Exoplanet Habitability Analysis

About the Project

Project Overview

This project analyzes exoplanet datasets from NASA's Kepler missions to predict planetary habitability. It incorporates data preprocessing, exploratory data analysis (EDA), feature engineering, and a Random Forest classification model.

The entire application is containerized with Docker and integrated into a Jenkins CI/CD pipeline tied to the GitHub repository for automated testing, image building, and deployment.

A Streamlit dashboard provides an interactive interface for real-time habitability predictions.

Problem Statement

While numerous exoplanets have been identified, determining their habitability remains a complex challenge. Traditional methods involve manual analysis, which is time-consuming and prone to biases.

The objective of this project is to automate the classification of exoplanets into habitable and non-habitable categories using machine learning techniques, thereby accelerating the identification process and aiding in the prioritization of targets for further study.

Methodology

Data Acquisition

Primary Dataset: Kepler mission data containing information about exoplanets.

Supplementary Data: Additional datasets to enrich the feature set and improve model accuracy.

Preprocessing

Handling Missing Values: Implemented strategies to address missing data, including imputation techniques.

Categorical Encoding: Converted categorical variables into numerical formats.

Feature Scaling: Applied normalization techniques for uniformity.

Feature Engineering

Habitability Score: A composite metric derived from existing features to quantify potential habitability.

Derived Features: Calculated additional attributes such as equilibrium temperature and stellar flux.

Key Findings

Habitability Mapping: Planet Radius vs. Equilibrium Temperature

This visualization shows how planet radius and equilibrium temperature relate to habitability scores. The model identifies optimal ranges for these parameters that correlate with higher habitability potential.

Key Parameters:

Radius lower limit: 0.5 Earth radii
Radius upper limit: 1.5 Earth radii
Temperature lower limit: 200K
Temperature upper limit: 350K

Distribution of Habitability Scores

The distribution shows how exoplanets are classified based on their habitability scores. Most planets fall into the non-habitable categories, with only a small percentage showing high habitability potential.

0

Non-habitable

1

Marginal

2

Potentially habitable

3

Highly habitable

Model Performance

93%

Accuracy

91%

Precision

94%

Recall

The Random Forest classifier achieved excellent performance metrics, demonstrating its effectiveness in classifying both habitable and non-habitable exoplanets.

Confusion Matrix

CONFIRMED

FALSE POSITIVE

NOT DISPOSITIONED

CANDIDATE

6

18

3

CONFIRMED

4

382

7

FALSE POSITIVE

1

36

20

NOT DISPOSITIONED

6

76

21

The confusion matrix provides insights into the model's classification performance across different exoplanet categories.

Technology Stack

Machine Learning Pipeline

Python: Primary programming language
Scikit-learn: Machine learning library (Random Forest classifier)
Pandas/Numpy: Data manipulation and analysis
Matplotlib/Seaborn: Data visualization
Streamlit: Interactive web dashboard

DevOps Implementation

Docker: Containerization for consistent deployment
Jenkins: CI/CD pipeline automation
GitHub: Version control and repository management
Multi-stage builds: Optimized Docker images
Automated testing: Unit tests and linting

CI/CD Pipeline

Code Commit

Linting & Testing

Docker Build

Registry Push

Deployment

Contact & Links

Project Developer

Shubham Pandey

Reg. No: 12211376

School of Computer Science and Engineering

Lovely Professional University, Phagwara, Punjab, India

Email: shubham30p@gmail.com

Project Links

GitHub Repository DockerHub