Exoplanet Habitability Analysis

Using machine learning to identify potentially habitable planets beyond our solar system

Explore the Project

About the Project

Project Overview

This project analyzes exoplanet datasets from NASA's Kepler missions to predict planetary habitability. It incorporates data preprocessing, exploratory data analysis (EDA), feature engineering, and a Random Forest classification model.

The entire application is containerized with Docker and integrated into a Jenkins CI/CD pipeline tied to the GitHub repository for automated testing, image building, and deployment.

A Streamlit dashboard provides an interactive interface for real-time habitability predictions.

Problem Statement

While numerous exoplanets have been identified, determining their habitability remains a complex challenge. Traditional methods involve manual analysis, which is time-consuming and prone to biases.

The objective of this project is to automate the classification of exoplanets into habitable and non-habitable categories using machine learning techniques, thereby accelerating the identification process and aiding in the prioritization of targets for further study.

Methodology

Data Acquisition

Primary Dataset: Kepler mission data containing information about exoplanets.

Supplementary Data: Additional datasets to enrich the feature set and improve model accuracy.

Preprocessing

Handling Missing Values: Implemented strategies to address missing data, including imputation techniques.

Categorical Encoding: Converted categorical variables into numerical formats.

Feature Scaling: Applied normalization techniques for uniformity.

Feature Engineering

Habitability Score: A composite metric derived from existing features to quantify potential habitability.

Derived Features: Calculated additional attributes such as equilibrium temperature and stellar flux.

Key Findings

Habitability Mapping: Planet Radius vs. Equilibrium Temperature

This visualization shows how planet radius and equilibrium temperature relate to habitability scores. The model identifies optimal ranges for these parameters that correlate with higher habitability potential.

Key Parameters:
  • Radius lower limit: 0.5 Earth radii
  • Radius upper limit: 1.5 Earth radii
  • Temperature lower limit: 200K
  • Temperature upper limit: 350K

Distribution of Habitability Scores

The distribution shows how exoplanets are classified based on their habitability scores. Most planets fall into the non-habitable categories, with only a small percentage showing high habitability potential.

0

Non-habitable

1

Marginal

2

Potentially habitable

3

Highly habitable

Model Performance

93%

Accuracy

91%

Precision

94%

Recall

The Random Forest classifier achieved excellent performance metrics, demonstrating its effectiveness in classifying both habitable and non-habitable exoplanets.

Confusion Matrix

CONFIRMED
FALSE POSITIVE
NOT DISPOSITIONED
CANDIDATE
6
18
3
CONFIRMED
4
382
7
FALSE POSITIVE
1
36
20
NOT DISPOSITIONED
6
76
21

The confusion matrix provides insights into the model's classification performance across different exoplanet categories.

Technology Stack

Machine Learning Pipeline

  • Python: Primary programming language
  • Scikit-learn: Machine learning library (Random Forest classifier)
  • Pandas/Numpy: Data manipulation and analysis
  • Matplotlib/Seaborn: Data visualization
  • Streamlit: Interactive web dashboard

DevOps Implementation

  • Docker: Containerization for consistent deployment
  • Jenkins: CI/CD pipeline automation
  • GitHub: Version control and repository management
  • Multi-stage builds: Optimized Docker images
  • Automated testing: Unit tests and linting

CI/CD Pipeline

Code Commit

Linting & Testing

Docker Build

Registry Push

Deployment

Contact & Links

Project Developer

Shubham Pandey

Reg. No: 12211376

School of Computer Science and Engineering

Lovely Professional University, Phagwara, Punjab, India

Email: shubham30p@gmail.com