This notebook, run.ipynb, is the final implementation for predicting worm lifespan based on early behavioral data. It integrates data preprocessing, feature engineering, machine learning models, and evaluation techniques to produce accurate lifespan predictions.
-
Project Overview
-
Prerequies
-
Overview of key functions
-
Notebook Workflow
-
Results and Outputs
-
Project Overview The project predicts the lifespan of worms using behavioral time-course data from laboratory experiments. The behavioral features include center-of-mass coordinates, speed, and other derived metrics, allowing the training of machine learning models to make lifespan predictions.
This notebook is self-contained, calling modularized functions from external files for efficient computation and analysis.
- Python 3.8+
- Required Libraries**:
numpypandasmatplotlibscikit-survivalscikit-learn
- To install dependencies, run : pip install numpy pandas matplotlib scikit-survival scikit-learn
- Clone the repository: git clone https://github.com/Tournedos/ML-Project-2.git
- Navigate to the project folder: cd worm-lifespan-prediction
- Open and run the notebook: jupyter notebook run.ipynb
• run.ipynb : The main notebook that integrates all steps, from data loading and preprocessing to analysis and visualization. • helpers.py : Contains utility functions used throughout the project. • models.py : Includes machine learning models used for predictions. • nan_imputation.py : Provides functions for handling missing values in the data. • Preprocessing.py : Handles general preprocessing tasks to clean the data. • preprocessing_features.py : Focuses on feature-specific preprocessing, such as scaling and extraction. Calculates new features based on the basic features that comes with the data • load_data.py : Includes functions like load_lifespan and load_earlylifespan for loading datasets. • try.ipynb : Not used in the final notebook, but contains (raw) previous analysis made to arrive to the final results.
The notebook run.ipynb is structured to guide you through the complete process of worm lifespan prediction.
Part 1 : 1. Lifespan prediction based on early behavior
- Setup :
- Import libraries (numpy,pandas...) and custom modules (helpers.py, models.py...)
- The root directory and data paths are set up for seamless data loading.
- Data loading :
- Load lifespan data, make sure of proper loading (only csc files)
- Data Preprocessing :
- Cleans data by imputing NaNs
- Remove frames where the worms are detected to be dead.
- Standardizes to prepares features for modeling.
- Feature Engineering:
- Extracts early behavior metrics from the raw data.
- Constructs datasets for regression and classification tasks.
- Model Training and Evaluation:
- Trains machine learning models to predict lifespan, using early behavioral features.
- Evaluates model performance using metrics and visualizations.
- Results Analysis:
- Analyze predictions against ground truth using metrics like RMSE and accuracy.
- Kaplan-Meier curves for survival analysis.
- Error histograms for lifespan prediction models.
Part 2 : Assessment of personality of worms based on early behavior
- Setup :
- make any additionnal needed imports
- load data, specifically Optogenetics file this time
- Preprocessing optogenetics data :
- NaN imputation
- Feature Engeneering :
- Derive personality metrics from early movement patterns such as consistency in movement and preferred activity levels
- Clustering Analysis :
- Perform clustering to group worms based on similar behavioral traits
- Visualize clusters to identify distinct personality types
- Behavioral Traits Evaluation :
- Quantify differences between clusters using statistical methods
- Highligh key behavioral features that differentiate groups
- Visualization :
- Generate plots to visualize personnality traits and cluster distributions
- Insights and interpretation:
- Draw connections between personnality traits and lifespan predictions from Part 1.
- Provide actionable based on behavioral clustering
- Cluster plots showing distinct worm personality types.
- Behavioral feature distributions across clusters.
• Predictions: Provides lifespan predictions for worms based on their early behavior. • Visualizations: Includes Kaplan-Meier survival curves and other plots for understanding model performance. • Evaluation: analysis of OLS coefficients. • Performance Metrics: Reports accuracy, RMSE, and survival analysis metrics.
To run the pipeline with new worms data make sure that the new files are saved in the same format (.csv) and contain the same informations. Then, put the files in one of the 'Data' subfolders and run the notebook. If the new files contain different data (from different experiments or with different drugs) the notebook will run, but to make it semantically meaningful it may be required to modify just parts of 'load_daya.py'.