π Machine Learning Regression Project β Kaggle Submission
This project builds a regression model to predict insurance premium amounts using a real-world dataset of 2 million entries. The goal was to optimize model performance while maintaining interpretability and memory efficiency under resource constraints.
- Trained a Random Forest model on 2M records with custom preprocessing and feature engineering
- Applied tailored imputation strategies for missing values and encoded high-cardinality categorical features
- Achieved ~1.15 RMSE on Kaggle's private leaderboard
- Visualized distribution patterns and evaluated prediction errors to understand model limitations
- Python, Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn
- Jupyter Notebook (exploratory analysis and training)
- π RMSE: ~1.15 (Kaggle)
- π MAE: ~637
- π Records Used: 2,000,000+
Due to file size constraints, the dataset is not included in this repository.
Please download it directly from the Kaggle competition page:
π Playground Series - Season 4, Episode 12
Place the downloaded files (e.g., train.csv, test.csv) in the same directory as the notebook before running.
- Clone the repo
- Download the dataset and place it in the same directory as the notebook
- Make sure you have Python 3 installed and the following libraries:
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- Launch the notebook: