Skip to content

Abhisri436/insurance_premium_prediction

Repository files navigation

Insurance Premium Prediction

πŸ“Š Machine Learning Regression Project – Kaggle Submission

This project builds a regression model to predict insurance premium amounts using a real-world dataset of 2 million entries. The goal was to optimize model performance while maintaining interpretability and memory efficiency under resource constraints.

🧠 Project Highlights

  • Trained a Random Forest model on 2M records with custom preprocessing and feature engineering
  • Applied tailored imputation strategies for missing values and encoded high-cardinality categorical features
  • Achieved ~1.15 RMSE on Kaggle's private leaderboard
  • Visualized distribution patterns and evaluated prediction errors to understand model limitations

πŸ“ˆ Tools & Libraries

  • Python, Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn
  • Jupyter Notebook (exploratory analysis and training)

πŸ“Š Key Metrics

  • πŸ“‰ RMSE: ~1.15 (Kaggle)
  • πŸ“‰ MAE: ~637
  • πŸ“ˆ Records Used: 2,000,000+

πŸ“‚ Dataset

Due to file size constraints, the dataset is not included in this repository.

Please download it directly from the Kaggle competition page:
πŸ”— Playground Series - Season 4, Episode 12

Place the downloaded files (e.g., train.csv, test.csv) in the same directory as the notebook before running.

πŸš€ How to Run

  1. Clone the repo
  2. Download the dataset and place it in the same directory as the notebook
  3. Make sure you have Python 3 installed and the following libraries:
  • pandas
  • numpy
  • scikit-learn
  • matplotlib
  • seaborn
  1. Launch the notebook:

About

Predicting insurance premiums using Random Forest on a 2M-record dataset with custom preprocessing and model optimization (Kaggle RMSE ~1.15)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors