This project builds credit default prediction models using the UCI Default of Credit Card Clients dataset. It focuses on improving defaulter detection with threshold tuning, and provides explainability via SHAP.
- Train and evaluate a Random Forest model with threshold tuning (targeting Recall ≈ 0.6).
- Generate evaluation artifacts: confusion matrix, ROC curve / AUC, threshold sensitivity analysis, and SHAP plots.
- (Optional) Train & save Logistic Regression / Random Forest / XGBoost models.
- (Optional) Run an interactive Streamlit dashboard for credit risk scoring.
- Recommended: Python 3.9+
- Install dependencies:
pip install -U pip
pip install pandas numpy scikit-learn matplotlib seaborn shap joblib xgboost streamlit plotlyRun from the project root:
python train_rf.pyThis script will:
- Load
uci_default_cleaned.csv - Split train/test sets
- Train a Random Forest (class-weighted for imbalance)
- Find a threshold targeting Recall ≈ 0.6
- Print classification metrics
- Export plots and CSV outputs (see below)
python "Train and Save All Models.py"It will generate:
lr_model.pkl,rf_model.pkl,xgb_model.pklfeature_names.pkl,reference_data.csv
streamlit run web_app.pyGenerated by python train_rf.py (saved to the project root):
| File | Description |
|---|---|
confusion_matrix_final.png |
Confusion matrix (threshold-adjusted) |
roc_curve_final.png |
ROC curve & AUC |
shap_importance_bar.png |
SHAP feature importance (bar) |
shap_summary_plot.png |
SHAP summary plot |
threshold_comparison.png |
Recall/Precision/Accuracy vs. threshold |
threshold_sensitivity_analysis.csv |
Threshold performance table |
Key files/folders:
final_project/
├─ Readme.md
├─ train_rf.py
├─ train_rf.ipynb
├─ Train and Save All Models.py
├─ web_app.py
├─ uci_default_cleaned.csv
├─ Dataset/ # raw + reference CSVs
├─ Random_Forest/ # additional RF experiments/results
├─ Logistic_Regression/ # LR report/code
└─ web_source/ # web app bundle (models + assets)
- Threshold tuning is used to prioritize recall for the defaulter class.
- Model artifacts (
*.pkl) are included to make the dashboard runnable without retraining.