An end-to-end credit default prediction system with explainable AI, demographic fairness auditing, and regulatory-compliant bias mitigation โ built on 150,000+ real borrower profiles.
Overview โข Architecture โข Results โข Fairness โข Setup โข Usage
The Credit Risk Intelligence Engine is a production-grade machine learning pipeline that goes beyond standard classification โ it is designed to answer the hard questions that financial institutions actually face:
Can we identify high-risk applicants reliably and ensure that our model does not systematically disadvantage borrowers based on demographic characteristics?
This project combines gradient boosting, statistical feature analysis, multi-layer explainability (SHAP + LIME), and IBM AIF360 fairness constraints into a single, audit-ready system.
| # | Question |
|---|---|
| 1 | Which borrower characteristics are the strongest predictors of default? |
| 2 | How do we build a model that catches defaults while minimizing false rejections? |
| 3 | Can the model provide specific, auditable reasons for each credit decision? |
| 4 | Does the model treat all age demographics equitably under the EEOC 80% rule? |
| 5 | Can fairness gaps be closed without sacrificing predictive performance? |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CREDIT RISK INTELLIGENCE ENGINE โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโค
โ DATA LAYER โ ML LAYER โ EXPLAIN. โ FAIRNESS LAYER โ
โ โ โ LAYER โ โ
โ Raw CSV โ Logistic โ SHAP Tree โ AIF360 Reweighing โ
โ EDA โ Regression โ Explainer โ (Pre-processing) โ
โ Stat Tests โ (Baseline) โ โ โ
โ Feature โ โ LIME โ Disparate Impact โ
โ Engineering โ Random โ Tabular โ Analysis โ
โ โ Forest โ Explainer โ โ
โ Imputation โ โ โ Equal Opportunity โ
โ Outlier โ XGBoost โ Global + โ Diff (EOD) โ
โ Capping โ (Champion) โ Local Scope โ โ
โ โ โ โ Threshold โ
โ StandardSc. โ Early โ Per-case โ Optimization โ
โ โ Stopping โ explanationsโ (Post-processing) โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโ
Source: Give Me Some Credit โ Kaggle
| Attribute | Value |
|---|---|
| Total Records | ~150,000 borrowers |
| Features | 11 raw + 5 engineered |
| Target Variable | SeriousDlqin2yrs (90+ day delinquency) |
| Class Imbalance | ~6.7% default rate (14:1 ratio) |
| Missing Data | MonthlyIncome (~19%), NumberOfDependents (~2.5%) |
| Column | Engineered Name | Description |
|---|---|---|
SeriousDlqin2yrs |
Target | 90+ day delinquency within 2 years |
RevolvingUtilizationOfUnsecuredLines |
Credit Usage % | Proportion of revolving credit in use |
age |
Age | Borrower age in years |
NumberOfTime30-59DaysPastDueNotWorse |
1-Month Lates | Count of 30โ59 day delinquencies |
DebtRatio |
Debt vs Income | Monthly obligations / monthly income |
MonthlyIncome |
Monthly Income | Gross monthly income |
NumberOfOpenCreditLinesAndLoans |
Open Accounts | Active credit lines + loans |
NumberOfTimes90DaysLate |
3-Month Lates | Count of 90+ day delinquencies |
NumberRealEstateLoansOrLines |
Mortgages | Real estate credit lines |
NumberOfTime60-89DaysPastDueNotWorse |
2-Month Lates | Count of 60โ89 day delinquencies |
NumberOfDependents |
Family Size | Number of dependents |
| Feature | Logic | Rationale |
|---|---|---|
TotalPastDue |
Sum of all 30/60/90-day lates | Single delinquency severity signal |
CreditHistoryLength |
(age - 18).clip(0) |
Proxy for years in credit system |
MonthlyPayment |
DebtRatio ร MonthlyIncome |
Actual cash-flow burden |
IncomePerPerson |
MonthlyIncome / (Dependents + 1) |
Effective disposable income |
AgeGroup |
Binned: Young/MiddleAge/Senior/Elderly | Protected attribute for fairness audit |
Before modeling, every feature was validated using the Mann-Whitney U Test + Cohen's d effect size across the default/non-default split:
| Tier | Features | Cohen's d | Business Meaning |
|---|---|---|---|
| Power Trio | TotalPastDue, NumberOfTimes90DaysLate, RevolvingUtilization |
> 1.0 | Primary behavioral risk signals |
| Stability | Age, CreditHistoryLength |
0.2โ0.5 | Protective maturity factors |
| Secondary | MonthlyIncome, DebtRatio, NumberOfDependents |
< 0.2 | Supporting context features |
Verdict: The extreme Cohen's d of the Power Trio features confirmed that a tree-based, split-optimizing model (XGBoost) would be the ideal architecture.
Three models were trained and compared in a rigorous pipeline:
class_weight='balanced'to address 14:1 imbalance- SAGA solver for large-scale convergence
- Purpose: interpretable linear baseline + recall ceiling benchmark
- 200 estimators,
max_depth=10 - Non-linear interaction capture
- Bridge between linear and boosting paradigms
n_estimators=1000withearly_stopping_rounds=50scale_pos_weighttuned to exact class ratio (~14.0)learning_rate=0.05,subsample=0.8,colsample_bytree=0.8- Early stopping on AUC โ training halts automatically at optimal generalization
xgb_model = XGBClassifier(
n_estimators=1000,
max_depth=6,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
scale_pos_weight=ratio, # ~14:1 imbalance correction
objective='binary:logistic',
eval_metric='auc',
early_stopping_rounds=50
)| Metric | Logistic Regression | Random Forest | XGBoost |
|---|---|---|---|
| Accuracy | โ | โ | ~86% |
| ROC-AUC | Lower | Moderate | 0.8651 |
| Precision | Low | Higher | Highest |
| Recall | Highest | Moderate | ~79% |
| F1-Score | Lower | Moderate | Highest |
XGBoost selected as production model: highest ROC-AUC, best F1-Score, and most robust handling of class imbalance.
Predicted: No Default Predicted: Default
Actual: No Default 22,453 5,540
Actual: Default 430 1,575
| Business Metric | Value |
|---|---|
| Default Catch Rate (Recall) | 78.55% |
| Safe Customer Clearance Rate | 80.21% |
| Missed Defaulters | ~430 (~21% of actual defaults) |
| AUC โ Discrimination Power | 0.8651 |
| Average Precision Score | 0.3976 (~6ร better than random) |
The model is risk-averse by design: it errs toward flagging borderline cases, since the cost of a missed default far exceeds the cost of a rejected safe applicant.
SHAP TreeExplainer was applied to a stratified 1,000-sample test subset, producing a ranked, directional view of global feature influence:
| Rank | Feature | Direction | Interpretation |
|---|---|---|---|
| 1 | TotalPastDue |
โ with value | Strongest default signal โ any delinquency history sharply raises risk |
| 2 | RevolvingUtilizationOfUnsecuredLines |
โ with value | Credit strain above ~70% is a heavy penalty |
| 3 | Age |
โ with age | Youth = higher risk; maturity acts as a protective factor |
| 4 | MonthlyIncome |
โ with income | Higher income modestly reduces risk |
| 5 | DebtRatio |
Mixed | Meaningful only above extreme thresholds |
For the highest-risk case identified in the test set (predicted default probability: 97.2%), LIME decomposed the prediction:
Feature Contribution
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
TotalPastDue (7.84) โ +0.33 risk
RevolvingUtilization(2.04) โ +0.29 risk
Age (-1.31) โ +0.07 risk (young borrower)
CreditHistoryLength (-1.31) โ +0.06 risk (short history)
Regulatory Value: LIME explanations provide individualized, auditable reasons for each credit decision โ a direct requirement under GDPR Article 22 and similar frameworks.
This is the most technically sophisticated component of the project. The fairness pipeline uses Age Group as the protected attribute and evaluates compliance with the EEOC 80% (Four-Fifths) Rule.
| Metric | Value |
|---|---|
| Senior/Elderly Approval Rate | Higher |
| Young/Middle-Age Approval Rate | Lower |
| Disparate Impact Ratio (Baseline) | < 0.80 โ |
| Root Cause | Proxy discrimination via TotalPastDue + RevolvingUtilization (both correlated with age) |
RW = Reweighing(
unprivileged_groups=[{'privileged': 0.0}],
privileged_groups=[{'privileged': 1.0}]
)
dataset_transf = RW.fit_transform(dataset_train)
instance_weights = dataset_transf.instance_weights
xgb_fair.fit(X_train_scaled, y_train, sample_weight=instance_weights)Reweighing assigns corrective importance weights to training samples โ upweighting under-represented fair cases and downweighting over-represented ones โ so the model learns a naturally equitable decision boundary without modifying features or labels.
A high-to-low threshold scan (0.99 โ 0.01, 500 steps) identified the tightest threshold that simultaneously:
- Satisfies DI โฅ 0.80 (EEOC compliance), and
- Maximizes F1-Score (operational utility)
for t in np.linspace(0.99, 0.01, 500):
preds = (xgb_fair_proba >= t).astype(int)
cur_di = sel_unprivileged / sel_privileged
if 0.80 <= cur_di <= 1.25:
if f1_score(y_test, preds) > best_f1:
final_thresh = t| Metric | Baseline | After Mitigation | Change |
|---|---|---|---|
| Disparate Impact Ratio | ~0.796 | โฅ 0.80 | โ Compliant |
| Equal Opportunity Diff | Higher | Lower | Improved |
| ROC-AUC | 0.8651 | ~0.865 | Preserved |
| Accuracy Impact | โ | < 1% | Negligible |
Key Finding: Fairness and predictive power are NOT mutually exclusive. The combined Reweighing + Best-F1 Threshold strategy achieves regulatory compliance while maintaining the full discriminative capacity of the original XGBoost model.
Python >= 3.9git clone https://github.com/your-username/credit-risk-intelligence-engine.git
cd credit-risk-intelligence-engine
pip install -r requirements.txtnumpy>=1.23
pandas>=1.5
scikit-learn>=1.2
xgboost>=1.7
shap>=0.42
lime>=0.2
aif360>=0.5
matplotlib>=3.6
seaborn>=0.12
scipy>=1.10Download cs-training.csv from Kaggle โ Give Me Some Credit and place it in the data/ directory.
Open credit_risk_intelligence_engine_v2.ipynb in Jupyter or Google Colab and run all cells sequentially. The notebook is self-contained and will install missing dependencies automatically.
import pickle, json
import pandas as pd
# Load artifacts
with open('artifacts/xgboost_fair_model.pkl', 'rb') as f:
model = pickle.load(f)
with open('artifacts/feature_scaler.pkl', 'rb') as f:
scaler = pickle.load(f)
with open('artifacts/fairness_thresholds.json') as f:
config = json.load(f)
with open('artifacts/feature_columns.json') as f:
features = json.load(f)
# Predict on new applicant
applicant = pd.DataFrame([{
'RevolvingUtilizationOfUnsecuredLines': 0.85,
'age': 32,
'DebtRatio': 0.45,
'MonthlyIncome': 4500,
'NumberOfOpenCreditLinesAndLoans': 7,
'NumberRealEstateLoansOrLines': 1,
'NumberOfDependents': 2,
'CreditHistoryLength': 14,
'TotalPastDue': 1
}])
applicant_scaled = scaler.transform(applicant[features])
risk_score = model.predict_proba(applicant_scaled)[0, 1]
threshold = config['global_fair_threshold']
decision = 'DEFAULT RISK' if risk_score >= threshold else 'LOW RISK'
print(f"Risk Score: {risk_score:.2%} โ {decision}")1. DATA LOADING & EDA
โโ Load cs-training.csv โ shape inspection โ missing value audit โ class distribution
2. FEATURE ENGINEERING
โโ Error code correction (96/98 โ 0 in delinquency columns)
โโ TotalPastDue aggregation
โโ MonthlyPayment = DebtRatio ร MonthlyIncome
โโ IncomePerPerson = MonthlyIncome / (Dependents + 1)
โโ AgeGroup binning (protected attribute)
3. STATISTICAL VALIDATION
โโ Mann-Whitney U + Cohen's d โ feature power ranking
4. MODEL TRAINING
โโ Logistic Regression (baseline)
โโ Random Forest (benchmark)
โโ XGBoost (champion, early stopping + scale_pos_weight)
5. MODEL EVALUATION
โโ ROC-AUC, Precision, Recall, F1, Confusion Matrix
โโ ROC Curve + Precision-Recall Curve
โโ Threshold analysis
6. EXPLAINABILITY
โโ SHAP TreeExplainer โ global beeswarm plot
โโ LIME โ individual case breakdown
7. FAIRNESS AUDITING
โโ Baseline DI + EOD calculation (AIF360)
โโ AIF360 Reweighing (pre-processing)
โโ Re-training with instance weights
โโ Optimal threshold search (highโlow scan)
โโ Granular per-group audit table
8. ARTIFACT EXPORT
โโ .pkl models + .json configs + .csv reports
| Decision | Approach | Why |
|---|---|---|
| Class imbalance | scale_pos_weight (XGBoost) + class_weight='balanced' (LR, RF) |
Avoids majority-class collapse without SMOTE artifacts |
| Outlier handling | Clip RevolvingUtilization at 2.0, late-counts at 20 | Preserves over-extension signal without extreme skew |
| Feature scaling | StandardScaler on XGBoost | Required for LIME and fair model convergence |
| Bias mitigation | Pre-processing (Reweighing) + Post-processing (threshold) | Two-layer defense; neither alone is sufficient |
| Threshold strategy | High-to-low scan for tightest DI-compliant F1 | Avoids the degenerate "approve everyone" solution |
| Explainability | SHAP (global) + LIME (local) | Different stakeholders need different explanation granularity |
- Streamlit Dashboard โ real-time loan officer interface with per-applicant SHAP waterfall charts
- Model Drift Monitoring โ PSI-based feature distribution tracking for production deployment
- Calibration Layer โ Platt scaling / isotonic regression for well-calibrated probability outputs
- A/B Testing Framework โ controlled threshold experimentation with statistical significance testing
- Intersectional Fairness โ multi-attribute analysis (age ร income group)
- API Deployment โ FastAPI wrapper with model versioning and audit logging
- Kaggle Competition: Give Me Some Credit
- Explainability: SHAP โ Lundberg & Lee, 2017
- Local Explanations: LIME โ Ribeiro et al., 2016
- Fairness Toolkit: IBM AIF360
- Fairness Criterion: EEOC Uniform Guidelines โ 80% Rule
- Gradient Boosting: XGBoost โ Chen & Guestrin, 2016
This project is licensed under the MIT License. See LICENSE for details.
Built with a commitment to both accuracy and equity in automated decision-making.
If this project helped you, consider starring the repo โญ