Class Imbalance Handling¶
Strategies for dealing with imbalanced classes in the Titanic dataset.
Problem¶
The Titanic dataset has imbalanced classes: - 62% Did Not Survive (0) - 38% Survived (1)
This can lead to models biased toward the majority class.
Solution: Stratified Cross-Validation¶
All cross-validation uses StratifiedKFold to maintain class distribution in each fold.
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
Benefits¶
- Maintains exact class proportions in each fold
- More reliable performance estimates
- Reduces variance across folds
- Better for small datasets
Additional Techniques¶
1. Class Weights¶
Penalize misclassifications of minority class more heavily:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')
2. SMOTE¶
Synthetic Minority Over-sampling Technique:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
3. Ensemble Methods¶
Use models robust to imbalance:
- Random Forest: Built-in class weighting
- XGBoost: scale_pos_weight parameter
- CatBoost: Handles imbalance automatically
4. Evaluation Metrics¶
Use metrics appropriate for imbalanced data: - F1-Score: Harmonic mean of precision and recall - ROC-AUC: Threshold-independent - Precision-Recall Curve: Better than ROC for imbalance
Current Implementation¶
The project uses:
- ✅ Stratified K-Fold CV (5 folds)
- ✅ Weighted F1-Score as secondary metric
- ✅ XGBoost/CatBoost (imbalance-aware models)
- ✅ ROC curves for threshold analysis
When to Use Each Technique¶
| Technique | Use When | Avoid When |
|---|---|---|
| Stratified CV | Always (best practice) | Never |
| Class Weights | Moderate imbalance (30-70%) | Extreme imbalance (<5%) |
| SMOTE | Severe imbalance (<20%) | Sufficient minority samples |
| Ensemble | Any imbalance | Need interpretability |