Methodology¶
This document explains the machine learning approach used in the Titanic Survival Prediction project.
1. Data Exploration¶
Initial Dataset Analysis¶
The Titanic dataset contains 891 training samples with 11 features:
Numeric Features: - Age: Passenger age (177 missing values) - SibSp: Number of siblings/spouses aboard - Parch: Number of parents/children aboard - Fare: Ticket price
Categorical Features: - Pclass: Ticket class (1st, 2nd, 3rd) - Sex: Gender (male, female) - Embarked: Port of embarkation (C/Q/S, 2 missing values) - Cabin: Cabin number (687 missing values) - Ticket: Ticket number - Name: Passenger name
Target Variable: - Survived: 0 (did not survive) or 1 (survived)
Key Findings from EDA¶
- Class Imbalance: 62% did not survive vs 38% survived
- Gender Disparity: ~74% of females survived vs ~19% of males
- Class Effect: 1st class had ~63% survival rate vs 24% in 3rd class
- Age Distribution: Children (<16) had higher survival rates
- Fare Correlation: Higher fares strongly correlated with survival
2. Feature Engineering¶
Created Features¶
1. cabin_multiple¶
# Number of cabins a passenger booked
cabin_multiple = lambda x: 0 if pd.isna(x) else len(x.split(' '))
2. name_title¶
Rationale: Titles (Mr., Mrs., Master, etc.) encode both age and social status.
Title Distribution: - Mr: 517 passengers - Miss: 182 passengers - Mrs: 125 passengers - Master: 40 passengers (young boys) - Rare titles: Rev, Dr, Col, etc.
3. norm_fare¶
Rationale: Fare is heavily right-skewed; log transformation creates normal distribution.
Dropped Features¶
- Cabin: 77% missing values, too sparse for reliable imputation
- Ticket: High cardinality, no clear pattern
- PassengerId: Just an identifier
- Name: Information extracted into
name_title
3. Data Preprocessing¶
Missing Value Imputation¶
| Feature | Strategy | Rationale |
|---|---|---|
| Age | Median (28 years) | Robust to outliers, represents central tendency |
| Fare | Median ($14.45) | Only 1 missing value in test set |
| Embarked | Most frequent (S) | Only 2 missing values |
Feature Scaling¶
StandardScaler applied to numeric features: - Age - SibSp - Parch - norm_fare
Formula: z = (x - μ) / σ
Rationale: - Required for distance-based algorithms (KNN, SVM) - Improves convergence for gradient descent - Prevents feature dominance
Encoding Categorical Variables¶
OneHotEncoding applied to: - Pclass → Pclass_1, Pclass_2, Pclass_3 - Sex → Sex_female, Sex_male - Embarked → Embarked_C, Embarked_Q, Embarked_S - name_title → name_title_Master, name_title_Miss, etc.
Final Feature Count: 40+ features after encoding
4. Model Selection¶
Baseline Models Evaluated¶
- Logistic Regression (82.1%)
- Simple, interpretable baseline
-
Linear decision boundary
-
K-Nearest Neighbors (80.5%)
- Non-parametric, instance-based
-
Sensitive to scaling (hence preprocessing)
-
Decision Tree (77.6%)
- Non-linear decision boundaries
-
Prone to overfitting
-
Random Forest (80.6%)
- Ensemble of decision trees
-
Reduces overfitting through bagging
-
Naive Bayes (72.6%)
- Probabilistic classifier
-
Assumes feature independence (violated here)
-
Support Vector Classifier (83.2%)
- Finds optimal hyperplane
-
Kernel trick for non-linearity
-
XGBoost (81.8%)
- Gradient boosting framework
-
Sequential error correction
-
CatBoost (baseline not run)
- Handles categorical features natively
- Robust to overfitting
Model Selection Criteria¶
- Primary Metric: Accuracy (Kaggle competition metric)
- Secondary Metric: Weighted F1-score (better for imbalance)
- Cross-Validation: 5-fold stratified CV
- Reproducibility: Fixed random seed (42)
5. Hyperparameter Tuning¶
GridSearchCV Configuration¶
GridSearchCV(
estimator=model,
param_grid=parameters,
cv=5, # 5-fold cross-validation
scoring='accuracy',
n_jobs=-1, # Use all CPU cores
verbose=1
)
Tuned Hyperparameters¶
XGBoost (Best Performer: 85.3%)¶
{
'n_estimators': 550,
'learning_rate': 0.5,
'max_depth': 10,
'colsample_bytree': 0.75,
'subsample': 0.6,
'gamma': 0.5,
'reg_lambda': 10
}
Random Forest (83.6%)¶
{
'n_estimators': 300,
'criterion': 'gini',
'max_depth': 15,
'max_features': 'sqrt',
'min_samples_split': 2,
'min_samples_leaf': 1
}
CatBoost (84.2%)¶
6. Ensemble Methods¶
Voting Classifier¶
Combined predictions from multiple models:
Hard Voting: Majority vote
Soft Voting: Average probabilities
Best Configuration: - Models: XGBoost, Random Forest, CatBoost - Voting: Soft - Weights: Optimized via GridSearchCV - Performance: 85.1% accuracy
Why Ensemble Works¶
- Diversity: Different algorithms capture different patterns
- Bias-Variance Trade-off: Reduces overfitting
- Robustness: More stable predictions
- Error Correction: Models compensate for each other's weaknesses
7. Model Evaluation¶
Metrics Used¶
- Accuracy:
(TP + TN) / (TP + TN + FP + FN) - Primary Kaggle metric
-
Overall correctness
-
Weighted F1-Score: Harmonic mean of precision and recall
- Better for imbalanced classes
-
Accounts for class distribution
-
ROC-AUC: Area under ROC curve
- Threshold-independent
-
Measures ranking quality
-
Confusion Matrix:
- True Positives, False Positives
- True Negatives, False Negatives
Cross-Validation Strategy¶
5-Fold Stratified CV: - Maintains class proportions in each fold - Reduces variance in performance estimates - Detects overfitting
8. Feature Importance¶
Top 10 Features (XGBoost)¶
- norm_fare (0.182): Strongest predictor
- Sex_male (0.156): Gender critical to survival
- Age (0.143): "Women and children first"
- name_title_Mr (0.089): Adult male indicator
- Pclass_3 (0.072): Lower class disadvantage
- name_title_Mrs (0.058): Married women prioritized
- SibSp (0.047): Family size effect
- Pclass_1 (0.041): First class advantage
- cabin_multiple (0.038): Cabin access
- Embarked_S (0.032): Port effect
Insights¶
- Socioeconomic Status: Fare and class dominate
- Demographic Factors: Gender and age crucial
- Social Context: Titles encode multiple signals
- Feature Engineering: Created features add value
9. Results & Kaggle Submission¶
Final Model Performance¶
| Submission | Model | CV Accuracy | Kaggle Score |
|---|---|---|---|
| 01 | Voting Baseline | 84.5% | TBD |
| 02 | XGBoost Tuned | 85.3% | TBD |
| 03 | Optimized Ensemble | 85.1% | TBD |
Key Takeaways¶
- Feature Engineering Crucial: +2-3% improvement
- Hyperparameter Tuning Effective: +1-3% per model
- Ensembles Provide Stability: Consistent performance
- Data Quality > Model Complexity: Clean preprocessing essential
10. Future Improvements¶
Potential enhancements:
- Advanced Feature Engineering:
- Family group survival analysis
- Deck location from cabin
-
Ticket prefix patterns
-
Model Stacking:
- Meta-learner on top of base models
-
Potentially higher accuracy
-
Neural Networks:
- TabNet or entity embeddings
-
May capture complex interactions
-
Feature Selection:
- Recursive feature elimination
-
Reduce overfitting risk
-
External Data:
- Historical passenger manifests
- Ship layout information
This methodology represents industry best practices for tabular data classification problems.