Architecture Guide¶
Understanding the system design of the Titanic Survival Predictor.
System Overview¶
The project follows a modular ML pipeline architecture with clear separation of concerns:
graph LR
A[Raw Data] --> B[Data Loader]
B --> C[Feature Engineering]
C --> D[Data Transformer]
D --> E[Model Trainer]
E --> F[Trained Model]
F --> G[Prediction Pipeline]
G --> H[Flask API]
H --> I[User/Client]
Component Overview¶
1. Data Layer (titanic_ml/data/)¶
Purpose: Load and manage raw datasets
loader.py- Loads Kaggle CSV files, splits train/testtransformer.py- Creates sklearn preprocessing pipelines
Key Design Decisions:
- Use pathlib for cross-platform compatibility
- Store processed data in artifacts/ for reproducibility
- Separate loading from transformation for flexibility
2. Feature Layer (titanic_ml/features/)¶
Purpose: Domain-specific feature engineering
build_features.py- Creates derived features
Features Created:
- cabin_multiple - Number of cabins (wealth indicator)
- name_title - Extracted title (social status)
- norm_fare - Log-normalized fare (handles skewness)
Why separate from transformation? - Feature logic is business/domain specific - Transformations are generic sklearn operations - Easier to test and modify independently
3. Model Layer (titanic_ml/models/)¶
Purpose: Model training and inference
train.py- Trains multiple models with GridSearchCVpredict.py- Inference pipeline with preprocessing
Key Features: - Model selection via cross-validation - Hyperparameter tuning - Ensemble methods (VotingClassifier) - Model versioning and serialization
4. API Layer (titanic_ml/app/)¶
Purpose: Serve predictions via web interface
routes.py- Flask endpoints and form handling
Endpoints:
- / - Landing page
- /prediction - Prediction form and results
- /health - Health check for monitoring
5. Utilities (titanic_ml/utils/)¶
Common functionality shared across components
logger.py- Azure-ready structured loggingexception.py- Custom exception handlinghelpers.py- Model persistence, evaluation
Data Flow¶
Training Pipeline¶
# 1. Load raw data
loader = DataLoader()
train_path, test_path = loader.load_data()
# 2. Engineer features
train_df, num_cols, cat_cols = apply_feature_engineering(train_df)
# 3. Create preprocessor
transformer = DataTransformer()
X_train, y_train, X_test, y_test, preprocessor = transformer.transform_data(
train_path, test_path
)
# 4. Train models
trainer = ModelTrainer()
best_model, score = trainer.train(X_train, y_train, X_test, y_test)
# Models saved to: models/model.pkl, models/preprocessor.pkl
Prediction Pipeline¶
# 1. Create input data
custom_data = CustomData(age=25, sex='female', ...)
# 2. Load pipeline
pipeline = PredictPipeline()
# 3. Make prediction
predictions, probabilities = pipeline.predict(
custom_data.get_data_as_dataframe()
)
Design Patterns¶
1. Pipeline Pattern¶
Used in: Data transformation and model training
num_pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
Benefits: - Composable transformations - Prevents data leakage - Easy to serialize and deploy
2. Factory Pattern¶
Used in: Model creation
def get_models(self) -> Dict[str, Any]:
return {
'Logistic Regression': LogisticRegression(...),
'Random Forest': RandomForestClassifier(...),
...
}
Benefits: - Centralized model configuration - Easy to add/remove models - Consistent interface
3. Dependency Injection¶
Used in: Configurable paths
class ModelTrainer:
def __init__(self, model_path: Optional[Path] = None):
self.model_path = model_path or MODEL_PATH
Benefits: - Testable without file I/O - Flexible configuration - Supports different environments
Configuration Management¶
All configuration centralized in titanic_ml/config/settings.py:
# Paths
PROJECT_ROOT = Path(__file__).parent.parent.parent
DATA_DIR = PROJECT_ROOT / "data"
MODEL_PATH = PROJECT_ROOT / "models" / "model.pkl"
# Model settings
CV_FOLDS = 5
RANDOM_STATE = 42
# Features
NUMERICAL_FEATURES = ['Age', 'SibSp', ...]
CATEGORICAL_FEATURES = ['Pclass', 'Sex', ...]
Benefits: - Single source of truth - Easy to modify - Environment-specific configs
Error Handling¶
Custom exception class with logging integration:
try:
result = risky_operation()
except Exception as e:
logging.error(f"Operation failed: {e}")
raise CustomException(e, sys)
Features: - Preserves stack trace - Structured logging - Centralized error handling
Logging Strategy¶
Structured Logging¶
logging.info(
"Prediction completed",
extra={'extra_fields': {
'prediction': int(prediction),
'probability': float(probability),
'user_ip': request.remote_addr
}}
)
Benefits: - Queryable in Azure Application Insights - Easy to filter and aggregate - Production monitoring ready
Log Levels¶
- DEBUG: Development debugging
- INFO: Normal operations, request tracking
- WARNING: Degraded functionality
- ERROR: Failures requiring attention
Testing Strategy¶
Unit Tests¶
Test individual functions in isolation:
def test_cabin_multiple_creation():
df = pd.DataFrame({'Cabin': ['A1', 'B1 B2', np.nan]})
result = apply_feature_engineering(df)
assert result['cabin_multiple'].tolist() == [1, 2, 0]
Integration Tests¶
Test component interactions:
Deployment Architecture¶
Local Development¶
┌─────────────┐
│ Developer │
└──────┬──────┘
│
▼
┌─────────────┐
│ Flask App │ (localhost:5000)
│ + Models │
└─────────────┘
Docker Container¶
┌─────────────────────────────┐
│ Docker Container │
│ ┌─────────────┐ │
│ │ Flask App │ │
│ │ + Models │ │
│ └──────┬──────┘ │
│ │ │
│ Port 5000 │
└─────────┼───────────────────┘
│
▼
User Request
Cloud Deployment (Render/Azure)¶
┌─────────────┐
│ GitHub │
│ Repository │
└──────┬──────┘
│ Push
▼
┌─────────────────┐
│ GitHub Actions │ Build Docker Image
└──────┬──────────┘
│
▼
┌─────────────────┐
│ Docker Hub / │
│ ACR │
└──────┬──────────┘
│
▼
┌─────────────────┐
│ Render/Azure │ Auto-deploy
│ Web Service │
└──────┬──────────┘
│
▼
Public URL
Scalability Considerations¶
Current Limitations¶
- Single instance: No load balancing
- In-memory models: ~1GB RAM per instance
- Synchronous API: Blocks on long requests
Scaling Options¶
Horizontal Scaling¶
# docker-compose.yml
services:
web:
image: titanic-ml
deploy:
replicas: 3
nginx:
image: nginx
# Load balancer config
Model as Service¶
Separate model serving:
Caching¶
Add Redis for frequent predictions:
Security Considerations¶
Current Implementation¶
✅ Input validation (type checking)
✅ No SQL injection (no database)
✅ HTTPS in production (via platform)
❌ No authentication
❌ No rate limiting
❌ No input sanitization
Recommended Enhancements¶
- Add authentication: JWT tokens or API keys
- Rate limiting: Flask-Limiter
- Input validation: Pydantic schemas
- CORS: Restrict allowed origins
- Security headers: Flask-Talisman
Performance Optimization¶
Current Performance¶
- Model loading: ~2-3s (startup)
- Prediction: ~10-50ms per request
- Memory: ~1GB (with models loaded)
Optimization Opportunities¶
- Model Quantization: Reduce model size
- Batch Predictions: Process multiple at once
- Model Caching: Keep in memory
- Async Predictions: Use celery for long tasks
Technology Choices¶
Why Flask?¶
- ✅ Lightweight and simple
- ✅ Easy to deploy
- ✅ Good for MVP
- ❌ Not async by default
- ❌ Less features than FastAPI
Alternative: FastAPI for production scale
Why sklearn?¶
- ✅ Industry standard
- ✅ Excellent documentation
- ✅ Easy model persistence
- ✅ Compatible with most ML algorithms
Why Docker?¶
- ✅ Consistent environments
- ✅ Easy deployment
- ✅ Platform agnostic
- ✅ Resource isolation
Future Improvements¶
Potential enhancements for production deployments:
- Model Monitoring - Track model performance and drift detection
- Model Versioning - Implement versioning system for model rollbacks
- A/B Testing - Framework for testing different model versions
- Explainability Dashboard - Interactive SHAP visualization dashboard
- Extended Test Coverage - Expand test suite for edge cases
Diagram: Full System¶
graph TB
subgraph "Data Layer"
A[Raw CSV] --> B[Data Loader]
B --> C[Feature Engineering]
C --> D[Data Transformer]
end
subgraph "Model Layer"
D --> E[Model Trainer]
E --> F[Model Registry]
F --> G[Trained Models]
end
subgraph "API Layer"
G --> H[Prediction Pipeline]
H --> I[Flask Routes]
I --> J[Web UI]
end
subgraph "Infrastructure"
J --> K[Docker Container]
K --> L[Render/Azure]
L --> M[Users]
end
Questions?¶
- Review API Reference
- Check Deployment Guide
- Open an issue on GitHub