Model Training and Evaluation Concepts
Introduction
Before deploying an AI model, you need to train it properly and evaluate its performance rigorously. A model that performs well on training data but fails in production is worse than no model at all — it gives false confidence.
Imagine a student who memorizes all the answers from a past exam without understanding the concepts. They will score 100% on that specific exam, but fail when faced with new questions. This is exactly what happens to a model that overfits its training data.
1. The ML Training Pipeline
The training pipeline is the structured sequence of steps that transforms raw data into a deployable model.
View ML Training Pipeline
Each step in this pipeline is critical. Skipping or rushing any step can lead to models that seem good on paper but fail in production.
2. Data Splitting: Train / Validation / Test
Why Split Data?
We split data into separate sets to get an honest estimate of how well our model will perform on data it has never seen before.
View Data Splitting Strategy
| Set | Purpose | Usage | Typical Size |
|---|---|---|---|
| Training | Learn patterns from features | Used during model.fit() | 60-70% |
| Validation | Tune hyperparameters, select best model | Used during model selection | 15-20% |
| Test | Final unbiased evaluation | Used once at the end | 15-20% |
The test set must never be used for making decisions during development. It is only used for the final evaluation. If you look at test set performance and go back to modify your model, you have contaminated your evaluation.
Code Example: Splitting Data
from sklearn.model_selection import train_test_split
# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Second split: training and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
# Result: 60% train, 20% validation, 20% test
print(f"Training: {len(X_train)} samples ({len(X_train)/len(X)*100:.0f}%)")
print(f"Validation: {len(X_val)} samples ({len(X_val)/len(X)*100:.0f}%)")
print(f"Test: {len(X_test)} samples ({len(X_test)/len(X)*100:.0f}%)")
The stratify=y parameter ensures that each subset has the same class proportions as the original dataset. This is essential for imbalanced datasets (e.g., 95% class A, 5% class B).
3. Cross-Validation
The Problem with a Single Split
A single train/validation split can be lucky or unlucky. Maybe all the "easy" examples ended up in the training set. Cross-validation solves this by testing multiple splits.
Imagine evaluating a restaurant by eating only one dish on one day. Maybe the chef was sick that day, or on the contrary it was their best dish. It is better to come back k times and try different dishes for a reliable evaluation.
K-Fold Cross-Validation
| Type | Description | When to Use |
|---|---|---|
| K-Fold | Splits into K folds, each serving as test in turn | General use, K=5 or K=10 |
| Stratified K-Fold | K-Fold preserving class proportions | Imbalanced classification |
| Leave-One-Out (LOO) | K = number of samples | Very small datasets (< 100) |
| Repeated K-Fold | Repeats K-Fold multiple times with different seeds | Very robust estimation |
| Time Series Split | Respects the temporal order of data | Sequential / time series data |
Code Example: Cross-Validation
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Stratified 5-Fold Cross-Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='accuracy')
print(f"Scores per fold: {scores}")
print(f"Mean accuracy: {scores.mean():.4f} ± {scores.std():.4f}")
# Multiple metrics at once
from sklearn.model_selection import cross_validate
results = cross_validate(
model, X_train, y_train, cv=cv,
scoring=['accuracy', 'f1_weighted', 'precision_weighted', 'recall_weighted'],
return_train_score=True
)
for metric in ['accuracy', 'f1_weighted']:
train_score = results[f'train_{metric}'].mean()
test_score = results[f'test_{metric}'].mean()
print(f"{metric}: Train={train_score:.4f}, Val={test_score:.4f}")
4. Hyperparameter Tuning
Hyperparameters are settings you choose before training (unlike model parameters, which are learned during training).
| Parameter | Hyperparameter | Learned During Training? |
|---|---|---|
| Model weights | — | ✅ Yes |
| — | Learning rate | ❌ No (you choose it) |
| — | Number of trees (n_estimators) | ❌ No |
| — | Max depth (max_depth) | ❌ No |
| — | Regularization (C, alpha) | ❌ No |
Grid Search
Tests all possible combinations. Exhaustive but expensive.
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Total: 3 × 4 × 3 × 3 = 108 combinations × 5 folds = 540 fits
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='f1_weighted',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")
best_model = grid_search.best_estimator_
Random Search
Randomly samples the search space. More efficient when the space is large.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': [3, 5, 10, 20, None],
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': uniform(0.1, 0.9)
}
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions,
n_iter=50, # only 50 random combinations (vs 108+ for grid)
cv=5,
scoring='f1_weighted',
n_jobs=-1,
random_state=42
)
random_search.fit(X_train, y_train)
print(f"Best params: {random_search.best_params_}")
print(f"Best score: {random_search.best_score_:.4f}")
| Method | Advantages | Disadvantages | When to Use |
|---|---|---|---|
| Grid Search | Exhaustive, guaranteed to find the best | Very slow (exponential) | Small search space |
| Random Search | Faster, explores the space better | No guarantee of optimal | Large search space |
| Bayesian Optimization | Intelligent, converges quickly | More complex to implement | Expensive models to train |
5. Evaluation Metrics
Classification Metrics
The Confusion Matrix
The confusion matrix is the foundation of all classification metrics.
- True Positive: There is a fire, the alarm sounds ✅
- False Positive: No fire, the alarm sounds anyway (burnt toast) 🚨
- False Negative: There is a fire, but the alarm doesn't sound 💀
- True Negative: No fire, the alarm stays silent ✅
An FN (missing a real fire) is much more serious than an FP (false alarm). The choice of metric depends on the cost of errors.
Metrics Derived from the Confusion Matrix
| Metric | Formula | Question It Answers |
|---|---|---|
| Accuracy | (TP + TN) / Total | What fraction of predictions is correct? |
| Precision | TP / (TP + FP) | Among positive predictions, how many are true? |
| Recall (Sensitivity) | TP / (TP + FN) | Among true positives, how many were detected? |
| F1-Score | 2 × (Precision × Recall) / (Precision + Recall) | Harmonic mean of Precision and Recall |
| Specificity | TN / (TN + FP) | Among true negatives, how many were identified? |
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, classification_report
)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred, average='weighted'):.4f}")
print(f"Recall: {recall_score(y_test, y_pred, average='weighted'):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred, average='weighted'):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test, ax=ax, cmap='Purples')
ax.set_title("Confusion Matrix")
plt.tight_layout()
plt.savefig("confusion_matrix.png", dpi=150)
plt.show()
AUC-ROC Curve
The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate vs. the False Positive Rate at various classification thresholds. The AUC (Area Under the Curve) summarizes this into a single number.
| AUC Value | Interpretation |
|---|---|
| 1.0 | Perfect classifier |
| 0.9 - 1.0 | Excellent |
| 0.8 - 0.9 | Good |
| 0.7 - 0.8 | Fair |
| 0.5 | Random (no skill) |
| < 0.5 | Worse than random |
from sklearn.metrics import roc_curve, roc_auc_score
# For binary classification
y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='#7c3aed', lw=2, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'k--', lw=1, label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
When to Use Which Metric?
| Scenario | Priority Metric | Why |
|---|---|---|
| Spam detection | Precision | Avoid marking legit emails as spam (minimize FP) |
| Cancer screening | Recall | Don't miss real cancers (minimize FN) |
| Balanced classes | Accuracy or F1 | Both error types equally costly |
| Imbalanced classes | F1, AUC-ROC | Accuracy is misleading with class imbalance |
| Ranking predictions | AUC-ROC | Measures ranking quality across thresholds |
On a dataset with 95% class A and 5% class B, a model that always predicts A will have 95% accuracy. This is why accuracy alone is insufficient for imbalanced datasets. Always use F1, Precision, Recall, and AUC-ROC as complements.
Regression Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| MSE (Mean Squared Error) | Σ(y - ŷ)² / n | Heavily penalizes large errors |
| RMSE (Root MSE) | √MSE | Same unit as the target variable |
| MAE (Mean Absolute Error) | Σ|y - ŷ| / n | Average error in absolute value |
| R² (Coefficient of Determination) | 1 - (SS_res / SS_tot) | Proportion of variance explained (0 to 1) |
| MAPE | Σ|y - ŷ|/|y| × 100 / n | Error as percentage |
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R²: {r2:.4f}")
6. Overfitting vs. Underfitting
View Bias-Variance Comparison
| Characteristic | Underfitting | Good Fit | Overfitting |
|---|---|---|---|
| Train accuracy | ❌ Low | ✅ High | ✅ Very High |
| Test accuracy | ❌ Low | ✅ High | ❌ Low |
| Model complexity | Too simple | Just right | Too complex |
| Bias | High | Low | Low |
| Variance | Low | Low | High |
How to Detect
# Train and evaluate on both sets
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
gap = train_score - test_score
print(f"Train: {train_score:.4f}")
print(f"Test: {test_score:.4f}")
print(f"Gap: {gap:.4f}")
if train_score < 0.7 and test_score < 0.7:
print("⚠️ Underfitting: model too simple")
elif gap > 0.10:
print("⚠️ Overfitting: model too complex")
else:
print("✅ Good generalization")
Remedies
| Problem | Solutions |
|---|---|
| Underfitting | Increase model complexity, add more features, reduce regularization, train longer |
| Overfitting | Add more data, increase regularization, reduce features, use dropout, early stopping, cross-validation |
7. Bias-Variance Tradeoff
The bias-variance tradeoff is a fundamental concept in machine learning that describes the tension between two sources of error.
- High bias: The shooter consistently aims too far to the left (systematic error)
- High variance: The shots are scattered all around the target (instability)
- Goal: Grouped shots at the center (low bias + low variance)
| Low Variance | High Variance | |
|---|---|---|
| Low Bias | ✅ Ideal (generalization) | ⚠️ Overfitting |
| High Bias | ⚠️ Underfitting | ❌ Worst case |
8. Learning Curves
Learning curves plot model performance as a function of training set size or training iterations. They are the most powerful visual diagnostic tool for detecting overfitting and underfitting.
from sklearn.model_selection import learning_curve
import numpy as np
import matplotlib.pyplot as plt
train_sizes, train_scores, val_scores = learning_curve(
RandomForestClassifier(n_estimators=100, random_state=42),
X_train, y_train,
cv=5,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='accuracy',
n_jobs=-1
)
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)
plt.figure(figsize=(10, 6))
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='#7c3aed')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color='#f59e0b')
plt.plot(train_sizes, train_mean, 'o-', color='#7c3aed', label='Training Score')
plt.plot(train_sizes, val_mean, 's-', color='#f59e0b', label='Validation Score')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curve')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
Interpreting Learning Curves
| Pattern | Diagnosis | Action |
|---|---|---|
| Both curves low and converging | Underfitting | Use more complex model or more features |
| Train high, validation low, gap persists | Overfitting | Add more data, regularize, simplify model |
| Both curves high and converging | Good fit | Model is ready for deployment |
| Validation curve still rising at the end | More data needed | Collect more training data |
9. Putting It All Together
Here is a complete example combining all concepts:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import (
train_test_split, StratifiedKFold, GridSearchCV
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
# 1. Load data
data = load_iris()
X, y = data.data, data.target
# 2. Split: 60% train, 20% val, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
# 3. Build a pipeline (preprocessing + model)
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier(random_state=42))
])
# 4. Hyperparameter tuning with cross-validation
param_grid = {
'clf__n_estimators': [50, 100, 200],
'clf__max_depth': [3, 5, None],
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(
pipeline, param_grid, cv=cv,
scoring='f1_weighted', n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best hyperparameters: {grid_search.best_params_}")
print(f"Best CV F1-score: {grid_search.best_score_:.4f}")
# 5. Validate on validation set
best_model = grid_search.best_estimator_
val_pred = best_model.predict(X_val)
print(f"\nValidation Accuracy: {accuracy_score(y_val, val_pred):.4f}")
# 6. Final evaluation on test set
test_pred = best_model.predict(X_test)
print(f"\nTest Set Results:")
print(classification_report(y_test, test_pred, target_names=data.target_names))
Summary
🔑 Key Takeaways
- Data Splitting: Always split into train/validation/test. The test set is only for final evaluation.
- Cross-Validation: Use K-Fold (K=5 or 10) for robust performance estimates.
- Hyperparameter Tuning: GridSearch for small spaces, RandomSearch for large spaces.
- Metrics: Choose the metric based on the cost of errors (Precision vs Recall).
- Confusion Matrix: Foundation of all classification metrics. Learn to read it.
- Overfitting: High training performance, low test performance → model too complex.
- Underfitting: Low performance everywhere → model too simple.
- Bias-Variance: The goal is to minimize both simultaneously.
- Learning Curves: Essential visual tool for diagnosing problems.
Further Reading
| Resource | Link |
|---|---|
| scikit-learn Model Selection Guide | sklearn.model_selection |
| Cross-Validation: Evaluating Estimator Performance | sklearn Cross-Validation |
| Metrics and Scoring | sklearn Metrics |
| Understanding the Bias-Variance Tradeoff | Scott Fortmann-Roe |