Model Training and Evaluation Concepts

Theory 45 min Module 2

Introduction

Before deploying an AI model, you need to train it properly and evaluate its performance rigorously. A model that performs well on training data but fails in production is worse than no model at all — it gives false confidence.

Real-World Analogy

Imagine a student who memorizes all the answers from a past exam without understanding the concepts. They will score 100% on that specific exam, but fail when faced with new questions. This is exactly what happens to a model that overfits its training data.

1. The ML Training Pipeline

The training pipeline is the structured sequence of steps that transforms raw data into a deployable model.

View ML Training Pipeline

Each step in this pipeline is critical. Skipping or rushing any step can lead to models that seem good on paper but fail in production.

2. Data Splitting: Train / Validation / Test

Why Split Data?

We split data into separate sets to get an honest estimate of how well our model will perform on data it has never seen before.

View Data Splitting Strategy

Set	Purpose	Usage	Typical Size
Training	Learn patterns from features	Used during `model.fit()`	60-70%
Validation	Tune hyperparameters, select best model	Used during model selection	15-20%
Test	Final unbiased evaluation	Used once at the end	15-20%

Golden Rule

The test set must never be used for making decisions during development. It is only used for the final evaluation. If you look at test set performance and go back to modify your model, you have contaminated your evaluation.

Code Example: Splitting Data

from sklearn.model_selection import train_test_split

# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Second split: training and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

# Result: 60% train, 20% validation, 20% test
print(f"Training:   {len(X_train)} samples ({len(X_train)/len(X)*100:.0f}%)")
print(f"Validation: {len(X_val)} samples ({len(X_val)/len(X)*100:.0f}%)")
print(f"Test:       {len(X_test)} samples ({len(X_test)/len(X)*100:.0f}%)")

Stratification

The stratify=y parameter ensures that each subset has the same class proportions as the original dataset. This is essential for imbalanced datasets (e.g., 95% class A, 5% class B).

3. Cross-Validation

The Problem with a Single Split

A single train/validation split can be lucky or unlucky. Maybe all the "easy" examples ended up in the training set. Cross-validation solves this by testing multiple splits.

Analogy

Imagine evaluating a restaurant by eating only one dish on one day. Maybe the chef was sick that day, or on the contrary it was their best dish. It is better to come back k times and try different dishes for a reliable evaluation.

K-Fold Cross-Validation

Type	Description	When to Use
K-Fold	Splits into K folds, each serving as test in turn	General use, K=5 or K=10
Stratified K-Fold	K-Fold preserving class proportions	Imbalanced classification
Leave-One-Out (LOO)	K = number of samples	Very small datasets (< 100)
Repeated K-Fold	Repeats K-Fold multiple times with different seeds	Very robust estimation
Time Series Split	Respects the temporal order of data	Sequential / time series data

Code Example: Cross-Validation

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)

# Stratified 5-Fold Cross-Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='accuracy')

print(f"Scores per fold: {scores}")
print(f"Mean accuracy:   {scores.mean():.4f} ± {scores.std():.4f}")

# Multiple metrics at once
from sklearn.model_selection import cross_validate

results = cross_validate(
    model, X_train, y_train, cv=cv,
    scoring=['accuracy', 'f1_weighted', 'precision_weighted', 'recall_weighted'],
    return_train_score=True
)

for metric in ['accuracy', 'f1_weighted']:
    train_score = results[f'train_{metric}'].mean()
    test_score = results[f'test_{metric}'].mean()
    print(f"{metric}: Train={train_score:.4f}, Val={test_score:.4f}")

4. Hyperparameter Tuning

Hyperparameters are settings you choose before training (unlike model parameters, which are learned during training).

Parameter	Hyperparameter	Learned During Training?
Model weights	—	✅ Yes
—	Learning rate	❌ No (you choose it)
—	Number of trees (n_estimators)	❌ No
—	Max depth (max_depth)	❌ No
—	Regularization (C, alpha)	❌ No

Grid Search

Tests all possible combinations. Exhaustive but expensive.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
# Total: 3 × 4 × 3 × 3 = 108 combinations × 5 folds = 540 fits

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=1
)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best score:  {grid_search.best_score_:.4f}")
best_model = grid_search.best_estimator_

Random Search

Randomly samples the search space. More efficient when the space is large.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': [3, 5, 10, 20, None],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': uniform(0.1, 0.9)
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions,
    n_iter=50,       # only 50 random combinations (vs 108+ for grid)
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    random_state=42
)
random_search.fit(X_train, y_train)

print(f"Best params: {random_search.best_params_}")
print(f"Best score:  {random_search.best_score_:.4f}")

Method	Advantages	Disadvantages	When to Use
Grid Search	Exhaustive, guaranteed to find the best	Very slow (exponential)	Small search space
Random Search	Faster, explores the space better	No guarantee of optimal	Large search space
Bayesian Optimization	Intelligent, converges quickly	More complex to implement	Expensive models to train

5. Evaluation Metrics

Classification Metrics

The Confusion Matrix

The confusion matrix is the foundation of all classification metrics.

Analogy: The Smoke Detector

True Positive: There is a fire, the alarm sounds ✅
False Positive: No fire, the alarm sounds anyway (burnt toast) 🚨
False Negative: There is a fire, but the alarm doesn't sound 💀
True Negative: No fire, the alarm stays silent ✅

An FN (missing a real fire) is much more serious than an FP (false alarm). The choice of metric depends on the cost of errors.

Metrics Derived from the Confusion Matrix

Metric	Formula	Question It Answers
Accuracy	(TP + TN) / Total	What fraction of predictions is correct?
Precision	TP / (TP + FP)	Among positive predictions, how many are true?
Recall (Sensitivity)	TP / (TP + FN)	Among true positives, how many were detected?
F1-Score	2 × (Precision × Recall) / (Precision + Recall)	Harmonic mean of Precision and Recall
Specificity	TN / (TN + FP)	Among true negatives, how many were identified?

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, classification_report
)

y_pred = model.predict(X_test)

print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred, average='weighted'):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred, average='weighted'):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred, average='weighted'):.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test, ax=ax, cmap='Purples')
ax.set_title("Confusion Matrix")
plt.tight_layout()
plt.savefig("confusion_matrix.png", dpi=150)
plt.show()

AUC-ROC Curve

The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate vs. the False Positive Rate at various classification thresholds. The AUC (Area Under the Curve) summarizes this into a single number.

AUC Value	Interpretation
1.0	Perfect classifier
0.9 - 1.0	Excellent
0.8 - 0.9	Good
0.7 - 0.8	Fair
0.5	Random (no skill)
< 0.5	Worse than random

from sklearn.metrics import roc_curve, roc_auc_score

# For binary classification
y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='#7c3aed', lw=2, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'k--', lw=1, label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

When to Use Which Metric?

Scenario	Priority Metric	Why
Spam detection	Precision	Avoid marking legit emails as spam (minimize FP)
Cancer screening	Recall	Don't miss real cancers (minimize FN)
Balanced classes	Accuracy or F1	Both error types equally costly
Imbalanced classes	F1, AUC-ROC	Accuracy is misleading with class imbalance
Ranking predictions	AUC-ROC	Measures ranking quality across thresholds

The Accuracy Trap

On a dataset with 95% class A and 5% class B, a model that always predicts A will have 95% accuracy. This is why accuracy alone is insufficient for imbalanced datasets. Always use F1, Precision, Recall, and AUC-ROC as complements.

Regression Metrics

Metric	Formula	Interpretation
MSE (Mean Squared Error)	Σ(y - ŷ)² / n	Heavily penalizes large errors
RMSE (Root MSE)	√MSE	Same unit as the target variable
MAE (Mean Absolute Error)	Σ\|y - ŷ\| / n	Average error in absolute value
R² (Coefficient of Determination)	1 - (SS_res / SS_tot)	Proportion of variance explained (0 to 1)
MAPE	Σ\|y - ŷ\|/\|y\| × 100 / n	Error as percentage

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE:  {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE:  {mae:.4f}")
print(f"R²:   {r2:.4f}")

6. Overfitting vs. Underfitting

View Bias-Variance Comparison

Characteristic	Underfitting	Good Fit	Overfitting
Train accuracy	❌ Low	✅ High	✅ Very High
Test accuracy	❌ Low	✅ High	❌ Low
Model complexity	Too simple	Just right	Too complex
Bias	High	Low	Low
Variance	Low	Low	High

How to Detect

# Train and evaluate on both sets
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

gap = train_score - test_score
print(f"Train: {train_score:.4f}")
print(f"Test:  {test_score:.4f}")
print(f"Gap:   {gap:.4f}")

if train_score < 0.7 and test_score < 0.7:
    print("⚠️ Underfitting: model too simple")
elif gap > 0.10:
    print("⚠️ Overfitting: model too complex")
else:
    print("✅ Good generalization")

Remedies

Problem	Solutions
Underfitting	Increase model complexity, add more features, reduce regularization, train longer
Overfitting	Add more data, increase regularization, reduce features, use dropout, early stopping, cross-validation

7. Bias-Variance Tradeoff

The bias-variance tradeoff is a fundamental concept in machine learning that describes the tension between two sources of error.

Analogy: The Sport Shooter

High bias: The shooter consistently aims too far to the left (systematic error)
High variance: The shots are scattered all around the target (instability)
Goal: Grouped shots at the center (low bias + low variance)

	Low Variance	High Variance
Low Bias	✅ Ideal (generalization)	⚠️ Overfitting
High Bias	⚠️ Underfitting	❌ Worst case

8. Learning Curves

Learning curves plot model performance as a function of training set size or training iterations. They are the most powerful visual diagnostic tool for detecting overfitting and underfitting.

from sklearn.model_selection import learning_curve
import numpy as np
import matplotlib.pyplot as plt

train_sizes, train_scores, val_scores = learning_curve(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_train, y_train,
    cv=5,
    train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='accuracy',
    n_jobs=-1
)

train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)

plt.figure(figsize=(10, 6))
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='#7c3aed')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color='#f59e0b')
plt.plot(train_sizes, train_mean, 'o-', color='#7c3aed', label='Training Score')
plt.plot(train_sizes, val_mean, 's-', color='#f59e0b', label='Validation Score')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curve')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Interpreting Learning Curves

Pattern	Diagnosis	Action
Both curves low and converging	Underfitting	Use more complex model or more features
Train high, validation low, gap persists	Overfitting	Add more data, regularize, simplify model
Both curves high and converging	Good fit	Model is ready for deployment
Validation curve still rising at the end	More data needed	Collect more training data

9. Putting It All Together

Here is a complete example combining all concepts:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import (
    train_test_split, StratifiedKFold, GridSearchCV
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# 1. Load data
data = load_iris()
X, y = data.data, data.target

# 2. Split: 60% train, 20% val, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

# 3. Build a pipeline (preprocessing + model)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(random_state=42))
])

# 4. Hyperparameter tuning with cross-validation
param_grid = {
    'clf__n_estimators': [50, 100, 200],
    'clf__max_depth': [3, 5, None],
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(
    pipeline, param_grid, cv=cv,
    scoring='f1_weighted', n_jobs=-1
)
grid_search.fit(X_train, y_train)

print(f"Best hyperparameters: {grid_search.best_params_}")
print(f"Best CV F1-score:     {grid_search.best_score_:.4f}")

# 5. Validate on validation set
best_model = grid_search.best_estimator_
val_pred = best_model.predict(X_val)
print(f"\nValidation Accuracy: {accuracy_score(y_val, val_pred):.4f}")

# 6. Final evaluation on test set
test_pred = best_model.predict(X_test)
print(f"\nTest Set Results:")
print(classification_report(y_test, test_pred, target_names=data.target_names))

Summary

🔑 Key Takeaways

Data Splitting: Always split into train/validation/test. The test set is only for final evaluation.
Cross-Validation: Use K-Fold (K=5 or 10) for robust performance estimates.
Hyperparameter Tuning: GridSearch for small spaces, RandomSearch for large spaces.
Metrics: Choose the metric based on the cost of errors (Precision vs Recall).
Confusion Matrix: Foundation of all classification metrics. Learn to read it.
Overfitting: High training performance, low test performance → model too complex.
Underfitting: Low performance everywhere → model too simple.
Bias-Variance: The goal is to minimize both simultaneously.
Learning Curves: Essential visual tool for diagnosing problems.

Resource	Link
scikit-learn Model Selection Guide	sklearn.model_selection
Cross-Validation: Evaluating Estimator Performance	sklearn Cross-Validation
Metrics and Scoring	sklearn Metrics
Understanding the Bias-Variance Tradeoff	Scott Fortmann-Roe

Introduction​

1. The ML Training Pipeline​

2. Data Splitting: Train / Validation / Test​

Why Split Data?​

Code Example: Splitting Data​

3. Cross-Validation​

The Problem with a Single Split​

K-Fold Cross-Validation​

Code Example: Cross-Validation​

4. Hyperparameter Tuning​

Grid Search​

Random Search​

5. Evaluation Metrics​

Classification Metrics​

The Confusion Matrix​

Metrics Derived from the Confusion Matrix​

AUC-ROC Curve​

When to Use Which Metric?​

Regression Metrics​

6. Overfitting vs. Underfitting​

How to Detect​

Remedies​

7. Bias-Variance Tradeoff​

8. Learning Curves​

Interpreting Learning Curves​

9. Putting It All Together​

Summary​

Further Reading​