TP2 — Train, Evaluate and Serialize a Model

Practical Lab 90 min Intermediate

Lab Objectives

By the end of this lab, you will be able to:

✅ Load and explore a real dataset
✅ Preprocess the data (scaling, encoding)
✅ Train multiple classification models
✅ Evaluate and compare models with rigorous metrics
✅ Visualize results (confusion matrix, ROC curve)
✅ Serialize the best model in pickle, joblib, and ONNX
✅ Load and verify serialized models
✅ Generate an evaluation report

Prerequisites

Prerequisite	Detail
Python	3.10+ installed
Libraries	scikit-learn, pandas, numpy, matplotlib, seaborn
Knowledge	Module 2 — Concepts (Training & Serialization)
Environment	Virtual environment activated

Install dependencies

pip install scikit-learn pandas numpy matplotlib seaborn joblib skl2onnx onnxruntime

Project architecture

tp2-model-evaluation/
├── tp2_train_evaluate.py       # Main script
├── models/
│   ├── best_model.pkl          # Serialized model (pickle)
│   ├── best_model.joblib       # Serialized model (joblib)
│   ├── best_model.onnx         # Serialized model (ONNX)
│   └── metadata.json           # Model metadata
├── reports/
│   ├── confusion_matrix.png    # Confusion matrix
│   ├── roc_curve.png           # ROC curve
│   └── evaluation_report.txt   # Text report
└── README.md

Step 1 — Setup and data loading

Dataset used

We use the Breast Cancer Wisconsin dataset from scikit-learn. This is a binary classification problem (malignant vs benign tumor) with 30 numeric features and 569 samples.

# tp2_train_evaluate.py — Step 1: Load and explore data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
import os

# Create output directories
os.makedirs('models', exist_ok=True)
os.makedirs('reports', exist_ok=True)

# Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Explore the dataset
print("=" * 60)
print("DATASET EXPLORATION")
print("=" * 60)
print(f"\nShape: {X.shape}")
print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]}")
print(f"\nTarget distribution:")
print(y.value_counts())
print(f"\nClass names: {data.target_names}")
print(f"\nFirst 5 features:")
print(X.iloc[:, :5].describe())

✅ Expected result

============================================================
DATASET EXPLORATION
============================================================

Shape: (569, 30)
Features: 30
Samples: 569

Target distribution:
1    357
0    212
Name: target, dtype: int64

Class names: ['malignant' 'benign']

Step 2 — Data preprocessing

# Step 2: Preprocessing and data splitting

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split: 60% train, 20% validation, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

print(f"Training set:   {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.0f}%)")
print(f"Validation set: {X_val.shape[0]} samples ({X_val.shape[0]/len(X)*100:.0f}%)")
print(f"Test set:       {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.0f}%)")

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

print(f"\nAfter scaling — Train mean: {X_train_scaled.mean():.6f}, std: {X_train_scaled.std():.4f}")

Watch out for data leakage!

scaler.fit_transform() is called only on the training set. The validation and test sets use scaler.transform() without fit. This avoids information leakage (data leakage).

✅ Expected result

Training set:   341 samples (60%)
Validation set: 114 samples (20%)
Test set:       114 samples (20%)

After scaling — Train mean: -0.000000, std: 1.0000

Step 3 — Training multiple models

# Step 3: Train multiple models

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Define models to compare
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM (RBF)': SVC(kernel='rbf', probability=True, random_state=42),
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
}

# Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Train and evaluate each model with cross-validation
cv_results = {}
print("=" * 60)
print("CROSS-VALIDATION RESULTS (5-Fold)")
print("=" * 60)

for name, model in models.items():
    scores = cross_val_score(model, X_train_scaled, y_train, cv=cv, scoring='f1')
    cv_results[name] = {
        'mean': scores.mean(),
        'std': scores.std(),
        'scores': scores
    }
    print(f"\n{name}:")
    print(f"  F1 scores: {scores.round(4)}")
    print(f"  Mean F1:   {scores.mean():.4f} ± {scores.std():.4f}")

✅ Expected result (approximate)

============================================================
CROSS-VALIDATION RESULTS (5-Fold)
============================================================

Logistic Regression:
  F1 scores: [0.9783 0.9778 0.9565 0.9778 0.9778]
  Mean F1:   0.9736 ± 0.0087

Random Forest:
  F1 scores: [0.9778 0.9556 0.9565 0.9778 0.9556]
  Mean F1:   0.9647 ± 0.0107

SVM (RBF):
  F1 scores: [0.9783 0.9778 0.9783 0.9778 0.9778]
  Mean F1:   0.9780 ± 0.0003

...

Step 4 — Detailed evaluation on the validation set

# Step 4: Detailed evaluation on validation set

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix, roc_auc_score
)

# Train all models on full training set and evaluate on validation set
val_results = {}
print("\n" + "=" * 60)
print("VALIDATION SET RESULTS")
print("=" * 60)

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_val_scaled)
    y_proba = model.predict_proba(X_val_scaled)[:, 1]

    val_results[name] = {
        'accuracy': accuracy_score(y_val, y_pred),
        'precision': precision_score(y_val, y_pred),
        'recall': recall_score(y_val, y_pred),
        'f1': f1_score(y_val, y_pred),
        'auc_roc': roc_auc_score(y_val, y_proba),
        'predictions': y_pred,
        'probabilities': y_proba,
    }

# Display results as a comparison table
results_df = pd.DataFrame(val_results).T
results_df = results_df[['accuracy', 'precision', 'recall', 'f1', 'auc_roc']]
results_df = results_df.round(4)
results_df = results_df.sort_values('f1', ascending=False)

print("\n📊 Model Comparison Table:")
print(results_df.to_string())

# Identify best model
best_model_name = results_df['f1'].idxmax()
print(f"\n🏆 Best model: {best_model_name} (F1 = {results_df.loc[best_model_name, 'f1']:.4f})")

Step 5 — Visualization: Confusion Matrix & ROC Curve

# Step 5a: Confusion Matrix for the best model

from sklearn.metrics import ConfusionMatrixDisplay

best_model = models[best_model_name]
best_model.fit(X_train_scaled, y_train)
y_val_pred = best_model.predict(X_val_scaled)

fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_predictions(
    y_val, y_val_pred,
    display_labels=data.target_names,
    cmap='Purples',
    ax=ax
)
ax.set_title(f'Confusion Matrix - {best_model_name}', fontsize=14)
plt.tight_layout()
plt.savefig('reports/confusion_matrix.png', dpi=150)
plt.show()
print("✅ Confusion matrix saved to reports/confusion_matrix.png")

# Step 5b: ROC Curves for all models

from sklearn.metrics import roc_curve

fig, ax = plt.subplots(figsize=(10, 7))
colors = ['#7c3aed', '#3b82f6', '#10b981', '#f59e0b', '#ef4444']

for (name, result), color in zip(val_results.items(), colors):
    fpr, tpr, _ = roc_curve(y_val, result['probabilities'])
    ax.plot(fpr, tpr, color=color, lw=2,
            label=f"{name} (AUC = {result['auc_roc']:.4f})")

ax.plot([0, 1], [0, 1], 'k--', lw=1, alpha=0.5, label='Random')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves - Model Comparison', fontsize=14)
ax.legend(loc='lower right', fontsize=10)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('reports/roc_curve.png', dpi=150)
plt.show()
print("✅ ROC curves saved to reports/roc_curve.png")

Step 6 — Serialize the best model

Complete pipeline

We serialize the complete pipeline (scaler + model) to ensure preprocessing is included.

# Step 6: Serialize the best model in all formats

import pickle
import joblib
from sklearn.pipeline import Pipeline

# Build full pipeline with best model
best_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', models[best_model_name])
])
best_pipeline.fit(X_train, y_train)  # fit on UN-scaled data (pipeline handles it)

# 6a. Pickle
with open('models/best_model.pkl', 'wb') as f:
    pickle.dump(best_pipeline, f)
print("✅ Saved: models/best_model.pkl")

# 6b. Joblib (with compression)
joblib.dump(best_pipeline, 'models/best_model.joblib', compress=3)
print("✅ Saved: models/best_model.joblib")

# 6c. ONNX
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(best_pipeline, initial_types=initial_type)
with open('models/best_model.onnx', 'wb') as f:
    f.write(onnx_model.SerializeToString())
print("✅ Saved: models/best_model.onnx")

# Compare file sizes
import os
for ext in ['pkl', 'joblib', 'onnx']:
    filepath = f'models/best_model.{ext}'
    size_kb = os.path.getsize(filepath) / 1024
    print(f"  {filepath:30s} → {size_kb:8.1f} KB")

Step 7 — Load and verify serialized models

# Step 7: Load and verify all serialized models

import numpy as np

# Sample test data (first 5 samples)
X_sample = X_test.iloc[:5]
y_sample = y_test.iloc[:5]

print("=" * 60)
print("VERIFICATION - Serialized Models")
print("=" * 60)
print(f"\nTrue labels: {y_sample.values}")

# 7a. Load Pickle
with open('models/best_model.pkl', 'rb') as f:
    model_pkl = pickle.load(f)
pred_pkl = model_pkl.predict(X_sample)
print(f"Pickle predictions:  {pred_pkl}")

# 7b. Load Joblib
model_joblib = joblib.load('models/best_model.joblib')
pred_joblib = model_joblib.predict(X_sample)
print(f"Joblib predictions:  {pred_joblib}")

# 7c. Load ONNX
import onnxruntime as ort

session = ort.InferenceSession('models/best_model.onnx')
input_name = session.get_inputs()[0].name
X_sample_float = X_sample.values.astype(np.float32)
pred_onnx = session.run(None, {input_name: X_sample_float})[0]
print(f"ONNX predictions:   {pred_onnx}")

# 7d. Verify consistency
assert np.array_equal(pred_pkl, pred_joblib), "Pickle/Joblib mismatch!"
print("\n✅ All serialization formats produce consistent predictions!")

# Full test set evaluation of loaded model
y_test_pred = model_joblib.predict(X_test)
final_accuracy = accuracy_score(y_test, y_test_pred)
final_f1 = f1_score(y_test, y_test_pred)
print(f"\n📊 Final Test Set Performance (loaded model):")
print(f"   Accuracy: {final_accuracy:.4f}")
print(f"   F1-Score: {final_f1:.4f}")

✅ Expected result

============================================================
VERIFICATION - Serialized Models
============================================================

True labels: [1 0 0 1 1]
Pickle predictions:  [1 0 0 1 1]
Joblib predictions:  [1 0 0 1 1]
ONNX predictions:   [1 0 0 1 1]

✅ All serialization formats produce consistent predictions!

📊 Final Test Set Performance (loaded model):
   Accuracy: 0.9737
   F1-Score: 0.9808

Step 8 — Generate evaluation report

# Step 8: Generate evaluation report

import json
from datetime import datetime

# Save metadata
metadata = {
    "model_name": best_model_name,
    "version": "1.0.0",
    "timestamp": datetime.now().isoformat(),
    "dataset": "Breast Cancer Wisconsin",
    "n_samples": len(X),
    "n_features": X.shape[1],
    "split": {"train": len(X_train), "val": len(X_val), "test": len(X_test)},
    "test_metrics": {
        "accuracy": round(final_accuracy, 4),
        "f1_score": round(final_f1, 4),
        "precision": round(precision_score(y_test, y_test_pred), 4),
        "recall": round(recall_score(y_test, y_test_pred), 4),
    },
    "serialization_formats": ["pickle", "joblib", "onnx"],
    "hyperparameters": models[best_model_name].get_params(),
}

with open('models/metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2, default=str)
print("✅ Metadata saved to models/metadata.json")

# Generate text report
report_lines = [
    "=" * 60,
    "MODEL EVALUATION REPORT",
    "=" * 60,
    f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
    f"Dataset: Breast Cancer Wisconsin ({len(X)} samples, {X.shape[1]} features)",
    "",
    "--- Data Split ---",
    f"Training:   {len(X_train)} samples",
    f"Validation: {len(X_val)} samples",
    f"Test:       {len(X_test)} samples",
    "",
    "--- Cross-Validation Results (F1 Score) ---",
]

for name, result in sorted(cv_results.items(), key=lambda x: x[1]['mean'], reverse=True):
    report_lines.append(f"  {name:25s}: {result['mean']:.4f} ± {result['std']:.4f}")

report_lines.extend([
    "",
    "--- Validation Set Results ---",
    results_df.to_string(),
    "",
    f"--- Best Model: {best_model_name} ---",
    f"Test Accuracy:  {final_accuracy:.4f}",
    f"Test F1-Score:  {final_f1:.4f}",
    "",
    "--- Classification Report (Test Set) ---",
    classification_report(y_test, y_test_pred, target_names=data.target_names),
    "",
    "--- Serialized Files ---",
])

for ext in ['pkl', 'joblib', 'onnx']:
    filepath = f'models/best_model.{ext}'
    size_kb = os.path.getsize(filepath) / 1024
    report_lines.append(f"  {filepath}: {size_kb:.1f} KB")

report_text = "\n".join(report_lines)
with open('reports/evaluation_report.txt', 'w') as f:
    f.write(report_text)

print("✅ Evaluation report saved to reports/evaluation_report.txt")
print("\n" + report_text)

Validation checklist

Before submitting your lab, verify the following points:

#	Criterion	Verified
1	The dataset is correctly loaded and explored	☐
2	Data is split into 3 sets (train/val/test)	☐
3	Scaling is applied correctly (fit on train only)	☐
4	At least 3 models are trained and compared	☐
5	5-fold cross-validation is used	☐
6	Metrics include accuracy, precision, recall, F1, AUC-ROC	☐
7	Confusion matrix is generated and saved	☐
8	ROC curves are generated and saved	☐
9	Best model is serialized in 3 formats (pkl, joblib, onnx)	☐
10	Serialized models are reloaded and verified	☐
11	An evaluation report is generated	☐
12	Metadata is saved in JSON	☐

Bonus challenges

🚀 Challenge 1 — Hyperparameter Tuning

Add a step of GridSearchCV or RandomizedSearchCV on the best model to optimize its hyperparameters. Compare performance before and after tuning.

from sklearn.model_selection import GridSearchCV

# Example for Random Forest
param_grid = {
    'classifier__n_estimators': [50, 100, 200, 300],
    'classifier__max_depth': [3, 5, 10, None],
    'classifier__min_samples_split': [2, 5, 10],
}

grid_search = GridSearchCV(
    best_pipeline, param_grid, cv=5,
    scoring='f1', n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best F1: {grid_search.best_score_:.4f}")

🚀 Challenge 2 — Learning Curves

Generate learning curves for the best model and identify if there is overfitting or underfitting.

from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(
    best_pipeline, X_train, y_train, cv=5,
    train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='f1', n_jobs=-1
)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', color='#7c3aed', label='Training')
plt.plot(train_sizes, val_scores.mean(axis=1), 's-', color='#f59e0b', label='Validation')
plt.xlabel('Training Set Size')
plt.ylabel('F1 Score')
plt.title('Learning Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.savefig('reports/learning_curve.png', dpi=150)
plt.show()

🚀 Challenge 3 — MLflow Tracking

Integrate MLflow to automatically log experiments, metrics, and models.

import mlflow
import mlflow.sklearn

mlflow.set_experiment("tp2-breast-cancer")

for name, model in models.items():
    with mlflow.start_run(run_name=name):
        pipeline = Pipeline([('scaler', StandardScaler()), ('clf', model)])
        pipeline.fit(X_train, y_train)
        y_pred = pipeline.predict(X_test)
        mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
        mlflow.log_metric("f1", f1_score(y_test, y_pred))
        mlflow.sklearn.log_model(pipeline, "model")

Lab Objectives​

Prerequisites​

Install dependencies​

Project architecture​

Step 1 — Setup and data loading​

Step 2 — Data preprocessing​

Step 3 — Training multiple models​

Step 4 — Detailed evaluation on the validation set​

Step 5 — Visualization: Confusion Matrix & ROC Curve​

Step 6 — Serialize the best model​

Step 7 — Load and verify serialized models​

Step 8 — Generate evaluation report​

Validation checklist​

Bonus challenges​

Lab Objectives

Prerequisites

Install dependencies

Project architecture

Step 1 — Setup and data loading

Step 2 — Data preprocessing

Step 3 — Training multiple models

Step 4 — Detailed evaluation on the validation set

Step 5 — Visualization: Confusion Matrix & ROC Curve

Step 6 — Serialize the best model

Step 7 — Load and verify serialized models

Step 8 — Generate evaluation report

Validation checklist

Bonus challenges