TP2 — Train, Evaluate and Serialize a Model
Lab Objectives
By the end of this lab, you will be able to:
- ✅ Load and explore a real dataset
- ✅ Preprocess the data (scaling, encoding)
- ✅ Train multiple classification models
- ✅ Evaluate and compare models with rigorous metrics
- ✅ Visualize results (confusion matrix, ROC curve)
- ✅ Serialize the best model in pickle, joblib, and ONNX
- ✅ Load and verify serialized models
- ✅ Generate an evaluation report
Prerequisites
| Prerequisite | Detail |
|---|---|
| Python | 3.10+ installed |
| Libraries | scikit-learn, pandas, numpy, matplotlib, seaborn |
| Knowledge | Module 2 — Concepts (Training & Serialization) |
| Environment | Virtual environment activated |
Install dependencies
pip install scikit-learn pandas numpy matplotlib seaborn joblib skl2onnx onnxruntime
Project architecture
tp2-model-evaluation/
├── tp2_train_evaluate.py # Main script
├── models/
│ ├── best_model.pkl # Serialized model (pickle)
│ ├── best_model.joblib # Serialized model (joblib)
│ ├── best_model.onnx # Serialized model (ONNX)
│ └── metadata.json # Model metadata
├── reports/
│ ├── confusion_matrix.png # Confusion matrix
│ ├── roc_curve.png # ROC curve
│ └── evaluation_report.txt # Text report
└── README.md
Step 1 — Setup and data loading
We use the Breast Cancer Wisconsin dataset from scikit-learn. This is a binary classification problem (malignant vs benign tumor) with 30 numeric features and 569 samples.
# tp2_train_evaluate.py — Step 1: Load and explore data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
import os
# Create output directories
os.makedirs('models', exist_ok=True)
os.makedirs('reports', exist_ok=True)
# Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')
# Explore the dataset
print("=" * 60)
print("DATASET EXPLORATION")
print("=" * 60)
print(f"\nShape: {X.shape}")
print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]}")
print(f"\nTarget distribution:")
print(y.value_counts())
print(f"\nClass names: {data.target_names}")
print(f"\nFirst 5 features:")
print(X.iloc[:, :5].describe())
✅ Expected result
============================================================
DATASET EXPLORATION
============================================================
Shape: (569, 30)
Features: 30
Samples: 569
Target distribution:
1 357
0 212
Name: target, dtype: int64
Class names: ['malignant' 'benign']
Step 2 — Data preprocessing
# Step 2: Preprocessing and data splitting
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Split: 60% train, 20% validation, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.0f}%)")
print(f"Validation set: {X_val.shape[0]} samples ({X_val.shape[0]/len(X)*100:.0f}%)")
print(f"Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.0f}%)")
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
print(f"\nAfter scaling — Train mean: {X_train_scaled.mean():.6f}, std: {X_train_scaled.std():.4f}")
scaler.fit_transform() is called only on the training set. The validation and test sets use scaler.transform() without fit. This avoids information leakage (data leakage).
✅ Expected result
Training set: 341 samples (60%)
Validation set: 114 samples (20%)
Test set: 114 samples (20%)
After scaling — Train mean: -0.000000, std: 1.0000
Step 3 — Training multiple models
# Step 3: Train multiple models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
# Define models to compare
models = {
'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM (RBF)': SVC(kernel='rbf', probability=True, random_state=42),
'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
'Decision Tree': DecisionTreeClassifier(random_state=42),
}
# Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Train and evaluate each model with cross-validation
cv_results = {}
print("=" * 60)
print("CROSS-VALIDATION RESULTS (5-Fold)")
print("=" * 60)
for name, model in models.items():
scores = cross_val_score(model, X_train_scaled, y_train, cv=cv, scoring='f1')
cv_results[name] = {
'mean': scores.mean(),
'std': scores.std(),
'scores': scores
}
print(f"\n{name}:")
print(f" F1 scores: {scores.round(4)}")
print(f" Mean F1: {scores.mean():.4f} ± {scores.std():.4f}")
✅ Expected result (approximate)
============================================================
CROSS-VALIDATION RESULTS (5-Fold)
============================================================
Logistic Regression:
F1 scores: [0.9783 0.9778 0.9565 0.9778 0.9778]
Mean F1: 0.9736 ± 0.0087
Random Forest:
F1 scores: [0.9778 0.9556 0.9565 0.9778 0.9556]
Mean F1: 0.9647 ± 0.0107
SVM (RBF):
F1 scores: [0.9783 0.9778 0.9783 0.9778 0.9778]
Mean F1: 0.9780 ± 0.0003
...
Step 4 — Detailed evaluation on the validation set
# Step 4: Detailed evaluation on validation set
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
classification_report, confusion_matrix, roc_auc_score
)
# Train all models on full training set and evaluate on validation set
val_results = {}
print("\n" + "=" * 60)
print("VALIDATION SET RESULTS")
print("=" * 60)
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_val_scaled)
y_proba = model.predict_proba(X_val_scaled)[:, 1]
val_results[name] = {
'accuracy': accuracy_score(y_val, y_pred),
'precision': precision_score(y_val, y_pred),
'recall': recall_score(y_val, y_pred),
'f1': f1_score(y_val, y_pred),
'auc_roc': roc_auc_score(y_val, y_proba),
'predictions': y_pred,
'probabilities': y_proba,
}
# Display results as a comparison table
results_df = pd.DataFrame(val_results).T
results_df = results_df[['accuracy', 'precision', 'recall', 'f1', 'auc_roc']]
results_df = results_df.round(4)
results_df = results_df.sort_values('f1', ascending=False)
print("\n📊 Model Comparison Table:")
print(results_df.to_string())
# Identify best model
best_model_name = results_df['f1'].idxmax()
print(f"\n🏆 Best model: {best_model_name} (F1 = {results_df.loc[best_model_name, 'f1']:.4f})")
Step 5 — Visualization: Confusion Matrix & ROC Curve
# Step 5a: Confusion Matrix for the best model
from sklearn.metrics import ConfusionMatrixDisplay
best_model = models[best_model_name]
best_model.fit(X_train_scaled, y_train)
y_val_pred = best_model.predict(X_val_scaled)
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_predictions(
y_val, y_val_pred,
display_labels=data.target_names,
cmap='Purples',
ax=ax
)
ax.set_title(f'Confusion Matrix - {best_model_name}', fontsize=14)
plt.tight_layout()
plt.savefig('reports/confusion_matrix.png', dpi=150)
plt.show()
print("✅ Confusion matrix saved to reports/confusion_matrix.png")
# Step 5b: ROC Curves for all models
from sklearn.metrics import roc_curve
fig, ax = plt.subplots(figsize=(10, 7))
colors = ['#7c3aed', '#3b82f6', '#10b981', '#f59e0b', '#ef4444']
for (name, result), color in zip(val_results.items(), colors):
fpr, tpr, _ = roc_curve(y_val, result['probabilities'])
ax.plot(fpr, tpr, color=color, lw=2,
label=f"{name} (AUC = {result['auc_roc']:.4f})")
ax.plot([0, 1], [0, 1], 'k--', lw=1, alpha=0.5, label='Random')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves - Model Comparison', fontsize=14)
ax.legend(loc='lower right', fontsize=10)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('reports/roc_curve.png', dpi=150)
plt.show()
print("✅ ROC curves saved to reports/roc_curve.png")
Step 6 — Serialize the best model
We serialize the complete pipeline (scaler + model) to ensure preprocessing is included.
# Step 6: Serialize the best model in all formats
import pickle
import joblib
from sklearn.pipeline import Pipeline
# Build full pipeline with best model
best_pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', models[best_model_name])
])
best_pipeline.fit(X_train, y_train) # fit on UN-scaled data (pipeline handles it)
# 6a. Pickle
with open('models/best_model.pkl', 'wb') as f:
pickle.dump(best_pipeline, f)
print("✅ Saved: models/best_model.pkl")
# 6b. Joblib (with compression)
joblib.dump(best_pipeline, 'models/best_model.joblib', compress=3)
print("✅ Saved: models/best_model.joblib")
# 6c. ONNX
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(best_pipeline, initial_types=initial_type)
with open('models/best_model.onnx', 'wb') as f:
f.write(onnx_model.SerializeToString())
print("✅ Saved: models/best_model.onnx")
# Compare file sizes
import os
for ext in ['pkl', 'joblib', 'onnx']:
filepath = f'models/best_model.{ext}'
size_kb = os.path.getsize(filepath) / 1024
print(f" {filepath:30s} → {size_kb:8.1f} KB")
Step 7 — Load and verify serialized models
# Step 7: Load and verify all serialized models
import numpy as np
# Sample test data (first 5 samples)
X_sample = X_test.iloc[:5]
y_sample = y_test.iloc[:5]
print("=" * 60)
print("VERIFICATION - Serialized Models")
print("=" * 60)
print(f"\nTrue labels: {y_sample.values}")
# 7a. Load Pickle
with open('models/best_model.pkl', 'rb') as f:
model_pkl = pickle.load(f)
pred_pkl = model_pkl.predict(X_sample)
print(f"Pickle predictions: {pred_pkl}")
# 7b. Load Joblib
model_joblib = joblib.load('models/best_model.joblib')
pred_joblib = model_joblib.predict(X_sample)
print(f"Joblib predictions: {pred_joblib}")
# 7c. Load ONNX
import onnxruntime as ort
session = ort.InferenceSession('models/best_model.onnx')
input_name = session.get_inputs()[0].name
X_sample_float = X_sample.values.astype(np.float32)
pred_onnx = session.run(None, {input_name: X_sample_float})[0]
print(f"ONNX predictions: {pred_onnx}")
# 7d. Verify consistency
assert np.array_equal(pred_pkl, pred_joblib), "Pickle/Joblib mismatch!"
print("\n✅ All serialization formats produce consistent predictions!")
# Full test set evaluation of loaded model
y_test_pred = model_joblib.predict(X_test)
final_accuracy = accuracy_score(y_test, y_test_pred)
final_f1 = f1_score(y_test, y_test_pred)
print(f"\n📊 Final Test Set Performance (loaded model):")
print(f" Accuracy: {final_accuracy:.4f}")
print(f" F1-Score: {final_f1:.4f}")
✅ Expected result
============================================================
VERIFICATION - Serialized Models
============================================================
True labels: [1 0 0 1 1]
Pickle predictions: [1 0 0 1 1]
Joblib predictions: [1 0 0 1 1]
ONNX predictions: [1 0 0 1 1]
✅ All serialization formats produce consistent predictions!
📊 Final Test Set Performance (loaded model):
Accuracy: 0.9737
F1-Score: 0.9808
Step 8 — Generate evaluation report
# Step 8: Generate evaluation report
import json
from datetime import datetime
# Save metadata
metadata = {
"model_name": best_model_name,
"version": "1.0.0",
"timestamp": datetime.now().isoformat(),
"dataset": "Breast Cancer Wisconsin",
"n_samples": len(X),
"n_features": X.shape[1],
"split": {"train": len(X_train), "val": len(X_val), "test": len(X_test)},
"test_metrics": {
"accuracy": round(final_accuracy, 4),
"f1_score": round(final_f1, 4),
"precision": round(precision_score(y_test, y_test_pred), 4),
"recall": round(recall_score(y_test, y_test_pred), 4),
},
"serialization_formats": ["pickle", "joblib", "onnx"],
"hyperparameters": models[best_model_name].get_params(),
}
with open('models/metadata.json', 'w') as f:
json.dump(metadata, f, indent=2, default=str)
print("✅ Metadata saved to models/metadata.json")
# Generate text report
report_lines = [
"=" * 60,
"MODEL EVALUATION REPORT",
"=" * 60,
f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
f"Dataset: Breast Cancer Wisconsin ({len(X)} samples, {X.shape[1]} features)",
"",
"--- Data Split ---",
f"Training: {len(X_train)} samples",
f"Validation: {len(X_val)} samples",
f"Test: {len(X_test)} samples",
"",
"--- Cross-Validation Results (F1 Score) ---",
]
for name, result in sorted(cv_results.items(), key=lambda x: x[1]['mean'], reverse=True):
report_lines.append(f" {name:25s}: {result['mean']:.4f} ± {result['std']:.4f}")
report_lines.extend([
"",
"--- Validation Set Results ---",
results_df.to_string(),
"",
f"--- Best Model: {best_model_name} ---",
f"Test Accuracy: {final_accuracy:.4f}",
f"Test F1-Score: {final_f1:.4f}",
"",
"--- Classification Report (Test Set) ---",
classification_report(y_test, y_test_pred, target_names=data.target_names),
"",
"--- Serialized Files ---",
])
for ext in ['pkl', 'joblib', 'onnx']:
filepath = f'models/best_model.{ext}'
size_kb = os.path.getsize(filepath) / 1024
report_lines.append(f" {filepath}: {size_kb:.1f} KB")
report_text = "\n".join(report_lines)
with open('reports/evaluation_report.txt', 'w') as f:
f.write(report_text)
print("✅ Evaluation report saved to reports/evaluation_report.txt")
print("\n" + report_text)
Validation checklist
Before submitting your lab, verify the following points:
| # | Criterion | Verified |
|---|---|---|
| 1 | The dataset is correctly loaded and explored | ☐ |
| 2 | Data is split into 3 sets (train/val/test) | ☐ |
| 3 | Scaling is applied correctly (fit on train only) | ☐ |
| 4 | At least 3 models are trained and compared | ☐ |
| 5 | 5-fold cross-validation is used | ☐ |
| 6 | Metrics include accuracy, precision, recall, F1, AUC-ROC | ☐ |
| 7 | Confusion matrix is generated and saved | ☐ |
| 8 | ROC curves are generated and saved | ☐ |
| 9 | Best model is serialized in 3 formats (pkl, joblib, onnx) | ☐ |
| 10 | Serialized models are reloaded and verified | ☐ |
| 11 | An evaluation report is generated | ☐ |
| 12 | Metadata is saved in JSON | ☐ |
Bonus challenges
🚀 Challenge 1 — Hyperparameter Tuning
Add a step of GridSearchCV or RandomizedSearchCV on the best model to optimize its hyperparameters. Compare performance before and after tuning.
from sklearn.model_selection import GridSearchCV
# Example for Random Forest
param_grid = {
'classifier__n_estimators': [50, 100, 200, 300],
'classifier__max_depth': [3, 5, 10, None],
'classifier__min_samples_split': [2, 5, 10],
}
grid_search = GridSearchCV(
best_pipeline, param_grid, cv=5,
scoring='f1', n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best F1: {grid_search.best_score_:.4f}")
🚀 Challenge 2 — Learning Curves
Generate learning curves for the best model and identify if there is overfitting or underfitting.
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(
best_pipeline, X_train, y_train, cv=5,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='f1', n_jobs=-1
)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', color='#7c3aed', label='Training')
plt.plot(train_sizes, val_scores.mean(axis=1), 's-', color='#f59e0b', label='Validation')
plt.xlabel('Training Set Size')
plt.ylabel('F1 Score')
plt.title('Learning Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.savefig('reports/learning_curve.png', dpi=150)
plt.show()
🚀 Challenge 3 — MLflow Tracking
Integrate MLflow to automatically log experiments, metrics, and models.
import mlflow
import mlflow.sklearn
mlflow.set_experiment("tp2-breast-cancer")
for name, model in models.items():
with mlflow.start_run(run_name=name):
pipeline = Pipeline([('scaler', StandardScaler()), ('clf', model)])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
mlflow.log_metric("f1", f1_score(y_test, y_pred))
mlflow.sklearn.log_model(pipeline, "model")