TP2 — Entrainer, evaluer et serialiser un modele
Objectifs du lab
A la fin de ce lab, vous serez capable de :
- ✅ Charger et explorer un jeu de donnees reel
- ✅ Pretraiter les donnees (mise a l'echelle, encodage)
- ✅ Entrainer plusieurs modeles de classification
- ✅ Evaluer et comparer les modeles avec des metriques rigoureuses
- ✅ Visualiser les resultats (matrice de confusion, courbe ROC)
- ✅ Serialiser le meilleur modele en pickle, joblib et ONNX
- ✅ Charger et verifier les modeles serialises
- ✅ Generer un rapport d'evaluation
Prerequis
| Prerequis | Detail |
|---|---|
| Python | 3.10+ installe |
| Librairies | scikit-learn, pandas, numpy, matplotlib, seaborn |
| Connaissances | Module 2 — Concepts (Entrainement et Serialisation) |
| Environnement | Environnement virtuel active |
Installer les dependances
pip install scikit-learn pandas numpy matplotlib seaborn joblib skl2onnx onnxruntime
Architecture du projet
tp2-model-evaluation/
├── tp2_train_evaluate.py # Script principal
├── models/
│ ├── best_model.pkl # Modele serialise (pickle)
│ ├── best_model.joblib # Modele serialise (joblib)
│ ├── best_model.onnx # Modele serialise (ONNX)
│ └── metadata.json # Metadonnees du modele
├── reports/
│ ├── confusion_matrix.png # Matrice de confusion
│ ├── roc_curve.png # Courbe ROC
│ └── evaluation_report.txt # Rapport texte
└── README.md
Etape 1 — Configuration et chargement des donnees
Nous utilisons le jeu de donnees Breast Cancer Wisconsin de scikit-learn. C'est un probleme de classification binaire (tumeur maligne vs benigne) avec 30 features numeriques et 569 echantillons.
# tp2_train_evaluate.py — Step 1: Load and explore data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
import os
# Create output directories
os.makedirs('models', exist_ok=True)
os.makedirs('reports', exist_ok=True)
# Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')
# Explore the dataset
print("=" * 60)
print("DATASET EXPLORATION")
print("=" * 60)
print(f"\nShape: {X.shape}")
print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]}")
print(f"\nTarget distribution:")
print(y.value_counts())
print(f"\nClass names: {data.target_names}")
print(f"\nFirst 5 features:")
print(X.iloc[:, :5].describe())
✅ Resultat attendu
============================================================
DATASET EXPLORATION
============================================================
Shape: (569, 30)
Features: 30
Samples: 569
Target distribution:
1 357
0 212
Name: target, dtype: int64
Class names: ['malignant' 'benign']
Etape 2 — Pretraitement des donnees
# Step 2: Preprocessing and data splitting
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Split: 60% train, 20% validation, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.0f}%)")
print(f"Validation set: {X_val.shape[0]} samples ({X_val.shape[0]/len(X)*100:.0f}%)")
print(f"Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.0f}%)")
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
print(f"\nAfter scaling — Train mean: {X_train_scaled.mean():.6f}, std: {X_train_scaled.std():.4f}")
scaler.fit_transform() est appele uniquement sur l'ensemble d'entrainement. Les ensembles de validation et de test utilisent scaler.transform() sans fit. Cela evite les fuites d'information (data leakage).
✅ Resultat attendu
Training set: 341 samples (60%)
Validation set: 114 samples (20%)
Test set: 114 samples (20%)
After scaling — Train mean: -0.000000, std: 1.0000
Etape 3 — Entrainement de plusieurs modeles
# Step 3: Train multiple models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
# Define models to compare
models = {
'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM (RBF)': SVC(kernel='rbf', probability=True, random_state=42),
'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
'Decision Tree': DecisionTreeClassifier(random_state=42),
}
# Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Train and evaluate each model with cross-validation
cv_results = {}
print("=" * 60)
print("CROSS-VALIDATION RESULTS (5-Fold)")
print("=" * 60)
for name, model in models.items():
scores = cross_val_score(model, X_train_scaled, y_train, cv=cv, scoring='f1')
cv_results[name] = {
'mean': scores.mean(),
'std': scores.std(),
'scores': scores
}
print(f"\n{name}:")
print(f" F1 scores: {scores.round(4)}")
print(f" Mean F1: {scores.mean():.4f} ± {scores.std():.4f}")
✅ Resultat attendu (approximatif)
============================================================
CROSS-VALIDATION RESULTS (5-Fold)
============================================================
Logistic Regression:
F1 scores: [0.9783 0.9778 0.9565 0.9778 0.9778]
Mean F1: 0.9736 ± 0.0087
Random Forest:
F1 scores: [0.9778 0.9556 0.9565 0.9778 0.9556]
Mean F1: 0.9647 ± 0.0107
SVM (RBF):
F1 scores: [0.9783 0.9778 0.9783 0.9778 0.9778]
Mean F1: 0.9780 ± 0.0003
...
Etape 4 — Evaluation detaillee sur l'ensemble de validation
# Step 4: Detailed evaluation on validation set
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
classification_report, confusion_matrix, roc_auc_score
)
# Train all models on full training set and evaluate on validation set
val_results = {}
print("\n" + "=" * 60)
print("VALIDATION SET RESULTS")
print("=" * 60)
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_val_scaled)
y_proba = model.predict_proba(X_val_scaled)[:, 1]
val_results[name] = {
'accuracy': accuracy_score(y_val, y_pred),
'precision': precision_score(y_val, y_pred),
'recall': recall_score(y_val, y_pred),
'f1': f1_score(y_val, y_pred),
'auc_roc': roc_auc_score(y_val, y_proba),
'predictions': y_pred,
'probabilities': y_proba,
}
# Display results as a comparison table
results_df = pd.DataFrame(val_results).T
results_df = results_df[['accuracy', 'precision', 'recall', 'f1', 'auc_roc']]
results_df = results_df.round(4)
results_df = results_df.sort_values('f1', ascending=False)
print("\n📊 Model Comparison Table:")
print(results_df.to_string())
# Identify best model
best_model_name = results_df['f1'].idxmax()
print(f"\n🏆 Best model: {best_model_name} (F1 = {results_df.loc[best_model_name, 'f1']:.4f})")
Etape 5 — Visualisation : Matrice de confusion et courbe ROC
# Step 5a: Confusion Matrix for the best model
from sklearn.metrics import ConfusionMatrixDisplay
best_model = models[best_model_name]
best_model.fit(X_train_scaled, y_train)
y_val_pred = best_model.predict(X_val_scaled)
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_predictions(
y_val, y_val_pred,
display_labels=data.target_names,
cmap='Purples',
ax=ax
)
ax.set_title(f'Confusion Matrix - {best_model_name}', fontsize=14)
plt.tight_layout()
plt.savefig('reports/confusion_matrix.png', dpi=150)
plt.show()
print("✅ Confusion matrix saved to reports/confusion_matrix.png")
# Step 5b: ROC Curves for all models
from sklearn.metrics import roc_curve
fig, ax = plt.subplots(figsize=(10, 7))
colors = ['#7c3aed', '#3b82f6', '#10b981', '#f59e0b', '#ef4444']
for (name, result), color in zip(val_results.items(), colors):
fpr, tpr, _ = roc_curve(y_val, result['probabilities'])
ax.plot(fpr, tpr, color=color, lw=2,
label=f"{name} (AUC = {result['auc_roc']:.4f})")
ax.plot([0, 1], [0, 1], 'k--', lw=1, alpha=0.5, label='Random')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves - Model Comparison', fontsize=14)
ax.legend(loc='lower right', fontsize=10)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('reports/roc_curve.png', dpi=150)
plt.show()
print("✅ ROC curves saved to reports/roc_curve.png")
Etape 6 — Serialiser le meilleur modele
Nous serialisons le pipeline complet (scaler + modele) pour s'assurer que le pretraitement est inclus.
# Step 6: Serialize the best model in all formats
import pickle
import joblib
from sklearn.pipeline import Pipeline
# Build full pipeline with best model
best_pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', models[best_model_name])
])
best_pipeline.fit(X_train, y_train) # fit on UN-scaled data (pipeline handles it)
# 6a. Pickle
with open('models/best_model.pkl', 'wb') as f:
pickle.dump(best_pipeline, f)
print("✅ Saved: models/best_model.pkl")
# 6b. Joblib (with compression)
joblib.dump(best_pipeline, 'models/best_model.joblib', compress=3)
print("✅ Saved: models/best_model.joblib")
# 6c. ONNX
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(best_pipeline, initial_types=initial_type)
with open('models/best_model.onnx', 'wb') as f:
f.write(onnx_model.SerializeToString())
print("✅ Saved: models/best_model.onnx")
# Compare file sizes
import os
for ext in ['pkl', 'joblib', 'onnx']:
filepath = f'models/best_model.{ext}'
size_kb = os.path.getsize(filepath) / 1024
print(f" {filepath:30s} → {size_kb:8.1f} KB")
Etape 7 — Charger et verifier les modeles serialises
# Step 7: Load and verify all serialized models
import numpy as np
# Sample test data (first 5 samples)
X_sample = X_test.iloc[:5]
y_sample = y_test.iloc[:5]
print("=" * 60)
print("VERIFICATION - Serialized Models")
print("=" * 60)
print(f"\nTrue labels: {y_sample.values}")
# 7a. Load Pickle
with open('models/best_model.pkl', 'rb') as f:
model_pkl = pickle.load(f)
pred_pkl = model_pkl.predict(X_sample)
print(f"Pickle predictions: {pred_pkl}")
# 7b. Load Joblib
model_joblib = joblib.load('models/best_model.joblib')
pred_joblib = model_joblib.predict(X_sample)
print(f"Joblib predictions: {pred_joblib}")
# 7c. Load ONNX
import onnxruntime as ort
session = ort.InferenceSession('models/best_model.onnx')
input_name = session.get_inputs()[0].name
X_sample_float = X_sample.values.astype(np.float32)
pred_onnx = session.run(None, {input_name: X_sample_float})[0]
print(f"ONNX predictions: {pred_onnx}")
# 7d. Verify consistency
assert np.array_equal(pred_pkl, pred_joblib), "Pickle/Joblib mismatch!"
print("\n✅ All serialization formats produce consistent predictions!")
# Full test set evaluation of loaded model
y_test_pred = model_joblib.predict(X_test)
final_accuracy = accuracy_score(y_test, y_test_pred)
final_f1 = f1_score(y_test, y_test_pred)
print(f"\n📊 Final Test Set Performance (loaded model):")
print(f" Accuracy: {final_accuracy:.4f}")
print(f" F1-Score: {final_f1:.4f}")
✅ Resultat attendu
============================================================
VERIFICATION - Serialized Models
============================================================
True labels: [1 0 0 1 1]
Pickle predictions: [1 0 0 1 1]
Joblib predictions: [1 0 0 1 1]
ONNX predictions: [1 0 0 1 1]
✅ All serialization formats produce consistent predictions!
📊 Final Test Set Performance (loaded model):
Accuracy: 0.9737
F1-Score: 0.9808
Etape 8 — Generer le rapport d'evaluation
# Step 8: Generate evaluation report
import json
from datetime import datetime
# Save metadata
metadata = {
"model_name": best_model_name,
"version": "1.0.0",
"timestamp": datetime.now().isoformat(),
"dataset": "Breast Cancer Wisconsin",
"n_samples": len(X),
"n_features": X.shape[1],
"split": {"train": len(X_train), "val": len(X_val), "test": len(X_test)},
"test_metrics": {
"accuracy": round(final_accuracy, 4),
"f1_score": round(final_f1, 4),
"precision": round(precision_score(y_test, y_test_pred), 4),
"recall": round(recall_score(y_test, y_test_pred), 4),
},
"serialization_formats": ["pickle", "joblib", "onnx"],
"hyperparameters": models[best_model_name].get_params(),
}
with open('models/metadata.json', 'w') as f:
json.dump(metadata, f, indent=2, default=str)
print("✅ Metadata saved to models/metadata.json")
# Generate text report
report_lines = [
"=" * 60,
"MODEL EVALUATION REPORT",
"=" * 60,
f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
f"Dataset: Breast Cancer Wisconsin ({len(X)} samples, {X.shape[1]} features)",
"",
"--- Data Split ---",
f"Training: {len(X_train)} samples",
f"Validation: {len(X_val)} samples",
f"Test: {len(X_test)} samples",
"",
"--- Cross-Validation Results (F1 Score) ---",
]
for name, result in sorted(cv_results.items(), key=lambda x: x[1]['mean'], reverse=True):
report_lines.append(f" {name:25s}: {result['mean']:.4f} ± {result['std']:.4f}")
report_lines.extend([
"",
"--- Validation Set Results ---",
results_df.to_string(),
"",
f"--- Best Model: {best_model_name} ---",
f"Test Accuracy: {final_accuracy:.4f}",
f"Test F1-Score: {final_f1:.4f}",
"",
"--- Classification Report (Test Set) ---",
classification_report(y_test, y_test_pred, target_names=data.target_names),
"",
"--- Serialized Files ---",
])
for ext in ['pkl', 'joblib', 'onnx']:
filepath = f'models/best_model.{ext}'
size_kb = os.path.getsize(filepath) / 1024
report_lines.append(f" {filepath}: {size_kb:.1f} KB")
report_text = "\n".join(report_lines)
with open('reports/evaluation_report.txt', 'w') as f:
f.write(report_text)
print("✅ Evaluation report saved to reports/evaluation_report.txt")
print("\n" + report_text)
Checklist de validation
Avant de soumettre votre lab, verifiez les points suivants :
| # | Critere | Verifie |
|---|---|---|
| 1 | Le jeu de donnees est correctement charge et explore | ☐ |
| 2 | Les donnees sont decoupees en 3 ensembles (train/val/test) | ☐ |
| 3 | La mise a l'echelle est appliquee correctement (fit sur train uniquement) | ☐ |
| 4 | Au moins 3 modeles sont entraines et compares | ☐ |
| 5 | La validation croisee 5-fold est utilisee | ☐ |
| 6 | Les metriques incluent exactitude, precision, rappel, F1, AUC-ROC | ☐ |
| 7 | La matrice de confusion est generee et sauvegardee | ☐ |
| 8 | Les courbes ROC sont generees et sauvegardees | ☐ |
| 9 | Le meilleur modele est serialise en 3 formats (pkl, joblib, onnx) | ☐ |
| 10 | Les modeles serialises sont recharges et verifies | ☐ |
| 11 | Un rapport d'evaluation est genere | ☐ |
| 12 | Les metadonnees sont sauvegardees en JSON | ☐ |
Defis bonus
🚀 Defi 1 — Ajustement des hyperparametres
Ajoutez une etape de GridSearchCV ou RandomizedSearchCV sur le meilleur modele pour optimiser ses hyperparametres. Comparez les performances avant et apres l'ajustement.
from sklearn.model_selection import GridSearchCV
# Example for Random Forest
param_grid = {
'classifier__n_estimators': [50, 100, 200, 300],
'classifier__max_depth': [3, 5, 10, None],
'classifier__min_samples_split': [2, 5, 10],
}
grid_search = GridSearchCV(
best_pipeline, param_grid, cv=5,
scoring='f1', n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best F1: {grid_search.best_score_:.4f}")
🚀 Defi 2 — Courbes d'apprentissage
Generez des courbes d'apprentissage pour le meilleur modele et identifiez s'il y a du surapprentissage ou du sous-apprentissage.
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(
best_pipeline, X_train, y_train, cv=5,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='f1', n_jobs=-1
)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', color='#7c3aed', label='Training')
plt.plot(train_sizes, val_scores.mean(axis=1), 's-', color='#f59e0b', label='Validation')
plt.xlabel('Training Set Size')
plt.ylabel('F1 Score')
plt.title('Learning Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.savefig('reports/learning_curve.png', dpi=150)
plt.show()
🚀 Defi 3 — Suivi MLflow
Integrez MLflow pour journaliser automatiquement les experiences, les metriques et les modeles.
import mlflow
import mlflow.sklearn
mlflow.set_experiment("tp2-breast-cancer")
for name, model in models.items():
with mlflow.start_run(run_name=name):
pipeline = Pipeline([('scaler', StandardScaler()), ('clf', model)])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
mlflow.log_metric("f1", f1_score(y_test, y_pred))
mlflow.sklearn.log_model(pipeline, "model")