Pular para o conteúdo principal

Quiz — Model Training & Evaluation

Quiz 30 min Module 2

Instructions

This quiz covers the concepts of Module 2: training, evaluation, and serialization of ML models. It includes 25 questions in 3 sections.

SectionThemeQuestions
ATraining Concepts10
BEvaluation Metrics8
CSerialization & Versioning7

For each question, choose the best answer among the options, then click "Show Answer" to verify.


Section A — Training Concepts (10 questions)

Question 1

What is the main role of the validation set in an ML pipeline?

  • A) Train the model on more data
  • B) Evaluate the final performance of the model
  • C) Select hyperparameters and compare models
  • D) Increase the size of the training dataset
Show Answer

Answer: C

The validation set is used to tune hyperparameters and compare models during development. The test set (B) is reserved for the final unique evaluation. The validation set never participates in training (A) nor increases data (D).


Question 2

You split a dataset of 1000 samples into 60% train, 20% validation, 20% test. How many samples does each set contain?

  • A) Train: 600, Val: 200, Test: 200
  • B) Train: 800, Val: 100, Test: 100
  • C) Train: 600, Val: 300, Test: 100
  • D) Train: 700, Val: 150, Test: 150
Show Answer

Answer: A

60% of 1000 = 600 (train), 20% of 1000 = 200 (validation), 20% of 1000 = 200 (test). This is the standard 60/20/20 split.


Question 3

What does the stratify=y parameter mean in train_test_split?

  • A) Data is sorted by ascending order of y
  • B) The proportion of each class in y is preserved in each subset
  • C) Samples are shuffled randomly
  • D) Only majority classes are kept
Show Answer

Answer: B

stratify=y ensures that each split contains the same proportion of classes as the original dataset. This is crucial for imbalanced datasets. For example, if the dataset has 70% class 0 and 30% class 1, each split will also have approximately 70/30.


Question 4

In 5-Fold cross-validation, how many times is the model trained in total?

  • A) 1 time
  • B) 4 times
  • C) 5 times
  • D) 25 times
Show Answer

Answer: C

In K-Fold with K=5, the dataset is split into 5 folds. The model is trained 5 times, each time using a different fold as validation and the other 4 as training. The final score is the average of the 5 scores obtained.


Question 5

What is the main difference between GridSearchCV and RandomizedSearchCV?

  • A) GridSearch is faster than RandomSearch
  • B) GridSearch tests all combinations, RandomSearch samples randomly
  • C) RandomSearch does not do cross-validation
  • D) GridSearch only works with linear models
Show Answer

Answer: B

GridSearchCV exhaustively tests all combinations of the parameter grid. RandomizedSearchCV samples a fixed number (n_iter) of random combinations from the search space. RandomSearch is generally more efficient when the search space is large, as some hyperparameters have more impact than others.


Question 6

A model has a training score of 0.99 and a test score of 0.65. What is the diagnosis?

  • A) Underfitting
  • B) Good fit
  • C) Overfitting
  • D) Data leakage
Show Answer

Answer: C

A large gap between train score (0.99) and test score (0.65) is the classic sign of overfitting. The model has memorized the training data but does not generalize to new data. Solutions: add regularization, reduce model complexity, or collect more data.


Question 7

Which strategy best fixes an underfitting problem?

  • A) Add more regularization (L2)
  • B) Reduce the number of features
  • C) Use a more complex model
  • D) Reduce the size of the training dataset
Show Answer

Answer: C

Underfitting means the model is too simple to capture the data patterns. The solution is to increase complexity: use a more powerful model (Random Forest instead of Logistic Regression), add features, or reduce regularization. Options A, B and D would worsen the problem.


Question 8

What does "bias" represent in the bias-variance tradeoff?

  • A) The error due to the model's sensitivity to data noise
  • B) The systematic error due to the model's simplifying assumptions
  • C) The irreducible error inherent in the data
  • D) The error due to lack of training data
Show Answer

Answer: B

Bias is the systematic error caused by the model's simplifying assumptions. A high-bias model (e.g., linear regression on non-linear data) underfits the patterns. Variance (A) is the error due to sensitivity to data fluctuations. Irreducible error (C) is intrinsic noise.


Question 9

In a learning curve, train and validation scores converge at a high level. What is the interpretation?

  • A) Severe overfitting
  • B) Underfitting — need a more complex model
  • C) Good fit — the model generalizes well
  • D) Data leakage — scores are too high
Show Answer

Answer: C

When both curves converge at a high level, it means the model learns the training data well AND generalizes well on validation data. This is the ideal scenario. Overfitting (A) shows a large gap between the two curves. Underfitting (B) shows both curves at a low level.


Question 10

Which type of cross-validation is most appropriate for time series data?

  • A) Standard K-Fold
  • B) Stratified K-Fold
  • C) Leave-One-Out
  • D) Time Series Split
Show Answer

Answer: D

Time Series Split respects the chronological order of data: training data always precedes test data in time. The other methods (A, B, C) randomly shuffle the data, which would cause temporal data leakage — using future data to predict the past.


Section B — Evaluation Metrics (8 questions)

Question 11

In a confusion matrix, what does a "False Positive" represent?

  • A) The model predicts Positive and it's correct
  • B) The model predicts Negative and it's correct
  • C) The model predicts Positive but the true value is Negative
  • D) The model predicts Negative but the true value is Positive
Show Answer

Answer: C

A False Positive (FP) = the model said "Positive" (first part) but was wrong ("False"). The true label is Negative. It's a Type I error (false alarm). Example: a legitimate email classified as spam. Option D describes a False Negative (Type II error).


Question 12

A cancer detection system has the following results: TP=90, FP=10, FN=30, TN=870. What is the precision?

  • A) 0.90
  • B) 0.75
  • C) 0.97
  • D) 0.96
Show Answer

Answer: A

Precision = TP / (TP + FP) = 90 / (90 + 10) = 90 / 100 = 0.90

This means that among all "cancer" predictions, 90% are actually cancers. Recall would be TP / (TP + FN) = 90 / 120 = 0.75, and accuracy = (TP + TN) / Total = 960/1000 = 0.96.


Question 13

With the same results (TP=90, FP=10, FN=30, TN=870), what is the recall?

  • A) 0.90
  • B) 0.75
  • C) 0.97
  • D) 0.96
Show Answer

Answer: B

Recall = TP / (TP + FN) = 90 / (90 + 30) = 90 / 120 = 0.75

This means the model detects 75% of true cancer cases. 25% of cases are missed (False Negatives). In a medical context, this recall is insufficient — ideally we want recall > 0.95 to miss almost no cases.


Question 14

For a bank fraud detection system, which metric should be prioritized?

  • A) Accuracy
  • B) Precision
  • C) Recall
  • D) Specificity
Show Answer

Answer: C

Recall is prioritized because the cost of a False Negative (letting fraud pass) is much higher than a False Positive (temporarily blocking a legitimate transaction). We prefer a few false alarms rather than missing real fraud. Accuracy (A) is misleading because fraud represents < 1% of transactions.


Question 15

Why is accuracy a misleading metric on an imbalanced dataset?

  • A) Because it's too complex to calculate
  • B) Because a model always predicting the majority class gets a high score
  • C) Because it doesn't account for the number of features
  • D) Because it requires a larger test set
Show Answer

Answer: B

On a dataset with 95% class A and 5% class B, a model that always predicts class A will have 95% accuracy. Yet this model is completely useless — it detects no class B cases. This is the "accuracy paradox". Use F1-score, AUC-ROC or precision/recall for imbalanced datasets.


Question 16

What does AUC-ROC (Area Under the ROC Curve) measure?

  • A) The average error of predictions
  • B) The model's ability to distinguish positive and negative classes at different thresholds
  • C) The number of features used by the model
  • D) The model's training speed
Show Answer

Answer: B

AUC-ROC measures the model's discrimination ability: the probability that a positive sample receives a higher score than a negative sample, across all classification thresholds. AUC of 0.5 = random classifier. AUC of 1.0 = perfect classifier. It's a threshold-independent metric.


Question 17

Which regression metric penalizes large errors the most?

  • A) MAE (Mean Absolute Error)
  • B) MSE (Mean Squared Error)
  • C) MAPE (Mean Absolute Percentage Error)
  • D) R² (Coefficient of Determination)
Show Answer

Answer: B

MSE (Mean Squared Error) squares the errors. An error of 10 contributes 100 to MSE, while an error of 2 contributes only 4. This squaring disproportionately penalizes large errors. MAE (A) treats all errors proportionally. MAPE (C) normalizes by the actual value. R² (D) measures explained variance.


Question 18

A regression model has an R² of 0.85. How to interpret it?

  • A) The model has 85% accuracy
  • B) The model explains 85% of the variance of the target variable
  • C) The model has 15% error
  • D) 85% of predictions are exact
Show Answer

Answer: B

R² = 0.85 means the model explains 85% of the variance of the target variable. The remaining 15% is due to factors not captured by the model (missing features, random noise). R² of 1.0 = perfect predictions. R² of 0 = the model does no better than predicting the mean. Negative R² = the model is worse than the mean.


Section C — Serialization & Versioning (7 questions)

Question 19

What is the main security risk associated with pickle.load()?

  • A) The file may be too large for memory
  • B) Data may be corrupted during transmission
  • C) A malicious pickle file can execute arbitrary code when loaded
  • D) Pickle files are not compatible across operating systems
Show Answer

Answer: C

pickle.load() can deserialize and execute arbitrary Python code. An attacker can create a pickle file that, when loaded, executes malicious system commands (file deletion, malware installation, etc.). That's why you should never load a pickle file from an untrusted source. ONNX is a safer alternative as it cannot execute arbitrary code.


Question 20

What is the main advantage of joblib over pickle for ML models?

  • A) Joblib is more secure
  • B) Joblib is faster and offers native compression for large NumPy arrays
  • C) Joblib works with all programming languages
  • D) Joblib doesn't require a file on disk
Show Answer

Answer: B

Joblib is optimized for objects containing large NumPy arrays, which is the case for most scikit-learn models. It is 2 to 10 times faster than pickle for these objects and offers built-in compression (compress=3). Note: joblib has the same security risks as pickle (A is false). It only works in Python (C is false).


Question 21

Which serialization format allows deploying a model in Python, C++, JavaScript and on mobile?

  • A) Pickle
  • B) Joblib
  • C) ONNX
  • D) JSON
Show Answer

Answer: C

ONNX (Open Neural Network Exchange) is a standardized and portable format that allows deploying models on any compatible runtime (Python, C++, JavaScript, mobile with ONNX Runtime). Pickle (A) and joblib (B) are limited to Python. JSON (D) can store parameters but not a complete model with its architecture.


Question 22

Why should you serialize the full pipeline (preprocessing + model) and not just the model?

  • A) To reduce file size
  • B) To ensure the same preprocessing is applied in production as in training
  • C) To speed up inference
  • D) To make the model compatible with ONNX
Show Answer

Answer: B

If you only serialize the model without preprocessing (StandardScaler, OneHotEncoder, etc.), you must manually reproduce the same preprocessing in production. The slightest difference (different mean/std, unknown category) produces incorrect predictions. Serializing the full pipeline ensures consistency between training and production.


Question 23

Which code illustrates "data leakage"?

  • A)
X_train, X_test = train_test_split(X, ...)
scaler.fit_transform(X_train)
scaler.transform(X_test)
  • B)
scaler.fit_transform(X)
X_train, X_test = train_test_split(X_scaled, ...)
  • C)
pipeline = Pipeline([('scaler', StandardScaler()), ('clf', model)])
pipeline.fit(X_train, y_train)
  • D)
scores = cross_val_score(pipeline, X_train, y_train, cv=5)
Show Answer

Answer: B

In option B, scaler.fit_transform(X) computes the mean and standard deviation on the entire dataset (including test data) before the split. This means the scaler contains information from the test set, which is data leakage. Options A, C and D are correct because the preprocessing never sees the test data.


Question 24

You load a serialized model and get a ConvergenceWarning. Accuracy is 0.52 while it was 0.95 during training. What is the most likely cause?

  • A) The model was corrupted during save
  • B) The scikit-learn version is different from the one used during training
  • C) The test set is too small
  • D) The file is too large
Show Answer

Answer: B

A ConvergenceWarning combined with a drastic performance drop when loading is the classic sign of version incompatibility in scikit-learn. Internal structures of estimators change between versions, which can silently corrupt the model. The solution is to retrain with the current version or downgrade to the original version (recorded in metadata).


Question 25

What is the best practice for versioning ML models in production?

  • A) Overwrite the model file at each new training
  • B) Save each version with a unique name, metadata (metrics, hyperparameters, date) and dependency versions
  • C) Keep only the latest version and the previous version
  • D) Use git to version model binary files directly
Show Answer

Answer: B

Best practice is to keep each version of the model with a unique name (e.g., model_v1.2.0_2025-03-15.joblib) accompanied by complete metadata: performance metrics, hyperparameters, training date, Python and library versions. This enables rollback, audit, A/B testing and traceability. Option A is dangerous (no rollback). Option D is not recommended because git is not designed for large binary files (use DVC or MLflow instead).


Results

Count your correct answers and evaluate your level:

ScoreLevelRecommendation
23-25 / 25⭐⭐⭐ ExcellentYou have mastered the concepts of Module 2. Move on to Module 3!
18-22 / 25⭐⭐ GoodGood understanding. Review the questions you missed.
13-17 / 25⭐ AdequateRe-read the Concepts and Serialization sections carefully.
0-12 / 25❌ InsufficientRestart the module from the beginning and redo TP2.

Key Takeaways

📝 Summary of essential concepts

Training Concepts:

  • Always split into train/validation/test (60/20/20)
  • K-Fold cross-validation for robust estimates
  • GridSearch vs RandomSearch depending on search space size
  • Overfitting = train >> test, Underfitting = both low
  • Bias-Variance tradeoff = finding the balance

Evaluation Metrics:

  • Confusion matrix = foundation of all classification metrics
  • Precision = relevance of positive predictions
  • Recall = coverage of true positives
  • F1 = harmonic mean of precision/recall
  • AUC-ROC = threshold-independent discrimination
  • Choose metric based on cost of errors

Serialization:

  • Pickle: simple but dangerous (arbitrary code)
  • Joblib: optimized for sklearn, built-in compression
  • ONNX: cross-platform portable, more secure
  • Always serialize the full pipeline
  • Always save metadata (versions, metrics)
  • MLflow for tracking and model registry