Lesson 21: Train, Validate, Test — Doing It Right

Train, Validate, Test: Doing It Right

We've talked about a Train/Test split. But there is a subtle problem: if you train a model, test it, tweak the settings to get a better test score, and test again... you are accidentally "leaking" information from the test set into the model! You're optimizing for the test set.

The Validation Set

The gold standard is a 3-way split: Train (to learn the patterns), Validate (to tune the model's settings), and Test (held in a locked vault until the very end for a final grade).

K-Fold Cross-Validation

What if your validation set just happens to be unusually easy or hard? Enter K-Fold Cross-Validation. We chop our training data into K chunks (folds). We train on K-1 chunks and validate on the 1 remaining chunk. We repeat this K times, rotating the validation chunk, and average the scores. This gives us a much more robust estimate of how the model will perform.

Python Challenge: Cross-Validate!

Run a 5-fold cross validation on a Random Forest.

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=10)

# TODO: Use cross_val_score with cv=5 to get 5 accuracy scores
# scores = ???

# print(f"Scores for each fold: {scores}")
# print(f"Average accuracy: {scores.mean():.2f}")