Train, Validate, Test: Doing It Right
We've talked about a Train/Test split. But there is a subtle problem: if you train a model, test it, tweak the settings to get a better test score, and test again... you are accidentally "leaking" information from the test set into the model! You're optimizing for the test set.
The gold standard is a 3-way split: Train (to learn the patterns), Validate (to tune the model's settings), and Test (held in a locked vault until the very end for a final grade).
What if your validation set just happens to be unusually easy or hard? Enter K-Fold Cross-Validation. We chop our training data into K chunks (folds). We train on K-1 chunks and validate on the 1 remaining chunk. We repeat this K times, rotating the validation chunk, and average the scores. This gives us a much more robust estimate of how the model will perform.
Run a 5-fold cross validation on a Random Forest.
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=10)
# TODO: Use cross_val_score with cv=5 to get 5 accuracy scores
# scores = ???
# print(f"Scores for each fold: {scores}")
# print(f"Average accuracy: {scores.mean():.2f}")