Why Single Splits Are Unreliable
With a single 80/20 train/test split, your test set is one random 20% sample. If that 20% happened to be "easy" examples, your score is optimistically high. If it was "hard" examples, it's pessimistically low. You cannot tell which.
K-fold cross-validation solves this by using every sample as both training and test:
| Step | Action |
|---|---|
| 1 | Divide data into K equal folds |
| 2 | For k = 1 to K: train on K-1 folds, evaluate on fold k |
| 3 | Average the K evaluation scores |
Every sample is used for evaluation exactly once. The K scores give you a mean and standard deviation — far more reliable than a single number.
Using cross_val_score
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=5, random_state=42)
# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Fold scores: {scores}")
print(f"Mean: {scores.mean():.4f}")
print(f"Std: {scores.std():.4f}")
# Example output: 0.9680 +/- 0.0080
A low standard deviation means the model performs consistently across different data splits — it is stable and trustworthy.
5-Fold vs 10-Fold
| Setting | Training size | Bias | Variance | Compute cost |
|---|---|---|---|---|
| 5-fold | 80% per fold | Slightly higher | Lower | Fast |
| 10-fold | 90% per fold | Lower | Higher | 2x slower |
For datasets with 2000+ samples, 5-fold is usually sufficient. Use 10-fold when data is scarce.
# Compare 5-fold and 10-fold
scores_5 = cross_val_score(model, X, y, cv=5, scoring='accuracy')
scores_10 = cross_val_score(model, X, y, cv=10, scoring='accuracy')
print(f"5-fold: {scores_5.mean():.4f} +/- {scores_5.std():.4f}")
print(f"10-fold: {scores_10.mean():.4f} +/- {scores_10.std():.4f}")
Cross-Validation vs Single Split
| Approach | Estimate quality | Detects data-dependent issues? |
|---|---|---|
| Single 80/20 split | One number — could be lucky or unlucky | No — hidden by one random split |
| 5-fold CV | Mean ± std — reveals stability | Yes — one bad fold exposes weak spots |
| Stratified CV | Same, but preserves class ratios in each fold | Yes — essential for imbalanced security data |
from sklearn.model_selection import StratifiedKFold
# Stratified ensures each fold has the same attack/benign ratio
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
print(f"Stratified 5-fold AUC: {scores.mean():.4f} +/- {scores.std():.4f}")
Think Deeper
You get cross-validation scores of [0.98, 0.71, 0.95, 0.96, 0.94]. One fold is much lower. Should you worry?