Step 3: K-Fold Cross-Validation

Reliable performance estimates with cross_val_score

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

Why Single Splits Are Unreliable

With a single 80/20 train/test split, your test set is one random 20% sample. If that 20% happened to be "easy" examples, your score is optimistically high. If it was "hard" examples, it's pessimistically low. You cannot tell which.

K-fold cross-validation solves this by using every sample as both training and test:

Step	Action
1	Divide data into K equal folds
2	For k = 1 to K: train on K-1 folds, evaluate on fold k
3	Average the K evaluation scores

Every sample is used for evaluation exactly once. The K scores give you a mean and standard deviation — far more reliable than a single number.

Using cross_val_score

from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=5, random_state=42)

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print(f"Fold scores: {scores}")
print(f"Mean:  {scores.mean():.4f}")
print(f"Std:   {scores.std():.4f}")
# Example output: 0.9680 +/- 0.0080

A low standard deviation means the model performs consistently across different data splits — it is stable and trustworthy.

5-Fold vs 10-Fold

Setting	Training size	Bias	Variance	Compute cost
5-fold	80% per fold	Slightly higher	Lower	Fast
10-fold	90% per fold	Lower	Higher	2x slower

For datasets with 2000+ samples, 5-fold is usually sufficient. Use 10-fold when data is scarce.

# Compare 5-fold and 10-fold
scores_5 = cross_val_score(model, X, y, cv=5, scoring='accuracy')
scores_10 = cross_val_score(model, X, y, cv=10, scoring='accuracy')

print(f"5-fold:  {scores_5.mean():.4f} +/- {scores_5.std():.4f}")
print(f"10-fold: {scores_10.mean():.4f} +/- {scores_10.std():.4f}")

Cross-Validation vs Single Split

Approach	Estimate quality	Detects data-dependent issues?
Single 80/20 split	One number — could be lucky or unlucky	No — hidden by one random split
5-fold CV	Mean ± std — reveals stability	Yes — one bad fold exposes weak spots
Stratified CV	Same, but preserves class ratios in each fold	Yes — essential for imbalanced security data

from sklearn.model_selection import StratifiedKFold

# Stratified ensures each fold has the same attack/benign ratio
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
print(f"Stratified 5-fold AUC: {scores.mean():.4f} +/- {scores.std():.4f}")

Think Deeper

Try this:

You get cross-validation scores of [0.98, 0.71, 0.95, 0.96, 0.94]. One fold is much lower. Should you worry?

Yes -- that 0.71 fold is a red flag. It means there is a subset of your data where the model performs terribly. In security, this could mean an entire attack category is being missed. Investigate which samples are in that fold. Possible causes: class imbalance concentrated in that fold, a distinct attack type the model cannot generalise to, or data quality issues in that subset. A single train/test split might never have caught this.

Cybersecurity tie-in: Security datasets are often heavily imbalanced (1% attacks, 99% benign). A single train/test split might accidentally put most attacks in the training set, giving misleadingly high test scores. Stratified K-fold guarantees every fold has the same attack ratio, producing honest performance estimates that reflect real-world detection capability.

← Previous ← → to navigate Next →