The Train/Validation/Test Split
When tuning hyperparameters (like tree depth), you cannot use the test set — that would leak test information into your model selection. Instead, use a three-way split:
| Set | Portion | Purpose |
|---|---|---|
| Training | 60% | The model learns from this data |
| Validation | 20% | Used to choose hyperparameters (e.g., best depth) |
| Test | 20% | Final evaluation ONLY — never touched during tuning |
Two parameters used below:
random_state=42 makes the split reproducible (covered in Lesson 1.2),
and stratify=y tells sklearn to keep the class proportions equal in both halves
— if your data is 90% benign / 10% attack, both train and test will also be 90/10
instead of randomly drifting. This is critical for imbalanced security datasets.
from sklearn.model_selection import train_test_split
# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
# Second split: training and validation from remainder
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp)
# 0.25 of 80% = 20% of total
The Overfitting Diagnostic
Sweep max_depth from 1 to 20 and record training and validation accuracy at each depth:
from sklearn.tree import DecisionTreeClassifier
train_scores, val_scores = [], []
depths = range(1, 21)
for d in depths:
tree = DecisionTreeClassifier(max_depth=d, random_state=42)
tree.fit(X_train, y_train)
train_scores.append(tree.score(X_train, y_train))
val_scores.append(tree.score(X_val, y_val))
# Plot both curves
plt.plot(depths, train_scores, 'b-', label='Training')
plt.plot(depths, val_scores, 'r--', label='Validation')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
Reading the Divergence
| Depth | Train Acc | Val Acc | Gap | Diagnosis |
|---|---|---|---|---|
| 1 | ~0.65 | ~0.65 | ~0% | Underfitting — too simple |
| 3 | ~0.93 | ~0.92 | ~1% | Learning real patterns |
| 5 | ~0.99 | ~0.97 | ~2% | Sweet spot — best validation |
| 10 | 1.00 | ~0.95 | ~5% | Starting to overfit |
| 20 | 1.00 | ~0.94 | ~6% | Overfitting — memorising noise |
The overfitting point is where the gap between training and validation grows while validation accuracy stops improving or drops. Pick the depth just before this happens.
Finding the Sweet Spot Programmatically
# Find depth with highest validation accuracy
best_depth = depths[np.argmax(val_scores)]
best_val = max(val_scores)
best_train = train_scores[np.argmax(val_scores)]
print(f"Best depth: {best_depth}")
print(f"Train acc: {best_train:.3f}")
print(f"Val acc: {best_val:.3f}")
print(f"Gap: {best_train - best_val:.3f}")
Think Deeper
Your intrusion detector scores 100% on training data and 74% on validation data. The security team says 'the model works.' What do you tell them?