Step 4: Depth and Overfitting

Finding the sweet-spot between too simple and too complex

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

Depth Controls Complexity

Every additional level lets the tree create finer distinctions:

Depth	Behaviour	Risk
1	Single yes/no question	Underfit — too simple
3–5	Captures major patterns	Good generalisation
10+	Tiny leaves for individual samples	Overfit — memorises data
None	Grows until all leaves are pure	Severe overfitting

The Overfitting Signal

An unlimited tree achieves ~100% training accuracy but fails on new data. The gap between training and test accuracy is the overfitting indicator:

	Underfit (depth=1)	Good (depth=5)	Overfit (depth=15)
Train accuracy	65%	99%	100%
Test accuracy	65%	97%	94%
Gap	0%	2%	6%

Pick the depth just before the gap starts widening — that's where generalisation is best.

Finding the Sweet Spot

train_scores, test_scores = [], []

for depth in range(1, 21):
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    train_scores.append(model.score(X_train, y_train))
    test_scores.append(model.score(X_test, y_test))

plt.plot(range(1, 21), train_scores, label='Train', marker='o')
plt.plot(range(1, 21), test_scores, label='Test', marker='s')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Depth vs Accuracy — find the sweet spot')
plt.show()

The test curve typically rises, plateaus, then gently declines. The plateau is your target depth.

Other Ways to Control Overfitting

Parameter	Effect
`min_samples_split=10`	Nodes with fewer than 10 samples won't split further
`min_samples_leaf=5`	Every leaf must have at least 5 samples
`max_features='sqrt'`	Only consider √n features at each split (adds randomness)

These are regularisation techniques — they constrain the model to prevent memorisation.

Think Deeper

Try this:

Plot training and test accuracy for depths 1–20. At what depth does the gap between them start growing fast?

Typically around depth 5–7. Training accuracy keeps rising toward 100%, but test accuracy plateaus or drops. The growing gap is the overfitting signal. In production security models, you'd pick the depth just before the gap starts widening — maximising generalisation to new, unseen traffic.

Cybersecurity tie-in: An overfit model memorises your training traffic patterns. When an attacker uses a slightly different technique, the model fails because it learned noise, not signal. Generalisation is the goal — a model that works on traffic it has never seen before.

← Previous ← → to navigate Next →