Step 4: Depth and Overfitting

Finding the sweet-spot between too simple and too complex

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

Depth Controls Complexity

Every additional level lets the tree create finer distinctions:

DepthBehaviourRisk
1Single yes/no questionUnderfit — too simple
3–5Captures major patternsGood generalisation
10+Tiny leaves for individual samplesOverfit — memorises data
NoneGrows until all leaves are pureSevere overfitting

The Overfitting Signal

An unlimited tree achieves ~100% training accuracy but fails on new data. The gap between training and test accuracy is the overfitting indicator:

Underfit (depth=1)Good (depth=5)Overfit (depth=15)
Train accuracy65%99%100%
Test accuracy65%97%94%
Gap0%2%6%

Pick the depth just before the gap starts widening — that's where generalisation is best.

Finding the Sweet Spot

train_scores, test_scores = [], []

for depth in range(1, 21):
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    train_scores.append(model.score(X_train, y_train))
    test_scores.append(model.score(X_test, y_test))

plt.plot(range(1, 21), train_scores, label='Train', marker='o')
plt.plot(range(1, 21), test_scores, label='Test', marker='s')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Depth vs Accuracy — find the sweet spot')
plt.show()

The test curve typically rises, plateaus, then gently declines. The plateau is your target depth.

Other Ways to Control Overfitting

ParameterEffect
min_samples_split=10Nodes with fewer than 10 samples won't split further
min_samples_leaf=5Every leaf must have at least 5 samples
max_features='sqrt'Only consider √n features at each split (adds randomness)

These are regularisation techniques — they constrain the model to prevent memorisation.

Loading...
Loading...
Loading...

Think Deeper

Plot training and test accuracy for depths 1–20. At what depth does the gap between them start growing fast?

Typically around depth 5–7. Training accuracy keeps rising toward 100%, but test accuracy plateaus or drops. The growing gap is the overfitting signal. In production security models, you'd pick the depth just before the gap starts widening — maximising generalisation to new, unseen traffic.
Cybersecurity tie-in: An overfit model memorises your training traffic patterns. When an attacker uses a slightly different technique, the model fails because it learned noise, not signal. Generalisation is the goal — a model that works on traffic it has never seen before.

Loading...