The Single-Tree Problem
A single DecisionTreeClassifier with no depth limit grows until every training sample is perfectly classified. It memorises noise, outliers, and quirks of the training set.
| Metric | Single Tree (no limit) | What it means |
|---|---|---|
| Training accuracy | 1.000 | Perfect — but suspicious |
| Test accuracy | ~0.891 | 10-point drop = overfitting |
| Stability | Low | Small data changes produce very different trees |
The model has memorised the training data rather than learning general patterns. On new malware samples, those memorised details do not apply.
Bootstrap Aggregation (Bagging)
Bagging fixes overfitting by creating diversity:
| Step | What happens |
|---|---|
| 1. Bootstrap sample | Randomly draw N rows with replacement from training data |
| 2. Train one tree | Fit a full decision tree on that bootstrap sample |
| 3. Repeat K times | Create K trees, each seeing slightly different data |
| 4. Aggregate | Majority vote (classification) or average (regression) |
Each tree overfits differently. When you average their predictions, the individual errors cancel out — the ensemble generalises better than any single tree.
Key Code Pattern
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Train a single tree with no depth limit
# max_depth=None lets the tree grow until every leaf is pure (the overfitting trap from Lesson 1.4 Step 3)
# random_state=42 makes the run reproducible (see Lesson 1.2)
tree = DecisionTreeClassifier(max_depth=None, random_state=42)
tree.fit(X_train, y_train)
print(f"Train accuracy: {tree.score(X_train, y_train):.3f}")
print(f"Test accuracy: {tree.score(X_test, y_test):.3f}")
# Expect: Train ~1.000, Test ~0.891 (overfitting gap)
Why Random Forests Work
| Single tree problem | How Random Forest fixes it |
|---|---|
| Overfits training data | Each tree sees different data; ensemble averages out errors |
| Sensitive to small data changes | Random sampling creates diversity among trees |
| One bad split ruins everything | Bad splits in one tree are diluted by the other trees |
Loading...
Loading...
Loading...
Think Deeper
Try this:
A single decision tree with no depth limit reaches 100% training accuracy on your malware dataset. Is that good news?
No -- it is a red flag. 100% training accuracy with an unbounded tree means the tree has memorised every training sample, including noise and outliers. On new malware samples it has never seen, accuracy drops significantly. This is overfitting. Bagging (Random Forest) fixes this by averaging many trees, each trained on a different bootstrap sample, which cancels out the individual trees' memorised noise.
Cybersecurity tie-in: In malware classification, new variants appear daily. A single tree trained on
last week's samples memorises specific byte patterns that tomorrow's malware won't share. A Random Forest is more
robust to distribution shift — its diversity means some trees will still catch the new variant
even if others miss it.