Step 1: From Tree to Forest

Why a single tree overfits and how bagging fixes it

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

The Single-Tree Problem

A single DecisionTreeClassifier with no depth limit grows until every training sample is perfectly classified. It memorises noise, outliers, and quirks of the training set.

MetricSingle Tree (no limit)What it means
Training accuracy1.000Perfect — but suspicious
Test accuracy~0.89110-point drop = overfitting
StabilityLowSmall data changes produce very different trees

The model has memorised the training data rather than learning general patterns. On new malware samples, those memorised details do not apply.

Bootstrap Aggregation (Bagging)

Bagging fixes overfitting by creating diversity:

StepWhat happens
1. Bootstrap sampleRandomly draw N rows with replacement from training data
2. Train one treeFit a full decision tree on that bootstrap sample
3. Repeat K timesCreate K trees, each seeing slightly different data
4. AggregateMajority vote (classification) or average (regression)

Each tree overfits differently. When you average their predictions, the individual errors cancel out — the ensemble generalises better than any single tree.

Key Code Pattern

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Train a single tree with no depth limit
# max_depth=None lets the tree grow until every leaf is pure (the overfitting trap from Lesson 1.4 Step 3)
# random_state=42 makes the run reproducible (see Lesson 1.2)
tree = DecisionTreeClassifier(max_depth=None, random_state=42)
tree.fit(X_train, y_train)

print(f"Train accuracy: {tree.score(X_train, y_train):.3f}")
print(f"Test accuracy:  {tree.score(X_test, y_test):.3f}")
# Expect: Train ~1.000, Test ~0.891 (overfitting gap)

Why Random Forests Work

Single tree problemHow Random Forest fixes it
Overfits training dataEach tree sees different data; ensemble averages out errors
Sensitive to small data changesRandom sampling creates diversity among trees
One bad split ruins everythingBad splits in one tree are diluted by the other trees
Loading...
Loading...
Loading...

Think Deeper

A single decision tree with no depth limit reaches 100% training accuracy on your malware dataset. Is that good news?

No -- it is a red flag. 100% training accuracy with an unbounded tree means the tree has memorised every training sample, including noise and outliers. On new malware samples it has never seen, accuracy drops significantly. This is overfitting. Bagging (Random Forest) fixes this by averaging many trees, each trained on a different bootstrap sample, which cancels out the individual trees' memorised noise.
Cybersecurity tie-in: In malware classification, new variants appear daily. A single tree trained on last week's samples memorises specific byte patterns that tomorrow's malware won't share. A Random Forest is more robust to distribution shift — its diversity means some trees will still catch the new variant even if others miss it.

Loading...