Step 1: From Tree to Forest

Why a single tree overfits and how bagging fixes it

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

The Single-Tree Problem

A single DecisionTreeClassifier with no depth limit grows until every training sample is perfectly classified. It memorises noise, outliers, and quirks of the training set.

Metric	Single Tree (no limit)	What it means
Training accuracy	1.000	Perfect — but suspicious
Test accuracy	~0.891	10-point drop = overfitting
Stability	Low	Small data changes produce very different trees

The model has memorised the training data rather than learning general patterns. On new malware samples, those memorised details do not apply.

Bootstrap Aggregation (Bagging)

Bagging fixes overfitting by creating diversity:

Step	What happens
1. Bootstrap sample	Randomly draw N rows with replacement from training data
2. Train one tree	Fit a full decision tree on that bootstrap sample
3. Repeat K times	Create K trees, each seeing slightly different data
4. Aggregate	Majority vote (classification) or average (regression)

Each tree overfits differently. When you average their predictions, the individual errors cancel out — the ensemble generalises better than any single tree.

Key Code Pattern

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Train a single tree with no depth limit
# max_depth=None lets the tree grow until every leaf is pure (the overfitting trap from Lesson 1.4 Step 3)
# random_state=42 makes the run reproducible (see Lesson 1.2)
tree = DecisionTreeClassifier(max_depth=None, random_state=42)
tree.fit(X_train, y_train)

print(f"Train accuracy: {tree.score(X_train, y_train):.3f}")
print(f"Test accuracy:  {tree.score(X_test, y_test):.3f}")
# Expect: Train ~1.000, Test ~0.891 (overfitting gap)

Why Random Forests Work

Single tree problem	How Random Forest fixes it
Overfits training data	Each tree sees different data; ensemble averages out errors
Sensitive to small data changes	Random sampling creates diversity among trees
One bad split ruins everything	Bad splits in one tree are diluted by the other trees

Think Deeper

Try this:

A single decision tree with no depth limit reaches 100% training accuracy on your malware dataset. Is that good news?

No -- it is a red flag. 100% training accuracy with an unbounded tree means the tree has memorised every training sample, including noise and outliers. On new malware samples it has never seen, accuracy drops significantly. This is overfitting. Bagging (Random Forest) fixes this by averaging many trees, each trained on a different bootstrap sample, which cancels out the individual trees' memorised noise.

Cybersecurity tie-in: In malware classification, new variants appear daily. A single tree trained on last week's samples memorises specific byte patterns that tomorrow's malware won't share. A Random Forest is more robust to distribution shift — its diversity means some trees will still catch the new variant even if others miss it.

← Previous ← → to navigate Next →