Step 4: Tune the Forest

n_estimators sweep, max_features, learning curve

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

n_estimators: More Trees, Diminishing Returns

Adding trees always reduces variance, but the improvement shrinks quickly. The "elbow" where accuracy plateaus is typically around 100–200 trees.

n_estimators	Typical accuracy gain	Training time
1 → 10	+3–5%	Low
10 → 50	+1–2%	Moderate
50 → 100	+0.3–0.5%	Moderate
100 → 500	+0.1%	High
500 → 1000	< 0.05%	Very high

Beyond the elbow, you are paying CPU time for negligible accuracy gains.

Sweep n_estimators

import time

n_trees = [1, 5, 10, 25, 50, 100, 200, 500]
results = []

for n in n_trees:
    start = time.time()
    rf = RandomForestClassifier(n_estimators=n, random_state=42)
    rf.fit(X_train, y_train)
    elapsed = time.time() - start
    acc = rf.score(X_test, y_test)
    results.append((n, acc, elapsed))
    print(f"n={n:>4d}  acc={acc:.4f}  time={elapsed:.2f}s")

max_features: Diversity vs Quality

max_features controls how many features each split considers. Fewer features = more tree diversity = better ensemble effect, but weaker individual trees.

max_features	Trees use	Effect
`None` (all)	All features at every split	Trees are similar → worse ensemble
`'sqrt'` (default)	√n_features ≈ 2–3	Good diversity; most common choice
`'log2'`	log₂(n_features)	More aggressive diversity
`0.5`	50% of features	Good for high-dimensional data

For PE file features (7 features), max_features='sqrt' means 2–3 features per split.

Finding the Sweet Spot

# Identify the elbow: where adding more trees gives < 0.1% improvement
prev_acc = 0
for n, acc, t in results:
    gain = acc - prev_acc
    marker = " <-- elbow" if 0 < gain < 0.001 else ""
    print(f"n={n:>4d}  acc={acc:.4f}  gain={gain:+.4f}{marker}")
    prev_acc = acc

Pick the n_estimators value at the elbow. In a real-time malware scanning pipeline, inference speed matters as much as accuracy. A 500-tree forest takes 5x longer to predict than a 100-tree forest for a tiny accuracy gain.

Think Deeper

Try this:

You sweep n_estimators from 10 to 500. Accuracy plateaus at 100 trees but training time keeps climbing. What do you recommend?

Recommend 100 trees (the elbow of the learning curve). Beyond this point, each additional tree costs CPU time but adds less than 0.1% accuracy. In a security pipeline processing thousands of PE files per hour, inference speed matters. A 500-tree forest takes 5x longer to predict with negligible accuracy gain. Always pick the most cost-effective configuration -- especially when the model runs in a real-time detection pipeline.

Cybersecurity tie-in: In a production malware scanner processing thousands of files per hour, inference latency directly impacts coverage. If your 500-tree model takes 50ms per file but the 100-tree model takes 10ms with nearly identical accuracy, the faster model can scan 5x more files in the same window. Tune for the sweet spot: maximum detection at minimum cost.

← Previous ← → to navigate Next →