n_estimators: More Trees, Diminishing Returns
Adding trees always reduces variance, but the improvement shrinks quickly. The "elbow" where accuracy plateaus is typically around 100–200 trees.
| n_estimators | Typical accuracy gain | Training time |
|---|---|---|
| 1 → 10 | +3–5% | Low |
| 10 → 50 | +1–2% | Moderate |
| 50 → 100 | +0.3–0.5% | Moderate |
| 100 → 500 | +0.1% | High |
| 500 → 1000 | < 0.05% | Very high |
Beyond the elbow, you are paying CPU time for negligible accuracy gains.
Sweep n_estimators
import time
n_trees = [1, 5, 10, 25, 50, 100, 200, 500]
results = []
for n in n_trees:
start = time.time()
rf = RandomForestClassifier(n_estimators=n, random_state=42)
rf.fit(X_train, y_train)
elapsed = time.time() - start
acc = rf.score(X_test, y_test)
results.append((n, acc, elapsed))
print(f"n={n:>4d} acc={acc:.4f} time={elapsed:.2f}s")
max_features: Diversity vs Quality
max_features controls how many features each split considers. Fewer features = more tree diversity = better ensemble effect, but weaker individual trees.
| max_features | Trees use | Effect |
|---|---|---|
None (all) | All features at every split | Trees are similar → worse ensemble |
'sqrt' (default) | √n_features ≈ 2–3 | Good diversity; most common choice |
'log2' | log₂(n_features) | More aggressive diversity |
0.5 | 50% of features | Good for high-dimensional data |
For PE file features (7 features), max_features='sqrt' means 2–3 features per split.
Finding the Sweet Spot
# Identify the elbow: where adding more trees gives < 0.1% improvement
prev_acc = 0
for n, acc, t in results:
gain = acc - prev_acc
marker = " <-- elbow" if 0 < gain < 0.001 else ""
print(f"n={n:>4d} acc={acc:.4f} gain={gain:+.4f}{marker}")
prev_acc = acc
Pick the n_estimators value at the elbow. In a real-time malware scanning pipeline, inference speed matters as much as accuracy. A 500-tree forest takes 5x longer to predict than a 100-tree forest for a tiny accuracy gain.
Think Deeper
You sweep n_estimators from 10 to 500. Accuracy plateaus at 100 trees but training time keeps climbing. What do you recommend?