Step 2: Train a Random Forest

RandomForestClassifier, OOB score, tree vs forest

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

RandomForestClassifier Parameters

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200,      # number of trees in the forest
    max_depth=None,        # let individual trees grow fully
    max_features='sqrt',   # features considered at each split
    oob_score=True,        # free validation using out-of-bag samples
    n_jobs=-1,             # use all CPU cores
    random_state=42
)
rf.fit(X_train, y_train)

Parameter	Effect
`n_estimators`	More trees = more stable predictions; diminishing returns after ~200
`max_features`	Fewer features per split = more diverse trees = less correlation
`max_depth`	Limits individual tree depth; forests are usually grown fully
`oob_score`	Computes OOB accuracy automatically — free, no extra data needed
`n_jobs`	-1 = use all CPU cores for parallel tree training

Feature Subsampling

At each split, the random forest considers only a random subset of features (not all). This is the key difference from plain bagging.

Default: max_features='sqrt' means √(n_features) features per split. For 7 PE file features, that means each split considers ~3 features randomly.

This forces different trees to use different primary splits, producing diverse, uncorrelated trees. The ensemble benefit comes from this diversity — if all trees made the same mistakes, averaging would not help.

Out-of-Bag (OOB) Score

Each tree is trained on a bootstrap sample (~63% of data). The remaining ~37% — the out-of-bag samples — were never seen by that tree.

rf = RandomForestClassifier(n_estimators=200, oob_score=True,
                             random_state=42)
rf.fit(X_train, y_train)

print(f"OOB accuracy:  {rf.oob_score_:.3f}")
print(f"Test accuracy: {rf.score(X_test, y_test):.3f}")
# These should be close — both estimate generalisation

OOB accuracy is a free cross-validation estimate. If it closely matches test accuracy, your model is generalising well.

Tree vs Forest Comparison

Model	Train Acc	Test Acc	Gap
Single tree (no limit)	1.000	~0.891	0.109
Random Forest (200 trees)	~0.999	~0.950	0.049

The forest's test accuracy is significantly higher and its overfitting gap is smaller. The ensemble diversity pays off.

Think Deeper

Try this:

You train a Random Forest with oob_score=True and get OOB accuracy of 0.94. Your test accuracy is 0.95. What does the close match tell you?

The close match between OOB and test accuracy is a good sign. OOB samples are data points each tree never saw during training (~37% per tree), so OOB accuracy is a built-in cross-validation estimate. When it closely matches the held-out test score, it confirms the model is generalising well and not overfitting. In a SOC context, this means your malware classifier should perform reliably on new samples arriving in production.

Cybersecurity tie-in: The OOB score is especially useful in security because labelled malware data is expensive. OOB gives you a generalisation estimate without sacrificing any training data for a separate validation set — every sample contributes to both training (when in-bag) and validation (when out-of-bag).

← Previous ← → to navigate Next →