RandomForestClassifier Parameters
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=200, # number of trees in the forest
max_depth=None, # let individual trees grow fully
max_features='sqrt', # features considered at each split
oob_score=True, # free validation using out-of-bag samples
n_jobs=-1, # use all CPU cores
random_state=42
)
rf.fit(X_train, y_train)
| Parameter | Effect |
|---|---|
n_estimators | More trees = more stable predictions; diminishing returns after ~200 |
max_features | Fewer features per split = more diverse trees = less correlation |
max_depth | Limits individual tree depth; forests are usually grown fully |
oob_score | Computes OOB accuracy automatically — free, no extra data needed |
n_jobs | -1 = use all CPU cores for parallel tree training |
Feature Subsampling
At each split, the random forest considers only a random subset of features (not all). This is the key difference from plain bagging.
Default: max_features='sqrt' means √(n_features) features per split. For 7 PE file features, that means each split considers ~3 features randomly.
This forces different trees to use different primary splits, producing diverse, uncorrelated trees. The ensemble benefit comes from this diversity — if all trees made the same mistakes, averaging would not help.
Out-of-Bag (OOB) Score
Each tree is trained on a bootstrap sample (~63% of data). The remaining ~37% — the out-of-bag samples — were never seen by that tree.
rf = RandomForestClassifier(n_estimators=200, oob_score=True,
random_state=42)
rf.fit(X_train, y_train)
print(f"OOB accuracy: {rf.oob_score_:.3f}")
print(f"Test accuracy: {rf.score(X_test, y_test):.3f}")
# These should be close — both estimate generalisation
OOB accuracy is a free cross-validation estimate. If it closely matches test accuracy, your model is generalising well.
Tree vs Forest Comparison
| Model | Train Acc | Test Acc | Gap |
|---|---|---|---|
| Single tree (no limit) | 1.000 | ~0.891 | 0.109 |
| Random Forest (200 trees) | ~0.999 | ~0.950 | 0.049 |
The forest's test accuracy is significantly higher and its overfitting gap is smaller. The ensemble diversity pays off.
Think Deeper
You train a Random Forest with oob_score=True and get OOB accuracy of 0.94. Your test accuracy is 0.95. What does the close match tell you?