Quiz — Random Forests

1 of 5

What is the core idea of a Random Forest, and why does it work better than a single tree?

A Random Forest is bagging: train N trees on N different bootstrap samples, then average their votes. Each individual tree overfits its sample, but their errors are uncorrelated, so averaging them produces a model that generalises much better than any single tree.

2 of 5

You set oob_score=True and get OOB accuracy = 0.94. Your test set accuracy is 0.95. What does this tell you?

Each tree in the forest never sees ~37% of the data (those are its out-of-bag samples). Predicting on those gives a free, built-in validation estimate. When OOB closely matches your held-out test score, you have strong evidence the model will perform similarly in production.

3 of 5

Why should you be cautious about publishing a Random Forest's feature importances when it's used for security?

If your malware classifier puts most weight on file_entropy, an attacker can pad the binary to lower entropy and bypass detection. Feature importances are useful internally for debugging, but they're an attack surface when exposed externally.

4 of 5

You sweep n_estimators from 10 to 500 and find accuracy plateaus at 100 trees but training time keeps climbing. What do you recommend?

Pick the elbow of the curve. Beyond 100 trees you're paying CPU and memory for less than 0.1% accuracy gain — in a security pipeline processing millions of events/day, that 5x slowdown matters and the accuracy gain doesn't.

5 of 5

A single deep decision tree gets 100% training accuracy on a malware dataset. Should you ship it?

100% training accuracy with an unbounded tree means it has memorised every sample, including noise. The test accuracy will be much lower, and the production accuracy on new malware variants will be lower still. This is precisely the gap a Random Forest is designed to close.

End-of-lesson Quiz

Quiz complete