Why Forest Importances Are More Stable
A single tree's feature importance depends heavily on which training sample it saw. Train two trees with different random seeds and you can get very different importance rankings.
A Random Forest averages importances over hundreds of trees, each trained on a different bootstrap sample. The result is a stable ranking — retrain with a different seed and the top features remain the same.
import numpy as np
# Measure stability: train 20 times with different seeds
importances_runs = []
for seed in range(20):
rf = RandomForestClassifier(n_estimators=100, random_state=seed)
rf.fit(X_train, y_train)
importances_runs.append(rf.feature_importances_)
# Low std = stable, trustworthy importance
stds = np.std(importances_runs, axis=0)
for name, std in zip(feature_names, stds):
print(f"{name}: std = {std:.4f}")
PE File Feature Interpretation
In a malware-vs-benign classifier, the forest reveals which static analysis features matter most:
| Feature | High value in malware | Security reason |
|---|---|---|
file_entropy | ~7.2 (near maximum) | Packed/encrypted malware has high entropy |
has_packer_sig | 68% vs 5% benign | Packers evade antivirus static analysis |
virtual_size_ratio | ~2.8 (inflated) | Malware unpacks itself in memory |
import_entropy | Lower — fewer imports | Targeted API calls (e.g., CreateRemoteThread) |
num_imports | Lower | Minimal import table to evade static analysis |
Extracting and Plotting Importances
import matplotlib.pyplot as plt
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
# Sort features by importance
indices = np.argsort(rf.feature_importances_)[::-1]
sorted_names = [feature_names[i] for i in indices]
sorted_imps = rf.feature_importances_[indices]
plt.figure(figsize=(10, 5))
plt.barh(sorted_names[::-1], sorted_imps[::-1])
plt.xlabel("Importance")
plt.title("Random Forest Feature Importances")
plt.tight_layout()
plt.show()
Single Tree vs Forest Stability
| Aspect | Single tree | Random Forest |
|---|---|---|
| Top feature rank | Changes with random seed | Consistent across seeds |
| Importance std | High (0.03–0.08) | Low (0.005–0.015) |
| Trustworthiness | Questionable | Reliable for feature selection |
Stable importances mean you can confidently drop low-importance features to speed up your pipeline without risking accuracy.
Think Deeper
Your random forest says file_entropy is the most important feature for malware detection. An attacker learns this. What could they do?