Step 3: Feature Importance

Stable importance rankings across many trees

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

Why Forest Importances Are More Stable

A single tree's feature importance depends heavily on which training sample it saw. Train two trees with different random seeds and you can get very different importance rankings.

A Random Forest averages importances over hundreds of trees, each trained on a different bootstrap sample. The result is a stable ranking — retrain with a different seed and the top features remain the same.

import numpy as np

# Measure stability: train 20 times with different seeds
importances_runs = []
for seed in range(20):
    rf = RandomForestClassifier(n_estimators=100, random_state=seed)
    rf.fit(X_train, y_train)
    importances_runs.append(rf.feature_importances_)

# Low std = stable, trustworthy importance
stds = np.std(importances_runs, axis=0)
for name, std in zip(feature_names, stds):
    print(f"{name}: std = {std:.4f}")

PE File Feature Interpretation

In a malware-vs-benign classifier, the forest reveals which static analysis features matter most:

Feature	High value in malware	Security reason
`file_entropy`	~7.2 (near maximum)	Packed/encrypted malware has high entropy
`has_packer_sig`	68% vs 5% benign	Packers evade antivirus static analysis
`virtual_size_ratio`	~2.8 (inflated)	Malware unpacks itself in memory
`import_entropy`	Lower — fewer imports	Targeted API calls (e.g., CreateRemoteThread)
`num_imports`	Lower	Minimal import table to evade static analysis

Extracting and Plotting Importances

import matplotlib.pyplot as plt

rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

# Sort features by importance
indices = np.argsort(rf.feature_importances_)[::-1]
sorted_names = [feature_names[i] for i in indices]
sorted_imps = rf.feature_importances_[indices]

plt.figure(figsize=(10, 5))
plt.barh(sorted_names[::-1], sorted_imps[::-1])
plt.xlabel("Importance")
plt.title("Random Forest Feature Importances")
plt.tight_layout()
plt.show()

Single Tree vs Forest Stability

Aspect	Single tree	Random Forest
Top feature rank	Changes with random seed	Consistent across seeds
Importance std	High (0.03–0.08)	Low (0.005–0.015)
Trustworthiness	Questionable	Reliable for feature selection

Stable importances mean you can confidently drop low-importance features to speed up your pipeline without risking accuracy.

Think Deeper

Try this:

Your random forest says file_entropy is the most important feature for malware detection. An attacker learns this. What could they do?

The attacker could craft malware with artificially lowered entropy -- for example, by padding the binary with structured data or using a custom packer that produces output with entropy similar to benign files. This is adversarial evasion: once attackers know which features the model relies on, they can manipulate those exact features. This is why security teams should not publicise model feature importances and should use diverse, hard-to-manipulate features.

Cybersecurity tie-in: Feature importances tell defenders which file properties are most indicative of malware. But beware: if attackers learn your top features, they can craft evasion samples. Treat your model's feature importances as sensitive intelligence — share them within the team, not publicly.

← Previous ← → to navigate Next →