Step 3: Feature Importance

Stable importance rankings across many trees

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

Why Forest Importances Are More Stable

A single tree's feature importance depends heavily on which training sample it saw. Train two trees with different random seeds and you can get very different importance rankings.

A Random Forest averages importances over hundreds of trees, each trained on a different bootstrap sample. The result is a stable ranking — retrain with a different seed and the top features remain the same.

import numpy as np

# Measure stability: train 20 times with different seeds
importances_runs = []
for seed in range(20):
    rf = RandomForestClassifier(n_estimators=100, random_state=seed)
    rf.fit(X_train, y_train)
    importances_runs.append(rf.feature_importances_)

# Low std = stable, trustworthy importance
stds = np.std(importances_runs, axis=0)
for name, std in zip(feature_names, stds):
    print(f"{name}: std = {std:.4f}")

PE File Feature Interpretation

In a malware-vs-benign classifier, the forest reveals which static analysis features matter most:

FeatureHigh value in malwareSecurity reason
file_entropy~7.2 (near maximum)Packed/encrypted malware has high entropy
has_packer_sig68% vs 5% benignPackers evade antivirus static analysis
virtual_size_ratio~2.8 (inflated)Malware unpacks itself in memory
import_entropyLower — fewer importsTargeted API calls (e.g., CreateRemoteThread)
num_importsLowerMinimal import table to evade static analysis

Extracting and Plotting Importances

import matplotlib.pyplot as plt

rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

# Sort features by importance
indices = np.argsort(rf.feature_importances_)[::-1]
sorted_names = [feature_names[i] for i in indices]
sorted_imps = rf.feature_importances_[indices]

plt.figure(figsize=(10, 5))
plt.barh(sorted_names[::-1], sorted_imps[::-1])
plt.xlabel("Importance")
plt.title("Random Forest Feature Importances")
plt.tight_layout()
plt.show()

Single Tree vs Forest Stability

AspectSingle treeRandom Forest
Top feature rankChanges with random seedConsistent across seeds
Importance stdHigh (0.03–0.08)Low (0.005–0.015)
TrustworthinessQuestionableReliable for feature selection

Stable importances mean you can confidently drop low-importance features to speed up your pipeline without risking accuracy.

Loading...
Loading...
Loading...

Think Deeper

Your random forest says file_entropy is the most important feature for malware detection. An attacker learns this. What could they do?

The attacker could craft malware with artificially lowered entropy -- for example, by padding the binary with structured data or using a custom packer that produces output with entropy similar to benign files. This is adversarial evasion: once attackers know which features the model relies on, they can manipulate those exact features. This is why security teams should not publicise model feature importances and should use diverse, hard-to-manipulate features.
Cybersecurity tie-in: Feature importances tell defenders which file properties are most indicative of malware. But beware: if attackers learn your top features, they can craft evasion samples. Treat your model's feature importances as sensitive intelligence — share them within the team, not publicly.

Loading...