The Four Outcomes
Every prediction for a binary classifier falls into one of four cells:
| Predicted: Benign | Predicted: Attack | |
|---|---|---|
| Actual: Benign | TN — correct pass | FP — false alarm |
| Actual: Attack | FN — missed threat | TP — caught it |
| Outcome | Security cost |
|---|---|
| TP — True Positive | Low — this is what we want |
| TN — True Negative | Low — no action needed |
| FP — False Positive | Medium — analyst time wasted |
| FN — False Negative | High — system compromised |
In security, FN cost >> FP cost. A missed attack is almost always more damaging than a false alarm.
Computing the Confusion Matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, y_pred)
print(cm)
# [[TN, FP],
# [FN, TP]]
# Visualise as a heatmap
disp = ConfusionMatrixDisplay(cm, display_labels=['Benign', 'Attack'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()
Deriving All Metrics from the Matrix
TN, FP, FN, TP = cm.ravel()
accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP) # of those flagged, how many real?
recall = TP / (TP + FN) # of all attacks, how many caught?
f1 = 2 * precision * recall / (precision + recall)
print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1: {f1:.3f}")
Every evaluation metric is just a different combination of these four numbers.
Loading...
Loading...
Loading...
Think Deeper
Try this:
Your IDS produced: TP=45, FP=300, FN=5, TN=9650. Calculate precision and recall. Is this a good system?
Precision = 45/(45+300) = 13%. Recall = 45/(45+5) = 90%. It catches 90% of attacks but only 1 in 8 alerts is real. Whether this is 'good' depends on your SOC's capacity. 300 false alarms/day might be acceptable if they're auto-triaged; unacceptable if humans must investigate each one.
Cybersecurity tie-in: The confusion matrix is the universal language of security ML evaluation.
When comparing two IDS vendors, don't compare accuracy — compare confusion matrices at the same threshold
and ask which one produces fewer FN (missed attacks) at an acceptable FP rate.