Step 4: Anomaly Scoring

Distance from centroid as an anomaly score

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

Distance as Anomaly Score

After K-Means assigns every sample to a cluster, the distance from its centroid becomes an anomaly score:

Sample type	Centroid distance	Meaning
Normal traffic	Low (e.g., 0.8)	Close to its cluster centre — typical behaviour
Borderline	Medium (e.g., 3.2)	At the edge of a cluster — worth monitoring
Anomaly	High (e.g., 6.5)	Far from all centroids — investigate

import numpy as np

# Distance from each sample to its nearest centroid
distances = np.min(km.transform(X_scaled), axis=1)

# km.transform() returns (n_samples, K) matrix of distances
# np.min(..., axis=1) picks the nearest centroid's distance

Setting a Threshold

A threshold converts continuous anomaly scores into binary alerts. Common approaches:

Method	Threshold	Tradeoff
Percentile	95th percentile of distances	Flags top 5% — manageable alert volume
Standard deviations	mean + 2σ	Statistically principled; assumes normality
Manual	Domain expert picks a value	Tuned to SOC capacity

# Percentile-based threshold
threshold = np.percentile(distances, 95)
anomaly_mask = distances > threshold

print(f"Threshold: {threshold:.2f}")
print(f"Flagged:   {anomaly_mask.sum()} / {len(distances)}")

# Show the flagged connections
anomalies = X[anomaly_mask]

Verifying Against True Labels

If you have ground truth labels (even partial), check how well the anomaly flags overlap:

# How many flagged samples are actually attacks?
flagged_attacks = y[anomaly_mask].sum()
total_attacks = y.sum()

precision = flagged_attacks / anomaly_mask.sum()
recall = flagged_attacks / total_attacks

print(f"Precision: {precision:.2f}")  # of flagged, how many are real attacks?
print(f"Recall:    {recall:.2f}")     # of real attacks, how many did we flag?

High recall means you are catching most attacks. High precision means you are not overwhelming analysts with false positives.

Limitations of K-Means Anomaly Detection

Limitation	Why it matters	Mitigation
Assumes spherical clusters	Real traffic groups may be elongated or irregular	Use DBSCAN or Gaussian Mixture Models
Sensitive to K choice	Wrong K merges or splits real groups	Use elbow + silhouette together
Static baseline	Network behaviour changes over time	Retrain periodically on recent data
Feature scale dependent	Unscaled features produce meaningless distances	Always apply StandardScaler first

Think Deeper

Try this:

You set your anomaly threshold at the 95th percentile of centroid distances. A week later, 20% of connections are flagged. What happened?

The network behaviour has drifted. The baseline clusters were learned on old traffic patterns, but the network has changed -- perhaps a new application was deployed, a server moved, or traffic patterns shifted seasonally. This is concept drift. The fix: periodically retrain the K-Means model on recent baseline data so the centroids reflect current normal behaviour, not last month's normal.

Cybersecurity tie-in: Distance-based anomaly scoring is the foundation of UEBA (User and Entity Behaviour Analytics). Production SIEM platforms like Microsoft Sentinel use exactly this approach: learn a baseline of normal entity behaviour, then flag deviations. The threshold is tuned to match the SOC's capacity to investigate alerts.

← Previous ← → to navigate Next →