Distance as Anomaly Score
After K-Means assigns every sample to a cluster, the distance from its centroid becomes an anomaly score:
| Sample type | Centroid distance | Meaning |
|---|---|---|
| Normal traffic | Low (e.g., 0.8) | Close to its cluster centre — typical behaviour |
| Borderline | Medium (e.g., 3.2) | At the edge of a cluster — worth monitoring |
| Anomaly | High (e.g., 6.5) | Far from all centroids — investigate |
import numpy as np
# Distance from each sample to its nearest centroid
distances = np.min(km.transform(X_scaled), axis=1)
# km.transform() returns (n_samples, K) matrix of distances
# np.min(..., axis=1) picks the nearest centroid's distance
Setting a Threshold
A threshold converts continuous anomaly scores into binary alerts. Common approaches:
| Method | Threshold | Tradeoff |
|---|---|---|
| Percentile | 95th percentile of distances | Flags top 5% — manageable alert volume |
| Standard deviations | mean + 2σ | Statistically principled; assumes normality |
| Manual | Domain expert picks a value | Tuned to SOC capacity |
# Percentile-based threshold
threshold = np.percentile(distances, 95)
anomaly_mask = distances > threshold
print(f"Threshold: {threshold:.2f}")
print(f"Flagged: {anomaly_mask.sum()} / {len(distances)}")
# Show the flagged connections
anomalies = X[anomaly_mask]
Verifying Against True Labels
If you have ground truth labels (even partial), check how well the anomaly flags overlap:
# How many flagged samples are actually attacks?
flagged_attacks = y[anomaly_mask].sum()
total_attacks = y.sum()
precision = flagged_attacks / anomaly_mask.sum()
recall = flagged_attacks / total_attacks
print(f"Precision: {precision:.2f}") # of flagged, how many are real attacks?
print(f"Recall: {recall:.2f}") # of real attacks, how many did we flag?
High recall means you are catching most attacks. High precision means you are not overwhelming analysts with false positives.
Limitations of K-Means Anomaly Detection
| Limitation | Why it matters | Mitigation |
|---|---|---|
| Assumes spherical clusters | Real traffic groups may be elongated or irregular | Use DBSCAN or Gaussian Mixture Models |
| Sensitive to K choice | Wrong K merges or splits real groups | Use elbow + silhouette together |
| Static baseline | Network behaviour changes over time | Retrain periodically on recent data |
| Feature scale dependent | Unscaled features produce meaningless distances | Always apply StandardScaler first |
Loading...
Loading...
Loading...
Think Deeper
Try this:
You set your anomaly threshold at the 95th percentile of centroid distances. A week later, 20% of connections are flagged. What happened?
The network behaviour has drifted. The baseline clusters were learned on old traffic patterns, but the network has changed -- perhaps a new application was deployed, a server moved, or traffic patterns shifted seasonally. This is concept drift. The fix: periodically retrain the K-Means model on recent baseline data so the centroids reflect current normal behaviour, not last month's normal.
Cybersecurity tie-in: Distance-based anomaly scoring is the foundation of UEBA (User and Entity
Behaviour Analytics). Production SIEM platforms like Microsoft Sentinel use exactly this approach: learn a
baseline of normal entity behaviour, then flag deviations. The threshold is tuned to match the SOC's capacity to
investigate alerts.