Step 4: Anomaly Scoring

Distance from centroid as an anomaly score

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

Distance as Anomaly Score

After K-Means assigns every sample to a cluster, the distance from its centroid becomes an anomaly score:

Sample typeCentroid distanceMeaning
Normal trafficLow (e.g., 0.8)Close to its cluster centre — typical behaviour
BorderlineMedium (e.g., 3.2)At the edge of a cluster — worth monitoring
AnomalyHigh (e.g., 6.5)Far from all centroids — investigate
import numpy as np

# Distance from each sample to its nearest centroid
distances = np.min(km.transform(X_scaled), axis=1)

# km.transform() returns (n_samples, K) matrix of distances
# np.min(..., axis=1) picks the nearest centroid's distance

Setting a Threshold

A threshold converts continuous anomaly scores into binary alerts. Common approaches:

MethodThresholdTradeoff
Percentile95th percentile of distancesFlags top 5% — manageable alert volume
Standard deviationsmean + 2σStatistically principled; assumes normality
ManualDomain expert picks a valueTuned to SOC capacity
# Percentile-based threshold
threshold = np.percentile(distances, 95)
anomaly_mask = distances > threshold

print(f"Threshold: {threshold:.2f}")
print(f"Flagged:   {anomaly_mask.sum()} / {len(distances)}")

# Show the flagged connections
anomalies = X[anomaly_mask]

Verifying Against True Labels

If you have ground truth labels (even partial), check how well the anomaly flags overlap:

# How many flagged samples are actually attacks?
flagged_attacks = y[anomaly_mask].sum()
total_attacks = y.sum()

precision = flagged_attacks / anomaly_mask.sum()
recall = flagged_attacks / total_attacks

print(f"Precision: {precision:.2f}")  # of flagged, how many are real attacks?
print(f"Recall:    {recall:.2f}")     # of real attacks, how many did we flag?

High recall means you are catching most attacks. High precision means you are not overwhelming analysts with false positives.

Limitations of K-Means Anomaly Detection

LimitationWhy it mattersMitigation
Assumes spherical clustersReal traffic groups may be elongated or irregularUse DBSCAN or Gaussian Mixture Models
Sensitive to K choiceWrong K merges or splits real groupsUse elbow + silhouette together
Static baselineNetwork behaviour changes over timeRetrain periodically on recent data
Feature scale dependentUnscaled features produce meaningless distancesAlways apply StandardScaler first
Loading...
Loading...
Loading...

Think Deeper

You set your anomaly threshold at the 95th percentile of centroid distances. A week later, 20% of connections are flagged. What happened?

The network behaviour has drifted. The baseline clusters were learned on old traffic patterns, but the network has changed -- perhaps a new application was deployed, a server moved, or traffic patterns shifted seasonally. This is concept drift. The fix: periodically retrain the K-Means model on recent baseline data so the centroids reflect current normal behaviour, not last month's normal.
Cybersecurity tie-in: Distance-based anomaly scoring is the foundation of UEBA (User and Entity Behaviour Analytics). Production SIEM platforms like Microsoft Sentinel use exactly this approach: learn a baseline of normal entity behaviour, then flag deviations. The threshold is tuned to match the SOC's capacity to investigate alerts.

Loading...