Step 3: Choosing K

Elbow method, silhouette score, picking the right K

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

The Elbow Method (Inertia)

Inertia is the sum of squared distances from each sample to its nearest centroid. Lower inertia = tighter clusters, but K=N (every point is its own cluster) trivially gives inertia=0.

Plot inertia vs K. The "elbow" — where the curve bends from steep to flat — suggests the right K.

inertias = []
K_range = range(2, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, 'bo-')
plt.xlabel('K')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

Silhouette Score

For each sample, the silhouette score measures how well it fits its own cluster vs the nearest other cluster:

ComponentDefinition
aMean distance to other samples in the same cluster
bMean distance to samples in the nearest other cluster
s(b - a) / max(a, b)
Score rangeMeaning
+1Perfect — sample is far from other clusters, close to its own
0On the boundary between two clusters
-1Wrong cluster — closer to another cluster than its own
from sklearn.metrics import silhouette_score

scores = []
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    scores.append(silhouette_score(X_scaled, labels))

# Pick K with the highest silhouette score
best_k = K_range[np.argmax(scores)]
print(f"Best K = {best_k}, silhouette = {max(scores):.3f}")

Combining Both Methods

MethodMeasuresPick K where...
Elbow (inertia)Total cluster compactnessCurve bends from steep to flat
SilhouetteCluster separation qualityScore is highest

When both methods agree, you have strong evidence for that K. When they disagree, prefer the silhouette score — it measures both compactness and separation, while inertia only measures compactness.

Why K Depends on Your Security Goals

The "right" K is not purely a statistical question:

ScenarioK choiceRationale
Broad anomaly detectionFewer clusters (3–4)Coarse baseline; anything unusual is flagged
Fine-grained profilingMore clusters (6–8)Separate DNS, HTTP, SSH, IoT, etc.
Specific threat huntingDomain-informed KK matches known traffic categories
Loading...
Loading...
Loading...

Think Deeper

The elbow method suggests K=3 but the silhouette score peaks at K=4. Which do you choose and why?

In security, prefer K=4. The silhouette score directly measures cluster quality (how well-separated and cohesive each cluster is). The elbow method only measures total inertia, which always decreases with more K. A higher silhouette at K=4 means there is a genuine fourth behavioural group in the traffic. In a SOC context, that fourth cluster might separate DNS traffic from ICMP, giving you finer-grained baseline profiles for anomaly detection.
Cybersecurity tie-in: Choosing K is a security architecture decision. Too few clusters merge distinct traffic types (DNS + web) into one, making anomalies harder to detect within each type. Too many clusters create noise and false positives. The right K reflects the actual behavioural diversity of your network.

Loading...