Step 3: Choosing K

Elbow method, silhouette score, picking the right K

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

The Elbow Method (Inertia)

Inertia is the sum of squared distances from each sample to its nearest centroid. Lower inertia = tighter clusters, but K=N (every point is its own cluster) trivially gives inertia=0.

Plot inertia vs K. The "elbow" — where the curve bends from steep to flat — suggests the right K.

inertias = []
K_range = range(2, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, 'bo-')
plt.xlabel('K')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

Silhouette Score

For each sample, the silhouette score measures how well it fits its own cluster vs the nearest other cluster:

Component	Definition
a	Mean distance to other samples in the same cluster
b	Mean distance to samples in the nearest other cluster
s	(b - a) / max(a, b)

Score range	Meaning
+1	Perfect — sample is far from other clusters, close to its own
0	On the boundary between two clusters
-1	Wrong cluster — closer to another cluster than its own

from sklearn.metrics import silhouette_score

scores = []
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    scores.append(silhouette_score(X_scaled, labels))

# Pick K with the highest silhouette score
best_k = K_range[np.argmax(scores)]
print(f"Best K = {best_k}, silhouette = {max(scores):.3f}")

Combining Both Methods

Method	Measures	Pick K where...
Elbow (inertia)	Total cluster compactness	Curve bends from steep to flat
Silhouette	Cluster separation quality	Score is highest

When both methods agree, you have strong evidence for that K. When they disagree, prefer the silhouette score — it measures both compactness and separation, while inertia only measures compactness.

Why K Depends on Your Security Goals

The "right" K is not purely a statistical question:

Scenario	K choice	Rationale
Broad anomaly detection	Fewer clusters (3–4)	Coarse baseline; anything unusual is flagged
Fine-grained profiling	More clusters (6–8)	Separate DNS, HTTP, SSH, IoT, etc.
Specific threat hunting	Domain-informed K	K matches known traffic categories

Think Deeper

Try this:

The elbow method suggests K=3 but the silhouette score peaks at K=4. Which do you choose and why?

In security, prefer K=4. The silhouette score directly measures cluster quality (how well-separated and cohesive each cluster is). The elbow method only measures total inertia, which always decreases with more K. A higher silhouette at K=4 means there is a genuine fourth behavioural group in the traffic. In a SOC context, that fourth cluster might separate DNS traffic from ICMP, giving you finer-grained baseline profiles for anomaly detection.

Cybersecurity tie-in: Choosing K is a security architecture decision. Too few clusters merge distinct traffic types (DNS + web) into one, making anomalies harder to detect within each type. Too many clusters create noise and false positives. The right K reflects the actual behavioural diversity of your network.

← Previous ← → to navigate Next →