Step 2: K-Means and Visualisation

Cluster assignment, PCA to 2D, colour-coded plots

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

K-Means Algorithm

K-Means iteratively assigns samples to clusters and updates cluster centres:

Step	Action	Result
1. Initialise	Place K centroids randomly	K starting positions
2. Assign	Each sample goes to the nearest centroid	K groups of samples
3. Update	Move each centroid to the mean of its group	Centroids shift
4. Repeat	Go to step 2 until assignments stop changing	Stable clusters

The result: K clusters, each with a centroid. A sample's distance from its centroid indicates how "typical" it is for that cluster. High distance = anomalous.

Training K-Means

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

km = KMeans(n_clusters=4, random_state=42, n_init=10)
km.fit(X_scaled)

labels  = km.labels_            # cluster assignment per point
centers = km.cluster_centers_   # centroid coordinates (K x n_features)
inertia = km.inertia_           # sum of squared distances to centroids

Attribute	What it gives you
`labels_`	Array of cluster IDs (0 to K-1) for each sample
`cluster_centers_`	Centroid coordinates — the "average" of each cluster
`inertia_`	Total within-cluster sum of squares — lower = tighter clusters

PCA for Visualisation

Network features are 6D — impossible to plot directly. PCA (Principal Component Analysis) projects to 2D by finding the two directions of maximum variance. This is only for visualisation — K-Means runs on all 6 features.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)

plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1],
                      c=km.labels_, cmap='viridis',
                      alpha=0.6, s=15)
plt.colorbar(scatter, label="Cluster")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("K-Means Clusters (PCA projection)")
plt.show()

Revealing True Labels

After clustering, overlay the actual traffic labels (if available) to check alignment:

# Colour by true label instead of cluster
plt.scatter(X_2d[:, 0], X_2d[:, 1],
            c=y_true, cmap='Set1', alpha=0.6, s=15)
plt.title("True labels overlaid on PCA projection")

If clusters align well with true labels, K-Means has discovered the same groups that human analysts would identify. If they don't, the features may need engineering or K may be wrong.

Think Deeper

Try this:

After running K-Means with K=4, you project to 2D with PCA and see that two clusters overlap heavily. What does this mean?

It could mean two things. First, the clusters may genuinely overlap in the two principal components but be well-separated in the full 6D space -- PCA only shows a projection, not the full picture. Second, K may be too large and those two clusters should actually be one. Check the silhouette score for those clusters -- if samples in the overlapping region have low or negative silhouette scores, the split is not justified.

Cybersecurity tie-in: The 2D PCA plot is a powerful tool for SOC briefings. You can show analysts a visual map of network behaviour: "These four clusters are your normal traffic patterns. Anything that appears far from all clusters is worth investigating." It turns abstract ML output into an actionable picture.

← Previous ← → to navigate Next →