Step 2: K-Means and Visualisation

Cluster assignment, PCA to 2D, colour-coded plots

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

K-Means Algorithm

K-Means iteratively assigns samples to clusters and updates cluster centres:

StepActionResult
1. InitialisePlace K centroids randomlyK starting positions
2. AssignEach sample goes to the nearest centroidK groups of samples
3. UpdateMove each centroid to the mean of its groupCentroids shift
4. RepeatGo to step 2 until assignments stop changingStable clusters

The result: K clusters, each with a centroid. A sample's distance from its centroid indicates how "typical" it is for that cluster. High distance = anomalous.

Training K-Means

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

km = KMeans(n_clusters=4, random_state=42, n_init=10)
km.fit(X_scaled)

labels  = km.labels_            # cluster assignment per point
centers = km.cluster_centers_   # centroid coordinates (K x n_features)
inertia = km.inertia_           # sum of squared distances to centroids
AttributeWhat it gives you
labels_Array of cluster IDs (0 to K-1) for each sample
cluster_centers_Centroid coordinates — the "average" of each cluster
inertia_Total within-cluster sum of squares — lower = tighter clusters

PCA for Visualisation

Network features are 6D — impossible to plot directly. PCA (Principal Component Analysis) projects to 2D by finding the two directions of maximum variance. This is only for visualisation — K-Means runs on all 6 features.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)

plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1],
                      c=km.labels_, cmap='viridis',
                      alpha=0.6, s=15)
plt.colorbar(scatter, label="Cluster")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("K-Means Clusters (PCA projection)")
plt.show()

Revealing True Labels

After clustering, overlay the actual traffic labels (if available) to check alignment:

# Colour by true label instead of cluster
plt.scatter(X_2d[:, 0], X_2d[:, 1],
            c=y_true, cmap='Set1', alpha=0.6, s=15)
plt.title("True labels overlaid on PCA projection")

If clusters align well with true labels, K-Means has discovered the same groups that human analysts would identify. If they don't, the features may need engineering or K may be wrong.

Loading...
Loading...
Loading...

Think Deeper

After running K-Means with K=4, you project to 2D with PCA and see that two clusters overlap heavily. What does this mean?

It could mean two things. First, the clusters may genuinely overlap in the two principal components but be well-separated in the full 6D space -- PCA only shows a projection, not the full picture. Second, K may be too large and those two clusters should actually be one. Check the silhouette score for those clusters -- if samples in the overlapping region have low or negative silhouette scores, the split is not justified.
Cybersecurity tie-in: The 2D PCA plot is a powerful tool for SOC briefings. You can show analysts a visual map of network behaviour: "These four clusters are your normal traffic patterns. Anything that appears far from all clusters is worth investigating." It turns abstract ML output into an actionable picture.

Loading...