K-Means Algorithm
K-Means iteratively assigns samples to clusters and updates cluster centres:
| Step | Action | Result |
|---|---|---|
| 1. Initialise | Place K centroids randomly | K starting positions |
| 2. Assign | Each sample goes to the nearest centroid | K groups of samples |
| 3. Update | Move each centroid to the mean of its group | Centroids shift |
| 4. Repeat | Go to step 2 until assignments stop changing | Stable clusters |
The result: K clusters, each with a centroid. A sample's distance from its centroid indicates how "typical" it is for that cluster. High distance = anomalous.
Training K-Means
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
km = KMeans(n_clusters=4, random_state=42, n_init=10)
km.fit(X_scaled)
labels = km.labels_ # cluster assignment per point
centers = km.cluster_centers_ # centroid coordinates (K x n_features)
inertia = km.inertia_ # sum of squared distances to centroids
| Attribute | What it gives you |
|---|---|
labels_ | Array of cluster IDs (0 to K-1) for each sample |
cluster_centers_ | Centroid coordinates — the "average" of each cluster |
inertia_ | Total within-cluster sum of squares — lower = tighter clusters |
PCA for Visualisation
Network features are 6D — impossible to plot directly. PCA (Principal Component Analysis) projects to 2D by finding the two directions of maximum variance. This is only for visualisation — K-Means runs on all 6 features.
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)
plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1],
c=km.labels_, cmap='viridis',
alpha=0.6, s=15)
plt.colorbar(scatter, label="Cluster")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("K-Means Clusters (PCA projection)")
plt.show()
Revealing True Labels
After clustering, overlay the actual traffic labels (if available) to check alignment:
# Colour by true label instead of cluster
plt.scatter(X_2d[:, 0], X_2d[:, 1],
c=y_true, cmap='Set1', alpha=0.6, s=15)
plt.title("True labels overlaid on PCA projection")
If clusters align well with true labels, K-Means has discovered the same groups that human analysts would identify. If they don't, the features may need engineering or K may be wrong.
Think Deeper
After running K-Means with K=4, you project to 2D with PCA and see that two clusters overlap heavily. What does this mean?