The Unsupervised Setting
In supervised learning (Lessons 1.2–2.2), every training sample has a label. In the real world:
| Reality | Why labels are unavailable |
|---|---|
| Volume | Security analysts cannot label every network connection |
| Novelty | New attack types have no historical labels |
| Zero-days | Never-before-seen exploits look nothing like past attacks |
Unsupervised learning finds structure without labels. You do not need to know what an attack looks like — you just need to know when something looks different from normal.
Why Normal Behaviour Clusters
Corporate network traffic is repetitive and predictable. The same devices talk to the same servers using the same protocols day after day. This repetition creates dense clusters:
| Cluster | Behaviour | Characteristics |
|---|---|---|
| A: Web browsing | Many connections, port 443 | Moderate bytes, short duration |
| B: File server | Internal IPs, large transfers | High bytes, long duration |
| C: DNS | Small packets, UDP, port 53 | Low bytes, very frequent |
| D: IoT heartbeats | Fixed schedule, tiny payloads | Predictable intervals |
Attacks break these patterns: unusual ports, abnormal volumes, unexpected timing. They appear far from any dense cluster.
Supervised vs Unsupervised
| Aspect | Supervised | Unsupervised |
|---|---|---|
| Labels required | Yes — benign/attack for every sample | No — learns from structure alone |
| Catches known attacks | Excellent (trained on examples) | Good (if they deviate from normal) |
| Catches novel attacks | Poor (never seen before) | Good (any deviation is flagged) |
| False positives | Lower | Higher (any anomaly is flagged) |
The Clustering Approach
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# 1. Scale features (K-Means uses Euclidean distance)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 2. Cluster into K groups
# n_clusters=4 — we picked 4 to match the four behaviour types in the table above.
# In real projects you'd use the elbow method or silhouette score to choose K.
# n_init=10 — run k-means 10 times with different starting points and keep the best result
# (k-means is sensitive to where it starts, so multiple restarts help).
# random_state=42 — reproducible runs (see Lesson 1.2).
km = KMeans(n_clusters=4, random_state=42, n_init=10)
km.fit(X_scaled)
# 3. Each sample now has a cluster label
print(km.labels_) # [0, 2, 1, 0, 3, ...]
Note: you must scale before clustering. If bytes_sent ranges 0–1,000,000 and duration ranges 0–300, the bytes feature dominates the distance calculation.
Think Deeper
Your network has 2 million connection logs from the past month but zero labelled attacks. Can you still build a detection system? How?