Step 1: Unsupervised Framing

Why labels are unavailable and what clustering finds

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

The Unsupervised Setting

In supervised learning (Lessons 1.2–2.2), every training sample has a label. In the real world:

Reality	Why labels are unavailable
Volume	Security analysts cannot label every network connection
Novelty	New attack types have no historical labels
Zero-days	Never-before-seen exploits look nothing like past attacks

Unsupervised learning finds structure without labels. You do not need to know what an attack looks like — you just need to know when something looks different from normal.

Why Normal Behaviour Clusters

Corporate network traffic is repetitive and predictable. The same devices talk to the same servers using the same protocols day after day. This repetition creates dense clusters:

Cluster	Behaviour	Characteristics
A: Web browsing	Many connections, port 443	Moderate bytes, short duration
B: File server	Internal IPs, large transfers	High bytes, long duration
C: DNS	Small packets, UDP, port 53	Low bytes, very frequent
D: IoT heartbeats	Fixed schedule, tiny payloads	Predictable intervals

Attacks break these patterns: unusual ports, abnormal volumes, unexpected timing. They appear far from any dense cluster.

Supervised vs Unsupervised

Aspect	Supervised	Unsupervised
Labels required	Yes — benign/attack for every sample	No — learns from structure alone
Catches known attacks	Excellent (trained on examples)	Good (if they deviate from normal)
Catches novel attacks	Poor (never seen before)	Good (any deviation is flagged)
False positives	Lower	Higher (any anomaly is flagged)

The Clustering Approach

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# 1. Scale features (K-Means uses Euclidean distance)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Cluster into K groups
#    n_clusters=4 — we picked 4 to match the four behaviour types in the table above.
#                   In real projects you'd use the elbow method or silhouette score to choose K.
#    n_init=10    — run k-means 10 times with different starting points and keep the best result
#                   (k-means is sensitive to where it starts, so multiple restarts help).
#    random_state=42 — reproducible runs (see Lesson 1.2).
km = KMeans(n_clusters=4, random_state=42, n_init=10)
km.fit(X_scaled)

# 3. Each sample now has a cluster label
print(km.labels_)  # [0, 2, 1, 0, 3, ...]

Note: you must scale before clustering. If bytes_sent ranges 0–1,000,000 and duration ranges 0–300, the bytes feature dominates the distance calculation.

Think Deeper

Try this:

Your network has 2 million connection logs from the past month but zero labelled attacks. Can you still build a detection system? How?

Yes -- use unsupervised learning. Cluster the 2 million connections by behaviour (bytes, duration, port, rate). Normal traffic forms dense, predictable clusters. Any new connection that falls far from all cluster centres is flagged as anomalous. You never needed a single label. This is exactly how baseline anomaly detection works in production SOCs -- learn what 'normal' looks like, then alert on deviations.

Cybersecurity tie-in: Most SOC environments have millions of logs but almost no labels. Unsupervised anomaly detection lets you build a baseline of normal from unlabelled data. When a zero-day exploit generates traffic that doesn't fit any known pattern, clustering flags it — no prior knowledge of the attack required.

← Previous ← → to navigate Next →