Step 1: Unsupervised Framing

Why labels are unavailable and what clustering finds

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

The Unsupervised Setting

In supervised learning (Lessons 1.2–2.2), every training sample has a label. In the real world:

RealityWhy labels are unavailable
VolumeSecurity analysts cannot label every network connection
NoveltyNew attack types have no historical labels
Zero-daysNever-before-seen exploits look nothing like past attacks

Unsupervised learning finds structure without labels. You do not need to know what an attack looks like — you just need to know when something looks different from normal.

Why Normal Behaviour Clusters

Corporate network traffic is repetitive and predictable. The same devices talk to the same servers using the same protocols day after day. This repetition creates dense clusters:

ClusterBehaviourCharacteristics
A: Web browsingMany connections, port 443Moderate bytes, short duration
B: File serverInternal IPs, large transfersHigh bytes, long duration
C: DNSSmall packets, UDP, port 53Low bytes, very frequent
D: IoT heartbeatsFixed schedule, tiny payloadsPredictable intervals

Attacks break these patterns: unusual ports, abnormal volumes, unexpected timing. They appear far from any dense cluster.

Supervised vs Unsupervised

AspectSupervisedUnsupervised
Labels requiredYes — benign/attack for every sampleNo — learns from structure alone
Catches known attacksExcellent (trained on examples)Good (if they deviate from normal)
Catches novel attacksPoor (never seen before)Good (any deviation is flagged)
False positivesLowerHigher (any anomaly is flagged)

The Clustering Approach

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# 1. Scale features (K-Means uses Euclidean distance)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Cluster into K groups
#    n_clusters=4 — we picked 4 to match the four behaviour types in the table above.
#                   In real projects you'd use the elbow method or silhouette score to choose K.
#    n_init=10    — run k-means 10 times with different starting points and keep the best result
#                   (k-means is sensitive to where it starts, so multiple restarts help).
#    random_state=42 — reproducible runs (see Lesson 1.2).
km = KMeans(n_clusters=4, random_state=42, n_init=10)
km.fit(X_scaled)

# 3. Each sample now has a cluster label
print(km.labels_)  # [0, 2, 1, 0, 3, ...]

Note: you must scale before clustering. If bytes_sent ranges 0–1,000,000 and duration ranges 0–300, the bytes feature dominates the distance calculation.

Loading...
Loading...
Loading...

Think Deeper

Your network has 2 million connection logs from the past month but zero labelled attacks. Can you still build a detection system? How?

Yes -- use unsupervised learning. Cluster the 2 million connections by behaviour (bytes, duration, port, rate). Normal traffic forms dense, predictable clusters. Any new connection that falls far from all cluster centres is flagged as anomalous. You never needed a single label. This is exactly how baseline anomaly detection works in production SOCs -- learn what 'normal' looks like, then alert on deviations.
Cybersecurity tie-in: Most SOC environments have millions of logs but almost no labels. Unsupervised anomaly detection lets you build a baseline of normal from unlabelled data. When a zero-day exploit generates traffic that doesn't fit any known pattern, clustering flags it — no prior knowledge of the attack required.

Loading...