Step 8: The Full Pipeline

Raw log → ML-ready features

1 ExplorePlay below
2 ReadUnderstand
💡 ReflectThink deeper

The full pipeline

You've built a complete feature engineering pipeline. Here's the journey from raw log to trained model.

📄
Raw Log
11 columns
6 are strings
Parse
Strip suffixes
Parse timestamps
Derive
bytes/sec, ratios
port risk scores
🎲
Encode
OneHot protocol
Drop label column
Scale
StandardScaler
Fit on train only
🤖
Model
12 features
All numeric

Before engineering

Accuracy
95.0%
F1
0.0%
4 raw numeric columns only

After engineering

Accuracy
95.0%
F1
0.0%
12 engineered features

What you built

StepWhat you learnedKey takeaway
0-1Raw logs failsklearn needs all-numeric input
2Parse & derivebytes_per_second captures transfer speed
3Domain knowledgePort risk scores encode security expertise
4Encode categoriesOneHot avoids false ordering
5ScalingStandardScaler handles outliers better
6Feature impactEngineered features boost model quality
Loading...

Think Deeper

A SOC analyst says 'just dump the logs into the model.' What would you tell them?

Models can't read IP addresses, protocol names, or timestamps. You need to engineer features that encode security knowledge: bytes_per_second for exfil speed, port_risk for attack surface, is_business_hours for anomaly timing. The transformation is where the real intelligence lives.
Cybersecurity tie-in: Every SOC ML project starts with raw logs and ends with a feature matrix. The pipeline you built — parse, derive, encode, scale — is the same one used in production malware detectors, intrusion detection systems, and UEBA platforms. Feature engineering is where security expertise becomes ML signal.

Loading...