The full pipeline
You've built a complete feature engineering pipeline. Here's the journey from raw log to trained model.
Raw Log
11 columns
6 are strings
6 are strings
→
Parse
Strip suffixes
Parse timestamps
Parse timestamps
→
Derive
bytes/sec, ratios
port risk scores
port risk scores
→
Encode
OneHot protocol
Drop label column
Drop label column
→
Scale
StandardScaler
Fit on train only
Fit on train only
→
Model
12 features
All numeric
All numeric
Before engineering
Accuracy
95.0%
F1
0.0%
4 raw numeric columns only
After engineering
Accuracy
95.0%
F1
0.0%
12 engineered features
What you built
| Step | What you learned | Key takeaway |
|---|---|---|
| 0-1 | Raw logs fail | sklearn needs all-numeric input |
| 2 | Parse & derive | bytes_per_second captures transfer speed |
| 3 | Domain knowledge | Port risk scores encode security expertise |
| 4 | Encode categories | OneHot avoids false ordering |
| 5 | Scaling | StandardScaler handles outliers better |
| 6 | Feature impact | Engineered features boost model quality |
Loading...
Think Deeper
Try this:
A SOC analyst says 'just dump the logs into the model.' What would you tell them?
Models can't read IP addresses, protocol names, or timestamps. You need to engineer features that encode security knowledge: bytes_per_second for exfil speed, port_risk for attack surface, is_business_hours for anomaly timing. The transformation is where the real intelligence lives.
Cybersecurity tie-in: Every SOC ML project starts with raw logs and ends with a feature matrix.
The pipeline you built — parse, derive, encode, scale — is the same one used in production malware detectors,
intrusion detection systems, and UEBA platforms. Feature engineering is where security expertise becomes ML signal.