Build the transformation plan
Every column needs a strategy. Click each column to classify it: use directly, needs transformation, or drop.
timestamp
string
e.g. 2024-01-15 08:00:00
Click to classify
src_ip
string
e.g. 192.168.3.42
Click to classify
dst_ip
string
e.g. 185.23.44.102
Click to classify
src_port
int
e.g. 52481
Click to classify
dst_port
int
e.g. 443
Click to classify
protocol
string
e.g. TCP
Click to classify
bytes_sent
int
e.g. 3421
Click to classify
bytes_recv
int
e.g. 15230
Click to classify
packets
int
e.g. 42
Click to classify
duration_str
string
e.g. 2.34s
Click to classify
action
string
e.g. ALLOW
Click to classify
The transformation plan
| Column | Type | Action |
|---|---|---|
| timestamp | string | Extract: hour_of_day, is_business_hours |
| src_ip | string | Extract: is_private, subnet |
| dst_ip | string | Extract: is_private, known-bad lookup |
| src_port | int | Use directly (or drop — ephemeral) |
| dst_port | int | Map to port_risk_score |
| protocol | string | One-hot encode (TCP/UDP/ICMP) |
| bytes_sent | int | Use directly |
| bytes_recv | int | Use directly |
| packets | int | Use directly |
| duration_str | string | Strip 's' suffix → float |
| action | string | Drop — this is the label |
5 columns are ready. 6 need transformation.
The next steps will show you how to transform each one.
Loading...
Loading...
Loading...
Think Deeper
Try this:
Which column should you NEVER use as a feature? Why?
The action column (ALLOW/BLOCK). That's the label — what you're trying to predict. Using it as a feature is called data leakage: the model gets the answer as input. In production, you wouldn't know the action before the model decides.
Cybersecurity tie-in: A transformation plan is like an incident response playbook — you decide
before the data arrives how each field will be handled. In production ML pipelines, this plan
becomes a preprocessing module that runs on every new batch of logs.