Encode the categories
The protocol column has three values: TCP, UDP, ICMP. A model needs numbers.
Toggle between two encoding methods and see why the choice matters.
The false ordering problem
0
ICMP
← distance = 1 →
1
TCP
← distance = 1 →
2
UDP
Loading...
Loading...
Loading...
Think Deeper
Try this:
If you label-encode protocols as ICMP=0, TCP=1, UDP=2 — what false relationship does a linear model learn?
It learns that TCP is 'between' ICMP and UDP, and that UDP is 'twice as much as' TCP. A linear model multiplies the feature by a weight — so
weight × 2 (UDP) is always double weight × 1 (TCP). This is meaningless for nominal categories.
Cybersecurity tie-in: In network security, protocol type is nominal — TCP is not
"between" ICMP and UDP. Using LabelEncoder on nominal features tricks linear models into learning
nonsensical relationships. Always use one-hot encoding for categorical security fields.