Step 5: Encode Categories

Label vs OneHot encoding

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

Encode the categories

The protocol column has three values: TCP, UDP, ICMP. A model needs numbers. Toggle between two encoding methods and see why the choice matters.

The false ordering problem

0
ICMP
← distance = 1 →
1
TCP
← distance = 1 →
2
UDP
Loading...
Loading...
Loading...

Think Deeper

If you label-encode protocols as ICMP=0, TCP=1, UDP=2 — what false relationship does a linear model learn?

It learns that TCP is 'between' ICMP and UDP, and that UDP is 'twice as much as' TCP. A linear model multiplies the feature by a weight — so weight × 2 (UDP) is always double weight × 1 (TCP). This is meaningless for nominal categories.
Cybersecurity tie-in: In network security, protocol type is nominal — TCP is not "between" ICMP and UDP. Using LabelEncoder on nominal features tricks linear models into learning nonsensical relationships. Always use one-hot encoding for categorical security fields.

Loading...