Step 5: Encode Categories

Label vs OneHot encoding

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

Encode the categories

The protocol column has three values: TCP, UDP, ICMP. A model needs numbers. Toggle between two encoding methods and see why the choice matters.

Encoding method:

The false ordering problem

ICMP

← distance = 1 →

TCP

← distance = 1 →

UDP

Think Deeper

Try this:

If you label-encode protocols as ICMP=0, TCP=1, UDP=2 — what false relationship does a linear model learn?

It learns that TCP is 'between' ICMP and UDP, and that UDP is 'twice as much as' TCP. A linear model multiplies the feature by a weight — so weight × 2 (UDP) is always double weight × 1 (TCP). This is meaningless for nominal categories.

Cybersecurity tie-in: In network security, protocol type is nominal — TCP is not "between" ICMP and UDP. Using LabelEncoder on nominal features tricks linear models into learning nonsensical relationships. Always use one-hot encoding for categorical security fields.

← Previous ← → to navigate Next →