Quiz — Feature Engineering

1 of 5

A SOC analyst says 'just dump the firewall logs into the model.' Why is this a bad idea?

Models only understand numbers. Strings like 'TCP', '2.34s', or '192.168.1.5' need to be parsed, encoded, or transformed into numeric features. Feature engineering is where the real intelligence lives — it's how you teach the model what 'suspicious' looks like.

2 of 5

In a firewall log dataset where you're predicting action = ALLOW/BLOCK, which column should you never use as a feature?

Using the label as a feature is called data leakage — the model gets the answer as input and gets 100% accuracy in training, then fails completely in production where the label is unknown. Always remove the target column from your feature set.

3 of 5

Two connections both transferred 50,000 bytes. Connection A took 0.5 seconds, Connection B took 50 seconds. The raw bytes_sent is identical — what feature would expose the suspicious one?

Connection A is moving 100,000 B/s — possibly data exfiltration. Connection B is just 1,000 B/s — normal browsing. The raw byte count can't tell them apart, but the derived feature bytes_per_second immediately exposes the difference. This is why feature engineering matters.

4 of 5

You label-encode protocols as ICMP=0, TCP=1, UDP=2. What hidden problem does this create for a linear model?

A linear model multiplies features by weights, so it interprets 2 (UDP) as twice 1 (TCP). For nominal categories there is no real ordering. Use one-hot encoding instead, which gives each category its own independent column.

5 of 5

Your dataset has one extreme outlier: a single connection with bytes_per_second = 10,000,000 while everything else is under 5,000. What happens if you scale with MinMaxScaler?

MinMax maps min→0 and max→1 linearly. One huge outlier crushes the entire normal range to near-zero. StandardScaler (z-scores) handles this much better — the outlier gets a high z-score, but normal data stays spread around 0.

End-of-lesson Quiz

Quiz complete