A firewall just exported this log
200 connections from the last few hours. Each row is one network connection. Can you spot which columns a machine learning model can use?
| timestamp | src_ip | dst_ip | src_port | dst_port | protocol | bytes_sent | bytes_recv | packets | duration_str | action |
|---|---|---|---|---|---|---|---|---|---|---|
| 2024-01-15 08:00:00 | 192.168.6.180 | 39.220.169.247 | 63069 | 53 | TCP | 264 | 1011 | 36 | 15.98s | ALLOW |
| 2024-01-15 08:02:00 | 192.168.7.189 | 26.98.49.153 | 51703 | 443 | TCP | 679 | 6160 | 47 | 71.95s | ALLOW |
| 2024-01-15 08:04:00 | 192.168.4.103 | 152.12.59.250 | 63461 | 3389 | TCP | 2764 | 9974 | 43 | 1.13s | ALLOW |
| 2024-01-15 08:06:00 | 192.168.9.211 | 135.56.35.173 | 62386 | 8080 | TCP | 2470 | 478 | 40 | 1.20s | BLOCK |
| 2024-01-15 08:08:00 | 192.168.6.75 | 20.64.7.238 | 64061 | 443 | UDP | 2043 | 2758 | 35 | 6.24s | ALLOW |
| 2024-01-15 08:10:00 | 192.168.7.117 | 144.141.203.115 | 54712 | 80 | TCP | 518 | 4003 | 46 | 30.19s | ALLOW |
| 2024-01-15 08:12:00 | 192.168.3.104 | 215.142.91.98 | 53441 | 443 | TCP | 2703 | 3927 | 48 | 3.64s | ALLOW |
| 2024-01-15 08:14:00 | 192.168.7.131 | 66.31.214.191 | 49998 | 80 | TCP | 3541 | 3828 | 48 | 6.70s | ALLOW |
| 2024-01-15 08:16:00 | 192.168.5.53 | 86.50.152.186 | 60106 | 22 | TCP | 476 | 1630 | 37 | 1.06s | ALLOW |
| 2024-01-15 08:18:00 | 192.168.1.88 | 63.189.124.150 | 62855 | 8080 | TCP | 1043 | 4530 | 41 | 21.61s | ALLOW |
Column types
Numeric — ready for ML
String — needs transformation
Loading...
Loading...
Loading...
Think Deeper
Try this:
Look at the 'duration_str' column. What would happen if you passed '2.34s' to a multiplication operation?
You'd get a TypeError — Python can't multiply a string by a number. The 's' suffix makes it a string, not a float. This is why parsing is the first step: strip the suffix, convert to float, then you can compute bytes_per_second.
Cybersecurity tie-in: Every SIEM and firewall exports logs like this.
The raw export has IP addresses, timestamps, protocol strings — none of which sklearn can process.
Feature engineering is the bridge between SOC data and ML models.