Step 1: The Raw Log

What a firewall export looks like

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

A firewall just exported this log

200 connections from the last few hours. Each row is one network connection. Can you spot which columns a machine learning model can use?

timestamp	src_ip	dst_ip	src_port	dst_port	protocol	bytes_sent	bytes_recv	packets	duration_str	action
2024-01-15 08:00:00	192.168.6.180	39.220.169.247	63069	53	TCP	264	1011	36	15.98s	ALLOW
2024-01-15 08:02:00	192.168.7.189	26.98.49.153	51703	443	TCP	679	6160	47	71.95s	ALLOW
2024-01-15 08:04:00	192.168.4.103	152.12.59.250	63461	3389	TCP	2764	9974	43	1.13s	ALLOW
2024-01-15 08:06:00	192.168.9.211	135.56.35.173	62386	8080	TCP	2470	478	40	1.20s	BLOCK
2024-01-15 08:08:00	192.168.6.75	20.64.7.238	64061	443	UDP	2043	2758	35	6.24s	ALLOW
2024-01-15 08:10:00	192.168.7.117	144.141.203.115	54712	80	TCP	518	4003	46	30.19s	ALLOW
2024-01-15 08:12:00	192.168.3.104	215.142.91.98	53441	443	TCP	2703	3927	48	3.64s	ALLOW
2024-01-15 08:14:00	192.168.7.131	66.31.214.191	49998	80	TCP	3541	3828	48	6.70s	ALLOW
2024-01-15 08:16:00	192.168.5.53	86.50.152.186	60106	22	TCP	476	1630	37	1.06s	ALLOW
2024-01-15 08:18:00	192.168.1.88	63.189.124.150	62855	8080	TCP	1043	4530	41	21.61s	ALLOW

Think Deeper

Try this:

Look at the 'duration_str' column. What would happen if you passed '2.34s' to a multiplication operation?

You'd get a TypeError — Python can't multiply a string by a number. The 's' suffix makes it a string, not a float. This is why parsing is the first step: strip the suffix, convert to float, then you can compute bytes_per_second.

Cybersecurity tie-in: Every SIEM and firewall exports logs like this. The raw export has IP addresses, timestamps, protocol strings — none of which sklearn can process. Feature engineering is the bridge between SOC data and ML models.

← Previous ← → to navigate Next →

Step 1: The Raw Log

A firewall just exported this log

Column types

Think Deeper