Step 1: How Trees Make Decisions

If/else rules, Gini impurity, information gain

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

The If/Else Tree

A decision tree is a flowchart of yes/no questions. Each node asks about one feature; each branch is a YES or NO answer; each leaf is a prediction.

Is connection_rate > 50? ├── YES → Is unique_dest_ports > 20? │ ├── YES → port_scan │ └── NO → Is bytes_sent > 100000? │ ├── YES → exfiltration │ └── NO → benign └── NO → benign

The model learns which questions to ask and which thresholds to use by finding splits that best separate the classes. Each cyan question is a learned (feature, threshold) pair; each coloured leaf is a final prediction.

Gini Impurity

Gini measures how "mixed" a node's class distribution is:

Gini = 1 - Σ p²_i

Scenario	Gini	Interpretation
All samples are benign	0.0	Pure node — perfect
50% benign, 50% attack	0.5	Maximum uncertainty
25% each of 4 classes	0.75	Very impure

The tree tries every possible feature/threshold combination and picks the split that produces the largest decrease in Gini (information gain).

Key Code Pattern

Two parameters in the snippet below are worth flagging now so you don't get hung up on them: max_depth=4 caps how many yes/no questions the tree can ask in a row — we'll explore why this matters in Step 3: Depth and Overfitting. random_state=42 and np.random.seed(42) are reproducibility seeds (covered in Lesson 1.2): they lock the randomness so every run produces identical results. The number 42 itself is arbitrary.

from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Synthetic network traffic dataset
np.random.seed(42)                # reproducible random data (see Lesson 1.2)
# Features: connection_rate, bytes_sent, unique_dest_ports, ...
# Labels: 'benign', 'port_scan', 'exfiltration'

model = DecisionTreeClassifier(
    max_depth=4,                  # limit tree depth — explained in Step 3
    random_state=42,              # reproducible training
)
model.fit(X_train, y_train)

# Classify a new connection
new_conn = [[55, 150000, 25, 30, 2, 5000]]
prediction = model.predict(new_conn)
print(f"Prediction: {prediction[0]}")

Think Deeper

Try this:

A node has 60 benign and 40 attack samples. Calculate its Gini impurity by hand. What Gini would a perfect split produce?

Gini = 1 - (0.6² + 0.4²) = 1 - (0.36 + 0.16) = 0.48. A perfect split produces two pure nodes with Gini = 0.0 each. The weighted average is 0.0 — that's maximum information gain. In practice, perfect splits are rare; the tree picks the best available split.

Cybersecurity tie-in: Decision trees are uniquely valuable in security because you can explain every prediction. A SOC analyst can trace the path: "flagged because connection_rate > 50 AND unique_dest_ports > 20." Try explaining a neural network's decision that clearly.

← Previous ← → to navigate Next →