The If/Else Tree
A decision tree is a flowchart of yes/no questions. Each node asks about one feature; each branch is a YES or NO answer; each leaf is a prediction.
The model learns which questions to ask and which thresholds to use by finding splits that best separate the classes. Each cyan question is a learned (feature, threshold) pair; each coloured leaf is a final prediction.
Gini Impurity
Gini measures how "mixed" a node's class distribution is:
Gini = 1 - Σ p²_i
| Scenario | Gini | Interpretation |
|---|---|---|
| All samples are benign | 0.0 | Pure node — perfect |
| 50% benign, 50% attack | 0.5 | Maximum uncertainty |
| 25% each of 4 classes | 0.75 | Very impure |
The tree tries every possible feature/threshold combination and picks the split that produces the largest decrease in Gini (information gain).
Key Code Pattern
Two parameters in the snippet below are worth flagging now so you don't get hung up on them: max_depth=4 caps how many yes/no questions the tree can ask in a row — we'll explore why this matters in Step 3: Depth and Overfitting. random_state=42 and np.random.seed(42) are reproducibility seeds (covered in Lesson 1.2): they lock the randomness so every run produces identical results. The number 42 itself is arbitrary.
from sklearn.tree import DecisionTreeClassifier
import numpy as np
# Synthetic network traffic dataset
np.random.seed(42) # reproducible random data (see Lesson 1.2)
# Features: connection_rate, bytes_sent, unique_dest_ports, ...
# Labels: 'benign', 'port_scan', 'exfiltration'
model = DecisionTreeClassifier(
max_depth=4, # limit tree depth — explained in Step 3
random_state=42, # reproducible training
)
model.fit(X_train, y_train)
# Classify a new connection
new_conn = [[55, 150000, 25, 30, 2, 5000]]
prediction = model.predict(new_conn)
print(f"Prediction: {prediction[0]}")
Think Deeper
A node has 60 benign and 40 attack samples. Calculate its Gini impurity by hand. What Gini would a perfect split produce?