Step 3: Train and Evaluate

Scaling, fitting, confusion matrix, classification report

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

Feature Scaling

Logistic regression uses gradient descent to find the best weights. If features are on very different scales (e.g., url_length ranges 10–250 while has_at_symbol is 0/1), the optimiser converges slowly.

StandardScaler transforms each feature to have mean=0 and std=1:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit AND transform
X_test_scaled  = scaler.transform(X_test)         # transform only!

Critical rule: fit the scaler on training data only. If you fit on the full dataset, you leak test distribution information into the model.

Training the Classifier

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

model = LogisticRegression(random_state=42)   # 42 keeps the run reproducible (see Lesson 1.2)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred,
                            target_names=['Legitimate', 'Phishing']))

Reading the Classification Report

Metric	What it measures	Security meaning
Precision	Of those flagged as phishing, how many really are?	Low precision = too many false alarms
Recall	Of all actual phishing, how many did we catch?	Low recall = threats getting through
F1	Harmonic mean of precision and recall	Balanced overall score
Accuracy	Total correct / total predictions	Misleading when classes are imbalanced

The Confusion Matrix

A 2×2 table showing every possible outcome:

	Predicted: Legit	Predicted: Phishing
Actually Legit	True Negative (TN)	False Positive (FP) — false alarm
Actually Phishing	False Negative (FN) — missed threat	True Positive (TP) — caught it

In security, False Negatives are usually worse than False Positives. A missed phishing email can lead to a breach; a false alarm just wastes an analyst's time.

Think Deeper

Try this:

Your model has 95% accuracy but only 60% recall on phishing. Your boss says '95% is great'. What do you tell them?

60% recall means 40% of phishing URLs get through to users. If 100 phishing emails arrive daily, 40 reach inboxes. Accuracy is misleading when classes are imbalanced — the model gets credit for correctly labelling the easy majority class. Recall is the metric that matters when missing a positive is dangerous.

Cybersecurity tie-in: The confusion matrix is the language of security ML. When a vendor claims "99% detection rate", ask: what's the false positive rate? And at what threshold? Without these numbers, the claim is meaningless.

← Previous ← → to navigate Next →