Why Linear Regression Fails for Classification
Suppose you try to predict whether a URL is phishing (1) or legitimate (0) using linear regression. The model predicts numbers like 0.3, 0.7, 1.2, -0.1 — outside the valid 0–1 range.
| Problem | Explanation |
|---|---|
| Unbounded output | Predictions can be < 0 or > 1 — not valid probabilities |
| Sensitive to outliers | One extreme value can pull the line and reverse predictions |
| Poor fit | The true relationship is an S-curve, not a line |
The Sigmoid Function
Logistic regression squashes any number into the range (0, 1) using the sigmoid:
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# z = weighted sum of features (same as linear regression computes)
| Input z | σ(z) | Interpretation |
|---|---|---|
| -5 | 0.007 | Very likely legitimate |
| -2 | 0.119 | Probably legitimate |
| 0 | 0.500 | Completely uncertain |
| +2 | 0.881 | Probably phishing |
| +5 | 0.993 | Almost certainly phishing |
The sigmoid produces an S-curve: near 0 for very negative z, near 1 for very positive z, passing through 0.5 at z=0.
Key Code Pattern
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train) # learn weights from data
# Hard prediction: 0 or 1
y_pred = model.predict(X_test)
# Soft prediction: probability between 0 and 1
y_proba = model.predict_proba(X_test)[:, 1] # P(phishing)
The default decision boundary is at probability 0.5 — but in security, you may want to lower it to catch more threats.
Loading...
Loading...
Loading...
Think Deeper
Try this:
Feed a z-value of 0 into the sigmoid. What probability do you get? What does this mean for a URL with perfectly balanced evidence?
σ(0) = 0.5 — the model is completely uncertain. This is the decision boundary: when the weighted features sum to zero, the model cannot decide. In security, sitting on the boundary means you need more features or more data to break the tie.
Cybersecurity tie-in: Logistic regression is the workhorse of security classification —
phishing detection, malware triage, spam filtering. Its probabilistic output lets you set different thresholds
for different risk levels: auto-block at P > 0.9, quarantine at P > 0.6, allow below 0.3.