Step 1: From Regression to Classification

The sigmoid function and why linear regression fails for yes/no

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

Why Linear Regression Fails for Classification

Suppose you try to predict whether a URL is phishing (1) or legitimate (0) using linear regression. The model predicts numbers like 0.3, 0.7, 1.2, -0.1 — outside the valid 0–1 range.

ProblemExplanation
Unbounded outputPredictions can be < 0 or > 1 — not valid probabilities
Sensitive to outliersOne extreme value can pull the line and reverse predictions
Poor fitThe true relationship is an S-curve, not a line

The Sigmoid Function

Logistic regression squashes any number into the range (0, 1) using the sigmoid:

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# z = weighted sum of features (same as linear regression computes)
Input zσ(z)Interpretation
-50.007Very likely legitimate
-20.119Probably legitimate
00.500Completely uncertain
+20.881Probably phishing
+50.993Almost certainly phishing

The sigmoid produces an S-curve: near 0 for very negative z, near 1 for very positive z, passing through 0.5 at z=0.

Key Code Pattern

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)      # learn weights from data

# Hard prediction: 0 or 1
y_pred = model.predict(X_test)

# Soft prediction: probability between 0 and 1
y_proba = model.predict_proba(X_test)[:, 1]  # P(phishing)

The default decision boundary is at probability 0.5 — but in security, you may want to lower it to catch more threats.

Loading...
Loading...
Loading...

Think Deeper

Feed a z-value of 0 into the sigmoid. What probability do you get? What does this mean for a URL with perfectly balanced evidence?

σ(0) = 0.5 — the model is completely uncertain. This is the decision boundary: when the weighted features sum to zero, the model cannot decide. In security, sitting on the boundary means you need more features or more data to break the tie.
Cybersecurity tie-in: Logistic regression is the workhorse of security classification — phishing detection, malware triage, spam filtering. Its probabilistic output lets you set different thresholds for different risk levels: auto-block at P > 0.9, quarantine at P > 0.6, allow below 0.3.

Loading...