Step 1: From Regression to Classification

The sigmoid function and why linear regression fails for yes/no

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

Why Linear Regression Fails for Classification

Suppose you try to predict whether a URL is phishing (1) or legitimate (0) using linear regression. The model predicts numbers like 0.3, 0.7, 1.2, -0.1 — outside the valid 0–1 range.

Problem	Explanation
Unbounded output	Predictions can be < 0 or > 1 — not valid probabilities
Sensitive to outliers	One extreme value can pull the line and reverse predictions
Poor fit	The true relationship is an S-curve, not a line

The Sigmoid Function

Logistic regression squashes any number into the range (0, 1) using the sigmoid:

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# z = weighted sum of features (same as linear regression computes)

Input z	σ(z)	Interpretation
-5	0.007	Very likely legitimate
-2	0.119	Probably legitimate
0	0.500	Completely uncertain
+2	0.881	Probably phishing
+5	0.993	Almost certainly phishing

The sigmoid produces an S-curve: near 0 for very negative z, near 1 for very positive z, passing through 0.5 at z=0.

Key Code Pattern

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)      # learn weights from data

# Hard prediction: 0 or 1
y_pred = model.predict(X_test)

# Soft prediction: probability between 0 and 1
y_proba = model.predict_proba(X_test)[:, 1]  # P(phishing)

The default decision boundary is at probability 0.5 — but in security, you may want to lower it to catch more threats.

Think Deeper

Try this:

Feed a z-value of 0 into the sigmoid. What probability do you get? What does this mean for a URL with perfectly balanced evidence?

σ(0) = 0.5 — the model is completely uncertain. This is the decision boundary: when the weighted features sum to zero, the model cannot decide. In security, sitting on the boundary means you need more features or more data to break the tie.

Cybersecurity tie-in: Logistic regression is the workhorse of security classification — phishing detection, malware triage, spam filtering. Its probabilistic output lets you set different thresholds for different risk levels: auto-block at P > 0.9, quarantine at P > 0.6, allow below 0.3.

← Previous ← → to navigate Next →