Step 1: Understanding Regression

Regression vs classification, scatter plots, EDA

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

Regression vs Classification

Machine learning problems split into two types depending on the output:

Property	Regression	Classification
Output	A continuous number	A category / label
Example	145 ms response time	"attack" or "benign"
Error metric	MSE, RMSE, R²	Accuracy, F1, AUC
sklearn class	`LinearRegression`	`LogisticRegression`

Rule of thumb: if you can put your output on a number line and care about how far off you are, use regression. If you only care about which bucket, use classification.

The Dataset: Server Response Time

You're monitoring a web server. As traffic increases, response time goes up. Can you predict response time from requests-per-second?

Column	Type	Meaning
`requests_per_second`	Float	HTTP requests arriving each second (input feature)
`response_time_ms`	Float	Average response time in milliseconds (target)

500 synthetic measurements, approximately linear relationship, with Gaussian noise.

Key Code Pattern

One thing you'll see in nearly every ML script: np.random.seed(42). The number 42 is arbitrary — any integer works. Setting a seed locks NumPy's random number generator so every run produces the exact same "random" data. This is called reproducibility: it lets a teammate re-run your code and get identical results, which is essential for debugging, peer review, and audit trails. You'll see random_state=42 on sklearn objects later for the same reason.

import numpy as np
import pandas as pd

# Generate synthetic data with a known relationship
np.random.seed(42)   # lock the random number generator so every run produces identical data
n_samples = 500
requests_per_second = np.random.uniform(10, 200, n_samples)
response_time_ms = 1.8 * requests_per_second + 30 + np.random.normal(0, 15, n_samples)

df = pd.DataFrame({
    'requests_per_second': requests_per_second,
    'response_time_ms': response_time_ms
})

print(df.shape)          # (500, 2)
print(df.describe())     # summary statistics

Exploratory Data Analysis (EDA)

Before fitting any model, always inspect your data:

Shape — how many rows and columns?
Types — are columns numeric or categorical?
Summary stats — mean, min, max, standard deviation
Missing values — any NaN entries?
Scatter plot — does it suggest linearity?

These steps catch problems early and tell you whether a linear model is appropriate.

Think Deeper

Try this:

Change the data range from 0–200 rps to 0–2000 rps. Does the scatter plot still look linear? What does that tell you about the model's assumptions?

At very high loads, real servers saturate — response times spike exponentially. The scatter would curve upward, and a straight line would under-predict at high loads. This is why linear regression only works when the true relationship is approximately linear. For non-linear patterns you need polynomial features or a different model.

Cybersecurity tie-in: A trained baseline says "at X rps, the server should respond in Y ms ± Z ms". Any observation that deviates far beyond that band is suspicious — it could indicate a DoS flood, resource exhaustion, or a background process consuming CPU. This is the foundation of anomaly detection.

← Previous ← → to navigate Next →