Regression vs Classification
Machine learning problems split into two types depending on the output:
| Property | Regression | Classification |
|---|---|---|
| Output | A continuous number | A category / label |
| Example | 145 ms response time | "attack" or "benign" |
| Error metric | MSE, RMSE, R² | Accuracy, F1, AUC |
| sklearn class | LinearRegression | LogisticRegression |
Rule of thumb: if you can put your output on a number line and care about how far off you are, use regression. If you only care about which bucket, use classification.
The Dataset: Server Response Time
You're monitoring a web server. As traffic increases, response time goes up. Can you predict response time from requests-per-second?
| Column | Type | Meaning |
|---|---|---|
requests_per_second | Float | HTTP requests arriving each second (input feature) |
response_time_ms | Float | Average response time in milliseconds (target) |
500 synthetic measurements, approximately linear relationship, with Gaussian noise.
Key Code Pattern
One thing you'll see in nearly every ML script: np.random.seed(42). The number 42 is arbitrary — any integer works. Setting a seed locks NumPy's random number generator so every run produces the exact same "random" data. This is called reproducibility: it lets a teammate re-run your code and get identical results, which is essential for debugging, peer review, and audit trails. You'll see random_state=42 on sklearn objects later for the same reason.
import numpy as np
import pandas as pd
# Generate synthetic data with a known relationship
np.random.seed(42) # lock the random number generator so every run produces identical data
n_samples = 500
requests_per_second = np.random.uniform(10, 200, n_samples)
response_time_ms = 1.8 * requests_per_second + 30 + np.random.normal(0, 15, n_samples)
df = pd.DataFrame({
'requests_per_second': requests_per_second,
'response_time_ms': response_time_ms
})
print(df.shape) # (500, 2)
print(df.describe()) # summary statistics
Exploratory Data Analysis (EDA)
Before fitting any model, always inspect your data:
- Shape — how many rows and columns?
- Types — are columns numeric or categorical?
- Summary stats — mean, min, max, standard deviation
- Missing values — any NaN entries?
- Scatter plot — does it suggest linearity?
These steps catch problems early and tell you whether a linear model is appropriate.
Think Deeper
Change the data range from 0–200 rps to 0–2000 rps. Does the scatter plot still look linear? What does that tell you about the model's assumptions?