Step 1: Understanding Regression

Regression vs classification, scatter plots, EDA

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

Regression vs Classification

Machine learning problems split into two types depending on the output:

PropertyRegressionClassification
OutputA continuous numberA category / label
Example145 ms response time"attack" or "benign"
Error metricMSE, RMSE, R²Accuracy, F1, AUC
sklearn classLinearRegressionLogisticRegression

Rule of thumb: if you can put your output on a number line and care about how far off you are, use regression. If you only care about which bucket, use classification.

The Dataset: Server Response Time

You're monitoring a web server. As traffic increases, response time goes up. Can you predict response time from requests-per-second?

ColumnTypeMeaning
requests_per_secondFloatHTTP requests arriving each second (input feature)
response_time_msFloatAverage response time in milliseconds (target)

500 synthetic measurements, approximately linear relationship, with Gaussian noise.

Key Code Pattern

One thing you'll see in nearly every ML script: np.random.seed(42). The number 42 is arbitrary — any integer works. Setting a seed locks NumPy's random number generator so every run produces the exact same "random" data. This is called reproducibility: it lets a teammate re-run your code and get identical results, which is essential for debugging, peer review, and audit trails. You'll see random_state=42 on sklearn objects later for the same reason.

import numpy as np
import pandas as pd

# Generate synthetic data with a known relationship
np.random.seed(42)   # lock the random number generator so every run produces identical data
n_samples = 500
requests_per_second = np.random.uniform(10, 200, n_samples)
response_time_ms = 1.8 * requests_per_second + 30 + np.random.normal(0, 15, n_samples)

df = pd.DataFrame({
    'requests_per_second': requests_per_second,
    'response_time_ms': response_time_ms
})

print(df.shape)          # (500, 2)
print(df.describe())     # summary statistics

Exploratory Data Analysis (EDA)

Before fitting any model, always inspect your data:

  1. Shape — how many rows and columns?
  2. Types — are columns numeric or categorical?
  3. Summary stats — mean, min, max, standard deviation
  4. Missing values — any NaN entries?
  5. Scatter plot — does it suggest linearity?

These steps catch problems early and tell you whether a linear model is appropriate.

Loading...
Loading...
Loading...

Think Deeper

Change the data range from 0–200 rps to 0–2000 rps. Does the scatter plot still look linear? What does that tell you about the model's assumptions?

At very high loads, real servers saturate — response times spike exponentially. The scatter would curve upward, and a straight line would under-predict at high loads. This is why linear regression only works when the true relationship is approximately linear. For non-linear patterns you need polynomial features or a different model.
Cybersecurity tie-in: A trained baseline says "at X rps, the server should respond in Y ms ± Z ms". Any observation that deviates far beyond that band is suspicious — it could indicate a DoS flood, resource exhaustion, or a background process consuming CPU. This is the foundation of anomaly detection.

Loading...