Step 3: Train / Test Split

Why we split data and how to avoid data leakage

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

Why You Must Split Your Data

Imagine studying for an exam using a practice test, then taking the exact same test as the real exam. Your score looks great — but it doesn't prove you learned anything.

The same applies to ML models. If you train and evaluate on the same data:

The model has already "seen" those examples
It can memorise specific points rather than learn patterns
Your error metric is falsely optimistic
The model may fail completely in production

This is called overfitting to the training set.

The Fix: Train/Test Split

Hold out a portion of your data before training and never touch it until final evaluation.

Set	Size	Purpose
Training set	80% (400 rows)	Model learns from this
Test set	20% (100 rows)	Final evaluation only — locked until the end

Key Code Pattern

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,                  # features array
    y,                  # target array
    test_size=0.2,      # 20% goes to test set
    random_state=42     # reproducible shuffle
)

print(f"Train: {X_train.shape}  Test: {X_test.shape}")
# Train: (400, 1)  Test: (100, 1)

Data Leakage

Data leakage means information from the test set "leaks" into training, making results unrealistically good:

Leakage type	Example	Fix
Direct	Evaluating on training data	Always split first
Feature	Computing mean/std on full dataset before splitting	Fit scalers on train only
Temporal	Using future data to predict the past	Chronological splits
Label	Features derived from the target	Audit feature construction

Think Deeper

Try this:

What happens if you evaluate the model on the training data instead of the test set? Try it — is R² higher or lower?

Training R² is almost always higher (better) than test R² because the model has already memorised the training examples. The gap between them is the overfitting indicator. In security ML, this means your malware detector may look perfect in the lab but fail on live traffic.

Cybersecurity tie-in: Data leakage is the #1 reason security ML models fail in production. A malware classifier trained with leakage can look 99% accurate in the lab but miss 90% of real threats. Always split your data before any preprocessing.

← Previous ← → to navigate Next →