Step 3: Train / Test Split

Why we split data and how to avoid data leakage

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

Why You Must Split Your Data

Imagine studying for an exam using a practice test, then taking the exact same test as the real exam. Your score looks great — but it doesn't prove you learned anything.

The same applies to ML models. If you train and evaluate on the same data:

  • The model has already "seen" those examples
  • It can memorise specific points rather than learn patterns
  • Your error metric is falsely optimistic
  • The model may fail completely in production

This is called overfitting to the training set.

The Fix: Train/Test Split

Hold out a portion of your data before training and never touch it until final evaluation.

SetSizePurpose
Training set80% (400 rows)Model learns from this
Test set20% (100 rows)Final evaluation only — locked until the end

Key Code Pattern

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,                  # features array
    y,                  # target array
    test_size=0.2,      # 20% goes to test set
    random_state=42     # reproducible shuffle
)

print(f"Train: {X_train.shape}  Test: {X_test.shape}")
# Train: (400, 1)  Test: (100, 1)

Data Leakage

Data leakage means information from the test set "leaks" into training, making results unrealistically good:

Leakage typeExampleFix
DirectEvaluating on training dataAlways split first
FeatureComputing mean/std on full dataset before splittingFit scalers on train only
TemporalUsing future data to predict the pastChronological splits
LabelFeatures derived from the targetAudit feature construction
Loading...
Loading...
Loading...

Think Deeper

What happens if you evaluate the model on the training data instead of the test set? Try it — is R² higher or lower?

Training R² is almost always higher (better) than test R² because the model has already memorised the training examples. The gap between them is the overfitting indicator. In security ML, this means your malware detector may look perfect in the lab but fail on live traffic.
Cybersecurity tie-in: Data leakage is the #1 reason security ML models fail in production. A malware classifier trained with leakage can look 99% accurate in the lab but miss 90% of real threats. Always split your data before any preprocessing.

Loading...