Why You Must Split Your Data
Imagine studying for an exam using a practice test, then taking the exact same test as the real exam. Your score looks great — but it doesn't prove you learned anything.
The same applies to ML models. If you train and evaluate on the same data:
- The model has already "seen" those examples
- It can memorise specific points rather than learn patterns
- Your error metric is falsely optimistic
- The model may fail completely in production
This is called overfitting to the training set.
The Fix: Train/Test Split
Hold out a portion of your data before training and never touch it until final evaluation.
| Set | Size | Purpose |
|---|---|---|
| Training set | 80% (400 rows) | Model learns from this |
| Test set | 20% (100 rows) | Final evaluation only — locked until the end |
Key Code Pattern
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, # features array
y, # target array
test_size=0.2, # 20% goes to test set
random_state=42 # reproducible shuffle
)
print(f"Train: {X_train.shape} Test: {X_test.shape}")
# Train: (400, 1) Test: (100, 1)
Data Leakage
Data leakage means information from the test set "leaks" into training, making results unrealistically good:
| Leakage type | Example | Fix |
|---|---|---|
| Direct | Evaluating on training data | Always split first |
| Feature | Computing mean/std on full dataset before splitting | Fit scalers on train only |
| Temporal | Using future data to predict the past | Chronological splits |
| Label | Features derived from the target | Audit feature construction |
Loading...
Loading...
Loading...
Think Deeper
Try this:
What happens if you evaluate the model on the training data instead of the test set? Try it — is R² higher or lower?
Training R² is almost always higher (better) than test R² because the model has already memorised the training examples. The gap between them is the overfitting indicator. In security ML, this means your malware detector may look perfect in the lab but fail on live traffic.
Cybersecurity tie-in: Data leakage is the #1 reason security ML models fail in production.
A malware classifier trained with leakage can look 99% accurate in the lab but miss 90% of real threats.
Always split your data before any preprocessing.