What a batch is
Instead of computing the gradient over the entire dataset (too slow) or one sample at a time (too noisy), we use a mini-batch -- a random subset of samples. Batch size controls the trade-off between gradient quality and computation speed.
| Batch size | Batches per epoch (1,600 samples) | Gradient updates per epoch |
|---|---|---|
| 32 | 50 | 50 (noisy but frequent) |
| 256 | ~6 | ~6 (stable but few) |
| 1,600 (full batch) | 1 | 1 (exact but expensive) |
Small vs large batches
| Property | Small batch (32) | Large batch (1024) |
|---|---|---|
| Gradient noise | High -- each batch is a different random subset | Low -- averages over many samples |
| Updates per epoch | Many (50 for 1,600 samples) | Few (~2 for 1,600 samples) |
| Wall-clock time per epoch | Slower (more overhead per update) | Faster (fewer but larger operations) |
| Generalisation | Often better -- noise helps escape sharp minima | May converge to sharp minima that generalise poorly |
| Memory usage | Lower | Higher |
The experiment: three batch sizes
import time
from tensorflow import keras
batch_sizes = [32, 256, 1024]
results = {}
for bs in batch_sizes:
model = keras.Sequential([
keras.layers.Dense(64, activation='relu', input_shape=(10,)),
keras.layers.Dense(32, activation='relu'),
keras.layers.Dense(1, activation='sigmoid'),
])
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
start = time.time()
history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=50, batch_size=bs, verbose=0,
)
elapsed = time.time() - start
val_loss = min(history.history['val_loss'])
results[bs] = {'time': elapsed, 'best_val_loss': val_loss}
print(f"batch_size={bs:4d} time={elapsed:.1f}s best_val_loss={val_loss:.4f}")
Practical batch size guidelines
| Dataset size | Recommended batch size | Reason |
|---|---|---|
| Small (< 5,000) | 32 | Need many updates per epoch to learn effectively |
| Medium (5,000 - 100,000) | 32 - 128 | Balance noise with speed |
| Large (> 100,000) | 128 - 512 | Can afford larger batches without losing update frequency |
Rule of thumb: Start with batch_size=32. Only increase if training is too slow. If you increase batch size by N, consider increasing learning rate by sqrt(N) to compensate for the smoother gradients (linear scaling rule).
Loading...
Loading...
Loading...
Think Deeper
Try this:
You train a phishing detector with batch_size=1024 on 10,000 samples. That is only ~10 gradient updates per epoch. What problem might this cause, and what batch size would you try instead?
With only 10 updates per epoch, the gradient is very smooth but may converge to a sharp minimum that generalises poorly. The model also sees very little variety per update. Try batch_size=32 or 64 — giving 156-312 updates per epoch. The noisier gradients help escape sharp minima and often find flatter minima that generalise better to unseen phishing samples.
Cybersecurity tie-in: Security datasets are often heavily imbalanced (1% attacks, 99% benign). With a large batch size, some batches may contain zero attack samples, producing gradients that only learn "everything is benign." Smaller batch sizes increase the chance that each batch contains at least some attack samples, leading to more balanced learning -- critical for recall on rare attack classes.