Step 3: Batch Size Effects

Gradient noise vs training stability

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

What a batch is

Instead of computing the gradient over the entire dataset (too slow) or one sample at a time (too noisy), we use a mini-batch -- a random subset of samples. Batch size controls the trade-off between gradient quality and computation speed.

Batch size	Batches per epoch (1,600 samples)	Gradient updates per epoch
32	50	50 (noisy but frequent)
256	~6	~6 (stable but few)
1,600 (full batch)	1	1 (exact but expensive)

Small vs large batches

Property	Small batch (32)	Large batch (1024)
Gradient noise	High -- each batch is a different random subset	Low -- averages over many samples
Updates per epoch	Many (50 for 1,600 samples)	Few (~2 for 1,600 samples)
Wall-clock time per epoch	Slower (more overhead per update)	Faster (fewer but larger operations)
Generalisation	Often better -- noise helps escape sharp minima	May converge to sharp minima that generalise poorly
Memory usage	Lower	Higher

The experiment: three batch sizes

import time
from tensorflow import keras

batch_sizes = [32, 256, 1024]
results = {}

for bs in batch_sizes:
    model = keras.Sequential([
        keras.layers.Dense(64, activation='relu', input_shape=(10,)),
        keras.layers.Dense(32, activation='relu'),
        keras.layers.Dense(1, activation='sigmoid'),
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy',
                  metrics=['accuracy'])

    start = time.time()
    history = model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=50, batch_size=bs, verbose=0,
    )
    elapsed = time.time() - start

    val_loss = min(history.history['val_loss'])
    results[bs] = {'time': elapsed, 'best_val_loss': val_loss}
    print(f"batch_size={bs:4d}  time={elapsed:.1f}s  best_val_loss={val_loss:.4f}")

Practical batch size guidelines

Dataset size	Recommended batch size	Reason
Small (< 5,000)	32	Need many updates per epoch to learn effectively
Medium (5,000 - 100,000)	32 - 128	Balance noise with speed
Large (> 100,000)	128 - 512	Can afford larger batches without losing update frequency

Rule of thumb: Start with batch_size=32. Only increase if training is too slow. If you increase batch size by N, consider increasing learning rate by sqrt(N) to compensate for the smoother gradients (linear scaling rule).

Think Deeper

Try this:

You train a phishing detector with batch_size=1024 on 10,000 samples. That is only ~10 gradient updates per epoch. What problem might this cause, and what batch size would you try instead?

With only 10 updates per epoch, the gradient is very smooth but may converge to a sharp minimum that generalises poorly. The model also sees very little variety per update. Try batch_size=32 or 64 — giving 156-312 updates per epoch. The noisier gradients help escape sharp minima and often find flatter minima that generalise better to unseen phishing samples.

Cybersecurity tie-in: Security datasets are often heavily imbalanced (1% attacks, 99% benign). With a large batch size, some batches may contain zero attack samples, producing gradients that only learn "everything is benign." Smaller batch sizes increase the chance that each batch contains at least some attack samples, leading to more balanced learning -- critical for recall on rare attack classes.

← Previous ← → to navigate Next →