Step 3: Batch Size Effects

Gradient noise vs training stability

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

What a batch is

Instead of computing the gradient over the entire dataset (too slow) or one sample at a time (too noisy), we use a mini-batch -- a random subset of samples. Batch size controls the trade-off between gradient quality and computation speed.

Batch sizeBatches per epoch (1,600 samples)Gradient updates per epoch
325050 (noisy but frequent)
256~6~6 (stable but few)
1,600 (full batch)11 (exact but expensive)

Small vs large batches

PropertySmall batch (32)Large batch (1024)
Gradient noiseHigh -- each batch is a different random subsetLow -- averages over many samples
Updates per epochMany (50 for 1,600 samples)Few (~2 for 1,600 samples)
Wall-clock time per epochSlower (more overhead per update)Faster (fewer but larger operations)
GeneralisationOften better -- noise helps escape sharp minimaMay converge to sharp minima that generalise poorly
Memory usageLowerHigher

The experiment: three batch sizes

import time
from tensorflow import keras

batch_sizes = [32, 256, 1024]
results = {}

for bs in batch_sizes:
    model = keras.Sequential([
        keras.layers.Dense(64, activation='relu', input_shape=(10,)),
        keras.layers.Dense(32, activation='relu'),
        keras.layers.Dense(1, activation='sigmoid'),
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy',
                  metrics=['accuracy'])

    start = time.time()
    history = model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=50, batch_size=bs, verbose=0,
    )
    elapsed = time.time() - start

    val_loss = min(history.history['val_loss'])
    results[bs] = {'time': elapsed, 'best_val_loss': val_loss}
    print(f"batch_size={bs:4d}  time={elapsed:.1f}s  best_val_loss={val_loss:.4f}")

Practical batch size guidelines

Dataset sizeRecommended batch sizeReason
Small (< 5,000)32Need many updates per epoch to learn effectively
Medium (5,000 - 100,000)32 - 128Balance noise with speed
Large (> 100,000)128 - 512Can afford larger batches without losing update frequency

Rule of thumb: Start with batch_size=32. Only increase if training is too slow. If you increase batch size by N, consider increasing learning rate by sqrt(N) to compensate for the smoother gradients (linear scaling rule).

Loading...
Loading...
Loading...

Think Deeper

You train a phishing detector with batch_size=1024 on 10,000 samples. That is only ~10 gradient updates per epoch. What problem might this cause, and what batch size would you try instead?

With only 10 updates per epoch, the gradient is very smooth but may converge to a sharp minimum that generalises poorly. The model also sees very little variety per update. Try batch_size=32 or 64 — giving 156-312 updates per epoch. The noisier gradients help escape sharp minima and often find flatter minima that generalise better to unseen phishing samples.
Cybersecurity tie-in: Security datasets are often heavily imbalanced (1% attacks, 99% benign). With a large batch size, some batches may contain zero attack samples, producing gradients that only learn "everything is benign." Smaller batch sizes increase the chance that each batch contains at least some attack samples, leading to more balanced learning -- critical for recall on rare attack classes.

Loading...