What Dropout does
During each training step, Dropout(rate) randomly sets a fraction of neuron outputs to zero. With rate=0.3, 30% of neurons are silenced each batch -- different neurons each time. This forces the network to build redundant representations instead of relying on any single neuron.
| Property | Detail |
|---|---|
| Training behaviour | Randomly zero out rate fraction of outputs per batch |
| Inference behaviour | All neurons active (no dropout) |
| Output scaling | Remaining outputs scaled by 1/(1 - rate) to preserve expected magnitude |
| Typical values | 0.2 -- 0.5 (higher = stronger regularisation) |
Dropout as implicit ensemble
Each training step with dropout uses a different random subset of neurons -- effectively training a different sub-network each time. Over thousands of steps, the model trains an exponential number of overlapping sub-networks. At inference time, using all neurons approximates the ensemble average of all these sub-networks.
| Concept | Without Dropout | With Dropout(0.3) |
|---|---|---|
| Active neurons | All neurons every step | ~70% random subset each step |
| Co-adaptation | Neurons can become co-dependent | Each neuron must be independently useful |
| Effective models trained | 1 model | Exponentially many sub-networks |
| Overfitting risk | High with excess capacity | Reduced significantly |
Adding Dropout to the overfit model
Place a Dropout layer after each hidden Dense layer. The model architecture stays the same width, but dropout prevents it from memorising.
model = keras.Sequential([
keras.layers.Dense(256, activation='relu', input_shape=(10,)),
keras.layers.Dropout(0.3), # drop 30% after first hidden layer
keras.layers.Dense(256, activation='relu'),
keras.layers.Dropout(0.3), # drop 30% after second hidden layer
keras.layers.Dense(256, activation='relu'),
keras.layers.Dropout(0.3), # drop 30% after third hidden layer
keras.layers.Dense(1, activation='sigmoid'),
])
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
Comparing dropout rates
Different rates trade off regularisation strength against model capacity:
| Dropout rate | Effect | When to use |
|---|---|---|
0.1 | Light regularisation | Small models, large datasets |
0.3 | Standard regularisation | Good default starting point |
0.5 | Strong regularisation | Very large models, limited data |
0.8 | Aggressive -- may underfit | Rarely used; only for extreme overfitting |
Think Deeper
A SOC deploys a model with Dropout(0.5). During inference on live traffic, are neurons still being dropped? What would happen if they were?
training flag). If neurons were still dropped during inference, predictions would be random and inconsistent — the same packet could be classified as malicious one second and benign the next. That is unacceptable for production alerting.