Model capacity and overfitting
A network's capacity is how many distinct functions it can represent. More parameters = higher capacity = ability to fit more complex patterns. But if capacity far exceeds the complexity of the real signal, the model memorises the training data instead of learning the underlying pattern.
| Scenario | Parameters | Training samples | Ratio | Risk |
|---|---|---|---|---|
| Balanced | 1,000 | 10,000 | 0.1 | Low |
| Borderline | 10,000 | 10,000 | 1.0 | Medium |
| Overfit | 134,000 | 1,600 | 84 | Severe |
The overfit architecture
In this exercise, you deliberately build a network that is far too large for the dataset: three Dense(256) layers on only 1,600 training samples.
| Layer | Size | Activation | Parameters |
|---|---|---|---|
| Input | 10 | -- | -- |
| Dense | 256 | relu | 10 x 256 + 256 = 2,816 |
| Dense | 256 | relu | 256 x 256 + 256 = 65,792 |
| Dense | 256 | relu | 256 x 256 + 256 = 65,792 |
| Output | 1 | sigmoid | 256 x 1 + 1 = 257 |
| Total | ~134,657 | ||
Reading the diverging loss curves
The telltale sign of overfitting is when training loss keeps decreasing but validation loss starts increasing. The gap between these two curves is the overfitting gap.
# Build the overfit model
model = keras.Sequential([
keras.layers.Dense(256, activation='relu', input_shape=(10,)),
keras.layers.Dense(256, activation='relu'),
keras.layers.Dense(256, activation='relu'),
keras.layers.Dense(1, activation='sigmoid'),
])
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
# Train and capture history
history = model.fit(X_train, y_train,
validation_data=(X_val, y_val),
epochs=50, batch_size=32, verbose=0)
# Plot diverging loss curves
import matplotlib.pyplot as plt
plt.plot(history.history['loss'], label='Training loss')
plt.plot(history.history['val_loss'], label='Validation loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.title('Overfitting: training loss drops, val loss rises')
plt.show()
Measuring the overfitting gap
Quantify overfitting by comparing training and validation metrics at the end of training:
# Numerical overfitting evidence
train_loss = history.history['loss'][-1]
val_loss = history.history['val_loss'][-1]
gap = val_loss - train_loss
print(f"Train loss: {train_loss:.4f}")
print(f"Val loss: {val_loss:.4f}")
print(f"Gap: {gap:.4f} (larger = more overfit)")
Think Deeper
A network has 134,000 parameters but only 1,600 training samples. What ratio does that give, and why is it a problem for a security ML model?