The most important hyperparameter
Learning rate controls how far the optimizer moves the weights each update step. It is universally considered the single most impactful hyperparameter. Getting it wrong makes everything else irrelevant.
# The weight update rule
new_weight = old_weight - learning_rate * gradient
| Learning rate | Step size | Behaviour | Loss curve pattern |
|---|---|---|---|
0.0001 | Tiny | Very slow convergence | Loss crawls down gradually, may never reach minimum in budget |
0.001 | Normal | Usually converges well | Smooth, steady decrease -- reaches minimum efficiently |
0.01 | Bigger | Faster start, may oscillate | Quick initial drop then bouncing near minimum |
0.1 | Huge | Often diverges | Loss explodes or oscillates wildly -- never converges |
The experiment: three learning rates
Train the same model architecture three times, changing only the learning rate, and plot the loss curves:
from tensorflow import keras
import matplotlib.pyplot as plt
learning_rates = [0.001, 0.01, 0.1]
histories = {}
for lr in learning_rates:
model = keras.Sequential([
keras.layers.Dense(64, activation='relu', input_shape=(10,)),
keras.layers.Dense(32, activation='relu'),
keras.layers.Dense(1, activation='sigmoid'),
])
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=lr),
loss='binary_crossentropy',
metrics=['accuracy'],
)
histories[lr] = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=50, batch_size=32, verbose=0,
)
# Plot all three loss curves
for lr, h in histories.items():
plt.plot(h.history['val_loss'], label=f'lr={lr}')
plt.xlabel('Epoch')
plt.ylabel('Validation Loss')
plt.legend()
plt.title('Learning Rate Comparison')
plt.show()
Diagnosing learning rate from the loss curve
| Symptom | Diagnosis | Fix |
|---|---|---|
| Loss barely moves after many epochs | Learning rate too low | Increase by 3-10x |
| Loss decreases smoothly, then plateaus | Learning rate about right | Keep it; consider lr scheduling |
| Loss oscillates but trends downward | Learning rate slightly high | Reduce by 2-3x |
| Loss spikes up or diverges to NaN | Learning rate much too high | Reduce by 10x or more |
Practical starting points
| Optimizer | Default learning rate | Search range |
|---|---|---|
| Adam | 0.001 | [0.0001, 0.001, 0.01] |
| SGD | 0.01 | [0.001, 0.01, 0.1] |
| RMSprop | 0.001 | [0.0001, 0.001, 0.01] |
Rule of thumb: Start with the optimizer's default. If loss is unstable, reduce by 3x. If loss is too slow, increase by 3x. Three tries usually finds a working range.
Loading...
Loading...
Loading...
Think Deeper
Try this:
You train a threat classifier with lr=0.1 and the loss oscillates wildly, never converging. You switch to lr=0.0001 and the loss barely moves after 50 epochs. What should you try, and why?
Try lr=0.001 — the Adam optimizer default. Learning rate 0.1 overshoots the loss minimum (too large steps), while 0.0001 takes steps too small to make progress in your epoch budget. In security ML, time matters: you need the model retrained and deployed before the threat landscape shifts. The middle ground balances convergence speed with stability.
Cybersecurity tie-in: In security ML, you often need to retrain models on shifting distributions (new attack types appear, old ones fade). A learning rate that worked last month may not work on this month's data. Building the habit of checking loss curves after every retrain ensures your threat detection model actually converged rather than silently failing.