Step 2: Learning Rate Sensitivity

The most important knob to turn

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

The most important hyperparameter

Learning rate controls how far the optimizer moves the weights each update step. It is universally considered the single most impactful hyperparameter. Getting it wrong makes everything else irrelevant.

# The weight update rule
new_weight = old_weight - learning_rate * gradient
Learning rateStep sizeBehaviourLoss curve pattern
0.0001TinyVery slow convergenceLoss crawls down gradually, may never reach minimum in budget
0.001NormalUsually converges wellSmooth, steady decrease -- reaches minimum efficiently
0.01BiggerFaster start, may oscillateQuick initial drop then bouncing near minimum
0.1HugeOften divergesLoss explodes or oscillates wildly -- never converges

The experiment: three learning rates

Train the same model architecture three times, changing only the learning rate, and plot the loss curves:

from tensorflow import keras
import matplotlib.pyplot as plt

learning_rates = [0.001, 0.01, 0.1]
histories = {}

for lr in learning_rates:
    model = keras.Sequential([
        keras.layers.Dense(64, activation='relu', input_shape=(10,)),
        keras.layers.Dense(32, activation='relu'),
        keras.layers.Dense(1, activation='sigmoid'),
    ])
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=lr),
        loss='binary_crossentropy',
        metrics=['accuracy'],
    )
    histories[lr] = model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=50, batch_size=32, verbose=0,
    )

# Plot all three loss curves
for lr, h in histories.items():
    plt.plot(h.history['val_loss'], label=f'lr={lr}')
plt.xlabel('Epoch')
plt.ylabel('Validation Loss')
plt.legend()
plt.title('Learning Rate Comparison')
plt.show()

Diagnosing learning rate from the loss curve

SymptomDiagnosisFix
Loss barely moves after many epochsLearning rate too lowIncrease by 3-10x
Loss decreases smoothly, then plateausLearning rate about rightKeep it; consider lr scheduling
Loss oscillates but trends downwardLearning rate slightly highReduce by 2-3x
Loss spikes up or diverges to NaNLearning rate much too highReduce by 10x or more

Practical starting points

OptimizerDefault learning rateSearch range
Adam0.001[0.0001, 0.001, 0.01]
SGD0.01[0.001, 0.01, 0.1]
RMSprop0.001[0.0001, 0.001, 0.01]

Rule of thumb: Start with the optimizer's default. If loss is unstable, reduce by 3x. If loss is too slow, increase by 3x. Three tries usually finds a working range.

Loading...
Loading...
Loading...

Think Deeper

You train a threat classifier with lr=0.1 and the loss oscillates wildly, never converging. You switch to lr=0.0001 and the loss barely moves after 50 epochs. What should you try, and why?

Try lr=0.001 — the Adam optimizer default. Learning rate 0.1 overshoots the loss minimum (too large steps), while 0.0001 takes steps too small to make progress in your epoch budget. In security ML, time matters: you need the model retrained and deployed before the threat landscape shifts. The middle ground balances convergence speed with stability.
Cybersecurity tie-in: In security ML, you often need to retrain models on shifting distributions (new attack types appear, old ones fade). A learning rate that worked last month may not work on this month's data. Building the habit of checking loss curves after every retrain ensures your threat detection model actually converged rather than silently failing.

Loading...