Architecture as a hyperparameter
When people say "hyperparameter tuning," they often mean learning rate and batch size. But the network architecture itself is also a hyperparameter -- and often has a bigger impact on performance.
| Architecture hyperparameter | Options | Impact |
|---|---|---|
| Number of hidden layers (depth) | 1, 2, 3, ... | More depth = more complex function approximation |
| Units per layer (width) | 16, 32, 64, 128, 256 | More width = more capacity per layer |
| Activation function | relu, tanh, leaky_relu | Affects gradient flow and expressiveness |
| Dropout rate | 0.0 - 0.5 | Regularisation strength |
The grid search strategy
A grid search tries every combination in a defined search space. Systematic and thorough, but the number of combinations grows quickly.
import itertools
import pandas as pd
from tensorflow import keras
# Define the search space
units_options = [32, 64, 128]
depth_options = [1, 2, 3]
# Total combinations: 3 x 3 = 9 models
results = []
for units, depth in itertools.product(units_options, depth_options):
# Build model with this configuration
layers = [keras.layers.Dense(units, activation='relu', input_shape=(10,))]
for _ in range(depth - 1):
layers.append(keras.layers.Dense(units, activation='relu'))
layers.append(keras.layers.Dense(1, activation='sigmoid'))
model = keras.Sequential(layers)
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
history = model.fit(X_train, y_train,
validation_data=(X_val, y_val),
epochs=50, batch_size=32, verbose=0)
best_val_loss = min(history.history['val_loss'])
n_params = model.count_params()
results.append({
'units': units, 'depth': depth,
'params': n_params, 'best_val_loss': round(best_val_loss, 4),
})
df = pd.DataFrame(results).sort_values('best_val_loss')
print(df.to_string(index=False))
Grid search vs random search
| Strategy | How it works | Pros | Cons |
|---|---|---|---|
| Grid search | Try every combination in a predefined grid | Thorough; covers entire space | Expensive; wastes time on unimportant dimensions |
| Random search | Sample randomly from the hyperparameter space | Explores more unique values per dimension; often finds good configs faster | No guarantee of finding the absolute best |
| Bayesian (Keras Tuner) | Use past results to guide where to search next | Most efficient; focuses on promising regions | More complex to set up; overhead for small searches |
Reading the results table
When comparing architecture search results, look for these patterns:
| Pattern | Meaning | Action |
|---|---|---|
| Wider models consistently beat narrow ones | The task needs more capacity per layer | Try even wider models, but watch for overfitting |
| Deeper models do not improve over shallow | The data is not complex enough for depth | Stick with 1-2 layers; add regularisation instead |
| Best model has far more params than training samples | Likely overfitting; val_loss may be misleading | Add dropout, reduce capacity, or get more data |
| All models perform similarly | Architecture is not the bottleneck | Focus on feature engineering or data quality instead |
Loading...
Loading...
Loading...
Think Deeper
Try this:
Your grid search tests 3 widths x 3 depths x 3 learning rates x 3 batch sizes = 81 models. Each takes 2 minutes to train. How long does this take, and what is a faster alternative?
81 models x 2 min = 162 minutes (2.7 hours). A faster alternative is random search: sample 20-30 random combinations from the same space. Research by Bergstra & Bengio (2012) showed random search finds good hyperparameters in fewer trials than grid search because it explores more unique values per dimension. For security teams with limited compute, random search is the practical choice.
Cybersecurity tie-in: Security teams often have limited compute budgets and tight deployment timelines. Random search with a budget of 20 trials is usually the best practical strategy -- it finds a good-enough model quickly. Save exhaustive grid search for final production models where the difference between 95.0% and 95.5% detection rate translates to catching hundreds more real threats per day at scale.