Step 4: Architecture Search

Systematic search over width and depth

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

Architecture as a hyperparameter

When people say "hyperparameter tuning," they often mean learning rate and batch size. But the network architecture itself is also a hyperparameter -- and often has a bigger impact on performance.

Architecture hyperparameter	Options	Impact
Number of hidden layers (depth)	1, 2, 3, ...	More depth = more complex function approximation
Units per layer (width)	16, 32, 64, 128, 256	More width = more capacity per layer
Activation function	relu, tanh, leaky_relu	Affects gradient flow and expressiveness
Dropout rate	0.0 - 0.5	Regularisation strength

The grid search strategy

A grid search tries every combination in a defined search space. Systematic and thorough, but the number of combinations grows quickly.

import itertools
import pandas as pd
from tensorflow import keras

# Define the search space
units_options = [32, 64, 128]
depth_options = [1, 2, 3]

# Total combinations: 3 x 3 = 9 models
results = []

for units, depth in itertools.product(units_options, depth_options):
    # Build model with this configuration
    layers = [keras.layers.Dense(units, activation='relu', input_shape=(10,))]
    for _ in range(depth - 1):
        layers.append(keras.layers.Dense(units, activation='relu'))
    layers.append(keras.layers.Dense(1, activation='sigmoid'))

    model = keras.Sequential(layers)
    model.compile(optimizer='adam', loss='binary_crossentropy',
                  metrics=['accuracy'])

    history = model.fit(X_train, y_train,
                        validation_data=(X_val, y_val),
                        epochs=50, batch_size=32, verbose=0)

    best_val_loss = min(history.history['val_loss'])
    n_params = model.count_params()

    results.append({
        'units': units, 'depth': depth,
        'params': n_params, 'best_val_loss': round(best_val_loss, 4),
    })

df = pd.DataFrame(results).sort_values('best_val_loss')
print(df.to_string(index=False))

Grid search vs random search

Strategy	How it works	Pros	Cons
Grid search	Try every combination in a predefined grid	Thorough; covers entire space	Expensive; wastes time on unimportant dimensions
Random search	Sample randomly from the hyperparameter space	Explores more unique values per dimension; often finds good configs faster	No guarantee of finding the absolute best
Bayesian (Keras Tuner)	Use past results to guide where to search next	Most efficient; focuses on promising regions	More complex to set up; overhead for small searches

Reading the results table

When comparing architecture search results, look for these patterns:

Pattern	Meaning	Action
Wider models consistently beat narrow ones	The task needs more capacity per layer	Try even wider models, but watch for overfitting
Deeper models do not improve over shallow	The data is not complex enough for depth	Stick with 1-2 layers; add regularisation instead
Best model has far more params than training samples	Likely overfitting; val_loss may be misleading	Add dropout, reduce capacity, or get more data
All models perform similarly	Architecture is not the bottleneck	Focus on feature engineering or data quality instead

Think Deeper

Try this:

Your grid search tests 3 widths x 3 depths x 3 learning rates x 3 batch sizes = 81 models. Each takes 2 minutes to train. How long does this take, and what is a faster alternative?

81 models x 2 min = 162 minutes (2.7 hours). A faster alternative is random search: sample 20-30 random combinations from the same space. Research by Bergstra & Bengio (2012) showed random search finds good hyperparameters in fewer trials than grid search because it explores more unique values per dimension. For security teams with limited compute, random search is the practical choice.

Cybersecurity tie-in: Security teams often have limited compute budgets and tight deployment timelines. Random search with a budget of 20 trials is usually the best practical strategy -- it finds a good-enough model quickly. Save exhaustive grid search for final production models where the difference between 95.0% and 95.5% detection rate translates to catching hundreds more real threats per day at scale.

← Previous ← → to navigate Next →