Step 4: Architecture Search

Systematic search over width and depth

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

Architecture as a hyperparameter

When people say "hyperparameter tuning," they often mean learning rate and batch size. But the network architecture itself is also a hyperparameter -- and often has a bigger impact on performance.

Architecture hyperparameterOptionsImpact
Number of hidden layers (depth)1, 2, 3, ...More depth = more complex function approximation
Units per layer (width)16, 32, 64, 128, 256More width = more capacity per layer
Activation functionrelu, tanh, leaky_reluAffects gradient flow and expressiveness
Dropout rate0.0 - 0.5Regularisation strength

The grid search strategy

A grid search tries every combination in a defined search space. Systematic and thorough, but the number of combinations grows quickly.

import itertools
import pandas as pd
from tensorflow import keras

# Define the search space
units_options = [32, 64, 128]
depth_options = [1, 2, 3]

# Total combinations: 3 x 3 = 9 models
results = []

for units, depth in itertools.product(units_options, depth_options):
    # Build model with this configuration
    layers = [keras.layers.Dense(units, activation='relu', input_shape=(10,))]
    for _ in range(depth - 1):
        layers.append(keras.layers.Dense(units, activation='relu'))
    layers.append(keras.layers.Dense(1, activation='sigmoid'))

    model = keras.Sequential(layers)
    model.compile(optimizer='adam', loss='binary_crossentropy',
                  metrics=['accuracy'])

    history = model.fit(X_train, y_train,
                        validation_data=(X_val, y_val),
                        epochs=50, batch_size=32, verbose=0)

    best_val_loss = min(history.history['val_loss'])
    n_params = model.count_params()

    results.append({
        'units': units, 'depth': depth,
        'params': n_params, 'best_val_loss': round(best_val_loss, 4),
    })

df = pd.DataFrame(results).sort_values('best_val_loss')
print(df.to_string(index=False))

Grid search vs random search

StrategyHow it worksProsCons
Grid searchTry every combination in a predefined gridThorough; covers entire spaceExpensive; wastes time on unimportant dimensions
Random searchSample randomly from the hyperparameter spaceExplores more unique values per dimension; often finds good configs fasterNo guarantee of finding the absolute best
Bayesian (Keras Tuner)Use past results to guide where to search nextMost efficient; focuses on promising regionsMore complex to set up; overhead for small searches

Reading the results table

When comparing architecture search results, look for these patterns:

PatternMeaningAction
Wider models consistently beat narrow onesThe task needs more capacity per layerTry even wider models, but watch for overfitting
Deeper models do not improve over shallowThe data is not complex enough for depthStick with 1-2 layers; add regularisation instead
Best model has far more params than training samplesLikely overfitting; val_loss may be misleadingAdd dropout, reduce capacity, or get more data
All models perform similarlyArchitecture is not the bottleneckFocus on feature engineering or data quality instead
Loading...
Loading...
Loading...

Think Deeper

Your grid search tests 3 widths x 3 depths x 3 learning rates x 3 batch sizes = 81 models. Each takes 2 minutes to train. How long does this take, and what is a faster alternative?

81 models x 2 min = 162 minutes (2.7 hours). A faster alternative is random search: sample 20-30 random combinations from the same space. Research by Bergstra & Bengio (2012) showed random search finds good hyperparameters in fewer trials than grid search because it explores more unique values per dimension. For security teams with limited compute, random search is the practical choice.
Cybersecurity tie-in: Security teams often have limited compute budgets and tight deployment timelines. Random search with a budget of 20 trials is usually the best practical strategy -- it finds a good-enough model quickly. Save exhaustive grid search for final production models where the difference between 95.0% and 95.5% detection rate translates to catching hundreds more real threats per day at scale.

Loading...