Step 2: Conv2D & MaxPooling

Sliding filters and downsampling

1 ExplorePlay below
2 ReadUnderstand
3 BuildHands-on lab
4 CompareSolution
💡 ReflectThink deeper

How a convolutional filter works

A Conv2D(32, (3,3)) layer creates 32 filters, each 3x3 pixels. Each filter slides across the image one position at a time, computing a dot product (element-wise multiply + sum) at each position plus a bias term.

The same filter weights are applied at every position -- this is weight sharing. A "horizontal edge detector" learned once detects horizontal edges everywhere in the image.

ComponentDescriptionExample
Filter (kernel)Small learnable weight matrix3x3 = 9 weights per filter
StrideHow many pixels to move between positionsstride=1 (default): slide 1 pixel at a time
PaddingWhether to pad edges with zeros'valid' (no padding) or 'same' (preserve size)
Feature mapOutput of one filter applied across the entire image26x26 map from 3x3 filter on 28x28 input

Shape arithmetic

For Conv2D with padding='valid' (default) and stride=1:

output_size = input_size - kernel_size + 1
Input sizeKernel sizeOutput per filter
28 x 283 x 326 x 26
28 x 285 x 524 x 24
26 x 263 x 324 x 24
13 x 133 x 311 x 11

MaxPooling2D -- downsampling

MaxPooling2D((2,2)) takes every 2x2 block and keeps only the maximum value. This halves each spatial dimension, reducing computation and making the network more robust to small shifts in position.

# MaxPooling2D(2,2) on a 4x4 input:
# [1, 3, 2, 1]      [3, 2]
# [4, 2, 0, 3]  =>  [4, 5]
# [1, 0, 5, 1]
# [3, 4, 2, 0]
# Each 2x2 block -> its maximum value
Input sizePool sizeOutput size
26 x 262 x 213 x 13
11 x 112 x 25 x 5
24 x 242 x 212 x 12

Parameter comparison: Conv vs Dense

Weight sharing makes Conv layers dramatically more efficient:

LayerCalculationParameters
Conv2D(32, (3,3)) on 28x28x1(3 x 3 x 1) x 32 + 32 biases320
Conv2D(64, (3,3)) on 13x13x32(3 x 3 x 32) x 64 + 64 biases18,496
Dense(128) on flattened 784784 x 128 + 128 biases100,480
Loading...
Loading...
Loading...

Think Deeper

A Conv2D(32, (3,3)) on a 28x28 input creates 32 feature maps of size 26x26. How many parameters does this layer have compared to a Dense(676) layer on the same flattened input?

Conv2D: each filter has 3x3x1 = 9 weights + 1 bias = 10 params, times 32 filters = 320 parameters. Dense(676) on 784 inputs: 784 x 676 + 676 = 530,260 parameters. The Conv layer achieves similar expressive power with 1,657x fewer parameters through weight sharing — the same 3x3 filter is reused at every position.
Cybersecurity tie-in: Conv filters learn to detect local patterns regardless of position -- exactly what you need for network traffic analysis. A malicious payload signature (specific byte sequence) could appear at any offset in a packet. A 1D Conv filter trained on raw packet bytes will detect it wherever it occurs, just as an image Conv filter detects edges at any position.

Loading...