How a convolutional filter works
A Conv2D(32, (3,3)) layer creates 32 filters, each 3x3 pixels. Each filter slides across the image one position at a time, computing a dot product (element-wise multiply + sum) at each position plus a bias term.
The same filter weights are applied at every position -- this is weight sharing. A "horizontal edge detector" learned once detects horizontal edges everywhere in the image.
| Component | Description | Example |
|---|---|---|
| Filter (kernel) | Small learnable weight matrix | 3x3 = 9 weights per filter |
| Stride | How many pixels to move between positions | stride=1 (default): slide 1 pixel at a time |
| Padding | Whether to pad edges with zeros | 'valid' (no padding) or 'same' (preserve size) |
| Feature map | Output of one filter applied across the entire image | 26x26 map from 3x3 filter on 28x28 input |
Shape arithmetic
For Conv2D with padding='valid' (default) and stride=1:
output_size = input_size - kernel_size + 1
| Input size | Kernel size | Output per filter |
|---|---|---|
| 28 x 28 | 3 x 3 | 26 x 26 |
| 28 x 28 | 5 x 5 | 24 x 24 |
| 26 x 26 | 3 x 3 | 24 x 24 |
| 13 x 13 | 3 x 3 | 11 x 11 |
MaxPooling2D -- downsampling
MaxPooling2D((2,2)) takes every 2x2 block and keeps only the maximum value. This halves each spatial dimension, reducing computation and making the network more robust to small shifts in position.
# MaxPooling2D(2,2) on a 4x4 input:
# [1, 3, 2, 1] [3, 2]
# [4, 2, 0, 3] => [4, 5]
# [1, 0, 5, 1]
# [3, 4, 2, 0]
# Each 2x2 block -> its maximum value
| Input size | Pool size | Output size |
|---|---|---|
| 26 x 26 | 2 x 2 | 13 x 13 |
| 11 x 11 | 2 x 2 | 5 x 5 |
| 24 x 24 | 2 x 2 | 12 x 12 |
Parameter comparison: Conv vs Dense
Weight sharing makes Conv layers dramatically more efficient:
| Layer | Calculation | Parameters |
|---|---|---|
| Conv2D(32, (3,3)) on 28x28x1 | (3 x 3 x 1) x 32 + 32 biases | 320 |
| Conv2D(64, (3,3)) on 13x13x32 | (3 x 3 x 32) x 64 + 64 biases | 18,496 |
| Dense(128) on flattened 784 | 784 x 128 + 128 biases | 100,480 |
Think Deeper
A Conv2D(32, (3,3)) on a 28x28 input creates 32 feature maps of size 26x26. How many parameters does this layer have compared to a Dense(676) layer on the same flattened input?