Step 2: Conv2D & MaxPooling

Sliding filters and downsampling

1 ExplorePlay below

›

2 ReadUnderstand

›

3 BuildHands-on lab

›

4 CompareSolution

›

💡 ReflectThink deeper

How a convolutional filter works

A Conv2D(32, (3,3)) layer creates 32 filters, each 3x3 pixels. Each filter slides across the image one position at a time, computing a dot product (element-wise multiply + sum) at each position plus a bias term.

The same filter weights are applied at every position -- this is weight sharing. A "horizontal edge detector" learned once detects horizontal edges everywhere in the image.

Component	Description	Example
Filter (kernel)	Small learnable weight matrix	3x3 = 9 weights per filter
Stride	How many pixels to move between positions	stride=1 (default): slide 1 pixel at a time
Padding	Whether to pad edges with zeros	`'valid'` (no padding) or `'same'` (preserve size)
Feature map	Output of one filter applied across the entire image	26x26 map from 3x3 filter on 28x28 input

Shape arithmetic

For Conv2D with padding='valid' (default) and stride=1:

output_size = input_size - kernel_size + 1

Input size	Kernel size	Output per filter
28 x 28	3 x 3	26 x 26
28 x 28	5 x 5	24 x 24
26 x 26	3 x 3	24 x 24
13 x 13	3 x 3	11 x 11

MaxPooling2D -- downsampling

MaxPooling2D((2,2)) takes every 2x2 block and keeps only the maximum value. This halves each spatial dimension, reducing computation and making the network more robust to small shifts in position.

# MaxPooling2D(2,2) on a 4x4 input:
# [1, 3, 2, 1]      [3, 2]
# [4, 2, 0, 3]  =>  [4, 5]
# [1, 0, 5, 1]
# [3, 4, 2, 0]
# Each 2x2 block -> its maximum value

Input size	Pool size	Output size
26 x 26	2 x 2	13 x 13
11 x 11	2 x 2	5 x 5
24 x 24	2 x 2	12 x 12

Parameter comparison: Conv vs Dense

Weight sharing makes Conv layers dramatically more efficient:

Layer	Calculation	Parameters
Conv2D(32, (3,3)) on 28x28x1	(3 x 3 x 1) x 32 + 32 biases	320
Conv2D(64, (3,3)) on 13x13x32	(3 x 3 x 32) x 64 + 64 biases	18,496
Dense(128) on flattened 784	784 x 128 + 128 biases	100,480

Think Deeper

Try this:

A Conv2D(32, (3,3)) on a 28x28 input creates 32 feature maps of size 26x26. How many parameters does this layer have compared to a Dense(676) layer on the same flattened input?

Conv2D: each filter has 3x3x1 = 9 weights + 1 bias = 10 params, times 32 filters = 320 parameters. Dense(676) on 784 inputs: 784 x 676 + 676 = 530,260 parameters. The Conv layer achieves similar expressive power with 1,657x fewer parameters through weight sharing — the same 3x3 filter is reused at every position.

Cybersecurity tie-in: Conv filters learn to detect local patterns regardless of position -- exactly what you need for network traffic analysis. A malicious payload signature (specific byte sequence) could appear at any offset in a packet. A 1D Conv filter trained on raw packet bytes will detect it wherever it occurs, just as an image Conv filter detects edges at any position.

← Previous ← → to navigate Next →