medium

Convolution & CNNs

The building block of computer vision — slide a small learnable kernel over an image to extract features.

A small image already has thousands of pixels; a fully connected network would need a separate weight for every pixel and quickly drown. Convolutional Neural Networks (CNNs) solve this with one elegant operation — convolution — that scans the image with a tiny shared filter, the engine behind modern vision.

A kernel is a small grid of weights (here $3 \times 3$ ). It slides across the input; at each stop it multiplies overlapping values and sums them into one output number. Pick a kernel below and step the window to fill the feature map.

3x3 kernel slides over a 6x6 input → 4x4 feature map

input

kernel ÷ 1

feature map

Speed

stop 1/16

output[0][0] = sum(window × kernel) = 8.0

The sliding dot product

At every position the kernel computes a weighted sum of the patch it covers:

$\text{out}[r,c] = \sum_{i}\sum_{j} \text{img}[r+i,\,c+j]\;\cdot\;\text{kernel}[i,j]$

Slide one step right, repeat; reach the edge, drop down a row. The grid of results is the feature map. With a $3 \times 3$ kernel and no padding, a $6 \times 6$ input shrinks to $4 \times 4$ because the window cannot hang off the edge (this is a valid convolution).

Convolution arithmetic

How big is the output? For a square input of size $i$ , kernel $k$ , padding $p$ , and stride $s$ , each output side is

$o = \left\lfloor \frac{i + 2p - k}{s} \right\rfloor + 1$

Change the padding, stride, and dilation below and watch the output grid — and the formula — update as the kernel sweeps the padded input.

output cell = sum of the window (padding = 0)

input i 5kernel k 3pad p 0stride s 1dilation d 1

input + padding · 5×5

output · 3×3

Speed

1/9

o = ⌊(5 + 2·0 − 3) / 1⌋ + 1 = 3

Padding adds a border of zeros so the kernel can reach the edges:

Valid ( $p = 0$ ): no padding; the output shrinks by $k - 1$ each side.
Same ( $p = \lfloor (k-1)/2 \rfloor$ with $s = 1$ , odd $k$ ): output matches the input.
Full ( $p = k - 1$ ): every possible overlap is used and the output grows.

Stride $s$ is how far the kernel jumps each step; a stride of 2 roughly halves each dimension — a cheap way to downsample instead of pooling. Dilation $d$ spreads the kernel’s taps apart with gaps, enlarging the receptive field without adding weights. The effective kernel size becomes $k_{\text{eff}} = k + (k-1)(d-1)$ , so

$o = \left\lfloor \frac{i + 2p - k - (k-1)(d-1)}{s} \right\rfloor + 1$

Transposed convolution (upsampling)

Encoders shrink images, but decoders, segmentation heads, and generative models must go the other way — to larger outputs. A transposed convolution (also called fractionally-strided, or loosely “deconvolution”) does this: conceptually it spaces the input out by the stride and then convolves, so each input cell paints a kernel-sized patch onto a bigger canvas. Switch the visualizer above to Transposed to watch one small grid grow, with output side

$o = s\,(i - 1) + k - 2p$

It is the mirror of the forward pass — the same connectivity, run backward — which is why it appears wherever a network reconstructs spatial detail.

What kernels detect

The kernel’s weights decide what pattern lights up. An edge kernel has a strong positive center against negative neighbors, so it fires where brightness changes sharply and stays near zero on flat regions. A blur (box) kernel averages the patch, smoothing noise. In a CNN these weights are not hand-designed — they are learned by gradient descent, so the network discovers whatever filters best reduce its loss.

Why convolution wins for images

Parameter sharing — the same $3 \times 3$ kernel is reused at every position, so a layer has a few weights instead of one per pixel.
Translation invariance — a feature is detected wherever it appears, because the same filter sweeps the whole image.
Locality — each output depends only on a small neighborhood (its receptive field), matching how image structure is local.

Stacking into a network

One convolutional layer applies many kernels, producing a stack of feature maps. A pooling step (e.g. taking the max over each $2 \times 2$ block) then shrinks them, adding robustness and cutting computation. Stack these layers and the features compose into a hierarchy: early layers find edges, middle layers find textures and shapes, deep layers find objects — the same edges $\to$ shapes $\to$ faces idea from the neural-networks lesson, now built directly into the architecture.

References

Dumoulin, V. & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv:1603.07285 — the definitive convolution-arithmetic animations and formulas (figures & code, free to use with attribution).
CS231n — Convolutional Networks, Stanford.

Takeaways

Convolution slides a small kernel over an image, computing a dot product at each stop to build a feature map.
Kernel weights determine the detected pattern (edge, blur, …) and in a CNN are learned, not hand-set.
Parameter sharing, locality, and pooling let stacked conv layers learn a hierarchy of visual features efficiently.
Output size is $o = \lfloor (i + 2p - k)/s \rfloor + 1$ ; padding, stride, and dilation trade off resolution, cost, and receptive field, while transposed convolution upsamples ( $o = s(i-1) + k - 2p$ ).