Convolution & CNNs
The building block of computer vision — slide a small learnable kernel over an image to extract features.
A small image already has thousands of pixels; a fully connected network would need a separate weight for every pixel and quickly drown. Convolutional Neural Networks (CNNs) solve this with one elegant operation — convolution — that scans the image with a tiny shared filter, the engine behind modern vision.
A kernel is a small grid of weights (here ). It slides across the input; at each stop it multiplies overlapping values and sums them into one output number. Pick a kernel below and step the window to fill the feature map.
The sliding dot product
At every position the kernel computes a weighted sum of the patch it covers:
Slide one step right, repeat; reach the edge, drop down a row. The grid of results is the feature map. With a kernel and no padding, a input shrinks to because the window cannot hang off the edge (this is a valid convolution).
Convolution arithmetic
How big is the output? For a square input of size , kernel , padding , and stride , each output side is
Change the padding, stride, and dilation below and watch the output grid — and the formula — update as the kernel sweeps the padded input.
Padding adds a border of zeros so the kernel can reach the edges:
- Valid (): no padding; the output shrinks by each side.
- Same ( with , odd ): output matches the input.
- Full (): every possible overlap is used and the output grows.
Stride is how far the kernel jumps each step; a stride of 2 roughly halves each dimension — a cheap way to downsample instead of pooling. Dilation spreads the kernel’s taps apart with gaps, enlarging the receptive field without adding weights. The effective kernel size becomes , so
Transposed convolution (upsampling)
Encoders shrink images, but decoders, segmentation heads, and generative models must go the other way — to larger outputs. A transposed convolution (also called fractionally-strided, or loosely “deconvolution”) does this: conceptually it spaces the input out by the stride and then convolves, so each input cell paints a kernel-sized patch onto a bigger canvas. Switch the visualizer above to Transposed to watch one small grid grow, with output side
It is the mirror of the forward pass — the same connectivity, run backward — which is why it appears wherever a network reconstructs spatial detail.
What kernels detect
The kernel’s weights decide what pattern lights up. An edge kernel has a strong positive center against negative neighbors, so it fires where brightness changes sharply and stays near zero on flat regions. A blur (box) kernel averages the patch, smoothing noise. In a CNN these weights are not hand-designed — they are learned by gradient descent, so the network discovers whatever filters best reduce its loss.
Why convolution wins for images
- Parameter sharing — the same kernel is reused at every position, so a layer has a few weights instead of one per pixel.
- Translation invariance — a feature is detected wherever it appears, because the same filter sweeps the whole image.
- Locality — each output depends only on a small neighborhood (its receptive field), matching how image structure is local.
Stacking into a network
One convolutional layer applies many kernels, producing a stack of feature maps. A pooling step (e.g. taking the max over each block) then shrinks them, adding robustness and cutting computation. Stack these layers and the features compose into a hierarchy: early layers find edges, middle layers find textures and shapes, deep layers find objects — the same edges shapes faces idea from the neural-networks lesson, now built directly into the architecture.
References
- Dumoulin, V. & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv:1603.07285 — the definitive convolution-arithmetic animations and formulas (figures & code, free to use with attribution).
- CS231n — Convolutional Networks, Stanford.
Takeaways
- Convolution slides a small kernel over an image, computing a dot product at each stop to build a feature map.
- Kernel weights determine the detected pattern (edge, blur, …) and in a CNN are learned, not hand-set.
- Parameter sharing, locality, and pooling let stacked conv layers learn a hierarchy of visual features efficiently.
- Output size is ; padding, stride, and dilation trade off resolution, cost, and receptive field, while transposed convolution upsamples ().