cs.thefarshad
medium

Convolution & CNNs

The building block of computer vision — slide a small learnable kernel over an image to extract features.

A small image already has thousands of pixels; a fully connected network would need a separate weight for every pixel and quickly drown. Convolutional Neural Networks (CNNs) solve this with one elegant operation — convolution — that scans the image with a tiny shared filter, the engine behind modern vision.

A kernel is a small grid of weights (here 3×33 \times 3). It slides across the input; at each stop it multiplies overlapping values and sums them into one output number. Pick a kernel below and step the window to fill the feature map.

3x3 kernel slides over a 6x6 input → 4x4 feature map
input
999111999111991111911111111155111155
kernel ÷ 1
-1-1-1-18-1-1-1-1
feature map
8
stop 1/16
output[0][0] = sum(window × kernel) = 8.0

The sliding dot product

At every position the kernel computes a weighted sum of the patch it covers:

out[r,c]=ijimg[r+i,c+j]    kernel[i,j]\text{out}[r,c] = \sum_{i}\sum_{j} \text{img}[r+i,\,c+j]\;\cdot\;\text{kernel}[i,j]

Slide one step right, repeat; reach the edge, drop down a row. The grid of results is the feature map. With a 3×33 \times 3 kernel and no padding, a 6×66 \times 6 input shrinks to 4×44 \times 4 because the window cannot hang off the edge (this is a valid convolution).

Convolution arithmetic

How big is the output? For a square input of size ii, kernel kk, padding pp, and stride ss, each output side is

o=i+2pks+1o = \left\lfloor \frac{i + 2p - k}{s} \right\rfloor + 1

Change the padding, stride, and dilation below and watch the output grid — and the formula — update as the kernel sweeps the padded input.

output cell = sum of the window (padding = 0)
input + padding · 5×5
3934833939779266278565137
output · 3×3
53
1/9
o = ⌊(5 + 2·0 − 3) / 1⌋ + 1 = 3

Padding adds a border of zeros so the kernel can reach the edges:

  • Valid (p=0p = 0): no padding; the output shrinks by k1k - 1 each side.
  • Same (p=(k1)/2p = \lfloor (k-1)/2 \rfloor with s=1s = 1, odd kk): output matches the input.
  • Full (p=k1p = k - 1): every possible overlap is used and the output grows.

Stride ss is how far the kernel jumps each step; a stride of 2 roughly halves each dimension — a cheap way to downsample instead of pooling. Dilation dd spreads the kernel’s taps apart with gaps, enlarging the receptive field without adding weights. The effective kernel size becomes keff=k+(k1)(d1)k_{\text{eff}} = k + (k-1)(d-1), so

o=i+2pk(k1)(d1)s+1o = \left\lfloor \frac{i + 2p - k - (k-1)(d-1)}{s} \right\rfloor + 1

Transposed convolution (upsampling)

Encoders shrink images, but decoders, segmentation heads, and generative models must go the other way — to larger outputs. A transposed convolution (also called fractionally-strided, or loosely “deconvolution”) does this: conceptually it spaces the input out by the stride and then convolves, so each input cell paints a kernel-sized patch onto a bigger canvas. Switch the visualizer above to Transposed to watch one small grid grow, with output side

o=s(i1)+k2po = s\,(i - 1) + k - 2p

It is the mirror of the forward pass — the same connectivity, run backward — which is why it appears wherever a network reconstructs spatial detail.

What kernels detect

The kernel’s weights decide what pattern lights up. An edge kernel has a strong positive center against negative neighbors, so it fires where brightness changes sharply and stays near zero on flat regions. A blur (box) kernel averages the patch, smoothing noise. In a CNN these weights are not hand-designed — they are learned by gradient descent, so the network discovers whatever filters best reduce its loss.

Why convolution wins for images

  • Parameter sharing — the same 3×33 \times 3 kernel is reused at every position, so a layer has a few weights instead of one per pixel.
  • Translation invariance — a feature is detected wherever it appears, because the same filter sweeps the whole image.
  • Locality — each output depends only on a small neighborhood (its receptive field), matching how image structure is local.

Stacking into a network

One convolutional layer applies many kernels, producing a stack of feature maps. A pooling step (e.g. taking the max over each 2×22 \times 2 block) then shrinks them, adding robustness and cutting computation. Stack these layers and the features compose into a hierarchy: early layers find edges, middle layers find textures and shapes, deep layers find objects — the same edges \to shapes \to faces idea from the neural-networks lesson, now built directly into the architecture.

References

Takeaways

  • Convolution slides a small kernel over an image, computing a dot product at each stop to build a feature map.
  • Kernel weights determine the detected pattern (edge, blur, …) and in a CNN are learned, not hand-set.
  • Parameter sharing, locality, and pooling let stacked conv layers learn a hierarchy of visual features efficiently.
  • Output size is o=(i+2pk)/s+1o = \lfloor (i + 2p - k)/s \rfloor + 1; padding, stride, and dilation trade off resolution, cost, and receptive field, while transposed convolution upsamples (o=s(i1)+k2po = s(i-1) + k - 2p).