medium

Regularization

Tame overfitting by penalizing complexity — how L2 regularization smooths an over-eager fit.

A flexible model can fit the training data too well. Give a polynomial enough degrees of freedom and it will thread every noisy point exactly — then swing wildly in between and fail on anything new. That is overfitting, and regularization is the standard cure: add a penalty that discourages complexity so the model prefers a simpler explanation.

Below, a polynomial is fit to noisy points. Crank the degree up with $\lambda = 0$ and the curve whips around to hit every dot. Now raise $\lambda$ and watch it relax into a smooth trend.

degree = 9λ = 0

train MSE = 0.0031curve wiggle (sum w_j^2) = 316408908.3no penalty

Push the degree high with λ = 0: the curve threads every point but whips between them (overfitting). Raise λ and the wiggle shrinks as large weights are penalized, recovering a smooth fit.

The complexity trap

A degree- $d$ polynomial has $d+1$ weights. With high $d$ the model has more freedom than the data deserves, so it spends that freedom modeling noise. The giveaway is large, alternating weights: tiny shifts in $x$ produce huge swings in the prediction. The readout under the plot shows the “wiggle”, $\sum_j w_j^2$ , exploding as the fit overfits.

Adding a penalty

Ordinary least squares minimizes only the data error. Regularization adds a term that grows with the size of the weights. L2 regularization (ridge regression) penalizes their squared magnitude:

$J(w) = \underbrace{\sum_i (y_i - \hat y_i)^2}_{\text{fit the data}} + \lambda \sum_j w_j^2$

Now the optimizer must balance two goals: fit the points and keep the weights small. The strength $\lambda$ is the dial:

$\lambda = 0$ — no penalty; pure least squares, free to overfit.
$\lambda$ too large — weights are crushed toward zero and the model underfits, flattening into a near-straight line.
Just right — the wiggle is suppressed but the real trend survives. You choose $\lambda$ with a validation set or cross-validation.

L1 vs L2

The other common choice, L1 regularization (lasso), penalizes the absolute value $\lambda \sum_j |w_j|$ . L1 tends to drive some weights all the way to zero, performing automatic feature selection, while L2 shrinks weights smoothly without eliminating them. Both push the model toward simplicity; they differ in the shape of the penalty.

Regularization is everywhere in machine learning — ridge and lasso for linear models, weight decay and dropout for neural networks — all expressing the same idea: when in doubt, prefer the simpler model.

Further reading: scikit-learn — Ridge regression.

Takeaways

Overly flexible models overfit by modeling noise, signaled by large, swinging weights.
L2 regularization adds $\lambda \sum_j w_j^2$ to the loss, trading exact fit for smaller, smoother weights.
$\lambda$ controls the balance; L1 (lasso) instead zeroes weights for feature selection.