cs.thefarshad
medium

Regularization

Tame overfitting by penalizing complexity — how L2 regularization smooths an over-eager fit.

A flexible model can fit the training data too well. Give a polynomial enough degrees of freedom and it will thread every noisy point exactly — then swing wildly in between and fail on anything new. That is overfitting, and regularization is the standard cure: add a penalty that discourages complexity so the model prefers a simpler explanation.

Below, a polynomial is fit to noisy points. Crank the degree up with λ=0\lambda = 0 and the curve whips around to hit every dot. Now raise λ\lambda and watch it relax into a smooth trend.

train MSE = 0.0031curve wiggle (sum w_j^2) = 316307443.7no penalty

Push the degree high with λ = 0: the curve threads every point but whips between them (overfitting). Raise λ and the wiggle shrinks as large weights are penalized, recovering a smooth fit.

The complexity trap

A degree-dd polynomial has d+1d+1 weights. With high dd the model has more freedom than the data deserves, so it spends that freedom modeling noise. The giveaway is large, alternating weights: tiny shifts in xx produce huge swings in the prediction. The readout under the plot shows the “wiggle”, jwj2\sum_j w_j^2, exploding as the fit overfits.

Adding a penalty

Ordinary least squares minimizes only the data error. Regularization adds a term that grows with the size of the weights. L2 regularization (ridge regression) penalizes their squared magnitude:

J(w)=i(yiy^i)2fit the data+λjwj2J(w) = \underbrace{\sum_i (y_i - \hat y_i)^2}_{\text{fit the data}} + \lambda \sum_j w_j^2

Now the optimizer must balance two goals: fit the points and keep the weights small. The strength λ\lambda is the dial:

  • λ=0\lambda = 0 — no penalty; pure least squares, free to overfit.
  • λ\lambda too large — weights are crushed toward zero and the model underfits, flattening into a near-straight line.
  • Just right — the wiggle is suppressed but the real trend survives. You choose λ\lambda with a validation set or cross-validation.

L1 vs L2

The other common choice, L1 regularization (lasso), penalizes the absolute value λjwj\lambda \sum_j |w_j|. L1 tends to drive some weights all the way to zero, performing automatic feature selection, while L2 shrinks weights smoothly without eliminating them. Both push the model toward simplicity; they differ in the shape of the penalty.

Regularization is everywhere in machine learning — ridge and lasso for linear models, weight decay and dropout for neural networks — all expressing the same idea: when in doubt, prefer the simpler model.

Further reading: scikit-learn — Ridge regression.

Takeaways

  • Overly flexible models overfit by modeling noise, signaled by large, swinging weights.
  • L2 regularization adds λjwj2\lambda \sum_j w_j^2 to the loss, trading exact fit for smaller, smoother weights.
  • λ\lambda controls the balance; L1 (lasso) instead zeroes weights for feature selection.