Regularization
Tame overfitting by penalizing complexity — how L2 regularization smooths an over-eager fit.
A flexible model can fit the training data too well. Give a polynomial enough degrees of freedom and it will thread every noisy point exactly — then swing wildly in between and fail on anything new. That is overfitting, and regularization is the standard cure: add a penalty that discourages complexity so the model prefers a simpler explanation.
Below, a polynomial is fit to noisy points. Crank the degree up with and the curve whips around to hit every dot. Now raise and watch it relax into a smooth trend.
Push the degree high with λ = 0: the curve threads every point but whips between them (overfitting). Raise λ and the wiggle shrinks as large weights are penalized, recovering a smooth fit.
The complexity trap
A degree- polynomial has weights. With high the model has more freedom than the data deserves, so it spends that freedom modeling noise. The giveaway is large, alternating weights: tiny shifts in produce huge swings in the prediction. The readout under the plot shows the “wiggle”, , exploding as the fit overfits.
Adding a penalty
Ordinary least squares minimizes only the data error. Regularization adds a term that grows with the size of the weights. L2 regularization (ridge regression) penalizes their squared magnitude:
Now the optimizer must balance two goals: fit the points and keep the weights small. The strength is the dial:
- — no penalty; pure least squares, free to overfit.
- too large — weights are crushed toward zero and the model underfits, flattening into a near-straight line.
- Just right — the wiggle is suppressed but the real trend survives. You choose with a validation set or cross-validation.
L1 vs L2
The other common choice, L1 regularization (lasso), penalizes the absolute value . L1 tends to drive some weights all the way to zero, performing automatic feature selection, while L2 shrinks weights smoothly without eliminating them. Both push the model toward simplicity; they differ in the shape of the penalty.
Regularization is everywhere in machine learning — ridge and lasso for linear models, weight decay and dropout for neural networks — all expressing the same idea: when in doubt, prefer the simpler model.
Further reading: scikit-learn — Ridge regression.
Takeaways
- Overly flexible models overfit by modeling noise, signaled by large, swinging weights.
- L2 regularization adds to the loss, trading exact fit for smaller, smoother weights.
- controls the balance; L1 (lasso) instead zeroes weights for feature selection.