L2 / L1 Regularization¶
Setup¶
Model (no intercept for simplicity):
Data loss (sum of squared errors):
L2-regularized loss (ridge):
- \(\lambda>0\) controls the strength of the penalty (larger \(\lambda\) stronger shrinkage).
- In practice, we usually don't penalize the bias/intercept.
How L2 Penalizes the Parameter¶
Take derivative w.r.t. \(w\) and set to 0:
Rearrange:
Compare to unregularized OLS:
L2 adds \(\lambda\) to the denominator and shrinks \(w\) toward 0.
Why L2 Decrease Variance and Increase Bias?¶
L2 regularization constrains how large the parameters can get. Constraining parameters makes the fitted function smoother/less wiggly, so predictions don't swing wildly when the training sample changesβthis cuts variance. The tradeoff is that the constrained model can't perfectly adapt to the true signal, so estimates are pulled toward zero (or toward simpler shapes), which introduces bias.
Tiny Numeric Example¶
Data: \(x=[0,1,2,3]\), \(y=[0,1,2,60]\) (last point is an outlier) - \(\sum x_i^2 = 14, \sum x_i y_i = 185\)
Weights: - OLS (no L2): \(185/14 \approx 13.214\) - L2, \(\lambda=10\): \(185/(14+10) = 185/24 \approx 7.708185\) - L2, \(\lambda=100\): \(185/(14+100) = 185/114 \approx 1.623\)
As \(\lambda\) grows, \(w\) is pulled toward 0, limiting the impact of the outlier.
Gradient-Descent View (Weight Decay)¶
With learning rate \(\eta\):
The \(+2\lambda w\) term is the shrinkage that steadily decays weights.
Multi-Feature Form (for reference)¶
For features \(X\in \mathbb{R}^{n\times d}\), target \(\mathbf{y}\):
Copy-Paste Python¶
import numpy as np
x = np.array([0,1,2,3], dtype=float)
y = np.array([0,1,2,60], dtype=float)
Sxx = np.sum(x**2)
Sxy = np.sum(x*y)
def ridge_weight(lmbda):
return Sxy / (Sxx + lmbda)
print("w_OLS =", Sxy / Sxx)
for lmbda in [10, 100]:
print(f"w_ridge", ridge_weight(lmbda))
Notes - Standardize features before using L2/L1 (esp. linear/logistic). - Tune \(\lambda\) via cross-validation. - Do not penalize the bias term.
Related Topics¶
- Overfitting & Underfitting - Why regularization helps
- Early Stopping - Alternative regularization technique
- Model Evaluation - Tuning regularization parameters