Skip to content

Bias-variance is a Pythagorean decomposition

Posted on:

Most stats and machine learning courses introduce the bias-variance decomposition as a piece of algebra. You expand E[(θ^θ)2]\mathbb{E}[(\hat\theta - \theta)^2], add and subtract the mean of θ^\hat\theta, watch a cross term vanish, and conclude

MSE  =  bias2+variance.\text{MSE} \;=\; \text{bias}^2 + \text{variance}.

The cross term vanishes because E[θ^E[θ^]]=0\mathbb{E}[\hat\theta - \mathbb{E}[\hat\theta]] = 0. This is true. It is also the Pythagorean theorem. The decomposition says two specific functions are orthogonal in L2L^2, and the MSE is the squared length of their orthogonal sum. Once you see this, the bias-variance tradeoff is a right triangle, and most of the regularization toolkit is geometry on that triangle.

The setup

You estimate a parameter θR\theta \in \mathbb{R} with a random estimator θ^\hat\theta. The mean squared error is

MSE(θ^)  =  E[(θ^θ)2].\text{MSE}(\hat\theta) \;=\; \mathbb{E}\big[(\hat\theta - \theta)^2\big].

Let μ=E[θ^]\mu = \mathbb{E}[\hat\theta]. Decompose the error vector into a constant part and a mean-zero part:

θ^θ  =  (μθ)bias  +  (θ^μ)residual.\hat\theta - \theta \;=\; \underbrace{(\mu - \theta)}_{\text{bias}} \;+\; \underbrace{(\hat\theta - \mu)}_{\text{residual}}.

The first piece is a deterministic offset. The second has mean zero by construction.

The inner product

Treat random variables as vectors. The natural inner product on L2L^2 is

U,V  =  E[UV],\langle U, V \rangle \;=\; \mathbb{E}[UV],

with squared norm U2=E[U2]\|U\|^2 = \mathbb{E}[U^2].

Compute the inner product of the two pieces:

μθ,  θ^μ  =  (μθ)E[θ^μ]  =  0.\langle \mu - \theta,\; \hat\theta - \mu \rangle \;=\; (\mu - \theta)\, \mathbb{E}[\hat\theta - \mu] \;=\; 0.

The constant μθ\mu - \theta pulls out of the expectation, and what is left is the mean of a mean-zero variable.

So the bias and the residual are orthogonal vectors in L2L^2.

Pythagoras

For any orthogonal U,VU, V:

U+V2  =  U2+V2.\|U + V\|^2 \;=\; \|U\|^2 + \|V\|^2.

Apply it to the error decomposition:

E[(θ^θ)2]  =  (μθ)2+E[(θ^μ)2]  =  bias2+variance.\mathbb{E}\big[(\hat\theta - \theta)^2\big] \;=\; (\mu - \theta)^2 + \mathbb{E}\big[(\hat\theta - \mu)^2\big] \;=\; \text{bias}^2 + \text{variance}.

That is the entire derivation. The identity is Pythagoras with this particular orthogonal pair.

The picture is a right triangle:

Every estimator is a point in the (bias,σres)(|\text{bias}|, \sigma_{\text{res}}) plane, and its MSE is the squared distance from origin.

The shrinkage frontier

Take the canonical example. Observe XN(θ,σ2)X \sim \mathcal{N}(\theta, \sigma^2) and use the shrinkage estimator

θ^λ  =  (1λ)X,λ[0,1].\hat\theta_\lambda \;=\; (1 - \lambda)\, X, \qquad \lambda \in [0, 1].

The two legs are clean functions of λ\lambda:

bias(λ)  =  λθ,σres(λ)  =  (1λ)σ.|\text{bias}(\lambda)| \;=\; \lambda |\theta|, \qquad \sigma_{\text{res}}(\lambda) \;=\; (1 - \lambda)\, \sigma.

In the (bias,σres)(|\text{bias}|, \sigma_{\text{res}}) plane, the family traces the line segment from (0,σ)(0, \sigma) at λ=0\lambda = 0 (unbiased, full variance) to (θ,0)(|\theta|, 0) at λ=1\lambda = 1 (the zero estimator, all bias).

Minimizing MSE is minimizing distance to origin over the segment. The minimum is the perpendicular foot, occurring at

λ  =  σ2θ2+σ2,MSE  =  θ2σ2θ2+σ2.\lambda^* \;=\; \frac{\sigma^2}{\theta^2 + \sigma^2}, \qquad \text{MSE}^* \;=\; \frac{\theta^2 \sigma^2}{\theta^2 + \sigma^2}.

When the noise dominates (σ2θ2\sigma^2 \gg \theta^2), shrink hard. When the signal dominates, barely shrink. The geometry is a perpendicular drop from origin to the frontier line.

Slide θ\theta, σ\sigma, λ\lambda and watch the right triangle morph. The dashed quarter-circle through the current point is the level set MSE=const\sqrt{\text{MSE}} = \text{const}. The smallest such circle that still touches the frontier is the one tangent at λ\lambda^*, the green dot.

bias = 0.60variance = 0.49MSE = 0.85min MSE at λ* = 0.20: 0.80

A few things to read off the picture:

What this buys you

Most regularization techniques (ridge regression, weight decay, James–Stein shrinkage, early stopping, dropout) introduce some bias in exchange for a reduction in variance. The right triangle gives a single picture for all of them: each method defines its own frontier in the (bias,σres)(|\text{bias}|, \sigma_{\text{res}}) plane, and the engineering question is whether that frontier dips closer to the origin than the unbiased baseline.

In high dimensions the frontier really does dip in. The geometric content of Stein’s paradox is exactly this: in d3d \ge 3, the unbiased point (0,σd)(0, \sigma\sqrt{d}) is dominated everywhere by a shrinkage estimator, because the noise leg has length σd\sigma\sqrt{d}, and a small bias buys back a much larger reduction in that long vertical leg.

Multidimensional version

The decomposition generalizes verbatim. For an estimator θ^Rd\hat\theta \in \mathbb{R}^d of θRd\theta \in \mathbb{R}^d,

E[θ^θ2]  =  E[θ^]θ2  +  E[θ^E[θ^]2].\mathbb{E}\big[\|\hat\theta - \theta\|^2\big] \;=\; \|\mathbb{E}[\hat\theta] - \theta\|^2 \;+\; \mathbb{E}\big[\|\hat\theta - \mathbb{E}[\hat\theta]\|^2\big].

Same proof, applied coordinate by coordinate. The constant bias vector is orthogonal to the mean-zero residual in L2(Rd)L^2(\mathbb{R}^d), and Pythagoras applies.

The picture survives, with the legs now being the bias vector’s norm and the trace of the covariance matrix.

Familiar Pythagoreans

Once you internalize the L2L^2 inner product, several other identities are the same theorem in different costumes:

Each is Pythagoras with a different orthogonal pair. The bias-variance identity is just the most familiar member of the family.

References



Next Post
Heisenberg uncertainty as Fourier duality