Skip to content

Bias-variance is a Pythagorean decomposition

Posted on:

Most stats and machine learning courses introduce the bias-variance decomposition as a piece of algebra. The textbook derivation feels like a lucky cancellation: write down the mean squared error, expand with a clever insertion, watch a cross term vanish. What is left is

MSE  =  bias2+variance.\text{MSE} \;=\; \text{bias}^2 + \text{variance}.

This is true. It is also the Pythagorean theorem. The decomposition says two specific functions are orthogonal in L2L^2, the MSE is the squared length of their sum, and the cross term is the inner product of two perpendicular vectors. Once you see this, the bias-variance tradeoff is a right triangle, and most of the regularization toolkit is geometry on that triangle.

The classical derivation

You estimate a parameter θR\theta \in \mathbb{R} with an estimator θ^\hat\theta. Since θ^\hat\theta is a function of random data, it is itself a random variable. Grade it by its mean squared error, the average squared distance from the truth:

MSE(θ^)  :=  E[(θ^θ)2].\text{MSE}(\hat\theta) \;:=\; \mathbb{E}\big[(\hat\theta - \theta)^2\big].

The expectation averages over the randomness in the data. Smaller is better.

Two further quantities describe how θ^\hat\theta behaves under that randomness. Let μ:=E[θ^]\mu := \mathbb{E}[\hat\theta] be the estimator’s mean value across all possible datasets. Then:

MSE measures the total error from the truth, bias measures the systematic part, and variance measures the random part. The textbook identity says these three are connected by the simplest possible relation.

Decompose the error inside the MSE by adding and subtracting μ\mu:

θ^θ  =  (θ^μ)+(μθ).\hat\theta - \theta \;=\; (\hat\theta - \mu) + (\mu - \theta).

Square both sides:

(θ^θ)2  =  (θ^μ)2  +  2(θ^μ)(μθ)  +  (μθ)2.(\hat\theta - \theta)^2 \;=\; (\hat\theta - \mu)^2 \;+\; 2(\hat\theta - \mu)(\mu - \theta) \;+\; (\mu - \theta)^2.

Take the expectation, term by term:

E[(θ^θ)2]MSE(θ^)  =  E[(θ^μ)2]Var(θ^)  +  E[2(θ^μ)(μθ)]cross term  +  E[(μθ)2]bias(θ^)2.\underbrace{\mathbb{E}\big[(\hat\theta - \theta)^2\big]}_{\text{MSE}(\hat\theta)} \;=\; \underbrace{\mathbb{E}\big[(\hat\theta - \mu)^2\big]}_{\text{Var}(\hat\theta)} \;+\; \underbrace{\mathbb{E}\big[2(\hat\theta - \mu)(\mu - \theta)\big]}_{\text{cross term}} \;+\; \underbrace{\mathbb{E}\big[(\mu - \theta)^2\big]}_{\text{bias}(\hat\theta)^2}.

The first term is the variance by definition. The third is (μθ)2=bias2(\mu - \theta)^2 = \text{bias}^2: both μ\mu and θ\theta are constants, so (μθ)2(\mu - \theta)^2 has no randomness and equals its own expectation. The cross term takes one extra step:

E[2(θ^μ)(μθ)]  =  2(μθ)E[θ^μ]  =  2(μθ)0  =  0.\mathbb{E}\big[2(\hat\theta - \mu)(\mu - \theta)\big] \;=\; 2(\mu - \theta)\,\mathbb{E}[\hat\theta - \mu] \;=\; 2(\mu - \theta) \cdot 0 \;=\; 0.

The constant μθ\mu - \theta pulls out of the expectation, leaving E[θ^μ]\mathbb{E}[\hat\theta - \mu]. By the very definition of μ=E[θ^]\mu = \mathbb{E}[\hat\theta], this equals zero.

What is left is the textbook identity:

MSE(θ^)  =  bias(θ^)2+Var(θ^).\text{MSE}(\hat\theta) \;=\; \text{bias}(\hat\theta)^2 + \text{Var}(\hat\theta).

The derivation is correct, but it reads like a lucky cancellation. The cross term vanishes only because μ\mu was defined as the mean of θ^\hat\theta. Some structure is doing the work, but the algebra by itself doesn’t make it visible.

What the cancellation actually says

Treat random variables as vectors: they can be added and scaled, and the space of square-integrable ones (those with E[U2]<\mathbb{E}[U^2] < \infty) is called L2L^2. The natural inner product on L2L^2 is

U,V  :=  E[UV],\langle U, V \rangle \;:=\; \mathbb{E}[UV],

with squared norm U2=E[U2]\|U\|^2 = \mathbb{E}[U^2]. Two random variables are orthogonal in L2L^2 when U,V=0\langle U, V \rangle = 0. (For mean-zero variables this reduces to being uncorrelated; the L2L^2 inner product generalizes correlation.)

The cross term in the derivation is exactly the inner product of two specific random variables, the constant μθ\mu - \theta and the mean-zero residual θ^μ\hat\theta - \mu:

μθ,  θ^μ  =  E[(μθ)(θ^μ)]  =  (μθ)E[θ^μ]  =  0.\langle \mu - \theta,\; \hat\theta - \mu \rangle \;=\; \mathbb{E}\big[(\mu - \theta)(\hat\theta - \mu)\big] \;=\; (\mu - \theta)\,\mathbb{E}[\hat\theta - \mu] \;=\; 0.

The cancellation isn’t a coincidence. It is the statement that the bias and the residual are orthogonal vectors in L2L^2.

Now apply the Pythagorean theorem. In any inner-product space, orthogonal vectors satisfy

U+V2  =  U2+V2whenever U,V=0.\|U + V\|^2 \;=\; \|U\|^2 + \|V\|^2 \qquad \text{whenever } \langle U, V \rangle = 0.

Let U=μθU = \mu - \theta and V=θ^μV = \hat\theta - \mu. Their sum is θ^θ\hat\theta - \theta, and the three squared norms are exactly the three quantities at play:

Pythagoras gives the identity immediately:

MSE(θ^)  =  bias(θ^)2+Var(θ^).\text{MSE}(\hat\theta) \;=\; \text{bias}(\hat\theta)^2 + \text{Var}(\hat\theta).

The “cross term vanishes” step in the algebraic derivation is the orthogonality of the bias and the residual; the bias-variance identity is the squared lengths adding because the angle between them is a right angle.

The right triangle

Picture a right triangle with:

Every estimator θ^\hat\theta is a point in the (bias,σres)(|\text{bias}|, \sigma_{\text{res}}) plane, and its MSE is the squared distance from origin.

The shrinkage frontier

Take the canonical example. Observe XN(θ,σ2)X \sim \mathcal{N}(\theta, \sigma^2) and use the shrinkage estimator

θ^λ  =  (1λ)X,λ[0,1].\hat\theta_\lambda \;=\; (1 - \lambda)\, X, \qquad \lambda \in [0, 1].

The two legs are clean functions of λ\lambda:

bias(λ)  =  λθ,σres(λ)  =  (1λ)σ.|\text{bias}(\lambda)| \;=\; \lambda |\theta|, \qquad \sigma_{\text{res}}(\lambda) \;=\; (1 - \lambda)\, \sigma.

In the (bias,σres)(|\text{bias}|, \sigma_{\text{res}}) plane, the family traces the line segment from (0,σ)(0, \sigma) at λ=0\lambda = 0 (unbiased, full variance) to (θ,0)(|\theta|, 0) at λ=1\lambda = 1 (the zero estimator, all bias).

Minimizing MSE is minimizing distance to origin over the segment. The minimum is the perpendicular foot, the point where the perpendicular from the origin meets the line, occurring at

λ  =  σ2θ2+σ2,MSE  =  θ2σ2θ2+σ2.\lambda^* \;=\; \frac{\sigma^2}{\theta^2 + \sigma^2}, \qquad \text{MSE}^* \;=\; \frac{\theta^2 \sigma^2}{\theta^2 + \sigma^2}.

When the noise dominates (σ2θ2\sigma^2 \gg \theta^2), shrink hard. When the signal dominates, barely shrink. The geometry is a perpendicular drop from origin to the frontier line.

Slide θ\theta, σ\sigma, λ\lambda and watch the right triangle morph. The dashed quarter-circle through the current point is the level set MSE=const\sqrt{\text{MSE}} = \text{const}. The smallest such circle that still touches the frontier is the one tangent at λ\lambda^*, the green dot.

bias = 0.60variance = 0.49MSE = 0.85min MSE at λ* = 0.20: 0.80

A few things to read off the picture:

What this buys you

Most regularization techniques (ridge regression, weight decay, James–Stein shrinkage, early stopping, dropout) introduce some bias in exchange for a reduction in variance. The right triangle gives a single picture for all of them: each method defines its own frontier in the (bias,σres)(|\text{bias}|, \sigma_{\text{res}}) plane, and the engineering question is whether that frontier dips closer to the origin than the unbiased baseline.

In high dimensions the frontier really does dip in. The geometric content of Stein’s paradox is exactly this: in d3d \ge 3, the unbiased point (0,σd)(0, \sigma\sqrt{d}) is dominated everywhere by a shrinkage estimator, because the noise leg has length σd\sigma\sqrt{d}, and a small bias buys back a much larger reduction in that long vertical leg.

Multidimensional version

The decomposition generalizes verbatim. For an estimator θ^Rd\hat\theta \in \mathbb{R}^d of θRd\theta \in \mathbb{R}^d,

E[θ^θ2]  =  E[θ^]θ2  +  E[θ^E[θ^]2].\mathbb{E}\big[\|\hat\theta - \theta\|^2\big] \;=\; \|\mathbb{E}[\hat\theta] - \theta\|^2 \;+\; \mathbb{E}\big[\|\hat\theta - \mathbb{E}[\hat\theta]\|^2\big].

Same proof, applied coordinate by coordinate. The constant bias vector is orthogonal to the mean-zero residual in L2(Rd)L^2(\mathbb{R}^d), and Pythagoras applies.

The picture survives, with the legs now being the bias vector’s norm and the trace of the covariance matrix (the sum of the per-coordinate variances).

Familiar Pythagoreans

Once you internalize the L2L^2 inner product, several other identities are the same theorem in different costumes:

Each is Pythagoras with a different orthogonal pair. The bias-variance identity is just the most familiar member of the family.

References



Previous Post
Sudakov minoration, or how big a maximum must be
Next Post
Voronoi tessellations and Lloyd's algorithm