Skip to content

Stein's paradox

Posted on:

In the 1950s, the statistical orthodoxy was clear: to estimate the mean of a Gaussian, use the sample mean. It is the maximum-likelihood estimator, the minimum-variance unbiased estimator, and (in one dimension) admissible: no other estimator beats it for every value of the parameter.

In 1961, Charles Stein and Willard James proved that in three or more dimensions, the sample mean is not admissible. There is a different estimator that beats it everywhere: for every choice of the true mean θRd\theta \in \mathbb{R}^d, simultaneously, in mean squared error. Even when the components are completely unrelated.

The estimator is famous and, when first encountered, jarring. It is also a precursor of ridge regression, empirical Bayes, and a great deal of modern machine learning.

The setup

You observe a single sample XN(θ,Id)X \sim \mathcal{N}(\theta, I_d) in Rd\mathbb{R}^d and want to estimate θ\theta. The sample mean (here, just XX) has

MSEMLE  =  E[Xθ2]  =  d.\mathrm{MSE}_{\mathrm{MLE}} \;=\; \mathbb{E}[\|X - \theta\|^2] \;=\; d.

Doesn’t depend on θ\theta. Just dd, no matter what.

The James–Stein estimator

Define

θ^JS  =  (1d2X2)X.\hat\theta_{\mathrm{JS}} \;=\; \left(1 - \frac{d - 2}{\|X\|^2}\right) X.

The shrinkage factor (1(d2)/X2)\bigl(1 - (d-2)/\|X\|^2\bigr) pulls XX toward the origin. (When X2\|X\|^2 is small, the factor can go negative; the positive-part version θ^JS+=max{0,1(d2)/X2}X\hat\theta_{\mathrm{JS}^+} = \max\{0, 1 - (d-2)/\|X\|^2\}\,X is a strict improvement, and is what people use in practice.)

Theorem (James and Stein, 1961). For every θRd\theta \in \mathbb{R}^d and every d3d \ge 3,

E[θ^JSθ2]  <  d  =  E[Xθ2].\mathbb{E}\bigl[\|\hat\theta_{\mathrm{JS}} - \theta\|^2\bigr] \;<\; d \;=\; \mathbb{E}\bigl[\|X - \theta\|^2\bigr].

The sample mean is dominated uniformly. For d2d \le 2 the estimator degenerates and the dominance disappears. The phase transition at d=3d = 3 is sharp.

The geometric reason

Why should shrinkage help?

If XN(θ,Id)X \sim \mathcal{N}(\theta, I_d), then X2\|X\|^2 has a non-central chi-squared distribution with non-centrality θ2\|\theta\|^2, so

E[X2]  =  θ2+d.\mathbb{E}[\|X\|^2] \;=\; \|\theta\|^2 + d.

The sample XX is on average further from the origin than θ\theta is, by an amount that grows with dd. This is exactly the Gaussian shell phenomenon: noise in dd dimensions has typical magnitude d\sqrt{d}, and adding it to θ\theta pushes outward.

Shrinking XX toward the origin partially undoes this outward push. In high dimensions, the bias introduced by shrinkage is a much smaller error than the variance you save. The dimension threshold d=3d = 3 becomes natural in this picture: in d=1,2d = 1, 2, the noise radius d\sqrt{d} isn’t much, and shrinkage doesn’t pay off; in d3d \ge 3, it does, uniformly.

The exact value at θ=0\theta = 0

A clean computation. If θ=0\theta = 0 then X2χd2\|X\|^2 \sim \chi^2_d, and for d>2d > 2, E[1/X2]=1/(d2)\mathbb{E}[1/\|X\|^2] = 1/(d - 2). Plugging in,

E[θ^JS2]  =  E[X2]2(d2)+(d2)2E[1/X2]  =  d2(d2)+(d2)  =  2.\mathbb{E}\bigl[\|\hat\theta_{\mathrm{JS}}\|^2\bigr] \;=\; \mathbb{E}[\|X\|^2] - 2(d - 2) + (d - 2)^2\, \mathbb{E}[1/\|X\|^2] \;=\; d - 2(d - 2) + (d - 2) \;=\; 2.

The James–Stein MSE at the origin is exactly 22, for every d3d \ge 3.

The MLE has MSE dd. So at θ=0\theta = 0, JS is a factor of d/2d/2 better. For d=100d = 100, MSE drops from 100100 to 22. A factor of 5050.

Slide dd to see the full curve as θ\|\theta\| varies:

d = 3MSEMLE = 3MSEJS(θ=0) = 2JS savings at θ=0: ×1.5

The MLE curve sits flat at dd. The James–Stein curve starts at 2\approx 2 for θ=0\theta = 0 and rises smoothly toward dd as θ\|\theta\| grows: with the true mean far from the origin, the shrinkage benefit shrinks, but the MLE never wins at any θ\theta in d3d \ge 3. At d=1d = 1 the JS curve sits above the MLE (the formula inflates rather than shrinks); at d=2d = 2 they coincide.

What is genuinely weird

Stein’s original example is the part that breaks people’s intuition.

Suppose you want to estimate three independent things: the speed of light, the price of tea in China, and the average height of redwoods. For each, you have one Gaussian-noisy measurement. The natural recipe: estimate each one independently, by its own measurement. Stein says: shrink all three toward zero (or toward any common point) and your total MSE drops.

The three estimation problems have nothing to do with each other. The three measurements are independent. And yet, coupling the three estimators by shrinking together reduces the joint risk. This is profoundly counterintuitive: the cost of independence in MSE terms is real.

The reason this confounds intuition is that unbiasedness was the wrong objective. MSE = bias2^2 + variance, and the sample mean trades zero bias for high variance. In d3d \ge 3, accepting a bit of bias (shrinkage) buys a much larger reduction in variance, every time.

What this led to

Stein’s paradox is the seed of a huge amount of modern statistics and ML:

All of these are descendants of: the sample mean is wrong in 3\ge 3 dimensions; do something biased instead.

References



Next Post
Nearest neighbor breaks in high dimensions