Skip to content

Stein's paradox

Posted on:

The orthodoxy was clear: to estimate the mean of a Gaussian, use the sample mean. Grading estimators by their mean squared error

MSE(θ^)  =  E[θ^θ2],\mathrm{MSE}(\hat\theta) \;=\; \mathbb{E}\bigl[\|\hat\theta - \theta\|^2\bigr],

the sample mean is the maximum-likelihood estimator, the minimum-variance unbiased estimator, and, in one dimension, admissible: no other estimator beats it for every value of the parameter. For two centuries this read as the end of the story.

In 1961, Charles Stein and Willard James proved that in three or more dimensions, the sample mean is not admissible. There is a different estimator with strictly smaller MSE for every θRd\theta \in \mathbb{R}^d, simultaneously. Even when the components correspond to completely unrelated quantities.

The estimator is famous and, when first encountered, jarring. It is also a precursor of ridge regression, empirical Bayes, and a great deal of modern machine learning.

The case for the sample mean in one dimension

You observe nn i.i.d. samples X1,,XnN(θ,σ2)X_1, \ldots, X_n \sim \mathcal{N}(\theta, \sigma^2) with known σ2\sigma^2 and unknown mean θR\theta \in \mathbb{R}. Three properties make the sample mean Xˉ=1niXi\bar X = \tfrac{1}{n}\sum_i X_i the textbook answer.

It maximizes the likelihood. The joint density of the data, viewed as a function of θ\theta with the data held fixed, is

L(θ)    exp ⁣(12σ2i=1n(Xiθ)2).L(\theta) \;\propto\; \exp\!\left(-\frac{1}{2\sigma^2}\sum_{i=1}^{n} (X_i - \theta)^2\right).

Maximizing LL over θ\theta is the same as minimizing i(Xiθ)2\sum_i (X_i - \theta)^2, a quadratic in θ\theta minimized at Xˉ\bar X. Of all values of θ\theta, the sample mean is the one that makes the observed data most plausible under the Gaussian model.

It is unbiased. E[Xˉ]=θ\mathbb{E}[\bar X] = \theta by linearity of expectation. On average, Xˉ\bar X hits the truth exactly.

It has the smallest possible MSE among unbiased estimators. The Cramér–Rao lower bound says any unbiased θ~(X1,,Xn)\tilde\theta(X_1, \ldots, X_n) satisfies

Var(θ~)    1nI(θ),\mathrm{Var}(\tilde\theta) \;\ge\; \frac{1}{n\, I(\theta)},

where I(θ)I(\theta) is the Fisher information per observation, a measure of how sharply the likelihood peaks around the truth. For the Gaussian, I(θ)=1/σ2I(\theta) = 1/\sigma^2, so the bound is σ2/n\sigma^2/n, and Xˉ\bar X saturates it. So MSE(Xˉ)=Var(Xˉ)=σ2/n\mathrm{MSE}(\bar X) = \mathrm{Var}(\bar X) = \sigma^2/n, the lowest MSE achievable by any unbiased estimator.

In one dimension, it is admissible. No estimator (biased or not) has strictly smaller MSE than Xˉ\bar X at every θ\theta simultaneously. You can find rivals that win at specific values (the constant θ^=0\hat\theta = 0 has zero MSE at θ=0\theta = 0 but MSE θ2\theta^2 everywhere else), but never one that wins everywhere. This is the substantive property among the four; the proof is below.

The MLE, the minimum-variance unbiased estimator, and unbeatable-everywhere — all three lining up convinced generations of statisticians that the sample mean was the only defensible default.

Proving admissibility in one dimension

The first three properties were direct calculations. Admissibility is the substantive one. Take the canonical case XN(θ,1)X \sim \mathcal{N}(\theta, 1) on R\mathbb{R} (one observation, unit variance, after rescaling). The sample mean is just XX, and its MSE is E[(Xθ)2]=1\mathbb{E}[(X - \theta)^2] = 1 at every θ\theta. The claim is that no estimator strictly beats 11 everywhere.

The standard proof, due to Hodges and Lehmann (1950) and Blyth (1951), is by contradiction, using a wide Gaussian prior on θ\theta.

Suppose some θ~\tilde\theta dominated XX: its risk function R(θ):=E[(θ~θ)2]R(\theta) := \mathbb{E}\bigl[(\tilde\theta - \theta)^2\bigr] satisfies R(θ)1R(\theta) \le 1 for every θ\theta, with strict inequality R(θ0)<1R(\theta_0) < 1 at some θ0\theta_0. Assuming RR is continuous in θ\theta (true for any reasonable θ~\tilde\theta), there is a small interval [θ0a,θ0+a][\theta_0 - a, \theta_0 + a] on which R(θ)1δR(\theta) \le 1 - \delta for some δ>0\delta > 0.

A wide Gaussian prior. Place a prior θπτ:=N(0,τ2)\theta \sim \pi_\tau := \mathcal{N}(0, \tau^2) for a large variance τ2\tau^2. Standard Gaussian conjugacy gives the posterior

θX    N ⁣(τ21+τ2X,  τ21+τ2),\theta \mid X \;\sim\; \mathcal{N}\!\left(\frac{\tau^2}{1+\tau^2} X,\; \frac{\tau^2}{1+\tau^2}\right),

so the Bayes estimator under squared-error loss (the posterior mean) is θ^πτ=τ21+τ2X\hat\theta_{\pi_\tau} = \frac{\tau^2}{1+\tau^2} X. Its Bayes risk — the prior-averaged MSE — is

E[(θ^πτθ)2θ]πτ(θ)dθ  =  τ21+τ2.\int \mathbb{E}\bigl[(\hat\theta_{\pi_\tau} - \theta)^2 \mid \theta\bigr]\, \pi_\tau(\theta)\, d\theta \;=\; \frac{\tau^2}{1+\tau^2}.

The defining property of the Bayes estimator is that it minimizes prior-averaged MSE. So any estimator — including our hypothetical θ~\tilde\theta — has Bayes risk at least τ21+τ2\frac{\tau^2}{1+\tau^2}:

R(θ)πτ(θ)dθ    τ21+τ2.\int R(\theta)\, \pi_\tau(\theta)\, d\theta \;\ge\; \frac{\tau^2}{1+\tau^2}.

Squeezing to a contradiction. The sample mean has R(θ,X)=1R(\theta, X) = 1 everywhere, so its prior-averaged MSE is 1πτ(θ)dθ=1\int 1 \cdot \pi_\tau(\theta)\, d\theta = 1. The Bayes risk gap between XX and θ~\tilde\theta is therefore squeezed:

[1R(θ)]πτ(θ)dθ    1τ21+τ2  =  11+τ2.\int \bigl[1 - R(\theta)\bigr]\, \pi_\tau(\theta)\, d\theta \;\le\; 1 - \frac{\tau^2}{1+\tau^2} \;=\; \frac{1}{1+\tau^2}.

The integrand is non-negative everywhere (by domination) and at least δ\delta on the interval where θ~\tilde\theta does strictly better, so

[1R(θ)]πτ(θ)dθ    δπτ([θ0a,θ0+a]).\int \bigl[1 - R(\theta)\bigr]\, \pi_\tau(\theta)\, d\theta \;\ge\; \delta \cdot \pi_\tau\bigl([\theta_0 - a, \theta_0 + a]\bigr).

For τ\tau large, the prior density near θ0\theta_0 is approximately 12πτ2\frac{1}{\sqrt{2\pi\tau^2}}, so the interval mass is 2a2πτ2\approx \frac{2a}{\sqrt{2\pi\tau^2}}. Putting the two bounds together,

2aδ2πτ2    11+τ2.\frac{2a\delta}{\sqrt{2\pi\tau^2}} \;\le\; \frac{1}{1+\tau^2}.

The left side decays like τ1\tau^{-1}; the right side like τ2\tau^{-2}. For τ\tau large enough, τ1>τ2\tau^{-1} > \tau^{-2}, contradicting the inequality. So no dominating θ~\tilde\theta exists, and XX is admissible.

Where the proof breaks in higher dimensions. Repeat the same argument in Rd\mathbb{R}^d with a N(0,τ2Id)\mathcal{N}(0, \tau^2 I_d) prior. The prior spreads its mass over a dd-dimensional region, so the density at any fixed θ0\theta_0 scales like τd\tau^{-d}, and the ball-mass contribution decays at the same rate. The Bayes risk gap, computed coordinate-by-coordinate, is at most d1+τ2τ2\frac{d}{1+\tau^2} \sim \tau^{-2}. The contradiction requires the ball-mass to decay slower than the Bayes gap — which only happens at d=1d = 1. (Admissibility actually still holds at d=2d = 2, via a more carefully chosen prior; at d3d \ge 3 admissibility genuinely fails, and Stein’s estimator is what walks through the open door.)

The Stein setup

By sufficiency, all the information about θ\theta in X1,,XnX_1, \ldots, X_n is concentrated in XˉN(θ,σ2/n)\bar X \sim \mathcal{N}(\theta, \sigma^2/n). The original problem reduces to: observe a single Gaussian sample with known variance and unknown mean. Rescaling sets the variance to one, and the multivariate version is the same story coordinate-by-coordinate. So without loss of generality, work with a single observation

X    N(θ,Id)in Rd,X \;\sim\; \mathcal{N}(\theta, I_d) \qquad \text{in } \mathbb{R}^d,

and try to estimate θRd\theta \in \mathbb{R}^d. The sample mean (here, just XX) has

MSEMLE  =  E[Xθ2]  =  d.\mathrm{MSE}_{\mathrm{MLE}} \;=\; \mathbb{E}[\|X - \theta\|^2] \;=\; d.

Doesn’t depend on θ\theta. Just dd, no matter what.

The James–Stein estimator

Define

θ^JS  =  (1d2X2)X.\hat\theta_{\mathrm{JS}} \;=\; \left(1 - \frac{d - 2}{\|X\|^2}\right) X.

The shrinkage factor (1(d2)/X2)\bigl(1 - (d-2)/\|X\|^2\bigr) pulls XX toward the origin, with the amount of shrinkage decided by the data: when X2\|X\|^2 is large, barely shrink; when it is small, shrink hard. (When X2\|X\|^2 is small enough that the factor goes negative, the positive-part version θ^JS+=max{0,1(d2)/X2}X\hat\theta_{\mathrm{JS}^+} = \max\{0, 1 - (d-2)/\|X\|^2\}\,X is a strict improvement, and is what people use in practice.)

Where the formula comes from. Try the simpler family θ^=cX\hat\theta = c X for a fixed scalar c[0,1]c \in [0, 1]. The bias-variance decomposition gives

MSE(cX)  =  (c1)2θ2+c2d,\mathrm{MSE}(cX) \;=\; (c-1)^2 \|\theta\|^2 + c^2 d,

minimized at c=θ2/(θ2+d)c^\star = \|\theta\|^2 / (\|\theta\|^2 + d). That is the optimal shrinkage if you knew θ\|\theta\|. You don’t, but the data hands you an estimate: E[X2]=θ2+d\mathbb{E}[\|X\|^2] = \|\theta\|^2 + d, so X2d\|X\|^2 - d is an unbiased estimator of θ2\|\theta\|^2. Substituting and simplifying produces a shrinkage factor of 1d/X21 - d/\|X\|^2. James and Stein replace dd with d2d - 2 — a correction that drops out of an integration-by-parts identity called Stein’s lemma — and that swap is exactly what makes the dominance proof go through.

Theorem (James and Stein, 1961). For every θRd\theta \in \mathbb{R}^d and every d3d \ge 3,

E[θ^JSθ2]  <  d  =  E[Xθ2].\mathbb{E}\bigl[\|\hat\theta_{\mathrm{JS}} - \theta\|^2\bigr] \;<\; d \;=\; \mathbb{E}\bigl[\|X - \theta\|^2\bigr].

The sample mean is dominated uniformly. For d2d \le 2 the estimator degenerates and the dominance disappears. The phase transition at d=3d = 3 is sharp.

The geometric reason

The mechanism behind shrinkage is a fact about the magnitude of XX. If XN(θ,Id)X \sim \mathcal{N}(\theta, I_d), then X2\|X\|^2 has a non-central chi-squared distribution with non-centrality θ2\|\theta\|^2, so

E[X2]  =  θ2+d.\mathbb{E}[\|X\|^2] \;=\; \|\theta\|^2 + d.

The sample XX is on average further from the origin than θ\theta is, by an amount that grows with dd. This is exactly the Gaussian shell phenomenon: noise in dd dimensions has typical magnitude d\sqrt{d}, and adding it to θ\theta pushes outward.

Shrinking XX toward the origin partially undoes this outward push. In high dimensions, the bias introduced by shrinkage is a much smaller error than the variance you save. The dimension threshold d=3d = 3 becomes natural in this picture: in d=1,2d = 1, 2, the noise radius d\sqrt{d} isn’t much, and shrinkage doesn’t pay off; in d3d \ge 3, it does, uniformly.

The exact value at θ=0\theta = 0

A clean computation. If θ=0\theta = 0 then X2χd2\|X\|^2 \sim \chi^2_d, and for d>2d > 2, E[1/X2]=1/(d2)\mathbb{E}[1/\|X\|^2] = 1/(d - 2). Plugging in,

E[θ^JS2]  =  E[X2]2(d2)+(d2)2E[1/X2]  =  d2(d2)+(d2)  =  2.\mathbb{E}\bigl[\|\hat\theta_{\mathrm{JS}}\|^2\bigr] \;=\; \mathbb{E}[\|X\|^2] - 2(d - 2) + (d - 2)^2\, \mathbb{E}[1/\|X\|^2] \;=\; d - 2(d - 2) + (d - 2) \;=\; 2.

The James–Stein MSE at the origin is exactly 22, for every d3d \ge 3.

The MLE has MSE dd. So at θ=0\theta = 0, JS is a factor of d/2d/2 better. For d=100d = 100, MSE drops from 100100 to 22. A factor of 5050.

Slide dd to see the full curve as θ\|\theta\| varies:

d = 3MSEMLE = 3MSEJS(θ=0) = 2JS savings at θ=0: ×1.5

The MLE curve sits flat at dd. The James–Stein curve starts at 2\approx 2 for θ=0\theta = 0 and rises smoothly toward dd as θ\|\theta\| grows: with the true mean far from the origin, the shrinkage benefit shrinks, but the MLE never wins at any θ\theta in d3d \ge 3. At d=1d = 1 the JS curve sits above the MLE (the formula inflates rather than shrinks); at d=2d = 2 they coincide.

What is genuinely weird

Stein’s original example is the part that breaks people’s intuition.

Suppose you want to estimate three independent things: the speed of light, the price of tea in China, and the average height of redwoods. For each, you have one Gaussian-noisy measurement. The natural recipe: estimate each one independently, by its own measurement. Stein says: shrink all three toward zero (or toward any common point) and your total MSE drops.

The three estimation problems have nothing to do with each other. The three measurements are independent. And yet, coupling the three estimators by shrinking together reduces the joint risk. This is profoundly counterintuitive: the cost of independence in MSE terms is real.

The reason this confounds intuition is that unbiasedness was the wrong objective. By the bias-variance decomposition, MSE = bias2^2 + variance, and the sample mean trades zero bias for high variance. In d3d \ge 3, accepting a bit of bias (shrinkage) buys a much larger reduction in variance, every time.

What this led to

Stein’s paradox is the seed of a huge amount of modern statistics and ML:

All of these are descendants of: the sample mean is wrong in 3\ge 3 dimensions; do something biased instead.

References



Previous Post
Marchenko-Pastur and the Wigner semicircle
Next Post
The Newton fractal