The orthodoxy was clear: to estimate the mean of a Gaussian, use the sample mean. Grading estimators by their mean squared error
the sample mean is the maximum-likelihood estimator, the minimum-variance unbiased estimator, and, in one dimension, admissible: no other estimator beats it for every value of the parameter. For two centuries this read as the end of the story.
In 1961, Charles Stein and Willard James proved that in three or more dimensions, the sample mean is not admissible. There is a different estimator with strictly smaller MSE for every , simultaneously. Even when the components correspond to completely unrelated quantities.
The estimator is famous and, when first encountered, jarring. It is also a precursor of ridge regression, empirical Bayes, and a great deal of modern machine learning.
The case for the sample mean in one dimension
You observe i.i.d. samples with known and unknown mean . Three properties make the sample mean the textbook answer.
It maximizes the likelihood. The joint density of the data, viewed as a function of with the data held fixed, is
Maximizing over is the same as minimizing , a quadratic in minimized at . Of all values of , the sample mean is the one that makes the observed data most plausible under the Gaussian model.
It is unbiased. by linearity of expectation. On average, hits the truth exactly.
It has the smallest possible MSE among unbiased estimators. The Cramér–Rao lower bound says any unbiased satisfies
where is the Fisher information per observation, a measure of how sharply the likelihood peaks around the truth. For the Gaussian, , so the bound is , and saturates it. So , the lowest MSE achievable by any unbiased estimator.
In one dimension, it is admissible. No estimator (biased or not) has strictly smaller MSE than at every simultaneously. You can find rivals that win at specific values (the constant has zero MSE at but MSE everywhere else), but never one that wins everywhere. This is the substantive property among the four; the proof is below.
The MLE, the minimum-variance unbiased estimator, and unbeatable-everywhere — all three lining up convinced generations of statisticians that the sample mean was the only defensible default.
Proving admissibility in one dimension
The first three properties were direct calculations. Admissibility is the substantive one. Take the canonical case on (one observation, unit variance, after rescaling). The sample mean is just , and its MSE is at every . The claim is that no estimator strictly beats everywhere.
The standard proof, due to Hodges and Lehmann (1950) and Blyth (1951), is by contradiction, using a wide Gaussian prior on .
Suppose some dominated : its risk function satisfies for every , with strict inequality at some . Assuming is continuous in (true for any reasonable ), there is a small interval on which for some .
A wide Gaussian prior. Place a prior for a large variance . Standard Gaussian conjugacy gives the posterior
so the Bayes estimator under squared-error loss (the posterior mean) is . Its Bayes risk — the prior-averaged MSE — is
The defining property of the Bayes estimator is that it minimizes prior-averaged MSE. So any estimator — including our hypothetical — has Bayes risk at least :
Squeezing to a contradiction. The sample mean has everywhere, so its prior-averaged MSE is . The Bayes risk gap between and is therefore squeezed:
The integrand is non-negative everywhere (by domination) and at least on the interval where does strictly better, so
For large, the prior density near is approximately , so the interval mass is . Putting the two bounds together,
The left side decays like ; the right side like . For large enough, , contradicting the inequality. So no dominating exists, and is admissible.
Where the proof breaks in higher dimensions. Repeat the same argument in with a prior. The prior spreads its mass over a -dimensional region, so the density at any fixed scales like , and the ball-mass contribution decays at the same rate. The Bayes risk gap, computed coordinate-by-coordinate, is at most . The contradiction requires the ball-mass to decay slower than the Bayes gap — which only happens at . (Admissibility actually still holds at , via a more carefully chosen prior; at admissibility genuinely fails, and Stein’s estimator is what walks through the open door.)
The Stein setup
By sufficiency, all the information about in is concentrated in . The original problem reduces to: observe a single Gaussian sample with known variance and unknown mean. Rescaling sets the variance to one, and the multivariate version is the same story coordinate-by-coordinate. So without loss of generality, work with a single observation
and try to estimate . The sample mean (here, just ) has
Doesn’t depend on . Just , no matter what.
The James–Stein estimator
Define
The shrinkage factor pulls toward the origin, with the amount of shrinkage decided by the data: when is large, barely shrink; when it is small, shrink hard. (When is small enough that the factor goes negative, the positive-part version is a strict improvement, and is what people use in practice.)
Where the formula comes from. Try the simpler family for a fixed scalar . The bias-variance decomposition gives
minimized at . That is the optimal shrinkage if you knew . You don’t, but the data hands you an estimate: , so is an unbiased estimator of . Substituting and simplifying produces a shrinkage factor of . James and Stein replace with — a correction that drops out of an integration-by-parts identity called Stein’s lemma — and that swap is exactly what makes the dominance proof go through.
Theorem (James and Stein, 1961). For every and every ,
The sample mean is dominated uniformly. For the estimator degenerates and the dominance disappears. The phase transition at is sharp.
The geometric reason
The mechanism behind shrinkage is a fact about the magnitude of . If , then has a non-central chi-squared distribution with non-centrality , so
The sample is on average further from the origin than is, by an amount that grows with . This is exactly the Gaussian shell phenomenon: noise in dimensions has typical magnitude , and adding it to pushes outward.
Shrinking toward the origin partially undoes this outward push. In high dimensions, the bias introduced by shrinkage is a much smaller error than the variance you save. The dimension threshold becomes natural in this picture: in , the noise radius isn’t much, and shrinkage doesn’t pay off; in , it does, uniformly.
The exact value at
A clean computation. If then , and for , . Plugging in,
The James–Stein MSE at the origin is exactly , for every .
The MLE has MSE . So at , JS is a factor of better. For , MSE drops from to . A factor of .
Slide to see the full curve as varies:
The MLE curve sits flat at . The James–Stein curve starts at for and rises smoothly toward as grows: with the true mean far from the origin, the shrinkage benefit shrinks, but the MLE never wins at any in . At the JS curve sits above the MLE (the formula inflates rather than shrinks); at they coincide.
What is genuinely weird
Stein’s original example is the part that breaks people’s intuition.
Suppose you want to estimate three independent things: the speed of light, the price of tea in China, and the average height of redwoods. For each, you have one Gaussian-noisy measurement. The natural recipe: estimate each one independently, by its own measurement. Stein says: shrink all three toward zero (or toward any common point) and your total MSE drops.
The three estimation problems have nothing to do with each other. The three measurements are independent. And yet, coupling the three estimators by shrinking together reduces the joint risk. This is profoundly counterintuitive: the cost of independence in MSE terms is real.
The reason this confounds intuition is that unbiasedness was the wrong objective. By the bias-variance decomposition, MSE = bias + variance, and the sample mean trades zero bias for high variance. In , accepting a bit of bias (shrinkage) buys a much larger reduction in variance, every time.
What this led to
Stein’s paradox is the seed of a huge amount of modern statistics and ML:
- Empirical Bayes. Learn the prior from data; shrink toward what the data says is the typical .
- Ridge regression. Penalize the squared norm of the regression coefficients; the regularization parameter is a shrinkage knob.
- Hierarchical Bayesian models. Multi-level shrinkage across groups.
- L2 regularization in deep learning. Weight decay is, at heart, a shrinkage estimator.
All of these are descendants of: the sample mean is wrong in dimensions; do something biased instead.
References
- William James, Charles Stein. Estimation with Quadratic Loss. Berkeley Symposium on Mathematical Statistics and Probability, 1961.
- Bradley Efron, Carl Morris. Stein’s Paradox in Statistics. Scientific American, 1977. A beautifully written exposition.
- Roman Vershynin. High-Dimensional Probability. Cambridge University Press, 2018. Chapter 8 has the geometric setup and a clean proof of the key inequality.