In the 1950s, the statistical orthodoxy was clear: to estimate the mean of a Gaussian, use the sample mean. It is the maximum-likelihood estimator, the minimum-variance unbiased estimator, and (in one dimension) admissible: no other estimator beats it for every value of the parameter.
In 1961, Charles Stein and Willard James proved that in three or more dimensions, the sample mean is not admissible. There is a different estimator that beats it everywhere: for every choice of the true mean , simultaneously, in mean squared error. Even when the components are completely unrelated.
The estimator is famous and, when first encountered, jarring. It is also a precursor of ridge regression, empirical Bayes, and a great deal of modern machine learning.
The setup
You observe a single sample in and want to estimate . The sample mean (here, just ) has
Doesn’t depend on . Just , no matter what.
The James–Stein estimator
Define
The shrinkage factor pulls toward the origin. (When is small, the factor can go negative; the positive-part version is a strict improvement, and is what people use in practice.)
Theorem (James and Stein, 1961). For every and every ,
The sample mean is dominated uniformly. For the estimator degenerates and the dominance disappears. The phase transition at is sharp.
The geometric reason
Why should shrinkage help?
If , then has a non-central chi-squared distribution with non-centrality , so
The sample is on average further from the origin than is, by an amount that grows with . This is exactly the Gaussian shell phenomenon: noise in dimensions has typical magnitude , and adding it to pushes outward.
Shrinking toward the origin partially undoes this outward push. In high dimensions, the bias introduced by shrinkage is a much smaller error than the variance you save. The dimension threshold becomes natural in this picture: in , the noise radius isn’t much, and shrinkage doesn’t pay off; in , it does, uniformly.
The exact value at
A clean computation. If then , and for , . Plugging in,
The James–Stein MSE at the origin is exactly , for every .
The MLE has MSE . So at , JS is a factor of better. For , MSE drops from to . A factor of .
Slide to see the full curve as varies:
The MLE curve sits flat at . The James–Stein curve starts at for and rises smoothly toward as grows: with the true mean far from the origin, the shrinkage benefit shrinks, but the MLE never wins at any in . At the JS curve sits above the MLE (the formula inflates rather than shrinks); at they coincide.
What is genuinely weird
Stein’s original example is the part that breaks people’s intuition.
Suppose you want to estimate three independent things: the speed of light, the price of tea in China, and the average height of redwoods. For each, you have one Gaussian-noisy measurement. The natural recipe: estimate each one independently, by its own measurement. Stein says: shrink all three toward zero (or toward any common point) and your total MSE drops.
The three estimation problems have nothing to do with each other. The three measurements are independent. And yet, coupling the three estimators by shrinking together reduces the joint risk. This is profoundly counterintuitive: the cost of independence in MSE terms is real.
The reason this confounds intuition is that unbiasedness was the wrong objective. MSE = bias + variance, and the sample mean trades zero bias for high variance. In , accepting a bit of bias (shrinkage) buys a much larger reduction in variance, every time.
What this led to
Stein’s paradox is the seed of a huge amount of modern statistics and ML:
- Empirical Bayes. Learn the prior from data; shrink toward what the data says is the typical .
- Ridge regression. Penalize the squared norm of the regression coefficients; the regularization parameter is a shrinkage knob.
- Hierarchical Bayesian models. Multi-level shrinkage across groups.
- L2 regularization in deep learning. Weight decay is, at heart, a shrinkage estimator.
All of these are descendants of: the sample mean is wrong in dimensions; do something biased instead.
References
- William James, Charles Stein. Estimation with Quadratic Loss. Berkeley Symposium on Mathematical Statistics and Probability, 1961.
- Bradley Efron, Carl Morris. Stein’s Paradox in Statistics. Scientific American, 1977. A beautifully written exposition.
- Roman Vershynin. High-Dimensional Probability. Cambridge University Press, 2018. Chapter 8 has the geometric setup and a clean proof of the key inequality.