The central limit theorem says that the sum of many independent random variables, properly standardized, converges to a deterministic shape (the Gaussian). Random matrix theory gives the analogous statement for spectra: the eigenvalues of a large random matrix, taken together as a histogram, converge to a deterministic density. Two of the cleanest results are the Marchenko-Pastur law for sample covariance matrices and the Wigner semicircle law for symmetric matrices with i.i.d. entries.
The geometric content matters in machine learning: the bias of the sample covariance matrix, the failure of nearest-neighbor and PCA in high dimensions, and a chunk of modern shrinkage theory all live inside these two densities.
Sample covariance matrices
Let be an matrix with i.i.d. real entries of mean and variance . The sample covariance matrix is
Each entry is the empirical correlation between coordinates and across the rows of . With true mean zero and unit variance, the true covariance is the identity . By the law of large numbers, if is fixed and , then entrywise, and every eigenvalue of approaches .
This stops being true as soon as the dimension grows along with . Consider the regime
The dimension is now comparable to the sample size. The matrix no longer concentrates at , and its eigenvalues do not pile up at . Instead, they spread into a deterministic shape that depends only on the ratio .
The Marchenko-Pastur law
Marchenko and Pastur (1967) proved that, in the regime above, the empirical distribution of the eigenvalues of converges (in probability) to the Marchenko-Pastur density
where the endpoints are
(For , the matrix has rank at most , so of its eigenvalues are exactly , contributing a point mass of weight at the origin. The remaining eigenvalues are spread on the same interval .)
The endpoints depend only on . The bulk straddles (the true eigenvalue of ) but is shifted and stretched by an amount determined entirely by the dimension-to-samples ratio. Three sanity checks:
- . from both sides, the density collapses to a point mass at . This recovers the law of large numbers: with infinitely many samples per dimension, .
- . , . The bulk spreads from to , and a fraction of eigenvalues sit arbitrarily close to zero. The matrix is on the edge of singularity.
- . Most eigenvalues are zero, and the rest spread over an interval of width shifted to mean .
The takeaway is geometric and quantitative: at finite , the empirical eigenvalues of are systematically biased away from those of the true covariance, by a deterministic and known amount. The smallest eigenvalues are pulled down to , the largest are pushed up to , and the bulk in between is a stretched, asymmetric arch (more mass near the lower edge for moderate ).
Slide on a log scale from to . We fix the dimension and let the sample size scale as , so the slider directly controls how many samples we get per dimension. Each tick draws a fresh Gaussian . For the histogram shows all eigenvalues of . For we have , so the matrix is rank ; we plot the nonzero eigenvalues (computed as the eigenvalues of the smaller matrix , which has the same nonzero spectrum) and report the count of exact zeros separately. The red dashed curve is for and for (the conditional density of the nonzero eigenvalues, which integrates to ).
The blue histogram is the empirical eigenvalue distribution of a fresh draw; the red dashed curve is the deterministic limit . The fit is tight even at .
The Wigner semicircle law
The Marchenko-Pastur law is for a structured random matrix: , which is positive semidefinite by construction. The simpler sibling drops the structure and just takes a symmetric matrix whose entries are i.i.d. (above the diagonal).
Let be an symmetric real matrix with for , , and . (The diagonal variance choice gives the Gaussian Orthogonal Ensemble but the limit theorem holds for arbitrary i.i.d. entries with finite variance.)
Wigner (1955) proved: the empirical distribution of the eigenvalues of converges to the semicircle density
It is, up to normalization, the upper half of a circle of radius . The density is a fixed function of alone: there is no analog of the ratio to tune, because the matrix size and the variance of the entries together fix the scale.
The histogram approaches the semicircle smoothly as grows. At the fit is rough; by it is essentially indistinguishable from the curve.
Where these limits come from
Both theorems are concentration phenomena for linear spectral statistics. For a fixed test function , the random variable
(an average of over the eigenvalues of an random matrix) has variance of order in good cases, and its expectation has a limit as . The expectations as ranges over polynomials are precisely the moments of the limiting density. The method-of-moments approach reduces the proof to combinatorics: counting non-crossing pair partitions for Wigner (which gives the Catalan numbers, the moments of the semicircle), and a related count for Marchenko-Pastur.
A more conceptual derivation comes from free probability, in which the semicircle plays the role that the Gaussian plays in classical probability: it is the limit of free sums of independent self-adjoint variables. Marchenko-Pastur is the free analog of the Poisson distribution, arising from a free compound process.
Implications for high-dimensional statistics
Three concrete consequences for machine learning:
- Sample covariance is biased even with , as long as is not vanishingly small. The eigenvalue at the lower edge is shifted to , the upper edge to . PCA on inflates the leading eigenvalue and suppresses the smaller ones, in a deterministic way that depends only on .
- Shrinkage estimators for the covariance matrix (Ledoit-Wolf, nonlinear shrinkage, etc.) explicitly use the Marchenko-Pastur picture to undo this bias. They pull each empirical eigenvalue back toward by an amount that depends on its distance from and . The Marchenko-Pastur framework is what makes the correction tractable.
- The ratio is the only relevant dimensionless parameter in the bulk. Doubling both and leaves the eigenvalue distribution unchanged. This is why “more data” alone does not save you in high-dimensional inference; you need more samples per dimension.
The Wigner side has analogous consequences for any setting that produces a symmetric random matrix, including Hessians of random loss landscapes, kernel matrices with random features, and the linearized dynamics of certain neural-network training schemes.
References
- V. A. Marchenko, L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457-483, 1967.
- E. P. Wigner. Characteristic vectors of bordered matrices with infinite dimensions. Annals of Mathematics, 62(3):548-564, 1955.
- T. Tao. Topics in Random Matrix Theory. Graduate Studies in Mathematics, AMS, 2012.
- Z. Bai, J. W. Silverstein. Spectral Analysis of Large Dimensional Random Matrices. Springer, 2010.
- O. Ledoit, M. Wolf. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2):365-411, 2004. (Marchenko-Pastur in shrinkage estimation.)