Marchenko-Pastur and the Wigner semicircle

The central limit theorem says that the sum of many independent random variables, properly standardized, converges to a deterministic shape (the Gaussian). Random matrix theory gives the analogous statement for spectra: the eigenvalues of a large random matrix, taken together as a histogram, converge to a deterministic density. Two of the cleanest results are the Marchenko-Pastur law for sample covariance matrices and the Wigner semicircle law for symmetric matrices with i.i.d. entries.

The geometric content matters in machine learning: the bias of the sample covariance matrix, the failure of nearest-neighbor and PCA in high dimensions, and a chunk of modern shrinkage theory all live inside these two densities.

Sample covariance matrices

Let $X$ be an $n \times d$ matrix with i.i.d. real entries of mean $0$ and variance $1$ . The sample covariance matrix is

S \;=\; \frac{1}{n}\, X^\top X \;\in\; \mathbb{R}^{d \times d}.

Each entry $S_{ij} = (1/n) \sum_{k=1}^n X_{ki} X_{kj}$ is the empirical correlation between coordinates $i$ and $j$ across the $n$ rows of $X$ . With true mean zero and unit variance, the true covariance is the identity $I_d$ . By the law of large numbers, if $d$ is fixed and $n \to \infty$ , then $S \to I_d$ entrywise, and every eigenvalue of $S$ approaches $1$ .

This stops being true as soon as the dimension $d$ grows along with $n$ . Consider the regime

n, d \to \infty \quad\text{with}\quad d / n \to c \in (0, \infty).

The dimension is now comparable to the sample size. The matrix $S$ no longer concentrates at $I_d$ , and its eigenvalues do not pile up at $1$ . Instead, they spread into a deterministic shape that depends only on the ratio $c$ .

The Marchenko-Pastur law

Marchenko and Pastur (1967) proved that, in the regime above, the empirical distribution of the eigenvalues of $S$ converges (in probability) to the Marchenko-Pastur density

\rho_c(\lambda) \;=\; \frac{1}{2\pi c\, \lambda}\, \sqrt{(\lambda_+ - \lambda)(\lambda - \lambda_-)}, \qquad \lambda \in [\lambda_-, \lambda_+],

where the endpoints are

\lambda_\pm \;=\; (1 \pm \sqrt{c})^2.

(For $c > 1$ , the matrix $S$ has rank at most $n < d$ , so $d - n$ of its eigenvalues are exactly $0$ , contributing a point mass of weight $1 - 1/c$ at the origin. The remaining $n$ eigenvalues are spread on the same interval $[\lambda_-, \lambda_+]$ .)

The endpoints depend only on $c$ . The bulk straddles $1$ (the true eigenvalue of $I_d$ ) but is shifted and stretched by an amount determined entirely by the dimension-to-samples ratio. Three sanity checks:

$c \to 0$ . $\lambda_\pm \to 1$ from both sides, the density collapses to a point mass at $1$ . This recovers the law of large numbers: with infinitely many samples per dimension, $S \to I_d$ .
$c = 1$ . $\lambda_- = 0$ , $\lambda_+ = 4$ . The bulk spreads from $0$ to $4$ , and a fraction of eigenvalues sit arbitrarily close to zero. The matrix is on the edge of singularity.
$c \to \infty$ . Most eigenvalues are zero, and the rest spread over an interval of width $4\sqrt{c}$ shifted to mean $1 + c$ .

The takeaway is geometric and quantitative: at finite $c$ , the empirical eigenvalues of $S$ are systematically biased away from those of the true covariance, by a deterministic and known amount. The smallest eigenvalues are pulled down to $\lambda_-$ , the largest are pushed up to $\lambda_+$ , and the bulk in between is a stretched, asymmetric arch (more mass near the lower edge for moderate $c$ ).

Slide $c$ on a log scale from $0.1$ to $10$ . We fix the dimension $d = 50$ and let the sample size scale as $n = \lfloor d / c \rfloor$ , so the slider directly controls how many samples we get per dimension. Each tick draws a fresh $n \times d$ Gaussian $X$ . For $c \le 1$ the histogram shows all $d$ eigenvalues of $S = X^\top X / n$ . For $c > 1$ we have $n < d$ , so the matrix $S$ is rank $n$ ; we plot the $n$ nonzero eigenvalues (computed as the eigenvalues of the smaller $n \times n$ matrix $X X^\top / n$ , which has the same nonzero spectrum) and report the count $d - n$ of exact zeros separately. The red dashed curve is $\rho_c$ for $c \le 1$ and $c \cdot \rho_c$ for $c > 1$ (the conditional density of the nonzero eigenvalues, which integrates to $1$ ).

The blue histogram is the empirical eigenvalue distribution of a fresh draw; the red dashed curve is the deterministic limit $\rho_c$ . The fit is tight even at $d = 50$ .

The Wigner semicircle law

The Marchenko-Pastur law is for a structured random matrix: $X^\top X$ , which is positive semidefinite by construction. The simpler sibling drops the structure and just takes a symmetric matrix whose entries are i.i.d. (above the diagonal).

Let $W$ be an $n \times n$ symmetric real matrix with $W_{ij} \sim \mathcal{N}(0, 1)$ for $i < j$ , $W_{ji} = W_{ij}$ , and $W_{ii} \sim \mathcal{N}(0, 2)$ . (The diagonal variance choice gives the Gaussian Orthogonal Ensemble but the limit theorem holds for arbitrary i.i.d. entries with finite variance.)

Wigner (1955) proved: the empirical distribution of the eigenvalues of $W / \sqrt{n}$ converges to the semicircle density

\rho_{\rm sc}(\lambda) \;=\; \frac{1}{2\pi}\, \sqrt{4 - \lambda^2}, \qquad \lambda \in [-2, 2].

It is, up to normalization, the upper half of a circle of radius $2$ . The density is a fixed function of $\lambda$ alone: there is no analog of the ratio $c$ to tune, because the matrix size and the variance of the entries together fix the scale.

The histogram approaches the semicircle smoothly as $n$ grows. At $n = 20$ the fit is rough; by $n = 80$ it is essentially indistinguishable from the curve.

Where these limits come from

Both theorems are concentration phenomena for linear spectral statistics. For a fixed test function $f$ , the random variable

\frac{1}{d} \sum_{i=1}^d f(\lambda_i)

(an average of $f$ over the eigenvalues of an $d \times d$ random matrix) has variance of order $1/d^2$ in good cases, and its expectation has a limit as $d \to \infty$ . The expectations as $f$ ranges over polynomials are precisely the moments of the limiting density. The method-of-moments approach reduces the proof to combinatorics: counting non-crossing pair partitions for Wigner (which gives the Catalan numbers, the moments of the semicircle), and a related count for Marchenko-Pastur.

A more conceptual derivation comes from free probability, in which the semicircle plays the role that the Gaussian plays in classical probability: it is the limit of free sums of independent self-adjoint variables. Marchenko-Pastur is the free analog of the Poisson distribution, arising from a free compound process.

Implications for high-dimensional statistics

Three concrete consequences for machine learning:

Sample covariance is biased even with $n \gg d$ , as long as $c = d/n$ is not vanishingly small. The eigenvalue at the lower edge is shifted to $(1 - \sqrt{c})^2$ , the upper edge to $(1 + \sqrt{c})^2$ . PCA on $S$ inflates the leading eigenvalue and suppresses the smaller ones, in a deterministic way that depends only on $c$ .
Shrinkage estimators for the covariance matrix (Ledoit-Wolf, nonlinear shrinkage, etc.) explicitly use the Marchenko-Pastur picture to undo this bias. They pull each empirical eigenvalue back toward $1$ by an amount that depends on its distance from $\lambda_-$ and $\lambda_+$ . The Marchenko-Pastur framework is what makes the correction tractable.
The ratio $c$ is the only relevant dimensionless parameter in the bulk. Doubling both $d$ and $n$ leaves the eigenvalue distribution unchanged. This is why “more data” alone does not save you in high-dimensional inference; you need more samples per dimension.

The Wigner side has analogous consequences for any setting that produces a symmetric random matrix, including Hessians of random loss landscapes, kernel matrices with random features, and the linearized dynamics of certain neural-network training schemes.

References

V. A. Marchenko, L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457-483, 1967.
E. P. Wigner. Characteristic vectors of bordered matrices with infinite dimensions. Annals of Mathematics, 62(3):548-564, 1955.
T. Tao. Topics in Random Matrix Theory. Graduate Studies in Mathematics, AMS, 2012.
Z. Bai, J. W. Silverstein. Spectral Analysis of Large Dimensional Random Matrices. Springer, 2010.
O. Ledoit, M. Wolf. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2):365-411, 2004. (Marchenko-Pastur in shrinkage estimation.)