Skip to content

Marchenko-Pastur and the Wigner semicircle

Posted on:

The central limit theorem says that the sum of many independent random variables, properly standardized, converges to a deterministic shape (the Gaussian). Random matrix theory gives the analogous statement for spectra: the eigenvalues of a large random matrix, taken together as a histogram, converge to a deterministic density. Two of the cleanest results are the Marchenko-Pastur law for sample covariance matrices and the Wigner semicircle law for symmetric matrices with i.i.d. entries.

The geometric content matters in machine learning: the bias of the sample covariance matrix, the failure of nearest-neighbor and PCA in high dimensions, and a chunk of modern shrinkage theory all live inside these two densities.

Sample covariance matrices

Let XX be an n×dn \times d matrix with i.i.d. real entries of mean 00 and variance 11. The sample covariance matrix is

S  =  1nXX    Rd×d.S \;=\; \frac{1}{n}\, X^\top X \;\in\; \mathbb{R}^{d \times d}.

Each entry Sij=(1/n)k=1nXkiXkjS_{ij} = (1/n) \sum_{k=1}^n X_{ki} X_{kj} is the empirical correlation between coordinates ii and jj across the nn rows of XX. With true mean zero and unit variance, the true covariance is the identity IdI_d. By the law of large numbers, if dd is fixed and nn \to \infty, then SIdS \to I_d entrywise, and every eigenvalue of SS approaches 11.

This stops being true as soon as the dimension dd grows along with nn. Consider the regime

n,dwithd/nc(0,).n, d \to \infty \quad\text{with}\quad d / n \to c \in (0, \infty).

The dimension is now comparable to the sample size. The matrix SS no longer concentrates at IdI_d, and its eigenvalues do not pile up at 11. Instead, they spread into a deterministic shape that depends only on the ratio cc.

The Marchenko-Pastur law

Marchenko and Pastur (1967) proved that, in the regime above, the empirical distribution of the eigenvalues of SS converges (in probability) to the Marchenko-Pastur density

ρc(λ)  =  12πcλ(λ+λ)(λλ),λ[λ,λ+],\rho_c(\lambda) \;=\; \frac{1}{2\pi c\, \lambda}\, \sqrt{(\lambda_+ - \lambda)(\lambda - \lambda_-)}, \qquad \lambda \in [\lambda_-, \lambda_+],

where the endpoints are

λ±  =  (1±c)2.\lambda_\pm \;=\; (1 \pm \sqrt{c})^2.

(For c>1c > 1, the matrix SS has rank at most n<dn < d, so dnd - n of its eigenvalues are exactly 00, contributing a point mass of weight 11/c1 - 1/c at the origin. The remaining nn eigenvalues are spread on the same interval [λ,λ+][\lambda_-, \lambda_+].)

The endpoints depend only on cc. The bulk straddles 11 (the true eigenvalue of IdI_d) but is shifted and stretched by an amount determined entirely by the dimension-to-samples ratio. Three sanity checks:

The takeaway is geometric and quantitative: at finite cc, the empirical eigenvalues of SS are systematically biased away from those of the true covariance, by a deterministic and known amount. The smallest eigenvalues are pulled down to λ\lambda_-, the largest are pushed up to λ+\lambda_+, and the bulk in between is a stretched, asymmetric arch (more mass near the lower edge for moderate cc).

Slide cc on a log scale from 0.10.1 to 1010. We fix the dimension d=50d = 50 and let the sample size scale as n=d/cn = \lfloor d / c \rfloor, so the slider directly controls how many samples we get per dimension. Each tick draws a fresh n×dn \times d Gaussian XX. For c1c \le 1 the histogram shows all dd eigenvalues of S=XX/nS = X^\top X / n. For c>1c > 1 we have n<dn < d, so the matrix SS is rank nn; we plot the nn nonzero eigenvalues (computed as the eigenvalues of the smaller n×nn \times n matrix XX/nX X^\top / n, which has the same nonzero spectrum) and report the count dnd - n of exact zeros separately. The red dashed curve is ρc\rho_c for c1c \le 1 and cρcc \cdot \rho_c for c>1c > 1 (the conditional density of the nonzero eigenvalues, which integrates to 11).

c = 1.00d = 100n = 100λ = 0.000, λ+ = 4.000

The blue histogram is the empirical eigenvalue distribution of a fresh draw; the red dashed curve is the deterministic limit ρc\rho_c. The fit is tight even at d=50d = 50.

The Wigner semicircle law

The Marchenko-Pastur law is for a structured random matrix: XXX^\top X, which is positive semidefinite by construction. The simpler sibling drops the structure and just takes a symmetric matrix whose entries are i.i.d. (above the diagonal).

Let WW be an n×nn \times n symmetric real matrix with WijN(0,1)W_{ij} \sim \mathcal{N}(0, 1) for i<ji < j, Wji=WijW_{ji} = W_{ij}, and WiiN(0,2)W_{ii} \sim \mathcal{N}(0, 2). (The diagonal variance choice gives the Gaussian Orthogonal Ensemble but the limit theorem holds for arbitrary i.i.d. entries with finite variance.)

Wigner (1955) proved: the empirical distribution of the eigenvalues of W/nW / \sqrt{n} converges to the semicircle density

ρsc(λ)  =  12π4λ2,λ[2,2].\rho_{\rm sc}(\lambda) \;=\; \frac{1}{2\pi}\, \sqrt{4 - \lambda^2}, \qquad \lambda \in [-2, 2].

It is, up to normalization, the upper half of a circle of radius 22. The density is a fixed function of λ\lambda alone: there is no analog of the ratio cc to tune, because the matrix size and the variance of the entries together fix the scale.

n = 80

The histogram approaches the semicircle smoothly as nn grows. At n=20n = 20 the fit is rough; by n=80n = 80 it is essentially indistinguishable from the curve.

Where these limits come from

Both theorems are concentration phenomena for linear spectral statistics. For a fixed test function ff, the random variable

1di=1df(λi)\frac{1}{d} \sum_{i=1}^d f(\lambda_i)

(an average of ff over the eigenvalues of an d×dd \times d random matrix) has variance of order 1/d21/d^2 in good cases, and its expectation has a limit as dd \to \infty. The expectations as ff ranges over polynomials are precisely the moments of the limiting density. The method-of-moments approach reduces the proof to combinatorics: counting non-crossing pair partitions for Wigner (which gives the Catalan numbers, the moments of the semicircle), and a related count for Marchenko-Pastur.

A more conceptual derivation comes from free probability, in which the semicircle plays the role that the Gaussian plays in classical probability: it is the limit of free sums of independent self-adjoint variables. Marchenko-Pastur is the free analog of the Poisson distribution, arising from a free compound process.

Implications for high-dimensional statistics

Three concrete consequences for machine learning:

  1. Sample covariance is biased even with ndn \gg d, as long as c=d/nc = d/n is not vanishingly small. The eigenvalue at the lower edge is shifted to (1c)2(1 - \sqrt{c})^2, the upper edge to (1+c)2(1 + \sqrt{c})^2. PCA on SS inflates the leading eigenvalue and suppresses the smaller ones, in a deterministic way that depends only on cc.
  2. Shrinkage estimators for the covariance matrix (Ledoit-Wolf, nonlinear shrinkage, etc.) explicitly use the Marchenko-Pastur picture to undo this bias. They pull each empirical eigenvalue back toward 11 by an amount that depends on its distance from λ\lambda_- and λ+\lambda_+. The Marchenko-Pastur framework is what makes the correction tractable.
  3. The ratio cc is the only relevant dimensionless parameter in the bulk. Doubling both dd and nn leaves the eigenvalue distribution unchanged. This is why “more data” alone does not save you in high-dimensional inference; you need more samples per dimension.

The Wigner side has analogous consequences for any setting that produces a symmetric random matrix, including Hessians of random loss landscapes, kernel matrices with random features, and the linearized dynamics of certain neural-network training schemes.

References



Previous Post
Optimal message passing on sparse graphs
Next Post
Stein's paradox