Skip to content

Central Limit Theorem - why sums become Gaussian

Posted on:

Most things in nature, when you measure them carefully, look roughly Gaussian. Heights, exam scores, voltage noise, sample averages of pretty much any quantity. This is, famously, the central limit theorem (CLT). The textbook statement uses characteristic functions and analytic limits, but here I try to present the underlying picture in a purely geometric way.

The setup

Take an i.i.d. sequence X1,X2,X_1, X_2, \ldots with mean μ\mu and variance σ2<\sigma^2 < \infty. The CLT says that the standardized sum

Zn  =  X1+X2++XnnμσnZ_n \;=\; \frac{X_1 + X_2 + \cdots + X_n - n\mu}{\sigma\sqrt{n}}

converges in distribution to N(0,1)\mathcal{N}(0, 1) as nn \to \infty.
I’ve written more on this mode of convergence here.

A more visual restatement: summing independent random variables is convolving their densities. Why? For the sum X+YX + Y to land at a particular value zz, the two summands have to take values that add up to zz: if XX takes value xx, then YY must take the complementary value zxz - x. By independence, the joint density at any pair (x,y)(x, y) is the product f(x)g(y)f(x)\,g(y). The density of X+YX + Y at zz is therefore what you get by summing (integrating) the joint density over every valid split (x,zx)(x, z-x) that produces this zz:

fX+Y(z)  =  f(x)g(zx)dx  =:  (fg)(z).f_{X+Y}(z) \;=\; \int f(x)\, g(z - x)\, dx \;=:\; (f \star g)(z).

That integral is, by definition, the convolution of ff and gg. So summing nn i.i.d. copies of XX corresponds to convolving its density with itself nn times. The CLT becomes a statement about repeated self-convolution: under the right rescaling, it converges to a Gaussian.

Watch it happen

Pick a base distribution and slide nn. The blue bars are the exact distribution of the sum after nn convolutions; the red curve is the Gaussian with mean nμn\mu and variance nσ2n\sigma^2.

n = 1μn = 3.50σn = 1.71

A few things to notice as you slide:

Why convolution → Gaussian?

Three complementary intuitions, each capturing a different aspect of why this happens.

(1) Convolution is smoothing. Each convolution averages out sharp features of the densities being combined. Spikes get blurred, gaps get filled, rough edges get rounded. After enough convolutions, only the smoothest distribution with the right mean and variance survives. That happens to be the Gaussian.

(2) Entropy maximization. Among all densities on R\mathbb{R} with a given mean and variance, the one with maximum differential entropy is the Gaussian. Convolution monotonically increases the entropy of the standardized sum (Barron, 1986). Repeated convolution therefore drives the standardized distribution toward the entropy maximizer with the matching moments. This is the entropic CLT, and it gives a thermodynamic flavor to convergence: the Gaussian is the thermal equilibrium of independent additive noise.

(3) Self-similarity. The Gaussian is the unique distribution with finite variance that is preserved under the operation add an independent copy and rescale by 1/21/\sqrt{2}:

X+Y  =d  2Xwhen X,YiidN(0,1).X + Y \;\stackrel{d}{=}\; \sqrt{2} \cdot X \quad\text{when } X, Y \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 1).

For any other finite-variance distribution, this operation moves the shape closer to Gaussian. The add and rescale map is a contraction toward the Gaussian fixed point, and the CLT is its convergence.

(Without finite variance, the same fixed-point logic gives non-Gaussian limits called stable distributions: Cauchy, Lévy, and others. The Gaussian is just the most familiar member of an infinite family of self-similar limit laws.)

How fast?

The Berry–Esseen theorem makes the convergence quantitative:

supzRFn(z)Φ(z)    Cρσ3n,\sup_{z \in \mathbb{R}} \big| F_n(z) - \Phi(z) \big| \;\le\; \frac{C \cdot \rho}{\sigma^3 \sqrt{n}},

where FnF_n is the CDF of ZnZ_n, Φ\Phi is the standard normal CDF, and ρ=EXμ3\rho = \mathbb{E}|X - \mu|^3 is the third absolute central moment. The best known universal constant is C0.4748C \le 0.4748 (Shevtsova, 2011).

Three takeaways:

In the widget, the rate-of-convergence comparison is easy to feel: the symmetric uniform die approaches Gaussian faster than asymmetric Bernoulli(0.3), which has a larger ρ/σ3\rho/\sigma^3.

What can go wrong?

The standard CLT has two requirements: i.i.d. (or close to it) and finite variance. Drop either and the limit can change shape.

Higher dimensions

In Rd\mathbb{R}^d, the multivariate CLT replaces the variance with a covariance matrix Σ\Sigma:

1ni=1n(Xiμ)  d  N(0,Σ).\frac{1}{\sqrt{n}}\sum_{i=1}^{n} (X_i - \mu) \;\xrightarrow{\,d\,}\; \mathcal{N}(0, \Sigma).

A lot of high-dimensional statistics is what happens to this picture when the dimension dd grows along with nn. The covariance matrix becomes a random object itself; concentration of measure phenomena take over; the spectral edge of the empirical covariance follows the Marchenko–Pastur law instead of staying at the deterministic Σ\Sigma. The relevant geometry stops being about a single Gaussian limit and becomes about many Gaussian-ish marginals interacting through random matrix theory. A topic for a separate post.

References



Previous Post
High-dimensional Gaussians live on a sphere
Next Post
Why hypercubes look spiky