Central Limit Theorem - why sums become Gaussian

Most things in nature, when you measure them carefully, look roughly Gaussian. Heights, exam scores, voltage noise, sample averages of pretty much any quantity. This is, famously, the central limit theorem (CLT). The textbook statement uses characteristic functions and analytic limits, but here I try to present the underlying picture in a purely geometric way.

The setup

Take an i.i.d. sequence $X_1, X_2, \ldots$ with mean $\mu$ and variance $\sigma^2 < \infty$ . The CLT says that the standardized sum

Z_n \;=\; \frac{X_1 + X_2 + \cdots + X_n - n\mu}{\sigma\sqrt{n}}

converges in distribution to $\mathcal{N}(0, 1)$ as $n \to \infty$ .
I’ve written more on this mode of convergence here.

A more visual restatement: summing independent random variables is convolving their densities. Why? For the sum $X + Y$ to land at a particular value $z$ , the two summands have to take values that add up to $z$ : if $X$ takes value $x$ , then $Y$ must take the complementary value $z - x$ . By independence, the joint density at any pair $(x, y)$ is the product $f(x)\,g(y)$ . The density of $X + Y$ at $z$ is therefore what you get by summing (integrating) the joint density over every valid split $(x, z-x)$ that produces this $z$ :

f_{X+Y}(z) \;=\; \int f(x)\, g(z - x)\, dx \;=:\; (f \star g)(z).

That integral is, by definition, the convolution of $f$ and $g$ . So summing $n$ i.i.d. copies of $X$ corresponds to convolving its density with itself $n$ times. The CLT becomes a statement about repeated self-convolution: under the right rescaling, it converges to a Gaussian.

Watch it happen

Pick a base distribution and slide $n$ . The blue bars are the exact distribution of the sum after $n$ convolutions; the red curve is the Gaussian with mean $n\mu$ and variance $n\sigma^2$ .

A few things to notice as you slide:

The shape becomes bell-curve-like fast. By $n \approx 5$ even an asymmetric or bimodal base looks visibly Gaussian. By $n \approx 20$ the histogram and the red curve are nearly indistinguishable.
The mean shifts at rate $\mu$ per step; the spread grows like $\sigma\sqrt{n}$ , sub-linearly in $n$ .
Standardization is essential. Without it, the unstandardized sum’s distribution drifts to infinity and spreads forever; you’d never converge to anything. The CLT is about the shape of the centered, scaled distribution, not its location.

The uniform die at $n = 2$ is the distribution of two-dice rolls, and you’ve probably internalized this triangle if you’ve played Settlers of Catan. The peak at $7$ is why rolling it summons the robber; $6$ and $8$ sit just below the peak and are the most coveted hex numbers because resources tied to them get produced most often; $2$ and $12$ live in the tails and barely pay out. The game’s economy is, quite literally, a single convolution of a uniform die with itself.

Why convolution → Gaussian?

Three complementary intuitions, each capturing a different aspect of why this happens.

(1) Convolution is smoothing. Each convolution averages out sharp features of the densities being combined. Spikes get blurred, gaps get filled, rough edges get rounded. After enough convolutions, only the smoothest distribution with the right mean and variance survives. That happens to be the Gaussian.

(2) Entropy maximization. Among all densities on $\mathbb{R}$ with a given mean and variance, the one with maximum differential entropy is the Gaussian. Convolution monotonically increases the entropy of the standardized sum (Barron, 1986). Repeated convolution therefore drives the standardized distribution toward the entropy maximizer with the matching moments. This is the entropic CLT, and it gives a thermodynamic flavor to convergence: the Gaussian is the thermal equilibrium of independent additive noise.

(3) Self-similarity. The Gaussian is the unique distribution with finite variance that is preserved under the operation add an independent copy and rescale by $1/\sqrt{2}$ :

X + Y \;\stackrel{d}{=}\; \sqrt{2} \cdot X \quad\text{when } X, Y \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 1).

For any other finite-variance distribution, this operation moves the shape closer to Gaussian. The add and rescale map is a contraction toward the Gaussian fixed point, and the CLT is its convergence.

(Without finite variance, the same fixed-point logic gives non-Gaussian limits called stable distributions: Cauchy, Lévy, and others. The Gaussian is just the most familiar member of an infinite family of self-similar limit laws.)

How fast?

The Berry–Esseen theorem makes the convergence quantitative:

\sup_{z \in \mathbb{R}} \big| F_n(z) - \Phi(z) \big| \;\le\; \frac{C \cdot \rho}{\sigma^3 \sqrt{n}},

where $F_n$ is the CDF of $Z_n$ , $\Phi$ is the standard normal CDF, and $\rho = \mathbb{E}|X - \mu|^3$ is the third absolute central moment. The best known universal constant is $C \le 0.4748$ (Shevtsova, 2011).

Three takeaways:

The convergence rate is the canonical $1/\sqrt{n}$ .
The factor $\rho/\sigma^3$ measures how skewed/heavy-tailed the base is. Heavy tails slow it down; thin tails speed it up.
The bound is uniform over $z$ . Sharper local statements (Edgeworth expansions) give higher-order corrections in $1/n$ that depend on higher cumulants of $X$ .

In the widget, the rate-of-convergence comparison is easy to feel: the symmetric uniform die approaches Gaussian faster than asymmetric Bernoulli(0.3), which has a larger $\rho/\sigma^3$ .

What can go wrong?

The standard CLT has two requirements: i.i.d. (or close to it) and finite variance. Drop either and the limit can change shape.

Drop independence. There are CLTs for martingales, for mixing/weakly-dependent sequences, for U-statistics, and so on. The variance in the limit gets corrected to a long-run variance that accounts for autocorrelation.
Drop identical distribution. The Lindeberg / Lyapunov conditions ensure no single $X_i$ ‘s variance dominates the sum. As long as everyone contributes, the limit is still Gaussian.
Drop finite variance. The limit is no longer Gaussian. For variables in the domain of attraction of a non-Gaussian stable law (e.g., when $\mathbb{E}|X|^\alpha = \infty$ for some $\alpha < 2$ ), the limit is a stable distribution with index $\alpha$ , with infinite variance and (for $\alpha < 1$ ) infinite mean. The rescaling itself changes from $\sqrt{n}$ to $n^{1/\alpha}$ .

Higher dimensions

In $\mathbb{R}^d$ , the multivariate CLT replaces the variance with a covariance matrix $\Sigma$ :

\frac{1}{\sqrt{n}}\sum_{i=1}^{n} (X_i - \mu) \;\xrightarrow{\,d\,}\; \mathcal{N}(0, \Sigma).

A lot of high-dimensional statistics is what happens to this picture when the dimension $d$ grows along with $n$ . The covariance matrix becomes a random object itself; concentration of measure phenomena take over; the spectral edge of the empirical covariance follows the Marchenko–Pastur law instead of staying at the deterministic $\Sigma$ . The relevant geometry stops being about a single Gaussian limit and becomes about many Gaussian-ish marginals interacting through random matrix theory. A topic for a separate post.

References

Roman Vershynin, High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018. Chapter 2 has a clean treatment of CLT and Berry–Esseen.
Andrew R. Barron. Entropy and the central limit theorem. The Annals of Probability, 14(1):336–342, 1986.
Irina Shevtsova. On the absolute constants in the Berry–Esseen type inequalities for identically distributed summands. arXiv:1111.6554, 2011.
Stéphane Boucheron, Gábor Lugosi, Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.