Skip to content

The Hanson–Wright inequality

Posted on:

A quadratic form is the expression XAX=i,jAijXiXjX^\top A X = \sum_{i,j} A_{ij} X_i X_j for a fixed n×nn \times n matrix AA and a random vector X=(X1,,Xn)X = (X_1, \dots, X_n) with independent, mean-zero entries. The squared length X2\|X\|^2 is the case A=IA = I, a sample variance is a quadratic form, and so is the energy Mx2=x(MM)x\|M x\|^2 = x^\top (M^\top M) x that a fixed linear map MM measures. The question is how tightly XAXX^\top A X concentrates around its mean. The Hanson–Wright inequality is the standard answer, and the answer is not a single Gaussian tail but two regimes.

The mean is the trace

Take the entries to be independent with mean zero and variance 11. Then E[XiXj]=δij\mathbb{E}[X_i X_j] = \delta_{ij}, equal to 11 when i=ji = j and 00 otherwise, so only the diagonal of AA survives the expectation:

E[XAX]=i,jAijE[XiXj]=iAii=tr(A).\mathbb{E}[X^\top A X] = \sum_{i,j} A_{ij}\,\mathbb{E}[X_i X_j] = \sum_i A_{ii} = \operatorname{tr}(A).

So XAXX^\top A X fluctuates around tr(A)\operatorname{tr}(A).

The Gaussian case, by rotating to the eigenbasis

Take XN(0,In)X \sim \mathcal{N}(0, I_n) and AA symmetric. (Any quadratic form sees only the symmetric part 12(A+A)\tfrac{1}{2}(A + A^\top), since xAx=xAxx^\top A x = x^\top A^\top x, so assuming symmetry loses nothing.) Diagonalize A=QΛQA = Q \Lambda Q^\top with orthonormal QQ and eigenvalues λ1,,λn\lambda_1, \dots, \lambda_n. The rotated vector g=QXg = Q^\top X is again standard Gaussian, since the standard Gaussian is rotation invariant, and in these coordinates the quadratic form is a plain weighted sum of squares:

XAX=gΛg=i=1nλigi2,giiidN(0,1).X^\top A X = g^\top \Lambda g = \sum_{i=1}^{n} \lambda_i\, g_i^2, \qquad g_i \overset{\text{iid}}{\sim} \mathcal{N}(0,1).

Subtracting the mean tr(A)=iλi\operatorname{tr}(A) = \sum_i \lambda_i,

XAXtr(A)=i=1nλi(gi21),X^\top A X - \operatorname{tr}(A) = \sum_{i=1}^{n} \lambda_i\,(g_i^2 - 1),

a sum of independent terms, one per eigenvalue.

Each term gi21g_i^2 - 1 is a centered chi-square with one degree of freedom. It is sub-exponential, meaning its tail decays like et/2e^{-t/2} rather than a Gaussian’s et2/2e^{-t^2/2}, because squaring a Gaussian fattens the tail. Its moment generating function is

Ees(gi21)=es12s    e2s2for s14.\mathbb{E}\,e^{s(g_i^2 - 1)} = \frac{e^{-s}}{\sqrt{1 - 2s}} \;\le\; e^{2 s^2} \qquad \text{for } |s| \le \tfrac{1}{4}.

Scaling the ii-th term by λi\lambda_i and using independence, the cumulant generating function of the whole sum is at most

lnEesiλi(gi21)=ilnEesλi(gi21)    2s2iλi2=2s2AF2,s14A,\ln \mathbb{E}\, e^{\,s \sum_i \lambda_i (g_i^2 - 1)} = \sum_i \ln \mathbb{E}\, e^{\,s \lambda_i (g_i^2 - 1)} \;\le\; 2 s^2 \sum_i \lambda_i^2 = 2 s^2\, \|A\|_F^2, \qquad |s| \le \frac{1}{4\|A\|},

where AF2=iλi2\|A\|_F^2 = \sum_i \lambda_i^2 is the squared Frobenius norm and the band s14A|s| \le \tfrac{1}{4\|A\|} comes from needing sλi14|s\lambda_i| \le \tfrac{1}{4} for every ii, with A=maxiλi\|A\| = \max_i |\lambda_i| the operator norm (the largest singular value).

This is the cumulant bound of a Bernstein-type variable: a Gaussian-like s2s^2 term, but valid only inside a band whose width is set by A\|A\|. Optimizing the Chernoff bound Pr[St]exp(st+2s2AF2)\Pr[S \ge t] \le \exp(-st + 2 s^2 \|A\|_F^2) over ss in that band gives the two regimes. For small tt the optimal ss stays inside the band and the bound is Gaussian; for large tt the optimum is pinned at the edge s=14As = \tfrac{1}{4\|A\|} and the bound is exponential. Together,

Pr[XAXtr(A)t]    2exp ⁣(18min ⁣(t2AF2,  tA)).\Pr\big[\,|X^\top A X - \operatorname{tr}(A)| \ge t\,\big] \;\le\; 2\exp\!\left(-\frac{1}{8}\min\!\left(\frac{t^2}{\|A\|_F^2},\; \frac{t}{\|A\|}\right)\right).

That is the Hanson–Wright inequality for Gaussian inputs.

For A=IA = I, XAX=X2X^\top A X = \|X\|^2, the mean is tr(I)=n\operatorname{tr}(I) = n, and the norms are IF2=n\|I\|_F^2 = n and I=1\|I\| = 1, so

Pr[X2nt]    2exp ⁣(18min(t2/n,  t)).\Pr\big[\,\big|\,\|X\|^2 - n\,\big| \ge t\,\big] \;\le\; 2\exp\!\left(-\tfrac{1}{8}\min(t^2/n,\; t)\right).

Fluctuations of size tnt \lesssim n are Gaussian; past tnt \approx n the tail turns exponential. This is the concentration of X2\|X\|^2 behind the thin shell in high-dimensional Gaussians: the squared length sits at nn with fluctuations of order n\sqrt{n}, so the length sits at n\sqrt{n} with fluctuations of order 11.

Two norms, two regimes

The Frobenius norm AF2=iλi2\|A\|_F^2 = \sum_i \lambda_i^2 is the total variance: Var(XAX)=2iλi2=2AF2\operatorname{Var}(X^\top A X) = 2\sum_i \lambda_i^2 = 2\|A\|_F^2 for Gaussian XX. It sets the Gaussian regime near the mean, where the fluctuation is an average over all nn eigenvalue-terms and the central-limit effect applies.

The operator norm A=maxiλi\|A\| = \max_i |\lambda_i| is the single largest weight. The heaviest-tailed term in iλi(gi21)\sum_i \lambda_i (g_i^2 - 1) is the one with the biggest λi|\lambda_i|, and it alone carries a sub-exponential tail et/(constA)e^{-t/(\text{const} \cdot \|A\|)} that no averaging removes. It sets the far tail.

The crossover between the two sits at t=AF2/At^\star = \|A\|_F^2 / \|A\|. Below it the form looks Gaussian; above it a single dominant direction takes over and the tail is exponential. The curve below is the rate min(t2/AF2,  t/A)\min(t^2/\|A\|_F^2,\; t/\|A\|) that sits in the exponent: the larger the rate, the smaller the tail. The parabola governs small deviations, the line governs large ones, and they switch at tt^\star. Slide the two norms and watch the crossover move.

AF = 5.0A‖ = 2.00crossover t = 12.5

What decides which regime matters in practice is the shape of the spectrum. Below, the bars are the eigenvalues λi\lambda_i, normalized so the largest is A=1\|A\| = 1. The slider tilts the spectrum from flat, where every direction counts equally, like A=IA = I, to spiky, where one direction dominates. A flat spectrum pushes the crossover tt^\star far out, so the form is Gaussian over a wide range; a spiky spectrum pulls it in, so the exponential tail takes over early. The ratio AF2/A2\|A\|_F^2 / \|A\|^2 that sets the crossover (in units of A\|A\|) is the stable rank, an effective count of active directions.

AF2 = 3.58A‖ = 1stable rank = 3.6t = 3.6

The general theorem

For general independent entries the rotation that handled the Gaussian case is not available: only the Gaussian is rotation invariant, so for any other distribution the rotated coordinates QXQ^\top X, while still sub-Gaussian, are no longer independent, and the clean weighted sum of squares falls apart. The inequality holds anyway. For a random vector with independent, mean-zero, sub-Gaussian entries of sub-Gaussian norm at most KK (the scale on which the entries’ own tails decay like ex2/K2e^{-x^2/K^2}),

Pr[XAXEXAXt]    2exp ⁣(cmin ⁣(t2K4AF2,  tK2A)),\Pr\big[\,|X^\top A X - \mathbb{E}\,X^\top A X| \ge t\,\big] \;\le\; 2\exp\!\left(-c\min\!\left(\frac{t^2}{K^4\|A\|_F^2},\; \frac{t}{K^2\|A\|}\right)\right),

the same two-regime shape, with KK tracking how heavy the entries are.

The general proof splits the form along its diagonal,

XAX=iAiiXi2diagonal+ijAijXiXjoff-diagonal.X^\top A X = \underbrace{\sum_i A_{ii} X_i^2}_{\text{diagonal}} + \underbrace{\sum_{i \ne j} A_{ij} X_i X_j}_{\text{off-diagonal}}.

The diagonal part is a sum of independent variables AiiXi2A_{ii} X_i^2; it carries the mean tr(A)\operatorname{tr}(A) and concentrates by Bernstein’s inequality (the same sub-exponential bound used above, now for a sum rather than after a rotation), no harder than the Gaussian case. The off-diagonal part has mean zero and is the real content of the theorem. Its terms XiXjX_i X_j share variables, so they are not independent, and the standard handle is decoupling: replace one copy of XX by an independent copy XX', so that conditionally on XX' the off-diagonal sum ijAijXiXj\sum_{i\ne j} A_{ij} X_i X_j' is a linear form in the independent variables XiX_i, which sub-Gaussian tools control directly. Carrying that through reproduces the same two norms.

Where it gets used

The inequality is a standard tool wherever a squared length or an energy has to be pinned to its mean.

References



Next Post
Covering the sphere with ε-nets