The Hanson–Wright inequality | Aseem Raj Baranwal

A quadratic form is the expression $X^\top A X = \sum_{i,j} A_{ij} X_i X_j$ for a fixed $n \times n$ matrix $A$ and a random vector $X = (X_1, \dots, X_n)$ with independent, mean-zero entries. The squared length $\|X\|^2$ is the case $A = I$ , a sample variance is a quadratic form, and so is the energy $\|M x\|^2 = x^\top (M^\top M) x$ that a fixed linear map $M$ measures. The question is how tightly $X^\top A X$ concentrates around its mean. The Hanson–Wright inequality is the standard answer, and the answer is not a single Gaussian tail but two regimes.

The mean is the trace

Take the entries to be independent with mean zero and variance $1$ . Then $\mathbb{E}[X_i X_j] = \delta_{ij}$ , equal to $1$ when $i = j$ and $0$ otherwise, so only the diagonal of $A$ survives the expectation:

\mathbb{E}[X^\top A X] = \sum_{i,j} A_{ij}\,\mathbb{E}[X_i X_j] = \sum_i A_{ii} = \operatorname{tr}(A).

So $X^\top A X$ fluctuates around $\operatorname{tr}(A)$ .

The Gaussian case, by rotating to the eigenbasis

Take $X \sim \mathcal{N}(0, I_n)$ and $A$ symmetric. (Any quadratic form sees only the symmetric part $\tfrac{1}{2}(A + A^\top)$ , since $x^\top A x = x^\top A^\top x$ , so assuming symmetry loses nothing.) Diagonalize $A = Q \Lambda Q^\top$ with orthonormal $Q$ and eigenvalues $\lambda_1, \dots, \lambda_n$ . The rotated vector $g = Q^\top X$ is again standard Gaussian, since the standard Gaussian is rotation invariant, and in these coordinates the quadratic form is a plain weighted sum of squares:

X^\top A X = g^\top \Lambda g = \sum_{i=1}^{n} \lambda_i\, g_i^2, \qquad g_i \overset{\text{iid}}{\sim} \mathcal{N}(0,1).

Subtracting the mean $\operatorname{tr}(A) = \sum_i \lambda_i$ ,

X^\top A X - \operatorname{tr}(A) = \sum_{i=1}^{n} \lambda_i\,(g_i^2 - 1),

a sum of independent terms, one per eigenvalue.

Each term $g_i^2 - 1$ is a centered chi-square with one degree of freedom. It is sub-exponential, meaning its tail decays like $e^{-t/2}$ rather than a Gaussian’s $e^{-t^2/2}$ , because squaring a Gaussian fattens the tail. Its moment generating function is

\mathbb{E}\,e^{s(g_i^2 - 1)} = \frac{e^{-s}}{\sqrt{1 - 2s}} \;\le\; e^{2 s^2} \qquad \text{for } |s| \le \tfrac{1}{4}.

Scaling the $i$ -th term by $\lambda_i$ and using independence, the cumulant generating function (the logarithm of the moment generating function) of the whole sum is at most

\ln \mathbb{E}\, e^{\,s \sum_i \lambda_i (g_i^2 - 1)} = \sum_i \ln \mathbb{E}\, e^{\,s \lambda_i (g_i^2 - 1)} \;\le\; 2 s^2 \sum_i \lambda_i^2 = 2 s^2\, \|A\|_F^2, \qquad |s| \le \frac{1}{4\|A\|},

where $\|A\|_F^2 = \sum_i \lambda_i^2$ is the squared Frobenius norm and the band $|s| \le \tfrac{1}{4\|A\|}$ comes from needing $|s\lambda_i| \le \tfrac{1}{4}$ for every $i$ , with $\|A\| = \max_i |\lambda_i|$ the operator norm (the largest singular value).

This is the cumulant bound of a Bernstein-type variable: a Gaussian-like $s^2$ term, but valid only inside a band whose width is set by $\|A\|$ . Optimizing the Chernoff bound $\Pr[S \ge t] \le \exp(-st + 2 s^2 \|A\|_F^2)$ (Markov’s inequality applied to $e^{sS}$ ) over $s$ in that band gives the two regimes. For small $t$ the optimal $s$ stays inside the band and the bound is Gaussian; for large $t$ the optimum is pinned at the edge $s = \tfrac{1}{4\|A\|}$ and the bound is exponential. Together,

\Pr\big[\,|X^\top A X - \operatorname{tr}(A)| \ge t\,\big] \;\le\; 2\exp\!\left(-\frac{1}{8}\min\!\left(\frac{t^2}{\|A\|_F^2},\; \frac{t}{\|A\|}\right)\right).

That is the Hanson–Wright inequality for Gaussian inputs.

For $A = I$ , $X^\top A X = \|X\|^2$ , the mean is $\operatorname{tr}(I) = n$ , and the norms are $\|I\|_F^2 = n$ and $\|I\| = 1$ , so

\Pr\big[\,\big|\,\|X\|^2 - n\,\big| \ge t\,\big] \;\le\; 2\exp\!\left(-\tfrac{1}{8}\min(t^2/n,\; t)\right).

Fluctuations of size $t \lesssim n$ are Gaussian; past $t \approx n$ the tail turns exponential. This is the concentration of $\|X\|^2$ behind the thin shell in high-dimensional Gaussians: the squared length sits at $n$ with fluctuations of order $\sqrt{n}$ , so the length sits at $\sqrt{n}$ with fluctuations of order $1$ .

Two norms, two regimes

The Frobenius norm $\|A\|_F^2 = \sum_i \lambda_i^2$ is the total variance: $\operatorname{Var}(X^\top A X) = 2\sum_i \lambda_i^2 = 2\|A\|_F^2$ for Gaussian $X$ . It sets the Gaussian regime near the mean, where the fluctuation is an average over all $n$ eigenvalue-terms and the central-limit effect applies.

The operator norm $\|A\| = \max_i |\lambda_i|$ is the single largest weight. The heaviest-tailed term in $\sum_i \lambda_i (g_i^2 - 1)$ is the one with the biggest $|\lambda_i|$ , and it alone carries a sub-exponential tail $e^{-t/(\text{const} \cdot \|A\|)}$ that no averaging removes. It sets the far tail.

The crossover between the two sits at $t^\star = \|A\|_F^2 / \|A\|$ . Below it the form looks Gaussian; above it a single dominant direction takes over and the tail is exponential. The curve below is the rate $\min(t^2/\|A\|_F^2,\; t/\|A\|)$ that sits in the exponent: the larger the rate, the smaller the tail. The parabola governs small deviations, the line governs large ones, and they switch at $t^\star$ . Slide the two norms and watch the crossover move.

What decides which regime matters in practice is the shape of the spectrum. Below, the bars are the eigenvalues $\lambda_i$ , normalized so the largest is $\|A\| = 1$ . The slider tilts the spectrum from flat, where every direction counts equally, like $A = I$ , to spiky, where one direction dominates. A flat spectrum pushes the crossover $t^\star$ far out, so the form is Gaussian over a wide range; a spiky spectrum pulls it in, so the exponential tail takes over early. The ratio $\|A\|_F^2 / \|A\|^2$ that sets the crossover (in units of $\|A\|$ ) is the stable rank, an effective count of active directions.

The general theorem

For general independent entries the rotation that handled the Gaussian case is not available: only the Gaussian is rotation invariant, so for any other distribution the rotated coordinates $Q^\top X$ , while still sub-Gaussian, are no longer independent, and the clean weighted sum of squares falls apart. The inequality holds anyway. For a random vector with independent, mean-zero, sub-Gaussian entries of sub-Gaussian norm at most $K$ (the scale on which the entries’ own tails decay like $e^{-x^2/K^2}$ ),

\Pr\big[\,|X^\top A X - \mathbb{E}\,X^\top A X| \ge t\,\big] \;\le\; 2\exp\!\left(-c\min\!\left(\frac{t^2}{K^4\|A\|_F^2},\; \frac{t}{K^2\|A\|}\right)\right),

the same two-regime shape, with $K$ tracking how heavy the entries are.

The general proof splits the form along its diagonal,

X^\top A X = \underbrace{\sum_i A_{ii} X_i^2}_{\text{diagonal}} + \underbrace{\sum_{i \ne j} A_{ij} X_i X_j}_{\text{off-diagonal}}.

The diagonal part is a sum of independent variables $A_{ii} X_i^2$ ; it carries the mean $\operatorname{tr}(A)$ and concentrates by Bernstein’s inequality (the same sub-exponential bound used above, now for a sum rather than after a rotation), no harder than the Gaussian case. The off-diagonal part has mean zero and is the real content of the theorem. Its terms $X_i X_j$ share variables, so they are not independent, and the standard handle is decoupling: replace one copy of $X$ by an independent copy $X'$ , so that conditionally on $X'$ the off-diagonal sum $\sum_{i\ne j} A_{ij} X_i X_j'$ is a linear form in the independent variables $X_i$ , which sub-Gaussian tools control directly. Carrying that through reproduces the same two norms.

Applications

The inequality is a standard tool wherever a squared length or an energy has to be pinned to its mean.

Length and distance concentration. $\|X\|^2$ around $n$ is the case $A = I$ . More generally $\|Mx\|^2 = x^\top(M^\top M)x$ concentrates around its mean for a fixed map $M$ , which is how one shows random features or embeddings preserve magnitudes.
Random projections. Projecting a fixed vector with a random matrix preserves its length up to a factor $1 \pm \varepsilon$ . The fluctuation of $\|\Pi x\|^2$ is a quadratic form, and Hanson–Wright supplies the failure probability that feeds the Johnson–Lindenstrauss lemma.
Covariance estimation. Controlling $x^\top(\hat\Sigma - \Sigma)x$ , the error of a sample covariance in a fixed direction, is a quadratic form in the data.
As a lemma. It is the usual way to show $x^\top A x$ is close to its mean for one $x$ , then combined with the ε-net argument to make the statement hold uniformly over the whole sphere of directions.

References

David L. Hanson, Farroll T. Wright. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3):1079–1083, 1971. The original.
Mark Rudelson, Roman Vershynin. Hanson-Wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18, no. 82, 1–9, 2013. The modern statement with the two norms and a clean proof by decoupling.
Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018. Chapter 6 states and proves the inequality; the sub-exponential and Bernstein bounds it rests on are developed in Chapter 2.
Stéphane Boucheron, Gábor Lugosi, Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. Background on sub-gamma variables and the Bernstein bound behind the two regimes.