Sudakov minoration, or how big a maximum must be

We have a good instinct for averages (or equivalently, sums) of random variables. Add up many (say $N$ ) independent random numbers, divide by $N$ , and the result barely moves: that is the law of large numbers. The central limit theorem refines this, telling us the typical size of the small deviation that remains.

In this post we look not at sums but at the maximum of many random variables. The maximum behaves very differently. The largest of a batch of random numbers keeps drifting upward as the batch grows, whereas the average stays put. Two questions:

By how much does the maximum drift upward as we add more numbers?
When is the maximum guaranteed to be at least some value?

Question 1 is, for maxima, what the CLT is for averages: a precise rate for how the quantity moves as $N$ grows.

The lower-bound direction (question 2) is the useful one and is also harder to answer. There is a mature toolkit for showing a maximum is not too big (upper bounds). Showing that a maximum is unavoidably big is needed for a variety of impossibility statements, for example:

this estimator cannot beat that rate,
these two clouds of points cannot be separated,
this noise cannot be filtered out, etc.

For Gaussian variables, the simplest tool for that direction is Sudakov’s minoration inequality. It is rarely explained outside of graduate textbooks, so I will try my best to explain it intuitively. I first ran into it while working on an old paper I co-authored, where I needed exactly this kind of lower bound on a maximum.

The maximum of N Gaussians

Start with the simplest possible case. Let $X_1, \dots, X_N$ be independent standard Gaussians, each $\mathcal{N}(0,1)$ . How large is $\max_i X_i$ ?

A single standard Gaussian is almost never far from $0$ : the chance of exceeding a level $t$ falls off as (directly from the definition of Gaussian)

\Pr[X > t] \;\approx\; e^{-t^2/2}

for large $t$ (there is a slowly-varying $1/t$ factor I am dropping but it does not change the conclusion). Now ask, how far out the largest of $N$ of them reaches. The expected number of the $N$ variables that clear level $t$ is

N \cdot \Pr[X > t] \;\approx\; N\, e^{-t^2/2}.

This count is huge when $t$ is small, and essentially zero when $t$ is large, and it passes through $1$ precisely where $N e^{-t^2/2} = 1$ , that is, where $\tfrac{t^2}{2} = \log N$ . So the maximum of $N$ of these variables piles up around

t^\star \;=\; \sqrt{2 \log N}.

This means that below $t^\star$ , there are many candidates poking above the line, while above it there are essentially none. The maximum of $N$ independent standard Gaussians sits at $\sqrt{2 \log N}$ to leading order. Notice the shape: the maximum grows without bound as $N \to \infty$ , but only at the leisurely rate of $\sqrt{\log N}$ . To double the maximum we need to increase the number of variables $N$ to $N^4$ .

Slide $N$ below and watch the threshold $t^\star = \sqrt{2 \log N}$ march to the right while the tail it cuts off keeps exactly the mass $1/N$ :

From independent to merely far apart

Independence made that argument easy: $N$ separate tries, each with its own fresh chance at a large value. But the maxima that matter in practice are usually over things that are correlated. The projections of one noise vector onto many directions, the errors of one model across many test points, the value of one random surface at many locations: all correlated, none independent. Can we still promise the maximum is large?

The key is to measure how different two of the variables are. For jointly Gaussian $X_i$ (each mean zero), define the distance between index $i$ and index $j$ as the typical size of their gap:

d(i, j) \;=\; \sqrt{\mathbb{E}\big[(X_i - X_j)^2\big]}.

This is called the canonical metric, and for mean-zero Gaussians it pays to write it out. With unit variances,

d(i,j)^2 \;=\; \mathbb{E}[X_i^2] - 2\,\mathbb{E}[X_i X_j] + \mathbb{E}[X_j^2] \;=\; 2\,(1 - \rho_{ij}),

where $\rho_{ij}$ is the correlation between $X_i$ and $X_j$ . Distance is just correlation in disguise, running the opposite way: identical variables ( $\rho = 1$ ) sit at distance $0$ , independent ones ( $\rho = 0$ ) at distance $\sqrt{2}$ , and mirror images ( $\rho = -1$ ) at the largest distance of all, $2$ . So two indices are close only when they are near-copies of each other, and they grow farther apart as their correlation falls from $+1$ down through $0$ to $-1$ . What the maximum cares about is exactly this redundancy: two strongly correlated indices are essentially one variable counted twice, a wasted draw that adds no fresh chance at a large value, while far-apart indices are genuinely different shots. Independent points, at distance $\sqrt{2}$ , are the clean case of two fresh tries; negatively correlated points are spread wider still, and the inequality below rewards that extra spread.

Sudakov’s minoration turns pairwise separation directly into a lower bound on the maximum:

Sudakov minoration. Let $X_1, \dots, X_N$ be jointly Gaussian with mean zero. If every pair is at least $\varepsilon$ apart in the canonical metric, $d(i,j) \ge \varepsilon$ for all $i \ne j$ , then
$\mathbb{E}\Big[\max_{i \le N} X_i\Big] \;\ge\; c\, \varepsilon \, \sqrt{\log N},$
where $c > 0$ is a universal constant.

So if no two of the $N$ variables are nearly identical, their expected maximum is at least $\varepsilon \sqrt{\log N}$ . The only thing we have to check is a pairwise separation, which is usually easy. We do not need the variables to actually be independent, and we do not need to compute anything about the joint distribution beyond the pairwise gaps.

Let us do a sanity-check on the case we already solved. For independent standard Gaussians, $\mathbb{E}[(X_i - X_j)^2] = \operatorname{Var}(X_i) + \operatorname{Var}(X_j) = 2$ , so every pair is exactly $\varepsilon = \sqrt{2}$ apart. Sudakov then gives

\mathbb{E}\Big[\max_i X_i\Big] \;\ge\; c\sqrt{2}\,\sqrt{\log N},

and we know the true answer is $\sqrt{2\log N}$ . The inequality has exactly the right shape and recovers the true growth rate up to the constant $c$ . Sudakov has dropped the convenient independence requirement and replaced it with a pairwise-distance check, yet the result loses nothing but a constant factor.

A few remarks for the curious:

The general statement is phrased for a supremum over a possibly infinite index set $T$ , using covering numbers: $\mathbb{E}\Big[\sup_{t \in T} X_t\Big] \;\ge\; c \sup_{\varepsilon > 0} \varepsilon \sqrt{\log N(T, d, \varepsilon)},$ where $N(T, d, \varepsilon)$ counts how many balls of radius $\varepsilon$ are needed to cover $T$ .
There is a twin inequality in the other direction. Dudley’s inequality bounds the same expected supremum from above by an integral of $\sqrt{\log N(T, d, \varepsilon)}$ over all scales. Sudakov is a single scale of that integral, used as a lower bound. For many natural processes the two match up to constants, so Sudakov is not just a crude floor; it often pins down the right order.

Separation rules out near-duplicates, non-duplicate points are genuinely different tries, and enough genuinely different tries force the maximum to be large. Sudakov turns that chain into a quantitative exchange rate between how far apart the points are and how large the maximum must be.

Lower bounds give impossibility results

Why would we want a lower bound on a maximum? Upper bounds are the reassuring direction: they say the worst case is not that bad, the estimator concentrates, the error is controlled. Lower bounds, on the other hand, are the source of impossibility results. If we can show some noise is quantitatively at least this large no matter what, then no method can drive the corresponding error below it. Minimax lower bounds, hardness-of-detection results, and non-separability results all rest on exactly this kind of statement.

And a supremum is exactly the right object for a worst case. “Can any direction separate these points?” is a statement about $\sup$ over directions. “Is there any test function that detects the signal?” is a $\sup$ over test functions. To prove such a worst case is unavoidably bad, we lower-bound a supremum, which is exactly what Sudakov does.

A toy example: when does a faint signal vanish into noise?

Here is a small problem where Sudakov’s bound does visible work.

We take $N$ noisy readings $Y_1, \dots, Y_N$ . We suspect that exactly one of them, we do not know which, carries a faint extra signal of size $s > 0$ : that reading is $Y_{i_0} = s + X_{i_0}$ , while all the others are pure noise, $Y_i = X_i$ . The noises $X_i$ are jointly Gaussian, and they need not be independent. Readings taken close together, or, for example, sharing a sensor, are correlated, and that is allowed. The obvious way to guess which reading is the special one is to point at the largest. When does that work?

It works only if the planted reading really does rise above all the noise, that is, if its signal clears the loudest of the pure-noise readings,

s \;>\; \max_{i \ne i_0} X_i.

So everything hinges on the size of the largest pure-noise reading, a maximum of a Gaussian process over the $N$ readings, and that is exactly what Sudakov lower-bounds. As long as the readings are genuinely different from one another, meaning their noises are pairwise at least $\varepsilon$ apart in the canonical metric, then

\max_i \, X_i \;\ge\; c\,\varepsilon \sqrt{\log N}

on average, and (because a Gaussian maximum concentrates tightly around its mean, by Gaussian concentration) with high probability too. There is an unavoidable noise floor of height about $\varepsilon\sqrt{\log N}$ . A signal fainter than the floor is buried: the loudest pure-noise reading beats it, we point at the wrong one, and no cleverness in how we inspect the data can recover a signal the noise has already swallowed. This is a real detection threshold: below $\varepsilon\sqrt{\log N}$ , the faint signal is invisible.

Play with the trade-off. The blue curve is the noise floor $\varepsilon\sqrt{2\log N}$ ; the red line is the signal $s$ . Where the floor stays under the signal the planted reading is detectable; once the floor climbs past, it is lost:

Two things to read off the picture.

More readings make detection harder. The floor rises with $N$ . Every extra reading is another chance for the noise to throw up a large value, and the planted signal has to clear all of them. A maximum taken over more readings can only grow; this is the cost of multiple comparisons, in one clean formula.
Averaging lowers the floor. If we can repeat each reading $m$ times and average, the noise scale shrinks from $\varepsilon$ to $\varepsilon/\sqrt{m}$ , and the floor drops with it. Slide $m$ up and the detectable region stretches to the right. We get the detection of fainter signals at the cost of making $m$ times as many measurements, and Sudakov is what certifies that the noise floor is where we would want, so the gain is a genuine threshold and not an artifact of a loose bound.

The shape of this argument, lower-bound a worst-case noise to show that nothing below some level can be detected, is why Sudakov is worth knowing. The same skeleton drives a non-separability threshold in an old paper that I co-authored (see Theorem 1 part 3), where the readings are data points and a smoothing graph operation plays the role of the averaging above. Different dressing, identical engine.

References

Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018. Section 7.4 states and proves Sudakov’s minoration inequality; the Dudley upper bound is in Section 8.1, in the chapter on chaining.
Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019. Chapter 5 develops covering and packing numbers and the supremum bounds built on them.
Michel Talagrand. Upper and Lower Bounds for Stochastic Processes. Springer, 2014. The definitive account of how far the lower and upper bounds for suprema can be pushed, via generic chaining.
Aseem Baranwal, Kimon Fountoulakis, Aukosh Jagannath. Graph Convolution for Semi-Supervised Classification: Improved Linear Separability and Out-of-Distribution Generalization. International Conference on Machine Learning (ICML), 2021. arXiv:2102.06966. The non-separability threshold (Theorem 1, part 3) is where the supremum lower bound above does its work.