We have a good instinct for averages (or equivalently, sums) of random variables. Add up many (say ) independent random numbers, divide by , and the result is pretty stable (barely moves): that is the law of large numbers. The central limit theorem is a second-order extension to this result that tells us exactly how small/large that deviation is.
In this post, we look at not the sums, but the maximum of a bunch of random variables. The maximum has a very different story, because taking the largest of a bunch of random numbers we observe that it keeps drifting upward as we add more numbers to the bunch. This is unlike averages, that are pretty stable as we add more random numbers to the bunch. The questions I try to address in this post is:
- By how much, does the maximum drift upward, as we add more numbers?
- When is the maximum guaranteed to be at least some value?
This is directly similar to how the CLT answers: by how much does the average deviate as we add more numbers?
The lower-bound direction (question 2) is the useful one and is also harder to answer. There is a mature toolkit for showing a maximum is not too big (upper bounds). Showing that a maximum is unavoidably big is needed for a variety of impossibility statements, for example:
- this estimator cannot beat that rate,
- these two clouds of points cannot be separated,
- this noise cannot be filtered out, etc.
For Gaussian variables, the simplest tool for that direction is Sudakov’s minoration inequality. It is rarely explained outside of graduate textbooks, so I will try my best to explain it intuitively. I first ran into it while working on an old paper I co-authored, where I needed exactly this kind of lower bound on a maximum.
The maximum of N Gaussians
Start with the simplest possible case. Let be independent standard Gaussians, each . How large is ?
A single standard Gaussian is almost never far from : the chance of exceeding a level falls off as (directly from the definition of Gaussian)
for large (there is a slowly-varying factor I am dropping but it does not change the conclusion). Now ask, how far out the largest of of them reaches. The expected number of the variables that clear level is
This count is huge when is small, and essentially zero when is large, and it passes through precisely where , that is, where . So the maximum of of these variables piles up around
This means that below , there are many candidates poking above the line, while above it there are essentially none. The maximum of independent standard Gaussians sits at to leading order. Notice the shape: the maximum grows without bound as , but only at the leisurely rate of . To double the maximum we need to increase the number of variables to .
Slide below and watch the threshold march to the right while the tail it cuts off keeps exactly the mass :
From independent to merely far apart
Independence made that argument easy: separate tries, each with its own fresh chance at a large value. But the maxima that matter in practice are usually over things that are correlated. The projections of one noise vector onto many directions, the errors of one model across many test points, the value of one random surface at many locations: all correlated, none independent. Can we still promise the maximum is large?
The key is to measure how different two of the variables are. For jointly Gaussian (each mean zero), define the distance between index and index as the typical size of their gap:
This is called the canonical metric, and for mean-zero Gaussians it pays to write it out. With unit variances,
where is the correlation between and . Distance is just correlation in disguise, running the opposite way: identical variables () sit at distance , independent ones () at distance , and mirror images () at the largest distance of all, . So two indices are close only when they are near-copies of each other, and they grow farther apart as their correlation falls from down through to . What the maximum cares about is exactly this redundancy: two strongly correlated indices are essentially one variable counted twice, a wasted draw that adds no fresh chance at a large value, while far-apart indices are genuinely different shots. Independent points, at distance , are the clean case of two fresh tries; negatively correlated points are spread wider still, and the inequality below rewards that extra spread.
Sudakov’s minoration turns pairwise separation directly into a lower bound on the maximum:
Sudakov minoration. Let be jointly Gaussian with mean zero. If every pair is at least apart in the canonical metric, for all , then
where is a universal constant.
So if no two of the variables are nearly identical, their expected maximum is at least . The only thing we have to check is a pairwise separation, which is usually easy. We do not need the variables to actually be independent, and we do not need to compute anything about the joint distribution beyond the pairwise gaps.
Let us do a sanity-check on the case we already solved. For independent standard Gaussians, , so every pair is exactly apart. Sudakov then gives
and we know the true answer is . The inequality has exactly the right shape and recovers the true growth rate up to the constant . Sudakov has managed to get rid of the convenient independence requirement and work with only a pairwise-distance bookkeeping, yet, the result did not lose anything but a constant.
A few remarks for the curious:
- The general statement is phrased for a supremum over a possibly infinite index set , using covering numbers: where counts how many balls of radius are needed to cover .
- There is a twin inequality in the other direction. Dudley’s inequality bounds the same expected supremum from above by an integral of over all scales. Sudakov is a single scale of that integral, used as a lower bound. For many natural processes the two match up to constants, so Sudakov is not just a crude floor; it often pins down the right order.
Separation rules out near-duplicates, non-duplicate points are genuinely different tries, and enough genuinely different tries force the maximum to be large. Sudakov turns that chain into a quantitative exchange rate between how far apart the points are and how large the maximum must be.
Why the lower-bound direction matters
Why would we want a lower bound on a maximum? Note that upper bounds are reassuring: i.e., they say the worst case is not that bad, the estimator concentrates, the error is controlled. Lower bounds on the other hand are the source of impossibility results. If we can show some noise is quantitatively at least this large no matter what, then no method can drive the corresponding error below it. Minimax lower bounds, hardness-of-detection results, and non-separability results all rest on exactly this kind of statement.
And a supremum is exactly the right object for a worst case. “Can any direction separate these points?” is a statement about over directions. “Is there any test function that detects the signal?” is a over test functions. To prove such a worst case is unavoidably bad, we lower-bound a supremum, which is exactly what Sudakov does.
A toy example: when does a faint signal vanish into noise?
Here I try to model a problem where Sudakov’s application is observable.
We take noisy readings . We suspect that exactly one of them, we do not know which, carries a faint extra signal of size : that reading is , while all the others are pure noise, . The noises are jointly Gaussian, and they need not be independent. Readings taken close together, or, for example, sharing a sensor, are correlated, and that is allowed. The obvious way to guess which reading is the special one is to point at the largest. When does that work?
It works only if the planted reading really does rise above all the noise, that is, if its signal clears the loudest of the pure-noise readings,
So everything hinges on the size of the largest pure-noise reading, a maximum of a Gaussian process over the readings, and that is exactly what Sudakov lower-bounds. As long as the readings are genuinely different from one another, meaning their noises are pairwise at least apart in the canonical metric, then
on average, and (because a Gaussian maximum concentrates tightly around its mean, by Gaussian concentration) with high probability too. There is an unavoidable noise floor of height about . A signal fainter than the floor is buried: the loudest pure-noise reading beats it, we point at the wrong one, and no cleverness in how we inspect the data can recover a signal the noise has already swallowed. This is a real detection threshold: below , the faint signal is invisible.
Play with the trade-off. The blue curve is the noise floor ; the red line is the signal . Where the floor stays under the signal the planted reading is detectable; once the floor climbs past, it is lost:
Two things to read off the picture.
- More readings make detection harder. The floor rises with . Every extra reading is another chance for the noise to throw up a large value, and the planted signal has to clear all of them. A maximum taken over more readings can only grow; this is the cost of multiple comparisons, in one clean formula.
- Averaging lowers the floor. If we can repeat each reading times and average, the noise scale shrinks from to , and the floor drops with it. Slide up and the detectable region stretches to the right. We get the detection of fainter signals at the cost of making times as many measurements, and Sudakov is what certifies that the noise floor is where we would want, so the gain is a genuine threshold and not an artifact of a loose bound.
The shape of this argument, lower-bound a worst-case noise to show that nothing below some level can be detected, is why Sudakov is worth knowing. The same skeleton drives a non-separability threshold in an old paper that I co-authored (see Theorem 1 part 3), where the readings are data points and a smoothing graph operation plays the role of the averaging above. Different dressing, identical engine.
References
- Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018. Section 7.4 states and proves Sudakov’s minoration inequality; the Dudley upper bound is in Section 8.1, in the chapter on chaining.
- Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019. Chapter 5 develops covering and packing numbers and the supremum bounds built on them.
- Michel Talagrand. Upper and Lower Bounds for Stochastic Processes. Springer, 2014. The definitive account of how far the lower and upper bounds for suprema can be pushed, via generic chaining.
- Aseem Baranwal, Kimon Fountoulakis, Aukosh Jagannath. Graph Convolution for Semi-Supervised Classification: Improved Linear Separability and Out-of-Distribution Generalization. International Conference on Machine Learning (ICML), 2021. arXiv:2102.06966. The non-separability threshold (Theorem 1, part 3) is where the supremum lower bound above does its work.