Skip to content

Sudakov minoration, or how big a maximum must be

Posted on:

We have a good instinct for averages (or equivalently, sums) of random variables. Add up many (say NN) independent random numbers, divide by NN, and the result is pretty stable (barely moves): that is the law of large numbers. The central limit theorem is a second-order extension to this result that tells us exactly how small/large that deviation is.

In this post, we look at not the sums, but the maximum of a bunch of random variables. The maximum has a very different story, because taking the largest of a bunch of random numbers we observe that it keeps drifting upward as we add more numbers to the bunch. This is unlike averages, that are pretty stable as we add more random numbers to the bunch. The questions I try to address in this post is:

  1. By how much, does the maximum drift upward, as we add more numbers?
  2. When is the maximum guaranteed to be at least some value?

This is directly similar to how the CLT answers: by how much does the average deviate as we add more numbers?

The lower-bound direction (question 2) is the useful one and is also harder to answer. There is a mature toolkit for showing a maximum is not too big (upper bounds). Showing that a maximum is unavoidably big is needed for a variety of impossibility statements, for example:

For Gaussian variables, the simplest tool for that direction is Sudakov’s minoration inequality. It is rarely explained outside of graduate textbooks, so I will try my best to explain it intuitively. I first ran into it while working on an old paper I co-authored, where I needed exactly this kind of lower bound on a maximum.

The maximum of N Gaussians

Start with the simplest possible case. Let X1,,XNX_1, \dots, X_N be independent standard Gaussians, each N(0,1)\mathcal{N}(0,1). How large is maxiXi\max_i X_i?

A single standard Gaussian is almost never far from 00: the chance of exceeding a level tt falls off as (directly from the definition of Gaussian)

Pr[X>t]    et2/2\Pr[X > t] \;\approx\; e^{-t^2/2}

for large tt (there is a slowly-varying 1/t1/t factor I am dropping but it does not change the conclusion). Now ask, how far out the largest of NN of them reaches. The expected number of the NN variables that clear level tt is

NPr[X>t]    Net2/2.N \cdot \Pr[X > t] \;\approx\; N\, e^{-t^2/2}.

This count is huge when tt is small, and essentially zero when tt is large, and it passes through 11 precisely where Net2/2=1N e^{-t^2/2} = 1, that is, where t22=logN\tfrac{t^2}{2} = \log N. So the maximum of NN of these variables piles up around

t  =  2logN.t^\star \;=\; \sqrt{2 \log N}.

This means that below tt^\star, there are many candidates poking above the line, while above it there are essentially none. The maximum of NN independent standard Gaussians sits at 2logN\sqrt{2 \log N} to leading order. Notice the shape: the maximum grows without bound as NN \to \infty, but only at the leisurely rate of logN\sqrt{\log N}. To double the maximum we need to increase the number of variables NN to N4N^4.

Slide NN below and watch the threshold t=2logNt^\star = \sqrt{2 \log N} march to the right while the tail it cuts off keeps exactly the mass 1/N1/N:

N = 100t = √(2 ln N) = 3.03tail mass ≈ 1/N = 0.010

From independent to merely far apart

Independence made that argument easy: NN separate tries, each with its own fresh chance at a large value. But the maxima that matter in practice are usually over things that are correlated. The projections of one noise vector onto many directions, the errors of one model across many test points, the value of one random surface at many locations: all correlated, none independent. Can we still promise the maximum is large?

The key is to measure how different two of the variables are. For jointly Gaussian XiX_i (each mean zero), define the distance between index ii and index jj as the typical size of their gap:

d(i,j)  =  E[(XiXj)2].d(i, j) \;=\; \sqrt{\mathbb{E}\big[(X_i - X_j)^2\big]}.

This is called the canonical metric, and for mean-zero Gaussians it pays to write it out. With unit variances,

d(i,j)2  =  E[Xi2]2E[XiXj]+E[Xj2]  =  2(1ρij),d(i,j)^2 \;=\; \mathbb{E}[X_i^2] - 2\,\mathbb{E}[X_i X_j] + \mathbb{E}[X_j^2] \;=\; 2\,(1 - \rho_{ij}),

where ρij\rho_{ij} is the correlation between XiX_i and XjX_j. Distance is just correlation in disguise, running the opposite way: identical variables (ρ=1\rho = 1) sit at distance 00, independent ones (ρ=0\rho = 0) at distance 2\sqrt{2}, and mirror images (ρ=1\rho = -1) at the largest distance of all, 22. So two indices are close only when they are near-copies of each other, and they grow farther apart as their correlation falls from +1+1 down through 00 to 1-1. What the maximum cares about is exactly this redundancy: two strongly correlated indices are essentially one variable counted twice, a wasted draw that adds no fresh chance at a large value, while far-apart indices are genuinely different shots. Independent points, at distance 2\sqrt{2}, are the clean case of two fresh tries; negatively correlated points are spread wider still, and the inequality below rewards that extra spread.

Sudakov’s minoration turns pairwise separation directly into a lower bound on the maximum:

Sudakov minoration. Let X1,,XNX_1, \dots, X_N be jointly Gaussian with mean zero. If every pair is at least ε\varepsilon apart in the canonical metric, d(i,j)εd(i,j) \ge \varepsilon for all iji \ne j, then

E[maxiNXi]    cεlogN,\mathbb{E}\Big[\max_{i \le N} X_i\Big] \;\ge\; c\, \varepsilon \, \sqrt{\log N},

where c>0c > 0 is a universal constant.

So if no two of the NN variables are nearly identical, their expected maximum is at least εlogN\varepsilon \sqrt{\log N}. The only thing we have to check is a pairwise separation, which is usually easy. We do not need the variables to actually be independent, and we do not need to compute anything about the joint distribution beyond the pairwise gaps.

Let us do a sanity-check on the case we already solved. For independent standard Gaussians, E[(XiXj)2]=Var(Xi)+Var(Xj)=2\mathbb{E}[(X_i - X_j)^2] = \operatorname{Var}(X_i) + \operatorname{Var}(X_j) = 2, so every pair is exactly ε=2\varepsilon = \sqrt{2} apart. Sudakov then gives

E[maxiXi]    c2logN,\mathbb{E}\Big[\max_i X_i\Big] \;\ge\; c\sqrt{2}\,\sqrt{\log N},

and we know the true answer is 2logN\sqrt{2\log N}. The inequality has exactly the right shape and recovers the true growth rate up to the constant cc. Sudakov has managed to get rid of the convenient independence requirement and work with only a pairwise-distance bookkeeping, yet, the result did not lose anything but a constant.

A few remarks for the curious:

Separation rules out near-duplicates, non-duplicate points are genuinely different tries, and enough genuinely different tries force the maximum to be large. Sudakov turns that chain into a quantitative exchange rate between how far apart the points are and how large the maximum must be.

Why the lower-bound direction matters

Why would we want a lower bound on a maximum? Note that upper bounds are reassuring: i.e., they say the worst case is not that bad, the estimator concentrates, the error is controlled. Lower bounds on the other hand are the source of impossibility results. If we can show some noise is quantitatively at least this large no matter what, then no method can drive the corresponding error below it. Minimax lower bounds, hardness-of-detection results, and non-separability results all rest on exactly this kind of statement.

And a supremum is exactly the right object for a worst case. “Can any direction separate these points?” is a statement about sup\sup over directions. “Is there any test function that detects the signal?” is a sup\sup over test functions. To prove such a worst case is unavoidably bad, we lower-bound a supremum, which is exactly what Sudakov does.

A toy example: when does a faint signal vanish into noise?

Here I try to model a problem where Sudakov’s application is observable.

We take NN noisy readings Y1,,YNY_1, \dots, Y_N. We suspect that exactly one of them, we do not know which, carries a faint extra signal of size s>0s > 0: that reading is Yi0=s+Xi0Y_{i_0} = s + X_{i_0}, while all the others are pure noise, Yi=XiY_i = X_i. The noises XiX_i are jointly Gaussian, and they need not be independent. Readings taken close together, or, for example, sharing a sensor, are correlated, and that is allowed. The obvious way to guess which reading is the special one is to point at the largest. When does that work?

It works only if the planted reading really does rise above all the noise, that is, if its signal clears the loudest of the pure-noise readings,

s  >  maxii0Xi.s \;>\; \max_{i \ne i_0} X_i.

So everything hinges on the size of the largest pure-noise reading, a maximum of a Gaussian process over the NN readings, and that is exactly what Sudakov lower-bounds. As long as the readings are genuinely different from one another, meaning their noises are pairwise at least ε\varepsilon apart in the canonical metric, then

maxiXi    cεlogN\max_i \, X_i \;\ge\; c\,\varepsilon \sqrt{\log N}

on average, and (because a Gaussian maximum concentrates tightly around its mean, by Gaussian concentration) with high probability too. There is an unavoidable noise floor of height about εlogN\varepsilon\sqrt{\log N}. A signal fainter than the floor is buried: the loudest pure-noise reading beats it, we point at the wrong one, and no cleverness in how we inspect the data can recover a signal the noise has already swallowed. This is a real detection threshold: below εlogN\varepsilon\sqrt{\log N}, the faint signal is invisible.

Play with the trade-off. The blue curve is the noise floor ε2logN\varepsilon\sqrt{2\log N}; the red line is the signal ss. Where the floor stays under the signal the planted reading is detectable; once the floor climbs past, it is lost:

effective noise ε/√m = 1.00detectable while N <

Two things to read off the picture.

The shape of this argument, lower-bound a worst-case noise to show that nothing below some level can be detected, is why Sudakov is worth knowing. The same skeleton drives a non-separability threshold in an old paper that I co-authored (see Theorem 1 part 3), where the readings are data points and a smoothing graph operation plays the role of the averaging above. Different dressing, identical engine.

References



Next Post
Voronoi tessellations and Lloyd's algorithm