Skip to content

Nearest neighbor breaks in high dimensions

Posted on:

k-nearest neighbors is one of the most basic ML algorithms: classify a query by looking at its closest few training points. It works beautifully on low-dimensional data. On high-dimensional data, it just stops working.

The reason has a name (the curse of dimensionality) and a sharp geometric statement: in high dimensions, all pairwise distances become essentially the same. Nearest and farthest neighbor are no longer meaningfully different.

The phenomenon

Take nn points uniformly at random in the unit cube [0,1]d[0, 1]^d. Pick any pair X,YX, Y and look at their squared distance:

XY2  =  i=1d(XiYi)2.\|X - Y\|^2 \;=\; \sum_{i=1}^d (X_i - Y_i)^2.

Each term (XiYi)2(X_i - Y_i)^2 is independent across coordinates with mean 16\tfrac{1}{6} (a one-line computation: for U,ViidUnif[0,1]U, V \stackrel{\text{iid}}{\sim} \mathrm{Unif}[0,1], E[(UV)2]=16\mathbb{E}[(U-V)^2] = \tfrac{1}{6}) and finite variance 7180\tfrac{7}{180}.

By the law of large numbers,

1dXY2    16a.s. as d.\frac{1}{d}\|X - Y\|^2 \;\longrightarrow\; \frac{1}{6} \quad \text{a.s. as } d \to \infty.

So XYd/6\|X - Y\| \approx \sqrt{d/6}, with this radius being the same for every pair. The fluctuations around it are of order 1/d1/\sqrt{d} by the central limit theorem. The histogram of pairwise distances among 50 random points pinches around d/6\sqrt{d/6} as dd grows; by d=100d = 100, all pairs sit within a few percent of the same distance.

Concretely: the mean distance grows as d/6\sqrt{d/6}, the ratio std/mean shrinks like 1/d1/\sqrt{d}, and by d=100d = 100 the maximum and minimum pairwise distances differ by less than 30 %; by d=500d = 500, less than 12 %. For comparison, in d=1d = 1 the maximum distance is roughly five times the minimum.

What this does to k-nearest neighbors

kNN works by ranking training points by distance to a query and using the closest few. If every training point is at essentially the same distance, the ranking is essentially random, driven by tiny fluctuations rather than signal. The nearest neighbor is no more informative than a random training point.

Beyer, Goldstein, Ramakrishnan, and Shaft made this precise in 1999 in a paper titled exactly When is “nearest neighbor” meaningful? The answer, roughly: not when the dimension is large compared to the number of points and the data has no low-dimensional structure. Their main result:

maxijXiXj    minijXiXjminijXiXj  p  0\frac{\max_{i \ne j} \|X_i - X_j\| \;-\; \min_{i \ne j} \|X_i - X_j\|}{\min_{i \ne j} \|X_i - X_j\|} \;\xrightarrow{p}\; 0

as dd \to \infty for fixed nn, under mild moment conditions on the coordinate distribution. Maximum and minimum pairwise distances are within an arbitrarily small relative gap.

Why the math works

Let Dd:=XY2/dD_d := \|X - Y\|^2 / d be the average squared coordinate difference between X,YiidUnif[0,1]dX, Y \stackrel{\text{iid}}{\sim} \mathrm{Unif}[0,1]^d. By construction, DdD_d is the average of dd i.i.d. random variables with mean 16\tfrac{1}{6} and variance σ2=7180\sigma^2 = \tfrac{7}{180}.

LLN: Dd16D_d \to \tfrac{1}{6} a.s. CLT: d(Dd16)N(0,σ2)\sqrt{d}\,(D_d - \tfrac{1}{6}) \to \mathcal{N}(0, \sigma^2). So Dd16±σ/dD_d \approx \tfrac{1}{6} \pm \sigma/\sqrt{d}, and

std(XY)E[XY]  =  O ⁣(1d)0.\frac{\mathrm{std}(\|X - Y\|)}{\mathbb{E}[\|X - Y\|]} \;=\; O\!\left(\frac{1}{\sqrt{d}}\right) \to 0.

The story is the same for Gaussian features (Vershynin, Ch. 5) and for any product distribution with finite second moment. It is fundamentally a concentration of measure phenomenon.

What to do instead

A few common tactics, none free:

  1. Reduce dimension first. PCA, autoencoders, or random projection (Johnson–Lindenstrauss) before applying distance-based methods.
  2. Learn a metric. A learned similarity function (e.g., from a deep network) typically beats raw Euclidean distance.
  3. Use cosine similarity. Direction in high dimensions is more stable than magnitude. Word and image embeddings essentially always use cosine.
  4. Exploit low-dimensional structure. Real high-dim data often lives near a low-dimensional manifold. Methods that adapt to that structure dodge the curse rather than fighting it.

The first principle: the raw Euclidean metric on a high-dimensional ambient space is rarely the metric you actually want.

References



Previous Post
Stein's paradox
Next Post
High-dimensional Gaussians live on a sphere