Skip to content

Shattering and the VC dimension

Posted on:

A classifier is a yes/no rule: feed it a point, it answers ++ or -. A hypothesis class H\mathcal{H} is a whole family of such rules, for instance all thresholds on the line, all half-planes in the plane, or all axis-aligned boxes. One question governs how well anything learned from such a class will generalize to unseen data: how expressive is H\mathcal{H}? A class rich enough to reproduce any pattern of labels on the training points has explained nothing about the data, it has only memorized; a class too poor cannot fit the signal at all. The Vapnik–Chervonenkis (VC) dimension is the single integer that measures this richness, and it measures it combinatorially, by counting the labelings a class can produce, with no reference to probability or to the geometry of a particular dataset.

A single integer can carry that weight for the same reason the ε-net argument works: a union bound does not run over the (usually infinite) class H\mathcal{H}, it runs over the finitely many distinct behaviors H\mathcal{H} exhibits on a fixed sample. The VC dimension is what controls that count.

Shattering

Fix a finite set of points x1,,xnx_1, \dots, x_n. Each rule hHh \in \mathcal{H} stamps them with a pattern of signs (h(x1),,h(xn)){+,}n(h(x_1), \dots, h(x_n)) \in \{+, -\}^n. There are 2n2^n conceivable patterns. As hh ranges over the class, we collect some subset of those 2n2^n patterns, the ones the class can actually produce on these particular points.

The set {x1,,xn}\{x_1, \dots, x_n\} is shattered by H\mathcal{H} if all 2n2^n labelings are produced: for every assignment of +/+/- to the points, some rule in the class realizes it.

Shattering is the strongest possible expressiveness on a fixed set, the class can split those points apart in every conceivable way, so the labels on them carry no constraint the class is forced to respect.

Take the simplest class, thresholds on the line: ha(x)=+h_a(x) = + exactly when xax \ge a, one rule per cutoff aa. A single point x1x_1 is shattered: a cutoff to its left labels it ++, a cutoff to its right labels it -, and both patterns appear. Now take two points x1<x2x_1 < x_2. A threshold labels everything to the right of its cutoff ++, so if x1x_1 is labeled ++ then x2x_2, being larger, is labeled ++ as well. The pattern (+,)(+, -) can never occur. Three of the four patterns appear, but not all four, so no pair of points is shattered by thresholds.

The VC dimension

The VC dimension of H\mathcal{H} is the size of the largest point set it can shatter:

VC(H)=max{n:some set of n points is shattered by H}.\mathrm{VC}(\mathcal{H}) = \max\{\, n : \text{some set of } n \text{ points is shattered by } \mathcal{H} \,\}.

The quantifiers matter: we need some set of nn points to be shatterable, not every one. Thresholds shatter one point and no pair, so VC=1\mathrm{VC} = 1. Walking up a few classes pins down the pattern.

Take half-planes in R2\mathbb{R}^2. Three points in general position, meaning not collinear, form a triangle, and a line can carve off any subset of its corners: every one of the 23=82^3 = 8 labelings has a separating line, so the triangle is shattered. Four points never are. There are two cases. If one point DD lies inside the triangle of the other three, then DD cannot be split off by itself: any line with DD alone on one side would have to leave A,B,CA, B, C on the other, impossible since DD sits in their convex hull (the triangle they span). If instead the four points are in convex position, forming a quadrilateral ABCDABCD, then the diagonal labeling that marks A,CA, C as ++ and B,DB, D as - has its two ++ points separated along the boundary by the two - points, and no single line splits them. Either way, one labeling escapes every half-plane.

The counting is exact, and the widget below makes it concrete. Click a point to flip its sign; a green line appears whenever the current labeling is one a half-plane can realize. Switch between three and four points and tally how many of the 2n2^n labelings get a line.

points:
Click a point to flip its sign.
linearly separable labelings: 8 of 8.

Three points give 88 of 88, the full set, so they are shattered. Four points give at most 1414 of 1616: the two diagonal labelings have no line. So some triple is shattered and no quadruple is, which is exactly VC(half-planes)=3\mathrm{VC}(\text{half-planes}) = 3. Those counts are not specific to the configuration on screen. For nn points in general position, the number of labelings a half-plane in R2\mathbb{R}^2 can realize is exactly 2k=02(n1k)2 \sum_{k=0}^{2} \binom{n-1}{k}, a result of Cover, and that equals 2n2^n only up to n=3n = 3.

The growth function and the Sauer–Shelah jump

To track expressiveness across all sample sizes at once, define the growth function

ΠH(n)=maxx1,,xn {(h(x1),,h(xn)):hH},\Pi_{\mathcal{H}}(n) = \max_{x_1, \dots, x_n} \ \big| \{ (h(x_1), \dots, h(x_n)) : h \in \mathcal{H} \} \big|,

the most labelings the class can produce on any nn points. By construction ΠH(n)2n\Pi_{\mathcal{H}}(n) \le 2^n, with equality exactly when some nn-set is shattered. So the VC dimension is the last nn at which ΠH(n)=2n\Pi_{\mathcal{H}}(n) = 2^n.

What happens after that is the structural fact at the center of the theory. Let d=VC(H)d = \mathrm{VC}(\mathcal{H}). Then for every nn,

ΠH(n)  i=0d(ni)  (end)d.\Pi_{\mathcal{H}}(n) \ \le\ \sum_{i=0}^{d} \binom{n}{i} \ \le\ \Big( \frac{en}{d} \Big)^{d}.

This is the Sauer–Shelah lemma. Up to n=dn = d the growth function is 2n2^n, doubling with each new point. The moment nn passes dd it can no longer double, it is trapped under a polynomial of degree dd. A quantity that grows exponentially up to a threshold and polynomially forever after, with the switch landing exactly at the VC dimension.

Slide dd below. The blue line is 2n2^n, what shattering would demand; the green curve is the Sauer–Shelah cap. They sit on top of each other until n=dn = d, then the cap peels away and falls far below the exponential. The vertical axis is logarithmic, so the straight blue line really is 2n2^n and the bending green curve really is polynomial.

d = 3at n = 22: 2n = , cap =

Why a single integer controls generalization

Suppose we want every rule in H\mathcal{H} to have its error on the training sample close to its true error on the full distribution, all rules at once. For one fixed rule this is a law of large numbers statement, with a Hoeffding tail bounding the gap. For all rules simultaneously we need a union bound, and when H\mathcal{H} is infinite, as it is for half-planes or any class with continuous parameters, a union bound over H|\mathcal{H}| is vacuous.

Vapnik and Chervonenkis observed that the count that belongs in the union bound is not the number of rules but the number of distinct labelings they produce on the nn samples, and that count is the growth function ΠH(n)\Pi_{\mathcal{H}}(n), which Sauer–Shelah caps at (en/d)d(en/d)^d. Two rules that label the sample identically are indistinguishable on the data, so they should count once. Carrying the polynomial growth function through the union bound gives the VC generalization bound: with probability at least 1δ1 - \delta, every hHh \in \mathcal{H} satisfies

true errortraining error  dlog(n/d)+log(1/δ)n.\big| \text{true error} - \text{training error} \big| \ \lesssim\ \sqrt{ \frac{d \log(n/d) + \log(1/\delta)}{n} }.

A finite VC dimension dd is exactly the condition for the right side to vanish as nn \to \infty. A class is learnable in this uniform sense precisely when d<d < \infty, and the sample size needed for a target accuracy grows linearly in dd. The full chain, from “can the class shatter this set?” to “how many samples until it generalizes?”, is carried by that one integer.

This is the same move as the ε-net bound on the operator norm: replace an unmanageable supremum over an infinite set by a maximum over a finite, controlled number of cases. There the finite object was a net of directions on the sphere; here it is the growth function of labelings. Infinitude is rarely the real obstacle in high-dimensional probability, the effective count is the thing to bound, and the VC dimension is that count for a class of classifiers.

References



Next Post
The Hanson–Wright inequality