Shattering and the VC dimension | Aseem Raj Baranwal

A classifier is a yes/no rule: feed it a point, it answers $+$ or $-$ . A hypothesis class $\mathcal{H}$ is a whole family of such rules, for instance all thresholds on the line, all half-planes in the plane, or all axis-aligned boxes. One question governs how well anything learned from such a class will generalize to unseen data: how expressive is $\mathcal{H}$ ? A class rich enough to reproduce any pattern of labels on the training points has explained nothing about the data, it has only memorized; a class too poor cannot fit the signal at all. The Vapnik–Chervonenkis (VC) dimension is the single integer that measures this richness, and it measures it combinatorially, by counting the labelings a class can produce, with no reference to probability or to the geometry of a particular dataset.

A single integer can carry that weight for the same reason the ε-net argument works: a union bound does not run over the (usually infinite) class $\mathcal{H}$ , it runs over the finitely many distinct behaviors $\mathcal{H}$ exhibits on a fixed sample. The VC dimension is what controls that count.

Shattering

Fix a finite set of points $x_1, \dots, x_n$ . Each rule $h \in \mathcal{H}$ stamps them with a pattern of signs $(h(x_1), \dots, h(x_n)) \in \{+, -\}^n$ . There are $2^n$ conceivable patterns. As $h$ ranges over the class, we collect some subset of those $2^n$ patterns, the ones the class can actually produce on these particular points.

The set $\{x_1, \dots, x_n\}$ is shattered by $\mathcal{H}$ if all $2^n$ labelings are produced: for every assignment of $+/-$ to the points, some rule in the class realizes it.

Shattering is the strongest possible expressiveness on a fixed set, the class can split those points apart in every conceivable way, so the labels on them carry no constraint the class is forced to respect.

Take the simplest class, thresholds on the line: $h_a(x) = +$ exactly when $x \ge a$ , one rule per cutoff $a$ . A single point $x_1$ is shattered: a cutoff to its left labels it $+$ , a cutoff to its right labels it $-$ , and both patterns appear. Now take two points $x_1 < x_2$ . A threshold labels everything to the right of its cutoff $+$ , so if $x_1$ is labeled $+$ then $x_2$ , being larger, is labeled $+$ as well. The pattern $(+, -)$ can never occur. Three of the four patterns appear, but not all four, so no pair of points is shattered by thresholds.

The VC dimension

The VC dimension of $\mathcal{H}$ is the size of the largest point set it can shatter:
$\mathrm{VC}(\mathcal{H}) = \max\{\, n : \text{some set of } n \text{ points is shattered by } \mathcal{H} \,\}.$

The quantifiers matter: we need some set of $n$ points to be shatterable, not every one. Thresholds shatter one point and no pair, so $\mathrm{VC} = 1$ . Walking up a few classes pins down the pattern.

Thresholds on $\mathbb{R}$ , $h_a = \mathbf{1}[x \ge a]$ : one point shattered, no pair. $\mathrm{VC} = 1$ .
Intervals on $\mathbb{R}$ , $h_{a,b} = \mathbf{1}[a \le x \le b]$ : two points get all four patterns, since an empty interval gives $(-,-)$ , a wide one gives $(+,+)$ , and a one-sided interval gives each of $(+,-)$ and $(-,+)$ . Three points $x_1 < x_2 < x_3$ cannot get $(+, -, +)$ , because an interval covering the two ends must cover the middle. $\mathrm{VC} = 2$ .
Half-planes in $\mathbb{R}^2$ (a straight line with one side labeled $+$ ): $\mathrm{VC} = 3$ , worked out below.
Half-spaces in $\mathbb{R}^d$ : $\mathrm{VC} = d + 1$ .
Axis-aligned rectangles in $\mathbb{R}^2$ (inside the box is $+$ ): $\mathrm{VC} = 4$ .

Take half-planes in $\mathbb{R}^2$ . Three points in general position, meaning not collinear, form a triangle, and a line can carve off any subset of its corners: every one of the $2^3 = 8$ labelings has a separating line, so the triangle is shattered. Four points never are. There are two cases. If one point $D$ lies inside the triangle of the other three, then $D$ cannot be split off by itself: any line with $D$ alone on one side would have to leave $A, B, C$ on the other, impossible since $D$ sits in their convex hull (the triangle they span). If instead the four points are in convex position, forming a quadrilateral $ABCD$ , then the diagonal labeling that marks $A, C$ as $+$ and $B, D$ as $-$ has its two $+$ points separated along the boundary by the two $-$ points, and no single line splits them. Either way, one labeling escapes every half-plane.

The counting is exact, and the widget below makes it concrete. Click a point to flip its sign; a green line appears whenever the current labeling is one a half-plane can realize. Switch between three and four points and tally how many of the $2^n$ labelings get a line.

Three points give $8$ of $8$ , the full set, so they are shattered. Four points give at most $14$ of $16$ : the two diagonal labelings have no line. So some triple is shattered and no quadruple is, which is exactly $\mathrm{VC}(\text{half-planes}) = 3$ . Those counts are not specific to the configuration on screen. For $n$ points in general position, the number of labelings a half-plane in $\mathbb{R}^2$ can realize is exactly $2 \sum_{k=0}^{2} \binom{n-1}{k}$ , a result of Cover, and that equals $2^n$ only up to $n = 3$ .

The growth function and the Sauer–Shelah jump

To track expressiveness across all sample sizes at once, define the growth function

\Pi_{\mathcal{H}}(n) = \max_{x_1, \dots, x_n} \ \big| \{ (h(x_1), \dots, h(x_n)) : h \in \mathcal{H} \} \big|,

the most labelings the class can produce on any $n$ points. By construction $\Pi_{\mathcal{H}}(n) \le 2^n$ , with equality exactly when some $n$ -set is shattered. So the VC dimension is the last $n$ at which $\Pi_{\mathcal{H}}(n) = 2^n$ .

What happens after that is the structural fact at the center of the theory. Let $d = \mathrm{VC}(\mathcal{H})$ . Then for every $n$ ,

\Pi_{\mathcal{H}}(n) \ \le\ \sum_{i=0}^{d} \binom{n}{i} \ \le\ \Big( \frac{en}{d} \Big)^{d}.

This is the Sauer–Shelah lemma. Up to $n = d$ the growth function is $2^n$ , doubling with each new point. The moment $n$ passes $d$ it can no longer double, it is trapped under a polynomial of degree $d$ . A quantity that grows exponentially up to a threshold and polynomially forever after, with the switch landing exactly at the VC dimension.

Slide $d$ below. The blue line is $2^n$ , what shattering would demand; the green curve is the Sauer–Shelah cap. They sit on top of each other until $n = d$ , then the cap peels away and falls far below the exponential. The vertical axis is logarithmic, so the straight blue line really is $2^n$ and the bending green curve really is polynomial.

A single integer controls generalization

Suppose we want every rule in $\mathcal{H}$ to have its error on the training sample close to its true error on the full distribution, all rules at once. For one fixed rule this is a law of large numbers statement, with a Hoeffding tail (an exponential bound on how far an average of bounded variables strays from its mean) bounding the gap. For all rules simultaneously we need a union bound, and when $\mathcal{H}$ is infinite, as it is for half-planes or any class with continuous parameters, a union bound over $|\mathcal{H}|$ is vacuous.

Vapnik and Chervonenkis observed that the count that belongs in the union bound is not the number of rules but the number of distinct labelings they produce on the $n$ samples, and that count is the growth function $\Pi_{\mathcal{H}}(n)$ , which Sauer–Shelah caps at $(en/d)^d$ . Two rules that label the sample identically are indistinguishable on the data, so they should count once. Carrying the polynomial growth function through the union bound gives the VC generalization bound: with probability at least $1 - \delta$ , every $h \in \mathcal{H}$ satisfies

\big| \text{true error} - \text{training error} \big| \ \lesssim\ \sqrt{ \frac{d \log(n/d) + \log(1/\delta)}{n} }.

A finite VC dimension $d$ is exactly the condition for the right side to vanish as $n \to \infty$ . A class is learnable in this uniform sense precisely when $d < \infty$ , and the sample size needed for a target accuracy grows linearly in $d$ . The full chain, from “can the class shatter this set?” to “how many samples until it generalizes?”, is carried by that one integer.

This is the same move as the ε-net bound on the operator norm: replace an unmanageable supremum over an infinite set by a maximum over a finite, controlled number of cases. There the finite object was a net of directions on the sphere; here it is the growth function of labelings. Infinitude is rarely the real obstacle in high-dimensional probability, the effective count is the thing to bound, and the VC dimension is that count for a class of classifiers.

References

V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16(2):264–280, 1971. The origin of the growth function and the dimension that bounds it.
Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145–147, 1972. One of the independent proofs of the exponential-to-polynomial bound.
Saharon Shelah. A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics, 41(1):247–261, 1972. The same combinatorial lemma, discovered independently in model theory.
Thomas M. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, EC-14(3):326–334, 1965. Source of the count $2\sum_{k=0}^{d}\binom{n-1}{k}$ for linearly realizable labelings.
Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. Chapter 6 develops shattering, VC dimension, and the fundamental theorem of statistical learning.
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning, 2nd edition. MIT Press, 2018. Chapter 3 covers the growth function, Sauer’s lemma, and VC generalization bounds.