Skip to content

Effects of graph convolutions in multi-layer networks

Posted on:

A natural question for graph neural networks: when you add a “graph convolution” layer that aggregates features from neighbors, does it actually help, and by how much? In the right theoretical setting, the answer is sharp.

In our ICLR 2023 paper with Kimon Fountoulakis and Aukosh Jagannath, we study this for multi-layer networks classifying nodes in a contextual stochastic block model whose features are non-linearly separable, so a hidden layer is genuinely needed and a hyperplane cannot work. The main result: each graph convolution provably lowers the feature-signal threshold for successful classification, by a quantitative factor that depends on the graph density. Two convolutions help dramatically more than one. The placement of the convolution layers within the network does not matter asymptotically.

The setup

The data model has three pieces:

  1. Two balanced classes, y{+1,1}y \in \{+1, -1\}, latent.
  2. Non-linearly separable Gaussian features conditional on class. Concretely, features within a class are drawn from a 2-component Gaussian mixture, with components placed so that no hyperplane separates the classes (an XOR-style arrangement). This makes the problem genuinely require a multi-layer network.
  3. A stochastic block model on top, with intra-class edge probability pp and inter-class edge probability qq, and p>qp > q (so neighbors are more likely to share class).

We call the parameter controlling the within-class mixture separation the feature signal γ\gamma (large γ\gamma: well separated, classification is easy; small γ\gamma: overlapping, classification is hard). The expected node degree E[deg]\mathbb{E}[\deg] grows with nn, so we are in the dense regime, in contrast to our later sparse-regime work.

What “the architecture works” means

We say a network architecture succeeds on this distribution if it classifies a uniform random node correctly with probability 1\to 1 as nn \to \infty. The classification threshold is the smallest feature signal γ\gamma for which the architecture succeeds. Smaller threshold means a more capable architecture: it tolerates noisier features.

Three architectures, three thresholds

We compare three architectures, each a multi-layer ReLU network:

Letting γMLP\gamma^{\star}_{\mathrm{MLP}} denote the classification threshold for the bare MLP, the main results (formal versions in the paper) read:

γ1-conv        γMLP(E[deg])1/4,γ2-conv        γMLPn1/4.\gamma^{\star}_{\mathrm{1\text{-}conv}} \;\;\lesssim\;\; \frac{\gamma^{\star}_{\mathrm{MLP}}}{(\mathbb{E}[\deg])^{1/4}}, \qquad \gamma^{\star}_{\mathrm{2\text{-}conv}} \;\;\lesssim\;\; \frac{\gamma^{\star}_{\mathrm{MLP}}}{n^{1/4}}.

In words:

So two convolutions are much better than one in the dense regime. (Past two, oversmoothing is a well-known issue, so we do not recommend stacking further unless you have a specific reason.)

larger feature signal γ → MLP fails succeeds γ★ (MLP) + 1 graph conv fails succeeds ÷ (𝔼[deg])¹⁄⁴ + 2 graph convs succeeds ÷ n¹⁄⁴ feature signal γ

Each graph convolution shifts the classification threshold left along the γ\gamma axis: one convolution by a factor of (E[deg])1/4(\mathbb{E}[\deg])^{1/4}, two convolutions by a much bigger factor of n1/4n^{1/4}. The wider the green “succeeds” region for an architecture, the noisier the features it can still classify.

Placement does not matter

In an LL-layer network you have LL slots for placing graph convolutions. Should they go at the input? In the middle? Right before the output? The paper shows that asymptotically the threshold result above holds for any placement of the convolutions among the layers. The conclusion: stop agonizing over conv placement; pick whatever is convenient (e.g., conv-then-MLP, or interleaved) and the asymptotic capability is unchanged.

Empirically

Synthetic CSBM experiments confirm the threshold predictions. As the feature signal γ\gamma is lowered, the MLP loses accuracy first, then MLP + 1 conv, then MLP + 2 convs, at the predicted scaling factors. Real-data experiments on standard node-classification benchmarks show that the placement-invariance and the 2-conv-beats-1-conv conclusions hold qualitatively. Plots and dataset details are in Section 5 of the paper.

Takeaways

  1. Graph convolutions provably lower the feature-signal requirement for classification, by a quantifiable factor: (E[deg])1/4(\mathbb{E}[\deg])^{1/4} per single convolution, n1/4n^{1/4} for two.
  2. Two convolutions are dramatically better than one in the dense regime. Beyond two, watch out for oversmoothing.
  3. Placement is asymptotically a wash. Don’t agonize over which layer the convolution goes into.

The full statements (with constants), the precise form of the non-linear feature mixture, and the experimental details are in the paper.

References



Previous Post
Optimal message passing on sparse graphs
Next Post
Fast and online palindrome counting