Skip to content

Optimal message passing on sparse graphs

Posted on:

Most theory on graph neural networks (GNNs) lives in the dense regime, usually on the dense stochastic block model, where each node’s degree grows with the size of the graph. That setting is mathematically convenient (as nn \to \infty, neighborhoods concentrate), but it doesn’t match what we actually train on: the Open Graph Benchmark datasets have millions of nodes with average degree on the order of tens, and the per-node feature dimension stays in the low hundreds. The interesting regime is sparse: node degrees stay O(1)O(1), the feature dimension stays fixed, and only nn grows.

In our NeurIPS 2023 paper with Kimon Fountoulakis and Aukosh Jagannath, we ask: what does the optimal node classifier look like in this regime, and is there an architecture that realizes it? The answer turns out to be cleaner than expected, with some surprising consequences for how to design GNNs in practice. This post sketches the main results.

The model

We work with the contextual stochastic block model (CSBM), a latent-class model with both graph and feature signal. Each of the nn nodes is assigned one of CC classes; conditional on class assignments, edges are placed independently with class-pair probabilities qij=bij/nq_{ij} = b_{ij}/n, and node features are drawn i.i.d. from class-conditional distributions P1,,PCP_1, \ldots, P_C. Two SNR-like parameters drive everything:

The 1/n1/n scaling makes node degrees Poisson with constant mean, which is exactly what makes the regime sparse. The full setup, with the multi-class generalization, is in Section 2 of the paper.

What “optimal” means

The Bayes-optimal classifier, which minimizes 0-1 risk given the entire graph and all features, is intractable to write down: it depends on every edge and every feature vector. So we ask a slightly weaker question.

Fix a node vv. In the sparse regime, vv‘s \ell-hop neighborhood is, with high probability, a tree (a classical fact about locally tree-like sparse random graphs). Define an \ell-locally Bayes optimal classifier as one that minimizes 0-1 risk among all classifiers depending only on the \ell-hop neighborhood of vv. Then ask: as \ell and nn grow, is there a clean form for the limit?

Yes, and it’s a message-passing architecture.

The optimal architecture (Theorem 1)

v m⁽¹⁾(Xᵤ) m⁽²⁾(Xᵤ) distance-k messages decay as Γᵏ aggregate score per class A B argmax ŷ(v) = A class A class B

Every neighbor uu at distance kk from vv contributes a per-distance, per-class log-likelihood message mc(k)(Xu)m^{(k)}_c(\mathbf{X}_u). Far-away neighbors carry exponentially less weight (Γk\Gamma^k decay, shown as smaller, dimmer arrows). All messages aggregate into a score for each class; the prediction is the argmax.

The asymptotic \ell-locally Bayes optimal predictor at node vv is a maximum-a-posteriori classifier over its tree-like neighborhood, schematically:

y^(v)  =  argmaxc  k=0uNk(v)mc(k) ⁣(Xu)\hat{y}(v) \;=\; \arg\max_{c}\; \sum_{k=0}^{\ell} \sum_{u \in \mathcal{N}_k(v)} m^{(k)}_{c}\!\left(\mathbf{X}_u\right)

where Nk(v)\mathcal{N}_k(v) is the set of nodes at exactly distance kk from vv. The message itself has a clean Bayesian shape. For a neighbor uu, define the feature likelihood vector

ρ(Xu)  =  (P1(Xu), P2(Xu), , PC(Xu)),\rho(\mathbf{X}_u) \;=\; \big(P_1(\mathbf{X}_u),\ P_2(\mathbf{X}_u),\ \ldots,\ P_C(\mathbf{X}_u)\big),

whose jj-th entry is the density of uu‘s features under the hypothesis that uu is class jj. This is what the feature alone says about uu‘s class. Define the class-coupling vector Qc(k)RCQ^{(k)}_c \in \mathbb{R}^C, whose jj-th entry is the probability that a class-cc node has a kk-step neighbor of class jj in the CSBM (computed from the edge probabilities bijb_{ij} via a kk-step random-walk transition). This is what the graph alone says about uu‘s class, conditional on vv being class cc. The message is the log of their inner product,

mc(k)(Xu)  =  logρ(Xu), Qc(k)  =  logj=1CPj(Xu)[Qc(k)]j.m^{(k)}_c(\mathbf{X}_u) \;=\; \log \big\langle \rho(\mathbf{X}_u),\ Q^{(k)}_c \big\rangle \;=\; \log \sum_{j=1}^{C} P_j(\mathbf{X}_u)\, \big[Q^{(k)}_c\big]_j.

The dot product is exactly Bayesian marginalization over uu‘s unknown class. We don’t know which class uu belongs to, so we sum over all CC possibilities, weighting each by (i) the feature evidence that uu is class jj (that’s ρ\rho), and (ii) the structural evidence that vv being class cc would put a class-jj node exactly kk steps away (that’s Qc(k)Q^{(k)}_c). The result is the likelihood of uu‘s features under the hypothesis ”vv is class cc”, with uu‘s own class integrated out. The logarithm then turns the product over independent neighbors (which is what the tree-like factorization gives in the limit) into the additive sum over neighbors that we see in the argmax expression. The full expression, including the multi-class generalization and the precise form of Q(k)Q^{(k)}, is in Section 3.

Three things matter about the shape of this result:

  1. It is message passing. Neighbors at different distances contribute additively, with their own per-distance parameters.
  2. The class-coupling matrix is learnable. We don’t assume access to the bijb_{ij}‘s. In the architecture, we parameterize the coupling tensor via sigmoid(Z)\mathrm{sigmoid}(Z) for a learnable ZZ and let it fit. A single architecture covers the whole spectrum from “graph is useless” to “graph is everything”.
  3. Distance-kk contributions decay as Γk\Gamma^k. Far neighbors carry exponentially less information than near ones, a sharp justification for shallow GNNs in the sparse regime.

Theorem 1 gives an explicit message-passing architecture and proves it asymptotically realizes this optimum.

Closed-form generalization error (Theorem 2)

For the symmetric two-class Gaussian case, we compute the asymptotic generalization error in closed form, in terms of (γ,Γ)(\gamma, \Gamma). The key observation is that, on the Galton–Watson tree that the CSBM neighborhood converges to, the distribution of the optimal message can be characterized by a fixed-point recursion on the branching process. This collapses what would otherwise be pages of integrals into a tractable distributional equation. The full statement (and the multi-class extension) is in Section 4.

The MLP–GCN interpolation (Theorem 3)

This is the cleanest consequence and, I think, the most useful one in practice.

0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Graph signal Γ Accuracy at Γ=0: optimal ≈ MLP at large Γ: optimal ≈ GCN Bayes optimal MLP GCN

Varying the graph signal Γ\Gamma at fixed feature signal γ\gamma: the Bayes-optimal architecture,

In other words: graph features become useful for node classification exactly when weak community recovery becomes possible. Below the threshold, no architecture that uses the graph can asymptotically beat one that doesn’t. The CSBM analysis sharpens this from “intuitively reasonable” to a theorem. Section 5 has the precise statement and the boundary cases.

From n=n = \infty to large finite nn (Theorem 4)

Asymptotic statements are clean, but the operational question is “does this predict anything for the nn I actually have?” Theorem 4 says: yes. For graphs of size nn, the \ell-hop neighborhoods of a 1o(1)1 - o(1) fraction of nodes are tree-like for \ell up to roughly logn/logd\log n / \log d, where dd is the average degree. So the asymptotic architecture is approximately optimal at large finite nn, with the receptive-field budget growing logarithmically. Another reason shallow networks are not just convenient on sparse graphs; they’re enough.

Empirically

On synthetic CSBM graphs with n=10,000n = 10{,}000 and d=4d = 4, the architecture predicted by the theory smoothly interpolates between a feature-only MLP and a vanilla GCN as Γ\Gamma varies, and outperforms both at intermediate Γ\Gamma. At Γ=0\Gamma = 0 it matches the MLP; at Γ=1\Gamma = 1 it matches the GCN. Exactly what Theorem 3 predicts. Plots in Section 6.

Takeaways for GNN design

Distilled to four:

  1. Shallow is usually right on sparse graphs. Distance-kk messages decay as Γk\Gamma^k. Two or three hops is most of the available information.
  2. Decouple depth from receptive field. The theory wants the architecture’s number of trainable layers and number of hops aggregated to be independent knobs, not the same knob the way they are in vanilla GCNs. Stacking layers should not be the only way to see further.
  3. Learn the class-coupling matrix; don’t assume it. A learnable coupling lets one architecture cover the full MLP↔GCN spectrum without per-dataset hand-tuning.
  4. Below Kesten–Stigum, ignore the graph. Not philosophically, but literally. Any classifier that lets the graph influence its prediction has higher asymptotic error than one that doesn’t, when Γ\Gamma is below threshold.

The full paper has the proofs, the formal asymptotic statements, the multi-class and general-feature versions, and the longer experimental section. The setup, theorems, and discussion are organized as: model in Section 2, optimal architecture in Section 3, generalization error in Section 4, MLP–GCN interpolation and threshold in Section 5, experiments in Section 6.

References



Previous Post
Why hypercubes look spiky
Next Post
Effects of graph convolutions in multi-layer networks