Optimal message passing on sparse graphs

Most theory on graph neural networks (GNNs) lives in the dense regime, usually on the dense stochastic block model, where each node’s degree grows with the size of the graph. That setting is mathematically convenient (as $n \to \infty$ , neighborhoods concentrate), but it doesn’t match what we actually train on: the Open Graph Benchmark datasets have millions of nodes with average degree on the order of tens, and the per-node feature dimension stays in the low hundreds. The interesting regime is sparse: node degrees stay $O(1)$ , the feature dimension stays fixed, and only $n$ grows.

In our NeurIPS 2023 paper with Kimon Fountoulakis and Aukosh Jagannath, we ask: what does the optimal node classifier look like in this regime, and is there an architecture that realizes it? The answer turns out to be cleaner than expected, with some surprising consequences for how to design GNNs in practice. This post sketches the main results.

The model

We work with the contextual stochastic block model (CSBM), a latent-class model with both graph and feature signal. Each of the $n$ nodes is assigned one of $C$ classes; conditional on class assignments, edges are placed independently with class-pair probabilities $q_{ij} = b_{ij}/n$ , and node features are drawn i.i.d. from class-conditional distributions $P_1, \ldots, P_C$ . Two SNR-like parameters drive everything:

Feature SNR $\gamma$ : how separable the class-conditional feature distributions are. For symmetric Gaussians $\mathcal{N}(\pm\boldsymbol{\mu}, \sigma^2 I)$ , $\gamma = \lVert\boldsymbol{\mu}\rVert_2 / \sigma$ .
Graph SNR $\Gamma$ : how informative the graph is. For two balanced classes with intra-class edge probability $a/n$ and inter-class $b/n$ , $\Gamma = (a-b)/(a+b)$ .

The $1/n$ scaling makes node degrees Poisson with constant mean, which is exactly what makes the regime sparse. The full setup, with the multi-class generalization, is in Section 2 of the paper.

What “optimal” means

The Bayes-optimal classifier, which minimizes 0-1 risk given the entire graph and all features, is intractable to write down: it depends on every edge and every feature vector. So we ask a slightly weaker question.

Fix a node $v$ . In the sparse regime, $v$ ‘s $\ell$ -hop neighborhood is, with high probability, a tree (a classical fact about locally tree-like sparse random graphs). Define an $\ell$ -locally Bayes optimal classifier as one that minimizes 0-1 risk among all classifiers depending only on the $\ell$ -hop neighborhood of $v$ . Then ask: as $\ell$ and $n$ grow, is there a clean form for the limit?

Yes, and it’s a message-passing architecture.

The optimal architecture (Theorem 1)

Every neighbor $u$ at distance $k$ from $v$ contributes a per-distance, per-class log-likelihood message $m^{(k)}_c(\mathbf{X}_u)$ . Far-away neighbors carry exponentially less weight ( $\Gamma^k$ decay, shown as smaller, dimmer arrows). All messages aggregate into a score for each class; the prediction is the argmax.

The asymptotic $\ell$ -locally Bayes optimal predictor at node $v$ is a maximum-a-posteriori classifier over its tree-like neighborhood, schematically:

\hat{y}(v) \;=\; \arg\max_{c}\; \sum_{k=0}^{\ell} \sum_{u \in \mathcal{N}_k(v)} m^{(k)}_{c}\!\left(\mathbf{X}_u\right)

where $\mathcal{N}_k(v)$ is the set of nodes at exactly distance $k$ from $v$ . The message itself has a clean Bayesian shape. For a neighbor $u$ , define the feature likelihood vector

\rho(\mathbf{X}_u) \;=\; \big(P_1(\mathbf{X}_u),\ P_2(\mathbf{X}_u),\ \ldots,\ P_C(\mathbf{X}_u)\big),

whose $j$ -th entry is the density of $u$ ‘s features under the hypothesis that $u$ is class $j$ . This is what the feature alone says about $u$ ‘s class. Define the class-coupling vector $Q^{(k)}_c \in \mathbb{R}^C$ , whose $j$ -th entry is the probability that a class- $c$ node has a $k$ -step neighbor of class $j$ in the CSBM (computed from the edge probabilities $b_{ij}$ via a $k$ -step random-walk transition). This is what the graph alone says about $u$ ‘s class, conditional on $v$ being class $c$ . The message is the log of their inner product,

m^{(k)}_c(\mathbf{X}_u) \;=\; \log \big\langle \rho(\mathbf{X}_u),\ Q^{(k)}_c \big\rangle \;=\; \log \sum_{j=1}^{C} P_j(\mathbf{X}_u)\, \big[Q^{(k)}_c\big]_j.

The dot product is exactly Bayesian marginalization over $u$ ‘s unknown class. We don’t know which class $u$ belongs to, so we sum over all $C$ possibilities, weighting each by (i) the feature evidence that $u$ is class $j$ (that’s $\rho$ ), and (ii) the structural evidence that $v$ being class $c$ would put a class- $j$ node exactly $k$ steps away (that’s $Q^{(k)}_c$ ). The result is the likelihood of $u$ ‘s features under the hypothesis ” $v$ is class $c$ ”, with $u$ ‘s own class integrated out. The logarithm then turns the product over independent neighbors (which is what the tree-like factorization gives in the limit) into the additive sum over neighbors that we see in the argmax expression. The full expression, including the multi-class generalization and the precise form of $Q^{(k)}$ , is in Section 3.

Three things matter about the shape of this result:

It is message passing. Neighbors at different distances contribute additively, with their own per-distance parameters.
The class-coupling matrix is learnable. We don’t assume access to the $b_{ij}$ ‘s. In the architecture, we parameterize the coupling tensor via $\mathrm{sigmoid}(Z)$ for a learnable $Z$ and let it fit. A single architecture covers the whole spectrum from “graph is useless” to “graph is everything”.
Distance- $k$ contributions decay as $\Gamma^k$ . Far neighbors carry exponentially less information than near ones, a sharp justification for shallow GNNs in the sparse regime.

Theorem 1 gives an explicit message-passing architecture and proves it asymptotically realizes this optimum.

Closed-form generalization error (Theorem 2)

For the symmetric two-class Gaussian case, we compute the asymptotic generalization error in closed form, in terms of $(\gamma, \Gamma)$ . The key observation is that, on the Galton–Watson tree that the CSBM neighborhood converges to, the distribution of the optimal message can be characterized by a fixed-point recursion on the branching process. This collapses what would otherwise be pages of integrals into a tractable distributional equation. The full statement (and the multi-class extension) is in Section 4.

The MLP–GCN interpolation (Theorem 3)

This is the cleanest consequence and, I think, the most useful one in practice.

When $\Gamma = 0$ , the graph carries no class information. The optimal classifier reduces to one that ignores the graph entirely, becoming a feature-only MLP. Adding any graph aggregation strictly hurts.
When $\Gamma \to 1$ , the graph is essentially a perfect community indicator and the optimal classifier matches a standard graph convolution.
The threshold at which graph information starts being usable coincides with the Kesten–Stigum threshold, the well-known information-theoretic threshold for weak community detection on the SBM.

Varying the graph signal $\Gamma$ at fixed feature signal $\gamma$ : the Bayes-optimal architecture,

coincides with the MLP at $\Gamma = 0$ (the graph carries no information),
coincides with the GCN at large $\Gamma$ (the graph is essentially a perfect community indicator), and
dominates both in between. See Figure 1 in the paper for the actual experiments, including the corresponding $\gamma$ -sweep at fixed $\Gamma$ .

In other words: graph features become useful for node classification exactly when weak community recovery becomes possible. Below the threshold, no architecture that uses the graph can asymptotically beat one that doesn’t. The CSBM analysis sharpens this from “intuitively reasonable” to a theorem. Section 5 has the precise statement and the boundary cases.

From $n = \infty$ to large finite $n$ (Theorem 4)

Asymptotic statements are clean, but the operational question is “does this predict anything for the $n$ I actually have?” Theorem 4 says: yes. For graphs of size $n$ , the $\ell$ -hop neighborhoods of a $1 - o(1)$ fraction of nodes are tree-like for $\ell$ up to roughly $\log n / \log d$ , where $d$ is the average degree. So the asymptotic architecture is approximately optimal at large finite $n$ , with the receptive-field budget growing logarithmically. Another reason shallow networks are not just convenient on sparse graphs; they’re enough.

Empirically

On synthetic CSBM graphs with $n = 10{,}000$ and $d = 4$ , the architecture predicted by the theory smoothly interpolates between a feature-only MLP and a vanilla GCN as $\Gamma$ varies, and outperforms both at intermediate $\Gamma$ . At $\Gamma = 0$ it matches the MLP; at $\Gamma = 1$ it matches the GCN. Exactly what Theorem 3 predicts. Plots in Section 6.

Takeaways for GNN design

Distilled to four:

Shallow is usually right on sparse graphs. Distance- $k$ messages decay as $\Gamma^k$ . Two or three hops is most of the available information.
Decouple depth from receptive field. The theory wants the architecture’s number of trainable layers and number of hops aggregated to be independent knobs, not the same knob the way they are in vanilla GCNs. Stacking layers should not be the only way to see further.
Learn the class-coupling matrix; don’t assume it. A learnable coupling lets one architecture cover the full MLP↔GCN spectrum without per-dataset hand-tuning.
Below Kesten–Stigum, ignore the graph. Not philosophically, but literally. Any classifier that lets the graph influence its prediction has higher asymptotic error than one that doesn’t, when $\Gamma$ is below threshold.

The full paper has the proofs, the formal asymptotic statements, the multi-class and general-feature versions, and the longer experimental section. The setup, theorems, and discussion are organized as: model in Section 2, optimal architecture in Section 3, generalization error in Section 4, MLP–GCN interpolation and threshold in Section 5, experiments in Section 6.

References

Baranwal, Fountoulakis, Jagannath. Optimality of Message-Passing Architectures for Sparse Graphs. NeurIPS 2023. arXiv:2305.10391
Deshpande, Sen, Montanari, Mossel. Contextual Stochastic Block Models. NeurIPS 2018.
Decelle, Krzakala, Moore, Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physical Review E, 2011.
Mossel, Neeman, Sly. A proof of the block model threshold conjecture. Combinatorica, 2018.
Kesten, Stigum. Additional limit theorems for indecomposable multidimensional Galton–Watson processes. Annals of Mathematical Statistics, 1966.