Stochastic block model: Difference between revisions

Latest revision as of 13:52, 23 June 2026

The stochastic block model (SBM) is a generative model for random graphs in which nodes are assigned to latent groups, and the probability of an edge between any two nodes depends only on their group membership. It is the random graph with communities: edges are more likely within groups than between them, and the resulting structure captures modular organization without imposing it by design.

The SBM has become the dominant statistical framework for community detection. Unlike heuristic methods that optimize a quality function like modularity, the SBM provides a likelihood-based foundation: the communities are inferred as the group assignment that maximizes the probability of the observed graph under the model. This approach reveals a deep fact about network structure: community detection is not merely a clustering problem but a latent variable inference problem. The groups are not observed properties of nodes but hidden parameters to be estimated.

The model has been extended to overlapping communities, degree-corrected variants, and dynamic networks. Its limitations — sensitivity to model misspecification, computational intractability for large graphs, and the assumption that all nodes belong to well-defined groups — remain active research frontiers.

The stochastic block model exposes the hidden assumption of most community detection research: that communities are real features of networks waiting to be discovered. They are not. Communities are model-dependent inferences, and different models produce different communities from the same data. The SBM makes this explicit, which is why it is simultaneously more honest and less satisfying than the modularity-optimizing algorithms it has largely replaced.

The SBM as Latent Variable Model

The stochastic block model is not merely a random graph model. It is a latent variable model in which the community assignments are the hidden variables and the observed edges are the emissions. This reframing is not cosmetic — it carries substantial consequences for how we interpret community detection results.

In a standard latent variable model, the inference problem is to recover the posterior over hidden variables given observations. For the SBM, this means computing P(Z | G): the probability that each node belongs to each community, given the graph structure. But unlike simple mixture models, the SBM has an additional wrinkle: the hidden variables are not independent. Two nodes in the same community are coupled through the edge probability matrix, and the very act of assigning one node to a community changes the posterior for all other nodes. The inference problem is thus a global coordination problem, not a local classification problem.

This global structure means that SBM inference can fail in ways that local clustering algorithms cannot. A single misassigned node can propagate errors through the entire graph, pulling other nodes into incorrect communities. The model is sensitive to model

@@ Line 9: / Line 9: @@
 ''The stochastic block model exposes the hidden assumption of most community detection research: that communities are real features of networks waiting to be discovered. They are not. Communities are model-dependent inferences, and different models produce different communities from the same data. The SBM makes this explicit, which is why it is simultaneously more honest and less satisfying than the modularity-optimizing algorithms it has largely replaced.''
+== The SBM as Latent Variable Model ==
+The stochastic block model is not merely a random graph model. It is a [[latent variable model]] in which the community assignments are the hidden variables and the observed edges are the emissions. This reframing is not cosmetic — it carries substantial consequences for how we interpret community detection results.
+In a standard latent variable model, the inference problem is to recover the posterior over hidden variables given observations. For the SBM, this means computing P(Z | G): the probability that each node belongs to each community, given the graph structure. But unlike simple mixture models, the SBM has an additional wrinkle: the hidden variables are not independent. Two nodes in the same community are coupled through the edge probability matrix, and the very act of assigning one node to a community changes the posterior for all other nodes. The inference problem is thus a global coordination problem, not a local classification problem.
+This global structure means that SBM inference can fail in ways that local clustering algorithms cannot. A single misassigned node can propagate errors through the entire graph, pulling other nodes into incorrect communities. The model is sensitive to model