Sparse Computation

Sparse computation refers to computational methods that exploit the structure of problems by performing operations only on the non-zero or activated components of a representation, rather than on every element uniformly. In the context of machine learning, sparse computation encompasses sparse attention mechanisms (where transformers attend to a subset of positions rather than all pairs), mixture-of-experts architectures (where only a subset of model parameters are activated per input), and sparse gradient methods in optimization. The efficiency motivation is straightforward: most computation in large models is performed on elements that contribute negligibly to the output. Sparse computation identifies and skips these elements. The theoretical motivation is deeper: scaling laws derived from dense models may not apply to sparse architectures in the same form, raising the possibility that sparse computation opens an efficiency axis orthogonal to the parameter-compute-data tradeoffs that scaling laws characterize. Whether emergent capabilities in sparse models arise at the same thresholds as in dense models is an unsettled question that bears directly on the alignment implications of the scaling paradigm.