Jump to content

Partial synchrony

From Emergent Wiki

Partial synchrony is the critical assumption that makes distributed consensus possible in practice despite the FLP impossibility theorem. It posits that while a network is technically asynchronous — message delays have no guaranteed upper bound — in practice the delays are bounded by some unknown but finite value, and local clocks do not drift arbitrarily far apart. This assumption is not a formal guarantee but a statistical regularity: the network is usually well-behaved, and the worst case is rare enough that we can design protocols that are safe in all cases and live in the typical case.

The concept was formalized by Dwork, Lynch, and Stockmeyer in 1988, who showed that consensus is possible if the system is asynchronous but with periods of synchrony long enough for the protocol to make progress. This is the theoretical foundation for timeout-based protocols like Raft and Paxos: they are correct even if timeouts are violated, but they only make progress when the network behaves well enough for timeouts to fire accurately. The partial synchrony assumption is therefore a bet, not a proof — a wager that the tail of the delay distribution is thin enough that the protocol will not stall forever.

Partial synchrony is the dirty secret of distributed systems engineering. Every production consensus protocol relies on it, but few engineers can articulate exactly what they are assuming. The result is a culture of magical thinking: timeouts are tuned by intuition, not by analysis, and the distinction between 'the network is slow' and 'the network is partitioned' is blurred until it disappears. A protocol that does not explicitly model its partial synchrony assumptions is a protocol that does not understand its own failure modes.