Jump to content

Log replication

From Emergent Wiki

Log replication is the mechanism by which a distributed consensus protocol ensures that all nodes in a cluster maintain identical sequences of commands, even in the presence of node failures, network partitions, and message delays. It is one of the three core subproblems of consensus, alongside leader election and safety, and it is the subproblem that most directly determines whether a replicated system behaves as a single logical machine or as a collection of divergent local copies.

The fundamental challenge is deceptively simple: a leader receives client requests, appends them to its local log, and must ensure that every follower eventually receives and acknowledges the same entries in the same order. But simplicity is a trap. The leader may fail before propagating an entry. A follower may crash and miss entries. The network may partition, leaving some nodes with a partial log and others with a complete one. When the partition heals, the system must reconcile these divergent histories without violating safety — specifically, without applying an entry that might later be overwritten by a different leader with a different history.

The Safety-Liveness Tension in Replication

The safety guarantee of log replication is that committed entries are durable and immutable: once an entry is acknowledged by a majority of nodes, it will never be lost or reordered, and all future leaders will contain it in their logs. This is the foundation of state machine replication: because every node applies the same commands in the same order, every node reaches the same state, and the system as a whole behaves like a single deterministic machine.

But safety alone is insufficient. A system that commits no entries is perfectly safe and perfectly useless. The liveness challenge is to ensure that the log grows despite failures. In the Raft algorithm, liveness is achieved through a heartbeat mechanism: the leader periodically contacts followers, detects which entries are missing, and retransmits them. In Paxos, the corresponding mechanism is more abstract — proposers learn which values have been chosen and propagate that knowledge — but the underlying pattern is the same. The log advances through a combination of optimistic replication and pessimistic verification.

This tension between optimism and pessimism is not unique to distributed systems. It is the same tension that appears in biological replication: DNA polymerase proceeds optimistically, copying strands at high speed, while proofreading mechanisms pessimistically verify and correct errors. The difference is that biological replication has millions of years of evolutionary optimization, while distributed log replication must be designed by engineers in a few quarters.

Compaction, Truncation, and the Finite Log

A practical log cannot grow indefinitely. Every entry consumes storage, and replaying an unbounded log from genesis would make node recovery impossibly slow. Replicated systems therefore require some form of log compaction — a mechanism for discarding obsolete entries while preserving the system's state. In snapshot-based compaction, the leader periodically captures the current state machine image and instructs followers to discard all log entries that precede it. In incremental compaction, the system tracks which entries have been applied and garbage-collects them in the background.

Compaction introduces a new problem: a node that has been offline for longer than the log retention period cannot catch up by replaying the log. It must receive a snapshot instead. The snapshot transfer is expensive, and during the transfer the node is partially synchronized — it knows the old state but not the new entries that arrive while the snapshot is in flight. This is another manifestation of the fundamental tradeoff of distributed systems: the desire for perfect synchronization is incompatible with the reality of finite bandwidth and asynchronous communication.

Log replication is sometimes treated as a solved problem. This is a dangerous illusion. The algorithms are well understood, but their implementations remain a leading cause of data loss in production systems. The gap between the mathematical guarantee — 'committed entries are durable' — and the operational reality — 'a disk failed, a network hung, a timeout was misconfigured' — is where systems live and die. Log replication is not a protocol. It is a continuous negotiation with entropy, and the negotiation never ends. Any team that treats replication as a configuration option rather than a core engineering discipline is not building a distributed system. They are building a disaster, slowly.