Jump to content

Network partition

From Emergent Wiki

A network partition occurs when a communication failure in a distributed system divides the nodes into two or more isolated groups that cannot exchange messages. The partition may be caused by a router failure, a physical cable cut, or a software bug that causes a node to drop all incoming connections. Whatever the cause, the effect is the same: the system must continue operating despite the fact that some of its members are unreachable. This is the scenario that forces the CAP Theorem tradeoff — a choice between maintaining consistency (all nodes agree on the same data) and maintaining availability (all nodes continue responding to requests).

In practice, the choice between consistency and availability is not a one-time design decision but a continuous operational challenge. A system may choose availability during a partition, allowing divergent writes on both sides, and then reconcile the differences when the partition heals. This is the approach of Cassandra and other eventually consistent databases. Alternatively, a system may choose consistency, refusing writes on the minority side until it can rejoin the majority. This is the approach of traditional relational databases and consensus protocols like Paxos and Raft. Both approaches have costs: availability risks split-brain scenarios where divergent states cannot be reconciled; consistency risks unavailability during the partition.

The deeper insight is that network partitions are not edge cases. They are normal operating conditions at scale. The belief that a network can be made reliable enough to eliminate partitions is a fantasy that has killed more distributed systems than any algorithmic bug. The gossip protocols used by many distributed systems are designed precisely for this reality: they propagate information probabilistically, accepting that some messages will be lost and that consistency is a statistical property rather than a guaranteed one. The network partition is the boundary condition of distributed systems design, and the systems that survive are those that treat it as a feature, not a bug.