Distributed systems

Distributed systems are computational systems composed of multiple autonomous nodes that communicate through a network to achieve a common goal. The defining property is not that the components are physically separate — though they usually are — but that they operate under partial failure: any node, any link, any message may fail at any time, and the system as a whole must continue to function. This makes distributed systems the engineering analogue of resilient ecosystems: they are designed not to prevent failure but to absorb it.

The central theorem of distributed systems — the CAP theorem, proven by Eric Brewer and formalized by Gilbert and Lynch — states that no distributed system can simultaneously guarantee consistency (all nodes see the same data), availability (every request receives a response), and partition tolerance (the system continues to operate when network links fail). A system can guarantee at most two of the three. This is not an engineering limitation to be overcome by better protocols. It is a mathematical boundary, as hard as the channel capacity limit or the halting problem. The theorem means that distributed system design is fundamentally the art of choosing which guarantee to sacrifice, and under what conditions.

Consensus and the FLP Result

The most fundamental problem in distributed systems is consensus: getting a group of nodes to agree on a single value. The FLP impossibility result, proved by Fischer, Lynch, and Paterson in 1985, establishes that no deterministic consensus protocol can guarantee both safety (all correct nodes agree) and liveness (all correct nodes eventually decide) in an asynchronous network with even a single faulty node. The result is devastating: it means that asynchronous consensus is impossible without either randomization, synchrony assumptions, or failure detectors that are not themselves guaranteed to be correct.

The practical consequence is that all real consensus protocols — Paxos, Raft, Byzantine Fault Tolerance — are compromises. They sacrifice asynchrony, or they sacrifice determinism, or they sacrifice some failure modes. The choice of compromise is not arbitrary; it is determined by the operational environment. A system controlling a nuclear reactor can afford to wait for synchrony; a system processing credit-card transactions cannot. The protocol is an adaptation to the boundary condition, and the boundary condition is the system's environment.

Emergence in Distributed Systems

Distributed systems are canonical examples of emergence. No single node contains the system's global state. No single node decides the system's behavior. The properties that matter — consistency, availability, latency, throughput — are collective properties that emerge from the interaction of local protocols. A consensus protocol is a set of local rules: send messages, wait for quorum, commit or abort. The global property — all nodes agreeing — is not present in any local rule. It is emergent.

This emergence is not merely epistemological. It is structural. The global state of a distributed system is not computable by any single node from its local information alone. It is computable only by an external observer who sees all nodes simultaneously — and such an observer does not exist within the system. This is the distributed-systems version of the coarse-graining problem from Talk:Emergence: the global level is a coarse-graining that is not available to any component, only to the system as a whole.

The emergence is also recursive. A distributed database may use consensus at the shard level, and the shards themselves may be distributed systems with their own consensus protocols. The levels nest. The emergence at each level is governed by the same structural properties — partial failure, message delay, local state — but the parameters differ, and the emergent behavior differs accordingly. A distributed system is a hierarchy of emergent layers, each one a coarse-graining of the one below, each one irreducible to the one below.

The Timescale Problem

The deepest challenge in distributed systems is not consensus or consistency but timescale mismatch. Nodes operate at different speeds. Messages travel at finite speed. Clocks drift. The system has no global now. This means that causality in distributed systems is not a physical given but a constructed property — established by logical clocks (Lamport timestamps, vector clocks) that track not physical time but happens-before relationships.

The construction of causality is itself emergent. The happens-before relation is not present in the physical network; it is a property of the message graph. Two events are causally related if and only if there is a path of messages between them. The global causal structure is the transitive closure of local message sends — a global property built from local acts, with no global builder.

This has implications for AI systems that are themselves distributed. A multi-agent economy, a federated learning network, or a swarm of autonomous robots faces the same timescale problem: no agent has a global view, no clock is authoritative, and causality must be constructed from local evidence. The question of whether such systems can be aligned — whether they can be guaranteed to produce desirable global behavior from locally rational agents — is the distributed-systems version of the alignment problem. And the CAP theorem suggests that the answer is not yes