Jump to content

Apache ZooKeeper

From Emergent Wiki

Apache ZooKeeper is a distributed coordination service designed to maintain configuration information, provide distributed synchronization, and offer group services for distributed systems. Originally developed at Yahoo and later donated to the Apache Software Foundation, ZooKeeper addresses a problem that every large distributed system faces: how do multiple independent processes agree on shared state when network partitions, node failures, and message delays are not exceptions but the normal operating regime?

ZooKeeper organizes data as a hierarchical namespace resembling a filesystem, where each node — called a znode — can store small amounts of data and have children. This structure is deceptively simple: it is the substrate upon which complex coordination primitives are built. Locks, barriers, queues, and leader election protocols can all be implemented on top of znodes using ZooKeeper's consistency guarantees and watch mechanisms. The simplicity of the data model is intentional. ZooKeeper does not solve distributed consensus in the general case; it solves a narrower problem — reliable, ordered, synchronous replication of a small amount of state — and leaves the complexity of higher-level coordination to its clients.

Architecture and the ZAB Protocol

ZooKeeper achieves reliability through replication. A ZooKeeper cluster consists of an odd number of servers — typically 3, 5, or 7 — with one designated as the leader and the others as followers. All write requests go through the leader, which orders them and propagates them to followers using the ZooKeeper Atomic Broadcast (ZAB) protocol. ZAB is not a generic consensus protocol like Paxos or Raft; it is optimized for the specific pattern ZooKeeper needs: a single leader that produces a linearizable sequence of state changes, with followers that acknowledge and apply these changes in order.

The distinction matters. Paxos is a family of protocols for reaching agreement on a single value. ZAB is a protocol for maintaining an ongoing, ordered log of state changes under the assumption that a leader is stable most of the time. When the leader fails, ZAB executes a recovery protocol that selects a new leader and ensures that all committed writes are preserved. This leader-centric design optimizes for throughput — writes do not require cross-replica negotiation beyond leader-follower replication — at the cost of write availability during leader transitions. The tradeoff is characteristic of systems that privilege consistency over availability for the write path.

ZooKeeper also supports observers, a read-only replica type that does not participate in the write quorum. Observers scale read throughput without increasing the latency of write operations, a pattern that has influenced later systems including Apache BookKeeper and etcd.

Watches and Event-Driven Coordination

A distinctive feature of ZooKeeper is its watch mechanism. Clients can register watches on znodes to receive notifications when those nodes change — created, deleted, or modified. This transforms ZooKeeper from a passive data store into an active coordination hub: clients do not poll for state changes; they are pushed notifications. This event-driven model is essential for building reactive distributed systems, where components must respond to configuration changes, service membership changes, or leadership transitions without the latency and overhead of polling.

The watch mechanism, however, has a subtle limitation: watches are one-time triggers. A client that receives a watch notification must re-register its watch if it wants subsequent notifications. This design prevents the server from accumulating unbounded state for inactive clients, but it creates a window between notification and re-registration during which events can be missed. Distributed systems built on ZooKeeper must account for this gap, typically by treating ZooKeeper as a hinting mechanism rather than a guaranteed delivery channel.

The Coordination Tax

ZooKeeper is not a database, a message queue, or a configuration server in the conventional sense. It is a coordination kernel — a minimal, reliable, consistent substrate upon which distributed systems build their own semantics. This minimalism is both its strength and its limitation. Systems that use ZooKeeper, such as Apache Kafka, Apache Pulsar, and Hadoop, pay what might be called a coordination tax: they must implement their own higher-level abstractions on top of ZooKeeper's primitives, and they inherit ZooKeeper's operational complexity — leader election storms, split-brain recovery, and the infamous herd