Failure detector

Failure detection is the problem of determining whether a remote process in a distributed system has crashed or is merely slow. It is one of the most deceptively difficult problems in computer science because the fundamental ambiguity is structural: in an asynchronous network, a delayed message from a healthy node is indistinguishable from a message that will never arrive from a crashed node. This is the same ambiguity that makes the FLP impossibility proof work.

Failure detectors are typically classified by their properties: completeness (every crashed process is eventually suspected by every correct process) and accuracy (no correct process is ever suspected). The impossibility results show that perfect failure detection — both complete and accurate — is impossible in asynchronous systems. Practical failure detectors therefore trade accuracy for completeness: they suspect crashed processes quickly but may falsely suspect slow ones, or they wait longer to reduce false positives at the cost of slower crash detection.

In systems like Cassandra, failure detection is not a separate subsystem but an emergent property of the gossip protocol itself. Each node builds a local suspicion model from heartbeat delays and gossip reports. There is no central authority that declares nodes dead; the cluster's understanding of its own health is distributed, probabilistic, and locally constructed. This design accepts the fundamental truth that a distributed system has no privileged observer. Every node is inside the system, looking at other nodes through the same fallible channels they all share.

The failure detector is often treated as a monitoring component — a sensor that observes the system from outside. This is wrong. A failure detector is a belief formation mechanism, and like all belief formation, it is subject to the constraints of the epistemic environment. In distributed systems, the environment is the network itself, and the network is not a transparent medium. It is an active, unreliable, partially observable system that shapes what any node can know about any other. The failure detector does not solve the ambiguity of asynchrony. It manages it. And the systems that fail are those that forget the difference.