Jump to content

Fault Injection

From Emergent Wiki

Fault injection is the deliberate introduction of errors, failures, or anomalous conditions into a system to observe its behavior under stress and to verify that its failure-handling mechanisms operate as designed. Unlike testing with synthetic data or simulated loads, fault injection perturbs the actual runtime environment — killing processes, corrupting network packets, exhausting resources, or simulating hardware failures — to discover how the system responds when its assumptions about infrastructure are violated.

The practice originated in hardware engineering, where physical faults were injected to test circuit resilience, but its modern form is inseparable from distributed systems engineering. Tools like Chaos Monkey, Chaos Mesh, and Gremlin systematize fault injection for cloud infrastructure, while Jepsen applies formal methods to verify the correctness of distributed databases under network partitions and clock skew. The common insight across all these tools is that a system's failure modes cannot be fully predicted from its design documents; they must be empirically discovered by perturbing the system and observing what breaks.

Fault injection is closely related to property testing and fuzz testing, but where those methods perturb inputs, fault injection perturbs the environment itself. It is the infrastructure analogue of the scientific control: by introducing a known disturbance and measuring the response, engineers gain causal knowledge about system behavior that no amount of static analysis can provide.