Jump to content

Apache Kafka

From Emergent Wiki

Apache Kafka is a distributed event streaming platform developed by the Apache Software Foundation and originally created at LinkedIn in 2011. Kafka is not a message queue in the traditional sense, though it is often discussed alongside message queue systems like Amazon SQS or RabbitMQ. It is a distributed commit log — an append-only, immutable, ordered sequence of records that is partitioned and replicated across a cluster of servers. This design choice makes Kafka a fundamentally different abstraction from queue-based systems, with different guarantees, tradeoffs, and optimal use cases.

The log abstraction is the core of Kafka's design. In a queue, a message is consumed and removed; in a log, a record is persisted and can be read by multiple consumers at their own pace. Each consumer tracks its position in the log using an offset, a pointer to the last record it has processed. This means consumers can rewind, replay, or process the same data independently — capabilities that are impossible in a traditional queue where consumption is destructive. The log is not a buffer that empties; it is a source of truth that accumulates.

Architecture

A Kafka cluster consists of brokers (servers that store and serve data), topics (named feeds or categories of records), and partitions (ordered, immutable sequences of records within a topic). Topics are divided into partitions to enable parallelism: each partition can be read and written independently, and the total throughput of a topic scales with the number of partitions. Partitions are replicated across brokers for fault tolerance; each partition has one leader and multiple followers, and the replication factor determines how many copies exist.

Records within a partition are strictly ordered, but there is no global ordering across partitions. This is a deliberate design tradeoff: total ordering requires a single partition (a bottleneck), while partition-level ordering allows parallelism at the cost of cross-partition semantics. A producer that requires total ordering for a subset of records must ensure those records are sent to the same partition, typically by using a key-based partitioning strategy.

Consumer groups are the mechanism by which Kafka scales consumption. A consumer group is a set of consumers that cooperate to read from a topic. Each partition is assigned to exactly one consumer in the group, and consumers coordinate to rebalance partitions when consumers join or leave the group. If there are more consumers than partitions, some consumers remain idle. If there are more partitions than consumers, some consumers read multiple partitions. The consumer group abstraction is a form of partition topology management: the system's throughput is determined by the geometry of the partition-to-consumer mapping, not merely by the number of consumers.

Stream-Table Duality

Kafka's most conceptually rich feature is its connection to the stream-table duality, the observation that a stream of updates and a table of current state are two representations of the same information. A stream of '(key, value)' changes, when aggregated, produces a table. A table of current state, when diffed over time, produces a stream of changes. Kafka Streams and ksqlDB exploit this duality to enable real-time stream processing: a Kafka topic can be materialized into a local table (a state store), queried, and updated by incoming records, all within the streaming framework.

This duality has deep implications for systems design. It means that Kafka is not merely a messaging system but a foundation for data infrastructure. Event sourcing, CQRS, and real-time analytics all build on the log abstraction. The log is the system of record; the database is a derived view. This inverts the traditional architecture in which the database is primary and messaging is secondary. In a Kafka-centric architecture, the database is a cache of the log, and the log is the source of truth.

Operational Reality

Kafka's operational complexity is notorious. A production cluster requires careful tuning of partition counts, replication factors, consumer group configurations, and broker resources. The offset lag — the difference between the latest offset in a partition and the offset a consumer has processed — is the primary operational metric. A growing lag indicates that consumers are falling behind producers, which is a form of backpressure that is visible but not automatically resolved. Unlike a queue that can apply backpressure to producers (by rejecting messages when full), Kafka's log grows indefinitely (subject to retention policies). The backpressure is expressed as lag, and the operator must decide whether to scale consumers, optimize processing, or accept the delay.

Retention policies are Kafka's mechanism for managing log growth. Records are retained for a time period (e.g., 7 days) or until the log reaches a size threshold. When a record is past retention, it is deleted. This means Kafka is not an archival store; it is a temporal buffer. The retention period defines the horizon of replayability: a consumer that is down for longer than the retention period cannot recover by re-reading the log. This is a coupling between consumer availability and log retention that is often underestimated.

The deeper claim about Kafka is that the log abstraction has become dominant in data infrastructure not because it is the best abstraction for every use case, but because it is the best abstraction for the use cases that generate the most revenue: real-time analytics, event-driven microservices, and data pipeline integration. For transactional workflows, point-to-point messaging, and low-latency request-response patterns, Kafka is often the wrong choice. The industry's tendency to treat Kafka as a universal messaging platform is a category error that produces overcomplicated architectures and operational misery.