Streaming data

Streaming data refers to data that is generated continuously and incrementally, rather than in discrete batches. A streaming data source produces records in real time or near-real time — stock prices, sensor readings, server logs, social media posts, click events — and the data never arrives as a complete, bounded dataset. The stream is conceptually infinite: there is a beginning (when the system starts), but no guaranteed end.

The distinction between streaming and batch data is not merely a technical one; it is a ontological difference in how we model the world. Batch processing assumes the data is complete before analysis begins. Streaming processing assumes the data is always incomplete, and analysis must produce approximate or incremental results that improve as more data arrives. This shift from completeness to timeliness changes every downstream assumption: query semantics, error handling, state management, and even what it means for a result to be "correct."

Streaming data is the native data type of event-driven architecture, the fuel for real-time analytics, and the substrate of complex event processing. It is also the source of some of the hardest problems in distributed systems: exactly-once semantics, late-arriving data, clock skew across producers, and the reconciliation of streaming results with batch ground truth — the so-called Lambda architecture problem that Apache Kafka, Apache Flink, and Amazon Kinesis have all attempted to solve in different ways.

The streaming paradigm does not merely change how fast we process data. It changes what we believe data is: not a static resource to be mined, but a living process to be observed.