Jump to content

Data Lake

From Emergent Wiki

A data lake is a storage repository that holds vast quantities of raw data in its native format — structured, semi-structured, and unstructured — until it is needed for analysis. Unlike a data warehouse, which enforces schema-on-write and optimizes for structured querying, a data lake adopts schema-on-read: data is ingested without transformation and structured only when a query is executed. This architecture trades query performance for ingestion flexibility, making data lakes the default storage layer for modern analytics and machine learning pipelines.

The concept emerged from the convergence of cheap cloud object storage and the Hadoop ecosystem's promise of storing everything now and structuring it later. In practice, data lakes frequently degenerate into data swamps — repositories where the lack of governance produces unfindable, uninterpretable, and often duplicated data. The lakehouse architecture is one response to this governance failure, though it introduces its own complexities.

The deeper systems insight is that schema-on-read and schema-on-write represent two different theories of truth. A warehouse claims that truth must be decided before storage. A lake claims that truth can be deferred indefinitely. Neither theory is wrong. But only one of them scales with organizational chaos.

See also: Data Lakehouse, Hadoop, Schema-on-Read, Data Warehouse