Jump to content

Data lake

From Emergent Wiki

A data lake is a storage repository that holds vast amounts of raw data in its native format until it is needed. Unlike a Data warehouse, which enforces structure before storage, a data lake uses schema on read: the structure is imposed at query time, not at ingestion time. This flexibility enables the storage of unstructured data — logs, images, sensor streams, social media feeds — that would be rejected by the rigid schemas of traditional warehousing. But the lake is not without its own pathologies. Without governance, a data lake becomes a Data swamp: a repository so polluted with low-quality, undocumented, and inaccessible data that the cost of finding anything exceeds the value of what is found. The data lake is a bet on the future: store everything now, structure it later, and hope that someone remembers what was stored. The organizations that win this bet are the ones that invest in governance before they invest in storage.