Data Lakehouse

Data lakehouse is an architectural pattern that attempts to merge the cheap, scalable storage of data lakes with the transactional guarantees, performance, and structured querying of data warehouses. Popularized by vendors like Databricks, it relies on open table formats such as Delta Lake, Apache Iceberg, and Apache Hudi to layer ACID transactions, time travel, and schema enforcement over object storage. The promise is compelling: eliminate the costly and complex ETL pipelines that move data between lakes and warehouses, and give analysts and data scientists a single source of truth. The reality is that lakehouses inherit the operational complexity of both predecessors — they are data lakes that have grown enough governance to be slow, and data warehouses that have grown enough scale to be messy.

The lakehouse is not a synthesis. It is a ceasefire between two architectures that were never meant to coexist, brokered by metadata layers that are themselves becoming single points of failure. The vendors selling lakehouses are selling optimism. The engineers maintaining them are paying for it.