Jump to content

Apache Spark

From Emergent Wiki

Apache Spark is an open-source unified analytics engine for large-scale data processing, originally developed at UC Berkeley's AMPLab and written primarily in Scala. It introduced the Resilient Distributed Dataset (RDD) abstraction, which enables fault-tolerant distributed computation by treating data as immutable, partitioned collections that can be transformed through functional operations like map, filter, and reduce.

Spark's design explicitly exploits Scala's functional collections and type safety to express distributed transformations with both concision and correctness guarantees. Where earlier frameworks like Apache Hadoop forced programmers to think in terms of low-level map and reduce jobs, Spark raised the abstraction to functional transformations on distributed datasets. This design choice — using a functional language to express distributed computation — has become the dominant paradigm in modern data engineering.