Apache Parquet

Apache Parquet is an open-source columnar storage format for structured data, originally developed at Twitter and Cloudera in 2013 and now maintained by the Apache Software Foundation. Parquet is designed for efficient storage and retrieval of complex nested data structures at scale, and it has become the de facto standard for analytics workloads in modern data lakes and data warehouses.

The fundamental design choice of Parquet is columnar orientation: data is stored column by column rather than row by row. In a traditional row-oriented format (such as CSV or JSON), all fields of a record are stored contiguously on disk. In a columnar format, all values of a single field across all records are stored contiguously. This orientation is not a implementation detail; it is a systems design decision that optimizes for the access patterns of analytical workloads, which typically scan a subset of columns across many rows, rather than transactional workloads, which typically access all columns of a single row.

Columnar Storage and Compression

The columnar orientation enables dramatically higher compression ratios than row-oriented formats. Values in a single column tend to be similar — they share the same data type, the same value range, and often the same actual value. Parquet exploits this redundancy through several compression mechanisms:

Dictionary encoding: For columns with low cardinality (few distinct values), Parquet stores a dictionary of unique values and replaces each cell with a compact integer index. This can reduce storage by orders of magnitude for categorical data.
Run-length encoding (RLE): For sorted columns, consecutive identical values are stored as a single value and a repetition count.
Bit-packing: For dictionary-encoded columns with small integer indices, the indices are packed into the minimum number of bits required, rather than full bytes.
General-purpose compression: After encoding, the column data is compressed with algorithms like Snappy, Gzip, or Zstandard, which exploit the sequential redundancy of the encoded column.

The combined effect of these techniques is that Parquet files are typically 10x to 100x smaller than equivalent CSV files, and 3x to 5x smaller than row-oriented binary formats like Avro. This compression is not merely a storage efficiency feature; it is a performance feature. Smaller files mean less data to read from disk, less data to transfer over the network, and less data to decompress in memory. The bottleneck in most analytical queries is not CPU computation but I/O bandwidth. Parquet's compression directly addresses the bottleneck.

Nested Data and Schema Evolution

Parquet's type system supports complex nested schemas, including structs, lists, maps, and unions. This support was inherited from Google's Dremel paper, which introduced the concept of repetition and definition levels to encode nested data in a flat columnar structure. Repetition levels indicate at what level of nesting a value repeats; definition levels indicate whether a value is null and at what level of nesting the null occurs. These two level types allow Parquet to represent arbitrarily nested JSON-like structures in a flat columnar format without requiring a separate schema per record.

Schema evolution is a practical necessity in data lakes, where upstream producers may add new fields, rename existing fields, or change data types over time. Parquet supports schema evolution through several mechanisms: new columns can be added without affecting existing columns; columns can be deleted (their data is ignored by readers); and the physical type of a column can be changed if the new type is compatible with the old type (e.g., int32 to int64). The schema is stored in the file metadata, so readers can read only the columns they understand and ignore the rest. This decouples the producer's schema evolution from the consumer's query logic.

Row Groups and Predicate Pushdown

A Parquet file is divided into row groups, typically 128MB to 1GB in size, each of which contains a contiguous subset of the rows. Within each row group, columns are stored in separate chunks. Each column chunk has associated metadata: minimum and maximum values, null counts, and dictionary sizes. This metadata enables predicate pushdown — the ability of a query engine to skip entire row groups or column chunks based on filter conditions. If a query selects rows where 'age > 30,' and the metadata for a row group shows that the maximum age in that group is 25, the query engine can skip the entire row group without reading it. This is a downward causation mechanism: the higher-level query predicate constrains which lower-level data is accessed.

Predicate pushdown is one of the reasons Parquet is dominant in analytics. In a row-oriented format, every row must be read to evaluate a filter on a single column. In Parquet, only the relevant column chunks are read, and entire row groups can be skipped. The performance difference is not incremental; it is orders of magnitude for selective queries on large datasets.

Systems Architecture and Tradeoffs

Parquet is optimized for write-once, read-many workloads. It is not designed for high-frequency random updates or transactional consistency. Writing a Parquet file requires buffering an entire row group in memory, encoding and compressing each column, and then writing the file metadata. The write path is relatively expensive compared to row-oriented formats. But the read path is optimized for analytical scans: sequential reads of large column chunks, with minimal seeks and maximal cache locality.

The format's design reflects a fundamental systems tradeoff: optimize for the dominant access pattern. Analytical workloads are dominated by large scans of historical data with selective filters. Parquet optimizes for this pattern at the cost of write performance and random-access latency. A transactional database that used Parquet as its storage format would be catastrophically slow for point queries and updates. A data lake that used a row-oriented format would be catastrophically slow for analytical scans. The choice of format is a systems design decision that shapes the architecture of everything above it.

Theoretical Connections

Parquet can be understood through the lens of data locality, a principle in computer systems that states that performance is determined by the proximity of data to the computation that uses it. In a row-oriented format, a query that scans one column must read all columns, because columns are interleaved in the row. The data locality is poor: most of the data read is irrelevant to the query. In a columnar format, the data locality is optimal: only the relevant column is read, and the values are stored contiguously, enabling vectorized processing and SIMD instructions. The columnar orientation is an architectural choice that maximizes data locality for the analytical access pattern.

The format also connects to information theory. Compression is the reduction of redundancy in a data representation. Columnar storage increases redundancy by grouping similar values together, making them more compressible. This is not a coincidence: it is a deliberate exploitation of the statistical structure of the data. The same principle underlies run-length encoding in image compression and delta encoding in time-series databases. Parquet's compression is an application of information-theoretic principles to systems engineering.

Limitations and Criticisms

Parquet is not a universal format. Its write-once, read-many model makes it unsuitable for transactional workloads, real-time streaming, and applications requiring fine-grained updates. The Apache Arrow project provides an in-memory columnar format that is complementary to Parquet: Arrow is optimized for in-memory computation and zero-copy sharing between processes, while Parquet is optimized for on-disk storage and compression. The two formats are often used together: data is stored in Parquet on disk and loaded into Arrow in memory for query execution.

A subtler limitation is that Parquet's performance depends heavily on the query pattern. A query that selects all columns (a 'SELECT *' scan) loses most of the benefits of columnar orientation, because every column must be read. In this case, Parquet is no faster than a row-oriented format, and may be slower due to the overhead of decompression and reconstruction. Similarly, Parquet is inefficient for point queries that select a single row by key: the row must be reconstructed from scattered column chunks, and the predicate pushdown metadata is too coarse-grained to skip most row groups. Parquet is designed for a specific class of workloads; using it outside that class is a systems mismatch.

The deeper criticism of Parquet — and of columnar formats in general — is that they encode a particular model of data and computation that is becoming dominant through inertia rather than merit. The columnar model assumes that queries are analytical scans, that data is immutable, and that schema evolution is slow. These assumptions are true for many data lakes but increasingly false for real-time analytics, machine learning pipelines, and operational workloads that blur the boundary between analytics and transactions. As the data ecosystem evolves, the dominance of Parquet may become a constraint rather than an enabler, and the next generation of storage formats may need to support both columnar and row-oriented access patterns within the same file.