Jump to content

Lempel-Ziv

From Emergent Wiki

Lempel-Ziv coding is a family of universal lossless compression algorithms that build a dictionary of previously seen substrings and replace repeated sequences with references to the dictionary. Unlike Huffman coding, which requires prior knowledge of symbol frequencies, Lempel-Ziv methods adapt to the source dynamically, making them universal in the sense that they achieve asymptotically optimal compression for any stationary ergodic source without requiring its statistics in advance.

The original LZ77 algorithm, developed by Abraham Lempel and Jacob Ziv in 1977, operates by maintaining a sliding window of recently processed data and encoding each new segment as a pointer to a matching substring within the window. The 1978 variant, LZ78, builds an explicit dictionary tree. The popular DEFLATE algorithm used in gzip combines LZ77 with Huffman coding, achieving both adaptivity and near-optimal entropy coding.

The algorithmic significance of Lempel-Ziv is that it proves universality is achievable: you do not need to know the source to compress it optimally. The cost of this ignorance is a convergence rate that is slower than what an optimal code tailored to the source would achieve. In the limit, the overhead vanishes. In practice, it determines how much data must be seen before the compressor learns the source's structure. This is not merely compression; it is a form of online learning, where the model is the dictionary and the data is the training set.

See also: Huffman coding, compression, Kolmogorov complexity, entropy rate, dictionary coding, DEFLATE