Lossless compression

Lossless compression is a class of data compression algorithms that allow the original data to be perfectly reconstructed from the compressed representation. Unlike lossy compression, no information is discarded — every bit of the original can be recovered. This property makes lossless compression essential for text, executable code, financial records, and any domain where even a single altered bit could be catastrophic.

The theoretical foundation of lossless compression is the source coding theorem, which establishes that the entropy of a source sets a hard lower bound on the average number of bits needed per symbol. No lossless compression algorithm can compress a source below its Shannon entropy on average. This is not an engineering limitation but a mathematical law: entropy measures the irreducible uncertainty of the source, and compression below that uncertainty would require predicting the future.

Foundational Techniques

Lossless compression algorithms exploit two kinds of redundancy in data: statistical redundancy and dictionary redundancy.

Statistical redundancy occurs when some symbols occur more frequently than others. Huffman coding addresses this by assigning shorter codes to more probable symbols, producing a prefix-free code tree that approaches the entropy limit for sources with known symbol probabilities. For sources where symbol probabilities are not known in advance or vary over time, arithmetic coding achieves better compression by encoding the entire message as a single fractional number in the unit interval, allowing the code length to approach the self-information of each symbol with arbitrary precision. Both techniques fall under the umbrella of entropy coding, the practice of representing symbols with code lengths proportional to their negative log-probabilities.

Dictionary redundancy occurs when sequences of symbols repeat within a message. The Lempel-Ziv family of algorithms — including LZ77, LZ78, and Lempel-Ziv-Welch — build adaptive dictionaries of previously seen strings and replace recurring patterns with references to earlier occurrences. These methods require no prior knowledge of source statistics; they learn the structure of the data as they compress it. The DEFLATE algorithm, which powers the ZIP and gzip formats, combines LZ77 dictionary matching with Huffman coding of the resulting tokens, achieving practical compression ratios across a wide range of data types.

Run-length encoding represents another elementary approach: consecutive repeated symbols are replaced by a count-value pair. While trivial, it remains the optimal compression method for sources with long runs of identical symbols — such as black-and-white fax images — and serves as a preprocessing step in more sophisticated pipelines.

The Limits of Losslessness

The requirement of perfect reconstruction imposes strict limits. The Kolmogorov complexity of a string — the length of the shortest program that generates it — is the ultimate theoretical bound on lossless compression. A string is incompressible if its Kolmogorov complexity exceeds its length minus a constant. Most strings are incompressible; compression is only possible because real-world data is not random but structured — it contains patterns, redundancies, and regularities that algorithms can exploit.

This reveals something subtle: lossless compression is not merely a technology but a theory of structure. An algorithm that compresses a file successfully has, in a precise sense, discovered something about the file's internal organization. The compression ratio is a empirical measure of how much order a dataset contains. Random data cannot be compressed; ordered data can. The gap between the two is where lossless compression lives.

The insistence on perfect reconstruction in lossless compression is often treated as a conservative constraint — a limitation we accept for critical data. But this framing misses the deeper point: lossless compression is the only form of compression that can claim to understand its input. Lossy compression guesses at what you will not miss; lossless compression finds what is genuinely redundant. In that sense, lossless compression is closer to science, and lossy compression is closer to marketing.