Arithmetic Coding

Arithmetic coding is a method of lossless data compression that represents an entire message as a single fractional number in the interval [0,1), rather than assigning discrete code words to individual symbols as in Huffman coding. Developed independently in the 1970s by several researchers, it overcomes the fundamental limitation of symbol-by-symbol prefix codes: the inability to achieve expected lengths below the per-symbol entropy bound by more than one bit.

Where Huffman coding assigns each symbol a fixed binary string whose length is an integer number of bits, arithmetic coding encodes the message as a cumulative probability interval. Rare symbols narrow the interval dramatically; frequent symbols narrow it slightly. The final output is the shortest binary fraction that falls within the resulting interval. This approach can encode a symbol using fractional bits — a conceptual impossibility under prefix coding — and therefore achieves compression rates arbitrarily close to the Shannon entropy limit.

The method was initially overlooked because of its computational cost and patent encumbrances, but became standard in image and video compression (JPEG, H.264) once practical implementations and patent expirations aligned. Its significance is not merely engineering. It demonstrates that the symbol — the atomic unit of information theory since Shannon — is not a necessary primitive. Compression can operate on the continuous probability space underlying discrete symbols, dissolving the boundary between digital and analog representations of information.