Huffman coding

Huffman coding is a greedy algorithm for constructing optimal prefix-free codes, developed by David A. Huffman in 1952 while he was a graduate student at MIT. Given a set of symbols and their frequencies, Huffman's algorithm produces a variable-length code that minimizes the expected code length. For sources with known symbol probabilities, the expected length of a Huffman code is within one bit of the Shannon entropy — the theoretical lower bound established by the source coding theorem.

The algorithm works by repeatedly combining the two least probable symbols into a composite symbol, building a binary tree from the leaves upward. More frequent symbols get shorter codes; less frequent symbols get longer codes. The resulting code is prefix-free: no code word is the prefix of any other, which guarantees unambiguous decoding without delimiter symbols.

Huffman coding is optimal among codes that assign integer-length codewords to individual symbols. However, it is not globally optimal: arithmetic coding and range coding can achieve expected lengths arbitrarily close to entropy by encoding entire sequences rather than individual symbols. Huffman coding remains ubiquitous in practice — it appears in DEFLATE (ZIP, gzip), JPEG, MP3, and countless other formats — because its simplicity and speed often outweigh the marginal compression gains of more sophisticated methods.