Entropy coding

Entropy coding is the practice of assigning code words to symbols such that the length of each code word is approximately equal to the symbol's self-information — the negative logarithm of its probability. The goal is to minimize the expected code length, approaching the theoretical lower bound given by Shannon entropy.

Huffman coding and arithmetic coding are the two dominant entropy coding techniques. Huffman coding produces optimal prefix codes for symbol-by-symbol encoding, while arithmetic coding achieves finer granularity by encoding entire sequences. Both assume that symbol probabilities are known or can be estimated accurately; when probabilities are misestimated, the resulting code length exceeds the entropy, a gap known as the redundancy of the code.

Entropy coding is not merely a compression technique but a principle: it formalizes the intuition that rare events deserve longer descriptions than common ones. This principle appears not only in data compression but in biological coding, where the genetic code's degeneracy mirrors an entropy-coding structure.