A Huffman code is a prefix code to represent symbols given their frequencies. A canonical Huffman code is a Huffman code that allows a particularly compact way of storing the information needed to decode it because the codes are lexically ordered by length.
Given a multiset of the symbols in a text:
For example, a table for a subset of letters in English:
Index | Symbol | Code length | Code |
---|---|---|---|
0 | E | 2 | 00 |
1 | T | 3 | 010 |
2 | A | 3 | 011 |
3 | O | 3 | 100 |
4 | I | 4 | 1010 |
5 | N | 4 | 1011 |
6 | S | 4 | 1100 |
7 | H | 5 | 11010 |
8 | R | 5 | 11011 |
etc. |
And the number of codes of each length:
Length | Number of codes |
---|---|
1 | 0 |
2 | 1 |
3 | 3 |
4 | 3 |
5 | 2 |
etc. |
The decoding model for a canonical Huffman code consists of two parts: a table of the number of codes for each length, followed by a symbol table from each index to a symbol.
For example, the code above is specified as 0,1,3,3,2;ETAOINSHR
.
(Other symbols after R are omitted in this example.)
This second table can be omitted if the symbols were ordered by decreasing frequency to begin with, such as if the Huffman code replaces a Golomb code or universal code.
In the encoder, the code is first transformed back into a table from symbols to bit strings. This begins by computing a table of first_code_on_row and first_index_on_row for each code length, using an algorithm similar to the decoder below:
With the code 0,1,3,3,2;ETAOINSHR
:
First code on row | First index on row | Number of codes | Symbols |
---|---|---|---|
0 | 0 | 0 | |
00 | 0 | 1 | E |
010 | 1 | 3 | TAO |
1010 | 4 | 3 | INS |
11010 | 7 | 2 | HR |
111000 | 9 | ... |
The following pseudocode implements a decoder that works a bit at a time, which is suitable for a computer with small memory or one with a fast cache and slow main memory. The length table for even the biggest codes can easily fit in the L1 data cache, and the only main memory accesses are those to read the bitstream and to retrieve the symbol corresponding to each index.
Without step 5, accumulator represents the code bits that have been fetched; this may be easier to understand. With step 5, accumulator represents the fetched bits minus the first code on the row; this eliminates one variable at implementation time.
Example of decoding an S (1100) with the code 0,1,3,3,2;ETAOINSHR
:
Fetched bits | First code on row | Fetched - first | First index on row | Number of codes |
---|---|---|---|---|
1 | 0 | 1 - 0 = 1 | 0 | 0 |
11 | 00 | 3 - 0 = 3 | 0 | 1 |
110 | 010 | 6 - 2 = 4 | 1 | 3 |
1100 | 1010 | 12 - 10 = 2 | 4 | 3 |
After the fourth iteration, (fetched - first_code_on_row) < number_of_codes, so the index is (fetched - first_code_on_row + first_index_on_row) = 12 - 10 + 4 = 6, and the symbol at index 6 is S.