Canonical Huffman code
A Huffman code is a prefix code to represent symbols given their frequencies. A canonical Huffman code is a Huffman code that allows a particularly compact way of storing the information needed to decode it because the codes are lexically ordered by length.
Making the code
Given a multiset of the symbols in a text:
- Convert the multiset to a heap.
- Compute a Huffman tree of these words by repeatedly popping the two least frequent nodes from the heap, making a new node whose children are those nodes and whose frequency is the sum of their frequencies, and pushing that symbol back to the heap until only one symbol remains.
- Traverse the tree to find the length for each symbol. Discard the tree.
- Sort the symbols by the code lengths. (Symbols with the same code length may be ordered in any way.) This can be done in O(n) time with an American flag sort: one pass to determine how many codes of each length and one to place each symbol into the bucket for its length.
- Assign codes to symbols in lexical order, producing a canonical Huffman code. For example, code lengths beginning with 2, 3, 3, 3, 4, 4, etc. would get codes 00, 010, 011, 100, 1010, 1011, etc. (See "Encoding" below for how this is done step by step.)
- Store the sorted symbols and the first distinct symbol with each code length,^{[1]} or equivalently the sorted symbols and the number of symbols with each code length.
For example, a table for a subset of letters in English:
Index | Symbol | Code length | Code |
---|---|---|---|
0 | E | 2 | 00 |
1 | T | 3 | 010 |
2 | A | 3 | 011 |
3 | O | 3 | 100 |
4 | I | 4 | 1010 |
5 | N | 4 | 1011 |
6 | S | 4 | 1100 |
7 | H | 5 | 11010 |
8 | R | 5 | 11011 |
etc. |
And the number of codes of each length:
Length | Number of codes |
---|---|
1 | 0 |
2 | 1 |
3 | 3 |
4 | 3 |
5 | 2 |
etc. |
The decoding model for a canonical Huffman code consists of two parts: a table of the number of codes for each length, followed by a symbol table from each index to a symbol.
For example, the code above is specified as 0,1,3,3,2;ETAOINSHR
.
(Other symbols after R are omitted in this example.)
This second table can be omitted if the symbols were ordered by decreasing frequency to begin with, such as if the Huffman code replaces a Golomb code or universal code.
Encoding
In the encoder, the code is first transformed back into a table from symbols to bit strings. This begins by computing a table of first_code_on_row and first_index_on_row for each code length, using an algorithm similar to the decoder below:
- Start on row 0 of the length table, with first_code_on_row = 0.
- Shift first_code_on_row left by 1.
- Write out first_code_on_row and first_index_on_row, and assign sequential codes for the next number_of_codes symbols starting at first_code_on_row.
- Add number_of_codes to first_code_on_row and to first_index_on_row.
- If there are more rows, go to step 2 with the next row of the length table.
With the code 0,1,3,3,2;ETAOINSHR
:
First code on row | First index on row | Number of codes | Symbols |
---|---|---|---|
0 | 0 | 0 | |
00 | 0 | 1 | E |
010 | 1 | 3 | TAO |
1010 | 4 | 3 | INS |
11010 | 7 | 2 | HR |
111000 | 9 | ... |
Decoding
The following pseudocode implements a decoder that works a bit at a time, which is suitable for a computer with small memory or one with a fast cache and slow main memory. The length table for even the biggest codes can easily fit in the L1 data cache, and the only main memory accesses are those to read the bitstream and to retrieve the symbol corresponding to each index.
- Start on row 0 of the length table, with first_code_on_row = 0, accumulator = 0.
- Fetch one bit from the bitstream, rotate it left into accumulator, and shift first_code_on_row left by 1.
- If accumulator - first_code_on_row < number_of_codes, return accumulator - first_code_on_row + first_index_on_row.
- Add number_of_codes to first_code_on_row and to first_index_on_row.
- (Optional) Subtract first_code_on_row from both accumulator and first_code_on_row.
- Go to step 2 with the next row of the length table.
Without step 5, accumulator represents the code bits that have been fetched; this may be easier to understand. With step 5, accumulator represents the fetched bits minus the first code on the row; this eliminates one variable at implementation time.
Example of decoding an S (1100) with the code 0,1,3,3,2;ETAOINSHR
:
Fetched bits | First code on row | Fetched - first | First index on row | Number of codes |
---|---|---|---|---|
1 | 0 | 1 - 0 = 1 | 0 | 0 |
11 | 00 | 3 - 0 = 3 | 0 | 1 |
110 | 010 | 6 - 2 = 4 | 1 | 3 |
1100 | 1010 | 12 - 10 = 2 | 4 | 3 |
After the fourth iteration, (fetched - first_code_on_row) < number_of_codes, so the index is (fetched - first_code_on_row + first_index_on_row) = 12 - 10 + 4 = 6, and the symbol at index 6 is S.
References
- ↑ Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes. New York: Van Nostrand Reinhold, 1994. ISBN 9780442018634. pp. 33-35.