You can probably guess that I’m a fan of Bit-twiddling Hacks. I like to think of bitwise operations as a very limited parallel processor, with no branching. Or possibly as turning all machine words sideways.

**Huffman compression** is a way to encode fixed-sized symbols (typically bytes) as variable-length bit strings, where shorter code values are assigned to frequent symbols in your data stream; hence the compression.

A special flavour of Huffman code is an **ordered **one: codes are assigned such that encoded byte-strings are lexically ordered the same way as the unencoded sources. That’s very useful for compressed indexes of long strings, for example. There is a cost: an ordered Huffman code is usually (slightly) less efficient at compression than the optimal unordered one.

The text-book method to compute the optimal ordered Huffman code, for a given symbol distribution, is the Hu-Tucker algorithm. Less well known, and more efficient, is theĀ Garsia-Wachs algorithm, which traverses the “tree” much more quickly. It’s an efficient **O(n log n)** algorithm. But is there be a faster way?

Huffman-code algorithms typically work from a binary tree structure, where each bit of each code represents a left- or right- branch in the tree. Consider, instead, the bit strings, as paths in that tree. You can compute a reasonable set of paths without the tree: start with a good estimate of each path (code) length, compute what the corresponding bit strings must be, then eliminate bits that don’t distinguish a code from the adjacent ones.

For example, suppose the letters [a..j] have frequencies F = (61, 10, 23, 33, 126, 22, 20, 61, 70, 2). Ideally, the number of bits in their Huffman codes should depend on base-2 log of F[i]. The ordered Huffman code is generated in these steps:

- Calculate left-justified bit masks M[i] based on sum(F[0..i]/sum(F)). For 32bit masks, M[i] is the set of bits above the highest bit in 0x3FFFFFFF * sum(F[0..i]/sum(F)). Yes, calculating that set of bits is in Bit Twiddling Hacks.
- Generate an initial bit string C[i] for code, such that every code differs from C[i-1] and C[i+1] on its lowest M[i] bit, at very least.
- C[0] = 0
- C[i] = (C[i-1] – (M[i] & M[i-1]) & M[i].

- Compress codes down to necessary bits. Unnecessary bits correspond to nodes with a single child, in the tree representation of the code. A bit is unnecessary if is 0 in the current code, and was either unnecessary in, or beyond the end of, the previous code. U[i] = (~M[i+1] | U[i+1]) & ~C[i] & M[i].

Length M[i] C[i] U[i] Final C[i] a 3 11100000 000..... 111 000 b 6 11111100 001000.. 111110 00100 c 5 11111000 00101... 11111 00101 d 4 11110000 0011.... 1111 0011 e 2 11000000 01...... 11 01 f 5 11111000 10000... 11101 1000 g 5 11111000 10001... 11101 1001 h 3 11100000 101..... 111 101 i 3 11100000 110..... 111 110 j 8 11111111 111..... 111 111

Unlike Hu-Tucker or Garsia-Wachs, this algorithm doesn’t guarantee optimality; a little information is lost by rounding, in the original length estimate based on ceil(log(F[i]). However, for all but degenerate cases, this method produces a Huffman code within 1-5% of optimal, using three linear passes. Not bad, eh? You can reduce the percentage by using a slightly non-linear mapping from frequency to bit length, without adding measurably to the processing time.

If you are used to playing with bits, you might have noticed I glossed over one step: how exactly do you compress one set of bits with another? No computer, other than the Analogic APL machine, has a native instruction for this. It’s possible to use look-up tables; but it happens that a pair of tiny registers-only loops does this very efficiently. Since the loops are absolutely bounded by machine word size, this algorithm can happily claim **O(n)** efficiency, in solving an **O(n log n)** problem … and run faster as well.

For lovers of bit-twiddling “C”, here are the loops:

for (x = U[i], m = M[i]; (int)x < 0; x <<= 1, m <<= 1) C[i] <<= 1; for (; x; x &= x - 1, m <<= 1) C[i] += (x ^ (x - 1)) & C[i];