## A bit sideways

You can probably guess that I’m a fan of Bit-twiddling Hacks. I like to think of bitwise operations as a very limited parallel processor, with no branching. Or possibly as turning all machine words sideways.

Huffman compression is a way to encode fixed-sized symbols (typically bytes) as variable-length bit strings, where shorter code values are assigned to frequent symbols in your data stream; hence the compression.

A special flavour of Huffman code is an ordered one: codes are assigned such that encoded byte-strings are lexically ordered the same way as the unencoded sources. That’s very useful for compressed indexes of long strings, for example. There is a cost: an ordered Huffman code is usually (slightly) less efficient at compression than the optimal unordered one.

The text-book method to compute the optimal ordered Huffman code, for a given symbol distribution, is the Hu-Tucker algorithm. Less well known, and more efficient, is the  Garsia-Wachs algorithm, which traverses the “tree” much more quickly. It’s an efficient O(n log n) algorithm. But is there be a faster way?

Huffman-code algorithms typically work from a binary tree structure, where each bit of each code represents a left- or right- branch in the tree. Consider, instead, the bit strings, as paths in that tree. You can compute a reasonable set of paths without the tree: start with a good estimate of each path (code) length, compute what the corresponding bit strings must be, then eliminate bits that don’t distinguish a code from the adjacent ones.

For example, suppose the letters [a..j] have frequencies F = (61, 10, 23, 33, 126, 22, 20, 61, 70, 2). Ideally, the number of bits in their Huffman codes should depend on base-2 log of F[i]. The ordered Huffman code is generated in these steps:

• Calculate left-justified bit masks M[i] based on sum(F[0..i]/sum(F)). For 32bit masks, M[i] is the set of bits above the highest bit in 0x3FFFFFFF * sum(F[0..i]/sum(F)). Yes, calculating that set of bits is in Bit Twiddling Hacks.
• Generate an initial bit string C[i] for code, such that every code differs from C[i-1] and C[i+1] on its lowest M[i] bit, at very least.
• C = 0
• C[i] = (C[i-1] – (M[i] & M[i-1]) & M[i].
• Compress codes down to necessary bits. Unnecessary bits correspond to nodes with a single child, in the tree representation of the code. A bit is unnecessary if is 0 in the current code, and was either unnecessary in, or beyond the end of, the previous code. U[i] = (~M[i+1] | U[i+1]) & ~C[i] & M[i].
```   Length M[i]     C[i]     U[i]  Final C[i]
a    3   11100000 000..... 111     000
b    6   11111100 001000.. 111110  00100
c    5   11111000 00101... 11111   00101
d    4   11110000 0011.... 1111    0011
e    2   11000000 01...... 11      01
f    5   11111000 10000... 11101   1000
g    5   11111000 10001... 11101   1001
h    3   11100000 101..... 111     101
i    3   11100000 110..... 111     110
j    8   11111111 111..... 111     111```

Unlike Hu-Tucker or Garsia-Wachs, this algorithm doesn’t guarantee optimality; a little information is lost by rounding, in the original length estimate based on ceil(log(F[i]). However, for all but degenerate cases, this method produces a Huffman code within 1-5% of optimal, using three linear passes. Not bad, eh? You can reduce the percentage by using a slightly non-linear mapping from frequency to bit length, without adding measurably to the processing time.

If you are used to playing with bits, you might have noticed I glossed over one step: how exactly do you compress one set of bits with another? No computer, other than the Analogic APL machine, has a native instruction for this. It’s possible to use look-up tables; but it happens that a pair of tiny registers-only loops does this very efficiently. Since the loops are absolutely bounded by machine word size, this algorithm can happily claim O(n) efficiency, in solving an O(n log n) problem … and run faster as well.

For lovers of bit-twiddling “C”, here are the loops:

```for (x = U[i], m = M[i]; (int)x < 0; x <<= 1, m <<= 1)
C[i] <<= 1;
for (; x; x &= x - 1, m <<= 1)
C[i] += (x ^ (x - 1)) & C[i];``` 