What is SSE !@#$% good for? #2: Bit vector operations

You’ll notice that my definition of “good for” does not include video codec operations or other very specialized computations. You will probably never write a video codec in your life (well, perhaps you will — you know whom I mean). But there are many high-volume bit-twiddling operations that may turn out to be useful in your code, that don’t warrant offloading to a GPU. With that in mind …

An Aside
In some offline correspondence, the subject of different SSE versions came up. My personal bias is to avoid anything that only Intel or AMD provides. In fact, since not many people have AMD Bobcat or Bulldozer cpus, I stay away from SSE4.2, let alone CLMUL instructions.

A Rant
Who comes up with (SSE) assembler mnemonics like PCLMULLQLQDQ ? Shades of the PowerPC, with opcodes like BDNZFLRL, and “Enforce In-order Execution of I/O”(Old MacDonald had a farm ….).

AND, OR, XOR … and “AND-NOT”? but not “NOT”?
The SSE bitwise operators sometimes surprise people. Usually, the first surprise is the lack of a NOT operator … you have to load 0xFFF…FFF into a register and XOR it. The second surprise is the “AND-NOT” operator. WTF? Anyone who programmed PDP-11’s will recognize AND-NOT as the BIC (bit-clear) operator; and yes, the PDP-11 bitwise-or instruction was BIS, not OR. AND-NOT lets you use the same mask to set bits (with OR) or clear bits (with AND-NOT). I guess the SSE architects wanted to save an opcode, and in practice, you’ll find AND-NOT more useful than NOT. BTW the fastest way to set an XMM register to all ones (0xFFFF….FFFF) is to compare the register to itself; i.e.:

__m128 junk; return _mm_cmpeq_epi8(junk,junk);

SSE added the idea of explicit streamed input and output to RAM. Prefetch and output streaming can help, but in a limited way.

Modern CPU’s are getting smarter and smarter at spotting patterns of data fetching, so explicit prefetching no longer matters, when you are scanning long linear  blocks of memory. If you are doing random fetches of blocks of memory, and can predict where those are, at least ten instructions before you need them, prefetch is still worth doing — the CPU may be smarter, but it’s not psychic.

Output streaming doesn’t speed up your processing, but spares the cache for better uses, such as look-up tables for (say) an encryption or compression routine. For large bit vectors, the result of processing any 128-bit slice won’t be used soon — certainly not until the process is done. It’s fairly trivial to replace assignments with streaming:

__m128  src[N], dst[N];
for (i = 0; i < N; i++)
    dst[i] = _mm_xor_si128(src[i], dst[i]);


for (i = 0; i < N; i++)
    _mm_stream_si128(&dst[i], _mm_xor_si128(src[i], dst[i]));

And now here comes the gotcha (you were expecting one, right?): if you stream the result back to memory that was recently read (i.e. in L1 cache), it’s great. If not, your program will run approximately twice as slow as the non-streamed version.

“It’s just a shift to the left …”
Shifting is not as easy as one would wish. First, all shift counts must be constants (but see
how to implement a optimal variable-count shift). Secondly, there is no 128-bit shift operator; you can only shift the 128-bit register by a whole number of bytes. There is an instruction to shift both 64-bit halves of a register — which is like a 128-bit shift if you don’t mind losing a few bits in the middle. Some wrapper code is inevitably required; you’ll end up writing switch-statements a lot. In fact, I do it so often that you end up using preprocessor hacks like the #defines in the following example. General performance tip: move as much work as possible into each case statement as possible, rather than writing lots of generic lower-level operators. For example, here’s the full source for a “left shift” operator — but if I had to combine a shift with other 128-bit operations, I would write a separate procedure:

#include <xmmintrin.h>

#define Do_1_7  Do(1) Do(2) Do(3) Do(4) Do(5) Do(6) Do(7)
#define Do_1_15 Do(1) Do(2) Do(3) Do(4) Do(5) Do(6) Do(7) Do(8) \
                Do(9) Do(10) Do(11) Do(12) Do(13) Do(14) Do(15)
// ...and yes, I sometimes use Do_1_127, too.

__m128i shiftLeft(__m128i x, int nbits)
    nbits &= 127;
    if (nbits > 7)
        x = shl8(x, nbits >> 3), nbits &= 7;
    if (nbits)
        x = _mm_or_si128(x, _mm_slli_si128(shr64x(x, nbits), 8));
    return  x;

static __m128i shl8(__m128i x, int nbytes)
    switch (nbytes) {
#   undef   Do
#   define  Do(z) case z: return _mm_slli_si128(x,z);

static __m128i shr64x(__m128i x, int nbits)
    switch (nbits) {
#   undef   Do
#   define  Do(x)   case z: return _mm_srli_epi64(z, 64-x);

Shift over some more

When you want to shift more than 128 bits, the code is very similar to doing the same operation in normal registers. For example, to left-shift by (nbits; nbits < 128):

void shiftLeftVec(__m128i Vec[], int N, int nbits)
    __m128i  carry = 0;
    for (i = N; --i >= 0;) {
         __m128i x = Vec[i];
         Vec[i] = _mm_or_si128(carry, shiftLeft(x, nbits));
         carry = shiftRight(x, 128 - nbits);

This can be optimized as described above — putting all operations into the least number of case statements.

The SSSE3 instruction set extension also has the PALIGNR op, which shifts one register, while filling in bytes from the bottom of another, leaving the other register unaffected. This is much like the SHLD/SHRD ops on pairs of CPU registers. Instructions like this make me wonder why Intel bothers: only one direction is supported, and only whole-byte shifts are supported, so you would have to code this as a complex special case, with only a small performance improvement

Setting or clearing a bit

You’re pretty much stuck with creating a 128-bit bitmask, and using OR/AND-NOT. You could use a complex shift operator to get the bit to the right position, but it’s faster to use a look-up table of 128 single-bit masks. The speed of aligned loads beats the saving of the required computed branches to shift a bit into position.

Lateral ops: FindFirstBit and Counting Ones

SSE doesn’t “do” lateral operations, with the exception of the horizontal float[4] “add” operation. And finding the position of the lowest (or highest) set bit is, well, arduous. Even the arithmetic/boolean trick for getting the lowest set bit as a bit mask — (X and (X – 1) ) xor X — doesn’t work well: SSE has no 128-bit add or subtract operator.

However, the fast way of counting bits (without hardware support) works for 128-bit words, too, and in fact is 2-4 times faster than the 32-bit version. gcc only uses two xmm registers, so it’s best to use ASM to extend this bitcount function to arrays of more than one 128-bit word — keeping the constant bitmasks below (m0m4) in registers.

int sse_bitcount(__m128i x)
    __m128i m0 = (__v2di) { 0x5555555555555555ULL, 0x5555555555555555ULL };
    x = _mm_add_epi64(_mm_and_si128(m0, x), _mm_and_si128(m0, _mm_srli_epi64(x, 1)));                                                  

    __m128i m1 = (__v2di) { 0x3333333333333333ULL, 0x3333333333333333ULL };
    x = _mm_add_epi64(_mm_and_si128(m1, x), _mm_and_si128(m1, _mm_srli_epi64(x, 2)));                                                  

    __m128i m2 = (__v2di) { 0x0F0F0F0F0F0F0F0FULL, 0x0F0F0F0F0F0F0F0FULL };
    x = _mm_add_epi64(_mm_and_si128(m2, x), _mm_and_si128(m2, _mm_srli_epi64(x, 4)));                                                  

    __m128i m3 = (__v2di) { 0x00FF00FF00FF00FFULL, 0x00FF00FF00FF00FFULL };
    x = _mm_add_epi64(_mm_and_si128(m3, x), _mm_and_si128(m3, _mm_srli_epi64(x, 8)));                                                  

    __m128i m4 = (__v2di) { 0x0000FFFF0000FFFFULL, 0x0000FFFF0000FFFFULL };
    x = _mm_add_epi64(_mm_and_si128(m4, x), _mm_and_si128(m4, _mm_srli_epi64(x,16)));                                                  

    x = _mm_add_epi32(_mm_shuffle_epi32(x, 0x4E), x);
    return (int)__builtin_ia32_vec_ext_v4si((__v4si)x, 0)
               +__builtin_ia32_vec_ext_v4si((__v4si)x, 2);

Up Next: Transposing a bit array, and computing hash functions

, such as look-up tables for (say) an encryption or compression routine.

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in algorithm, bit, bit shift, Uncategorized. Bookmark the permalink.

6 Responses to What is SSE !@#$% good for? #2: Bit vector operations

  1. mischasan says:

    As a footnote: in the same way that a fast way to zero a general-purpose register is to xor it with itself, the fastest way to load FFFF….FFFF into an XMM register is to compare itself to itself; i.e.:
    __m128i junk; return _mm_cmpeq_epi8(junk, junk)

  2. Müller says:

    Usefull information 🙂

  3. jedihacker says:

    Where’s the rest of the article? “First, all shift counts must be constants (but see”

  4. Nemo says:

    “BTW the fastest way to set an XMM register to all ones (0xFFFF….FFFF) is to compare the register to itself”

    Modern compilers (or at least, GCC 4.9&5.1, ICC 13&14, and Clang 3.6) emit the one-instruction comparison for the somewhat more readable _mm_set1_epi8(0xFF) and _mm_set1_epi32(-1).

    Thanks for the great blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s