You’ll notice that my definition of “good for” does not include video codec operations or other very specialized computations. You will probably never write a video codec in your life (well, perhaps you will — you know whom I mean). But there are many high-volume bit-twiddling operations that may turn out to be useful in your code, that don’t warrant offloading to a GPU. With that in mind …
An Aside
In some offline correspondence, the subject of different SSE versions came up. My personal bias is to avoid anything that only Intel or AMD provides. In fact, since not many people have AMD Bobcat or Bulldozer cpus, I stay away from SSE4.2, let alone CLMUL instructions.
A Rant
Who comes up with (SSE) assembler mnemonics like PCLMULLQLQDQ ? Shades of the PowerPC, with opcodes like BDNZFLRL, and “Enforce In-order Execution of I/O”(Old MacDonald had a farm ….).
AND, OR, XOR … and “AND-NOT”? but not “NOT”?
The SSE bitwise operators sometimes surprise people. Usually, the first surprise is the lack of a NOT operator … you have to load 0xFFF…FFF into a register and XOR it. The second surprise is the “AND-NOT” operator. WTF? Anyone who programmed PDP-11’s will recognize AND-NOT as the BIC (bit-clear) operator; and yes, the PDP-11 bitwise-or instruction was BIS, not OR. AND-NOT lets you use the same mask to set bits (with OR) or clear bits (with AND-NOT). I guess the SSE architects wanted to save an opcode, and in practice, you’ll find AND-NOT more useful than NOT. BTW the fastest way to set an XMM register to all ones (0xFFFF….FFFF) is to compare the register to itself; i.e.:
__m128 junk; return _mm_cmpeq_epi8(junk,junk);
Streaming
SSE added the idea of explicit streamed input and output to RAM. Prefetch and output streaming can help, but in a limited way.
Modern CPU’s are getting smarter and smarter at spotting patterns of data fetching, so explicit prefetching no longer matters, when you are scanning long linear blocks of memory. If you are doing random fetches of blocks of memory, and can predict where those are, at least ten instructions before you need them, prefetch is still worth doing — the CPU may be smarter, but it’s not psychic.
Output streaming doesn’t speed up your processing, but spares the cache for better uses, such as look-up tables for (say) an encryption or compression routine. For large bit vectors, the result of processing any 128-bit slice won’t be used soon — certainly not until the process is done. It’s fairly trivial to replace assignments with streaming:
__m128 src[N], dst[N]; for (i = 0; i < N; i++) dst[i] = _mm_xor_si128(src[i], dst[i]);
becomes:
for (i = 0; i < N; i++) _mm_stream_si128(&dst[i], _mm_xor_si128(src[i], dst[i]));
And now here comes the gotcha (you were expecting one, right?): if you stream the result back to memory that was recently read (i.e. in L1 cache), it’s great. If not, your program will run approximately twice as slow as the non-streamed version.
“It’s just a shift to the left …”
Shifting is not as easy as one would wish. First, all shift counts must be constants (but see
how to implement a optimal variable-count shift). Secondly, there is no 128-bit shift operator; you can only shift the 128-bit register by a whole number of bytes. There is an instruction to shift both 64-bit halves of a register — which is like a 128-bit shift if you don’t mind losing a few bits in the middle. Some wrapper code is inevitably required; you’ll end up writing switch-statements a lot. In fact, I do it so often that you end up using preprocessor hacks like the #defines in the following example. General performance tip: move as much work as possible into each case statement as possible, rather than writing lots of generic lower-level operators. For example, here’s the full source for a “left shift” operator — but if I had to combine a shift with other 128-bit operations, I would write a separate procedure:
#include <xmmintrin.h> #define Do_1_7 Do(1) Do(2) Do(3) Do(4) Do(5) Do(6) Do(7) #define Do_1_15 Do(1) Do(2) Do(3) Do(4) Do(5) Do(6) Do(7) Do(8) \ Do(9) Do(10) Do(11) Do(12) Do(13) Do(14) Do(15) // ...and yes, I sometimes use Do_1_127, too. __m128i shiftLeft(__m128i x, int nbits) { nbits &= 127; if (nbits > 7) x = shl8(x, nbits >> 3), nbits &= 7; if (nbits) x = _mm_or_si128(x, _mm_slli_si128(shr64x(x, nbits), 8)); return x; } static __m128i shl8(__m128i x, int nbytes) { switch (nbytes) { # undef Do # define Do(z) case z: return _mm_slli_si128(x,z); Do_1_15; } } static __m128i shr64x(__m128i x, int nbits) { switch (nbits) { # undef Do # define Do(x) case z: return _mm_srli_epi64(z, 64-x); D_1_7; } }
Shift over some more
When you want to shift more than 128 bits, the code is very similar to doing the same operation in normal registers. For example, to left-shift by (nbits; nbits < 128):
void shiftLeftVec(__m128i Vec[], int N, int nbits) { __m128i carry = 0; for (i = N; --i >= 0;) { __m128i x = Vec[i]; Vec[i] = _mm_or_si128(carry, shiftLeft(x, nbits)); carry = shiftRight(x, 128 - nbits); } }
This can be optimized as described above — putting all operations into the least number of case statements.
The SSSE3 instruction set extension also has the PALIGNR op, which shifts one register, while filling in bytes from the bottom of another, leaving the other register unaffected. This is much like the SHLD/SHRD ops on pairs of CPU registers. Instructions like this make me wonder why Intel bothers: only one direction is supported, and only whole-byte shifts are supported, so you would have to code this as a complex special case, with only a small performance improvement
Setting or clearing a bit
You’re pretty much stuck with creating a 128-bit bitmask, and using OR/AND-NOT. You could use a complex shift operator to get the bit to the right position, but it’s faster to use a look-up table of 128 single-bit masks. The speed of aligned loads beats the saving of the required computed branches to shift a bit into position.
Lateral ops: FindFirstBit and Counting Ones
SSE doesn’t “do” lateral operations, with the exception of the horizontal float[4] “add” operation. And finding the position of the lowest (or highest) set bit is, well, arduous. Even the arithmetic/boolean trick for getting the lowest set bit as a bit mask — (X and (X – 1) ) xor X — doesn’t work well: SSE has no 128-bit add or subtract operator.
However, the fast way of counting bits (without hardware support) works for 128-bit words, too, and in fact is 2-4 times faster than the 32-bit version. gcc only uses two xmm registers, so it’s best to use ASM to extend this bitcount function to arrays of more than one 128-bit word — keeping the constant bitmasks below (m0…m4) in registers.
int sse_bitcount(__m128i x) { __m128i m0 = (__v2di) { 0x5555555555555555ULL, 0x5555555555555555ULL }; x = _mm_add_epi64(_mm_and_si128(m0, x), _mm_and_si128(m0, _mm_srli_epi64(x, 1))); __m128i m1 = (__v2di) { 0x3333333333333333ULL, 0x3333333333333333ULL }; x = _mm_add_epi64(_mm_and_si128(m1, x), _mm_and_si128(m1, _mm_srli_epi64(x, 2))); __m128i m2 = (__v2di) { 0x0F0F0F0F0F0F0F0FULL, 0x0F0F0F0F0F0F0F0FULL }; x = _mm_add_epi64(_mm_and_si128(m2, x), _mm_and_si128(m2, _mm_srli_epi64(x, 4))); __m128i m3 = (__v2di) { 0x00FF00FF00FF00FFULL, 0x00FF00FF00FF00FFULL }; x = _mm_add_epi64(_mm_and_si128(m3, x), _mm_and_si128(m3, _mm_srli_epi64(x, 8))); __m128i m4 = (__v2di) { 0x0000FFFF0000FFFFULL, 0x0000FFFF0000FFFFULL }; x = _mm_add_epi64(_mm_and_si128(m4, x), _mm_and_si128(m4, _mm_srli_epi64(x,16))); x = _mm_add_epi32(_mm_shuffle_epi32(x, 0x4E), x); return (int)__builtin_ia32_vec_ext_v4si((__v4si)x, 0) +__builtin_ia32_vec_ext_v4si((__v4si)x, 2); }
Up Next: Transposing a bit array, and computing hash functions
As a footnote: in the same way that a fast way to zero a general-purpose register is to xor it with itself, the fastest way to load FFFF….FFFF into an XMM register is to compare itself to itself; i.e.:
__m128i junk; return _mm_cmpeq_epi8(junk, junk)
Usefull information 🙂
Where’s the rest of the article? “First, all shift counts must be constants (but see”
Argh. Thanks. Formatting seems to have changed (gone screwy). Will fix.
If you’re interested in seeing how to create an efficient variable-count shift function,
you can look directly in the source (sse.h, sseutil.c) in https://github.com/mischasan/sse2
“BTW the fastest way to set an XMM register to all ones (0xFFFF….FFFF) is to compare the register to itself”
Modern compilers (or at least, GCC 4.9&5.1, ICC 13&14, and Clang 3.6) emit the one-instruction comparison for the somewhat more readable _mm_set1_epi8(0xFF) and _mm_set1_epi32(-1).
Thanks for the great blog.
Thanks, you’re right … and we can add my gcc 4.2.1 compiler on Darwin to that list.