I’ve been struggling for the last few weeks with a post on SSE and what it might be good for, other than simple operations on arrays of four floats or two doubles (I don’t do much of that). There are plenty of things it’s NOT good for, mostly because every use has so many edge cases. Memory access aligned on 16-byte boundaries, and operations that only take constant arguments, account for most of that.
It’s possible to write some SSE code in gcc with intrinsic calls. However, I’ve yet to find a way to make gcc 4.4 use more than two registers. Anyone have a suggestion? Till then, that means writing a lot of inline asm.
So, what can an x86 program do better with SSE? In the following posts I’ll cover:
- adding up a vector of integers
- boolean operations on bitvectors
- transposing a bit array
- computing 128-bit and 256-bit hash functions
- sorting doubles faster than qsort
Adding up a vector integers
Pardon my condensed style here, but here’s the nobrainer loop to add up Vec[N], N > 0.
for (i = 1, sum = Vec; i < N; ++i) sum += Vec[i];
A bright compiler may unroll the loop a few times; and a modern CPU will intelligently prefetch for an ascending series of memory accesses; so there’s not much to improve here. Here’s what the basic SSE equivalent looks like, for N being a multiple of 4 greater than 4, and Vec aligned on a 16-byte boundary in memory:
#include <xmmintrin.h> __m128i const *mp = (void const*)vec; __m128i xsum = _mm_add_epi32(mp, mp); for (i = 8, mp += 2; i < N-3; i += 4, ++mp) xsum = _mm_add_epi32(xsum, *mp); // xsum += xsum, xsum += xsum: xsum = _mm_add_epi32(xsum, _mm_shuffle_epi32(xsum, 0x4E)); sum = __builtin_ia32_vec_ext_v4si((__v4si)xsum, 0) + __builtin_ia32_vec_ext_v4si((__v4si)xsum, 1);
The peculiar shuffle routine permutes the four elements of an __m128i — and yes, it only takes a constant permutation. I use __builtin… routines because the _mm interface doesn’t support retrieving individual int fields from an __m128i … and yes, again, the “subscript” field must be a constant. Starts to make old-school self-modifying code look good.
All that work gives you a routine that’s 40% faster than the naive loop, measured on an Athlon64 and a Pentium4. You can get exactly the same speed-up by unrolling the naive loop 8 times. Sigh.
So you can see that using SSE means using a quirky interface, weaving through peculiar constraints, having no guarantee of better performance. Stick with me. It does get more interesting.