What is SSE !@#$% good for?

I’ve been struggling for the last few weeks with a post on SSE and what it might be good for, other than simple operations on arrays of four floats or two doubles (I don’t do much of that). There are plenty of things it’s NOT good for, mostly because every use has so many edge cases. Memory access aligned on 16-byte boundaries, and operations that only take constant arguments, account for most of that.

It’s possible to write some SSE code in gcc with intrinsic calls. However, I’ve yet to find a way to make gcc 4.4 use more than two registers. Anyone have a suggestion? Till then, that means writing a lot of inline asm.

So, what can an x86 program do better with SSE? In the following posts I’ll cover:

  • adding up a vector of integers
  • boolean operations on bitvectors
  • transposing a bit array
  • computing 128-bit and 256-bit hash functions
  • sorting doubles faster than qsort

Adding up a vector integers

Pardon my condensed style here, but here’s the nobrainer loop to add up Vec[N], N > 0.

for (i = 1, sum = Vec[0]; i < N; ++i)
    sum += Vec[i];

A bright compiler may unroll the loop a few times; and a modern CPU will intelligently prefetch for an ascending series of memory accesses; so there’s not much to improve here. Here’s what the basic SSE equivalent looks like, for N being a multiple of 4 greater than 4, and Vec aligned on a 16-byte boundary in memory:

#include <xmmintrin.h>
__m128i const *mp = (void const*)vec;
__m128i xsum = _mm_add_epi32(mp[0], mp[1]);
for (i = 8, mp += 2; i < N-3; i += 4, ++mp) 
    xsum = _mm_add_epi32(xsum, *mp);

// xsum[0] += xsum[2], xsum[1] += xsum[3]:
xsum = _mm_add_epi32(xsum, _mm_shuffle_epi32(xsum, 0x4E));
sum = __builtin_ia32_vec_ext_v4si((__v4si)xsum, 0) 
    + __builtin_ia32_vec_ext_v4si((__v4si)xsum, 1);

The peculiar shuffle routine permutes the four elements of an __m128i — and yes, it only takes a constant permutation. I use  __builtin… routines because the _mm interface doesn’t support retrieving individual int fields from an __m128i … and yes, again, the “subscript” field must be a constant. Starts to make old-school self-modifying code look good.

All that work gives you a routine that’s 40% faster than the naive loop, measured on an Athlon64 and a Pentium4. You can get exactly the same speed-up by unrolling the naive loop 8 times. Sigh.

So you can see that using SSE means using a quirky interface, weaving through peculiar constraints, having no guarantee of better performance. Stick with me. It does get more interesting.

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in algorithm. Bookmark the permalink.

1 Response to What is SSE !@#$% good for?

  1. Pingback: The “C” preprocessor: not as cryptic as you’d think | Coding on the edges

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s