Okay, one more poke at SSE2: sorting doubles

Follow-up: source code is on github: ssesort.c.

Old-school (pre-CUDA) non-graphic programming of GPU’s dusted off a bunch of classic algorithms that did little or no branching, and no data sharing between processors, but allowed massive parallelism. One of those algorithms is bitonic sort, aka network sort, whose only operator is one that compares two values and swaps them if the first is greater than the second. Bitonic sort does no conditional branching at all — giving it some advantage where instructions are pipelined. In the abstract, it is an O(n log n log n) algorithm, so, for non-parallel sorts, it was left in the dust behind O(n log n) algorithms like quicksort.

SSE registers provide parallelism on a micro-scale: 2,4,8 or 16 values processed in parallel. For the fun of it, I put together an SSE equivalent of a GPU sort of 16 doubles (yes, a more specialized sort I cannot imagine). It only does 16 values, because SSE2 only supports 8 XMM registers (two doubles per register). The code turned into an exercise in gcc inline asm and cpp hackery, because gcc 4.4 just couldn’t handle keeping all 8 XMM registers in use, with operations ordered to pipeline loads and saves.

How fast is it? Eight times as fast as qsort with a callback function to compare doubles; five times as fast as a hand-written insertion sort.

What’s this good for? Well, remember that a fast “sorting” routine like quicksort is really a partitioning algorithm, that uses a simpler algorithm (like insertion sort) to sort small sets.

Furthermore, if you care to twiddle bits, there’s a lot you can do with a double. It can represent up to 62 bits of data. If you use the bottom 4 bits as an index (0..15) into the set of records, the remaining 58 bits can be used as a key.

Here’s the code, and I invite you to preprocess it (gcc -E) or look at the asm (gcc -S) to see exactly what’s being done. The first half (the asm code) sorts d[0,2,4,6] and d[1,3,5,7] independently. The second half does a linear merge of the two sorted sequences.

void ssesort16d(double keys[16])
    union { __m128i m[8]; double d[16]; } temp;
    // COMP(a,b,t): compute [a,b] = [min(a,b), max(a,b)], using temp (t)
#   define _ "\n"
#   define COMP(a,b,t)  _" movdqa %%xmm"#a",%%xmm"#t \
                        _" minpd  %%xmm"#b",%%xmm"#a \
                        _" maxpd  %%xmm"#t",%%xmm"#b
#   define LOAD(x)      _" movaps "#x"*16(%%"SRC"),%%xmm"#x
#   define SAVE(x)      _" movaps %%xmm"#x","#x"*16(%%edi)"
#   define SRC "esi"    // initial loads from keys[]:
        // Defer loading 7 (initially used as temp register):
        LOAD(0) LOAD(1) LOAD(2) LOAD(3) LOAD(4) LOAD(5) LOAD(6)
            // Bitonic step #1:
        COMP(0,1,7) SAVE(0) COMP(2,3,7)
        COMP(4,5,0) LOAD(7) COMP(6,7,0)
#   undef  SRC
#   define SRC "edi"    // Now do (re)loads from temp[]:
            // Bitonic step #2:
        COMP(4,7,0) SAVE(7) COMP(5,6,0)
        COMP(1,2,7) LOAD(0) COMP(0,3,7)
        COMP(0,1,7) SAVE(0) COMP(2,3,7)
        COMP(4,5,0) LOAD(7) COMP(6,7,0)
            // Bitonic step #3:
        COMP(3,4,0) SAVE(4) COMP(2,5,0)
        COMP(1,6,4) LOAD(0) COMP(0,7,4)
        COMP(0,2,4) SAVE(0) COMP(1,3,4)
        COMP(5,7,0) LOAD(4) COMP(4,6,0)
                 // Interleave last SAVES with last COMPs:
        COMP(4,5,0) SAVE(7) SAVE(6) LOAD(0)
        COMP(2,3,7) SAVE(5) SAVE(4) SAVE(3)
        COMP(0,1,7) SAVE(2) SAVE(1) SAVE(0)
    : : "S"(keys), "D"(temp.m) : "memory" //temp

    // Linear merge of temp.d[0,2,4,6] + temp.d[1,3,5,7] => keys:
    double  *zp = keys, *ap = temp.d, *bp = ap + 1, b = *bp;
    do {
        double  a = *ap;
        if (a > b) {
            double  *tp = ap; ap = bp; bp = tp;
            *zp++ = b;
            b = a;
        } else {
            *zp++ = a;
    } while ((ap += 2) < temp.d + 16);
    do *zp++ = *bp; while ((bp += 2) < temp.d + 16);

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in algorithm and tagged , . Bookmark the permalink.

One Response to Okay, one more poke at SSE2: sorting doubles

  1. Pingback: SSE2 odd-even merge (the last step in sorting) | Coding on the edges

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s