Follow-up: source code is on github: ssesort.c.
Old-school (pre-CUDA) non-graphic programming of GPU’s dusted off a bunch of classic algorithms that did little or no branching, and no data sharing between processors, but allowed massive parallelism. One of those algorithms is bitonic sort, aka network sort, whose only operator is one that compares two values and swaps them if the first is greater than the second. Bitonic sort does no conditional branching at all — giving it some advantage where instructions are pipelined. In the abstract, it is an O(n log n log n) algorithm, so, for non-parallel sorts, it was left in the dust behind O(n log n) algorithms like quicksort.
SSE registers provide parallelism on a micro-scale: 2,4,8 or 16 values processed in parallel. For the fun of it, I put together an SSE equivalent of a GPU sort of 16 doubles (yes, a more specialized sort I cannot imagine). It only does 16 values, because SSE2 only supports 8 XMM registers (two doubles per register). The code turned into an exercise in gcc inline asm and cpp hackery, because gcc 4.4 just couldn’t handle keeping all 8 XMM registers in use, with operations ordered to pipeline loads and saves.
How fast is it? Eight times as fast as qsort with a callback function to compare doubles; five times as fast as a hand-written insertion sort.
What’s this good for? Well, remember that a fast “sorting” routine like quicksort is really a partitioning algorithm, that uses a simpler algorithm (like insertion sort) to sort small sets.
Furthermore, if you care to twiddle bits, there’s a lot you can do with a double. It can represent up to 62 bits of data. If you use the bottom 4 bits as an index (0..15) into the set of records, the remaining 58 bits can be used as a key.
Here’s the code, and I invite you to preprocess it (gcc -E) or look at the asm (gcc -S) to see exactly what’s being done. The first half (the asm code) sorts d[0,2,4,6] and d[1,3,5,7] independently. The second half does a linear merge of the two sorted sequences.
void ssesort16d(double keys[16]) { union { __m128i m[8]; double d[16]; } temp; // COMP(a,b,t): compute [a,b] = [min(a,b), max(a,b)], using temp (t) # define _ "\n" # define COMP(a,b,t) _" movdqa %%xmm"#a",%%xmm"#t \ _" minpd %%xmm"#b",%%xmm"#a \ _" maxpd %%xmm"#t",%%xmm"#b # define LOAD(x) _" movaps "#x"*16(%%"SRC"),%%xmm"#x # define SAVE(x) _" movaps %%xmm"#x","#x"*16(%%edi)" # define SRC "esi" // initial loads from keys[]: asm( // Defer loading 7 (initially used as temp register): LOAD(0) LOAD(1) LOAD(2) LOAD(3) LOAD(4) LOAD(5) LOAD(6) // Bitonic step #1: COMP(0,1,7) SAVE(0) COMP(2,3,7) COMP(4,5,0) LOAD(7) COMP(6,7,0) # undef SRC # define SRC "edi" // Now do (re)loads from temp[]: // Bitonic step #2: COMP(4,7,0) SAVE(7) COMP(5,6,0) COMP(1,2,7) LOAD(0) COMP(0,3,7) COMP(0,1,7) SAVE(0) COMP(2,3,7) COMP(4,5,0) LOAD(7) COMP(6,7,0) // Bitonic step #3: COMP(3,4,0) SAVE(4) COMP(2,5,0) COMP(1,6,4) LOAD(0) COMP(0,7,4) COMP(0,2,4) SAVE(0) COMP(1,3,4) COMP(5,7,0) LOAD(4) COMP(4,6,0) COMP(6,7,0) // Interleave last SAVES with last COMPs: COMP(4,5,0) SAVE(7) SAVE(6) LOAD(0) COMP(2,3,7) SAVE(5) SAVE(4) SAVE(3) COMP(0,1,7) SAVE(2) SAVE(1) SAVE(0) : : "S"(keys), "D"(temp.m) : "memory" //temp ); // Linear merge of temp.d[0,2,4,6] + temp.d[1,3,5,7] => keys: double *zp = keys, *ap = temp.d, *bp = ap + 1, b = *bp; do { double a = *ap; if (a > b) { double *tp = ap; ap = bp; bp = tp; *zp++ = b; b = a; } else { *zp++ = a; } } while ((ap += 2) < temp.d + 16); do *zp++ = *bp; while ((bp += 2) < temp.d + 16);
Pingback: SSE2 odd-even merge (the last step in sorting) | Coding on the edges