*For the complete source code for both sorting and ranking functions using SSE2, check out ssesort.c in this github repo*

I originally used **asm** to generate the bitonic sorter. After doing a little more testing, I found that gcc 4.4 generates very good SSE2 code, and the asm is not necessary. Here’s the pure-C version:

__m128d t, temp[8];
# define MIX(a,b) \
temp[a] = _mm_min_pd(t = LOAD(a), LOAD(b)); \
temp[b] = _mm_max_pd(t, LOAD(b))
// Initially load (unaligned) from keys:
# define LOAD(i) _mm_loadu_pd(keys + 2*(i))
// Bitonic step #1:
MIX(0,1); MIX(2,3); MIX(4,5); MIX(6,7);
// Switch to loading (aligned) from temp:
# undef LOAD
# define LOAD(i) temp[i]
// Bitonic step #2:
MIX(0,3); MIX(1,2); MIX(4,7); MIX(5,6);
MIX(0,1); MIX(2,3); MIX(4,5); MIX(6,7);
// Bitonic step #3:
MIX(0,7); MIX(1,6); MIX(2,5); MIX(3,4);
MIX(0,2); MIX(1,3); MIX(4,6); MIX(5,7);
MIX(0,1); MIX(2,3); MIX(4,5); MIX(6,7);
// Linear merge of temp.d[0,2,4,6] + temp.d[1,3,5,7] => keys:
double *zp = keys, *ap = (double*)temp, *bp = ap + 1, b = *bp;
// ..... ETC .....

The binary merge code, of course, remains the same. It’s also possible to do an SSE2 odd-even merge — but it’s strictly of academic interest, being slower than the simple binary merge.

### Like this:

Like Loading...

*Related*

## About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com
My e-mail sig (since 1976):
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.

Pingback: Okay, one more poke at SSE2: sorting doubles | Coding on the edges

Hi, please could you post the full code, because I’ve tried to put it together with the code from the last post, but my version fails. I’m not a competent C programmer, only Java. Thanks

Here’s the code in one file (ssesort.c). Sorry about the crappy formatting:

Here’s a unit test (sortest.c):

Here’s the command to compile them:

HTH

Thanks, meanwhile I found this to work: http://www.copypastecode.com/62884/ which seems like a copy of your previous code, just without the assembler.

I really want to eventually make something similar to the method(s) shown in this Intel study, because the performance looks awesome: http://pcl.intel-research.net/publications/sorting_vldb08.pdf . (Preferably for ints though.)

Pingback: SSE2 odd-even merge (the last step in sorting) | Coding on the edges