## SSE2 odd-even merge (the last step in sorting)

If you’ve looked at my example of bitonic sort in SSE2 in ASM or in “C”, you’ll see that the clever stuff ends with two eight-element sorted sequences. The final step is a simple loop that merges the two sequences. A commenter (Thomas Knight) asked if there was something smarter you could do for the merge, using SSE2, and pointing out an interesting paper on using bitonic merge and odd-even merge, also in SSE2 registers.

It would be cool if the merge step could be sped up — in fact I tried and gave up on odd-even merge. I couldn’t find a series of SSE2 ops that came out faster than the simple loop. The bitonic sort is about 80 instructions, and the odd-even merge about the same. Bitonic merge is mildly interesting, and looks just like the code you generate for a GPU. Unfortunately it involves more operations than odd-even merge, that just drive up its cost (time) on a single-instruction stream CPU using SSE2.

A problem with the SSE2 merge is that, to do the zippy branch-free MIN/MAX operations, the pairs of doubles in XMM registers must be transposed, twice. The first transposition puts adjacent elements of each sequence ( [0,1], [2,3], [4,5] …) in the same registers. The second transposition is needed to compare adjacent elements. Add in 13 MINMAX operations (39 instructions) and odd-even merge just can’t compete.

For anyone who cares to see the optimal odd-even merge:

```#   define  LOADLO(x, d)    (x = _mm_loadl_pd(x, &(d)))
#   define  STORELO(d, x)   _mm_storel_pd(&(d), x)
#   define  STOREHI(d, x)   _mm_storeh_pd(&(d), x)
#   define  UNPACKHI(a, b)  (a = (__m128d)_mm_unpackhi_epi64((__m128i)(a), (__m128i)(b)))
#   define  UNPACKLO(a, b)  (a = (__m128d)_mm_unpacklo_epi64((__m128i)(a), (__m128i)(b)))
// At the end of the bitonic sort, xmm registers (X0..X7) hold
// two 8-element sorted sequences, [0..7] and [8..f].
// The first element of each sequence is in X0, etc.
// X0 X1 X2 X3 X4 X5 X6 X7 <<reg
// 08 19 2a 3b 4c 5d 6e 7f <<(lo,hi)
// For MINMAX ops, you need values to compare be in different regs.
//  For all but the last odd-even merge MINMAX steps, you need adjacent
//  pairs of each seq to be in the same reg (e.g X0=01, X1=23,..,89,ab,..)
// In the following, '_' stands for a don't-care position.
double  d[4];
STOREHI(d[0], temp[0]);     // 0_ 19 2a 3b 4c 5d 6e 7f
STOREHI(d[1], temp[2]);     // 0_ 19 2_ 3b 4c 5d 6e 7f
STOREHI(d[2], temp[4]);     // 0_ 19 2_ 3b 4_ 5d 6e 7f
STOREHI(d[3], temp[6]);     // 0_ 19 2_ 3b 4_ 5d 6_ 7f
UNPACKLO(temp[0], temp[1]); // 01 _9 2_ 3b 4_ 5d 6_ 7f
UNPACKLO(temp[2], temp[3]); // 01 _9 23 _b 4_ 5d 6_ 7f
UNPACKLO(temp[4], temp[5]); // 01 _9 23 _b 45 _d 6_ 7f
UNPACKLO(temp[6], temp[7]); // 01 _9 23 _b 45 _d 67 _f
LOADLO(temp[1], d[0]);      // 01 89 23 _b 45 _d 67 _f
LOADLO(temp[3], d[1]);      // 01 89 23 ab 45 _d 67 _f
LOADLO(temp[5], d[2]);      // 01 89 23 ab 45 cd 67 _f
LOADLO(temp[7], d[3]);      // 01 89 23 ab 45 cd 67 ef
MINMAX(0, 1);   // 01:89
MINMAX(2, 3);   // 23:ab
MINMAX(4, 5);   // 45:cd
MINMAX(6, 7);   // 67:ef
STORELO(keys[0],  temp[0]); // _1 89 23 ab 45 cd 67 ef
STOREHI(keys[15], temp[7]); // _1 89 23 ab 45 cd 67 e_
MINMAX(4, 1);   // 45:89
MINMAX(6, 3);   // 67:ab
MINMAX(2, 4);   // 23:45
MINMAX(6, 1);   // 67:89
MINMAX(3, 5);   // ab:cd
UNPACKHI(temp[0], temp[2]); // 13 89 2_ ab 45 cd 67 e_
UNPACKLO(temp[2], temp[4]); // 13 89 24 ab _5 cd 67 e_
UNPACKHI(temp[4], temp[6]); // 13 89 24 ab 57 cd 6_ e_
UNPACKLO(temp[6], temp[1]); // 13 _9 24 ab 57 cd 68 e_
UNPACKHI(temp[1], temp[3]); // 13 9b 24 a_ 57 cd 68 e_
UNPACKLO(temp[3], temp[5]); // 13 9b 24 ac 57 _d 68 e_
UNPACKHI(temp[5], temp[7]); // 13 9b 24 ac 57 d_ 68 e_
MINMAX(0, 2);   // 13:24
MINMAX(4, 6);   // 57:68
MINMAX(1, 3);   // 9b:ac
MINMAX(5, 7);   // dd:e_ only lo part used
STORELO(keys[ 1], temp[0]);
STOREHI(keys[ 3], temp[0]);
STORELO(keys[ 2], temp[2]);
STOREHI(keys[ 4], temp[2]);
STORELO(keys[ 5], temp[4]);
STOREHI(keys[ 7], temp[4]);
STORELO(keys[ 6], temp[6]);
STOREHI(keys[ 8], temp[6]);
STORELO(keys[ 9], temp[1]);
STOREHI(keys[11], temp[1]);
STORELO(keys[10], temp[3]);
STOREHI(keys[12], temp[3]);
STORELO(keys[13], temp[5]);
STORELO(keys[14], temp[7]);```