SSE2 odd-even merge (the last step in sorting)

If you’ve looked at my example of bitonic sort in SSE2 in ASM or in “C”, you’ll see that the clever stuff ends with two eight-element sorted sequences. The final step is a simple loop that merges the two sequences. A commenter (Thomas Knight) asked if there was something smarter you could do for the merge, using SSE2, and pointing out an interesting paper on using bitonic merge and odd-even merge, also in SSE2 registers.

It would be cool if the merge step could be sped up — in fact I tried and gave up on odd-even merge. I couldn’t find a series of SSE2 ops that came out faster than the simple loop. The bitonic sort is about 80 instructions, and the odd-even merge about the same. Bitonic merge is mildly interesting, and looks just like the code you generate for a GPU. Unfortunately it involves more operations than odd-even merge, that just drive up its cost (time) on a single-instruction stream CPU using SSE2.

A problem with the SSE2 merge is that, to do the zippy branch-free MIN/MAX operations, the pairs of doubles in XMM registers must be transposed, twice. The first transposition puts adjacent elements of each sequence ( [0,1], [2,3], [4,5] …) in the same registers. The second transposition is needed to compare adjacent elements. Add in 13 MINMAX operations (39 instructions) and odd-even merge just can’t compete.

For anyone who cares to see the optimal odd-even merge:

#   define  LOADLO(x, d)    (x = _mm_loadl_pd(x, &(d)))
#   define  STORELO(d, x)   _mm_storel_pd(&(d), x)
#   define  STOREHI(d, x)   _mm_storeh_pd(&(d), x)
#   define  UNPACKHI(a, b)  (a = (__m128d)_mm_unpackhi_epi64((__m128i)(a), (__m128i)(b)))
#   define  UNPACKLO(a, b)  (a = (__m128d)_mm_unpacklo_epi64((__m128i)(a), (__m128i)(b)))
    // At the end of the bitonic sort, xmm registers (X0..X7) hold
    // two 8-element sorted sequences, [0..7] and [8..f].
    // The first element of each sequence is in X0, etc.
    // X0 X1 X2 X3 X4 X5 X6 X7 <<reg
    // 08 19 2a 3b 4c 5d 6e 7f <<(lo,hi)
    // For MINMAX ops, you need values to compare be in different regs.
    //  For all but the last odd-even merge MINMAX steps, you need adjacent
    //  pairs of each seq to be in the same reg (e.g X0=01, X1=23,..,89,ab,..)
    // In the following, '_' stands for a don't-care position.
    double  d[4];
    STOREHI(d[0], temp[0]);     // 0_ 19 2a 3b 4c 5d 6e 7f
    STOREHI(d[1], temp[2]);     // 0_ 19 2_ 3b 4c 5d 6e 7f
    STOREHI(d[2], temp[4]);     // 0_ 19 2_ 3b 4_ 5d 6e 7f
    STOREHI(d[3], temp[6]);     // 0_ 19 2_ 3b 4_ 5d 6_ 7f
    UNPACKLO(temp[0], temp[1]); // 01 _9 2_ 3b 4_ 5d 6_ 7f
    UNPACKLO(temp[2], temp[3]); // 01 _9 23 _b 4_ 5d 6_ 7f
    UNPACKLO(temp[4], temp[5]); // 01 _9 23 _b 45 _d 6_ 7f
    UNPACKLO(temp[6], temp[7]); // 01 _9 23 _b 45 _d 67 _f
    LOADLO(temp[1], d[0]);      // 01 89 23 _b 45 _d 67 _f
    LOADLO(temp[3], d[1]);      // 01 89 23 ab 45 _d 67 _f
    LOADLO(temp[5], d[2]);      // 01 89 23 ab 45 cd 67 _f
    LOADLO(temp[7], d[3]);      // 01 89 23 ab 45 cd 67 ef
    MINMAX(0, 1);   // 01:89
    MINMAX(2, 3);   // 23:ab
    MINMAX(4, 5);   // 45:cd
    MINMAX(6, 7);   // 67:ef
    STORELO(keys[0],  temp[0]); // _1 89 23 ab 45 cd 67 ef
    STOREHI(keys[15], temp[7]); // _1 89 23 ab 45 cd 67 e_
    MINMAX(4, 1);   // 45:89
    MINMAX(6, 3);   // 67:ab
    MINMAX(2, 4);   // 23:45
    MINMAX(6, 1);   // 67:89
    MINMAX(3, 5);   // ab:cd
    UNPACKHI(temp[0], temp[2]); // 13 89 2_ ab 45 cd 67 e_
    UNPACKLO(temp[2], temp[4]); // 13 89 24 ab _5 cd 67 e_
    UNPACKHI(temp[4], temp[6]); // 13 89 24 ab 57 cd 6_ e_
    UNPACKLO(temp[6], temp[1]); // 13 _9 24 ab 57 cd 68 e_
    UNPACKHI(temp[1], temp[3]); // 13 9b 24 a_ 57 cd 68 e_
    UNPACKLO(temp[3], temp[5]); // 13 9b 24 ac 57 _d 68 e_
    UNPACKHI(temp[5], temp[7]); // 13 9b 24 ac 57 d_ 68 e_
    MINMAX(0, 2);   // 13:24
    MINMAX(4, 6);   // 57:68
    MINMAX(1, 3);   // 9b:ac
    MINMAX(5, 7);   // dd:e_ only lo part used
    STORELO(keys[ 1], temp[0]);
    STOREHI(keys[ 3], temp[0]);
    STORELO(keys[ 2], temp[2]);
    STOREHI(keys[ 4], temp[2]);
    STORELO(keys[ 5], temp[4]);
    STOREHI(keys[ 7], temp[4]);
    STORELO(keys[ 6], temp[6]);
    STOREHI(keys[ 8], temp[6]);
    STORELO(keys[ 9], temp[1]);
    STOREHI(keys[11], temp[1]);
    STORELO(keys[10], temp[3]);
    STOREHI(keys[12], temp[3]);
    STORELO(keys[13], temp[5]);
    STORELO(keys[14], temp[7]);
Advertisements

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

One Response to SSE2 odd-even merge (the last step in sorting)

  1. Pingback: Update on bitonic SSE2 sort of 16 doubles | Coding on the edges

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s