Update on bitonic SSE2 sort of 16 doubles

For the complete source code for both sorting and ranking functions using SSE2, check out ssesort.c in this github repo

I originally used asm to generate the bitonic sorter. After doing a little more testing, I found that gcc 4.4 generates very good SSE2 code, and the asm is not necessary. Here’s the pure-C version:

   __m128d     t, temp[8];
#   define MIX(a,b) \
        temp[a] = _mm_min_pd(t = LOAD(a), LOAD(b)); \
        temp[b] = _mm_max_pd(t, LOAD(b))

    // Initially load (unaligned) from keys:
#   define LOAD(i) _mm_loadu_pd(keys + 2*(i))

    // Bitonic step #1:
    MIX(0,1); MIX(2,3); MIX(4,5); MIX(6,7);

    // Switch to loading (aligned) from temp:
#   undef  LOAD
#   define LOAD(i) temp[i]

        // Bitonic step #2:
    MIX(0,3); MIX(1,2); MIX(4,7); MIX(5,6);
    MIX(0,1); MIX(2,3); MIX(4,5); MIX(6,7);

        // Bitonic step #3:
    MIX(0,7); MIX(1,6); MIX(2,5); MIX(3,4);
    MIX(0,2); MIX(1,3); MIX(4,6); MIX(5,7);
    MIX(0,1); MIX(2,3); MIX(4,5); MIX(6,7);

    // Linear merge of temp.d[0,2,4,6] + temp.d[1,3,5,7] => keys:
    double  *zp = keys, *ap = (double*)temp, *bp = ap + 1, b = *bp;
    // ..... ETC .....

The binary merge code, of course, remains the same. It’s also possible to do an SSE2 odd-even merge — but it’s strictly of academic interest, being slower than the simple binary merge.

Advertisements

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in algorithm and tagged , , , . Bookmark the permalink.

5 Responses to Update on bitonic SSE2 sort of 16 doubles

  1. Pingback: Okay, one more poke at SSE2: sorting doubles | Coding on the edges

  2. Tom says:

    Hi, please could you post the full code, because I’ve tried to put it together with the code from the last post, but my version fails. I’m not a competent C programmer, only Java. Thanks

    • mischasan says:

      Here’s the code in one file (ssesort.c). Sorry about the crappy formatting:

      #include <xmmintrin.h>
      void ssesort16d(double keys[16]);
      void ssesort16d(double keys[16])
      {
          __m128d     t, temp[8];
          // MIX(a,b): [a,b] := [min(a,b), max(a,b)]
      #   define MIX(a,b) \
              temp[a] = _mm_min_pd(t = LOAD(a), LOAD(b)), \
              temp[b] = _mm_max_pd(t, LOAD(b))
          // Initially load from keys[]
      #   define LOAD(i) _mm_loadu_pd(keys + 2*(i))
          // Bitonic step #1:
          MIX(0,1); MIX(2,3); MIX(4,5); MIX(6,7);
          // Switch to loading (aligned) from temp[]
      #   undef  LOAD
      #   define LOAD(i) temp[i]
              // Bitonic step #2:
          MIX(0,3); MIX(1,2); MIX(4,7); MIX(5,6);
          MIX(0,1); MIX(2,3); MIX(4,5); MIX(6,7);
              // Bitonic step #3:
          MIX(0,7); MIX(1,6); MIX(2,5); MIX(3,4);
          MIX(0,2); MIX(1,3); MIX(4,6); MIX(5,7);
          MIX(0,1); MIX(2,3); MIX(4,5); MIX(6,7);
          // Linear merge of temp.d[0,2,4,6] + temp.d[1,3,5,7] => keys:
          double  *zp = keys, *ap = (double*)temp, *bp = ap + 1, b = *bp;
          do {
              double  a = *ap;
              if (a > b) {
                  double  *tp = ap; ap = bp; bp = tp;
                  *zp++ = b;
                  b = a;
              } else *zp++ = a;
          } while ((ap += 2) < (double*)temp + 16);
          do *zp++ = *bp; while ((bp += 2) < (double*)temp + 16);
      }
      

      Here’s a unit test (sortest.c):

      #include <stdio.h>
      #include <string.h>
      #include <stdlib.h>
      typedef double D16[16];// __attribute( (__aligned__(16)) );
      typedef int (*cmpfn_t)(const void *, const void *);
      static int  cmpd(double const*ap, double const *bp);
      static void dumpd(double v[16], const char *label);
      
      void ssesort16d(double keys[16]);
      
      int main(void)
      {
          int  i;
          D16 input, output;
      
          for (i = 0; i < 16; ++i) {
              input[i] = (i * 11 + 37) % 64;
              *(char*)&input[i] = i;
          }
      
          dumpd(input, "unsorted");
          memcpy(output, input, sizeof(input));
          qsort(output, 16, sizeof(double), (cmpfn_t)cmpd);
          dumpd(output, "qsort");
          memcpy(output, input, sizeof(input));
          ssesort16d(output);
          dumpd(output, "ssesort");
          return 0;
      }
      
      int cmpd(double const*ap, double const *bp)
      {
          double diff = *ap - *bp;
          return  diff < 0 -1 : diff > 0 ? 1 : 0;
      }
      
      void dumpd(double v[16], const char *label)
      {
          int         r, n0, n1;
          fprintf(stderr, "# %s:\n", label);
          for (r = 0; r < 16; r += 2) {
              n0 = 15 & *(char*)&v[r];
              n1 = 15 & *(char*)&v[r+1];
              fprintf(stderr, "# %2d:%4g %2d:%4g\n", n0, v[r], n1, v[r+1]);
          }
      }
      

      Here’s the command to compile them:

      gcc -g -O9 -msse3 -D_GNU_SOURCE -o sortest sortest.c ssesort.c
      

      HTH

  3. Pingback: SSE2 odd-even merge (the last step in sorting) | Coding on the edges

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s