The full SSE2 bit matrix transpose routine

Source code for this routine and many others using SSE2 in unusual ways is in this github repo.

Since there have been a large number of hits on the “SSE2 bit matrix transpose” post,
here’s the full deal: transpose of an arbitrary matrix of [nrows x ncols] bits,
where each row begins on a byte boundary. An inner block uses one XMM register
to transpose a [16 x 8]-bit block. This requires gathering 16 bytes from sparse locations,
and scattering 8 half-words to sparse locations. Since the gather is a blocking operation on RAM, the (“cc”) loops are organized to travel over 16 streams of input.

#include <assert.h>
#include <emmintrin.h>
#include <stdint.h>

void
sse_trans(uint8_t const *inp, uint8_t *out, int nrows, int ncols)
{
#   define INP(x,y) inp[(x)*ncols/8 + (y)/8]
#   define OUT(x,y) out[(y)*nrows/8 + (x)/8]
    int rr, cc, i, h;
    union { __m128i x; uint8_t b[16]; } tmp;
    assert(nrows % 8 == 0 && ncols % 8 == 0);

    // Do the main body in 16x8 blocks:
    for (rr = 0; rr <= nrows - 16; rr += 16) {
        for (cc = 0; cc < ncols; cc += 8) {
            for (i = 0; i < 16; ++i)
                tmp.b[i] = INP(rr + i, cc);
            for (i = 8; --i >= 0; tmp.x = _mm_slli_epi64(tmp.x, 1))
                *(uint16_t*)&OUT(rr,cc+i)= _mm_movemask_epi8(tmp.x);
        }
    }
    if (rr == nrows) return;

    // The remainder is a block of 8x(16n+8) bits (n may be 0).
    //  Do a PAIR of 8x8 blocks in each step:
    for (cc = 0; cc <= ncols - 16; cc += 16) {
        for (i = 0; i < 8; ++i) {
            tmp.b[i] = h = *(uint16_t const*)&INP(rr + i, cc);
            tmp.b[i + 8] = h >> 8;
        }
        for (i = 8; --i >= 0; tmp.x = _mm_slli_epi64(tmp.x, 1)) {
            OUT(rr, cc + i) = h = _mm_movemask_epi8(tmp.x);
            OUT(rr, cc + i + 8) = h >> 8;
        }
    }
    if (cc == ncols) return;

    //  Do the remaining 8x8 block:
    for (i = 0; i < 8; ++i)
        tmp.b[i] = INP(rr + i, cc);
    for (i = 8; --i >= 0; tmp.x = _mm_slli_epi64(tmp.x, 1))
        OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

6 Responses to The full SSE2 bit matrix transpose routine

  1. Pingback: What is SSE !@# good for? Transposing a bit matrix | Coding on the edges

  2. Pingback: SSE2 bit matrix transpose special case … 8 x 256 … for Marek | Coding on the edges

  3. Kyle Wheeler says:

    Most of your 6th for-loop appears to be missing!

    • mischasan says:

      SNARL Thanks! I’ve been bitten by the wordpress munching “C” code when it looks a bit like HTML tags. I’ll fix it. In the meantime, github.com/mischasan/sse2 has “ssebmx.c”, which is the full routine, including the optimization for Marek’s particular case (8 x 256).

      • Kyle Wheeler says:

        Thank you! I’m noticing that, bizarrely, the results are not exactly what I’m expecting. When I print the output as 64-bit numbers, it looks like the bits are in the reverse order I expected. Given that you’re doing it bitwise, I was expecting a little-endian shuffle by printing things as 64-bit numbers, but not a straight-up reversal. Does that make sense?

  4. mischasan says:

    Hmmm. Not sure what’s up. I posted the unit test for ssebmx to github (a bit slow on doing that, my bad) and if you run “gmake ssebmx_t.pass”, that file shows the test matrices and their transposes (the expected values which match the actual values byte-for-byte) in graphic form, i.e. using ‘-‘ and ‘#’ to represent ‘0’ and ‘1’ bits, printed LSB-first. Could you post/send me the input data you tried, and I’ll add it as a test?

Leave a comment