## The full SSE2 bit matrix transpose routine

Source code for this routine and many others using SSE2 in unusual ways is in this github repo.

Since there have been a large number of hits on the “SSE2 bit matrix transpose” post,
here’s the full deal: transpose of an arbitrary matrix of [nrows x ncols] bits,
where each row begins on a byte boundary. An inner block uses one XMM register
to transpose a [16 x 8]-bit block. This requires gathering 16 bytes from sparse locations,
and scattering 8 half-words to sparse locations. Since the gather is a blocking operation on RAM, the (“cc”) loops are organized to travel over 16 streams of input.

```#include <assert.h>
#include <emmintrin.h>
#include <stdint.h>

void
sse_trans(uint8_t const *inp, uint8_t *out, int nrows, int ncols)
{
#   define INP(x,y) inp[(x)*ncols/8 + (y)/8]
#   define OUT(x,y) out[(y)*nrows/8 + (x)/8]
int rr, cc, i, h;
union { __m128i x; uint8_t b[16]; } tmp;
assert(nrows % 8 == 0 && ncols % 8 == 0);

// Do the main body in 16x8 blocks:
for (rr = 0; rr <= nrows - 16; rr += 16) {
for (cc = 0; cc < ncols; cc += 8) {
for (i = 0; i < 16; ++i)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; --i >= 0; tmp.x = _mm_slli_epi64(tmp.x, 1))
}
}
if (rr == nrows) return;

// The remainder is a block of 8x(16n+8) bits (n may be 0).
//  Do a PAIR of 8x8 blocks in each step:
for (cc = 0; cc <= ncols - 16; cc += 16) {
for (i = 0; i < 8; ++i) {
tmp.b[i] = h = *(uint16_t const*)&INP(rr + i, cc);
tmp.b[i + 8] = h >> 8;
}
for (i = 8; --i >= 0; tmp.x = _mm_slli_epi64(tmp.x, 1)) {
OUT(rr, cc + i) = h = _mm_movemask_epi8(tmp.x);
OUT(rr, cc + i + 8) = h >> 8;
}
}
if (cc == ncols) return;

//  Do the remaining 8x8 block:
for (i = 0; i < 8; ++i)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; --i >= 0; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}```

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

### 6 Responses to The full SSE2 bit matrix transpose routine

1. Kyle Wheeler says:

Most of your 6th for-loop appears to be missing!

• mischasan says:

SNARL Thanks! I’ve been bitten by the wordpress munching “C” code when it looks a bit like HTML tags. I’ll fix it. In the meantime, github.com/mischasan/sse2 has “ssebmx.c”, which is the full routine, including the optimization for Marek’s particular case (8 x 256).

• Kyle Wheeler says:

Thank you! I’m noticing that, bizarrely, the results are not exactly what I’m expecting. When I print the output as 64-bit numbers, it looks like the bits are in the reverse order I expected. Given that you’re doing it bitwise, I was expecting a little-endian shuffle by printing things as 64-bit numbers, but not a straight-up reversal. Does that make sense?

2. mischasan says:

Hmmm. Not sure what’s up. I posted the unit test for ssebmx to github (a bit slow on doing that, my bad) and if you run “gmake ssebmx_t.pass”, that file shows the test matrices and their transposes (the expected values which match the actual values byte-for-byte) in graphic form, i.e. using ‘-‘ and ‘#’ to represent ‘0’ and ‘1’ bits, printed LSB-first. Could you post/send me the input data you tried, and I’ll add it as a test?