*Source code for this routine and many others using SSE2 in unusual ways is in this github repo.*

Since there have been a large number of hits on the “SSE2 bit matrix transpose” post,

here’s the full deal: transpose of an arbitrary matrix of [nrows x ncols] bits,

where each row begins on a byte boundary. An inner block uses one XMM register

to transpose a [16 x 8]-bit block. This requires gathering 16 bytes from sparse locations,

and scattering 8 half-words to sparse locations. Since the gather is a blocking operation on RAM, the (“cc”) loops are organized to travel over 16 streams of input.

#include <assert.h>
#include <emmintrin.h>
#include <stdint.h>
void
sse_trans(uint8_t const *inp, uint8_t *out, int nrows, int ncols)
{
# define INP(x,y) inp[(x)*ncols/8 + (y)/8]
# define OUT(x,y) out[(y)*nrows/8 + (x)/8]
int rr, cc, i, h;
union { __m128i x; uint8_t b[16]; } tmp;
assert(nrows % 8 == 0 && ncols % 8 == 0);
// Do the main body in 16x8 blocks:
for (rr = 0; rr <= nrows - 16; rr += 16) {
for (cc = 0; cc < ncols; cc += 8) {
for (i = 0; i < 16; ++i)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; --i >= 0; tmp.x = _mm_slli_epi64(tmp.x, 1))
*(uint16_t*)&OUT(rr,cc+i)= _mm_movemask_epi8(tmp.x);
}
}
if (rr == nrows) return;
// The remainder is a block of 8x(16n+8) bits (n may be 0).
// Do a PAIR of 8x8 blocks in each step:
for (cc = 0; cc <= ncols - 16; cc += 16) {
for (i = 0; i < 8; ++i) {
tmp.b[i] = h = *(uint16_t const*)&INP(rr + i, cc);
tmp.b[i + 8] = h >> 8;
}
for (i = 8; --i >= 0; tmp.x = _mm_slli_epi64(tmp.x, 1)) {
OUT(rr, cc + i) = h = _mm_movemask_epi8(tmp.x);
OUT(rr, cc + i + 8) = h >> 8;
}
}
if (cc == ncols) return;
// Do the remaining 8x8 block:
for (i = 0; i < 8; ++i)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; --i >= 0; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}

### Like this:

Like Loading...

*Related*

##
About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com
My e-mail sig (since 1976):
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.

Pingback: What is SSE !@# good for? Transposing a bit matrix | Coding on the edges

Pingback: SSE2 bit matrix transpose special case … 8 x 256 … for Marek | Coding on the edges

Most of your 6th for-loop appears to be missing!

SNARL Thanks! I’ve been bitten by the wordpress munching “C” code when it looks a bit like HTML tags. I’ll fix it. In the meantime, github.com/mischasan/sse2 has “ssebmx.c”, which is the full routine, including the optimization for Marek’s particular case (8 x 256).

Thank you! I’m noticing that, bizarrely, the results are not exactly what I’m expecting. When I print the output as 64-bit numbers, it looks like the bits are in the reverse order I expected. Given that you’re doing it bitwise, I was expecting a little-endian shuffle by printing things as 64-bit numbers, but not a straight-up reversal. Does that make sense?

Hmmm. Not sure what’s up. I posted the unit test for ssebmx to github (a bit slow on doing that, my bad) and if you run “gmake ssebmx_t.pass”, that file shows the test matrices and their transposes (the expected values which match the actual values byte-for-byte) in graphic form, i.e. using ‘-‘ and ‘#’ to represent ‘0’ and ‘1’ bits, printed LSB-first. Could you post/send me the input data you tried, and I’ll add it as a test?