Source code for this routine and many others using SSE2 in unusual ways is in this github repo.
Since there have been a large number of hits on the “SSE2 bit matrix transpose” post,
here’s the full deal: transpose of an arbitrary matrix of [nrows x ncols] bits,
where each row begins on a byte boundary. An inner block uses one XMM register
to transpose a [16 x 8]-bit block. This requires gathering 16 bytes from sparse locations,
and scattering 8 half-words to sparse locations. Since the gather is a blocking operation on RAM, the (“cc”) loops are organized to travel over 16 streams of input.
#include <assert.h> #include <emmintrin.h> #include <stdint.h> void sse_trans(uint8_t const *inp, uint8_t *out, int nrows, int ncols) { # define INP(x,y) inp[(x)*ncols/8 + (y)/8] # define OUT(x,y) out[(y)*nrows/8 + (x)/8] int rr, cc, i, h; union { __m128i x; uint8_t b[16]; } tmp; assert(nrows % 8 == 0 && ncols % 8 == 0); // Do the main body in 16x8 blocks: for (rr = 0; rr <= nrows - 16; rr += 16) { for (cc = 0; cc < ncols; cc += 8) { for (i = 0; i < 16; ++i) tmp.b[i] = INP(rr + i, cc); for (i = 8; --i >= 0; tmp.x = _mm_slli_epi64(tmp.x, 1)) *(uint16_t*)&OUT(rr,cc+i)= _mm_movemask_epi8(tmp.x); } } if (rr == nrows) return; // The remainder is a block of 8x(16n+8) bits (n may be 0). // Do a PAIR of 8x8 blocks in each step: for (cc = 0; cc <= ncols - 16; cc += 16) { for (i = 0; i < 8; ++i) { tmp.b[i] = h = *(uint16_t const*)&INP(rr + i, cc); tmp.b[i + 8] = h >> 8; } for (i = 8; --i >= 0; tmp.x = _mm_slli_epi64(tmp.x, 1)) { OUT(rr, cc + i) = h = _mm_movemask_epi8(tmp.x); OUT(rr, cc + i + 8) = h >> 8; } } if (cc == ncols) return; // Do the remaining 8x8 block: for (i = 0; i < 8; ++i) tmp.b[i] = INP(rr + i, cc); for (i = 8; --i >= 0; tmp.x = _mm_slli_epi64(tmp.x, 1)) OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x); }
Pingback: What is SSE !@# good for? Transposing a bit matrix | Coding on the edges
Pingback: SSE2 bit matrix transpose special case … 8 x 256 … for Marek | Coding on the edges
Most of your 6th for-loop appears to be missing!
SNARL Thanks! I’ve been bitten by the wordpress munching “C” code when it looks a bit like HTML tags. I’ll fix it. In the meantime, github.com/mischasan/sse2 has “ssebmx.c”, which is the full routine, including the optimization for Marek’s particular case (8 x 256).
Thank you! I’m noticing that, bizarrely, the results are not exactly what I’m expecting. When I print the output as 64-bit numbers, it looks like the bits are in the reverse order I expected. Given that you’re doing it bitwise, I was expecting a little-endian shuffle by printing things as 64-bit numbers, but not a straight-up reversal. Does that make sense?
Hmmm. Not sure what’s up. I posted the unit test for ssebmx to github (a bit slow on doing that, my bad) and if you run “gmake ssebmx_t.pass”, that file shows the test matrices and their transposes (the expected values which match the actual values byte-for-byte) in graphic form, i.e. using ‘-‘ and ‘#’ to represent ‘0’ and ‘1’ bits, printed LSB-first. Could you post/send me the input data you tried, and I’ll add it as a test?