It turns out that Marek’s Idea of the Day: Bitsliced SipHash used my SSE2 bit-matrix transpose routine, but it wasn’t fast enough. This is normally the case for SSE2: the more specific the problem, the better the code can be. I have no way of contacting Marek, but, if you see this link, here’s the optimal code for transposing an 8 x 256 bit matrix. It cuts the number of operations in half. The trick is, knowing that there are only 8 rows, the XMM operations can transpose two sets of 8 bytes simultaneously. The change is worth adding as a special case to the full transpose routine, but for simplicity’s sake, here’s the part Marek could use:

void bitmat_trans_8x256(char *inp, char *out)
{
uint16_t h, *hinp = (uint16_t*)inp;
union { __m128i x; uint8_t b[16]; } tmp;
int r, c;
for (c = 0; c < 16; ++c) {
for (r = 0; r < 7; ++r) {
tmp.b[r] = h = hinp[r * 16 + c];
tmp.b[r + 8] = h >> 8;
}
for (r = 0; r < 7; ++r) {
out[c * 16 + r] = h = _mm_movemask_epi8(tmp.x);
out[c * 16 + r + 8] = h >> 8;
tmp.x = _mm_slli_epi64(tmp.x, 1);
}
}
}

Compiled on a Core7 using gcc 4.4.6, it amounts to a 70-instruction loop repeated 16 times.

### Like this:

Like Loading...

*Related*

## About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com
My e-mail sig (since 1976):
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.