## What is SSE !@# good for? Transposing a bit matrix

If you find this interesting, check out my Oct post for the
full “C” routine
for transposing an arbitrary-sized bit matrix. On an AMD64, it runs about 10-15x the speed of an efficiently-coded non-SSE routine.

This is probably my last post on SSE2: a limited instruction set that spews edge cases all over, with a few interesting abuses that do some generic tasks efficiently.

Bit arrays (vectors and matrices) are an efficient way of storing and processing boolean information. They give you highly parallel processing (32 bits an operation). Bit matrices are less common, but are an efficient representation of a graph or network of connected nodes.

You may be familiar with matrix transpose and product operations:

```C = transpose(A): C[i,j] = A[j,i]
C = product(A, B): C[i,j] = sum(A[i,k] * B[k,j])```

The bitwise equivalent to product uses “|” and “&” instead of “+” and “*”. For example, if bit A[i,j] means that node[i] connects to node[j] in a network, then

`product(A, transpose(A))`

computes which nodes can be reached in 2 steps. or less Repeating this operation produces matrices indicating which nodes are 3,4,5… steps apart.

The only down side to bit arrays is that transpose(A) is an alternate way of indexing (A), and “indexing” a bit array is not nearly as efficient as indexing, say, a byte array. That problem for bit vectors is a problem-squared for bit matrices. Row-major traversal is fine … almost no loss of parallelism. Column-major traversal stinks. Every bit is stored in a different word, scattered across memory. If your bit array is bigger than your RAM cache, you will feel pain.

If it were cheap to transpose (flip) the matrix, so that rows became columns and vice versa, then you’d have your efficiency back. This makes a matrix dot-product into a stream of AND’s and OR’s that takes full advantage of bit parallelism. Unfortunately, there’s no support for transposing bit matrices in CPU-register operations; and the registers themselves are pretty small.

But … tada! … SSE operations to the rescue. The MOVEMASK instruction packs the top bit of each of 16 bytes, into a 16-bit word. Using this, sse_trans_slice transposes a bit[16][8] array into a bit[8][16] array, in 24 instructions without branching (assuming your compiler unrolls small constant loops):

```__m128i sse_trans_slice(__m128i)
{
union { unsigned short s[8]; __m128i m; } u;
int    i;
for (i = 0; i < 8; ++i) {
x = _mm_slli_epi64(x,1)
}
return  u.m;
}```

This can be easily extended to transpose an array of [16M,8N] bits. The fun, as usual, comes in handling the edge conditions.

```union {char b[16]; __m128i m; } s;
for (slice = 0; slice < N; slice += 16)
nrows = min(N - slice, 16);
for (col = 0; col < M; ++col)
for (row = 0; row < nrows; ++row)  s.b[row] = Inp[slice+row][col];
s.m = sse_trans_slice(s.m);
for (row = 0; row < nrows; ++row)  Out[col][slice+row] = s.b[row];```

Assembling an 16-byte value, with 16 wide-spread byte fetches in the innermost loop, is more efficient than it looks; the next iteration benefits from all 16 cache lines being in L1 cache.