*If you find this interesting, check out my Oct post for the
full “C” routine for transposing an arbitrary-sized bit matrix. On an AMD64, it runs about 10-15x the speed of an efficiently-coded non-SSE routine.*

This is probably my last post on SSE2: a limited instruction set that spews edge cases all over, with a few interesting abuses that do some generic tasks efficiently.

Bit arrays (vectors and matrices) are an efficient way of storing and processing boolean information. They give you highly parallel processing (32 bits an operation). Bit matrices are less common, but are an efficient representation of a graph or network of connected nodes.

You may be familiar with matrix transpose and product operations:

C = transpose(A): C[i,j] = A[j,i] C = product(A, B): C[i,j] = sum(A[i,k] * B[k,j])

The bitwise equivalent to product uses “|” and “&” instead of “+” and “*”. For example, if bit A[i,j] means that node[i] connects to node[j] in a network, then

product(A, transpose(A))

computes which nodes can be reached in 2 steps. or less Repeating this operation produces matrices indicating which nodes are 3,4,5… steps apart.

The only down side to bit arrays is that transpose(A) is an alternate way of indexing (A), and “indexing” a bit array is not nearly as efficient as indexing, say, a byte array. That problem for bit vectors is a problem-squared for bit matrices. Row-major traversal is fine … almost no loss of parallelism. Column-major traversal stinks. Every bit is stored in a different word, scattered across memory. If your bit array is bigger than your RAM cache, you will feel pain.

If it were cheap to transpose (flip) the matrix, so that rows became columns and vice versa, then you’d have your efficiency back. This makes a matrix dot-product into a stream of AND’s and OR’s that takes full advantage of bit parallelism. Unfortunately, there’s no support for transposing bit matrices in CPU-register operations; and the registers themselves are pretty small.

But … tada! … SSE operations to the rescue. The MOVEMASK instruction packs the top bit of each of 16 bytes, into a 16-bit word. Using this, **sse_trans_slice** transposes a bit[16][8] array into a bit[8][16] array, in 24 instructions without branching (assuming your compiler unrolls small constant loops):

__m128i sse_trans_slice(__m128i) { union { unsigned short s[8]; __m128i m; } u; int i; for (i = 0; i < 8; ++i) { u.s[7-i]=_ mm_movemask_epi8(x); x = _mm_slli_epi64(x,1) } return u.m; }

This can be easily extended to transpose an array of [16M,8N] bits. The fun, as usual, comes in handling the edge conditions.

union {char b[16]; __m128i m; } s; for (slice = 0; slice < N; slice += 16) nrows = min(N - slice, 16); for (col = 0; col < M; ++col) for (row = 0; row < nrows; ++row) s.b[row] = Inp[slice+row][col]; s.m = sse_trans_slice(s.m); for (row = 0; row < nrows; ++row) Out[col][slice+row] = s.b[row];

Assembling an 16-byte value, with 16 wide-spread byte fetches in the innermost loop, is more efficient than it looks; the next iteration benefits from all 16 cache lines being in L1 cache.

I thought that this would be a good place to say thank you for this algorithm. It is used extensively in a compression library I wrote (https://github.com/kiyo-masui/bitshuffle). Also, I managed to get the phrase “What is SSE !@# good for?” published in the references of a peer-reviewed academic journal (http://dx.doi.org/10.1016/j.ascom.2015.07.002), so well done!

Very interesting post indeed, but for matrix transpose operation that I was looking for with SSE instruction sets, I found another implementation using the function _mm_shuffle_epi8 (source: http://pzemtsov.github.io/2014/10/01/how-to-transpose-a-16×16-matrix.html) instead of your algorithm using _mm_movemask_epi8 function. Could you please comment on the differences and advise which one is better/faster?

Umm … shufps is for bytes, not bits. How does that apply to _bit_ matrix transpose?