What is SSE !@# good for? Transposing a bit matrix

If you find this interesting, check out my Oct post for the
full “C” routine
for transposing an arbitrary-sized bit matrix. On an AMD64, it runs about 10-15x the speed of an efficiently-coded non-SSE routine.

This is probably my last post on SSE2: a limited instruction set that spews edge cases all over, with a few interesting abuses that do some generic tasks efficiently.

Bit arrays (vectors and matrices) are an efficient way of storing and processing boolean information. They give you highly parallel processing (32 bits an operation). Bit matrices are less common, but are an efficient representation of a graph or network of connected nodes.

You may be familiar with matrix transpose and product operations:

C = transpose(A): C[i,j] = A[j,i]
C = product(A, B): C[i,j] = sum(A[i,k] * B[k,j])

The bitwise equivalent to product uses “|” and “&” instead of “+” and “*”. For example, if bit A[i,j] means that node[i] connects to node[j] in a network, then

product(A, transpose(A))

computes which nodes can be reached in 2 steps. or less Repeating this operation produces matrices indicating which nodes are 3,4,5… steps apart.

The only down side to bit arrays is that transpose(A) is an alternate way of indexing (A), and “indexing” a bit array is not nearly as efficient as indexing, say, a byte array. That problem for bit vectors is a problem-squared for bit matrices. Row-major traversal is fine … almost no loss of parallelism. Column-major traversal stinks. Every bit is stored in a different word, scattered across memory. If your bit array is bigger than your RAM cache, you will feel pain.

If it were cheap to transpose (flip) the matrix, so that rows became columns and vice versa, then you’d have your efficiency back. This makes a matrix dot-product into a stream of AND’s and OR’s that takes full advantage of bit parallelism. Unfortunately, there’s no support for transposing bit matrices in CPU-register operations; and the registers themselves are pretty small.

But … tada! … SSE operations to the rescue. The MOVEMASK instruction packs the top bit of each of 16 bytes, into a 16-bit word. Using this, sse_trans_slice transposes a bit[16][8] array into a bit[8][16] array, in 24 instructions without branching (assuming your compiler unrolls small constant loops):

__m128i sse_trans_slice(__m128i)
    union { unsigned short s[8]; __m128i m; } u;
    int    i;
    for (i = 0; i < 8; ++i) {
        u.s[7-i]=_ mm_movemask_epi8(x);
        x = _mm_slli_epi64(x,1)
    return  u.m;

This can be easily extended to transpose an array of [16M,8N] bits. The fun, as usual, comes in handling the edge conditions.

union {char b[16]; __m128i m; } s;
for (slice = 0; slice < N; slice += 16)
    nrows = min(N - slice, 16);
    for (col = 0; col < M; ++col)
        for (row = 0; row < nrows; ++row)  s.b[row] = Inp[slice+row][col];
        s.m = sse_trans_slice(s.m);
        for (row = 0; row < nrows; ++row)  Out[col][slice+row] = s.b[row];

Assembling an 16-byte value, with 16 wide-spread byte fetches in the innermost loop, is more efficient than it looks; the next iteration benefits from all 16 cache lines being in L1 cache.

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in algorithm, Uncategorized. Bookmark the permalink.

3 Responses to What is SSE !@# good for? Transposing a bit matrix

  1. I thought that this would be a good place to say thank you for this algorithm. It is used extensively in a compression library I wrote (https://github.com/kiyo-masui/bitshuffle). Also, I managed to get the phrase “What is SSE !@# good for?” published in the references of a peer-reviewed academic journal (http://dx.doi.org/10.1016/j.ascom.2015.07.002), so well done!

  2. Very interesting post indeed, but for matrix transpose operation that I was looking for with SSE instruction sets, I found another implementation using the function _mm_shuffle_epi8 (source: http://pzemtsov.github.io/2014/10/01/how-to-transpose-a-16×16-matrix.html) instead of your algorithm using _mm_movemask_epi8 function. Could you please comment on the differences and advise which one is better/faster?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s