SSE2 bit trick: ffs/fls for XMM registers

For the full “C” code that uses this idea for an arbitrary-length byte vector, see
this later blog post

In a discussion about all the wonderful uses of the combination movemask(pcmpxx(a,b)), it occurred to me that this gives you a fast XMM version of find-first-set and find-last-set (bit) operations. Pardon my C-centrism in preferring bit positions 0..127 (with -1 for no-bit-set)  rather than the 1… convention of the x86 ops (bsfl,bsrl).

int xm_ffs(__m128i x) {
    int     pos = _mm_movemask_epi8(_mm_cmpeq_epi8(x, _mm_setzero_si128()));
    pos = ffs((uint16_t)~pos) - 1;
    return  pos < 0 ? -1
                       : (pos << 3) + ffs(((unsigned char const*)&x)[pos]) - 1;

Note the folderol needed because x86 has _mm_cmpeq_epi8, _mm_cmplt_epi8, _mm_cmpgt_epi8, but no _mm_cmpneq_epi8. Go figure.

And if anyone comes up with a cleverer way to index bytes of an XMM value, I’d love to see it.


About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in Uncategorized and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s