In a discussion about all the wonderful uses of the combination movemask(pcmpxx(a,b)), it occurred to me that this gives you a fast XMM version of find-first-set and find-last-set (bit) operations. Pardon my C-centrism in preferring bit positions 0..127 (with -1 for no-bit-set) rather than the 1… convention of the x86 ops (bsfl,bsrl).

int xm_ffs(__m128i x) {
int pos = _mm_movemask_epi8(_mm_cmpeq_epi8(x, _mm_setzero_si128()));
pos = ffs((uint16_t)~pos) - 1;
return pos < 0 ? -1
: (pos << 3) + ffs(((unsigned char const*)&x)[pos]) - 1;
}

Note the folderol needed because x86 has **_mm_cmpeq_epi8, _mm_cmplt_epi8, _mm_cmpgt_epi8**, but no **_mm_cmpneq_epi8**. Go figure.

And if anyone comes up with a cleverer way to index bytes of an XMM value, I’d love to see it.

