In response to a couple of comments on my post about find-first-bit-set in SSE2 registers, amounting to “what use is a routine that only does 16-byte bitvecs”, I thought I’d post the canonic, generic loop through memory using SSE2 ops. The basic principles are:
- valid memory segments in x86 always end on multiple of 16 bytes (or more)
- aligned loads of XMM registers (i.e. on 16-byte boundaries) are always faster than unaligned loads.
The generic loop over unaligned memory amounts to:
- do an aligned load of the first part of a byte vector, possibly throwing away the low bytes of what’s loaded
- loop through aligned 16-byte chunks
- do an aligned load of the last part of a byte vector, possibly throwing away the high bytes of what’s loaded.
You’ll see this in my examples on SSE2-based strstr. Here’s “find-first-bit-set” for an arbitrary bit-vector:
// For a vector of 16 bytes, NZBYTES returns a 16-bit mask // of the non-zero bytes. #define NZBYTES(p) (0xFFFF ^ _mm_movemask_epi8(\ _mm_cmpeq_epi8(*(__m128i const*)(p),\ _mm_setzero_si128()))) int bitvec_ffs(uint8_t const*vec, int len) { unsigned i = 15 & (uintptr_t)vec; uint8_t const *bp = vec - i; uint16_t m; if (i) { if ((m = NZBYTES(bp) >> i)) return i = FFS(m), i*8 + FFS(vec[i]); len -= i, bp += 16; } for (; len >= 16; len -= 16, bp += 16) if ((m = NZBYTES(bp))) return i = FFS(m), (bp-vec+i)*8 + FFS(bp[i]); if (len > 0) if ((m = NZBYTES(bp) & ~(-1 << len))) return i = FFS(m), (bp-vec+i)*8 + FFS(bp[i]); return -1; }
A more succinct way of solving the problem partitions it into the three parts implicitly, by careful arrangement of state changes. You be the judge of what’s more understandable 🙂
int bitvec_ffs(uint8_t const*vec, int len) { int off = 15 & (intptr_t)vec; uint8_t const *bp = vec - off; // back up to a 16b boundary uint16_t m = NZBYTES(bp); len += off; if (len < 16) m &= ~(-1 << len); // mask off trailing bits (tiny vec[]) for (m &= -1 <= 16; m = NZBYTES(bp), len -= 16) bp += 16; if (len < 16) m &= ~(-1 << len); // mask off trailing bits (tail of vec[]) if (!m) return -1; off = ffs(m) - 1; return 8 * (off + bp - vec) + ffs(bp[off]) - 1; }
Pingback: SSE2 bit trick: ffs/fls for XMM registers | Coding on the edges
Tried this, here’s what I got ( 00 = 1 char ):
00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,01, -> Results: bitvec_ffs=131
00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,01,01, -> Results: bitvec_ffs=130
00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,01,01,01, -> Results: bitvec_ffs=129
00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,01,01,01,01, -> Results: bitvec_ffs=128
00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,01,01,01,01,01, -> Results: bitvec_ffs=15
00,00,00,00,00,00,00,00,00,00,00,00,00,00,01,01,01,01,01,01, -> Results: bitvec_ffs=14
00,00,00,00,00,00,00,00,00,00,00,00,00,01,01,01,01,01,01,01, -> Results: bitvec_ffs=13
00,00,00,00,00,00,00,00,00,00,00,00,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=12
00,00,00,00,00,00,00,00,00,00,00,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=11
00,00,00,00,00,00,00,00,00,00,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=10
00,00,00,00,00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=9
00,00,00,00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=8
00,00,00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=7
00,00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=6
00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=5
00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=4
00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=3
00,00,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=2
00,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=1
01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=0
Interesting mix of bit/byte indexes, but not entirely correct?
Try this version:
int bitvec_ffs2( uint8_t const*vec, int nbytes )
{
unsigned pos, off = 15 & (uintptr_t) vec;
__m128i const* mp = (__m128i const*)(vec – off);
__m128i zero = {};
long i;
uint8_t const* mpb;
if ( off ) {
// Throw away (movemask) bits from before start of (vec)
pos = ffs( (uint16_t) (~_mm_movemask_epi8(_mm_cmpeq_epi8(*mp, zero))) >> off );
if (pos > 0) return pos – 1;
nbytes -= off, ++mp;
}
i = -(nbytes&~15);
mpb = ((uint8_t*) mp) – i;
if (i<=-16) do {
pos = ffs( (uint16_t) ~_mm_movemask_epi8(_mm_cmpeq_epi8(*(__m128i*)(mpb+i), zero)));
if (pos) return ((mpb+i – vec) << 3) + pos – 1;
} while ((i+=16) 0) {
// fprintf( stderr, “\nnbytes=%d i=%d mpb=%p vec=%p”, nbytes, i, mpb, vec );
// Throw away bits from beyond end of vec[len]
pos = ffs( (uint16_t) (~_mm_movemask_epi8(_mm_cmpeq_epi8(*(__m128i*)mpb, zero))) & (~(-1 << nbytes)));
if (pos) return ((mpb – vec) << 0) + pos – 1;
}
return -1;
}
while loop above: while ((i+=16) < 0)
Thanks, Jeroen! I should have revisited this post a long time ago. The basic breakdown of an SSE2 loop was right … but computing the first bit set was hopelessly off: it only computed which byte in the __m128i was the lowest nonzero byte! I’m reposting the source, now that I’ve (ahem) done a little TDD.