The Generic SSE2 Loop

In response to a couple of comments on my post about find-first-bit-set in SSE2 registers, amounting to “what use is a routine that only does 16-byte bitvecs”, I thought I’d post the canonic, generic loop through memory using SSE2 ops. The basic principles are:

  • valid memory segments in x86 always end on multiple of 16 bytes (or more)
  • aligned loads of XMM registers (i.e. on 16-byte boundaries) are always faster than unaligned loads.

The generic loop over unaligned memory amounts to:

  •  do an aligned load of the first part of a byte vector, possibly throwing away the low bytes of what’s loaded
  • loop through aligned 16-byte chunks
  • do an aligned load of the last part of a byte vector, possibly throwing away the high bytes of what’s loaded.

You’ll see this in my examples on SSE2-based strstr. Here’s “find-first-bit-set” for an arbitrary bit-vector:

 
// For a vector of 16 bytes, NZBYTES returns a 16-bit mask
//   of the non-zero bytes.
#define NZBYTES(p) (0xFFFF ^ _mm_movemask_epi8(\
                               _mm_cmpeq_epi8(*(__m128i const*)(p),\
                                              _mm_setzero_si128())))
int bitvec_ffs(uint8_t const*vec, int len)
{
    unsigned i = 15 & (uintptr_t)vec;
    uint8_t const *bp = vec - i;
    uint16_t m;
    if (i) {
        if ((m = NZBYTES(bp) >> i))
            return i = FFS(m), i*8 + FFS(vec[i]);
        len -= i, bp += 16;
    }

    for (; len >= 16; len -= 16, bp += 16)
        if ((m = NZBYTES(bp)))
            return i = FFS(m), (bp-vec+i)*8 + FFS(bp[i]);

    if (len > 0)
        if ((m = NZBYTES(bp) & ~(-1 << len)))
            return i = FFS(m), (bp-vec+i)*8 + FFS(bp[i]);

    return -1;
}

A more succinct way of solving the problem partitions it into the three parts implicitly, by careful arrangement of state changes. You be the judge of what’s more understandable 🙂

int bitvec_ffs(uint8_t const*vec, int len)
{
    int off = 15 & (intptr_t)vec;
    uint8_t const *bp = vec - off;  // back up to a 16b boundary
    uint16_t m = NZBYTES(bp);
    len += off;
    if (len < 16)
        m &= ~(-1 << len);  // mask off trailing bits (tiny vec[])
    for (m &= -1 <= 16; m = NZBYTES(bp), len -= 16)
        bp += 16;
    if (len < 16)
        m &= ~(-1 << len); // mask off trailing bits (tail of vec[])
    if (!m) return -1;
    off = ffs(m) - 1;
    return 8 * (off + bp - vec) + ffs(bp[off]) - 1;
}

Advertisements

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in ffs, SSE2, Uncategorized and tagged , , , , . Bookmark the permalink.

5 Responses to The Generic SSE2 Loop

  1. Pingback: SSE2 bit trick: ffs/fls for XMM registers | Coding on the edges

  2. Jeroen van Bemmel says:

    Tried this, here’s what I got ( 00 = 1 char ):

    00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,01, -> Results: bitvec_ffs=131
    00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,01,01, -> Results: bitvec_ffs=130
    00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,01,01,01, -> Results: bitvec_ffs=129
    00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,01,01,01,01, -> Results: bitvec_ffs=128
    00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,01,01,01,01,01, -> Results: bitvec_ffs=15
    00,00,00,00,00,00,00,00,00,00,00,00,00,00,01,01,01,01,01,01, -> Results: bitvec_ffs=14
    00,00,00,00,00,00,00,00,00,00,00,00,00,01,01,01,01,01,01,01, -> Results: bitvec_ffs=13
    00,00,00,00,00,00,00,00,00,00,00,00,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=12
    00,00,00,00,00,00,00,00,00,00,00,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=11
    00,00,00,00,00,00,00,00,00,00,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=10
    00,00,00,00,00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=9
    00,00,00,00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=8
    00,00,00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=7
    00,00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=6
    00,00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=5
    00,00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=4
    00,00,00,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=3
    00,00,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=2
    00,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=1
    01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01,01, -> Results: bitvec_ffs=0

    Interesting mix of bit/byte indexes, but not entirely correct?

  3. Jeroen van Bemmel says:

    Try this version:

    int bitvec_ffs2( uint8_t const*vec, int nbytes )
    {
    unsigned pos, off = 15 & (uintptr_t) vec;
    __m128i const* mp = (__m128i const*)(vec – off);
    __m128i zero = {};
    long i;
    uint8_t const* mpb;

    if ( off ) {
    // Throw away (movemask) bits from before start of (vec)
    pos = ffs( (uint16_t) (~_mm_movemask_epi8(_mm_cmpeq_epi8(*mp, zero))) >> off );
    if (pos > 0) return pos – 1;
    nbytes -= off, ++mp;
    }
    i = -(nbytes&~15);
    mpb = ((uint8_t*) mp) – i;
    if (i<=-16) do {
    pos = ffs( (uint16_t) ~_mm_movemask_epi8(_mm_cmpeq_epi8(*(__m128i*)(mpb+i), zero)));
    if (pos) return ((mpb+i – vec) << 3) + pos – 1;
    } while ((i+=16) 0) {
    // fprintf( stderr, “\nnbytes=%d i=%d mpb=%p vec=%p”, nbytes, i, mpb, vec );
    // Throw away bits from beyond end of vec[len]
    pos = ffs( (uint16_t) (~_mm_movemask_epi8(_mm_cmpeq_epi8(*(__m128i*)mpb, zero))) & (~(-1 << nbytes)));
    if (pos) return ((mpb – vec) << 0) + pos – 1;
    }
    return -1;
    }

  4. mischasan says:

    Thanks, Jeroen! I should have revisited this post a long time ago. The basic breakdown of an SSE2 loop was right … but computing the first bit set was hopelessly off: it only computed which byte in the __m128i was the lowest nonzero byte! I’m reposting the source, now that I’ve (ahem) done a little TDD.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s