Convergence: SSE2 and strstr

The original improved strstr routine split the problem up based on the pattern length: 2, 3 and 4+ bytes were separate cases. How about reimplementing the 2- and 3-byte cases using SSE2 functions?
The main change is to compare each 16-byte chunk against 2 (or 3) xmm registers, then shifting and and’ing the resulting bitflags of matches. In addition, the code must check for matches across a 16-byte boundary.

#define compxm(a,b) _mm_movemask_epi8(_mm_cmpeq_epi8((a), (b)))
#define xmload(p)   _mm_load_si128((__m128i const *)(p))
#define load16(p)   (*(uint16_t const*)(p))
#define load32(p)   (*(uint32_t const*)(p))

char const*scanstr2(char const *tgt, char const pat[2])
{
    __m128i const   zero = _mm_setzero_si128();
    __m128i const   p0   = _mm_set1_epi8(pat[0]);
    __m128i const   p1   = _mm_set1_epi8(pat[1]);
    unsigned        f    = 15 & (intptr_t)tgt;
    uint16_t        pair = load16(pat);
    if (f) {
        __m128i  x = xmload(tgt - f);
        unsigned u = compxm(zero, x) >> f;
        unsigned v = ((compxm(p0, x) & (compxm(p1, x) >> 1)) >> f) & ~u & (u - 1);
        if (v) return tgt + ffs(v) - 1;
        if (u) return  NULL;
        tgt += 16 - f;
        if (load16(tgt - 1) == pair)
            return tgt -1;
    }
    while (1) {
        __m128i  x = xmload(tgt);
        unsigned u = compxm(zero, x);
        unsigned v = compxm(p0, x) & (compxm(p1, x) >> 1) & ~u & (u - 1);
        if (v) return tgt + ffs(v) - 1;
        if (u) return  NULL;
        tgt += 16;
        if (load16(tgt - 1) == pair)
            return tgt -1;
    }
}

char const *scanstr3(char const *tgt, char const pat[3])
{
    __m128i const   zero = _mm_setzero_si128();
    __m128i const   p0   = _mm_set1_epi8(pat[0]);
    __m128i const   p1   = _mm_set1_epi8(pat[1]);
    __m128i const   p2   = _mm_set1_epi8(pat[2]);
    unsigned        trio = load32(pat) & 0x00FFFFFF;
    unsigned        f    = 15 & (uintptr_t)tgt;

    if (f) {
        __m128i  x = xmload(tgt);
        unsigned u = compxm(zero, x);
        unsigned v = compxm(p0, x) & (compxm(p1, x) >> 1);
        v = (v & (compxm(p2, x) >> 2) & ~u & (u - 1)) >> f;
        if (v) return tgt + ffs(v) - 1;
        tgt += 16 - f;
        v = load32(tgt - 2);
        if (trio == (v & 0x00FFFFFF)) return tgt - 2;
        if (trio ==  v >> 8         ) return tgt - 1;
        if (u >> f) return  NULL;
    }

    while (1) {
        __m128i  x = xmload(tgt);
        unsigned u = compxm(zero, x);
        unsigned v = compxm(p0, x) & (compxm(p1, x) >> 1);
        v = (v & (compxm(p2, x) >> 2) & ~u & (u - 1)) >> f;
        if (v) return tgt + ffs(v) - 1;
        tgt += 16;
        v = load32(tgt - 2);
        if (trio == (v & 0x00FFFFFF)) return tgt - 2;
        if (trio ==  v >> 8         ) return tgt - 1;
        if (u) return  NULL;
    }
}

And here it gets interesting: the above scanstr2 is 2-16 times as fast as the non-SSE version. We can simplify strstr back to something similar to the original, that used strchr and strncmp, with essentially that much of a speed-up for the 4+ case of pattern length. There’s a small performance win in using scanstr2 instead of scanstr3, because of the more complicated 16-byte boundary case for the latter.

char const *scanstrN(char const *tgt, char const *pat, int len)
{
    for (; (tgt = scanstr2(tgt, pat)); tgt++)
        if (!memcmp(tgt+2, pat+2, len-2))
            return tgt;
    return NULL;
}

So, there you have it: an order-of-magnitude speed-up for all but the shortest of target strings. And now, let’s look at some things SSE2 is unequivocally good for.

About these ads

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog http://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in algorithm and tagged , , . Bookmark the permalink.

6 Responses to Convergence: SSE2 and strstr

  1. mischasan says:

    Follow-up: a bit of reorganizing the main loop in scanchr/scanstr2/scanstr3 improves their performance (and hence scanstrN’s for all sizes of string) for strings over 32 bytes. The main inner loop repeats as long as no zero byte is detected (i.e as long as u == 0). This eliminates the
    … & ~u & (u – 1) …
    from the main loop. Surprisingly, that little change speeds search up by 30-40%.

  2. Hi mischasan,
    yesterday I was very excited seeing your SSE2 etudes, they are the boost I have been looking for.
    It turns out that the C without reinforcement of Intrinsics is so badly limited in performance that I would (and I will) trade its ‘compatibility’ right away for the AMD&Intel powerful XMM registers.
    I wish I knew more than nothing about SSE2 instructions, alas the ignorance is a never ending story. Somewhere in the future I hope that guys like you will do the remedy and boost the crawling strstr and hash-table functions.

    You are welcome to publish and comment your results at:
    http://www.codeproject.com/KB/cpp/Railgun_Quadruplet.aspx

    Showing how Railgun lags behind compared to your down-to-the-metal approach makes me eager and mostly greedy for more enhancements.

    Keep up the exciting work of yours.

  3. Any advise how to build this code for Windows? Also, in scanstr2 the “pair” variable is used before declaration point.

    • mischasan says:

      Ha! good catch.

      I don’t have Visual Studio at hand, but according to this MSDN SSE2 intrinsics page, emmintrin.h defines the same types (__m128i etc) and intrinsics (_mm_set1_epi8…). It’s been a while, but I think windows.h pulls in the header files that define uint16_t etc. That’s the best I can tell you; let me know if you have any such problems.

      • __m128i doesn’t make any problems; types such as uint16_t I’ve found in stdint.h
        I need to know what is function of ffs, and why usage of the variable “pair” appears in code before it’s declaration.

      • mischasan says:

        Ah. ffs means “find first (bit) set”. The MSVC equivalent is _BitScanForward.
        And I should have answered more directly: the declaration of “pair” was out of sequence — copy/paste error when I copied the text from source to the wordpress text editor.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s