Convergence: SSE2 and strstr

The original improved strstr routine split the problem up based on the pattern length: 2, 3 and 4+ bytes were separate cases. How about reimplementing the 2- and 3-byte cases using SSE2 functions?
The main change is to compare each 16-byte chunk against 2 (or 3) xmm registers, then shifting and and’ing the resulting bitflags of matches. In addition, the code must check for matches across a 16-byte boundary.

#define compxm(a,b) _mm_movemask_epi8(_mm_cmpeq_epi8((a), (b)))
#define xmload(p)   _mm_load_si128((__m128i const *)(p))
#define load16(p)   (*(uint16_t const*)(p))
#define load32(p)   (*(uint32_t const*)(p))

char const*scanstr2(char const *tgt, char const pat[2])
{
    __m128i const   zero = _mm_setzero_si128();
    __m128i const   p0   = _mm_set1_epi8(pat[0]);
    __m128i const   p1   = _mm_set1_epi8(pat[1]);
    unsigned        f    = 15 & (intptr_t)tgt;
    uint16_t        pair = load16(pat);
    if (f) {
        __m128i  x = xmload(tgt - f);
        unsigned u = compxm(zero, x) >> f;
        unsigned v = ((compxm(p0, x) & (compxm(p1, x) >> 1)) >> f) & ~u & (u - 1);
        if (v) return tgt + ffs(v) - 1;
        if (u) return  NULL;
        tgt += 16 - f;
        if (load16(tgt - 1) == pair)
            return tgt -1;
    }
    while (1) {
        __m128i  x = xmload(tgt);
        unsigned u = compxm(zero, x);
        unsigned v = compxm(p0, x) & (compxm(p1, x) >> 1) & ~u & (u - 1);
        if (v) return tgt + ffs(v) - 1;
        if (u) return  NULL;
        tgt += 16;
        if (load16(tgt - 1) == pair)
            return tgt -1;
    }
}

char const *scanstr3(char const *tgt, char const pat[3])
{
    __m128i const   zero = _mm_setzero_si128();
    __m128i const   p0   = _mm_set1_epi8(pat[0]);
    __m128i const   p1   = _mm_set1_epi8(pat[1]);
    __m128i const   p2   = _mm_set1_epi8(pat[2]);
    unsigned        trio = load32(pat) & 0x00FFFFFF;
    unsigned        f    = 15 & (uintptr_t)tgt;

    if (f) {
        __m128i  x = xmload(tgt);
        unsigned u = compxm(zero, x);
        unsigned v = compxm(p0, x) & (compxm(p1, x) >> 1);
        v = (v & (compxm(p2, x) >> 2) & ~u & (u - 1)) >> f;
        if (v) return tgt + ffs(v) - 1;
        tgt += 16 - f;
        v = load32(tgt - 2);
        if (trio == (v & 0x00FFFFFF)) return tgt - 2;
        if (trio ==  v >> 8         ) return tgt - 1;
        if (u >> f) return  NULL;
    }

    while (1) {
        __m128i  x = xmload(tgt);
        unsigned u = compxm(zero, x);
        unsigned v = compxm(p0, x) & (compxm(p1, x) >> 1);
        v = (v & (compxm(p2, x) >> 2) & ~u & (u - 1)) >> f;
        if (v) return tgt + ffs(v) - 1;
        tgt += 16;
        v = load32(tgt - 2);
        if (trio == (v & 0x00FFFFFF)) return tgt - 2;
        if (trio ==  v >> 8         ) return tgt - 1;
        if (u) return  NULL;
    }
}

And here it gets interesting: the above scanstr2 is 2-16 times as fast as the non-SSE version. We can simplify strstr back to something similar to the original, that used strchr and strncmp, with essentially that much of a speed-up for the 4+ case of pattern length. There’s a small performance win in using scanstr2 instead of scanstr3, because of the more complicated 16-byte boundary case for the latter.

char const *scanstrN(char const *tgt, char const *pat, int len)
{
    for (; (tgt = scanstr2(tgt, pat)); tgt++)
        if (!memcmp(tgt+2, pat+2, len-2))
            return tgt;
    return NULL;
}

So, there you have it: an order-of-magnitude speed-up for all but the shortest of target strings. And now, let’s look at some things SSE2 is unequivocally good for.

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in algorithm and tagged , , . Bookmark the permalink.

11 Responses to Convergence: SSE2 and strstr

  1. mischasan says:

    Follow-up: a bit of reorganizing the main loop in scanchr/scanstr2/scanstr3 improves their performance (and hence scanstrN’s for all sizes of string) for strings over 32 bytes. The main inner loop repeats as long as no zero byte is detected (i.e as long as u == 0). This eliminates the
    … & ~u & (u – 1) …
    from the main loop. Surprisingly, that little change speeds search up by 30-40%.

  2. Hi mischasan,
    yesterday I was very excited seeing your SSE2 etudes, they are the boost I have been looking for.
    It turns out that the C without reinforcement of Intrinsics is so badly limited in performance that I would (and I will) trade its ‘compatibility’ right away for the AMD&Intel powerful XMM registers.
    I wish I knew more than nothing about SSE2 instructions, alas the ignorance is a never ending story. Somewhere in the future I hope that guys like you will do the remedy and boost the crawling strstr and hash-table functions.

    You are welcome to publish and comment your results at:
    http://www.codeproject.com/KB/cpp/Railgun_Quadruplet.aspx

    Showing how Railgun lags behind compared to your down-to-the-metal approach makes me eager and mostly greedy for more enhancements.

    Keep up the exciting work of yours.

  3. Any advise how to build this code for Windows? Also, in scanstr2 the “pair” variable is used before declaration point.

    • mischasan says:

      Ha! good catch.

      I don’t have Visual Studio at hand, but according to this MSDN SSE2 intrinsics page, emmintrin.h defines the same types (__m128i etc) and intrinsics (_mm_set1_epi8…). It’s been a while, but I think windows.h pulls in the header files that define uint16_t etc. That’s the best I can tell you; let me know if you have any such problems.

      • __m128i doesn’t make any problems; types such as uint16_t I’ve found in stdint.h
        I need to know what is function of ffs, and why usage of the variable “pair” appears in code before it’s declaration.

      • mischasan says:

        Ah. ffs means “find first (bit) set”. The MSVC equivalent is _BitScanForward.
        And I should have answered more directly: the declaration of “pair” was out of sequence — copy/paste error when I copied the text from source to the wordpress text editor.

  4. M T says:

    Is this code “public domain” or under some other license? I may be interested in proposing it to the FreeBSD-project, if the license is BSD or better (GPL is not suitable). Thanks!

    • mischasan says:

      If you want to make it part of the FreeBSD project, I’m amenable.
      I post GPL so that people contact me, and I have a chance to find out what kinds of application this is getting.
      That’s not the GPL philosophy; but it’s worked for me. Will that work for you?

  5. Zian says:

    Great solid work, helped speedup my queries by quite a bit. Since I am new to SIMD, I would love to know your opinion: would it be possible/feasible to do something analogous to wide chars? If not the SIMD part, at least the rest of it? The wide char version of strstr, that is wcsstr, also happens to follow the simplistic brute force you detected at the beginning of this series of posts.

    • mischasan says:

      I use the bytewise algorithm for UTF16 … though it requires an added check that you haven’t matched at an odd offset (never ran across this in real data).

Leave a reply to Georgi 'Kaze' Cancel reply