What the !@# is SSE2 good for: char search in long strings

You don’t need SSE4.2 to do some neat string operations with XMM registers. Case in point: using 16-byte parallelism, searching for a character in a null-terminated character string — aka strchr.

Smart implementations of strchr don’t simply test each byte in turn for null or the target character. Instead, they load four bytes at a time, and use a bit/arithmetic trick that returns a non-zero value if a four-byte word x has a zero byte in any position:

(x - 0x01010101) & 0x80808080 & ~x

If there is no zero byte, strchr xor’s the word with the target character replicated in all four byte positions, and does the same test again. A tiny loop checks the last 1-3 characters in the string.

The SSE2 equivalent takes even fewer instructions. PCMPEQB compares two XMM registers byte-by-byte, setting corresponding bytes to FF or 00. PMOVMSK collects the high bit of each XMM byte into a single 16-bit word. BFSL (“find first set” bit instruction) does the rest.

Ignoring the niceties of aligned 16-byte loads and handling short strings:

char const *ssechr(char const *s, char ch)
    __m128i zero = _mm_setzero_si128();
    __m128i cx16 = _mm_set1_epi8(ch); // (ch) replicated 16 times.
    while (1) {
        __m128i  x = _mm_loadu_si128((__m128i const *)s);
        unsigned u = _mm_movemask_epi8(_mm_cmpeq_epi8(zero, x));
        unsigned v = _mm_movemask_epi8(_mm_cmpeq_epi8(cx16, x))
                        & ~u & (u - 1);
        if (v) return s + ffs(v) - 1;
        if (u) return  NULL;
        s += 16;

And how does ssechr compare with regular (glibc) strchr? With some TLC for the edge cases, ssechr is about the same as strchr when the target char is in the first 4 bytes, twice as fast for the next 32 bytes, and maxes out at about 16 times as fast as strchr when the char is 200 bytes (or more) into the string.


About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

10 Responses to What the !@# is SSE2 good for: char search in long strings

  1. Jeroen van Bemmel says:

    “ffs(v) – 1” can be replaced by __builtin_ctz(v) when using GCC – it generates a ‘bsf’ instruction

    • mischasan says:

      True. On my current platform (gcc 4.1, Linux, x86-64) seems to be smart enough to generate the bsfl without an extract subtraction — the compiler sees that ffs(x) – 1 = (bsfl(x) + 1) – 1 and that +1-1 is a wash. Have you found there are platforms where __builtin_ctz(x) generates better code than ffs(x)?

      • Jeroen van Bemmel says:

        I used __builtin_ffs(x) myself before, and it generated a cmov instruction.

        BTW, is the above code correct when there is a match on ch just beyond an end-of-string?
        I am changing the code to:

        if (c==0) return s + __strlen(s); // handle rare case inline

        __m128i zero = _mm_setzero_si128();
        __m128i cx16 = _mm_set1_epi8(c); // (ch) replicated 16 times.
        while (1) {
        __m128i x = _mm_loadu_si128( ( const __m128i *) s);
        unsigned u = _mm_movemask_epi8(_mm_cmpeq_epi8(zero, x));
        unsigned v = _mm_movemask_epi8(_mm_cmpeq_epi8(cx16, x)) & ~u & (u – 1);
        if ( (u+v)==0 ) {
        s += 16;

        // There could also be a newline => check if vv)
        return (v<(u-1)) ? (char*) s + __builtin_ctz(v) : 0;

        also created an SSE4.2 version (slightly faster):
        if (c==0) return s + __strlen(s);

        // SSE4 slightly faster than SSE2 version
        #if defined(__SSE4_2__)
        const __m128i chartofind = _mm_set1_epi8(c);
        const __m128i* str = (const __m128i*) (s-16); // unaligned

        __m128i strchars = _mm_loadu_si128( ++str );
        int offset = _mm_cmpistri( chartofind, strchars, _SIDD_CMP_EQUAL_EACH );
        __asm__ __volatile__ goto( "jz %l[eos] \n\t jnc %l[loop]" : : : : loop, eos );
        return ((char*) str) + offset;
        __asm__ __volatile__ goto( "jc %l[cintail]" : : : : cintail );
        return 0;
        int zofs = _mm_cmpistri( _mm_setzero_si128(), strchars,
        return zofs<offset? 0 : ((char*) str) + offset;

      • mischasan says:

        Yes, the ffs(x) for me generates:
        bsfl %rbx, %eax
        movl $-1, %ecx
        cmove %ecx, %eax
        which is two register instructions at the end of a whole lot of code, with a cmove that is (in this case) always a nop, and unrelated to the “- 1” in “ffs(x) – 1”. I never worried about optimizing that out.

        I guess it’s time for me to start playing with hand-tuned asm and pcmpistri 🙂

  2. Jeroen van Bemmel says:

    Never mind, the code works correctly in case ch is beyond end-of-string. However, it fails for strchr( s, 0 )
    This should return the end of the string, instead of NULL (see http://www.cplusplus.com/reference/clibrary/cstring/strchr/)

    • Jeroen van Bemmel says:

      Simple fix for this rare case: if (u) return (c!=0) ? NULL : (char*) s + __builtin_ctz(u);

      • mischasan says:

        Thanks again. It was most educational to learn that strchr(s,0) and strrchr(s,0) do what they do, especially since (s + strlen(s)) is the more obvious way to express it, and it seems inconsistent to treat NUL like any other non-zero character. Should strstr(s, “abc”) only return a pointer to “abc” if it is at the end of (s)? Why doesn’t the null terminator in “abc” count as a part of the string there, too? How about strpbrk and strspn? But standards are standards, I guess. I’d be really interested in anyone suggesting the rationale is.

        glibc 4.5 strlen uses SSE2 to find the trailing NUL, so a special up-front case {if (!c) return s + strlen(s)} would seem sensible.

        BTW after posting ssechr, I worked out that xmm works faster, and safer, using _mm_loada_si128(), albeit with more elaborate code for startup and stop conditions. It’s safer because it can never accidentally try to read beyond the end of a memory page. A rewrite of ssechr using this scheme is what so far outperforms the glibc 4.5 strchr routine with SSE4.2 enabled (and disassembled to prove it was using PCMPSTRI) — even for short strings.

  3. Jeroen van Bemmel says:

    I tried the upfront case, but gcc does not optimize it away unless I put it in a macro:
    #define strcmp(s,c) (((int _c = (c))!=0)?__strcmp(s,_c) : (char*) s + __strlen(s))

    I figured that the case is rare enough to let the extra (c!=0) check handle it; for SSE4 you can simply change “return zofs=offset) ? ((char*) str) + offset : 0;” because zofs==offset when c==0

    FYI – see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53712, am working to help gcc do the pstricmpi optimizations for you…

    • Jeroen van Bemmel says:

      SSE4 code got lost (XML); should be “return (zofs>=offset) ? ((char*) str) + offset : 0;”

  4. mischasan says:

    Yet another joke on me, relying on compiler optimization. gcc 4.1.2 -O9 produced the following instructions for “ffs(x)-1”:

            movl    $-1, %edx
            bsfl    %eax, %eax
            cmove   %edx, %eax
            addl    $1, %eax
            subl    $1, %eax


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s