Building a better strstr

NOTE: check my later posts for how to make strstr an order-of-magnitude faster, using SSE2 instructions.

It would be hard to find a “C” programmer who doesn’t know strchr(string,char): find the first occurrence of (char) in a null-terminated string. Slightly less well-known is strstr(string,substring): do the same with a null-terminated substring of chars.

While hunting for practical uses of the Intel SSE instructions, I came across an implementation of strstr that did indeed use SSE 4.2 extensions; but couldn’t bring myself to get into something that wouldn’t work on AMD as well. But it made me wonder just how complicated the standard strstr is. I dug out source for glibc’s strstr.  It almost seemed too simple. There had to be a way to improve it.

The basic strstr algorithm (per FreeBSD and Debian glibc) is:

#include <string.h>
char *strstr(char *s, char *find)
{
    char c, sc;
    size_t len;

    if ((c = *find++) != 0) {
        len = strlen(find);
        do {
            do {
                if ((sc = *s++) == 0)
                    return NULL;
            } while (sc != c);
        } while (strncmp(s, find, len) != 0);
        s--;
    }
    return s;
}

In the world of string searching, this is almost as simple as it gets. The problem of substring search has been studied for decades, and there are a lot of very clever algorithms. Surely there must be a better way.

An example of fast search is the Horspool algorithm. It uses a 256-element lookup table to skip ahead in the string by (up to) the length of (substring) on each iteration, rather than looking at each byte. Elegantly simple.

It turns out that a lot of intelligent study of algorithms has more or less been made irrelevant, by hardware improvements that stream memory to the CPU. Unless (substring)  is 32 bytes or more, the average Intel/AMD CPU makes all the skip-ahead computation pointless. In practice, most uses of strstr are indeed for substrings less than 32 bytes.

Even if that weren’t the case, intelligent “skip-ahead” algorithms are beaten by the nature of “C” strings themselves: null terminators. For an algorithm to not overrun the end of a null-terminated string, it must check every byte, sequentially to detect the null! So it must initially scan the entire string, or it must incorporate logic that tests every byte in sequence, anyway.

What about speeding up the algorithm using x86 string instructions (REPNE SCASB, REPE CMPSB)? I implemented strstr this way in assembler, just to check. Unfortunately, needing to find the trailing null first means scanning the string twice. That overhead wipes out any advantage of the hardware instructions. In fact, because of caching and load prediction, basic strstr is almost twice as fast as the string-op version.

Given up yet? I almost did. Then, playing with assembler, I realized that there is no difference in overhead between comparing two bytes, and comparing two 32-bit words. An algorithm can keep up to four leading bytes of the substring in a register, and compare that against a moving window of four bytes in (string). This means that the character scan strstr will fall into the string compare less often. If the first byte of (substring) is common, that can make a performance difference.

To make the most of x86 architecture required coding special cases for substrings of length 2, 3, 4 and 5+. The overhead of choosing a special case turns out to be insignificant. The essential change from the standard code is movzbl (%esi),%eax; test %eax,%eax becomes  shl $8,%eax; add (%esi),%al — a small slow-down. Adding a byte to zero combines “load” and “test” into one operation. The resulting procedure takes 55-105% of the time that the standard strstr does, and usually 75%, across a wide range of inputs. The 105% case is when (substring) begins with a character that isn’t in (string) at all. Here’s the special case for four-byte patterns.

# Inputs:   ESI=pattern EDI=target
# Return:   EDI=start of match in (target), or NULL
# Locals:   EDX=pattern[0..3] EAX=(target window)
        mov     (%esi),%edx
        bswap   %edx
        xor     %eax,%eax
LOOP:   add     0(%edi),%al
        jz      FAIL
        cmp     %edx,%eax
        je      3f
        shl     $8,%eax

        add     1(%edi),%al
        jz      FAIL
        cmp     %edx,%eax
        je      2f
        shl     $8,%%eax

        add     2(%edi),%al
        jz      FAIL
        cmp     %edx,%eax
        je      1f
        shl     $8,%eax

        add     3(%edi),%al
        jz      FAIL
        cmp     %edx,%eax
        je      DONE
        shl     $8,%eax

        lea     4(%edi),%edi
        jmp     LOOP

FAIL:   xor     %edi,%edi
        jmp     DONE

3:      dec     %%edi
2:      dec     %%edi
1:      dec     %%edi
DONE:

Up next: what about those SSE instructions, anyway?

Advertisements

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in algorithm, Uncategorized and tagged . Bookmark the permalink.

One Response to Building a better strstr

  1. mischasan says:

    Since running the initial tests on Intel Pentium4 and Core2 boxes, I’ve now had a chance to run the same tests on an AMD Athlon 3200+. The spread between standard and window-scan gets somewhat wider, with window-scan taking 18-140% of the time that standard strstr does. The best case for window-scan is a pattern rarely found in the target, but beginning with a character found frequently in the target. The best case for standard strstr is a string beginning with a character never found in the target.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s