A better strstr, with no asm code

For the ultimate in kickass strstr, see my later post about using SSE2 instructions to get a 10x speedup on non-SSE2 strstr. Unlike the Intel-only SSE4.2 ops, SSE2 is on every x86-compatible chip since the Pentium-4.

My original strstr post used asm. Naively, I though I could make the world a faster place by coding to the metal. After a request to make it portable, I found that the “C”-only version was almost identical in speed … and on an Intel Core7, faster. Here’s the source for a strstr that averaged twice as fast on strings under 16 bytes, five times as fast for strings over 64 bytes, compared to glibc’s strstr (YMMV):

#if defined(__FreeBSD__)
#   include <sys/endian.h>
#elif defined(__linux__) && defined(__USE_BSD)
#   include <byteswap.h>
#   define  bswap16 bswap_16
#   define  bswap32 bswap_32
#else
#   error "Need a bswap intrinsic!"
#endif

#if BYTE_ORDER == BIG_ENDIAN
#   define  MSBF16(x)   (*(uint16_t const*__attribute((aligned(1))))x)
#   define  MSBF32(x)   (*(uint32_t const*__attribute((aligned(1))))x)
#else
#   define  MSBF16(x)   bswap16(*(uint16_t const*__attribute((aligned(1))))x)
#   define  MSBF32(x)   bswap32(*(uint32_t const*__attribute((aligned(1))))x)
#endif

static char const *
scanstr2(char const *tgt, char const pat[2])
{
    uint16_t head = MSBF16(pat), wind = 0, next;

    while ((next = *(uint8_t const*)tgt++)) {
        wind = ( wind << 8 ) + next;
        if (wind == head)
            return tgt - 2;
    }
    return  NULL;
}

// NOTE: MSBF32(pat) will never read beyond pat[] in memory,
//          because pat has a null-terminator.
static char const *
scanstr3(char const *tgt, char const pat[3])
{
    uint32_t head = MSBF32(pat), wind = 0, next;

    while ((next = *(uint8_t const*)tgt++)) {
        wind = (wind + next) << 8;
        if (wind == head)
            return tgt - 3;
    }
    return  NULL;
}

static char const *
scanstrm(char const *tgt, char const *pat, int len)
{
    uint32_t head = MSBF32(pat), wind = 0, next;

    pat += 4, len -= 4;
    while ((next = *(uint8_t const*)tgt++)) {
        wind = ( wind << 8 ) + next;
        if (wind == head && !memcmp(tgt, pat, len))
            return tgt - 4;
    }
    return  NULL;
}

char const *scanstr(char const *tgt, char const *pat)
{
    unsigned     len = strlen(pat);
    switch (len) {
    case  0: return tgt;
    case  1: return strchr( tgt,*pat);
    case  2: return scanstr2(tgt, pat);
    case  3: return scanstr3(tgt, pat);
    default: return scanstrm(tgt, pat, len);
    }
}
About these ads

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog http://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in Uncategorized and tagged . Bookmark the permalink.

2 Responses to A better strstr, with no asm code

  1. joegroff says:

    Your code isn’t quite portable because `(*(uint16_t const*)x)` and `(*(uint32_t const*)x)` are undefined if `x` isn’t aligned properly for a uint16_t or uint32_t. You should use `(*(uint16_t const __attribute__((align(1)))*)x)` (or, in C++11/C11, `(*(uint16_t const alignas(char) *)x)`).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s