SSE2 and BNDM string search

For the past few weeks, I’ve been testing and experimenting with the Railgun string search function written by Sanmayce. Railgun is really a “memmem” function, where the target length is known in advance; and the cost of compiling the pattern is separate from the cost of execution. It’s an interesting example of carefully factoring several algorithms into one, based on pattern and target length.

The string search functions I’ve talked about here have all been intended as pure strstr replacements. That is a different kind of challenge: you don’t know the target length in advance; you have to be good for all kinds of character distributions; you can’t afford a significant start-up (compile) cost; and for current libc strstr on Intel platforms, you’re competing with SSE4.2 PCMPESTRI instructions.

For the longest time, a super-simple routine was hard to beat, and in fact was the platform-independent libc routine:

char *simplestr(char *tgt, char *pat)
    int patlen = strlen(pat); --tgt;
    do tgt = strchr(++tgt, *pat);
    while (tgt && memcmp(tgt+1, pat+1, patlen-1));
    return tgt;

Of course, for some inputs this routine is horrible. But, for MANY inputs it’s not bad; and it’s certainly predictable.

Determining the length of the pattern argument is usually negligible; but the length of the target is far more variable. (More about solving this problem intelligently in the next post). However, let’s assume you know the length of the target. This opens up the field to some smarter algorithms … the most common and well-known being Horspool. However, there’s a less well known algorithm called Bounded Nondeterministic DAWG Matching. “DAWG” stands for “directed acyclic word graph”; best to go look that up. This is a remarkably long and complicated name for a remarkably small and simple piece of code:

#include <stdint.h>
// setbit: set a bit in a LSB-first 32bit word in memory.
static void setbit(void *v, int p) { ((uint32_t*)v)[p >> 5] |= 1 << (p & 31); }

char *bndm32(char *target, int tgtlen, char *pattern, int patlen)
    assert(patlen <= 32);
    uint8_t     *tgt = (uint8_t*)target, *pat = (uint8_t*)pattern;
    int         i, j;
    uint32_t    maskv[256] = {};
    for (i = 0; i < patlen; ++i)
        setbit(&maskv[pat[i]], patlen - 1 - i);

    for (i = 0; i <= tgtlen - patlen; i += j) {
        uint32_t    mask = maskv[tgt[i + patlen - 1]];
        for (j = patlen; mask;) {
            if (!--j) return target + i;
            mask = (mask << 1) & maskv[tgt[i + j - 1]];
    return 0;

This is hardly more complicated than Horspool, but outperforms Horspool for a wide range of inputs. The 32 bits of a word are an annoying limitation on pattern length. Can 128bit XMM registers help?

On the face of it, this seems unreasonable. The above routine has to start by zeroing out a maskv, which is 256*4 bytes long. That’s significant overhead for our generic memmem routine.
The equivalent for XMM registers means zeroing 256*16 bytes!

However, a small optimization makes this as efficient as the above routine, if not more so. The trick is to keep a vector of (byte) flags indicating which maskv elements are initialized (nonzero). Here’s the code:

#include <emmintrin.h>
char *bndm128(char *target, int tgtlen, char *pattern, int patlen)
    assert(patlen <= 128);
    uint8_t     *tgt = (uint8_t*)target, *pat = (uint8_t*)pattern;
    int         i, j;
    __m128i     zero = {}, maskv[256], carry64 = (__v2di){0,1};
    uint8_t     used[256] = {};
    for (i = 0; i < patlen; ++i) {
        if (!used[j = pat[i]])
            used[j] = 1, maskv[j] = zero;
        setbit(&maskv[j], patlen - 1 - i);

    for (i = 0; i <= tgtlen - patlen; i += j) {
        j = patlen;
        if (!used[tgt[patlen - 1 + i]]) continue;
        __m128i     mask = maskv[tgt[patlen - 1 + i]];
        while (0xFFFF != _mm_movemask_epi8(_mm_cmpeq_epi8(zero, mask))) {
            if (!--j)
                return target + i;
            if (!used[tgt[i + j - 1]])
            mask = _mm_or_si128( _mm_slli_epi64(mask, 1)
                               , _mm_srli_epi64( _mm_slli_si128(mask, 8), 63 )

    return 0;

Note the hackery to compensate for SSE2 not having a 128bit bit-shift operator: you have to shift each 64-bit half independently (albeit in one operation), then manually “or” in a “carry” bit into the low bit of the eighth byte. Sigh.

How does this perform? For target strings of 256+ bytes, and patterns of 9+ bytes, BNDM using SSE2 pretty much dominates everything else, including BNDM with 32bit int masks, particularly for a wide range of pathological patterns and targets. It leads the pack for a wider range of “reasonable” smaller (100+ byte) targets. As always, YMMV … so, try it out!

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in algorithm, SSE2, string search and tagged , , , , , , . Bookmark the permalink.

8 Responses to SSE2 and BNDM string search

  1. Mischa, thanks a lot for your superb BNDM, it is an inspiring eye-wide-opening etude.

    I did some tests against my ‘Piggy-Houses’ at:

    Happy I am having it, the code is worthy to be put in a frame and beheld each day – a sight for gloomy/inspirationless eyes for sure.

  2. kornman00 says:

    To avoid all the ‘- 1’ operations, why not do ‘patlen -= 1;’ before the first for loop? You would then need to adjust the for conditions obviously (first becomes ‘i <= patlen', second 'i < tgtlen – patlen'), and the assert (0 < patlen <= 128), but you get to remove all five of the '- 1' operations (try looking at the resulting assembly for each case).

    Also, two things in regards to MSVC and this code.
    1) saying '__m128i … = {}' results in a 'rep stosb' instruction (Debug) or a memset call (Release). I would suggest initializing with _mm_setzero_si128() instead (worst case, results in a unique set of pxor and movdqa).
    2) MSVC doesn't define '__v2di'. Instead you would have to do something like 'carry64 = zero; carry64.m128i_u64[1]= 1;'. Though I'm sure you could also do some 'magic' with _mm_setr_epi32 too. There's also _mm_set_epi64x, but that's x64 only.

    One last thing: since maskv[] allocates 4096b of local memory, perhaps a better way would be to define that last so that all of the other variables are as close together as possible. Maybe not so much for Release (still talking in terms of MSVC) builds, but in Debug builds the order of variable declaration is the same order they appear as on the stack.

  3. mischasan says:

    Thanks, Kornmanoo; I use MSVC once in a blue moon; and debug mode only for finding logic errors. Nice to know how the other half lives. My base is gcc 4.3 with -O3. With gcc, those (-1) “ops” are optimized out or at least take no more instructions; e.g. “leal -1(%eax,%esi),%ebx”. I was led to believe that MSVC was better than gcc at this.

    The (__m128i zero = {}) compiles as “pxor %xmm0,%xmm0”; no memory involved. The (carry = (__v2di) {0,1}) is just gcc idiom, i guess; it compiles down to “movdqa .LC0,%xmm4”. What does the above MSVC code compile to? How would you get MSVC to generate a single aligned xmm load?

    Wups, just remembered, MSVC generates Intel-style asm; you need to read that gcc asm backwards from Intel; e.g. (OP src,dst).

    You have me curious as to why those other variables need to be close to one another in memory … aside from the expectation that they almost all fit in registers. L1 cache is pretty big and this code is small; after the first set or load, I’d expect access to all the vars to be equally fast, even if their RAM locations were 100K apart. Or am I missing something?

  4. mischasan says:

    After working on how to do 128bit-word shifts efficiently (see my later post) it’s worth noting that the “shift128(mask, 1)” done with a conditional, can be done faster and more simply with:

    mask = _mm_or_si128(_mm_slli_epi64(mask, 1),
    _mm_srli_epi64(_mm_slli_si128(mask, 8), 63))

    “Simply” from the processor’s point of view, of course 🙂

  5. laci says:

    Hi, thanks for the post! I found it very helpful when extending my BNDM implementation to make use of SSE2.

    I just want to comment on the way how you move the search window. You always shift it by the number of letters remaining in the pattern that could not have been matched in the current window (that’s your variable j). The longest possible safe jump though is “the length of the pattern minus length of the longest pattern prefix that is also a suffix of the current window”. This means that you shift the window so that the longest possible prefix of the pattern is aligned to a matching sequence in the text – this part of text was a suffix of the previous window. May sound complicated but the code still remains simple. To update your bndm32 function, you would modify the for-loop somehow like this:

    for (i = 0; i <= tgtlen – patlen; i += last) {
    uint32_t mask = maskv[tgt[i + patlen – 1]];
    for (last = j = patlen; mask;) {
    // if we have found a prefix of the pattern
    // that is also a suffix of the window
    if (mask & (1 << (patlen – 1))) {
    if (!j) return target + i;
    else last = j;
    mask = (mask << 1) & maskv[tgt[i + j – 1]];

    For bndm128 the same thing, to compute "mask & (1 <> 5];
    uint32_t patlen_mask = 1 << ((patlen – 1) & 31);

    Then it is as simple as "*p_mask & patlen_mask".

    Now the jumps are never worse compared to what they were, and in fact often much better (depends on the pattern and alphabet).

    Btw, for more on BNDM, expecially extensions to allow for classes of characters in pattern (e.g. like in regex you can write "de[cs]k" to match both "desk" and "deck"), mismatches, wild cards and such, I suggest papers by Gonzalo Navarro. (I've used his book "Flexible Pattern Matching in Strings")


    • laci says:

      Ups, something happend to the text after the for-loop code, this is what should have been there:

      For bndm128 the same thing, to compute (mask & (1 << (patlen – 1))) you can precompute:

      uint32_t *p_mask = &((uint32_t*) &mask)[patt_length >> 5];
      uint32_t patlen_mask = 1 << ((patt_length-1) & 31);

      • mischasan says:

        Thanks in turn for the missing piece from the BNDM code. For the 128-bit version, there’s a faster way: left justify every maskv entry (use the *high* patlen bits), and if the high bit of calling movemask is 0, AND then the high bit of movemask(mask) is 1, then you have the high-bit test of BNDM without storing the XMM register back in memory to get at the high byte. Just cobbling together the code right now …

  6. mischasan says:

    Okay, laci (and thanks again for pointing me back at Navarro and the improvement in general …)
    Here’s an optimized version of the SSE2 (65-128-bit case) of BNDM:

    int8_t used[256] = {};
    __m128i zero = {}, maskv[256];
    for (i = 0; i < patlen; ++i) {
        if (!used[j = pat[i]]) used[j] = 1, maskv[j] = zero;
        setbit(&maskv[j], 128 - 1 - i);
    for (i = 0; i <= tgtlen - patlen; i += skip) {
        j = skip = patlen;
        if (!used[tgt[--j + i]]) continue;
        __m128i mask = maskv[tgt[i + j]]; // gteed not zero.
        do {
            if (0 > (int16_t)_mm_movemask_epi8(mask)) {
                if (j) skip = j; else return target + i;
            if (!used[tgt[--j + i]]) break;
            mask = _mm_and_si128(xm_shl_01(mask), maskv[tgt[i + j]]);
        } while (0xFFFF != _mm_movemask_epi8(_mm_cmpeq_epi8(mask, zero)));

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s