The original improved **strstr** routine split the problem up based on the pattern length: 2, 3 and 4+ bytes were separate cases. How about reimplementing the 2- and 3-byte cases using SSE2 functions?

The main change is to compare each 16-byte chunk against 2 (or 3) xmm registers, then shifting and and’ing the resulting bitflags of matches. In addition, the code must check for matches across a 16-byte boundary.

#define compxm(a,b) _mm_movemask_epi8(_mm_cmpeq_epi8((a), (b))) #define xmload(p) _mm_load_si128((__m128i const *)(p)) #define load16(p) (*(uint16_t const*)(p)) #define load32(p) (*(uint32_t const*)(p)) char const*scanstr2(char const *tgt, char const pat[2]) { __m128i const zero = _mm_setzero_si128(); __m128i const p0 = _mm_set1_epi8(pat[0]); __m128i const p1 = _mm_set1_epi8(pat[1]); unsigned f = 15 & (intptr_t)tgt; uint16_t pair = load16(pat); if (f) { __m128i x = xmload(tgt - f); unsigned u = compxm(zero, x) >> f; unsigned v = ((compxm(p0, x) & (compxm(p1, x) >> 1)) >> f) & ~u & (u - 1); if (v) return tgt + ffs(v) - 1; if (u) return NULL; tgt += 16 - f; if (load16(tgt - 1) == pair) return tgt -1; } while (1) { __m128i x = xmload(tgt); unsigned u = compxm(zero, x); unsigned v = compxm(p0, x) & (compxm(p1, x) >> 1) & ~u & (u - 1); if (v) return tgt + ffs(v) - 1; if (u) return NULL; tgt += 16; if (load16(tgt - 1) == pair) return tgt -1; } } char const *scanstr3(char const *tgt, char const pat[3]) { __m128i const zero = _mm_setzero_si128(); __m128i const p0 = _mm_set1_epi8(pat[0]); __m128i const p1 = _mm_set1_epi8(pat[1]); __m128i const p2 = _mm_set1_epi8(pat[2]); unsigned trio = load32(pat) & 0x00FFFFFF; unsigned f = 15 & (uintptr_t)tgt; if (f) { __m128i x = xmload(tgt); unsigned u = compxm(zero, x); unsigned v = compxm(p0, x) & (compxm(p1, x) >> 1); v = (v & (compxm(p2, x) >> 2) & ~u & (u - 1)) >> f; if (v) return tgt + ffs(v) - 1; tgt += 16 - f; v = load32(tgt - 2); if (trio == (v & 0x00FFFFFF)) return tgt - 2; if (trio == v >> 8 ) return tgt - 1; if (u >> f) return NULL; } while (1) { __m128i x = xmload(tgt); unsigned u = compxm(zero, x); unsigned v = compxm(p0, x) & (compxm(p1, x) >> 1); v = (v & (compxm(p2, x) >> 2) & ~u & (u - 1)) >> f; if (v) return tgt + ffs(v) - 1; tgt += 16; v = load32(tgt - 2); if (trio == (v & 0x00FFFFFF)) return tgt - 2; if (trio == v >> 8 ) return tgt - 1; if (u) return NULL; } }

And here it gets interesting: the above **scanstr2** is 2-16 times as fast as the non-SSE version. We can simplify **strstr** back to something similar to the original, that used **strchr **and **strncmp, **with essentially that much of a speed-up for the 4+ case of pattern length. There’s a small performance win in using **scanstr2** instead of **scanstr3**, because of the more complicated 16-byte boundary case for the latter.

char const *scanstrN(char const *tgt, char const *pat, int len) { for (; (tgt = scanstr2(tgt, pat)); tgt++) if (!memcmp(tgt+2, pat+2, len-2)) return tgt; return NULL; }

So, there you have it: an order-of-magnitude speed-up for all but the shortest of target strings. And now, let’s look at some things SSE2 is unequivocally good for.

Follow-up: a bit of reorganizing the main loop in scanchr/scanstr2/scanstr3 improves their performance (and hence scanstrN’s for all sizes of string) for strings over 32 bytes. The main inner loop repeats as long as no zero byte is detected (i.e as long as u == 0). This eliminates the

… & ~u & (u – 1) …

from the main loop. Surprisingly, that little change speeds search up by 30-40%.

Hi mischasan,

yesterday I was very excited seeing your SSE2 etudes, they are the boost I have been looking for.

It turns out that the C without reinforcement of Intrinsics is so badly limited in performance that I would (and I will) trade its ‘compatibility’ right away for the AMD&Intel powerful XMM registers.

I wish I knew more than nothing about SSE2 instructions, alas the ignorance is a never ending story. Somewhere in the future I hope that guys like you will do the remedy and boost the crawling strstr and hash-table functions.

You are welcome to publish and comment your results at:

http://www.codeproject.com/KB/cpp/Railgun_Quadruplet.aspx

Showing how Railgun lags behind compared to your down-to-the-metal approach makes me eager and mostly greedy for more enhancements.

Keep up the exciting work of yours.

Any advise how to build this code for Windows? Also, in scanstr2 the “pair” variable is used before declaration point.

Ha! good catch.

I don’t have Visual Studio at hand, but according to this MSDN SSE2 intrinsics page,

emmintrin.hdefines the same types (__m128ietc) and intrinsics (_mm_set1_epi8…). It’s been a while, but I thinkwindows.hpulls in the header files that defineuint16_tetc. That’s the best I can tell you; let me know if you have any such problems.__m128i doesn’t make any problems; types such as uint16_t I’ve found in stdint.h

I need to know what is function of ffs, and why usage of the variable “pair” appears in code before it’s declaration.

Ah. ffs means “find first (bit) set”. The MSVC equivalent is _BitScanForward.

And I should have answered more directly: the declaration of “pair” was out of sequence — copy/paste error when I copied the text from source to the wordpress text editor.