You don’t need SSE4.2 to do some neat string operations with XMM registers. Case in point: using 16-byte parallelism, searching for a character in a null-terminated character string — aka strchr.
Smart implementations of strchr don’t simply test each byte in turn for null or the target character. Instead, they load four bytes at a time, and use a bit/arithmetic trick that returns a non-zero value if a four-byte word x has a zero byte in any position:
(x - 0x01010101) & 0x80808080 & ~x
If there is no zero byte, strchr xor’s the word with the target character replicated in all four byte positions, and does the same test again. A tiny loop checks the last 1-3 characters in the string.
The SSE2 equivalent takes even fewer instructions. PCMPEQB compares two XMM registers byte-by-byte, setting corresponding bytes to FF or 00. PMOVMSK collects the high bit of each XMM byte into a single 16-bit word. BFSL (“find first set” bit instruction) does the rest.
Ignoring the niceties of aligned 16-byte loads and handling short strings:
char const *ssechr(char const *s, char ch)
{
__m128i zero = _mm_setzero_si128();
__m128i cx16 = _mm_set1_epi8(ch); // (ch) replicated 16 times.
while (1) {
__m128i x = _mm_loadu_si128((__m128i const *)s);
unsigned u = _mm_movemask_epi8(_mm_cmpeq_epi8(zero, x));
unsigned v = _mm_movemask_epi8(_mm_cmpeq_epi8(cx16, x))
& ~u & (u - 1);
if (v) return s + ffs(v) - 1;
if (u) return NULL;
s += 16;
}
}
And how does ssechr compare with regular (glibc) strchr? With some TLC for the edge cases, ssechr is about the same as strchr when the target char is in the first 4 bytes, twice as fast for the next 32 bytes, and maxes out at about 16 times as fast as strchr when the char is 200 bytes (or more) into the string.
“ffs(v) – 1″ can be replaced by __builtin_ctz(v) when using GCC – it generates a ‘bsf’ instruction
True. On my current platform (gcc 4.1, Linux, x86-64) seems to be smart enough to generate the bsfl without an extract subtraction — the compiler sees that ffs(x) – 1 = (bsfl(x) + 1) – 1 and that +1-1 is a wash. Have you found there are platforms where __builtin_ctz(x) generates better code than ffs(x)?
I used __builtin_ffs(x) myself before, and it generated a cmov instruction.
BTW, is the above code correct when there is a match on ch just beyond an end-of-string?
I am changing the code to:
if (c==0) return s + __strlen(s); // handle rare case inline
__m128i zero = _mm_setzero_si128();
__m128i cx16 = _mm_set1_epi8(c); // (ch) replicated 16 times.
while (1) {
__m128i x = _mm_loadu_si128( ( const __m128i *) s);
unsigned u = _mm_movemask_epi8(_mm_cmpeq_epi8(zero, x));
unsigned v = _mm_movemask_epi8(_mm_cmpeq_epi8(cx16, x)) & ~u & (u – 1);
if ( (u+v)==0 ) {
s += 16;
continue;
}
// There could also be a newline => check if vv)
return (v<(u-1)) ? (char*) s + __builtin_ctz(v) : 0;
}
also created an SSE4.2 version (slightly faster):
if (c==0) return s + __strlen(s);
// SSE4 slightly faster than SSE2 version
#if defined(__SSE4_2__)
const __m128i chartofind = _mm_set1_epi8(c);
const __m128i* str = (const __m128i*) (s-16); // unaligned
loop:
__m128i strchars = _mm_loadu_si128( ++str );
int offset = _mm_cmpistri( chartofind, strchars, _SIDD_CMP_EQUAL_EACH );
__asm__ __volatile__ goto( "jz %l[eos] \n\t jnc %l[loop]" : : : : loop, eos );
return ((char*) str) + offset;
eos:
__asm__ __volatile__ goto( "jc %l[cintail]" : : : : cintail );
return 0;
cintail:
int zofs = _mm_cmpistri( _mm_setzero_si128(), strchars,
_SIDD_CMP_EQUAL_EACH );
return zofs<offset? 0 : ((char*) str) + offset;
Yes, the ffs(x) for me generates:
bsfl %rbx, %eax
movl $-1, %ecx
cmove %ecx, %eax
which is two register instructions at the end of a whole lot of code, with a cmove that is (in this case) always a nop, and unrelated to the “- 1″ in “ffs(x) – 1″. I never worried about optimizing that out.
I guess it’s time for me to start playing with hand-tuned asm and pcmpistri
Never mind, the code works correctly in case ch is beyond end-of-string. However, it fails for strchr( s, 0 )
This should return the end of the string, instead of NULL (see http://www.cplusplus.com/reference/clibrary/cstring/strchr/)
Simple fix for this rare case: if (u) return (c!=0) ? NULL : (char*) s + __builtin_ctz(u);
Thanks again. It was most educational to learn that strchr(s,0) and strrchr(s,0) do what they do, especially since (s + strlen(s)) is the more obvious way to express it, and it seems inconsistent to treat NUL like any other non-zero character. Should strstr(s, “abc”) only return a pointer to “abc” if it is at the end of (s)? Why doesn’t the null terminator in “abc” count as a part of the string there, too? How about strpbrk and strspn? But standards are standards, I guess. I’d be really interested in anyone suggesting the rationale is.
glibc 4.5 strlen uses SSE2 to find the trailing NUL, so a special up-front case {if (!c) return s + strlen(s)} would seem sensible.
BTW after posting ssechr, I worked out that xmm works faster, and safer, using _mm_loada_si128(), albeit with more elaborate code for startup and stop conditions. It’s safer because it can never accidentally try to read beyond the end of a memory page. A rewrite of ssechr using this scheme is what so far outperforms the glibc 4.5 strchr routine with SSE4.2 enabled (and disassembled to prove it was using PCMPSTRI) — even for short strings.
I tried the upfront case, but gcc does not optimize it away unless I put it in a macro:
#define strcmp(s,c) (((int _c = (c))!=0)?__strcmp(s,_c) : (char*) s + __strlen(s))
I figured that the case is rare enough to let the extra (c!=0) check handle it; for SSE4 you can simply change “return zofs=offset) ? ((char*) str) + offset : 0;” because zofs==offset when c==0
FYI – see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53712, am working to help gcc do the pstricmpi optimizations for you…
SSE4 code got lost (XML); should be “return (zofs>=offset) ? ((char*) str) + offset : 0;”
Yet another joke on me, relying on compiler optimization. gcc 4.1.2 -O9 produced the following instructions for “ffs(x)-1″:
movl $-1, %edx bsfl %eax, %eax cmove %edx, %eax addl $1, %eax subl $1, %eaxSheesh.