SSE2 beats SSE4.2 in memcmp?

At the moment I haven’t any box where I can test the latest GCC compilers and SSE4.2 support (pcmpestri etc). So far, the following beats gcc 4.4 with -march=corei7 -msse4.2 (okay, perhaps that’s redundant :-). But gcc generates “repz cmpsb” inline for memcmp, which I find suspicious. In the benchmark code, prefixing the memcmp() with _mm_prefetch() calls makes no measurable difference. Any ideas?

typedef __m128i XMM;

static inline XMM xmload(void const*p)
{ return _mm_load_si128((XMM const*)p); }

static inline XMM xmloud(void const*p)
{ return (XMM)_mm_loadu_pd((double const*)p); }

static inline unsigned xmdiff(XMM a, XMM b)
{ return 0xFFFF ^ _mm_movemask_epi8(_mm_cmpeq_epi8(a, b)); }

static inline int cmp(int mask, uint8_t const*src, uint8_t const*dst)
{ return (mask = ffs(mask) - 1) < 0 ? 0 : (int)src[mask] - dst[mask]; }

int xmcmp(void const*_src, void const*_dst, int len)
    uint8_t const *src = (uint8_t const*)_src;
    uint8_t const *dst = (uint8_t const*)_dst;
    int ret, srcoff = 15 & (intptr_t)src;

    if (srcoff) {
        if ((ret = xmdiff(xmloud(src), xmloud(dst))))
            return cmp(ret, src, dst);

        src += 16 - srcoff;
        dst += 16 - srcoff;
        len -= 16 - srcoff;

    for (; len > 15; src += 16, dst += 16, len -=16) {
        _mm_prefetch(src+512-64, _MM_HINT_NTA);
        _mm_prefetch(dst+512-64, _MM_HINT_NTA);
        if ((ret = xmdiff(xmload(src), xmloud(dst))))
            return cmp(ret, src, dst);

    ret = xmdiff(xmloud(src), xmloud(dst)) & ~(-1 << len);
    return cmp(ret,  src, dst);

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in ffs, SSE2, SSE4.2, string search. Bookmark the permalink.

2 Responses to SSE2 beats SSE4.2 in memcmp?

  1. (late) I’ve had gcc generate some _really_ bad intrinsics in cases like bcmp, where the glibc version was significantly faster. Don’t assume that gcc knows what it’s doing.

    • mischasan says:

      Agreed, in part: I was unhappily surprised that gcc 4.4 generates REP CMPSB. Out of curiosity, what versions of gcc and glibc are you referring to?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s