The “C” preprocessor: not as cryptic as you’d think

The C preprocessor is a modest macro-expansion language (check out “m4” if you want to see an immodest one).  Basic symbols and function-macros are convenient for giving meaningful names to constants and tiny function calls, with the rewarding feeling that you’re helping the compiler do its optimizing best.

As an aside, you can try out these samples using your normal compile commands but substituting cc -E … for cc -c … . I am going to assume you can find out what #X and A##B mean to the precompiler.

An unfortunate truth about cpp coding is, it can quickly make it impossible to figure out what actually is happening (qv my post on identifying what is being #included, and how to find out which #if’s control a given line’s inclusion.

Macros can make life hell for people who heavily rely on gdb. Like programs built with C++ templates, there’s often a big disjunct between the source code, and the symbols the debugger displays. Also, you can’t set a breakpoint on a macro. For that reason, static inline functions are preferrable, even to tiny macro functions.

Badly written macros can create odd syntax errors, or silently destroy control flow in your program
(Please pardon the squished-together code layout)

#define foo(str) { puts("error:"; puts(str) }
    if (a != b) foo(a); else puts("okay");

The if-statement would be fine if foo were a real function; but braces mess up the (admittedly peculiar) requirements of if-statement syntax. The common defense is to use a peculiar idiom:

#define foo(str) do { puts("error:"; puts(str) } while(0)

The precompiler is useful when nothing but constants are allowed, and the code becomes extremely repetitive.

For example, the SSE2 shift operators only take constant shift-distance arguments. This results in repetitive case statements, particularly if you are trying get the most out of the compiler (fewest calls and branches).

The practical example here is an optimal shift-left function for 128-bit XMM registers.
This takes a 127-way case statement of the form:

__m128i xm_shift_left(__m128i x, int bits)
    switch (bits) {
    case 0: break;
    case 1: ... break;
    case 2: ... break;
    case 127: ... break;
    default: x = _mm_setzero_si128(); 
    return x;

Now the kicker: optimal instruction sequences for those cases break down into four different formulas:

  • for n=8,16, …120, a single instruction:
    _mm_slli_si128(x, n/8)
  • for n=1..7, a five-instruction sequence that I may dissect in another post:
    _mm_or_si128(_mm_slli_epi64(x, n), 
                 _mm_srli_epi64(_mm_slli_si128(x, 8), 64-n))
  • for n=9..15 17..23 … 57..63, a six-instruction sequence:
    _mm_or_si128(_mm_slli_epi64(_mm_slli_si128(x, n/8), n%8), 
                 _mm_srli_epi64(_mm_slli_si128(x, 8+n/8), 64-n%8))
  • and for n=65..71 73..79 … 121..127, a two-instruction sequence:
    _mm_slli_epi64(_mm_slli_si128(x, n/8), n%8)

You can spend some time generating the case statement by hand; or you can use a bit of automation.

Originally, I #define’d four macros (F1(n), F2(n), F5(n), F6(n)) then used this perl command

perl -e 'print "case $_: x=F"
               .($_ < 8 ? 5 : $_ % 8 ? $_ < 64 ? 6 : 2 : 1)
               ."($_); break;\n" for 1..127

Using perl to generate embedded “C” code has its own problems; but you can do the equivalent with macros.

The trick is, if you have a fixed (small) number of repetitions (e.g for i = 1 to 7), you can expand them in line.
The above ranges can be broken into single loops of 1..7 and double loops of 1..7 x 1..7. First, an inner- and outer-loop macro:

// for (i = 1; i < 8; i++) f(x + i);
#define BY_1(f,x) f((x)+1) f((x)+2) f((x)+3) \
                  f((x)+4) f((x)+5) f((x)+6) f((x)+7)

// for (i = 8; i < 64; i += 8)  f(y + i);
#define BY_8(f,y) BY_1(f,(y)+1*8) BY_1(f,(y)+2*8) BY_1(f,(y)+3*8) \
                  BY_1(f,(y)+4*8) BY_1(f,(y)+5*8) BY_1(f,(y)+6*8) \

Then the four functions, written as case statements:

#define F1(n) case (n)*8: x = _mm_slli_si128(x, n); break;
#define F2(n) case n: x = _mm_slli_epi64(_mm_slli_si128(x, n/8), \
                                         n%8); break;

And now the loops:

    BY_1(F5, 0) // 1..7
    BY_1(F1, 0) // 8,16, ... 56
    BY_1(F1, 7) // 64,72, ... 120
    BY_8(F6, 0) // 9..15, 17..23, ... 57..63
    BY_8(F2,56) // 65..71, 73..79, ... 121..127

Try this cc -E, and you will see output that — with a bit of whitespace fixups — looks like:

    case (0)+7: x = _mm_or_si128(_mm_slli_epi64(x, (0)+7), 
                                  _mm_srli_epi64(_mm_slli_si128(x, 8),
    case (0)+1*8: x = _mm_slli_si128(x, (0)+1); break;
    case ((56)+1*8)+1: x = _mm_slli_epi64(_mm_slli_si128(x, ((56)+1*8)+1/8),

Other kinds of lists

Another use of more-than-vanilla macros is to put definitions of a list of values in a single place.

For example, I have an enum for about 100 error return values: HXERR_BAD_REQUEST, HXERR_BAD_RECORD, HXERR_BAD_FILE … Among other things, I want string versions of those enum values, for logging. I could define the two lists separately, and risk a hard-to-spot mismatch. Or, I could do this:

               _E(BAD_FILE) _S _E(READ) ...
#define _S ,
#define _E(x) HXERR_##x = _HXERR_##x
enum { _HXERR_OK=0, HXERRS };

I can tweak that so the error enums are negative:

#undef _E
#define _E(x) HXERR_##x = -_HXERR_##x
enum { HXERR_OK=0, HXERRS };

What’s happening here? _S is a macro for the “separator”. _E(x) is a mnemonic for “each” or “element”. Setting _S and _E makes the HXERRS macro expand in different ways.

That defined the enum. Now I want the corresponding list of printable strings, probably in some private source file:

#define _S ,
#define _E(x) #x
    static char const *symv[] = { "okay", HXERRS };

Yes, this is NOT as obvious as hard-coding the enum and string list. But it’s a good solution when the problem is a list duplicated several times and/or in different files.

That’s it for now. Next up: convergence between preprocessor programming and function programming.


About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in bit shift, preprocessor and tagged , , , , , , . Bookmark the permalink.

6 Responses to The “C” preprocessor: not as cryptic as you’d think

  1. Geoff Langdale says:

    Nice article. None of the below necessarily invalidates your approach, although it would be somewhat interesting to compare the use of the below tricks:

    Oddly enough, there are some exceptions to the “can’t do SSE integer shifts by a variable quantity” – that is, you can use xmm/m128 registers to hold shift quantities (in their low order bits). This means that you have to have a nice big table of values to index into, although I suppose you could pack them. Packing them is a good idea, I suppose, as you can thus index them with mod r/m addressing without any trouble, which is otherwise awkward for SSE values (as standard Intel mod r/m can’t do a multiply by sizeof(m128) for free).

    This puts a couple loads onto your critical path (which can run in parallel) rather than a branch mispredict – hard to see whether that would be a better idea.

    Another bit of trickery (if you have SSSE3) is to use the PSHUFB instruction to do byte-level variable shifts. On a modern Intel architecture, PSHUFB is as fast as a shift, although I believe we lose one generalized shuffle unit with Haswell. Sandy Bridge and Ivy Bridge could do two full 128-bit shuffles per cycle.

    Sadly, even with AVX/AVX2, there is still no way to do shifts of a 128-bit value by a shift amount that doesn’t cleanly divide by 8, and AVX2 adds a new collection of fairly asinine restrictions.

    • mischasan says:

      Thanks. And you have me intrigued but puzzled … I’m having no luck finding a reference to XMM shifts with the count in another XMM register. Can you point me at an opcode name, or link? My Intel x86/64 docs only refer to “[V]PSLLDQ xmm,imm”.

      The extension to PSHUFB and the shift-trick using that is one to remember, especially now that I’m no longer need to stay AMD/Intel agnostic. It certainly makes _mm_set1_epi8 much faster! Weirdly, the SSE2 _mm_set1_epi8 multi-instruction sequence is a significant start-up overhead in the string-search functions using SSE2, for short targets.

      In the example given, handling all the other 120 cases seems to work best as a single computed jump for the case statement, with no further branches (and in fact, no further memory references). Admittedly, that’s a near-guaranteed missed branch prediction 🙂 Can’t see the gain, though, in having all the 8N-bit cases branch to the same small block of code, then doing an _mm_load_si128().

      Can’t argue about the amount of asinine restrictions in this instruction set. OTOH, they make it more fun when you find a way to make them useful somehow!

      • Geoff Langdale says:

        If you go to the current version of the Intel® 64 and IA-32 Architectures Software Developer Manuals you can look up (say) PSLLW/PSLLD/PSLLQ and there will be entries like:

        “66 0F F3 /r PSLLQ xmm1, xmm2/m128” which will be described as “Shift quadwords in xmm1 left by xmm2/m128 while shifting in 0s.”

        If you have the current all-singing, all-dancing, 9-volumes-in-1-huge-PDF form, then the page is 1195.

        I haven’t used this form so can’t vouch for it. Note that it is a fixed version (everyone moves the same distance) not a variable independent shift like in Tilera or AVX2 etc.

        Assume the quadword form works (I think all of these things below can be done with either the 16/32/64 bit forms).

        I haven’t really thought too carefully about this, but it does seem possible that one could have the required shift amounts (e.g. 63 and 1, 62 and 2, 61 and 3, …) in a single xmm register. High bits are safely ignored, I think. So we’d have something like this:

        so, if we’ve got the value in v, and shift value is in s, then
        load based on shift value s -> a m128 register with a pair such a [61,3] (call it t)
        then shift our original value by 64 bits into a temporary v2
        then shift v by our variable shift in t
        then shift t over to get at the other part of the pair (call it t2)
        then shift v2 by our variable shift in t2
        then put v and v2 back together with an or.

        That’s a load, 2 fixed shifts, 2 variable shifts, and an or. Given the furious rate of instruction issue, it seems this _could_ be faster than a branch mispredict. Then again I might be completely full of it.

  2. Pingback: What is SSE !@#$% good for? #2: Bit vector operations | Coding on the edges

  3. Sebastiano Vigna says:

    I tried to use the code associated with this post appearing on stackoverflow:

    However, it does not work for almost half of the shift values. A simple test program is available at

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s