Tag Archives: SSE2

The “C” preprocessor: not as cryptic as you’d think

The C preprocessor is a modest macro-expansion language (check out “m4” if you want to see an immodest one).  Basic symbols and function-macros are convenient for giving meaningful names to constants and tiny function calls, with the rewarding feeling that … Continue reading

Posted in bit shift, preprocessor | Tagged , , , , , , | 6 Comments

SSE2 odd-even merge (the last step in sorting)

If you’ve looked at my example of bitonic sort in SSE2 in ASM or in “C”, you’ll see that the clever stuff ends with two eight-element sorted sequences. The final step is a simple loop that merges the two sequences. … Continue reading

Posted in Uncategorized | Tagged , , , , | 1 Comment

SSE2 and BNDM string search

For the past few weeks, I’ve been testing and experimenting with the Railgun string search function written by Sanmayce. Railgun is really a “memmem” function, where the target length is known in advance; and the cost of compiling the pattern … Continue reading

Posted in algorithm, SSE2, string search | Tagged , , , , , , | 8 Comments

The Generic SSE2 Loop

In response to a couple of comments on my post about find-first-bit-set in SSE2 registers, amounting to “what use is a routine that only does 16-byte bitvecs”, I thought I’d post the canonic, generic loop through memory using SSE2 ops. … Continue reading

Posted in ffs, SSE2, Uncategorized | Tagged , , , , | 5 Comments

The full SSE2 bit matrix transpose routine

Source code for this routine and many others using SSE2 in unusual ways is in this github repo. Since there have been a large number of hits on the “SSE2 bit matrix transpose” post, here’s the full deal: transpose of … Continue reading

Posted in Uncategorized | Tagged , , | 6 Comments

Update on bitonic SSE2 sort of 16 doubles

For the complete source code for both sorting and ranking functions using SSE2, check out ssesort.c in this github repo I originally used asm to generate the bitonic sorter. After doing a little more testing, I found that gcc 4.4 … Continue reading

Posted in algorithm | Tagged , , , | 5 Comments

Okay, one more poke at SSE2: sorting doubles

Follow-up: source code is on github: ssesort.c. Old-school (pre-CUDA) non-graphic programming of GPU’s dusted off a bunch of classic algorithms that did little or no branching, and no data sharing between processors, but allowed massive parallelism. One of those algorithms … Continue reading

Posted in algorithm | Tagged , | 1 Comment

What the !@# is SSE2 good for: char search in long strings

You don’t need SSE4.2 to do some neat string operations with XMM registers. Case in point: using 16-byte parallelism, searching for a character in a null-terminated character string — aka strchr. Smart implementations of strchr don’t simply test each byte … Continue reading

Posted in Uncategorized | Tagged , , , | 10 Comments