Aho-Corasick code posted to github

You may be interested in the source code for my best-of-both-worlds Aho-Corasick implementation. It runs on a state-transition matrix (the fastest implementation) in an interleaved array (the smallest-data implementation). Typically 2.5 bytes of state machine for each byte of pattern text. The matrix is tuned for L1-cache-friendliness. The original google doc describing the technique is http://goo.gl/6decc. https://github.com/mischasan/aho-corasick

So what’s the big deal about this? Well, back in the late 70’s, Robert Tarjan tried this … but found that the “compile” step was ridiculously expensive, and unsuitable for large sets of patterns. He based his work on another researcher’s ideas about “interleaved arrays” which is a very cool idea (“ACISM” in the github source refers to “Aho-Corasick Interleaved State-transition Matrix”). Unfortunately, the idea of how best to do this was by a physical analogy: if you want to fill a bucket with rocks, gravel and sand, put the bigger (coarser) things in first, because the finer-grain things will fit in the gaps around them. Unfortunately, that produces the worst-case runtime behaviour.

Instead, I tried inserting the objects in random order, while using a “hint” array to skip searching in areas where a given object couldn’t possibly fit. Tarjan was right about interleaved arrays, just missed a trick in implementation. See the .pdf at github, or the google.doc article, to see how the compilation works. Or just have fun using the code.

About mischasan

I've had the privilege to work in a field where abstract thinking has concrete value. That applies at the macro level --- optimizing actions on terabyte database --- or the micro level --- fast parallel string searches in memory. You can find my documents on production-system radix sort (NOT just for academics!) and some neat little tricks for developers, on my blog https://mischasan.wordpress.com My e-mail sig (since 1976): Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.
This entry was posted in Uncategorized. Bookmark the permalink.

4 Responses to Aho-Corasick code posted to github

  1. sanmayce says:

    Hi Mischa,
    interesting and frighteningly effective are those automata etudes, way to go, yet I am still in the woods and the thing that interests me is to exhaust the hash-like approaches.

    To the point, do you have an idea of juxtaposing for example my latest best ‘Railgun_Sekireigan’ with some scary automaton-like beast of yours in particular? I am still ignorant of what exactly edge they can give.
    In my view the numbers (difference will say much if not scream) will open eyes of many people how powerful they are, of course each one ruling in its own realm. Simply put: a simple benchmark will be very informative.

  2. mischasan says:

    I’ve replied to this in detail via email, sanmayce. Railgun is better compared to BNDM or the SSE2 string search routines (where your routine . Comparing railgun and Aho-Corasick is sort of comparing a motorbike to a disk harrow for making ruts. Aho-Corasick is also not a universally-best multi-string search. When you have a lot of LONG patterns and substrings of the patterns tend to be rare in the target text, Wu-Manber or the routine I wrote under this patent ( http://www.faqs.org/patents/app/20090238474 ) are much better. My 2c.

  3. sanmayce says:

    Thanks, I see, for heavy loads (multi-pattern searches) heavy off-road machinery is needed.

  4. Pingback: Quick update on ACISM | Coding on the edges

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s