Faster than Fast String Search in Qt
Is your code burning a lot of cycles hunting for strings? Maybe you need to find the proper charset nestled in a chunk of HTML, or look for the dimensions in an XPM image file, or locate the email attachment boundaries in a block of MIME. If you string search a lot and performance is important to you, you’ll want to keep reading!
You may have encountered Boyer-Moore string searches if you have a formal education in Computer Science or if you’re an algorithm junkie. When it comes to finding strings within text, Boyer-Moore is the gold standard for speed. It works by searching from the back end of the sought-after string and by using intelligent comparisons to skip over chunks of the test pattern.
Why is Boyer-Moore so darn fast?
As a very quick example, let’s say we’re looking for the string “</body>” within the following text:
<body>This is a block of sample text.</body>
The search starts at the back of the search string. Since “</body>” is seven characters long, the search looks at the seventh character of the text. It’s a “T” which doesn’t match the pattern’s seventh character, so the string couldn’t possibly match. Because “T” doesn’t appear elsewhere in the pattern, the search skips another seven characters because that’s the earliest next possible match. The text has a space at the fourteenth character — again not a match — so the search skips another seven characters. It continues skipping seven characters at a time until it encounters the end where it needs to check for a possible match. Our example doesn’t hit any complicating cases but we’ll get to that.
The diagram below highlights where the search makes comparisons against the text: it sniffs at it every once in a while (yellow), until it fully checks it at the end (green) and finds a match.
Compared to a naïve implementation that would make 43 tests (one for each letter), the Boyer-Moore string search only needed six comparisons (length of text divided by length of pattern) before it found and verified the match. This algorithm performs very well if the pattern is long and doesn’t contain repeats. However it still performs acceptably — at least as well as a standard string search — even in degenerate cases when the pattern and text contain all the same letters.
Substring matches and skip tables
There’s just one problem — how does it know how far ahead to skip? This problem is apparent when we think about patterns containing duplicate letters, for example searching for “abba” within “aabacadabraabacabba”. Because there are substring matches within the pattern and text, the algorithm can’t just blaze forward by the length of the pattern or it may skip right over a match.
To solve this, we have to preprocess the test string to create a table that tells us how far ahead to skip. There’s plenty of literature on how to create a Boyer-Moore skip table because there are a few alternative methods. All of them mean that the actual cost of a search is the search time plus the skip table generation time. (At least for the first search — for multiple searches using the same string, you can amortize the cost of the preprocessing against the subsequent searches provided you save your skip table.) That additional preprocessing can make a Boyer-Moore search slower than a standard string search under the wrong circumstances.
But that’s where the power of C++ comes in. If you’re looking for static strings anyway, shouldn’t you be able to generate skip tables at compile time? Thanks to the extra power in the latest C++ standard — yes you can!
C++14 adds a crucial feature to implement compile-time Boyer-Moore string searching; namely a constexpr function is now allowed to contain loops and logic. With this additional functionality, we can build a skip table at compile time for any static search strings, eliminating the cost of the preprocessing step and making the already fast string searches even faster.
What’s Qt got to do with it?
We’ve extended Qt’s string searching by adding in compile-time Boyer-Moore string searches. This required a couple of constexpr functions and template changes to QByteArrayMatcher, giving us a new function qMakeStaticByteArrayMatcher(). Using this new string search capability is dead simple. Here’s how you’d implement it in our earlier search example:
static const auto seekBodyEnd = qMakeStaticByteArrayMatcher("</body>");
seekBodyEnd.indexIn(QByteArray("<body>This is a block of sample text.</body>"));
You’ll notice we use another handy “new C++” feature, the auto keyword. If you’re used to old-school C/C++, you might be fooled into thinking auto means “allocate this variable on the stack”. That use has been deprecated since C++11 and ever since auto has become a lot more useful. Here, auto intelligently assigns the type of seekBodyEnd so we don’t have to worry about the “under the covers” template magic that figures out the length of our search string.
Pretty cool, huh? Okay — now go out there and do some searching!
Sounds cool. A link to the commit(s) would be amazing.
Thanks for your interesting post,
Will it be available within Qt 5.7?
will this be available with Qt 5.8 right?
Is that only for QByteArray or QStrings too?
Also what if I’m parsing a stream of bytes can I assume the chunk size must be multiples of matcher size?
There seems to be a typo in your code example at the end – I suppose ” was meant to be ”.
Oops, the markup was swallowed in my last comment. I meant to suggest replacing .body (i.e. dot body) with /body (i.e. slash body).
Good catch–fixed!
Just a heads-up:
QStaticByteArrayMatcher
was merged for Qt 5.9.