Qt is a native library at the heart. As a native (C++) library, it already outperforms most higher-level language libraries when it comes to startup performance. But if you're using native languages, you usually do so because you need to get the most out of the available hardware and being just fast may not be fast enough. So it should come as no surprise that we at KDAB are looking into how to speed things up even more.
A Look At Dynamic Linking
One source of startup delays in native applications is the dynamic linker. You can read all about how it works on Unix in Ulrich Drepper's excellent article, How To Write Shared Libraries. For the purposes of this article, it is sufficient to understand that the final link step of a native application is performed at startup-time. In this step, the application, as well as the libraries it uses, are loaded into memory and adapted to the specific memory location they have been loaded to. Memory locations may differ from application to application because of security reasons (address space randomisation) or simply because one application loads more libraries than another, requiring a different memory layout, esp. on 32-bit platforms.
With this — very simplified — view of things in mind, let's look at what "adapting to the specific memory location" actually involves.
Most of the library code is compiled in position-independent code, which means that jumps, as well as data references are always expressed as an offset to the current execution position (commonly called Program Counter - PC). That offset doesn't change when the library is loaded at different memory locations, so code, for the most part, does not need to be adapted.
But a library does not only consist of code. It also contains data, and as soon as one piece of data points to another (say, a pointer variable which references a function), the content of that pointer suddenly becomes dependent on the actual position of the library in memory. Note that the trick used in code (offsetting from the PC) doesn't work here.
So the linker is forced to go in and patch the pointer variable to hold the actual memory location. This process is called relocation. By performing relocations, the dynamic linker changes the data from how it is stored on disk, which has several drawbacks: First, the data (actually, the whole memory page - usually 4KiB) is no longer backed on-disk, so if memory gets tight, it has to be copied to swap instead of just being dropped from memory, knowing that it can always be loaded back from disk. Second, while unmodified data is shared among processes, once the data is modified in one process, the data is copied and no longer shared (copy-on-write), and this can be a real memory waster on systems where many applications use the same library: All the library copies living in different application address spaces are duplicated instead of shared, increasing the total memory footprint of the system.
V-Tables And String Tables
If all of the above was a bit abstract for you, let's look at some concrete examples:
In a C++ library, the virtual function call mechanism is a major source of relocations, because vtables are simply lists of function pointers, all entries of which require relocation. But short of reducing the number of virtual functions (something Trolltech originally did for Qt 4), there's not much one can do about those.
But there is a class of relocations that are 100% avoidable, with some work: string tables. In their simplest form, they come as an array of C strings:
But the above is just a short-cut for the following:
You can view this as a mapping between a zero-based integer and a string, with the integer implicitly encoded in the string position in the array. In the more complex form, the string table maps something else than a zero-based integer:
Here, too, what we colloquially call a "string" is actually a pointer-to-const-char, and therefore in need of relocation at dynamic link time.
One Solution
So the underlying problem here is that strings are inherently reference types — they are only a pointer to the data stored elsewhere. And we learned that data referring to other data causes relocations. So it would seem that the easiest way to avoid relocations is to store the data directly, and not reference it. The two examples above could be rewritten as:
In both cases, the string data is now stored in-line, and no relocations are necessary anymore.
But this approach has several drawbacks. First, it wastes some space if the strings are not all of the same length. In the above examples, that waste is not very large, but consider what happens if the colorMap
above gets a member whose name is "azure light blue ocean waves". Then the name
member needs to be at least of size 31. Consequently, less than two of those structs now fit into one cache line, reducing scanning performance significantly — for both lookups: by-color as well as by-name, which is the second problem.
So, this simple approach that requires no changes to the code or data except to fix the declaration of the string member works well only if the strings are of essentially the same length. In particular, just one outlier pessimises the lookup performance of the whole lookup table.
A Better Solution
Data-Oriented Design suggests that we should prefer to separate data of different type. We can apply this in the colorMap
case and hold colors and names in different arrays:
We still have the gaps within the names
array, but at least the colors are out of the way now. We can then compress the string data the way moc
has been doing since Qt 4.0:
We just concatenate all strings into one, with NULs as separators, and record the start of each one in an offset table. Please take a moment to digest this. We now have reached a point where there are no relocations, not more than sizeof(uint)
bytes wasted per-entry (could be reduced to sizeof(ushort)
or sizeof(uchar)
for smaller tables, which is less than the sizeof(const char*)
with which we started out), and nicely separated lookup keys and values.
But we have created an unmaintainable beast. The largest such table in Qt is ca. 650 entries in size. One problem is that key and value are now separated — those two arrays better stay in sync. The even larger problem is that no-one is calculating the offset table for us!
So, while this technique of avoiding relocations is pretty well-known, it is hardly ever applied in practice because it essentially forces you to write a code generator to create these intricately-connected sets of tables from a human-readable description.
Enter Q_STRINGTABLE
The key insight now is that C++ comes with powerful code generators built-in: Both Template Meta-Programming (TMP) can be used here, at least in C++11, as well as the good ol' C preprocessor.
Using the preprocessor, the colorMap
example can be written like this:
First, you describe the key-value pairs as a sequence of 2-tuples: ((.,.))(.,.))...
, then you feed that into a magic macro (here, the one for when the strings are not sorted), and voila, you get all three tables generated for you, including a nice find()
function for looking up values by string. To use:
Obviously, if you sort the data (one of the things that's not done automatically for you, yet), you can use Q_STRINGTABLE_SORTED
instead and get an O(log N) find()
method.
Next week, we'll look at both the Q_STRINGTABLE
API and implementation in more depth. This will also reveal why Q_STRINGTABLE
, despite its usefulness, has not been accepted into Qt, yet. If you can't wait to start playing with it, head over to the Qt-Project Gerrit: Long live Q_STRINGTABLE!. The header file implementing all of this has minimal dependencies (Boost.PP and <QtGlobal>
).
Stay tuned!
1 Comment
27 - Jun - 2015
Jaweriya
Hi Marc,
Some more insights into API's and examples of q_stringtable will be interesting to know. "Next week, we’ll look at both the Q_STRINGTABLE API and implementation in more depth".