A few days ago Marc Mutz, colleague of mine at KDAB and also author in this blog, spotted this function from Qt's source code (documentation):
Apart from the mistake of considering empty strings not uppercase, which can be easily fixed, the loop in the body looks innocent enough. How would we figure out if a string only contains uppercase letters (as per the documentation in the snippet), anyhow?
- Look at the string character by character;
- If we see a non-uppercase character, the string is not uppercase;
- Otherwise, it is uppercase.
That's exactly what the for
loop in the code above is doing, right?
Well, no.
The code above is broken.
It falls into the same trap of endless other similar code: it doesn't take into account that QString
does not contain characters/code points, but rather UTF-16 code units.
All operations on a QString
(getting the length, splitting, iterating, etc.) always work in terms of UTF-16 code units, not code points. The reality is: QString
is Unicode-aware only in some of its algorithms; certainly not in its storage.
For instance, if a string contains simply the character "𝐀" -- that is, MATHEMATICAL BOLD CAPITAL A (U+1D400) -- then its QString
storage would actually contain 2 "characters" reported by size() (again, really, not characters in the sense of code points but two UTF-16 code units): 0xD835 and 0xDC00.
The naïve iteration done above would then check whether those two code units are uppercase, and guess what, they're not; and therefore conclude that the string is not uppercase, while instead it is. (Those two code units are "special" and used to encode a character outside the BMP; they're called a surrogate pair. When taken alone, they're invalid.)
Wherefore art thou, Unicode?
If you want to know more about what all of this Unicode story is about, please take a few minutes and read this and this. The resources linked are also good reads.
The problem of Unicode-aware iteration over string data is so common and frequent that back in 2014 I contributed a new class to Qt to solve it. The class is called, unsurprisingly, QStringIterator.
From its own documentation:
Any code that walks over the contents of a QString
should consider using QStringIterator
, therefore preventing all such possible mistakes as well as leaving the burden of decoding UTF-16 into a series of code points into Qt. Indeed, QStringIterator
is now used in many critical places inside Qt (text encoding, font handling, text classes, etc.).
How do I use it?
For various reasons (see below) QStringIterator
is private API at the moment. Code that wants to use it has to include its header and enable the usage of private Qt APIs, for instance like this by using qmake:
Then we can include it, and use it to properly implement isUpper()
:
The call to next()
will read as many code units are necessary to fully decode the next code point, and it will also do error checking.
(In this case it will return U+FFFD (REPLACEMENT CHARACTER), which has the nice property of not being uppercase, therefore making the function return false. But this is an implementation detail; calling QString
algorithms on a string that contains illegal UTF-16 encoded data is unspecified behavior already, so don't do it.)
QStringIterator
's API is quite rich; it supports bidirectional iteration, some customization of what should happen in case of decoding failure, as well as unchecked iteration (iteration that assumes that the QString contents are valid UTF-16; this allows skipping some checks).
That's it, no more excuses, start using QStringIterator today!
Regarding the QString::isUpper()
function that we started this journey with: trying to fix it caused quite a discussion during code review, as you can see here and here.
Why isn't QStringIterator public API?
There are a few reasons why I am keeping QStringIterator
as private API. It's not because its API is in constant evolution -- actually, it has not changed significantly in the past 6 years. QStringIterator
even has complete documentation, tests and examples (the documentation is readable here).
From my personal point of view:
- The API would benefit from a serious uplifting, becoming more C++ oriented, and way less Java oriented.
Rather than writing this:
one should also be able to write something like this:
None of the required APIs to make this possible exist at the moment -- QStringIterator
is neither a range nor an iterable type.
Making it so opens up many, many API problems: e.g. minor things whether if QStringIterator
is a good name, given it yields out iterators; to huge design problems, like how to add customization points to decide how to handle strings containing malformed UTF-16 data (skip? replace? stop? throw an exception?).
- The implementation is optimized for clarity, not raw speed.
At the moment, it doesn't use SIMD or any similar intrisics. I strongly feel that it may benefit from such improvements, if we redesign its API (e.g. making the failure mode a customization point).
- There is other, similar, more general purpose work happening elsewhere.
For instance, in the glorious ICU libraries, in the work happening in the SG16 WG21 study group, in the proposed Boost.Text, and so on. We may just decide to use the results of some of that work, rather than coming up with a Qt-specific way of using a particular algorithm (UTF-16 decoding).
- Unicode is complicated, and we may have forgotten to handle some corner case properly.
If we set QStringIterator
's API/ABI in stone (by making it public), we risk ending up with our hands tied for future necessary expansion.
- Most of Qt assumes valid UTF-16 content in
QString
s (see the comment above).
We need a project-wide decision on how to actually detect and tackle invalid UTF-16 content, and enforce it consistently. QStringIterator
should therefore follow such decision, and that becomes very hard if we're again constrained by the public API promise.
With all of this in mind, I am not comfortable with committing QStringIterator
as public API at the moment. But again, it doesn't mean that you can't use it in your code today, and maybe submit some feedback.
Happy hacking!
Trusted software excellence across embedded and desktop platforms
The KDAB Group is a globally recognized provider for software consulting, development and training, specializing in embedded devices and complex cross-platform desktop applications. In addition to being leading experts in Qt, C++ and 3D technologies for over two decades, KDAB provides deep expertise across the stack, including Linux, Rust and modern UI frameworks. With 100+ employees from 20 countries and offices in Sweden, Germany, USA, France and UK, we serve clients around the world.
2 Comments
15 - Feb - 2022
Brian Warner
QStringIterator - private, undocumented, unstable. Using it is bad or even worse.
The fact is that nobody has a good solution. Does Qt6 solve this? I'm 99% sure it doesn't.
15 - Feb - 2022
Giuseppe D'Angelo
Hi,
Why the FUD about this? Yes, QStringIterator is a private class; the whole point of this blog post is to raise awareness about it.
"Undocumented" means no public documentation, because... it's a private class. It does not mean that comprehensive documentation about it does not exist; I wrote it with the idea that the class could become public some day: https://github.com/qt/qtbase/blob/dev/src/corelib/text/qstringiterator.qdoc
"Unstable", "using it is bad": where does that assertion come from? If anything, it's one of the most stable classes in Qt, having had maybe just one significant API change in the last 8 years. These are all the commits on it:
So while I perfectly understand the frustation at not having ready-made classes for Unicode iteration (... which is why I wrote QStringIterator in the first place, and why I then wrote this blog post), these aren't substantive critics :)