QStringView Diaries: The Eagle Has Landed QStringView merged for Qt 5.10
After two months of intensive reviews, discussions, fixes, and stripping down the initial commit, feature by feature, to make it acceptable, I am happy to announce that the first QStringView commits have landed in what will eventually become Qt 5.10. Even the docs are already on-line.
This is a good time to briefly recapitulate what QStringView is all about.
QStringView: A std::string_view for QString
If you never heard of std::string_view
, you may want to learn about it in Marshall Clow’s CppCon 2015 presentation.
TL;DR: String-views reduce temporary allocations.
Yours truly is not generally known to support reimplementing std facilities in Qt. So you might legitimately ask: “Why QStringView? Why not just use std::basic_string_view<QChar>
?”. The answer is the same as for QString itself. QString
simply has a lot going for it that std::string
is lacking. First and foremost, it has excellent Unicode support. So reimplementing std::string_view
for QString
/QChar
is really a no-brainer.
QStringView
tries to solve the problem that functions outside the very core of QString
only take QString
. There are usually not even QLatin1String
overloads, even though most users pass just US-ASCII string literals to these functions. Sure, if you compile without QT_NO_CAST_FROM_ASCII
, then just passing "foo"
to a function taking QString
works just fine.
But the use of QString
has a cost: it allocates dynamic memory, and that is comparatively slow. For a string class, it has also fallen a bit behind the state of the art. It uses copy-on-write/implicit sharing, which developers outside Qt no longer consider an optimisation. It also does not use the small-string optimisation, which stores small strings in the object itself instead of in dynamic memory. That makes QString("OK")
or QString("Cancel")
much more expensive than it should be.
Enter QStringView
This is where string-views come in. QStringView
is designed as a read-only view on QString
s and QString
-like objects. QString
-like are classes such as QStringRef
, std::u16string
, and char16_t
literals (u"Hello"
). This is useful, since a lot of functions that take QString
do not need an actual QString
. That is, they do not need an owning reference to the characters. They only need a weak reference: a non-owning pointer and a size, say. Or a pointer pair acting as iterators. And indeed, a lot of low-level functions take (const QChar* data, int length)
. In doing so, they do not require the construction of a QString
just to iterate over its characters.
bool isValidIdentifier(const QChar *data, int len) { if (!data || len <= 0) return false; if (!data->isLetter()) return false; --len; ++data; while (len) { if (!data->isLetterOrNumber()) return false; ++data; --len; } return true; }
Using pointer-and-length APIs has a cost, too, though.
Towards wide contracts in low-level string APIs
Such functions have preconditions. We say they have a narrow contract. Only certain combinations of the two parameters are allowed: The length must be non-negative, and the pointer mustn’t be nullptr
unless the length is zero, too.
If a function takes a QString
instead, it has no preconditions. We say it has a wide contract: any QString
is generally acceptable, and valid.
QStringView
combines the efficiency and QString
-independence of pointer-and-length APIs with the conceptual clarity of QString
APIs. By passing an object of class type, we can (and do) enforce invariants between these parameters. Constructing a string-view with a negative length is undefined behaviour. And that is caught at string-view construction time (with an assertion in debug mode). Before the function is entered. This way, we put the onus of checking for valid parameters on the caller. So far, nothing changed compared to the pointer-and-size case. But the function can now assume that its QStringView
argument references valid data.
Practically speaking, this means that functions taking QStringView
can be marked as noexcept
while functions that take pointer-and-size cannot. At least if you buy into the rule that narrow-contract functions mustn’t be noexcept
(which both the standard and Qt libraries do).
bool isValidIdentifier(QStringView id) noexcept { if (id.isEmpty()) return false; if (!id.front().isLetter()) return false; for (QChar ch : id.mid(1)) { if (!ch.isLetterOrNumber()) return false; } return true; }
A (nearly) universal string-data sink
The most thrilling property of QStringView
, however, is the wide variety of arguments with which you can construct one. Not only does it abstract away the container used to hold the character data: Whether your string data is stored in a QString
, a QStringRef
, a std::u16string
or a std::u16string_view
, QStringView
won’t care. It also abstracts away the plethora of character types Qt uses. It does not distinguish between QChar
, ushort
, char16_t
or (on platforms, such a Windows, where it is a 2-byte type) wchar_t
. It swallows any of those without a cast:
bool isValidIdentifier(QStringView id); isValidIdentifier(u"QString"); // OK isValidIdentifier(L"QString"); // OK (on Windows only) isValidIdentifier(QStringLiteral("QString")); // OK QString fun = "QString::left()"; isValidIdentifier(fun.leftRef(7)); // OK isValidIdentifier(u"QString"s); // OK isValidIdentifier(L"QString"s); // OK (on Windows only)
QStringView
does not completely replace QString
as an argument type, however. There are some (expensive-to-convert) argument types QString
allows, but QStringView
doesn’t. Your QString
function will happily accept a QChar
or a QLatin1String
, too. QStringView
doesn’t. If you use QStringBuilder (as you should), then your QString
function can be called with a QStringBuilder expression. QStringView
only accepts this with a manual cast to QString
: f(QString(expr))
.
Future
By Qt 5.10, we’d like a QStringView
which has most if not all of the const QString
API. There are some notable exceptions we already know about: we will not add a split()
method. One of the reasons to use a string-view is to enable zero-allocation parsing. The split()
function, however, returns a dynamically-sized container of substrings. We intend to replace this functionality with a QStringTokenizer
class. Taking the same arguments as QString::split()
, it will have a container interface that allows you to plug it into a ranged for-loop:
QString s = ...; for (QStringView part : QStringTokenizer(s, u'\n')) use(part);
We will also co-evolve QLatin1String
together with QStringView
, making QLatin1String
as full-blown a view type for char
s as QStringView
is for QChar
s.
You can follow QStringView development on this blog and on Gerrit.
Stay tuned!
What’s the reason that makes passing a const QStringView& worse than passing it by value? Indirection?
See Chandler Carruth’s BoostCon 2013 presentation. Or any other Chandler Carruth presentation ever given 🙂
TL;DR: Pass by value takes memory out of the picture, simplifying the optimiser’s job considerably: values are not forced on the stack with most platform ABIs, and aliasing is not a problem.
Interesting. I thought that a Qt’s equivalent to std::string_view is QStringRef class. It would be nice if you’d explain key differences between QStringRef and QStringView.
Indeed, thanks for the suggestion.
In all brevity:
QStringRef
cannot reference non-QString
-backed data, because it holds aconst QString*
, a position and a length inside that string.QStringView
, otoh, is just a pointer to the character data and a size, and thus agnostic to the owning container. It may, but does not have to be aQString
.So extending QStringRef instead of introducing a new type would keep the Qt API cleaner. Have you considered it?
QStringRef
has certain guarantees (it’s stable under reallocations of it’sstring()
) that were specifically designed into it. If I were to re-useQStringRef
for whatQStringView
is designed to solve, I would have to do the whole work as an almost-atomic operation between Qt 5 and Qt 6. And I’d still break existing out-of-tree users in the process. I wanted something that was possible to implement here and now, and less disruptive.Another question: can QStringView work with a kind of string where the data is not contiguous? Say that one needs to implement a text editor, and considers storing the edited text as a gap buffer, rope, sequence of lines, or whatever.
I also had the question of whether this could work with QStringIterator, but I saw one commit that made use of it, so it seems that yes.
Thanks.
QStringView
, likestd::string_view
, expects characters to be contiguous. It cannot represent a rope.QStringIterator
is already ported toQStringView
, yes, but since it already sported a(QChar*, QChar*)
constructor, you could’ve passed (and can still pass)begin()
andend()
of aQStringView
even if it wasn’t.Maybe it’s a stupid question, but… As far as I understand, the QStringView fixes performance issues with the QString. Then why just not fix the QString itself?
A string-view is conceptually similar, if not identical, to the STL design of separating algorithms from containers by having containers provide, and algorithms work with, iterators. A function taking a string-view is an algorithm on characters. The string-view is the iterator pair, and which container the algorithm works on is abstracted. Only, because we’re working with a rather restricted set of value types and only contiguous memory, we don’t need to write our algorithms as template functions. A normal function taking
QStringView
will do, becauseconst QChar*
is always the iterator.As for fixing
QString
: There are many things that I’d like to see fixed inQString
, and I’ve mentioned them in the article. But a string class needs to hold strings of arbitrary size. So it must (eventually) allocate memory, and own it. That makesQString
a container and fundamentally different from a string-view.I think I fail to appreciate your section regarding preconditions. You write
> Such functions have preconditions. We say they have a narrow contract. Only certain combinations of the two parameters are allowed: The length must be non-negative, and the pointer mustn’t be nullptr unless the length is zero, too.
However, the ‘isValidIdentifier’ function appears to be a total function. A null pointer and negative lengths are perfectly fine as it is, and the function (probably rightfully so) rejects them as valid identifiers.
You proceed to state that
> If a function takes a QString instead, it has no preconditions. We say it has a wide contract: any QString is generally acceptable, and valid.
However, you’d surely test for a QString to be non-empty (i.e. the equivalent of your previous `len <= 0` test) before proceeding, no?
I believe your point would be better made if you assert(!) that the `data` pointer is non-null. This would make it a partial function, and data being non-null would clearly be a precondition. This would also nicely lead to showing how a real `QString` does away with this precondition since you now pass a reference which cannot be null.
The traditional
isValidIdentifier()
is not a total function:And neither is QStringView’s constructor taking the same arguments:
Consequently, that constructor is not
noexcept
.This is subtle, I know: If
isValidIdentifier()
is ported to QStringView it becomes a total function:Crucially, the UB now happens outside the function, in the QStringView constructor, just as in the second example.
If you think there’s no difference, consider this: If I have some sanitizer API that would allow me to assert that a
[ptr, len)
range is valid, I could report the error. In the traditional case, I’d need to detect and report it insideisValidIdentifier()
(and in all other such functions). With QStringView, it’s detected and reported from the QStringView ctor.https://codereview.qt-project.org/193707
Yes, I think I see what you’re getting at – it makes perfect sense.
I suspect it may just be my lack of experience with `QStringView` which keeps me from acknowledging that using `QStringView` actually makes `isValidIdentifier` a total function. Or maybe it’s because my idea of what constitutes a ‘precondition’ differs from yours (to me, it’s a pre-condition of a piece of code which is not currently expressed in the type system but which has to be asserted at runtime).
My understanding is that `QStringView` itself does not verify that the given range is valid. It also doesn’t create a copy of the data. Hence, the precondition on `isValidIdentifier` is still that a valid range is passed. It’s just that instead of passing a starting address and a length, a QStringView is passed – but there’s nothing in the type system or in the constructor of QStringView which can enforce that the given QStringView denotes a valid range. I.e. the QStringView constructor still permits constructing invalid ranges, so there is a (wide) range of invalid QStringView objects possible for which `isValidIdentifier` is not well-defined, and hence partial.
Unfortunately I cannot seem to figure out how to do syntax highlighting in this blog, but to give an example of what I mean: If you consider this function for getting the first character of a string to be partial:
…then this might be a way to make it a total function, by using the type system and enforcing the precondition in the constructor:
My impression is that QStringView does not give any such guarantees since the behaviour of QStringView’s constructor is undefined for empty ranges.
A pre-condition of a function is a condition that needs to be true on the arguments of the function in order for the function to realise its post-conditions. Calling a function without all pre-conditions met is undefined behaviour. You seem to want it have defined behaviour. Here’s why that’s a fallacy:
Yes, ideally, a precondition would be a predicate in the same language as the function. But that is frequently not possible, or even if it is possible, it’s not desirable to check it.
E.g.
std::lower_bound(first, last, value, cmp)
has the following pre-conditions, which are usually not assertable:[first, last)
is a valid rangeQList::iterator
)[first, last)
is sorted according tocmp
cmp
!) while the functionality is O(logN), so usually not asserted.cmp
is a strict weak orderingcmp
is callingstrcmp()
, then you’ve lost.But even though these are not checkable (or too expensive to check), they’re still pre-conditions of the function, and failing to meet them means the function will not reliably meet its post-conditions.
So, the QStringView ctor checks the cheap preconditions, but fails to check the expensive or uncheckable ones. That doesn’t mean you’re free to create a QStringView with too large size. You’re still violating preconditions, and you’re still invoking UB.
Did you see https://codereview.qt-project.org/193707 ?
I concur with every word you write, but I feel I’m drawing a different conclusion. You are of course perfectly right when you say that it’s frequently not possible (or practical) to assert preconditions in code. `std::lower_bound` is a good example.
In the same way, the traditional `isValidIdentifier` function had preconditions which code cannot easily verify (that the given range is valid). And indeed, `QStringView` inherits this behaviour in that the `QStringView` has no practical way to verify that the given range is valid. I believe that so far, we’re on the same page.
Now, the conclusion I draw from this (and I think this is where we diverge) is that the new, QStringView-based definition of `isValidIdentifier` is *still* as easy (or hard) to call correctly as before. The contract is as wide (or narrow) as before – since the function *still* asserts that the given view denotes a valid range. A `QStringView` is isomorphic to a `char */int` tuple, and the set of `QStringView` objects which violate the precondition is as large as the set of parameter combinations with which the traditional `isValidIdentifier` function must not be called.
In the same vein, in my understanding, the contract of `std::lower_bound` would not change if instead of
std::lower_bound(first, last, value, cmp)
it would be declared as e.g.
std::lower_bound(range, value, cmp)
With `range` being something like
The contract of `std::lower_bound` will still include that you have to pass a valid range. It just happens that the range is no longer expressed as two iterators but as a `Range` object (which does not enforce that the range is valid!).
You pointed out that with the traditional definition of `isValidIdentifier` “Only certain combinations of the two parameters are allowed”. My understanding is that the same holds true if `isValidIdentifier` is given a `QStringView` – only certain `QStringView` objects (those which denote a valid range) are allowed. It would be a very different story if `isValidIdentifier` was given a real `QString`, which took a copy of the data.
That’s what I meant to express with my `NonEmptyString` example: by raising an exception in the constructor (and copying the data), it’s actually impossible to construct objects which transport empty strings. So the set of possible `NonEmptyString` is actually smaller than the set of all possible `std::string` objects, and hence the contract of the `firstCharacter` function becomes, in your terms, wider when using `NonEmptyString`.
Regarding https://codereview.qt-project.org/193707 — yes, I did notice it, but I didn’t quite understand what’s going on. I’m not too familiar with Valgrind unfortunately.
This is where we disagree, indeed. A
QStringView
is a new type. It has(const QChar*, qqsize_t)
members, yes, but it is not isomorphic to a tuple made out of its data members any more than aQString
is isomorphic to aQString::Data*
. That’s because a C++ class can, and usually does, introduce class invariants. The constructor is responsible for establishing the class invariant, but frequently, if it depends on user-provided arguments, the extent to which it can guarantee that all class invariants have been successfully established is limited. This is where the C++ standard resorts to the phrase “undefined behaviour, no diagnostic required”. And a normal C++ class can do the same.That said, the change I linked attempts to do away with the “no diagnostic” part, by using a Valgrind hook to check, at
QStringView
construction time, whether the given range contains valid data.Now, crucially, any function has an implicit pre-condition that its arguments are valid values of the arguments’ types. In your definition of a total function, that means such functions cannot have arguments of types that have class invariants. Ok, if that is the definition of a total function then functions taking
QStringView
are never total, indeed. Neither is a function takingNonEmptyString
total, though, sinceNonEmptyString
clearly has class invariants, too.I just noticed that since the `NonEmptyString` constructor never actually initializes the `value` member, the comment in the second `firstCharacter` definition is very misleading; *every* `NonEmptyString` object will have an empty `value` member. Oops. 🙂