FMA Woes

Giuseppe D’Angelo

24 February 2023

What Did the Compiler Do to You?

In an least one interesting case (Clang 14, now shipped with the last XCode on Apple), started recognizing floating-point expressions, such as,

x * y + z

and started to automatically turn them into fused multiply-add (FMA) instructions:

The MAC operation modifies an accumulator

a: a ← a + ( b × c )

When done with floating point numbers, it might be performed with two roundings (typical in many DSPs), or with a single rounding. When performed with a single rounding, it is called a fused multiply–add (FMA) or fused multiply–accumulate (FMAC).

An FMA instruction carries the two operations in one step, and does them in "infinite precision". Notably, an FMA does only one rounding at the end instead of the sequence expressed by the source code, which is: 1) multiplying, 2) rounding the result, 3) adding, 4) rounding the result. So, there are two steps of rounding.

So not only is it faster, but it's also more accurate.

However, one can easily encounter cases (like the two cases illustrated above) in which doing operations without the intermediate rounding step will give you trouble.

Let's look again at the first example:

const double scale = 1.0 / i;
const double r = 1.0 - i * scale;
assert(r >= 0);

If i = 5, then scale is
0.200000000000000011102230246251565404236316680908203125, and r is
negative (about -5.55e-17) when using a FMA. The point is that i * scale did not get rounded in an intermediate.

In the second example,

const double result = std::sqrt(a*a - b*b);

the argument to sqrt(a*a - b*b) can be turned into FMA(a, a, -b*b). If a == b. Then, this expression is equivalent to FMA(a, a, -a*a). The problem is that, if a*a done in "infinite precision" is strictly less than the rounded product of a*a, then the result will again be a negative number (not 0!) passed into sqrt. This is very easy to obtain (example on CE)!

Rounding in C++

For me, the interesting question is, "Is the compiler allowed to do these manipulations, since they affect the rounding as expressed by the source code?"

Within the context of one expression, compilers can use as much precision as they want. This is allowed by:
[expr.pre/6] and similar paragraphs:

The values of the floating-point operands and the results of floating-point expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.

with a footnote that says:

The cast and assignment operators must still perform their specific conversions as described in [expr.type.conv], [expr.cast], [expr.static.cast] and [expr.ass].

Now suppose that one turns

const double scale = 1.0 / i;
const double r = 1.0 - i * scale;

into separate expressions and statements:

const double scale = 1.0 / i;
const double tmp = i * scale;
const double r = 1.0 - tmp;

Here, in theory, the source code mandates that tmp is rounded; so a compiler cannot do a FMA when calculating r.

In practice, compilers violate the standard and apply FMA. :-)

In literature, these substitutions are called "floating point contractions." Let's read what the GCC manual has to say about them:

By default, -fexcess-precision=fast is in effect; this means that operations may be carried out in a wider precision than the types specified in the source if that would result in faster code, and it is unpredictable when rounding to the types specified in the source code takes place.

(Emph. mine)

Hence, you can turn these optimizations off by compiling under -std=c++XX, not -std=gnu++XX (the default). If you try to use -fexcess-precision=standard, then GCC lets you know that:

cc1plus: sorry, unimplemented: '-fexcess-precision=standard' for C++

The Origin

Where does all this nonsense come from? The first testcase is actually out of Qt Quick rendering code. It has been lingering around for a decade!

Qt Quick wants to give a unique Z value to each element of the scene. These Z values are then going to be used by the underlying graphics stack (GL, Vulkan, Metal) as the depth of the element. This allows Qt Quick to render a scene using the ordinary depth testing that a GPU provides.

The Z values themselves have no intrinsic meaning, as long as they establish an order. That's why they're simply picked to be equidistant in a given range (simplest strategy that maximizes the available resolution).

Now, the underlying 3D APIs want a depth coordinate precisely in [0.0, 1.0]. So that's picked as range and then inverted (going from 1.0 to 0.0) because, for various reasons, Qt Quick wants to render back-to-front (smaller depth means "closer to the camera", i.e. on top in the Qt Quick scene.).

When the bug above gets triggered the topmost element of the scene doesn't get rendered at all. That is because its calculated Z value is negative; instead of being the "closest to the camera" (it's the topmost element in the scene), the 3D API will think the object ended up being behind the camera and will cull it away.

So why didn't anyone notice so far in the last 10 years? On one hand, it's because no one seems to compile Qt with aggressive compiler optimizations enabled. For instance, on X86-64 one needs to opt-in to FMA instructions; on GCC, you need to pass -march=haswell or higher. On ARM(64), this manifests more "out of the box" since ARM7/8 have FMA instructions.

On the other hand, because by accident everything works fine on OpenGL. Unlike other 3D graphics APIs, on OpenGL the depth range in Normalized Device Coordinates is from -1 to +1, and not 0 to +1. So even a (slightly) negative value for the topmost element is fine. If one peeks at an OpenGL call trace (using apitrace or similar tools), one can clearly see the negative Z being set.

Only on a relatively more recent combination of components does the bug manifest itself, for instance, on Mac:

Qt 6 ⇒ Metal as graphics API (through RHI)
ARM8 ⇒ architecture with FMA
Clang 14 in the latest XCode ⇒ enables FP contractions by default

Windows and Direct3D (again through RHI) are also, in theory, affected, but MSVC does not generate FMA instructions at all. On Linux, including embedded Linux (e.g. running on ARM), most people still use OpenGL and not Vulkan. Therefore, although GCC has floating-point contractions enabled by default, the bug doesn't manifest itself.

Definitely an interesting one to research; many kudos to the original reporter. The proposed fix was simply to clamp the values to the wanted range. I'm not sure if one can find a numerical solution that works in all cases.

Tags:

3d c++performance qml qt

About KDAB

The KDAB Group is a globally recognized provider for software consulting, development and training, specializing in embedded devices and complex cross-platform desktop applications. In addition to being leading experts in Qt, C++ and 3D technologies for over two decades, KDAB provides deep expertise across the stack, including Linux, Rust and modern UI frameworks. With 100+ employees from 20 countries and offices in Sweden, Germany, USA, France and UK, we serve clients around the world.

4 Comments

You don't mention anything that requires the topmost layer to be a z = 0, so add one layer, and don't use the topmost one:

- const double scale = 1.0 / i;
+ const double scale = 1.0 / (i + 1);

Isn't this a general effect of floating point rounding and not particularly related to FMA? I mean, for:

const double scale = 1.0 / i;
const double tmp = i * scale;

Wasn't it always possible that scale ends up slightly higher than 'actual' 1/i value? In that case, i*tmp could end up being larger than 1.0. I had seen a similar effect in one of our applications several years back and had done some research. IIRC, there were some processor flags (on x86) that determined if a value could be rounded up/down. We had different behaviour between gcc/Linux and MSVC++6 on Windows. Even on Windows, the behaviour changed depending on whether we had a service pack installed or not.

Hi,

That's what makes the problem funny. You can try experimentally for any integer i, and you will never find one that makes the example fail... unless you enable FMA. https://gcc.godbolt.org/z/a7Wcq5qoq

So "wasn't it always possible", sure, except that FMA wasn't enabled by default. Is the compiler allowed to use it? The Standard seem to say no, the compilers say "we don't care" :-)

This post reminded me of two things:

First, didn't ARM have two multiply-accumulate instructions? One with an intermediate rounding step (perfect for compiler use without affecting precision) and one without. And yes, a quick scan of the documentation reveals that there is indeed "VMLA" and "VFMA". But unfortunately this one didn't make it through the ARMv7 to ARMv8 transition. On the ARM web site there's the following note with regards to ARMv8: "All floating-point Multiply-Add and Multiply-Subtract instructions are fused."(https://developer.arm.com/documentation/den0024/a/AArch64-Floating-point-and-NEON/New-features-for-NEON-and-Floating-point-in-AArch64)). That's a pity!

Second, I came up with a way to prevent the compiler doing such optimizations in the past. The case was slightly different (the compiler constant folded the absolute address for some indexed operations and then put all the immediate load instructions into my tight loop where indexed load and store instructions would have been more effective). What is needed is some kind of barrier over which the optimizer can't jump to combine different expressions. The gcc and Clang "asm" statement can be used to this effect. The trick is to have no actual assembly code inside the statement but use the constraint modifiers to pretend that this statement modifies the variable where we want the barrier to be. As the compiler doesn't look into the assembly itself it can no longer rely on information specific to that variable that comes before that "asm" statement for optimization. Hence no fusion.

Something following should do in your case (x86):

const double scale = 1.0 / i; double tmp = i * scale; asm("" : "+x" (tmp)); // cloak the value of tmp const double r = 1.0 - tmp;

Of course this might have downsides. It might affect other optimizations (I saw clang no longer unrolling the loop, though this might have little effect), so one should at least inspect the generated assembly to gain some confidence.

With gcc there is also "__builtin_assoc_barrier()" and "const double r = 1.0 - __builtin_assoc_barrier(i * scale);" works too. Though as the name suggests this one has a different purpose (associativity) and hence it might be risky to rely on it here as MAC fusion isn't about associativity.

Ideally there would be some kind of helper function in the standard library for that purpose, but I couldn't find anything.

Giuseppe D’Angelo

Senior Software Engineer

Senior Software Engineer at KDAB. Giuseppe is a long-time contributor to Qt, having used Qt and C++ since 2000, and is an Approver in the Qt Project. His contributions in Qt range from containers and regular expressions to GUI, Widgets, and OpenGL. A free software passionate and UNIX specialist, before joining KDAB, he organized conferences on opensource around Italy. He holds a BSc in Computer Science.

FMA Woes

What Did the Compiler Do to You?

Rounding in C++

The Origin

4 Comments

Related Content

Join KDAB Training Day 2025 in Munich

Sign up for the KDAB Newsletter

FMA Woes

What Did the Compiler Do to You?

Rounding in C++

The Origin

4 Comments

Related Content

Qt World Summit 2019 talk videos are online

10 Tips to Make Your QML Code Faster and More Maintainable

Introducing the ConnectionEvaluator in KDBindings

Join KDAB Training Day 2025 in Munich

Sign up for the KDAB Newsletter