Projection Matrices with Vulkan – Part 1 How transformations differ from OpenGL to Vulkan
Introduction
When someone with an OpenGL background begins using Vulkan, one of the very common outcomes – beyond the initial one of “OMG how much code does it take to draw a triangle?” – is that the resulting image is upside down.
Searching the web for this will give many hits on discussions about coordinate systems being flipped with suggested solutions being to do things like:
- Invert all of your
gl_Position.y
coordinates in all of your vertex shaders. - Provide a negative height viewport to flip the viewport transformation applied by Vulkan.
- Perform some magic incantation on your transformation matrices such as negating the y-axis.
All of these approaches have downsides such as needing to touch all of your vertex shaders; using hardware where negative viewport heights are supported; not really understanding the implications of randomly flipping an axis in a transformation matrix; having to invert your geometry winding order.
This post will aim to explain what is different between OpenGL and Vulkan transformations and how we can adapt our code to get the desired results with the bonus of actually understanding what is going on. This final point is crucial when it comes time to make changes later so that you don’t end up in the common situation of randomly flipping axes until you get what you want but which probably breaks something else.
Left- vs Right-handed Coordinate Systems
As a quick aside, it is important in what follows to know if we are dealing with a left-handed or right-handed coordinate system at any given time. First of all what does it even mean for a coordinate system to be left-handed or right-handed?
Well, it’s just a way of defining the relative orientations of the coordinate axes. In the following pictures we can use our thumb, first finger, and middle finger to represent the x, y, and z axes (or basis vectors if you prefer).
In a right-handed coordinate system we use those digits on our right hand so that the x-axis points to the right say, the y-axis points up, leaving the z-axis (middle finger) to point towards us.
Conversely, in a left-handed coordinate system we can still have the x-axis pointing to the right and the y-axis pointing up, but this time the z-axis increases away from us.
Converting from a right-handed coordinate system to a left-handed coordinate system or vice versa can be achieved by simply flipping the sign of a single axis (or any odd number of axes).
As we shall see, different graphics APIs use left- or right-handed coordinate systems at various stages of processing. This stuff can be a major source of confusion for graphics developers if they do not keep track of coordinate systems and often results in “oh hey, it works if I flip the sign of this column but I have no idea why”.
Common Coordinate Systems in 3D Graphics
Let’s take a quick tour of the coordinate systems used in 3D graphics at various stages of the (extended) pipeline. We will begin with OpenGL and then go on to discuss Vulkan and its differences. Note that the uses of the coordinate systems are the same in both systems, but as we shall see, there are some small but important changes between the two APIs. It is these differences that we need to be aware of in order to make our applications behave the way we want them to.
Here is a quick summary of the coordinate systems, what they are used for and where they occur.
- Model Space or Object Space
- This is any coordinate system that a 3D artist chooses to use when creating a particular asset. If modelling a chair, then they may use units of cm perhaps. If modelling a mountain range, a more suitable choice of unit may be km. Different tools also have different conventions for the orientation of axes. Blender for example, uses a z-up convention whereas as we shall see later, many real-time 3D applications choose to use y-up as their chosen orientation. Ultimately, it does not matter just so long as we know which conventions are used. Objects in model space are also often located close to the origin for convenience when being modelled and for when we later wish to position them.
Model space is often right-handed but it is usually decided by the tool author or generative code author.
- World Space
- World space is what we are most familiar with and is typically what you create using game engine editors. World space is a coordinate system where everything is brought into consistent units whether the units we choose are microns, centimeters, meters, kilometers etc. How we define world space in our applications is up to us. It may well differ depending upon what it is we are trying to simulate. Cellular microscopy applications probably make more sense using suitable units such as microns or perhaps even nanometers. Whereas a space simulation is probably better off using kilometers or maybe something even larger – whatever allows you to make best use of the limited precision of floating point numbers.
World space is also where we would rotate objects coming from various definitions of model space so that they make sense in the larger scene. For example, if a chair was modeled with the z-up convention and it wasn’t rotated when it was exported, then when we place it into world space we would also apply the rotation here, so that it looks correct in a y-up convention.
To create a consistent scene, we scale, rotate and translate our various 3D assets so that they are positioned relative to each other as we wish. The way we do this is to pre-multiply the vertex positions of the 3D assets by a “Model Matrix” for that asset. The Model Matrix, or just M for short, is a 4×4 matrix that encodes the scaling, rotation and translation operations needed to correctly position the asset.
World space is often right-handed but it is up to the application developer to decide.
- Camera or View or Eye Space
- This next space goes by various names in the literature and online such as eye space, camera space or view space. Ultimately they all mean the same thing which is that the objects in our 3D world are transformed to be relative to our virtual camera. Wait, our what?
Well, in order to be able to visualize our virtual 3D worlds on a display device, we must choose a position and orientation from which to view it. This is typically achieved by placing a virtual camera into the world. Yes, the camera entity is also positioned in world space by way of a transformation just like the assets mentioned above. View space is often defined to be a right-handed coordinate system where:
- the x-axis points to the right;
- the y-axis points upwards;
- and the z-axis is such that we are looking down the negative z-axis.
Typically a camera is only rotated and translated to place it into world space and so the units of measurement are still whatever you decided upon for World space. Therefore, the transformation to get our 3D entities from world space and into view space consists only of a translation and rotation. The matrix for transforming from World space to View space is typically called the “View Matrix” or just V.
View space is often right-handed but it is up to the developer to decide.
- Clip Space
- In addition to a position and orientation, our virtual camera also needs to provide some additional information that helps us convert from a purely mathematical model of our 3D world to how it should appear on screen. We need a way to map points in View space onto specific pixel coordinates on the display.
The first step towards this is the conversion from View space to “Clip Space” which is achieved by multiplying the View space positions by a so-called “Projection Matrix” (abbreviated to P).
There are various ways to calculate a projection matrix, P, depending upon if you wish to use an orthographic projection or a perspective projection.
- Orthographic projection: Often used in CAD applications as parallel lines in the world remain parallel on screen. Angles are preserved. The view volume (portion of scene that will appear on screen) is a cuboid.
- Perspective projection: Often used in games and other applications as this mimics the way our eyes work. Distant objects appear smaller. Angles are not preserved. The view volume for a perspective projection is a frustum (truncated rectangular pyramid).
Ultimately, the projection matrix transforms the view volume into a cuboid in clip space with a characteristic size of w. Thanks to the way that perspective projection matrices are constructed, the w component is equal to the z-depth in eye space of the point being transformed. This is so that we can later use this to perform the perspective divide operation and get perspective-correct interpolation of our geometry’s attributes (see below).
Don’t worry too much about the details of this. Conceptually it squashes things around so that anything that was inside the view volume (cube or frustum) into a cuboidal volume. The exact details of this depends upon which graphics API you are using (see even further below).
Why even do this? Well as the name suggests, clip space is used by the fixed function parts of the GPU to clip geometry so that it only has to rasterize parts that will actually be visible on the display. Any coordinate that has a magnitude exceeding the value w will be clipped.
- Normalised Device Coordinates
- The next step along our path to getting something to appear on screen involves the use of NDC space or Normalized Device Coordinates. This step is easy though. All we do to get from Clip space to NDC space is to divide the x, y, and z components of each vertex by the 4th w component (and then discarding the 4th component which is now guaranteed to be exactly 1). A process known as homogenization or perspective divide.
It is this step that “bakes in” the perspective effect if using a perspective transformation.
The end result is that our visible part of the scene is now contained within a cuboid with characteristic length of 1. Again, see below for the differences between graphics APIs.
NDC space is a nice simple, normalized coordinate system to reason about. We’re now just a small step away from getting our 3D world to appear at the correct set of pixels on the display.
- Framebuffer or Window Space
- The final step of the process is to convert from NDC to Window Space or Framebuffer Space or Viewport Space. Again more names for the same thing. It’s basically the pixel coordinates in your application window.
The conversion from NDC to Framebuffer space is controlled by the viewport transformation that you can configure in your graphics API of choice. This transformation is just a bias (offset) and scaling operation. This makes intuitive sense when you think we are converting from the normalized coordinates in NDC space to pixels. The levels of scale and bias are controlled by which portion of the window you wish to display to. Specifically it’s offset and dimensions. The details of how to set the viewport transformation vary between graphics APIs.
Coordinate Systems in Practice
The above descriptions sound very scary and intimidating but in practice they are not so bad once we understand what is going on. Spending a little time to understand the sequence of operations is very worth while and is infinitely better than randomly changing the signs of various elements to make something work in your one particular case. It’s only a matter of time until your random tweak will break something else in the future.
Take a look at the following diagram that summarizes the path that data takes through the graphics pipeline and the transformations/operations at each stage:
A few things to note:
- The transformations from Model Space to Clip Space are performed in the programmable Vertex Shader stage of the graphics pipeline.
- Rather than doing 3 distinct matrix multiplications for every vertex, we often combine the M, V, and P matrices into a single matrix on the CPU and pass the result into the vertex shader. This allows a vertex to be transformed all the way to Clip Space with a single matrix multiplication.
- The clipping and perspective divide operations are fixed function (hardwired in silicon) operations. Each graphics API specifies the coordinate systems in which these happen.
- The scale and bias transformation to go from NDC to Framebuffer Space is fixed function too but is controlled via API calls such as
glViewport()
orvkCmdSetViewport()
.
The upshot of all of this is that we need to create the Model, View and Projection matrices to get our vertex data correctly into Clip Space. How we do this differs subtly between the different graphics APIs such as OpenGL vs Vulkan as we shall see now. These differences are what often lead to some issues when migrating from OpenGL to Vulkan. Especially when using some helper libraries that were coded up with the expectation of only being used with OpenGL.
As stated above, the Model, World and View spaces are defined by the content creation tools (Model Space) or by us as application/library developers (World and View spaces). It is only when we get to clip space that we have to be concerned about what the graphics API we are using expects to receive.
OpenGL Coordinate Systems
With OpenGL, the fixed function parts of the pipeline all use left-handed coordinate systems as shown here:
If we stick with the common conventions of using a right-handed set of coordinate systems for Model Space, World Space and View Space, then the transformation from View Space to Clip Space must also flip the handedness of the coordinate system somehow.
Recall that to go from View Space to Clip Space, we multiply our View Space vertex by the projection matrix P. Usually we would use some library to create a projection matrix for us such as glm or even glFrustum, if you are still using OpenGL 1.x!
There are various ways to parameterize a perspective projection matrix but to keep it simple let’s stick with the left (l), right (r), top (t), bottom (b), near (n) and far (f) parameterisation as per the glFrustum specification. This assumes the virtual camera (or eye) is at the origin and that the near and far values are the distances to the near and far clip planes along the negative z-axis. The near plane is the plane to which our scene will be projected. The left, right, top and bottom values specify the positions on the near plane used to define the clip planes that form the view volume – a frustum in the case of a perspective transform.
With this parameterisation, the projection matrix for OpenGL looks like this:
P = \begin{pmatrix} \tfrac{2n}{r-1} & 0 & \tfrac{r+1}{r-1} & 0 \\ 0 & \tfrac{2n}{t-b} & \tfrac{t+b}{t-b} & 0 \\ 0 & 0 & -\tfrac{f+n}{f-n} & -\tfrac{2fn}{f-n} \\ 0 & 0 & -1 & 0 \end{pmatrix}
Do not blindly use this as your projection matrix! It is specifically for OpenGL!
OK, that looks reasonable and matches various texts on OpenGL programming. It works perfectly well for OpenGL because it not only performs the perspective projection transform but it also bakes in the flip from right-handed coordinates to left-handed coordinates. This last little fact seems to be something that many texts gloss over and so goes unnoticed by many graphics developers. So where does this happen? That pesky little -1 in the 3rd column of the 4th row is what does it. This has the effect of flipping the z-axis and using -z as the w component causing the change in handedness.
If we blindly then use the same matrix to calculate a perspective projection for use with Vulkan that does not need the handedness flip, then we end up in trouble. This is typically followed by google searches leading to one of the many hacks to provide a “fix”.
Instead, let’s use our understanding of the problem domain to now come up with a proper correction for use with Vulkan.
Vulkan Coordinate Systems
Conversely to OpenGL, the fixed function coordinate systems used in Vulkan remain as right-handed in keeping with the earlier coordinate systems as shown here:
Notice that even though z-increases into the distance and y is increasing down, it is still in fact a right-handed coordinate system. You can convince yourself of this with some flexible rotations of your right hand similar to the photographs above.
Let’s think about what we need conceptually without getting bogged down in the math – for now at least, we will save that for next time. With the OpenGL perspective projection matrix we have something that takes care of the transformation of the view frustum into a cube in clip space. The problem we have when using it with Vulkan is the flip in the handedness of the coordinate system thanks to that -1 we mentioned in the previous section. Setting that perspective component to 1 instead of -1 prevents the flip in handedness – there’s a bit more to it as we will see in part 2 but that takes care of the change in handedness.
We still need to reorient our coordinate axes from View Space (x-right, y-up, looking down the negative z-axis) to Vulkan’s Clip Space (x-right, y-down, looking down the positive z-axis). Since the start and end coordinate systems are both right-handed, this does not involve an axis flip as in the OpenGL case. Instead, all we need to do is to perform a rotation of 180 degrees about the x-axis. This gives us exactly the change in orientation that we need.
This means that before we see how to construct a projection matrix, we should reorient our coordinate axes to already be aligned with the desired clip space orientation. To do this, we inject a 180 degree rotation of the eye space coordinate around the x-axis before we later apply the actual projection. This rotation is shown here:
Recall from high school maths, that a rotation matrix about the x-axis basis vector of 180\degree (\pi radians) is easily constructed as:
X = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & \cos{\pi} & \sin{\pi} & 0 \\ 0 & -\sin{\pi} & \cos{\pi} & 0 \\ 0 & 0 & 0 & 1 \\ \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & -1 & 0 & 0 \\ 0 & 0 & -1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{pmatrix}
This also makes sense intuitively as the y and z components of the matrix will both be negated by the -1 elements. Note that we have two “axis flips”, so it still maintains the right-handedness of the coordinate system as desired.
So, in the end all we need to do is to include this “correction matrix”, X, into our usual chain of matrices when calculating the combined model-view-projection matrix that gets passed to the vertex shader. With the correction included, our combined matrix is calculated as A = (PX)VM = PXVM. That means the transforms applied in order (right to left) are:
- Model to World
- World to Eye/View
- Eye/View to Rotated Eye/View
- Rotated Eye/View to Clip
With the above in place, we can transform vertices all the way from Model Space through to Vulkan’s Clip Space and beyond. All that remains for us next time, is to see how to actually construct the perspective projection matrix. However, we are now in a good position (and orientation) to derive the perspective projection matrix as our source (rotated eye space) and destination (clip space) coordinate systems are now aligned. All we have to worry about is the actual projection of vertices onto the near plane.
Once we complete this next step, we will be able to avoid any of the ugly hacks mentioned at the start of this article and we will have a full understanding of how our vertices are transformed all the way from Blender through to appearing on our screens. Thanks for reading!
Part 2 is available here.
If you like this article and want to read similar material, consider subscribing via our RSS feed.
Subscribe to KDAB TV for similar informative short video content.
KDAB provides market leading software consulting and development services and training in Qt, C++ and 3D/OpenGL. Contact us.
Advocate and the best of the part of this I’ve taught myself