Implementing an Audio Mixer, Part 1

Basic DSP with Qt Multimedia

Matt Aber

22 August 2024

Motivation

When using Qt Multimedia to play audio files, it’s common to use QMediaPlayer, as it supports a larger variety of formats than QSound and QSoundEffect. Consider a Qt application with several audio sources; for example, different notification sounds that may play simultaneously. We want to avoid cutting notification sounds off when a new one is triggered, and we don’t want to construct a queue for notification sounds, as sounds will play at the incorrect time. We instead want these sounds to overlap and play simultaneously.

Ideally, an application with audio has one output stream to the system mixer. This way in the mixer control, different applications can be set to different volume levels. However, a QMediaPlayer instance can only play one audio source at a time, so each notification would have to construct a new QMediaPlayer. Each player in turn opens its own stream to the system.

The result is a huge number of streams to the system mixer being opened and closed all the time, as well as QMediaPlayers constantly being constructed and destructed.

To resolve this, the application needs a mixer of its own. It will open a single stream to the system and combine all the audio into the one stream.

Before we can implement this, we first need to understand how PCM audio works.

PCM

As defined by Wikipedia:

Pulse-code modulation (PCM) is a method used to digitally represent sampled analog signals. It is the standard form of digital audio in computers, compact discs, digital telephony and other digital audio applications. In a PCM stream, the amplitude of the analog signal is sampled at uniform intervals, and each sample is quantized to the nearest value within a range of digital steps.

Here you can see how points are sampled in uniform intervals and quantized to the closest number that can be represented.

Image Source: Wikipedia

Description from Wikipedia: Sampling and quantization of a signal (red) for 4-bit LPCM over a time domain at specific frequency.

Think of a PCM stream as a humongous array of bytes. More specifically, it’s an array of samples, which are either integer or float values and a certain number of bytes in size. The samples are these discrete amplitude values from a waveform, organized contiguously. Think of the each element as being a y-value of a point along the wave, with the index representing an offset from

at a uniform time interval.

Here is a graph of discretely sampled points along a sinusoidal waveform similar to the one above:

Image Source: Wikimedia Commons

Description from Wikimedia Commons: Image of a discrete time sinusoid

Let’s say we have an audio waveform that is a simple sine wave, like the above examples. Each point taken at discrete intervals along the curve here is a sample, and together they approximate a continuous waveform. The distance between the samples along the x-axis is a time delta; this is the sample period. The sample rate is the inverse of this, the number of samples that are played in one second. The typical standard sample rate for audio on CDs is 44100 Hz - we can’t really hear that this data is discrete (plus, the resultant sound wave from air movement is in fact a continuous waveform).

We also have to consider the y-axis here, which represents the amplitude of the waveform at each sampled point. In the image above, amplitude

is normalized such that

. In digital audio, there are a few different ways to represent amplitude. We can’t represent all real numbers on a computer, so the representation of the range of values varies in precision.

For example, let’s say we have two different representations of the wave above: 8-bit signed integer and 16-bit signed integer. The normalized value

from the image above maps to

with 8-bit representation and

with 16-bit. Therefore, with 16-bit representation, we have 128 times as many possible values to represent the same range; it is more precise, but the required size to store each 16-bit sample is double that of 8-bit samples.

We call the chosen representation, and thus the size of each sample, the bitdepth. Some common bitdepths are 16-bit int, 24-bit int, and 32-bit float, but there are many others in existence.

Let’s consider a huge stream of 16-bit samples and a sample rate of 44100 Hz. We write samples to the audio device periodically with a fixed-size buffer; let’s say it is 4096 bytes. The device will play each sample in the buffer at the aforementioned rate. Since each sample is a contiguous 2-byte short, we can fit 2048 samples into the buffer at once. We need to write 44100 samples in one second, so the whole buffer will be written around 21.5 times per second.

What if we have two different waveforms though, and what if one starts halfway through the other one? How do we mix them so that this buffer contains the data from both sources?

Waveform Superimposition

In the study of waves, you can superimpose two waves by adding them together. Let’s say we have two different discrete wave approximations, each represented by 20 signed 8-bit integer values. To superimpose them, for each index, add the values at that index. Some of these sums will exceed the limits of 8-bit representation, so we clamp them at the end to avoid signed integer overflow. This is known as hard clipping and is the phenomenon responsible for digital overdrive distortion.

x	Wave 1 (y1)	Wave 2 (y2)	Sum (y1+y2)	Clamped Sum
0	+60	−100	−40	-40
1	−120	+80	−40	−40
2	+40	+70	+110	+110
3	−110	−100	−210	−128
4	+50	−110	−60	−60
5	−100	+60	−40	−40
6	+70	+50	+120	+120
7	−120	−120	−240	−128
8	+80	−100	−20	−20
9	−80	+40	−40	−40
10	+90	+80	+170	+127
11	−100	−90	−190	−128
12	+60	−120	−60	−60
13	−120	+70	−50	−50
14	+80	−120	−40	−40
15	−110	+80	−30	−30
16	+90	−100	−10	−10
17	−110	+90	−20	−20
18	+100	−110	−10	−10
19	−120	−120	−240	−128

Now let’s implement this in C++. We’ll start small, and just combine two samples.

Note: we will use qint types here, but qint16 will be the same as int16_t and short on most systems, and similarly qint32 will correspond to int32_t and int.

qint16 combineSamples(qint32 samp1, qint32 samp2)
{
    const auto sum = samp1 + samp2;

    if (std::numeric_limits<qint16>::max() < sum)
        return std::numeric_limits<qint16>::max();

    if (std::numeric_limits<qint16>::min() > sum)
        return std::numeric_limits<qint16>::min();

    return sum;
}

This is quite a simple implementation. We use a function combineSamples and pass in two 16-bit values, but they will be converted to 32-bit as arguments and summed. This sum is clamped to the limits of 16-bit integer representation using std::numeric_limits in the <limits> header of the standard library. We then return the sum, at which point it is re-converted to a 16-bit value.

Combining Samples for an Arbitrary Number of Audio Streams

Now consider an arbitrary number of audio streams n. For each sample position, we must sum the samples of all n streams.

Let’s assume we have some sort of audio stream type (we’ll implement it later), and a list called mStreams containing pointers to instances of this stream type. We need to implement a function that loops through mStreams and makes calls to our combineSamples function, accumulating a sum into a new buffer.

Assume each stream in mStreams has a member function read(char *, qint64). We can copy one sample into a char * by passing it to read, along with a qint64 representing the size of a sample (bitdepth). Remember that our bitdepth is 16-bit integer, so this size is just sizeof(qint16).

Using read on all the streams in mStreams and calling combineSamples to accumulate a sum might look something like this:

qint16 accumulatedSum = 0;

for (auto *stream : mStreams)
{
    // call stream->read(char *, qint64)
    // to read a sample from the stream into streamSample
    qint16 streamSample;
    stream->read(reinterpret_cast<char *>(&streamSample), sizeof(qint16)));

    // accumulate
    accumulatedSum = combineSamples(sample, accumulatedSum);
}

The first pass will add samples from the first stream to zero, effectively copying it to accumulatedSum. When we move to another stream, the samples from the second stream will be added to those copied values from the first stream. This continues, so the call to combineSamples for a third stream would combine the third stream’s sample with the sum of the first two. We continue to add directly into the buffer until we have combined all the streams.

Combining All Samples for a Buffer

Now let’s use this concept to add all the samples for a buffer. We’ll make a function that takes a buffer char *data and its size qint64 maxSize. We’ll write our accumulated samples into this buffer, reading all samples from the streams and adding them using the method above.

The function signature looks like this:

void readData(char *data, qint64 maxSize);

Let’s achieve more efficiency by using a constexpr variable for the bitdepth:

constexpr qint16 bitDepth = sizeof(qint16);

There’s no reason to call sizeof multiple times, especially considering sizeof(qint16) can be evaluated as a literal at compile-time.

With the size of each sample and the size of the buffer, we can get the total number of samples to write:

const qint16 numSamples = maxSize / bitDepth;

For each stream in mStreams we need to read each sample up to numSamples. As the sample index increments, a pointer to the buffer data needs to also be incremented, so we can write our results at the correct location in the buffer.

That looks like this:

void readData(char *data, qint64 maxSize)
{
    // start with 0 in the buffer
    memset(data, 0, maxSize);

    constexpr qint16 bitDepth = sizeof(qint16);
    const qint16 numSamples = maxSize / bitDepth;

    for (auto *stream : mStreams)
    {
        // this pointer will be incremented across the buffer
        auto *cursor = reinterpret_cast<qint16 *>(data);
        qint16 sample;

        for (int i = 0; i < numSamples; ++i, ++cursor)
            if (stream->read(reinterpret_cast<char *>(&sample), bitDepth))
                *cursor = combineSamples(sample, *cursor);
    }
}

The idea here is that we can start playing new audio sources by adding new streams to mStreams. If we add a second stream halfway through a first stream playing, the next buffer for the first stream will be combined with the first buffer of this new stream. When we’re done playing a stream, we just drop it from the list.

Next Steps

In Part 2, we'll use Qt Multimedia to fully implement our mixer, connect to our audio device, and test it on some audio files.

About KDAB

The KDAB Group is a globally recognized provider for software consulting, development and training, specializing in embedded devices and complex cross-platform desktop applications. In addition to being leading experts in Qt, C++ and 3D technologies for over two decades, KDAB provides deep expertise across the stack, including Linux, Rust and modern UI frameworks. With 100+ employees from 20 countries and offices in Sweden, Germany, USA, France and UK, we serve clients around the world.

1 Comment

Personally I'd perform intermediate math as 32-bit integers and clamp to the output range after mixing all samples, to avoid the order of streams affecting audio output when clipping? I think it's good practice but don't know how important it is.

Matt Aber

Software Engineer

Matt Aber is a Software Engineer at KDAB

Implementing an Audio Mixer, Part 1

Basic DSP with Qt Multimedia

Motivation

PCM

Waveform Superimposition

Combining Samples for an Arbitrary Number of Audio Streams

Combining All Samples for a Buffer

Next Steps

1 Comment

Related Content

Clazy 1.5 released

Using Visual Studio Code for Writing Qt Applications

UiWatchDog: a keepalive monitor for the GUI thread

Join KDAB Training Day 2025 in Munich

Sign up for the KDAB Newsletter