Implementing an Audio Mixer, Part 1 Basic DSP with Qt Multimedia
Motivation
When using Qt Multimedia to play audio files, it’s common to use QMediaPlayer
, as it supports a larger variety of formats than QSound
and QSoundEffect
. Consider a Qt application with several audio sources; for example, different notification sounds that may play simultaneously. We want to avoid cutting notification sounds off when a new one is triggered, and we don’t want to construct a queue for notification sounds, as sounds will play at the incorrect time. We instead want these sounds to overlap and play simultaneously.
Ideally, an application with audio has one output stream to the system mixer. This way in the mixer control, different applications can be set to different volume levels. However, a QMediaPlayer
instance can only play one audio source at a time, so each notification would have to construct a new QMediaPlayer
. Each player in turn opens its own stream to the system.
The result is a huge number of streams to the system mixer being opened and closed all the time, as well as QMediaPlayer
s constantly being constructed and destructed.
To resolve this, the application needs a mixer of its own. It will open a single stream to the system and combine all the audio into the one stream.
Before we can implement this, we first need to understand how PCM audio works.
PCM
As defined by Wikipedia:
Pulse-code modulation (PCM) is a method used to digitally represent sampled analog signals. It is the standard form of digital audio in computers, compact discs, digital telephony and other digital audio applications. In a PCM stream, the amplitude of the analog signal is sampled at uniform intervals, and each sample is quantized to the nearest value within a range of digital steps.
Here you can see how points are sampled in uniform intervals and quantized to the closest number that can be represented.
Description from Wikipedia: Sampling and quantization of a signal (red) for 4-bit LPCM over a time domain at specific frequency.
Think of a PCM stream as a humongous array of bytes. More specifically, it’s an array of samples, which are either integer or float values and a certain number of bytes in size. The samples are these discrete amplitude values from a waveform, organized contiguously. Think of the each element as being a y-value of a point along the wave, with the index representing an offset from x=0 at a uniform time interval.
Here is a graph of discretely sampled points along a sinusoidal waveform similar to the one above:
Image Source: Wikimedia Commons
Description from Wikimedia Commons: Image of a discrete time sinusoid
Let’s say we have an audio waveform that is a simple sine wave, like the above examples. Each point taken at discrete intervals along the curve here is a sample, and together they approximate a continuous waveform. The distance between the samples along the x-axis is a time delta; this is the sample period. The sample rate is the inverse of this, the number of samples that are played in one second. The typical standard sample rate for audio on CDs is 44100 Hz – we can’t really hear that this data is discrete (plus, the resultant sound wave from air movement is in fact a continuous waveform).
We also have to consider the y-axis here, which represents the amplitude of the waveform at each sampled point. In the image above, amplitude A is normalized such that A\in[−1,1]. In digital audio, there are a few different ways to represent amplitude. We can’t represent all real numbers on a computer, so the representation of the range of values varies in precision.
For example, let’s say we have two different representations of the wave above: 8-bit signed integer and 16-bit signed integer. The normalized value 1 from the image above maps to (2^{8}\div{2})−1=127 with 8-bit representation and (2^{16}\div2)−1=32767 with 16-bit. Therefore, with 16-bit representation, we have 128 times as many possible values to represent the same range; it is more precise, but the required size to store each 16-bit sample is double that of 8-bit samples.
We call the chosen representation, and thus the size of each sample, the bitdepth. Some common bitdepths are 16-bit int, 24-bit int, and 32-bit float, but there are many others in existence.
Let’s consider a huge stream of 16-bit samples and a sample rate of 44100 Hz. We write samples to the audio device periodically with a fixed-size buffer; let’s say it is 4096 bytes. The device will play each sample in the buffer at the aforementioned rate. Since each sample is a contiguous 2-byte short
, we can fit 2048 samples into the buffer at once. We need to write 44100 samples in one second, so the whole buffer will be written around 21.5 times per second.
What if we have two different waveforms though, and what if one starts halfway through the other one? How do we mix them so that this buffer contains the data from both sources?
Waveform Superimposition
In the study of waves, you can superimpose two waves by adding them together. Let’s say we have two different discrete wave approximations, each represented by 20 signed 8-bit integer values. To superimpose them, for each index, add the values at that index. Some of these sums will exceed the limits of 8-bit representation, so we clamp them at the end to avoid signed integer overflow. This is known as hard clipping and is the phenomenon responsible for digital overdrive distortion.
x | Wave 1 (y_1) | Wave 2 (y_2) | Sum (y_1+y_2) | Clamped Sum |
---|---|---|---|---|
0 | +60 | −100 | −40 | −40 |
1 | −120 | +80 | −40 | −40 |
2 | +40 | +70 | +110 | +110 |
3 | −110 | −100 | −210 | −128 |
4 | +50 | −110 | −60 | −60 |
5 | −100 | +60 | −40 | −40 |
6 | +70 | +50 | +120 | +120 |
7 | −120 | −120 | −240 | −128 |
8 | +80 | −100 | −20 | −20 |
9 | −80 | +40 | −40 | −40 |
10 | +90 | +80 | +170 | +127 |
11 | −100 | −90 | −190 | −128 |
12 | +60 | −120 | −60 | −60 |
13 | −120 | +70 | −50 | −50 |
14 | +80 | −120 | −40 | −40 |
15 | −110 | +80 | −30 | −30 |
16 | +90 | −100 | −10 | −10 |
17 | −110 | +90 | −20 | −20 |
18 | +100 | −110 | −10 | −10 |
19 | −120 | −120 | −240 | −128 |
Now let’s implement this in C++. We’ll start small, and just combine two samples.
Note: we will use
qint
types here, butqint16
will be the same asint16_t
andshort
on most systems, and similarlyqint32
will correspond toint32_t
andint
.
qint16 combineSamples(qint32 samp1, qint32 samp2)
{
const auto sum = samp1 + samp2;
if (std::numeric_limits<qint16>::max() < sum)
return std::numeric_limits<qint16>::max();
if (std::numeric_limits<qint16>::min() > sum)
return std::numeric_limits<qint16>::min();
return sum;
}
This is quite a simple implementation. We use a function combineSamples
and pass in two 16-bit values, but they will be converted to 32-bit as arguments and summed. This sum is clamped to the limits of 16-bit integer representation using std::numeric_limits
in the <limits>
header of the standard library. We then return the sum, at which point it is re-converted to a 16-bit value.
Combining Samples for an Arbitrary Number of Audio Streams
Now consider an arbitrary number of audio streams n
. For each sample position, we must sum the samples of all n
streams.
Let’s assume we have some sort of audio stream type (we’ll implement it later), and a list called mStreams
containing pointers to instances of this stream type. We need to implement a function that loops through mStreams
and makes calls to our combineSamples
function, accumulating a sum into a new buffer.
Assume each stream in mStreams
has a member function read(char *, qint64)
. We can copy one sample into a char *
by passing it to read
, along with a qint64
representing the size of a sample (bitdepth). Remember that our bitdepth is 16-bit integer, so this size is just sizeof(qint16)
.
Using read
on all the streams in mStreams
and calling combineSamples
to accumulate a sum might look something like this:
qint16 accumulatedSum = 0;
for (auto *stream : mStreams)
{
// call stream->read(char *, qint64)
// to read a sample from the stream into streamSample
qint16 streamSample;
stream->read(reinterpret_cast<char *>(&streamSample), sizeof(qint16)));
// accumulate
accumulatedSum = combineSamples(sample, accumulatedSum);
}
The first pass will add samples from the first stream to zero, effectively copying it to accumulatedSum
. When we move to another stream, the samples from the second stream will be added to those copied values from the first stream. This continues, so the call to combineSamples
for a third stream would combine the third stream’s sample with the sum of the first two. We continue to add directly into the buffer until we have combined all the streams.
Combining All Samples for a Buffer
Now let’s use this concept to add all the samples for a buffer. We’ll make a function that takes a buffer char *data
and its size qint64 maxSize
. We’ll write our accumulated samples into this buffer, reading all samples from the streams and adding them using the method above.
The function signature looks like this:
void readData(char *data, qint64 maxSize);
Let’s achieve more efficiency by using a constexpr
variable for the bitdepth:
constexpr qint16 bitDepth = sizeof(qint16);
There’s no reason to call sizeof
multiple times, especially considering sizeof(qint16)
can be evaluated as a literal at compile-time.
With the size of each sample and the size of the buffer, we can get the total number of samples to write:
const qint16 numSamples = maxSize / bitDepth;
For each stream in mStreams
we need to read each sample up to numSamples
. As the sample index increments, a pointer to the buffer data needs to also be incremented, so we can write our results at the correct location in the buffer.
That looks like this:
void readData(char *data, qint64 maxSize)
{
// start with 0 in the buffer
memset(data, 0, maxSize);
constexpr qint16 bitDepth = sizeof(qint16);
const qint16 numSamples = maxSize / bitDepth;
for (auto *stream : mStreams)
{
// this pointer will be incremented across the buffer
auto *cursor = reinterpret_cast<qint16 *>(data);
qint16 sample;
for (int i = 0; i < numSamples; ++i, ++cursor)
if (stream->read(reinterpret_cast<char *>(&sample), bitDepth))
*cursor = combineSamples(sample, *cursor);
}
}
The idea here is that we can start playing new audio sources by adding new streams to mStreams
. If we add a second stream halfway through a first stream playing, the next buffer for the first stream will be combined with the first buffer of this new stream. When we’re done playing a stream, we just drop it from the list.
Next Steps
In Part 2, we’ll use Qt Multimedia to fully implement our mixer, connect to our audio device, and test it on some audio files.
If you like this article and want to read similar material, consider subscribing via our RSS feed.
Subscribe to KDAB TV for similar informative short video content.
KDAB provides market leading software consulting and development services and training in Qt, C++ and 3D/OpenGL. Contact us.
Personally I’d perform intermediate math as 32-bit integers and clamp to the output range after mixing all samples, to avoid the order of streams affecting audio output when clipping? I think it’s good practice but don’t know how important it is.