Skip to content

Little Trouble in Big Data – Part 1 how to use mmap() to load a large data set into RAM

A few months ago, we received a phone call from a bioinformatics group at a European university. The problem they were having appeared very simple. They wanted to know how to usemmap() to be able to load a large data set into RAM at once. OK I thought, no problem, I can handle that one. Turns out this has grown into a complex and interesting exercise in profiling and threading.

The background is that they are performing Markov-Chain Monte Carlo simulations by sampling at random from data sets containing SNP (pronounced “snips”) genetic markers for a selection of people. It boils down to a large 2D matrix of floats where each column corresponds to an SNP and each row to a person. They provided some small and medium sized data sets for me to test with, but their full data set consists of 500,000 people with 38 million SNP genetic markers!

The analysis involves selecting a column (SNP) at random in the data set and then performing some computations on the data for all of the individuals and collecting some summary statistics. Do that for all of the columns in the data set, and then repeat for a large number of iterations. This allows you to approximate the underlying true distribution from the discreet data that has been collected.

That’s the 10,000 ft view of the problem, so what was actually involved? Well we undertook a bit of an adventure and learned some interesting stuff along the way, hence this blog series.

The stages we went through were:

  1. Preprocessing
  2. Loading the Data
  3. Fine-grained Threading
  4. Preprocessing Reprise
  5. Coarse Threading

In this blog, I’ll detail stages 1 and 2. The rest of the process will be revealed as the blog series unfolds, and I’ll include a final summary at the end.

1. Preprocessing

The first thing we noticed when looking at the code they already had is that there is quite some work being done when reading in the data for each column. They do some summary statistics on the column, then scale and bias all the data points in that column such that the mean is zero. Bearing in mind that each column will be processed many times, (typically 10k – 1 million), this is wasteful to repeat every time the column is used.

So, reusing some general advice from 3D graphics, we moved this work further up the pipeline to a preprocessing step. The SNP data is actually stored in a compressed form which takes the form of quantizing 4 SNP values into a few bytes which we then decompress when loading. So the preprocessing step does the decompression of SNP data, calculates the summary statistics, adjusts the data and then writes the floats out to disk in the form of a ppbed file (preprocessed bed where bed is a standard format used for this kind of data).

The upside is that we avoid all of this work on every iteration of the Monte Carlo simulation at runtime. The downside is that 1 float per SNP per person adds up to a hell of a lot of data for the larger data sets! In fact, for the full data set it’s just shy of 69 TB of floating point data! But to get things going, we were just worrying about smaller subsets. We will return to this later.

2. Loading the data

Even on moderately sized data sets, loading the entirety of the data set into physical RAM at once is a no-go as it will soon exhaust even the beefiest of machines. They have some 40 core, many-many-GB-of-RAM machine which was still being exhausted. This is where the original enquiry was aimed – how to use mmap(). Turns out it’s pretty easy as you’d expect. It’s just a case of setting the correct flags so that the kernel doesn’t actually take a copy of the data in the file. Namely, PROT_READ and MAP_SHARED:

void Data::mapPreprocessBedFile(const string &preprocessedBedFile)
{
    // Calculate the expected file sizes - cast to size_t so that we don't overflow the unsigned int's
    // that we would otherwise get as intermediate variables!
    const size_t ppBedSize = size_t(numInds) * size_t(numIncdSnps) * sizeof(float);
 
    // Open and mmap the preprocessed bed file
    ppBedFd = open(preprocessedBedFile.c_str(), O_RDONLY);
    if (ppBedFd == -1)
        throw("Error: Failed to open preprocessed bed file [" + preprocessedBedFile + "]");
 
    ppBedMap = reinterpret_cast<float *>(mmap(nullptr, ppBedSize, PROT_READ, MAP_SHARED, ppBedFd, 0));
    if (ppBedMap == MAP_FAILED)
        throw("Error: Failed to mmap preprocessed bed file");
 
    ...
}

When dealing with such large amounts of data, be careful of overflows in temporaries! We had a bug where ppBedSize was overflowing and later causing a segfault.

So, at this point we have a float *ppBed pointing at the start of the huge 2D matrix of floats. That’s all well and good but not very convenient for working with. The code base already made use of Eigen for vector and matrix operations so it would be nice if we could interface with the underlying data using that.

Turns out we can (otherwise I wouldn’t have mentioned it). Eigen provides VectorXf and MatrixXf types for vectors and matrices but these own the underlying data. Luckily Eigen also provides a wrapper around these in the form of Map. Given our pointer to the raw float data which is mmap()‘d, we can use the placement new operator to wrap it up for Eigen like so:

class Data
{
public:
    Data();
 
    // mmap related data
    int ppBedFd;
    float *ppBedMap;
    Map<MatrixXf> mappedZ;
}
 
 
void Data::mapPreprocessBedFile(const string &preprocessedBedFile)
{
    ...
 
    ppBedMap = reinterpret_cast<float *>(mmap(nullptr, ppBedSize, PROT_READ, MAP_SHARED, ppBedFd, 0));
    if (ppBedMap == MAP_FAILED)
        throw("Error: Failed to mmap preprocessed bed file");
 
    new (&mappedZ) Map<MatrixXf>(ppBedMap, numRows, numCols);
}

At this point we can now do operations on the mappedZ matrix and they will operate on the huge data file which will be paged in by the kernel as needed. We never need to write back to this data so we didn’t need the PROT_WRITE flag for mmap.

Yay! Original problem solved and we’ve saved a bunch of work at runtime by preprocessing. But there’s a catch! It’s still slow. See the next blog in the series for how we solved this.

is a senior software engineer at KDAB where he heads up our UK office and also leads the 3D R&D team. He has been developing with C++ and Qt since 1998 and is Qt 3D Maintainer and lead developer in the Qt Project. Sean has broad experience and a keen interest in scientific visualization and animation in OpenGL and Qt. He holds a PhD in Astrophysics along with a Masters in Mathematics and Astrophysics.

1 thought on “Little Trouble in Big Data – Part 1”

  1. I would have done things slightly different at a few points.

    First, your casting is inconsistent:

    const size_t ppBedSize = size_t(numInds) * size_t(numIncdSnps) * sizeof(float);
    ppBedMap = reinterpret_cast(mmap(…

    Since you already had problems with overflows you may agree that finding casts easily is a good thing, which makes searching for “_cast<" handy. So this boils down to:
    const size_t ppBedSize = static_cast(numInds) * static_cast(numIncdSnps) * sizeof(float);
    ppBedMap = static_cast(mmap(…

    Yes, the latter one can also be a static_cast, as casting from void* is permitted.

    Next is, given that you opened the file with O_RDONLY and mapped it with PROT_READ I would expect ppBedMap to be “const float *” for all the good reasons that “const” has.

    And then there is ppBedFd, which I suspect you don’t need anymore. Once you have mmap()’d a file you can close the file descriptor unless you need it for other fancy things

Leave a Reply

Your email address will not be published. Required fields are marked *