I’ve recently been doing a lot of coding in C++ & CUDA for my PhD project, and I’ve needed to dig in and understand memory layouts in a way I’ve not had to previously using Python. In this post, I’ll explain how I’ve learned to think about row-major and column-major memory ordering in C++, especially when dealing with more than 2 dimensions.
Imagine you have a 4 dimensional dataset, maybe something astronomy-related like telescope data, where your dimensions would be (Receiver, Channel, Polarization, Timestep). In Python using numpy, you’d create a numpy array with 4 dimensions and use square brackets to access particular elements and life would be good!
For example, you could access receiver 2, channel 3, polarization 2 and timestep 10 using:
dataset[2, 3, 2, 10]
# 1 + 2j
If you needed to rearrange the axes, then np.transpose is there to help!
import numpy as np
data = np.array([[[1,2],[3,4]],[[5,6],[7,8]]])
data_transposed = np.transpose(data, (1,0,2))
data_transposed
# np.array([[[1,2], [5,6]], [[3,4], [7,8]]])
In C++, things are not as clear cut. While there are data structures that allow numpy-like indexing, much of the time you just have a long 1D array that you need to impose structure upon & find the right place for the data you’re looking for.
For example, using a vector structure:
#include <vector>
auto a = std::vector<int>(8);
a = {1,2,3,4,5,6,7,8};
How can we interpret or unpack this 1 dimensional array of numbers into a matrix or tensor? Enter the idea of row-major and column-major layouts.
Let’s start with a 2D case where we have a matrix
If this was in row-major format, then each time I increment the index by 1, I move along a row. If I reach the end of a row, then I wrap-around to the start of the next row. Column-major format is similar, except moving down each column instead of along a row.
data = [1, 2, 3, 4] ## row-major
data = [1, 3, 2, 4] ## column-major
OK – so this makes sense. But what happens when we go from having 2 dimensions to…more than 2? Well, firstly our matrix will become a tensor, a higher dimensional analogue of the matrix.
In this case, we will need a better definition of row-major and column-major formats. Let’s take an example with a 2x2x2 tensor that looks like this:
data[0] = [[1,2],[3,4]]
data[1] = [[5,6],[7,8]]
# A 1D representation in row-major order.
array_data = [1,2,3,4,5,6,7,8]
Note that each element of the tensor is a 2×2 matrix. We will use array_data to represent our 1D array which we are trying to structure correctly.
We can now define the row-major and column-major orderings more generally:
Definition (Row-major) – Given a set of dimensions, if a tensor is stored in row-major order then when the index of the array increases by 1 the right-most dimension is the fastest changing dimension.
So in this case we start at array_data[0] = data[0,0,0] = 1. The next element will increase the right-most dimension – so we go to array_data[1] = data[0,0,1] = 2. The next element will reach the end of the row and wrap-around to the next row: array_data[2] = data[0,1,0] = 3. Each time we increment 1 to the right-most dimension and wrap-around to the next dimension when required. The full layout is [1,2,3,4,5,6,7,8].
Definition (Column-major) – Given a set of dimensions, if a tensor is stored in column-major order then when the index of the array increases by 1, the left-most dimension is the fastest changing dimension.
In this layout – our first element remains the same: array_data[0] = data[0,0,0] = 1. However when we go to the next element we update the left-most dimension, so we get array_data[1] = data[1,0,0] = 5. Then we wrap-around to the next row array_data[2] = data[0,1,0] = 3 and then array_data[3] = data[1,1,0] = 7. The full layout is [1,5,3,7,2,6,4,8].
Why bother with having column-major layouts? Why not just have everything as row-major? Well, having different memory layouts allows us to get speedups on common operations, for instance matrix or tensor multiplication. cuBLAS, the CUDA library for linear algebra, requires that the first matrix be in row-major format, and the second matrix be in column-major format, as this is the layout that is most efficient for coalesced memory accesses. This means that data that is read together is stored together, as when you do a matrix multiplication you multiply a row of the first matrix with the column of the second matrix! Storing in the right format to allow coalesced memory accesses can result in huge performance increases in your code!
Thanks for reading, and I hope this has been helpful to you!
Leave a Reply