January 5, 2025

Representing Matrices in memory: Row-major and Column-major ordering

In most programming languages, an array of primitive values (like integers) usually implies that the values are stored adjacent to each other in memory. This is most evident in a language like C, where array indices are equivalent to pointer arithmetic:

#include <stdio.h>

int main() {
  int arr[5] = {1, 2, 3, 4, 5};
  // Example 1: Using typical array indices
  for (int i = 0; i < 5; i++) { // output: "1 2 3 4 5"
    printf("%d ", arr[i]);
  }
  printf("\n");

  // Example 2: Using explicit pointer arithmetic
  for (int i = 0; i < 5; i++) { // also outputs: "1 2 3 4 5"
    printf("%d ", *(arr + i));
  }
}

In the above example, if the first integer in arr is stored at address $a$ then each subsequent integer is stored at an offset from $a$. The offset is calculated using the index $i$. For example, if we assume byte-addressable memory and an integer size of 4 bytes (32-bits), and $a$ = 0x0000 then the first integer would be at 0x0000, the second at 0x0004, the third at 0x0008 and the fourth at 0x000C and the fifth at 0x0010. (Using hexadecimal notation). This would result in a memory layout like¹:

Address  Contents
        ---------
0x0000  |   1   |
0x0004  |   2   |
0x0008  |   3   |
0x000C  |   4   |
0x0010  |   5   |
        ---------

In the second example above, we use explicit pointer arithmetic of *(arr + i) to access each element. This works because pointer arithmetic takes into account the size of the type pointed at, so adding i to arr is actually increasing the address by 4, assuming that the size of int on this system is 4 bytes.

An array is a one-dimensional structure. What about a matrix? A matrix is a two-dimensional structure with a number of rows and columns. For example, a matrix $A$ that has 2 rows and 3 columns might look like this: $$ A = \begin{bmatrix} 1 & 2 & 3\\ 4 & 5 & 6 \end{bmatrix} $$ When such a matrix² is stored in memory the contents must be linearized since memory is arranged as a sequence of addresses, usually one byte per address, as we saw above. How might this be done?

There are two basic ways: ³,⁴

Row-major order: The elements of rows are stored contiguously: That is, the first row stored together, and then the second row, and so on.
Column-major order: The elements of columns are stored contiguously: The first column, then the second, and so on.

If the above matrix $A$ was stored in row-major order then it would look something like this in memory:

Address  Contents
        ---------
0x0000  |   1   |  \
0x0004  |   2   |   |-- First row
0x0008  |   3   |  /
0x000C  |   4   |  \
0x0010  |   5   |   |-- Second row
0x0014  |   6   |  /
        ---------

Conversely, if $A$ was stored in column-major order it would look like:

Address  Contents
        ---------
0x0000  |   1   | \__ First column
0x0004  |   4   | /
0x0008  |   2   | \__ Second column
0x000C  |   5   | /
0x0010  |   3   | \__ Third column
0x0014  |   6   | /
        ---------

A given programming language can be said to be either “row-major” or “column-major”. This refers to how the language chooses to layout 2D arrays (matrices) in memory. For example, C is a well-known row-major language. That means if we explicitly define a 2D array, its contents will be laid out in row-major order. We can test this out by declaring a 2D array and then seeing what order the values are laid out in memory when we manually advance a pointer that starts out at the first element:

// A 3x2 matrix; each row is an inner array
float a[3][2] = {
  {1.0, 2.0}, 
  {3.0, 4.0}, 
  {5.0, 6.0}
};
// Get a pointer to `t`, which is of type `float (*)[3][2]` and (unsafely?) cast to `float *`
float *aPtr = (float*) &a;

// Since row-major order, below loop outputs: `1.0 2.0 3.0 4.0 5.0 6.0 `
for (int i = 0; i < 3*2; i++) {
  printf("%.1f ", *aPtr);
  aPtr += 1;
}

However, you don’t need to declare a matrix as an explicit 2D array. You can just declare it as a regular 1D array, and then interpret it as either a row or column-major order. In fact, doing it this way is more flexible as then the interpretation is left up to the code and not the language.

For example, we could just define the above $2 \times 3$ matrix as a single array a of length 6. We’d then access the element at $(i, j)$ using the indexing arithmetic a[i * numCols + j]. Here’s an example of that:

// NOTE: Usually we'd `malloc()` an array onto the heap dynamically and not define it statically.
// rows = 2
// cols = 3
float a[2 * 3] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0};
float *aPtr = (float*) &a;
for (int i = 0; i < 2; i++) { // Iterate over rows
  for (int j = 0; j < 3; j++) { // Iterate over cols
    // Element at (i, j) is at index i * numCols + j
    printf("%.1f ", aPtr[i * 3 + j]);
  }
}

Why this matters

How the contents of the matrix are laid out in memory determines what an efficient access pattern is. Generally, when accessing data in memory, you want to access adjacent (or close) memory locations in sequence, and not jump back and forth across memory locations. The example matrix $A$ is small, but what if we had a matrix with a very large number of rows and columns?

In that case, the access pattern dictates which layout is more optimal. For example, if we are reading one row at a time, then the row-major ordering makes sense, because then adjacent memory addresses store adjacent elements in the row. However, if we are reading one column at a time, then each element in the column is separated by a number of elements equal to the row size. (This is known as the stride) Reading a single column at a time from a row-major matrix would be inefficient.

This has to do with how memory accesses are done in modern systems, which usually have a memory hierarchy⁵. For the purposes of this discussion, let’s assume we only have two levels: Main (system) memory, and a cache close to the CPU. In most systems, when we attempt to read an address from system memory, the CPU will first check if that address is already in the cache, since reading from there is much faster than reading from main memory.

If the address is not in the cache, it will trigger a read from main memory. But the read from main memory is not just for the given address, but instead for a “block” or range of memory addresses that covers the given address. Then, that entire block will be copied into the CPU cache and read from there. For example, if we wanted to read the address 0x0003, if our block size was 16-bytes, this might result in the range 0x0000 - 0x000F inclusive being read and copied from main memory to the CPU cache.

Once this entire range is in the CPU cache, any subsequent accesses for addresses in that range will be very fast. This is why accessing contiguous data structures like arrays can be very fast - only the first (cold) access will be slow.

Converting to/from row/column-major order: The “transpose trick”

Suppose we have matrices in row-major order, and need to multiply them together, but the library⁶ we’re using specifies that the input must be in column-major order. How can we do this without doing an expensive operation to rearrange/rewrite the underlying arrays in memory?

The first thing we can notice is that if we store a matrix in row-major order, and then interpret the stored data as column-major ordered, this results in a transpose of the matrix. Let’s look at an example of this, using the $2 \times 3$ matrix $A$ from before, which was defined as: $$ A = \begin{bmatrix} 1 & 2 & 3\\ 4 & 5 & 6 \end{bmatrix} $$ Storing this in row-major order yields the following memory layout: (As before)

Address  Contents
        ---------
0x0000  |   1   |  \
0x0004  |   2   |   |-- First row
0x0008  |   3   |  /
0x000C  |   4   |  \
0x0010  |   5   |   |-- Second row
0x0014  |   6   |  /
        ---------

We can now interpret this as a $3 \times 2$ matrix in column-major order:

Address  Contents
        ---------
0x0000  |   1   |  \
0x0004  |   2   |   |-- First *column*
0x0008  |   3   |  /
0x000C  |   4   |  \
0x0010  |   5   |   |-- Second *column*
0x0014  |   6   |  /
        ---------

This results in a matrix which is the transpose of $A$, or $A^T$: $$ A^T = \begin{bmatrix} 1 & 4\\ 2 & 5\\ 3 & 6 \end{bmatrix} $$ Note that we haven’t changed anything in memory: We’ve merely change how we interpret the stored data in memory. This works the other way as well, so if you stored a matrix in column-major order and interpreted it as row-major order, it would also result in a transpose of the matrix.

The second thing we can take advantage of is the properties of the transpose operation when applied to matrix multiplication. In particular, if we have the matrix multiplication: $$ C = AB $$ Then the transpose of $C$ is equal to: $$ C^T = (AB)^T = B^T A^T $$ Putting this together: If we have $A$ and $B$ stored in row-major order, but need to perform a matrix multiplication $AB$ using a library that interprets inputs in column-major order, then we can:

Pass in $B$ as the left side matrix. It will be interpreted as $B^T$.
Pass in $A$ as the right side matrix. It will be interpreted as $A^T$.
The result will be given as $C^T$ (column-major order), which when interpreted as row-major order will give the result $C$.

In this way, you avoid having to do any (potentially expensive) rewrite operations on the input matrices $A$ and $B$.

Footnotes

For the sake of brevity, I’m only using 16-bits for the memory address when in most modern systems the addressable range is 64-bits. Also, I assume a single logical addressing space as seen by the program, i.e. virtual memory, and ignore the physical memory. ↩︎
This article concerns dense matrices, where there are relatively few zero values. In sparse matrices the opposite is true, and there would very few non-zero values. In such a case, there are more data-efficient formats that avoid storing every single value in memory since most would be zeros. ↩︎
Row-major and column-major orderings are analogous to row-oriented and column-oriented databases. Both concern how the data is laid out in memory or storage and whether values in a given row or a given column are “close” to one another. ↩︎
A matrix can be considered a 2D tensor. For higher-dimensional tensors, the storage layout in memory would also be linearized, in a similar way. For more details about how this works in PyTorch, I recommend this article from Edward Z. Yang. ↩︎
A full discussion of a computer’s memory hierarchy and caching techniques is outside the scope of this article, and I’ve intentionally discussed a simplified model. ↩︎
The Nvidia cuBLAS library has many functions which operate on inputs in column-major order (apparently to maintain compatibility with Fortran!). If your matrices are in row-major order, you’ll have to be careful about how you invoke these functions. ↩︎