August 29, 2024

The Naive Roofline Model in Performance Modeling

The Naive Roofline Model is a way to determine a theoretical upper bound on the performance of an algorithm based on the hardware it’s running on. In this case, “performance” means throughput in terms of operations per second (like FLOPs/second) and “hardware” means an execution unit coupled with a data source (like CPU and main/system memory)

This is called a naive model because it relies on surprisingly few details about the hardware, which are:

$P_{peak}$: Theoretical peak performance of the execution unit (like a processor) in operations per second. Usually the operations are FLOPs, so the units are FLOPs/second. ¹
$b_s$: Theoretical peak bandwidth between the data source and the execution unit (in bytes/second).
- This is the sustainable maximum bandwidth of the slowest data path used (e.g. usually from main memory to CPU, or memory bandwidth)

This simplified view of the hardware can be represented like so:

In other words, in order for an execution unit to do work on data, that data must be transported over a data path with a maximum bandwidth of $b_s$ to the execution unit, which can then perform work on that data at a maximum rate of $P_{peak}$ operations per second.

We also need to know a few details about the algorithm’s implementation, namely its arithmetic intensity: (Also known as computational intensity). This is defined as the ratio between operations performed per byte read of data:

$$ I = {W\over{D}} $$ Where:

$W$ is the number of operations (e.g. FLOPs) done, i.e. the work or compute done
$D$ is the number of bytes transferred (read and written) from the data source to the execution unit in order to do the work.
$I$ is the arithmetic intensity, or the amount of FLOPs done per byte transferred

The arithmetic intensity measures how much compute you do per byte accessed:

A “high intensity” algorithm does a lot of FLOPs per byte accessed
A “low intensity” algorithm doesn’t do many FLOPs per byte accessed

Here’s a simple example illustrating how you’d calculate the arithmetic intensity of a simple loop used to compute the dot product between two vectors:

float dot_product(float *a, float *b, int n) {
  float result = 0.0;
  for (int i = 0; i < n; i++) {
    result += a[i] * b[i];
  }
  return result;
}

Assumptions:

The array size n, loop counter i and result will get stored in registers or L1d cache, making their access times negligible.
n is sufficiently large such that the elements of a and b cannot all be cached and thus iterating over the loop results in all elements of a and b being loaded from main memory or our slowest data source.

Let’s determine the arithmetic intensity of this code under the above assumptions. During each loop iteration:

Work/Compute: There are 2 floating-point operations, a multiply and an add.²
Data Transfer: Two float values are read, one from a and one from b; this results in 2 floats x 4 bytes/float = 8 bytes. Since the intensity is work/data, this is an arithmetic intensity of: $$ {2 \text{ FLOPs}\over{8 \text{ bytes}}} = 0.25 {{\text{ FLOPs}\over{ \text{ byte}}}} $$

Finding the bottleneck

In the execution of an algorithm, like our code above, the roofline model assumes that the bottleneck will be either one of the following: (Both of which are measured in operations/second like FLOPs/second)

The execution of work: Determined by: $P_{peak}$
The data path (memory): Determined by: $I \cdot b_s$ (Computational intensity of the algorithm multiplied by memory bandwidth)

Whichever one results in the lowest value will be the ultimate limit; expressed more succinctly: $$ P = min(P_{peak}, I \cdot b_s) $$ When we graph $P$ as a function of $I$ (because we assume that $P_{peak}$ and $b_s$ are fixed and determined by the hardware) it looks like this:

The shape of this graph looks a bit like the outline of a roof, hence the name, “roofline model”: As intensity $I$ increases, the performance goes up at a rate of $b_s$ until we hit $P_{peak}$, and then it is capped at that afterward.

This results in two distinct regions of the graph:³

Low-intensity: When $I \cdot b_s$ is less than $P_{peak}$, then we are memory-bound. This is the “sloped” part of the graph.
- $P$ is limited by data transfer
- Not all the available FLOPS are used; FLOPS “waiting” on data.
High-intensity: When $I \cdot b_s$ is greater than $P_{peak}$ then we are compute-bound. This is the “flat” part of the graph.
- $P$ is limited by execution/compute
- Not all of the bandwidth is used; data “waiting” on FLOPs.

This value of $P$ we obtain from the naive roofline model is the upper limit on how fast we can make an algorithm (defined by its arithmetic intensity $I$) run on hardware with the given specifications of $P_{peak}$ and $b_s$.

What the roofline model tells us about performance

Assume that our objective is to get an algorithm running as fast as possible on our hardware, that is, come as close as possible to $P_{peak}$. If we are in the memory-bound region that means our observed performance/throughput $P$ is less than $P_{peak}$ and there is room for improvement.

The only way this can be done is by increasing $I$ or the arithmetic intensity of the algorithm’s implementation. The usual way this is done is by being more careful about how data is read to reduce accesses to main memory by altering the implementation to better utilize caches. This increases $I$ and moves us rightward along the roofline graph towards $P_{peak}$.

The particular details involved in such an optimization can be quite involved, and usually depend on the specifics of the hardware, the algorithm being implemented, and the input data characteristics. A great example is this article on CPU matrix multiplication optimization. The same author also wrote an equally (if not more) impressive article on CUDA matrix multiplication. The key takeaways for me were:

Even if an algorithm is conceptually simple (like matrix multiplication), an efficient implementation that comes as close as possible to $P_{peak}$ can be very involved.
An efficient implementation requires in-depth knowledge of how the hardware works, both in terms of its instruction set and how that maps to memory operations.
Even for the same hardware, an efficient implementation might differ based on the input sizes - you can actually see this with the cuBLAS implementations of SGEMM - at runtime, they will pick different kernels (implementations) based on the matrix dimensions.

Lastly, this article isn’t strictly about GPU performance, but I found this Nvidia article on “Understanding Performance” to also cover the main points of the Naive Roofline Model, even though it doesn’t call it out by name.

Usually $P_{peak}$ is obtained from reading the specifications for a processor, or if those are unavailable, running a synthetic workload which only accesses data from the fastest possible source, e.g. the L1d cache. ↩︎
The indexing into a and b using i would also incur an integer operation as we’d be adding i to some base pointer. However, I’ve ignored these operations for the sake of this simple example, and also because these operations aren’t intrinsic to the dot product operation but rather a result of how it’s implemented in the code here. ↩︎
There is also an “optimal” point on the graph, which is the “corner” at which $P_{peak} = I \cdot b_s$. This is optimal in the sense that an algorithm operating at this point would make full use of the hardware’s compute and memory bandwidth. But often you’ll only care about hitting $P_{peak}$ and less about how much memory bandwidth you’re utilizing to get there. ↩︎