May 19, 2024

CPU vs. GPU for neural networks

GPUs are the predominant hardware on which neural networks (which I’ll refer to just as models or networks in this post) run today, both for training and inference. What makes them so much better than CPUs for this purpose?

In this post, we’ll first look at the characteristics of CPUs and GPUs, and then run some simple experiments to demonstrate where GPUs are better than CPUs.

CPU and GPU characteristics

CPUs and GPUs have differ performance goals. In particular, CPUs tend to be optimized for low latency, while GPUs tend to be optimized for high throughput.¹ What does this mean?

CPU: Low latency goal: Execute as sequence of instructions as fast as possible.
GPU: High throughput goal: Process the same set of instructions on as much data as possible.

CPUs achieve their low latency goal through design choices such as:

Faster clock speeds
Superscalar techniques such as pipelining to increase the IPC
Speculative execution techniques like branch prediction to complement pipelining

GPUs achieve their high throughput goal by:

Massive parallelism across thousands of cores (which are grouped into streaming multiprocessors or SMs)
High memory bandwidth to keep those cores fed
A large number of registers (on-chip) to allow for efficient execution

A simple analogy would be that the CPU is the one horse-sized duck and the GPU is the 100 duck-sized horses.

But it’s a little more complicated than that: For a less-imperfect analogy, consider a store where you have a single checkout counter. How can we process more people?²

CPU approach: Make that single checkout counter faster by optimizing the checkout process: Automatic price scans using RFID, etc. This means each individual is processed faster.
GPU approach: Add more checkout counters and have an efficient queue/lineup so that people get directed quickly to the next free checkout. Each individual is not processed faster, but many can be processed in parallel.

As a concrete example, let’s look at the specifications of the CPU and GPU in my system: (These are both high-end desktop components, but a similar comparison would apply for datacenter components)

CPU: AMD 7950X3D:
- 16-cores
- Clock Speed: 4.2 GHz/5.7 GHz base/boost
- Uses DDR5 memory with 64 GB/s bandwidth
- FP32 performance: 2.6 TFLOPS (benchmarked)
GPU: RTX 4090
- 16,384 CUDA cores (over 128 SMs)
- Clock Speed: 2.2 GHz/2.5 GHz base/boost
- GDDR6X memory with 1.01 TB/s bandwidth
- FP32 performance: 82.6 TFLOPS (rated)

The GPU absolutely crushes the CPU in terms of FLOPS (82.6 TFLOPS vs. 2.6 TFLOPS) and has 1000x the number of cores vs. the CPU. Though each core is much less capable than a CPU core, this is more than made up by the sheer number of them. Additionally, the memory bandwidth available to the GPU cores is much higher than the CPU’s.

Knowing this, we can expect that:

Sequential work will perform better on CPU.
Work that is easily parallelizable will perform better on GPU.

Let’s run some experiments to confirm our understanding.

Experiments

The code used to generate these results is available as a Jupyter Notebook here.

In many types of large neural networks, including LLMs, the Feedforward or MLP layers tend to account for the majority of the FLOPs used³. These layers are essentially large matrix multiplications. With that in mind, we can create a toy model in PyTorch which is just a bunch of Feedforward layers that can be used to simulate a “compute intensive” neural network:

# Model Definition: A bunch of FeedForward layers with residual connections
class DumbBlock(nn.Module):

  def __init__(self, d_model):
    super().__init__()
    self.net = nn.Sequential(
      nn.Linear(d_model, d_model, bias=False),
      nn.ReLU(),
    )

  def forward(self, x):
    return self.net(x) + x # Residual connection

class DumbModel(nn.Module):

  def __init__(self, d_model, n_layers):
    super().__init__()
    self.blocks = nn.Sequential(*[DumbBlock(d_model) for _ in range(n_layers)])

  def forward(self, x):
    return self.blocks(x)

A few notes about the above network:

We don’t need to train the network, since we will be only measuring the inference (forward pass) runtime. The great thing about neural networks is that whether you train them or not, they will use the same amount of compute on the forward pass!
nn.Linear() is created with bias=False so there is only a matrix multiplication.
The residual connection was added to avoid the output decaying to zero when the network got very deep (i.e. n_layers being large), since I didn’t want to worry about a complicated weight initialization strategy.

When visualized using torchview, the model looks like this (using n_layers = 2):

“Toy model architecture” — Toy model architecture

We can now vary the following parameters and see how the inference runtime of the model compares on CPU vs. GPU:

d_model: Determines the size of the weight matrices in each layer; each layer has a nn.Linear layer of size (d_model, d_model), so the number of parameters in the model is proportional to the square of this.
n_layers: The number of layers in the model; the number of parameters in the models is linearly proportional to this.
batch_size: Not a model parameter, but a runtime parameter for how many independent inputs we’ll use the model to score at a time during inference

We’ll use defaults of d_model = 128, n_layers = 128, batch_size = 1 if that parameter is not changing. The system I’ll be running this on is my own desktop, which consists of an AMD 7950X3D CPU and RTX 4090 GPU, whose specifications were given previously.

Batch Size tests

For these tests, we’ll fix d_model = 128, n_layers = 128 and vary the batch_size of the input between [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048], and then measure the inference time:

“batch size: CPU and GPU” — batch size: CPU and GPU

For our small network (which at d_model = 128, n_layers = 128 has only 2.1M parameters), the model inference time on CPU is comparable to GPU until a batch size of 64. After that, the inference time on CPU starts to grow. There are some slight deviations in this trend, most likely because the AMD 7950X3D has a varying base/boost clock speed, and I didn’t pin the clock speed for these tests. (Note that the x-scale is logarithmic; when plotted on a linear scale the growth of CPU inference time appeared to be sub-linear)

By contrast, the GPU inference time stays pretty constant as the batch size increases. This indicates two things:

The GPU is able to leverage its massive parallelism to handle the larger batch sizes without an increased runtime. (Increased throughput through parallelism)
We haven’t fully saturated the GPU even with a batch size of 2048 on this toy model of 2.1M parameters.

The ability of GPUs to handle large batch sizes is one reason why GPUs can benefit training before they benefit inference: At training time, it’s possible to increase the batch size, but at inference time your batch size might be limited by other factors, e.g. how many requests you can batch together while staying under latency constraint.

`d_model` size tests

For these tests, we’ll fix n_layers = 128, batch_size = 1 and vary d_model between [64, 128, 256, 512, 1024, 2048, 4096]. This results in the model size varying between 530K and 2.1B parameters, as the model parameter count grows with the square of d_model.

Up to d_model = 512 the CPU and GPU inference time is roughly equal. For this toy model, this is a model size of 33.6M parameters. Beyond that, the CPU inference time appears to grow linearly with d_model. As with the batch size tests, the GPU inference time stays relatively constant throughout the range of d_model values tested.

In this test, we again see the GPU benefiting from its parallelism: As d_model increases, the size (number of elements) of the weight matrices in our nn.Linear layers grows quadratically. Since matrix multiplication is something that is, at least conceptually, easily parallelizable⁴, the increasing FLOPs required from larger matrix multiplies is something that the GPU can put its many cores to use on.

By contrast, our CPU cores appear to be saturated as d_model increases, causing the inference time to increase as d_model increases the size of the matrix multiplies. As before in our batch size tests, that does not appear to be the case for the GPU tests, as the inference time has stayed relatively constant. (We could be bottlenecked by something else, like reading the weights from global memory to on-chip memory)

`n_layers` tests

For these tests, we’ll fix d_model = 128, batch_size = 1 and vary n_layers from [128, ... 3840] in increments of 128. (That is, vary the depth of the network). Note that the upper bound here (3840) results in a ridiculously-deep network which probably has no practical analog. In a brief survey of the literature, the deepest network I could find of practical significance had 825 layers.

As expected, inference time for both CPU and GPU increases as the number of layers increases. But interestingly, at least at d_model = 128 and batch_size = 1, the GPU is no faster than the CPU, and in fact is slower regardless of the number of layers.

This is because as we add more layers, this represents more sequential work: The output of one layer is the input to the next, so there is a data dependency that prevents straightforward parallelization of the work, so the GPU’s resources cannot be easily brought to bear on an extremely deep network. In this situation, the extra cores of the GPU go to waste, as there isn’t enough work to be parallelized to realize an advantage over the CPU.

Fix the model parameter count and trade off between `d_model` and `n_layers`

In this set of experiments, we’ll fix batch_size = 1 and set the number of model parameters to $2^{31}$, or about 2.1B parameters. The idea here is to fix the amount of FLOPs (compute) in a forward pass, but vary how many layers those FLOPs are spread across.

As an example to illustrate this point, Suppose we have a single matrix of size W = [N, N], so that the total number of parameters is $N^2$. Then multiplying x = [1, N] with it requires 2*N*N = 2*N^2 FLOPs.⁵. If we distribute the $N^2$ parameters over two matrices equally, then we have two matrices of dimensions [sqrt(N^2/2), sqrt(N^2/2)]. Each matrix multiply then takes 2*N^2/2 = N^2, so the combined FLOPs over the two layers is still 2*N^2.

“fixed parameter count tests” — fixed parameter count tests

As you can see from the x-labels, this test again results in some ridiculously deep networks, but bear with me. Since we fixed the parameter count of the model, and since the model basically a bunch of matrix multiplies, the total amount of compute is fixed, but what is being varied is the number of layers that compute is spread across, or how sequential the work is.

As expected, the less sequential the work (lower values of n_layers) the faster the inference time. In all but the highest number of layers tested, the GPU is faster than the CPU, indicating that if the model’s parameter count (from “dense compute” layers like nn.Linear) is high enough, as long as there isn’t a ridiculous number of layers that divide up the compute, the GPU’s parallelism can likely help.

Notes on optimizations

I did not compile the toy PyTorch model used for testing here, nor did I do any other optimizations for either the CPU or GPU case. However, the main source of compute here is going to be the matrix multiply in the nn.Linear layer, and I believe that is well optimized. For the CPU, in some ad-hoc tests I ran, matrix multiplication in PyTorch was about as fast as in numpy, and numpy’s implementation is very efficient. (Numpy on my system uses OpenBLAS, while PyTorch appears to use MKL for its BLAS implementation). These implementations are optimized for the specific hardware (CPU architecture) they are running on.

It’s a similar thing for the GPU, as a ton of effort has gone into making matrix multiplication efficient there as well via cuBLAS. A quick check using ncu to profile a PyTorch program, and you’ll find that the kernels are optimized for the specific GPU architecture, e.g. ampere_sgemm_32x32_sliced1x4_tn. Compiling may help the GPU marginally here, as it may enable “kernel fusion” where multiple operations are brought together into a single GPU kernel to reduce reads/writes to global memory.

Conclusion

The mental model you should have of a GPU vs. a CPU is that compared to a CPU, a GPU has a massive number of cores that lends itself towards high throughput of parallelizable work. In other words, the GPU is capable of much more FLOPS than a CPU, provided the work can be parallelized. What this means:

You cannot expect a model that runs reasonably well on CPU, to run even faster on GPU: If the model runs well on CPU, it probably is not very “compute dense”, and hence there may not be enough work that can be parallelized to bring down the runtime on a GPU. (Without additional changes)
GPUs will help during training before they help during inference: During training, we can usually use much higher batch sizes than at inference. A higher batch size represents parallelizable work that a GPU will happily eat up to increase your throughput. If you cannot realize such a high batch size during inference (e.g. due to inability to batch requests together due to latency constraint), then the gains you will see in inference may be reduced or even eliminated, depending on the circumstances. (Additionally, training involves a backward pass, which roughly takes 2x the compute as the forward pass, and also requires additional memory to store intermediate activation values)
Beyond a certain parameter size, GPUs are likely to help for inference: This is more of a rule of thumb, but as parameter counts increase, this tends to imply the growth of the size of weight matrices in “compute dense” layers like linear layers. This sort of computation can be easily parallelized and made to run efficient on GPUs, hence the requirement of GPU inference for models like LLMs.
GPUs do not help for sequential work: See the results of CPU vs. GPU varying n_layers. Thankfully there is a lot of parallelizable work in most practical models, and most practical models do not have thousands of sequential layers.

I’ve greatly simplified the discussion of CPU vs. GPU performance here for brevity and also because this has been covered in great details in articles like this one from Abhinav Upadhyay. In reality, CPUs implement some similar techniques to GPUs, like SIMD/vector instructions, though the effectiveness of this vs. the GPU is dwarfed by the sheer number of cores a GPU has. ↩︎
This sort of “optimize for latency vs. optimize for throughput” exists in several other places in CS. For example, an online transaction processing system desires low-latency, while an offline batch processing system desires high-throughput. Similarly, different types of garbage collection algorithms can prioritize latency (keep pause durations to a minimum) or throughput (just get the dead objects cleaned up as fast as possible). ↩︎
See this excellent article by Adam Casson, which summarizes several language model scaling law papers, and has graphs toward the end showing how FLOPs are distributed across different parts of a Transformer as the number of parameters in the model increases. ↩︎
Matrix multiplication is “easily” parallelizable since each element in the output matrix depends only on a row of the left input matrix, and a column of the right input matrix. It doesn’t depend on any other elements in the output. Having said that, optimizing matrix multiplication for a specific architecture, like CPU or GPU, is a very involved process, as those articles from Simon Boehm will indicate. Thankfully, PyTorch already uses BLAS implementations for CPU and GPU that are already very efficient. ↩︎
More generally, if we have a matrix multiply of the shapes [M, K] @ [K, N] = [M, N], then this operation takes 2*M*N*K FLOPs. This is because there are M*N elements in the output matrix, and each requires a dot product between two vectors of length K. Each dot product requires 2 FLOPs: A multiply and an add, so this is how we arrive at 2*M*N*K FLOPs for the entire matrix multiply. ↩︎