March 2, 2024

Profiling CUDA programs on WSL 2

I’m a big fan of running WSL on my Windows 11 PC. I like having access to a Linux environment for development, while still retaining access to a native Windows experience for gaming. I think it’s a perfect setup for a hobbyist. However, sometimes the WSL abstraction doesn’t work exactly as it would on a true native Linux environment, and one example I ran into recently was profiling CUDA programs running on a WSL 2 Linux distro. Here’s the solution I arrived at that works.

System Setup

These are the prerequisites that will allow you to compile and run CUDA programs in your WSL 2 environment:

Install (or ensure you are running) the latest Nvidia drivers for your GPU/videocard. This ensures that the drivers support WSL 2. (Download Link)
Install the Linux distro of your choice with WSL 2. (Guide)
- Following that guide will download whichever distro you choose, and install it on your PC. You will then be able to open a shell into the running instance using Windows Terminal.
- I chose Ubuntu 22.04.3 LTS, and I recommend you choose this one as well.
- WSL 2 gives you access to the host (Windows) file system via the path /mnt/c (e.g. for your C:\ drive)

At this point, you should be able to run CUDA programs (e.g. those already compiled elsewhere) and you should be able to run nvidia-smi to confirm the CUDA version the GPU driver supports.

The reason is that the Nvidia driver installed in (1) already stubs itself inside of your WSL distro, exposing that functionality within the Linux environment. This is related to the CUDA development model:

CUDA programs aren’t compiled directly to GPU machine code. Instead, they are compiled to bytecode (an intermediate representation) called PTX.
The GPU driver is responsible for translating these PTX instructions into the actual executable binary code.

So if all you want to do is run PyTorch code (which already containes pre-compiled CUDA code), then you are good to go now: Setup Python, install PyTorch for Linux and your CUDA version, setup a virtualenv, install JupyterLab, etc. and start playing around with models from HuggingFace.

However, if you want to write and profile CUDA programs, there is additional work to do, outlined in the next sections.

Compiling CUDA programs on WSL 2

You will need to install the CUDA toolkit in your WSL 2 distro. This will give you access to, among other things, the CUDA compiler. (nvcc) The official instructions from Nvidia cover this, but unfortunately are a bit long-winded. Here is what I did:

Verify that CUDA programs can run correctly. If you have installed PyTorch, this can be done by running python -m torch.utils.collect_env | grep CUDA.
- You should see Is CUDA available: True

Follow the instructions at the CUDA Toolkit download page. (Download link)

This is the most important step, as it installs a WSL-specific version of the CUDA toolkit that excludes the Linux GPU driver. This must be excluded otherwise it will overwrite the special “stubbed” version that was already made available in WSL 2 by the Windows GPU driver.
As of 2024-03-02, the instructions were: (But please check the link for the current instructions)

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.3.2/local_installers/cuda-repo-wsl-ubuntu-12-3-local_12.3.2-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-3-local_12.3.2-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3

You will need to update your $PATH (so that nvcc and other binaries are visible) and also set some environment variables. Here’s what worked for me after reading various Stack Overflows (Ref 1, Ref 2, Ref 3)
- If you’re using bash or similar:
```
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
export PATH=$PATH:$CUDA_HOME/bin
```
- If you’re using fish (as I am):
```
set -Ux CUDA_HOME /usr/local/cuda
set -Ux LD_LIBRARY_PATH $CUDA_HOME/lib64
fish_add_path $CUDA_HOME/bin
```
Verify things work by running: nvcc --version

You should now be able to compile CUDA programs using nvcc. Since this isn’t a tutorial on CUDA programming, I’ll skip over this section. If you’d like to learn more, here are some suitable resources:

The CUDA MODE Discord has a series of recent lectures, recorded and available here.
Udacity’s Intro to Parallel Programming (It’s focused on CUDA)
Emory University CS355: See section 14, GPU-programming using the CUDA programming language

Profiling CUDA programs on WSL 2

I’m not an expert on this, but here’s what I’ve found works for me. Before we get started, let’s cover the tools:

Nsight Systems: This is used to profile an overall application’s performance. If a program is running slow, it might be due to other reasons than the CUDA kernels. This should be used to profile the application to determine what the overall issue is. Hence, this profiler is more general purpose, and NOT focused on CUDA performance. (Documentation)
- The associated CLI is nsys or nsys profile, and was already installed when you installed the CUDA Toolkit above.
Nsight Compute: This is used to profile CUDA kernels. It will allow you to measure the efficiency of your CUDA kernels by reporting, among other things, metrics like effective compute utilization, and effective memory bandwidth utilization.
- The associated CLI is ncu, and this was already installed when you installed the CUDA Toolkit above.

We are going to focus on (2) here. The official documentation for the Nsight Compute CLI (ncu) covers how to use it, but there’s a bit of extra work to get it to function on WSL 2.

Getting `ncu` to work on WSL 2

When you first try to run ncu on your WSL 2 Linux distro, you may get an error like this:

ncu ./parallel_add
==PROF== Connected to process 19698 (/mnt/c/Files/development/git/c-learning/cuda/parallel_add)
Size 1234567099, Thread Blocks 1205632, Threads Per Block 1024
==ERROR== An error was reported by the driver
==ERROR== Profiling failed because a driver resource was unavailable or the user does not have permission to access NVIDIA GPU Performance Counters. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. For instructions on enabling permissions, see https://developer.nvidia.com/ERR_NVGPUCTRPERM. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.

The way to fix this is given in the error message, but it’s not immediately obvious. One of the links gives the solution but here are the exact steps you must do to resolve this permission issue, which involves changing settings in Windows.

In Windows, open NVIDIA Control Panel
In the menu, enable Desktop -> Enable Developer Settings. This will cause a “Developer” section to appear on the left side.
On the left side, select Developer -> Manage GPU Performance Counters, then select “Allow access to the GPU performance counter to all users”

This will give permission to ncu to carry out operations needed for profiling.

Profiling workflow

Here’s my workflow for using ncu to profile CUDA programs:

Install Nsight Compute on your Windows host machine. (Download)
- This will give you a UI to visualize the reports generated by the CLI.
By default ncu outputs the profiling results in text to standard out.
- You can use the -o option to create a report file which can then be opened by the UI: ncu -o <report file> <CUDA program>. Example:
```
# Creates a report file `./profiling/matmul.ncu-rep` from running the program `matmul`
ncu -o ./profiling/matmul matmul
```
The documentation goes over different modes of operation, and how to tell ncu which data/statistics to collect.

Here’s an example of the output from the ncu report file when viewed in the UI: (From my exercise of writing a CUDA kernel to do matrix multiplication)

Aside: Legacy Nvidia tools

CUDA has been around for a long time. It was originally released back in 2007. Since then, the ecosystem has evolved and changed many times, and the associated profiling tools have consequently changed. Here are some legacy tools you should not waste your time on, assuming you are running the latest CUDA Toolkit version and have a recent GPU/device supporting a recent Compute Capability version:

nvprof: This is an older profiler that doesn’t work with GPUs/devices with compute capability 8.0 and higher. This was replaced by nsys and ncu.
Nvidia Visual Profiler/NVVP: Replaced by Nsight Systems/Nsight Compute

Peter Chng

Profiling CUDA programs on WSL 2

System Setup

Compiling CUDA programs on WSL 2

Profiling CUDA programs on WSL 2

Getting `ncu` to work on WSL 2

Profiling workflow

Aside: Legacy Nvidia tools

References

Profiling CUDA programs on WSL 2

System Setup

Compiling CUDA programs on WSL 2

Profiling CUDA programs on WSL 2

Getting ncu to work on WSL 2

Profiling workflow

Aside: Legacy Nvidia tools

References

Getting `ncu` to work on WSL 2