Skip to main content

Compiling NVIDIA GPU Programs

The Anvil cluster contains GPU nodes that support CUDA and OpenCL. See the detailed hardware overview for the specifics on the GPUs in Anvil. This section focuses on using CUDA.

A simple CUDA program has a basic workflow:

  • Initialize an array on the host (CPU).
  • Copy array from host memory to GPU memory.
  • Apply an operation to array on GPU.
  • Copy array from GPU memory to host memory.

Here is a sample CUDA program:

ModuleTree or modtree helps users to navigate between CPU stack and GPU stack and sets up a default compiler and MPI environment. For Anvil cluster, our team makes a recommendation regarding the cuda version, compiler, and MPI library. This is a proven stable cuda, compiler, and MPI library combination that is recommended if you have no specific requirements. By load the recommended set:

$ module load modtree/gpu
$ module list
# you will have all following modules
Currently Loaded Modules:
  1) gcc/8.4.1   2) numactl/2.0.14   3) zlib/1.2.11   4) openmpi/4.0.6   5) cuda/11.2.2   6) modtree/gpu

Both login and GPU-enabled compute nodes have the CUDA tools and libraries available to compile CUDA programs. For complex compilations, submit an interactive job to get to the GPU-enabled compute nodes. The gpu-debug queue is ideal for this case. To compile a CUDA program, load modtree/gpu, and use nvcc to compile the program:

$ module load modtree/gpu
$ nvcc gpu_hello.cu -o gpu_hello
./gpu_hello
No GPU specified, using first GPUhello, world

The example illustrates only how to copy an array between a CPU and its GPU but does not perform a serious computation.

The following program times three square matrix multiplications on a CPU and on the global and shared memory of a GPU:

$ module load modtree/gpu
$ nvcc mm.cu -o mm
$ ./mm 0
                                                            speedup
                                                            -------
Elapsed time in CPU:                    7810.1 milliseconds
Elapsed time in GPU (global memory):      19.8 milliseconds  393.9
Elapsed time in GPU (shared memory):       9.2 milliseconds  846.8

For best performance, the input array or matrix must be sufficiently large to overcome the overhead in copying the input and output data to and from the GPU.

For more information about NVIDIA, CUDA, and GPUs:

Helpful?

Thanks for letting us know.

Please don't include any personal information in your comment. Maximum character limit is 250.
Characters left: 250
Thanks for your feedback.