CUDA
Introduction
Scientific simulations can often be significantly accelerated by hardware accelerators such as Graphics Processing Units (GPUs). GPUs are available on several HPC nodes. The GPUs currently available are NVIDIA GeForce GTX1080 Ti, which is of the Pascal micro-architecture, and of compute capability 6.1. The CUDA driver version is 10.1. The following table shows the key paramters of the GPU at the RCC:
Brand Name | GTX1080 Ti |
Compute Capability | 6.1 |
Micro-Architecture | Pascal |
Number Stream Multi-Processors | 28 |
Number of CUDA Cores | 3584 |
Boost Clock | 1600 MHZ |
Memory Capacity | 11 GB |
Memory Bandwidth | ~484GBs |
FP32 TFLOPS | ~11.4 TFLOPS |
Compile CUDA code
To compile CUDA/C/C++ code, first load the cuda module
$ module load cuda/10.1
The cuda compiler nvcc should be immediately available,
$ which nvcc
/usr/local/cuda-10.1/bin/nvcc
and you can check the cuda version via
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:10:27_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168
You can then compile your cuda/c/c++ code via the cuda nvcc compiler
$ nvcc -O3 -arch sm_61 -o a.out a.cu
In the above, the compiler option "-arch sm_61" specify the compute capability 6.1 for the Pascal micro-architecture.
Submit a CUDA Job
To submit a GPU job to the HPC cluster, first create a SLURM submit script sub.sh similar to the following
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -J "cuda-job"
#SBATCH -t 4:00:00
#SBATCH -p backfill
#SBATCH --gres=gpu:1
#SBATCH --mail-type=ALL
# load the cuda module to set up the environment
module load cuda/10.1
# the following line should provide the full path to the cuda compiler
which nvcc
# execute your cuda executable a.out
srun -n 1 ./a.out <input.dat >output.txt
Not all computer nodes have GPU cards, and a GPU node contains up to 4 GPU cards. In order to require a compute node with GPUs, add the following line to your submit script
#SBATCH --gres=gpu:[1-4] # <-- Choose between 1 and 4 GPU cards to use.
Then submit the job via
$ sbatch sub.sh
Cuda Sample Code
The following cuda code example deviceQuery.cu can help new users to get familar to the GPUs availalbe on the HPC cluster:
#include <stdio.h>
#include <cuda_runtime.h>
int main( ) {
int dev = 0;
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, dev);
printf("device id %d, name %s\n", dev, prop.name);
printf("number of multi-processors = %d\n",
prop.multiProcessorCount);
printf("Total constant memory: %4.2f kb\n",
prop.totalConstMem/1024.0);
printf("Shared memory per block: %4.2f kb\n",
prop.sharedMemPerBlock/1024.0);
printf("Total registers per block: %d\n",
prop.regsPerBlock);
printf("Maximum threads per block: %d\n",
prop.maxThreadsPerBlock);
printf("Maximum threads per multi-processor: %d\n",
prop.maxThreadsPerMultiProcessor);
printf("Maximum number of warps per multi-processor %d\n",
prop.maxThreadsPerMultiProcessor/32);
return 0;
}
Compile the code via
$ module load cuda
$ nvcc -o deviceQuery deviceQuery.cu
The output will be similar to the following upon a successful run
device id 0, name GeForce GTX 1080 Ti
number of multi-processors = 28
Total constant memory: 64.00 kb
Shared memory per block: 48.00 kb
Total registers per block: 65536
Maximum threads per block: 1024
Maximum threads per multi-processor: 2048
Maximum number of warps per multi-processor 64