From Montana Tech High Performance Computing

CUDA is the NVIDIA parallel programming language that executes at a higher perfromance on Graphical Processing Units (GPUs). Currently CUDA can support: C, C++, C#, Fortran, Java, Python.

CUDA Versions 8.0 is installed on /opt/cuda-8.0/, and Version 9.0 is installed on /opt/cuda-9.0/

CUDA compiler: nvcc

CUDA file extension: .cu

CUDA Environment

CUDA binaries and libraries are installed in /opt/CUDA. To set the environment for using CUDA, use the module command:

module load cuda/8.0 (or simply module load cuda)

Example of compiling CUDA file

Please do your editing and compiling on the Management node & execute program on a GPU node. To simply compile a CUDA file: nvcc -arch=sm_35 This will generate a standard "a.out" execution file on the current work directory. -arch=sm_35 is the gpu architecture supported by the compiler OR nvcc -arch=sm_35 -O3 -o outCUDA This will optimize at level 3 of the serial part of the code and generate execution file "outCUDA"

GPU/CUDATesla K20 Architecture

Compute Capability 3.5 Max Threads per Thread Block 1024 Max Threads per SM 2048 Max Thread Blocks per SM 16

CUDA C-example program

The simple vector addition sample program located in /opt/cuda-8.0/samples/0_Simple/vectorAdd/ is one of the official CUDA samples shipped with CUDA Toolkit. It randomly generates two float type vectors, and uses GPU to calculate their additions. In the end, the GPU result is compared with the CPU result to verify if the GPU result is correct or not.

More sample programs can be found at /opt/cuda-8.0/samples/

Compile the code

module load cuda
nvcc -arch=sm_35 (copy the program to your home directory or give the full path)

The above command will create an executable file named ‘a.out’. Alternatively, you may specify your executable filename:

nvcc -arch=sm_35 -o outCUDA

To run the CUDA program, you need to request a GPU node. A sample batch file for requesting the GPU node and running the above sample program is provided:

#PBS -l nodes=1:ppn=1
#PBS -l feature=gpunode
#PBS -l walltime=00:05:00


Expected output:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory

Running CUDA programs

Once you have compiled your CUDA code, you will need to run on one of the GPU Nodes. Instructions on how to submit a job is detailed on the GPU Nodes page.