From Montana Tech High Performance Computing
CUDA is the NVIDIA parallel programming language that executes at a higher perfromance on Graphical Processing Units (GPUs). Currently CUDA can support: C, C++, C#, Fortran, Java, Python.
CUDA Versions 8.0 is installed on /opt/cuda-8.0/, and Version 9.0 is installed on /opt/cuda-9.0/
CUDA compiler: nvcc
CUDA file extension: .cu
CUDA binaries and libraries are installed in /opt/CUDA. To set the environment for using CUDA, use the module command:
module load cuda/8.0 (or simply
module load cuda)
Example of compiling CUDA file
Please do your editing and compiling on the Management node & execute program on a GPU node.
To simply compile a CUDA file:
nvcc -arch=sm_35 filename.cu
This will generate a standard "a.out" execution file on the current work directory.
-arch=sm_35 is the gpu architecture supported by the compiler
nvcc -arch=sm_35 -O3 filename.cu -o outCUDA
This will optimize at level 3 of the serial part of the code and generate execution file "outCUDA"
GPU/CUDATesla K20 Architecture
Compute Capability 3.5 Max Threads per Thread Block 1024 Max Threads per SM 2048 Max Thread Blocks per SM 16
CUDA C-example program
The simple vector addition vectoradd.cu sample program located in
/opt/cuda-8.0/samples/0_Simple/vectorAdd/ is one of the official CUDA samples shipped with CUDA Toolkit. It randomly generates two float type vectors, and uses GPU to calculate their additions. In the end, the GPU result is compared with the CPU result to verify if the GPU result is correct or not.
More sample programs can be found at
Compile the code
module load cuda
nvcc -arch=sm_35 vectorAdd.cu(copy the program to your home directory or give the full path)
The above command will create an executable file named ‘a.out’. Alternatively, you may specify your executable filename:
nvcc -arch=sm_35 vectorAdd.cu -o outCUDA
To run the CUDA program, you need to request a GPU node. A sample batch file for requesting the GPU node and running the above sample program is provided:
#PBS -l nodes=1:ppn=1
#PBS -l feature=gpunode
#PBS -N GPUJob
#PBS -l walltime=00:05:00
- [Vector addition of 50000 elements]
- Copy input data from the host memory to the CUDA device
- CUDA kernel launch with 196 blocks of 256 threads
- Copy output data from the CUDA device to the host memory
- Test PASSED
Running CUDA programs
Once you have compiled your CUDA code, you will need to run on one of the GPU Nodes. Instructions on how to submit a job is detailed on the GPU Nodes page.