CUDA: An Introduction

CUDA – NVIDIA’s GPU architecture and programming API for harnessing the power of their massively parallel graphics processors – has been a de facto go-to for many performance-oriented developers since it was first released in 2006.



Today, GPUs are available in a great variety of Compute Unified Device Architecture (CUDA) core counts and memory configurations – which determines power consumption and compute performance – scaling all the way up from NVIDIA’s 165W, 2,048-core Maxwell GM204 down to the 10W, 192-core CUDA-capable Tegra K1 ARM SoC – with many GPUs available in between (Figure 1).


Figure 1
The NVIDIA Tegra K1 mobile processor is rapidly becoming the basis of rugged embedded computing solution.

Many of the high-end Tesla cards are designed into HPC datacenters, powering the largest supercomputers; Quadro cards are found in professional workstations; and the GeForce cards are used heavily by gamers and consumers alike in desktops and laptops.  All are capable of running the same CUDA code.

Easily Scalable

Embedded systems – traditionally sensitive to size, weight, and power – have also capitalized on the performance and productivity gains of CUDA with ruggedized NVIDIA modules available for deployment in harsh environments.  Applications that utilize the shared CUDA architecture can easily be scaled across platforms by selecting the right GPU for the deployment at hand.  It’s all provided by CUDA which automatically scales an algorithm depending on the variable number of GPU cores available.

CUDA’s threading model provides an efficient way to process vast datasets in parallel, independent of the underlying GPU hardware.  The result is increased runtime performance and power efficiency over traditional serial CPUs, all the while remaining in a pure software environment (as opposed to firmware or ASIC) and reaping the productivity benefits of accelerated application development.

Applications in the fields of imaging, signal processing, and scientific computing are prime targets for GPU acceleration and have been the focus of many CUDA developers.  Users are able to execute a multidimensional grid of threads over a dataset. For example when processing an image, one lightweight CUDA thread is typically launched for each pixel in the image.  Each thread runs a user-provided function, called a kernel, that is responsible for processing an element of the dataset (in this case, one pixel from the image).  Generally this entails reading the data element from GPU global memory (typically GDDR5 or DDR3), performing any initial calculations, communicating with neighboring data elements via on-chip shared memory, and storing any results back to global memory.

Fast switching

What really makes CUDA fast is that the GPU cores are able to switch between the lightweight threads very quickly.  This means that when cores stall on memory accesses, the cores can quickly switch to different threads and continue to make progress where they can, instead of wasting cycles waiting for memory or other hazards.  GPUs have large on-chip register caches, allowing them to keep many thousands of threads in flight simultaneously. This provides a large pool of work from which to schedule without needing to go out to global memory. Serial processors traditionally incur large overhead when switching between relatively heavyweight and monolithic CPU threads.

Let’s take a look at the “hello world” of CUDA kernels – vector multiply/add.  A * B + C is performed element-wise over each sample, with the result stored in A.  First let us define the kernel function, which runs on the GPU and performs the vector operations (Source Listing 1).

CUDA kernels are written in a language very close to C/C++, with a few keywords and extensions added by NVIDIA.  For example, the __global__ identifier enables the function to be invoked as a kernel that runs on the GPU. The blockDim, blockIdx, and threadIdx variables are built-in registers that contain the value of the current thread’s ID within the work grid.  Traditional facilities like printf are available to aid in development and debugging. The source listing is part of a .CU source file; let’s call it It’s similar to a .C or .CPP source file, except CUDA source files can contain mixed CPU/GPU code. Our vma_kernel has three arguments – pointers to A, B, and C reference global memory buffers on the GPU that will be allocated in the following code section, which runs on the CPU and invokes our CUDA kernel (Source Listing 2).

The main() function from the listing is the typical C/C++ application entry point.  First, it allocates GPU global memory using a function called cudaMalloc() that’s provided by the CUDA driver.  Providing APIs for GPU management, the CUDA driver runs on the CPU and appends operations to the GPU’s work queue.  After allocating GPU memory with cudaMalloc(), a grid of thread blocks is launched using the <<< … >>> construct, which launches a CUDA kernel.  The arguments that we call the kernel with from the CPU are transferred to the GPU and show up as the kernel’s parameters (in this case, the pointers to A, B, and C).  Source Listing 2 is appended to the same .CU file as Source Listing 1.  It contains mixed GPU/CPU code and will be compiled by NVIDIA’s NVCC compiler, part of the CUDA Toolkit.

The Tool Chain

The CUDA Toolkit, freely available for Windows and Linux (including ARM-based Linux4Tegra), provides the tools, headers, libraries, and documentation necessary for development with CUDA.  Included is the compiler (NVCC), debugger (cuda-gdb), profiler (cuda-prof), Parallel NSight IDE plugins for Eclipse or Visual-Studio, in addition to many examples that highlight many of the features of CUDA.  To compile our hello world example, run the command in Source Listing 3 after installing CUDA Toolkit. It will compile our source file, containing the code from listings 1 & 2, into an executable called vma.out (or vma.exe on Windows).

NVCC automatically adds CUDA Toolkit’s /include directory as an include path (-I), in which exists the header referenced by the #include<cuda_runtime.h> statement from listing 1.  These headers define the CUDA extensions and constructs used by the GPU kernels.  The headers also provide the CPU runtime driver APIs that provides calls like cudaMalloc(), cudaMemcpy(), and the ability to launch kernels.  Shared libraries provide the backing implementation for the runtime driver APIs in addition to many useful application libraries like cuFFT, cuBLAS, and NPP (NVIDIA Performance Primitives).

One great aspect of working with CUDA is that there are many existing libraries already available that are implemented with CUDA under the covers.  In fact, you can easily make GPU-accelerated applications without writing any CUDA kernels at all but instead calling into libraries like cuFFT, which in turn launch the CUDA kernels for you.

Let’s take a look at an example of using cuFFT to execute a 1024-point FFT over 2^20 samples.  cuFFT has an API reminiscent of FFTW and provides multi-dimensional FFTs & IFFTs, real & complex modes, single & double precision. This sample performs a real-to-complex forward 1D FFT (Source Listing 4). To compile the FFT example, run the command line in Source Listing 5.

There are many useful libraries freely available for CUDA in addition to cuFFT, like the cuBLAS linear algebra library that provides CUDA matrix multiply, linear solvers, etc.  NPP provides hundreds of image processing functions.  NVIDIA’s new cuDNN library implements dynamic neural networks for GPU-accelerated machine learning.  All of these are included with CUDA Toolkit. You can do a lot before ever having to write your own parallelized CUDA kernels although a little application-specific optimization never hurts . Source Listing 6??

Streaming Memory

Our examples so far have had a shortcoming for real-world use. Although we allocated GPU global memory before launching CUDA kernel, we never initialized the memory with data (from sensors or disk, for example).  Likewise we never transferred the results of our CUDA computations back off the GPU.  There are multiple ways that one can efficiently stream memory to/from GPUs, depending on the kind of device that the data is coming from or going to.  Performing the memory I/O asynchronously is important for creating a continuously streaming CUDA processing pipeline, especially if the desired application has realtime characteristics.

The user should allocated pinned memory from system RAM using the cudaHostAlloc() API. One can also use cudaHostRegister() to pin existing memory that was previously allocated with malloc(), for example.  Users can queue DMA transfers between the host CPU and GPU using the cudaMemcpy() API.

Using pinned memory in calls to cudaMemcpy() is important because pinned memory transfers to or from the GPU are performed asynchronously – i.e. the CPU returns immediately after posting the memcpy operation to the GPU’s work queue.  If you were to send unpinned memory to cudaMemcpy(), the function would block the CPU until the transfer had completed (Source Listing 7).

ZeroCopy on Tegra K1

TK1’s integrated GPU and CPU physically share the same memory, which means we should use a feature in CUDA called ZeroCopy to eliminate unnecessary copies between CPU and GPU on TK1.  After all, the CPU and GPU access all the same memory.  If we pass the cudaHostAllocMapped flag to cudaHostAlloc(), the memory allocation is eligible to be mapped into GPU global memory space, which can be done with the cudaHostGetDevicePointer() function.  Both the CPU and GPU will each have their pointers to the same shared memory (Source Listing 8).

GPUDirect – RDMA over PCI Express

It’s often the case that data is coming in from an external source – for example, a video capture card or network interface that speaks Ethernet or InfiniBand.  Many of these devices are PCI Express peripherals that can provide I/O tailored to the application.  Remote Direct Memory Access (RDMA) is a latency- and bandwidth-saving technique used to transport memory across devices with low overhead.  Using a CUDA feature called GPUDirect, third party devices can stream data directly to or from the GPU.  Before GPUDirect it used to take multiple copies, as third party devices first had to copy their data to system RAM, where the methods outlined above were then performed to get the data into GPU memory.  Now, the memory is able to be streamed directly to the GPU, allowing for low-overhead intercommunication, reduced CPU usage, and low-latency asynchronous CUDA applications.

GPUDirect allows GPUs to exist in a mixed heterogeneous environment alongside FPGAs, and other third party devices like network interfaces, solid state drives and storage RAID and so on.  Many GPU-accelerated systems are deployed with customized FPGA interfaces on the front end, which acquire data in application-specific mediums and RDMA it to the GPU for the heavy floating-point math.  The results are then typically RDMA’d to other compute nodes over a network fabric like 10GbE or InfiniBand.  Following this blueprint where GPUDirect is used to interconnect processor nodes results in scalable system architectures.

Rendering to the Display with OpenGL or Direct3D

Many CUDA-accelerated applications, because of their rich multimedia nature, would like to render video or visualizations of some kind.  If an NVIDIA GPU is connected to the display, CUDA applications can utilize a shortcut – OpenGL/Direct3D interoperability – to decrease the overhead and latency of transferring memory from CUDA global memory to OpenGL/Direct3D buffers or textures.  If the GPU doing the rendering is different than the one running CUDA, the interoperability layer will automatically copy the buffer GPU↔GPU. If the same GPU is both driving the display and simultaneously running CUDA, the memory remains on-GPU and no external transfer is necessary.  Without the interoperability layer, CUDA memory would need to be copied back to CPU system RAM before being re-uploaded to the OpenGL/Direct3D buffer or texture.  The benefits are very similar to how GPUDirect RDMA avoids unnecessary copies to or from system RAM.

CUDA at Work and Home

Real-world processing pipelines can be built by chaining together multiple CUDA kernels (whether implemented yourself or invoked via a library like cuFFT or cuBLAS along with the memory streaming operations for your particular dataflow.

Due to convenient developer tools, parallelized libraries, and leading performance out-of-the-box, applications can be developed more quickly with CUDA than other technologies that attempt similar compute density but require significant resources to program. CUDA’s flexible software programmability results in rapid application development and reduced project timelines. Each new GPU architecture launched by NVIDIA provides a steady march of performance improvements, many taken advantage of automatically without requiring any updates to existing CUDA code.  Embedded devices and systems utilizing CUDA benefit not only from the ample compute horsepower at their disposal, but also from the shortened development cycles and constant infusion of new features.

CUDA allows everyone to access the vast possibilities offered by the processing power of GPUs. CUDA has a very low barrier of entry for anyone interested.  The CUDA Toolkit, NVIDIA drivers, and an ecosystem of CUDA libraries are all provided for free by NVIDIA. New and updated versions are available for download every couple of months from their website. What’s more are the millions of lines of open-source CUDA code freely available online, covering everything from gene sequencing and protein folding to machine learning and computer vision.  Anybody who has an NVIDIA GPU from the last eight years in their desktop or laptop can run CUDA-accelerated applications.  By using CUDA, anyone can make high-performance systems and applications that leverage the efficiency of GPUs.

What will you build today with CUDA?


General Electric Intelligent Platforms
Charlottesville, VA
(780) 401-7700