Today’s High-Powered GPUs—Strong for Graphics and for Math

Fractal Realms series. Backdrop of fractal elements, grids and symbols on the subject of education, science and technology

Today’s High-Powered GPUs—Strong for Graphics and for Math

Powerful multicore CPUs are increasingly integrating high performance graphics coprocessors on the same die with fast low-latency connections. The parallelism of these coprocessors makes them capable of numeric calculations that rival DSP cores.


Today, to build advanced embedded processing systems for use in applications such as image signal processing and computer vision, defense system designers use multi-core processors, high capacity memory chips, wide bandwidth fabrics and open software developed for commercial markets. In a recent innovation, system designers are increasingly making use of the graphics processing units (GPUs) integrated within a new generation of Intel x86 architecture chips.

To understand the advantages that an integrated GPU offers for aerospace and defense applications, one needs to understand how GPUs function. At a high level, they lend themselves to two broad classes of processing, both of which are needed by defense applications:

(1) Rendering images for display

(2) Accelerated floating point math operations

What is involved in rendering an image?

An image on a flat screen monitor can be derived via a 2D or 3D computer model. Creation of complex objects usually involves the utilization of very simple shapes, such as triangles, that serve as fundamental graphic building blocks. These objects are then immersed in a scene that defines the overall lighting, surface textures, and point of view. One of the most popular techniques for immersing the objects within a given scene first involves object shading and then rasterization, which geometrically projects the 3-D objects and the scene onto a 2-D image plane, i.e. the monitor.

In contrast to these computer-generated images, image signal processing and computer vision applications receive a continuous stream of imaging data from a video camera or electro-optic focal plane array. Two dimensional signal processing techniques are then applied for edge detection and motion blur correction (in addition to myriad other application specific algorithms). The enhanced image is then rendered to a display for use by the human operator.

The original intent of GPUs, and what remains as their core function in commercial applications today, is image rendering. Projecting 3D computer images in motion onto a 2D surface, so as to be convincing to the human eye, is a commonly used but highly compute intensive application. To render a moving image, for example, requires a tremendous number of calculations per frame as the GPU renders simple polygons to create more advanced objects, maps textures to simulated surfaces, and then rotates these shapes within dynamically changing coordinate systems.

Image signal processing and computer vision perform different operations, but are equally intensive in terms of computation throughput requirements and use math algorithms highly similar to those used by image rendering.

Rendering video frames requires the continuous streaming of data. By design, GPUs are optimized for streaming processing. To manage these data streams they first boost throughput by parallelizing their compute engines via a massive pipeline architecture. Then they add the high-bandwidth, high capacity memory that is central to this model given the relentless flow of high resolution imagery.

In recent years, the streaming processing capability of GPUs has been further enhanced by the incorporation of “programmable shaders.” These shaders perform highly sophisticated shading effects (e.g. volumetric lighting, normal mapping, and chroma keying); they are optimized at the silicon level to execute the algorithms associated with this type of processing. Powerful new GPUs include large numbers of these shaders to help accelerate the rendering of highly complex, life-like images (Figure 1).

Figure 1
Greater shading capability enhances image quality.

How, and why, does a GPU accelerate floating point math?

With the advent of programmable shaders, which are inherently adept at vector and matrix math, the use of GPUs to work on problems outside of rendering graphics has become a reality. Vector and matrix data manipulation are fundamental to linear algebra, filtering, and transforms, which are all components of digital signal processing (DSP). While data formats were originally integer based, a growing need for high dynamic range imaging resulted in the development of floating point data formats. The use of floating point data set the course in motion for GPUs to be used as floating point math accelerators in a multitude of commercial and defense related high performance compute applications. When used in this fashion, a GPU is often referred to as General Purpose GPU ( GPGPU).

Thanks to the sheer number of shaders in GPGPUs, coupled with a high throughput streaming pipeline architecture, it’s now possible to perform scalable, massive parallel processing that would otherwise far outpace the floating point performance of SIMD units found within standard CPUs. Discrete GPGPUs are often coupled with high throughput, high capacity GDDR5, a type of synchronous graphics RAM based on DDR3 SDRAM. This is required for buffering high frame rates of high resolution graphics data, but is also highly advantageous for buffering the ingest and egress of wideband sensor data.

Until very recently, GPUs have only existed as unique chips. As discrete pieces of silicon, GPUs have traditionally relied on external host CPUs, typically Intel-based, since they themselves are not governed by an operating system. The host CPU directs the instructions to be executed by the GPU, as well as identifying the input source and output destination for data streams.

GPUs are often deployed on full length PCI cards, placed in Intel-based workstations. They are also incorporated directly onto CPU motherboards. MXM (Mobile PCI Express Module) is a modular format for GPUs that, in addition to use in high-end consumer products, is also used in high performance embedded computing (HPEC) applications. One example is a single MXM module on a 3U OpenVPX carrier board. In another example, another two MXM modules are hosted on a 6U OpenVPX carrier. These carriers are typically placed in adjunct slots alongside Intel SBC or Intel/FPGA DSP boards with a multi-lane PCIe super highway providing the interconnect for streaming data ingest/egress. Rendered graphics destined for displays are output directly from the GPUs over display signal ports. More recently, ARM cores that do run operating systems are also finding their way into discrete GPUs. And, as will be discussed later, there are other architectural approaches that provide new ways to integrate GPUs into systems.

Programming a GPU

Within the software domain, the most popular and ubiquitous API for drawing and rendering graphics is undoubtedly OpenGL, which is maintained by a not-for-profit consortium called the Kronos Group. However, when it comes to math acceleration, there are two main choices. One option is Compute Unified Device Architecture (CUDA) from the GPU silicon manufacturer, NVIDIA. The other option, a bit newer and also from the Kronos Group, is Open Computing Language (OpenCL).

CUDA strictly targets NVIDIA GPUs for use as GPGPUs. CUDA is based on the C programming language and is predicated on a parallel computing programming model designed to gain massive throughput by executing many threads concurrently. The big advantage to CUDA is that it exposes optimized, intrinsic functionalities buried within the NVIDIA GPU silicon. For example, CUDA enables thread usage of shared memory regions, a previously cited omission of GPUs. In general, reviews from CUDA developers have been very favorable within the High Performance Computing (HPC) community. On the other hand, a big disadvantage with CUDA is that it is proprietary and closed to silicon other than NVIDIA GPUs.

OpenCL, also based on C, provides parallel computing constructs. The main differentiation between CUDA and OpenCL lies in the fact that OpenCL targets and executes across a whole spectrum of heterogeneous platforms including GPUs, CPUs, DSPs, and even FPGAs. Because of its openness, OpenCL is portable and has the ability to bridge heterogeneous computing elements within the same system or even the same silicon. The disadvantage of OpenCL, which can sometimes be experienced in the level of abstraction from the underlying silicon and optimization mileage, will vary depending on OpenCL implementation by the associated silicon vendor. Great progress is being made in this regard by the various chip manufacturers and the community is growing at a much faster rate due to the variety of target silicon that is available.

How GPUs have evolved

Like other silicon devices, GPUs have seen, and continue to see, a steady increase in core count within a fixed power footprint, which is driven by lithography advances that increase die density. This increased core count creates an ever greater performance-to-power (i.e. FLOPS to Watt) ratio in all device architectures, with GPUs currently in the lead. As CPU cores (e.g. ARM) find their way onto GPU dies, GPUs are achieving a new level of autonomy. This is enabling GPUs to assume functions within a system well beyond that of a mindless Giga/Tera FLOPS monster.

Historically, Intel CPUs have been coupled on motherboards with discrete GPUs from NVIDIA or AMD. While Intel’s integration of the GPU device onto the Mobile Class die was partly spurred by marketplace contentions, there are also definitive technical advantages to subsuming the GPU, such as removing PCB complexities and reducing thermal dissipation. The aerospace and defense industry is a direct beneficiary of this development as the Core i7 now provides a very powerful heterogeneous CPU+GPU core in which the GPU can be used to drive monitors or to process extreme sensor feeds.

The Intel Core i7 Gen 4 (Haswell) contains four x86 cores, each with its own dedicated AVX2 SIMD unit and each with an exclusive L1/L2 cache (Figure 2). There is a larger L3 cache (4-8 MB) that is “shared” among all four cores and the internal graphics processor (integrated GPU). This sharing is accomplished via a high speed ring bus (> 300 GBytes/sec) that serves as the data transport mechanism. Additionally, some of the Core i7 chip variants also have embedded DRAM (eDRAM) that effectively functions as a Level 4 cache (victim cache to L3, up to 128 MB). Outside of using external DDR3 memory, the method for getting into and out of the Core i7 is through lots of PCI Express Gen 3 lanes.

Figure 2
Core i7 Architecture Diagram

The embedded GPU within a Core i7 is available in several different sizes; one of the more popular sizes is the GT2 (a.k.a HD Graphics 4600), which has 20 shader processors, referred to as Execution Units (EU). These EUs produce >350 GFLOPS of single precision floating point processing. Additionally, the GT2 supports three display ports for rendering graphics (Figure 3).

Figure 3
A Silicon View of the Core i7 Architecture

Advantages for aerospace & defense electronics

Not every aerospace and defense application is a good candidate for eschewing a discrete NVIDIA or AMD GPU in favor of the embedded Intel Mobile Class GPUs. Those applications that require a discrete GPU include those needing TFLOPS of processing and/or GBytes of external GDDR5 DRAM (supported by MXM modules with the larger discrete GPUs). However, there are applications than can greatly benefit from the onboard GPU in the Core i7. For these applications the GFLOPS performance, excluding CPU AVX2 units, meets the need and the memory capacity is sufficient.

For SWaP-constrained aerospace & defense applications, one obvious advantage  of Core i7’s is the removal of an entire adjunct GPU board (or more), which equates to an entire FRU slot for a deployed chassis, and saves size, weight, power, and overall cost for the entire system.

Another key advantage provided by an onboard GPU is the extremely low latency made possible by its proximity to the CPU cores. With all the CPU cores and the GPU interconnected by a lightning-fast ring bus and passing data at the caching level, latency benchmarks are greatly improved when compared to data transport to/from an Intel CPU device on one board and a discrete GPU on hosted on a second board and interconnected via PCI Express.

Example use cases

GPUs are increasingly being used in aerospace & defense applications, as image renderers (GPUs) and math accelerators (GPGPUs). Meanwhile, both the discrete GPUs and the integrated GPUs are growing in terms of performance and memory capacity. Which type makes the most sense is really a question that needs to be answered by the requirements of the particular program.

Today, we are seeing GPUs being used to render video displays for 360° situational awareness inside modern military platforms. This enables the pilot or driver to effectively “see through” the ceiling, walls and floor of a vehicle. Even more popular, is the use of GPUs to capture image sensor data (e.g. via gigapixel cameras), followed by real-time image manipulation, optimization, and display. This latter application is becoming particularly popular with Unmanned Aerial Vehicles (UAVs) tasked with carrying multiple camera types (i.e. electro-optic and infrared). These cameras need to be ortho-rectified and stitched together, just for starters.

Sensor data other than standard imagery is also a target for GPUs. STAP and SAR radar, which are hungry for FLOPS, are seeing pulse compression and Doppler processing occurring in GPUs. SIGINT applications requiring high throughput, wideband frequency domain analysis are also being targeted by GPUs. The embedded GPUs are increasingly attractive for some applications because of their proximity to the CPU cores all on the same die. Applications with stringent latency requirements, in addition to processing throughput, can greatly benefit from a heterogeneous processor such as the Intel Core i7, thanks to the tightly coupled ring bus infrastructure between the embedded GPU and the AVX2 enabled x86 cores.

Electronic warfare (EW) applications, such as Cognitive EW, that had been dismissive of discrete GPUs due to high latency, are now considering the integrated device as it gives them the throughput of the GPU, the cognitive capabilities of the CPUs, and the low latency of the ring bus interconnected caching. OpenCL, as previously discussed, is the software enabler giving developers the access they need to realize the benefits of this heterogeneous device.

An example of a design implementation incorporating embedded GPUs is Curtiss Wright’s CHAMP-AV9 Intel Core i7 Multiprocessor 6U OpenVPX DSP Board (Figure 4). This module brings two Core i7 processors onto a single PCB and is designed to withstand the severe environments typical of aerospace & defense applications. The combined GFLOPS metric for just the AVX2 SIMD units on the board’s two Haswell Core i7’s equates to 614 GFLOPS.

Figure 4
The CHAMP-AV9 6U OpenVPX DSP Module

What’s more, if the Core i7s’ embedded GT2 GPUs are used as GPGPU math accelerators, we see an overall total of 1,318 GFLOPs of processing power. Alternatively, if the GPUs were leveraged to render displays, the CHAMP-AV9 provides for a total of 6 individual display ports, three per Core i7 device.

One important consideration to take into account when designing with the Core i7 is the importance of providing ample memory throughput and capacity, as well as high speed I/O, and lots of it. Rarely are GPUs compute bound; they are much more likely to suffer from being memory bound or I/O bound. The CHAMP-AV9 employs up to 32 GBytes of DDR3 to tackle memory capacity in addition to the routing of all genres of high speed I/O on and off module.

Signal integrity over an extended temperature range is also of particular importance as the board routes 1,600 MHz memory lanes, PCIe Gen 3, Display Port, and 40GigE/ InfiniBand in a very tight 6U OpenVPX footprint. The card employs Curtiss-Wright Fabric40™ technology to address this issue and ensure high signal integrity at the extreme signaling rates used by high bandwidth fabrics.

System designers must regularly evaluate the many choices of processing elements available on the market, selecting the architecture that best meets the needs of a specific program. GPUs integrated within Intel’s Core i7 processors are a recently available processing variant, with significant advantages to many types of aerospace and defense applications. SWaP savings, high performance, low latency and enhanced security are all characteristics of these integrated GPUs.

Curtiss-Wright Defense Solutions
Ashburn, VA

(661) 705-1142