Conference Highlights the Computational Power of Graphics Processors and their Move into Mobile and Embedded

Fractal Realms series. Backdrop of fractal elements, grids and symbols on the subject of education, science and technology

The GPU Technology Conference hosted by NVIDIA focused on new and powerful application areas made possible by advances in graphics processing that can also be applied to a wide range of computationally intensive applications, and how these will increasingly influence mobile and embedded systems.


Humans are visual creatures, and interacting with computer systems large and small on a visual basis has been a distant goal that was largely frustrated by the enormous demands of the sheer processing power needed to make realistic visual interaction possible. That time now appears to be coming to an end as demonstrated by the recent GPU Technology Conference (GTC) held in San Jose by NVIDIA. Many of the solutions being developed using the underlying powerful GPU technology address more than just interaction with systems. They provide meaningful analysis and presentation of huge amounts of data in visual form; they provide a means for truly intelligent machine vision, machine learning, robotics and automotive applications. The interesting part is that a large number of these growing and innovative solutions owe the underlying technology to the demands of the gaming industry.

The gaming industry constantly strives for more realism and more—and faster—interactivity. Unlike the offerings of the early CAD industry, gaming cannot work with just drawing pictures and rendering surfaces. It must deal with motion, which means physics. Not so long ago it was difficult and time-consuming enough to render a realistic image of an object with textures, ray tracing, lighting models and all the other elements that go into graphical processing. Gaming demands that it happen at the highest resolution in real time. That means it must represent flying objects like collapsing or exploding buildings, racing vehicles, fighting figures and more. This entails re-rendering surfaces and shapes for each frame to maintain the gaming experience.

Now that such processing has been achieved for the gamers, it turns out that the ability to do physics processing along with high-speed, high-definition graphical rendering lends itself to many other “practical” applications in a vast number of fields such as seismology, medical imaging, machine vision, data analysis and more. This year the percentage of gaming solutions showing at the conference was smaller than previous years, but that was offset by a variety of creative new startups and applications. NVIDIA continues to feed this segment with a number of significant announcements.


Continuing the high-end graphics direction with at least a partial appeal to high-end gamers, NVIDIA announced the GeForce GTX Titan Z, which incorporates two Kepler GK100 GPU chips and 12 Mbytes of frame buffer memory. Each GPU has 2,880 CUDA cores for a total of 5,760 cores on the one card. The two GPUs are tuned to run at the same clock speeds with dynamic power balancing so that neither of them becomes a bottleneck (Figure 1).


Figure 1
The GeForce GTX Titan Z incorporates two Kepler GPUs running with clock and power balance, with 12 Gbytes of frame buffer memory.

While the Titan Z will plug into a x16 PCIe slot in a PC (it is triple-wide) and handle 5k multi-display gaming, its potential as an under-the-desk “supercomputer” at the roughly $3,000 price for other applications is attractive. Of course, many applications will turn to systems that may contain multiple Titan Z modules and go well beyond the high-end PC for many of these advanced applications.

New GPU Interconnect Technology


Parallelism is the one key to supercomputing, embedded or otherwise. Fast data movement is definitely another. This requires not only increasing data bandwidth, but also minimizing the number of transfers needed. NVIDIA has developed an interconnect architecture it calls NVLink, which it will incorporate into future GPU architectures. NVLink is expected to increase data rates from 5 to 12 times that of current PCIe 3.0. Putting this fatter pipe between the CPU and GPU will thus allow data to flow at more than 80 Gbytes/s, compared to the 16 Gbytes/s available now. The GPU will be able to access memory at near the bandwidth of the memory and will enable a faster data link between GPU and CPU (Figure 2).


Figure 2
NVLink is a high-speed interconnect that enables tightly integrated GPUs and CPUs.

PCIe being four to five times slower than CPU memory systems makes it a definite bottleneck for the GPUs. This is even more significant for Power CPUs, which have a higher memory bandwidth than x86 processors. The NVLink technology was developed in cooperation with IBM, which will be incorporating it into its next-generation Power CPU technology.

Today’s GPUs are connected to x86-based CPUs through the PCI Express (PCIe) interface, which limits the GPU’s ability to access the CPU memory system and is four to five times slower than typical CPU memory systems. PCIe is an even greater bottleneck between the GPU and IBM Power CPUs, which have more bandwidth than x86 CPUs. As the NVLink interface will match the bandwidth of typical CPU memory systems, it will enable GPUs to access CPU memory at its full bandwidth. When Power CPUs also have HVLink, that bandwidth bottleneck should disappear.

In addition to faster memory access, the NVLink model implements unified memory, in which the developer can treat GPU and CPU memory as a single block. This has significant advantages over partitioned memory, which requires data transfers between memory partitions for a different CPU or GPU to operate on a given set of data. With unified memory, GPU and CPU can both directly access the same memory.

Although future NVIDIA GPUs will continue to support PCIe, NVLink technology will be used for connecting GPUs to NVLink-enabled CPUs as well as providing high-bandwidth connections directly between multiple GPUs. Also, despite its very high bandwidth, NVLink is substantially more energy efficient per bit transferred than PCIe.

Pascal: The Next Step

NVIDIA has announced that the NVLink technology along with unified memory will be implemented on its next generation GPU architecture called Pascal, which is scheduled for release in 2016. The new GPUs will also feature 3D memory, which stacks DRAM chips into dense modules with wide interfaces and brings them inside the same package as the GPU. The new memory chips are expected to have multiple times the existing bandwidth, about 2.5 times the current bandwidth and size and have 4 times the energy efficiency of today. This makes possible more compact GPUs that put more power into smaller devices. The result: several times greater bandwidth, more than twice the memory capacity and quadrupled energy efficiency.


NVIDIA’s first module will be about 1/3 the size of a PCIe card (Figure 3), and from the appearance of the proposed first module, it will not have the standard PCIe edge connector. This would suggest that it will be intended for use with CPUs that are also enabled with NVLink, which would initially imply Power Architecture CPUs. There is also no indication, despite the connector at the ends of the board, that there is yet a definition of an NVLink interface connector.


Figure 3
The Pascal architecture—introduced here by NVIDIA CEO Jen-Sun Huang—will bring 3D memory onto a single device with a GPU and an NVLink interface for maximum data processing.

Moving into Embedded and Mobile

As powerful as NVIDIA GPUs like the Tesla and the Titan Z may be, it is hard to imagine them being used in embedded and mobile systems due to their size and power consumption. However, as with other longstanding assumptions mentioned here, this is about to change as well. The announcements around the Pascal technology certainly point in that direction with a very powerful GPU in a small, low-power form factor. A more immediate development has arrived in the form of the Tegra K1, which was introduced last December (see “Mobile Graphics CPU Promises High Performance . . .” RTC, February 2014).

The K1 is the latest member of the Tegra family, which incorporated NVIDIA GPU technology on the same die with multicore ARM CPUs and was used in high-end mobile graphics applications such as tablets or automobiles like the Tesla (the car, not the GPU). The K1 is also an advance in that it now includes a 4-Plus-1 quad core ARM Cortex-A15 and a 192-core GPU based on the Kepler architecture—and it is compatible with CUDA, the parallel platform and computing model designed to harness the parallel nature of the GPU not only for graphic applications but also for numeric-intensive computation such as physics, vision and data analysis.


Now NVIDIA has released the Jetson TK1 Developer Kit, which consists of a board with the Tegra K1, a BSP and software stack, CUDA, OpenGL 4.4 and the NVIDIA VisionWorks toolkit (Figure 4). It also includes a suite of development and profiling tools plus support for cameras and other peripherals. The board has 2 Gbytes of memory and a 16 Gbyte eMMC, USB 3.0, HDMI 1.4, Gigabit Ethernet, camera interface and more. The best part is that it will sell for $192. No one has yet indicated whether it will be available via Amazon. The Jetson TK1 also supports Linux for Tegra.


Figure 4
The Jetson TK1 Development Kit will put powerful GPU development tools into the hands of anyone from engineers to hobbyists.

The fact that the Tesla K1 supports CUDA is of high importance because the CUDA platform represents a software compatibility continuum from a device like the K1 all the way up to the largest Tesla- or Titan-based system. One example of how this is being exploited involves Audi’s efforts to develop a driverless car. Developing an intelligent vision system involves a big project of machine learning using neural networks. Among other things, the system must be taught to recognize objects such as pedestrians, other cars, dogs, etc. While that can be done using something like 18,000 cores on a multiple-Tesla system, there is no way to install such a thing in an automobile.

Audi is taking the parameters gained from the machine learning project and placing them—the results—onto a Tegra K1 in a small box that fits in a corner of the trunk. This then forms the basis of the automotive vision system thanks to the CUDA compatibility across the range of GPUs.

The possibilities of truly High Performance Embedded Computing (HPEC) are only now beginning to be realized. In the coming months and years we can expect to witness a range of yet to be discovered embedded and mobile applications from NVIDIA and other companies pioneering this new and exciting area of intelligent devices.


Santa Clara, CA
(408) 486-2000