BROWSE BY TECHNOLOGY



Design Automation Conference

RTC SUPPLEMENTS


TECHNOLOGY IN SYSTEMS

Developing Hybrid code Using OpenCL

OpenCL Programming: Parallel Processing Made Faster and Easier than Ever

Newer processors may not only have multiple cores of the same architecture, they may also integrate heterogeneous computing elements. Programming such devices with a single code base has just gotten easier.

TODD ROBERTS, AMD

  • Page 1 of 1
    Bookmark and Share

Article Media

Parallel processing isn’t really new. It has been around in one form or another since the early days of computing. As traditional CPUs have become multicore parallel processors, with many cores in a socket, it has become more important for developers to embrace parallel processing architectures as a means to realize significant system performance improvements. This move toward parallel processing has been complicated by the diversity and heterogeneity of the various parallel architectures that are now available. A heterogeneous system is made up of different processors, each with specialized capabilities. Over the last several years GPUs have been targeted as yet another source of computing power in the system. GPUs, which have always been very parallel, counting hundreds of parallel execution units on a single die, have now become increasingly programmable, to the point that it is now often useful to think of GPUs as many-core processors instead of special purpose accelerators.

All of this diversity has been reflected in a wide array of tools and programming models required for programming these architectures. This has created a dilemma for developers. In order to write high-performance code they have had to write their code specifically for a particular architecture and give up the flexibility of being able to run on different platforms. In order for programs to take advantage of increases in parallel processing power, however, they must be written in a scalable fashion. Developers need the ability to write code that can be run on a wide range of systems without having to rewrite everything for each system.

OpenCL for Unified, Portable Source Code

OpenCL, the first open and royalty-free programming standard for general-purpose parallel computations on heterogeneous systems, is quickly growing in popularity as a means for developers to preserve their expensive source code investments and easily target multicore CPUs and GPUs.

OpenCL is maintained by the Khronos Group, a not-for-profit industry consortium that creates open standards for the authoring and acceleration of parallel computing, graphics, dynamic media, computer vision and sensor processing on a wide variety of platforms and devices. Developed in an open standards committee with representatives from major industry vendors, OpenCL affords users a cross-vendor, non-proprietary solution for accelerating their applications across mainstream processing platforms, and provides the means to tackle major development challenges, such as maximizing parallel compute utilization, efficiently handling data movement and minimizing dependencies across cores.

Ultimately, OpenCL enables developers to focus on applications, not just chip architectures, via a single, portable source code base. When using OpenCL, developers can use a unified tool chain and language to target all of the parallel processors currently in use. This is done by presenting the developer with an abstract platform model that conceptualizes all of these architectures in a similar way, as well as an execution model supporting data and task parallelism across heterogeneous architectures.

Key Concepts and Workflows

OpenCL has a flexible execution model that incorporates both task and data parallelism (see sidebar “Task Parallelism vs. Data Parallelism”). Tasks themselves are comprised of data-parallel kernels, which apply a single function over a range of data elements in parallel. Data movements between the host and compute devices, as well as OpenCL tasks, are coordinated via command queues. Where the concept of a kernel usually refers to the fundamental level of an operating system, here the term identifies a piece of code that executes on a given processing element.

An OpenCL command queue is created by the developer through an API call, and associated with a specific compute device. To execute a kernel, the kernel is pushed onto a particular command queue. Enqueueing a kernel can be done asynchronously, so that the host program may enqueue many different kernels without waiting for any of them to complete. When enqueueing a kernel, the developer optionally specifies a list of events that must occur before the kernel executes. If a developer wishes to target multiple OpenCL compute devices simultaneously, the developer would create multiple command queues. 

Command queues provide a general way of specifying relationships between tasks, ensuring that tasks are executed in an order that satisfies the natural dependences in the computation. The OpenCL runtime is free to execute tasks in parallel if their dependencies are satisfied, which provides a general-purpose task parallel execution model.

Events are generated by kernel completion, as well as memory read, write and copy commands.  This allows the developer to specify a dependence graph between kernel executions and memory transfers in a particular command queue or between command queues themselves, which the OpenCL runtime will traverse during execution. Figure 1 shows a task graph illustrating the power of this approach, where arrows indicate dependencies between tasks. For example, Kernel A will not execute until Write A and Write B have finished, and Kernel D will not execute until Kernel B and Kernel C have finished.

Figure 1
Task parallelism within a command queue.

The ability to construct arbitrary task graphs is a powerful way of constructing task-parallel applications. The OpenCL runtime has the freedom to execute the task graph in parallel, as long as it respects the dependencies encoded in the task graph. Task graphs are general enough to represent the kinds of parallelism useful across the spectrum of hardware architectures, from CPUs to GPUs.

Besides the task parallel constructs provided in OpenCL, which allow synchronization and communication between kernels, OpenCL supports local barrier synchronizations within a work group. This mechanism allows work items to coordinate and share data in the local memory space using only very lightweight and efficient barriers. Work items in different work groups should never try to synchronize or share data, since the runtime provides no guarantee that all work items are concurrently executing, and such synchronization easily introduces deadlocks.

Developers are also free to construct multiple command queues, either for parallelizing an application across multiple compute devices, or for expressing more parallelism via completely independent streams of computation. OpenCL’s ability to use both data and task parallelism simultaneously is a great benefit to parallel application developers, regardless of their intended hardware target.

Kernels

As mentioned, OpenCL kernels provide data parallelism. The kernel execution model is based on a hierarchical abstraction of the computation being performed. OpenCL kernels are executed over an index space, which can be 1, 2 or 3 dimensional. In Figure 2, we see an example of a 2-dimensional index space, which has Gx * Gy elements. For every element of the kernel index space, a work item will be executed. All work items execute the same program, although their execution may differ due to branching based on data characteristics or the index assigned to each work item.

Figure 2
Executing kernels - work groups and work items.

The index space is regularly subdivided into work groups, which are tilings of the entire index space. In Figure 2, we see a work group of size Sx * Sy elements. Each work item in the work group receives a work group ID, labeled (wx, wy) in the figure, as well as a local ID, labeled (sx, sy) in the figure. Each work item also receives a global ID, which can be derived from its work group and local IDs.

Figure 2
Executing kernels - work groups and work items.

Work items in different work groups may coordinate execution through the use of atomic memory transactions, which are an OpenCL extension supported by some OpenCL runtimes. For example, work items may append variable numbers of results to a shared queue in global memory. However, it is good practice that work items do not, generally, attempt to communicate directly because without careful design, scalability and deadlock can become difficult problems. The hierarchy of synchronization and communication provided by OpenCL is a good fit for many of today’s parallel architectures, while still providing developers the ability to write efficient code, even for parallel computations with non-trivial synchronization and communication patterns.

The work items may only communicate and synchronize locally, within a work group, via a barrier mechanism. This provides scalability, traditionally the bane of parallel programming. Because communication and synchronization at the finest granularity are restricted in scope, the OpenCL runtime has great freedom in how work items are scheduled and executed.

A Typical OpenCL Kernel

As already discussed, the core programming goal of OpenCL is to provide programmers with a data-parallel execution model. In practical terms this means that programmers can define a set of instructions that will be executed on a large number of data items at the same time. The most obvious example is to replace loops with functions (kernels) executing at each point in a problem domain.

Referring to Figures 3 and 4, let’s say you wanted to process a 1024 x 1024 image (your global problem dimension). You would initiate one kernel execution per pixel (1024 x 1024 = 1,048,576 kernel executions).

Figure 3 shows sample scalar code for processing an image. If you were writing very simple C code you would write a simple for loop, and in this for loop you would go from 1 to N and then perform your computation.

Figure 3
Example of traditional loop (scalar).

An alternate way to do this would be in a data parallel fashion (Figure 4), and in this case you’re going to logically read one element in parallel from all of a (*a), multiply it from an element of b in parallel and write it to your output. You’ll notice that in Figure 4 there is no for loop—you get an ID value, read a value from a, multiply by a value from b and then write the output.

Figure 4
Data parallel OpenCL.

As stated above, a properly written OpenCL application will operate correctly on a wide range of systems. While this is true, it should be noted that each system and compute device available to OpenCL may have different resources and characteristics that allow and sometimes require some level of tuning to achieve optimal performance. For example, OpenCL memory object types and sizes can impact performance. In most cases key parameters can be gathered from the OpenCL runtime to tune the operation of the application. In addition, each vendor may choose to provide extensions that provide for more options to tune your application. In most cases these are parameters used with the OpenCL API and should not require extensive rewrite of the algorithms. 

Building an OpenCL Application

An OpenCL application is built by first querying the runtime to determine which platforms are present. There can be any number of different OpenCL implementations installed on a single system. The desired OpenCL platform can be selected by matching the platform vendor string to the desired vendor name, such as “Advanced Micro Devices, Inc.” The next step is to create a context. An OpenCL context has associated with it a number of compute devices (for example, CPU or GPU devices). Within a context, OpenCL guarantees a relaxed consistency between these devices. This means that memory objects, such as buffers or images, are allocated per context; but changes made by one device are only guaranteed to be visible by another device at well-defined synchronization points. For this, OpenCL provides events with the ability to synchronize on a given event to enforce the correct order of execution.

Most OpenCL programs follow the same pattern. Given a specific platform, select a device or devices to create a context, allocate memory, create device-specific command queues, and perform data transfers and computations. Generally, the platform is the gateway to accessing specific devices, given these devices and a corresponding context. The application is independent of the platform. Given a context, the application can:

• Create one or more command queues.

• Create programs to run on one or more associated devices.

• Create kernels within those programs.

• Allocate memory buffers or images, either on the host or on the device(s)—Memory can be copied between the host and device.

• Write data to the device.

• Submit the kernel (with appropriate arguments) to the command queue for execution.

• Read data back to the host from the device.

The relationship between context(s), device(s), buffer(s), program(s), kernel(s) and command queue(s) is best seen by looking at sample code.

Example Program – Simple Buffer Write

Here is a simple programming example—a simple buffer write—with explanatory comments.

This code sample shows a minimalist OpenCL C program that sets a given buffer to some value. It illustrates the basic programming steps with a minimum amount of code. This sample contains no error checks and the code is not generalized. Yet, many simple test programs might look very similar. The entire code for this sample is provided in Code Block 1.

Code Block 1

1. The host program must select a platform, which is an abstraction for a given OpenCL implementation. Implementations by multiple vendors can coexist on a host, and the sample uses the first one available.

2. A device ID for a GPU device is requested. A CPU device could be requested by using CL_DEVICE_TYPE_CPU instead. The device can be a physical device, such as a given GPU, or an abstracted device, such as the collection of all CPU cores on the host.

3. On the selected device, an OpenCL context is created. A context ties together a device, memory buffers related to that device, OpenCL programs and command queues. Note that buffers related to a device can reside on either the host or the device. Many OpenCL programs have only a single context, program and command queue.

4. Before an OpenCL kernel can be launched, its program source is compiled, and a handle to the kernel is created.

5. A memory buffer is allocated on the device.

6. The kernel is launched. While it is necessary to specify the global work size, OpenCL determines a good local work size for this device. Since the kernel was launched asynchronously, clFinish() is used to wait for completion.

7. The data is mapped to the host for examination. Calling clEnqueueMapBuffer ensures the visibility of the buffer on the host, which in this case probably includes a physical transfer. Alternatively, we could use clEnqueueWriteBuffer(), which requires a pre-allocated host-side buffer.

OpenCL affords developers an elegant, non-proprietary programming platform to accelerate parallel processing performance for compute-intensive applications. With the ability to develop and maintain a single source code base that can be applied to CPUs, GPUs and APUs with equal ease, developers can achieve significant programming efficiency gains, reduce development costs, and speed their time-to-market. 

Advanced Micro Devices
Sunnyvale, CA.
(408) 749-4000
www.amd.com