BROWSE ARTICLES BY TECHNOLOGY

DIGITAL EDITION

RTC Magazine Digital Edition

INDUSTRY NEWS

QUICK DOWNLOADS

RTEC10 is an index made up of 10 public companies which have revenue that is derived primarily from sales in the embedded sector. The companies are made up of both software and hardware companies being traded on public exchanges.

COMPANY PRICE
(USD)
CHANGE
 
Adlink
1.22
-1.781%
Advantech
3.02
-0.889%
Concurrent Comp
3.58
-3.241%
Elma
474.00
0.173%
Enea
5.31
-1.918%
-   Interphase5.130.000%
-   Kontron0.00
Mercury Comp
14.04
1.299%
Performance Tech
1.83
-2.032%
PLX
3.22
-0.617%
Radisys
7.39
0.271%
52 WK HIGH 52 WK LOW MKT CAP (Million USD)
1.24
1.15
167.08
3.06
3.02
1,668.57
3.66
3.51
32.95
474.00
474.00
108.30
5.34
5.00
93.75
5.155.1235.37
0.000.000.00
14.05
13.69
429.77
1.83
1.72
20.36
3.25
3.20
143.40
7.52
7.23
204.97
RTEC10 Index: 603.86 (-4.75%)
RTEC10 is sponsored by VDC research

INDUSTRY INSIGHT

From Multiprocessor to Multicore

Moving from Multiprocessors to Multiple Cores

The advent of powerful multicore architectures like the Cell Broadband Engine can significantly enhance applications that were already boosted by multiprocessor approaches. The trick lies in knowing how to optimize the newly available resources.

WILLIAM LUNDGREN, KERRY BARNES AND JAMES STEED, GEDAE

  • Page 1 of 1
    Bookmark and Share

The use of multiple processing elements has become essential to software development. A variety of multicore and DSP processors are available. While each processing core is capable of doing a variety of tasks, some processing elements may be better suited to some tasks than others. Using traditional development methods, the choice of processor for each task must be done at the beginning of development. By making this choice early, the planning of the partitioning and mapping of work to processors can be done before coding is started to minimize risk to the project. However, this preplanning requires much technical experience and insight both in the type of problem and the capability of the processors. The sense of experimentation that moves most engineers and programmers into entering science is shackled and restrained by the necessary structure needed to help improve the chances of getting an expensive project through to fruition.

Other options are available. Software development tools are available that automate the implementation of distributed software. Using a model of the software that can be constructed on a single workstation, the tool generates separate threads and executables to construct the parallel implementation, and many types of processors can be supported using the same infrastructure. Using these software development tools, the distribution of work to processors, and even the choice of processors themselves, can be delayed until the final stages of software development. Through experimentation and analysis, engineers can find the optimum implementation, not just enabling the search for better software, but also reducing risk to the project by allowing the implementation parameters—that used to be set in stone before coding—to be altered in an iterative fashion.

An example of some of the benefits of using this approach to software development is the work recently done to move a synthetic aperture radar (SAR) benchmark from a quad PowerPC DSP system to the Cell Broadband Engine (Cell/B.E.) processor. The SAR algorithm consists of three main components: range processing, a matrix transpose and azimuth processing. The range and azimuth processing have many compute-intensive vector operations, including FFTs, inverse FFTs and vector multiplies. The work of the range and azimuth processing can be easily distributed to multiple processors, but distributing this work requires the matrix transpose to be distributed—what is called a “corner turn.”

The existing SAR benchmark was implemented in Gedae, a programming language and multithreading compiler that enables experimentation with many different processors and processor topologies. Gedae was used to generate an implementation for the quad PowerPC system, as shown in Figure 1. Each PowerPC in the system runs at 500 MHz and has 256 Mbytes of memory. While the 500 MHz processors are several years old, the suitably ample memory allows the large SAR images to be processed one at a time. In other words, once distributed, one SAR image easily fits in the four memories. Because of this ample memory, the corner turn operation is implemented easily by sending the i-th section of the subimage on the j-th processor to the j-th section of the subimage on the i-th processor; a very trivial implementation of a distributed matrix transpose. The quad PowerPC implementation achieves a frame rate of 3 Hz.

Using traditional development techniques, re-implementing this application on the Cell/B.E. processor presents a significant programming project. The Cell Broadband Engine Architecture is a heterogeneous multicore architecture developed through a collaboration between Sony, Toshiba and IBM. The current implementation of the Cell/B.E. processor combines one Power Processing Element (PPE) with eight identical Synergistic Processing Elements (SPE), as shown in Figure 2. The PPE is a dual-threaded PowerPC core, and each SPE contains a high-speed processor with its own 256 Kbyte local store and DMA (Direct Memory Access) engine. Using the SPEs effectively is a key programming challenge when targeting the processor. While processing can be put on both PPE threads, the power of the processor is only unleashed when the SPEs are heavily utilized. Using the SPEs heavily means the software developer must overcome the hurdle of the SPE’s 256 Kbyte local storage.

The existing SAR implementation was not created to support processors with only 256 Kbyte of local memory. As we have discussed, a trivial implementation of the corner turn operation is possible on the quad PowerPC board because of the system’s large memory. On the Cell/B.E. processor, this trivial implementation is not possible. In addition, the implementation of the range and azimuth processing in the previous implementation takes advantage of the ability to fit the entire SAR frame in memory. While this storage is also not possible on the Cell/B.E., the processor’s highly efficient vector processing on the SPE cores along with its highly efficient interprocessor communication mean that a significant gain in frame rate can still be possible if the developer can overcome the difficulties of programming the SPEs.

The first issue to tackle when moving this SAR application to the Cell/B.E. processor is fitting the range and azimuth processing onto the SPEs. In the processing of stripmining, a large data buffer is processed chunk by chunk instead of en masse. The benefits of using vector-optimized routines are still realized because the chunk size is still large when compared to the size of the memory, however we have decomposed the data so that it will actually fit in the SPE’s local storage. This technique is often used to reduce cache misses, and it is just as relevant to using the SPEs. Sophisticated programming tools can handle this stripmining of the data automatically without changing the code of the application. The user specifies that he must decompose the data in order to fit it on his target processors, and the tool analyzes the processing to determine how to do the decomposition and automatically adjust the implementation accordingly.

The second issue to tackle is how to implement the matrix transpose. Because the data has been stripmined and is being processed chunk by chunk, the transpose must be altered to use the data chunk by chunk. While the trivial transpose implementation we used on the PowerPC is not possible, an equally trivial implementation is possible on the Cell/B.E. processor.

While the SPEs have 256 Kbytes of local storage, there is a large amount of unmapped memory available on the system that can be used to collect the chunks of data before performing the transpose. While the quad PowerPC board’s trivial implementation had no real drawbacks, collecting the data in unmapped memory forces the SPEs to remain idle while the PPE performs the transpose. This behavior can be seen in the Gedae Trace Table (Figure 3).

Each line at the top of the table represents a separate processing core in the Cell/B.E. Architecture. Four SPE cores are used to perform the range and azimuth processing; the gap in the middle is when the PPE is performing the data collection, transpose and data distribution required for this centralized implementation of a transpose, taking roughly 50% of the time. While this implementation of the SAR algorithm already outperforms the quad PowerPC implementation, reducing the centralization of the transpose operation will significantly increase performance, keeping the load on the highly efficient vector processing SPE cores and off the more general-purpose PPE cores.

Because the data set is too large to be done only in the SPEs’ memory and because collecting it all in unmapped memory has shown to be inefficient, another approach is needed to perform the transpose. This decentralized implementation still uses unmapped memory, however we do not collect the data there. We only use it as intermediate storage for a few chunks of data at a time as the chunks flow from one processor to another. Instead of requiring the PPE to transpose the data, we simply have the SPEs read each chunk from the unmapped memory in transposed order.

This streaming of the chunks of data through the unmapped memory allows the matrix transpose to be performed in a more distributed manner. With this change, the entire SAR algorithm is performed on the eight SPEs of the Cell/B.E. processor, and each of the eight cores is very highly utilized. A Trace Table illustrating this dense execution is shown in Figure 4. With the decentralization of the matrix transpose, the frame rate of the Cell/B.E. implementation exceeds 135 Hz, approximately a 45x improvement over the quad PowerPC implementation.

While we have focused on the programming issues that were tackled to move this SAR benchmark from a DSP system to a radically different multicore processor, there is a wealth of other issues that were addressed with full automation, with no work required by the programmer. Threads were automatically created for the new target system, and the run-time thread scheduler was automatically launched on each core to manage those threads. The communication between cores was fully integrated into the threads, utilizing the DMA engine to provide efficient transfers, even when the implementation shifted from using four processing elements to nine. Optimized vector libraries were linked in to the generated executables to provide the most efficient FFT and vector multiply implementations for the SPEs and PPE. Many of the mundane aspects of developing applications for the architecture were addressed by the tool so the developer was free to concentrate on the interesting problems in getting the SAR algorithm on the Cell/B.E. hardware.

As seen with the porting of this SAR algorithm, the use of automated implementation has significantly empowered us in being able to easily run the benchmark on new architectures, whether it is a change in processor type, change in number of processors, or even something as radical as moving to the Cell/B.E. processor. Because the partitioning and mapping can be done after coding is completed, a majority of the code is reused regardless of target processor, and the tool automates many of the optimizations needed to create an efficient implementation for that architecture. In addition, the analysis tools around the Gedae programming language allow the developer to easily identify issues in the implementation, allowing the developer to quickly adapt the implementation to the new topology. By being able to easily benchmark an application on radically different hardware, software developers will have much more freedom as to which processor is used for which task and can determine, through experimentation, which implementation provides the best performance.

Gedae
Moorestown, NJ.
(856) 231-4458.
[www.gedae.com].