SYSTEM INTEGRATION
Developing for Multicore Systems
The Multicore Developer's Toolbox
Multicore processors offer great potential for increased performance and power savings. They also present a challenge to realizing this potential that can be met by a new generation of enhanced development tools.
BY DAVID N. KLEIDERMACHER, GREEN HILLS SOFTWARE
Article Media
All major embedded processor architectures—ARM, Intel, MIPS and Power—have spawned multicore processors, and designers across a range of industries are snapping them up due to the promise of improved performance efficiency. Developers, however, are faced with a new set of design, debug and run-time management challenges that are specific to multicore processors, for example, their ability to provide concurrency in execution of multiple threads.
Taking advantage of multicore concurrency usually requires the programmer to determine which parts of the application can be parallelized or offloaded and write the code such that this parallelism is explicit. For example, the developer can place code into threads that will then be scheduled by an SMP operating system to run concurrently. Using standards-based multithreading such as POSIX will ensure that multicore software is portable and reusable across projects.
POSIX is a collection of open standard APIs specified by the IEEE for operating system services. POSIX threads, or Pthreads, is the part of the standard that deals with multithreading. The Pthread APIs provide interfaces for run control of threads, synchronization primitives and IPC mechanisms. While other multithreading standards exist, Pthreads is the most generic, widely applicable standard. Pthreads is supported by Linux, UNIX, and a wide range of embedded operating systems such as Integrity, LynxOS and QNX. Even Windows supports a POSIX interface.
Due to the ubiquity of POSIX, there exists a very large base of application code that can be reused for embedded designs. Another strong advantage of POSIX is that independent conformance validation is available from the Open Group. There are a large number of POSIX implementations that have been certified conformant to the latest POSIX specification. By programming to the POSIX API, developers can write multithreaded applications that can be ported to any multicore platform running a POSIX-conformant operating system. POSIX is a natural fit for multicore embedded systems since embedded software is often written from scratch to be multithreaded. Furthermore, add-on software components can often be easily mapped to individual threads. For example, a TCP/IP network stack may execute within the context of a single POSIX thread; same for a file system server, audio application, etc. POSIX conformance is a requirement for any operating system that expects to be used widely in multicore systems.
Message passing has long been a mechanism used to implement parallel computing, mainly because the multicomputers used historically to host massively parallel scientific computations lacked a shared memory subsystem. Rather, data for parallel computations are sent to the parallel cores using IPC, with the same IPC serving as a synchronization mechanism. Although the embedded applications may differ from their scientific brethren, IPC is usually required to implement multiprocess systems.
IPC comes in many flavors. In the scientific community, MPI (Message Passing Interface) is a widely used standard. POSIX, of course, specifies a variety of mechanisms, including pipes, FIFOs and sockets, that were designed for loosely coupled IPC. The multicore developer’s toolbox must often include a variety of IPC choices. If the application has enough headroom to handle the overhead of an underlying network stack, then POSIX sockets is arguably the most ubiquitous of all basic IPC mechanisms.
Multicore Run-Time Management
Symmetric multi-processing is one of the more promising multicore tools for embedded systems. If the application is already multithreaded, as is the case with most embedded designs, then an SMP operating system will simply schedule concurrent threads to run on the extra cores in the system. SMP can provide dramatic performance increases without requiring software modifications. SMP is now quite common in embedded operating systems. For example, Linux, Green Hills’ Integrity and Intel VxWorks all support SMP.
One of the challenges with SMP is the fundamental difference in scheduling behavior relative to unicore real-time systems. Because multiple processes may be running concurrently, the use of process priority to guarantee execution time may not be sufficient. Green Hills’ Integrity operating system provides an optional alternative approach in which application processes can be assigned guaranteed percentages of execution time across all cores in the system. Within an application, threads are prioritized using standard priorities and time slicing.
Maximizing cache usage is critical to the performance and power profile of most embedded systems. Most SMP systems consist of cores that have independent on-chip caches. When a thread migrates from one core to another, the cache locality of the thread’s code and data is lost and must be reloaded on the new core. When a thread needs to run (such as when it is the highest priority runable thread), and more than one core is available (idle), the SMP operating system must intelligently choose which core to use. The operating system should keep track of a thread’s natural affinity. The natural affinity of a thread is defined as the core on which the thread last executed. Assigning threads to the cores that match their natural affinity minimizes migrations and cache misses and the embedded software will exhibit superior performance and power efficiency.
User-defined affinity is another useful tool for developers, which enables specific threads to be bound to specific cores. Latency as well as overall system efficiency can be improved in certain scenarios. For example, a device driver thread can be bound to the same core handling the device’s interrupts, avoiding the use of inter-processor interrupts (IPI), which increase latency. Another scenario involves assigning multiple threads cooperating to fulfill a particular job (e.g. using shared data structures) the same core affinity, again to minimize IPIs and maximize cache utilization. The SMP operating system typically provides a system call to assign core affinity.
Multicore processors such as the Freescale P4080, Intel Core2 and ARM Cortex A9 provide some form of hardware virtualization acceleration. Multicore architectures can improve the usability of hypervisors. For example, on a dual-core system, a separate virtual machine can be bound to each core, enabling a guaranteed quality of service for each guest. Using an enhanced Type-1 hypervisor (Figure 1), real-time applications can be assured optimal response time by executing on a core independent of guest operating environments.
Figure 1
An enhanced Type-1 embedded multicore hypervisor
Another use case for multicore hypervisors is the ability to provide more flexible power management. Guest operating systems can execute on all system cores when applications require maximum CPU availability. However, because the hypervisor can map guests to virtual cores, the guests can be executed on a smaller number of available cores when there is less work to do, enabling the remaining cores to be fully powered off. The proliferation of multicore devices is likely to increase the proliferation of hypervisors: symbiotic growth for two disruptive technologies.
Multicore Debugging Tools
Multicore processors often provide a single on-chip debug port such as JTAG that enables a host debugger, connected with a hardware probe device, to debug multiple cores simultaneously. With this capability, developers can perform low-level, synchronized run control of the multiple cores. Board bring-up and device driver development are two common uses of this type of solution. For efficient use of this multicore hardware facility, the development tool must enable the developer to visualize all the cores of the system and choose any combination of the cores to debug, each optionally in its own window. At the same time, the tool must provide controls for synchronized running and halting of the debugged cores.
With run-mode debugging, the cores are never stopped. Rather, the debugger controls application threads using a communications channel (usually Ethernet) between the host PC and a target-resident debug agent. For efficient use of this facility, the operating system must provide an integrated debug agent (and the associated communications device drivers) that is operating system aware and provides flexible options for interrogating the system. For example, the operating system should come with a debug agent that communicates with the debugger to provide the capability to debug any combination of user threads on any core, regardless of the homogeneity of the core architecture.
The user needs to be able to set specialized breakpoints that enable user-defined groups of threads to be halted when another thread hits the breakpoint. Some classes of bugs require this fine-grained level of control. To be able to halt threads on a core separate from the core running the thread that hits the breakpoint, the operating system must handle all the behind-the-scenes communication that informs the appropriate core, with minimal latency, of the event.
Many operating system vendors provide an event analysis tool. The event analyzer is an indispensable tool for developers of multithreaded software because it makes it easy to understand system behavior and locate performance bottlenecks, livelocks, or other problems. A target-resident agent logs important operating system level events, such as service calls, interrupts, context-switches and user-defined events. The tool uploads this event log (either during execution or post-mortem), and displays the events in a timeline. The tool allows the user to zoom, select specific events for further information, generate execution statistical reports, and other functions.
The event analyzer is even more critical for a multicore design. The event analyzer must be able to show events for all threads on all the cores, with the event streams synchronized to the same time scale. The tool must be able to display IPC between the cores. Green Hills Software’s EventAnalyzer product is one example of a tool that meets these requirements.
There can be little argument that embedded designs are going multicore. These systems are more complex to develop, manage and debug. To meet these needs and help designers realize the performance benefits of multicore without losing time-to-market, systems software must meet a challenging set of requirements. Vendors must provide tools that aid in the portability of multithreaded code, operating system and virtualization features that make it easier for application software to be allocated efficiently across multiple cores, and development tools that provide complete visibility into and control over the distributed software.
Green Hills Software.
Santa Barbara, CA.
(805) 965-6044.
[www.ghs.com].


Adlink
Elma