SOFTWARE & DEVELOPMENT TOOLS
Multicore Software
Multicore – What’s the Big Deal?
Multicore is being talked about a lot and is getting plenty of attention in the press. Why do we need it, what is the state of multicore and how do we make it work?
SVEN BREHMER, POLYCORE SOFTWARE
Why do we need multicore? The simple answer is that with multicore we continue pushing the performance/power envelope through parallel processing rather than increasing the clock speed as we have done for many years. With multiple cores, the processors can run at lower frequencies, with lower supply voltages, and cores can be turned on and off based on the system load. This means that higher MIPS/watt can be achieved—that is, if the software executes in parallel.
The availability of multicore-enabled applications will drive multicore silicon revenue, but existing applications are mostly single threaded. The current software infrastructure is almost entirely aimed at single-threaded, single-processor applications. Even though some embedded applications are multi-threaded, it is not easy to migrate them to multicore because the different threads, which executed in sequence on a single-processor system, may now execute at the same time, which requires precaution to avoid destruction of shared data. The lack of multicore software and standards presents a significant barrier to entry for multicore, which must be reduced to enable broad adoption and continued silicon revenue growth.
While parallel processing has been used in high-end applications (aerospace, defense, industrial and high-performance computing) for many years, it is a poorly understood concept in the broader markets now being “exposed” by the proliferation of multicore silicon. In PCs with two going to four cores, it is currently relatively simple to take advantage of multicore with multiple (contained) applications running in parallel (SMP), which may become more challenging as the number of cores increase.
What Is the State of Multicore?
There is a variety of multicore silicon from commercially available “standard” parts to custom SoCs. For simplicity, let’s take a look at standard parts, there are:
• Homogenous cores
• Heterogeneous cores
• Wide ranging number of cores; two to many
• Shared memory
• Local memory
• Different types if interconnect
Homogenous is sometimes associated with Symmetric Multi-Processing (SMP) and heterogeneous with Asymmetric Multi-Processing (AMP). While there is some logic to that labeling, as SMP requires homogenous cores and shared memory, I believe it is more relevant to look at application requirements when we talk about SMP and AMP as programming models; more on that below.
The number of cores on a die will influence the approach to parallel processing, as two cores can be quite powerful and have substantial memory per core whereas many cores imply simpler cores and less resources per core.
Shared memory, commonly used in today’s multicore chips, has the advantage of being visible to all the cores, but the disadvantage of being shared between the cores, which creates contention for memory access, and limits scalability. Local memory has the opposite properties and will be more common as the number of cores per die goes up. A combination of shared and local memory can provide flexibility and scalability.
The bus is currently the most common interconnect; it is simple as all the cores are connected to the bus and can “see” one another. However, the bus doesn’t scale very well and will over time be combined with or replaced by other types of interconnects such as direct links between cores making up a network on chip (NoC) and others. A combination of a bus and other interconnect(s) can make migration from legacy systems easier and provide higher throughput and performance.
Software
Since the current software “infrastructure” is primarily targeted to single processing there is a lot of work to be done to make the software multicore “friendly.”
Some of the challenges are:
• Single-threaded applications – Lots of them!
• Choice of programming model
• Lack of multicore-enabled system software
• Lack of multicore software tools
• Lack of standards
Migrating single-threaded applications to multicore can be quite a challenge since the C & C++, the most popular languages in embedded systems, are sequential (no parallel concepts), so the partitioning has to be done at the system level (not counting parallelizing compilers that operate primarily with data parallelism).
Which programming model to use, SMP or AMP, depends on several factors such as the requirements of the application in terms of deadlines and throughput and the application’s characteristics that may offer opportunity for parallel execution and may or may not have global and static variables and pointers. One must consider the structure of the existing code: Is it modular or monolithic? Will two cores be enough for future scalability or will more be needed down the road? Then one must look at which hardware to select in terms of the types of cores, the memory architecture and interconnects and the available operating systems.
Most operating systems, except those that are SMP-enabled, control one processor. A multicore system may have multiple instantiations of one or even different types of OSs. This requires both efficient on-chip communication and dynamic management above the OS level. Commonly used communication stacks such as TCP/IP, UDP/IP are not well suited for communication in a closely distributed computing environment like multicore.
A few tools that would come in handy for multicore software development are:
• Communication topology generators/configurators
• Programming model tools
• Tools for application partitioning, a potentially very complex task
• Debuggers that can “focus” in and out to view information both at the detail and function levels and that can stop execution on selective cores, without “breaking” the application
• Simulators, which can focus in and out of scope
• Optimization tools
Standards that are specifically targeting the closely distributed computing environment are needed to enable the multicore ecosystem. Efforts are underway in the Multicore Association (http://www.multicore-association.org) to address this. The software is clearly lagging behind the hardware for multicore, presenting formidable challenges for developers and opportunity for vendors.
How Do We Make it Work?
The software approach depends on whether you can choose the hardware or not. Matching the processor (or hardware accelerator) to the task can make a significant difference both for performance and power consumption. Having multiple types, for example a CPU and a DSP, gives you flexibility to divide the application and match some portions to the CPU and some to the DSP. Heterogeneous cores also add complexity as they may require different tools and/or operating environments. As a general guideline, a CPU is well suited for control tasks and a DSP to data processing tasks. A simple first partitioning would be to assign the control tasks such as keyboard control, user interface and general office application to the CPU, and signal processing and multimedia applications to a DSP, in a mobile multimedia-enabled device like a smart phone.
You have to decide which programming model, SMP or AMP, to choose, which may be dictated by the hardware (SMP requires homogenous cores and shared memory). Beyond the hardware the choice depends on the application requirements and the structure of the existing code. If enough cores are available, a combination of the two models may be a good choice. Generally it will be easier to scale an application to a larger number of cores with a larger number of functional modules or threads.
The performance of your application running on multiple cores requires efficient data movement between the cores to keep the cores working in parallel as much as possible (comparable to cache efficiency on a single processor). Equally important is that this communication can be done with a consistent API, such as the upcoming MCAPI from the Multicore Association, even if the underlying hardware and software changes. Changes may include the number and types of cores, memory architectures, different physical (and logical) interconnects as well as operating systems. The Internet provides a macro scale analogy, with a variety of different types of computers (PCs, Macs, etc.), OSs (Windows, Linux, etc.) and interconnects, “speaking” TCP/IP.
Multicore communication is best handled by a lightweight interprocessor communication framework (IPCF) designed to address the requirements specific to a closely distributed multi-processing environment and with a consistent programming API. The programming API abstracts the underlying cores, memory architectures, interconnects and OSs. Key considerations are high throughput, low latency, predictability and small footprint. Because a multicore environment is static from a hardware perspective (the number of cores is fixed) and interconnects can be considered reliable, a multicore IPCF doesn’t require some of the dynamic functionality needed on the Internet. The static nature allows efficient lightweight IPCF implementations, such as Poly-Core Software’s Poly-Messenger (Figure 1).

Additional functionality such as dynamic system management functions or object oriented framework, etc. can be layered on top of or use a multicore IPCF to take full advantage of a multicore system’s capabilities.
There is a need for more multicore-specific tools for tasks such as topology generation, debugging, partitioning, optimization and more. The “first-generation” multicore debuggers are available from a few different OS and development tools vendors. Which one to choose will probably depend on the silicon and/or OS and tools vendor(s) of choice.
It is important to be able to easily reconfigure the communication topology both for optimization purposes and to reuse application code on multiple and future hardware platforms. Separating the topology definition from the application makes reconfiguration easier, and an application that has been partitioned can potentially be remapped to different cores without modification.
A structured approach makes separating the topology configuration and application feasible, and a tool that generates the specified topology makes it easier and less error prone. The topology can be defined in terms of structure, such as the communication endpoints; the nodes, which are equivalent to a core (or an SMP cluster); the channels and priorities as well as the amount of resources assigned to the communication “components.” By using a higher level language such as XML to define topology, it can easily be reconfigured, reused and also be integrated in higher level tools.
An example of the flow from topology definition to application integration is outlined in Figure 2. The topology file is processed by the topology generator, which outputs the communication software topology in C. The C files are compiled and linked with the application and the IPCF run-time library resulting in a communication-enabled application. Topology changes are easily handled with modifications in the XML topology file. The topology is scalable (up and down) from one to even thousands of cores by using a topology generation tool that allows you to reconfigure your topology outside your application, using an XML-based topology definition.

Application partitioning is at this stage up to the developer as there are no generally available tools. This can be a very challenging task, particularly if the existing application has a lot of shared data structures that will have to be protected with some form of locks to avoid accidental destruction and pointers that can create unexpected results in a multicore environment. A reasonable approach would be use a stepwise procedure starting with distributing existing tasks and/or “independent” functions among cores, as they are likely fairly well contained with a defined input/output. If the first step is not sufficient, the next will be to break the application into smaller parts, distribute and evaluate, and so on.
The emergence of standards specifically targeting closely distributed computing will help to drive the availability of run-time software and tools, and I would encourage everyone who is or thinks they will be using multicore to participate in the Multicore Association’s efforts to enable the ecosystem!
We are still in the early stages of embedded multicore software, with some run-time software and development tools available, but much more is needed. If you are involved in some form of high-performance embedded computing, chances are that multicore is coming to a “project near you.” The “big deal” with multicore is not the concept of parallel processing, which has been around for many years; it’s that multicore is spreading parallel computing to the broader embedded market segments without a software infrastructure to support it.
PolyCore Software
Foster City, CA.
(650) 570-5942.
[www.polycoresoftware.com].

