Realizing the Potential of Multicore Archtectures
Retooling Applications to Ride on Multiple Cores
Taking advantage of the real potentials for performance offered by multicore processors is an ongoing quest. A new tool can dissect programs into multiple threads that can run across multiple cores—with significant performance gains.
TOM WILLIAMS, EDITOR-IN-CHIEF
Page 1 of 1
It goes without saying that today multicore processors are all the rage. As silicon manufacturers attempted to obtain increased performance by simply flogging the clock faster, they ran up against the very real limits of power consumption and heat dissipation. But advances in process technology and reduced geometries have enabled us to put multiple cores on the same silicon die and—hopefully—parallelize our software so that it effectively runs faster. But how successful has that really been? And have the very perceptive performance gains really been indicative of the potential that is there if it could be fully exploited. We don’t really know.
Back in 2008, Steve Jobs, never one to mince words, was quoted by the New York Times as saying, “The way the processor industry is going, is to add more and more cores, but nobody knows how to program those things. I mean, two, yeah; four, not really; eight, forget it.” Of course, Steve, that is not going to stop us from trying. So among the questions are how far have we come in exploiting this potential and where can we still go?
Getting the Juice out of Multicore
First of all, the utilization of multicore architectures is taking place on several levels. Among these are the operating system and a variety of hypervisors, which are used to virtualize the multicore environment so that it can be used by more than one operating system. The next level is just starting to open up and, of course, offers the biggest potential, which is the ability of individual applications to run smoothly across multiple processor cores.
At the operating system level, for example, Windows 7 running on a quad-core Intel Core processors does take advantage of the multiple cores to the extent it can. It can launch an application on one core, drivers on another, and it can assign a whole host of background processes, of which the user is mostly never aware, across other cores. The user has no knowledge of or control over where these processes all run, but the fact that the operating system can do this does significantly increase the performance of the overall system.
The world of hypervisors comes in several forms, which are basically software entities that manage virtual machines so that multiple guest operating systems can run on the same hardware platform. John Blevins, director of product marketing for LynuxWorks, offers a scheme for classifying hypervisors based on their functionality. There are Type 2 hypervisors that are basically emulation programs that run under a host OS. This would allow the running of a guest operating system such as Windows under a host OS like Linux. Type 1 hypervisors are computer emulation software tightly integrated with an OS to form a single “self-hosted” environment that runs directly on the hardware. The lines between Type 1 and Type 2 hypervisors can become a bit blurred.
Therefore, Blevins offers a third designation he calls the Type Zero hypervisor. This type of hypervisor is built to contain only the absolute minimum amount of software components required to implement virtualization. These hypervisors are the most secure type of hypervisor as they are “un-hosted” and are dramatically smaller than Type 1 and 2 hypervisors. Since the Type Zero hypervisor runs directly on the hardware below the operating system, it can assign a given OS to a designated core or cores giving that OS complete protection from code running on other cores (Figure 1). Thus an OS that is multicore-aware could run on its assigned cores just as if it were on its own hardware platform. Windows 7, for example, would function in such a hypervisor environment exactly as it does on a multicore PC—within its assigned space. None of this, however, addresses the need to spread the actual applications across multiple cores to improve their performance.
The LynxSecure separation kernel/hypervisor from LynuxWorks is being called a Type Zero hypervisor because it runs directly on the hardware below the operating systems. It can assign a given OS to a core or cores and ensure that it runs exclusively on its assigned hardware.
Parallelizing the Application
In approaching the goal of actually getting everyday application programs to take advantage of multiple cores, the object will naturally be to first determine which parts of the code can be targeted to run on independent cores while keeping their communication and synchronization with each other and with the application as a whole. Multithreaded code offers itself to analysis and parallelization, but the question of how to approach the task is nontrivial. Steve Jobs was almost certainly correct in believing that nobody understands how to sit down at a computer and start writing multicore code from scratch. You’ve got to establish the structure of the program and then decide how to approach parallelization through analysis.
This is the approach of the Pareon tool recently announced by Vector Fabrics. A normal application that has not been parallelized will only run on one core. The approach is to take the complete—or fairly complete— written and debugged source code and to understand it well enough to pick out those elements that will actually provide enhanced performance if they are parallelized. That normally means that the performance increase should significantly exceed the amount of overhead involved in parallelizing and such things as delays for communication and synchronization.
This is fundamentally different from what a hypervisor does. A hypervisor does not touch the code but provides virtual contexts in which it can run. Currently, Pareon supports the x86 architecture and the ARM Cortex A-9, mostly in a Linux environment. It can work with C and C++ programs including programs that use binary libraries or access databases. It can handle programs that include non-C/C++ codes such as assembler, but can only parallelize those sections written in C/C++.
The process starts by building the program and running it on a model of the target hardware architecture. This allows insight into deep details of behavior including such things as bottlenecks, cache hits and misses, bus bandwidth and traffic, memory access times and more. After running this executable on the model, which is intended to give deep understanding of how the processor architecture and the code interact, Pareon performs dynamic analysis to gain information that will support deciding how best to go about an optimal parallelization. These are the initial steps of a three-stage process consisting of what Vector Fabrics describes as, “insight, investigation and implementation.”
Having done the basic analysis, Pareon moves the user to the investigation stage, which actually involves more deliberate analysis by the user of the data gathered during the insight phase. For example, coverage analysis is needed to ensure that the execution profile of the application is as complete as possible so as not to miss any parts that would gain significantly from parallelization. In addition, a profiling analysis keeps track of where the actual compute cycles go. This can reveal hotspots and provide a basis to explore and find the best parallelization strategy.
For example, the user can select a given loop and investigate the parallelization options. The tool will alert the user to possible dependencies that would restrict partitioning. Otherwise, it presents options for loop parallelization from which the user can choose, such as the number of threads. From this the tool can show the overall performance increase, taking into account overhead involved with such things as spawning and synchronizing tasks or the effects of possible memory contention and data communication (Figure 2).
This is a relatively complex loop with a number of dependencies, including communication, but would gain significantly from parallelization. In the bar above, the red shows the overhead that would be introduced compared to the execution time. The speedup is indicated on the left.
Of course, not all loops are good candidates for parallelization. If threads have to wait on each other so that one thread must complete before the next can start, parallelization is not going to be a great benefit. So a good deal of the developer’s effort is devoted to understanding the code and making informed choices about what to parallelize and what not to (Figure 3). To help with such decisions there is a performance prediction feature that immediately shows the impact of parallelization on program performance.
While this loop could be parallelized, the view shows that the threads mostly wait on each other and therefore would offer very little performance advantage if parallelized.
Once the developer has gone through and examined the parallelization options, made his or her choices and gotten a view of the overall improvement, it is time to refactor the code to implement the chosen improvements. Pareon keeps track of where the developer has chosen to add parallelism, and once the strategy has been selected presents the user with detailed step-by-step instructions on how to implement that parallelism. Thus, there may not be an automated button, but every step needed to properly modify the source code is given.
Once the code is refactored, it can be recompiled using a normal C or C++ compiler. In fact, Pareon integrates with standard development tools so that familiar debuggers, profilers, text editors and other tools are available for subsequent tweaks and examination of the code. Such steps will probably be needed if the code is intended for embedded applications where timing constraints, interrupt latencies and other issues affecting deterministic behavior are critical.
In fact, Vector Fabrics is not currently targeting the tool at hard real-time applications, but more at the tablet, handheld mobile and smartphone arena. The temptation to use it for embedded applications will no doubt be strong and it seems clear that this avenue will also be pursued by other players if not by Vector. The push to truly take advantage of the possibilities offered by multicore architectures is increasing—as is the number of cores per die—and we can expect further innovation in this dynamic arena.
San Jose, CA.
+31 40 820-0960