TECHNOLOGY IN CONTEXT
Developing for Multi-Core Processors
Multi-Core Software: To Gain Speed, Eliminate Resource Contention
Even when applications are parallelized over multiple cores, performance can suffer from resource contention. Parallelizing the access to resources can eliminate much of the contention.
BY STEVE GRAVES, MCOBJECT
Page 1 of 1
Deployment on multicore CPUs should make software faster by enabling tasks to run in parallel rather than waiting for a time slice on a single core. But in practice, parallel processing on multiple cores can introduce contention for system resources, making an application slower than when it is deployed on a single core. Achieving multicore’s hoped-for linear performance gains hinges on resolving this issue. By examining two examples of such a conflict and the proven solutions to them, we can get to the conceptual starting point for solutions to similar resource contention bottlenecks.
The first conflict is contention in updating a shared data store, and its solution is called multi-version concurrency control (MVCC). Until recently, nearly all database management solutions for embedded systems used a pessimistic concurrency model that locked all or portions of the database while a single task updated it, thereby blocking all other tasks from writing to the database (or portions of it) during the modification. In contrast, MVCC is an optimistic concurrency model in which no task or thread is ever blocked by another because each is given its own copy (version) of objects in the database to work with during a transaction. When a transaction is committed, its copy of the objects it has modified replaces the objects in the database. Because no explicit locks are ever required, and no task is ever blocked by another task with locks, MVCC can provide significantly faster performance in write-intensive applications where data must be changed often and where many tasks are running on multiple cores.
Under MVCC, when tasks want to update the same data at the same time, a conflict does arise and a retry will be required by one or more tasks. However, an occasional retry is far better, in performance terms, than the guaranteed blocking caused by pessimistic locking. Further, in most systems conflicts under MVCC are assumed to be infrequent because of the logically separate duties among tasks, that is, task A tends to work with a different set of data than tasks B, C and D, etc.
Figure 1 compares MVCC to pessimistic locking, in operation. The diagram shows three tables in a database, each with five rows, and three tasks that are reading and/or modifying certain rows of certain tables. Task 1 is modifying Table 1’s row 3 (T1R3) and Table 2’s row 5 (T2R5). Task 2 is modifying T3R1, T3R3 and T3R5. Task 3 is reading T3R3 and modifying T1R5 and T3R2. Note that there are two copies (versions) of T3R3: a copy in Task 2 and Task 3.
Three tasks running in parallel have resulted in two versions of a database row (T3R3).
For purposes of this discussion, assume that all three tasks are started as close together in time as possible, but in the order Task 1, Task 2, Task 3. With MVCC, the three tasks run in parallel. With pessimistic locking, there are three possibilities: database locking, table locking and row locking. Each will exhibit different degrees of parallelism (but all less than MVCC).
Database locking: Task 1, Task 2 and Task 3 will be serialized. In other words, Task 1 will be granted access to the database while Task 2 and Task 3 are blocked as they “wait their turn.” When Task 1 completes its transaction, Task 2 will run while Task 3 continues to wait. Finally, Task 3 will run after Task 2 completes its transaction.
Table locking: Task 1 and Task 2 will run in parallel because Task 1 acquires locks on Table 1 and Table 2, while Task 2 acquires locks only on Table 3. Task 3 will block until Task 1 and Task 2 complete because it also needs a lock on Table 1 (which is locked by Task 1) and Table 3 (which is locked by Task 2). Task 3 will be blocked for the length of time required by Task 1 or Task 2, whichever is greater.
Row locking: Again, Task 1 and Task 2 will run in parallel because they operate on different tables, (hence on different rows). Task 3 will again block because Task 2 has a write lock on T3R3, which Task 3 wants to read.
Any serialization effectively defeats a multicore system; all but one core will be idle with respect to the utilization of the shared resource (in this instance the database). However, strategies to maximize parallelism, such as MVCC or fine-grained locking, impose their own overhead. In the case of fine-grained locking (row locking) there is lock arbitration, which can be complex. In the case of MVCC, there is version management—creating object versions, merging them and discarding them. So for MVCC to be justified, the gain in parallelism has to outweigh the additional processing overhead. To illustrate, the graphs in Figures 2, 3 and 4 show the relative performance of McObject’s eXtremeDB in-memory database system on identical multithreaded tests executed on a multicore system, using a multiple-reader, single-writer (MURSIW, or database-locking) transaction manager, and its multi-version concurrency control (MVCC) transaction manager.
In all three cases, the performance gains tail off. This is due to the increasing amount of housekeeping with each new thread—which becomes a greater percentage of the total processing time—and increased contention for data structures within the DBMS runtime. Systems less complex than a DBMS would probably see linear improvement extended farther to the right (i.e. to more cores). Conversely, there is virtually no processing overhead inherent in the MURSIW (database-locking) transaction manager—it is extremely lightweight, and performance remains nearly constant regardless of the number of concurrent threads.
A second example of resource contention in multicore, multithreaded systems involves the C/C++ heap. Normally, the C runtime library provides malloc() and free() (or C++’s new and delete) functions that allow applications to allocate and release memory from/to the heap. Application developers tend to use these functions liberally, without thinking about how they actually work. What’s going on “under the hood” is that the heap is organized as a pool of contiguous memory locations (often called free holes) referenced by a singly linked chain of pointers. The allocator services a request for memory allocation by traversing the chain looking for a large enough free hole.
In a multithreaded application, the allocator’s chain of pointers is a shared resource that must be protected. Chaos would ensue if a thread, while in the process of breaking the chain to insert a new link, was interrupted by a context switch and another thread tried to walk the (now broken) chain. To prevent this sort of interruption, the chain is protected by a mutex to prevent concurrent access and preserve the allocator’s consistency.
On single-core systems the latency caused by this locking is minimal, because the CPU is kept busy by processes unrelated to memory management. On a multicore system, though, a phenomenon arises that is similar to the one described above involving the shared data store: all but one core will be idle, with respect to the utilization of the shared resource. In other words, if the applications were doing nothing but allocating and freeing memory, the threads would be serialized due to the mutex. So, the extent of this problem for any particular application is directly proportional to the amount of dynamic memory management used.
In the data store example discussed previously, contention is solved by giving each process a copy of the relevant object to work with. Similarly, when threads “fight” over access to the memory allocator, an effective solution is a custom memory manager that provides a separate memory allocator for each thread, so that each allocator manages memory independently of the others. This approach is called a thread-local allocator.
The thread-local allocator avoids creating locking conflicts when objects are allocated and released locally, in one task/thread. When allocating memory in one task and de-allocating the same memory in another task, a lock is, of course, required. However, this allocator takes measures to minimize those locking conflicts.
To illustrate the benefit, we created a thread-local memory manager that is based on a block allocator. A block allocator imposes less metadata overhead than a standard list allocator, for the following reasons. In use, it is given a quantity of memory and divides this chunk into equal sized pieces, which it organizes in a linked list of free elements. Blocks are guaranteed to be the same size, so unlike the list allocator, the block allocator does not need to keep track of the size of each block. A block allocator also eliminates fragmentation because blocks are all of equal size, therefore when freed, they are not, and do not need to be, joined with adjacent free blocks.
The allocator creates and maintains a number of linked lists (called chains) of same-size “small” blocks that are made out of “large” pages. To allocate memory, the allocator simply “unlinks” the block from the appropriate chain and returns the pointer to the application. When a new large page is necessary, the allocator uses a general-purpose memory manager (standard malloc) to allocate the page.
This algorithm does not require any synchronization as long as all objects are allocated and de-allocated locally by the same thread.
When objects are not de-allocated locally, the memory manager maintains a pending-free requests list (PRL) for each thread. When an object allocated in one thread is being de-allocated by another thread, the de-allocating thread simply links the object into its PRL list. Of course, PRL access is protected by a mutex. Each thread periodically de-allocates objects in its PRL. This can occur based on a timer, or when a certain number of requests are pending, or according to some other application-specific criteria.
It’s important to note that regardless of those criteria, the number of synchronization requests is reduced significantly by using a thread-local allocator. First, objects are often freed by the same thread that allocated them; secondly, even when the object is de-allocated by a different thread, it does not interfere with all other threads, but only with those that need to use the same PRL. The thread-local allocator is depicted in Figure 5. McObject performed two tests to examine the impact of the thread-local memory manager in allocating memory for 10 million objects and then freeing it.
A block allocator allocates and deallocates all objects locally and so needs no synchronization.
The first test compares performance of the thread-local allocator and the standard C runtime allocator (malloc) when the allocation pattern is local: all de-allocations are performed by the same thread as the original allocations.
The second test compares performance when objects are allocated by a “producer” thread and freed by a “consumer” thread, using both thread-local and standard (default) allocators.
Results of the first test, using 24 threads and running on a 24-core machine, are shown in Figure 6. Thread-local allocation results in a performance advantage of more than 600%. Note that the application’s 24 threads are all trying to allocate and then free 10,000,000 objects each, for a total of 240,000,000 alloc/free pairs (480,000,000 malloc/free calls). Because of the lock conflict, they all get serialized (run one at a time) when using the standard malloc allocator, which is why it takes more than 10 minutes.
Performance of thread local allocator vs. standard C malloc.
In contrast, the thread-local allocator has no conflicts for “local” allocations, so all 24 threads run in parallel.
Figure 7 depicts the results of the second test, in which every object is freed by a thread other than the one that allocated it. This test used just two threads, in order to isolate the performance difference to just the reduced synchronization requirements of the thread-local allocator, even when all allocations are “global.” Overall time to complete is less than in the first test, because fewer threads are running (2 vs. 24). In both cases, the performance gap between thread-local and standard allocation mechanisms is smaller due to the overhead associated with the “handoff” of responsibility for de-allocation from one thread to another. But the thread-local allocator, by reducing synchronization requests via the PRL, turns in a 14% performance gain.
Global allocation/deallocation performance of TLA vs. malloc.
A real-world application would have a blend of “local” and “global” allocation patterns, and would exhibit a performance difference somewhere between the two charts depicted here. The results would be more like Figure 6 if the preponderance of the pattern is “local,” and more like Figure 7 if the pattern is more “global.”
Performance of thread local allocator vs. standard C malloc.
Global allocation/deallocation performance of TLA vs. malloc.
Multicore CPUs add the potential for performance gains by allowing threads to execute in parallel. But software often lags in taking advantage of this capability. Even programs that use multitasking and multithreading can be hobbled when these processes are forced to contend for a system resource.
In such situations, overall application performance can depend heavily on its efficiency in granting access to the resource. The examples above should be recognizable to programmers as involving typical shared resources: application data, and the software’s memory allocation function. In both cases, significant performance gains are obtained through a solution that makes a copy of the resource available to each task. While processing is (sometimes) required to merge or pass results, this overhead is easily outweighed by the increase in efficiency that results from elimination of resource contention.