Hypervisors and Virtualization

Multicore Performance Requires NUMA-Aware Hypervisor Design

Smart management of memory in multicore systems boosts performance and improves determinism


  • Page 1 of 1
    Bookmark and Share

Article Media

It is very impressive how computing power increases year over year. Previously this was achieved by running at higher and higher clock frequencies, and then by going to smaller structures and now by constantly increasing the numbers of CPU cores in a package.

When looking back, and really only about 10 years, x86 single-core processors burning somewhere between 30 and 60 watts were top of the line in embedded system design. Now quad-core processors are doing their job in industrial PCs requiring less than 10 watts. While processors have been growing rapidly in performance, computer memory has a hard time keeping up. It is important to understand that in addition to computing power, memory access time and throughput plays an essential role in total performance. As CPUs can process data much quicker than they can access main memory, memory access very often turns out to be the bottleneck in a system. When processing only a small amount of data, and when multiple levels of cache are available, this performance-limiting effect is greatly reduced and therefore hardly noticeable. But when processing large amounts of data, problems arise.

In contrast to multiple Gigabytes of main memory, caches can only hold some kilobytes (or maybe up to a few Megabytes) of data. When data not found in cache needs to be accessed, so called ‘cache misses’ can happen, meaning the processor has to access main memory to load data and at the same time push already-cached data out of its caches. Pushing data out of the cache also usually includes data required for time-critical processes. Because this data has to be reloaded again later, determinism is greatly affected—which in many embedded systems is deadly. In a real-time operating system, this phenomenon also leads to an increase in latencies. Processing interrupt requests may be delayed if code associated with an interrupt has to be reloaded from memory, and in addition to high interrupt latencies, prolonged task-execution times in general can be observed.

In a perfect world, time-critical real-time code would always be kept separate from non-deterministic code. A good example would be the deployment of a non-real-time operating system like Microsoft Windows in parallel with a real-time operating system (RTOS). In this case a hypervisor can address jitter related to competing cache and memory access by giving an RTOS priority over Windows. The RTS Hypervisor from Real-Time Systems for example supports this functionality, and has done so since early 2014 and other hypervisors are sure to follow on the market.

On high-end systems, where multiple real-time operating systems are run and multiple processor sockets are available on a board, latencies due to memory access can be even further optimized or eliminated by using a hypervisor optimized for Non-Uniform Memory Access (NUMA).

In a NUMA system, every processor holding from four to 36 logical x86 CPUs has its own memory controller on-chip for accessing main memory. All processors on each board are then linked to each other via fast point-to-point interconnects. Individual processors, including their local memory bank(s), are called NUMA-Nodes. Figure 1 illustrates an example of a NUMA system similar to the one used in the measurements below.

Figure 1
In NUMA systems, every processor has its own memory controller on chip.

In a NUMA system, memory access time depends on the memory modules involved, or more precisely, on which processor the physical memory bank in question is connected to. Each processor can access any memory module on a board via the processor-to-processor interconnects. Accessing local memory banks via the processor’s own memory controller is of course faster. Thinking this through, the advantage seems obvious: If every CPU processes its data using its own local memory, no processor would impact any other processor, neither through cache misses nor through competing access via the same memory controller.

In order to optimize memory usage, some operating systems feature “NUMA Support”, allowing applications to be distributed to various NUMA nodes exclusively. And, for some applications this approach does work well, but as always, the devil is in the details: When analyzing operating system functionality (like scheduling, page fault handler, device drivers including interrupt processing), chances are that some operating system code is being executed on a different node than the one requiring this functionality at that time. When running number-crunching applications, this usually does not cause much of an impact, but in real-time applications it is exactly these system calls that make up an essential part of the code. If execution of applications are really to remain independent, clean partitioning of resources is absolutely required. Consequently, this can only be achieved by running multiple operating systems in parallel.

Starting a separate operating system on each of the physical processors can be accomplished using a Type-1 hypervisor—a hypervisor that runs without host operating system directly on the hardware, also called “bare-metal hypervisor.” Still this does not quite satisfy the above requirements. In order to provide determinism and maximum performance, the hypervisor in use has to be NUMA-aware, meaning that memory allotment has to be done the right way by physical address of memory required by the hypervisor itself and for each guest operating system. This is the only way to really avoid the problems described above caused by parallel execution and concurrent memory access.

Because NUMA-aware hypervisors for real-time use were not available in the past, the RTS Hypervisor was advanced by adding specific NUMA functionality and configuration capabilities. The previous RTS hypervisor offered two different modes for running guest operating systems (Guest OS): A “fully virtualized” mode, typically used for running unmodified general purpose operating systems (GPOS) and the so-called “privileged mode” for executing operating systems deterministically with direct hardware access in hard real-time.

While both modes allowed direct assignment of available processor cores, logical CPUs and devices to operating systems, memory partitioning optimized for NUMA was not possible. Both modes did not provide the desired behavior: In virtualized mode, the hypervisor added a software layer between Guest OS and hardware, potentially adding latencies, affecting determinism and introducing some jitter. In “privileged mode” on the other hand, hard real-time and direct hardware access was guaranteed, but without virtualization, physical memory location could not be chosen for a Guest OS at will.

Typically, an operating system is linked to a specific base address and usually needs 32-bit addressable memory somewhere below 4 GB. To support the NUMA architecture a third mode of the RTS Hypervisor was required, a “privileged mode” with virtual memory management unit (MMU). This mode adds minimal virtualization, only for virtualizing the guest operating system’s memory access. In virtual MMU mode a guest OS can then be moved freely to any location in physical memory even allowing, for example, a 32-bit operating system to run at a location that usually is far beyond its addressable range. By means of the virtual MMU mode it is therefore possible to execute multiple operating systems, requiring the same physical addresses for their kernels simultaneously in parallel while using all of the available physical memory of the system.

 The underlying technology is called “second level address translation” which is available in current processors supporting virtualization from both Intel and AMD. Intel named it Extended Page Tables (EPT) and AMD calls it Rapid Virtualization Indexing (RVI) or nested paging. In virtual MMU mode the hypervisor warrants that execution of a guest OS is not interrupted by hypervisor code, i.e., without additional latencies. Only if a guest OS were to access memory outside its pre-configured allowed address ranges, would the hypervisor step in. To avoid any influence on other NUMA nodes, all required page tables and structures of a guest OS are automatically loaded in memory of the same NUMA node by the hypervisor.

The Results

When designing real-time critical systems, assumptions and theoretical considerations never replace real-world measurements and long term tests, which is why the following tests on a real system were performed:

The test platform used carries four Intel XEON E7 ten-core processors with hyper-threading, linked to each other via QuickPath Interconnect (QPI) as pictured above. Each QPI link offers a maximum transfer rate of 6.4 Giga transactions per second, which translates into 16 GBytes per second considering the 20-bit width of the QPI. As always in hyper-threading processors, two logical CPUs of a given processor core share their Level 1 caches (32KBytes). Each processor core then features its own Level 2 Cache (256KBytes) and all 10 cores of a processor use the same Level 3 Cache (24MBytes). Each processor with its associated memory represents one NUMA node holding two registered DDR3 1066MHz SDRAM modules each. Two NUMA nodes have a direct link to the chipset (Node 1 and 2).

As guest operating systems, a real-time Linux with kernel 3.12.42-rt58.x86_64 was run in multiple instances in parallel. In the hypervisor configuration for each guest OS instance a different NUMA node was selected for it to run on. Whether memory directly attached to the assigned node is used or if memory of other NUMA nodes should be used can be selected as well.

When measuring throughput, the difference between accessing local memory attached directly to an operating system’s own processor vs. accessing memory on a different node via the QPI can be seen clearly in Figure 2. The measurement shows bandwidth results using 128-bit read cycles (Load MMX register from main memory using a variety of block sizes). When reading blocks larger than the size of the cache, the difference in bandwidth depending on memory location can be observed clearly. The steps in the resulting diagram represent the different levels of cache, 32KB and 256KB Level 1 and Level 2 cache. Beyond 24MB block size (extend of Level 3 cache) the actual bandwidth of the path to main memory can be seen, roughly 4.7 GByte/s vs. a little over 3.6 GByte/s.

Figure 2
For transactions exceeding the 24MB L3 cache size, bandwidth depends on memory location

While bandwidth is of course of interest, even more important in embedded and real-time applications are latencies, determinism and jitter. In order to measure latencies caused by simultaneous access to memory by multiple operating systems a special software program was used. This test program allocates a 64 MB block of memory, substantially larger than the 24 MB last level cache to minimize cache effect. The test tool then writes the first DWORD (4 Byte) in steps of 4096 Byte (one page). After this, in a second loop all previously written data is read back.

This measurement is executed in kernel mode with all interrupts masked. The measurement illustrated in Figure 3 shows two operating systems, running on separate cores but on the same NUMA node. Basically this scenario could be viewed as a traditional non-NUMA System without using memory access prioritization described in the beginning of this article. Operating system 1 executes the test program described above while the second OS is idle, illustrated in green color. The duration of the memory loop stays at a constant value of about 1140 µs. If the second operating system now executes memory benchmark software, the red curve shows the resulting memory loop duration. The load created on the memory bus by the second operating system has a massive impact. The loop duration takes up to 5800 µs, a factor of 5 of the original time it took.

Figure 3
Shared memory access leads to prolonged execution time and jitter

In our next examination of the system, we started the second operating system on a different NUMA node. In our measurement (Figure 4) it shows that even if a heavy load is applied to either operating system, there is a clear separation and no influence between the individual nodes. The curve of the observed memory loop durations closely matches the results of the “no load” scenario above shown in green.

Figure 4
Assigning memory exclusively, max. execution time is not affected under load

The measurements that were performed attest that memory access can be optimized using a NUMA-aware bare-metal hypervisor combined with intelligent physical memory management. Access time to RAM can be reduced greatly if only memory modules connected to a processor’s own on-board memory controller, i.e. on the same local NUMA node are used.

Very drastic improvement can be achieved for memory intense applications. If a lot of memory transactions and parallel access to memory are required, separating traffic (i.e., keeping each NUMA-node autonomous) has a huge impact on the execution time. Without NUMA-awareness in the hypervisor, the maximum execution time of the test program went up by a factor of 5.

As a conclusion it can be said: “Making the Most of Multicore” involves smart handling of memory. If virtualization is used on small, embedded systems, for example, one featuring only one dual-core Atom processor and a single memory module, then prioritization of cache and memory access is a must in hypervisor functionality. On larger systems, the hardware design is already laid out to separate caches and memory controllers, but only hypervisors or virtualization that is NUMA-aware can make effective use of this architecture.

Real-Time Systems
Ravensburg. Germany
+49 (0) 751 359 558 – 0