Tips and Tricks for Minimizing ARM Cortex CPU Power Consumption

Fractal Realms series. Backdrop of fractal elements, grids and symbols on the subject of education, science and technology

Minimizing ARM Cortex CPU Power By carefully considering and implementing combinations of these techniques, developers can realize substantial advantages.


September 2014

There are myriad CPU choices available to embedded developers and each choice has its own idiosyncrasies, architectural benefits and array of power-saving techniques. To provide a useful context for the wide range of embedded CPU options, it’s helpful to focus on one widely used architecture: the ARM Cortex-M series. This architectural choice provides several CPU variants to explore and many different MCU devices from silicon vendors. The diversity of options within the Cortex-M series is important, as minimizing the CPU power consumption is achieved not just through the CPU architecture itself but also by the supporting elements around it, such as interrupt controllers, memory interfaces, memory accelerators, low power modes and intelligent peripherals.

Everything You Need to Know about Thumb-2

Let’s begin our look at power-saving techniques in a place that might not be an obvious starting point—the instruction set. Cortex-M CPUs all use the Thumb-2 instruction set, which blends the 32-bit ARM instruction set with the 16-bit Thumb instruction set and provides a flexible solution for both raw performance and overall code size. A typical Thumb-2 application on Cortex-M will be in the range of 25 percent smaller and 90 percent as efficient (when optimized for time) compared to code written entirely in ARM instructions.

Thumb-2 includes a number of powerful instructions that reduce the cycle count for basic operations. Reducing the cycle count means you can do the job at hand with a significant reduction in CPU power consumption. Consider, for example, a 16-bit multiply (Figure 1). Performing this operation on an 8-bit 8051-based MCU would take 48 clock cycles and 48 bytes. Using a 16-bit core such as C166, this operation would be in the range of 8 clock cycles and 8 bytes. In contrast, on a Cortex-M3 core using Thumb-2, this is done in a single cycle and uses only two bytes of flash. The Cortex-M3 saves power by using fewer clock cycles to do the same job and by using much less flash memory and therefore far fewer accesses to flash to achieve the end result (in addition to potentially fitting the application into a smaller flash and reducing overall system power). We could spend a lot more time examining the Thumb-2 instruction set, but this example should give a sense of its potential for power saving.

Figure 1: Cycle Count Comparison for a 16-bit Multiply

Power-Saving Interrupt Controller Techniques

The interrupt controller (nested vectored interrupt controller or NVIC) in the Cortex-M architecture also plays a role in reducing CPU consumption. The Cortex-M3 NVIC takes just 12 cycles to get from the interrupt request to executing the interrupt code, compared to the “up to” 42 cycles that it took previously on the ARM7-TDMI, clearly bringing a benefit in efficiency and reducing wasted time for the CPU. In addition to the fast entry to the interrupt routine, the NVIC also man- ages moving from one interrupt routine to another much more efficiently.

On the ARM7-TDMI implementations, it was necessary to spend cycles moving from the interrupt routine back to the main application, and then back into the next interrupt routine, burning up to 42 cycles between interrupt service routines on just “push-and-pop” functions. The Cortex-M NVIC enables a much more efficient method for accomplishing this task called “tail-chaining.” This method uses a simple 6-cycle process for getting the necessary information to run the next service routine. With tail-chaining, there is no need to do a full push-and-pop cycle, resulting in an overall savings of 65 percent in clock cycles to manage interrupts (Figure 2).

Figure 2: Interrupt Response on ARM7 and Cortex-M3

Power-Saving Memory Considerations

Memory interfaces and memory accelerators can make a significant impact on how much power the CPU consumes. Branches and jumps in the code have the effect of flushing the pipeline feeding instructions to the CPU, and in that case the CPU will stall for a number of cycles while the pipeline is refilled. On a Cortex-M3 or Cortex-M4, the CPU is equipped with a 3-stage pipeline. Flushing this pipeline will cause the CPU to stall for three cycles—even longer if there are wait- states on the flash—while it is refilled. These stalls waste power because effectively nothing useful is happening. To help mitigate the delay, the Cortex-M3 and M4 cores include a function called Speculative Fetch, which upon fetching a branch in the pipeline will also fetch the likely branch target. If that branch is taken, then the Speculative Fetch will have reduced the stall by one cycle. While this feature is useful, it is clearly not enough, and many vendors of Cortex-M devices add their own IP to enhance this capability.

Consider, for example, different approaches to instruction caches used in popular ARM Cortex-M class MCUs. An MCU with a simple instruction cache, such as an EFM32 de- vice from Silicon Labs, may store 128 x 32 (512 bytes) of the most recently executed instructions with logic to determine if the requested address is in the cache or not. The EFM32 reference manuals suggest a typical application will get a >70 percent hit rate on this cache, meaning far fewer flash accesses and faster execution of the code, and an overall reduction in power consumption.

In contrast, an ARM-based MCU using a branch cache with 64 x 128-bit buffers can store the first few instructions a minimum of four, maximum of eight for each branch depending on the mix of 16- or 32-bit instructions. A branch cache implementation can therefore fill the pipeline in a single cycle for any branch or jump that hits the cache, eliminating any stalls or wasted cycles for the CPU. Both of these cache techniques offer considerable improvements in performance and reductions in CPU consumption compared to the same CPU without these cache features. Both cache types also minimize accesses to flash, enabling an overall reduction in MCU power consumption as well.

A Closer Look at the M0+ Core

For power-sensitive applications where every nano-watt counts, the Cortex-M0+ core is a good option. The M0+ is based on a von-Neumann architecture (whereas the Cortex-M3 and Cortex-M4 cores are Harvard architecture), meaning it has a smaller gate count for lower overall power consumption figures and only a minimal hit on performance (0.93 DMIPS/MHz vs. 1.25 DMIPS/MHz for its bigger ARM siblings). It also uses a smaller subset of the Thumb-2 instruction set (Figure 3). Nearly all of the instructions have 16-bit opcodes (50 x 16-bit opcodes and 6 x 32-bit opcodes; data operations are all 32-bit), enabling some interesting options for reducing CPU power consumption.


Figure 3: The Cortex-M0+ Instruction Set

The first of these power-saving options is to reduce the number of flash accesses. Having a predominantly 16-bit instruction set means you can access the flash on alternate cycles (Figure 4) and fetch two instructions for the pipeline with each access. This assumes that you have two instructions aligned in a 32-bit word in memory. In the case where the instructions are not aligned, then the Cortex-M0+ will disable half of the bus to conserve every last bit of energy.

Figure 4: Alternate Clock Flash Access on Cortex-M0+

Additionally, the Cortex-M0+ core can save power as a byproduct of the reduction to a 2-stage pipeline. In pipelined processors, subsequent instructions are fetched while the CPU executes the current instructions. If the program branch- es and does not use the subsequently fetched instructions, the energy used to fetch them (branch shadow) is effectively wasted. In a 2-stage pipeline this branch shadow is reduced so energy (admittedly a small amount) can be saved.

It also means there is one cycle less to refill the pipeline in the event it is flushed (Figure 5).

Figure 5: Pipeline and Branch Shadow

Another area where the Cortex-M0+ core offers some power savings is through its high-speed GPIO ports. On the Cortex-M3 and Cortex-M4 cores, the process of toggling a bit or GPIO port is to “read-modify-write” a 32-bit register. While the Cortex-M0+ can use this method too, it has a dedicated 32-bit-wide I/O port that gives single-cycle access to the GPIO, making it possible to toggle the bit/pin efficiently. Note: this is an optional feature on the Cortex-M0+, so not all vendors will implement this useful GPIO feature.

Putting the CPU to Sleep

One of the most effective ways to minimize CPU power consumption is to switch off the CPU itself. There are a number of different sleep modes in the Cortex-M architec- ture, each offering a tradeoff between power consumption in the chosen sleep mode and startup time to executing code again (Figure 6). It is also possible to have the CPU enter
a sleep mode automatically upon completing an interrupt service routine without the need for a line or two of code to do the job. This approach saves CPU cycles for a task that is likely to be very common in an ultra-low-power application.

Figure 6: Cortex-M Native Low-Power Modes

In deep sleep mode, it’s also possible to use the wakeup interrupt controller (WIC) to offload the NVIO. When using the WIC, there is no need for a clock to the NVIC in low-power modes to wake up the CPU using external interrupts.

Autonomous on-chip peripherals also can be a real ben- efit for power reduction. Most MCU vendors have their own architectures for enabling autonomous interaction between peripherals, such as the Peripheral Reflex System (PRS) used in Silicon Labs’ EFM32 MCU devices. Autonomous peripheral implementations enable fairly complex chains of peripheral actions (triggers rather than data transfers) to take place while the CPU remains asleep. For example, using the PRS capability of an EFM32 MCU, the application could be configured such that in a low-power mode with the CPU asleep, the onboard comparator detects a voltage that cross- es its threshold, triggering a timer to start counting down. When that timer hits zero, it then triggers the DAC to start its output—all while the CPU remains asleep.

,p>With such complex interactions occurring autonomously, it is easy to get a lot done without CPU intervention. Furthermore, peripherals that have some built-in intelligence such as sensor interfaces or pulse counters can be configured to wake the CPU on an interrupt with a pre-configured condition, such as having counted 10 pulses. In this example, when the CPU wakes to that specific interrupt, it knows exactly what needs to be done and doesn’t need to check counters or registers to see what information has come in, thereby saving some cycles and getting on with other important tasks.

These are some of the easier-to-access methods of reducing CPU consumption on the wide choice of Cortex-M devices. There are, of course, other considerations impacting power consumption, such as the process node used to manufacture the device or the memory technology used to store the application code. Process and memory technologies can have significant impacts on both run-time power consumption and the leakage in low-power modes, and thus should be factored into an embedded developer’s overall power-saving strategy.

Silicon Labs

Austin, TX.

(512) 416-8500