The choice of ARM cores can initially seem bewildering, but there is a carefully designed selection to help developers optimize their performance and power consumption options matched to the needs of their applications.
By Matt Sounders, Silicon Labs – September 2014
The ARM Cortex family of cores encompasses a very wide range of scalable performance options offering designers a great deal of choice and the opportunity to use the best-fit core for their application without being forced into a one-size-fits-all solution. The Cortex portfolio is split broadly into three main series:
- Cortex-A – application processor cores for performance-intensive systems
- Cortex-R – high-performance cores for real-time applications
- Cortex-M – microcontroller cores for a wide range of embedded applications
Cortex-A for Rich Applications
Cortex-A processors provide a range of solutions for devices that make use of a rich operating system such as Linux or Android, and are used in a wide range of appli- cations from low-cost handsets to smartphones, tablet computers, set-top boxes and also enterprise networking equipment. The first range of Cortex-A processors (A5, A7, A8, A9, A12, A15 and A17) is based on the ARMv7-A architecture. Each core shares a common feature set including items such as the NEON media processing engine, Trust- zone for security extensions, and single- and double-precision floating point support along with support for several instruction sets (ARM, Thumb-2, Thumb, Jazelle and DSP). Together this group of processors offers design flexibility by providing the required peak performance points while delivering the desired power efficiency.
While the Cortex-A5 core is the smallest and lowest power member of the Cortex A series, it offers the possibility of multicore performance and is compatible with the larger members of the series (A9 and A15). The A5 is a natural choice for designers who have previously worked with the ARM926EJ-S or ARM1176JZ-S processors as it enables higher performance and lower silicon cost.
The Cortex-A7 is similar in power consumption and area to the Cortex-A5, but brings a performance increase in the range of 20 percent as well as full architectural compatibility with the Cortex-A15 and Cortex-A17. The Cortex-A7 is a good choice for cost-sensitive smartphone and tablet implementations, and it can also be combined with a Cortex-A15 or Cortex-A17 in what ARM refers to as a “big. LITTLE” processing configuration.
The big.LITTLE configuration is essentially a power optimization technology; a high-performance CPU (e.g., Cortex-A17) and an ultra-efficient CPU (e.g., Cortex-A7) are combined to provide higher sustained performance and also to enable significant overall power savings by relying on the more efficient core in cases of low to moderate performance requirements from the application, saving potentially 75 percent of CPU energy and as such extending battery life (Figure 1). This configuration offers a significant advantage to the developer as the performance demands of smartphones and tablets are advancing much faster than the capacity of batteries can keep pace. Design methodologies such as big.LITTLE, as part of an overall system design strategy, can significantly help reduce this battery technology gap.
FIGURE 1: ARM big.LITTLE processing is a power-optimization technology where high performance ARM CPUs are combined with the most efficient ARM CPUs to deliver peak performance capacity, higher sustained performance, and increased parallel processing performance, at significantly lower average power. The underlying big. LITTLE software automatically moves workloads to the appropriate CPU based on performance needs.
Moving to the other end of the Cortex-A scale, let’s con- sider the Cortex-A15 and Cortex-A17 cores. These are both very high-performance processors and again are available in a variety of configurations. The Cortex-A17 is the most efficient “mid-range” processor, and it squarely targets premium smartphones and tablets. The Cortex-A9 has been widely deployed in that market, but the Cortex-A17 offers an increase of more than 60 percent (cycle for cycle) compared to the Cortex-A9, and it achieves this performance while also improving overall power efficiency. The Cortex-A17 can be configured with up to four cores, each
of which contains a fully out-of-order pipeline. As mentioned previously, the Cortex-A17 can be combined with the Cortex-A7 for an effective big.LITTLE configuration, and it can also be combined with high-end mobile graphics processors (such as the MALI from ARM), resulting in a very efficient design overall.
The Cortex-A15 is the highest performance memberof this series, providing (in a mobile configuration) twice the performance you would get from a Cortex-A9. While being perfectly adequate in applications such as high-end smartphones or tablets, a multicore Cortex-A15 processor running at 2.5 GHz opens up the possibility of using a Cortex-A processor in applications such as low-power servers or wireless infrastructure. The Cortex-A15 is the first processor from ARM to incorporate hardware sup- port for data management and arbitration of virtualized software environments. Applications in those software environments are able to simultaneously access the system capabilities, making it possible to implement devices with virtual environments that are robust and isolated from each other (Figure 2).
The Cortex-A15 is the highest performance ARM processor to date, and is targeted at next-generation smartphones, tablets, large-screen mobile computing and high-end digital home entertainment devices through to wireless base stations and enterprise infrastructure products.
The latest additions—the Cortex-A50 series—extend the reach of the Cortex-A series into low-power servers. These processors are built on the ARMv8 architecture and bring with them support for AArch64—an energy-efficient 64-bit execution state that can operate alongside the existing 32-bit execution state. An obvious reason for the move to 64-bit is the support of more than 4 Gbytes of physical memory, which is already achieved on Cortex-A15 and Cortex-A7. In this case, the move to 64-bit is really about providing better support for server applications where a growing number of operating system and application implementations are using 64-bit, and the Cortex-A50 series delivers a power-optimized solution for this scenario. The same is largely true for the desktop market, and support for 64-bit will enable the CortexA50 series to be more broadly adopted into this segment and will provide some level of future-proofing for the eventual migration of 64-bit operating systems into mobile applications.
Cortex-R for Real Time
Moving on from Cortex-A, the Cortex-R series is the smallest ARM processor offering in terms of derivatives and possibly the least well known. The Cortex-R proces- sors target high-performance real-time applications such as hard disk controllers (or solid state drive controllers), net- working equipment and printers in the enterprise segment, consumer devices such as Blu-ray players and media players, and also automotive applications such as airbags, braking systems and engine management.
The Cortex-R series is similar in some respects to a high- end microcontroller (MCU), but it targets larger systems than you would typically implement using a standard MCU. The Cortex-R4, for example, is well suited for automotive applications. It can be clocked up to 600 MHz (delivering 2.45 DMIPS/MHz), has an 8-stage pipeline with dual-issue, pre-fetch and branch prediction, and a low latency interrupt system that can interrupt multi-cycle operations to quickly serve the incoming interrupt. It can also be implemented in a dual-core configuration with the second Cortex-R4 being in a redundant lock-step configuration with logic for fault detection, making it a good choice for safety-critical systems.
Networking and data storage applications are well served by the Cortex-R5, which extends the feature set offered by the Cortex-R4 to offer increased efficiency and reliability and enhance error management in dependable real-time systems. One such system-level feature is the low latency peripheral port (LLPP) to enable fast peripheral reads and writes instead of having to perform a read-mod- ify-write on the entire port. The Cortex-R5 can also be implemented as a “lock-step” dual-core system with the processors running independently, each executing its own programs with its own bus interfaces and interrupts. This dual-core implementation makes it possible to build very powerful, flexible systems with real-time responses.
The Cortex-R7 significantly extends the performance reach of the series, with clock speeds in excess of 1 GHz and a performance of 3.77 DMIPS/MHz (Figure 4). The 11-stage pipeline on the Cortex-R7 now adds out-of-order execution along with improved branch prediction. There are several options for multicore implementations as well: lock- step, symmetric multiprocessing and asymmetric multiprocessing. The Cortex-R7 also has a fully integrated generic interrupt controller (GIC) supporting complex priority-based interrupt handling. It is worth noting, however, that despite its high-performance levels, the Cortex-R7 is not suitable for running rich operating systems (such as Linux and An- droid), which remains the domain of the Cortex-A series.
FIGURE 4: The ARM Cortex-R7 processor brings new levels of extreme performance to the Cortex-R series of processors. The R7 offers more than 50 percent performance uplift through in- creased pipeline efficiency with enhanced branch prediction, advanced superscalar and out-of-order execution.
Cortex-M for Embedded
Finally we come to the Cortex-M series, designed specifically to target the already very crowded MCU market. The Cortex-M series is built on the ARMv7-M architecture (used for Cortex-M3 and Cortex-M4), and the smaller Cortex-M0+ is built on the ARMv6-M architecture. The first Cortex-M processor was released in 2004, and it quickly gained popularity when a few mainstream MCU vendors picked up the core and started producing MCU devices.
It is safe to say that the Cortex-M has become for the 32-bit world what the 8051 is for the 8-bit world—an indus- try-standard core supplied by many vendors, each of which dip the core in their own special sauce to provide differentiation in the market. The Cortex-M series can be implement- ed as a soft core in an FPGA, for example, but it is much more common to find them implemented as an MCU with integrated memories, clocks and peripherals. Some are optimized for energy efficiency, some for high performance, and some are tailored to a specific market segment such as smart metering.
The Cortex-M3 and Cortex-M4 are very similar cores. Each offers a performance of 1.25 DMIPS/MHz with a 3-stage pipeline, multiple 32-bit buses, clock speeds up to 200 MHz and very efficient debug options. The significant difference is the Cortex-M4 core’s capability for DSP. The Cortex-M3 and Cortex-M4 share the same architecture and instruction set (Thumb-2). However, the Cortex-M4 adds
a range of saturating and SIMD instructions specifically optimized to handle DSP algorithms. For example, consider the case of a 512-point FFT running every 0.5 second on equivalent off-the-shelf Cortex-M3 and Cortex-M4 MCUs. For comparison, the Cortex-M3 would consume around three times the power that a Cortex-M4 would need for the same job. There is also the option to get a single precision floating point unit (FPU) on a Cortex-M4. If your application requires floating point math, you will get this done considerably faster on a Cortex-M4 than you will on a Cortex-M3. That said, for an application that is not using the DSP or FPU capabilities of the Cortex-M4, you will see the same level of performance and power consumption on a Cortex-M3. In other words, if you need DSP functionality, go with a Cortex-M4. Otherwise, the Cortex-M3 will do the job.
For applications that are particularly cost sensitive or are migrating from 8-bit to 32-bit, the smallest member of the Cortex-M series might be the best choice. The Cortex-M0+ performance sits a little below that of the Cortex-M3 and Cortex-M4 at 0.95 DMIPS/MHz, but it is still compatible with its bigger brothers. The Cortex-M0+ uses a subset of the Thumb-2 instruction set, and those instructions are pre- dominantly 16-bit operands (although all data operations are 32-bit), which lend themselves nicely to the 2-stage pipeline that the Cortex-M0+ offers. This brings some over- all power saving to the system through reduced branch shadow, and the pipeline will in most cases hold the next four instructions. The Cortex-M0+ also has a dedicated bus for single-cycle GPIO, meaning you can implement certain interfaces with bit-bashed GPIO like you would on an 8-bit MCU but with the performance of a 32-bit core to process the data (Figure 5).
The ARM Cortex-M0+ processor is the most energy efficient ARM processor available, and is the only ARM processor to employ a two-stage pipeline. The M0+ builds on the Cortex-M0 processor while extending the choice of design options It is intended for demanding entry-level microcontrol- lers and deeply embedded applications that require an area- and power-optimized processor.
Another key difference on the Cortex-M0+ is the addition of the micro trace buffer (MTB). This peripheral allows you to dedicate some of the on-chip RAM to store program branches while in debug. These branches can then be passed back up to the integrated development environment (IDE), and the program flow can be reconstructed. This capability provides a rudimentary form of instruction trace and compensates for not having the extended trace macrocell (ETM) found on the Cortex-M3 and Cortex-M4. The level of debug information you can extract from a Cortex-M0+ is significantly higher than what you can get from an 8-bit MCU, meaning those hard to solve bugs just got easier to fix.
So the Cortex processor family offers many options regardless of the performance level you need for your application. With a little bit of thought and investigation, you will be able to find the right processor that suits your application needs, whether it’s for a high-end tablet or an ultra-low-cost wireless sensor node for the Internet of Things.