INTERCONNECT STRATEGIES
Fault Tolerance Using RapidIO
To achieve five nines availability at the system level a combined procedural, software and hardware approach is required. RapidIO was developed with the most demanding levels of availability in mind and offers a robust switched interconnect for high-performance embedded applications
VICTOR MENASCE, TUNDRA SEMICONDUCTOR
System failure is an expensive business. If part of a mobile phone network fails for several hours during a peak period, for instance, the combined cost to the operator in terms of lost revenue and damage to reputation can totally dwarf the cost of rectifying the actual fault. Yet the cost of failure isnt always just financial. The "cost" of a hardware or software failure on an aircraft, for example, can include the lives of its passengers and crew.
The problem is, however, that no electronics system in the real world can be made 100% reliable; it can only strive to approach the magic number. Moreover, the more complex (and invariably useful) the system, the more potential modes of failure exist, and the more failure prone it becomes.
"Availability" is a measure of system reliability in the broadest sense as it encompasses both planned and unplanned system failures and outages. Planned outages include necessary hardware and software upgrades and routine network maintenance. Unplanned outages primarily include hardware and software failures, and system failures due to operator error.
Economic viability and competitive markets demand system reliability of 99.999%the so-called "five nines". This equates to five minutes of downtime per year per system for all sources of outage: planned and unplanned. Unfortunately for hardware designers, five nines reliability at the system level demands seven nines reliability at the hardware level because system failures are also due to software and human faults. The good news is that seven nines capability is already here in the form of robust fault tolerance technology built into the latest generation of embedded interconnect architecture called RapidIO.
System Failure Incident Rates
Recent research done by the author using data maintained by the U.S. Federal Communications Commission (FCC) in their ARMIS database revealed that the vast majority of complete electronics system level failures are caused by software-related errors (around 63%), followed by operator error (22%) with hardware failure third (15%).
The impact of these errors in terms of the actual mean outage downtime is 45% for software, 35% for operator and 20% for hardware failures. This outage downtime varies because the amount of time it takes a system to recover or be repaired is different for each type of error. These figures reflect the general fact that although software errors are more common, they are generally easier and quicker to fix than operator or hardware-induced ones.
If one estimates that planned outages on average account for three minutes per system per year, the inescapable conclusion is that there are only two minutes remaining for unplanned outages, if five nines reliability is to be maintained. Of this, hardware failure can account for no more than 24 seconds of outage per year (20% of 2 minutes), which equates to an availability of 99.99994%, or six nines.
In other words, five nines of system availability at the system level demands that the hardware be two orders of magnitude more reliable to compensate for software failures and human error. RapidIO was designed to meet this extreme reliability requirement by including built-in error detection and correction from the outset.
An Overview of RapidIO
RapidIO is a chip-to-chip (on-board) and board-to-board (backplane) packet-switched interconnect that is built into the hardware data path fabric of a system. It is based around a broad, open standards protocol that is backed by a dedicated standards committeethe RapidIO Trade Association. This presently comprises over 50 active member companies including Motorola, Alcatel, Ericsson, IBM, Cisco Systems, Lucent Technologies and Nortel Networks.
RapidIO was developed to eliminate in-system bottlenecks primarily in high-performance embedded, networking and communications applications by replacing existing bridged hierarchies of shared bus structures such as PCI and PCI-X with RapidIO (Figures 1 and 2). PCI and PCI-X have served the industry well, but are beginning to approach their ultimate migratory limits in terms of bandwidth, reliability and scalability.


For example, a typical 64-bit PCI bus running at 133 MHz can reach a data transfer rate of 1 Gbyte/s. Contemporary demands, however, are for speeds well in excess of the 1 Gbyte/s barrier, together with the ability to handle more devices with fewer pins, and to be able to work with existing PCI and CPU hardware and software.
Enter RapidIO. It supports maximum transmission rates up to 64 Gbit/s per device interface and offers significantly greater bandwidth, flexibility and reliability than currently used bus interconnects. It also comes in parallel and serial interface formats, the former typically being used on the edge of high-speed processors to provide a high bandwidth interface, and the latter for backplane connectivity.
RapidIO also offers a high degree of determinism because point-to-point transactions reduce the chances of data "walking" due to elimination of buses. In addition, it can be implemented in a small silicon footprint in ASICs and FPGAs without consuming all available gates. It also offers low transaction overhead (the number of sent bytes required to complete a transaction) and software transparency to existing application software and operating systems down to driver level.
System Fault Tolerance
There are six key elements to system level fault tolerance:
- No single point of failure
- No single point of repair
- Fault recovery
- 100% fault detection
- 100% fault isolation
- Fault containment
Many of these elements are not unique to RapidIO. They are required by the system. But most interconnect technologiessuch as PCIdo not provide the required infrastructure to meet all of these requirements simultaneously. Building systems to achieve high reliability, however, requires a real focus on all these elements. RapidIO distinguishes itself by providing much of the infrastructure required to build these kinds of systems.
RapidIO offers extensive error detection, recovery and isolation mechanisms. The first three requirements are usually addressed by the system architecture. This is usually based on redundant hardware that can be independently maintained. Fault recovery is usually implemented in software with the assistance of hardware-based watchdog timers. The last three items in the list are a particular challenge and worthy of further explanation. It is only by addressing fault detection, isolation and containment that real system fault tolerance can be achieved.
RapidIO Support for Fault Tolerance
The RapidIO standard has three layers. The logical layer comprises the transaction types. A transport layer specifies how data gets routed through switch topologies and a physical layer defines the electrical interfacing (Figure 3). Transactions in RapidIO are split transactions with separate requests and responses. A masteror initiatorgenerates a request transaction, which is transmitted to a target via the data path fabric. The target then generates a response transaction back to the initiator to complete the operation (Figure 4).


The transactions are encapsulated into data packets defined by the logical layer. These packets, comprising header and payload, contain a rich range of data with all the necessary bit fields to ensure reliable delivery to the targeted end-point in the transport layer (Figure 5). RapidIO devices are typically connected to each other via switch fabric devices or directly on a point-to-point basis in smaller systems.

Control symbols embedded within the data packets are at the heart of RapidIOs hardware recovery mechanism and are used to manage the flow of transactions in the physical layer. Control symbols are used for packet acknowledgement (packet accepted, not accepted, retry, etc.), flow control information (transaction type and data payload size), and maintenance functions (Figure 6).

All packets are transmitted and acknowledged using a strict ordering that is enforced by a transaction tag field called ackID. This means that once a packet has been issued, it is either accepted or rejected by the receiving device. If it is accepted, the receiver then accepts responsibility for sending the packet on to the next device in the path to the final end destination, one link at a time. If the receiver rejects the packet for any reason, for instance, due to a full buffer or bad cyclic redundancy clock (CRC), it will be retransmitted at a later time. This link level reliability is implemented in hardware with no software intervention.
All data transfer in RapidIO is, in fact, protected by redundancy. This appears as either CRC codes or parity. The strength of the CRC code combined with the hardware-based retry mechanism provides an extremely low and effective bit-error rate (to the order of 10-19). Even at gigahertz frequencies, this is as good as guaranteed message delivery. If a link bit-error rate degrades for whatever reason, hardware controlled link re-training can restore the link integrity.
The RapidIO error management specification provides a consistent mechanism for handling errors at the system level. It defines which errors are detected in a system, how they are logged by the hardware and reported to the control host in the system. The specification also provides multi-level thresholds for error reporting. This can be used to construct green, yellow and red light conditions in a system.
Yellow light conditions provide early visibility of degraded conditions to system software. This allows maintenance software to take proactive corrective action prior to a catastrophic failure. These mechanisms also employ a leaky bucket algorithm, which approximates time averaging. Decisions on whether to declare a piece of hardware faulty are based on error rates as a function of time.
In the event of more severe errors, RapidIO may not be able to make a graceful hardware recovery. In this extreme case, the hardware generates interrupts so that system software can invoke a higher-level error recovery protocol. This will usually involve software querying built-in maintenance registers to reconstruct the current status of a device and the sequence of events that led to the problem.
Containment, Detection and Isolation
One common failure mode in systems is the so-called rogue transmitter. In this scenario, the failed device resists any attempt to be reset, and is constantly spewing faulty data. The RapidIO architecture can allow for this device to be isolated in the system. This will effectively sever it from the system to prevent network congestion. This is a major advantage of this architecture.
One hundred percent fault detection means that all failures are detected. RapidIO achieves this by eliminating any form of "datagram" (so-called "send and pray") transmissions. All paths are transferred by handshakes. Each data path is protected by parity, error checking and correction (ECC) or CRC functions. This means all faulty transactions are also traceable to the offending transaction so that the correct recovery action can be initiated.
In the event of a failure, RapidIO is designed to leave enough clues around the system to enable backtracking to find out whats happened. It exploits a series of status registers that software can use to quickly isolate a problem and figure out what went wrongeven after a system reset. This includes holding registers for failed transactions, and reset shadow registers to preserve a status registers contents after a system reset. These attributes guard against the common failure scenario of a software process going "insane" and hanging the system, causing the watchdog timer to kick in and perform a reset, which would potentially wipe out all trace of the root cause.


Adlink
Elma