Architecting Highly Available CompactPCI Systems

Fractal Realms series. Backdrop of fractal elements, grids and symbols on the subject of education, science and technology

Many of today’s telecommunications systems require five nines availability
or 99.999% uptime. The amount of downtime allowed in these systems is 5.26 minutes
per year ( 525,600 minutes/year x 99.999%). The 5 minutes of downtime includes
scheduled maintenance as well as any downtime that might result from the failure
of any part of the system. Designing high-availability systems that are capable
of obtaining five nines availability will generally require that every function
in the system be redundant, that is there is no single point of failure. The road
to high-availability systems generally includes redundant power supplies, fan
trays and mirrored hard drives. The addition of these redundant components will
decrease the probability that a component failure will cause a system failure.
The addition of the redundant components has increased the availability of the
system; it is now more highly available. As you might expect, adding redundancy
to power supplies, fans and hard drives is relatively straightforward. Providing
for redundant compute elements in a system is a more complicated challenge.

Application of CompactPCI to High Availability

Developers have been applying PICMG 2.0 CompactPCI Specification-compliant
systems to a variety of high-availability applications over the years. As the
market requirements for high availability have increased, CompactPCI systems
have had to evolve to meet the new challenges. The original CompactPCI systems
were simple bus-based architectures. Figure 1 shows typical first-generation
CompactPCI architecture.

PICMG 2.0 CompactPCI-compliant systems are composed of one or more CompactPCI
bus segments. Each segment can contain up to eight CompactPCI board slots. Each
bus segment contains one system slot and up to 7 peripheral slots. The PCI bus
is used as the primary communication path between the slots in each bus segment.
In this architecture the PCI bus and the system slot are single points of failure.
A misbehaving peripheral slot can bring down the entire PCI bus segment preventing
communication between any of the slots. This single point of failure was a significant
obstacle to the adoption of CompactPCI in high-availability applications. Early
architects of CompactPCI high-availability systems had to overcome the limitation
of the single point of failure PCI bus. The typical solution was to add a second
CompactPCI bus segment and duplicate the functionality in both bus segments as
shown in Figure 2.

In Figure 2, dual bus segments and dual system slots are used to provide redundancy
for the single points of failures that exist in standard Compact PCI architectures.
In the Dual Segment architecture, each of the system slots can control either
of the two PCI bus segments. By providing redundant system slots, a failure of
either system slot can now be compensated for. This architecture also covers the
potential fault of a PCI bus. If a fault occurs in PCI Bus 1, then PCI Bus 2 is
available to handle the task.

The engineering challenges with this kind of architecture are complicated. The
system slots provide clocks, arbitration and interrupt servicing for a bus segment.
The failover of a system slot requires that the clock drivers, request/grant arbitration
and interrupt controllers also transfer over to the active system slot. Knowing
when a bus has failed and then being able to bring up the redundant system slot
without impacting the total system availability is difficult.

In 1999 PICMG formed a subcommittee to standardize an implementation of redundant
system slots. The PICMG 2.13 Redundant System Slot Specification was abandoned
three years later. PICMG 2.13 is the only subcommittee that was disbanded without
completing a specification. This is largely due to the complexities of the problem
and the propriety solutions that exist. It is clear that redundant system slots
in CompactPCI can be used to increase system availability, but at a cost and at
a level of complexity that are prohibitive. Vendors that provide this type of
architecture are selling proprietary solutions—not open architectures.

Adding IP Data Transport

In September 2001, PICMG approved the PICMG 2.16 Packet Switched Backplane
Specification. This specification defines 10/100/1000 Mbit Ethernet interconnects
between peripheral slots and fabric slots in a compact PCI segment. The fabric
slots are redundant. PICMG 2.16-compliant systems have been deployed in a variety
of applications. The ubiquitous nature of the Ethernet interconnects and the
need for IP data transports has led to high levels of adoption among system
providers (Figure 3).

In PICMG 2.16-compliant systems, the IP data transport can be used as the primary
communication channel within the system. This communications path has redundant
links to redundant fabric slots. The PICMG 2.16 specification allows an architect
to avoid using the CompactPCI bus altogether, and provides a way of increasing
system availability without increasing the cost of the system. PICMG 2.16-compliant
systems are inherently redundant—there is no single point of failure. The
Ethernet fabric is a convenient way to handle packet-based data transport that
we see in next-generation applications.

The next step in the evolution of highly available CompactPCI systems is the
removal of the system slot. As applications take advantage of the IP interconnects
in today’s systems, the PCI bus is becoming an unused expense. PICMG is
working on a specification called CompactTCA. The CompactTCA specification is
expected to combine the system management capabilities defined in AdvancedTCA
(PICMG 3.0), the form-factor defined in PICMG 2.0 and the data transport defined
in PICMG 2.16. This architecture will not contain a PCI bus. This kind of system
will be able to support 24 peripheral slots and two fabric slots. The elimination
of the PCI bus will reduce the cost of the boards used in CompactPCI systems,
reduce the complexities of providing redundant system slots and increase the
total slot count (Figure 4).

PICMG 2.16 Packet Switched Backplane is a viable way to improve the availability
of systems built today. The elimination of single points of failure found in first-generation
CompactPCI systems and the addition of redundant data transports provide the building
blocks necessary to achieve five nines availability. Systems designers should
beware of vendors providing products based on proprietary Redundant System Slot
architectures. These closed architecture systems will not benefit from the CompactPCI
ecosystem that exists today. It is clear that CompactPCI systems using PICMG 2.16
Packet Switched Backplanes will provide the combination of point-to-point data
transports and redundancy necessary to achieve five nines availability as well
as providing a migration path to future technologies.

ADLink Technology
Irvine, CA.
(949) 727-2077.