SOLUTIONS ENGINEERING
PCI Express and InfiniBand
The PCI Express-InfiniBand Connection
PCI Express and InfiniBand initially competed for the same system space. Today, they fill two complementary roles, meeting and sometimes competing on the backplane.
JACK REGULA, PLX TECHNOLOGY
Those with a long memory will recall that InfiniBand (IB) was once touted as the replacement for PCI, destined to bring a new high-performance, high-availability I/O model to the enterprise. The advantages of a switched serial interconnect seemed overwhelming compared to the limitations of PCI’s parallel bus structure and the PCI tree topology. As it turned out, the weight of PCI’s infrastructure and its software legacy made it too heavy to unseat. The need for a complete new infrastructure, both software and hardware, proved to be an insurmountable barrier. After suffering through a long gestation period, the IB camp suffered attrition coincident with the bursting of the tech bubble. Just when we had written it off, signs began to appear that IB had secured a niche. Now the surviving IB players are enjoying its success as a cluster interconnect in high-performance computing.
When it became clear that IB would not replace PCI, the need remained to re-invent PCI. The lessons of the recent past were applied to this task (by the usual suspects), and out of the ashes of the original IB vision arose PCI Express (PCIe). PCIe uses a PHY very similar to IB but retains its PCI heritage. PCIe is configured just like PCI; its topology is a tree of PCI-to-PCI bridges in which routing is performed by address and device ID (bus, device, function number). With PCI-to-PCIe bridges, legacy PCI devices can be employed with good performance using unmodified legacy software. PCIe was driven to success by implementation on Northbridges and graphics processors, followed quickly by storage and network adapters for the enterprise. While systems still ship with a mix of PCI and PCIe slots, it is abundantly clear that PCIe is replacing PCI and AGP, and will soon be as ubiquitous as PCI once was.
PCIe was initially set on a parallel technological track with IB but pursuing an entirely different set of applications and market segments. PCIe provides cost-optimized but still high-performance I/O for the desktop and enterprise limited to a single root complex, while IB provides performance-optimized cluster interconnects scaling beyond 10,000 processing nodes.
The embedded market has a history of taking enterprise and desktop technology and adapting it to its needs. The economies of scale thus leveraged are compelling. The storage and Ethernet controller chips developed for the enterprise and the desktop are used almost everywhere such interfaces are needed in the embedded space. X86 processors compete for tasks formerly performed by embedded and communications processors. When PLX applied non-transparent bridging, which is standard in PCI, to PCIe switches, it allowed PCIe to expand beyond its nominal single-host limit to support multiple hosts, failover and systems with redundant fabrics. In addition to non-transparent bridging, two PCIe specification development activities at the PCI-SIG—I/O Virtualization (IOV) and the PCIe Cable Standard—will allow PCIe to encroach further into IB’s interconnect space. However, it’s important to look at some of the key elements of these developing specifications to see just how far the new capabilities extend.
I/O Virtualization
For the past 18 months, the PCI-SIG’s IOV work group has been developing a standard that will allow the sharing of I/O adapters by multiple hosts, as well as by multiple system images running on a single host. IOV, whose specification the PCI-SIG expects to complete by the end of 2006, is part of a larger trend toward virtualization throughout the enterprise. Servers themselves are being virtualized to reduce maintenance costs, increase system resiliency and make better use of multi-core processors. When a server’s workload is divided among applications with each running under its own guest operating system, then fault and error-side effects can be constrained to just a single application. IOV supports that trend by allowing separate virtual I/O devices, within a single physical I/O component, to be assigned to each guest OS, thus limiting the scope of I/O errors to a single guest OS or system image.
I/O sharing and IOV are two sides of the same coin. I/O sharing is enabled when a single I/O device is made to look like many, primarily by giving each virtual instance of the I/O function its own set of control and status registers (CSRs). The ability to share I/O devices among blades provides a compelling advantage for PCIe as the backplane interconnect for blade servers. I/O can be removed from the blades and direct connections made from the root complexes to the backplane switch, saving both cost and latency. Throughput is increased by stepping up from 1 Gbit/s to x4 or x8 PCIe. The cost of 10 Gbit/s I/O adapters is amortized over all the compute blades in the backplane. The IOV standard minimizes the software impact of an otherwise revolutionary advance.
To share virtual instances of an I/O function among multiple hosts, one must create a multiple-host-aware fabric. The direction chosen by the PCI-SIG is to extend packet headers with a host ID field, which are added/removed at the host ports of switches to allow the use of legacy root complexes. Multi-host-aware switches implement a separate CSR space, including address and ID routing information, for each host using the host ID to select the routing information for each packet as it passes through the switch. Multi-host-aware I/O controllers use the host ID as part of the address for incoming packets and attach it to outgoing packets to allow them to be routed upstream. The result is that each host sees a virtual PCI hierarchy fully compatible with that of a standard, single-host PCIe system. The multi-host fabric contains multiple virtual PCI hierarchies overlaid within a single physical switch or fabric, as illustrated via the color coding of Figure 1.

A careful examination of Figure 1 will show no connections among the separate virtual hierarchies. The IOV specification is silent on host-to-host communications, neither standardizing it nor precluding it. It is a relatively simple matter for a switch vendor to add non-transparent bridges to allow hosts to open windows into each other’s domains. It is only slightly more complicated for the switch to include a DMA engine to speed the transfer of data through those windows. Such proprietary features fit the embedded and communications usage models better than enterprise blade servers, where software and security concerns argue against the use of PCIe as a host-to-host interconnect.
Cable Standards
A cable specification is included in the IB specification. Because of the close match in PHY between the two interconnects, IB cables will work for PCIe. However, PCIe devices don’t drive the cable as hard as IB does, so the maximum length of passive cable usable for PCIe is lower than it is for IB. Preliminary results suggest that PCIe can drive passive cables of approximately seven to eight meters in length. Increased lengths are possible using techniques such as re-driving, re-timing and/or equalization using active devices in the connector shell, enabled by defined power pins.
The primary motivations for the PCIe cable specification were these:
• Support multiple link width options: x1, x4, x8, x16
• Reach cost targets suitable for high-volume markets
• Support the I/O usage model
However, unlike that of IB, the I/O usage model for PCIe resulted in its cable specification having defined upstream and downstream ends. This, together with the presence of sideband reset and clock signals, among other factors, complicates its use in multi-host systems. The cable specification doesn’t address such uses.
The PCIe cable drives a reference clock from the upstream end to the downstream end. If spread spectrum clocking (SSC) is used in the system, then the reference clock must be used by the PHY at the downstream end of the cable. This fits with the standard I/O usage model that employs a common reference clock throughout the system. Multi-host systems, whether stacked or bladed, are likely to have an independent SSC clock at each host. While it is possible to build switch fabrics that can handle multiple independent SSC links and clocks, this is not required for compliance with the PCIe specification. Leading PCIe switch vendors continue to develop solutions for this problem in response to growing market demand. The PLX, for example, makes an eight-lane switch that can be used in a two-port configuration to drive a cable, while providing both non-transparent bridging and SSC clock domain isolation.
PCIe is an I/O interconnect that can be enhanced to connect multiple processors, while IB, originally intended for use as an I/O interconnect, instead has become a clustering interconnect. PCIe and IB meet and in some cases compete on the backplane. Both PCIe and IB are specified as optional backplane interconnect technologies for ATCA, for example, where strangely enough, the front-runner is Ethernet. PCIe is, after all, the chip-to-chip interconnect used between the processor chip sets and the IB or Ethernet controllers used to drive the backplane, and is used throughout such systems as a local bus and mezzanine interconnect.
While non-transparent bridging is a familiar and indeed de facto standard, the remaining specification infrastructure for PCIe-linked multi-host systems is still developing, and switches with multiple non-transparent bridges or IOV multi-host capabilities have yet to appear. As time passes and the PCIe infrastructure is enhanced in both standard and proprietary means, PCIe will see increasing use in multi-host systems. Nevertheless, IB will remain secure in its niche both because it can scale well beyond the backplane and because of the barrier created by all the clustering software that continues to be developed around it.
PLX Technology
Sunnyvale, CA.
(408) 774-9064.
[www.plxtech.com].


Adlink
Elma