TECH FEATURE
PCI Express
PCI Express and InfiniBand Form a Fruitful Marriage
The emergence of PCI Express gives a new boost to the value of InfiniBand. The complementary technologies will enable greater overall performance and scalability.
KEVIN DEIERLING, MELLANOX TECHNOLOGIES
The development of PCI Express local interconnect technology has been perhaps more closely watched and anticipated than any other technology of the past year. Now that initial PCI Express platforms are beginning to appear, designers and system architects have the first real opportunity to understand the technology at a practical level. Two key questions that need to be addressed are what advantages PCI Express technology offers over conventional PCI bus architectures and what adapters are available that can benefit from these advantages.
The answer to the first is that PCI Express delivers many key advantages over conventional PCI and PCI-X bus architectures including reduced system complexity and unmatched levels of scalability, reliability, bandwidth and latency. The answer to the second is that InfiniBand channel adapters are available that can push PCI Express to its limits.
Blending Express and InfiniBand
PCI Express offers many advantages over conventional bus-based PCI technology and bandwidth in particular stands out. InfiniBand channel adapters offer the ideal complement to PCI Express local interconnect to enable scalable systems to be assembled. The reduced system complexity with PCI Express-based platforms can be easily seen in Figure 1. This reduced system complexity also contributes to the significant data transfer latency improvements achievable with PCI Express systems.

InfiniBand channel adapters can extend the PCI Express single physical address space limitation in order to implement a distributed architecture of sophisticated subsystems each with PCI Express at the core. Furthermore, InfiniBand offers the ability to support a completely heterogeneous system consisting of both PCI-X bus and PCI Express-based subsystems. In these sophisticated distributed architectures InfiniBand offers the ability to consolidate communications, storage and clustering traffic on a unified fabric, and PCI Express, the ability to deliver this data efficiently to the CPU and memory subsystem. The consolidation of multiple traffic flows across the InfiniBand fabric can easily generate the type of bandwidth to consume the performance offered by PCI Express.
The InfiniBand fabric can be switched across a copper trace on a backplane, through copper cables between subsystems, or even connecting chassis over long distances over fiber links. PCI Express provides the tight coupling within the system for InfiniBand to deliver the data to the processor and memory subsystems.
Bandwidth Suits High-Performance Apps
Perhaps even more compelling than reduced system complexity, for many high-performance systems the bandwidth advantage stands out as pre-eminent. An 8X PCI Express delivers up to 4 Gbytes/s of aggregate raw bandwidth— nearly 4X the raw bandwidth of even the fastest PCI-X bus. This is truly a staggering level of performance. However, to be useful, this raw bandwidth needs to be coupled with a technology capable of taking advantage of it.
Fortunately InfiniBand components are available today that are perfectly matched to the bandwidth delivered by PCI Express (Figure 2). An InfiniBand 4X link can simultaneously transmit and receive 1 Gbyte/s of data. Thus two such InfiniBand links offer the identical 4 Gbyte/s bandwidth as raw bandwidth from an 8X PCI Express link. The InfiniHost III dual port channel adapter card features two 4X InfiniBand ports and plugs into an 8X PCI Express slot.

In many cases theoretical bandwidth and usable bandwidth are two entirely different things, particularly as bandwidths increase above 1 Gbit/s. In the case of software transport stacks, the promise of theoretical bandwidth is not met because the performance is limited by CPU processing requirements rather than the raw bandwidth available.
As a rule of thumb, 1 Hz of CPU processing power is required for each bit per second of transport level communications—in other words reliable, in-order, connected services. This means that a processor running at around 1 GHz is required in order to deliver a gigabit/sec of useful data throughput. In fact, operating system overhead and memory bandwidth limitations make software transport stacks even less efficient at these high data rates. Thus conventional software transport stacks such as TCP/IP are simply not able to take advantage of the 4 Gbytes/s (16 Gbits/s) of an 8X PCI Express slot.
Hardware Transport and Remote DMA
The only way to take advantage of the raw performance offered by PCI Express is to implement both transport processing and remote direct memory access in hardware. At 16 Gbits/s it is simply not possible for a software implementation, since no processor available is able to touch every bit of a data flow and keep pace. Transport processing includes connection de-multiplexing, sequence number checking, ordering, dropped packet detection and re-transmission. Remote direct memory access (RDMA) provides the further benefit of kernel bypass by delivering data precisely to where the application needs it, thereby eliminating the typical operating system overhead.
That fact was clearly recognized by the architects who defined the InfiniBand specification and therefore channel adapters are required to implement both transport offload and RDMA in hardware. InfiniBand host channel adapters are able to achieve actual usable process-to-process bandwidth approaching the theoretical capabilities of the PCI Express local interface. As can be seen in the leftmost columns of Figure 3, a uni-directional data flow of a single InfiniBand 4X link is able to saturate a PCI-X 133 MHz 64-bit bus. In fact, the bi-directional bandwidth of PCI-X is no better than the uni-directional bandwidth simply because the half duplex PCI-X shared bus has run out of steam and limits the performance.

By contrast, the middle columns of Figure 3 show that with PCI Express this limitation is overcome and the bi-directional data achieves bandwidth approaching the theoretical 2 Gbytes/s of a single 4X link. The far right columns of the graph show that when both InfiniBand 4X ports are operating the bandwidth achieved is well over 2.5 Gbytes/s. InfiniBand’s hardware implementation of both transport offload and RDMA are critical to achieve this level of performance.
|
Binary Compatibility between PCI-X and PCI Express Strictly speaking it is not possible to achieve binary compatibility between a software device driver intended for PCI-X and one intended for a PCI Express, even if they supported devices which are fundamentally identical. The reason for this is that during the discovery process a sweep occurs of the PCI bus—or PCI Express hierarchy— and all devices are discovered. Different devices are recognized by a unique PCI Device ID, and this is used to load the corresponding software device driver. A device with a PCI-X interface should have a different device ID than a device with a PCI Express interface and therefore require separate non-binary compatible device drivers. If, however, the devices are 100% compatible at the software level, then it is in fact possible to break this rule and report the exact same device ID for two different devices. In this way, whether a device is plugged into the PCI-X bus or a different device is plugged into a PCI Express slot, the same device driver is loaded. Thus 100% binary software compatibility can be achieved. |
PCI Express and Software Compatibility
Normally, adopting new technologies and achieving new levels of performance come at the price of disrupting existing software models, but not so with PCI Express. At its core, PCI Express is a serial, point-to-point, local interconnect technology that offers complete backward compatibility with previous PCI local bus architectures including PCI Express. In fact this backward compatibility is so comprehensive that it is possible to design adapters that function with software drivers that are 100% identical at the binary level and can operate on either a PCI-X or PCI Express device (see sidebar).
Mellanox Technologies
Santa Clara, CA.
(408) 970-3400.
[www.mellanox.com].

