TECHNOLOGY IN CONTEXT
Choosing the Right Programmable Logic Solution for PCI Express Applications
FPGAs can provide fully integrated PCI Express solutions. However, designers must look for the right combination of power, performance and density to meet multilayer data path challenges. A set of selection criteria can help them uncover the right solution.
BOB BLAKE, ALTERA
Page 1 of 1
PCI Express is rapidly establishing itself as the successor to PCI, providing higher performance, increased flexibility and scalability for next-generation systems, while maintaining software compatibility with existing PCI applications widely deployed in compute, storage, communications and general embedded systems. Compared to legacy PCI, the PCI Express protocol is considerably more complex, with three layers—the transaction, data link and physical layers. Consequently, designers must evaluate a more detailed set of selection criteria before choosing their programmable logic solutions for PCI Express applications.
Layered protocols have been used for years in data communications. They permit isolation between different functional areas in the protocol and allow updating or upgrading of one or more layers, often without requiring changes in the other layers. For example, new transaction types might be included in newer revisions of a protocol specification that don’t affect lower layers, or the physical media might be changed with no major effects on higher layers.
PCI Express employs dual-simplex point-to-point links to overcome the limitations of a shared bus. The links use high-speed serial transceivers with embedded clock and data differential signals operating at 2.5 Gbits/s with industry standard 8b/10b encoding. A link can consist of a single (x1) lane providing peak bandwidth of 500 Mbytes/s (2 directions x 2.5 Gbits/s x 8/10 encoding), and for scalability, multiple lanes can be used, including x2, x4, x8, x12, x16 and x32, with the data bytes striped across the lanes comprising the link, increasing the peak bandwidth accordingly (e.g., a x4 link provides peak bandwidth of 2 Gbytes/s). To further overcome the limitations of a multi-drop shared bus, non-blocking switches are used to connect multiple PCI Express devices together. Table 1 summarizes the performance advantages of the switch-based PCI Express architecture compared to legacy PCI.
Unlike PCI, the PCI Express protocol structure does not use sideband signaling, further reducing the number of pins per port. From the transmitting side of a PCI Express transaction, transaction layer packets (TLPs) are formed at the transaction layer with protocol information (transaction type, recipient address, transfer size, etc.) inserted in header fields. These are then passed down to the data link layer to calculate and append a 32-bit cyclic redundancy check (CRC-32) and a sequence ID for link level data integrity management (packet acknowledgement and retry mechanisms).
The physical layer then frames the packet with start and end of packet bytes before lane striping (multi-lane links only), byte scrambling to reduce electromagnetic emissions (EMI), 8b/10b encoding to ensure sufficient transitions for clock and data recovery, and serialization of the 10-bit symbols before transmission across the link to the receiving device. The packet then traverses up the protocol stack at the receiving device until data is extracted and passed to the device core.
Although PCI Express encapsulates transactions as packets, backward compatibility is maintained at the application layer. From a system perspective, PCI Express retains many similarities to PCI while offering significantly higher system performance and scalability (Figure 1).
In a PCI Express system, a root complex device connects the processor and memory subsystem to the PCI Express switch fabric comprised of one or more switch devices. Similar to a host bridge in a PCI system, the root complex generates transaction requests on behalf of the processor, which is interconnected through a local bus. Root complex functionality may be implemented as a discrete device, or may be integrated with the processor. A root complex may contain more than one PCI Express port and multiple switch devices can be connected to ports on the root complex or cascaded.
Forward bridges are used to connect legacy PCI or PCI-X subsystems to the PCI Express fabric, and reverse bridges can be used to connect a PCI Express fabric to a PCI host bridge. Endpoints complete (or request) PCI Express transactions (e.g., a Gigabit Ethernet controller with a PCI Express system interface). The physical and data link layers of each PCI Express device (root complex, switch, bridge and endpoint) provide the same functionality, but the transaction layers have some differences.
Transaction Layer Criteria
In addition to the formation of TLPs discussed earlier, the transaction layer prioritizes the various types of data transaction into traffic classes (TCs) and maps them into virtual channels (VCs), which are used to provide deterministic latencies as the packets traverse the fabric. Each TC may be mapped into a separate VC (e.g., TC0 to VC0, TC1 to VC1, etc.), or multiple TCs may be mapped into a single VC (e.g., TC[2:0] to VC0, TC[7:3] to VC1), depending on the requirements of the system and in particular, the requirements of the different endpoints.
Each VC requires a separate FIFO buffer and control logic, which reside in the transaction layer of the port. Switch devices support both port arbitration (packets on different ingress ports, same VC) and VC arbitration (packets in VC buffer of egress ports). Port arbitration may be based on round-robin, weighted round-robin or time-based round-robin schemes, and VC arbitration may be based on strict-priority, round-robin or weighted round-robin schemes.
Each packet has a unique identifier that enables response packets to be directed to the correct originator. The packet format supports 32-bit memory addressing, or extended 64-bit memory addressing. Packet lengths of up to 4 Kbytes are supported. To accommodate this, an FPGA must contain enough memory and density provided by an adequate amount of logic elements (LEs) and embedded RAM.
FPGAs can be used to implement any one of the PCI Express devices (root complex, switch, bridge or endpoint), and in many cases would be used to implement unique designs with multiple elements. For example, an FPGA could be configured with a host interface, memory controller, two PCI Express ports, one PCI port and a Gigabit Ethernet controller, providing a highly integrated device solution comprised of root complex, switch, bridge and endpoint functionality.
Accordingly, the selection criteria for the FPGA must include the aggregate performance and density requirements of the application (i.e., the frequency and width of FPGA core logic) and the performance requirements of each interface, including both the I/O signaling and throughput rates of each interface (PCI Express, memory, PCI and endpoint). In other words, an FPGA might support a 2.5 Gbit/s I/O signal, but lack the performance for the PCI Express protocol layers and the device core logic, or perhaps have insufficient performance on the memory interface, or have inadequate signal integrity for a particular source-synchronous interface standard required by the application. Lastly, the combination of interfaces directly impacts the pin requirements and thus the FPGA package options.
System designers should consider an FPGA with an architecture or migration path that allows for scalability, including the number of virtual channels, lanes and maximum packet size to be supported now and in the foreseeable future, since all of these parameters impact the resource (logic and memory) requirements of the design.
Data Link Layer Criteria
As discussed earlier, the primary role of the PCI Express data link layer is to ensure reliable delivery of the data
packets across the PCI Express link by way of a CRC-32. The data link layer tracks the state of the link, communicates link status with the transaction and physical layers and performs link management. The data link layer contains the data link control and management state machine (DLCMSM) to perform these tasks. The data link layer stores packets in a retry buffer until an acknowledgement (ACK) is received from the data link layer of the adjacent device based on the results of the CRC. If the CRC fails and a non-acknowledgement (NAK) is received, the stored packet is re-transmitted.
To ensure maximum efficiency across the link, a credit-based flow control protocol ensures that packets are only transmitted when it is known that a corresponding VC buffer is available to receive the packet at the other end. If not, a packet of lower priority is transmitted while the target device completes the processing of the higher priority packets and clears its Rx VC buffer.
Data link layer packets (DLLPs) are used for a variety of link management functions, including transaction layer packet (TLP) acknowledgement, power management and exchange of flow control information. DLLPs are generated at the data link layer, transmitted through the physical layers of the link along with TLPs, and consumed by the data link layer of the receiving device. DLLPs are also protected from signal integrity errors, with a CRC-16 appended to a 4-byte packet (8-bit type field, 3 bytes of information).
Since data integrity is extremely critical for successful packet transfer across PCI Express links, system designers must select an FPGA that provides the high performance (logic and memory) necessary to perform the data link layer processing at the peak bandwidth rate of the link.
Physical Layer Criteria
The system designer’s FPGA selection criteria for the PHY are distinctly different from that of the data link and transaction layers, primarily focused on ensuring PCI Express components electrically interoperate with each other. The criteria for the transaction protocol and data link layers concern proper data transfer services for the application.
The PCI Express physical layer is further broken down into three sub-layers, including: (from top to bottom) a media access controller (MAC), physical coding sub-layer (PCS) and physical media attachment (PMA). The PMA specifies the analog domain of the physical interface. The PCS provides 8b/10b encoding in the transmit direction and decoding in the receive direction, as well as elastic buffers for rate matching.
The MAC sub-layer controls the physical interface with state machines for link training and initialization, including polarity inversion, signal rate negotiation and bit lock, symbol detection, link width negotiation—and for multi-lane links—lane reversal and lane-to-lane de-skew. A logical interface between the PCS and MAC sub-layers is defined—the Physical Interface for PCI Express (PIPE)—as illustrated in Figure 2.
PIPE was initially defined to connect cells in an ASIC or embedded hard macros to programmable logic in an FPGA, and accordingly, I/O drivers were not specified. However, discrete transceiver devices incorporating PCI Express PMA and PCS sub-layers have emerged, and although I/O technologies on the PIPE may differ by vendor, they are all (to date) supported by FPGAs. Although FPGAs with integrated high-speed serial transceivers (PMA and PCS) compatible with PCI Express have been shipping for more than two years, these device families are not the only FPGAs that should be considered. For example, these devices are only available with specific numbers of transceiver channels (e.g., x4, x8, x16 and x20) and are provided in the more expensive, higher pin count packages required by the highest performance, large density designs. So, a design requiring a single x1 PCI Express port might be more cost-effectively implemented in a lower cost FPGA with a PIPE interface to an external x1 transceiver. This avoids the costs of the unused transceiver channels and more expensive packaging of the integrated FPGA.
FPGAs with integrated transceivers and intellectual property (IP) cores that implement the MAC sub-layer and the data link and transaction layers, offer fully tested solutions that have been qualified for the PCI Express Integrator’s List http://www.pcisig.com/developers/compliance_program/integrators_list/pcie. Development boards are available to evaluate the PCI Express IP cores and prototype the core logic designed for specific applications through PCI-SIG compliance workshops in parallel with the end-system architectural planning, component selection and board design.
The PCI Express specification also has requirements for electrical characteristics to ensure proper data transmission. The differential voltage is 1V with -3.5 dB de-emphasis required. Designers can meet this specification by utilizing devices in which the differential voltage and pre-emphasis can be dynamically changed in a lab environment to assist with debugging boards. Receive channels additionally should have equalization that can be dynamically adjusted to accommodate for high frequency losses on the transmission line.
As PCI Express becomes the standard interconnect for next-generation embedded applications, system designers must address the challenges associated with more complex physical, data link and transaction layers, as well as the challenges associated with the higher device core performance requirements that result from the increased link bandwidth. While FPGAs can provide fully integrated PCI Express solutions, designers must realize that not every FPGA offers the right combination of power, performance and density to meet these challenges. However, if designers adhere to a certain set of selection criteria as a guide, they should be able to uncover the right FPGA solution for their specific PCI Express application.
San Jose, CA.