Choosing the Right Programmable Logic Solution for PCI Express Applications

PCI Express is rapidly establishing itself as the successor to PCI, providing
higher performance, increased flexibility and scalability for next-generation
systems, while maintaining software compatibility with existing PCI applications
widely deployed in compute, storage, communications and general embedded systems.
Compared to legacy PCI, the PCI Express protocol is considerably more complex,
with three layers—the transaction, data link and physical layers. Consequently,
designers must evaluate a more detailed set of selection criteria before choosing
their programmable logic solutions for PCI Express applications.

Layered protocols have been used for years in data communications. They permit
isolation between different functional areas in the protocol and allow updating
or upgrading of one or more layers, often without requiring changes in the other
layers. For example, new transaction types might be included in newer revisions
of a protocol specification that don’t affect lower layers, or the physical
media might be changed with no major effects on higher layers.

PCI Express employs dual-simplex point-to-point links to overcome the limitations
of a shared bus. The links use high-speed serial transceivers with embedded clock
and data differential signals operating at 2.5 Gbits/s with industry standard
8b/10b encoding. A link can consist of a single (x1) lane providing peak bandwidth
of 500 Mbytes/s (2 directions x 2.5 Gbits/s x 8/10 encoding), and for scalability,
multiple lanes can be used, including x2, x4, x8, x12, x16 and x32, with the data
bytes striped across the lanes comprising the link, increasing the peak bandwidth
accordingly (e.g., a x4 link provides peak bandwidth of 2 Gbytes/s). To further
overcome the limitations of a multi-drop shared bus, non-blocking switches are
used to connect multiple PCI Express devices together. Table 1 summarizes the
performance advantages of the switch-based PCI Express architecture compared to
legacy PCI.

Unlike PCI, the PCI Express protocol structure does not use sideband signaling,
further reducing the number of pins per port. From the transmitting side of a
PCI Express transaction, transaction layer packets (TLPs) are formed at the transaction
layer with protocol information (transaction type, recipient address, transfer
size, etc.) inserted in header fields. These are then passed down to the data
link layer to calculate and append a 32-bit cyclic redundancy check (CRC-32) and
a sequence ID for link level data integrity management (packet acknowledgement
and retry mechanisms).

The physical layer then frames the packet with start and end of packet bytes before
lane striping (multi-lane links only), byte scrambling to reduce electromagnetic
emissions (EMI), 8b/10b encoding to ensure sufficient transitions for clock and
data recovery, and serialization of the 10-bit symbols before transmission across
the link to the receiving device. The packet then traverses up the protocol stack
at the receiving device until data is extracted and passed to the device core.

Although PCI Express encapsulates transactions as packets, backward compatibility
is maintained at the application layer. From a system perspective, PCI Express
retains many similarities to PCI while offering significantly higher system
performance and scalability (Figure 1).

In a PCI Express system, a root complex device connects the processor and memory
subsystem to the PCI Express switch fabric comprised of one or more switch devices.
Similar to a host bridge in a PCI system, the root complex generates transaction
requests on behalf of the processor, which is interconnected through a local bus.
Root complex functionality may be implemented as a discrete device, or may be
integrated with the processor. A root complex may contain more than one PCI Express
port and multiple switch devices can be connected to ports on the root complex
or cascaded.

Forward bridges are used to connect legacy PCI or PCI-X subsystems to the PCI
Express fabric, and reverse bridges can be used to connect a PCI Express fabric
to a PCI host bridge. Endpoints complete (or request) PCI Express transactions
(e.g., a Gigabit Ethernet controller with a PCI Express system interface). The
physical and data link layers of each PCI Express device (root complex, switch,
bridge and endpoint) provide the same functionality, but the transaction layers
have some differences.

Transaction Layer Criteria

In addition to the formation of TLPs discussed earlier, the transaction layer
prioritizes the various types of data transaction into traffic classes (TCs) and
maps them into virtual channels (VCs), which are used to provide deterministic
latencies as the packets traverse the fabric. Each TC may be mapped into a separate
VC (e.g., TC0 to VC0, TC1 to VC1, etc.), or multiple TCs may be mapped into a
single VC (e.g., TC[2:0] to VC0, TC[7:3] to VC1), depending on the requirements
of the system and in particular, the requirements of the different endpoints.

Each VC requires a separate FIFO buffer and control logic, which reside in the
transaction layer of the port. Switch devices support both port arbitration (packets
on different ingress ports, same VC) and VC arbitration (packets in VC buffer
of egress ports). Port arbitration may be based on round-robin, weighted round-robin
or time-based round-robin schemes, and VC arbitration may be based on strict-priority,
round-robin or weighted round-robin schemes.

Each packet has a unique identifier that enables response packets to be directed
to the correct originator. The packet format supports 32-bit memory addressing,
or extended 64-bit memory addressing. Packet lengths of up to 4 Kbytes are supported.
To accommodate this, an FPGA must contain enough memory and density provided by
an adequate amount of logic elements (LEs) and embedded RAM.

FPGAs can be used to implement any one of the PCI Express devices (root complex,
switch, bridge or endpoint), and in many cases would be used to implement unique
designs with multiple elements. For example, an FPGA could be configured with
a host interface, memory controller, two PCI Express ports, one PCI port and a
Gigabit Ethernet controller, providing a highly integrated device solution comprised
of root complex, switch, bridge and endpoint functionality.

Accordingly, the selection criteria for the FPGA must include the aggregate performance
and density requirements of the application (i.e., the frequency and width of
FPGA core logic) and the performance requirements of each interface, including
both the I/O signaling and throughput rates of each interface (PCI Express, memory,
PCI and endpoint). In other words, an FPGA might support a 2.5 Gbit/s I/O signal,
but lack the performance for the PCI Express protocol layers and the device core
logic, or perhaps have insufficient performance on the memory interface, or have
inadequate signal integrity for a particular source-synchronous interface standard
required by the application. Lastly, the combination of interfaces directly impacts
the pin requirements and thus the FPGA package options.

System designers should consider an FPGA with an architecture or migration path
that allows for scalability, including the number of virtual channels, lanes and
maximum packet size to be supported now and in the foreseeable future, since all
of these parameters impact the resource (logic and memory) requirements of the
design.

Data Link Layer Criteria

As discussed earlier, the primary role of the PCI Express data link layer is to
ensure reliable delivery of the data

packets across the PCI Express link by way of a CRC-32. The data link layer tracks
the state of the link, communicates link status with the transaction and physical
layers and performs link management. The data link layer contains the data link
control and management state machine (DLCMSM) to perform these tasks. The data
link layer stores packets in a retry buffer until an acknowledgement (ACK) is
received from the data link layer of the adjacent device based on the results
of the CRC. If the CRC fails and a non-acknowledgement (NAK) is received, the
stored packet is re-transmitted.

To ensure maximum efficiency across the link, a credit-based flow control protocol
ensures that packets are only transmitted when it is known that a corresponding
VC buffer is available to receive the packet at the other end. If not, a packet
of lower priority is transmitted while the target device completes the processing
of the higher priority packets and clears its Rx VC buffer.

Data link layer packets (DLLPs) are used for a variety of link management functions,
including transaction layer packet (TLP) acknowledgement, power management and
exchange of flow control information. DLLPs are generated at the data link layer,
transmitted through the physical layers of the link along with TLPs, and consumed
by the data link layer of the receiving device. DLLPs are also protected from
signal integrity errors, with a CRC-16 appended to a 4-byte packet (8-bit type
field, 3 bytes of information).

Since data integrity is extremely critical for successful packet transfer across
PCI Express links, system designers must select an FPGA that provides the high
performance (logic and memory) necessary to perform the data link layer processing
at the peak bandwidth rate of the link.

Physical Layer Criteria

The system designer’s FPGA selection criteria for the PHY are distinctly
different from that of the data link and transaction layers, primarily focused
on ensuring PCI Express components electrically interoperate with each other.
The criteria for the transaction protocol and data link layers concern proper
data transfer services for the application.

The PCI Express physical layer is further broken down into three sub-layers, including:
(from top to bottom) a media access controller (MAC), physical coding sub-layer
(PCS) and physical media attachment (PMA). The PMA specifies the analog domain
of the physical interface. The PCS provides 8b/10b encoding in the transmit direction
and decoding in the receive direction, as well as elastic buffers for rate matching.

The MAC sub-layer controls the physical interface with state machines for
link training and initialization, including polarity inversion, signal rate
negotiation and bit lock, symbol detection, link width negotiation—and
for multi-lane links—lane reversal and lane-to-lane de-skew. A logical
interface between the PCS and MAC sub-layers is defined—the Physical Interface
for PCI Express (PIPE)—as illustrated in Figure 2.

PIPE was initially defined to connect cells in an ASIC or embedded hard macros
to programmable logic in an FPGA, and accordingly, I/O drivers were not specified.
However, discrete transceiver devices incorporating PCI Express PMA and PCS sub-layers
have emerged, and although I/O technologies on the PIPE may differ by vendor,
they are all (to date) supported by FPGAs. Although FPGAs with integrated high-speed
serial transceivers (PMA and PCS) compatible with PCI Express have been shipping
for more than two years, these device families are not the only FPGAs that should
be considered. For example, these devices are only available with specific numbers
of transceiver channels (e.g., x4, x8, x16 and x20) and are provided in the more
expensive, higher pin count packages required by the highest performance, large
density designs. So, a design requiring a single x1 PCI Express port might be
more cost-effectively implemented in a lower cost FPGA with a PIPE interface to
an external x1 transceiver. This avoids the costs of the unused transceiver channels
and more expensive packaging of the integrated FPGA.

FPGAs with integrated transceivers and intellectual property (IP) cores that
implement the MAC sub-layer and the data link and transaction layers, offer
fully tested solutions that have been qualified for the PCI Express Integrator’s
List http://www.pcisig.com/developers/compliance_program/integrators_list/pcie.
Development boards are available to evaluate the PCI Express IP cores and prototype
the core logic designed for specific applications through PCI-SIG compliance
workshops in parallel with the end-system architectural planning, component
selection and board design.

The PCI Express specification also has requirements for electrical characteristics
to ensure proper data transmission. The differential voltage is 1V with -3.5
dB de-emphasis required. Designers can meet this specification by utilizing
devices in which the differential voltage and pre-emphasis can be dynamically
changed in a lab environment to assist with debugging boards. Receive channels
additionally should have equalization that can be dynamically adjusted to accommodate
for high frequency losses on the transmission line.

As PCI Express becomes the standard interconnect for next-generation embedded
applications, system designers must address the challenges associated with more
complex physical, data link and transaction layers, as well as the challenges
associated with the higher device core performance requirements that result from
the increased link bandwidth. While FPGAs can provide fully integrated PCI Express
solutions, designers must realize that not every FPGA offers the right combination
of power, performance and density to meet these challenges. However, if designers
adhere to a certain set of selection criteria as a guide, they should be able
to uncover the right FPGA solution for their specific PCI Express application.

Altera
San Jose, CA.
(408) 544-7000.
[www.altera.com].