BROWSE ARTICLES BY TECHNOLOGY

DIGITAL EDITION

RTC Magazine Digital Edition

INDUSTRY NEWS

QUICK DOWNLOADS

RTEC10 is an index made up of 10 public companies which have revenue that is derived primarily from sales in the embedded sector. The companies are made up of both software and hardware companies being traded on public exchanges.

COMPANY PRICE
(USD)
CHANGE
 
Adlink
1.22
-1.781%
Advantech
3.02
-0.889%
Concurrent Comp
3.58
-3.241%
Elma
474.00
0.173%
Enea
5.31
-1.918%
-   Interphase5.130.000%
-   Kontron0.00
Mercury Comp
14.04
1.299%
Performance Tech
1.83
-2.032%
PLX
3.22
-0.617%
Radisys
7.39
0.271%
52 WK HIGH 52 WK LOW MKT CAP (Million USD)
1.24
1.15
167.08
3.06
3.02
1,668.57
3.66
3.51
32.95
474.00
474.00
108.30
5.34
5.00
93.75
5.155.1235.37
0.000.000.00
14.05
13.69
429.77
1.83
1.72
20.36
3.25
3.20
143.40
7.52
7.23
204.97
RTEC10 Index: 603.86 (-4.75%)
RTEC10 is sponsored by VDC research

TECH INSIGHT

PCI Express & Advanced Switching

PCI Express Advanced Switching is About to Come into its Own

Coming fast on the heels of PCI Express, the Advanced Switching Interconnect is about to make its own debut.

CHUCK TREFTS, CATALYST ENTERPRISES

  • Page 1 of 1
    Bookmark and Share

Advanced Switching is something of an extrapolation of PCI Express, borrowing its lower two architectural layers from the PCI Express specification, but diverging at the transaction layer and in the marketplaces it intends to serve. Whereas PCI Express has already begun to reshape a new generation of PCs and traditional servers, Advanced Switching is intended to proliferate in multiprocessor, peer-to-peer systems in the communications, storage, networking, servers and embedded platform environments (Figure 1).

The need for Advanced Switching (AS) essentially comes about as computing and communication platforms begin to converge by exhibiting increasing overlap in terms of the functions they serve. While PCI Express is clearly the interconnect of choice for the computing industry, a common interconnect with the communications industry seems logical and necessary, in order to keep development costs down, performance up and reduce time-to-market. By sharing the same physical and data link layers as PCI Express, AS leverages an in-place infrastructure and provides built-in cost savings by expanding a common ecosystem of IP, tools, services and foundries.

Starting with PCI Express

PCI Express is a layered, multi-lane, serialized evolution from the monolithic, parallel PCI/PCI-X system architectures. These layers include the physical layer, data link layer and transaction layer (Figure 2).

At the physical layer, an initial 2.5 Gbit/s dual simplex, point-to-point topology is specified, with future provisions to scale much faster. The PCI Express base specification calls for up to 32 lanes, although in current practice, developers are tending to build at no wider than 16 lanes, with most non-graphics development being designed at x1, x4 and x8 (pronounced “by one,” “by four,” etc.). Implementations at x2 and x12 are also specified.

All information moving across the link is in 10-bit byte format and most of this information is scrambled. Data scrambling is a process that encodes these bytes such that repetitive patterns are eliminated from the bit stream. Repetitive patterns can concentrate energy at specific frequencies, which generates EMI. Scrambling spreads energy over a wider frequency range, eliminating any nasty spikes.

Coming down through the stack, 8-bit data is scrambled first and then encoded to the 10-bit format before being serialized and sent out on the wire in LVDS form. On the other side of the link, the data stream is de-serialized, followed by a 10b/8b decoding and descrambling.

This 8b/10b scheme does result in an inherent 25% overhead, but considering the brute speed at which PCI Express operates, it’s essentially inconsequential. For example, at a x1 link width, aggregate PCI Express bandwidth is 500 Mbytes/s. At x16, bandwidth scales to a blistering 8 Gbytes/s. By comparison, legacy PCI running at 66 MHz and 64-bits wide, provides 533 Mbytes/s of raw bandwidth (Figure 3).

Beyond the obvious bandwidth advantages, it is important to note that PCI Express can move much more information around a system using much less real estate. A x8 PCI Express link provides 100 Mbyte/s performance, on a per pin basis. This provides advantages in power requirements, size and ultimately, cost. At the data link layer, mechanisms are in place to ensure data integrity by way of CRC checks, transaction sequencing functions and ACK/NAK protocols.

The PCI Express transaction layer will look familiar to those who have worked with PCI or PCI-X architectures, with some obvious additions. The familiar configuration, memory and I/O transactions are still there, with a new message transaction type being defined. Message transactions transport interrupts, power management information, vendor messages and error reporting mechanisms.

Address space will again appear familiar, although configuration space is extended from the 256 bytes used by legacy PCI to 4 Kbytes, and message space is added. New software must take advantage of this extended configuration space, but legacy PCI software, including OS, drivers and applications, will operate just fine in a PCI Express environment.

A Typical PCI Express Transaction

Moving a transaction, such as a memory read, from one PCI Express device to another involves all three layers on both devices. As the application initiates a transaction in one device, it is passed down from the transaction layer, through the data link and physical layers, and across the link to the receiving device. At the transmitting device, the data link and physical layers append various fields to the transaction, such as framing characters, CRC values and sequence information. On the receiving device, these physical and data link appendages are monitored for various purposes and are stripped away as the transaction moves up the stack, leaving purely transactional information to be consumed at the highest layer.

Link communication can also originate at the physical or data link layer, and terminate at the same layer on the other end of the link. Communication that starts and ends between the physical layers takes place in the form of ordered sets (sometimes called primitives in other protocols). These ordered sets are exchanged between the two link devices in order to control link initialization, link recovery, perform clock tolerance compensation and control idle states. Ordered sets do not contain framing, or delimiter characters, as do packets generated at the data link and transaction layers, although they are sometimes referred to as physical layer “packets.” Ordered sets begin with a comma (COM) character, followed by 3 or more other characters that define the specific ordered set. Ordered sets always consist of multiples of 4 characters.

Communication that starts and ends between the two data link functions takes place in the form of data link layer packets (DLLPs), which include start and end framing characters as well as a CRC. These packets include such functions as flow control updates and initialization packets for each virtual channel (VC), as well as power management packets. DLLPs are used only on the local link between the two communicating devices; they do not pass through switches. Other transaction layer mechanisms are in place to ensure data integrity across multiple links, including an end-to-end CRC check (ECRC).

Quality of Service

For Quality of Service (QoS) purposes, PCI Express employs a traffic class (TC) / virtual channel (VC) scheme, whereby different traffic flows are provided with varying prioritizations and deterministic latencies within the PCI Express fabric elements. For example, a SCSI hard drive may be assigned a lower priority through the fabric than is assigned to time-sensitive video applications. Device drivers or application software assign the traffic class, whereas virtual channels are hardware buffers on PCI Express fabric devices. The different TCs are mapped to specific VC buffers within these devices, where an internal arbitration and prioritization scheme takes place in order to determine which packets are forwarded first.

On the hardware side of things, PCI Express provides three distinct functional elements: a root complex, switches and endpoints. The root complex sits atop the hierarchy to interface the host processor and memory subsystems with subordinate endpoint devices and switches. Switches are used to expand the hierarchy to multiple endpoints and for connectivity between the root complex and subordinate functions. An endpoint is a peripheral device, other than root complexes or switches, which requests and completes transactions.

PCI Express was designed with the idea that it would maintain complete backward compatibility with legacy PCI/PCI-X software. As a system is booted, the same hierarchical enumeration used to determine the presence and characteristics of the various devices and bridges present in a PCI or PCI-X system is done with PCI Express-based systems. Register mapping that was done in PCI/PCI-X is also done for PCI Express. Legacy system interrupts are seamlessly handled by PCI Express, done in-band by transaction layer messages, eliminating the need for sideband signals.

Advanced Switching

Where PCI Express employs a top-down tree hierarchy with a single host, Advanced Switching enables true peer-to-peer and multiprocessor environments in multiple topologies, including mesh, star and dual star, topologies typically employed in blade servers and telecom systems. Additionally, AS provides for the tunneling of nearly any other protocol through the fabric. AS diverges from PCI Express in that it provides a source-based path routing method, as opposed to the memory-mapped method employed in PCI Express. These new functions are evident on examination of a typical AS packet, which includes provisions for path routing and encapsulation (tunneling) of other protocols. The packet’s path through the fabric is defined by navigation information provided in the AS route header, essentially giving all AS nodes a host-like capability.

There are two basic packet types for AS: Unicast packets and path-building packets. Path-building packets include variations for spanning tree and multicast functions. Path-building packets are constructed such that the receiving device is provided a path back through the fabric to the origin device, something like a trail of bread crumbs. As a path-building packet traverses the fabric, the method switches use to route them is based on whether the packet is further identified in its header as a spanning tree packet or a multicast packet.

Essentially, every AS packet contains two headers: one for fabric navigation (the route header) and the other for content (the protocol interface, or PI header). This structure provides for great flexibility in data-carrying capabilities, both for existing protocols and for future protocols, especially since AS switches are concerned only with the routing information and, with some exceptions for path building and device management packets, don’t care about the content of the payload, i.e., they are agnostic to the encapsulated protocol (Figure 4).

Path-building packets are also used for a spanning tree process, initiated by a fabric manager (FM) elected from amongst various candidate devices. This process is part of an initial fabric discovery (FD) and involves a promiscuous generation, or blind broadcast of packets, which are used to identify topology, node capabilities and paths between all communicating devices, including a path from each node to the FM for purposes of event/status notifications. Redundant paths are also identified during this process, but are placed into a blocked state unless needed in the case of a path failure or traffic congestion. Switches consume, then regenerate these packets to every other port it has, except the port on which the packet arrived (the ingress port) or any explicitly masked port. As a result, all nodes on the fabric receive spanning tree packets and are identified to the FM.

Provisions are made to route packets around fabric congestion or failure points within the fabric. Any failures are reported to the FM. The FM can in turn notify ingress endpoints within the fabric, all without software intervention, resulting in fast failover capabilities. As a result, traffic can be re-routed around a failed fabric port (port failover) or completely new paths can be provided between the source and terminus devices (path failover). This feature is especially critical in highly utilized, high-availability systems.

If the path-building packet is identified as multicast, it is forwarded by the switch according to a look-up table to multiple, specific endpoints. The switch may or may not replicate the packet since the egress port for the packet(s) may be the same port, even when its ultimate destinations are different endpoints (in other words, it may necessarily diverge somewhere else down the fabric path).

Unicast routing is used for sending a packet from a single origin to a single terminus. As a unicast packet traverses through the fabric, there is no need for switches to use destination look-up tables to route the packet. This results in simpler switch design and negates latencies involved in such look-up schemes.

A unicast packet contains routing information in the form of a 31-bit turn pool, turn pointer and direction flag. This turn pool information is included in the route header. Switches forward the unicast packet based on this information. A turn pool contains, of all things, turns. A turn is variable in length, ranging from 1 to 8 bits, depending on the port count of the switch immediately in its path. The turn pointer indicates which turn is currently active, in that as a packet moves through the fabric, different turns through different switches are required to properly route the packet. For example, a packet traversing four 3-bit switches and one 4-bit switch would use 5 different turns (through 5 switches), consuming 16 of the available 31 bits in the turn pool (called the active portion of the turn pool). The remaining 15 bits would be unused.

The turn value indicates the relative position of a switch’s egress port from the ingress port at which the packet arrives. The 8-bit maximum for a turn correlates obviously to the 256-port maximum port count on an AS switch. The direction flag is used to indicate whether the packet is being forward-routed, from origin to terminus, or backward-routed, from terminus to origin.

Protocol Interfaces

The protocol interface (PI) field identifies the format of the encapsulated protocol (such as Fibre Channel, TDM, ATM, TCP/IP, InfiniBand, AS Fabric Services, etc.). A PI is also designed for encapsulating PCI Express (PI-8), which accommodates PCI Express edge devices acting as a portal into the AS fabric. Many protocols can be simultaneously tunneled around the fabric, giving AS enormous flexibility to support different modular applications. PI types are specified from PI-0 to PI-127, with PI-0 to PI-7 reserved for fabric services and PI-8 to PI-254 reserved for tunneling specific protocols.

Supporting extensions provided by AS include Simple Load/Store (SLS), Simple Queue (SQ) and Socket Data Transport (SDT). SLS (PI-10) is a read/write protocol that maintains the legacy PCI load/store model. SLS provides specific, protected memory apertures to fabric devices, including a base address and a range, or limit. It is ideal for RDMA operations, peer-to-peer communications between different processor systems and bridging between two different protocols, such as PCI Express to HyperTransport.

SQ and SDT are native AS PIs that are optimized to support legacy socket applications. SQ (PI-11) is a simple messaging protocol for packet-based communications protocols, like UDP, although it can also support non-datagram services such as TCP. SQ employs a push-pull (write/read) queue model. SDT (PI-9) is a hardware implementation of streaming socket protocols and is somewhat more efficient than SQ for TCP-like semantics.

Catalyst Enterprises
San Jose, CA.
(408) 365-3846.
[www.getcatalyst.com].