SOLUTIONS ENGINEERING
Network Processors
Beyond the Network Layer: NP-Based TCP Offloading via TCP Splicing
TCP performance of low-end devices in today’s networks continues to be limited mainly by memory and CPU speeds. A TCP offload mechanism is proposed that takes advantage of TCP splicing and implements the offloaded functions in a remote network processor.
SUREKHA PERI AND PRAVIN PATHAK, AGERE SYSTEMS
The Transmission Control Protocol (TCP), a transport layer protocol, was originally designed for low-speed terrestrial links. However, with the proliferation of high-speed transmission mediums, as well as sophisticated access mechanisms, extensive computing power and memory are required for software-only TCP implementations.
In recent years, there has been an explosion in the speed of networks, CPUs and memories. Since the speed of Ethernet has increased much more rapidly than that of CPUs or memories, the performance of today’s networks continues to be limited mainly by memory and CPU speeds.
One method for overcoming these limitations and accelerating a TCP/IP connection is TCP offloading. TCP processing complexity is offloaded from the host CPU to specialized TCP accelerators. Usually, the accelerating TCP offload engine (TOE) is a dedicated subsystem co-located with the host CPU. However, an alternative, more scaleable approach to achieving the same end, which also works with legacy equipment, is to move the complicated TCP/IP processing to a network processor (NP). This is a special-purpose programmable hardware device, which is connected to the TCP server over a high-speed link.
The NP-based TOE combines the low cost and flexibility of a general-purpose processor with the speed and scalability of custom silicon solutions. Furthermore, the NP offloads both memory-intensive and CPU-intensive processing from the TCP server. In the NP-based TOE mechanism described here, TCP splicing is used to enhance TCP performance.
An Alternative TCP Offload Approach
For low-end enterprise products limited by cost and size, addressing the memory/CPU bottleneck by introducing a remotely located TOE on a central networking element is a scaleable and cost-effective solution. This is especially true when contrasted with the solution of increasing the capability of the host itself, such as via keyboard, video and mouse (KVM) switches.
In one sample deployment scenario (Figure 1), a remote keyboard, monitor and mouse control multiple, remotely managed servers connected to a KVM-over-IP switch. The mouse and keyboard events from remote controlling computers are transmitted over the Internet via the KVM switch to the server. Compressed monitor data is sent to the remote computer by the server via a KVM switch. The KVM switch offloads TCP processing to the NP-based TOE, simplifying KVM switch hardware.

TCP Offloading Using TCP Splicing
TCP offloading can be performed across two network entities, an enterprise TCP host and a remote NP. This is achieved by using a split TCP connection, also known as TCP splicing, a well-known technique for enhancing TCP performance. In TCP splicing, two independent TCP connections span a session: one from the client to the NP, and the second from the NP to the server.
Data from the server is locally acknowledged by the NP, thereby reducing server buffering requirements and speeding up congestion window growth (Figure 2). The NP buffers the data until the acknowledgment (ACK) from the far end (client) arrives. In the absence of an acknowledgment, the NP performs timer management to retransmit TCP segments toward the far end. The usage of local acknowledgments also shields the TCP server from any network congestion and excessive delays between the NP and the client. This reduces the memory burden at the server.

The NP also performs optional network address translation (NAT) for data from the server to the client involving checksum computations. Although this functionality is computationally intensive and heavily loads a general-purpose processor, an NP is well suited for such operations.
In contrast to traditional NP applications such as routing, the TCP offload application requires inter-packet dependency and connection-based state machines. Furthermore, this application deals with transport layer functionality, which is beyond traditional NP link- and network-layer processing.
There are two fundamental complexities in TCP processing. Buffer management consists of storing unacknowledged segments, out-of-order segments and stalled segments in congested networks. Timer management involves Retransmission Timeout (RTO) estimation and triggering retransmission of unacknowledged segments. Both of these are offloaded to the NP.
The alternative TCP offload mechanism, using TCP splicing, is also useful when a high-performance server or network operates in tandem with a long-delay wireless and/or low-bandwidth network, as is predominant in defense and mobile networks. In this scenario, the TCP offload on the NP shortens the slow start phase as a result of the perceived shorter delay caused by the immediate acknowledgment. Note that the slow start phase dominates application performance when transaction sizes are small.
TCP congestion control is always tuned for congestion in wire line segments and performance degrades in error-prone wireless channels. TCP offload onto the NP prevents window collapse on the server, since error recovery is restricted to the NP and the client. This improves overall throughput.
Implementation of TOE with TCP Splicing on an NP
In order to achieve a proof of concept of this proposed alternative TCP offload mechanism, it was implemented on an Agere APP340 NP (Figure 3). The APP3xx series of NPs offers up to 2 Gbits/s throughput. The device contains a classifier; a traffic manager consisting of a scheduler, a buffer manager and a stream editor (SED); a state engine; and an on-chip control processor, i.e., an embedded host. Slow path, or control and management, processing is achieved on the embedded host. Fast path, or data-plane, processing is performed by the rest of the components.

The classifier is used to identify the connection and its state. The state engine maintains the TCP state variables accessible to various functional blocks on the NP. The traffic manager runs the TCP congestion control protocol and is used to make scheduling decisions. The SED is used to perform sequence number manipulation and network address translation traversal.
The APP340 NP allows the traffic shaper functional block to control the scheduling of packets in each queue. For every flow, the NP maintains multiple destination queues, including primary transmission to the client, retransmission of packets to the client and transmission to the server.
The NP also supports hierarchical scheduling. This enables the use of auxiliary queues carrying the control information at the same level of hierarchy of each destination queue. These control packets are internally generated by the packet generator engine (PGE) based on the state machine. They are used to trigger scheduling mechanisms such as holding packets in the scheduling queues, and release them only when the TCP state machine requires them to.
The ability to control scheduling of TCP segments using this hierarchical scheduling architecture, along with the ability to maintain inter-packet state dependencies, makes the architecture of this NP well suited to TCP processing.
TCP Connection Establishment
When packets arrive at the NP’s ports, the classifier block determines the flow of the packet, based on the TCP port numbers and the IP address. If the flow is not already present, the packet is forwarded to the embedded host for connection establishment.
The initial packets correspond to the TCP three-way handshake (SYN, SYN-ACK and ACK). Upon receiving these packets, the host parses the TCP options and obtains parameters such as sequence number (SN) and maximum segment size, and passes them to the state engine and the SED. It also updates lookup trees on the classifier that associate a flow identifier with the TCP connection.
The packets are then returned to the classifier for reinsertion to the packet stream. The outgoing TCP handshake packets are subject to NAT traversal and bypass the remaining TOE functionality outlined in the subsequent sections. Similarly, the host handles connection termination. The classifier parses the FIN field, which indicates termination of the connection, and forwards the packets to the host.
TCP Data Flow from the Server
When TCP segments for established flows enter the classifier, the flow identifier is obtained using the lookup tree. The APP340 NP is a block-based processor that employs a two-phase classification. In the first phase, individual blocks are processed. The re-assembled protocol data units are handled in the second phase. If a segment arrives out of order, it is held in the first pass re-order buffer in the classifier until the in-order segment arrives.
For in-order segments, TCP states are examined to see if the current sliding window size (cwnd), the receiver’s advertised window (rwnd), the memory restrictions of the NP and the size of packets in flight allow transmission of this packet. If the packet is to be transmitted, it is forwarded to the SED for NAT translation and transmission to the far-end client. A copy is stored in the retransmit buffer on the traffic shaper (TS). Another copy is sent to the SED for generating an immediate local acknowledgment. The SED uses incoming TCP/IP headers, the SN of the last ACK and the last received byte number to generate the local ACK.
If the TCP sliding window is stalled, the TS stops scheduling packets. The packets are effectively stored in the TS until the stall is released (Figure 4).

TCP Data Flow from the Client
When an ACK arrives at the classifier, the classifier extracts the acknowledgment number and forwards it to the state engine. The state engine updates the cwnd, reflecting the slow start or the congestion-avoidance phase of the TCP flow. The engine also updates the rwnd, the size of unacknowledged data and the stall status of the sliding window.
A trigger is generated to the TS to remove acknowledged packets from the retransmission queue. The TS schedules all acknowledged packets and associated packets from the partner control queue, which are then discarded at the SED. If the TCP sliding window was previously stalled, a trigger is generated to the TS to resume servicing the TCP segments (Figure 5). This acknowledgment is then terminated at the NP.

If the client packet carries piggybacked data, the state machines are updated as described previously. The acknowledgment number in the header is modified to reflect the most recently generated local acknowledgment number and forwarded to the server. Furthermore, the next expected SN from the client is updated in the state engine to be used as the SN for subsequent local acknowledgments.
Retransmissions and Retransmission Timeout Estimation
For every packet in the retransmission buffer, a partner control queue generated by the internal PGE holds the transmit time of the segment. When an acknowledgment arrives, the round-trip time is computed as the difference between the time of arrival of the acknowledgment and the transmit time. The retransmission timeout (RTO) estimation is then performed as per IETF RFC 2988 recommendation.
When packets are placed in the retransmission queue, the PGE generates a control packet consisting of current time and retransmission time, or current time plus RTO.
The PGE is programmed to generate a periodic trigger. Each time this occurs, the TS examines the retransmission time. For every packet with an expired timer, the packet is scheduled to be delivered to the SED and the corresponding control packet is flushed. This is repeated until all packets with expired timers have been serviced. A copy of every retransmitted segment is looped back to be stored for further retransmissions. The fast retransmission algorithm as specified in the IETF RFC 2001 is achieved by the PGE generating a retransmission trigger upon receiving three duplicate acknowledgments from the client.
The NP-based TOE enables server performance to be independent of the characteristics of the network between the NP and the client. Instead, performance reflects the characteristics of the link between the NP and the server. In addition, memory requirements of the server remain independent of network congestion, and of the round-trip delays in the client network.
TCP performance of a low-end device can be substantially enhanced by using a TCP offload mechanism at a remote NP. Furthermore, the server is shielded from the resource requirement variability associated with various client environments. In addition, the congestion characteristics and link loss characteristics of the high-performance client network, along with long delays, are absorbed at the NP-based TOE, substantially enhancing the user experience.
Agere Systems
Allentown, PA.
(800) 372-2447.
[www.agere.com].


Adlink
Elma