Over the past several years, I/O subsystems have become a focal point within the
real-time computing segment because of the significant flexibility and diverse
functionality they offer to system architects. As applied to this market segment,
I/O subsystems are generally VME or CompactPCI in form-factor and usually have
two PMC (PCI Mezzanine Card) sites to support various interconnects.
Along with readily available APIs (Application Programming Interface) and development
environments, PMC sites should be largely credited for leading to the broader
acceptance of the I/O subsystem product category in embedded systems. As PMCs
are electrically equivalent to the ubiquitous PCI bus, the mezzanines can easily
accommodate any silicon with a PCI interface.
Since the PCI bus is the de facto standard for the manufacturers of I/O silicon,
this has led to an incredible breadth of product offerings. But more than just
choice, it has provided a platform for flexibility and future upgrades not previously
possible. As fabric protocols evolve and interconnects advance to higher speeds,
the I/O subsystem’s PMCs can be refreshed accordingly while preserving the
initial investment in the primary carrier board and much of the associated embedded
software without having a domino upgrade effect on other cards in the system.
From an embedded architectural perspective, these carriers have generally been
designed to optimize throughput from the interconnect, via the PMC site, to the
backplane or to the secondary PMC site. Acting much like a bridge or router, this
is a very effective implementation for most applications. When used at either
end of the system, this strategy generally feeds or offloads internal system resources—the
multicomputer—at the fastest possible rate.
Beyond PMC-based I/O subsystems, even within the confines of VME and CompactPCI,
there are several other mezzanine implementations available. Some of these include
Industry Pack (IP), Pentek’s Velocity Interface Mezzanine (VIM), and Mercury
Computer’s Type B Daughtercards, all of which can be configured to create
I/O subsystems in much the same way as PMC-based carriers.
Memory Nodes and I/O in Multi-Processing
With the product category broadly defined, a possible implementation can be
examined. The system in Figure 1 involves a signal processing application where
remote converters sample and digitize signal data before transferring to the
multi-compute system. After processing by the multicomputer is completed, data
is uploaded to a video display and sent to persistent storage for archival and
additional non-real-time processing.
This system is VME in form-factor and utilizes a RACE++ backplane overlay across
the P2 connectors to provide a switch fabric interconnect to each board in the
system. The system can be divided into four primary subsections: the single board
computer (SBC), the front-end I/O subsystem, the multicomputer and the rear-end
I/O subsystems. The SBC maintains command and control over the VME bus on P1 while
primary data transfers take place over the RACE++ fabric on P2.
To acquire data from the remote converters, Serial FPDP has been chosen as the
input device in the front end I/O subsystem because of its simple, low overhead
protocol and the resilience of its Fibre Channel fabric. This is a real-time application
and as such, data input streams can be difficult to predict and schedule. Trace
analysis reveals that small "bursty" transfers can be followed by periods
of long delays or, subsequently, long sustained streams.
This results in the multicomputer being under utilized, followed by periods of
being CPU-bound where data cannot be effectively processed in relation to the
long sustained input streams. Being CPU-bound may not just be a result of insufficient
aggregate MIPS, but a result of interprocessor communication bandwidth within
the multicomputer. In either case, combined processing resources are at times
idle and at other times saturated. This is a problem common to many computing
environments, but perhaps more critical in real time applications.
Optimizing Embedded FIFOs
This type of bottleneck is often addressed by tuning internal FIFOs found within
the I/O controller. These FIFOs can range from a few hundred kilobytes to several
megabytes deep, employing novel tactics such as interleaving "swing buffers".
But relying on this resource to rate buffer the long sustained transfers common
to many real-time streaming applications can be a challenging proposition, especially
during periods of being CPU-bound.
Assuming the source code to the proprietary embedded firmware can be procured,
it is possible that adequate results can be achieved by delving deeply into optimizing
this code and the associated device driver. Effective tuning, however, can be
quite complicated with results and a development cycle difficult to predict in
For more intensive streaming applications, even something on the order of a 4
Mbyte internal FIFO is an obviously insufficient resource to buffer the amounts
of incoming data when CPU-bound and attempting to prevent overflow. In these scenarios,
optimal results are achieved by introducing a large independent buffer memory
node into the system. As this introduction should occur somewhere between the
input device and multicomputer, a very effective point of insertion can be within
the I/O subsystem.
Buffer Memory Nodes Defined
In this context, buffer memory nodes can be defined as a memory resource distinct
from processors or I/O devices, with a memory controller that has its own distinct,
independent bus interface. These bus interfaces can be PCI, RACE, VME, etc. Independent
memory nodes are significantly different than what is commonly considered "system
memory" in several respects:
System Memory is generally defined as:
- A memory resource local to a specific device (processor, I/O controller,
- Memory that can only be accessed through the primary device interface of
which it is a subset.
- Intended for processor operations such as executing instructions, scratch
pad functions or running application code.
Buffer Memory Nodes are generally defined as:
- Not specific to any one device, but better viewed as an independent peripheral
device that resides on an end point of a bus segment.
- Memory that can be accessed independently by any other device with access
to that bus segment.
- Intended to be a low latency, temporary repository for data that can be
accessed directly by external devices.
Because buffer memory nodes are not local to a specific device or an extension
of a specific device, they can be equally accessed as an independent resource
from multiple destinations. Inefficient data paths and buses internal to a processor
or I/O controller are also avoided. Latency is minimized because buffer memory
nodes do not have the overhead that comes from performing other operations. This
is relevant within I/O subsystems and multi-processing applications because controllers
and processors are freed to perform tasks such as protocol translation or computational
Memory nodes with Direct Memory Access (DMA) can be of particular value to these
applications because data movement can be accelerated between memory and other
devices such as I/O controllers or processor nodes. This takes place by having
the device driver provide the memory node with the location of data, or its destination,
and then triggering the DMA engine to commence the transfer without further involvement
from other devices. This enables the memory node to take control of the bus, saving
a bus transfer, and frees other devices to perform more specialized operations.
DMAs can also be chained together through the use of linked lists, so multiple
DMA operations can occur without intervention.
Buffer Memory Nodes in Front-End I/O Subsystems
Returning to the system application in Figure 1, it can be seen that the front-end
I/O subsystem is based on a MCJ6 VME carrier from Mercury Computers (Figure
2). The MCJ6 has two Type B Daughtercard sites, with the first site occupied
by a RINOJ-F mezzanine as the system’s input device. The RINOJ-F 2.5 has
two independent Serial FPDP channels, each capable of inputting data at up to
In order to solve the previously described system level bottleneck of being CPU-bound
to the multicomputer, the carrier’s second daughtercard site has been configured
with a large 4 Gbyte Type B daughtercard, the MM-6494D, from Micro Memory. This
card has two independent buffer memory nodes, each 2 Gbytes, which can be accessed
simultaneously to buffer and interleave the incoming Serial FPDP streams from
the RINOJ-F at RACE++ speeds of up to 267 Mbytes/s per port. Data can then be
accessed over the on-board crossbar and onto the switch fabric backplane overlay
via the carrier’s dual RACE++ connections on P2.
Inserting these large independent buffer memory nodes into the front end I/O subsystem
alleviates the first bottleneck of being CPU bound as data overflow has been prevented
in essentially all cases. This was done by installing simple configuration files
and avoiding any need to tune the I/O controller’s embedded FIFOs.
Alternatively, it can be assumed that the multicomputer in this example application
is not CPU-bound. This means sampling rates are slow relative to the bandwidth
of the internal buses and processing capabilities, and there may be no need for
a buffer memory node to prevent overflow. But even among these cases, it is possible
that such a memory node can be of significant benefit.
Store and Forward
For many signal processing applications, algorithms are written around the distributed
architecture of the multicomputer to maximize CPU utilization and total throughput.
It is also possible that large cross sections of data (swaths) can be computed
more efficiently than a random series of smaller chunks. And in some applications,
these computations cannot commence until a complete and coherent data set has
By inserting a large buffer memory node into the I/O subsystem, in front of the
multicomputer, data can be staged and irregular real-time streams can be scheduled
to properly match other system resources. This technique, commonly referred to
as "store and forward", enables pipelining data sets for parallel computation
and consequently optimizing system performance.
Another optimization technique relevant to alleviating this type of bottleneck
is known as interleaving. This strategy is similar to that of swing buffers found
in embedded FIFOs, where an input device fills one memory buffer while data is
emptied out of another memory buffer. Ping-ponging the independent buffers in
this fashion effectively creates a type of dual-ported memory, maximizing data
flow as reads and writes can occur simultaneously on separate buses. Further down
stream in Figure 1, it can be seen that there are additional bottlenecks after
data has left the multicomputer.
This application has two output requirements for the post-processed data. The
first is to upload results to the image display and the second is to send data
to persistent storage for archival and additional non-real-time processing. Two
I/O subsystems based on VME/RACE++ PMC carriers, such as those offered by VMetro
(PowerMidas), Mercury Computers (Myriad 4030) or TekMicro (PowerRACE) have been
selected to satisfy these output requirements. Chosen because they can easily
accommodate a wide range of storage interconnects and video cards, each of these
carriers have dual RACE++ connections to the system’s switch fabric.
Fibre Channel with SCSI protocol has been selected as the output device to an
external disk array. Whether the limiting factor is the transfer rate of the 2
Gbit Fibre Channel or the sustained throughput of the JBOD has not been determined,
but sequential writes are generally around 160 Mbytes/s. This is considerably
less than the data transfer rates coming from the multicomputer at up to 267 Mbytes/s,
and therefore another system bottleneck exists.
By configuring one of the VME/RACE++ carriers with a Fibre Channel PMC on
one site, a PMC buffer memory node such as the MM-6165D can be configured on
the same carrier’s adjacent mezzanine site (Figure 3). This 2 Gbyte 64-bit/66MHz
PCI memory node can rate buffer data coming from the multicomputer before using
its DMA to independently transfer the data to the Fibre Channel PMC for output
to the JBOD.
The second VME/RACE++ PMC carrier is configured with a video card on one of the
PMC sites. A bottleneck also exists here when transferring data from the video
card to the image display. Similarly, this problem is solved by configuring the
carrier’s second PMC site with a buffer memory node.