TECHNOLOGY IN CONTEXT
Embedded Memory Design
Algorithmic Memory Boosts Next Generation SoC Memory Performance
A new memory synthesis platform is able to leverage existing memory IP and combine it with new algorithms to create customized embedded memory solutions that can be highly optimized for specific applications.
SUNDAR IYER, MEMOIR SYSTEMS
Page 1 of 1
System-on-chip (SoC) architectures are a popular choice for meeting the ever increasing performance needs of embedded and mobile systems. These designs generate a lot of memory requests including frequent back-to-back requests from multiple processor cores. Unable to keep up with processor demand, memory has become a bottleneck. SoC architects and designers are struggling to meet the performance requirements of today’s data-hungry applications. Faster processors can’t fix the problem, and can actually make it worse. With memory performance becoming the limiting factor in many SoC designs, the question arises: how can we make memory faster? Algorithmic memories, which use algorithms synthesized in hardware to speed up memory, offer SoC architects a new option for unlocking system performance.
When addressing SoC memory requirements, the first question to ask is where is the bottleneck? In some cases, the bottleneck is obvious. For example, in network data path processing, a 4-port 100 Gbit/s Ethernet line card receives 600 million packets per second. For each packet, a unique set of jobs must be performed, some requiring four to six memory accesses per packet. Multiplying six accesses by 600 million packets, we see 3600 million memory operations per second (MOPS) are required. Since an embedded memory operating at 500 MHz can sustain only 500 million MOPS, system architects will be challenged to bridge the performance gap. In some cases though, it is only after the designer starts figuring out the micro-architecture and analyzing all pipeline stages, inputs, outputs and the logic in between that the bottlenecks become evident.
Today, system designers use a wide array of ingenious system level mechanisms, such as hierarchical caches, pipelined memory architectures, memory striding, static memory allocation etc., to avoid memory bottlenecks. Statistical solutions such as memory interleaving, which involves banking of memories, are also commonly used. Often, achieving performance goals requires compromises such as replication of memory hardware or increased design complexity. Most importantly though, system level approaches are not always applicable, and do not always provide the necessary performance. What if these kinds of architectural mechanisms could be incorporated at a lower level and placed into the embedded memory core itself?
A new approach called algorithmic memory technology does exactly that. Algorithmic memories work by adding logic to existing embedded memory macros enabling them to operate much more efficiently. Within the memories, algorithms intelligently read, write and manage data in parallel using a variety of techniques such as buffering, virtualization, pipelining and data encoding. These techniques are woven together to create a new memory that internally processes memory operations an order of magnitude faster and with guaranteed performance. This increased performance capability is made available to the system through additional memory ports so that many more requests can be processed in parallel (Figures 1 and 2).
Perhaps you are wondering, where’s the catch? You can’t get something for nothing. As with all aspects of design there are tradeoffs to be made. Algorithmic memories trade a small amount of area, about—15 percent—to double performance. Using this approach, performance improvements of up to tenfold are possible, although not always practical. Alternatively, if area or power is more critical, then performance can be traded to optimize these (Figure 3).
Algorithmic memory technology allows system designers to treat memory performance as a programmable characteristic with its own set of tradeoffs with respect to speed, area and power. For example, it is possible to trade a small amount of area, about 15%, to double performance.
In addition, every application has a different pain point. Do reads need to be faster, or writes, or both? Is power consumption an issue? Is die area a concern? How can we find the optimal balance of speed, area and power? Understanding whether a bottleneck is the result of reads, writes, updates or any combination of reads and writes is the basis for determining what kind of memory is required to solve the problem, and this is where algorithmic memory technology comes into play.
Algorithmic memory allows us to create customized memories that are highly optimized for specific applications. The more clearly and narrowly we define the performance requirement for a specific application the better we can make the right tradeoffs in terms of speed, area and power. For example, if an application is mainly doing reads to a data structure, and that becomes a bottleneck, then perhaps the best solution would be a four read port memory with just one write port (4R1W). In another case, an architect may decide two read ports and two write ports are needed. This could be satisfied with a four port algorithmic memory with two read and two write ports (2R2W), assuming the requirement is for equal amounts of read and write acceleration. Another designer might find an application does lots of reads and other times lots of writes. In this instance, a quad-port memory, which means four bi-directional ports, would be preferable.
By convention, if the ports are bi-directional they are called dual, tri, quad and so on. For example, a quad-port memory is a superset, and it can be used as a four port memory or with various ports used bi-directionally. Perhaps an application is doing only updates, which is a special case where the application does a read-modify-write. For example, to update a counter you would read from an address and immediately write back to the same address a few cycles later. Can a memory be built to exploit this special case?
Until now, it has been impractical to use customized memories because both the cost and the amount of time required to design, develop and verify a new memory were prohibitive. This is no longer the case with algorithmic memories. Algorithmic memories can be created very rapidly by combining existing memory IP memory cores with previously verified algorithms. In principle, a memory synthesis tool could analyze memory IP from any vendor and select the right memory core and the right combination of algorithms for a particular set of memory requirements. Using this automated memory synthesis platform, a new custom memory could be created within a couple of days (Figure 4).
Custom embedded memories can be synthesized on many different embedded memory types including SRAMs, eDRAM, register files and more. Memories can be configured with any combination of read and write interfaces.
How would this new memory synthesis platform actually work? A system architect would need to specify the desired characteristics of the new memory such as the number of read and write interfaces, the operating clock frequency, and any area and power requirements. The synthesis platform would need to perform extremely rapid analysis and estimations of potential solutions since it must sort through a large body of commercially available memory IP and determine the best matching of physical memory and algorithms.
This phase of processing could be done working with abstract models of the IP building blocks. For example, all memory IP might be characterized in a common format, such as an ASCII representation, to capture each memory’s data width, address depth, operating clock frequency and power consumption. Likewise, every available algorithm could be characterized or mapped into a database for selection based on whether it accelerates reads or writes, the number of ports it supports and so on. Working with this high level information, the synthesis platform could rapidly analyze various combinations of memory IP and algorithms to find a set of potential algorithmic memory solutions and their estimated speed, area and power characteristics. An architect could then choose an algorithmic memory based on the preferred memory IP vendor or specific details of a particular configuration. Only in a final stage would the synthesis platform need to use detailed information for a specific vendor, process and node to synthesize the algorithmic memory and close timing (Figure 5).
An algorithmic memory synthesis platform can analyze different configurations of memory type, desired acceleration, number of ports and other parameters, and give information about the resulting density, speed, power consumption and die size within seconds.
Once a suitable combination of memory IP and algorithms has been identified, how is the algorithmic memory built? Building a complex memory with circuits is tedious, and there is essentially only one brute force way to do it. For example, building a 4R4W memory requires laying out enough transistors onto a cell to support four inputs and four outputs. With algorithmic memory, however, there are many ways to build one, each with its own advantages and disadvantages. For example, to build a 4R memory, we might start by building a 2R memory. This 2R algorithmic memory could then be modified to create a 3R memory and then the 3R memory modified to form a 4R memory. Furthermore, the 4R memory could form the basis for a 4R1W memory and additional write acceleration algorithms could be added to support more write ports to form a 4R4W memory. The underlying physical memory may only be doing 1R. The point is that algorithmic memory can be constructed hierarchically and it is not necessary to build every algorithm a priori.
To further demonstrate this, consider building a 7R8W custom memory. Imagine there are algorithms that can take a single port physical memory and make it look like a 2R memory with 2x read acceleration. Now we treat that 2R memory like a black box because a 2R algorithmic memory functions just like a 2R physical memory. We can use 2R memory as the black box and repeat the algorithmic operation on each of the ports individually, then re-instantiate the algorithm and hook the first instance to the first port. Then, we can hook another instance to the second port to get a 4x in acceleration and so on. It is possible to keep doing this operation recursively for both the reads and the writes.
The important point is that algorithmic memory can be built recursively. The synthesis platform tool can build an NR, NW memory by recursively invoking its core algorithms. This means only a few core algorithms are required. Required algorithms could include even and odd number ports with 2x and 3x for read and write acceleration. Both two and three are co-prime numbers and from these any number of port combinations can be generated. In some cases, it may be preferable to do the acceleration directly. For example, rather than recursively using 2x plus 2x, a 4x algorithm might work better so additional algorithms could be developed.
At the end of the process, what would be the output of the synthesis platform, or more to the point, what constitutes an algorithmic memory? Recall that circuit definitions for the algorithms are just register-transfer level (RTL) logic, like any other logic. Then a logic synthesis tool such as Design Compiler from Synopsys is used to create an intermediate format code i.e., a gate level netlist that is specific for a foundry and technology node.
Algorithmic memories would be delivered as soft intellectual property (soft IP), because only intermediate formats are generated. The chip designers would then integrate the intermediate format memory with the rest of the chip’s intermediate format components. This has the advantage of allowing chip designers, who have more system knowledge of their chips, to make decisions about how the routing and placement is done.
In summary, algorithmic memory technology addresses the challenge of memory performance at a higher level and allows system designers to rapidly create customized memory solutions that are optimized for a specific application. Thus, algorithmic memories allow system architects to treat memory performance as a configurable characteristic with its own set of tradeoffs with respect to speed, area and power.
Santa Clara, CA.