Advances in Memory Technology
Closing the Performance Gap: Making Memory Faster with Algorithms and Logic
By taking memory access to a higher level of abstraction, a new interface approach is able to greatly enhance read/write access speed without new process technology or higher clock speeds.
TOM WILLIAMS, EDITOR-IN-CHIEF
Page 1 of 1
It is an eternal truth of system design that something always winds up being the bottleneck. In the case of ASIC/SoC design, and presumably general system design as well, that bottleneck is becoming the performance gap between the processor and its memory. Part of this is due to raw processor speed, but a big part of it results from the requirements of networks and multicore processors. The question that must now be addressed is how to close this soon-to-be-yawning performance gap (Figure 1).
Due to advances in processor architectures, particularly the advent of multicore processors, there is a huge impending gap between processor and memory performance.
In fact, it was the advent of the multicore processor that proved there must be another approach to enhancing performance beyond simply flogging the silicon with higher clock speeds. The multicore approach yielded significantly higher throughput without hugely increasing power consumption, heat dissipation and cost. Such an alternative approach—based on creative architecture and IP—is now being brought to the memory performance issue by a new company named Memoir Systems. Memoir’s approach is based on surrounding existing memory technology in terms of process/geometry and memory type (DRAM, SRAM, DDR2, DDR3, etc.) with a layer of IP logic that takes memory access to a higher level of abstraction via creative algorithms for encoding, compression, virtualization and management.
The idea is to add more external access points to the surrounding IP that will create an interface that will be able to do multiple accesses to the memory array in a single clock cycle, while accessing data at the level of single addresses. For example, a given single-port memory process technology with a clock speed of 500 MHz could be able to perform four memory operations (MOPS) at an effective external clock speed of 2 GHz. So at the system level, that memory would effectively be running at 2 GHz. The memory appears as a standard multiport memory (Figure 2). This is accomplished through the addition of external IP logic and in no way involves a redesign of the memory itself.
A given memory technology consisting of numerous banks can be connected to innovative IP logic to parallelize faster external accesses and keep track of simultaneous reads and writes to and from the array without changing the underlying memory technology.
According to Sundar Iyer, CTO of Memoir, the memory consists of a number of banks and different banks can be accessed in parallel. Each bank has its own physical address and data bus. Thus, four external accesses that addressed four different banks could take place in parallel in a single clock cycle. The matter gets more complex when all four addresses are within a single memory bank. In such a case, one of the writes would be posted to its address and the remaining three would be cached or buffered either in locations within other banks of the memory or in the surrounding Memoir IP. These actual addresses are a form of virtual address that is kept track of in scratchpad memory so that they are correlated with the intended address.
Since successive reads and writes can come in fast and furiously and in all kinds of combinations, the logic in the Memoir IP has to be able to manage all the patterns of hot spots and multiple accesses to the same bank and cache them intelligently. When there is time, the IP can move them and rearrange them. “Of course,” Iyer notes, “you could also have a worst case for life, so we have to intelligently rearrange things so that the operations continue to be posted.”
If the process of writing seems complex, the read process gets even hairier. In the case of a two-read access, for example, if the application wants data from the same memory bank, they cannot be accessed in parallel and trying to access them sequentially would cost performance. Therefore, all the data that is stored in the physical memory is encoded using a variety of schemes such as algorithms for encoding, lossless compression, virtualization, etc., into coded metadata that is stored in a scratchpad area. Storing this metadata requires an overhead in terms of memory capacity of about 15 percent for a 2x performance increase. While not explicitly stated, that would appear to imply that a 4x performance increase would require a correspondingly larger amount of memory overhead. Memoir states that they are capable of achieving up to a 10x increase in performance.
In the example of a two-address access from the same bank, the first data word could be retrieved directly, but the second cannot and would require decoding. This means that all stored data undergoes this metadata encoding. Three locations participate in the encoding of the metadata: the actual data, and the data in two other locations in two different memory banks. Then in order to retrieve the data from the second target address, the system takes the data in the two other locations (which it can access in parallel with the first target address) and uses the metadata in the scratchpad memory to recreate the data stored at the second target address.
These schemes required extensive simulation and analysis to be able to manage all possible patterns and contingencies, and involve a large number of techniques that must all work together for all instances. According to Iyer, “You have to do it in a way that is deterministic so that every location is covered so that if you need to recreate it, you can always recreate it using the right combination of other locations. And the IP knows when you are doing a straight read from a given location.”
The Memoir IP that connects to memory arrays is written at register transfer level (RTL) and is agnostic in terms of semiconductor process or geometry and fits into any standard design flow. In addition, it can connect to any memory architecture that accepts addresses and data. That gives developers a wide range of choices when selecting the memory and the technology they want to use for an embedded ASIC or SoC design. The technology is also at least theoretically applicable to memory chips, but the initial implementation is targeted for memory integrated into embedded devices.
Memoir also has developed an analysis tool that lets the developer select a combination of characteristics for the desired memory. Using this synthesis platform (Figure 3), the developer can select a memory type in terms of DRAM, eDRAM, etc., and also by process geometry, such as 22nm. Then by selecting the number for read and write ports and the acceleration (2x, 4x, etc.), can almost instantly analyze the amount of power consumed and the die area for that configuration and get a picture of the available tradeoffs for the desired design goals.
The Memoir synthesis platform can analyze different configurations of memory type, desired acceleration, number of ports and other parameters and give information about the resulting density, speed, power consumption and die size within seconds.
In fact, the tool can be set up to provide an analysis for multiple instances of memory on the same die (all would, of course, use the same process geometry). In many SoCs, such as network devices, there are numerous memory blocks that fill different roles and thus have different numbers of ports, different capacities and even different types. Such an analysis can provide a selection of options for the desired mix of characteristics for the target device die size and power requirements (Figure 4). Thus for a given physical memory type, there may be a number of combinations in terms of the number of ports, the memory density and the power efficiency. These could all be characterized for each given candidate geometry or only for the one that has been preselected.
Choices for interfacing to physical memory fall along three axes: performance, density and power efficiency. The Memoir IP and analysis tools allow the developer to evaluate a wide range of options.
Memory performance thus becomes a programmable characteristic with its own set of tradeoffs for the system developer. And this performance enhancement and flexibility comes as a result of additional logic—not different, smaller or faster logic.
Santa Clara, CA.