Increasing the life expectancy of the NAND array reduces the total cost of ownership of the system. Also, the ability to reconfigure the controller to implement improved algorithms, or to adapt to a new generation of flash chips, extends the system life.
by Robert Pierce, Altera, Conor Ryan and Joe Sullivan, NVMdurance
NAND Flash has revolutionized the computing markets by improving ease of access to and availability of data from the data center and your mobile device. Although the rate of geometry shrink (reduction in the size of structure in silicon) has slowed, the industry continues to find innovative ways to increase the capacity and reduce cost, but often at the expense of reliability, which is particularly problematic at the enterprise level. Solid State Disk (SSD) controller design has lagged behind, and also has been unable to solve the issue of reliability without introducing other limitations to the system as a whole.
The life of NAND Flash can be extended and new viability for SSDs can come from today’s FPGA-based implementations for SSD that have overcome the limitations of controllers, permitting virtually all their operations to be conducted in hardware. The reconfigurable nature of FPGAs enables manufacturers to change and tune their controllers on the fly in hardware.
SSDs are superior to traditional hard disk drives (HDDs) in almost every way. They are much faster, smaller and consume less power, and produce less heat and noise, as they have no moving parts. However, an area in which they are not superior to HDDs is the average lifespan; flash memory is the cornerstone technology in SSDs but wears rapidly through usage.
This is known as the endurance problem; similarly, there is a retention issue, in that, although flash is non-volatile, it isn’t permanent storage, and data effectively “leaks out” over time.
At a high level, NAND is divided into three classes, in order of development: SLC, MLC and TLC. SLC stores a single bit of data per memory cell (one bit-per-cell) where the presence or absence of charge represents data.
MLC, on the other hand, stores two bits per cell and rather than the presence or absence of charge, it is the quantity of charge stored that determines which of four states (00, 01, 10 and 11) are stored in the memory element.
The maximum data density is provided by TLC (three bits per cell), which differentiates between eight different voltage states. In a perfect world, the NAND vendors could just continue cramming more bits into each cell, but increasing the number of bits per cell increases the access time and reduces the data reliability. The constant pressure to produce flash with smaller and smaller geometries compounds the wear-out issue. Smaller geometries mean that more bits can be packed into the same area, but this leads to faster wear-out. Table 1 summarizes the characteristics of each.
Table 1: A comparison of the different classes of flash memory.
The endurance and retention issues lead to errors in the data. The flash controller typically uses Error Correcting Codes (ECC) to identify and correct these errors. The number of errors that can be handled in this way is directly related to the level of extra data written, and the amount of processing time spent handling the errors. Until recently, particularly with SLC and MLC devices, the most common error correction used was the Bose-Chaudhuri-Hocquenghem (BCH) method. BCH error correction worked quite well in most cases, offering a predictable operating time, or latency, and is not particularly difficult to implement in hardware for solutions requiring up to 50 bits of error correction per chunk of data. However, it has scaling issues.
Above 50 bits of error, BCH begins to consume hardware resources at an alarming rate and, in many TLC implementations, BCH has been abandoned in favor the more powerful Low Density Parity Check (LDPC) approach, which, while considerably more expensive in terms of resources, scales more gracefully than BCH as the level of ECC required increases.
LDPC operates using both hard and soft information decode. Hard decode is analogous to BCH, while soft decode can be used to add extra correction capacity in the event that hard decode cannot correct the errors, but soft code pays for this by using hints extracted by characterizing the degradation of the NAND chip at the foundry.
These hints are coded into a read re-retry mechanism, where the new read is tailored to the degraded state of the NAND device. This is a neat trick, but these re-reads introduce several issues for the NAND channel as they occur randomly and increasingly with age. As each re-read causes a delay, it can be difficult to predict exactly how long a read will take; this is particularly problematic when the reads need to be re-inserted into a structured pipeline with many overlapping operations.
Given that in most cases a file is dispersed across many NAND devices, or even many SSDs, this can introduce unacceptable delays for file retrieval, (known as tail latency). And if a portion of a file is delayed, the whole file will be delayed. This is especially troublesome in striped applications and can hasten the end of life because the SSD has become too slow at delivering cleaned-up data.
It would be much better if this characterization effort were used to reduce the error rate in the data stream in the first place, rather than to correct the errors after they have occurred.
To understand how the controller can minimize error creation, we should first understand what kinds of errors are generated.
- Read Disturb: These errors affect a single read and may clear when the same location is read again. The more reads that occur without refreshing the data, the more likely they are to occur.
- Program/Erase Cycling: In changing the state of a cell, electrons are forced through an insulator that keeps the charge on the memory element (the floating gate). The effect of changing state many times is akin to punching holes in the insulator and trapping electrons in the insulator media, like firing ball bearings through the side of a bucket in order to fill it up—the medium that needs to be penetrated is the same one that is required to keep the balls in.
- Retention: Over time, the electrons will drift off, and the stored voltage will change, especially once there are a lot of holes in the insulator.
The rate at which flash wears out is directly related to the stress of the writes, or from our earlier example, the number of ball bearings put into the bucket. Higher stress, in the form of higher voltages, or longer write times leads to more holes and implantation and faster wear-out. However, if not enough electrons are passed onto the gate, the flash will suffer from retention issues, and data will be lost. This implies that retention and endurance are somewhat interchangeable. By relaxing the retention requirements, higher endurance levels may be enjoyed, and vice versa.
It also implies that early on in life, when there is little damage, relatively low stress (fewer ball bearings) could be used to write the flash, while later in life, when the flash has endured thousands of cycles, considerably higher stress could be used to ensure the retention constraint. However flash doesn’t have a way to actively respond like this, so the worst-case scenario, high stress writes, must be used, hastening the chip’s demise.
Discovering the values for the internal control registers that give the best (or even just a tolerable) trade-off between retention and endurance is known as trimming. This involves discovering and setting key operational parameters such as voltages and write times, read thresholds, and so on. The factory usually trims the devices to meet the industry standard (JEDEC) specifications, which may or may not match requirements for a particular application.
In general, the trims are static, and never change during the lifetime of the flash/SSD, even though, particularly early in life, quite different sets can be used. Furthermore, trading endurance for retention may be useful in data centers, where data is rarely kept on SSDs for long periods of time, as so-called cold data (which is rarely accessed) is usually moved to HDDs. If the retention constraint were reduced, less stress could be used, and so, higher levels of endurance could be enjoyed. The solution to this must involve active management, based on the health of the memory elements using the SSD controller.
Active Management of Flash
Active management of flash takes the approach of dynamically varying the register values throughout the lifetime of the device. In particular, early on in life, when relatively low stress can safely be used, wear can be minimized so that the least possible amount of damage is being done, while still safely satisfying retention. Similarly, later in life, as the flash requires higher levels of stress to ensure reliability, the values of the registers can slowly be increased, but, because so much less wear was accumulated early on in life, the eventual end-of-life of the part occurs much later.
Once an SSD is capable of performing this sort of active management, there are all sorts of use cases available. In particular, SSD manufacturers can now tailor specific flash parts to certain use cases; for example, one SSD model might require twelve months’ retention, while a hyper-scale customer might require an SSD to be re-tasked to retention of just a single week on the fly. The SSD is capable of using the same flash in both configurations simply using different management techniques.
To treat the flash in this manner, the SSD controller needs to run special flash management software that monitors the degradation of the flash and decides at what point it is necessary to change parameter sets. One such method is NVMdurance Navigator, which constantly monitors flash wear.
Actively managing flash can be a delicate balancing act: too aggressive early on and the flash will wear more quickly than it needs to, but not aggressive enough and retention may be compromised. Furthermore, not all flash cells, even those contained on the same die, wear at the same rate. So it is entirely possible that a set of registers that will safely see one cell to 500 cycles and 12 months retention, for example, will only get a nearby cell to 400 cycles and 12 months retention. This sort of variation is usually dealt with by “guard-banding” the flash, that is, ensuring that the even the weaker cells can attain the specified endurance, even if that means specifying a lower endurance than many blocks are capable of, e.g. 400 cycles instead of 500 cycles. This sort of deration of cells results in wasted cycles.
Using sufficiently powerful active management techniques, however, one can exploit these spare cycles. By managing outlier blocks, that is, those that are less likely to make it to the target cycling level, one can temporarily rest them, and instead, spread the load across the rest of the blocks. Thus, there are two different ways in which extra cycles can be wrung from the flash; first, by using weaker stress early on in life, and second, by ensuring that any outlier blocks on the devices are identified and dealt with early in life.
Key to the entire enterprise of actively managing parameters is to ensure that the parameters being used at any given time are either optimal, or near optimal. Current generation devices have anywhere from fifty to three hundred control registers, and this number is only likely to get bigger with time.
In flash factories today, highly-skilled and experienced engineers are relied upon to produce these sets of register values, through a mixture of engineering experience and massive characterization efforts, often basing new sets on the most recent generation of device that was the most similar to the next generation. This is a slow and expensive process which requires months of testing, and it costs millions of dollars to produce a single set. The complexity of current and next generation devices means that this undertaking is becoming unmanageable, and manufacturers are looking for ways to formalize and simplify the process.
The problem becomes nearly impossible to solve manually for actively managed flash because the characterization load is five to ten times that of single-set factory methods and each change of trim settings results in a new characterization run. The enabling technology for actively managed flash is machine learning.
NVMdurance Pathfinder uses machine learning and model building to automate register value discovery and testing. The machine-learning engine experiments with variations in register values and monitors the effects in both hardware and in simulation models. In this way, hundreds of millions of permutations can be trialed, with the results of the hardware trials continually improving the software simulations.
Unlike the task faced by the manufacturers, however, NVMdurance isn’t trying to produce a single set of registers that can guarantee every cell of every die ever made, rather, the task is to produce a set of registers that, when actively managed by NVMdurance Navigator, will permit the flash to last substantially longer.
In particular, because this is done through different stages of life, stage is relatively independent, thus, a register set that works well early in life would not be expected to perform well at the end of life (as its stress values would be too low). This is like moving to a higher ECC level in LDPC (with associated tail latency and re-reads), when the current noise floor makes correction impossible, but instead of correcting the errors, we are preventing them from occurring in the first place.
Run time considerations
Armed with close-to-optimal register values for particular times of life, and a way to navigate between them, it is now possible to achieve close to as many Program / Erase cycles as the NAND chip can deliver for a particular retention scenario (i.e., a determination of how much retention will be required). Similarly, by monitoring and identifying outlier blocks and ensuring that none are over-worked, no data is ever prematurely lost. In fact, the parameters can be further tuned when more information is available about the use case, for example, information about how various operation times can be varied.
There are costs associated with the active management of flash. The greater the degree of management, the higher the cost, both in terms of activity monitoring (e.g. bit error rate, timing etc.) and controller resources.
Tight integration between the SSD controller and the active management system is essential to make sure that the host is never kept waiting while the flash is being managed. This is achieved by only involving the active management system when absolutely necessary, and by ensuring that any operations it does are carried out in the background. Conservative examples put the extra cost incurred by NVMdurance Navigator at substantially less than 1% of the total processing conducted by the controller. Thus, the system is virtually unnoticeable, and simply sits in the background observing and learning.
Memory, in particular RAM, is often a sparse resource on SSDs, and often the amount of RAM required for a process can be estimated in terms of the number of blocks it will be operating on. A full featured NVMdurance Navigator implementation requires 300 bytes per block (typically a block is of the order of 4-8 megabytes), so the RAM overhead is less than 0.01% of the size of the SSD. Many of the benefits can still be gained using a more cut-down version, which doesn’t require RAM to store data about the flash.
Altera Corporation has recently released a reference design that enables active management of flash in a highly configurable and upgradeable FPGA-based system. The growing complexity of flash memory management algorithms has made controller designs complex, and has impacted the performance and the diversity of NAND devices that a single controller can support.
The NAND array has become the dominating factor for the cost of the drive. Increasing the life expectancy of the NAND array obviously reduces the total cost of ownership of the system. Performance predictability and throughput are key concerns for data-center managers. Throughput must be predictable not only from transaction to transaction, but throughout the life of the drive. With conventional controllers, however, performance drops off with age. There are many reasons for this, tail latency: block reclaim, large writes, and multi-cycle error correction, for example. Conventional controllers cannot overcome these problems.
Altera, NVMdurance and Mobiveil have come together to create a new kind of flash memory controller on a single FPGA SoC chip. This device will be field reconfigurable and upgradeable; not only can the NVMdurance software extend the lifetime of the flash, but the design of the system paves the way for a new class of field reconfigurable SSDs. Not only can the flash be simply removed when it has worn out, truly commoditizing the part, the controller itself can be reconfigured to deal with different use cases. For example, a highly write intensive application could be treated differently to an application focused more on reads, because the hardware can change around the data. It is this ability to so easily modify the hardware that makes this an ideal approach for actively managing flash.
San Jose, CA.
+353 87 223 5462