Solid state drives (SSDs) are faster and ideal for rough and rugged applications, but one thing that seems to deter those who are considering the big switch from mechanical HDDs is that SSDs can be written to for only a limited number of times. At the end of the SSD's usage life, the data may be corrupted or the device may be rendered unusable if measures are not proactively taken to manage their life span.
How Flash Memory Wears Out
An SSD is made up of NAND flash memory cells. A cell is essentially a metal-oxide semiconductor (MOS) transistor with a floating gate that can retain or store data persistently – data stays there even if there is no power to the SSD. Each time data is written (programming), electrons are trapped to the transistor. When data is removed (erasing), electrons are taken off. Electrons go in and out through the cell's tunnel oxide. Each program/erase (P/E) is one cycle and every cycle of electrons going in and out wears out the tunnel oxide. This is how the flash memory cell wears out.
In this article, we will look at some factors affecting SSD life expectancy and how these can be addressed to manage SSD endurance.
Garbage Collection and Write Amplification
Unlike a hard disk drive, SSDs have no mechanical parts and therefore read, write, and erase data differently. A flash cell is made up of pages, and several pages make up a block. Data is written on a page level, but erasing data is done on the block level.
If the host wants to write new data to a used block, pages containing valid data have to be copied to an empty block, and the previous block has to be completely erased in order for that block to be usable again. This process is called garbage collection. It takes several steps:
- All pages with valid data are copied to an empty block.
- The flash controller updates the logical block address (LBA) with the new location.
- Pages with stale data marked for deletion remain on the old location. The whole block is then erased and added to the free block pool.
- New data can now be written on an available empty block.
Because of the extra processes that the controller performs, a phenomenon called write amplification occurs, where the actual amount of physical data written to flash memory becomes larger (amplified) compared with the amount of logical data written by the host. The numerical value, expressed as write amplification index (WAI) or write amplification factor (WAF), is calculated as follows:
Data written to flash memory / Data from the host = WAI/WAF
Ideally, if the host wants to write 1 MB of data, the SSD should write 1 MB; in this case, WAI is "1." This happens rarely due to the nature of flash memory, and the SSD ends up writing more data than originally intended. This physical act of moving data several times can cause the insulator layer of tunnel oxide to degrade or wear over time. The ability to erase slows down. When a block fails to erase, a spare block is used. Eventually, spares run out, and the SSD fails.
- TRIM Command. SSDs typically cannot detect which pages contain data marked for deletion, causing them to erase and rewrite entire blocks during the garbage collection process. The TRIM command allows the host operating system to inform the SSD about the location of stale data (marked for deletion). The SSD then accesses the stale data and immediately wipes it out. With the TRIM command, the SSD controller can perform garbage collection on a page level instead of managing whole blocks, thereby reducing WAI and increasing SSD endurance. ATP SSDs support the TRIM function to ensure optimum performance and better endurance. Additionally, ATP SSDs perform background garbage collection, executing the process from time to time even without new write commands to clean up the drive without slowing down write performance.
- Over-Provisioning. Over-provisioned (OP) space is space that cannot be used or accessed by the user. It is dedicated for controller functions like garbage collection and TRIM. OP percentage is inversely proportional to WAI; as such, higher OP lowers WAI, decreases drive degradation, and extends the SSD life span. ATP's Dynamic OP solution gives users the freedom to configure the OP according to the actual workloads. Using simple software, OP can be set to 7%, 14%, 28%, 50% or more in order to optimize performance, endurance and cost. Down the road, enterprises benefit from more efficient performance and less-frequent disk replacements.
- Wear Leveling. Wear leveling involves evenly distributing P/E cycles to available cells to avoid overusing certain blocks. Frequently writing to or erasing the same blocks leads to more bad blocks, eventually wearing out the SSD. The ATP Advanced Wear Leveling technology combines both dynamic and static wear leveling techniques. A RAM register on the flash controller records the erase count of all blocks to identify which ones are frequently or seldom used. Data in frequently used blocks are swapped to the seldom-used blocks to even out the erase count and effectively wear level the entire SSD.
Uncorrectable/Unrecoverable Bit Error Rate (UBER)
UBER is a measure for data corruption rate referring to the percentage of bits with errors in relation to the total number of read bits. Bit errors increase as P/E cycles increase. Error-correcting code mechanisms in the controller typically detect and fix these errors automatically, but when errors reach the ECC capability threshold, the SSD is bound to fail.
Solutions
ATP AutoRefresh and Dynamic Data Refresh technologies check for both error bits and read counts in frequently read as well as seldom-read areas.
- AutoRefresh Technology improves the data integrity of read-only areas by monitoring the error bit level and read counts in every read operation. It detects when the read count is about to exceed the threshold and before the limit is reached, data in the affected block is copied to a healthy block, thus preventing the controller from reading blocks with too many error bits and averting uncorrectable data damage.
- Dynamic Data Refresh Technology reduces the risks of read disturb and sustains data integrity in seldom-accessed areas. Read disturb happens when frequent reading of a cell causes adjacent cells to change or be programmed. Dynamic Data Refresh runs automatically in the background, sequentially scanning the user area flag record while the SSD is free from host commands, thus keeping data safely stored without affecting the read/write operation.
Imminent SSD End of Life
Flash wear is a reality, and SSDs are bound to fail at one point in the future. Being unprepared for this unavoidable event can lead to catastrophic data loss, expensive equipment replacements, and more. By effectively monitoring and managing the SSD life span, users can take contingency measures and make the necessary replacements before the SSD wears out.
Solutions
- ATP SD Life Monitor / S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) provides a friendly interface for monitoring various indicators of drive reliability and other attributes. Armed with this vital information, users can plan way ahead and replace SSDs before they wear out, saving data and precious financial resources.
Conclusion
Despite the reality of flash wear-out, technological advancements in firmware and design enable current flash storage products to last and perform reliably for longer periods of time. Through careful management using available tools and solutions, your SSD will serve you well for many years. For more information about ATP's high-endurance SSDs, visit the ATP website or contact an ATP Representative or Distributor in your area.