Based on a presentation at Flash Memory Summit 2023 by John Cronise, Technical Business Development Manager, ATP Electronics
Solid state drives (SSDs) utilizing NAND flash memory provide significant advantages over traditional hard disk drives (HDDs) in terms of speed, durability, and energy efficiency when dealing with large pools of data and artificial intelligence (AI) models.
Unlike HDDs, SSDs do not have mechanical parts and thus have lower latency and can access random data faster, making them crucial for high-speed AI processing, especially when dealing with large datasets and intensive workloads.
NAND flash storage devices, specifically those that comply with the NVMe protocol supporting the PCIe interface, with multi-namespace, Compute Express Link™ (CXL™), and NVMe-over Fabrics (NVMe-oF) support provide the flexibility and utility to create AI system architectures that are robust, reliable, and can unify the scattered data efficiently.
SSDs Drive High-Performance Edge Computing
As billions of devices become connected on the Internet of Things (IoT), huge quantities of data are being collected. Sending all the data to the cloud is creating bottlenecks, driving computing to the edge of the network so data is near enough and processing is fast enough to churn out near real-time actionable insights.
Gartner predicts that by 2025, 75% of enterprise-generated data will be processed outside a traditional data center or cloud.
AI applications implemented by edge computing systems are not only data heavy and compute intensive. They are also often mission- and safety-critical, requiring storage that can optimize the performance and responsiveness of edge servers.
NAND flash storage devices, particularly industrial-focused NVMe drives, are increasingly being adopted in AI and edge computing systems. They offer faster read/write performance, lower data access latency, lower power consumption, and high endurance. They provide the utility in analytical models at the edge and in Internet of Things (IOT) applications. Here, data is acquired, cached, and lightly processed so that it can be used to make real-time decisions before, possibly, being moved into larger data pools for further processing.
High performance and low latency increase the efficiency in these edge applications where compute power may be limited.
This includes industrial applications, surveillance, and autonomous vehicles.
Key Considerations for SSDs at the Edge
AI system design with SSDs revolves around maximizing data transfer rates, storage capacity, and understanding drive endurance.
There are three key considerations:
- Performance Optimization
- Endurance Optimization
- Thermal Management
Performance & Endurance Optimization
SSDs have a finite lifespan. NAND flash memory wears out and loses charge-holding capacity, and the SSD degrades over time. This is why data management plays a crucial role in maximizing SSD lifespan and performance.
SSDs employ NAND flash memory cells, which are arranged in pages and blocks. Data is written on pages, but the minimum unit of erasing is by blocks. This results in the NAND flash controller performing extra processes – this phenomenon is called write amplification, where the actual amount of physical data written to the flash memory becomes larger (amplified) compared with the amount of logical data written by the host. The numerical value is called Write Amplification Factor (WAF) or Write Amplification Index (WAI).
This unique behavior requires a different approach to optimizing writes to enhance the longevity and performance of the drive. In edge and IoT applications the file types and file sizes may be relatively well defined; however, the workload may be elastic. It is critical to optimize the data as much as possible to avoid unnecessary WAF/WAI and ensure long-term drive endurance.
At ATP, we pay close attention to write amplification and understanding SSD Write/Erase behavior.
- Regular monitoring ensures that write amplification challenges are identified and can be addressed.
- Small edge and IOT applications, potentially in remote locations, need the SSD to provide the long-term endurance and reliability.
- TRIM function support, over-provisioning, and wear leveling are some of the typical ways of managing SSD life expectancy and extending endurance.
Thermal Management
Overheating is a major challenge that could impact the performance of edge/IoT AI applications. High-performance, high-speed NVMe drives are particularly susceptible to thermal challenges. Common causes of overheating include multiple die stacking per integrated circuit (IC), controller heat and intensive components in the limited printed circuit board (PCB) space, especially for double-sided designs, and intense workloads.
Given such challenges, accurately modeling power dissipation is the first step in creating a thermal management strategy. Below we list important aspects of ATP’s Thermal Management Strategy
Determining the customer’s system/mechanical/performance criteria.
We assess user applications and system specifications to have an overview of mechanical limitations, if any, and other factors that may cause overheating. Where space permits, a large heatsink is a simple solution.
We also consider operating and ambient temperatures, airflow within and outside the system, mechanical design, other heat-generating components that may contribute to the thermal challenges, as well as workload and performance enhancements.
We recommend interfacing the SSD with as much mass as possible. This includes using thermal interface material (TIM) to dissipate excess temperature to the enclosure and environment. TIMs acts as a “gap filler” that aids in thermal conduction between the heat sink and the system enclosure.
Simulation Tests from the Component Level. ATP uses Cadence Simulation Software to run component-level thermal simulation. It is a pure hardware simulation based on full-speed operation (worst-case scenario).
These simulation models have led to expanded passive thermal management solutions for industrial NVMe – leading to temperature reductions by nearly 50% over a bare module.
At ATP we have developed three unique passive solutions that include an ultra-thin graphene and copper foil solution, for applications with weight budgets, intended to spread the load over a larger surface.
Testing the adhesive strength of the copper foil heatsink ensures reliability and excellent retention of the heatsink to the SSD. To evaluate the reliability of our copper foil solution, we perform resistance testing at high/low temperature on the adhesive layer to make sure that there is no deformation.
The following graphs show performance optimized with an 8 mm heatsink solution compared with a bare PCB.
ATP’s 8 mm heatsink, complemented by airflow, can dissipate heat and prevent the drive from throttling.
With the 8 mm heatsink, the maximum composite temperature of the NVMe SSD is reduced, and the performance is steady with optimized FW algorithm, as the graph (Performance) shows.
Conclusion
AI applications and edge computing are data intensive. NVMe drives provide greater bandwidth, faster read/write performance, lower data access latency, lower power consumption, and high endurance for AI and machine learning systems. They provide the utility in analytical models at the edge and in Internet of Things (IOT) applications.
With proper system design, by understanding NAND flash memory behavior, and through regular performance checks, these drives can unlock the full potential of AI systems.
ATP addresses performance, endurance, and thermal challenges through unique, specialized solutions and technologies for industrial NVMe flash storage solutions at the edge so they can effectively handle AI’s demanding workloads and operating environments.
For more information on ATP NVMe flash storage solutions optimized for AI and the edge, visit the ATP website or contact an ATP Representative.