How to Sustain Performance for NVMe Drives Under Thermal Stress Conditions

SSDs2022-03-29

NVMe drives are major disruptors in flash storage technology, offering unprecedented speeds and performance either in the ultra-slim M.2 or U.2 form factor. Breaking Serial ATA (SATA) transfer rates capped at 6 Gb/s, NVMe drives leverage the PCI Express (PCIe) interface, which directly connects to the CPU, resulting in 4-6X the speed of SATA in random workloads.

The big leap in speed and performance reduces latency, enables faster access, and delivers higher input/output per second (IOPS) compared with other interfaces designed for mechanical storage devices.

With the increase in speed came overheating issues, which are exacerbated by NVMe drives typically installed in compact embedded systems that are often fanless or with minimal airflow for heat dissipation. Overheating has adverse effects on the NVMe’s data integrity, endurance and retention capabilities. The drive will degrade quickly as the tunnel oxide weakens, causing electrons to leak out. This, in turn, results in higher bit errors and more uncorrectable errors.

This article explores the thermal management challenges for NVMe drives and presents ATP’s Customizable Thermal Management Solution based on different application needs, system mechanical design, and other important considerations.

Applications Requiring Thermal Management

 

Due to its speedy transfer rates, NVMe storage is gaining adoption in applications where microseconds count, such as those involving real-time customer interactions, time-critical data analytics, and more. In many of these scenarios, the device is typically installed in enclosures with little or no airflow. and are constantly subjected to intense workloads under harsh conditions. Multiple die stacking per integrated circuit (IC) and intensive components in the limited printed circuit board (PCB) space, especially for double-sided designs, also contribute to the overheating issue.

Thermal management is therefore critical to sustain performance stability during operation at high temperatures.

                                                   

Applications requiring thermal management

 

The following table shows possible scenarios with thermal and airflow conditions that need to be addressed.

 

Application Ambient Temperature Condition (°C) Air Flow (LFM)* Customer Criteria
Fanless Server System ~60°C No Air Flow Operate under 60°C without reliability concerns
Fanless Box PC ~70°C No Air Flow Stay operational without shutting down; low performance requirement
Data Logger ~85°C 1200 ~ 1800 LFM Sustained Read/Write performance at Ta=85°C
IIoT Server ~55°C Min. 2000 LFM Fan actions triggered by composite temperature need to stay within certain range

*LFM: Linear Feet per Minute

 

ATP’s Customizable Thermal Management Solution

 

ATP recognizes that thermal challenges are unique for different use cases and scenarios; hence, a “one-size fits all” approach may not be the most suitable. To meet a customer’s specific thermal requirements, ATP offers a holistic and customizable solution that combine firmware and hardware technologies.

The process is hinged on extensive collaboration with customers and is summarized in these four steps:

                                                            

STEP 1 : ASSESSMENT 
 

ATP works with system developers to overcome the challenges unique to the specific case. By understanding the performance criteria, user application and system specifications (including, but not limited to temperature, workload, airflow, and mechanical design), ATP can customize an NVMe solution for the customer.  

           

An important part of assessing heat dissipation is taking a close look at the mechanical design within the system. How much space is available for heatsink solutions? How can we make sure that no mechanical interference happens among all the components of the system printed circuit board?

                                    

The system’s mechanical design may not have considered a heatsink solution in the beginning. This is why it is important to examine the available space around the NVMe SSD as well as possible mechanical interferences that may happen.

 

STEP 2 : SIMULATION
Influence of Air Inlet/Outlet and SSD Location

Since air flow may vary depending on the fan and drive location, simulation tests are also performed using a proprietary ATP-built mini chamber to recreate as closely as possible the thermal environments based on customers’ profile. Air flow capability and SSD location, as well as performance requirements for the SSD considering its location from the air inlet, are among the factors considered. Necessary adjustments are then made to ensure the most optimal solution to meet the requirements.

                                                                                         

The proprietary ATP-built mini chamber (Generation 2) is used to simulate and adjust thermal environments based on customer’s profile.

A pure hardware simulation test based on full-speed operation, which is the worst-case scenario, is conducted using the Cadence Simulation system. This gives hardware engineers insights into the heat distribution in each PCB layer, as well as the potential risk of heat accumulating in particular areas. Adjustments can then be made to layout circuits, wire thickness, quantity/position of through-holes, and others.

                                                                      

An example of heat distribution simulation result of a PCB’s top layer

STEP 3 : CUSTOMIZATION
Thermal Management Consideration: Which Heatsink Fits the Mechanical Design?

ATP’s customized thermal management solution consists of both firmware and hardware components:

Adaptive Thermal Control through the ATP Dynamic Thermal Throttling Mechanism

This provides a delicate balance between performance and temperature instead of dramatic performance reduction. Temperature sensors continuously detect the device temperature. After sophisticated FW transactions, the performance gradually declines, and the temperature is adjusted.

H/W Heatsink, Thermal Pad Solutions

For NVMe M.2 2280 modules, a variety of HW heatsink options (materials, dimensions, types) are available to match the mechanical constraints of each system design. For high-density NVMe U.2 SSDs, a thermal pad covering the controller and NAND flash area dissipates heat through the U.2 aluminum housing.

 

NVMe M.2 2280 High Density NVMe U.2 Thermal SSD
                                       
Recommended for applications requiring stable/sustained Read/Write performance
at high temperatures
Various heatsink solutions:
(Copper foil / 4 mm or 8 mm fin-type options)
Thermal pad covering the controller and NAND flash area
to dissipate heat through the U.2 aluminum housing
Adaptive Thermal Control through Dynamic Thermal Throttling Advanced Thermal Control (ATC) Technology ensuring data reliability
Power Loss Protection Design
LDPC (Low Density Parity Check) ECC algorithm
RAID Engine Support
End-to-End Data Path Protection

HW thermal management options for NVMe M.2 2280 modules and U.2 SSD

 

Garbage Collection F/W Tuning

A periodic background refresh offsets the significant performance drop caused by the long garbage collection process.

 

STEP 4: OPTIMIZATION
Thermal Management Consideration: Which Heatsink Fits the Mechanical Design?

An optimized solution combines both HW and FW to meet customer’s needs. As the graph below shows, performance can drop sharply when standard thermal throttling is used. ATP NVMe SSDs with the customized thermal management solution, on the other hand, deliver higher sustained write performance.

                                                               

Comparison graph shows that NVMe SSDs with ATP Thermal Management Solutions combining both hardware and firmware deliver better sustained write performance and do not have drastic performance drops compared with SSDs using standard heatsinks and thermal throttling mechanism.

 

 

Conclusion

Customization through ATP’s Joint Validation Service offers effective hardware and firmware thermal management solutions to overcome NVMe heating challenges and to deliver better sustained performance. By working closely together, ATP and its customers can arrive at the most optimized solution to meet thermal criteria and performance requirements.

ATP’s customizable Thermal Management Solutions use both hardware (heatsinks) and advanced firmware (Dynamic Thermal Throttling mechanism) to make sure that NVMe SSDs remain cool even when installed in spaces with insufficient airflow and under varied thermal conditions.

With their blistering-fast performance, NVMe SSDs race, not only against time but also against speed. In upcoming blog articles, we will tackle how marathons are similar to thermal management and how we address various factors affecting temperature.

Look out for the next series of articles to find out why STEADY WINS THE RACE.

For more information on ATP’s customizable thermal management solutions for M.2 and U.2 NVMe drives, visit the ATP website or contact an ATP Representative.

 

 

Zurück zum Blog
Kontaktieren Sie uns