A study conducted by IBM over 20 years ago showed that high-energy nuclear particles cause electronic component "soft" errors. There are two sources of nuclear particles: decaying radioactive atoms existing in trace amounts in all materials and extraterrestrial radioactive cosmic rays cascading through the Earth's atmosphere. When these particles hit a silicon nucleus, it explodes into fragments that generate a stream of electrical charges that can upset any circuit. Any circuit is vulnerable to soft fails due to cosmic rays.
Soft Errors vs. Hard Errors
JEDEC defines soft errors as, " An erroneous output signal from a latch or memory cell that can be corrected by performing one or more normal functions of the device containing the latch or memory cell. As commonly used, the term refers to an error caused by radiation or electromagnetic pulses and not to an error associated with a physical defect introduced during the manufacturing process." Soft errors occur randomly and cause no permanent damage to the memory device.
Hard errors on the other hand are those that "keep recurring as a result of hardware or physical defects on the memory or storage device. Hard memory errors are commonly caused by operating a system beyond the memory's speed capacity and subjecting the system to charges of static electricity. Other causes include environmental factors such as temperature, shock/vibration, electrical/voltage stress or physical stress. Mishandling, aging, or manufacturing defects can also affect the reliability of hardware components. Hard errors are usually permanent and require module replacement."
Causes and Impact of Soft Errors
A soft error, also known as a single-event upset (SEU), is caused by ionizing radiation from cosmic rays and alpha particles. Cosmic rays such as neutrons are high-energy particles from space, which enter the Earth's atmosphere and interact with the air, while alpha particles are from traces of contaminants or radioactive materials in memory chip packages. When these highly charged particles penetrate a memory cell, the state of a bit changes (flips). If the charge is big enough, it can cause multiple cells or bits to be upset.
In systems requiring the highest levels of reliability, alpha particles can be reduced by applying shielding materials or using components that are insensitive to radiation. Neutrons however, can't be shielded – they can penetrate even 5 ft. of concrete!
Figure 1. When radioactive particles such as neutrons from cosmic rays penetrate a silicon nucleus from the substrate, the resulting nuclear reaction fragments the silicon nucleus. The fragments (alpha particles) upset the sensitive region, causing the memory cell to "flip" or change value.
- Functional interruptions
- Read/write errors, data corruption
- Device hanging or stops working but resumes after a power cycle
- Bricking, or the device stops working even after a power cycle
- Data written in the wrong location
- Data writing takes longer
Figure 2. Soft errors, also called single-event upsets, occur when highly charged particles such as neutrons and radioactive materials like alpha particles from the environment strike sensitive regions of an electronic device and disrupt its normal operation.
The study conducted by Intel showed that electronic failure from cosmic rays increased at elevated altitudes. These days, however, even ground-based devices are at great risk, particularly SRAM-based devices that have high sensitivities to radiation effects. Static Random Access Memory (SRAM) is used as memory cache by the processor. Unlike Dynamic Random Access Memory (DRAM), which requires being constantly refreshed (recharged with power) to keep data, SRAM can store data without frequent refreshing. This means that the processor does not have to wait to access data on the SRAM, resulting in speedier processing. SRAM is very fast, and hence also more expensive than DRAM. The following are common reasons why SRAM is susceptible to soft errors.
Lower Supply Voltage. SRAM voltage goes down with every process generation, resulting in lower cell capacitance (the ability of a cell to store electric charge). This makes the memory cell more vulnerable to getting struck by an alpha particle or cosmic ray.
Scaling. The trend to use more SRAM bits to cut latencies makes SRAM arrays the densest memory on a chip and increases exposure to charged particles.
Packaging. Integrated chips are packaged using materials with small amounts of radioactive contaminants. Trace amounts of Uranium and Thorium for example are found in mold compounds and assembly materials. If the ideal material purity is not maintained, alpha particles can cause soft errors.
The ATP e.MMC SRAM Soft Error Detector and Recovery Mechanism
For some personal devices, the impact of soft errors may be insignificant. However, for mission-critical tasks like financial transactions, traffic control, security/surveillance, even minor glitches can have disastrous effects. Unattended, soft errors can lead to function loss, system failure, and other adverse effects.
The ATP e.MMC advanced SRAM Soft Error Detector and Recovery mechanism maximizes data integrity by providing timely error detection, logging, and configurable action to address the error (configuration is predetermined by the customer with ATP and cannot be changed on the field). If, after assessing the risk, the user opts to continue running the device, an error log and a system reboot should be performed to avoid unpredictable events that could damage the system, or worse, cause personal safety risks in critical autonomous applications.
The following figures show how the e.MMC SRAM Soft Error Detector and Recovery mechanism works.
(Note: Steps may vary depending on pre-determined configuration.)
Figure 3. As soon as an error is detected, the event is logged in the flash. The system can be alerted to handle the error and the firmware may be stopped. The ATP SRAM Soft Error Detector and Recovery Mechanism attempts to correct the error by initiating a system reboot.
Figure 4. If the system reboot fails, the error is considered a "hard error" and the e.MMC should be replaced. If the system reboot succeeds, the error is confirmed to be a "soft error." This means that the reboot has resolved the error and operation can continue with the correct data.
Soft errors can corrupt data and cause systems to malfunction or fail. SRAM is particularly vulnerable; hence, it is important to be able to detect errors that may not have been detected by built-in error-correcting codes. Data integrity and high reliability is of great importance to mission-critical applications such as networking, military, health care, financial services, and more. It is therefore important to ensure that soft errors are not left to worsen as they could damage not only very important data but also physical assets.
To find out more about how ATP's industrial e.MMC prevents and mitigates the effects of soft errors using the SRAM Soft Error Detection and Recovery mechanism, visit the ATP website or contact an ATP Representative/Distributor.