Understanding and mitigating the effect of soft errors in semiconductor memory
18 May 2015
The past few decades have brought about unprecedented advancements in semiconductor technology.
However, each advance in semiconductor technology has brought up new obstacles to maintaining the exponential improvement of process technology. Today, CMOS technology has shrunk to such a size that extraterrestrial radiation and chip packaging cause failures at an increasing rate.
Since these errors are temporary, they are called soft errors. The first instance of soft errors was in 1978, when Intel was unable to deliver its chips to AT&T due to uranium contaminated packaging modules. Intel, while coining the term ‘soft fail’, reported that radioactive contamination could cause not only flips in stored data but also microcontroller lock-up. Cypress Semiconductor came across the first instance of soft errors in 2001, when a large telecommunications client found that a single soft error in an SRAM was causing hundreds of computers in a system farm to crash.
As memory process technology scales for improved performance and power, the reduced voltage and shrinking node capacitance makes these devices more susceptible to soft errors. Soft errors not only corrupt data, but can also lead to loss of function and system critical failures. Industrial controllers, military equipment, networking systems, medical devices, automotive electronics, servers, handheld devices, and consumer applications are especially vulnerable to the adverse effects of soft errors. An uncorrected soft error can lead to system failures in mission critical applications such as implantable medical devices and automotive engine control, as well as high-end security systems. Soft errors have the potential to cause elevator controllers to malfunction, while in a networking system it can cause the traffic to go haywire. Such occurrences, though rare, have the potential to cause havoc at a massive scale.
A soft-error is a change of state induced by an energetic particle. However, unlike a hard error, the affected device’s normal operation can be restored by a simple reset/rewrite operation. Soft errors can occur in digital and analogue circuits, transmission lines, and magnetic storage. When a high energy particle interacts with the semiconductor substrate, it generates many electron-hole pairs. The resulting electric field in the depletion region causes a charge drift, creating current disturbance. If the charge displacement overcomes the critical charge stored in the memory cell, the stored data may flip, causing an error when it is next read. Soft errors manifest themselves as single-bit upsets (SBU) or multi-bit upsets (MBU), depending on the energy of the causative particle. An SBU occurs when only one bit is flipped by a single energetic particle; while an MBU occurs when a high energy particle flips multiple bits in a word.
The rate that measures soft errors – Soft Error Rate (SER) – determines the probability of device failure due to energetic particles. Since soft errors are random, the occurrence of soft errors doesn’t define reliability but rather the rate of failure of the memory.
Alpha particles are emitted by radioactive nuclei in a process called alpha decay. Alpha particles have kinetic energies of a few MeV and are the direct cause of soft errors in semiconductor memories. They have a dense layer of charge and create electron-hole pairs as they pass through a substrate. If the disturbance is strong enough, a bit will flip. This lasts only for a fraction of a nanosecond, and hence is very hard to detect.
Low-energy alpha particles are generated by the radioactive decay of trace amounts of Uranium-238, and Thorium-232 present in mould compounds, packages, and other assembly materials. However it’s nearly impossible to maintain the ideal material purity (less than 0.001 counts per hour per cm2) needed for reliable performance of most circuits. Small amounts of epoxy can reduce the incidence of soft errors by shielding the chip from alpha radiation.
Manufacturers have managed to control contaminants emitting alpha particles, but they have been unable to counter cosmic radiation. In fact, cosmic rays are the likeliest cause of soft errors in modern semiconductors, since radioactive contaminants have been largely controlled. The primary particles of the cosmic rays don’t usually reach the earth’s surface. However, they do create a stream of energetic secondary particles, mostly energetic neutrons. While neutrons are uncharged and hence can’t cause soft errors, they can be captured by the nucleus in a chip, an event that can result in alpha particles.
Cosmic radiation increases with altitude due to a lower shielding effect of the atmosphere. In addition, modules used at the Poles are also highly susceptible to soft errors for the same reason. To reduce soft errors, modules used in high exposure applications undergo a special process called Radiation Hardening.
Neutrons void of kinetic energy are an important source of soft errors due to neutron capture reactions. The capture of a thermal neutron by a Boron isotope (10B) nucleus, found in large quantities in Boronphsophosilicate glass dielectric layers, emits an alpha particle, Lithium nucleus, and gamma ray. Either the Alpha particle or the Lithium nucleus can cause a soft error.
Thermal neutrons are especially important for medical electronics used in cancer radiation therapy. The neutrons combined with the photon beam used in treatment result in a thermal neutron flux that generates a very high rate of soft errors. However, thermal neutrons aren’t a major cause of soft errors nowadays, since manufacturers eliminated borated dielectrics by the 150nm process node.
Soft errors can be avoided by improving process technology and memory cell layout, system-level changes, and changing chip design and architecture.
The reliability of a memory device can be enhanced by increasing the critical charge stored in the memory cell. The resistance of a device to soft errors can also be increased by using a process technology that reduces the thickness of diffusion. This reduces the amount of time a charge particle spends in a memory cell. A triple-well architecture can also be used to drift charges away from the active region. This process creates an opposite electric field with respect to the NMOS-depletion region and forces charges into the substrate. It only acts when a soft error occurs in the NMOS region.
At the system-level, designers can prevent the effect of soft errors by using external error correction code (ECC) logic. In this technique, the user employs additional memory chips with parity bits for error detection and correction. As expected, system-level mitigation is expensive and also adds more complexity to the system and its software.
This is the best way to combat soft errors. Chip designers can mitigate soft errors by using Error Correction Code (ECC). During a write operation, the ECC encoder algorithm includes parity bits with every addressable word of data stored in the memory. During a read operation, the ECC detection algorithm uses parity bits to determine whether any of the data bits have changed. If there is single-bit error, the ECC correction algorithm determines the location of the concerned bit. It can then facilitate error correction by flipping the data bit back to its complementary value.
ECC alone, however, cannot address multi-bit upsets (MBU). For these, designers have to implement bit interleaving. This technique arranges bit lines such that physically adjacent bits are mapped to different word registers. The bit-interleave distance separates two consecutive bits mapped to the same word register. If the bit-interleave distance is greater than the spread of a multi-cell hit, it results in single bit upset (SBUs) in multiple words rather than a multi-bit upset (MBU) in a single word. Typical bit-interleave distance depends on the process technology. Neutron testing is performed with a subsequent physical MBU analysis to determine the safe interleaving distance for each process technology node. In a bit-interleaved memory, single-bit error correction algorithm can be used to detect and correct all errors. The ECC algorithm applies only to the copy of the affected word of data. The data as it resides in memory still contains the flipped bit. If this flipped bit in memory remains uncorrected, exposure to another bit flipping in the same word of data can result in a multi-bit upset. It is important, therefore, that the ECC logic indicates the occurrence and correction of a single-bit upset. The system can then use this information to recognise the event and write-back corrected data. This technique is known as memory scrubbing.
With semiconductor chips being manufactured on shrinking process nodes, the risk of soft errors is increasing. Hence, many experts expect soft errors to be a limiting factor to continue shrinking unless new technology is developed that overcomes soft errors. Furthermore, with technology entering more spheres of human life, the need for reliability is bound to increase. This trend increases the need for on-chip Error Correcting Code (ECC) for memory modules. All major memory manufacturers have started releasing chips with on-chip ECC to meet the demand for high reliability memories. Given the high-performance applications that SRAM devices are used for, error correcting capabilities are a must for SRAMs. Cypress has a family of ultra-reliable Asynchronous SRAMs with on-chip ECC and bit interleaving.
Contact Details and Archive...