The importance of accurate temperature monitoring for SoC-based systems

Author : Kim Majkowski | Global Product Manager for Power Management ICs | Farnell

03 February 2020

Farnell_Computer-Chip-Overheat_580x280
Farnell_Computer-Chip-Overheat_580x280

The heat generated by advanced multicore system-on-chip (SoC) & programmable devices has become a major issue in electronics design. As Kim Majkowski, Global Product Manager for Power Management ICs at electronics distributor, Farnell tells us, although the voltages these integrated circuits require to run have fallen to below 1V, the devices still exhibit high peak current demands when the processors they contain need to run at full speed.

This article was originally featured in the February 2020 issue of EPDT magazine [read the digital issue]. Sign up to receive your own copy each month.

Power dissipation across an individual SoC can vary significantly, as different cores are activated or as the compute demand changes over time. A SoC die can shift from relatively cool to hot within a matter of seconds as the software load increases. If operated at peak current for long periods, the local die temperature can rise to a value that could cause a thermal shutdown – or affect the performance and reliability of nearby components.

By monitoring the die temperatures of sensitive components, a system can avoid the problems caused by heat, by increasing the rotation speed of cooling fans, or reducing clock speeds to help reduce the temperature of an overheating component. As a result, accurate temperature monitoring is essential in systems that used advanced SoC and field programmable gate array (FPGA) devices. On-die probes provide the most accurate way to determine the thermal conditions close to critical cores.

Farnell_Figure 1. Simplified block diagram of a remote diode temperature sensor
Farnell_Figure 1. Simplified block diagram of a remote diode temperature sensor

On-die temperature sensors

On-die temperature sensors take advantage of a property of semiconductor PN junctions. For a PN junction of a given area, the voltage developed across the junction will have a characteristic value that is dependent on current flow and temperature. The temperature susceptibility is due to the presence of thermally generated carriers in the semiconductor. If current is maintained constant, any changes in voltage will be due to changes in temperature. Typically, in semiconductors, the voltage across a junction falls with temperature. However, if two different current levels are applied, one after the other, and the difference in voltage is measured each time, this will result in a small voltage delta between the two readings. A rise in absolute temperature leads to an increased delta in a near-linear relationship, providing a reliable basis for use in semiconductor temperature sensors.

On the modern CMOS processes used to build complex SoCs, suitable PN junctions are easy to construct. Typically, the thermal probe is a bipolar transistor, with the base-emitter junction forming the required diode, and the collector tied to the device’s substrate.

The need for system-level thermal management

Farnell_Figure 2. 45nm CPU – BETA = 0.3 - MAX6692Y – (NON-BETA COMPENSATED)
Farnell_Figure 2. 45nm CPU – BETA = 0.3 - MAX6692Y – (NON-BETA COMPENSATED)

Although many components, particularly programmable devices, can monitor their own temperature, thermal problems frequently need to be solved at a system level. For example, controlling the speed of enclosure fans will change the cooling of all components in the system. To achieve system level control, the local die temperature for several devices must be monitored remotely.

In principle, it is simple to construct a full temperature sensor on the SoC for each area that requires thermal monitoring. The measurement involves forming the probe close to the circuitry of interest that is then switched between two current sources of different magnitudes. The two voltage measurements that result from the current source are then provided to an analogue-to-digital converter (ADC), and associated logic that computes the estimated temperature.

In practice, many system designers choose to employ remote temperature sensors, as they provide for greater levels of reliability and accuracy. If implemented entirely on the SoC die, the implementation of two current sources per thermal probe calls for the manufacturer to match the devices precisely, which is difficult in many commodity digital processes. By forming the current sources on a die made using a precision mixed-signal process, much greater measurement reliability is possible. Furthermore, fewer pins are needed on the SoC, as connections only need to be made to one transistor per monitored region, rather than two.

Farnell_hot-chip_580x280
Farnell_hot-chip_580x280

A secondary advantage of remote temperature sensors is that they allow you to monitor more than one hot spot with a single IC, and trigger alarms automatically. A basic single-remote sensor, such as the MAX6642, can monitor two temperatures: its own temperature, plus the temperature of a nearby SoC or FPGA. Other remote sensors monitor three or more external temperatures. The MAX31730, for example, can monitor its own internal temperature, and that of three external probes. If the temperature of any of the inputs passes a programmable threshold, the device sets a status and records the temperature of the hottest channel in a dedicated register. The MAX31730 uses the SMbus to relay this information to a system controller.

To monitor more inputs, the designer can select a device such as the MAX6681, which has seven remote diode inputs. This could be used to monitor the temperatures of a pair of FPGAs with integrated thermal diodes, four board hotspots using discrete diode-connected transistors, and the temperature of the board at the MAX6681’s location. Another option is to deploy up to eight MAX31730 devices as slaves on the SMbus.

Ensuring accuracy when monitoring die temperature

Although the remote temperature sensor approach has a number of advantages in system design, there are sources of error and inaccuracies that need to be taken into account by the engineering team.

Farnell_Figure 3. Example of a remote temperature sensor from the MAX31730 data sheet
Farnell_Figure 3. Example of a remote temperature sensor from the MAX31730 data sheet

Parasitic series resistance of some kind is inevitable in any circuit and will affect the temperature provided by the sensor device if compensation is not applied. Take a setup where the first bias current chosen is 100µA and the second 10µA. The voltage difference between the two will be proportional to the natural logarithm of the current of the first divided by the second. The absolute value will be that log value multiplied by an ideality constant, which is normally close to 1, and kT/q, where k is the Boltzmann constant and q is the charge of the electron. If the series resistance is 1O, the voltage drop for the higher current source will be 100µV and 10µV for the second. The resultant measured temperature shift will be 0.45°C.

If the series resistance is known, and can often be calculated using typical PCB trace resistances, then it is possible to correct for the temperature shift. Some sensors, such as the MAX31730 and others manufactured by Maxim, have automatic resistance cancellation, which avoids the need to compensate for this parasitic source of error.

Although the ideality factor is normally close to 1.01, its exact value will depend on the process and transistor design and is therefore a potential source of error. Most remote sensors will be optimised for a specific ideality factor. Maxim has several that are tuned for the value of around 1.008 typically found on advanced processes, such as those used in advanced FPGAs and SoCs. For a device with a different ideality factor, it is relatively straightforward to apply a correction in system-controller firmware.

Farnell_on-die-temperature-sensors_580x280
Farnell_on-die-temperature-sensors_580x280

A further source of error can come from SoC-based thermal-probe transistors that suffer from a low current gain or beta value. If the transistor’s current gain is very low, the ratio of collector currents may not match the ratio of emitter currents and so cause an error in the calculated temperature. A 10% change in the collector current ratio can cause in reported temperature of approximately 12°C.

This is not normally an issue for dedicated remote temperature sensor ICs, as they employ transistors with high current gain. But SoC transistors are fabricated on processes optimised for MOS rather than bipolar transistors, and so cannot guarantee high gain in these devices. When such transistors are employed, it may be best to use a remote sensor IC with beta compensation, although it is not always required.

If the beta is relatively uniform over the expected range of currents and temperatures, the effect may be small enough to ignore. For example, in tests of three samples of a microprocessor built on a 45nm process that exhibits a beta for bipolar transistors of around 0.3, the resulting error was less than ±1°C. However, where low beta is likely to lead to larger errors, remote sensors with beta compensation such as Maxim’s MAX31730, MAX6693 or MAX6581 can be applied.

Thanks to devices in Maxim’s range that are tuned to different scenarios for remote temperature measurements, system designers can ensure they can implement thermal controls that are reacting to the true thermal situation inside their products. The result is greater reliability, longer product lifetimes and less risk of disruptive thermal shutdowns.


Contact Details and Archive...

Print this page | E-mail this page