MCU differentiation by innovative peripherals

Author : Dirk Jansen, Texas Instruments

05 September 2016

UART, SPI and other peripherals can be found in every microcontroller (MCU) architecture. However, features of particular interest include those that enable real differentiation and provide entirely new solution approaches for designers.

(Click here to view article in digi-issue)

The following article illustrates the use of integrated digital helpers including real-time CLA, arithmetic TMU, VCU or PRU (Programmable Real-time Unit) co-processors in MCUs and MPUs to implement hard real-time applications without loading the main processor. After all, increased clock frequencies are not always the only solution. 

You have to design a tricky application and do not know if the performance provided by the central microcontroller will be sufficient? Yes, it’s all about common terms like ‘real time’ and ‘typical response time’, but also about streamlining computationally intensive algorithms blocking the CPU. 

Let’s imagine the following motor-based application. Your company’s product marketing department wants the design department to implement a low-cost control solution capable of driving up to two asynchronous motors dynamically and smoothly using three-phase inverters. The solution is supposed to use a field-oriented control (FOC) approach without mechanical rotor position sensors. In addition to this sophisticated drive control, the system shall be parameterised and controlled by CAN bus and keyboard. 

Critical design paths

At this point, it is necessary to consider possible solutions within the project’s cost budget, which will quickly reveal any critical hardware and software paths. 

FOC algorithms are based on fixed-point arithmetic. In addition, the trigonometric functions required for Clark and Park transformations (sine, cosine) will lead to C library calls and extended runtimes. Smooth motor operation can be achieved by increasing the PWM resolution and the fundamental frequency, which will proportionally increase the call frequency of the FOC algorithm. There must be a fixed, deterministic relationship between the triggering of a PWM edge, the measurement of the motor current with the ADC and the triggering of the FOC algorithm. Otherwise, increased control deviations, bumpy motor operation and increased power consumption can result. Timing requirements get even tighter with the need to meet these demands for two motors at the same time. 

It is necessary to process a CAN protocol concurrently. Despite the computational load resulting from two FOC control loops, data losses must be avoided and a definite response time must be guaranteed. In other words, this must be achievable under worst-case conditions even at full CPU utilisation by the motor controllers. 

In most cases, the user interface (display and keyboard) will not be critical. Millisecond delays during input and output operations will be tolerable. 

The issues to be considered when looking for possible solutions are described in more detail below. 

Questions concerning the CPU system

Is it possible to implement a deterministic interrupt behaviour? Although system performance will benefit from CPU caches, it will be far from deterministic. The tricks used by CPU designers (including branch prediction) cannot fully eliminate the risk of cache flushes and a certain timing jitter. 

Is there an efficient instruction set, can it be used for fixed-point and floating-point arithmetic (and maybe even trigonometric computations) and does it support uninterruptable (atomic) instructions? Atomic instructions avoid frequent interrupt-disable events when accessing hardware resources and global variables, thereby reducing interrupt latency. 

Questions concerning the compiler

Who has not experienced the following scenario: After years of field operation, a device exhibits unexplainable phenomena following a seemingly insignificant modification of the software. In systems with high computational load, this is sometimes caused by the compiler, which raises the question of how do optimisation strategies and different compiler versions influence a system’s run-time behaviour? Is it necessary to question and repeat all previous software tests following a new compiler release? 

Three possible solutions to the real-time dilemma

Solution A: Two or three MCUs instead of one

Assigning different tasks (communication, HMI, motor control) to physically separate MCUs is a possible way to meet real-time requirements. Tasks will be assigned according to the respective computational loads. This can result in a system consisting of two or three processors (MCU1: communication; MCU2: motor 1; MCU3: motor 2). Challenges resulting from this approach include increased BOM and PCB costs and the need to synchronise the three MCUs via serial IPC (e.g. RS485). In addition, different versions must be maintained during software development. EMI can turn out as an additional problem because three MCUs with asynchronous clocks will lead to increased emissions. 

As another interesting option, integrated dual-MCU SoC systems provide a communications MCU and a control MCU including their dedicated peripherals on a single chip. For instance, the F28M3 family belonging to TI’s C2000 MCU family, enables an on-chip separation of time-critical control functions (C28 CPU) and communication-related tasks (ARM Cortex-M). 

Solution B: Increased computational performance

Instead of parallelising tasks at the hardware level, it is possible to raise the system’s clock frequency, which will proportionally reduce the timing jitter. However, instead of solving the problems, this approach will only reduce their impact. It may not be overlooked that MCUs with on-chip Flash memory get accelerated only to a limited extent if their clock frequency is raised. Physical limits including maximum Flash access times will lead to wait states that will flatten the DMIPS curve. In this context, it is often suggested using a micro processing unit (MPU) instead of an MCU. MPUs are often used with high-performance operating systems including Linux, Windows or Android that can then be executed from RAM or DRAM memory at reduced access times. However, finite access times to external memory will slow down even a high-speed 1GHz MPU. This problem can be solved by implementing data and code caches for the CPU, although this will result in the non-deterministic behaviour mentioned above. As you can see, solving the real-time dilemma is far from trivial. 

Apart from the MPU, external boot memory (NAND, SD, MMC and DDR execution memory, power management) also impacts the cost of the full system. Increased clock frequency and power consumption may also lead to higher EMI levels. 

Possible solutions include TI’s Sitara AM437 processor, which provides four additional 32-bit PRU (Programmable Real-time Unit) co-processors to offload the main ARM Cortex-A9 processor. 

Typical tasks of the PRU subsystem (PRUSS) include:

• Motor Control (FOC);

• Interfacing digital position encoders (BISS-C, Hiperface DSL, Endat2.2);

• Implementation of SINC filters for external SD modulators

• Link Layers for industrial fieldbuses (EtherCat, ProfiNet, EtherNet/IP, Profibus etc.). 

The so-called Single-Chip Drive based on the AM437 processor represents an interesting feasibility study. 

Solution C: On-board accelerator

To elaborate on TI’s C2000 Real-Time MCU Family that uses specific co-processors to overcome the aforementioned real-time challenges, the family enables real-time operation even on moderately clocked MCU systems, in addition to a significant performance increase. 

Most important hardware accelerators include:

• Floating Point Unit (FPU)

• Real-Time Control Co-Processor (CLA)

• Trigonometric Math Unit (TMU)

• Viterbi - Complex Math - CRC Unit (VCU)

At the heart of the C2000 MCUs, there is a C28 32-bit fixed-point DSP CPU whose instruction set and internal structure are tailor-made for any kind of DSP and control algorithms. Development work is facilitated by C programmability and a multitude of application-specific libraries. 

Floating Point Unit (FPU)

Control system design often begins with simulation tools using floating-point arithmetic. Porting these solutions (e.g. as C source code) is greatly facilitated by target MCUs providing native floating-point support. As another benefit of floating-point arithmetic compared to fixed-point solutions, the wide dynamic range requires no scaling and saturation. As there are no underflow and overflow situations, the resulting code is more robust. The FPU therefore extends the C28 instruction set by IEEE754 single-precision floating-point arithmetic. Close integration into the C28’s pipeline enables concurrent execution of some instructions. Run times can be improved by a factor of up to 2.5 when using standard algorithms. 

Real-Time Control Co-Processor (CLA)

Provided as a supplement to the C28 core, this 32-bit floating-point CPU enables the implementation of time-critical, complex processes requiring low latency. The performance of these systems can thus be doubled by using the CLA. 

In the motor control example, the time between an ADC sample (current/voltage), the subsequent algorithm (e.g. PID, 2p2z, 3p3z) and the final update of the PWM register is critical. Other possible applications include digital power implementations like PFC stages (power factor correction), DC-DC converters or solar inverter controllers. 

For these and other applications, TI implemented the CLA, which is capable of processing up to eight tasks as an event-driven task machine. The tasks are either triggered by hardware events (interrupt trigger) from the peripherals (ADC, comparator, serial interfaces, PWM, capture, DAC, GPIO, timer) or by the C28 software. The universal accessibility of almost all peripherals introduced in the newer C2000 MCUs (TMS320F28x7x) enables designers to make virtually unlimited use of the CLA. 

Being unable to interrupt each other, CLA tasks are commonly implemented as short, linear programs. In addition, there is a fixed task priority scheme with task 0 having the highest priority and task 7 having the lowest. The CLA executes from a dedicated RAM in the C28 memory map, which is commonly used to store a CLA program image following the boot routine. Thus, the CLA firmware is part of the C28 project. 

The current development environment within Code Composer Studio 6.1 is an attractive feature: Although the existing CLA C compiler has some minor restrictions due to its hardware architecture, it is easier to port existing algorithms to the CLA. The JTAG adapter can be used to access the CLA CPU for high-level language dual-core debugging. 

A CLA program appears as an assigned sub-project in the Eclipse workspace (Code Composer Studio IDE) of the C28 project. For engineers wishing to familiarise themselves with the CLA, the C2000MCU ControlSUITE provides multiple examples for the C28 and the CLA. 

Trigonometric Math Unit (TMU)

The TMU can be considered as a IEEE-754 floating-point extension to the FPU, complementing the C28 and FPU instruction set by trigonometric functions including sin(b), cos(b), atn(b), div(a,b), and sqrt(b). As the TMU is using the same pipeline as the CPU and the FPU, no additional measures are required for saving and restoring the interrupt context. 

Instead of using C libraries, many mathematical functions can now be executed in hardware, which saves CPU cycles and enables high control performance even on systems with lower clock frequencies. 

TI’s C28 compiler automatically generates the necessary TMU instructions, making it fully transparent to the C programmer. 

This results in drastically shortened and accelerated routines especially for the following algorithms: 

• Park and inverse Park transforms

• Space vector generation

• DQ0 and inverse DQ0 transforms

• FFT amplitude and phase computations

For instance, a Park transform typically requires 80 - 100 cycles with the FPU. Using the TMU, only 13 cycles will be required, reducing computation time by 85 percent. 

In typical applications including digital motor controls or multi-phase solar inverters, the TMU will yield a 1.4-fold performance increase. 

Viterbi, Complex Math and CRC Unit (VCU)

As a fixed-point accelerator, the VCU can provide additional computational power especially in communication-based systems. For instance, the coding and decoding of PLC signals (power line communication) can be accelerated, enabling the elimination of an additional processor system in the best case. In addition, the VCU supports generic signal-processing algorithms including digital filter computations and FFT spectral analysis. This facilitates the implementation of algorithms including motor vibration analysis for detecting bearing problems. 

The VCU consists of the following main elements: 

• Viterbi decoder (for baseband applications)

• Complex FFT accelerator

• Complex filter accelerator

• Background CRC computation and verification

The C2000 MCU portfolio with on-chip hardware accelerators

TI’s portfolio of C2000 MCUs consists of the Delfino and Piccolo MCU families. As the high-performance derivatives, Delfino MCUs combine high system-level performance and high accelerator and peripheral integration. The latest TMS320F2837x MCU family provides pin-compatible scalability from quad core (2 x C28; 2 x CLA; 200MHz) down to the Piccolo world (TMS320F2807x MCU featuring 1 x C28; 1 x CLA; 120MHz). 

Typical applications of the Delfino MCU class include industrial drives, digital power, solar inverters and intelligent sensors. Targeting lower-cost applications, the C2000 Piccolo MCU class is ideally suited for white goods, motor control, digital power, hybrid and electric vehicles (HEV) and PLC applications. Apart from fixed-point versions supporting 40 - 60MHz clocks, floating-point derivatives are available with a clock frequency of 120MHz. The accelerators mentioned above are implemented here as well. 

As an example, the current superset C2000 Delfino derivative TMS320F2837xD MCU provides a reach feature set, the MCU consists of two symmetric MCU systems having their own Flash and RAM memories, DMA, CLA, FPU, and VCU. Both systems can be synchronised via IPC (Inter Processor Communication) using a shared memory block, resulting in a performance of 4 x 200MHz = 800MHz. 

If a single-core C2000 Delfino TMS320F28337xS MCU is used for the exemplary dual motor drive project, the following internal logical assignment can be used: 

• The C28 core is responsible for the controller of motor 1, the HMI and the CAN stacks. Ample processing power is available here thanks to the FPU, the TMU and a clock frequency of 200MHz. 

• The CLA co-processor is responsible for the controller of motor 2. It can access all key peripherals of the MCU. Synchronisation with the C28 core is achieved via IPC and shared memory. 


In case of hard real-time requirements, it makes sense to offload critical software paths including digital controllers to additional co-processors available in the system (e.g. the CLA of the C2000 real-time MCU series or the PRU of the AM437 Sitara MPUs). As these co-processors feature low latency, highly deterministic behaviour and true parallel processing, complex systems can now be implemented in closed MCU systems. Based on IPC, the close internal coupling between the co-processor and the main processor enables a structured software development including a C compiler and multi-core debugging. 

Furthermore, dedicated co-processors like the TMU and the VCU can greatly accelerate the execution of various algorithms including trigonometric functions (Clark and Park transformation, SinCos resolver) for reduced system-level latency. 

In many cases, this will obviate the need to implement costly 

multi-CPU systems which would result in increased software maintenance and testing overhead. 

Contact Details and Archive...

Print this page | E-mail this page