Scalable platforms for embedded AI
Author : Markus Levy | Vice President, Marketing & Business Development | Kinara
01 September 2022
Since its emergence little more than a decade ago, artificial intelligence (AI) based on deep learning has moved from academia into mainstream commercial applications. As Markus Levy, VP, Marketing & Business Development at edge AI solutions developer, Kinara tells us here, it is now reshaping embedded computing, as users demand increased AI compute capabilities on edge devices – for applications across security, medical, retail, robotics & more...
This article was originally featured in the September 2022 issue of EPDT magazine [read the digital issue]. And sign up to receive your own copy each month.
However, the need to add AI into applications poses a problem to teams working with embedded devices, since it’s always easier to extend existing systems, instead of moving to entirely new and different platforms. Moving to a new host platform presents a time-to-market issue associated with porting, and it also runs the risk that the new platform does not have the long-term support that is available from existing hardware-platform suppliers such as NXP Semiconductors – and which is essential in many embedded computing markets.
Running AI models in real time requires a high level of energy efficient processing. Though it is far less compute-intensive than the training phase – which requires multiple passes through the entire parameter set for each new training sample – the inferencing that each model performs when deployed still requires millions of calculations to parse each successive video frame. In many of the current deployments of embedded AI systems, the inference is offloaded to the cloud – but with the need to reduce latency comes the requirement that the data be held and inferenced locally by the edge appliance or sensor (for instance, the camera) itself.
This requirement can be solved through embedded AI platforms that couple one or more inference accelerator IP blocks on a single SoC (system-on-chip), alongside a general-purpose processor (for example, something like NXP’s i.MX 8M Plus or Qualcomm’s QCS610 SoC). However, such all-in-one implementations often lack the granularity and scalability to address the needs of real-world use cases, with expanding requirements to process faster frame rates and higher resolution inputs.
Augmenting the system with AI accelerators
The best of all worlds can be accomplished by extending an existing system with the addition of a dedicated edge AI accelerator. This allows embedded developers to stay with the processor(s) of their choice – while incrementally adding the performance that is required to handle increasingly advanced AI workloads. One such edge AI accelerator is Kinara’s Ara-1. Each Ara-1 accelerator can run multiple AI models with zero overhead and no additional load on the host processor, providing the ability to perform parallel tasks.
Furthermore, if higher system performance or frame rates are needed, multiple Ara-1 accelerators can be used in parallel, communicating with the host processor over a standard high speed PCIe interconnect or USB3.2 interface. In this scenario, different incoming data to be inferenced upon can be distributed across the Ara-1 devices in the system and reassigned dynamically for load balancing.
This dynamic assignment can be important for applications such as tracking people or objects. In these applications, a subject of interest identified by an earlier inferencing step may move out of frame. If another subject moves in from that area, the system may need to determine whether this was the previously seen subject, performing a re-identification task to attempt to match it.
Application considerations
In applications such as crowd monitoring or security, smart cameras are often placed in less-than-ideal locations, such as a bank lobby where customers can access automated facilities outside of opening hours. The camera may be placed such that its view of the area is at a highly acute angle that will likely increase the impact of lens distortions. The video may also be affected by light streaming in through glass doors in the early morning or evening. To cope with these factors, it makes sense to perform extensive processing on the source images before they are passed to the neural network for inferencing.
This additional processing, some of which may be needed before and after the inferencing phase, calls for an execution platform that can deliver the right balance of performance, power and cost. It must be a platform that a developer can tune in terms of general-purpose, graphics-processing and neural-network performance, each of which could or will be needed for the overall vision inferencing pipeline.
It’s also important to have a flexible software framework for an inference pipeline to support a balanced system. For example, the software infrastructure for Ara-1 accelerators, coupled with NXP’s i.MX processing platforms, fully supports the complex, dynamic structure of real-world AI applications. The Kinara software development kit (SDK) – a comprehensive environment for optimising inferencing performance, using a built-in profiler to show potential bottlenecks and performance statistics on a per-layer basis – offers direct support for commonly used models, such as ResNet and MobileNet, and more recent, complex models, such as DenseNet, EfficientNet and transformers. Following compilation using Kinara’s SDK, models can be deployed and managed on any NXP i.MX applications processor at runtime using C++ or Python application programming interfaces (APIs) or the GStreamer environment.

Figure 1. Basic differentiation between smart camera & edge AI appliance
GStreamer is a library designed for the construction of compute graphs of media-handling components, such as the inference pipeline. The left side of Figure 1 shows a high-level block diagram and a vision and inference pipeline for a smart camera example. Data and results pass between multiple processing elements (for instance, the CPUs or GPU of the NXP i.MX 8M Plus) and accelerators (such as Ara-1) in a compute graph. In a sense, GStreamer provides a plug-and-play approach, because it can hide the underlying hardware details, making it easier to switch processors and associated processing elements. For example, a developer can switch from using the CPUs for scaling and instead utilise the GPU.
To take this example of flexibility a step further, compare this smart camera design to that of an edge AI appliance (see the right side of Figure 1). Let’s use an example of a smart retail store (say a convenience store). In this example, the store may require 200 cameras, but it might not be convenient to upgrade all the store cameras to ‘be smart’, so instead of the processing being done onboard in the camera, the video feeds are fed into an edge AI appliance that could contain an NXP i.MX 8M and multiple Kinara Ara-1 accelerators. This appliance would be capable of processing up to 8 video streams (coming in from 8 cameras) at 1080p resolution and 30 frames per second. But the key point here is that the inference pipeline would be similar to that of the camera example, and the use of Gstreamer would allow ‘easy’ drop in of the different pre- and post-processing components (for example, ISP/image signal processing or frame decode).
Altogether, this demands both a hardware platform that delivers execution flexibility, and support for software frameworks that allow the creation and manipulation of complex distributed compute graphs.
The combination of scalable hardware performance, model tuning and a flexible execution model provide a solid base for building embedded AI applications around host platforms such as the NXP i.MX applications processors with the addition of the Kinara Ara-1 accelerator. This means development teams do not need to switch to unfamiliar host processor architectures in order to leverage the benefits of AI.
Accelerators such as Kinara’s Ara-1 provide the means to not only integrate sophisticated AI models into systems easily, but to deploy them in complex compute graphs to create systems that address customers’ real-world needs. The technology, and the need for it, is growing exponentially as new opportunities for development and deployment emerge daily.
The processing power of Kinara’s Ara-1 accelerator eliminates one of the last of the perceived time-to-market risks by swiftly and painlessly enabling the transition to a newer, faster and ultimately more profitable host platform with well-established technology that has already proven itself.
More information...
Contact Details and Archive...