A new methodology for debugging fleets of deployed IoT devices
01 June 2021
Percepio_A new methodology for debugging fleets of deployed IoT devices
The unhappy truth is that IoT devices are deployed with bugs. With thousands of lines of embedded code, and infinite ways to use them in the field, checking every possible combination of events before deployment is almost impossible.
This article was originally featured in EPDT's H1 2021 IoT & Industry 4.0 supplement, included in the June 2021 issue of EPDT magazine [read the digital issue]. And sign up to receive your own copy each month.
Analysing every edge case & corner case in verification is also costly & time consuming. Here, Johan Kraft, CEO & founder at embedded software tools provider, Percepio outlines an alternative approach…
So, when a device is deployed in the field, sooner or later, problems are going to happen. To be fair, this may be rare for individual devices, but when tens of thousands, or even millions of sensor nodes are deployed across the IoT, those problems will appear when least expected.
Research shows up to 100 bugs can be introduced per thousand lines of written code and about 5% of these remain in devices in deployment. While some bugs are harmless, about 20% are serious. Assuming a project has 50,000 lines of code, then up to 50 major defects can still remain in the node deployed in the field. When there are millions of devices in the field, this becomes a major challenge.
Part of this challenge is being able to monitor and track the performance of these nodes for every customer, especially when there is no indication when a problem is likely to occur. Relying on users to flag a problem is not a reliable option – they may not notice the issue, or instead of reporting it decide to reboot the local system to ‘fix’ the glitch. Customer support tends to be not much help either. Even if a problem is reported, and even if the report is clear, the problem may not be identified as a software issue.
Another way to tackle this challenge is to add a small piece of software to the node firmware to monitor the performance and provide instant feedback when unexpected software behaviour occurs. This software agent needs to be tiny, but well connected both inside and out. The small size of the code matters in systems where every byte is vital. At the same time, the supporting communication infrastructure has to be able to cope with millions of nodes in an agnostic way, sending specific information about the issue and the events that led to the problem. This information – especially when presented in an intuitive, visual way – makes it much easier for developers to understand the root cause of the problem and solve it properly.
One of the advantages of the IoT and Industry 4.0 is that the sensors and nodes are connected. That infrastructure can be used to send the diagnostic data to the cloud. This can then notify developers of a problem and forward the details to them to provide the basis for continuous software improvement. The problem can be analysed quickly, and a patch developed. This patch can then be downloaded to the node to fix the problem without having to wait for a regular update – all potentially without the user, or the customer, even realising that there was a problem in the first place.
But the real power of such an architecture comes into play when dealing with the IoT at scale. In the cloud, every incoming alert can be compared with earlier alerts from the entire device fleet, grouping them into specific issues, and visualising trends on a dashboard. This provides a clear overview and avoids repeated notifications about the same issue. Any repetition of previously detected issues are counted and shown in the dashboard, which allows issues to be efficiently prioritised for the developer.
This approach to finding and fixing bugs is invaluable as IoT applications roll out at scale, whether this is for smart bins collecting refuse or monitoring the cold chain delivering vaccines to tackle the COVID-19 pandemic.
For example, smart refuse systems use sensors and wireless links to determine when a bin is full and needs to be collected, rather than relying on a set schedule. Fleet management software and intelligent routing then determines the most cost-effective retrieval. This requires complex monitoring, and a bug can put a serious spanner in the works. While one rubbish bin not being collected may not sound like a disaster, angry calls from tens or even hundreds of customers with overflowing bins is not what an operator wants to deal with. It is also vital to know whether a sensor node has failed as a result of the sensor, the communications link or other parts of the system.
Similarly, monitoring the ambient and air temperature across an entire cold chain distribution highlights weak points that can be addressed, and ensures that products such as vital vaccines arrive effective at the point of use without having been compromised. This requires extensive monitoring and logging, with data uploaded to the cloud for analysis.
Capturing the errors as they occur, notifying developers immediately and fixing the bugs with an over-the-air (OTA) update has tremendous value for the operator, not least in saving hundreds of thousands of dollars in fixing the problems in the field.
The technology in detail
The techniques for monitoring, finding, relaying and fixing elusive bugs in field-deployed IoT devices are based on a DevOps-style philosophy for continuous software improvement. The integrated solution consists of the DevAlert Firmware Monitoring (DFM) client that makes it easy to report errors, along with any other condition that is observable in the device software, including delays or performance issues. Within seconds of an issue occurring, an alert is sent to the development team.
Tapping into Tracealyzer technology, the alert includes a compact software trace that provides a visual timeline of software events just before the issue was reported. The human brain is visual and excels at pattern recognition, so the visual presentation of trace diagnostics and their context is crucial to understanding and fixing the problem promptly.
A fully managed cloud service hosted by Percepio complements the DFM client. This cloud service is responsible for classification, statistics and sending out notifications to developers, and one of its main features is to detect duplicate alerts to make sure that only alerts not previously seen trigger a developer notification.
Based on Amazon Web Services (AWS) serverless technology, this cloud service scales to very large device fleets and the design has been reviewed by AWS to ensure that it is consistent with current best practices. Security and privacy are maintained at all times as sensitive data, such as the recorded software traces, are kept in the customer’s cloud account and can only be viewed by the device developers.
Contact Details and Archive...