Use benchmarks to assess static analysis tools

Author : Paul Anderson, GrammaTech

05 October 2016

Researchers from Toyota published a paper entitled "Test Suites for Benchmarks of Static Analysis Tools" at the 26th IEEE International Symposium on Software Reliability Engineering (ISSRE).

(Click here to view article in digital issue)

This was a follow-up paper to one the same team had published in 2014 called "Quantitative Evaluation of Static Analysis Tools," which has since been withdrawn for IP reasons.

In this new paper, the authors use the same (although slightly restructured) benchmark suite as before to measure how well the tools find defects. CodeSonar was one of the three top-performing tools used in the study.

CodeSonar always does very well in these studies. It's designed for safety-critical systems and therefore typically finds the most software defects. The Toyota studies are no exception. CodeSonar performs measurably better overall than both of the other tools. As far as I am aware, CodeSonar has placed first (or tied for first) in every head-to-head comparison of this kind.

In the recent Toyota study, however, there was one category of bugs—Stack-Related Defects—in which CodeSonar scored zero. This didn’t seem right, so I decided to look more closely. What I found illustrates some of the disadvantages of these kinds of benchmark suites, reinforcing something I have been saying for years—that benchmarks can be misleading, and you should never make a decision on which tool to deploy based on benchmark results alone.

In the stack-related defects category, I looked into one of the warning classes, Stack Overflow. The examples are all similar; an extremely large buffer is allocated on the stack. Here’s an example:

void st_overflow_003_func_001 (st_overflow_003_s_001 s)
      char buf[524288]; /* 512 Kbytes */ /*Tool should detect this line as error*/
                          /*ERROR:Stack overflow*/
        s.buf[0] = 1;
        buf[0] = 1;
        sink = buf[idx];

A buffer of that size is likely to exceed the capacity of what most machines can handle, so it's reasonable to expect a static analysis tool to report it as a potential problem.

However, CodeSonar can find stack overflows of this kind. It lets you specify what the maximum stack size is, and it finds call paths where that capacity is exceeded. Why weren't these examples detected? The answer is simple; that particular checker isn't enabled in CodeSonar by default, so the paper's authors must not have noticed that they had to explicitly enable it. Once I turned it on, CodeSonar detected the overflow as expected.

A second warning class in the study’s Stack-Related Defects category was Stack Underflow. Again, CodeSonar failed to detect any of those defects, which seemed unlikely to me because it has checkers for a few different kinds of underflows. All of the examples were similar: an index into a stack-allocated buffer was decremented in a loop, where the loop's termination condition loop didn't check that the index could go negative. Here’s the code:

void st_underrun_001 ()
         char buf[10];
         strcpy(buf, "my string");
         int len = strlen(buf) - 1;
         while (buf[len] != 'Z')
                 len--; /*Tool should detect this line as error*/ 
                        /* Stack Under RUN error */

The termination condition is based only on the contents of the character that the index points to. This is an example of a sentinel-based search, which explains why CodeSonar didn't flag it. It's difficult for static analysis tools to reason precisely about the position of sentinels, and CodeSonar doesn't report defects if it can't do so with reasonable precision.

When we develop checkers for CodeSonar, we try to balance recall (the ability to find a real bug) against precision (the proportion of results that are true positives). If the precision is too low, then the checker is typically useless because the noise drowns out the signal. Of course, we could adapt our underrun checker to report this flaw, but due to the inherent difficulty of the problem in general, the precision would be too low for it to be useful. Unfortunately, precision is one aspect of analysis that studies such as this are very poor at assessing.

This example is highly misleading. A casual reader of the paper would incorrectly conclude that CodeSonar is completely incapable of finding stack underruns. In reality, the examples chosen aren't very representative of all possible stack underruns; examples like those are IMHO not even likely to occur in the wild. There are plenty of other underrun examples that CodeSonar (and other static analysis tools) would be good at finding.

Benchmarks such as these can be useful at times; they can help identify weak spots in tool coverage, and may be helpful in comparing how different tools report the defects. They are typically easy to use and to understand.

If benchmarks have been designed to target a particular application domain, they can yield helpful specific insights. However, for that reason, beware that benchmarks targeted at reliability problems in safety-critical medical devices, for instance, are unlikely to have much in common with those targeted at security problems in server software.

In any domain, benchmarks are most useful when they are models of code found “in the wild,” but that's essentially impossible to do without sacrificing some of the positive aspects of benchmarks. Don't think of these kinds of micro-benchmarks as real bugs, but rather as caricatures of bugs. Although they can be useful to assess some limited aspects of static analysis tools, it would be unwise to make a purchase decision based solely on benchmark results. For example, a tool might do well on a micro-benchmark but be useless on real code because of weak precision, poor performance, or failure to integrate well into the workflow. The best way to assess a tool's effectiveness is to try it on real code, make sure it's configured correctly, and judge the results rationally.

Static analysis benchmarks

Many programmers would agree that static analysis is pretty awesome: it can find code defects that are very hard to find using testing and walkthroughs. On the other hand, some scientific validation of the effectiveness of static analysis would be useful. For example, this nice 2004 paper found that when five analysers were turned loose against some C programs containing buffer overflows, only one of them was able to beat random guessing in a statistically significant fashion.

More recently, in 2014, some researchers at the Toyota InfoTechnology Center published a paper evaluating six static analysers using a synthetic benchmark suite containing more than 400 pairs of C functions, where each pair contains one function exemplifying a defect that should be found using static analysis and the other function, while looking fairly similar, does not contain the defect. The idea is that a perfect tool will always find the bug and not find the not-bug. The results show that different tools have different strengths; for example, CodeSonar dominates in detection of concurrency bugs but PolySpace and QA-C do very well in finding numerical bugs. With a few exceptions, the tools do a good job in avoiding false positives.

Shin’ichi Shiraishi, the lead author of the Toyota paper, contacted me asking if I’d help them release their benchmarks as open source, and I have done this. I am very happy that Shin’ichi and his organisation were able to release these benchmarks and we hope that they will be useful to the community.

Contact Details and Archive...

Print this page | E-mail this page