Keywords

figure a

1 Introduction

The \(\mathbb {K}\) semantic frameworkFootnote 1 is a program analysis environment based on term rewriting [1]. Users define the formal semantics of a target programming language and the \(\mathbb {K}\) framework provides a series of formal analysis tools specialized for that language, such as a symbolic execution engine, a semantic debugger, a systematic checker for undesired behaviors (model checker), and even a fully fledged deductive program verifier. Our tool, RV-Match, is based on the \(\mathbb {K}\) framework instantiated with the publicly-available C11 semanticsFootnote 2 [6, 7], a rigorous formalization of the current ISO C11 standard [10]. We have specially optimized RV-Match for the execution and detection of errors in C programs.

Unlike modern optimizing compilers, which have a goal to produce binaries that are as small and as fast as possible at the expense of compiling programs that may be semantically incorrect, RV-Match instead aims at mathematically rigorous dynamic checking of programs for strict conformance with the ISO C11 standard. A strictly-conforming program is one that does not rely on implementation-specific behaviors and is free of the most notorious feature of the C language, undefined behavior. Undefined behaviors are semantic holes left by the standard for implementations to fill in. They are the source of many subtle bugs and security issues [9].

Running RV-Match. Users interface with RV-Match through the kcc executable, which behaves as a drop-in replacement for compilers like gcc and clang. Consider a file undef.c with contents:

figure b

We compile the program with kcc just as we would with gcc or clang. This produces an executable named a.out by default, which should behave just as an executable produced by another compiler—for strictly-conforming, valid programs. For undefined or invalid programs, however, kcc reports errors and exits if it cannot recover:

figure c

In addition to location information and a stack trace, kcc also cites relevant sections of the standard [10].

2 Practical Semantics-Based Program Analysis

Unlike similar tools, we do not instrument an executable produced by a separate compiler. Instead, RV-Match directly interprets programs according to a formal operational semantics. The semantics gives a separate treatment to the three main phases of a C implementation: compilation, linking, and execution. The first two phases together form the “translation” semantics, which we extract into an OCaml program to be executed by the kcc tool. The kcc tool, then, translates C programs according to the semantics, producing an abstract syntax tree as the result of the compilation and linking phases. This AST then becomes the input to another OCaml program extracted from the execution semantics.

The tool on which we have based our work was originally born as a method for testing the correctness of the operational semantics from which it was extracted [7], but the performance and scalability limitations of this original version did not make it a practical option for analysis of real programs. To this end, we have improved the tool on several fronts:

  • OCaml-based execution engine. We implemented a new execution engine that interprets programs according to a language semantics 3 orders of magnitude faster than our previous Java-based version. For this improvement in performance, we take advantage of the optimized pattern-matching implemented by the OCaml compiler, a natural fit for \(\mathbb {K}\) Framework semantics. In the course of this work, we uncovered and fixed a few limitations of the OCaml compiler itself in dealing with very large pattern match expressions.Footnote 3

  • Native libraries. Previous versions of our tool required all libraries to be given semantics (or their C source code) before they could be interpreted. We now support linking against and calling native libraries, automatically marshalling data to and from the representation used in the semantics.

  • Expanded translation phase. In our C semantics, we now calculate the type of all terms, the values of initializers, and generally do more evaluation of programs during the translation phase. Previously, much of this work was duplicated during execution.

  • Error recovery and implementation-defined behavior. We have implemented error recovery and expanded support for implementation-defined behavior. Programs generated by older versions of kcc would halt when encountering undefined or implementation-defined behavior. Our new version of kcc gives semantics for many common undefined behaviors so the interpreter can continue with what was likely the expected behavior after reporting the error. Similarly, we have added support for implementation profiles, giving users an easy way to parameterize the semantics over the behaviors of common C implementations.

  • Scope of errors. We have also expanded the breadth of the errors reported by kcc to include bad practices and errors involving standard library functions.Footnote 4

These improvements have allowed kcc to build and analyze programs in excess of 300k lines of code, including the BIND DNS server.

Performance evaluation. For an idea of the extent of the performance enhancements over previous versions of our tool, consider this simple program that calculates the sum of integers between 0 and 10000:

figure d

In the table below, we compare the time in seconds to compile and run this program five times with an old version of our toolFootnote 5 [9] to our new version using our OCaml execution engine. The first and second rows report the average time for five compilations and runs,Footnote 6 respectively, and the third reports the sum of all runs plus the average compilation time to simulate the case of a compiled test being run on different input.

figure e

3 Evaluation

Of course, many other tools exist for analyzing C programs. In this section, we compare RV-Match with some popular C analyzers on a benchmark from Toyota ITC. We also briefly mention our experience with running our tool on the SV-COMP benchmark. The other tools we consider:

  • GrammaTech CodeSonar is a static analysis tool for identifying “bugs that can result in system crashes, unexpected behavior, and security breaches” [8].

  • MathWorks Polyspace Bug Finder is a static analyzer for identifying “run-time errors, concurrency issues, security vulnerabilities, and other defects in C and C++ embedded software” [11].

  • MathWorks Polyspace Code Prover is a tool based on abstract interpretation that “proves the absence of overflow, divide-by-zero, out-of-bounds array access, and certain other run-time errors in C and C++ source code” [12].

  • Clang UBSan, TSan, MSan, and ASan (version 3.7.1) are all clang modules for instrumenting compiled binaries with various mechanisms for detecting undefined behavior, data races, uninitialized reads, and various memory issues, respectively [5].

  • Valgrind Memcheck and Helgrind (version 3.10.1, GCC version 4.8.4) are tools for instrumenting binaries for the detection of several memory and thread-related issues (illegal reads/writes, use of uninitialized or unaddressable values, deadlocks, data races, etc.) [13].

  • The CompCert C interpreter (version 2.6) uses an approach similar to our own. It executes programs according to the semantics used by the CompCert compiler [3] and reports undefined behavior.

  • Frama-C Value Analysis (version sodium-20150201), like Code Prover, is a tool based on static analysis and abstract interpretation for catching several forms of undefinedness [4].

Fig. 1.
figure 1

Comparison of tools on the 1,276 tests of the ITC benchmark. The numbers for the GrammaTech and MathWorks tools come from [14]. (Color figure online)

  • indicates the best score in a category for a particular metric.

  • DR, FPR, and PM are, respectively, the detection rate, (the complement of the false positive rate), and the productivity metric.

  • The final average is weighted by the number of tests in each category.

  • Italics and a dash indicate categories for which a tool has no support.

The Toyota ITC benchmark [14]. This publicly-availableFootnote 7 benchmark consists of 1,276 tests, half with planted defects meant to evaluate the defect rate capability of analysis tools and the other half without defects meant to evaluate the false positive rate. The tests are grouped in nine categories: static memory, dynamic memory, stack-related, numerical, resource management, pointer-related, concurrency, inappropriate code, and miscellaneous.

We evaluated RV-Match along with the tools mentioned above on this benchmark. Our results appear in Fig. 1 and the tools we used for our evaluation are available online.Footnote 8 Following the method of [14], we report the value of three metrics: DR is the detection rate, the percentage of tests containing errors where the error was detected; \({{{\textsf {\textit{FPR}}}}} = 100 - {{{\textsf {\textit{FPR}}}}}\), where FPR is the false positive rate; and PM is a productivity metric, where , the geometric mean of DR and \({{{\textsf {\textit{FPR}}}}}\).

Interestingly, and similar to our experience with the SV-COMP benchmark mentioned below, the use of RV-Match on the Toyota ITC benchmark detected a number of flaws in the benchmark itself, both in the form of undefined behavior that was not intended, and in the form of tests that were intended to contain a defect but were actually correct. Our fixes for these issues were accepted by the Toyota ITC authors and we used the fixed version of the benchmark in our experiments. Unfortunately, we do not have access to the MathWorks and GrammaTech static analysis tools, so in Fig. 1 we have reproduced the results reported in [14]. Thus, it is possible that the metrics scored for the other tools may be off by some amount.

The SV-COMP benchmark suite. This consist of a large number of C programs used as verification tasks during the International Competition on Software Verification (SV-COMP) [2]. We analyzed 1346 programs classified as correct with RV-Match and observed that 188 (14 %) of the programs exhibited undefined behavior. Issues ranged from using uninitialized values in expressions, potentially invalid conversions, incompatible declarations, to more subtle strict aliasing violations. Our detailed results are available online.Footnote 9

4 Conclusion

We have presented RV-Match, a semantics-based ISO C11 compliance checker. It does better than the other tools we considered in terms of its detection rate, and note that it reports no false positives. Also, we think our experience with finding undefined behavior even in the presumed-correct programs of the above benchmarks demonstrates our tool’s usefulness.

We do not claim, however, that our approach is simply better than the approaches represented by the other tools. We see our technology as a complement to other approaches. Static analysis tools, for example, are more forgiving in terms of analyzing code that does not even compile, so they can help find errors earlier. They also typically analyze all code in one run of the tool. On the other hand, our tool, like all tools performing dynamic analysis, generally requires the program to actually execute in order to detect most errors. Our tool also limits itself to the code that is actually executed, so it is best combined with existing testing infrastructure (e.g., by running unit tests with kcc).