Challenges of Using Sound and Complete Static Analysis Tools in Industrial Software

(1)

Challenges of Using Sound and Complete Static

Analysis Tools in Industrial Software

Wouter Stikkelorum

w.g.stikkelorum@gmail.com

30-06-2016

Supervisor: Dr. Magiel Bruntink

Host Organization: Software Improvement Group (SIG)

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

(3)

Abstract

A static analysis tool that is sound and complete will be able to find all the bugs present in software, without reporting false warnings. Knowing that a software system is completely bug free can be very valuable, especially for safety critical systems. Static analysis tools that use formal verification techniques come closest to be being sound and complete because they can prove that systems have bugs or not. Running these tools however, can be very labor intensive, or they might only be applicable for a simpler language. For example some tools do not support dynamic memory allocation or assume that there are no memory errors.

In the summer of 2015 Facebook open-sourced Infer, a new static analysis tool that uses formal verification techniques. Infer will work on real world software, can run on large code bases, on languages that dynamically allocate memory, and does not assume memory safety. To provide developers and companies with more information about Infer, and what they can expect when running it, we did an extensive evaluation of Infer, including benchmarking Infer on the Juliet test suite, running Infer on a large set of open-source projects and running Infer on industrial code at the Software Improvement Group (SIG).

As with other static analysis tools the performance of Infer differs per bug type and programming language, but overall the results are promising and a precision up to 100% can be seen on the Juliet test suite for multiple Common Weakness Enumerations (CWEs). Good scores were also found when running Infer on industrial software, achieving similar scores as on the Juliet suite. However, getting Infer to run (especially on Java projects) can be troublesome, as we found when running Infer on a large set of open-source projects.

(5)

Preface

This thesis is the final part of my Master programma Software Engineering from the University of Amsterdam. I did my internship at the Software Improvement Group (SIG) where I had the opportunity to be part of the research team for 4 months. I would like to express my gratitude to dr. Magiel Bruntink for introducing me to SIG and for supervising this work. I would also like to thank dr. Barbara Vieira and dr. Haiyun Xu, who met with me every week to discuss my progress and for supervising me as my company supervisors. Understanding the basics of separation logic would not have been so easy without the help and work of dr. Marina Stojanovski, who helped me get a head start on the material. A special thanks also needs to go to all the developers who responded on questions and issues I posted on the Infer GitHub page. Finally, I would like to thank Madzy, getting my Master degree would have been much harder without your love and support. Wouter Stikkelorum

(6)

Chapter 1 Introduction

Static analysis encompasses a family of techniques for automatically computing information about the be-haviour of a program without executing it. However, most questions about the bebe-haviour of a program are either undecidable or computationally infeasible to answer, which means that static analysis tools try to efficiently compute approximate but sound guarantees [25]. A perfect static analysis tool would be sound and complete, meaning that it can detect all problems and raises no false alarms.

Formal verification aims to study mathematically based methodologies to validate whether a program rigorously satisfies a given property or a predefined formal specification of the system [27]. Formal verification has the potential to be sound and complete, and is thus used by some static analysis tools to analyze programs. An example is the static analysis tool ASTREE, which was used to prove the absence of run-time errors in Airbus code [10]. While other techniques that find flaws in code exist such as unit testing and code reviews, they cannot prove that it is free of errors with respect to some specification, like formal verification techniques do. Proving the correctness of a program however is very expensive. It requires a lot of manual work from the programmers, who in most cases will have to annotate the source code with pre- and postconditions. Only safety critical systems, like the Airbus systems, will therefore be worth the effort.

Building a verification tool which does not require a developer to write annotations is difficult, automation is usually acquired by limiting the input language. ASTREE for example works only on input programs that do not use dynamic allocation, and the formal verification tool SLAM, used for Microsoft device drivers, assumes memory safety [3].

In the summer of 2015 a new static analysis tool that uses formal verification, called Infer, was made open-source by Facebook. Infer can reason about programs that use dynamic allocation, and about pointer based data structures. Infer infers pre- and postconditions for functions, which frees the developers of writing annotations. It also uses a compositional method, meaning Infer can analyze large and incomplete code bases. Infer allows developers to freely use and modify a tool which, for the first time, brings formal verification methods to industrial size code. The recent release of Infer means that no information is available on its performance, capabilities or limitations. Infer brings a new level of automation to sound and complete static analysis tools, which makes it the best suitable candidate for this research.

1.1 Research Questions

The research questions build upon existing work on analyzing and benchmarking static analysis tools which explains how to benchmark static analysis tools [13], and on work that explains the disadvantages from current static analysis tools and to some extend provides benchmarking scores for them [6, 16, 24, 26, 28]. We want to know how good Infer is and how difficult Infer is to use. We formulated the following research questions:

1. How accurately can Infer identify flaws?

1.1 What are the benchmark scores for Infer?

1.2 How do the benchmark scores for Infer compare to other static analysis tools? 2. Can Infer be run easily on software systems?

2.1 What is the average bug density found by Infer?

1.2 Research Methods

The performance of a static analysis tool can be measured by categorizing its bug reports as true positive, false positive, true negative or false negative. Performance measures are precision and recall. By using the Juliet test suite we can automatically categorize each bug report from Infer. The test cases in the Juliet suite are grouped by their CWE (common weakness enumeration), which is a standard for grouping bug types.1

(7)

This allows us to measure the performance for specific bug types.

Benchmarking Infer on the Juliet test suite will be done in the same way as was described in [13]. We will provide a description for each metric that we calculate and provide the results. This will allow other researchers to replicate and validate the results, or compare them to other tools or improvements in newer versions of Infer. The metrics that will be calculated with the Juliet test suite are: true positives, false positives, false negatives, precision, recall, discrimination rate and F-score.

The open-source projects that we will analyze with Infer will be cloned from GitHub. GitHub provides a large set of open-source projects for all the programming languages that Infer supports. We will be cloning the most popular projects on GitHub per supported programming language by Infer, and also projects used in earlier research done at SIG. To count the lines of source code for each project we will use Cloc.2 _{Cloc is an}

open-source command line tool which is able to report the exact amount of source code lines per programming language. Source code lines are lines of code not counting comments and blank lines.

The analysis of industrial code will be done on internal SIG systems. The bug reports from Infer will be evaluated with a developer to determine whether the reports are correct and useful.

Because we also want to understand how Infer works and what its limitations are we will perform two runs on the Juliet test suite. We will manually inspect test cases and reports to understand which constructs may still lead to false warnings from Infer. At the same time this allows us to validate if Juliet is indeed a proper code base to benchmark static analysis tools on.

1.3 Goals

The goal of this research is to provide an extensive analysis of the performance of Infer. This encompasses benchmark scores on the Juliet test suite as well as running Infer on a large set of open-source and industrial projects. The analysis should provide companies and developers who consider to integrate Infer into their development cycle with enough information to assess whether they will benefit from it. By testing Infer we also aim to deliver feedback to the developers and help debugging and improving the tool.

1.4 Relevance

For many organizations high quality software is important, but the pace of change and complexity of modern code makes it difficult to produce error free software. Available tooling is often lacking in helping developers write more reliable code [8]. Despite the ability to automatically detect bugs in source code, static analysis tools produce too many warnings, and too many false positive warnings [16]. If Infer indeed can outperform other tools it could be a great help for developers.

1.5 Scope

Besides performance and reporting too many false positive warnings static analysis tools are also considered useless when output is poorly presented, or when the tool does not present enough information for the developer to fix the problem [16]. Even though a lot can be said about the current interface of Infer (which currently just consists of looking at console output or looking through comma separated, XML or JSON files) our research will not focus on these aspects. Despite being important for the usability, we will not conduct interviews with developers like was done in [16] about the usability and the workflow of the user interface of static analysis tools. SIG already has several monitors which can integrate the results from Infer, making research about the interface less useful for SIG than research about its performance.

By concentrating just on the performance of Infer and disregarding the current interface, this research is also more focused on the capabilities of the novel techniques used by Infer, which will be more useful for current or future tools using similar techniques.

(8)

1.6 Thesis Outline

Chapter 2 explains the techniques used by Infer. This includes the basics for separation logic and bi-abduction techniques that Infer uses to build proofs, as well as more information about how to run Infer. Chapter 3 explains the research we performed. We explain how we benchmarked Infer on the Juliet test suite and how we selected the open-source projects. This Chapter also provides detailed information about the bugs we found and reported in Infer and the flaws we found in the Juliet test suite. Chapter 4 contains all the results from benchmarking, running open-source and industrial systems. In Chapter 5 we discuss how benchmark scores can be improved by adding models to Infer as well as discussing the limitations of Infer and suggestions for improvements. We describe how Infer could also detect memory leaks in Java and why Infer still has trouble with array and loop bounds, also future work is discussed. In Chapter 6 the research questions are answered.

(9)

Chapter 2 Background

This chapter will present the necessary background information to get a basic understanding about the techniques used by Infer.

2.1 Formal Verification

Just as testing, formal verification is about the correctness of a computer program. But while testing can only show the presence of bugs, formal verification can prove their absence. Formal verification exploits the advantages of mathematical logics and uses a program logic to understand the semantics of the language [29]. Common approaches to formal verification are theorem proving, various forms of static analysis and model checking [19]. Because Infer uses static analysis we will focus in this chapter on formal verification with static analysis.

2.2 Static Analysis

Static analysis is often defined as being the opposite of dynamic analysis [19, 25, 29]. In contrast to dynamic analysis which actually runs the program, static analysis is done on the source code without executing it. The most common approaches to static analysis are: checking for error patterns, data flow analysis, constraint-based analysis, type-constraint-based analysis and abstract interpretation. Static analysis is not only used to check for correctness but it can also be used in: compiling, optimization and code generation [19].

Static analysis techniques can be categorized by the type of analyses they perform. Techniques that take the order of statements into account are called flow sensitive. If only the feasible paths through a program are concerned it is called path sensitive. If method calls are analyzed differently, based on call site, the technique is called context sensitive. Finally, if the body of the method is analyzed in the context of each respective call side the technique is interprocedural. If one of these conditions is ignored, the analysis technique is flow, path or context insensitive. A technique not distinguishing the call sites of the methods is called intraprocedural [25].

Static analysis is also used together with formal verification techniques. An example of this are static formal verification techniques based on axiomatic (Hoare-style) reasoning. Such verification techniques rely on three basic components: the program implementation (source code), the program specification and a program logic [29]. The specification of the program describes its individual requirements, which can be generalized for all programs or be program specific. The specification is written in a special mathematical language called the specification language. The program logic is an extension of predicate logic with rules that describe the behavior of each construct of the programming language. The goal of the verification technique is to prove that the implementation matches the specification by deriving a proof in the given program logic [29]. Verifying a program can be done with assertions. Assertions are a basic component of the specification language and describe some functional properties of the program.

2.3 Axiomatic Reasoning and Hoare Logic

Axiomatic reasoning about computer programs started with the work of Hoare who reasoned about simple sequential programs [14]. Axiomatic reasoning is often called Hoare-style reasoning and consists of specifying the program in terms of pre- and postconditions. This can be done by using a Hoare triple:

{P } S {Q}.

Verifying the program S with respect to its pre- and postconditions P and Q means proving that for any execution of S, if P holds in the prestate of the execution (and if the execution terminates) Q will hold in the poststate of the execution. The assumption that S terminates means that the Hoare triples only expresses partial correctness.

To derive proofs for the Hoare triples, standard mathematical logic such as predicate logic is not sufficient because it will only allow to prove properties over a (given) stable state. Programs alter state and thus

(10)

require an extension of mathematical logic with additional rules that describe how the instructions in the program change the state. Hoare logic is a program logic and a deductive system that extends the predicate logic with a set of axioms and inference rules [29].

Each of the additional Hoare logic rules describes the behavior of a specific programming language con-struct. Rules of inference decompose triples of composed statements into triples of their substatements. For each programming language construct such an inference rule (or axiom) has to be introduced. The soundness of these rules can be derived from the semantics of the programming language. The most basic example of such a rule is the assignment axiom:

[Assignment]

{P [v/x]} x = v {P }.

The assignment axiom states that to prove P in the poststate you must prove that in the prestate of the instruction the formula P [v/x] holds (the formula P in which every free variable x is replaced by v). Valid examples of Hoare triples using the assignment axiom would be:

{true} x := 5 {x = 5} {x = 43} y := x {y = 43}

A proof in Hoare logic is derived by applying rules from the logic to decompose the program into smaller subprograms until axioms or provable expressions for predicate logic are left. In order to give a better understanding how such a proof can be derived, we show a proof for a small program which was proven in [29] by using two more Hoare logic rules:

[Conditional]{b ∧ P } S1{Q} {¬b ∧ P } S2{Q} and b had no side-effect {P } if b then S1 else S2{Q}

[Consequence]P =⇒ P1 {P1} S {Q1} Q1 =⇒ Q {P } S {Q1}

With the assignment axiom, the conditional rule and rule of consequence, the pre- and postcondition can be proven for the following piece of code:

1 // p r e c o n d i t i o n : t r u e 2 i f ( a > b ) { 3 r = a − b ; 4 } e l s e { 5 r = b − a ; 6 } 7 // p o s t c o n d i t i o n : r >= 0

Listing 2.1: Code snippet with pre- and postconditions

Proof for the code in listing 2.1 with the precondition {true} and the postcondition {r ≥ 0}: 1.a > b ∧ true =⇒ a − b ≥ 0 {a − b ≥ 0} r = a − b {r ≥ 0}

{a > b ∧ true} r = a − b {r ≥ 0} [Conseq] 2.¬(a > b) ∧ true =⇒ b − a ≥ 0 {b − a ≥ 0} r = b − a {r ≥ 0}

{¬(a > b) ∧ true} r = b − a {r ≥ 0} [Conseq] {true} if (a > b) then r = a - b else r = b - a {r ≥ 0} [Cond]

By means of backwards reasoning about the weakest precondition for each method in a program, which is called predicate transformer semantics, proofs in this style can be automated [29].

(11)

2.4 Separation Logic

Separation logic was first developed by John Reynolds around 2000 and is an extension of Hoare logic for reasoning about programs that access and mutate data held in computer memory [23]. It is centered around the separation conjunction P ∗ Q, which asserts that P and Q hold for disjoint parts of the addressable memory, and on program proof rules that exploit separation to provide modular reasoning about programs. The problem that separation logic addresses is that the previous approaches needed complex restrictions to reason about the correctness of programs that mutate data structures. Explicitly stating all restrictions for the sharing of resources scales poorly, even for small programs proofs already grow large as Reynolds shows in [23]. By using the separation conjunction, the prohibition of sharing is built into the operation, which results in shorter and more readable proofs even for bigger programs.

Separation logic extends the simple imperative language originally axiomatized by Hoare. Semantically two components are added to the computational states: the store and the heap. The store is a mapping between variables and values, like the contents of registers. The heap is a mapping between addresses and values, like addressable memory. Four heap-manipulating commands are specified: allocation, lookup, mutation and deallocation. A core ingredient in separation logic is the predicate x 7→ v, which is known as the points-to predicate. The meaning of this is that a pointer x points to a location at which the value v is stored. The store s maps x to a memory address and the heap h maps this address to v: h(s(x)) = v [29].

In addition to the usual formula of predicate calculus, four new forms of assertions are introduced which describe the heap:

• emp asserts that the heap is empty;

• e 7→ e0 _{asserts that the heap contains one cell, at address e with contents e}0_;

• p0∗ p1asserts that the heap can be split into two disjoint parts in which p0 and p1hold respectively;

• p0−∗ p1asserts that if the current heap is extended with a disjoint part in which p0 holds, then p1will

hold in the extended heap.

With these rules we can already give a proof outline for the procedure DispT ree from Listing 2.2. This proof was given by O’Hearn in [22]. DispT ree walks a tree recursively, disposing left and right subtrees and finally the root pointer.

1 p r o c e d u r e D i s p T r e e ( p ) 2 l o c a l i , j ; 3 i f ( ! isAtom ( p ) ) t h e n 4 i := p −> l ; 5 j := p −> r ; 6 D i s p T r e e ( i ) ; 7 D i s p T r e e ( j ) ; 8 f r e e ( p ) ;

Listing 2.2: Pseudo code for disposing the elements in a tree

The specification of DispT ree is {tree(p)} DispT ree(p) {emp}, which states that in the prestate there is tree structure on the heap and afterwards just the empty heap. Before we can reason about the procedure we also need a formal definition of trees, which is the program logic describing the tree data structure:

tree(E) ⇔ if isAtom(E) then emp

else ∃xy.E 7→ [l : x, r : y] ∗ tree(x) ∗ tree(y).

With this definition we can prove the specification for DispT ree. The crucial part of the procedure is in the then branch. Here we know that p is not an atom and looking at the inductive definition of the tree predicate p points then to a left and right subtree occupying separate storage. Then the roots of the two subtrees are loaded into i and j. The first recursive call operates in-place on the left subtree, removing it. The two

(12)

assertions afterwards are equivalent because emp is the unit of ∗. The subtree j is removed the same way as i and afterwards f ree(p) frees the root pointer leading to the final assertion {emp} [22].

{p 7→ [l : x, r : y] ∗ tree(x) ∗ tree(y)} i := p 7→ l; j := p 7→ r; {p 7→ [l : i, r : j] ∗ tree(i) ∗ tree(j)} DispTree(i) {p 7→ [l : i, r : j] ∗ emp ∗ tree(j)} {p 7→ [l : i, r : j] ∗ tree(j)} DispTree(j) {p 7→ [l : i, r : j] ∗ emp} {p 7→ [l : i, r : j]} free(p) {emp}

Just as for Hoare logic, separation logic has axioms to which proofs are often reduced. In the proof for DispT ree the proof was reduced to the axiom: {E 7→ −} f ree(E) {emp}, which states that E points to something beforehand (so it is in the domain of the heap) and afterwards only the empty heap remains. The axiom corresponding to the operational idea of storing a value F at the address E will be: {E 7→ −} [E] := F {E 7→ F }.

Separation logic also adds structural rules to those already defined in Hoare style logic. The most used rule that comes alongside the rules like the rule of consequence and the if-then-else rule is the frame rule:

{P } C {Q} {P ∗ R} C {Q ∗ R}.

By using this rule we can extend a local specification, involving only variables and parts of the heap that are actually used by C, by adding arbitrary predicates about variables and parts of the heap that are not modified or mutated by C. Using the frame rule allows for local reasoning [23]. Because all other cells of the memory that a function does not access automatically remain unchanged, understanding a program should be possible by just looking at the memory it actually accesses [21].

Different from Hoare logic the specification {P }C{Q} can not be interpreted loosely, in the sense that C may cause state changes not described in the pre- and postcondition. This means that the specification should be interpretated as tight, which should guarantee that C only alters those resources mentioned in P and Q [21].

2.5 Shape Analysis and Bi-abduction

Shape analysis is a form of program analysis that tries to infer descriptions of data structures. It also tries to prove that these structures are not misused in the program. To accurately and efficiently define mutable data structures was one of the outstanding problems in automatic verification, but with the work done by Calcagno, Distefano, O’Hearn and Yang on shape analysis by means of bi-abduction and separation logic, an automated process that can verify large programs is now possible [9].

Shape analysis is computationally expensive because of the complexity caused by aliasing and the need to look arbitrarily deeply into the program heap. The method described in [9] however boosts shape analysis by using a compositional method where each procedure is analyzed independently of its callers. The method assigns Hoare triples to each procedure which then provides an over-approximation of data structure usage. By being a compositional method this new technique has the advantage of being able to scale, deal with large programs and deal with imprecision.

The analysis uses a new method called bi-abduction which is a generalized form of abduction. Bi-abduction displays Bi-abduction as an inverse of the frame problem, by jointly infering frames and anti-frames. In principle such an analysis can prove that programs do not commit pointer-safety without the need for a user to annotate the source code with loop invariants or pre- and postconditions. Frame inference is an extension

(13)

of the entailment question. The task is to find a formula f rame which makes the following entailment valid: A ` B ∗ f rame.

Frame inference is a way to find the leftover portions of the heap needed to automatically apply the frame rule as presented in Section 2.4. This extended entailment capability is used at procedure call sites, where A is an assertion at the call site and B a precondition from a procedure’s specification [22].

A first solution to frame inference was implemented in a tool called Smallfoot [5]. The approach worked by using information from failed proof attempts of the standard entailment question, for example:

F ` emp .. . A ` B

will tell that F is a frame. So, this failed proof will lead to the following successful entailment, by adding ∗ F on the right side everywhere in the failed proof:

F ` F F ` emp ∗ F

.. . A ` B ∗ F

The proof procedure works upwards and crunches the proof until it can go no further, if the proof is then in the form indicated above it can indicate a frame [22].

Synthesis of frames allows the usage of small specifications of memory portions by slotting them into larger states found at call sites, but we need abduction of anti-frames to find the small specifications [9]. These specifications can be found by solving a more general problem than abduction, which is called bi-abduction:

A ∗ anti-f rame ` B ∗ f rame.

2.6 Infer

Infer is an automatic program verification tool that was originally aimed at proving memory safety of C programs [7]. Infer attempts to build a compositional proof, as described in Section 2.5, by composing proofs for each function or procedure. Bugs are extracted from failures of proof attempts. In 2013 Infer was brought from its verification startup Monoidics to Facebook [8]. Infer is continuously developed at Facebook, where it was made open-source in the summer of 2015.

Infer performs a deep-heap-analysis or shape analysis in presence of dynamic memory allocation. It has an abstract domain that can precisely reason about complex dynamically allocated data structures such as singly and double linked lists. Infer is sound with respect to its underlying model of separation logic. This means that Infer synthesizes sound Hoare triples which imply memory safety or having no null pointer dereferences with respect to that model. Infer is also automatic, in the sense that it does not require user annotations to build proofs, and it can analyze incomplete code.

When attempting to build a proof, the output of Infer for each procedure can be: a Hoare triple, a failed proof attempt or a failed proof attempt but caused by internal limitations of Infer. When an Hoare triple is found that triple is true for a particular mathematical model according to the fault-avoiding interpretation of triples used in Hoare logic [8]. Any execution starting from that procedure will not cause a prescribed collection of runtime errors. The list of predescribed errors or bug types which Infer can find are available from Infer’s website.1 _{When the proof attempt fails Infer extracts reasons that prevented it to establish the}

proof from the failed attempts. These findings are then returned as a bug report to the user [7]. When Infer fails due to its limitations nothing can be concluded for those procedures. The results from Infer are sound but soundness does not mean that no bugs are missed. This is only the case under conditions were the model’s assumptions are met [8].

(14)

By looking at Infer’s output for the code in Listing 2.3 it is possible to see how Infer performs inter-procedural reasoning. The usual notion for memory deallocation is that when a new object is allocated during the execution of a procedure, it is the responsibility of the procedure to either deallocate the object or make it available to its callers. To illustrate how this works, and how Infer reports these kind of bugs, function example1 contains a memory leak and function example2 will leave the responsibility of deallocating the variable i to its caller main.

1 #include ” s t d l i b . h” 2 3 void example1 ( ) { 4 i n t ∗ i ; 5 i = NULL ; 6 i = m a l l o c ( s i z e o f ( i n t ) ) ; 7 } 8 9 i n t example2 ( ) { 10 i n t ∗ i ; 11 i = NULL ; 12 i = m a l l o c ( s i z e o f ( i n t ) ) ; 13 return i ; 14 } 15 16 main ( ) { 17 example1 ( ) ; 18 i n t i = example2 ( ) ; 19 } 20

Listing 2.3: Examples of procedure-local bugs

Infer reports two memory leaks for the code in Listing 2.3. The first memory leak is reported on line 6. Infer reported the following message: memory dynamically allocated to i by call to malloc() at line 6, column 6 is not reachable after line 6, column 2. The second memory leak is detected in the function main at line 18 because function example2 returned the variable i to its caller. To better understand this message we look at the pre- and postconditions that Infer found. (This is possible by running Infer in debug mode.) As expected: for the function example1 no pre- or postconditions are found because the proof attempt failed. Because example2 is error free, according to Infer, pre and postconditions are generated. The specification Infer gives for example2() is the following:

{emp} example2() {return 7→ − ∨ return 7→ 0}

This means that the return value points to an allocated cell which did not exist in the (empty) precondition, or that the return value points to NULL. The or-clause comes from the function malloc which returns NULL if insufficient memory is available. This information is not inferred but prescribed as an axiom of the program logic. The specification for malloc() equals:

{emp} malloc() {(return 7→ −) ∨ (return = nil ∧ emp)}.

As of this writing Infer can perform these kinds of proofs for Java, C and Objective-C. Table 2.1 shows which bugs Infer can detect per programming language. Because Infer is still being developed, and the current range of bugs is still rather limited, bug types are continuously added. Infer’s website contains information about the current list of bugs which it can detect.2

Infer is run via the command-line and uses Javac or Clang to compile the source code. Infer uses these compilers to compile the source code into its own intermediate language, on which the analysis is performed. To run larger projects Infer can be called with build commands from: Gradle, Buck, Maven, Make or Xcodebuild. Infer runs are comprised of two phases: the capture phase and analysis phase. During the capture phase compilation commands are captured and translated into the intermediate language. During the analysis phase the captured files are analyzed. Each function and method is analyzed separately using

(15)

Java C Objective-C

Resource leaks X X X

Memory leaks X X

Context leaks X

Null pointer dereferences X X X

Tainted values reaching sensitive functions X

Premature nil terminating arguments X X

10 Object-C specific bugs X

Table 2.1: Bug types per programming language

the compositional method described in Section 2.5. Infer’s output can be found in the console or in the inf er-out folder which Infer creates at the call site. Infer also has a debug mode which allows the user to see more specifications from the analyzed code.

(16)

Chapter 3 Research

Test suites such as OWASP’s benchmark suite and the Juliet test suite provide the ability to automatically evaluate bug reports generated by static analysis tools.1 This enables developers to do more than just evaluating different tools by hand on small projects, since manually going through all the bug reports will be very time consuming. All the bugs in these test suites are known which means that also the recall for a tool can be calculated (which would be really hard to do on normal projects). The best known test suites are OWASP and Juliet. Because Juliet contains test cases that Infer can detect, we benchmarked Infer on Juliet. This chapter will explain how the Juliet test suite works and how static analysis tools can be benchmarked with it. It also explains how we run Infer on open-source and industrial code.

3.1 Juliet Test Suite

The Juliet test suite was created by the National Security Agency’s Center of Assured Software (CAS). The suite was created in 2005 and the latest additions were made in 2013. The test suite consists of Java, C and C++ code. It was made to address the growing lack of software assurance in the U.S. government by determining the capabilities of commercial and open-source static analysis tools. The NSA believed that static analysis tools could improve software quality and should be used more. By identifying the strengths of the tools the NSA also tried to determine how to combine tools to provide a more thorough analysis of software by using strong tools in each area analyzed.

The suite consists of a large number of test cases that contain intentional flaws. By having consistent naming conventions for functions or classes that contain these flaws the suite allows for automatic analysis. The problem when analyzing natural code (code which is written by programmers for purposes other than creating a benchmarking suite) is that it takes a lot of time to analyze each report that is generated by static analysis tools. To analyze whether a report is indeed valid, programmers need to identify if the report points to a bug or not, which is time consuming and very hard to automate.

Places that contain bugs are known for the Juliet suite, which means that simple scripts can be written that compare the location and type of bug that a tool reports with the location and type of bugs known in Juliet. Because not only all the bugs are known, but also all the functions that do not contain bugs, the recall can also be calculated. This would also be very hard to do for natural code unless all bugs in the code are known.

The test cases in Juliet are classified by a number that indicates the control- or data flow complexity. To identify the more sophisticated tools Juliet does not only contain a large number of different types of bugs, but also for each bug an increase in complexity of the code which leads to the bug, for example a bug that only manifests itself in a specific branch of the program. This allows to identify tools that are flow insensitive or path insensitive. Because the same complexity of test cases is repeated for each bug type the tools being analyzed can also be compared for specific bug types.

Bug types are categorized by CWE entry. CWE stand for common weakness enumeration.2 _{For the C}

and C++ test cases Juliet contains 118 CWE entries and for Java 112. If a static analysis tool does not categorizes reports into CWE entries a mapping needs to be made to automatically compare CWE entry and bug type. This is needed because the naming conventions in Juliet only state that for the good and bad functions the code does or does not contain a bug for the CWE entry to which the test case belongs. Which means for example that the suite guarantees that no null pointer dereferences are found in good functions for CWE entry 690 but it still might be that the code has a resource leak.

For C and C++ Juliet contains more than 61.000 test cases and for Java more than 25.000. This results in around ten million lines of code.3 _{Most static analysis tools will not target all the CWEs present in Juliet.}

So, most static analysis tools will only be tested on a subset of the CWEs because testing them on a bug type it cannot find will be pointless.

1_{https://www.owasp.org/index.php/Benchmark} 2_{https://cwe.mitre.org/index.html}

(17)

Artificial code also has disadvantages: it is much simpler than natural code and the frequency of bugs and non-flawed constructs is much higher. The latter means that even if a static analysis tool will just report every function in the Juliet suite as flawed, the precision for that tool would still be around 50% and the recall a 100%, that would be decent scores for static analysis tools. To identify the more sophisticated tools from the tool, that just report (almost) everything, another metric specific to the Juliet test suite is proposed: the discrimination rate. A discrimination equals a test case where the tool reported the flawed and non-flawed function correctly. Thus the tool reported the correct bug in the flawed function or class, and reported no bugs of the targeted type in the non-flawed function(s) or class [13].

3.2 Benchmarking Infer

We benchmarked Infer in the same way as was proposed in [13]. This means first creating a mapping between the CWE entries in Juliet and the bug types that Infer can report. Secondly the true positives, false positives, and false negatives need to be counted to calculate the precision, recall and F-score. All the values were calculated using a small script that analyzed the output from Infer and counted test cases in the Juliet test suite.

Precision is also known as positive predictive value and means the ratio of correct bugs reported and the total reported bugs:

P recision = #TP #TP + #FP.

Precision describes how well a tool identifies flaws. A lack of precision is one of the major reason why static analysis tools can be considered useless [16]. The higher the precision will be, the less time developers will spend going through false reports and the less time they will waste.

The script calculates the true positives with the following algorithm:

1 f o r ( bug in I n f e r O u t p u t )

2 i f ( i s T a r g e t ( bug . t y p e ) && b a d F u n c t i o n ( bug . f u n c t i o n N a m e ) ) 3 t r u e P o s i t i v e s ++;

Listing 3.1: Algorithm: Count True Positives

Here isT arget is a partial function between the CWE number from the test case and the bug type from Infer. The function will be partial because not all CWEs will map to a bug type (Infer is still rather limited and a large number of CWEs cannot be tested in Juliet). The function badF unction will return true if the function name is for a flawed construct or f alse when the function name is for a non-flawed function. The function can do this check by comparing the function name with the following regular expression: ∧bad$. To calculate false positives the same code can be used with the alterations that the function name must be for a non-flawed construct and that instead of increasing the number of trueP ositives the number of f alseP ositives will be increased. Because no promises are made about bugs outside of the target type, bugs found in ‘good’ functions must still be of the same type as the target type from the test case.

Recall is also known as sensitivity [13]. It represents the fraction of real bugs that were reported by a tool. Recall is defined as:

Recall = #TP #TP + #FN.

A high recall will indicate that the tool correctly identified a large number of flaws. A recall of 1 will mean that all bugs were identified correctly. Recall is a measure which will be hard to calculate, on other than benchmark suites, because all the bugs need to be known. To calculate the recall the number of false negatives needs to be calculated. The script will calculate the false negatives with the following algorithm:

1 f o r ( f i l e in m a i n F i l e s )

2 i f ( f i l e not in f i l e s T r u e P o s i t i v e ) 3 f i l e s F a l s e N e g a t i v e . add ( f i l e ) ; 4 return f i l e s F a l s e N e g a t i v e . s i z e ( ) ;

(18)

The input for this algorithm is the set of all main files. Main files in Juliet are files which end on a number or on a number followed by ‘a’. So, ‘filename1.c’ and ‘fileName2a.c’ would be main files whereas ‘filename2b.c’ will not. Main files will contain a main function and the good and bad test cases. There can also be helper functions which will be in the helper files and are always used by the main files. f ilesT rueP ositives is the set of main files for which a correct bug report (true positive) was given. Because each main file will always contain exactly one bad function the number of false negatives is always: |{mainF iles} − {f ilesT rueP ositives}|. By doing this subtraction with sets, duplicates will not be counted. The function add() which is called on f ilesF alseN egatives should therefore only add the file if it does not already exists in f ilesF alseN egatives. In addition to the precision and recall we also calculate the F-score for Infer as in done in [13]. The F-score is a harmonic mean of precision and recall, and will therefor always be a number between precision and recall. The F-score provides a weighted guidance in identifying a good static analysis tool. An F-score is calculated by the following formula:

F-score = 2 ∗ Precision ∗ Recall Precision + Recall

.

The last metric to be calculated is the discrimination rate. A discrimination is a test case for which the flawed and non-flawed constructs are identified correctly, thus having a true positive and no false positive bug report for the test case. The discrimination rate is defined as follows:

Discrimination Rate = #Discriminations #Flaws .

To calculate the discrimination rate, the number of discriminations needs to be counted. The script will calculate the discrimination rate as follows:

1 f o r ( f i l e in f i l e s T r u e P o s i t i v e ) 2 i f ( f i l e not in f i l e s F a l s e P o s i t i v e s ) 3 d i s c r i m i n a t i o n s ++;

4 return ( double ) d i s c r i m i n a t i o n s / m a i n F i l e s . s i z e ( ) ;

Listing 3.3: Algorithm: Calculate Discrimination Rate

Here f ilesT rueP ositive, f ilesF alseP ositives and mainF iles are again sets. f ilesT rueP ositive are all the files for which a bug report was categorized as true positive and f ilesF alseP ositives are all the files for which a false negative bug report was found. Because each main file contains exactly one bad test case, and thus one flaw, the discrimination rate can be calculated by dividing the number of discriminations by the size of main files.

3.3 Selecting Open-Source Systems

We will be running open-source systems because this will allow us to analyze a large corpus of real world code, and allows us to calculate metrics like the average bug density and the success rate of running systems. The average bug density will provide insight in the likelihood for the reports of being true, since a very high bug density will probably mean that some of the reports will be false positive (or less likely that a particular system has many more bugs than all the other systems) [20]. From the success rate of systems that can be analyzed by Infer we will be able to better answer research question 2 about the ease of use. If a lot of systems cannot be analyzed Infer out-of-the-box it is less easy to use.

By running industrial code (SIG internal systems) we can provide a more qualitative analysis by looking at the bugs Infer detects. With this information we can verify that the benchmark scores obtained by running Infer on the Juliet test suite also hold for industrial code.

We will select open-source systems from GitHub.4 _{GitHub provides a good way to search projects on}

programming language and popularity. All Infer’s output data from running the open-source projects will be aggregated and analyzed. We will look at the number of systems we tried to run and the number we could actually run on Infer, reasons for failure, the size of the systems, for each bug found the number of

(19)

occurrences and the bug density. Because all the systems need to be able to build before Infer can run them, we can not automate this process, since automatically solving build problems is very hard.

We will try to run as a large set of systems, restrained by the available time we have. We will select some of the more popular systems on GitHub ass well as systems already analyzed by previous research done at SIG.

3.4 Analyzing Industrial Code

To analyze industrial code we will run Infer on SIG internal tooling and validate the reports with a SIG developer. We will run the projects currently in development at SIG, since Java is the standard language Infer should run. From the validation we can calculate the precision of Infer on SIG systems. For each true positive bug report we will also determine whether we received enough information from Infer to fix the bug, and thus if the report is useful for SIG developers.

Before doing the final benchmark run on the Juliet test suite we ran an earlier version of Infer (the most recent one at that time) on the same test suite to identify bugs in Infer and flaws in Juliet. The bugs in Infer were send to the developers of Infer and the flawed test cases in Juliet were ignored in the final benchmark scores. In this Chapter we give examples of the bugs and flaws that we encountered.

We ran the script for each CWE entry from Java and C for which Infer can find the targeted bug type. The first run was performed with Infer version v0.8.0, for which we analyzed all the test cases for which Infer reported a false positive or false negative warning. From this analysis we could discover bugs and limitations from Infer. In total, for Infer versions v0.8.0 and v0.8.1, we analyzed 2871 test cases manually from Juliet to discover why Infer found false positives and false negatives. We analyzed 2313 test cases which let to false negative bug reports, and 504 test cases that let to false positive bug reports.

3.5 Bugs in Infer

For Infer version v0.8.0 the code in Listing 3.4 does not lead to a memory leak while the code in Listing 3.5 does contain a memory leak according to Infer. This was a result of a bug in Infer and after our report it was resolved in Infer version 0.8.1. The problem occurred when the last statement of a function was a conditional statement. When any other statement was made after the conditional, such as the return statement, then the analysis was correct. Because the Juliet test suite contains a large number of functions that end with a conditional statement the fix was also important for the final benchmark scores.

1 #include <wchar . h> 2 3 void memoryLeakingFunction ( ) 4 { 5 char ∗ d a t a ; 6 d a t a = ( char ∗ ) c a l l o c ( 1 0 0 , s i z e o f ( char ) ) ; 7 8 i f ( 0 ) // Never t r u e 9 { 10 f r e e ( d a t a ) ; 11 } 12 } 13 14 i n t main ( ) 15 { 16 memoryLeakingFunction ( ) ; 17 return 0 ; 18 } 19

(20)

1 #include <wchar . h> 2 3 void memoryLeakingFunction ( ) 4 { 5 char ∗ d a t a ; 6 d a t a = ( char ∗ ) c a l l o c ( 1 0 0 , s i z e o f ( char ) ) ; 7 8 i f ( 0 ) // Never t r u e 9 { 10 f r e e ( d a t a ) ; 11 } 12 13 return ; 14 } 15 16 i n t main ( ) 17 { 18 memoryLeakingFunction ( ) ; 19 return 0 ; 20 } 21

Listing 3.5: Sample code issue: reported memory leak

After reporting the bug it was fixed by the developers of Infer. We reported other bugs as well which were mostly project specific and also solved by updates to newer versions. One of the bugs we reported was actually a limitation of Infer and is described in Section 5.2.

3.6 Flaws in Juliet

Benchmarking static analysis tools on the Juliet test suite can be done automatically since the naming conventions of the classes and functions in the suite indicate if they contain bugs. This means that there is no need to go through the source code manually to determine per bug report whether it correctly identified a bug. In other research that looks at the performance of static analysis tools, like the research done in [30], test cases from the Juliet suite are not checked on correctness, but when we went through the source code to figure out why certain bug reports from Infer were incorrect we discovered that the Juliet suite actually contains some serious flaws. Most test cases in Juliet contain a large number of classes and functions, which means that a few flawed cases will not significantly alter the scores. In order to be precise however, it will be necessary to manual inspect the source code of test cases to determine that they are correct.

Bugs in Juliet test cases become a problem when they are of the same type as the target type of the test case. Juliet test cases often contain incidental bugs, which do not obstruct the benchmark process. The code in Listing 3.6 is from the Juliet test suite. In this test case for CWE 401 memory leak bugs are identified by the function and class names. The function in Listing 3.6 is a bad function, which means in this case that there is a memory leak bug in the code. The memory leak is however not the only bug in the code, since the function calloc can return null, and data is used without checking if the memory allocation succeeded. This is not a problem since the target type of the test cases for CWE 401 are memory leaks and not null pointer dereferences. Other bug types should be ignored anyway since the Juliet test suite only makes promises about the target bug type.

(21)

1 void C W E 4 0 1 M e m o r y L e a k c h a r c a l l o c 0 1 b a d ( )

2 {

3 char ∗ d a t a ;

4 d a t a = NULL ;

5 /∗ POTENTIAL FLAW: A l l o c a t e memory on t h e heap ∗/

6 d a t a = ( char ∗ ) c a l l o c ( 1 0 0 , s i z e o f ( char ) ) ; 7 /∗ I n i t i a l i z e and make u s e o f d a t a ∗/ 8 s t r c p y ( data , ”A S t r i n g ” ) ; 9 p r i n t L i n e ( d a t a ) ; 10 /∗ POTENTIAL FLAW: No d e a l l o c a t i o n ∗/ 11 ; /∗ empty s t a t e m e n t n e e d e d f o r some f l o w v a r i a n t s ∗/ 12 } 13

Listing 3.6: CWE401 Memory Leak char calloc 01.c

In the code from Listing 3.7 there is however a problem since the incidental bug type and target type are the same. The code in Listing 3.7 comes from the tests cases for CWE 476 where function and class names indicate null pointer dereferences. All the function and class names should therefore indicate if the code does or does not contain a null pointer dereference bug. For the function good1 there should not be any null pointer dereference bugs, any bug report indicating a null pointer dereference in this function will (if you do not filter flawed test cases in Juliet) be categorized as false positive. The problem is however that there is an incidental null pointer dereference bug.

The test cases which are named ‘null check after deref’ in CWE 476 for C and C++ all contain this flaw. These test cases are more specific than the target type to which they belong. The bugs these functions actually test are null checks after deref. Meaning that if you allocate memory like is done on line 5 in Listing 3.7 then you should check directly if the memory allocation succeeded. If this check is performed after the variable or pointer is already used there is no point for the test anymore because if the memory allocation had failed, the program would have already crashed because of a null pointer dereference. The bad functions will have an if-statement at line 8 that check if intP ointer! = null. This results in a (possible) null pointer dereference at line 6, since the pointer is already dereference at that point. A fix would be to check if the memory allocation succeeded before dereferencing the pointer, but the Juliet test suite just removes the check all together. If no check is present then no null check after a dereference is done. But this still results in a (possible) null pointer dereference at line 5, and the supposedly good function thus contains a bug of the same type as the target type.

1 s t a t i c void good1 ( ) 2 { 3 { 4 i n t ∗ i n t P o i n t e r = NULL ; 5 i n t P o i n t e r = ( i n t ∗ ) m a l l o c ( s i z e o f ( i n t ) ) ; 6 ∗ i n t P o i n t e r = 5 ; 7 p r i n t I n t L i n e ( ∗ i n t P o i n t e r ) ;

8 /∗ FIX : Don ’ t c h e c k f o r NULL s i n c e we wouldn ’ t r e a c h t h i s l i n e i f t h e p o i n t e r was NULL ∗/ 9 ∗ i n t P o i n t e r = 1 0 ; 10 p r i n t I n t L i n e ( ∗ i n t P o i n t e r ) ; 11 } 12 } 13

Listing 3.7: CWE476 NULL Pointer Dereference null check after deref 01.c

A similar problem can be found in the test cases for the Java version of CWE 476 which are named NULL Pointer Dereference null check after deref. Again these test cases are actually more specific, they check if there is a null check after a dereference. For the Java version this means that in the bad test cases no null pointer dereference is present. These test cases should therefore be considered good functions.

Another example of a supposedly good function can be found in test cases for CWE 401 for C and C++ code where part of the good functions make use of the function ALLOCA to avoid the trouble of dynamically allocating memory. The function is similar to malloc and allocates memory, but instead of

(22)

allocating memory on the heap it will allocate the memory on the stack. ALLOCA does not return null when insufficient memory is available, and thus does not need a null check afterwards. No issues will arise when automatically categorizing bug reports for these functions since indeed using ALLOCA solves the null pointer dereference from the bad functions. The usage of ALLOCA is however considered to be bad since it can crash the program and result in strange behavior when not enough memory is available on the stack.

Another exception should be made when dealing with helper classes and files. The naming conventions in Juliet can not be trusted for helper files. The Juliet test suite contains main files which contain the good and bad functions, it also contains helper files which contain helper functions used by the main files. These helper functions are also named good and bad, but most of these function are either bad or good depending on the way they are used. The code in Listing 3.8 contains the bad and goodG2BSink function from the helper file for test case 67 for CWE 401 for C and C++. Both functions are however exactly the same. If the functions in Listing 3.8 will contain a bug or not thus depends on the way they are used in the main files.

1 void C W E 4 0 1 M e m o r y L e a k c h a r c a l l o c 6 7 b b a d S i n k ( C W E 4 0 1 M e m o r y L e a k c h a r c a l l o c 6 7 s t r u c t T y p e myStruct ) 2 { 3 char ∗ d a t a = myStruct . s t r u c t F i r s t ; 4 /∗ POTENTIAL FLAW: No d e a l l o c a t i o n ∗/ 5 ; /∗ empty s t a t e m e n t n e e d e d f o r some f l o w v a r i a n t s ∗/ 6 } 7

8 /∗ goodG2B u s e s t h e GoodSource w i t h t h e BadSink ∗/

9 void C W E 4 0 1 M e m o r y L e a k c h a r c a l l o c 6 7 b g o o d G 2 B S i n k ( C W E 4 0 1 M e m o r y L e a k c h a r c a l l o c 6 7 s t r u c t T y p e myStruct ) 10 { 11 char ∗ d a t a = myStruct . s t r u c t F i r s t ; 12 /∗ POTENTIAL FLAW: No d e a l l o c a t i o n ∗/ 13 ; /∗ empty s t a t e m e n t n e e d e d f o r some f l o w v a r i a n t s ∗/ 14 } 15

Listing 3.8: CWE401 Memory Leak char calloc 67b.c

For the final benchmark scores we ignored the test cases mentioned in this Chapter. therefore be ignored. We also ignored Windows specific and C++ test cases because the Windows specific test cases could not be compiled on a Unix machine (and Infer only runs on Unix machines and needs to compile the source code) and Infer does not perform C++ analysis correctly. What goes wrong with C++ test cases is described later in Section 5.2 were we talk about the limitations of Infer.

(23)

Chapter 4 Results

4.1 Benchmark Results

To automate the calculations of the benchmarking scores a mapping needs to be made between the CWE target bug and the bug types of Infer. The target bug type for CWEs can be found by looking at the descriptions and examples from the CWE website.1 _{Some of Infer’s bug types match directly with the CWEs}

but to be sure we ran Infer on all possible candidates and searched the results for the expected bug. If one or more could be found we assumed a mapping. To select all possible CWEs that might be detected we created an over-approximation from all the CWEs that had something to do with memory leaks, resource leaks, null pointer dereferences or with other bug types from Infer.

The mapping between CWEs and bug types is a partial function from CWE to bug type, because most CWEs will not map to any of Infer’s bug types. Partly because Infer’s set of detectable bugs is still rather small compared to other static analysis tools, and partly because there are over a hundred different CWEs present in the Juliet test suite for both Java and C. The final mapping can be found in table 4.1.

CWE Description Related Bug Related Warning Language

401 Memory leak Memory leak - C

415 Double Free - Use after free C

416 Use after free - Use after free C

476 Null pointer dereference Null pointer dereference - C, Java

570 Always false - Condition always false C

571 Always true - Condition always true C

690 Null pointer dereference from return Null pointer dereference - C, Java

761 Free pointer not at start of buffer Memory leak - C

Table 4.1: Mapping between CWE and bug and warning types from Infer

The final benchmark scores were obtained by running Infer version v0.8.1 on the Juliet test suite. The results from Juliet were analyzed by a script that also filtered the flaws in Juliet as discussed in Chapter 3. The final benchmark scores can be found in Table 4.2.

CWE Language Precision Recall #TP #FP #FN Discr F-Score

401 C 0.83 0.91 636 134 62 0.74 0.87 476 C 1.00 0.87 204 0 30 0.87 0.93 690 C 1.00 0.74 447 0 161 0.74 0.85 761 C 0.59 0.63 144 102 84 0.24 0.61 476 Java 0.95 0.85 154 8 27 0.74 0.90 690 Java 0.87 0.36 108 16 188 0.34 0.51

Table 4.2: Benchmark Results Bugs

Infer reports warnings when it is run in debug mode. There is no documentation available about the specific warnings that Infer will report. Four warnings that Infer produces for C actually have test cases in Juliet with the same target bug, see Table 4.1. To provide more information about them we benchmarked these warnings exactly the same as we did with the bugs. The results can be found in Table 4.3.

(24)

CWE Language Precision Recall #TP #FP #FN Discr F-Score

415 C 1.00 0.94 179 0 11 0.94 0.97

416 C 1.00 0.51 60 0 58 0.51 0.68

570 C 1.00 0.13 2 0 14 0.13 0.23

571 C 1.00 0.13 2 0 14 0.13 0.23

Table 4.3: Benchmark Results Warnings

4.1.1 Infer Compared to Other Static Analysis Tools

Benchmark results from other tools are hard to find, especially for commercial tools. We found output data from Coverity on their website2_{and compared this against the output data we obtained from Infer and with}

the output data from Findbugs that was present at SIG. The comparison is for the testcases in CWE 476 for Java code. Because Coverity and Findbugs did not find any flaws in the test cases for CWE 690 (which are similar to CWE 476) we cannot make the same comparison for CWE 690.

Figure 4.1: Comparison with Findbugs and Coverity for CWE 476 for Java

The results for Findbugs are very good because only the rules that performed very well were selected in the SIG research that the data originates from. If all the Findbugs rules were selected the precision would not even reach 0.30.

Comparison like this can be very valuable for developers or companies looking to use static analysis tools in their development cycle. However, even though some of the tools are open-source, this data is certainly not. Even the original paper about the Juliet suite [13] states that the authors made a comparison, yet the data is never made public. Research projects or published papers discussing tool performance will mostly consist of a small case study or a small benchmark test. A recent example can be found in [30] where static analysis tools are compared on the Juliet test suite. They benchmarked the tools like was done in the Juliet white paper and mention that they wrote scripts to automatically analyze the results. As seen in our analysis this may not be sufficient since Juliet does contain serious flaws, and the results should also be scanned by hand. Yet despite these flaws, the research does at least provide numbers and names with the scores so developers can judge which tool suits them best. It would be useful to keep adding to this information, benchmarking all the current tools, publish the data and merge the data from the small case study and benchmark tests. To give a head start, Figure 4.2 shows how our data overlaps.

(25)

Figure 4.2: Precision and Recall Scores for Different CWEs and Tools

4.2 Open-Source Results

We ran Infer successfully on 54 systems, together consisting of more than 4 millions lines of code (excluding blank lines, comments and header files). Infer identified 2146 bugs in 46 systems, 8 systems were bug free according to Infer. Table 4.4 provides a summary of the results per programming language.

Language Lines of Code Number of Systems Total Bugs Bug Free Systems

C 1053251 20 327 5

Java 3012218 26 1670 2

Objective-C 27970 8 149 1

Table 4.4: Results of Running Open Source Systems

To better answer research question 2 we looked at the difference between the number of systems that we tried to analyze by Infer and the number of systems that actually could be analyzed. We did not modified Infer in any way or tried fixing the problem with the developers because we wanted to see if the systems would run directly without first providing and incorporating fixes. Figure 4.3 shows how many systems we tried running on Infer and how many of those could actually be analyzed. All the systems we tried could build on computers at SIG and we provided the same commands to Infer (depending on the build system used by a project commands do differ in calling for example Maven, Gradle or ANT). Infer should have been able to run all the systems we tried according to the documentation, that means we did not try to run Infer on languages (or versions of languages) or with build systems not currently supported by Infer.

Almost 50% of the Java systems we tried on Infer could not run. A single point of failure could not be analyzed from the data we got, but all systems which successfully went through the capture phase resulted in successful runs. The error messages Infer provides did rarely provide a clue as to what went wrong. For now the only viable way to still run Infer would be to publish a bug report on GitHub and wait for a developer’s response.

From the reports we also looked at the types of bugs that Infer could find in the open-source projects. This will give an indication about the presence and detectability of these bugs. Limited by time we were not

(26)

Figure 4.3: Success Rate Running Open-Source Systems on Infer

able to analyze all the 2146 bug reports and check if they were indeed correct reports or to go through 4 million lines of code to identify all the present bugs. This means that we do not know if for example resource leaks in Objective-C are not detected properly or that these bugs are rarely present in GitHub projects.

Table 4.5 gives an overview of the types of bug we found. We concentrated especially on the type of bugs that we benchmarked because the benchmarking scores will provide some indication about the correctness of these findings.

Language Null Pointer Resource Leaks Memory Leaks Other

C 213 17 97 0

Java 878 790 not supported 2

Objective-C 9 0 1 139

Table 4.5: Bug Types in Open Source Systems

With the Cloc command line tool and the bug reports from Infer we calculated the bug density per project. On the Coverity website a report can be found from a large study in 2012 where they analyzed 300 open-source projects and found an average bug density of 0.69 per 1000 lines of code. They did not mention if they counted blank and comment lines and which type of bugs they looked at but they do mention a border between good and bad quality code which is on 1 bug per 1000 lines of code.3 _{From Figure 4.4 it is clear that}

Infer finds a lower bug density for most open-source projects, but also many projects with a much higher bug density. Especially for Objective-C systems this will be the case because these projects are much smaller, see Table 6 in Appendix A for the complete list of projects, defects and lines of source code.

To get a better idea about the size of the systems and how many lines of source code contain on average a specific bug density we plotted this data in Figures 4.5 and 4.6. From these Figures it becomes clear that 90% of the lines of code that we analyzed has an average bug density lower than or equal to 1 per 1000 lines of code.

3

(27)

Figure 4.4: Bug density per project grouped by programming language and ordered on bug density

4.3 Industrial Code Results

Results from running Infer on SIG internal system can be found in Table 4.6. We ran Infer on 6 systems and went through the results manually with a developer to determine the correctness of each report. In total 40 bugs were found, 26 null pointer dereferences and 14 resource leaks. Out of the 40 reports 35 were correct. This means that Infer’s precision was 0.88 for all the results and 0.92 for null pointer dereferences and 0.79 for resource leaks. To get Infer running we did need to do partial runs, because none of the systems could be run in its entirely by Infer.

System Total Bugs Null Pointer Resource Leak True Positives False Positives

java-sys A 0 0 0 0 0 java-sys B 6 6 0 4 2 java-sys C 12 11 1 11 1 java-sys D 20 7 13 18 2 java-sys E 2 2 0 2 0 java-sys F 0 0 0 0 0

Table 4.6: Infer results on SIG systems

With a precision of 0.92 for null pointer dereferences found in SIG tooling the results are between the benchmarking scores for Java CWE 476 and 609. Although being just a small case study these were the numbers we expected to see from the benchmarking scores, and support the validity of the benchmark scores.

(28)

Figure 4.5: Bug density for lines of code with a logarithmic scale

(29)

Chapter 5 Discussion

5.1 Improving Benchmark Scores

The benchmarking scores for Java CWE 690 seem to be lacking. Especially the score for recall is, with only 0.34, much lower than for the other CWEs. By examining the 188 false negative bug reports we found that 172 of those are caused because Infer cannot find the source code for a function. This will mean that Infer will skip the function and the analysis results will be less precise, especially when most test cases in Juliet are so small and only use one function.

Infer’s underlying model is not yet complete and not all functions are modeled, but because Infer is an open-source project everybody can add models to the domain of Infer. On the website of Infer there is a small tutorial on how to add models. These models are not the same as the models in Infer’s abstract domain but rather dummy code which mimics the behavior of the modeled function. Adding models can be beneficial when dealing with library functions for which the source code is not included in the project.

By providing code Infer is then able to make an analysis based on those models and in theory (when the modeled function is implemented correctly) create a more precise analysis. For Java CWE 690 this means that the benchmarking scores can be improved to match those of the other CWEs, as depicted in Figure 5.1.

Figure 5.1: Possible Improvement by Adding Models to Infer for Code in Juliet CWE 690 for Java

5.2 Limitations of Infer

For the CWEs 476 and 690 for Java, 24 false positive warnings are produced by Infer. For these test cases Infer falsely reports bug free functions because it does not perform a global analysis of Java classes. For Juliet, this means that when a variable is used inside a conditional statement Infer does not evaluate the variable and runs both paths through the program. To illustrate this we ran Infer on the code in Listings 5.1 and 5.2 which runs without any null pointer dereferences because the conditional statement at line 6 will always evaluate to false. But because Infer will analyze all paths through function test it will report a null pointer dereference warning at line 11, because when the conditional statement would be true this will happen.

(30)

1 public c l a s s Program {

2 public s t a t i c void main ( S t r i n g [ ] a r g s ) { 3 N o N u l l D e r e f e r e n c e n o N u l l D e r e f e r e n c e =

4 new N o N u l l D e r e f e r e n c e ( ) ;

5 n o N u l l D e r e f e r e n c e . t e s t ( ) ;

6 }

7 }

Listing 5.1: Progam class used to call functions

1 public c l a s s N o N u l l D e r e f e r e n c e { 2 private boolean a l w a y s F a l s e = f a l s e ; 3 4 public void t e s t ( ) { 5 S t r i n g d a t a ; 6 i f ( a l w a y s F a l s e ) { 7 d a t a = n u l l ; 8 } e l s e { 9 d a t a = ” h e l l o ” ; 10 } 11 d a t a . t r i m ( ) ; 12 } 13 }

Listing 5.2: Class without a null dereference

The same goes wrong with C code. For CWE 401, 134 false positive bug reports were produced, 64 of them because variables in conditional statements were not evaluated. In the example code in Listing 5.3 a memory leak is reported by Infer on line 9. After this line Infer says that i is no longer reachable, thus resulting in a wrong analysis since b is set to true and i will always be freed. The analysis for Listing 5.3 will be correct when the variable b is substituted with its value. In that case Infer will report both true and false cases correctly, resulting in none and a single memory leak. The problem is again the evaluation of variables in conditional statements. 1 #include ” s t d l i b . h” 2 3 i n t b = 1 ; 4 5 void noMemoryleak ( ) { 6 i n t ∗ i ; 7 i = NULL ; 8 i = m a l l o c ( s i z e o f ( i n t ) ) ; 9 i f ( b ) { 10 f r e e ( i ) ; 11 } 12 } 13 14 main ( ) { 15 noMemoryleak ( ) ; 16 }

Listing 5.3: C code with no memory leak

Infer does not support C++ yet, but C++ files will be analyzed. It can perform experimental C++ analysis but unless specifically told to do so the results from Infer will be strange. As an example, consider the code in Listing 5.4 which contains C++ code with two memory leaks: in function testA in class A after line 10 and in function test after line 18. Both variables data are never freed but when Infer analyzes this file it will only report one memory leak in function test, even though the functions test and testA are exactly the same with the only exception that testA is contained in a class.

Challenges of Using Sound and Complete Static Analysis Tools in Industrial Software