Dependence analysis in PHP

(1)

Dependence analysis in PHP

Master’s thesis

Merijn Wijngaard

mwijngaard@gmail.com

August 15, 2016, 64 pages

Supervisor: M. Hills & V. Zaytsev

Host organisation: Moxio BV,http://www.moxio.com

Host supervisor: A. Boks

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

Abstract

Dependence analysis evaluates operations of a program to produce execution order constraints, which embody the semantics of the program. Constraints can be ‘control’ dependences, which imply that an operation controls whether another operation is executed, or ‘data’ dependences, which imply that two operations access the same resource. By combining all control and data dependences in a single procedure in a program into a graph we get a Program Dependence Graph (PDG). For multi-procedure programs we can combine the PDGs of individual multi-procedures with a call graph to generate a System Dependence Graph (SDG). Common use cases of the PDG and SDG include optimization, slicing, and semantic clone detection.

Dependence analysis has been extensively researched in simple procedural code, but there is cur-rently no PDG and SDG library available for a dynamic language such as PHP. While dynamic features of PHP may cause a loss of precision in the generated dependences, we believe that par-tially correct results can also be of value. In this thesis we therefore study how we can implement dependence analysis for PHP, and how effectively we can apply it in practice.

We analysed the effects of PHP language features and their practical usage, developed strategies for handling some problematic features, implemented a PDG and SDG library, and analysed its effectiveness. We show that several of PHP’s language features are challenging to model in dependence analysis but rarely used. We demonstrated that in real-world PHP projects the vast majority of the control and data dependences that we studied can be supported, and that the accuracy of the dependences we evaluated is very high.

We believe we have shown that in general the partial accuracy of dependences that we are able to achieve in PHP is high enough for dependence analysis to be useful. To further support this claim we demonstrated the practical use of our library by using it to implement an interprocedural slicing tool. In doing so we believe we have shown that dependence analysis is applicable to PHP.

(3)

1 Introduction 3 1.1 PHP . . . 5 1.2 Problem statement . . . 5 1.2.1 Research questions . . . 5 1.2.2 Scope . . . 5 1.2.3 Research method . . . 6 1.3 Corpus. . . 6 1.4 Contributions . . . 7 1.5 Outline . . . 7 2 Background 8 2.1 Dependence analysis . . . 8 2.2 PHP . . . 9 3 Language analysis 10 3.1 Control dependence . . . 10 3.2 Data dependence . . . 13 3.2.1 Aliasing . . . 13 3.2.2 Dynamic features . . . 14 3.2.3 Scoping . . . 16 3.3 Call dependence . . . 16

3.3.1 Object-orientation & type system. . . 16

3.3.2 Dynamic features . . . 17 3.4 Summary . . . 20 4 PDG implementation 21 4.1 Libraries. . . 21 4.1.1 PHP-CFG. . . 21 4.1.2 Graph . . . 23 4.2 Implementation . . . 23 4.2.1 Control dependences . . . 23 4.2.2 Data dependences . . . 27 4.2.3 Combined . . . 27 4.2.4 Extensibility . . . 27 5 SDG implementation 29 5.1 Libraries. . . 29 5.1.1 PHP-Types . . . 29 5.2 Implementation . . . 30 5.2.1 Function calls. . . 30 5.2.2 Method calls . . . 32 5.2.3 Overloading . . . 37 5.2.4 Extensibility . . . 37 6 Validation 40

(4)

6.1 Analysis application . . . 40 6.2 Results. . . 40 6.2.1 Control dependence . . . 41 6.2.2 Data dependence . . . 41 6.2.3 Call dependence . . . 42 7 Use case 47 7.1 Example . . . 47 7.2 Implementation . . . 48 8 Evaluation 51 8.1 Research questions . . . 51 8.2 Threats to validity . . . 52 9 Conclusion 53 9.1 Future work . . . 53 Bibliography 54 A Graph library 56 B PDG library 59 C SDG library 62

(5)

Chapter 1

Introduction

Dependence analysis is a form of static analysis that evaluates operations of a program to produce execution order constraints. These constraints embody the semantics of a program. Two programs consisting of the same operands and operations and identical constraints can be considered semanti-cally equivalent, regardless of the order of their operations in the source code. There are two kinds of constraints, namely ‘control’ and ‘data’ dependences. A control dependence implies that an operation controls whether or not another operation is executed. A data dependence implies that two operations access the same shared resource.

Figure1.1 shows an example of a control dependence. The echo operation on line 4 has a control

dependence on the if operation on line 3, because it controls whether or not the echo operation is executed. If we were to apply a code transformation that moves the execution of the echo operation before the if operation, then the semantics of the program would change because the echo operation would always be executed, instead of it being conditional on the value of$i.

1 <?php

2

3 if ($i > 1) {

4 e c h o " foo "; // c o n t r o l d e p e n d e n c e on l i n e 3

5 }

Figure 1.1: Control dependence

Figure 1.2 shows an example of a data dependence. The echo operation on line 4 has a data

dependence on the assignment operation on line 3. If we were to reorder the two operations in

execution, then the value of$a would most likely be undefined, thus changing the semantics of the

program. 1 <?php

2

3 $a = " foo ";

4 e c h o $a; // d a t a d e p e n d e n c e on l i n e 3

Figure 1.2: Data dependence

Figure 1.3 shows an example of two programs that have different source code (lines 3 and 4 are

switched) but the same dependence structure. These programs are semantically equivalent.

By combining all control and data dependences in a single procedure in a program into a graph we create a Program Dependence Graph (PDG). Many optimizing transformations can be efficiently applied using a PDG, as they essentially use the dependence structure of a program to determine which transformations retain the semantics [8]. Similarly, the PDG can also be used to determine

(6)

1 <?php

2

3 $foo = " foo "; 4 $bar = " bar ";

5 e c h o $foo . $bar; // 'foobar ' (a) Program 1

1 <?php

2

3 $bar = " bar "; 4 $foo = " foo ";

5 e c h o $foo . $bar; // 'foobar ' (b) Program 2

Figure 1.3: Programs with identical dependence structure

use case of the PDG is in program slicing, where the problem of determining which operations affect a slicing criterion can be reduced to a vertex reachability problem [8].

For multi-procedure programs we can combine the PDGs of individual procedures with a call graph and create a System Dependence Graph (SDG). An SDG contains ‘call’ and ‘param’ dependences between procedure calls and their definitions. Call dependences are a type of control dependence, and param dependences are a type of data dependences. The SDG can be used for interprocedural slicing [15] and clone detection [23].

Dependence analysis is common in optimizing compilers for statically compiled languages. It has been extensively studied in simple procedural code [8,22,15] and in static object-oriented languages

like C++ [20] and Java [18, 28, 25, 5]. As far as we know, there is currently no PDG or SDG

alternative available for a dynamic scripting language such as PHP. This is likely due to the fact that dynamic languages contain features that are difficult to model in static analysis.

An example of a problematic feature of PHP is property overloading, shown in figure 1.4. By

defining a get method, classes are able to intercept property fetches to undefined properties and

dynamically return a property value. To determine which operations have a call dependence on the get method, we need to know if a property fetch refers to a property that has been defined on the object. This is complex because it requires the integration of type inference, class hierarchies, and polymorphism. 1 <?php 2 3 c l a s s Foo { 4 p u b l i c f u n c t i o n _ _ g e t($name) { 5 r e t u r n 'foo '; 6 } 7 } 8

9 $foo = new Foo() ;

10 e c h o $foo- >bar; // foo

Figure 1.4: Overloading

While PHP may contain dynamic features that can be difficult or impossible to support in all cases, we believe that common analyses using dependence graphs could also be valuable here. Example application areas we currently see include debugging, code comprehension, security, optimization, and semantic clone detection; but there are likely many more that we have not considered. Dynamic features may cause a loss of precision, but we are confident that partially correct results can also be very helpful in these areas. In this thesis we therefore study how we can implement dependence analysis for PHP, and how effectively we can apply it in practice. Our work is intended to be usable with real PHP code and by real PHP developers. For this reason all our development has been done in PHP itself, and we have extensively tested our implementation on a wide variety of publicly available open source PHP projects.

(7)

1.1 PHP

PHP is a popular general-purpose scripting language that is especially suited for web development.

As of April 2016, it ranks 6th on the TIOBE programming community index1, and powers 82% of all

websites whose server-side language can be determined2. PHP originated as a simple CGI scripting

language, but has over time acquired several features suitable for more complex application develop-ment. Examples of these are a class-based object model, which enables object-oriented programming; namespacing, which reduces name collisions and improves library interoperability; and autoloading, which allows files containing class definitions to be automatically included when a class is first used.

In recent years PHP has developed a solid packaging ecosystem in the form of Composer3_and Pack-agist4_{, and through the PHP-FIG}5_{a number of community standards for e.g. code style, autoloading,} and logging have been created. This all contributes to a general increase in the compatibility and quality of PHP libraries, and makes PHP increasingly suitable for the development of complex (web) applications. As more complex applications are being developed, PHP is also an increasingly inter-esting target for static analysis tools and techniques, such as the one we discuss in this thesis.

1.2 Problem statement

The problem we study regards the applicability of dependence analysis to PHP.

1.2.1 Research questions

We will investigate the applicability of dependence analysis using the following research questions: 1. How do PHP’s language features affect dependence analysis?

2. How are relevant language features used in practice and how does this impact dependence analysis?

3. Are there strategies that we can adopt to handle problematic features? 4. How can we apply dependence analysis to PHP?

5. How effective is dependence analysis for PHP?

1.2.2 Scope

Dependence analysis is a broad subject. Given the limited amount of time available for this thesis, we have restricted our scope to only certain kinds of dependences. We cover intraprocedural control dependence, intraprocedural data dependence excluding object properties (these would steer our re-search too much in the direction of alias analysis), and interprocedural call dependence. We consider this to be a good basic feature set for examining dependence analysis in PHP. For brevity we will, throughout this document, use the terms control, data, and call dependence.

To apply dependence analysis to a dynamic language several existing analysis techniques need to be combined. For our work we use simple type inference, class hierarchy construction, and polymorphism computations. The precision of our generated dependences could likely be improved by better imple-mentations of these techniques and by the integration of other techniques such as alias analysis. The development of these techniques is however not the focus of this thesis. Our focus is on implementing and evaluating dependence analysis itself.

1_{http://www.tiobe.com/tiobe_index}

2_{http://w3techs.com/technologies/details/pl-php/all/all} 3_{https://getcomposer.org/}

4_{https://packagist.org/} 5_{http://www.php-fig.org/psr/}

(8)

1.2.3 Research method

We first performed a language meta-analysis in which we examined PHP language features and their practical usage in relation to dependence analysis. We then implemented our own PDG/SDG library and applied it to a large corpus of real-world PHP projects. To evaluate its effectiveness we collected metrics such as the amount of operands that we were able to associate with data dependences in a PDG, the amount of function and method calls with associated call dependences in the SDG, and the amount of run-time calls in a function trace of a test suite covered by call dependences in our SDG. All our analyses are available as part of our analysis application described in section6.1. This ensures that our results are reproducible and verifiable.

1.3 Corpus

To analyse practical use of PHP language features and verify our implementation we have collected a large corpus of popular open-source PHP projects. We based the composition of the corpus on

the popularity rankings on Black Duck’s Open Hub site6_{. Most of the projects are among the most}

popular, but we also chose some of the less popular projects and even a project that that has not seen updates since 2011. As constructing a dependence graph requires a lot of memory, and our test system had only 8G available, we did not include projects with more than around 400k lines of code. Lastly we also added Fabric, a proprietary application framework created by our sponsor company, Moxio.

Table 1.1: Corpus of popular open-source PHP libraries

Project Version Date PHP1 Files SLOC2 Description

CakePHP 3.2.8 24-04-2016 5.5.9 852 139,620 Application Framework

CodeIgniter 3.0.6 21-03-2016 5.2.4 199 29,770 Application Framework

Doctrine ORM 2.5.0 02-04-2016 5.4.0 1,048 85,591 ORM

Fabric 1.6.489 08-07-2016 5.5.0 1,306 57,573 Application Framework

Gallery 3.0.9 26-06-2013 5.2.3 568 44,758 Photo Management

Joomla 3.5.1 05-04-2016 5.3.10 3,221 347,423 CMS

Kohana 3.3.5 10-03-2016 5.3.3 490 30,738 Application Framework

MediaWiki 1.26.2 21-12-2015 5.3.3 2,891 400,416 Wiki

osCommerce 2.3.4 06-06-2014 4.0.0 702 60,003 Online Retail

PEAR 1.10.1 17-10-2015 5.4.0 199 48,872 Component Framework

phpBB 3.1.5 03-05-2015 5.3.3 745 183,293 Bulletin Board

phpDocumentor 2.9.0 22-05-2016 5.5 470 22,131 Documentation Generator

phpMyAdmin 4.6.0 17-05-2016 5.5.0 871 202,399 Database Administration

phpUnit 5.4.8 26-07-2016 5.6 321 72,731 Test Framework

Roundcube 1.2.1 24-07-2016 5.3.7 256 106,314 Webmail

SilverStripe 3.3.1 29-02-2016 5.3.3 769 183,293 CMS

Smarty 3.1.29 21-12-2015 5.2.0 204 17,282 Template Engine

Squirrel Mail 1.4.22 12-07-2011 4.1.0 293 26,045 Webmail

Symfony 3.0.4 30-03-2016 5.5.9 2,978 202,757 Application Framework

WordPress 4.5 12-04-2016 5.6.0 653 160,039 Blog

Zend Framework 2.5.3 27-01-2016 5.5.0 2,639 169,447 Application Framework

1_{Minimum required version. Determined by composer.json or mention on website.}

2_{Computed using cloc, PHP only.}

(9)

1.4 Contributions

Language feature analysis We analysed a significant number of several PHP language features with regards to dependence analysis and looked into how problematic features are being used in practice. Our results offer insight into the effects and practical use of these language features. We present our language analysis in chapter3.

PDG/SDG library We created a PDG and SDG library that supports a significant subset of the language. It can be used for building PDG/SDG-based static analysis tools and for further research into PHP. Its implementation is described in chapters 4and5.

Interprocedural slicing tool Using our PDG and SDG library we developed an interprocedural slicing tool that can create a backwards slice of a system using a file name and line number as a slicing criterion. It can be seen as a proof of concept of what is possible with dependence analysis in PHP. Its implementation is described in chapter7.

PHP-CFG improvements As part of our work, the PHP-CFG library that implements a control flow graph in static single assignment form was augmented support for additional PHP language features and many small bugfixes. These are described in section 4.1.1

PHP-Types improvements As part of our work the PHP-Types library, that implements type reconstruction for PHP-CFG, was augmented with an improved class hierarchy, improved in-terprocedural type inference, support for additional PHP language features, and many small bugfixes. These are described in section 5.1.1.

1.5 Outline

The rest of this thesis is structured as follows. Chapter2describes existing research and its relation

to our work. Chapter 3 discusses the impact of PHP language features on dependence analysis,

practical use of problematic features, and several strategies for dealing with problematic features.

Chapters4and5present our implementation of a PDG and SDG. Chapter6presents our validation

of the PDG and SDG implementation, and problematic feature handling strategies. In chapter7we

demonstrate the practical use of our PDG and SDG implementation by using it to implement a slicing tool. Chapter8 evaluates our work in relation to our research questions. We conclude in chapter9.

(10)

Chapter 2

Background

In this chapter we present existing research and relate it to our work.

2.1 Dependence analysis

Weiser [26] introduced the concept of a program slice as the minimal form of a program that is still guaranteed to produce a certain behavior, and demonstrated its usefulness in the testing, automatic parallelization, maintenance, and debugging of programs. He computed program slices by using data and control flow analysis. We use his algorithm for interprocedural slicing in our slicing tool.

Ferrante et al. [8] combined the dependence graphs created by control and data flow analysis to form the PDG, and demonstrated its qualities in the optimization of imperative languages. They showed how several kinds of optimizations can be efficiently performed using the PDG. We use a variation of their original algorithm for determining control dependence in our work, but a different method for determining data dependence. They also briefly mention that the PDG is useful in the context of program slicing.

Ottenstein & Ottenstein [22] present the PDG as a suitable program representation for a software development environment, and cover program slicing using the PDG more extensively. They also state that the amount of space required for storing the PDG might be a practical consideration. This is something that we confirmed in our work, as we could not support projects of more than around 400k lines of code on our test system.

Horwitz et al. [15] combined the PDGs of all application procedures with a call graph to construct the SDG, and demonstrated its use in interprocedural slicing. They demonstrated the construction of call and param in/out dependence. We have integrated call dependence in our SDG, but left param dependences out of scope, and thus as future work. We also use interprocedural slicing as a use case to demonstrate the practical application of our work.

Larsen & Harrold [20] demonstrated how interprocedural slicing using the SDG could be applied to object-oriented software. They represent method polymorphism by linking a method call to a polymorphic choice vertex that is then linked to all reachable implementations of that method. We use a similar technique to support method polymorphism, but link a method call to all implementations directly, instead of through a polymorphic call vertex.

Sinha & Harold [24] and Jo & Chang [16] demonstrated ways in which exception induced control flow could be integrated into control flow analysis. We initially investigated this for possible imple-mentation in our work, before our language analysis showed that within our current scope modeling exception induced control flow was not essential for accurate dependence analysis. This could be reevaluated when we add support for additional interprocedural control dependences in the future.

Komondoor & Horwitz [17] and Krinke [19] used the PDG for identifying non-contiguous semantic code clones by identifying isomorphic subgraphs. Shomrat & Feldman [23] extended this to interpro-cedural semantic clone detection in an attempt to detect refactored clones. Semantic clone detection is one of the main use cases we see for our library.

(11)

2.2 PHP

Hills et al. [12] performed an empirical analysis of PHP feature usage and looked specifically at usage of several dynamic features. They discovered that uses of most of these dynamic features were very rare. Hills et al. [13] showed that most instances of dynamic includes can be statically resolved, and Hills [10] showed that this is also true for many instances of variable variables. Their results inspired us to investigate the possibilities for dependence analysis in PHP. Our usage statistics are also collected in a similar fashion.

Hills & Klint [11] present PHP AiR, a framework for analysing PHP software implemented in Rascal. We analysed their implementation when determining how to implement control flow analysis in PHP. We settled on using PHP-CFG instead of porting their implementation, as this saved us considerable effort during implementation.

Van der Hoek & Hage [14] developed an object-sensitive type analysis for PHP that accommodates several of PHP’s dynamic aspects. We investigated their work when determining how to integrate type information into our call dependence determination. We settled on using PHP-Types instead of their implementation, even though its type analysis quality was likely worse, as it already integrated well with PHP-CFG. Implementing type analysis ourselves was beyond the scope of this thesis. We do see their analysis as a promising next step in improving the accuracy of our generated call dependences.

(12)

Chapter 3

Language analysis

In this chapter we present our analysis of a significant number of PHP language features in relation to dependence analysis. We look at several features that are problematic, how these are being used in practice, and what strategies we can implement for handling some of them. As per our scope statement of section1.2.2, we only look at language features that affect control, data, and call dependences. This analysis is not intended to be exhaustive, but we consider it to be a good overview of relevant language features.

All practical usage data presented here has been gathered using our analysis application described in section6.1. In many cases we show the number of procedures containing a certain feature, both as an absolute number and percentage of total number of procedures. As every procedure is represented by a separate PDG, this is to indicate how many PDG’s in a project could be affected by that feature. Besides functions, methods, and closures, we also consider the top-level scope of a script to be a procedure, as it can also contain logic. We call this the script’s ‘pseudo-main’ procedure. Figure3.1 shows examples of the different kinds of procedure in PHP.

1 <?php 2 3 c l a s s Foo { 4 p u b l i c f u n c t i o n foo() { // m e t h o d 5 e c h o 'foo '; 6 } 7 } 8 9 f u n c t i o n bar() { // f u n c t i o n 10 e c h o 'bar '; 11 12 $baz = f u n c t i o n () { // c l o s u r e 13 e c h o 'baz '; 14 }; 15 } 16 17 e c h o 'quux '; // p a r t of pseudo - m a i n Figure 3.1: Procedures in PHP

3.1 Control dependence

Most of PHP’s control structures have their counterparts in other imperative programming languages. If, else, elseif/else if, while, do-while, for, foreach, switch, break, continue, and return all work pretty much as expected. There are some minor differences, such as the fact that a switch statement is considered a looping structure, and we can specify a number of levels for the break and continue statements. This means that if we use a switch in a foreach loop, we have to use ‘continue 2’ from

(13)

inside the switch to jump to the next iteration of the foreach loop. An example of this is shown in figure3.2. 1 <?php 2 3 f o r e a c h ($ns as $n) { 4 s w i t c h ($n) { 5 c a s e 'foo ': 6 c o n t i n u e; // c o n t i n u e s the s w i t c h ( s a m e as b r e a k ) 7 c a s e 'bar ': 8 c o n t i n u e 2; // c o n t i n u e s the f o r e a c h l o o p 9 c a s e 'baz ': 10 b r e a k; // b r e a k s the s w i t c h 11 c a s e 'quux ': 12 b r e a k 2; // b r e a k s the f o r e a c h l o o p 13 } 14 }

Figure 3.2: Break and continue for switch in loop

Recent additions to PHP are the goto statement and generators. The goto statement allows jumping to another section of the program. Jumps cannot create irreducible control flow such as when jumping into another scope or looping construct. While the target of a goto may require some bookkeeping to resolve, as jumps can be forward or backward, it is fundamentally similar to a break or continue

statement and so should not pose any problems for determining control dependences. Figure 3.3

shows an example of goto. 1 <?php

2

3 $arr = ['foo ', 'bar ', 'baz ']; 4

5 f o r e a c h ($arr as $value) {

6 e c h o $value; // p r i n t s foo and bar 7 if ($value === 'bar ') { 8 g o t o end; 9 } 10 } 11 end: Figure 3.3: Goto

Generators are special functions that use the yield expression to iteratively transfer control out of the function and return values. They can be used to create lightweight lazy iteration similar to iterator objects, but generators can also be passed values from outside the generator function that will be received as the result of a yield expression. Generators are challenging to model in control flow analysis as once they yield control, control does not necessarily have to return. If control returns it returns at the last executed yield expression. If we ignore the fact that control might not be returned to the generator function, then from the point of the generator function the yield expression is similar to a function call. Figure3.4shows an example of a generator.

To determine the effect that generators would have on our ability to determine control dependences,

we looked at how common the yield expression was in our corpus. Table3.1 shows the results. We

can see that it is very rare, with only 2 projects using yield at all, and even those having very few instances. This suggests that accurate support for generators is not required for usefully determining control dependences. We therefore treat the yield expression as a function call for now, and ignore any other impact this feature has on control dependence.

Exceptions in PHP also work similar to those in other languages. An exception can be thrown using a throw statement and caught, if thrown inside a try block, using a catch statement. A try block can also have a finally block, that contains code that is always executed, whether the try blocks exits normally or with an exception. Exceptions can be problematic for determining control dependences

(14)

1 <?php 2 3 f u n c t i o n gen() { 4 y i e l d 'foo '; 5 y i e l d 'bar '; 6 y i e l d 'baz '; 7 } 8 9 f o r e a c h (gen() as $value) {

10 e c h o $value; // p r i n t s foo bar baz

11 }

Figure 3.4: Generator

Table 3.1: Yield usage

Project Procs Yield

Total Procs Proc %

CakePHP 9,943 0 0 0.00 CodeIgniter 1,897 0 0 0.00 Doctrine ORM 7,876 0 0 0.00 Fabric 7,407 0 0 0.00 Gallery 3,039 0 0 0.00 Joomla 18,922 0 0 0.00 Kohana 2,409 0 0 0.00 MediaWiki 26,594 0 0 0.00 osCommerce 3,290 0 0 0.00 PEAR 1,756 0 0 0.00 phpBB 5,000 0 0 0.00 phpDocumentor 2,682 0 0 0.00 phpMyAdmin 6,614 0 0 0.00 phpUnit 2,338 0 0 0.00 Roundcube 2,537 0 0 0.00 SilverStripe 9,610 0 0 0.00 Smarty 1,132 0 0 0.00 Squirrel Mail 1,093 0 0 0.00 Symfony 20,744 1 1 0.00 WordPress 6,904 0 0 0.00 Zend Framework 15,509 2 1 0.01

(15)

because they are a form of unstructured jumping. A thrown exception can either immediately exit the current procedure, or be caught by a surrounding catch statement and continue execution from there. However, catch statements can opt to only catch some types of exceptions, and almost any operation can throw any exception type by the automatic invocation of a destructor or error handler. To determine the effect that exceptions would have on our ability to determine control dependences

we looked at how they were used in our corpus. This is shown in table 3.2. We can see that the

use of try and throw statements is reasonably common, but very few procedures contain both a try statement and a throw statement. This means that in most cases a throw statement immediately exits the procedure.

Table 3.2: Exception construct usage

Project Procs Try Throw Try & Throw

Total Procs Proc % Total Procs Proc % Procs Proc %

CakePHP 9,943 78 72 0.72 344 267 2.69 19 0.19 CodeIgniter 1,897 8 8 0.42 13 5 0.26 0 0.00 Doctrine ORM 7,876 127 123 1.56 355 222 2.82 11 0.14 Fabric 7,407 154 122 1.65 926 506 6.83 60 0.81 Gallery 3,039 156 135 4.44 281 208 6.84 29 0.95 Joomla 18,922 576 457 2.42 1,208 766 4.05 87 0.46 Kohana 2,409 54 50 2.08 155 122 5.06 30 1.25 MediaWiki 26,594 495 409 1.54 1,830 1,245 4.68 109 0.41 osCommerce 3,290 10 10 0.30 57 40 1.22 7 0.21 PEAR 1,756 5 4 0.23 10 8 0.46 0 0.00 phpBB 5,000 37 29 0.58 120 73 1.46 6 0.12 phpDocumentor 2,682 10 9 0.34 101 88 3.28 2 0.07 phpMyAdmin 6,614 25 17 0.26 20 18 0.27 3 0.05 phpUnit 2,338 278 262 11.21 222 151 6.46 10 0.43 Roundcube 2,537 36 31 1.22 2 2 0.08 1 0.04 SilverStripe 9,610 91 72 0.75 569 369 3.84 32 0.33 Smarty 1,132 13 13 1.15 102 76 6.71 9 0.80 Squirrel Mail 1,093 0 0 0.00 0 0 0.00 0 0.00 Symfony 20,744 460 375 1.81 1,696 1,204 5.80 131 0.63 WordPress 6,904 48 42 0.61 107 49 0.71 21 0.30 Zend Framework 15,509 207 195 1.26 2,888 2,109 13.60 128 0.83

Several studies have attempted to integrate exception induced control flow into a control flow graph in Java or C++ [24, 16, 2], but these studies generally cover explicitly thrown exceptions and their strategies are not immediately suitable for implementation in the dynamic context of PHP. As this problem is too complex to be included in our scope, we currently assume that all throw statements immediately exit the procedure.

3.2 Data dependence

There are many features in PHP that are relevant for determining data dependences. Some can be used to create non-obvious aliasing, some can create non-obvious or dynamic references, and some influence the determination of reaching definitions, such as the scoping mechanism.

3.2.1 Aliasing

The most common way of creating aliases in PHP is using object references. While the references themselves are copied on assignment or when passed as an argument, the underlying object remains the same. When considering data dependences to object properties it is important to take aliasing using object references into account. As this is not included in our current scope we will not cover it here, but there are several other language features of PHP that can be used to create aliasing which are currently relevant to us.

(16)

Predefined variables are provided by PHP to all scripts. They can contain internal or environment information, such as the last error message or the values of GET or POST parameters in the context of an HTTP request. Some of these predefined variables are available to all scopes, these are called superglobals. PHP also allows user defined global variables, that need to be explicitly imported into procedures. All global variables, including superglobals, are also accessible through the superglobal $GLOBALS array. In the context of data dependences this is problematic because the same resource

can be referenced in multiple ways. Figure3.5a shows an example of how a variable defined in the

global scope is automatically aliased using the globals array.

Similarly, reference assignments also allow multiple variable names to reference the same resource. Correctly determining data dependence for this requires tracking which variable names correspond to the same resource. Further complicating matters is the fact that individual elements of an array can also be assigned by reference. Figure3.5bshows an example of aliasing using reference assignment.

1 <?php

2

3 $foo = 'foo ';

4 $GLOBALS['foo '] = 'bar '; 5 6 e c h o $foo; // bar (a) GLOBALS 1 <?php 2 3 $foo = 'foo ';

4 $bar = &$foo;

5 $bar = 'bar ';

6 e c h o $foo; // bar

(b) Reference assignment

Figure 3.5: Aliasing using GLOBALS and reference assignments

To correctly model multiple ways of accessing the same resource requires alias analysis. This has been extensively researched particularly in the context of Java [27,4]. Existing techniques for Java could well be suitable for porting to PHP, but we consider this problem too complex to be included in our current scope. We therefore ignore the possibility that variables might reference the same resource, and leave the implementation of alias analysis in PHP as future work.

To determine the impact that this could have on our ability to correctly determine data dependences

we looked at how common possible aliasing features were in our corpus. This is shown in table3.3.

As we can see, these features are rare in many libraries, but some do use them extensively. While many of these uses of possible aliasing features might not actually cause hidden data dependences, this definitely warrants further research into the impact of aliasing on dependence analysis in PHP. This is something that we currently leave as future work.

3.2.2 Dynamic features

The eval and dynamic include statements are PHP’s method of evaluating code at run-time, with eval evaluating a string and a dynamic include evaluating a file. In both cases the pseudo-main procedure of the evaluated code is evaluated in the same scope as the eval or dynamic include statement. This means it has access to any variables in the evaluating scope and can thus have hidden data dependences. Whether or not a dynamically included pseudo-main actually references the surrounding scope could in many cases be determined be static analysis if the path to the included file can be resolved. The path of the file to include however, can itself be the result of an expression, meaning that it can

also be determined at run-time. Figure3.6shows examples of eval and include statements using the

evaluating scope.

Variable variables are variable references that have the name of the variable they reference be the result of an expression. Since this expression can even be dependent on user input, it could evaluate to any variable in scope, or even a variable that is currently undefined. Figure3.7shows an example of a variable variable.

Theoretically these dynamic features pose a significant problem for dependence analysis. However, Hills et al. [12] showed that in practice the use of eval and variable variables is extremely rare. As shown in table 3.4, we can confirm this for our corpus. Hills et al. [13] showed that most instances of dynamic includes can be statically resolved, and Hills [10] showed that this is also true for many instances of variable variables. Although we currently limit our scope to more traditional methods of

(17)

Table 3.3: Possible aliasing construct usage

Project Procs Ref assign Global Any

Total Procs Total Procs Procs Proc %

CakePHP 9,943 49 32 0 0 32 0.32 CodeIgniter 1,897 126 93 0 0 93 4.90 Doctrine ORM 7,876 16 11 0 0 11 0.14 Fabric 7,407 39 25 0 0 25 0.34 Gallery 3,039 18 11 9 9 20 0.66 Joomla 18,922 644 331 3 3 333 1.76 Kohana 2,409 66 52 4 4 56 2.32 MediaWiki 26,594 1,021 553 1,578 1,479 1,884 7.08 osCommerce 3,290 23 9 522 511 519 15.78 PEAR 1,756 503 280 6 6 285 16.23 phpBB 5,000 142 93 974 758 832 16.64 phpDocumentor 2,682 5 4 0 0 4 0.15 phpMyAdmin 6,614 149 58 234 209 263 3.98 phpUnit 2,338 5 3 4 2 5 0.21 Roundcube 2,537 13 11 3 3 14 0.55 SilverStripe 9,610 103 62 39 35 97 1.01 Smarty 1,132 22 13 0 0 13 1.15 Squirrel Mail 1,093 23 18 500 414 427 39.07 Symfony 20,744 72 41 0 0 41 0.20 WordPress 6,904 299 139 911 902 1,004 14.54 Zend Framework 15,509 207 149 1 1 150 0.97 1 <?php 2 3 $foo = 'foo '; 4 5 e v a l('$foo = " bar "; ') ; 6 e c h o $foo; // bar 7 8 i n c l u d e(_ _ D I R _ _ . '/ baz . php ') ; 9 e c h o $foo; // baz (a) foo.php 1 <?php 2 3 $foo = 'baz '; (b) baz.php

Figure 3.6: Eval and include using evaluating scope

1 <?php

2

3 $foo = 'Hello World '; 4 $bar = 'foo ';

5 e c h o $$bar; // H e l l o W o r l d

(18)

generating data dependences, we see their methods as good candidates for integration into our library in the future. This could significantly improve the accuracy of dependence analysis for these features.

Table 3.4: Eval, include, and variable variable usage

Project Procs Eval Include Var var Total Procs Proc % Total Procs Proc % Total Procs Proc % CakePHP 9,943 0 0 0.00 14 13 0.13 0 0 0.00 CodeIgniter 1,897 1 1 0.05 87 34 1.79 12 5 0.26 Doctrine ORM 7,876 0 0 0.00 91 29 0.37 0 0 0.00 Fabric 7,407 3 3 0.04 102 61 0.82 1 1 0.01 Gallery 3,039 1 1 0.03 48 29 0.95 8 4 0.13 Joomla 18,922 4 3 0.02 714 511 2.70 6 4 0.02 Kohana 2,409 1 1 0.04 50 47 1.95 7 3 0.12 MediaWiki 26,594 5 5 0.02 544 304 1.14 6 3 0.01 osCommerce 3,290 5 5 0.15 777 238 7.23 119 39 1.19 PEAR 1,756 2 2 0.11 302 203 11.56 3 3 0.17 phpBB 5,000 11 6 0.12 397 203 4.06 61 17 0.34 phpDocumentor 2,682 0 0 0.00 11 7 0.26 0 0 0.00 phpMyAdmin 6,614 2 2 0.03 1,217 474 7.17 33 9 0.14 phpUnit 2,338 3 3 0.13 34 15 0.64 0 0 0.00 Roundcube 2,537 0 0 0.00 87 74 2.92 3 2 0.08 SilverStripe 9,610 10 10 0.10 558 290 3.02 3 2 0.02 Smarty 1,132 9 8 0.71 44 25 2.21 40 11 0.97 Squirrel Mail 1,093 1 1 0.09 426 139 12.72 24 6 0.55 Symfony 20,744 23 15 0.07 139 99 0.48 0 0 0.00 WordPress 6,904 0 0 0.00 809 272 3.94 2 1 0.01 Zend Framework 15,509 1 1 0.01 58 42 0.27 2 2 0.01

3.2.3 Scoping

In PHP there are only two scopes, the global and function scope. Although many of PHP’s control structures contain a block of statements, blocks do not have an associated scope like in many other languages. Variables defined within the block are defined in the current global or function scope, and thus remain accessible after leaving the block. This can make determining reaching definitions, and by extent data dependences, easier. Figure3.8shows an example of the lack of block scoping in PHP.

1 <?php 2 3 f u n c t i o n foo() { 4 if (t r u e) { 5 $foo = 'foo '; 6 } 7 e c h o $foo; // p r i n t s foo 8 }

Figure 3.8: Lack of block scoping

3.3 Call dependence

The main challenge in PHP for determining call dependence is its object model and weak dynamic type system, but there are also a number of dynamic features that are relevant.

3.3.1 Object-orientation & type system

PHP has a single-inheritance class-based object model. Its large library of built-in functionality is, for historic reasons, mostly not object-oriented, but most projects in our corpus do use the object

(19)

model extensively. This is shown in table3.5.

Table 3.5: Call dependence relevant feature usage

Project Files Functions Classes Methods

Defs Calls Defs Calls

CakePHP 852 21 6,995 843 8,483 48,370 CodeIgniter 199 167 5,087 138 1,531 2,275 Doctrine ORM 1,048 0 2,742 1,500 6,770 32,093 Fabric 1,306 15 3,768 1,133 6,028 18,241 Gallery 568 45 7,176 403 2,426 11,844 Joomla 3,221 258 25,989 2,402 15,383 98,070 Kohana 490 33 2,929 389 1,872 3,972 MediaWiki 2,891 251 28,033 2,745 22,943 85,115 osCommerce 702 404 18,293 273 2,184 2,685 PEAR 199 13 5,237 131 1,544 6,957 phpBB 745 569 18,428 599 3,674 14,568 phpDocumentor 470 11 767 432 2,086 7,975 phpMyAdmin 871 840 18,594 633 4,885 24,358 phpUnit 321 141 1,663 310 1,866 4,384 Roundcube 256 106 5,738 264 2,170 7,842 SilverStripe 769 40 10,648 1,190 8,657 33,470 Smarty 204 59 1,911 175 869 1,064 Squirrel Mail 293 604 10,750 22 196 651 Symfony 2,978 19 11,164 2,754 16,924 63,018 WordPress 653 2,935 52,263 275 3,315 10,058 Zend Framework 2,639 4 13,039 1,946 12,673 20,013

As PHP’s type system is weak and dynamic, determining the type of an operand can be quite difficult. This is problematic in combination with object-orientation, because dynamic-dispatch can result in a method name referencing different definitions based on the type of the object. To be able to correctly resolve dependences to polymorphic properties and methods, we require operand type information.

Type analysis in PHP is still an active research topic. Van der Hoek & Hage [14] have recently developed an object-sensitive type analysis framework for PHP that could be suitable for integration in our work, but this would stretch our current scope too far beyond dependence analysis. We therefore opt for a simpler implementation that integrates well with other libraries we use. This implementation is described in section5.1.1.

3.3.2 Dynamic features

PHP’s main mechanism for importing functionality is the include statement. This means that appli-cations could theoretically contain multiple implementations of the same function, class, or method, and vary the run-time implementation by including different files. To determine the impact of this, we looked at how many duplicate class and function names there were in our corpus. This is shown in

table3.6. We can see that in general duplicate names are not very common, but they do exist. Our

call dependence generation algorithms therefore all handle multiple definitions of a name by linking to all available implementations, even though this would not be possible at run-time. Our method for this is described in sections5.2.1and5.2.2.

Interesting outliers are Joomla and osCommerce, which in both functions and classes have the highest number of duplicate names. We examined their code to find the reasons for this. In Joomla some were conditionally defined functions that ensured compatibility in case certain extensions were not loaded, but most duplicate names were due to the fact that Joomla conditionally includes certain class definitions based on the type of output it is generating, e.g. html or json. In osCommerce we found several monkey-patched functions for compatibility with older PHP versions, but the majority

(20)

of duplicate names seemed to be due to the face that osCommerce duplicates a lot of functionality between its catalog (front end) and admin (back end) modes. We found several names with identical implementations, but also some for which the implementations varied.

Table 3.6: Duplicate names

Project Files Functions Classes

Total Dup Dup % Total Dup Dup %

CakePHP 852 21 0 0.00 843 0 0.00 CodeIgniter 199 167 0 0.00 138 1 0.72 Doctrine ORM 1,048 0 0 0.00 1,500 0 0.00 Fabric 1,306 15 0 0.00 1,133 2 0.18 Gallery 568 45 1 2.22 403 0 0.00 Joomla 3,221 258 60 23.26 2,402 55 2.29 Kohana 490 33 0 0.00 389 0 0.00 MediaWiki 2,891 251 6 2.39 2,745 0 0.00 osCommerce 702 404 112 27.72 273 9 3.30 PEAR 199 13 1 7.69 131 2 1.53 phpBB 745 569 16 2.81 599 0 0.00 phpDocumentor 470 11 0 0.00 432 2 0.46 phpMyAdmin 871 840 4 0.48 633 0 0.00 phpUnit 321 141 0 0.00 310 1 0.32 Roundcube 256 106 0 0.00 264 0 0.00 SilverStripe 769 40 0 0.00 1,190 1 0.08 Smarty 204 59 1 1.69 175 0 0.00 Squirrel Mail 293 604 11 1.82 22 0 0.00 Symfony 2,978 19 0 0.00 2,754 5 0.18 WordPress 653 2,935 24 0.82 275 2 0.73 Zend Framework 2,639 4 0 0.00 1,946 0 0.00

There are also a number of ways in which a call target can be determined using an expression. PHP provides variable function and method calls, and several built-in functions than can invoke other functions or methods. Hills et al. [12] showed that these features are rare but used. As shown in table3.7, we can confirm this for our corpus. Hills et al. [13] showed that many instances of variable function and method calls cannot easily be resolved using static analysis. This means that these calls

could be problematic for the correct determination of call dependences. Figure 3.9shows examples

of several forms of dynamic invocation.

Lastly, overloading in PHP means to dynamically create methods and properties. Classes can define ‘magic’ methods that are called when a certain operations attempt to access a method or property

that is not defined. For methods these are call and callStatic. They are called when a

non-existing method is called using an instance or static method call. For properties, these are the get, set, isset, and unset. They are called when a non-existing property is retrieved, set, checked for existence, or deleted. All overloading methods can also be called explicitly, which can be used to enable overloading method inheritance.

An additional complication for properties is the fact that they can also be defined at run-time. As it is very difficult to determine whether or not a property is defined at run-time, we ignore this possibility and assume that all properties which we cannot resolve on objects that implement overloading hooks are instances of overloading. We accept the fact that this might generate some call dependences that would actually refer to dynamic properties at run-time. Figure3.10shows an example of overloading

using the get method, and a dynamic property assignment that overrides the overloading mechanism.

Although Hills et al. [12] showed that overloading is not very common in practice, we consider it a good example of a dynamic feature in PHP that should be statically resolvable in most cases. Over-loading is similar to other dynamic features in PHP that allow objects to interact with certain built-in features. Other examples include objects that implement the Countable and ArrayAccess interfaces, which can be used with the built-in count function or array syntax. If accurate type information is

(21)

Table 3.7: Dynamic invocation usage

Project Procs Var func call Var method call call user func1

Total Procs Proc % Total Procs Proc % Total Procs Proc % CakePHP 9,943 140 105 1.06 54 37 0.37 30 28 0.28 CodeIgniter 1,897 17 8 0.42 23 17 0.90 9 7 0.37 Doctrine ORM 7,876 4 3 0.04 13 8 0.10 10 9 0.11 Fabric 7,407 4 2 0.03 0 0 0.00 29 29 0.39 Gallery 3,039 7 3 0.10 27 14 0.46 53 35 1.15 Joomla 18,922 14 12 0.06 52 37 0.20 94 76 0.40 Kohana 2,409 12 8 0.33 18 14 0.58 13 11 0.46 MediaWiki 26,594 224 76 0.29 68 45 0.17 271 232 0.87 osCommerce 3,290 2 1 0.03 2 2 0.06 5 4 0.12 PEAR 1,756 1 1 0.06 20 9 0.51 37 23 1.31 phpBB 5,000 27 12 0.24 6 5 0.10 35 27 0.54 phpDocumentor 2,682 16 16 0.60 10 8 0.30 1 1 0.04 phpMyAdmin 6,614 15 12 0.18 17 13 0.20 10 10 0.15 phpUnit 2,338 1 1 0.04 10 6 0.26 144 143 6.12 Roundcube 2,537 3 3 0.12 4 3 0.12 13 12 0.47 SilverStripe 9,610 46 28 0.29 242 115 1.20 92 69 0.72 Smarty 1,132 14 6 0.53 18 18 1.59 16 12 1.06 Squirrel Mail 1,093 24 24 2.20 0 0 0.00 4 4 0.37 Symfony 20,744 85 58 0.28 38 28 0.13 86 80 0.39 WordPress 6,904 23 9 0.13 7 7 0.10 127 101 1.46 Zend Framework 15,509 327 172 1.11 83 74 0.48 85 78 0.50

1_{Also includes call user func array, call user method & call user method array}

1 <?php 2 3 f u n c t i o n foo() { 4 r e t u r n 'foo '; 5 } 6 7 c l a s s Bar { 8 p u b l i c f u n c t i o n baz() { 9 r e t u r n 'baz '; 10 } 11 } 12 13 $foo = 'foo ';

14 $bar = new Bar() ;

15 $baz = 'baz '; 16

17 e c h o $foo() ; // foo

18 e c h o $bar- >$baz() ; // baz 19 e c h o c a l l _ u s e r _ f u n c($foo) ; // foo 20 e c h o c a l l _ u s e r _ f u n c([$bar, $baz]) ; // baz

(22)

1 <?php 2 3 c l a s s Foo { 4 p u b l i c f u n c t i o n _ _ g e t($name) { 5 r e t u r n 'foo '; 6 } 7 } 8

9 $foo = new Foo() ;

10 e c h o $foo- >bar; // foo 11

12 $foo- >bar = 'bar ';

13 e c h o $foo- >bar; // bar

Figure 3.10: Overloading and dynamic properties

available, most instances of these features should be statically resolvable. By implementing support for overloading we also intend to demonstrate how these other behaviors could be integrated in the future.

3.4 Summary

In this language analysis we have seen that with regards to control dependence PHP is mostly similar to many other imperative languages. Generators are an exception to this, but they are currently rarely used so we can safely ignore their effect and model them as function calls. Exceptions can be problematic, but within our current scope we can assume that they simply exit the procedure. With regards to data dependence problems can come from aliasing, which is too complex to be included in our current scope, and dynamic features, which are rarely used in practice. PHP’s scoping mechanism can actually make determining data dependences easier, as there is only a single scope per procedure. Determining call dependences requires operand type information for method calls, supporting multiple definitions of the same name due to conditional inclusion, and knowledge of how objects can interact with certain built-in functionality such as overloading. Dynamic invocation is problematic to support, but rarely used in practice.

(23)

Chapter 4

PDG implementation

In this chapter we present our implementation of a PDG for PHP. It is part of our PDG and SDG library called Php-Pdg, and is publicly available1_{. We first cover the libraries that we use for our} implementation and then present our implementation of the PDG itself.

4.1 Libraries

Our PDG is based on PHP-CFG, originally created by Anthony Ferrara and Nikita Popov, and our own implementation of a graph library.

4.1.1 PHP-CFG

We have based our PDG implementation on PHP-CFG2, a pure PHP implementation of a Control

Flow Graph (CFG) in Static Single-Assignment (SSA) form [7]. PHP-CFG has its own representation of PHP operations, which is similar to the operations that the PHP engine executes when running a program. An AST ‘If’ node, for example, results in a ‘JumpIf’ operation with 2 possible target blocks and 2 ‘Jump’ operations at the end of each of those blocks to implement the joining of control flow after the ‘If’. This is shown in figure 4.1, along with the implicit return at the end of each script’s pseudo-main.

The operations used by PHP-CFG are more suitable for PDG construction than the original AST nodes. Using the original AST nodes would requires us to introduce separate constructs for concepts not represented in the AST, such as the headers of while and for loops or implicit returns, which are already represented by operations in PHP-CFG.

PHP-CFG is a fairly new library, having been created in the summer of 2015, so not all features of PHP are currently supported. For example, exception handling is missing. The current implemen-tation ignores try statements and considers a throw statement to be equivalent to an empty return. As we have seen in section3.1of our language analysis, most instances of a throw statement exit the procedure, so this should be satisfactory for our current purposes.

SSA

The SSA form of the CFG is advantageous to us because it makes the process of determining flow dependences trivial. SSA form was developed by Cytron et al. [7] and requires that each variable be assigned exactly once. This means that the use-def chains needed for determining flow data depen-dences are directly encoded into the CFG. The library constructs SSA form directly from the AST, without going through an intermediate representation. It uses an implementation of the algorithm by Braun et al. [3] for this.

1_{https://github.com/mwijngaard/php-pdg/tree/master/src/PhpPdg/ProgramDependence} 2_{https://github.com/ircmaxell/php-cfg}

(24)

1 <?php 2 3 if (1) { 4 } e l s e { 5 } (a) Source Stmt_JumpIf cond: LITERAL<int>(1) Stmt_Jump if Stmt_Jump else Terminal_Return expr: LITERAL<int>(1) target target Function {main}(): (b) CFG

Figure 4.1: If-else program

Model

PHP-CFG parses files into a Script object, which contains one or more procedures represented by Func objects. These are the pseudo-main procedure and any other functions, methods, or closures in the file. A Func object contains a Block object representing the start of the CFG, which contains a list of operations represented by Op objects. The last operation of a block can link to another block. Multiple blocks can be linked in the case of a conditional jump operation. Most operations have operands. Operands can be literals, or be defined by other operations. Literals directly contain their value, and are used at most once. Due to fact that the CFG is in SSA form, non-literals are always defined by exactly one operation, and can have any number of using operations. The phi operation is a special kind of operation that is used to construct SSA form. It is inserted at the beginning of a block that has multiple input paths to ‘select’ the value of an operand depending on which definition reaches the block. Phi operations are only inserted if their result operand is used. Figure4.2 shows most mentioned objects, with operations being printed by their name (e.g. Expr Assign) and operands being printed as Var#X, along with the name of their original variable.

Contributions

As part of our work for this thesis we made a number of contributions to PHP-CFG, which have been merged upstream. Most of these contributions concern small fixes to handling certain language features, so we do not explicitly describe them here.

Goto forward jumps The goto statement can be used to jump to a named label. Jumps cannot create irreducible control flow, such as when jumping into another scope or loop construct, but can go forward or backward. PHP-CFG’s implementation of goto did not support forward jumps, so we implemented them.

(25)

4.1.2 Graph

To facilitate easy use of the PDG for e.g. program slicing or clone detection, the PDG should be represented as a single graph. Some existing PDG libraries3_{represent the PDG with separate graphs} for control and data dependences, but this means standard graph algorithms, e.g. determining reach-ability, need to be applied over multiple graphs. We could not find any existing graph library in PHP that would allow us to do this easily, so we implemented our own. It offers the following key functionalities:

Custom nodes Any objects that implement the node interface can be added to the graph. This allows us to use the same graph library for all the different kinds of graphs in the library (e.g. PDG, SDG, etc.).

Edge attributes Different kinds of relations can be added to the graph using edge attributes. This is useful in e.g. the PDG, for distinguishing between control and data dependences. Two edges are considered equivalent if all their attributes match, and cannot be added to the graph multiple times. The library also allows us to retrieve edges by partial matching on attributes. This allows us to easily retrieve only some edge types.

Custom equality We consider different objects that represent the same node to be equivalent. This is useful for node objects that wrap other objects, e.g. PHP-CFG operation nodes in the PDG. When comparing a PDG to a CFG, the CFG operations can simply be wrapped in new operation node objects which will be considered equivalent to any node containing the same operation already in the PDG. Custom equality is implemented by requiring nodes to implement the getHash method, which should return an equivalent hash for equivalent nodes.

Fast edge lookup For complex programs, the PDG can grow quite large, so near constant lookup times are required. Our graph library offers amortized O(1) lookups by indexing all edges by the hashes of their nodes. Although the lack of indexing on attributes means that theoretically edge lookup is still O(n), in practice very few graphs have many edges between the same nodes.

The implementation of the main components of our graph library is shown in detail in appendixA.

4.2 Implementation

We construct the PDG of a procedure using a Factory object, for which the input is a PHP-CFG Func object and the output is a Php-Pdg Func object. Like PHP-CFG, we do not distinguish between different kinds of procedures. We traverse the PHP-CFG Func object three times, first initializing the graph by adding all operation nodes, then adding control dependences and then adding data

dependences. We demonstrate this using the program of figure 4.2 as a running example. The

implementation of the outward facing components of our PDG library is shown in detail in appendix B.

4.2.1 Control dependences

To determine control dependences we implement the method of Ferrante et al. [8], with some adap-tations to suit our case. We determine control dependences on the level of basic blocks first, before translating them to individual operations. A basic block is linear sequence of program instructions having one entry point and one exit point [1]. By definition it can thus have at most one operation that is a control dependence of any other operation, and this is always the last operation in the block. This means that determining the control dependences of basic blocks is sufficient to be able to infer the control dependences of individual operations. We use this fact to significantly reduce the size of the graph that needs to be evaluated in the intermediate steps. Our algorithm involves the following high-level steps, which are explained further below:

(26)

1 <?php 2 3 if (1) { 4 $a = 'foo '; 5 } e l s e { 6 $a = 'bar '; 7 } 8 e c h o $a; (a) Source Stmt_JumpIf cond: LITERAL(1) Expr_Assign var: Var#1<$a> expr: LITERAL('foo') result: Var#2 Stmt_Jump if Expr_Assign var: Var#3<$a> expr: LITERAL('bar') result: Var#4 Stmt_Jump else

Var#5<$a> = Phi(Var#1<$a>, Var#3<$a>) Terminal_Echo expr: Var#5<$a> Terminal_Return expr: LITERAL(1) target target Function {main}(): (b) CFG

(27)

1. Construct a CFG for the basic blocks of the original CFG.

2. Construct a Post-Dominator Tree (PDT) for the basic block CFG.

3. Construct a Control Dependence Graph (CDG) using the basic block CFG and basic block PDT.

4. Add individual operation control dependences to the PDG using the basic block CDG.

Figure4.3shows all intermediate constructs for the program in figure4.2. To enable us to identify individual blocks we annotate them with the minimum and maximum start line number of the AST nodes that the operations they contain were generated from. For instance, the block containing the Stmt JumpIf operation that was generated from the if statement on line 3 is shown as Block@3:3. Note that we do not use the end line numbers of AST nodes. This is because the end line number of e.g. an if statement would be the line number of the last closing parenthesis, meaning it would enclose any blocks generated from its ‘then’ or ‘else’ statements. We found this to be counterintuitive when reasoning about CFG operations, and that it made it more difficult to distinguish individual blocks.

ENTRY Block@3:3 STOP Block@4:4 {"case":true} Block@6:6 {"case":false} Block@8:8 (a) Block CFG ENTRY STOP Block@3:3 Block@8:8 Block@4:4 Block@6:6 (b) Block PDT ENTRY Block@3:3 Block@8:8 Block@4:4 {"case":true} Block@6:6 {"case":false} (c) Block CDG

Figure 4.3: Control dependence intermediate constructs

Basic block CFG

The CFG created by PHP-CFG already contains Block objects, which are the basic blocks we need, so we need only add them to a new graph. As per the method of Ferrante et al. [8], we augment the new CFG with an entry and stop node, link them to the appropriate block nodes, and link the entry and stop nodes themselves. These steps are required to ensure that nodes which have no control

dependences on other nodes become control dependent on the entry node. Figure 4.3a shows the

resulting basic block CFG for the program in figure4.2. The block ending in a conditional jump (the if statement) has its outgoing edges annotated to reflect the case that would result in that edge. Basic block PDT

A PDT encodes post-domination relations between nodes. We use the following definitions: [1] • A node A is post-dominated by a node B if all paths from from A to STOP contain B.

• A node A is strictly post-dominated by a node B if A is post-dominated by B and A is not equal to B

• A node A is immediately post-dominated by a node B if A is strictly post-dominated by B, but B does not strictly post-dominate any other node that strictly post-dominates A.

(28)

We construct the PDT by implementing the method for computing dominance presented by Cooper et al. [6], though without their optimizations. It is not the fastest available option, but its implemen-tation is very simple and its performance is not a determining factor in the overall performance of the library. The algorithm iteratively computes the following equation until it has reached a fixpoint:

Dom(no) = {no} Dom(n) =   \ p∈preds(n) Dom(p)   [ {n}

By running this algorithm on an inverted CFG we can compute post-dominance relationships. We then proceed to eliminate non-strict and non-immediate relationships, and are left with a PDT. Figure

4.3bshows the basic block PDT for the example program in 4.2.

Basic block CDG

Using the basic block CFG and basic block PDT, we can now determine the control dependences between basic blocks and create a basic block CDG. We evaluate all edges A-B in the CFG where B does not post-dominate A according to the PDT, and determine their least common ancestor L in the PDT. In this case there are 2 possible situations: [8]

1. L is the parent of A

2. L is equal to A (loop dependence)

In case 1, we make all nodes in the PDT between B and L excluding L control dependent on A. In case 2, we make all nodes in the PDT between B and L including L control dependent on A. Figure

4.3cshows the basic block CDG for the example program in4.2. The block ending in a conditional

jump again has its outgoing edges annotated to reflect the case that would result in that edge. Operation control dependences

Control dependences on an individual operation level are directly inferred from those on the level of basic blocks. We do this by evaluating all edges A-B in the basic block CDG, and making all operations in B control dependent on the last operation of A. Figure4.4shows the control dependence subgraph

of the PDG for the example program in figure 4.2. It contains operation nodes instead of block

nodes. The operation nodes are annotated with the start line number of the AST node that they were constructed from. As the implicit return was not part of the AST, it shows up at line -1. The edges are annotated with the type of dependence, which is always control here, and the case that would result in that edge.

ENTRY Op[Stmt_JumpIf]@3 {"type":"control"} Op[Terminal_Echo]@8 {"type":"control"} Op[Terminal_Return]@-1 {"type":"control"} Op[Expr_Assign]@4 {"case":true,"type":"control"} Op[Stmt_Jump]@3 {"case":true,"type":"control"} Op[Expr_Assign]@6 {"case":false,"type":"control"} Op[Stmt_Jump]@3 {"case":false,"type":"control"}

(29)

4.2.2 Data dependences

Ferrante et al. [8] list 4 types of data dependences: Flow dependence Read after write

Antidependence Write after read Input dependence Read after read Output dependence Write after write

As noted in section4.1.1, the SSA form of the CFG created by PHP-CFG allows us to directly infer flow dependences. PHP-CFG uses operand objects that are referenced from both defining and using operations. Operations using an operand should have data dependences on the operation defining an operand. As the CFG is in SSA form, some operands are defined by phi operations. These phi operations are not part of our PDG, so we recursively resolve them to the operations that define their operands. Figure4.2shows that the echo statement on line 8 references variable 5, which is defined by a phi operation. The phi operation itself uses variables 1 and 3, which are defined by the assignments on lines 4 and 6 respectively. The flow dependences of the echo operation on line 8 are therefore the two assignments on line 4 and 6.

Figure4.5shows the data dependence subgraph of the PDG for the example program in figure4.2.

The operation nodes are again annotated with the starting line number of the AST node that they were generated from. The edges are annotated with the type of dependence, which is always data here, and name of the operand that they were generated from. Adding the operand facilitates easier debugging and verification, but means that there can be multiple data dependence edges between two operations if an operation uses an operand multiple times.

Op[Expr_Assign]@4

Op[Terminal_Echo]@8 {"type":"data","operand":"expr"}

Op[Expr_Assign]@6

{"type":"data","operand":"expr"}

Figure 4.5: PDG data dependence subgraph

We have currently limited our scope to supporting flow dependences. The reason for this is that the other types of dependences do not really apply to the SSA context of PHP-CFG. Ferrante et al. [8] showed that they can be useful when applying certain optimizations to the original non-SSA source code, so we see them as a good optional add-on to Php-Pdg. The implementation of this add-on remains future work.

4.2.3 Combined

By combining the control and data dependence subgraphs we get a full PDG. Figure 4.6shows the

PDG for the example program in figure4.2. We now see the different dependences combined, and the

need for annotating edges with their type.

4.2.4 Extensibility

Our PDG implementation is modular and can easily be extended with support for additional types of control and data dependences. To demonstrate this we have implemented a data dependence generator that adds ‘maybe data’ dependences for eval statements and variable variables4_{. As these}

4_{https://github.com/mwijngaard/php-pdg/blob/master/src/PhpPdg/ProgramDependence/DataDependence/}

(30)

ENTRY Op[Stmt_JumpIf]@3 {"type":"control"} Op[Terminal_Echo]@8 {"type":"control"} Op[Terminal_Return]@-1 {"type":"control"} Op[Expr_Assign]@4 {"case":true,"type":"control"} Op[Stmt_Jump]@3 {"case":true,"type":"control"} Op[Expr_Assign]@6 {"case":false,"type":"control"} Op[Stmt_Jump]@3 {"case":false,"type":"control"} {"type":"data","operand":"expr"} {"type":"data","operand":"expr"} Figure 4.6: Full PDG

language features can technically refer to any variables in the current scope, this generator can make these possible dependences explicit. We do not use this generator by default, but it can be passed to the factory to create a PDG with ‘maybe data’ dependences.

(31)

Chapter 5

SDG implementation

In this chapter we present our implementation of an SDG for PHP. It is part of our PDG and SDG library called Php-Pdg, and is publicly available1_{. We first cover the libraries that we use for our} implementation and then present our implementation of the SDG itself.

5.1 Libraries

The SDG implementation is based on our own PDG implementation, described in chapter 4, and

PHP-Types, originally created by Anthony Ferrara.

5.1.1 PHP-Types

We use our own fork of PHP-Types2 for resolving calls. PHP-Types is a type reconstruction library

for PHP-CFG. It supports parsing type information from type hints, literals and phpdoc3annotations (@var, @param, @return). It also supports inferring result types of PHP expressions, and contains a database of type information for PHP’s built-in procedures, allowing type inference to also work for them.

Like PHP-CFG, PHP-Types was created in the summer of 2015, meaning it is also rather immature. This translates to a lot of room for improvement concerning the support and accuracy of type recon-struction. Since we use type information to generate method call dependences, any improvements to type reconstruction will automatically increase the quality of method call dependences in the SDG. Model

PHP-Types operates on the CFG created by PHP-CFG (see 4.1.1). It offers a TypeReconstructor

object, that takes as input all the PHP-CFG Script objects of the entire application. It uses these scripts to construct a state of the application that contains a class hierarchy and indexes certain objects that are required for type reconstruction. Classes, for example, are indexed by their name, and their methods indexed by their class object and method name. It also retrieves all operands in the application. The type reconstructor processes all operands that it can resolve immediately, such as literals and operands with type declarations, and then proceeds to iteratively evaluate the operations defining any remaining operands. It attempts to infer the result types of these operations using the operands it has already resolved, its knowledge of language constructs, and the application state.

The types of resolved operands are described using a Type object. PHP-Types follows the type

model of phpDocumentor4_{, with some additions to accommodate practical usage. A Type object has a}

type property that contains a constant value representing the actual type, such as TYPE STRING or TYPE INT. Some types are compound types, such as TYPE ARRAY, or represent multiple possible

1_{https://github.com/mwijngaard/php-pdg/tree/master/src/PhpPdg/SystemDependence} 2

https://github.com/mwijngaard/php-types

3_{https://www.phpdoc.org/}

Dependence analysis in PHP