Implementing a PDG Library in Rascal

(1)

Implementing a PDG Library in Rascal

Lulu Zhang (10630856)

Host organization: University of Amsterdam

Supervisor:

Dr. Vadim Zaytsev (vadim@grammarware.net)

(2)

Abstract

In this thesis, we present an implementation and tests for a program dependence graph library in Rascal, including the core algorithm descriptions. Rascal is a meta-programming language mostly for analyzing, transforming or generating source code, which fits the usage of PDG. How-ever, there is no library or convenient function in Rascal for PDG for now, which makes a library implementation timely and meaningful.

(3)

This project is about the implementation of a PDG library in Rascal to analyze Java programs. PDG is short for “Program Dependence Graph” [1] which is one type of plan calculus [2, 3]. The Plan Calculus aims to represent programs and algorithmic clich´es. Based on this, program dependence graph exhibits both data and control dependences for each statement or operation in a program. CodeSurfer1is a deep semantic analysis tool which can compute control dependences and data dependences for C and C++ programs. However, there is a significant disadvantage of this tool that it costs $6990, which is not really affordable for programmers who need to pay by themselves. Besides, others who use the programs developed based on this tool also need to pay. Rascal is a new meta-programming language and mostly used for analyzing, transforming or generating source code [4]; however, the current libraries in Rascal have not provided a conve-nient way to represent PDG, which makes this library implementation meaningful and useful. Compared with CodeSurfer, this PDG library will be more accessible as a tool to analyze pro-grams, visualize source code and support further uses:

• Program slicing: Weiser pointed out that the process of stripping a program of statements without influence on a given variable at a given statement is called program slicing [5]. Ex-periments presented in [6] indicated that slices are used by programmers during debugging and maintenance because a large program is easier to understand and maintain when it is decomposed into smaller pieces. Since programs are usually maintained by experienced programmers other than the program designers, maintainers will benefit from automatic slicing to get slices information with no need to understand the entire system. The com-putation of slices requires the information of control dependences and data dependences, which makes PDG ideal as an intermediate tool.

• Program complexity metrics: Complexity measurements of software can provide important information and help reconstruct or maintain complex software. Much of work has been done based on counts of physical attributes of the source code. For example, cyclomatic complexity metric [7] and reachability metric [8] are based on the control flow graph of the program. Karl J. Ottenstein and Linda M. Ottenstein [9] pointed out these metrics that were compared in [10,11,12,13,14] considered on the surface characteristics of the program and the nature of flow must be considered to know the “deep structure” of a program. Since PDG provides explicit data and control information, it is claimed to offer a good basis for software complexity metrics. It is even possible to automatically transform original source code to reduce psychological complexity [15] by using a complexity metric and PDG. • Clone detection: All clones can be roughly categorized into these two types: syntactic

clones and semantic clones [16]. Syntactic clones mostly vary in comments, white spaces, identifiers, change of statements and etc. Semantic clones are more difficult to detect be-cause they perform the same computation but are implemented by different syntactic vari-ants or layouts [17]. According to Mati and Yishai [18], PDG can also be used to detect clones. The semantics of the nodes along certain data or control flow will be compared in order to find similar operation.

The remainder of this thesis will address these issues: providing examples of PDG, detailed algo-rithm descriptions for the main implementations, test cases and test results for each step. Every main step is also accompanied with the results from the same example code (Code1) extracted from [19] with specific explanations and evidence to validate this project. The appendix contains more detailed tests and results.

(5)

2 Program Dependence Graph

The PDG represents a program as a graph in which nodes are statements and predicate expres-sions (or operator and operands) and the edges incident to a node represent both data values on which node’s operations depend and the control conditions on which the execution of the operations depends[1]. In this project, nodes represent statements and predicates, and depen-dency relation arises as the result of irreversible effect between two statements in terms of control dependency or data dependency. For example,

Code 1: sum Example Code

0 : i n t n = 0 ; 1 : i n t i = 1 ; 2 : i n t sum = 0 ; 3 : while( i <= n ) { 4 : sum = 0 ; 5 : i n t j = 1 ; 6 : while( j <= i ) { 7 : sum = sum + j ; 8 : j = j + 1 ; }

9 : System . out . p r i n t l n ( sum + i ) ; 1 0 : i = i + 1 ;

}

1 1 : System . out . p r i n t l n ( sum + i ) ;

First, statements 4˜10 depend on the predicate i ≤ n since the value of i ≤ n determines whether statements 4˜10 will be executed or not. This type of dependence is control dependence. Control dependence exists between two statements when one determines the execution of the other one. Then, statement 3 depends on statement 0 and statement 1 because the execution of statement 3needs the value of i and n, otherwise statement 3 would get an incorrect value for the condi-tion result. Dependence of this type is data dependence. Data dependence exists between two statements whenever the variable from one statement may have incorrect answer or error if the sequence of these two statements is reversed.

2.1 Control Dependence

In order to get the control dependence graph, control flow and control dependence relations are needed beforehand.

2.1.1 Control Flow Graph

A control flow graph is a directed graph in which the nodes represent basic blocks and the edges represent control flow paths [20]. Here is the CFG (Figure 1) of the example sum code above (Code1):

(6)

Figure 1: Control Flow Graph for code1

In this project, the basic block is statement and control flow is denoted by CF = (flow, f irst, [last]): flow is a list of statement relations which represent execution sequences of each pair of statements; f irstis the statement executed first at the current scope; [last] is a list of statements executed at last at the current scope and last statements might be more than one because certain types of statements will have two or more successors. The algorithms will be described below.

Algorithm 1Analyze each statement and get CF (the structure of CF, see:2.1.1)

Precondition: statis the implementation statement of a method from the source code

Postcondition: all the statements and their unique numbers counting will be stored in statements as index; the rela-tion among statements, f irst statement and a list of last statements of the relarela-tion will be returned.

1 statements(num : statement) ← (); 2 counting;

3 functionGETCONTROLFLOW(stat) 4 counting ← 0;

5 returnSTATEMENTCF(stat) 6 functionSTATEMENTCF(stat) 7 switch stat do

8 case blockStatement : returnBLOCKCF(stat) 9 case if Statement : returnIFCF(stat) 10 case f orStatement : returnFORCF(stat) 11 case whileStatement : returnWHILECF(stat) 12 case switchStatement : returnSWITCHCF(stat) 13 default :

14 statements+ = (counting : stat); 15 f irst ← counting;

16 last ← [counting]; 17 counting + +; 18 return ([], f irst, last); 19 end switch

Statement is the smallest computation unit. Algorithm1 checks the structure of each statement and if they meet certain types, such as block, if, for, etc., it will analyze deeper and process dif-ferently. Otherwise, the statement is an assertion or declaration and it will be appended into a

(7)

map with its unique number (the variable counting is used for counting numbers of all the state-ments). Algorithm2,3,4below and Algorithm9,10in Appendix describe different processes for different types of statement and how to concatenate separate control flows together (Algorithm 8).

All the algorithm descriptions do not elaborate return statement which will always be the last statement of the current computation unit or function. The execution mechanism (described in sectioniii) of break and continue is obvious thus unnecessary to describe implementation details in algorithm. Everyone can implement them via different methods but under the same principle. The lack of these details will not influence the understandability of design thoughts behind all the algorithm descriptions. The tests presented in Appendix: Tests, § 3cover most of common situations including break, continue and return inside a block or inside multiple loops, which can validate the control flow part of this PDG library.

i Block

A block consists of multiple statements or nothing which means the block is empty. If the block is not empty, the first statement will be computed. Then, if this statement is not a minimum unit, algorithm 1 will go into it deeper until all the inside minimum statements are computed and concatenated (here we use recursion). Then the control flow result of the first statement will be concatenated to the control flow result of the rest statements. The concatenation algorithm will be described in Algorithm8in Appendix.

Algorithm 2Analyze block statement and get CF (the structure of CF, see: 2.1.1)

Precondition: statis a block

Postcondition: all the statements inside block and their corresponding unique number counting will be stored in a map statements; the relation among statements, first statement and a list of last statements of the relation will be returned.

1 functionBLOCKCF(stat)

2 block ←all the statements inside stat; 3 if blockis not empty then

4 f irstCF ←STATEMENTCF(block[0]); 5 f irst ← f irststatement of f irstCF ;

6 blockCF ←CONCATCF(f irstCF , the rest of block); 7 f low, last ← f low, lastof blockCF ;

8 return (f low, f irst, last); 9 else

10 return([], -1, []); .Empty block

ii If

When there is a If statement, the condition will be checked first and then branch appears: the first statements of then branch and else branch (if there is) will be linked to the condition. If there is no else branch, the last statements will be the same with those of the then branch; otherwise, the last statements of the else branch should also be appended.

(8)

Algorithm 3Analyze If statement and get the control flow (the structure of control flow, see:2.1.1)

Precondition: statis a If statement

Postcondition: all the elements/statements inside If statement and their corresponding unique number counting will be stored in a map statements; the relation among statements, first statement and a list of last statements of the relation will be returned.

1 functionIFCF(stat) 2 condition ← counting;

3 statements+ = (condition :condition statement); 4 counting + +;

5 thenBranchCF ←STATEMENTCF(thenBranch); .concatenate a with B: [a, first of B] + flow inside B 6 f low ←concatenate condition with thenBranchCF ; 7 last ← laststatements of thenBranchCF ;

8 ifthere is an elseBranch then

9 elseBranchCF ←STATEMENTCF(elseBranch); 10 f low+ =concatenate condition with elseBranchCF ; 11 last+ = laststatements of thenBranchCF ;

12 return (f low, condition, last); . conditionwill be the first statement of the flow

iii For, While

For the for statement, the analyzed code fragment will normally start at initializers and end at the condition; if no condition, this will be an infinite loop unless there is a break inside. Before a program goes into a for loop, all the initializers will be executed first then it will check whether the condition is met or not. If the condition is not satisfied, the code fragment inside for loop won’t be executed; otherwise, the program will go inside the loop and condition will be checked again after all the updaters. Algorithm is described below.

(9)

Algorithm 4Analyze For statement and get the control flow (the structure of control flow, see: 2.1.1)

Precondition: statis a For statement

Postcondition: all the elements/statements inside For statement and their corresponding unique number counting will be stored in a map statements; the relation among statements, first statement and a list of last statements of the relation will be returned.

1 functionFORCF(stat) 2 f irst ← counting;

3 for i ←all initialization statements do 4 initializers+ = counting; 5 statements+ = (counting : i); 6 counting + +;

.interconnect [1, 2, 3]: [(1, 2), (2, 3)]; 7 f low ←interconnect initializers

8 ifthere is a condition statement for this loop then 9 condition, last ← counting;

10 statements+ = (condition :condition statement); 11 f low+ =concatenate last initializer with condition; 12 counting + +;

13 for u ←all update statements do 14 updaters+ = counting; 15 statements+ = (counting : u); 16 counting + +;

17 f orBodyCF ←STATEMENTCF(f orBody);

18 for bc ← breakorcontinuestatements for current loop do 19 if bcis a break then

20 last+ = bc;

21 else .else bc is a continue

22 f low+ =concatenate bc with the first updater; .Depends on whether there is a condition for the loop

23 f low ←concatenate last initializer or condition, f orBodyCF , updaters and the loop start; 24 return (f low, f irst, last);

While loop is similar to for loop. But for while loop, there is always a condition which will be both first and last statement of this fragment. If the condition is satisfied, all the code inside will be executed and condition will be checked again until it is not satisfied; otherwise, the program will skip the loop and execute the following code. The algorithm is described in Algorithm9in Appendix.

For some cases, there is break or continue (or both) inside a loop. If there is a break, the loop will be terminated and the code following this loop fragment will be executed; if there is a continue, the current iteration will be terminated and next iteration will start. Here are examples and test results for break and continue (more tests are presented in Appendix: Tests, §3).

Code 2: break Example

public void t e s t B r e a k ( ) {

f o r( i n t i = 0 ; i < 9 ; i ++){

i f( i == 6 ) {

System . out . p r i n t l n ( ” break ” ) ;

break; }

System . out . p r i n t l n ( i ) ; }

System . out . p r i n t l n ( ”end” ) ; }

Code 3: continue Example

public void t e s t C o n t i n u e ( ) { f o r( i n t i = 0 ; i < 9 ; i ++){ i f( i == 6 ) { System . out . p r i n t l n ( ” c o n t i n u e ” ) ; continue; } System . out . p r i n t l n ( i ) ; }

(10)

(a) Control Flow for testBreak() (b) Control Flow for testContinue() Figure 2: Expected Control Flow for Break and Continue

(a) Test result for testBreak() (b) Test result for testContinue() Figure 3: Expected Control Flow for Break and Continue

iv Switch

Switch statement starts by evaluating expression and then executes all the statements inside any case whose constant is matched by expression. If there is no match, statements inside default will be executed. But it’s not necessary to have default keyword, in this case program will pass the switch block if no case matched. The algorithms is described in Algorithm10in Appendix. In some cases, there is a break at the end of the statements inside each case, which will force an exit from the current switch block after the previous statement has been executed; if there is a return, the program or function will get the returned value without executing all the statements followed by return. Here is an example (Code4) for case with break:

Code 4: switch-case Example with break

i n t i = 0 ; switch( i ) { c a s e 0 : System . out . p r i n t l n ( ” 0 ” ) ; break; c a s e 1 : System . out . p r i n t l n ( ” 1 ” ) ; break; d e f a u l t: System . out . p r i n t l n ( ” d e f a u l t ” ) ; break; }

Code 5: switch-case Example without break

i n t i = 0 ; switch( i ) { c a s e 0 : System . out . p r i n t l n ( ” 0 ” ) ; c a s e 1 : System . out . p r i n t l n ( ” 1 ” ) ; c a s e 2 : System . out . p r i n t l n ( ” 2 ” ) ; break; d e f a u l t: System . out . p r i n t l n ( ” d e f a u l t ” ) ; }

If there is no break or return keyword for case, all the statements following matched case will also be executed until break is found. In the above example (Code5), after i is match with 0, the statement inside case 0 will be executed first. Since break doesn’t appear before case 1 and case 2, 1 and 2 will also be printed out without executing default statement.

(11)

v Test examples

Here are two test cases that cover most of the statement types. More tests that cover all types of statement are presented in Appendix: Tests, §3.

Code 6: Test Case 1

public void t e s t 1 ( ) { i n t i = 0 ; i f( i > 3 ) { f o r( i n t j = 0 ; j <= i ; j ++){ System . out . p r i n t l n ( j ) ; }

System . out . p r i n t l n ( ”end f o r ” ) ; } e l s e { while( i < 9 ) { System . out . p r i n t l n ( i ) ; i += 3 ; } }

System . out . p r i n t l n ( ”End” ) ; }

Code 7: Test Case 2

public i n t t e s t 2 ( ) { i n t i = 3 ; i n t j = 4 ; switch( i +1){ c a s e 4 : { i f( j == 4 ) r e t u r n 4 ; e l s e { System . out . p r i n t l n ( ”−4” ) ; r e t u r n −4; } } c a s e 5 : r e t u r n 5 ; d e f a u l t: { i ++; r e t u r n 6 ; } } }

After analyzing both functions manually, the control flows we expected are:

(a) Control Flow for test1() (b) Control Flow for test2() Figure 4: Expected Control Flow

Rascal provides a testing framework that is used for all the unit tests in this project. Here are the results from testing the two functions above. We compared the control flow generated by project with what we expected above. Besides, the last statements were also compared because there could be more than one last statement for certain statement sequence, which may contain latent errors. The green highlight stands for the selected tests are passed, hence the algorithms work exactly as what we expect.

(12)

(a) Test results for test1() (b) Test results for test2() Figure 5: Test results

2.1.2 Dominance

Control dependences will be determined based on the generated control flow graph described above. The first step is to construct post-dominator tree for this control flow graph. The algorithm to get post-dominators2[1] is equivalent to the one to compute dominators3[21] in the reversed control flow graph described in [22]:

Algorithm 5Calculate dominance relation based on the flow graph

Precondition: nodesare sorted in reverse postorder

Postcondition: domsholds all immediate dominators4indexed by node

1 functionBUILDDOMINANCE(nodes) 2 for b ← nodes do

3 doms[b] ← U ndef ined; 4 doms[f irstN ode] ← f irstN ode; 5 Changed ← true

6 while Changed do

7 Changed ← f alse

8 for b ← nodes(exceptf irstN ode) do

9 newIdom ←first (processed) predecessor of b in flow graph 10 for p ← otherpredecessorsof b do

11 if doms[p] 6= U ndef ined then

12 newIdom ←INTERSECT(p, newIdom) 13 if doms[b] 6= newIdom then

14 doms[b] ← newIdom 15 Changed ← true 16 return doms 17 functionINTERSECT(b1, b2) 18 f inger1 ← b1 19 f inger2 ← b2

20 while f inger1 6= f inger2 do

21 while f inger1 < f inger2 do

22 f inger1 ← doms[f inger1] 23 while f inger2 < f inger1 do

24 f inger2 ← doms[f inger2] 25 return f inger1

For a node n, entry doms(n) presents the IDom(n) (immediate dominator of n). Then the entry doms(doms(n)) means IDom(IDom(n)). By walking through doms, starting at n, dominator tree and dominator set of n can be reconstructed and the first node dominates itself. Figure6 is the dominator tree for the control flow graph1of example sum code1. More tests are presented in Appendix: Tests, §3

2_{A node V is post-dominated by a node W in G if every directed path from V to STOP (not including V) contains}

W.

3_{Node d of a flow graph dominates node n, if every path from the entry node of the flow graph to n goes through}

d.

(13)

Figure 6: dominator tree for the control flow graph1of example sum code1

2.1.3 Control Dependence

Here we add START, EXIT and ENTRY nodes to the control flow graph. START is added before the first node and EXIT is added after the last nodes. ENTRY representing external condition is a special predicate node that has one edge labeled “T” going to START and another edge labeled “F” going to EXIT. Post-dominator tree is computed based on reversed control flow with these three common nodes added. Figure7shows the the control flow graph of Code1with START, EXIT and ENTRY added, its post-dominator tree and its control dependence graph. Another test case extracted from [1] is presented in Appendix3.

(14)

(a) Control flow graph with START, EXIT

and ENTRY (b) Post-dominator tree of Figure7a

(c) Control dependence graph Figure 7: CFG, PDT and CDG of Code1

With control flow graph and post-dominator tree, control dependences can be determined within 3 main steps [1]:

i Determine and Label Control Dependence

Examined edges are extracted from the control flow graph and nodes are annotated based on cor-responding tree paths. Examined edges S consist of all edges (A, B) from the control flow graph such that B is not an ancestor of A in the post-dominator tree. For example, according to Figure7a and7b, the examined edges set of example sum program (Code1) is S = {(ENTRY, START), (3, 4), (6, 7)}.

(15)

is denoted as L. According to the proof from [1], L is either A or the parent of A in the post-dominator tree. Based on these two cases, control dependences are determined differently:

• If L is the parent of A: all nodes on the path from (not including) L to (including) B in the post-dominator tree should depend on A. In our example, the least common ancestor of ENTRY and START is EXIT. Then all these nodes {START, 0, 1, 2, 3, 11} will depend on ENTRY with label “T”.

• If L is A: all the nodes on the path from (including) L to (including) B in the post-dominator tree should depend on A. The dependence relation will be labeled as “T ” or “F ”. In our example, the least common ancestor of 6 and 7 is 6. Then all the nodes {6, 7, 8} along the path from 6 to 7 (both included) are dependent on 6 with label T.

If two nodes sets are control dependent on the same node, they should be marked differently (one is labeled as “T ”, the other one “F ”).

ii Region Node Insertion

Since there are some nodes that may have same dependence node with same condition, region nodes will be inserted to group these nodes. With region nodes, predicate nodes will only have two successors as in the control graph. Control dependence predecessor set, denoted as CD, will be computed for each node that has other than a single unlabeled control dependence predeces-sor.

One region node R is created for each CD set. Then, every node whose set of control dependence predecessor is CD will be control dependent on R, while R will take this CD set as its control dependence predecessor. If the CD set of R is a subset of the CD set of other node, this node will be made control dependent on R instead of all the nodes in its CD set. By construction, if two nodes having a containment of their CD set must connect to each other on the path in post-dominator tree. So, the tree will be traversed in post order so that all the children will be visited before their parents.

Then, we need to compare the CD set of the current node and that of its immediate child in the post-dominator tree. If the CD set of current node contains the CD set of its immediate child, the corresponding dependences for R will be replaced with the child’s predecessor; in the other way around, the corresponding dependences for the child will be replaced with R.

In our example, the CD set of node 8 is 6T corresponding to region node R2 and the CD set of node 5 is 3T corresponding to region node R1. When node 6 is visited, a region node R3 will be created for its CD set {3T, 6T } first and then this CD set contains the CD sets of its children 5, 8. Thus, the region node R3 corresponding to node 6 will depend on R1 instead of node 3 with label “T” and R2 instead of node 6 with label “T”.

iii Last Modification

The last step is to scan over the graph again to make sure every node has unique successor with each predicate value (“T ” or “F ”). If any node N has multiple successors with the same label L, a new region R node will be created and inserted as the predecessor of N ’s successors. Then R will be made control dependent on N with the label L.

(16)

2.2 Data Dependence

For the data dependence analysis, we focus on where each variable is defined and where its value is passed. Every variable node in AST in a program can be classified as a definition or use. A definition of a variable happens whenever it is declared or gets a value. A use of a variable occurs whenever its value is fetched. Since a variable may be assigned to different values in multiple blocks, it becomes critical to analyze where the value of certain variable is exactly from. Then this definition-use pair can represent data dependence. For example, DUPairs[U] =< D, v > is a definition-use pair, then U is dependent on the variable v computed by D.

Before data dependence or definition-use pair, we need to compute reaching definition as prepa-ration. According to [21], a definition D reaches a point P if there is a path from the point im-mediately following D to P such that D is not killed among that path. We kill a definition of a variable x if there is any other definition of x anywhere along the path. For example, sum is defined in statement 2. And there is a way to reach statement 9 along the 2, 3, 4, 5, 6, 9. But sum is redefined in statement 4, thus the definition of sum in statement 2 is killed by the definition of sumin statement 4. Then the definition of sum in statement 2 cannot reach statement 9 which the definition of sum in statement 2 can.

Obviously, we need to mark all the statements that define or use variables, as well as the names of variables. Since each statement is analyzed when control flow graph is generated, we can mark definition or use statement at the same time so that the control flow graph or every statement does not need to be traversed again.

2.2.1 Reaching Definitions

Four sets of definition statements are computed here. These sets are GEN, KILL, IN and OUT [21]: • GEN set for a node is a set of definitions that are generated by this node. For example, in the code1, the GEN set for statement 2 is ”sum” and the GEN set for statement 3 is empty. • KILL set for a node is the set of definitions of the variables that are redefined or reassigned (thus killed) in this current node. For example, in the code1, the KILL set for statement 7 is {2 : ”sum”, 4 : ”sum”, 7 : ”sum”}.

• IN set for a node is the set of definitions that reach the point immediate before the node. Since a definition cannot reach to a statement unless there is a path which reaches this state-ment, the IN set of this statement will not be larger than the OUT set of its any predecessor in the control flow graph. IN[n] = S

p←pred(n)

OUT[p], pred(n) gets all the predecessors of n in the control flow graph.

• OUT set for a node is the set of definitions that reach the point immediate after the node: OUT[n] = GEN[n]S(IN[n]-KILL[n]), which means the OUT set of a node consists of all the definitions that are generated in this node and all the definitions that reach here but not killed within this node.

(17)

Algorithm 6Iterative algorithm to compute IN and OUT sets for reaching definitions

Precondition: GENand KILL have been computed for every statement

Postcondition: INand OU T for every node 1 functionGETREACHINGDEF(CF , GEN , KILL) 2 nodes ←all the statements in CF

3 for n ← nodes do 4 OU T [n] ← GEN [n]; 5 Changed ← true 6 while Changed do 7 Changed ← f alse 8 for n ← nodes do 9 IN [n] ← S p←pred(n) OU T [p] 10 oldOU T ← OU T [n]

11 OU T [n] ← GEN [n]S(IN [n] − KILL[n]) 12 if OU T [n] 6= oldOU T then

13 Changed ← true

14 return(IN , OU T )

Iterative approach [21] [19] is used to compute IN and OUT for reaching definitions (Algorithm6). The inputs of this algorithm are the control flow which is already generated according to the steps described before, GEN and KILL sets for every statement or node in this control flow. A boolean variable Changed is used to record on each pass whether OUT set has changed or not. In the end, IN contains all the definitions that flows into each node and OUT contains all the definitions that flows out of each node.Figure8shows the IN, OUT sets that algorithm 6computes for the example code given in Code1. Tests for IN and OUT are presented in Appendix Test3.

(18)

2.2.2 Definition-Use Pairs and Data Dependence

A definition-use pair for a variable for v is an ordered pair DUPairs[U] =< D, v >: D is the statement where v is defined and U is a statement where the value of v is fetched. Along the way from D to U in CFG, the value of v defined in D is not killed. Algorithm 7 describes the computation of definition-use pairs.

Algorithm 7Compute Definition-Use pairs

Precondition: INset has been computed for every statement and U SE map has been extracted

Postcondition: a set of definition-use pairs 1 functionCOMPUTEDEFUSEPAIRS(IN , U SE) 2 DU P airs ← {};

3 for n ← U SE do

4 for < U, variable >← U SE[n] do

5 ifthere is a definition D of variable in IN[n] then 6 DU P airs[U ] ← < D, variable >

7 return(DU P airs)

A definition-use pair represents data interaction between nodes or statements in a program. For example, table1shows all definition-use pairs inside code1, which is computed by algorithm7. For the statement 3, the value of variable n used in statement 3 comes from the definition in statement 0 and the value of variable i comes from the definitions in statement 1 and 10. This also implied the computation in 3 depends on the data computed in 0, 1 and 10. This dependence is called data dependence. Data dependence test result for Code1is presented in Appendix: Tests, §3. Definitions 0 1 2 3 4 5 6 7 8 9 10 11 Uses 0 1 2 3 n i i 4 5 6 i j j i 7 sum j sum j 8 j j 9 i sum sum i 10 i i

11 i sum sum sum i

Table 1: Definition-Use pairs for sum Code1

2.3 Program Dependence Graph

A program dependence graph presents both control dependences and data dependences for a program [1]. Data dependences we have computed in section2.2 consist of definition and use nodes, relations that presents data dependences, and the names of variables which relations are built upon. Then we can add the dependence relations to a CFG as edges labeled by the names of variables. To do this, for each statement in DUPairs, we first find all the definition-use pairs for the current statement. For example, DUPairs[3] = {< 0, n >, < 1, i >, < 10, i >}, then we add edges from node 0, 1 and 10 to node 3, and label each edge “n”, “i” and “i” respectively.

(19)

Figure9ashows the dependence graph for example code1with control dependences and data dependences for node 3 is added (dotted line). Figure9b presents all the data dependences of sumfor example sum code1. The complete PDG is presented in Appendix3.

(a) Parts of PDG for example sum code1with con-trol dependences and data dependences of node 3

(b) Parts of PDG for example sum code1with con-trol dependences and data dependences of sum Figure 9: Parts of PDG for example sum code1

(20)

3 Conclusion and Future Work

We have successfully implemented a PDG library in Rascal and it can be accessed via http: //github.com/lulu516/ProgramDependenceGraph. The examples presented in [1, 19] are used as test cases and the test results from comparing computation outcomes with the estimated ones are regarded as evidence of the correctness of this project thus making it more convincing. Tests are also complemented with written cases which computation results are compared with self-analysis results.

However, the main threat in this project may be the correctness of self-analysis results of test cases. Since every step or statement type is accompanied with several self-written test cases, they are regarded as the evidences to validate the correctness of this project. It is important to have confidence in self-analysis results since they serve as an oracle for testing the algorithm.

Some future work is open to be done. Since testing plays a significant role here, this PDG library can be further validated by differential testing [23] with CodeSurfer, which would save effort of evaluating test results and increase test coverage. Besides, this PDG library can be extended to the applications described in section1such as slicing, and support interprocedural analysis which is typically overlooked in literature about PDG. Because this project is focused on well-structured Java programs, we did not consider outdated features such as label or goto which however can be easily covered.

(21)

Appendix: Algorithms

1. Concatenation Algorithms

The concatCF function will take a CF type data (mainCF ) and a list of statements as arguments. The CF of the first statement will be computed and then the program will recursively compute the concatenation control flow of the CF and the rest statements. Because each return statement will also be the last statement of the analyzed program, they will be appended to the last state-ments list of the final CF result.

Algorithm 8Concatenate control flows (the structure of control flow, see:2.1.1) Postcondition: control flows will be linked together to one complete flow

1 functionCONCATCF(mainCF, statements) 2 if statementsis empty then

3 return mainCF;

4 else

5 f irstCF ←STATEMENTCF(statement[0]); 6 ifthe first statement is not an empty block then

7 restCF ←CONCATCF(f irstCF, rest of statements); .recursion

8 f low ←concatenate mianCF with restCF ;

9 return(flow, f irst of mainCF , last of restCF ) + all return statement; 10 else iffirst statement is an empty block k there two or more statements then 11 returnCONCATCF(mainCF, rest of statements);

12 else

(22)

2. While

Algorithm 9Analyze While statement and get the control flow (the structure of control flow, see: 2.1.1)

Precondition: statis a While statement

Postcondition: all the elements/statements inside While statement and their corresponding unique number counting will be stored in a map statements; the relation among statements, f irststatement and a list of last statements of the relation will be returned.

1 functionWHILECF(stat) 2 condition ← counting;

3 statements+ = (condition :condition statement); 4 counting + +;

5 last ← condition

6 whileBodyCF ←STATEMENTCF(whileBody);

7 f low ←concatenate condition, whileBodyCF and condition; .a loop 8 for bc ← breakorcontinuestatements for current loop do

9 if bcis a break then

10 last+ = bc;

11 else .else bc is a continue

12 f low+ =concatenate bc with the condition; 13 return (f low, condition, last);

3. Switch-Case

Algorithm 10Analyze Switch and Case statement and get the control flow (the structure of con-trol flow, see:2.1.1)

Precondition: statis a Switch statement

Postcondition: all the elements/statements inside Switch statement and their corresponding unique number counting will be stored in a map statements; the relation among statements, f irststatement and a list of last statements of the relation will be returned.

1 functionSWITCHCF(stat) 2 expr ← counting;

3 statements+ = (expr :expression statement); 4 counting + +;

5 cases ←divide switch statements by case keyword; 6 groupedCases ←group cases by break or return; 7 for caseGroup ← groupedCases do

8 caseCFs ←BLOCKCF(caseGroup);

9 f low ←concatenate expr with f irst statement and f low of each CF in caseCF s 10 last ← laststatements of each CF in caseCF s

(23)

Appendix: Tests

1. Control Flow Tests

Here are test cases for each statement type in Java. Tests written in Java compare the program results with the expected results and the green highlight means success.

1.1 Basic Statement Test cases: public void t e s t B a s i c ( ) { i n t i = 0 ; i n t t = 5 ; i = 3 + 1 ;

(24)

1.2 If Statement Test cases: public void t e s t I f ( ) { i n t i = 0 ; i f( i > 0 ) { i n t j = 3 ; System . out . p r i n t l n ( ” f i r s t ” + i + j ) ; } e l s e i f ( i == −4){

System . out . p r i n t l n ( ” second ” ) ; } e l s e {

System . out . p r i n t l n ( ” t h i r d ” ) ; }

System . out . p r i n t l n ( ”End” ) ; } public void t e s t I f 2 ( ) { i n t i = 0 ; i f( i > 0 ) { i n t j = 3 ; System . out . p r i n t l n ( ” f i r s t ” + i + j ) ; } e l s e i f ( i == −4){

System . out . p r i n t l n ( ” second ” ) ; } e l s e {

System . out . p r i n t l n ( ” t h i r d ” ) ; }

}

Expected results:

Figure above is the expected result of testIf(). The expected result of testIf2() is the one above

without N ode 7

(25)

1.3 For Statement Test cases:

public void t e s t F o r ( ) {

i n t m = 2 ;

f o r( i n t i = 0 ; i <= m; i ++){ System . out . p r i n t l n ( ”FOR” ) ; }

System . out . p r i n t l n ( ”END” ) ; }

public void t e s t F o r 2 ( ) {

i n t m = 2 ;

f o r( i n t i = 0 , j = 7 ; i <= j ; i ++ , j −−){ System . out . p r i n t l n ( ” For ” + m) ; }

}

Expected results:

(26)

1.4 While Statement Test cases:

public void t e s t W h i l e ( ) {

i n t i = 3 ;

while( i > 1 ) {

System . out . p r i n t l n ( ” While ” ) ; i −−;

}

System . out . p r i n t l n ( ”End” ) ; }

public void t e s t W h i l e 2 ( ) {

i n t i = 3 ;

while( i > 1 ) {

System . out . p r i n t l n ( ” While ” ) ; i −−;

} }

Expected results:

(27)

1.5 Switch Statement Test cases: public void t e s t S w i t c h ( ) { i n t i = 0 ; switch( i +1){ c a s e 0 : System . out . p r i n t l n ( ” 0 ” ) ; break; c a s e 1 : System . out . p r i n t l n ( ” 1 ” ) ; c a s e 2 : System . out . p r i n t l n ( ” 2 ” ) ; break; d e f a u l t: System . out . p r i n t l n ( ” d e f a u l t ” ) ; } } Expected results: Results:

(28)

1.6 Return Statement Test cases: public void t e s t R e t u r n 1 ( ) { i n t i = 0 ; i f( i == 2 ) r e t u r n ; System . out . p r i n t l n ( ” i ” ) ; } public void t e s t R e t u r n 2 ( ) { f o r( i n t i = 0 ; i < 4 ; i ++){ i f( i == 2 ) r e t u r n ; e l s e i f( i == 3 ) { i += 5 ; System . out . p r i n t l n ( ” e l s e i f ” ) ; }

}

(29)

1.7 Break and Continue Test cases: public void t e s t B r e a k C o n t i n u e 1 ( ) { i n t i = 0 ; while( i < 9 ) { i f( i == 3 ) { i = 5 ; continue; } e l s e i f ( i == 5 ) break ; i ++; }

System . out . p r i n t l n ( ” while ” ) ; } public void t e s t B r e a k C o n t i n u e 2 ( ) { f o r( i n t i = 0 ; i < 6 ; i ++){ i n t j = 3 ; while( i < j ) { i f( i == j ) break ; j −−; } i f( i == 1 ) { System . out . p r i n t l n ( ” c o n t i n u e ” ) ; continue; } e l s e i f ( i == 4 ) break ; System . out . p r i n t l n ( ” loop2 ” ) ; }

}

Results:

(30)

2. Dominance Tests

Here is a test from Lengauer and Tarjan’s work[24] to validate the correctness of this part of implementation.

(a) Example flow graph

(b) Expected dominance relation

(c) Test result Figure 10: Dominance test 1

Here are two testcases from [22] which are used to test the correctness of implementation.

(31)

3. Data Dependence Tests

3.1 IN and OUT Test

Here are the IN , OU T tests that we compare the results generated by the program with our analysis results in Figure8:

3.2 Definition-Use Pairs Test

Here is the data dependence test that we compare the results generated by the program and our analysis results in Table1:

(32)

4. Control Dependence Graph

This is a test case extracted from [1]:

Figure 13: Control Dependence Graph example

5. Program Dependence Graph

This is the program dependence graph of Code1generated by this library.

(33)

We tested this library by a random Java program online (https://github.com/hujiaweibujidao/ JavaProjects/blob/a6bed02763079fa62d4ba3d3ede2fea4df3a415c/AlgorithmPractise/src/ACM/ NY/NY006.java). The generated program dependence graph of this Java program is presented be-low:

(34)

Bibliography

[1] Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS), 9(3):319–349, July 1987.

[2] Charles Rich. A formal representation for plans in the programmer’s apprentice. In Michael L. Brodie, John Mylopoulos, and Joachim W. Schmidt, editors, On Conceptual Mod-elling, Topics in Information Systems, pages 239–273. Springer New York, 1984.

[3] Charles Rich and Richard C. Waters. The Programmer’s Apprentice. ACM, New York, NY, USA, 1990.

[4] Paul Klint, Tijs van der Storm, and Jurgen Vinju. EASY meta-programming with rascal. In Proceedings of the 3rd International Summer School Conference on Generative and Transformational Techniques in Software Engineering III, GTTSE’09, pages 222–289, Berlin, Heidelberg, 2011. Springer-Verlag.

[5] Mark Weiser. Programmers use slices when debugging. Commun. ACM, 25(7):446–452, July 1982.

[6] Mark Weiser. Program slicing. In Proceedings of the 5th International Conference on Software Engineering, ICSE ’81, pages 439–449. IEEE Press, 1981.

[7] Thomas J. McCabe. A complexity measure. Software Engineering, IEEE Transactions on, (4):308–320, 1976.

[8] Norman F. Schneidewind and H.M. Hoffmann. An experiment in software error data collec-tion and analysis. Software Engineering, IEEE Transaccollec-tions on, (3):276–286, 1979.

[9] Karl J. Ottenstein and Linda M. Ottenstein. The program dependence graph in a software development environment. In Proceedings of the First ACM SIGSOFT/SIGPLAN Software En-gineering Symposium on Practical Software Development Environments, SDE 1, pages 177–184, New York, NY, USA, 1984. ACM.

[10] Albert L. Baker and Stuart H. Zweben. The use of software science in evaluating modularity concepts. Software Engineering, IEEE Transactions on, (2):110–120, 1979.

[11] Victor R. Basili and Tsai-Yun Phillips. Evaluating and comparing software metrics in the software engineering laboratory. ACM SIGMETRICS Performance Evaluation Review, 10(1):95– 106, 1981.

[12] Bill Curtis, Sylvia B. Sheppard, and Phil Milliman. Third time charm: Stronger prediction of programmer performance by software complexity metrics. In Proceedings of the 4th interna-tional conference on Software engineering, pages 356–360. IEEE Press, 1979.

(35)

[13] Alan R. Feuer and Edward B. Fowlkes. Some results from an empirical study of computer software. In Proceedings of the 4th international conference on Software engineering, pages 351– 355. IEEE Press, 1979.

[14] Sallie Henry and Dennis Kafura. Software structure metrics based on information flow. Soft-ware Engineering, IEEE Transactions on, SE-7(5):510–518, Sept 1981.

[15] Bill Curtis. Measurement and experimentation in software engineering. Proceedings of the IEEE, 68(9):1144–1157, 1980.

[16] Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. Deckard: Scal-able and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on Software Engineering, pages 96–105. IEEE Computer Society, 2007.

[17] Chanchal K. Roy, James R. Cordy, and Rainer Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Program-ming, 74(7):470–495, 2009.

[18] Mati Shomrat and Yishai A. Feldman. Detecting refactored clones. In Giuseppe Castagna, editor, ECOOP 2013 – Object-Oriented Programming, volume 7920 of Lecture Notes in Computer Science, pages 502–526. Springer Berlin Heidelberg, 2013.

[19] Mary Jean Harrold, Gregg Rothermel, and Alex Orso. Representation and anal-ysis of software. http://www.ics.uci.edu/~lopes/teaching/inf212W12/readings/ rep-analysis-soft.pdf.

[20] Frances E. Allen. Control flow analysis. In ACM Sigplan Notices, volume 5, pages 1–19. ACM, 1970.

[21] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools (2Nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006.

[22] Keith D. Cooper, Timothy J. Harvey, and Ken Kennedy. A simple, fast dominance algorithm. Software — Practice & Experience, 4:1–10, 2001.

[23] William M. McKeeman. Differential testing for software. Digital Technical Journal, 10(1):100– 107, 1998.

[24] Thomas Lengauer and Robert Endre Tarjan. A fast algorithm for finding dominators in a flowgraph. ACM Transactions on Programming Languages and Systems (TOPLAS), 1(1):121– 141, 1979.

Implementing a PDG Library in Rascal