Detecting Refactored Clones with Rascal

(1)

Detecting Refactored

Clones with Rascal

Master Thesis

Ren´

e Bulsing

Faculty of Science

Supervisor:

Dr. Vadim Zaytsev

Version 1.0

Amsterdam, August 2015

(2)

Abstract

Cloned code is very common in modern software and the harmfulness of it has been researched

often, with conflicting results. One area that does show consensus is that inconsistent change

to code clones and deviations in them are a source of software defects. Being able to detect

such deviations is therefore a potentially valuable thing to do. In 2013, Shomrat and Feldman

have developed a tool called Cider that is able to detect code clones that have deviated due to

refactoring. Our research has attempted to reimplement their tool in the meta-programming

language Rascal. Both the tools are targeting the analysis of Java projects exclusively.

Where Cider utilises an internal semantic program representation based on the Plan

Calculus, we have opted for the system dependence graph. Creating a library in Rascal

that creates such graphs, has been the major part of the project. The resulting library is

able to create correct system dependence graphs that cover a wide range of Java statements.

Unfortunately, not every statement has been covered due to time limits. Implementing data

analysis on this and super statements remains future work, as is control analysis on labels

and the associated labelled break and continue statements.

As can be seen from the results, we have succeeded in reimplementing Cider in Rascal.

Our results show overlap with those that are publicly available of Cider, which is unfortunately

a small set. Regardless, it shows the potential of our tool, and precision and recall percentages

are comparable. Some projects do show a low precision, as is the case for Cider too, indicating

that much work remains to be done. The potential of our tool is large, however, contributing

to the research on refactored clone detection.

Unlike Cider, it is also publicly available

(

https://github.com/grammarware/pdg

), together with a detailed data set containing our

results so it can serve as a basis for future research.

Another contribution resulting from our research is the centralisation of fragmented

formation. Our work covers the creation of a system dependence graph for Java programs,

in-cluding all the necessary intermediate representations and interesting edge-cases; information

that is usually spread across multiple sources. Combining that with coverage on refactored

clone detection, and the public availability of our tool and data, makes that our work is a

solid centralised body of information for further research on the subject.

(3)

iii

1 Introduction

1

1.1 Relevance

. . . .

1

1.2 Document Structure

. . . .

2

1.3 Contributions

. . . .

2 I

Project Concept

5

2 Project Description

7

2.1 Clone Detection

. . . .

7

2.2 Motivation

. . . .

8

3 Research Questions

9

3.1 Existing Library

. . . .

9

3.2 Modification

. . . .

9

3.3 Search-space

. . . .

9

3.4 Scalability

. . . .

10

3.5 Value

. . . .

10

3.6 Cider Comparison

. . . .

10 II

Program Representation

11

4 Overview

13

4.1 Node Types

. . . .

14

4.2 Data Structures

. . . .

14

4.3 Validation

. . . .

15

5 Control Flow Graph

17

5.1 Graph Construction

. . . .

17

5.2 Special Constructs

. . . .

19

5.2.1 Try-Catch(-Finally)

. . . .

19

5.2.2 Method Calls

. . . .

20

5.2.3 Method Declaration & Return

. . . .

23

5.3 Considerations

. . . .

23

(4)

6 Post-Dominator Tree

24

6.1 Tree Construction

. . . .

24

6.2 Finding Immediate Post-Dominators

. . . .

25

7 Control Dependence Graph

27

7.1 Graph Construction

. . . .

27

7.2 Finding Immediate Control Dependence

. . . .

28

7.3 Simplifications

. . . .

30

8 Data Dependence Graph

31

8.1 Data Collections

. . . .

31

8.1.1 Variable Encoding

. . . .

32

8.2 Reaching Definitions

. . . .

32

8.3 Graph Construction

. . . .

34

8.4 Simplifications

. . . .

35

9 Program Dependence Graph

36

9.1 Graph Construction

. . . .

36 10 System Dependence Graph

37 10.1 Graph Construction

. . . .

37 10.1.1 Node Encoding

. . . .

38 10.1.2 Example

. . . .

38 10.2 Limitations

. . . .

41 11 Call Graph

42 11.1 Graph Construction

. . . .

42 III

Clone Detection

43 12 The Algorithm

45 12.1 Flows

. . . .

45 12.2 Matching

. . . .

47 12.3 Filtering Clones

. . . .

48 12.4 Changes

. . . .

49 12.4.1 Naive

. . . .

49 12.4.2 Lockstep

. . . .

49 12.5 Limitations

. . . .

49 13 Seeding

51 13.1 Internal Seeds

. . . .

51 13.2 Filtering

. . . .

52 13.3 Input Restrictions

. . . .

52 13.4 Limitation

. . . .

52

(5)

IV

Evaluation

53 14 Results

55 14.1 Refactored Clones

. . . .

56 14.1.1 Unfiltered Seeds

. . . .

56 14.1.2 Filtered Seeds

. . . .

56 14.1.3 Analysis

. . . .

57 14.2 Interprocedural Clones

. . . .

59 14.3 Performance

. . . .

59 15 Cider Comparison

61 15.1 Results

. . . .

61 15.2 Performance

. . . .

63 16 Related Work

64 17 Threats to Validity

66 17.1 Internal Validity

. . . .

66 17.2 External Validity

. . . .

66 18 Future Work

68 18.1 Graphs

. . . .

68 18.2 Clone Detection

. . . .

68 18.3 Verification

. . . .

69 19 Conclusion

70 Bibliography

72 A Source Code

76 B Clone Analysis

171

(6)

(7)

Chapter 1 Introduction

In this thesis we present the research and development efforts for reimplementing a clone

detection tool known as Cider [

39 ] in Rascal [

24 ]. The original tool developed by

Shom-rat and Feldman is quite unique as it focusses on the detection of interprocedural clones

that have deviated by refactoring, mainly method extraction and inline method [

11 ]. Even

though Shomrat and Feldman explain their algorithm and tool, it does not seem like Cider

is published, making a reimplementation in Rascal interesting.

To facilitate a reimplementation in Rascal we had to develop a library to create a

se-mantic representation of Java programs, known as the system dependence graph [

16 ,

41 ].

Reports of the findings uncovered during the development thereof, span a large portion of the

thesis.

1.1 Relevance

Recent research shows an open-minded attitude towards code clones, in the sense that they

are not always considered a bad thing [

17 ,

22 ]. However, research has also shown that

sep-arate evolution of code clones can introduce software defects [

20 ]. When a change has been

introduced in a cloned code fragment, it is important for that change to be applied to the

other instances as well. This is a labour intensive and error prone task, because the developer

can easily miss an instance. Clone detection techniques that do not use a semantic

represent-ation of the program will not spot this discrepancy, because the syntactical structure of the

clone has changed [

37 ]. It is not hard to imagine that this forgotten clone can cause software

defects as additional changes are applied to the properly maintained clone instances, leaving

the forgotten ones to deviate further.

A clone detection tool that uses a semantic program representation is able to spot clones

that have a different syntactic structure, but are semantically identical.

For example, a

cloned code fragment with four instances is refactored by extracting a method, leaving the

functionality unchanged. This change is applied in only three of four instances. A tool like

Cider will still identify all four instances as clones. The developer will then be able to spot

his mistake and easily correct it. It is not difficult to see the value in such a tool and how it

may be relevant.

(8)

CHAPTER 1. INTRODUCTION

1.2 Document Structure

Our findings and development efforts are split into four major parts in this thesis. The order

of the parts and chapters represents our progression path through the project. We iterate

these parts and briefly explain their contents below.

Part one presents the project concept, touching on the practice of clone detection and

presenting a motivation for the choice of programming language and program

repres-entation. In this part we also present the research questions which have directed the

project.

Part two presents the program representation of choice, together with all the intermediate

representations that are necessary to generate it. Our work is based on that of a previous

student, which is highlighted in the first overview chapter. All other chapters discuss

a single program representation, explaining the concepts, algorithms, and modification

or extensions.

Part three discusses the clone detection algorithm and how it is being seeded. Modifications

have been necessary, because we do not use the same program representation as Cider.

An explanation of said changes is an important subject in this part. Even though this

part is relatively small, it is the most important one as the concepts and algorithms

behind the actual clone detection are being presented here.

Part four is the final part of the thesis and contains the evaluation. First we present our

experiment results, such as precision and recall on multiple configurations of the tool.

Subsequently, the results are compared to those of Cider.

The remaining chapters

present related work, threats to validity, future work, and the conclusion in which we

shortly reiterate the research question and answer them.

Algorithms that are listed in the chapters closely resemble their coded implementation in

our tool. Generally we avoid direct explanation of code, because it is very volatile as the tool

is a work in progress. Instead the concepts and algorithms are explained, which are not as

likely to change.

1.3 Contributions

In the listing below we shortly discuss the contributions of this project. Naturally, we explain

throughout this thesis how we achieved them, but we still wish to grant the reader a convenient

overview below.

Detecting refactored clones. The tool we developed is able to detected refactored clones

between two subsequent versions of software. In doing so we have shown the feasibility

of using the system dependence graph to implement the algorithms and concepts that are

presented by Shomrat and Feldman [

39 ], who used the Plan Calculus [

35 ,

36 ] instead.

This also further proves that their algorithms and concepts function correctly and can

be adapted to work with a different program representation.

Open source. Not every tool resulting from research is open source and available to the

public, nor is their raw data. This is also the case for Cider, but not for our tool [

5 ].

(9)

CHAPTER 1. INTRODUCTION

The public availability of our tool and the raw data on which the evaluation is based,

marks a contribution on its own. It grants future researchers a solid base of information

and an implementation example for the algorithms and concepts.

Centralised information. During development and research we noticed that information

on graph representations is fragmented. Most papers only cover a single graph type,

leaving out necessary preliminary information for an implementation. Our work covers

the creation of every graph and shows the dependencies between them. It may not

be the most thorough coverage available, but it includes the preliminary information

needed for an implementation.

Improved graph library in Rascal. A graph library in Rascal was already available, but

it did not support interprocedural graph generation [

43 ]. Some important statements,

such as try-catch, were also not included in the generated graphs. We extended the

library to cover more statements and support the generation of interprocedural graphs.

Contribution to Rascal. Rascal is still in active development and is considered to be

alpha software. During this project we have encountered multiple bugs and performance

issues, allowing us to contribute to the development of Rascal by filing bug reports and

thinking of solutions. We do not cover this further in the thesis as we avoid discussion

of the code directly, but it is an interesting contribution nevertheless.

(10)

(11)

Part I

(12)

(13)

Chapter 2 Project Description

In this project we aim to reimplement the Cider algorithm as described by Shomrat and

Feldman [

39 ]. Cider is able to detect refactored clones between two code bases. Such clones

are semantic and otherwise known as type-4 clones [

34 ]. The internal program representation

they use to detect the clones is the Plan Calculus, which is comparable to the control flow

graph [

36 ,

35 ] and the program dependence graph [

10 ]. An important characteristic for a plan is

that the nodes represent an expression, instead of a statement as in the other aforementioned

representations. Edges in a plan indicate control and data flow, the latter making it more

comparable to a program dependence graph.

Our reimplementation will use the Rascal programming language and the system

de-pendence graph as program representation. The focus is on the detection of refactored clones

between two successive versions of the same software. We focus on the detection of method

extraction and inline method refactorings [

11 ], because those are also the focus of Shomrat

and Feldman. We only analyse Java projects, as is the case in the Cider paper as well. The

source-code of this project and the data on our results are available on Github [

5 ].

2.1 Clone Detection

Other approaches exist to detect various types of clones and almost all them have their own

pros and cons. Also, not every tool detects the same type of clones [

34 ,

37 ]. A number of

papers that discuss clones regard them as a “bad smell” and thus problematic [

6 ,

19 ,

21 ,

26 ].

In contrast there are papers that do not regard code clones as inherently bad [

22 ]. However,

there is one point on which most of the authors agree: separate evolution of code clones leads

to faults and is a problem [

3 ,

14 ,

20 ].

Clones start to evolve separately when a developer forgets to apply a change in all the clone

instances. It is easy to see that this may lead to bug fixes that are not properly implemented

in all the clone instances, causing defects [

20 ,

29 ,

38 ]. At this point the clone evolution has

become problematic and specialist tools are needed to detect the offenders.

The tools that are currently available for the detection of clones are developed with the

intention of refactoring them or to enable consistent changes [

18 ,

19 ,

21 ,

29 ,

42 ]. Specialist

tools to detect clones that have already been inconsistently changed, are rare and sometimes

not publicly available (e.g. Cider ). Inconsistent changes to clones have been identified as

a source of faults in some studies [

20 ,

29 ,

38 ] and even though the general harmfulness of

clones is disputed, that of inconsistent changes is not. It is therefore interesting to note that

(14)

CHAPTER 2. PROJECT DESCRIPTION

specialist tools for detecting inconsistencies in clones are so uncommon, strengthening the

case for our project.

2.2 Motivation

Many programming languages are available to program this envisioned tool, but we have

chosen to use Rascal as it attempts to integrate multiple facets of meta-programming into

a single language environment [

24 ,

23 ]. Furthermore it supports analysis of Java projects out

of the box. A software analysis project such as this is the perfect field-test for the language

that is – at the time of writing – still in active development. We may also contribute to the

maturity of the language and display an example of its usefulness in practice.

Using the system dependence graph as a representation instead of the Plan Calculus, is

a choice based on the availability of information on the subjects. Both representations have

sources available, but the accessibility to information on the system dependence graph was

far superior [

16 ,

32 ,

33 ,

31 ,

41 ,

42 ]. It was also mentioned by Shomrat and Feldman as a

possible alternative to the Plan Calculus [

39 ], which makes it an interesting and viable venue

of research. A viability that is further compounded as most of the available information on

the creation of system dependence graphs has insufficient depth for a full implementation.

Our research may uncover those implementation and practical details, granting a complete

overview from start to finish.

(15)

Chapter 3 Research Questions

3.1 Existing Library

Developing a library to create graph representations of a Java program may not be necessary,

as a preceding student has created one for her master thesis in 2014 [

43 ]. It allows the creation

of a program dependence graph, that is the basis for the system dependence graph. Clearly

the library had to be extended. Less clear is the quality and documentation of the code. A

thorough inspection of the library was in order, as problems in it would have the potential to

negatively impact the entire clone detection process.

Question:

Can the reimplementation of Cider be built on top of the existing Rascal graph

library?

3.2 Modification

A large change from the original Cider implementation is the use of the system dependence

graph, as opposed to the Plan Calculus. Another change is the used programming language to

realise the tool: Rascal. Such changes may have an affect on how the underlying algorithm

has to be implemented. Investigating the effects of these changes is necessary and motivations

for potential modifications have to be documented.

Question:

How does the use of system dependence graphs and Rascal affect the

reimple-mentation of Cider ?

3.3 Search-space

Generating a system dependence graph is quite an expensive process [

26 ,

37 ] and doing so for

a complete project is infeasible. Pruning the search-space is a necessity and can be achieved

by seeding and filtering. Measuring the effectiveness of the seeding and filtering algorithm is

needed to ensure that the search-space is effectively reduced.

Question:

How does seeding and filtering reduce the search-space and impact performance?

(16)

CHAPTER 3. RESEARCH QUESTIONS

3.4 Scalability

It is important to test the performance of the tool on multiple projects of different sizes.

We want it to work on sizeable projects of 100.000 lines of code and more. An indication of

scalability is also important to determine, as such a measurement may indicate the practical

value of the tool.

Question:

Does the reimplementation of Cider in Rascal run on open-source projects

containing more than 100.000 lines of code? How does performance scale when the input

increases?

3.5 Value

For a new tool it is important to have some added value when compared to its competitors.

We will therefore compare the resulting tool to the currently available ones to see how it adds

value. Value can be measured by quantifying functionality, performance, and other properties

of the tool.

Question:

What is the added value of the resulting tool when compared to other available

tools?

3.6 Cider Comparison

As the resulting tool will be a reimplementation of Cider, it is important to compare the

results with those of the original paper. Found clones should be comparable to those of the

original Cider tool. If there is any difference between the results it is important to investigate

the reasons behind it.

Question:

How do the results of the reimplemented tool compare to those of the original

Cider tool? If they differ, why?

(17)

Part II

(18)

(19)

Chapter 4 Overview

The graph library is the basis for the project, as all analysis will eventually be done on graph

representations. It was already known that the existing library did not cover construction of

system dependence graphs (SDG) and had to be extended. When investigating the library it

became quite clear that more functionality was missing. For example, try-catch constructs

in the Java language were not covered, even though exception handling is a large concern in

modern software [

9 ]. Furthermore, it turned out that the documentation and quality of the

code were sub-par. Variable names were cryptic and magic variables were used, to name a

few problems. These points have led to the decision to completely reconstruct the library and

extend it afterwards.

Figure 4.1: Graph creation order

Figure

4.1 shows the

depend-encies between graphs for their

creation.

Naturally, the library

is able to create those graphs,

also allowing the visualisation

of them.

An important thing

to note is that only the SDG

spans method boundaries and are

therefore interprocedural.

All

other graphs are intraprocedural

and will be shown as separate

graphs. The scope used for SDG

creation is configurable to either

file, method or full scope. Using

the full scope can lead to massive

graphs as it allows the graph to

span methods and files, and is

usually discouraged.

We utilise

the library with file scope mostly.

The other chapters in this

part will show how every graph

is constructed. Coverage includes

the algorithm and usually an example. Some of the graph construction methods needed some

extension for Java programs, which will also be presented in their respective chapters.

(20)

CHAPTER 4. OVERVIEW

plifications have also been necessary, as will be described in the chapters. Our goal is to show

a proper guide through the creation of a practical SDG while presenting all the intermediate

representation as well, including documentation on edge-cases that we found to be missing in

the literature [

10 ,

26 ,

32 ,

31 ,

42 ,

43 ].

4.1 Node Types

Every node in the graph can have one of four types, enabling easy differentiation during

analysis and facilitating certain extensions. The following list briefly describes the four node

types and what they are used for.

Normal() Nodes that have a direct mapping to a statement in the source-code will be stored

with this type.

Entry() These nodes are used to indicate the entry point in a method. It is usually the top

node in a graph.

CallSite() A method call will get its own node of this type. This node will spawn parameter

nodes to indicate the transfer of data in the SDG.

Parameter() These are used to indicate data transfer into or from a method. They consist of

assignments from the arguments to the parameters of the called method, or from return

statements back to the caller. Parameter nodes are spawned by return statements and

method calls.

Global() A node of this type indicates a global variable or class field. They are used to

create an entry point for data analysis, so that data dependence edges can be created

for statements that use such variables.

4.2 Data Structures

Every graph is contained within its own datastructure, but data on the analysed method is

being shared by all of them. That data is encoded in the MethodData data structure, as

shown in listing

4.1 .

MethodData ( s t r name ,

node a b s t r a c t T r e e ,

map [ i n t , node ] nodeEnvironment , s e t [ l o c ] c a l l e d M e t h o d s ,

s e t [ i n t ] c a l l S i t e s ,

map [ i n t , i n t ] p a r a m e t e r N o d e s ) ;

Listing 4.1: Method data

Name A self explanatory field. It contains the name of the method to which the data belongs.

(21)

CHAPTER 4. OVERVIEW

Abstract Tree The abstract syntax tree (AST) is stored in this field. It is provided by

Rascal.

Node Environment This mapping is filled during the construction of the Control Flow

Graph. Every statement in the AST receives an identifier and is stored in this mapping

for future reference. If additional nodes are generated they will also be stored in this

environment. As the graphs will have edges from and to identifiers, it is paramount for

every graph to have access to this environment.

Called Methods For easy future reference we stored the called methods in this set of

loc-ations that point to the called method. A location is a data structure that Rascal

supplies and can be used to resolve to paths on the storage medium.

Call Sites Knowing what nodes in the AST have outgoing method calls is needed to create

interprocedural control edges in the SDG. We store the identifier of every statement

containing a call site in this set.

Parameter Nodes This mapping maps the identifier of a parameter node to its parent node

identifier. When the Control Dependence Graph is generated, this information is used

to properly connect the parameter nodes to their parent.

4.3 Validation

Correctness of the graphs has to be ensured as the validity of analysis depends on it. Even

though we cannot hope to test for every code construct, we can provide a set of unit tests to

cover basic ones separately. For testing we used the mechanism that is provided by Rascal

and extended it. Listing

4.2 show example code of a unit test that covers control flow graph

creation for for statements.

t e s t b o o l t e s t F o r ( ) { p r o j e c t M o d e l = createM3 ( | p r o j e c t : // J a v a T e s t | ) ; map [ l o c , Graph [ i n t ] ] a s s e r t i o n s = ( g e t M e t h o d L o c a t i o n (” t e s t F o r 1 ”, p r o j e c t M o d e l ) : { <0 , 1> , <1 , 2> , <2 , 1> } , g e t M e t h o d L o c a t i o n (” t e s t F o r 1 A l t e r n a t e ”, p r o j e c t M o d e l ) : { <0 , 1> , <1 , 2> , <2 , 1> } , g e t M e t h o d L o c a t i o n (” t e s t F o r 2 ”, p r o j e c t M o d e l ) : { <0 , 1> , <1 , 2> , <2 , 1> , <1 , 3> } , g e t M e t h o d L o c a t i o n (” t e s t F o r 2 A l t e r n a t e ”, p r o j e c t M o d e l ) : { <0 , 1> , <1 , 2> , <2 , 1> , <1 , 3> } ) ; r e t u r n RTestFunction (” T e s t F o r ”, getMethodCFG , a s s e r t i o n s ) ; }

Listing 4.2: Unit test example

Our testing extension is simple and can be observed in the RTestFunction method. It

takes a name to identify potential error messages, the method that is to be tested, and a

map of assertions. The map in listing

4.2 contains the method input as key, and the expected

(22)

CHAPTER 4. OVERVIEW

output as value. RTestFunction will call the provided method with the keys of the map

and compares the output thereof with the value that is attached to the key. If any of the

assertions fail, an error message will be printed to identify the problem.

As input for the tests we use very simple Java methods that contain a specific language

structure. Listing

4.3 shows one of the test programs for the unit test in listing

4.2 .

p u b l i c v o i d t e s t F o r 1 ( ) { i n t m = 2 ; // 0 f o r(i n t i = 0 , j = 7 ; i <= j ; i ++, j −−) { // 1 m = m + 4 ; // 2 } }

Listing 4.3: Java test code

The program is intentionally simple as we focus the testing effort on small constructs so

failures can be inspected quickly and effectively. It may not seem as representative of real

programs, since they are much more complex. We reason that if we test every structure in

multiple configurations, we can be reasonably sure that they function correctly when used in

conjunction with other structures.

Thinking of every possible structure and scenario of usage is impossible, causing the test

suite to evolve during usage. More edge-cases are always uncovered and consequently covered

by a new test.

(23)

Chapter 5 Control Flow Graph

A control flow graph (CFG) models the flow of control between statements in a program [

2 ].

An edge between two nodes in the graph indicates that the first node is executed before it

transfers control to the second node. All the edges in a CFG are directed.

Information on CFGs is plentiful, but a lot of it is incomplete in the sense that certain

edge-cases or language specific elements are omitted [

2 ,

15 ,

44 ]. Omissions include the

hand-ling of exceptions, method calls, and threads. This does not mean that such information

is unavailable, but it is fragmented. We have investigated these points, and reports of our

findings can be found in the sections below.

Our library constructs CFGs that are intraprocedural, generating separate CFGs for every

method that is within the analysis scope. Extensions to the generation of the CFGs ensures

that it is possible to create interprocedural graphs later on. Spawning parameter nodes for a

call site is a prime example of an extension. A detailed description of the extensions can be

found in its respective section below.

5.1 Graph Construction

A post-order traversal on the method AST facilitates the creation of the CFG. Every processed

statement will have a small CFG created for it, that will be connected to all the other CFGs

when the traversal goes back up the tree. This recursive method of construction eventually

interconnects all sub-graphs to form the final CFG for a method. Listing

5.1 presents the

datastructure that is used to model a CFG; necessary information to understand how the

sub-graphs are being connected by algorithm

1 . The algorithm is a refactored version of the

one found in the old library [

43 ].

C o n t r o l F l o w (

// C o n t a i n s a l l e d g e s g o i n g from and t o a node i d e n t i f i e r . Graph [ i n t ] graph ,

// The t o p node o f a graph . i n t entryNode ,

// The n o d e s where c o n t r o l f l o w e x i t s t h e graph . s e t [ i n t ] e x i t N o d e s

) ;

Listing 5.1: The datastructure modelling control flow graphs.

(24)

CHAPTER 5. CONTROL FLOW GRAPH

Algorithm 1 Connecting multiple control flow graphs.

Require: f lows 6= ∅

function connectControlFlows(list[ControlFlow] flows)

f stF low ← pop(f lows)

sndF low ← pop(f lows)

cF low.graph ← f stF low.graph + sndF low.graph

cF low.graph ← cF low.graph + f stF low.exitN odes × {sndF low.entryN ode}

if size(f lows) ≥ 2 then

succF low ← connectControlFlows(flows)

cF low.graph ← cF low.graph + succF low.graph

cF low.graph ← cF low.graph + cF low.exitN odes × {succF low.entryN ode}

cF low.exitN odes ← succF low.exitN odes

else

cF low.exitN odes ← sndF low.exitN odes

end if

return cF low

end function

Creating the flows is a process that must be defined for every statement separately, as

every type of statement has its own control flow. How the exit points are defined is also

different between statements, because not every statement reacts in the same way to a break

or throw, for example.

Example

Listing

5.2 contains the code for processing while statements, showing a good example of

the construction process. It first scans the condition of the statement for any method calls,

storing their control flows into a list. After this the while statement is stored into the node

environment and its identifier is returned. A flow is always initialised with the node identifier

as entry and exit node.

Proper scoping is very important as jumps (e.g. return and break) may otherwise connect

to the wrong follow-up statement. Since a while statement contains a block with its own

scope, we call the scopeDown function to clear the environments that store jumps.

The next step is the processing of the statement its body. We do this by calling the

process method on the body, leaving pattern-driven dispatch to handle it from there. The

method will return a control flow for the body. After the body is processed, the continue

nodes in the body scope are retrieved and added as exit nodes. This is necessary so we can

create edges from the continue statements to the while entry statement.

After creating the proper edges between the body and the while entry statement, break

nodes in the body scope are retrieved and added as exit nodes for the CFG. Doing so ensures

that all the relevant break nodes can get an edge to the follow-up statement of the while

block. As the internal block is now processed we scope back up, passing any unbound jumps

to the parent environment for later binding, ensuring proper scoping of jumps.

Finally, any method calls in the expression of the while statement will be connected to

the CFG before returning it, as those calls will be executed before the body of the while

statement.

(25)

CHAPTER 5. CONTROL FLOW GRAPH

p r i v a t e C o n t r o l F l o w p r o c e s s ( w h i l e N o d e : \ w h i l e ( c o n d i t i o n , body ) ) { l i s t [ C o n t r o l F l o w ] c a l l S i t e s = r e g i s t e r M e t h o d C a l l s ( c o n d i t i o n ) ; i n t i d e n t i f i e r = s t o r e N o d e ( w h i l e N o d e ) ; C o n t r o l F l o w w h i l e F l o w = C o n t r o l F l o w ( { } , 0 , { } ) ; w h i l e F l o w . entryNode = i d e n t i f i e r ; w h i l e F l o w . e x i t N o d e s += { i d e n t i f i e r } ; scopeDown ( ) ; C o n t r o l F l o w bodyFlow = p r o c e s s ( body ) ; bodyFlow . e x i t N o d e s += g e t C o n t i n u e N o d e s ( ) ; w h i l e F l o w . graph += bodyFlow . graph ;

w h i l e F l o w . graph += c r e a t e C o n n e c t i o n E d g e s ( bodyFlow , w h i l e F l o w ) ; w h i l e F l o w . graph += c r e a t e C o n n e c t i o n E d g e s ( w h i l e F l o w , bodyFlow ) ; w h i l e F l o w . e x i t N o d e s += g e t B r e a k N o d e s ( ) ; scopeUp ( ) ; r e t u r n c o n n e c t C o n t r o l F l o w s ( c a l l S i t e s + w h i l e F l o w ) ; }

Listing 5.2: Creating a control flow graph for while statements.

5.2 Special Constructs

The available information on generating a CFG does a good job of covering the basic code

structures such as loops and jumps. Processing statements pertaining to exceptions or method

calls is usually not covered. For our analysis those structures have to be included in the CFG

as well. Therefore, we had to investigate those statements and find out how to properly

process them. Our findings are presented below.

5.2.1 Try-Catch(-Finally)

Exception handling code is commonplace in modern software, manifesting itself in Java

pro-grams by try-catch and try-catch-finally blocks. To illustrate, in the 597.450 lines of

code used to test our tool, we found 5095 try, 4847 catch, and 1363 finally blocks

1

.

Pro-cessing try-catch blocks is a quite straightforward affair as every statement in the body

may connect to the catch block, if it is capable of throwing an exception that is. The

try-catch-finally blocks are slightly more complex as Java ensures that the finally block

is always executed, even if there is a return or throw in a catch its body. Listing

5.3 shows

example Java code where such a construct exists.

The flow starting at line 11 runs through 12, 17, and 18. In this case the body of the catch

block is executed before the finally block. For the flow starting at line 13 it is different,

running through line 14, 17, 18, and 15. The body of the finally block is essentially inlined

before the throw statement. Figure

5.1 shows the CFG fragment with these flows.

Our first implementation for handling the try-catch-finally block, created a single

1

How the lines of code metric is computed can be found in chapter14. For counting the blocks we used Rascal by looping through the AST of every class in a project and incrementing a counter.

(26)

CHAPTER 5. CONTROL FLOW GRAPH

flow for the body in the finally block and inlined it in the larger CFG. It seemed like an

easy solution, but the implementation caused invalid paths. For example, the CFG fragment

contains a path running from line 12 through 17, 18, and 15. Obviously such a flow does not

exist in the program; there is no way for the statement at line 12 to reach the throw at line

15. Furthermore, line 14 would be able to reach the exit node without crossing through the

throw at line 15, as it follows the path spawned by the catch block starting at line 11.

1 p u b l i c v o i d t h r o w F u n c t i o n ( ) t h r o w s E x c e p t i o n { 2 i n t i = 0 ; 3 4 i f( i > 1 ) { 5 throw new N u l l P o i n t e r E x c e p t i o n ( ) ; 6 } 7 8 t r y { 9 i = i ∗ 2 ; 10 throw new N u l l P o i n t e r E x c e p t i o n ( ) ; 11 } c a t c h( NoClassDefFoundError e x c e p t i o n ) { 12 i = 1 0 ; 13 } c a t c h( E x c e p t i o n e x c e p t i o n ) { 14 i = 1 2 ; 15 throw e x c e p t i o n ; 16 } f i n a l l y { 17 i = 1 1 ; 18 i = i ∗ 3 ; 19 } 20 }

Listing 5.3: Throwing code.

Figure 5.1: CFG fragment.

For the final implementation we still utilise the idea of flow inlining, but instead apply it

for every catch block separately. Figure

5.2 shows the generated CFG for the code in listing

5.3 . In this implementation there are no invalid paths. A side effect is that the flow of the

finally block can have many duplicates in the graph. This should not cause any problems,

however, as the nodes in the duplicate flows still refer to the same statements in the code.

5.2.2 Method Calls

Processing method calls for interprocedural control flow analysis is a complex subject that

has been researched before [

7 ,

27 ]. These studies attempt to analyse the program by using the

CFG directly, causing context related issues where invalid paths may exist or flows become

“tainted” by spawned paths from a different context. Our use of the CFG is purely as an

intermediate presentation and it is never used directly for analysis. This allows a simplification

in processing method calls, as the control flow does not have to be interprocedural. However,

to enable generation of an interprocedural SDG later on, we have to account for eventual

graph linking.

The code in listing

5.4 yields three separate CFGs, wherein nodes are added for every

call site. During SDG construction it is essential to know how control and data is transferred

from caller to callee. In the case of control this is simple: a single edge from the call site to

the called method its entry node will suffice. For data it is more complicated as the algorithm

will have to deal with data transfer from input arguments to parameters, and a potential

(27)

CHAPTER 5. CONTROL FLOW GRAPH

Figure 5.2: Throw code flow

p u b l i c v o i d C a l l i n g ( ) { i n t i = 1 ; w h i l e( i < 1 1 ) { i = I n c r e m e n t ( i ) ; } } p u b l i c i n t I n c r e m e n t (i n t z ) { r e t u r n Add ( z , 1 ) ; } p u b l i c i n t Add (i n t a , i n t b ) { r e t u r n a + b ; }

Listing 5.4: Unprocessed calls

(28)

CHAPTER 5. CONTROL FLOW GRAPH

return value to the caller. To provide the algorithm with that information we have extended

the CFG generator to spawn parameter nodes that indicate this transfer of data.

Listing

5.5 shows how the code would look if the three output CFGs are transformed

back into source code. Every method call is assigned its own node and spawns multiple

assignments. The added assignment statements are mapped to the node that spawned them,

being used later during control and system dependence graph generation. All the names start

with $ to easily differentiate normal variables with the ones created during CFG construction.

Because it is illegal in Java to start a variable name with $, it also avoids any name conflicts

between the original variables and the created ones.

Special care must be taken for processing statements where a single method is called

multiple times.

The naming scheme that is employed in listing

5.5 causes, for example,

$method_Add(int, int)_return to be overridden if there happens to be a second call to

the Add(int, int) method in the same statement. Adding the offset of the call in the file

into the variable name (i.e. $method_Add(int, int)_return_<file offset>) solved this

problem.

p u b l i c v o i d C a l l i n g ( ) { i n t i = 1 ; w h i l e( i < 1 1 ) { c a l l I n c r e m e n t ( ) ; $ m e t h o d I n c r e m e n t (i n t) i n 1 = i ; $ m e t h o d I n c r e m e n t (i n t) r e t u r n = $ I n c r e m e n t (i n t) r e t u r n ; i = I n c r e m e n t ( i ) ; } } p u b l i c i n t I n c r e m e n t (i n t z ) { z = $ m e t h o d I n c r e m e n t (i n t) i n 1 ; c a l l Add ( ) ; $method Add (i n t , i n t) i n 1 = z ; $method Add (i n t , i n t) i n 2 = 1 ;

$method Add (i n t , i n t) r e t u r n = $Add (i n t , i n t) r e t u r n ;

r e t u r n Add ( z , 1 ) ; $ I n c r e m e n t (i n t , i n t) r e s u l t = Add ( z , 1 ) ; $ I n c r e m e n t (i n t , i n t) r e t u r n = $ I n c r e m e n t (i n t, i n t) r e s u l t ; } p u b l i c i n t Add (i n t a , i n t b ) { a = $method Add (i n t , i n t) i n 1 ; b = $method Add (i n t , i n t) i n 2 ; r e t u r n a + b ; $Add (i n t , i n t) r e s u l t = a + b ; $Add (i n t , i n t) r e t u r n = $Add (i n t , i n t) r e s u l t ; }

Listing 5.5: Processed calls

(29)

CHAPTER 5. CONTROL FLOW GRAPH

5.2.3 Method Declaration & Return

Listing

5.5 contains parameter nodes that do not originate from the expansion of method calls,

instead they are created when processing the method entry and return statements. Every

input parameter receives an assignment statement with the right-side being the value that is

generated during method call expansion. Other nodes are assignments to $<name>_result

and $<name>_return. Every return statement creates an assignment to $<name>_result,

but there is only a single $<name>_return assignment for every method that returns a value.

Much like the nodes that are created during the processing of a method call, these new nodes

are used to create interprocedural data flow edges when constructing an SDG.

5.3 Considerations

Due to the extensions, reversing a CFG back to source code does not yield an executable

or even syntactically valid Java program. Some of the flows in the created CFG may not

even be realistic execution paths. For example, the created assignment statements after a

return would never be reachable during execution. To create valid intraprocedural CFGs

one would have to disable the generation of parameter nodes, and much more extension

would be necessary for valid interprocedural CFG generation.

In our tool the CFG is never directly used during clone detection, instead it only serves as

the basis for creating other graphs; we never analyse any control flow paths. Combined with

changes during the creation of said graphs, the invalidity of the CFGs does not pose a problem

for clone detection analysis. So for our intents and purposes, the current implementation

suffices.

(30)

Chapter 6 Post-Dominator Tree

A post-dominator tree (PDT) determines the dominance of every node in a CFG, according to

three relation types: post-dominance, strict post-dominance, and immediate post-dominance.

Post-dominance A node Y is said to post-dominate a node X, if every path from node X to

the exit node runs through Y . From definition it follows that every node post-dominates

itself.

Strict post-dominance Node Y strictly post-dominates node X if Y post-dominates X

and Y 6= X.

Immediate post-dominance Even though a node may post-dominate multiple nodes, it

only immediately post-dominates one. Node Y immediately post-dominates node X if

none of other nodes that strictly post-dominate X are strictly post-dominated by Y .

One exception is the exit node, as it post-dominates every other node, but it does not

post-dominate itself.

6.1 Tree Construction

An algorithm for calculating normal dominators can be used to construct a PDT, since it is

an inverse dominator tree. Only the input CFG for the algorithm requires a change: it has to

be inverted. A simple process of reversing every edge in the graph suffices, which for Rascal

boils down to calling one of the standard library functions.

Multiple algorithms are available for calculating dominators, with differing computational

complexity. A fast and complex implementation from Lengauer and Tarjan [

28 ] has an almost

linear runtime. Due to time limitations we have chosen a simple implementation of Aho and

Ullman [

1 ] with a complexity of O(mn), with m denoting the number of edges and n the

number of nodes in the analysed graph.

Algorithm

2 shows how (post-)dominators are calculated. The basic concept behind the

algorithm is that a node X (post-)dominates the nodes that are not reachable when paths in

de CFG containing X are excluded.

(31)

CHAPTER 6. POST-DOMINATOR TREE

Algorithm 2 Calculating dominators

for all node ∈ graphN odes do

reach ← reachable nodes from root, avoiding paths with node

dominations[node] ← graphN odes − {node} − reach

for all dominatedN ode ∈ dominations[node] do

dominators[dominatedN ode] ← dominators[dominatedN ode] + {node}

end for

6.2 Finding Immediate Post-Dominators

Every node has a single immediate post-dominator that must be extracted and calculated

from the dominance set. Algorithm

3 shows how this is done. Its implementation exactly

mirrors the definition for an immediate post-dominator. For every node we look at the nodes

that post-dominate it. When a post-dominator is found that does not strictly post-dominates

any of the other dominators, an edge is added to the PDT signifying the immediate

post-dominance relation.

Algorithm 3 Calculating immediate dominators

for all node ∈ graphN odes do

for all dominator ∈ dominators[node] do

if dominations[dominator] ∩ dominators[node] ≡ ∅ then

Add an edge to the tree from dominator to node

break

end if

end for

Figure

6.1 shows a simplified input CFG without expanded call sites and an added new

start node that denotes the dependence on the external signal causing the method to execute.

It has no analytical value, but clarifies the visualisation if the PDT is rendered. An example

of such a PDT can be seen in figure

6.2 .

(32)

CHAPTER 6. POST-DOMINATOR TREE

Figure 6.1: Input CFG

Figure 6.2: Output PDT

(33)

Chapter 7 Control Dependence Graph

The control dependence graph (CDG) shows how nodes are dependent on each other for

execution. We say that a node X is control dependent on a node Y if one path from Y

ensures that X is not executed, while the other causes X to be executed. This principle is

stated more formally in the following definition.

Node Y is control dependent on node X if, and only if, there is a path in the CFG

from X to Y without the immediate post-dominator of X.

While it is possible for a node to be control dependent on multiple preceding nodes, there

is only a single one on which it directly depends, denoting the immediate control dependence

relationship. Edges in the output CDG model this immediate dependence exclusively.

7.1 Graph Construction

An algorithm to construct a CDG is presented by Ferrante et al. [

10 ]. Initially it retrieves

all edges in the CFG where the source node is not post-dominated by the target node. After

this a path in the PDT is constructed from the immediate post-dominator of the source node,

to the target node. Every node in that path that is not the source node or the immediate

post-dominator, is then marked as being control dependent on the source node. Algorithm

4 shows the implementation of this behaviour. It stores the dependencies in two direction,

mapping a node to all nodes that depend on it and the node it depends on itself.

Algorithm 4 Calculating control dependence

inspectionEdges ← {X → Y ∈ graphEdges, Y does not post-dominate X}

for all X → Y ∈ inspectionEdges do

parentIdom ← immediate dominator of X

pathN odes ← nodes in PDT path hparentIdom → Y i

pathN odes ← pathN odes − parentIdom

for all pathN ode ∈ pathN odes, pathN ode 6= X do

dependencies[pathN ode] ← dependencies[pathN ode] + {f rom}

controls[f rom] ← controls[f rom] + {pathN ode}

end for

(34)

CHAPTER 7. CONTROL DEPENDENCE GRAPH

Figure 7.1: Simple PDT

To show an example run of the algorithm we use figure

7.1 . Let us assume that the

related CFG of this PDT contains an edge h1 → 2i. This is an example of an edge where the

source node does not post-dominate the target node. Consequently, the algorithm retrieves

the immediate post-dominator of 1 (i.e. Entry) and creates a path in the PDT from that

node to the target node, being 2. On this path we find the nodes 3 and 2, which are then

marked as being control dependent on node 1.

7.2 Finding Immediate Control Dependence

The definition of immediate control dependence is very similar to that of the immediate

post-dominance relation in a PDT, being stated as follows.

Node X is immediately control dependent on Y if all other controllers of X are

not control dependent on Y .

Not only is the definition very similar, the algorithm to calculate immediate control

de-pendence is also similar to that for calculating immediate post-dominance. Algorithm

5 shows

how this behaviour is implemented, with a small number of modifications.

Algorithm 5 Calculating immediate dependence

for all node ∈ graphN odes do

if node is a parameter node then

Add edge hparameterP arent → nodei

continue

end if

for all controller ∈ sort(dependencies[node]) do

if size(dependencies[node]) ≡ 1

or controls[controller] ∩ dependencies[node] ≡ ∅ then

Add edge hcontroller → nodei

break

end if

end for

(35)

CHAPTER 7. CONTROL DEPENDENCE GRAPH

The first modification can be observed with the addition of a conditional that checks if a

node is a Parameter() node. Such nodes are excluded from normal analysis and are directly

connected to the node that spawned them. That parent node is retrieved from a map that has

been constructed during CFG creation. We exclude Parameter() nodes from normal analysis,

because they are not represented in the original source-code, making them dependent only

on the statement that spawned them.

p u b l i c v o i d l o o p B r e a k i n g ( ) { i n t i = 0 ; // 0 w h i l e( i < 1 0 ) { // 1 i f( i == 6 ) { // 2 b r e a k; // 3 } i = 1 0 ; // 4 } }

Listing 7.1: Problematic code

Figure 7.2: CFG of problem code

A second modification is the sorting of the collection that is iterated in the inner loop,

being necessary to avoid invalid edges when analysing loops. An example of problematic

code is shown by listing

7.1 and its CFG by figure

7.2 . The source of the problem is edge

h1 → 2i, where 2 is not post-dominated by 1. When trying to determine the immediate

control dependence of 1 we find that there are two options: it can be immediately control

dependent on Entry or on node 2. Naturally, being control dependent on a node that executes

later is impossible, but such a dependency will be created if the inner loop looks at node 2

first. Sorting avoids this problem as the Entry node has a negative value as identifier. Figure

7.3 shows the PDT and the proper CDG for the code in listing

7.1 .

(a) PDT _{(b) CDG}

Figure 7.3: Problem code graphs

(36)

CHAPTER 7. CONTROL DEPENDENCE GRAPH

7.3 Simplifications

Region nodes are introduced by Ferrante to enhance control dependence analysis [

10 ].

Statements in the body of both branches in an if-else statement are currently

de-pendent on a single node. When analysing the CDG directly, it may be important to

know what statement resides in the true or false branch of a conditional block.

Re-gion nodes can be used to group statements that are dependent on a common expression

together, including branch information. We do not take interest in these branches and

only use the dependencies for clone detection, allowing us to exclude region nodes.

(37)

Chapter 8 Data Dependence Graph

A data dependence graph (DDG) is used to analyse data dependences between statements,

granting insights in how a piece of code may be reorganised without affecting the semantics.

This information helps a compiler in parallelising commands that have no constraints due to

data dependence.

A = 0 ; B = A + 1 ; C = A + B ; D = 5 ;

In the trivially simple code above we see that B depends on A, and C on both of them.

This dependency makes it impossible to reorder the statements, while retaining the semantics.

Statement D is not dependent on anything and can be executed at any point.

8.1 Data Collections

Intuitively we can think of data dependence being between statements that define and use

a variable. As shown later in this chapter, it is a bit more complicated than that as there

are multiple edge cases to consider. The idea does show the first two basic data mappings

that have to be calculated. Definitions is the first one and maps a variable name to all the

statements that define it. Uses is the second one and maps a statement its identifier to all

the variables it uses. The example assignment below at line 4 is analysed as defining A and

using B and C.

1 A = 0 ; 2 B = 1 ; 3 C = 2 ; 4 A = B + C ;

Definitions are not only mapped from variable name to all the statement identifiers that

define it, but every statement identifier also maps to the variables it defines, or generates. In

the example code above the statement at line 4 is mapped to the variable A in a mapping

called generators, as it generates variable A.

(38)

CHAPTER 8. DATA DEPENDENCE GRAPH

The final data collection we use is the kill mapping. It maps every statement identifier to

the variables it kills or in other words, redefines. This data is calculated by using the

gen-erators and definitions mappings. For every node the variables it generates are retrieved,

and for every generated variable we say that the statement kills all other definitions of that

variable. For example, line 4 in the code above kills the variable definition of A at line 1.

8.1.1 Variable Encoding

Variables in the preceding examples simply encode into a string that is equal to their name.

Additionally, there are some variables or statements that need different encoding such as

array access and method calls.

Array access can be encoded by the name of the array directly, so that myArray[x] is said

to use myArray, and encode it as such. It is a simple approach, but we found it to be too

imprecise. The difference between accessing an array with a variable or a set value is a small

one syntactically, but may indicate a large semantic difference. Accessing the first element in

an array explicitly may indicate a very specific purpose. We reason that accessing an array

with an index in a loop, is not as specific and may therefore be less meaningful semantically.

In other words, we argue that myArray[x] with x being 5 or 6, inside a loop where x is the

loop index, makes no semantic difference. To separate the two ways of accessing an array

we encode them differently. An access statement that uses a variable such as myArray[x] is

encoded as myArray[variable], where myArray[5] is simply encoded as myArray[5].

During data analysis any method call that is encountered will be encoded to link to the

previously created $method_<name>_return assignment statements. For example, the

state-ment $Increstate-ment(int, int)_result = Add(z, 1) in listing

5.5 , will be implicitly encoded

as $Increment(int, int)_result = $method_Add(int, int)_return to allow for proper

data analysis.

8.2 Reaching Definitions

An important step in the creation of a DDG is the calculation of the reaching definitions.

There are two maps, in and out, that maps every statement to the definitions that reach it

and flow out of it, respectively. For a statement S the in and out maps are calculated by the

following equations.

• in[S] =

S

p∈predecessors[S]

out[p]

• out[S] = gen[S] ∪ (in[S] − kill[S])

Implementing the equations is straightforward as they can be directly translated into code.

An important thing to consider is that a single execution will not be enough to calculate the

maps, instead a fixed point must be found. Therefore, the equations will be executed in a

loop for every node in the graph until no further changes are being made to in and out.

Algorithm

6 shows how this behaviour is implemented.

An example input for calculating reaching definitions can be seen in figure

8.1 , with the

results presented in table

8.1 . The data in the columns shows how variable data is stored in

the maps, with the first member holding the name of the variable and the second being the

identifier of the statement for which the tuple has been generated.

(39)

CHAPTER 8. DATA DEPENDENCE GRAPH

Algorithm 6 Calculating reaching definitions

while in and out have changed do

for all node ∈ graphN odes do

in[node] ←

S

p∈predecessors[node]

out[p]

out[node] ← gen[node] ∪ (in[node] − kill[node])

end for

end while

Figure 8.1: Example CFG

Statement

gen

kill

in

out

1 ha, 1i

ha, 5i

∅

ha, 1i

2 hc, 2i

hc, 4i, hc, 6i

ha, 1i

ha, 1i, hc, 2i

3 ∅

∅

ha, 1i, hc, 2i, hc, 4i

4 hc, 4i

hc, 2i, hc, 6i

ha, 1i, hc, 2i, hc, 4i

ha, 1i, hc, 4i

5 ha, 5i

ha, 1i

ha, 1i, hc, 2i, hc, 4i

hc, 2i, hc, 4i, ha, 5i

6 hc, 6i

hc, 2i, hc, 4i

hc, 2i, hc, 4i, ha, 5i

ha, 5i, hc, 6i

Table 8.1: Reaching definitions for figure

8.1

(40)

CHAPTER 8. DATA DEPENDENCE GRAPH

8.3 Graph Construction

With all the calculated mappings we are able to generate a DDG out of a CFG. Algorithm

7 shows the steps involved in creating a DDG, as it loops through all the used variables of a

node in the domain of the uses mapping.

The first conditional statements filter out any global variables that are not being defined

in the current function scope. Those global variables are stored for later use during SDG

construction. It also accounts for array accesses that may not have been defined yet, even

though the array itself is defined in the function scope. In this case it falls back to the more

general array name and retrieves the statement nodes that define it.

Algorithm 7 Calculate the DDG

for all node ∈ domain(uses), usedV ariable ∈ uses[node] do

if usedV ariable /

∈ def initions then

if usedV ariable is an array and name of the array ∈ def initions then

variableDef initions ← def initions[ name of the array ]

else

continue

end if

else

variableDef initions ← def initions[usedV ariable]

end if

for all dependency ∈ in[node] ∩ variableDef initions do

Add edge hdependency.identif ier, nodei

end for

After the proper variable definition nodes are retrieved, the algorithm loops through the

variable definitions that can actually reach the currently analysed node. For every reaching

definition an edge is added from its respective node to the current node. Figure

8.2 show the

DDG for the CFG in figure

8.1 .

Figure 8.2: DDG for figure

8.1

(41)

CHAPTER 8. DATA DEPENDENCE GRAPH

8.4 Simplifications

Our focus on detecting refactored clones in single files allowed us to simplify some factors in

the creation of a DDG.

Ignore Anonymous Subclasses Some constructs in Java allow anonymous subclasses, such

as the creation of a new thread. Listing

8.1 shows how such an anonymous subclass

may manifest in source code. In this case the run() method is part of the anonymous

subclass that is spawned by creating the new Thread object. We ignore this subclass

entirely for simplicity and because such subclasses are relatively rare.

Thread t h r e a d = new Thread ( ) {

p u b l i c v o i d run ( ) {

System . o u t . p r i n t l n (” Thread Running ”) ; }

}

Listing 8.1: Anonymous subclass

Loop Unrolling Loops are not being unrolled which may lead to imprecise DDGs when

arrays are accessed with a loop index. In listing

8.2 we see a loop that can be unrolled

relatively easy. Doing so would show that reversing the execution flow (i.e. starting at

4 and count down to 0) does not retain the semantic meaning.

f o r(i n t i = 1 ; i < 5 ; i ++) {

myArray [ i ] = myArray [ i + 1 ] + 5 ; }

Listing 8.2: Loop example

For the intended use in this project it causes no problems however, as we are only

interested in the dependences between statements and not in actual optimisations or

analysis on reordering the statements.

Exception Catching When an exception is thrown and caught by a catch statement, it

catches the exception object that is created for the throw. This indicates that there is a

data dependence between the throw and the catch. We do not model this dependency,

as it makes the construction of graphs easier and because the exception definition in the

catch statement provides a good enough starting point for data dependency analysis.

Expanded Nodes Any expanded node is ignored during analysis, because analysing the

spawned parameter nodes will give a more detailed insight into the dependencies. Also,

including the node that is expanded will cause redundant information that is less

de-tailed, tainting the graph.

Detecting Refactored Clones with Rascal