Detecting Refactored
Clones with Rascal
Master Thesis
Ren´
e Bulsing
Faculty of Science
Supervisor:
Dr. Vadim Zaytsev
Version 1.0
Amsterdam, August 2015
Abstract
Cloned code is very common in modern software and the harmfulness of it has been researched
often, with conflicting results. One area that does show consensus is that inconsistent change
to code clones and deviations in them are a source of software defects. Being able to detect
such deviations is therefore a potentially valuable thing to do. In 2013, Shomrat and Feldman
have developed a tool called Cider that is able to detect code clones that have deviated due to
refactoring. Our research has attempted to reimplement their tool in the meta-programming
language Rascal. Both the tools are targeting the analysis of Java projects exclusively.
Where Cider utilises an internal semantic program representation based on the Plan
Calculus, we have opted for the system dependence graph. Creating a library in Rascal
that creates such graphs, has been the major part of the project. The resulting library is
able to create correct system dependence graphs that cover a wide range of Java statements.
Unfortunately, not every statement has been covered due to time limits. Implementing data
analysis on this and super statements remains future work, as is control analysis on labels
and the associated labelled break and continue statements.
As can be seen from the results, we have succeeded in reimplementing Cider in Rascal.
Our results show overlap with those that are publicly available of Cider, which is unfortunately
a small set. Regardless, it shows the potential of our tool, and precision and recall percentages
are comparable. Some projects do show a low precision, as is the case for Cider too, indicating
that much work remains to be done. The potential of our tool is large, however, contributing
to the research on refactored clone detection.
Unlike Cider, it is also publicly available
(
https://github.com/grammarware/pdg
), together with a detailed data set containing our
results so it can serve as a basis for future research.
Another contribution resulting from our research is the centralisation of fragmented
formation. Our work covers the creation of a system dependence graph for Java programs,
in-cluding all the necessary intermediate representations and interesting edge-cases; information
that is usually spread across multiple sources. Combining that with coverage on refactored
clone detection, and the public availability of our tool and data, makes that our work is a
solid centralised body of information for further research on the subject.
Contents
Contents
iii
1
Introduction
1
1.1
Relevance
. . . .
1
1.2
Document Structure
. . . .
2
1.3
Contributions
. . . .
2
I
Project Concept
5
2
Project Description
7
2.1
Clone Detection
. . . .
7
2.2
Motivation
. . . .
8
3
Research Questions
9
3.1
Existing Library
. . . .
9
3.2
Modification
. . . .
9
3.3
Search-space
. . . .
9
3.4
Scalability
. . . .
10
3.5
Value
. . . .
10
3.6
Cider Comparison
. . . .
10
II
Program Representation
11
4
Overview
13
4.1
Node Types
. . . .
14
4.2
Data Structures
. . . .
14
4.3
Validation
. . . .
15
5
Control Flow Graph
17
5.1
Graph Construction
. . . .
17
5.2
Special Constructs
. . . .
19
5.2.1
Try-Catch(-Finally)
. . . .
19
5.2.2
Method Calls
. . . .
20
5.2.3
Method Declaration & Return
. . . .
23
5.3
Considerations
. . . .
23
CONTENTS
6
Post-Dominator Tree
24
6.1
Tree Construction
. . . .
24
6.2
Finding Immediate Post-Dominators
. . . .
25
7
Control Dependence Graph
27
7.1
Graph Construction
. . . .
27
7.2
Finding Immediate Control Dependence
. . . .
28
7.3
Simplifications
. . . .
30
8
Data Dependence Graph
31
8.1
Data Collections
. . . .
31
8.1.1
Variable Encoding
. . . .
32
8.2
Reaching Definitions
. . . .
32
8.3
Graph Construction
. . . .
34
8.4
Simplifications
. . . .
35
9
Program Dependence Graph
36
9.1
Graph Construction
. . . .
36
10 System Dependence Graph
37
10.1 Graph Construction
. . . .
37
10.1.1 Node Encoding
. . . .
38
10.1.2 Example
. . . .
38
10.2 Limitations
. . . .
41
11 Call Graph
42
11.1 Graph Construction
. . . .
42
III
Clone Detection
43
12 The Algorithm
45
12.1 Flows
. . . .
45
12.2 Matching
. . . .
47
12.3 Filtering Clones
. . . .
48
12.4 Changes
. . . .
49
12.4.1 Naive
. . . .
49
12.4.2 Lockstep
. . . .
49
12.5 Limitations
. . . .
49
13 Seeding
51
13.1 Internal Seeds
. . . .
51
13.2 Filtering
. . . .
52
13.3 Input Restrictions
. . . .
52
13.4 Limitation
. . . .
52
CONTENTS
IV
Evaluation
53
14 Results
55
14.1 Refactored Clones
. . . .
56
14.1.1 Unfiltered Seeds
. . . .
56
14.1.2 Filtered Seeds
. . . .
56
14.1.3 Analysis
. . . .
57
14.2 Interprocedural Clones
. . . .
59
14.3 Performance
. . . .
59
15 Cider Comparison
61
15.1 Results
. . . .
61
15.2 Performance
. . . .
63
16 Related Work
64
17 Threats to Validity
66
17.1 Internal Validity
. . . .
66
17.2 External Validity
. . . .
66
18 Future Work
68
18.1 Graphs
. . . .
68
18.2 Clone Detection
. . . .
68
18.3 Verification
. . . .
69
19 Conclusion
70
Bibliography
72
A Source Code
76
B Clone Analysis
171
Chapter 1
Introduction
In this thesis we present the research and development efforts for reimplementing a clone
detection tool known as Cider [
39
] in Rascal [
24
]. The original tool developed by
Shom-rat and Feldman is quite unique as it focusses on the detection of interprocedural clones
that have deviated by refactoring, mainly method extraction and inline method [
11
]. Even
though Shomrat and Feldman explain their algorithm and tool, it does not seem like Cider
is published, making a reimplementation in Rascal interesting.
To facilitate a reimplementation in Rascal we had to develop a library to create a
se-mantic representation of Java programs, known as the system dependence graph [
16
,
41
].
Reports of the findings uncovered during the development thereof, span a large portion of the
thesis.
1.1
Relevance
Recent research shows an open-minded attitude towards code clones, in the sense that they
are not always considered a bad thing [
17
,
22
]. However, research has also shown that
sep-arate evolution of code clones can introduce software defects [
20
]. When a change has been
introduced in a cloned code fragment, it is important for that change to be applied to the
other instances as well. This is a labour intensive and error prone task, because the developer
can easily miss an instance. Clone detection techniques that do not use a semantic
represent-ation of the program will not spot this discrepancy, because the syntactical structure of the
clone has changed [
37
]. It is not hard to imagine that this forgotten clone can cause software
defects as additional changes are applied to the properly maintained clone instances, leaving
the forgotten ones to deviate further.
A clone detection tool that uses a semantic program representation is able to spot clones
that have a different syntactic structure, but are semantically identical.
For example, a
cloned code fragment with four instances is refactored by extracting a method, leaving the
functionality unchanged. This change is applied in only three of four instances. A tool like
Cider will still identify all four instances as clones. The developer will then be able to spot
his mistake and easily correct it. It is not difficult to see the value in such a tool and how it
may be relevant.
CHAPTER 1. INTRODUCTION
1.2
Document Structure
Our findings and development efforts are split into four major parts in this thesis. The order
of the parts and chapters represents our progression path through the project. We iterate
these parts and briefly explain their contents below.
Part one presents the project concept, touching on the practice of clone detection and
presenting a motivation for the choice of programming language and program
repres-entation. In this part we also present the research questions which have directed the
project.
Part two presents the program representation of choice, together with all the intermediate
representations that are necessary to generate it. Our work is based on that of a previous
student, which is highlighted in the first overview chapter. All other chapters discuss
a single program representation, explaining the concepts, algorithms, and modification
or extensions.
Part three discusses the clone detection algorithm and how it is being seeded. Modifications
have been necessary, because we do not use the same program representation as Cider.
An explanation of said changes is an important subject in this part. Even though this
part is relatively small, it is the most important one as the concepts and algorithms
behind the actual clone detection are being presented here.
Part four is the final part of the thesis and contains the evaluation. First we present our
experiment results, such as precision and recall on multiple configurations of the tool.
Subsequently, the results are compared to those of Cider.
The remaining chapters
present related work, threats to validity, future work, and the conclusion in which we
shortly reiterate the research question and answer them.
Algorithms that are listed in the chapters closely resemble their coded implementation in
our tool. Generally we avoid direct explanation of code, because it is very volatile as the tool
is a work in progress. Instead the concepts and algorithms are explained, which are not as
likely to change.
1.3
Contributions
In the listing below we shortly discuss the contributions of this project. Naturally, we explain
throughout this thesis how we achieved them, but we still wish to grant the reader a convenient
overview below.
Detecting refactored clones. The tool we developed is able to detected refactored clones
between two subsequent versions of software. In doing so we have shown the feasibility
of using the system dependence graph to implement the algorithms and concepts that are
presented by Shomrat and Feldman [
39
], who used the Plan Calculus [
35
,
36
] instead.
This also further proves that their algorithms and concepts function correctly and can
be adapted to work with a different program representation.
Open source. Not every tool resulting from research is open source and available to the
public, nor is their raw data. This is also the case for Cider, but not for our tool [
5
].
CHAPTER 1. INTRODUCTION
The public availability of our tool and the raw data on which the evaluation is based,
marks a contribution on its own. It grants future researchers a solid base of information
and an implementation example for the algorithms and concepts.
Centralised information. During development and research we noticed that information
on graph representations is fragmented. Most papers only cover a single graph type,
leaving out necessary preliminary information for an implementation. Our work covers
the creation of every graph and shows the dependencies between them. It may not
be the most thorough coverage available, but it includes the preliminary information
needed for an implementation.
Improved graph library in Rascal. A graph library in Rascal was already available, but
it did not support interprocedural graph generation [
43
]. Some important statements,
such as try-catch, were also not included in the generated graphs. We extended the
library to cover more statements and support the generation of interprocedural graphs.
Contribution to Rascal. Rascal is still in active development and is considered to be
alpha software. During this project we have encountered multiple bugs and performance
issues, allowing us to contribute to the development of Rascal by filing bug reports and
thinking of solutions. We do not cover this further in the thesis as we avoid discussion
of the code directly, but it is an interesting contribution nevertheless.
Part I
Chapter 2
Project Description
In this project we aim to reimplement the Cider algorithm as described by Shomrat and
Feldman [
39
]. Cider is able to detect refactored clones between two code bases. Such clones
are semantic and otherwise known as type-4 clones [
34
]. The internal program representation
they use to detect the clones is the Plan Calculus, which is comparable to the control flow
graph [
36
,
35
] and the program dependence graph [
10
]. An important characteristic for a plan is
that the nodes represent an expression, instead of a statement as in the other aforementioned
representations. Edges in a plan indicate control and data flow, the latter making it more
comparable to a program dependence graph.
Our reimplementation will use the Rascal programming language and the system
de-pendence graph as program representation. The focus is on the detection of refactored clones
between two successive versions of the same software. We focus on the detection of method
extraction and inline method refactorings [
11
], because those are also the focus of Shomrat
and Feldman. We only analyse Java projects, as is the case in the Cider paper as well. The
source-code of this project and the data on our results are available on Github [
5
].
2.1
Clone Detection
Other approaches exist to detect various types of clones and almost all them have their own
pros and cons. Also, not every tool detects the same type of clones [
34
,
37
]. A number of
papers that discuss clones regard them as a “bad smell” and thus problematic [
6
,
19
,
21
,
26
].
In contrast there are papers that do not regard code clones as inherently bad [
22
]. However,
there is one point on which most of the authors agree: separate evolution of code clones leads
to faults and is a problem [
3
,
14
,
20
].
Clones start to evolve separately when a developer forgets to apply a change in all the clone
instances. It is easy to see that this may lead to bug fixes that are not properly implemented
in all the clone instances, causing defects [
20
,
29
,
38
]. At this point the clone evolution has
become problematic and specialist tools are needed to detect the offenders.
The tools that are currently available for the detection of clones are developed with the
intention of refactoring them or to enable consistent changes [
18
,
19
,
21
,
29
,
42
]. Specialist
tools to detect clones that have already been inconsistently changed, are rare and sometimes
not publicly available (e.g. Cider ). Inconsistent changes to clones have been identified as
a source of faults in some studies [
20
,
29
,
38
] and even though the general harmfulness of
clones is disputed, that of inconsistent changes is not. It is therefore interesting to note that
CHAPTER 2. PROJECT DESCRIPTION
specialist tools for detecting inconsistencies in clones are so uncommon, strengthening the
case for our project.
2.2
Motivation
Many programming languages are available to program this envisioned tool, but we have
chosen to use Rascal as it attempts to integrate multiple facets of meta-programming into
a single language environment [
24
,
23
]. Furthermore it supports analysis of Java projects out
of the box. A software analysis project such as this is the perfect field-test for the language
that is – at the time of writing – still in active development. We may also contribute to the
maturity of the language and display an example of its usefulness in practice.
Using the system dependence graph as a representation instead of the Plan Calculus, is
a choice based on the availability of information on the subjects. Both representations have
sources available, but the accessibility to information on the system dependence graph was
far superior [
16
,
32
,
33
,
31
,
41
,
42
]. It was also mentioned by Shomrat and Feldman as a
possible alternative to the Plan Calculus [
39
], which makes it an interesting and viable venue
of research. A viability that is further compounded as most of the available information on
the creation of system dependence graphs has insufficient depth for a full implementation.
Our research may uncover those implementation and practical details, granting a complete
overview from start to finish.
Chapter 3
Research Questions
3.1
Existing Library
Developing a library to create graph representations of a Java program may not be necessary,
as a preceding student has created one for her master thesis in 2014 [
43
]. It allows the creation
of a program dependence graph, that is the basis for the system dependence graph. Clearly
the library had to be extended. Less clear is the quality and documentation of the code. A
thorough inspection of the library was in order, as problems in it would have the potential to
negatively impact the entire clone detection process.
Question:
Can the reimplementation of Cider be built on top of the existing Rascal graph
library?
3.2
Modification
A large change from the original Cider implementation is the use of the system dependence
graph, as opposed to the Plan Calculus. Another change is the used programming language to
realise the tool: Rascal. Such changes may have an affect on how the underlying algorithm
has to be implemented. Investigating the effects of these changes is necessary and motivations
for potential modifications have to be documented.
Question:
How does the use of system dependence graphs and Rascal affect the
reimple-mentation of Cider ?
3.3
Search-space
Generating a system dependence graph is quite an expensive process [
26
,
37
] and doing so for
a complete project is infeasible. Pruning the search-space is a necessity and can be achieved
by seeding and filtering. Measuring the effectiveness of the seeding and filtering algorithm is
needed to ensure that the search-space is effectively reduced.
Question:
How does seeding and filtering reduce the search-space and impact performance?
CHAPTER 3. RESEARCH QUESTIONS
3.4
Scalability
It is important to test the performance of the tool on multiple projects of different sizes.
We want it to work on sizeable projects of 100.000 lines of code and more. An indication of
scalability is also important to determine, as such a measurement may indicate the practical
value of the tool.
Question:
Does the reimplementation of Cider in Rascal run on open-source projects
containing more than 100.000 lines of code? How does performance scale when the input
increases?
3.5
Value
For a new tool it is important to have some added value when compared to its competitors.
We will therefore compare the resulting tool to the currently available ones to see how it adds
value. Value can be measured by quantifying functionality, performance, and other properties
of the tool.
Question:
What is the added value of the resulting tool when compared to other available
tools?
3.6
Cider Comparison
As the resulting tool will be a reimplementation of Cider, it is important to compare the
results with those of the original paper. Found clones should be comparable to those of the
original Cider tool. If there is any difference between the results it is important to investigate
the reasons behind it.
Question:
How do the results of the reimplemented tool compare to those of the original
Cider tool? If they differ, why?
Part II
Chapter 4
Overview
The graph library is the basis for the project, as all analysis will eventually be done on graph
representations. It was already known that the existing library did not cover construction of
system dependence graphs (SDG) and had to be extended. When investigating the library it
became quite clear that more functionality was missing. For example, try-catch constructs
in the Java language were not covered, even though exception handling is a large concern in
modern software [
9
]. Furthermore, it turned out that the documentation and quality of the
code were sub-par. Variable names were cryptic and magic variables were used, to name a
few problems. These points have led to the decision to completely reconstruct the library and
extend it afterwards.
Figure 4.1: Graph creation order
Figure
4.1
shows the
depend-encies between graphs for their
creation.
Naturally, the library
is able to create those graphs,
also allowing the visualisation
of them.
An important thing
to note is that only the SDG
spans method boundaries and are
therefore interprocedural.
All
other graphs are intraprocedural
and will be shown as separate
graphs. The scope used for SDG
creation is configurable to either
file, method or full scope. Using
the full scope can lead to massive
graphs as it allows the graph to
span methods and files, and is
usually discouraged.
We utilise
the library with file scope mostly.
The other chapters in this
part will show how every graph
is constructed. Coverage includes
the algorithm and usually an example. Some of the graph construction methods needed some
extension for Java programs, which will also be presented in their respective chapters.
CHAPTER 4. OVERVIEW
plifications have also been necessary, as will be described in the chapters. Our goal is to show
a proper guide through the creation of a practical SDG while presenting all the intermediate
representation as well, including documentation on edge-cases that we found to be missing in
the literature [
10
,
26
,
32
,
31
,
42
,
43
].
4.1
Node Types
Every node in the graph can have one of four types, enabling easy differentiation during
analysis and facilitating certain extensions. The following list briefly describes the four node
types and what they are used for.
Normal() Nodes that have a direct mapping to a statement in the source-code will be stored
with this type.
Entry() These nodes are used to indicate the entry point in a method. It is usually the top
node in a graph.
CallSite() A method call will get its own node of this type. This node will spawn parameter
nodes to indicate the transfer of data in the SDG.
Parameter() These are used to indicate data transfer into or from a method. They consist of
assignments from the arguments to the parameters of the called method, or from return
statements back to the caller. Parameter nodes are spawned by return statements and
method calls.
Global() A node of this type indicates a global variable or class field. They are used to
create an entry point for data analysis, so that data dependence edges can be created
for statements that use such variables.
4.2
Data Structures
Every graph is contained within its own datastructure, but data on the analysed method is
being shared by all of them. That data is encoded in the MethodData data structure, as
shown in listing
4.1
.
MethodData ( s t r name ,
node a b s t r a c t T r e e ,
map [ i n t , node ] nodeEnvironment , s e t [ l o c ] c a l l e d M e t h o d s ,
s e t [ i n t ] c a l l S i t e s ,
map [ i n t , i n t ] p a r a m e t e r N o d e s ) ;
Listing 4.1: Method data
Name A self explanatory field. It contains the name of the method to which the data belongs.
CHAPTER 4. OVERVIEW
Abstract Tree The abstract syntax tree (AST) is stored in this field. It is provided by
Rascal.
Node Environment This mapping is filled during the construction of the Control Flow
Graph. Every statement in the AST receives an identifier and is stored in this mapping
for future reference. If additional nodes are generated they will also be stored in this
environment. As the graphs will have edges from and to identifiers, it is paramount for
every graph to have access to this environment.
Called Methods For easy future reference we stored the called methods in this set of
loc-ations that point to the called method. A location is a data structure that Rascal
supplies and can be used to resolve to paths on the storage medium.
Call Sites Knowing what nodes in the AST have outgoing method calls is needed to create
interprocedural control edges in the SDG. We store the identifier of every statement
containing a call site in this set.
Parameter Nodes This mapping maps the identifier of a parameter node to its parent node
identifier. When the Control Dependence Graph is generated, this information is used
to properly connect the parameter nodes to their parent.
4.3
Validation
Correctness of the graphs has to be ensured as the validity of analysis depends on it. Even
though we cannot hope to test for every code construct, we can provide a set of unit tests to
cover basic ones separately. For testing we used the mechanism that is provided by Rascal
and extended it. Listing
4.2
show example code of a unit test that covers control flow graph
creation for for statements.
t e s t b o o l t e s t F o r ( ) { p r o j e c t M o d e l = createM3 ( | p r o j e c t : // J a v a T e s t | ) ; map [ l o c , Graph [ i n t ] ] a s s e r t i o n s = ( g e t M e t h o d L o c a t i o n (” t e s t F o r 1 ”, p r o j e c t M o d e l ) : { <0 , 1> , <1 , 2> , <2 , 1> } , g e t M e t h o d L o c a t i o n (” t e s t F o r 1 A l t e r n a t e ”, p r o j e c t M o d e l ) : { <0 , 1> , <1 , 2> , <2 , 1> } , g e t M e t h o d L o c a t i o n (” t e s t F o r 2 ”, p r o j e c t M o d e l ) : { <0 , 1> , <1 , 2> , <2 , 1> , <1 , 3> } , g e t M e t h o d L o c a t i o n (” t e s t F o r 2 A l t e r n a t e ”, p r o j e c t M o d e l ) : { <0 , 1> , <1 , 2> , <2 , 1> , <1 , 3> } ) ; r e t u r n RTestFunction (” T e s t F o r ”, getMethodCFG , a s s e r t i o n s ) ; }Listing 4.2: Unit test example
Our testing extension is simple and can be observed in the RTestFunction method. It
takes a name to identify potential error messages, the method that is to be tested, and a
map of assertions. The map in listing
4.2
contains the method input as key, and the expected
CHAPTER 4. OVERVIEW
output as value. RTestFunction will call the provided method with the keys of the map
and compares the output thereof with the value that is attached to the key. If any of the
assertions fail, an error message will be printed to identify the problem.
As input for the tests we use very simple Java methods that contain a specific language
structure. Listing
4.3
shows one of the test programs for the unit test in listing
4.2
.
p u b l i c v o i d t e s t F o r 1 ( ) { i n t m = 2 ; // 0 f o r(i n t i = 0 , j = 7 ; i <= j ; i ++, j −−) { // 1 m = m + 4 ; // 2 } }
Listing 4.3: Java test code
The program is intentionally simple as we focus the testing effort on small constructs so
failures can be inspected quickly and effectively. It may not seem as representative of real
programs, since they are much more complex. We reason that if we test every structure in
multiple configurations, we can be reasonably sure that they function correctly when used in
conjunction with other structures.
Thinking of every possible structure and scenario of usage is impossible, causing the test
suite to evolve during usage. More edge-cases are always uncovered and consequently covered
by a new test.
Chapter 5
Control Flow Graph
A control flow graph (CFG) models the flow of control between statements in a program [
2
].
An edge between two nodes in the graph indicates that the first node is executed before it
transfers control to the second node. All the edges in a CFG are directed.
Information on CFGs is plentiful, but a lot of it is incomplete in the sense that certain
edge-cases or language specific elements are omitted [
2
,
15
,
44
]. Omissions include the
hand-ling of exceptions, method calls, and threads. This does not mean that such information
is unavailable, but it is fragmented. We have investigated these points, and reports of our
findings can be found in the sections below.
Our library constructs CFGs that are intraprocedural, generating separate CFGs for every
method that is within the analysis scope. Extensions to the generation of the CFGs ensures
that it is possible to create interprocedural graphs later on. Spawning parameter nodes for a
call site is a prime example of an extension. A detailed description of the extensions can be
found in its respective section below.
5.1
Graph Construction
A post-order traversal on the method AST facilitates the creation of the CFG. Every processed
statement will have a small CFG created for it, that will be connected to all the other CFGs
when the traversal goes back up the tree. This recursive method of construction eventually
interconnects all sub-graphs to form the final CFG for a method. Listing
5.1
presents the
datastructure that is used to model a CFG; necessary information to understand how the
sub-graphs are being connected by algorithm
1
. The algorithm is a refactored version of the
one found in the old library [
43
].
C o n t r o l F l o w (
// C o n t a i n s a l l e d g e s g o i n g from and t o a node i d e n t i f i e r . Graph [ i n t ] graph ,
// The t o p node o f a graph . i n t entryNode ,
// The n o d e s where c o n t r o l f l o w e x i t s t h e graph . s e t [ i n t ] e x i t N o d e s
) ;
Listing 5.1: The datastructure modelling control flow graphs.
CHAPTER 5. CONTROL FLOW GRAPH
Algorithm 1 Connecting multiple control flow graphs.
Require: f lows 6= ∅
function connectControlFlows(list[ControlFlow] flows)
f stF low ← pop(f lows)
sndF low ← pop(f lows)
cF low.graph ← f stF low.graph + sndF low.graph
cF low.graph ← cF low.graph + f stF low.exitN odes × {sndF low.entryN ode}
if size(f lows) ≥ 2 then
succF low ← connectControlFlows(flows)
cF low.graph ← cF low.graph + succF low.graph
cF low.graph ← cF low.graph + cF low.exitN odes × {succF low.entryN ode}
cF low.exitN odes ← succF low.exitN odes
else
cF low.exitN odes ← sndF low.exitN odes
end if
return cF low
end function
Creating the flows is a process that must be defined for every statement separately, as
every type of statement has its own control flow. How the exit points are defined is also
different between statements, because not every statement reacts in the same way to a break
or throw, for example.
Example
Listing
5.2
contains the code for processing while statements, showing a good example of
the construction process. It first scans the condition of the statement for any method calls,
storing their control flows into a list. After this the while statement is stored into the node
environment and its identifier is returned. A flow is always initialised with the node identifier
as entry and exit node.
Proper scoping is very important as jumps (e.g. return and break) may otherwise connect
to the wrong follow-up statement. Since a while statement contains a block with its own
scope, we call the scopeDown function to clear the environments that store jumps.
The next step is the processing of the statement its body. We do this by calling the
process method on the body, leaving pattern-driven dispatch to handle it from there. The
method will return a control flow for the body. After the body is processed, the continue
nodes in the body scope are retrieved and added as exit nodes. This is necessary so we can
create edges from the continue statements to the while entry statement.
After creating the proper edges between the body and the while entry statement, break
nodes in the body scope are retrieved and added as exit nodes for the CFG. Doing so ensures
that all the relevant break nodes can get an edge to the follow-up statement of the while
block. As the internal block is now processed we scope back up, passing any unbound jumps
to the parent environment for later binding, ensuring proper scoping of jumps.
Finally, any method calls in the expression of the while statement will be connected to
the CFG before returning it, as those calls will be executed before the body of the while
statement.
CHAPTER 5. CONTROL FLOW GRAPH
p r i v a t e C o n t r o l F l o w p r o c e s s ( w h i l e N o d e : \ w h i l e ( c o n d i t i o n , body ) ) { l i s t [ C o n t r o l F l o w ] c a l l S i t e s = r e g i s t e r M e t h o d C a l l s ( c o n d i t i o n ) ; i n t i d e n t i f i e r = s t o r e N o d e ( w h i l e N o d e ) ; C o n t r o l F l o w w h i l e F l o w = C o n t r o l F l o w ( { } , 0 , { } ) ; w h i l e F l o w . entryNode = i d e n t i f i e r ; w h i l e F l o w . e x i t N o d e s += { i d e n t i f i e r } ; scopeDown ( ) ; C o n t r o l F l o w bodyFlow = p r o c e s s ( body ) ; bodyFlow . e x i t N o d e s += g e t C o n t i n u e N o d e s ( ) ; w h i l e F l o w . graph += bodyFlow . graph ;w h i l e F l o w . graph += c r e a t e C o n n e c t i o n E d g e s ( bodyFlow , w h i l e F l o w ) ; w h i l e F l o w . graph += c r e a t e C o n n e c t i o n E d g e s ( w h i l e F l o w , bodyFlow ) ; w h i l e F l o w . e x i t N o d e s += g e t B r e a k N o d e s ( ) ; scopeUp ( ) ; r e t u r n c o n n e c t C o n t r o l F l o w s ( c a l l S i t e s + w h i l e F l o w ) ; }
Listing 5.2: Creating a control flow graph for while statements.
5.2
Special Constructs
The available information on generating a CFG does a good job of covering the basic code
structures such as loops and jumps. Processing statements pertaining to exceptions or method
calls is usually not covered. For our analysis those structures have to be included in the CFG
as well. Therefore, we had to investigate those statements and find out how to properly
process them. Our findings are presented below.
5.2.1
Try-Catch(-Finally)
Exception handling code is commonplace in modern software, manifesting itself in Java
pro-grams by try-catch and try-catch-finally blocks. To illustrate, in the 597.450 lines of
code used to test our tool, we found 5095 try, 4847 catch, and 1363 finally blocks
1.
Pro-cessing try-catch blocks is a quite straightforward affair as every statement in the body
may connect to the catch block, if it is capable of throwing an exception that is. The
try-catch-finally blocks are slightly more complex as Java ensures that the finally block
is always executed, even if there is a return or throw in a catch its body. Listing
5.3
shows
example Java code where such a construct exists.
The flow starting at line 11 runs through 12, 17, and 18. In this case the body of the catch
block is executed before the finally block. For the flow starting at line 13 it is different,
running through line 14, 17, 18, and 15. The body of the finally block is essentially inlined
before the throw statement. Figure
5.1
shows the CFG fragment with these flows.
Our first implementation for handling the try-catch-finally block, created a single
1
How the lines of code metric is computed can be found in chapter14. For counting the blocks we used Rascal by looping through the AST of every class in a project and incrementing a counter.
CHAPTER 5. CONTROL FLOW GRAPH
flow for the body in the finally block and inlined it in the larger CFG. It seemed like an
easy solution, but the implementation caused invalid paths. For example, the CFG fragment
contains a path running from line 12 through 17, 18, and 15. Obviously such a flow does not
exist in the program; there is no way for the statement at line 12 to reach the throw at line
15. Furthermore, line 14 would be able to reach the exit node without crossing through the
throw at line 15, as it follows the path spawned by the catch block starting at line 11.
1 p u b l i c v o i d t h r o w F u n c t i o n ( ) t h r o w s E x c e p t i o n { 2 i n t i = 0 ; 3 4 i f( i > 1 ) { 5 throw new N u l l P o i n t e r E x c e p t i o n ( ) ; 6 } 7 8 t r y { 9 i = i ∗ 2 ; 10 throw new N u l l P o i n t e r E x c e p t i o n ( ) ; 11 } c a t c h( NoClassDefFoundError e x c e p t i o n ) { 12 i = 1 0 ; 13 } c a t c h( E x c e p t i o n e x c e p t i o n ) { 14 i = 1 2 ; 15 throw e x c e p t i o n ; 16 } f i n a l l y { 17 i = 1 1 ; 18 i = i ∗ 3 ; 19 } 20 }
Listing 5.3: Throwing code.
Figure 5.1: CFG fragment.
For the final implementation we still utilise the idea of flow inlining, but instead apply it
for every catch block separately. Figure
5.2
shows the generated CFG for the code in listing
5.3
. In this implementation there are no invalid paths. A side effect is that the flow of the
finally block can have many duplicates in the graph. This should not cause any problems,
however, as the nodes in the duplicate flows still refer to the same statements in the code.
5.2.2
Method Calls
Processing method calls for interprocedural control flow analysis is a complex subject that
has been researched before [
7
,
27
]. These studies attempt to analyse the program by using the
CFG directly, causing context related issues where invalid paths may exist or flows become
“tainted” by spawned paths from a different context. Our use of the CFG is purely as an
intermediate presentation and it is never used directly for analysis. This allows a simplification
in processing method calls, as the control flow does not have to be interprocedural. However,
to enable generation of an interprocedural SDG later on, we have to account for eventual
graph linking.
The code in listing
5.4
yields three separate CFGs, wherein nodes are added for every
call site. During SDG construction it is essential to know how control and data is transferred
from caller to callee. In the case of control this is simple: a single edge from the call site to
the called method its entry node will suffice. For data it is more complicated as the algorithm
will have to deal with data transfer from input arguments to parameters, and a potential
CHAPTER 5. CONTROL FLOW GRAPH
Figure 5.2: Throw code flow
p u b l i c v o i d C a l l i n g ( ) { i n t i = 1 ; w h i l e( i < 1 1 ) { i = I n c r e m e n t ( i ) ; } } p u b l i c i n t I n c r e m e n t (i n t z ) { r e t u r n Add ( z , 1 ) ; } p u b l i c i n t Add (i n t a , i n t b ) { r e t u r n a + b ; }
Listing 5.4: Unprocessed calls
CHAPTER 5. CONTROL FLOW GRAPH
return value to the caller. To provide the algorithm with that information we have extended
the CFG generator to spawn parameter nodes that indicate this transfer of data.
Listing
5.5
shows how the code would look if the three output CFGs are transformed
back into source code. Every method call is assigned its own node and spawns multiple
assignments. The added assignment statements are mapped to the node that spawned them,
being used later during control and system dependence graph generation. All the names start
with $ to easily differentiate normal variables with the ones created during CFG construction.
Because it is illegal in Java to start a variable name with $, it also avoids any name conflicts
between the original variables and the created ones.
Special care must be taken for processing statements where a single method is called
multiple times.
The naming scheme that is employed in listing
5.5
causes, for example,
$method_Add(int, int)_return to be overridden if there happens to be a second call to
the Add(int, int) method in the same statement. Adding the offset of the call in the file
into the variable name (i.e. $method_Add(int, int)_return_<file offset>) solved this
problem.
p u b l i c v o i d C a l l i n g ( ) { i n t i = 1 ; w h i l e( i < 1 1 ) { c a l l I n c r e m e n t ( ) ; $ m e t h o d I n c r e m e n t (i n t) i n 1 = i ; $ m e t h o d I n c r e m e n t (i n t) r e t u r n = $ I n c r e m e n t (i n t) r e t u r n ; i = I n c r e m e n t ( i ) ; } } p u b l i c i n t I n c r e m e n t (i n t z ) { z = $ m e t h o d I n c r e m e n t (i n t) i n 1 ; c a l l Add ( ) ; $method Add (i n t , i n t) i n 1 = z ; $method Add (i n t , i n t) i n 2 = 1 ;$method Add (i n t , i n t) r e t u r n = $Add (i n t , i n t) r e t u r n ;
r e t u r n Add ( z , 1 ) ; $ I n c r e m e n t (i n t , i n t) r e s u l t = Add ( z , 1 ) ; $ I n c r e m e n t (i n t , i n t) r e t u r n = $ I n c r e m e n t (i n t, i n t) r e s u l t ; } p u b l i c i n t Add (i n t a , i n t b ) { a = $method Add (i n t , i n t) i n 1 ; b = $method Add (i n t , i n t) i n 2 ; r e t u r n a + b ; $Add (i n t , i n t) r e s u l t = a + b ; $Add (i n t , i n t) r e t u r n = $Add (i n t , i n t) r e s u l t ; }
Listing 5.5: Processed calls
CHAPTER 5. CONTROL FLOW GRAPH
5.2.3
Method Declaration & Return
Listing
5.5
contains parameter nodes that do not originate from the expansion of method calls,
instead they are created when processing the method entry and return statements. Every
input parameter receives an assignment statement with the right-side being the value that is
generated during method call expansion. Other nodes are assignments to $<name>_result
and $<name>_return. Every return statement creates an assignment to $<name>_result,
but there is only a single $<name>_return assignment for every method that returns a value.
Much like the nodes that are created during the processing of a method call, these new nodes
are used to create interprocedural data flow edges when constructing an SDG.
5.3
Considerations
Due to the extensions, reversing a CFG back to source code does not yield an executable
or even syntactically valid Java program. Some of the flows in the created CFG may not
even be realistic execution paths. For example, the created assignment statements after a
return would never be reachable during execution. To create valid intraprocedural CFGs
one would have to disable the generation of parameter nodes, and much more extension
would be necessary for valid interprocedural CFG generation.
In our tool the CFG is never directly used during clone detection, instead it only serves as
the basis for creating other graphs; we never analyse any control flow paths. Combined with
changes during the creation of said graphs, the invalidity of the CFGs does not pose a problem
for clone detection analysis. So for our intents and purposes, the current implementation
suffices.
Chapter 6
Post-Dominator Tree
A post-dominator tree (PDT) determines the dominance of every node in a CFG, according to
three relation types: post-dominance, strict post-dominance, and immediate post-dominance.
Post-dominance A node Y is said to post-dominate a node X, if every path from node X to
the exit node runs through Y . From definition it follows that every node post-dominates
itself.
Strict post-dominance Node Y strictly post-dominates node X if Y post-dominates X
and Y 6= X.
Immediate post-dominance Even though a node may post-dominate multiple nodes, it
only immediately post-dominates one. Node Y immediately post-dominates node X if
none of other nodes that strictly post-dominate X are strictly post-dominated by Y .
One exception is the exit node, as it post-dominates every other node, but it does not
post-dominate itself.
6.1
Tree Construction
An algorithm for calculating normal dominators can be used to construct a PDT, since it is
an inverse dominator tree. Only the input CFG for the algorithm requires a change: it has to
be inverted. A simple process of reversing every edge in the graph suffices, which for Rascal
boils down to calling one of the standard library functions.
Multiple algorithms are available for calculating dominators, with differing computational
complexity. A fast and complex implementation from Lengauer and Tarjan [
28
] has an almost
linear runtime. Due to time limitations we have chosen a simple implementation of Aho and
Ullman [
1
] with a complexity of O(mn), with m denoting the number of edges and n the
number of nodes in the analysed graph.
Algorithm
2
shows how (post-)dominators are calculated. The basic concept behind the
algorithm is that a node X (post-)dominates the nodes that are not reachable when paths in
de CFG containing X are excluded.
CHAPTER 6. POST-DOMINATOR TREE
Algorithm 2 Calculating dominators
for all node ∈ graphN odes do
reach ← reachable nodes from root, avoiding paths with node
dominations[node] ← graphN odes − {node} − reach
for all dominatedN ode ∈ dominations[node] do
dominators[dominatedN ode] ← dominators[dominatedN ode] + {node}
end for
end for
6.2
Finding Immediate Post-Dominators
Every node has a single immediate post-dominator that must be extracted and calculated
from the dominance set. Algorithm
3
shows how this is done. Its implementation exactly
mirrors the definition for an immediate post-dominator. For every node we look at the nodes
that post-dominate it. When a post-dominator is found that does not strictly post-dominates
any of the other dominators, an edge is added to the PDT signifying the immediate
post-dominance relation.
Algorithm 3 Calculating immediate dominators
for all node ∈ graphN odes do
for all dominator ∈ dominators[node] do
if dominations[dominator] ∩ dominators[node] ≡ ∅ then
Add an edge to the tree from dominator to node
break
end if
end for
end for
Figure
6.1
shows a simplified input CFG without expanded call sites and an added new
start node that denotes the dependence on the external signal causing the method to execute.
It has no analytical value, but clarifies the visualisation if the PDT is rendered. An example
of such a PDT can be seen in figure
6.2
.
CHAPTER 6. POST-DOMINATOR TREE
Figure 6.1: Input CFG
Figure 6.2: Output PDT
Chapter 7
Control Dependence Graph
The control dependence graph (CDG) shows how nodes are dependent on each other for
execution. We say that a node X is control dependent on a node Y if one path from Y
ensures that X is not executed, while the other causes X to be executed. This principle is
stated more formally in the following definition.
Node Y is control dependent on node X if, and only if, there is a path in the CFG
from X to Y without the immediate post-dominator of X.
While it is possible for a node to be control dependent on multiple preceding nodes, there
is only a single one on which it directly depends, denoting the immediate control dependence
relationship. Edges in the output CDG model this immediate dependence exclusively.
7.1
Graph Construction
An algorithm to construct a CDG is presented by Ferrante et al. [
10
]. Initially it retrieves
all edges in the CFG where the source node is not post-dominated by the target node. After
this a path in the PDT is constructed from the immediate post-dominator of the source node,
to the target node. Every node in that path that is not the source node or the immediate
post-dominator, is then marked as being control dependent on the source node. Algorithm
4
shows the implementation of this behaviour. It stores the dependencies in two direction,
mapping a node to all nodes that depend on it and the node it depends on itself.
Algorithm 4 Calculating control dependence
inspectionEdges ← {X → Y ∈ graphEdges, Y does not post-dominate X}
for all X → Y ∈ inspectionEdges do
parentIdom ← immediate dominator of X
pathN odes ← nodes in PDT path hparentIdom → Y i
pathN odes ← pathN odes − parentIdom
for all pathN ode ∈ pathN odes, pathN ode 6= X do
dependencies[pathN ode] ← dependencies[pathN ode] + {f rom}
controls[f rom] ← controls[f rom] + {pathN ode}
end for
end for
CHAPTER 7. CONTROL DEPENDENCE GRAPH
Figure 7.1: Simple PDT
To show an example run of the algorithm we use figure
7.1
. Let us assume that the
related CFG of this PDT contains an edge h1 → 2i. This is an example of an edge where the
source node does not post-dominate the target node. Consequently, the algorithm retrieves
the immediate post-dominator of 1 (i.e. Entry) and creates a path in the PDT from that
node to the target node, being 2. On this path we find the nodes 3 and 2, which are then
marked as being control dependent on node 1.
7.2
Finding Immediate Control Dependence
The definition of immediate control dependence is very similar to that of the immediate
post-dominance relation in a PDT, being stated as follows.
Node X is immediately control dependent on Y if all other controllers of X are
not control dependent on Y .
Not only is the definition very similar, the algorithm to calculate immediate control
de-pendence is also similar to that for calculating immediate post-dominance. Algorithm
5
shows
how this behaviour is implemented, with a small number of modifications.
Algorithm 5 Calculating immediate dependence
for all node ∈ graphN odes do
if node is a parameter node then
Add edge hparameterP arent → nodei
continue
end if
for all controller ∈ sort(dependencies[node]) do
if size(dependencies[node]) ≡ 1
or controls[controller] ∩ dependencies[node] ≡ ∅ then
Add edge hcontroller → nodei
break
end if
end for
end for
CHAPTER 7. CONTROL DEPENDENCE GRAPH
The first modification can be observed with the addition of a conditional that checks if a
node is a Parameter() node. Such nodes are excluded from normal analysis and are directly
connected to the node that spawned them. That parent node is retrieved from a map that has
been constructed during CFG creation. We exclude Parameter() nodes from normal analysis,
because they are not represented in the original source-code, making them dependent only
on the statement that spawned them.
p u b l i c v o i d l o o p B r e a k i n g ( ) { i n t i = 0 ; // 0 w h i l e( i < 1 0 ) { // 1 i f( i == 6 ) { // 2 b r e a k; // 3 } i = 1 0 ; // 4 } }
Listing 7.1: Problematic code
Figure 7.2: CFG of problem code
A second modification is the sorting of the collection that is iterated in the inner loop,
being necessary to avoid invalid edges when analysing loops. An example of problematic
code is shown by listing
7.1
and its CFG by figure
7.2
. The source of the problem is edge
h1 → 2i, where 2 is not post-dominated by 1. When trying to determine the immediate
control dependence of 1 we find that there are two options: it can be immediately control
dependent on Entry or on node 2. Naturally, being control dependent on a node that executes
later is impossible, but such a dependency will be created if the inner loop looks at node 2
first. Sorting avoids this problem as the Entry node has a negative value as identifier. Figure
7.3
shows the PDT and the proper CDG for the code in listing
7.1
.
(a) PDT (b) CDG
Figure 7.3: Problem code graphs
CHAPTER 7. CONTROL DEPENDENCE GRAPH
7.3
Simplifications
Region nodes are introduced by Ferrante to enhance control dependence analysis [
10
].
Statements in the body of both branches in an if-else statement are currently
de-pendent on a single node. When analysing the CDG directly, it may be important to
know what statement resides in the true or false branch of a conditional block.
Re-gion nodes can be used to group statements that are dependent on a common expression
together, including branch information. We do not take interest in these branches and
only use the dependencies for clone detection, allowing us to exclude region nodes.
Chapter 8
Data Dependence Graph
A data dependence graph (DDG) is used to analyse data dependences between statements,
granting insights in how a piece of code may be reorganised without affecting the semantics.
This information helps a compiler in parallelising commands that have no constraints due to
data dependence.
A = 0 ; B = A + 1 ; C = A + B ; D = 5 ;
In the trivially simple code above we see that B depends on A, and C on both of them.
This dependency makes it impossible to reorder the statements, while retaining the semantics.
Statement D is not dependent on anything and can be executed at any point.
8.1
Data Collections
Intuitively we can think of data dependence being between statements that define and use
a variable. As shown later in this chapter, it is a bit more complicated than that as there
are multiple edge cases to consider. The idea does show the first two basic data mappings
that have to be calculated. Definitions is the first one and maps a variable name to all the
statements that define it. Uses is the second one and maps a statement its identifier to all
the variables it uses. The example assignment below at line 4 is analysed as defining A and
using B and C.
1 A = 0 ; 2 B = 1 ; 3 C = 2 ; 4 A = B + C ;
Definitions are not only mapped from variable name to all the statement identifiers that
define it, but every statement identifier also maps to the variables it defines, or generates. In
the example code above the statement at line 4 is mapped to the variable A in a mapping
called generators, as it generates variable A.
CHAPTER 8. DATA DEPENDENCE GRAPH
The final data collection we use is the kill mapping. It maps every statement identifier to
the variables it kills or in other words, redefines. This data is calculated by using the
gen-erators and definitions mappings. For every node the variables it generates are retrieved,
and for every generated variable we say that the statement kills all other definitions of that
variable. For example, line 4 in the code above kills the variable definition of A at line 1.
8.1.1
Variable Encoding
Variables in the preceding examples simply encode into a string that is equal to their name.
Additionally, there are some variables or statements that need different encoding such as
array access and method calls.
Array access can be encoded by the name of the array directly, so that myArray[x] is said
to use myArray, and encode it as such. It is a simple approach, but we found it to be too
imprecise. The difference between accessing an array with a variable or a set value is a small
one syntactically, but may indicate a large semantic difference. Accessing the first element in
an array explicitly may indicate a very specific purpose. We reason that accessing an array
with an index in a loop, is not as specific and may therefore be less meaningful semantically.
In other words, we argue that myArray[x] with x being 5 or 6, inside a loop where x is the
loop index, makes no semantic difference. To separate the two ways of accessing an array
we encode them differently. An access statement that uses a variable such as myArray[x] is
encoded as myArray[variable], where myArray[5] is simply encoded as myArray[5].
During data analysis any method call that is encountered will be encoded to link to the
previously created $method_<name>_return assignment statements. For example, the
state-ment $Increstate-ment(int, int)_result = Add(z, 1) in listing
5.5
, will be implicitly encoded
as $Increment(int, int)_result = $method_Add(int, int)_return to allow for proper
data analysis.
8.2
Reaching Definitions
An important step in the creation of a DDG is the calculation of the reaching definitions.
There are two maps, in and out, that maps every statement to the definitions that reach it
and flow out of it, respectively. For a statement S the in and out maps are calculated by the
following equations.
• in[S] =
S
p∈predecessors[S]
out[p]
• out[S] = gen[S] ∪ (in[S] − kill[S])
Implementing the equations is straightforward as they can be directly translated into code.
An important thing to consider is that a single execution will not be enough to calculate the
maps, instead a fixed point must be found. Therefore, the equations will be executed in a
loop for every node in the graph until no further changes are being made to in and out.
Algorithm
6
shows how this behaviour is implemented.
An example input for calculating reaching definitions can be seen in figure
8.1
, with the
results presented in table
8.1
. The data in the columns shows how variable data is stored in
the maps, with the first member holding the name of the variable and the second being the
identifier of the statement for which the tuple has been generated.
CHAPTER 8. DATA DEPENDENCE GRAPH
Algorithm 6 Calculating reaching definitions
while in and out have changed do
for all node ∈ graphN odes do
in[node] ←
S
p∈predecessors[node]
out[p]
out[node] ← gen[node] ∪ (in[node] − kill[node])
end for
end while
Figure 8.1: Example CFG
Statement
gen
kill
in
out
1
ha, 1i
ha, 5i
∅
ha, 1i
2
hc, 2i
hc, 4i, hc, 6i
ha, 1i
ha, 1i, hc, 2i
3
∅
∅
ha, 1i, hc, 2i, hc, 4i
ha, 1i, hc, 2i, hc, 4i
4
hc, 4i
hc, 2i, hc, 6i
ha, 1i, hc, 2i, hc, 4i
ha, 1i, hc, 4i
5
ha, 5i
ha, 1i
ha, 1i, hc, 2i, hc, 4i
hc, 2i, hc, 4i, ha, 5i
6
hc, 6i
hc, 2i, hc, 4i
hc, 2i, hc, 4i, ha, 5i
ha, 5i, hc, 6i
Table 8.1: Reaching definitions for figure
8.1
CHAPTER 8. DATA DEPENDENCE GRAPH
8.3
Graph Construction
With all the calculated mappings we are able to generate a DDG out of a CFG. Algorithm
7
shows the steps involved in creating a DDG, as it loops through all the used variables of a
node in the domain of the uses mapping.
The first conditional statements filter out any global variables that are not being defined
in the current function scope. Those global variables are stored for later use during SDG
construction. It also accounts for array accesses that may not have been defined yet, even
though the array itself is defined in the function scope. In this case it falls back to the more
general array name and retrieves the statement nodes that define it.
Algorithm 7 Calculate the DDG
for all node ∈ domain(uses), usedV ariable ∈ uses[node] do
if usedV ariable /
∈ def initions then
if usedV ariable is an array and name of the array ∈ def initions then
variableDef initions ← def initions[ name of the array ]
else
continue
end if
else
variableDef initions ← def initions[usedV ariable]
end if
for all dependency ∈ in[node] ∩ variableDef initions do
Add edge hdependency.identif ier, nodei
end for
end for
After the proper variable definition nodes are retrieved, the algorithm loops through the
variable definitions that can actually reach the currently analysed node. For every reaching
definition an edge is added from its respective node to the current node. Figure
8.2
show the
DDG for the CFG in figure
8.1
.
Figure 8.2: DDG for figure
8.1
CHAPTER 8. DATA DEPENDENCE GRAPH
8.4
Simplifications
Our focus on detecting refactored clones in single files allowed us to simplify some factors in
the creation of a DDG.
Ignore Anonymous Subclasses Some constructs in Java allow anonymous subclasses, such
as the creation of a new thread. Listing
8.1
shows how such an anonymous subclass
may manifest in source code. In this case the run() method is part of the anonymous
subclass that is spawned by creating the new Thread object. We ignore this subclass
entirely for simplicity and because such subclasses are relatively rare.
Thread t h r e a d = new Thread ( ) {
p u b l i c v o i d run ( ) {
System . o u t . p r i n t l n (” Thread Running ”) ; }
}
Listing 8.1: Anonymous subclass
Loop Unrolling Loops are not being unrolled which may lead to imprecise DDGs when
arrays are accessed with a loop index. In listing
8.2
we see a loop that can be unrolled
relatively easy. Doing so would show that reversing the execution flow (i.e. starting at
4 and count down to 0) does not retain the semantic meaning.
f o r(i n t i = 1 ; i < 5 ; i ++) {
myArray [ i ] = myArray [ i + 1 ] + 5 ; }