Is program slicing using Rascal MPL a practical approach in aiding database reverse engineering?

(1)

aiding database reverse engineering?

Master Thesis Software Engineering

Host organization: University of Amsterdam Supervisor: dr. Magiel Bruntink

Martijn Barsingerhorn

m.barsingerhorn@ilsmoija.nl

November 15, 2015

(2)

A program its database documentation includes a conceptual schema, which describes organiza-tional entities and accompanying relations, and a logical schema, which describes data structures and accompanying relations without yet making any database technology specific decisions. Evo-lution of database based programs relies on this documentation to be complete and up to date. Often however, this documentation is out of date, incomplete or both, hindering evolution. The recovery of database documentation is called DRE, which is an abbreviation of Database Reverse Engineering.

For a database system used intensively in a hospital, the most up to date formal documentation is incomplete and over twenty years old. Effectively, the only up to date documentation is the source code of the programs using the database. The database system must now be adapted to use a new data environment. To do this intelligently this means the documentation of the system must first be recovered.

To recover database information from a program its source code, program slicing is used. Pro-gram slicing [13, 33] is transforming the source code of a proPro-gram in such a way that only statements that influence a chosen statement and variable(s) tuple, called a slice criterion, remain. If the slice criterion is pointing to a database operation, only statements that influence this database opera-tion remain in the slice. Database informaopera-tion can then be deduced from these sliced statements and subsequently this information can be used to rebuild the database its conceptual and logical schema. For this purpose, a program slicer is constructed in Rascal MPL1 [40]. First a context free grammar is developed to parse the source code of the programs using the database. Sub-sequently the resulting parse trees are sliced using a backwards tracing, static, inter-procedural, non-executable program slicer. The resulting program slices are visualized as highlighted text.

143 program slices are generated for a 55 kloc test program, taking 60 hours of computing time distributed over multiple computers. A program slice contains on average 67 statements which influence the slice criterion and these are located in on average 5 methods. These methods are included completely in the slice in a light grey color to provide context to the brightly colored statements which indicate slice criterion influencing statements. Reading and interpreting a single slice takes on average 5 minutes for a person who has knowledge of the organizational context the test program operates in, but who has no in-depth knowledge of the test program its actual purpose. Database information describing foreign keys, structured data fields and detailed fields semantics is deduced from the program slices. Most recovered information helps in rebuilding the logical database schema. Information that helps in rebuilding the conceptual schema is more scarce. Rascal MPL proved to be a stable and well documented environment to build the program slicer in, although it is slow and memory demanding in its execution.

Program slices allow a quick recovery of database information. Most information recovered helps in rebuilding the logical schema though. To confidently rebuild the conceptual schema, other sources of information must be used in conjunction. Implementation of a program slicer is a time consuming task. This makes this approach mainly viable for large, sufficiently important programs so that this setup cost is justified. Also alternative uses of the slicer can be considered to help justifying this setup cost.

(3)

This thesis is written for the University of Amsterdam as a completion of my master project for the master program Software Engineering.

I would like to express my gratitude to my supervisor Magiel Bruntink for his useful comments, remarks and guidance through the learning process of this master project.

At the university, I would like to thank Hans Dekkers, Paul Griffioen, Paul Klint, Tijs Storm, Jurgen Vinju and Hans van Vliet for their enthusiasm and inspiring courses offered in the program. Further, I would like to thank the radiotherapy department of the NKI/AVL for the opportunity to follow this program.

Finally, I would like to thank the people in my surroundings for their patience and support. You know who you are :-)

(4)

1 Introduction 1 2 Motivation 3 Problem statement . . . 3 Research question . . . 3 Research method . . . 4 3 Background 6 Context free grammars . . . 6

Program slicing . . . 7

Database reverse engineering. . . 8

4 Grammar 11 Introduction . . . 11

Raw grammar extraction . . . 12

Testing the grammar . . . 13

Lexical syntax . . . 14

Correction and completion of raw grammar . . . 14

Abstract syntax tree . . . 16

Beautification . . . 17

Expanding test cases . . . 17

Production usage . . . 18

Conclusion . . . 19

5 Program Slicer 20 Introduction . . . 20

(5)

Unique symbol name extraction and creation . . . 22

Control flow graph creation . . . 23

Variable usage extraction . . . 24

Symbol name resolving . . . 25

Method call contexts . . . 26

Design decision . . . 26

Traversal construction . . . 27

Visualization . . . 31

Tests . . . 32

Conclusion . . . 32

6 Database Information Recovery 33 Introduction . . . 33

Test program preparation . . . 34

Program slice generation . . . 34

Slice analysis . . . 35

Slice interpretation . . . 36

Recovered information assessment . . . 37

Conclusion . . . 37

7 Conclusion 39 Answering sub research questions . . . 39

Answering main research question . . . 41

Bibliography 43

A Delphi Grammar 46

B Delphi Grammar Example Test Case 55

C Program Slices 57

D Program Slicer Example Test Case 67

E Program Slice Sizes 69

(6)

Introduction

The NKI/AVL1_{is a combination of a research institution and a hospital housed in a single building} complex and is focused on the improvement and development of cancer treatment methods and on the treatment of patients having cancer. Software aiding and implementing novel treatment techniques is developed in-house since the early 1980’s. Today, software used to treat patients partly still relies on components developed in this era. In some cases, these components have aged well and are still properly documented and maintainable. In other cases however, they have aged not so well and have become poorly documented and nearly unmaintainable. Some databases belong to this last category.

Dutch government legislation mandates hospitals to modernize their information environment. Database systems must be well documented, secure and maintainable, and access to these systems must be controlled and accountable. In the NKI/AVL, some core clinical software programs which use old databases, based on now obsolete technology, can not be modified anymore to satisfy these new mandated requirements.

To regain the ability to modify these programs and to be able to replace these obsolete, nearly unmaintainable databases with modern, maintainable counterparts, knowledge of these old databases must first be recovered. A hindering factor in recovering knowledge is the size of these systems, i.e. the programs together with their respective databases. Also they are likely to be the product of many development iterations, performed by many different people, showing different levels of understanding and development styles while adding and deleting functionality and fields. Additionally, the people who originally developed these systems have meanwhile left the organization, and first hand knowledge has thus faded. Written documentation is minimal and if available generally out of date by at least a decade. These factors make these systems harder to understand. The used databases effectively have become an undocumented combination of still recognizable, unrecognizable, relevant and no longer relevant information.

To intelligently bring a system back to a maintainable state, correct and complete documenta-tion of the system is needed. For a database this means that a complete conceptual, logical and physical schema must be produced. At this time, the only up to date source of information is the source code of the programs using the databases and the actual layout of the databases as it is stored in the table headers though2_.

Program slicing is a technique where a program its source code is transformed in such a way that only the statements that influence a location-variable tuple of interest remain[13, 33]. This reduces the size of the program and thus makes it easier and faster to comprehend relevant program parts. When a program is sliced on database manipulations, only the statements related to these manipulations remain in the resulting program slice. These statements may contain information about the meaning of the database.

1_{Nederlands Kanker Instituut / Antoni Van Leeuwenhoek ziekenhuis}

(7)

Can a program slicer be used on this legacy system and can it help in reverse engineering the database its conceptual, logical and physical schema? To answer this question, a program slicer will be constructed and used to produce program slices of a database program it database manipulations. Then the usefulness of these slices in reverse engineering the database its schemas will be analyzed.

The thesis is organized as follows: In chapter 2, the problem statement, research question and research method are discussed in more detail. Chapter 3 provides background information on the main topics used in this thesis, which are context free grammars, program slicing and database reverse engineering. Chapter 4 describes the construction and testing of a context free grammar for the target programming language, followed by chapter 5 which discusses the construction of a program slicer in detail. In chapter 6 a test program is prepared, program slices are created and database information is retrieved and analyzed. Finally conclusions are drawn in chapter 7.

(8)

Motivation

Problem statement

Dutch government legislation mandates hospitals to modernize their information environment. Database systems must be well documented, secure and maintainable and access to these must be controlled and accountable. Legacy systems in the NKI/AVL do not all comply yet with these new requirements, and some systems cannot be just reconfigured to do so in the future. Additionally, some legacy systems are very poorly documented, which practically inhibits evolution.

One such system is known in the NKI/AVL just as the QUIRT database. QUIRT, or Quality Insurance and Imaging in RadioTherapy [15], was a project carried out at the NKI/AVL, working together with three other European cancer institutions, aimed at implementing a system to further improve radiation treatment accuracy. The resulting software system was successful and was eventually used for more and more aspects radiotherapy related: i.e. image acquisition, image matching, image visualization, performing instrument calibrations and patient treatment progress monitoring. Applications were implemented using an interpreted domain specific language, which provided access to a library of mainly image processing and data communication functionality, all written in C++. The system ran originally on top of MS-DOS.

Most of the original QUIRT applications have been retired by now and have been replaced by more modern Windows Delphi applications. Data storage for QUIRT applications was done in a single central database. This database is still operational today and used by many applications, including the modern ones. It’s a DB3 database, which uses plain files on a filesystem directly accessible by each client application. The latest official documentation of the system is more than 20 years old, which means that all modifications performed in the last 20 years are undocumented. This database must be migrated to a modern database environment, and the schemas of the database must be recovered and re-documented.

Research question

Software evolution is defined as “the modification of a software product after delivery to correct faults, to improve performance or other attributes, or to adapt the product to a modified environ-ment” [26]. To adapt the QUIRT database to a new environment, and to be able again to modify it, it must first become clear what the actual functional and technical specification, or meaning and implementation, of the database is.

To get the technical specification of a database, or the physical schema, one can read the DDL1 code or, in case of DB3, which is file based and has no distinct DDL, the database table headers in

(9)

the files2_{. To relearn about the functional specification of the database, or phrased differently, the} meaning of the database, which were originally documented in a logical and conceptual schema, reverse engineering can be used. According to [38], “Reverse engineering a piece of software consists, among others, in recovering or reconstructing its functional and technical specifications, starting mainly from the source text of the programs”. A problem is that the source text of these legacy programs may be very large and may have become convoluted.

In Weisers [33] original paper on program slicing, it is stated that: “A slice consists of any subset of program statements preserving the behavior of the original program with respect to a program point and a subset of the program variables (slicing criterion), for any execution path”. Put differently, only the statements in the program, adding to the state of that program at a certain program location for a certain set of program variables, are to be included in the slice. Slicing could be used here to produce relatively small subsets of the original legacy code, such that these subsets’ slice criterions are database related statements. Then, the information needed to reverse engineer the database could be deduced from these program slices. This idea is inspired by work done in [9].

Rascal MPL [40] is a meta-programming language developed at the CWI3[21]. Rascal MPL can be used to create language parsers based on context free grammars and it also includes a scripting language that can be used to develop programs that work directly on parsed programs their parse trees. On its website, Rascal MPL claims to be “The one-stop shop for meta programming”.

This leads to the following research question:

• Is program slicing using Rascal MPL a practical approach in aiding database reverse engi-neering?

The question can be broken up into the following sub questions:

• To perform operations on code written in a certain language in a meta-programming envi-ronment, a parser based on a context free grammar for this particular language is needed. The legacy system under investigation has applications implemented mainly in Delphi, a development environment based on Object Pascal. The grammar must be able to serve as a foundation for a program slicer.

– Can a suitable grammar be developed for Delphi?

• There are multiple methods of program slicing, all with their own strengths and weaknesses and thus suitability for specific purposes. A method has to be selected and a prototype program slicer has to be implemented in the meta-programming environment so it can be used to create Delphi code slices.

– Can a suitable program slicer be developed for Delphi?

• Finally, what database information can be recovered from the produced program slices? Will this convincingly help to rebuild the conceptual and logical database schema?

– Can database information be recovered from the resulting program slices?

Research method

Methodology

This research is conducted using the constructive research method [35]. Constructive research means applying the scientific method while building a construct, or an artifact. This artifact will

2_{Implicit constraints on database fields defined in source code but not enforced in the physical database schema}

while that could have been technically possible also have to be reverse engineered. For clarity, this is omitted in this paragraph.

(10)

serve as the basis for all measurements, hypothesis, predictions and evaluations done. The con-structive research method also implies several iterations of improvement being done. Hypotheses are continuously formed about how the current artifact can be improved.

Practical approach

To answer the research questions, three artifacts will be constructed: A context free grammar, a program slicer and a set of program slices.

A program slicer uses a parse tree as input. This parse tree can be generated by a parser, which in turn can be generated by a parser generator. This parser generator requires a context free grammar describing the language. The program slicer on its turn analyzes the parse tree and computes a set of program slices described by the selected slice criterions.

The first two artifacts, the grammar and the program slicer, will be build using an iterative refining approach:

• First a rudimentary artifact will be build. This will serve as a starting point.

1. The artifact will be tested against a set of inputs which serve as test cases that get more diverse and larger in size until the artifact demonstrates some sort of undesired behavior. 2. A hypothesis will be formed on how the artifact can be changed to successfully process this

“offending” input.

3. A change will be made and a test will be performed.

4. If undesired behavior still exists, the hypothesis must be altered, and more changes must be made, so one has to return to case 2. If all the test cases execute correctly, more complicated cases can be added to the test set.

This cycle must be repeated until the actual source code of the legacy system will process correctly and efficiently.

The third artifact, the set of program slices, will be generated from a set of slice criterions. These slice criterions correspond directly with database statements in the source code of the legacy system. An analysis of the source code of the legacy system must be made to determine which statements return the most useful information for reverse engineering the database. This also depends on the type of program slicer implemented.

The recovered database information will be analyzed to assess its value in the reverse engineering process.

(11)

Background

Context free grammars

This chapter is partly based on [25, 28, 36]

As stated in [12], a generative grammar is “a device, as a body of rules, whose output is all of the sentences that are permissible in a given language, while excluding those that are not permissible”. The grammar is used together with a vocabulary, also called an alphabet, which is a set of words, here called symbols, to form ordered sequences of symbols, akin to sentences, here called strings.

The rules producing a string are conveniently called productions, and in a context free grammar take the form described in equation 3.1:

V → w (3.1)

Here, V is a single non-terminal that will be replaced by w, which is a string of both non terminals and terminals. Note that w can be empty. A grammar is context free if a production rule replacing a non terminal can be applied regardless of the symbols surrounding the non-terminal. A language generated by a context free grammar is called a context free language.

A context free grammar G can be described by a four element tuple, as shown in equation 3.2:

G = (V, Σ, R, S) (3.2)

• Here, V is a finite set where each element v ∈ V is a non terminal. Each non terminal represents a different type of phrase in the string.

• Σ is the finite set of terminals, disjoint from V . A finished string consists of terminals only. The complete set of terminals equals the vocabulary of the grammar G.

• R is a set of relations, each describing how to transform an element of V to a string of elements of (V ∪ Σ). These are the production rules of the grammar G.

• S is the start symbol, which represents the whole string, or program. S must be an element of V .

Productions can be written as a pair (α, β) ∈ R where α ∈ V is a non terminal and β ∈ (V ∪Σ)∗ is a string of non terminals and terminals. In grammar tools usually the notation α → β is used however. If β is empty, it is denoted as .

Applying a rule can be formulated as follows: For two strings u, v ∈ (V ∪ Σ)∗, u results in v, which is written as u ⇒ v, if ∃(α, β) ∈ R with α ∈ V , β ∈ (V ∪ Σ)∗ and u1, u2∈ (V ∪ Σ)∗ such that

(12)

u = u1αu2and v = u1βu2. Summarized, v is a result of applying rule (α, β) to u, disregarding the surrounding symbols u1 and u2, which is the context the rule is applied in.

A string is part of the language if for the strings u ∈ S, v ∈ (V ∪ Σ) there is a set of strings ∃i ≥ 1, ∃u1, · · · , ui∈ (V ∪ Σ)∗ so that u ⇒+v exists.

The complete language G is the set L(G) = {w ∈ Σ∗_{: S ⇒}+_w}

A language is ambiguous if multiple paths exists from the start symbol to a destination string. A parser uses the context free grammar to analyse a piece of text and to express it in structures corresponding to the grammar. This creates a tree of productions starting at S and leading to the input string or program, called a parse tree.

Program slicing

This chapter is partly based on [2, 41, 32, 29].

Program slicing is a technique first proposed by Mark Weiser. In his paper [33], he presents the idea that programmers debugging software use a mental abstraction, braking the software down into smaller pieces. These pieces of code are sets of statements related to each other by their control or data flow. Obviously, these statements may not necessarily be a contiguous part of the program, they may very well be scattered. Weiser calls these sets of statements program slices. A program slice contains all the statements in a program affecting certain variables at a point of interest. This point of interest is defined by the slice criterion, which is a two-tuple, containing both a program location and a set of variables.

The program slicer computes the program slice using the specified slice criterion, resulting in just the subset of statements of the program directly or indirectly influencing this slice criterion. There are different methods to compute a slice, which can be thought of as a composition of several aspects combined, taken from, but not necessarily limited to, items from the following list:

• Static or dynamic. Static program slicing computes a program slice using only the program source code as input. It’s opposite is dynamic program slicing, where also an instance of the program its data is used to compute the program slice. Dynamic program slicing generally produces more precise results, as only statements in relevant or taken branch paths are included in the program slice. This, as opposed to static program slices, that have to take every path of a branch into consideration.

• Forward or backward. This defines the direction in which statements that have influenced the slice criterion or statements that will be influenced by the slice criterion are sought. • Intra- or inter-procedural. When a program slicer creates program slices only within a single

procedure or method, it is called intra-procedural. If it follows method calls, it’s called inter-procedural.

• Executable or non-executable. When resulting program slices are working and compilable programs, they’re called executable program slices.

One way to implement a program slicer is by building a dependency graph. This graph contains a node for each statement and edges to indicate data and control dependencies. When the graph is completely created for a program, a program slice can be created from it by performing a reach analysis of variables starting at a slice criterion. Another way can be on-the-fly program slice generation. In this method, statements are processed one by one starting at the slice criterion, and for each statement it is decided if it belongs in the slice or not, considering data gathered during the code traversal.

(13)

• Software debugging. Debugging a large program can be a time consuming task. Backwards slicing a program its source code on a statement computing an erroneous value for a vari-able delivers only the statements that influence the construction of this value. This can significantly reduce the search space, and thus allow the programmer to solve the problem faster.

• Software comprehension. Slicing can be used to isolate certain aspects of a program so these can be understood easier.

• Software maintenance. Forward slicing can be used to gain insight in how a variable is used after its construction or manipulation at the slice criterion. This shows which parts of the program are affected after making a certain change to it, and which parts thus require retesting.

Programming constructions that require special attention when using program slicing are, among others:

• Pointer and composite variable types. Pointers can dereference multiple variables in their life, depending on the address value it contains at each moment in program execution. Composite variable types, like unions, can have multiple names for the same variable. To create accurate program slices, it must be known what each pointer address and composite variable name actually means.

• Multithreading. Threads are separated paths of execution in a program, and statements belonging to one thread may never be reached by following a control flow starting at a statement in a path of execution in another thread. However, one thread path can influence variables also used by this other thread path, and thus can contain statements that influence the slice criterion and belong in the produced program slice. To create accurate program slices in the presence of multithreading, its behaviour has to be taken into account.

Database reverse engineering.

This chapter is partly based on [14, 10, 9, 1, 42]

The creation of an information system its database can be characterized as passing through three phases. In the first phase, the functionality of the system will be established. Together with this, a data model is created. This model describes objects of interest, the properties these objects must possess, manipulations that will be performed on these objects and how data will flow between these objects. This model is called the conceptual schema. Objects described in this schema are organizational objects familiar to the application user. No database concepts or technology is used in this schema. An entity-relationship diagram is an example of a conceptual schema.

In the second phase, a logical schema is devised from the conceptual schema. This schema is a design of the database its data structures, expressing the organizational elements from the conceptual schema, without yet selecting a database technology and without yet making database implementation specific decisions. The schema may be influenced by a class of database technology however1_.

Finally, in the third phase, the physical schema is developed, which is based on the logical schema. This schema implements the structures defined in the logical schema, with implementa-tion specific choices made and optimizaimplementa-tions performed for a particular, selected database system. This may mean that, for example, relationships are expressed using foreign keys, many-to-many relationships are implemented using intersection tables, data may be denormalized for better per-formance and data fields may be converted or encoded into datatypes taking expected usage into account.

1_{Today virtually all databases use the relational model, so the logical schema may be prepared to fit this}

(14)

(a) Conceptual Schema (b) Logical Schema (c) Physical Schema

Figure 3.1: Conceptual, logical and physical schemas of a medical database.

An example of these three schemas is shown in figure 3.1. In the conceptual schema, two organizational entities and a relation are displayed. In the logical schema, the Patient entity is now expressed in relevant data structures, and for clarity, the Treatment entity is ommited. Finally, the physical schema shows the database implementation of the logical data structures. Note that Patient Number is duplicated here as an optimization.

Note that the abstract conceptual schema gets encoded in the less abstract logical schema. After this, the logical schema gets encoded in the non abstract, concrete, physical schema. In these translation steps, abstract and semantic information about the system is lost.

Reverse engineering of a database of an information system is the process of recovering the previously described schemas, starting with incomplete or obsolete descriptions of the system, or even no description at all.

First the physical schema must be recovered. This can be done by analyzing the DDL2_{code of} the database, or, if not available, the structure of the database its tables. Explicit data property definitions, such as primary and foreign keys, unique constraints and mandatory fields can be recovered this way. Since the DDL code or table structure of the database is an exact representation of the physical schema, this conversion is lossless. Note however that this does not mean the resulting schema will be always complete. There may be data constraints enforced in the source code of an application that are not made explicit in the physical schema. Also, if the used database technology is old, it may not support all constraints required to completely implement the logical schema3_.

Second step is recovering the logical schema. Partly, the logical schema can be generated by transforming the physical schema, so that database specific constructions and optimizations are removed from it. Some parts of the schema may be difficult to recover. For example, parent-child relationships may exist in the logical schema, but may never be made explicit in the physical schema. Also names may need to be converted from abbreviated machine names to more descriptive and longer logical names. Recovering these constructions may require analysis of, among others, the program code, database contents, screen layouts and report structures. Since abstract data is lost in the translation of the logical schema to the physical schema, the translation back may not be complete. Some constructions that do exist may not be found, resulting in a false negative or silence, or constructs believed to be found actually do not exist, resulting in a false positive or noise.

Finally, the conceptual schema must be recovered. In this schema, the organizational elements and their intended meanings are expressed. Constructs that must be found are, among others, relationship types, multivalued attributes and optional attributes that can then be described using the terminology of the organizational context the application resides in.

2_{Data Definition Language}

3_{The DB3 database system used in the legacy system analyzed in this thesis does not support foreign keys for}

(15)

Attributes to recover

Attributes not explicitly stated in the DDL code or table files are called implicit attributes. These implicit attributes must be uncovered during the reverse engineering effort. Some implicit at-tributes are, among others:

• Meaning of column structure. A single column can store multiple values in a structured way. • Foreign keys. Each value of a column is interpreted as being a reference to another record in

the same or in another table.

• Functional dependencies. The meaning of the value of the column depends on the value of another column.

• Value domains. A limited set of values each have a specific defined meaning. An example is an enumeration.

Information sources

To recover database information, the following data sources, among others, can be analyzed:

• Physical database schema. Names of columns in the schema may indicate relationships or organizational constructions.

• Database data. The database its content can be analyzed to find values that may indicate relationships or membership of certain value domains, expressed in lookup tables.

• Program source code. The program its source code can be analyzed to extract database attributes, field meanings and structure, relations and value domains.

• Program screens or reports. These are user oriented views on the data, with descriptions using terminology from the conceptional domain.

• User interviews. Users of the software may be able to recognize, explain and clarify certain constructions.

(16)

Grammar

Introduction

This chapter describes the development of a context free grammar for the Delphi language in Rascal MPL. This context free grammar is needed to generate parse trees of Delphi source code, and the program slicer will use these parse trees as input.

To build a grammar, L¨ammel and Verhoef [24] propose the following approach:

1. Raw grammar extraction from a language reference, a compiler or another artifact.

2. Resolution of static errors such as unconnected non terminals, also called sort names, if the grammar is extracted from a non-executable source.

3. Extraction or definition of lexical syntax.

4. Test-driven correction and completion of the raw grammar if necessary.

5. Beautification.

6. Modularization.

7. Disambiguation if necessary.

8. Generation of a browsable version of the grammar if needed.

9. Adaption of the grammar for the intended purpose.

This generic approach seems well suited to our problem, and Rascal MPL supports test-driven development by offering test specific functionality. However, the approach as printed above does not have an iterative refinement cycle. Apparently, it is expected that the steps are distinctly separated pieces of work, that will have no interaction with each other. When using an already perfect and complete grammar as source material, and well tested tools to do a routinely conversion, this may be the case. Expected is that the conversion performed in this case will be using less than perfect and complete source material, so the approach is modified as follows:

(17)

1. Raw grammar extraction from a language reference, a compiler or another artifact.

2. Resolution of static errors such as unconnected non terminals, also called sort names, if the grammar is extracted from a non-executable source.

3. Extraction, definition or correction of lexical syntax if necessary. 4. Correction and completion of the raw grammar if necessary. 5. Creation and correction of abstract syntax tree if necessary. 6. Beautification if necessary.

7. Disambiguation if necessary.

8. Execute continuously expanding test suite and return to step 3 as long as required. 9. Modularization.

10. Generation of a browsable version of the grammar if needed.

Step 8 now executes a test suite, and will return to step 3 as long as the tests do not execute perfectly yet. The test suite will start small, and will grow as the development effort on the grammar progresses, until the source code of the systems of interest will parse properly. Also, step number 9 in the first list, “Adaption of the grammar for the intended purpose”, is interpreted here as “Creation and correction of abstract syntax tree if necessary”. Additionally, this step is now placed as number 5, so it will be a part of the iterative development and test cycle. Finally note that the last two steps in the list are emphasized now. These steps are not essential to the project although they are nice to have. So, these steps have a lower priority in the project, and may be performed later if time permits.

Raw grammar extraction

The company now developing Delphi1 _{unfortunately does not ship a grammar description with} their language documentation anymore. The last version they officially shipped is part of the Delphi Language Guide [5], which dates back to 2002, and describes Delphi version 5. Since then, version 6, 7, 2007, 2009, 2010, XE1,XE2. . . XE7,XE8 and most recently version 10 “Seattle” was released, which makes this version 5 documentation very outdated, especially because Delphi regularly seems to expand the grammar with new constructs [30].

Since Delphi does not seem to be a very popular language any more [8, 7, 31], third party support is expected to be relatively low. This expectation is reflected in the quality and quantity of third party grammars that can be found for the language. Rascal MPL comes out of the box with several grammars, but not one for Delphi. Rascal MPL its predecessor SDF has a grammar which seems to be strongly based on the original Delphi Language Guide [5]. Rascal MPL competitors like Semantic Designs’ DMS R _{Software Reengineering Toolkit}TM_{offers a Delphi version 6 grammar} [20], which is also very old, and ANTRL [18] does not offer a Delphi grammar at all, only a pure Pascal grammar [17], which is not object oriented. The open source project that develops an Object Pascal Compiler called “Free Pascal” [19], does not include a grammar in their, otherwise extensive and impressive, documentation.

Grammars not associated with a project or tool on the internet seem to be mostly based on the Delphi Language Guide. One project called “DGrok Delphi Grammar” [34] however, documents a grammar developed for a personal project where a Delphi parser was written from scratch in C#. The grammar documented in this project seems to be based on the Delphi Language Guide with newer features reverse engineered into it. Since this appears to be the latest and most complete grammar available, with last updates performed on it in 2007, it is selected as a basis to start from.

(18)

Testing the grammar

When testing a grammar, the following questions can be asked:

• Is the grammar describing the intended language?

• Is the structure of the grammar, reflected in the resulting parse trees, suitable to perform the intended work?

The first question immediately results in a problem. Since there practically is no recent official description of the Delphi language available anymore, there is no conclusive way to test this. The only test available is the Delphi compiler itself, which can act as a “Boolean Oracle”, giving one of two results after compiling code: “Yes” the program can compile and thus the language is proper Delphi, or “No”, something is wrong and maybe the language is not not proper Delphi. When the code compiles properly, it obviously is acceptable Delphi language. When it does not compile however, it does not always mean it isn’t acceptable Delphi language. There are many other things than syntax alone that a compiler checks in source code and may fail on: availability of used variables and type incompatibilities for example. Practical tests are thus to be found in size and grammar variation of existing, properly compiling code. If a piece of code is large and has a lot of variation, its a more suitable test case. If it’s small and not very varied, it’s less suitable.

The second question has a more satisfying and conclusive answer. Since the raw grammar used is based on a grammar described in an old but official Delphi Language Guide, the grammar is proven to be good enough to be used to compile code. Compiling code needs all the details a program slicer also needs. Namely program flow and variable naming, types and locations. This provides confidence that the grammar its productions are well suited to use and to build such functionality.

Test framework

To practically test the produced grammar constructions, a test framework is created. This test framework generates test methods that can be executed in Rascal MPL. The test frameworks’ code generator searches in a specified base directory for subdirectories. These subdirectories have names that equal production names in the grammar. In these directories, there are zero or more files that contain test sentences for this particular production. The test sentences start with success: or fail:, to indicate that the sentence should succeed or fail respectively. This system makes it easy and quick to add and modify test sentences. An example generated test method together with accompanying test sentences is showed in appendix B. The actual test methods generated by the test framework can be altered easily as well and will perform three tasks:

• The first task is to parse the sentence and to check if this succeeds or fails. The result is then compared to the result requested by the test sentence. If the test sentence fails because the results are not equal, an error message is showed together with the offending sentence and its index number. If the test succeeds nothing is printed and the test just continues. • The second task is to scan the resulting parse tree for ambiguous clauses. If these are present,

the sentence its parsing resulted in multiple parse trees, and this is a problem that has to be corrected. An error message will be printed informing the user that an ambiguous parsed sentence has been found, including the sentence and it’s index number.

• The third task is the generation of the parse tree for a particular sentence of interest. This is an interactive task. This provides an easy method for the user to quickly generate a parse tree for an interesting test sentence, so it can be inspected or visualized2.

2_{Note that this functionality is not showed in appendix B. The result of a parsing is a parse tree, while a test}

(19)

Writeln(’The answer of the problem: 27/5 is: ’, 27/5:3:13);

Figure 4.1: Delphi number notation including precision descriptors.

Rascal MPL supports test driven development by offering the option to prefix a regular method with the keyword test. If the command :test is then issued from a Rascal MPL command line, all the loaded methods tagged with test are executed in a random order and the results of the methods are highlighted in the source code editor. The test method template used by the test framework includes this test tag in the generated method headers.

Lexical syntax

To start working with a grammar first the lexical syntax must be known. Unfortunately, no grammar or document available at this point specifies the lexical details of Delphi, so this has to be devised based on descriptions in the Delphi Language Guide [5], the Free Pascal Reference Guide [6] and grammars of other languages which have similarities to Delphi at some points.

After adding test cases for the initial lexical syntax, the cases are run, problems are detected, and modifications are made to the syntax. This process is repeated until the syntax is satisfactory, and work on grammar rules can be started.

Later in the process, when parsing larger pieces of code, the lexical syntax is corrected a couple of times. Subtle mistakes in the recognition of whitespace in combination with comment parts render the grammar ambiguous. As a matter of fact, most of the ambiguity problems encountered in developing the grammar are caused by mistakes in the lexical syntax.

Also, Delphi is more rich in notations for numbers and strings than expected forehand. For example, when calling a method with a number, the precision of the number, when converted automatically to a string, can be notated directly after the number using a colon notation, as seen in figure 4.1. This required a small adaption to the lexical definition of a number.

Correction and completion of raw grammar

To start using this grammar to parse some actual Delphi code, its productions must be translated to Rascal MPL language. The DGrok website lists all its productions in a html format, of which a sample can be seen in figure 4.2.

Figure 4.2: A Delphi grammar production from the DGrok website.

Chosen is to translate all productions in the DGrok grammar to Rascal MPL syntax one by one by hand. First of all, there are just over 90 rules. That’s not enough to justify writing a tool to extract the rules automatically. Secondly, it provides a perfect opportunity to learn the finesses of the grammar, inspect parse trees visually if required and to build an initial set of test cases for each production.

(20)

syntax Expr = ... > left unaryOps Expr > left Expr mulOps Expr > left Expr addOps Expr > left Expr relOps Expr

(a) Rascal MPL variant: Explicit priorities.

A > B means B will not be nested under A.

Expr -> SimpleExp (RelOps SimpleExp) SimpleExp -> Term (AddOps Term) Term -> Factor (MulOps Factor) Factor -> UnaryOps Factor

(b) Original variant: Priorities are expressed im-plicitly.

Figure 4.3: Rascal MPL expression grammar vs DGrok expression grammar. if a = 10 then

if b = 15 then c := c + 1 else

c := c - 1;

(a) “Else” may belong to if b = 15.

if a = 10 then if b = 15 then

c := c + 1 else

c := c - 1;

(b) “Else” may belong to if a = 10.

if a = 10 then begin if b = 15 then c := c + 1 else c := c - 1; end;

(c) Unambiguous version of figure 4.4a.

if a = 10 then begin if b = 15 then c := c + 1; end else c := c - 1;

(d) Unambiguous version of figure 4.4b.

Figure 4.4: Fragments of Delphi code posing ambiguous else statements and their respective unambiguous versions.

For most productions, the translation process is straightforward. Some productions however, can be implemented more cleanly using Rascal MPL language features. For example the expression grammar can be implemented using priorities in a single production rule, instead of the “leveled” set of production rules found in the original grammar, as shown in figure 4.3

Other productions have problems mainly because they are ambiguous. A well known ambiguity that may occur in a language, and does occur in parsing Delphi using the DGrok grammar, is called the “dangling else problem”3_{. Consider the fragments of code in figure 4.4a and 4.4b, which are} identical except for the indentation of the else clause. To which if statement does the else clause belong? It can belong to the ’if a = 10’ statement, but it can also belong to the ’if b = 15’ statement. The semantics of the two alternatives are obviously not the same. When this code is parsed and the grammar does not take this problem into account, the parse tree will contain both these two alternatives as shown in figure 4.5. These ambiguities can be detected by visually inspecting a parse tree, or by searching programmatically for ambiguous clauses in the parse tree. Since the test method checks for this, an error is indeed produced on nested if statements.

Delphi was not always vulnerable to the “dangling else” problem however. The original compiler mandated the use of begin and end keywords in the top clause of an if statement. The code from figure 4.4a and 4.4b would then have looked as the code in figure 4.4c and 4.4d respectively. This problem is the result of a relaxation of the rules of the language. Speculated is that this is a response to keep Delphi compatible with Lazerus code. Lazerus [11] is an open source Delphi alternative based on FreePascal [19], that allows this type of if statement.

3_{The DGrok grammar was never intended to be used as a context free grammar. It’s a document describing the}

constructions implemented in a parser written from scratch in C# for a personal project. Since this is the case, it cannot be said that the DGrok parser also automatically has dangling else problems because this grammar displays the behaviour.

(21)

(a) Ambigous parse tree of figure 4.4a, which

contains two if statement branches. (b) Non ambiguous parse tree of figure 4.4c.

Figure 4.5: Two parse trees of code fragments from figure 4.4. The colored dot at the top in the left figure denotes an ambiguous tree node with two alternatives.

if (a = 10) then b := b + 1 else

b := b - 1;

(a) A single Delphi statement.

if (a = 10) then c := c + 1; else

c := c - 1;

(b) A faulty Delphi statement.

case variable of

’1’: if (a = 10) then b := 10; else

c := 10;

(c) Case statement with default clause de-noted by else. case variable of ’1’: if (a = 10) then b := 10 else c := 10;

(d) Case statement with clause that ends with if else statement.

Figure 4.6: Fragments of Delphi code posing various problems.

A final problem encountered in translating the grammar rules discussed here, is a subtlety in the anatomy of if statements. In Delphi, the code shown in figure 4.6a is considered to be a single statement, from a grammar point of view. Note that there is only a single semicolon terminating the last assignment. More intuitively may be the code in figure 4.6b, which uses a semicolon to terminate each assignment. This code is not valid Delphi however. The first version of the grammar rule describing the if statement accepted these incorrect statements though, as it seemed that accepting superfluous semicolons would provide a higher fault tolerance.

The consequence of making this choice becomes apparent when parsing the case statement, which is known as a “switch” statement in C/C++ and Java. The code for the distinct clauses is not fenced with begin and end keywords, and the “default” clause of this statement is denoted with an else keyword, at the end of the statement as shown in figure 4.6c. In figure 4.6d, a case statement without a “default” clause is showed. Note that the code fragment in figure 4.6c is identical to the code fragment in figure 4.6d, except for the semicolon terminating or not terminating the line b := 10. If every assignment could have its own semicolon here however, these case statements would become ambiguous. It could not be determined anymore if the else clause is the default clause of the case statement, or if it belongs to the if statement instead.

Abstract syntax tree

The parse tree generated by the parser is a concrete syntax tree. This is a one on one application of the context free grammar rules on a fragment of source code. The concrete syntax tree contains for example all lexical and layout symbols specified in the context free grammar. If needed, the concrete parse tree can be unparsed to the original source text. This is an important feature when performing source code to source code transformations. Since this grammar is build however to serve as the foundation of a program slicer, un-parsing source code is not required. So, the concrete syntax tree can be converted to a simplified and easier to work with abstract syntax tree. This

(22)

abstract syntax tree has the layout, such as keywords, and lexical symbols, such as spaces and comments, removed from it. Remaining terminal symbols are converted to Rascal MPL string and number types, and non terminals are converted to Rascal MPL patterns.

To generate the abstract syntax tree and to test its rules automatically when running the test suite, the test methods of the test framework are adapted to convert the concrete syntax tree to an abstract syntax tree after successful parsing of the test sentences and to report the result of the conversion to the user if it fails.

Beautification

Rascal MPL uses user-definable pretty names for all productions, which are stored in the parse tree. These names can then be used by software working on the tree for pattern matching and visiting. Some time is spent to give clear names to productions and to clean up temporarily ones that make no sense. Also, the coding style is found to be inconsistent at this point. The coding style seen in the Java grammar of Rascal MPL is adopted.

Expanding test cases

To improve the quality of the grammar, tests on larger and more diverse fragments of code must be performed. For this purpose, a set of methods is extracted from a real program. These methods are small enough to debug quickly, and large enough to contain more interesting and diverse constructions.

A collection of slightly more than 420 methods, all the methods of a clinically used program, is extracted and converted to separate test sentences. A problem is that this production code, like a lot of larger programs, contains a lot of unprocessed conditional compiler switches. Some compilers, such as most C or C++ compilers, have an option to only preprocess source code, and to output the result of this. The Delphi compiler unfortunately does not offer this functionality, and the Free Pascal compiler doesn’t either. When searching for a third party Delphi preprocessor 4_{, some tools are found [22, 3, 39], but none of these seem to actually succeed in preprocessing a} large piece of code. After preprocessing the Delphi code with a self constructed preprocessor, and using a script to extract the methods, these are stored in the test frame work sentence set. Then the test-develop iteration cycle is continued.

A lot of methods do not parse yet. Problems arise mainly from the diversity of the new test code and problems in method headers. Some NKI/AVL code was maintained or developed by people preferring Lazerus [11] for a while, and small syntactical differences between the Delphi grammar and the Free Pascal grammar used by Lazerus are noticeable. Most prominently, as stated earlier, Lazerus seems to be a bit more relaxed in how if statements can be constructed, allowing the programmer to omit most of the begin and end keywords, which once used to be mandatory in Delphi.

The Free Pascal compiler comes with its own set of test cases to test the compiler. It seems attractive to also use these test cases on the grammar. Unfortunately, these test cases seem to be aimed mainly at code generation and semantic tests, i.e. tests to check if certain optimizations are performed by the compiler and if private variables in classes are (un)reachable. From a grammar point of view these tests don’t seem to be very interesting. Since these tests also include a large amount of test cases aimed at Free Pascal specific language extensions, it is decided to not use these test cases.

When all methods from the test set parse correctly, a complete program can be tried. First program selected is a data access module, amounting to 10 kloc. Since this is the first complete

4_{Delphi its preprocessor is limited to enable or disable compiler settings and to conditionally compile blocks of}

(23)

program loaded, problems arise mainly in the framing5 _{of the program and in its class definitions.} Each time a problem is found, the size of the program is reduced to a minimal fragment of code that still produces the problem. Then the grammar is corrected, and the code fragment is added to the collection of test cases. Larger and larger programs are tested until real programs up to 120 kloc parse correctly.

One particular problem encountered during the testing of the these larger programs, is that some include methods that use inline assembly instructions to speed up certain algorithms. Since this happens in a lot of places in the code, the syntax for inline assembly is added to the gram-mar6_{. Another problem encountered in these programs are generic datatypes, which are known} as templates in C++. This a relatively recent addition to Delphi, and this is not supported in the DGrok grammar. Because of the notation used, the datatype of the generic variable is placed between angular brackets, just as used in a boolean expression, they are by design ambiguous at certain places. One has to know the meaning of a variable to know if a variable is a generic or just part of a “normal” expression, as shown in figure 4.7.

result := method(a<b, c>(d));

Figure 4.7: Ambiguous generic. Does method() has two arguments a < b and c > (d) or a single one which is a method call a() with generic types b, c and single argument d? Without knowing what a is, it cannot be known definitively.

Knowing what exactly a in figure 4.7 is adds context, and thus is beyond the context free grammar. Because this particular ambiguous generic case does not occur that often in the test code, support for generics is added without fixing this particular problem. Test programs using this are modified so it can be parsed.

Finally, the target programs of interest, approximately 60 kloc each, are parsed without prob-lems.

Finding really large programs in Delphi turns out to be a problem. Not many open source projects use Delphi7, and if they do, the code bases are not really big, say nearing or surpassing 1 Mloc. Most programs found are smaller tools up to 10 kloc. A large project is the Free Pas-cal compiler, but that is written and compiled in Free PasPas-cal, so it uses their specific language expansions, and thus does not parse with our grammar.

Production usage

To get an indication of the diversity of the test cases used, the percentage of used productions can be calculated. By adding sort clause names8_{used in the resulting parse tree to a set and subtracting} this set from the set of all production names in the grammar, a set of unused production names and thus a usage ratio is created. In an ideal test, 100% of the productions in the grammar is used. Real target programs use typically 80% of productions described however. Unused productions are arguably mainly newer additions that may be just unpopular, like variant records, too modern for older software that is parsed here9, like helper classes, arcane oldschool, like inline assembly language, or just not needed in this particular case, like constructions specifically aimed at creating packages or console applications. The large 120 kloc application defines a lot of interfaces but does relatively little diverse work, using only 60% of available productions. Measuring on this diversity aspect, the smaller 60 kloc program is more interesting.

Counting used productions does have it’s limits. Although it tells something about which production rules are used, it does not tell anything about how these rules are used. If a rule

5_{This includes for example the declaration of the module type and the interface and implementation section.}

6_{Although the assembly language is technically parsed, the resulting parse tree is not very useful for analysis,}

because the mnemonics and their arguments are just treated as strings, and not much structure is added.

7_{It is speculated that this is logical, because most open source projects use open source compilers and tools.}

8_{In Rascal MPL, this equals the grammar production name}

9_{In a hospital, the latest compilers are never used, because stable, proven versions are required. This particular}

(24)

is composed so that is has multiple optional parts, and the same configuration of optional parts is constantly used, not everything in the rule is tested. Also, due to the type of counting used, productions with multiple alternatives having a single name are counted as “used” already if only one of these alternatives is used. Taking this into account, the actual usage number must be lower and the number calculated here is nothing more than just an indication that can be used to compare relative diversity among test cases.

When using this grammar on other code however, there is a change that problems appear and must be fixed, because the grammar is not tested with 100% coverage. In a way, these test cases with code from this hospital, produced by a limited set of programmers, give birth to a certain amount of tunnel vision.

Conclusion

A grammar was developed to parse Delphi code in Rascal MPL to serve as a basis for a program slicer. An old grammar was used as a starting point. Testing was done mainly by parsing actual target code and artificial test cases. Test program size was 120 kloc max. Target test programs have sizes around 60 kloc each. Production usage is around 80% for 60 kloc test cases. The grammar is included in Appendix A. Supporting source code and its repository location is described in Appendix F.

(25)

Program Slicer

Introduction

This chapter will describe the construction of a program slicer, working on abstract syntax trees generated using the grammar developed earlier. A program slicer takes as input a source program, a location in that program and a set of variables. The last two items together are called the slice criterion. Using this information the program slicer will produce a program slice, or a subset of the program, which contains only the statements of the program which influence the variables specified in the slice criterion, observed from the location also specified in the slice criterion.

There are multiple types of program slicer algorithms available. According to [32, 2], program slicers can be fit into main categories, which, among others, are:

• Static or dynamic. A static program slicer uses only the source code of a program to base its slicing decisions upon, as opposed to a dynamic program slicer, which also uses instances of a program its data as input. An advantage of using input data, is that the produced program slices usually become smaller, because statements in branch paths or loops not taken for the particular case using this input data, can be left out of the slice.

• Forward or backward direction. Forward or backward describes in which direction the pro-gram slicer traces through the propro-gram to find statements that lead to influencing the slice criterion or that will be influenced later by the slice criterion. If a program slicer is aimed at debugging for example, the user will want to know what statements led to a certain current situation in the program. Thus the slicer has to trace backwards to find these statements. If, in another scenario, a change is made to a statement in a program and the user wants to know which other statements of the program are going to be affected by that change, a forward slicer can be used to find these.

• Intra-procedural or inter-procedural. Intra-procedural slicing considers only statements con-tained within a single method, and will not trace outside this method. Inter-procedural slicing also takes method calls into consideration.

• Executable or non-executable. Executable program slices are, as the name implies, complete, compilable and runnable programs, isolating a subset of an original program. Note that this means that called methods in the program slice may have to be duplicated. In figure 5.1 a main method calls a sub method from two locations. The path traversed through the sub method may be different for each call, and combining these two paths in a single method may not necessarily result in an executable method. Non executable program slices just combine these statements.

To use the program slicer in reverse engineering to retrieve database information, all code influencing a database statement must be considered. If code is not used for one particular set of

(26)

1 …. 2 …. 3 Call #1 To Sub Method 4 …. 5 …. 6 …. 7 Call #2 To Sub Method 8 Main Method 1 …. 2 Statement only relevant for call #1 3 ….

4 …. 5 ….

6 Statement only relevant for call #2 7 ….

8 ….

Sub Method Call #1

Call #2

Figure 5.1: Main method calling a sub method from two locations. Each call may traverse different paths through the sub method, resulting in a different set of used statements.

input data, it may very well be important for another set of input data, i.e. this unused code may contain information on how a database field is build or used. Because of this, input data must not be considered, and the the slicer must perform its analysis on static code alone.

From a database point of view, a backward traced program slice corresponds with a database write operation. A value is first constructed, then processed and finally it is stored in the database. A backward traced slice shows this value construction and processing, and the semantic information on the database field that it may contain. A forward slice corresponds with a database read operation. The value is loaded from the database and then handled further in code. A scan of the database usage in the test programs shows that results of database read operations are used without a lot of processing. Many values are just showed in a gui with minimal processing, or they are used in preparation of a database write operation. Since program slicing these gui associated operations won’t reveal a lot more about the system beyond just opening the gui screens in the gui designer, and since database reads performed in preparation of a database writes are shown in the same program slice because their variables influence each other, it seems to be more interesting to perform backward tracing program slicing on database write operations.

Because the test programs are written in an object oriented language and obviously consisting of multiple methods, inter-procedural slicing must be used here.

Since program slices produced here only need to be analyzed to recover database information from, they do not have to be executional per se. The problem with program slices having single methods showing multiple merged code paths, is estimated to be a serious problem only for methods that have a lot of callers. In a database application, these may be calls that form a DAM, or Data Access Module. A scan of the test programs shows mainly methods, which use the database, with names that indicate a specific functional task, instead of a generic, supporting one. Based on this it is estimated that this problem will be non-existent or small. If this turns out not to be the case, the program slicer can later be adapted to duplicate methods that are called from multiple sites.

In conclusion, it seems to be the best choice to build a static, backward tracing, inter-procedural, non-executional program slicer to support this effort.

Program slicer construction

Surveying [32, 27, 16, 23], it is learned that a program slice can be computed by performing the following steps:

1. Create control flow graph.

2. Extract variable usage per statement. 3. Create method calling contexts. 4. Create system dependency graph.

(27)

Since Delphi is an object oriented language, an extra step will be added here. A problem is that symbols, i.e. variables or methods, with the same name can be referenced from multiple places, like for example class methods, and actually mean different things, based on their location. To avoid naming problems, a symbol table will be build with unique symbol names. Symbols used in the program slicer its data structures will be replaced by these unique names.

To assess the quality of the produced program slices, a visualization must be made of the slice, so it can be compared to the original source code to determine if all the statements that need to be in the slice actually are there. Also superfluous statements, which should not be in the slice but actually are, can be detected this way. Following this visual inspection, hypothesis can be formed about adding or removing certain statements.

When inserting additional steps to address these problems, the list looks as follows:

1. Extract and create unique symbol names from abstract syntax tree. 2. Create control flow graph.

3. Extract variable usage per statement. 4. Resolve variable names per statement. 5. Create method calling contexts. 6. Create system dependency graph.

7. Traverse system dependency graph using slice criterion and produce actual program slice. 8. Visualize slice.

Step 1 is added to handle unique symbol name extraction and creation, step 4 is added to resolve non unique names in the control flow graph to unique names and finally step 8 is added to visualize the program slices.

Unique symbol name extraction and creation

In Delphi, and in virtually any programming language, a variable or method has a scope in which it is known. So, for example, a local variable is only known in the scope of the method in which it is defined and a private member method of a class is only known in the scope of the class its other member methods. If this private member method shares its name with a global method, the actual method being called thus depends on the scope of the call site. If this call site is another class member method, the member method will be called. If the call site is a non member method, the global method will be called. To ensure that the correct method will be evaluated by the program slicer in this type of situation, and to remove the need to keep track of scopes when slicing, the applicable scope for a particular method call or variable can be made explicit by replacing the method call or variable its original name with a new unique name, taking scope priorities into account. The first step in this process, is to build a symbol table in which every variable and method will receive an unique name. This process is demonstrated in code in figure 5.2. Note that the actual application of these names will only be done in the slicer its data structures, and not in the actual source code.

Delphi scope names and their priorities are showed in figure 5.7. The global scope in Delphi actually has two sections. An interface section and an implementation section. These section names are also added to the unique variable names, although mainly for debugging purposes, as the Delphi compiler mandates that names used in the interface section cannot be used in the implementation section and vice versa.

To build a list of unique symbols, a Rascal MPL method is created to extract all declared method and variable names from the supplied abstract syntax tree. Care is taken to make sure the

(28)

procedure ProcedureCall(); procedure AClass.ProcedureCall(); procedure AClass.Test(); begin ProcedureCall(); end;

(a) Using regular method names.

procedure Impl.ProcedureCall(); procedure Impl.AClass.ProcedureCall(); procedure Test(); begin Impl.AClass.ProcedureCall(); end;

(b) Using unique method names.

Figure 5.2: Example of uniquely named methods. In figure a, ProcedureA will resolve to AClass.ProcedureA(), because the scope priority order mandates looking for class methods first. In figure b, this scope priority is made explicit through an unique name.

"getarrow" -> ["Implementation","getarrow"]

"checkoverlaymixmenu" -> ["uwmatch","Implementation","checkoverlaymixmenu"]

"tform1","axialviewclick" -> ["uwmatch","Implementation","tform1","axialviewclick"]

Figure 5.3: Name translation table. Normal found names printed left and unique names printed right.

scope of each method and variable is tracked. Then this scope information is added to the name of each method and variable, thus making it unique. Finally the new and old name are added to a list. A few example name translations are shown in figure 5.3

Control flow graph creation

The CFG, or Control Flow Graph, is a base data structure that contains all statements of a program, together with information that specifies the statement(s) that follow(s) it. A statement can have a single next statement, but if a statement describes an if, there may be two next statements. When using a case1_{statement for example, there may be an arbitrary number of next} statements.

The CFG will have special nodes indicating method start and method end points, and state-ments are grouped by methods. An example CFG can be seen in figure 5.4

Start Func

Branch Statement

Branch A

Statement #1 Statement #1Branch B

Branch A Statement #2 Branch B Statement #2 Branch B Statement #3 End branch placeholder End Func

Figure 5.4: Simple Control Flow Graph. Nodes contain statements, edges contain control flow. The placeholder indicates the end of a branch statement.