Mapping high-level concepts to the source code: a case study

(1)

Mapping high-level concepts to the source

code: a case study

24 August, 2014

Student: Huy Hoang Supervisor: Paul Klint Student number: 10290982

(2)

(3)

2.7 MEASURING PERFORMANCE ... 11 3 IR APPLICATION... 12 3.1 INTRODUCTION ... 12 3.2 IR APPLICATION ... 12 3.2.1 IR engine ...12 3.2.2 Purpose ...12 3.2.3 Usage ...12 3.2.4 Implementation ...13 3.2.5 Pre-processing steps ...14 3.3 VALIDATION ... 15 4 DATASETS ... 17 4.1 INTRODUCTION ... 17

4.2 TRAINING AND TEST SET ... 17

4.2.1 Sample size ...18

4.3 PRELIMINARY REQUIREMENTS DOCUMENT ... 19

5 VALIDATION RESULTS ... 20

5.1 INTRODUCTION ... 20

5.2 HOW CLASSIFICATION WAS PERFORMED ... 20

5.3 RESULTS ... 20

5.4 DISCUSSION ... 22

5.4.1 Vocabulary mismatch ...22

5.5 CONCLUSION ... 23

6 EXPERIMENT #1: CLASSIFICATION MODEL ... 24

(4)

6.2 CLASSIFICATION MODEL ... 24

6.2.1 Background information ...24

6.2.2 Mitigating the single LPS value problem ...24

6.2.3 The model ...25 6.3 RESULTS ... 25 6.3.1 Training set ...25 6.3.2 Test set ...26 6.4 DISCUSSION ... 27 6.5 CONCLUSION ... 27

7 EXPERIMENT #2: VALIDATE THE CLASSIFICATION MODEL ... 28

7.1 INTRODUCTION ... 28 7.2 RESULTS ... 28 7.2.1 PRD A ...28 7.2.2 PRD B ...28 7.3 DISCUSSION ... 29 7.4 CONCLUSION ... 30 8 CONCLUSION ... 31 8.1 INTRODUCTION ... 31

8.2 THREATS TO INTERNAL VALIDITY... 31

8.3 THREATS TO EXTERNAL VALIDITY ... 31

8.4 CONCLUSION ... 31

8.5 RECOMMENDATION FOR FUTURE WORK... 32

9 APPENDIX A: STATISTICS SOURCE CONTROL ... 33

9.1 SOURCE CONTROL STATISTICS ... 33

9.2 LOC ... 34

9.3 LUCENE CORPUS STATISTICS ... 34

(5)

Abstract

According to a report by the Standish Group, incomplete requirements are the number one factor for impairing software projects. In the context of reengineering, determining the completeness of the requirements documentation of a redesigned system (RS) can therefore be beneficial. This can be done manually by human experts familiar with the Legacy Information System (LIS), but it is a time-consuming and costly task.

In this case study, we explored the feasibility of an automated approach for determining the completeness of the requirements documentation of a RS. The first step was to make sure that the requirements documentation of the RS can be mapped to the source code of a LIS. The more files of the LIS that can be mapped to existing requirements, the more complete the requirements documentation is. Various Information Retrieval (IR) studies have shown that mapping documentation of a software system to the source code of the same software system is feasible. This case study confirms that the open source IR library Apache Lucene can be used for this purpose, as well as mapping requirements documentation of the RS to the LIS. The second step was to create a classification model that can classify the requirements as new or existing. For this purpose, we introduced a classification model that used the ranking score metric generated by Lucene in the first step.

The results revealed that the classification model was able to correctly identify existing requirements with a classification rate of 79% to 97%. However, new requirements had a rate of 21% and 33%. An automated approach would therefore be feasible if the requirements documentation of the RS contains existing requirements only.

(6)

1 Introduction

1.1 Problem statement

Our case study revolves around an e-commerce Legacy Information System (LIS) that is currently in operation for a little over six years now. It is now part of an ongoing reengineering project. This master’s thesis originated from a simple question asked by project management at the host organization: how do we know that the redesigned system (RS) covers all the functionality of the LIS? In other words, how complete are the requirements of the RS?

The posed question matters a lot as the completeness of the requirements documentation of a RS, or any software system, has quite an impact on software development. In 1995, the Standish Group published a report [1] involving over 8000 software projects by 365 US companies. 13% of the companies identified incomplete requirements to be the number 1 factor for impairing software development, which ultimately led to the cancellation of the software project.

There are three possibilities for determining the completeness of the requirements: 1. Comparing the two systems using two sets of requirements documentation 2. Comparing functionalities of the two systems based on implementation

3. Comparing the requirements documentation of the RS with the source code of the LIS

In the current situation, there is incomplete documentation of both the LIS and RS. In addition, they do not share any commonalities. This excludes possibility 1. The implementation of the RS will happen in multiple phases, rather than applying a flash cutover approach. Because of this, the RS will not be complete until the near future. This excludes possibility 2. There is a set of preliminary

requirements documentation (PRD) that describes both new and existing requirements of the RS.

Although the set of PRD is not complete, given the situation, possibility 3 is the most feasible. Therefore, this master’s thesis will focus on the third option: comparing the PRD with the source code of the LIS.

1.2 Research question

The structure of a PRD and the LIS are inherently different. At the highest level, a PRD consists of

categories, titles and sentences describing a particular high-level concept expressed in a natural language, whereas the LIS consists of files, classes, methods, variables, etc. Because of the

differences between a PRD and the LIS, it is difficult to perform a 1:1 comparison and find out what the commonalities are between the two. Without this information, it is difficult to determine to what degree the LIS implementation is covered by a PRD. However, both the PRD and the LIS rely on

words to tell its audience what the software system must do. This allows us to use descriptions found

in the PRD to trace back, or map, to the part of the source code where the described high-level concept is implemented.

(7)

Human experts who have worked on, and are familiar with the LIS can be employed to manually

perform this mapping process. This is indeed a feasible strategy for small programs. However, in general, human experts may no longer be around. Not to mention that software systems with a large footprint will make this process painstakingly tedious. Instead, an automated approach would most likely be more efficient. An example of this would be an application that is able to map each existing requirement described in the PRD to one or more files in the code base of the LIS. The more files that get mapped to existing requirements, the more complete the PRD should be. When all files are mapped, the set of PRDs should have described all the functionalities of the LIS. Of course, this approach assumes that one file contains the implementation of only one high-level concept.

However, for the above or any other automated approach to be feasible, the following fundamental question, which is also our research question (RQ), needs to be asked:

RQ: Can the requirements documentation of a redesigned system (RS) be mapped automatically to the source code of an existing system (LIS)?

To answer this main research question, we have to decompose it into smaller manageable sub questions:

1. What methods exist for mapping natural language text to the source code? 2. What are the prerequisites for the mapping method?

3. What is the accuracy of the mapping method?

4. Is it possible to use a metric for discriminating new and existing requirements that are expressed as natural language text?

1.3 How this document is organized

This chapter has described the problem statement, and the research questions we want to address. The rest of this document is organized as follows:

 Chapter 2 provides relevant literature study on the two pillars of this master’s thesis:

Information Retrieval (IR) and concept location, addressing sub research questions 1 and 2.

 Chapter 3 discusses the purpose, usage, implementation and validation of the IR application that was developed for this case study. The IR application made it possible to create a mapping between the documentation and the source code.

 Chapter 4 presents and elaborates on the training, test, and PRD datasets. They were mostly used for validation of experiments and are referenced extensively in the subsequent chapters.  Chapter 5 discusses the effectiveness of the IR application using the recall and precision

metric. This will address sub research question 3.

 Chapter 6 presents a classification model and also discusses its test set performance.

 Chapter 7 presents the validation results of the classification model using the PRD. The results of this and the previous chapter will address sub research question 4.

 Chapter 8 discusses internal and external threats to validity of the case study. Furthermore, it provides an answer to our RQ.

(8)

2 Information Retrieval

2.1 Introduction

When surveying the literature, relevant work [2, 3] can be found with regards to mapping documentation (expressed in natural language) to the source code, and vice versa. The method described in the literature is Information Retrieval (IR). IR plays an important role in our everyday life as Internet search engines are built using this technology.

2.2 How it works

An IR system consists of two subsystems: 1) an indexing subsystem that takes a corpus of documents as input for indexing, and 2) a search engine subsystem that takes a user query, expressed in (natural) text, as input and matches this against the corpus of indexed documents. Ranking functions are then employed “to order documents according to the documents’ estimated match with a user query” [4]. The output is a list of documents ordered by the highest relevance ranking score on top.

2.3 Mapping documentation to the source code

Antoniol et al. first proposed to use IR to establish “traceability links between the source code and free text documents” [2]. They compared two retrieval and ranking models: the probabilistic model and

Vector Space Model (VSM). [3] followed up by doing the opposite: mapping documentation to the

source code. They decided to use Latent Semantic Indexing (LSI). We will briefly go through these three models. In addition, we will also briefly introduce the classical Standard Boolean Model (SBM) and the Extended Boolean Model (EBM), as this was part of our IR solution (to be discussed later in chapter 3).

1. Probabilistic model: documents are ranked according to the probability “of being relevant to a query computed on a statistical basis” [2].

2. VSM: the query and documents are expressed using vectors. The distance between the query vector and document vector determines the semantic similarity, or relevance, of a single document. According to [3] VSM’s shortcoming is that it cannot deal with words that are

polysemous or synonymous in nature.

3. LSI: this model is an extension of VSM and was created to address the polysemy and synonymy problem found in VSM [5]. The polysemy problem occurs when a word has multiple meanings. For instance, in the field of computer science the word transaction refers to a unit of work within a database system, whereas in the financial domain this is an agreement of exchange between a buyer and a seller. The synonymy problem occurs when there are multiple words with the same or similar meaning, e.g. transaction, sale, purchase. 4. SBM: this model has its roots in Boolean algebra. For a document to be considered relevant,

all terms in a document must satisfy the query. A query consists of terms, and could include the Boolean operators AND, OR, and NOT. An example: the query “master AND thesis”

(9)

returns all documents that contain both the term “master” and “thesis”. According to [6] the SBM has various drawbacks:

a. Unlike the previous three models, the SBM does not have a ranking function, so the output cannot be ordered from the most relevant to the least relevant document. b. “All terms included in the documents and queries are assumed to carry equal

importance” [6].

5. EBM: as the name suggests, this is an extension of the SBM. First, the SBM is used to filter the corpus of indexed documents. The extension is the incorporation of weighted terms in the query and documents. This makes it possible to employ a ranking function to calculate the relevance ranking score for the remaining potential relevant documents [6].

2.4 Alternative

An alternative to IR is the “grep” method, a static analysis method named after the UNIX regular expression search tool grep. The idea is to use grep or similar tools to “text search for keywords in comments and variables” [7]. There are several limitations to this method: 1) natural language input is difficult, if not impossible, 2) stemming needs to be performed manually, and 3) grep does not provide any way to rank the results [2].

2.5 Motivation for automated mapping

According to Nelson, the software maintenance phase consumes 50% to 80% “of the resources in the total software budget” [8]. Within software maintenance, 27% to 62% of the time is spent on program

comprehension [8, 9, 10].

Program comprehension is the cognitive process of understanding the source code of a software system. In a review paper [11], Storey mentions four cognitive model theories that attempt to explain how programmers comprehend code. The four cognitive models are:

1. Top-down comprehension, which explains that programmers take domain knowledge and map this to the source code.

2. Bottom-up comprehension, where programmers “read code statements and then mentally chunk or group these statements into higher level abstractions” [11].

3. Opportunistic and systematic strategies, where programmers go through the source code and focus on the control and data flow trying to gain a global understanding of the system, or employ an as-needed approach “focusing only on the code related to a particular task” [11] 4. Integrated Metamodel, a hybrid model that combines the above three cognitive theories into

a single model.

In both the top-down and as-needed approach lies the fundamental “problem of discovering individual human oriented concepts and assigning them to their implementation oriented counterparts” [12]. Biggerstaff et al. dubbed this as the concept assignment problem, or concept location problem.

(10)

2.6 Prerequisites

To enable a programmer to perform concept location, or manually establishing the mapping between a given natural language query and a specific portion of the source code, the code base of the LIS must contain useful informal semantics [13] such as identifiers and/or comments. They must be descriptive and relevant. This premise is also shared by automated IR mapping approaches [2]. The following two factors determine how useful the source code is to a programmer:

Identifiers. In Biggerstaff’s thought experiment [13, 14] the importance of meaningful identifiers was

illustrated. Source code in which meaningful identifiers were replaced by “semantically empty symbols” [13] such as f0001, a0001, i0001, etc. show that comprehending the code was very difficult, if not impossible.

A comprehension test performed by [15] resulted in the same conclusion. The study revealed that the use of a descriptive naming style for identifiers, accompanied by documentation, resulted in easier code comprehension than when one of the two was missing. The second best situation was when a descriptive naming style was used, but without any documentation. Developers had the hardest time comprehending code when a non-descriptive naming style was used and documentation was lacking. In a more recent study, observations and interviews with 28 developers confirmed that low quality informal semantics led to misunderstandings, cryptic names that the programmer had no idea what it meant, among others [16].

Comments. As mentioned earlier, [15] has shown that comments do matter in comprehension.

However, comments do have limitations: 1) the comment is not a real comment, but commented-out code, 2) comments are not updated to reflect code changes made by the programmer, 3) the comment is simply useless as illustrated in the following example:

(Example 1) i++; // increment I

(Example 2) i++; // examine the next customer

(11)

2.6.1 The thesis author as the human expert

The author of this document, hereafter referred to as the thesis author, can confirm that most identifiers in the code base followed the Microsoft General Naming Conventions [18], which advocates the use of casing styles and descriptive names, and discourages abbreviations and acronyms. This means that the LIS meets the prerequisites we described earlier.

The thesis author is in a position to judge the above because he has intimate knowledge of the LIS. He has been working as a programmer on the LIS since its inception. According to the statistics (§9.1) found in the source code repository, the thesis author was responsible for modifying 55% of lines during its lifespan. Because of these two reasons, the thesis author served as the human expert for the entire research. The human expert had the following two tasks:

1. Perform a binary classification on the IR search results (§5.2)

2. Classify the requirements found in the PRD as new or existing (§4.3)

2.7 Measuring performance

The metrics recall and precision at k are commonly used to measure the performance of IR systems [5]. However, the IR system’s application and its end-users determine which metric is the most appropriate for measuring its effectiveness.

Precision at k. The precision at k (P@k) metric tells us how many relevant documents are retrieved

by a query until cutoff point k. This metric reflects how most people use Internet search engines. According to a 2013 report [19], the first page of an average Google search result generated 92% traffic to other sites, compared to 5% of the second page. This implies that the average Google user finds the top k documents to be relevant, and will look no further than k. When developing an Internet search engine, this metric provides a good indication of the performance of an IR system.

The P@k is defined as follows:

𝑃@𝑘 =# of relevant documents retrieved # of retrieved documents

Recall. This metric indicates the fraction of relevant documents that are successfully retrieved for a

query. Measuring the performance of an Internet search engine using the recall metric does not make sense. The typical Google user does not seem to care about getting all relevant documents.

The recall is defined as follows:

𝑅𝑒𝑐𝑎𝑙𝑙 =# of relevant documents retrieved # of relevant documents

(12)

3 IR application

3.1 Introduction

In this chapter, we will present a custom application, hereafter referred to as IR application, which can map documentation to the code base of the LIS. This application played an important role in the research as the classification model (§6.1) used its output. The classification model allows us to answer the fourth sub research question.

This chapter consists of two parts: in the first part, we will discuss the purpose, usage, and implementation of the IR application. The second part explains what acceptance criteria were used for validating the IR application.

3.2 IR application

3.2.1 IR engine

Instead of writing our own VSM or LSI implementation, it was decided to use an existing open source library. The motivation for this was that it greatly mitigated the chances of a faulty IR implementation. The tradeoff was that the choice for the retrieval method was restricted to what was available.

The chosen IR library was Lucene.Net 3.0.3 RC2 [20], a C# port of the open source Apache Lucene project. The motivation for Lucene was that it is well documented and stable: it has been in development since 1999 and used as the search engine by many popular platforms, e.g. LinkedIn [21], and Stackoverflow [20]. The C# port was chosen because of the author’s familiarity with the .NET platform. Lucene uses the EBM, or more specifically the SBM and the VSM.

3.2.2 Purpose

The purpose of the IR application was to save the relevance ranking score metric produced by the ranking function of Lucene. The saved scores were then used in a subsequent step by the classification model.

In the Lucene community, the relevance ranking score is also called the Lucene Practical Score (LPS). The LPS is a score given to a source code document in the search results. The higher this score, the more relevant the document should be. The LPS for each document in the search result is calculated using the Lucene Practical Scoring Function (LPSF). According to [22] the LPSF contains a factor that normalizes the query so that scores from different queries are comparable.

3.2.3 Usage

The IR application was used in the following order:

(13)

2. Generate a set of LPS values by feeding the IR application with queries extracted from

existing documentation1_{. We will elaborate on this type of queries in §4.2.}

3. Generate a set of LPS values by feeding the IR application with queries extracted from requirements found in the PRD. We will elaborate on this type of queries in §4.3.

3.2.4 Implementation

As mentioned in §2.2, an IR system consists of two subsystems. Our implementation followed this pattern and thus consisted of two Windows console programs written in C#: 1) the Indexer (IXR) and 2) the Search Engine (SE). The footprint of the code that needed to be written by the thesis author was very small, only 600 LOC.

The IXR is a static analysis tool. Its main purpose was to go through relevant files in the entire code base of the LIS, and extract informal semantics from each source code file. The informal semantics we aimed to collect were names of classes, methods, variables and comments. The next step was to pre-process them, and use it as input for Lucene to build the corpus of indexed documents. More details on extraction and pre-processing can be found in 3.2.5.

The SE expected input in the form of a query expressed in plain English, and returned an ordered list of relevant source code files with the most relevant on top. The SE did not explicitly apply any Boolean operators between the terms. However, Lucene did implicitly apply the OR operator between all terms. Then the SE tool called Lucene to search through the corpus and save the LPS results to a file.

1_{Existing documentation refers to a collection of documents that were used for implementing concepts in the} LIS. They consisted of one functional design written by the thesis author three years ago, and SaaS integration

guides written by different companies. The SaaS integration guides describe homogenous web services.

(14)

The following diagram illustrates the aforementioned process:

Source code Query formulated in natural language

Indexer (IXR) Search Engine (SE) Input

Custom IR system

Third party library Apache Lucene .NET library

Index source code

Perform boolean model and result

ranking

Corpus Output

Figure 1: a view of the IR application and its dependencies

3.2.5 Pre-processing steps

Each document in the corpus will be referred to as the corpus document. Similar to the approach of [5], each corpus document represents one source code file. Unlike Marcus et al.’s approach, big or “god” source code files were not split to create multiple corpus documents.

The steps to extract and process the informal semantics are based on the work of [23, 24]. Similar to Kuhn et al. and Gay et al.’s approaches, the source code was pre-processed.

Step 1: tokenize. The source code and comments were tokenized. During this process, curly/square

brackets, parentheses, etc. were removed. This left us with words that were then processed in the subsequent steps.

Step 2: remove .NET vocabulary. For each type of file, a specific stop-list consisting of

C#/VB.NET, LINQ and .NET XML documentation tags was employed to filter out tokens that were not semantically interesting for finding the traceability link. If they were not filtered, queries containing .NET vocabulary such as interface, class, public, protected, private, etc. would produce results with low precision.

Legend Artefact Application Library x y X uses Y x y X creates Y Function

(15)

Step 3: Pascal and camel casing. Compound words where Pascal or camel casing was applied were

broken into words, thus SetProperty became set and property. The code base’s coding style largely follows the .NET naming convention, which proposes Pascal and camel casing.

Step 4: remove English stop words. English stop words are common words that do not add much

semantic value, such as an, the, etc. Similar to step 2, if the English stop words were not removed, the SE tool could return more non-relevant results.

We employed an arbitrary or non-tailored stop-list by Fox. Fox’s stop-list is based on a corpus “of

1,014,000 words drawn from a broad range of literature in English” [25]. There is a possible limitation in the use of Fox’s stop-list: as methods in .NET are conventionally wise syntactically constructed as

VerbNoun, method names that consisted of verbs and/or nouns available in the stop-list may not have

gotten indexed (correctly).

The alternative to an arbitrary stop-list is the use of a tailored stop-list. Zaman et al. [26] described an approach by Manning et al. for constructing a tailored stop-list. However, the approach involves more manual work. According to Zaman et al.’s findings, an IR system with a tailored stop-list does produce better precision than IR systems using an arbitrary stop-list or no stop-list. However, “after the top 40% retrieval all the systems show almost the same retrieval performance” [26]. In addition, Zaman et al. also revealed that the precision difference between two arbitrary stop-lists is at most 2.92%.

Step 5: remove short token. We removed informal semantics that were shorter than 3 characters

since tokens consisting of one or two characters are most likely to be semantically insignificant.

Step 6: stemming. The final step (before adding the word to Lucene) was to reduce the word back to

its root form, e.g. “windows” becomes “window”. This was done using the widely adopted Porter

Stemming algorithm, for which the C# source code is publicly available at [27].

3.3 Validation

Although the IR application uses the off-the-shelf IR library Lucene, custom code had to be written for performing the pre-processing steps. Validating that the entire IR application as a whole worked as intended was critical, as the IR application was used for subsequent steps in the research. An incorrect performing application would have propagated its errors to the next steps. This would likely introduce a threat to internal validity, such as an invalid classification model.

The following acceptance criteria for the IR application were used:

1. The application is able to establish a mapping2_{with the source code of the LIS using queries}

extracted from existing documentation

2_{A mapping is established when the top k results of a query produces relevant results according to the}

(16)

2. The application is able to establish a mapping with the source code of the LIS using queries extracted from documentation that simply mentions implemented concepts but were not used for the implementation of the LIS

3. The application is unable to establish a mapping with the source code of the LIS using non-relevant queries

The second acceptance criterion was included to make sure that the IR application was also able to map relevant queries that were not used as a basis for the implementation of the LIS. This was done because queries extracted from existing requirements of the PRD are of this nature.

(17)

4 Datasets

4.1 Introduction

In this chapter, we will introduce and elaborate on the two main types of datasets: the training and test set, and the PRD dataset. Each dataset was created from one or more sources and served multiple purposes. This is illustrated in the figure below:

Existing documentation

PRD

Training set

PRD dataset

Validate and measure effectiveness of the IR

application (§5.3)

Calculate recall and precision and compare with

test set (§5.3)

Source Dataset Purpose

Used as the gold standard by the classification model

(§6.3.1)

Used to validate the classification model (§6.3.2)

Used by the classification model to generate classification results (§7.2) Wikipedia articles

Test set Thesis author s knowledge

Figure 2: the source and purpose of the two datasets

4.2 Training and test set

This dataset consisted of two subsets: training and test set. When we validated and measured the effectiveness of the IR application, both subsets were used. We fully admit that the name is a bit misleading in that case. When used by the classification model, the two subsets were indeed used separately and the names do reflect its purpose.

Based on the acceptance criteria in §3.3, the thesis author created different types of samples for this dataset. Samples were of varying sizes to better reflect the samples of the PRD dataset. The following two tables summarizes the characteristics of the samples.

Hypothesis / Sample group

Name Description Source

VH1 Short relevant

queries

A term that describes a concept that can be mapped to the source code

Thesis author’s domain and source

(18)

VH2 Short non-relevant queries

One or two terms that describe a concept that cannot be mapped to the source code

code knowledge

VH3 Large relevant

queries

A paragraph that describes a concept that

can be mapped to the source code.

Wikipedia articles and book chapters

VH4 Large

non-relevant queries

Paragraphs that describe concepts that

cannot be mapped to the source code

VH5 Ambiguous

queries

Short sentences containing important terms that can be mapped but are used in a different context. An example is the word

ideal, which can mean perfect but is also

the name of a Dutch payment method

Thesis author’s domain and source code knowledge

VH6 Large relevant +

implemented queries

A paragraph that was used for the implementation of a concept in the LIS and is therefore definitely mappable to the source code

Existing documentation

Table 1: the six different groups in the training and test set

The attributes of each hypothesis/sample data are expressed in the comparison matrix below:

Hypothesis / Sample group

Name Number of

terms

Describes concept related to the domain? *

Is

mappable? **

VH1 Short relevant queries 1 Yes Yes

VH2 Short non-relevant

queries 1 – 2 No No

VH3 Large relevant queries 31 – 410 Yes Yes

VH4 Large non-relevant

queries 193 – 448 Yes No

VH5 Ambiguous queries 5 – 12 Yes No

VH6 Large relevant +

implemented queries 83 – 918 Yes Yes

Table 2: characteristics of the hypotheses / sample group

* Do the queries of this sample group describe concepts related to the domain of the source code? This was determined by the thesis author.

** Are the queries of this sample group relevant, or mappable to the source code? This was also determined by the thesis author.

4.2.1 Sample size

The thesis author created 20 queries for each sample group, resulting in a total of 120 queries. Each sample group was split using Excel’s randomization function (RAND) into two sets: 70% was assigned to the training set, and 30% was assigned to the test set.

(19)

4.3 Preliminary requirements document

8 to 10 employees with different expertise from different departments within the host organization contributed to the two PRDs. Neither the thesis author nor the software development team contributed to these two documents. The two PRDs did not describe requirements in full detail. Each requirement in the document was at most a set of sentences describing a concept, and used as a query for the SE to generate a list of LPS values.

The thesis author manually classified whether each requirement mentioned a new or existing high-level concept. The results of the manual classification was used in a later step for comparison with the classification results (§7.1) produced by the classification model.

The following table summarizes the characteristics of the two PRD datasets:

Requirements document No. of queries No. of terms / query Existing concepts

PRD A 131 4 – 143 18.3%

PRD B 95 1 – 28 54.7%

(20)

5 Validation results

5.1 Introduction

In this chapter, we will measure the effectiveness of the IR application using the recall and precision metric. We will do this for the training/test set, as well as for the PRD datasets.

5.2 How classification was performed

The thesis author had to perform a binary classification on all documents of a search result, before it was possible to calculate the recall and precision values of a query. This needed to be done for all 60 and 76 relevant queries from the training/test set and the PRD dataset respectively. The thesis author attempted to abide by the following two guidelines when classifying source code documents as relevant:

1. The source code document is required in order for the high-level concept mentioned in the query to work.

2. Dependencies are not considered relevant.

The implementation of a high-level concept can exist in one or multiple source code documents because of the way the code is organized. These documents were considered relevant by the thesis author. However, generic dependencies such as helpers or database classes that are reusable for other purposes as well were not considered relevant unless the query specifically mentioned them.

5.3 Results

We used cutoff points of 10, 50, 100, and 200 to calculate the recall and precision values. It was observed that in all cases, the VH1, VH3, and VH6 sample groups of the training/test set achieved higher recall and precision values more frequently than the two PRD datasets.

The results of the recall performance for k = 10 is visualized in Figure 3 (next page). The reason for visualizing the results of this cutoff point is because we want to stress the difference in performance between using the training/test set and the PRD dataset. We noted that the higher the k, the smaller the difference in recall and precision percentages between the two types of datasets. This was especially noticeable at cutoff point 100 and 200, where the differences were minimal.

Figure 3 shows five histograms, one for every relevant dataset. Each histogram consists of four bins. Each bin represents a percentage range. So, if the recall percentage for a query is 20%, then it will fall in the bin ≥ 0 ≤ 0.24, 40% falls in the bin ≥ 0.25 ≤ 0.49, etc.

(21)

Figure 3: the recall performance (k = 10) for 5 different dataset

Figure 4 visualizes the precision results in the same way as the recall results: using the same k, same number of bins, and same bin sizes.

Figure 4: the P@10 performance for 5 different dataset 10% 30% 15% 50% 55% 60% 25% 10% 4% 8% 20% 10% 10% 25% 16% 10% 35% 65% 21% 22% 0% 10% 20% 30% 40% 50% 60% 70% VH1 VH3 VH6 PRD-A PRD-B

Recall

≥ 0 ≤ 0.24 ≥ 0.25 ≤ 0.49 ≥ 0.4 ≤ 0.74 ≥ 0.75 ≤ 1.0 15% 45% 60% 88% 94% 20% 30% 20% 8% _6% 25% 25% 20% 4% 0% 40% 0% 0% 0% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% VH1 VH3 VH6 PRD-A PRD-B

Precision

≥ 0 ≤ 0.24 ≥ 0.25 ≤ 0.49 ≥ 0.4 ≤ 0.74 ≥ 0.75 ≤ 1.0

(22)

5.4 Discussion

In Figure 3, we can see that a significant amount of PRD queries have recall percentages that fall in the first bin. This is a significant difference compared to the training/test set. This means that the IR application achieved a better recall performance and thus can retrieve more relevant documents using queries of the training/test set than using the queries of the PRD datasets.

Figure 4 shows an even more obvious contrast between the PRD dataset results and the training/test set results. A high percentage of PRD queries end up in the first bin, indicating a low precision. The VH6 histogram shows some similar result, but a significant difference can also be seen between the first bin of the VH6 histogram and the two PRD histograms. The precision figure indicate that the search results of PRD queries contained more non-relevant documents than the search results of a training/test set queries.

Furthermore, the results revealed that the IR application was able to retrieve a significant amount of relevant documents at higher cutoff points. However, as the cutoff point increased, the number of non-relevant documents also went up. At k = 200, 90% to 100% of the queries produced recall percentages that ended up in the ≥ 0.75 ≤ 1.0 bin, compared to the 10% to 65% in Figure 3. The P@200 results revealed that 95% to 100% of the queries produced precision percentages that ended up in the first bin. This was not a surprise, considering the fact that the average number of documents per search result (for both training/test and PRD datasets) was 777, while the average number of relevant documents was 6. There is an explanation for this ratio. No Boolean operators were added explicitly to the queries, so the default OR operator was applied between each term by Lucene. This means that the query “Hello world” will return any documents that contain the term “hello” or “world”. Although possible relevant documents will therefore not be missed, the side effect is that the number of (non-relevant) documents per search result was high.

5.4.1 Vocabulary mismatch

One possible explanation for the performance difference between using existing documentation and the PRD is the so called vocabulary mismatch problem [28]. This phenomenon occurs when two people independently make different word choices to describe the same high-level concept. The results of an empirical study [29, 28] revealed that the variability in word choice was high among participants with different backgrounds. The study revealed that the probability for participants to choose common words to describe concepts did not surpass 20% [28].

The significance of the vocabulary mismatch problem could further be attributed to outdated informal semantics, e.g. informal semantics used to describe high-level concepts in the source code have a different name than what the end-user is exposed to in the user interface. This was due to name changes happening after implementation. The changes were due to one of the following reasons:

1. The official marketing name was not yet available during the initial development phase, so programmers gave the high level concept a generic name that was commonly used during meetings.

2. Changing business requirements require implemented high-level concepts to evolve, so the stakeholders decided to rename it to better reflect the evolved functionality.

(23)

Since renaming high-level concepts on such an aesthetic level did not affect the implementation, the programmers did not update the names of related classes and methods. Renaming did result in changes in artifacts such as HTML templates, but these were not part of the corpus.

5.5 Conclusion

Judging from a user query need perspective, the IR application performed poorly as it retrieved more documents than needed. As pointed out in §2.7, in a regular search engine scenario, the user looks no further than the top k documents in a search result. However, in our case study, the IR application was not created to satisfy any user query needs. The most important and only aspect was to validate that the IR application can retrieve relevant documents.

In the next chapter, we will introduce the classification model. The results in this chapter revealed that the use of the PRD datasets resulted in a weaker performance by the IR application. Since the classification model depends on the LPS values found in the search result and share the same results of the binary classification performed in §5.2, we expect that the classification model will again show a weaker performance when the PRD datasets are used.

(24)

6 Experiment #1: Classification model

6.1 Introduction

In this chapter, we will present the classification model for classifying queries, and reveal how effective it was for classifying the test set queries.

6.2 Classification model

6.2.1 Background information

The core of the classification model revolves around the LPS, a metric which should indicate the relevancy of a document relative to a query. The hypothesis is that the LPS metric can be used to establish a threshold value that can then be used to decide whether a query is relevant or not.

It was decided to keep the model simple by using the LPS values of the VH1, VH3, and VH6 training samples as the gold standard threshold values. These sample groups were chosen because they covered relevant high-level concepts that were also implemented in the LIS. The assumption was that the non-relevant training samples would result in LPS values lower than the gold standard threshold values. Although we do realize this assumption may not hold for every case, opting for a more complex model would have raised more challenges and thus the scope of this project would have increased. One of the challenges and limiting factor was the small sample size. Suppose we did incorporate more metrics into our model such as the number of documents of a search result, or the

number of query terms. The problem here is that overfitting would have occurred because of the

sample size. In that case, the model would have been able to accurately classify the training samples, but would not have generalized well to classify the unseen samples. Another alternative would be to extend the gold standard with LPS values of non-relevant samples. However, when a sample can be classified as both relevant and non-relevant, yet another challenge needs to be solved.

6.2.2 Mitigating the single LPS value problem

We refer to a document in a search result as a true positive, if the document has a relatively high LPS value and is classified as relevant by the thesis author. A non-relevant document with a high LPS value is referred to as a false positive. A relevant and non-relevant document with a low LPS value are referred to as a false negative and a true negative, respectively.

It was shown in §5.3 that the precision of the IR application was low. This means that false positives were placed in positions similar to true positives. Because of this, the usefulness of a single LPS value is undermined. To mitigate this problem, we propose to use the sum of the top k LPS values, or

SumLPS(@k). The motivation is based on the assumption that relevant queries produce a high number

of true positives, and a limited number of false positives. Non-relevant queries produce no documents, or mostly negatives with a limited number of false positives. As long as the search result fits into this

(25)

pattern, the SumLPS for relevant queries should be distinctively higher than the SumLPS for non-relevant queries.

6.2.3 The model

Let v be the LPS value of a source code document, Tv (gold standard threshold value) be a set of LPS

values v produced by the SE using training data of VH1, VH3 and VH6 as input, and Qv be a set of

LPS values v produced by the SE using an arbitrary query as input for which we want to classify its relevancy. The classification model is defined as follows:

𝑆𝑢𝑚𝐿𝑃𝑆(𝑘, 𝐿) = ∑ 𝑣𝑖 | 𝑣 ∈ 𝐿𝑣 𝑘 𝑖=1 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 = 𝑆𝑢𝑚𝐿𝑃𝑆(𝑘, 𝑄𝑣) ≥ 𝑆𝑢𝑚𝐿𝑃𝑆(𝑘, 𝑇𝑣)

6.3 Results

6.3.1 Training set

We used the LPS values of the VH1 to VH6 queries for Qv. Since we did not know what the optimal

cutoff point k was, we used 1 to 1385. This upper bound is the total number of indexed source code files (§9.3). The results in the table below reveal that it was not possible to achieve 100% correct classification using the training set or gold standard itself. As expected, this was because the SumLPS of non-relevant files ended up being higher than the gold standard threshold value.

k Correctly classified

1..12 95.2%

13..343 96.4%

344..1385 95.2%

Table 4: classification performance of the classification model using the training set

The k range 13 to 343 was the most optimal, with 96.4% correct classifications. The following table shows the classification results for this range. This is possible because the results were consistent.

Sample group Correctly classified Incorrectly classified Count Percentage Count Percentage

(VH1) Small relevant 14 100.0% 0 0.0%

(VH2) Small non-relevant 13 92.9% 1 7.1%

(VH3) Large relevant queries 14 100.0% 0 0.0%

(VH4) Large non-relevant queries 14 100.0% 0 0.0%

(VH5) Ambiguous queries 12 85.7% 2 14.3%

(26)

Total 81 96.4% 3 3.6% Table 5: classification results for the optimal k range

The classification model used VH1, VH3, and VH6 as the gold standard, so 100% correct classification for those sample groups is only logical. The results reveal that some non-relevant samples were classified incorrectly due to the fact that their LPS values were similar to that of the gold standard.

Most errors were due to the ambiguous queries of the VH5 sample group. This sample group was carefully engineered by the thesis author. The purpose was to introduce noise data, something that might also be present in unseen data. This was realized by introducing terms in the query that appeared in many places in the code base of the LIS. One of the VH5 query that did not get classified correctly was “the sales department is a strong force within the company”. Here, the terms “sales” and “force” returned a lot of false positives, due to the SalesForce CRM integration in the LIS.

6.3.2 Test set

We used the unseen test set of VH1 to VH6 for Qv, and the range 1 to 1385 for k. The table below

shows the results. We will keep the k ranges of Table 4 to make comparison easier. The results below are slightly different compared to the previous results. In this case, the most optimal k is from 43 to 1385 with 97.2% correct classification.

1..12 80.6% – 86.1% 13..343 86.1% – 97.2%

344..1385 97.2%

Table 6: classification performance of the classification model using the test set

The classification results for k = 43 to 1385 was consistent. This was also the optimal k range. The following table shows the classification results for this range:

Sample group Correctly classified Incorrectly classified Count Percentage Count Percentage

(VH1) Small relevant 6 100.0% 0 0.0%

(VH2) Small non-relevant 6 100.0% 0 0.0%

(VH3) Large relevant queries 5 83.3% 1 16.7%

(VH4) Large non-relevant queries 6 100.0% 0 0.0%

(VH5) Ambiguous queries 6 100.0% 0 0.0%

(VH6) Relevant + implemented queries 6 100.0% 0 0.0%

Total 35 97.2% 1 2.8%

(27)

6.4 Discussion

Using the test set for Qv made the classification model perform slightly better than using the training

set for Qv. The difference is only 0.8%.

The results reveal that the classification model has a slight problem with classifying a small amount of non-relevant samples of the training set, and one relevant sample of the test set. A limitation of this experiment is the small sample size. With a larger sample size, it is not inconceivable that more errors could be found.

The k value has an impact on the classification error. Depending on the used k, the error percentage varied from 2.8% to 19.4%. The difference between the optimal k of the training set and the test set indicate that the optimal k is not a fixed value or range. This means that for unseen datasets, the optimal k needs to be determined every time. This is not a big problem since the total number of indexed documents is only 1385.

6.5 Conclusion

The classification model was unable to achieve 100% correct classification using samples from the training and test set. Although the results do not disappoint, the limitation of this experiment must not be ignored. The small sample size could be a threat to internal validity. It is very possible that larger sample sizes could cause more errors.

(28)

7 Experiment #2: Validate the classification model

7.1 Introduction

In the previous chapter, the gold standard for the classification model was established using the training set prepared by the thesis author. Then, the classification model was validated using the test set, also prepared by the thesis author. In this chapter, we will validate the model by using unseen data: the queries of the PRD dataset.

7.2 Results

7.2.1 PRD A

We used the LPS values of the PRD A queries for Qv, and 1 to 1385 for k. The results are shown in the

following table:

1..12 37.4% - 41.2% 13..343 26.7% - 37.4% 344..1385 26.0% - 27.5%

Table 8: Classification model results using the PRD A queries as input

The optimal k in this case is 2 and 4, where 41.2% of the queries are classified correctly. The following table shows in detail what was classified correctly, and what was not, for both k=2 and k=4.

Classification model result Relevant queries Non-relevant queries All queries

Count Percentage Count Percentage Count Percentage

Correctly classified 19 79.2% 35 32.7% 54 41.2%

Incorrectly classified 5 20.8% 72 67.3% 77 58.8%

Total 24 100.0% 107 100.0% 131 100.0%

Table 9: Performance of SumLPS@2, and SumLPS@4

The results in Table 8 are in stark contrast with the test set results in §6.3.2, where the worst performance of 80.6% is almost double the best performance in this case. Table 9 shows that classifying relevant queries results in a performance that is similar to previous results using the test set (§6.3.2). However, this is not the case for non-relevant queries, which affects the results negatively.

7.2.2 PRD B

We used the LPS values of the PRD B queries for Qv, and 1 to 1385 for k. The results are shown in the

(29)

1..12 58.9% - 61.1% 13..343 61.1% - 63.2%

344..1385 61.1%

Table 10: Classification model results using the PRD B queries as input

The optimal k is from 40 to 59 where the number of correct classification is 63.2%. The following table shows in detail what was classified correctly, and what was not, for k = 40 to 59.

Classification model result Relevant queries Non-relevant queries All queries

Count Percentage Count Percentage Count Percentage

Correctly classified 51 98.1% 9 20.9% 60 63.2%

Incorrectly classified 1 1.9% 34 79.1% 35 36.8%

Total 52 100.0% 43 100.0% 95 100.0%

Table 11: Performance of SumLPS (k = 40 to 59)

At first sight, Table 10 seem to be suggesting that the classification model performs better using PRD B than using PRD A. However, this is only because the ratio of relevant to non-relevant queries in the PRD B dataset is more evenly distributed than in the PRD A data set.

Table 11 reveals that the performance of the classification model is consistent for both the PRD A and PRD B datasets: a significant amount of relevant queries were classified correctly, and a significant amount of non-relevant queries were classified incorrectly.

7.3 Discussion

The results in this second experiment are in stark contrast with the first experiment (§6.3.2). Here the results show that the poor performance by the classification model is caused by the set of non-relevant queries. The non-relevant queries are similar to the ambiguous VH5 queries in the sense that they both contain domain terms but cannot be mapped to the source code. The difference being that the VH5 queries of the test set were all classified correctly, compared to the 32.7% and 20.9% of the PRD A and B queries respectively.

A possible explanation for such a big difference in performance can be found in the training set. When we looked at the training set, there were four queries with the following SumLPS@2 values: 0.21, 0.22, and 0.26. The average was 1.02. There were many of the non-relevant PRD-A and PRD-B queries that had similar SumLPS@2 values. Removing these outliers from the training did not have any effect on the test set results. It did, however, improve the classification of the PRD A dataset from 41% to 64%. An improvement was also observed for the PRD B dataset: from 63% to 69%. The optimal k changed for all instances. This reveals the limitation of the classification model: outliers in the training set can reduce the performance for unseen data. More extensive future experiments need to be performed to find out how to best mitigate the effects of outliers, either by removing them or by using a different classification model.

(30)

The thesis author could also have played a pivotal role in contributing to the big performance difference, simply because the thesis author could have made classification mistakes. The outliers can simply be the result of those four queries being misclassified as relevant, when someone other than the thesis author might have correctly classified those as non-relevant. Literature has shown that this possibility is not unlikely. Robillard et al. performed an experiment in which 23 programmers were asked to map source code to high-level concepts. The study revealed that there is a high variability ranging from 0% to 61% [30] in agreement between two programmers on the same mapping.

Alternatively, the thesis author could have misclassified a fraction of the PRD datasets instead of the training/test set. According to the thesis author, classifying the PRD datasets was much tougher than classifying the training/test set. The thesis author needed to 1) determine whether the PRD queries were relevant or not, and 2) classify whether each document of a search result was relevant to the query. When a query was not to the point or contained mostly unknown vocabulary, he found both tasks to be difficult. This is understandable since the thesis author was not involved in writing the PRDs. Both could have been mitigated by asking the original authors of the PRDs to:

1. Classify their own work. They must know whether their query was describing an implemented concept or not.

2. Elaborate their own queries. The thesis author found it difficult to determine the scope of the query, which means that determining which documents of a search result was relevant to the query was difficult.

Unfortunately, the original authors of the PRDs were no longer available or it was impossible to get them on board with this project.

7.4 Conclusion

In this second experiment, the classification model misclassified a significant portion of the unseen data, unlike in the first experiment (§6.5) where almost the entire test set was classified correctly. This chapter reveals that the presented classification model is sensitive. Even a single sample in the training set with a SumLPS@k value deviating from the average could have an impact on the performance of the classification model on unseen data. These outliers can be the result of errors in human judgment, which were not, and could not have been mitigated due to the unavailability of multiple human experts.

(31)

8 Conclusion

8.1 Introduction

In this chapter, we will discuss threats to both internal and external validities, provide an overall conclusion and ideas for future work.

8.2 Threats to internal validity

In §7.3 we have already discussed that human mistakes could cause an impact on the performance of the classification model. The threat to internal validity shares this viewpoint: when collecting training and test data, the thesis author might have suffered from selective bias. Because of the thesis author’s intimate knowledge of the LIS and the domain, specific paragraphs from existing documentation that he thought were highly mappable was chosen. This is especially the case with the two SaaS integration guides since these documents contain method names exposed in external web services. These web service methods are consumed by the LIS, and thus present in the code base. This might have led to overly optimistic LPS values.

8.3 Threats to external validity

The way in which the PRD was constructed has an impact on the accuracy of the mapping between the requirements and the LIS. In this specific case, the requirements in the PRDs were compiled using data collected from group sessions. Each group session consisted of 8 to 10 participants with people from different departments (except software development) that had to anonymously enter requirements on a laptop computer. The entered requirements were visible to all the participants for discussion. No restrictions were placed on the vocabulary and the number of terms to describe requirements. In situations with less participants, participants with more intimate knowledge of the LIS, more standardized/monotonous vocabulary and/or a guidance on the minimum number of terms, the classification results would probably be different, if not more accurate, than what is presented in this case study.

8.4 Conclusion

In §1.2 we have asked the following research question:

Can the requirements documentation of a redesigned system (RS) to be mapped automatically to the source code of an existing system (LIS)?

The results in this case study suggest that the above is possible using a custom IR application built using Lucene. Contrast in recall and precision performance of the IR application was observed between using requirements of the requirements documentation of a RS and descriptions found in documentation used for the implementation of the LIS, with the latter outperforming the former.

(32)

The research question originated from another question: can the completeness of a requirements

documentation of a RS be determined automatically? In §1.2 we described an automated approach to

determine the degree to which the LIS is described in the requirements documentation of a RS. The classification model was created for this purpose, and the results suggest that the classification model was able to correctly classify existing requirements with a significant classification rate of 79% and 98%. However, it was less successful at classifying new requirements, achieving a rate of 21% and 33%. Based on these findings, the automated approach is feasible if 1) the requirements documentation of a RS describes existing requirements only, or 2) if existing and new requirements are labeled so that only the existing requirements are used.

8.5 Recommendation for future work

During the research, more questions were raised than could be answered. Here we propose some future work, as an extension to this master’s thesis:

Larger training and test set: the use of a larger training and test set allows us to revalidate that the

classification model performance in this case study was not merely due to luck.

Reducing the precision and recall performance gap: a significant gap in precision and recall

performance was observed between the two types of documentation. Reduction of this gap could possibly lead to a classification model that might achieve consistent classification rates for both existing and new requirements of the requirements documentation of a RS. One possible way could be by reducing the high recall by means of applying the AND operator between certain combination of terms in a query.

(33)

9 Appendix A: Statistics source control

9.1 Source control statistics

Statistics extracted from the source control system. The column ‘Years involved’ was added by the thesis author based on the number of years the programmer was involved with the LIS directly or indirectly. The names of the source code contributors have been anonymized, except for the thesis author. Contributor Lines added Lines deleted Code Churn Count Lines modified Total Churn Total Lines Years involved Alpha 9.1k 8.1k 160 1 17.3k 974 4 Beta 125.8k 6.5k 594 3.1k 135.5k 119.3k 1 Gamma 361.2k 13.8k 2.5k 9.4k 384.6k 347.3k 4 Delta 9.5M 238.5k 24.2k 2k 9.8M 9.3M < 1 Epsilon 3.3M 1.1M 14.5k 26.8k 4.4M 2.2M 1 Huy 31.1M 18.5M 173.6k 117k 49.8M 12.6M 6 Digamma 27.6k 3.7k 1.1k 4.6k 35.9k 23.9k 6 Zeta 27.9k 614 54 39 28.5k 27.3k 0.5 Eta 4.7M 112k 18.4k 25.6k 4.8M 4.5M 1 – 2 Theta 14.1k 11.6k 338 634 26.5k 2.5k 4 Iota 670 74 48 345 1k 596 < 1 Kappa 849.8k 1.3M 6.5k 22k 2.1M -466.2k 2

(34)

9.2 LOC

The subject system has been measured with CLOC (Count Lines of Code) version 1.58, an open source command line tool [31].

Language Files Blank Comment Code

VB.NET 1530 63862 57619 283478

C# 434 7052 6092 41581

Total 1964 70914 63711 325059

Table 2: LOC of the LIS

Because CLOC doesn’t parse the code, it has problems processing comment lines properly in tricky situations such as this:

printf(“ /* “);

for (I = 0; I < 100; i++) { a += I;

}

printf(“ */ “);

Example taken from http://cloc.sourceforge.net/#Limitations

From the thesis author’s experience, this type of tricky comments should not occur in the code base. CLOC doesn’t include empty files. There were 44 empty .cs and .vb files in the code base.

9.3 Lucene corpus statistics

Description File count

Total C# and VB.NET files 2008

Empty source code files (ignored) 44

Auto-generated designer files (ignored) 483

AssemblyInfo files (ignored) 134

Total indexed source code files 1385

(35)

10 Bibliography

[1] The Standish Group International, Inc., „The Chaos Report,” The Standish Group International, Inc., 1995.

[2] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia en E. Merlo, „Recovering traceability links between code and documentation,” IEEE Transactions on Software Engineering, pp. 970-983, 2002.

[3] A. Marcus en J. I. Maletic, „Recovering Documentation-to-Source-Code Traceability Links using Latent Semantic Indexing,” in Proceeding ICSE '03 Proceedings of the 25th International

Conference on Software Engineering, 2003.

[4] W. Fan, M. D. Gordon en P. Pathak, „A generic ranking function discovery framework by genetic programming for information retrieval,” Information Processing & Management, vol. 40, nr. 4, pp. 587-602, 2004.

[5] A. Marcus, A. Sergeyev, V. Rajlich en J. I. Maletic, „An information retrieval approach to concept location in source code,” in Reverse Engineering, 2004. Proceedings. 11th Working

Conference on, 2004.

[6] G. Salton, E. A. Fox en H. Wu, „Extended Boolean Information Retrieval,” Communications of

the ACM, vol. 26, nr. 11, pp. 1022-1036, 1983.

[7] N. Wilde, M. Buckellew, H. Page, V. Rajlich en L. Pounds, „A comparison of methods for locating features in legacy software,” in Journal of Systems and Software, 2003.

[8] M. L. Nelson, „A survey of reverse engineering and program comprehension,” ODU CS

551-Software Engineering Survey, 1996.

[9] R. Dupuis, Software Engineering Body of Knowledge, IEEE Press Piscataway, NJ, USA, 2004, p. 93.

[10] A. R. Hevner, R. C. Linger, R. W. Collins, M. G. Pleszkoch en G. H. Walton, „The impact of function extraction technology on next-generation software engineering,” Technical Report CMU/SEI-2005-TR-015, Software Engineering Institute, Carnegie Mellon University, 2005. [11] M. Storey, „Theories, methods and tools in program comprehension: Past, present and future,” in

Program Comprehension, 2005. IWPC 2005. Proceedings. 13th International Workshop on,

Mapping high-level concepts to the source code: a case study