Locating Features in Source Code A Quantitative Evaluation

(1)

Locating Features in Source Code

A Quantitative Evaluation

Vincent Jong

vincent.jong@student.uva.nl

August 29, 2018, 33 pages

Academic supervisor: Clemens Grelck

University: Universiteit van Amsterdam Company supervisors: Kevin Bankersen

Host organisation: KPMG

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

Abstract

In this thesis, we analyze several tools implementing feature location techniques and analyze their strengths and weaknesses. Our research is primarily based on a previous study by Dit et al. who have proposed a benchmark that can be used to evaluate feature location techniques. A feature location technique (FLT) is a technique which is able to pinpoint parts of the source code in a software sys-tem that are responsible for a feature. Researchers have proposed many techniques based on static, dynamic or textual analysis and combinations thereof.

However, there has not been much investigation into the performance of these techniques relative to each other using a single set of data and a large number of features. The results of this research enables future studies to compare their technique with the existing ones and provide the basis in which you can choose a technique that better suits your needs.

We tested the tools FLAT3_{, Featureous and STRADA and assessed their performance based on 333}

features across the systems jEdit, ArgoUML, muCommander and JabRef. Our results show that these tools are good at finding 66% to 89% of the implementations of features but also find a large number of methods that are not part of a feature. Furthermore, small features (in this work defined as features with less than six methods) are easier to locate than larger features. Combining techniques helps in eliminating false positives from the results and we believe future studies in various combinations could help improve feature location techniques even further.

(5)

Acknowledgement

I would first like to thank my supervisor from the University of Amsterdam, Clemens Grelck, for providing help and feedback over the course of this thesis. His door was always open whenever I had a question about my research or writing. Without his input, I would not have been able to write this thesis.

Second, I want to thank KPMG and the team in the Digital Enablement department. It was a plea-sure to be able to do work with everyone. And I would like to specifically thank Kevin Bankersen, who was always available whenever I was stuck or had questions and also provided feedback on my thesis.

I want to also thank Sangam Gupta who was also doing his thesis at KPMG. We often helped each other and it was very helpful to have someone to spar with about ideas.

And finally my gratitude goes to Andrzej Olszak and Alexander Egyed, the developers behind Fea-tureous and STRADA respectively. They graciously gave me access to their tools and helped me whenever I had questions or ran into problems.

(6)

Chapter 1

Introduction

1.1 Context

This thesis is written at KPMG’s Digital Enablement department. One of the services that they provide is designing and developing software, mainly focusing on software that supports business pro-cesses. KPMG provides support for their software in terms of service-level agreements which often involves software maintenance tasks.

Our work will focus on researching techniques that support developers in software maintenance tasks such as bug fixes. These techniques aid them by helping them find parts in their software that are relevant to a certain feature or bug. Several discussions with members of the development team at the Digital Enablement department lead us to believe that these techniques could be quite useful in their daily work. Interesting to know would be which techniques work well.

There is a field of research that aims to develop such techniques and these techniques are known as feature location techniques. In our literature study we find that there are a lot of feature location techniques that exist. But research in this field has not made strides in providing evaluations of such techniques. In fact, a study done by Revelle [15] is the only one we found that attempts to do so. They tested ten feature location techniques that make use of various combinations of static, dynamic and textual analysis. Our work differs from theirs in the amount of features that were tested. They tested eight issues which can limit the amount of generalizations that can be drawn from their results. We examine a total of 333 issues as can be seen in table2.1.

1.2 Problem Definition

Dit et al. [4] have provided an overview of the field of feature location in the period of 1991 to 2013 by collecting the various work done and categorizing it1_{. One of their observations was that this field}

suffers from a lack of evaluations among feature location techniques which made it hard to assess their effectiveness. They argue that this is due to the different software or different versions of the same software researchers use to evaluate their technique.

The problem we are studying is the lack of evaluation of feature location techniques using the same dataset. A lot of studies in feature location techniques make use of a case study on one particular system to get an idea of the performance of a technique. And a lot of these studies use different systems in their research. This makes it hard to suggest one technique over another because there is no common ground on which these techniques were tested and compared. This problem only grows as more techniques are developed. And new techniques are still being created such as the work done by Moslehi et al. [12] and Chochlov et al. [8].

(7)

1.3 Research Questions

Our overarching question is as follows:

What are the strengths and weaknesses of current feature location techniques that are available as tools?

When we think about the strengths and weaknesses of techniques that can locate (part of) features we wonder how complete the results are. It would be ideal if feature location techniques can locate complete implementations if not large portions of features. From the results it would be possible to see if this is the case and find aspects on what such techniques need to improve. We also want to know if feature location techniques are better at finding features that do not have large implementations than features that do take up a large part of the source code. We think that small features are easier to find because the amount of source code that a technique would need to find is less as well. And lastly, we want to know how easy it is to use the tool. Tools can be good but if the effort required to use them is too much or if the requirements are too limiting then it is unlikely to be used. Thus to answer our main research question we focus on the following sub questions:

RQ1 How do the available tools implementing a feature location technique perform using the data set by Dit et al.?

RQ2 Do small features have better precision and recall results than big features? RQ3 What are the characteristics of the features that the tools are (un)able to find? RQ4 What was the experience using the tools? Does it take a lot of effort to use the tools?

1.4 Research Method

Dit et al. have created a single set of data as a benchmark which could be used to evaluate a feature location technique. This benchmark includes a set of software packages and for each package containing a collection of issues, a set of methods that belong to each issue (gold set), descriptions (queries) and execution traces of those issues. Whereas Dit et al. did provide the benchmark, they did not continue to use it to perform an evaluation as their research focused on identifying the work done in this field.

In this study we pick up where they left off by taking several tools implementing a feature location technique and evaluate them using the benchmark. We also focus on the programming language Java for two reasons: 1) the gross of tools support Java and 2) the software packages in the benchmark are all developed in Java. Our method consists of establishing the metrics that are used in the field of feature location, collecting tools that are available and the benchmark that Dit et al. provide. The tools are used on the software systems provided in the benchmark and data on the tool’s performance is collected in the form of the metrics.

These tools are then used on a number of features to test whether the implementations of these features can be found. To be able to do this we needed a list of features and for each features a set of source code that implements them. This is where the data set provided by Dit et al. comes in. Using this, it is possible to compare for each feature the methods that the tool is able to find and the methods that it ideally should find. This performance can be shown in metrics such as precision and recall.

(8)

1.5 Thesis structure

The rest of this thesis is structured as follows: in chapter2we provide insight into the field of feature location. We begin by explaining what is meant when talking about a feature and the activity of locating features. Then we dive into feature location techniques and explain the different types that exist. Next we look at some tools that implement feature location techniques and how they aid a developer. After that we outline the benchmark that is used, which software systems it contains and how information about these systems is used.

In chapter 3 an overview of existing feature location tools is provided. These tools are able to analyze system developed in the programming language Java. Not all known tools were used in this research due to reasons we also mention here. Then we take an in-depth look at the tools we were able to obtain (FLAT3_{, Featureous and STRADA) and outline the general workings and any other}

requirements to use these tools.

In chapter 4 all the necessary parts to be able to test feature location tools are explained. We go into the metrics we use and some additional tools that we use for data analysis. After that, the outcome of the tests are provided and an analysis of the results is made.

We close with our conclusions in which we answer the research questions, some threats to the validity of our work and recommendations for future work in chapter5.

(9)

Chapter 2

Background

Software maintenance activities can often comprise up to 80% of a software project development life cycle [9, 10]. These activities are usually corrective, adaptive or perfective [19] in nature but all of them require locating the parts of the code that is relevant to the change that needs to be made. This can be a time-consuming and tedious task for a programmer if the source code has to manually be studied and checked over and over until it is clear what needs to be done. This is even more difficult if one is not familiar with the system and the code base of the system itself is large. Researchers have long recognized this and have suggested many techniques to assist a developer in understanding a system by providing a feature-oriented view of the system. These techniques are called feature location techniques (FLTs).

2.1 Feature

What constitutes a feature is not universally defined or agreed upon in the literature. Wilde et al. [20] mention that a user sees a system as a collection of functionalities which are the features. Other studies [7,11,3] consider features observable behavior that can be initiated by a user i.e. functional requirements. Rohatgi et al. [17] have a similar notion in that they define a feature as a scenario that can be triggered by an external user and Dit et al. [4] view features as functionalities that are defined by requirements and can be accessed by developers and users. In all of these definitions the common themes are the functionality that a system can do that is available to the outside world and a user that can initiate such a functionality.

2.2 Feature location

Software maintenance tools such as data flow tracers, slicers and call graph analyzers help a program-mer understand a system once he finds a relevant starting point. However, not many tools are able to aid the programmer in locating this starting point in the first place. Feature location is about finding this initial foothold in the source code.

2.3 Feature location techniques

Feature location techniques make use of the various sources of information about a software system (source code, documentation, execution, comments, issue trackers etc.) and analyze them to make the connection between features and source code. It should be noted that feature location techniques are based on the assumption that the existence of the features are known beforehand (e.g. I know system x has a login feature and I want to know where in the code this login feature is) and are not suited to discover features in an arbitrary system (e.g. I want to know what the features of system x are).

(10)

Researchers have proposed many techniques using various ideas such as capturing the behavior during run-time, creating dependency models by parsing source code and in more recent work looking at audio and video material (e.g. tutorials) of software. These techniques can be categorized as dynamic, static, textual and other types [4]. Dynamic feature location techniques perform analysis while a system is running. This is usually done by running test cases or scenarios of features, exemplified in the work done by Wilde et al. [20]. Static feature location techniques make use of static sources of information of a system (e.g. source code) and analyze them. Dependency graphs [2, 17, 7] and identification of relevant parts by looking at identifier names, comments etc. (textual analysis) fall within this category.

2.3.1 Dynamic Techniques

Wilde et al. [20] defined a method for tracing features of a system to the components that imple-ment those features which they called Software Reconnaissance. This method is a dynamic analysis technique which can aid program comprehension. It utilizes test cases to generate traces by execut-ing features. Resultexecut-ing traces are compared to identify components that are specific to a feature. This technique is the basis for many subsequent studies done using techniques developed for feature location. The limitation of this technique lies in having to carefully select test cases according to the functionality they exhibit and therefore requiring some knowledge of the system. And while it is good at finding components pertaining to a particular feature, it does not guarantee that it locates all components that are important for that feature [20]. Being a dynamic analysis technique it also suffers from the limitation that the coverage of the system is only as good as the code that is executed [5].

2.3.2 Static Techniques

Robbilard et al. [16] proposed a structure-oriented representation of a program in a graph called a Concern Graph. This method consists of taking the classes, methods and fields of a program and their relationships and dividing them in terms of sets and their dependencies. Some limitations of this technique are related to not being able to deal with dynamic binding and calls that cannot be detected in a static manner.

2.3.3 Hybrid Techniques

Eisenbarth et al. [7] have proposed a technique to locate computational units related to features based on static and dynamic analysis. They use test cases (defined as scenarios) to identify computational units for relevant features using dynamic analysis. Using concept analysis, a mathematical method for describing binary relations, they derive the specific units for a feature and the jointly and distinctly required units for a set of features. This information is used in a subsequent static analysis phase to more accurately search a dependency graph for any computational units that are relevant for that feature. This technique is, similar to Software Reconnaissance, limited in needing a domain expert to be able to set up the scenarios to run.

2.3.4 Other Techniques

One of the most recent developments is the work done by Moslehi et al. [12]. They proposed a method to locate components that implement a feature using screencasts as input rather than the traditional input. They pull this information by recognizing key words in the audio and video and matching those that correspond with identifiers, comments, string literals and file names used in the source code. Some limitations of this study lie in how much useful information can be found in screencasts since the choice of words will largely affect the results. Furthermore, the presence of textual cues in the source code are necessary. However, this is the first study of its kind and while there are limitations the results are promising.

(11)

2.4 Comparing feature location techniques

Since feature location techniques can be based on very different approaches such as static or dynamic, it is quite hard to compare them. However, feature location techniques output sets of source code since that is their purpose. Using this as a basis, it is possible to compare how good these techniques are at locating features. Our approach is then to compare these techniques based on their output. To be able to determine how well they do this, we need data.

This data would have to be the following: • A set of features that we want to locate • A set of source code for each feature

Using this data, it is possible to benchmark feature location techniques.

2.4.1 Benchmarking

Dit et al. have made a data set publicly available for use. In this section we outline exactly what it contains.

Overview

The software packages we used as the data set are listed in table 2.

Name Version KLOC No. of features jEdit 4.3 104 150

muCommander 0.8.5 77 92 ArgoUML 0.24 155 52 JabRef 2.6 74 39

Table 2.1: Software packages in the benchmark

For each software package, Dit et al. have provided a collection of issues. To be able to collect information about these issues Dit et al. chose Java software systems with the following characteristics:

• uses SVN as the source code repository,

• has an ITS that keeps track of the change requests, • has a subset of SVN log messages are referencing IssueIDs,

• and optionally, the system allows collecting execution traces (i.e., the system is not a library that would make it difficult for a user to interact with it in order to collect execution traces) For each issue the following information is given:

Short description: textual description of the issue. Gold set: a set of methods that is related to the issue.

Execution trace: a set of methods that was generated by executing a specific scenario. An example of how this looks like for a feature is shown in table2.2.

To perform dynamic analysis we needed test scenarios. Test scenarios indicate the steps that need to be done while the technique traces what code was triggered. Dit et al. were able to generate execution traces by recreating the scenarios that were described in the issues. However, these scenarios are not included in the benchmark. We recreated these step-by-step instructions by also looking at the issue descriptions. To show an example we have added the scenarios for the tool JabRef in appendix A.

(12)

This helps us ensure that performing dynamic analysis with this benchmark happens under the same circumstances as there is a chance that the issues could be interpreted differently, which can affect the results. This also takes the hassle of looking up the issues and devising scenarios away from future studies.

ID Short description Query Gold set

1535044 Month sorting month sorting

net.sf.jabref.FieldComparator. FieldComparator(String,boolean) net.sf.jabref.FieldComparator. compare(Object,Object) net.sf.jabref.Util.toFourDigitYear (String) net.sf.jabref.Util.toFourDigitYear (String,int) net.sf.jabref.Util.getMonthNumber (String) tests.net.sf.jabref.UtilTest. test2to4DigitsYear() tests.net.sf.jabref.UtilTest. testToMonthNumber()

(13)

Chapter 3

Overview Feature Location Tools

There exists a number of feature location tools although not many. In this chapter we explain our search for available feature location tools, which tools we have found, which we have selected to use in this thesis and why the others were not suitable.

3.1 Existing tools

Although many feature location techniques exist, there are not many tools available. We primarily looked at the ones that Dit et al. identified since they have also listed feature location tools but also looked for any tools that are popular or otherwise known. As mentioned before we also focused on tools that are suitable for Java. We identified eleven tools compared to sixty existing techniques in the taxonomy and twenty-seven more after that [3].

3.2 Available tools

To be able to evaluate tools implementing feature location techniques, we needed to collect those that are available for use. Of the eleven tools mentioned in table 3.1, five of these tools were not available, one was not functional and four more did not fall within our definition of a feature location technique. These tools were developed to keep track of the learning process of a programmer while (s)he is studying a system and do not compute a set of source code that is relevant for a feature. Regarding the tools that were not publicly available, we attempted to obtain access to the tools by contacting the developers. One of the four responded that the tool was not available anymore [14] and the remaining three did not reply. The tools that we were able to obtain are STRADA, Featureous and FLAT3 and are listed in table3.2.

This section goes through each tool and describes the general use case of the tool and requirements for using the tool.

3.2.1 FLAT

3

FLAT3 _[₁₈_{] is an Eclipse plugin which implements a textual feature location technique based on the}

Lucene library [1]. A user provides the tool with a query which is then used to locate source code that uses words that are similar to the query. The results of FLAT3 _{have an additional property which is}

the probability score. FLAT3_{then ranks its output by probability. To calculate this score, the source}

code of the software package is indexed. This means that for every method a set of the words used in the method is made. This collection is then converted into vectors along with the query. Then similarities between those vectors are determined and a score is assigned based on that similarity [18]. An example of the results can be seen in figure3.1.

To be able to use FLAT3 _{the source code of a target system is required as the system needs to be}

(14)

Tool name Analysis type

Available Link Note

STRADA Dynamic Yes – Obtained through personal communi-cation with the de-veloper

Featureous Dynamic Yes http://featureous.org/ – FLAT3 Textual Yes http://www.cs.wm.edu/

semeru/flat3/ – Concern

Mapper Other Yes

https://www.cs.mcgill.ca/

∼martin/cm/ Does not output source code TopicXP Other (Latent Dirichlet Allocation) Yes http://www.cs.wm.edu/

semeru/TopicXP/ Not operational

JIRiSS Textual No – Attempted to obtain through personal commu-nication with the developer

Suade Static No – Attempted to obtain through personal commu-nication with the developer

Google Eclipse Search

Textual No – Google Desktop Search is discontin-ued

Cognitive assignment Eclipse

Textual No – Does not output source code

JRipples Static No – Does not output source code

FEAT Other Yes https://www.cs.mcgill.ca/

∼swevo/feat/ Does not output source code

(15)

Name Analysis type Version Link

Featurous Dynamic 4.0 http://featureous.org FLAT3 _Textual _0.2.4 _{http://www.cs.wm.edu/}

semeru/flat3/

STRADA Dynamic – Personal communication

Table 3.2: Feature location tools used in the benchmark

the source code slightly to export its findings to a file.

Figure 3.1: Screenshot output of FLAT3

3.2.2 Featureous

Featureous [13] is a dynamic feature location technique written in Java and available as a plugin for the Netbeans IDE. This tool provides several feature-centric views based on execution traces. Featureous creates these execution traces by following a program’s execution by having a programmer annotate parts of the source code. From their website: ”After a programmer recovers a list of a program’s features from its documentation, or its graphical interface, she has to annotate each feature’s ”entry points” in a program’s source code. Feature entry points are the methods through which a program’s control flow enters implementations of a feature. Annotated program is then transparently instrumented with a tracing aspect when a programmer executes ”Trace project” action provided by Featureous. After a user triggers features in the interface of an instrumented program, the information is saved in form of feature-traces, which serve as an input to further analysis.”.

Furthermore, a test case or scenario is necessary in which a particular feature is executed. Therefore, the code that is annotated must reflect the feature that is executed. To find these annotation points, they provide the following guideline: ”A good heuristic for starting to place the feature entry point annotations is to find the action listeners that get invoked when you activate a feature of your pro-gram by clicking a button somewhere in the main menu or press a key on the keyboard. When this is not enough, i.e. if some features get triggered somewhere else than by the GUI (e.g. some special emulation features like maybe ”render fonts”), you would have to go deeper in the code to identify the right places.”. While this works well for features that can be initiated from a user interface, for features that are initiated internally or otherwise not available through the interface it can be a bit of manual search work. In short, to be able to use this tool we primarily need the test cases and to annotate certain parts of the source code.

Featureous, however, also does not directly support exporting the execution traces. To achieve this, we created a small script that we could execute within the Beanshell console available in Featureous.

(16)

Figure 3.2: Screenshot output of Featureous

3.2.3 STRADA

STRADA [6] is a dynamic feature location technique which traces the execution of a program. STRADA provides several views while testing a feature scenario. The Trace Capture component observes the execution of a test scenario. The Trace Analysis component visualizes the current knowledge on feature-to-code mapping in form of a trace matrix. Unlike Featureous, this tool does not require modifying source code in any way and instead observes what code is accessed during execution. The test cases are necessary to perform specific execution.

We also had to modify STRADA slightly so that the output is exported to a file.

(17)

Chapter 4

Benchmarking Feature Location

Tools

In this study, the performance of FLAT3_{, Featureous and STRADA have been established by}

exam-ining 333 issues across systems JEdit, muCommander, ArgoUML and JabRef. This performance is shown by looking at the precision and recall metrics. In this section the metrics used are explained in more detail by describing what they represent and how they are calculated. Then we go into additional tools we have developed to make the process of testing the tools easier, specifically auto-matically gathering the input for the tests and processing the output of the tools. Then for each tool, the results are provided and an analysis is made.

4.1 Metrics

To find out how we might compare feature location techniques we conducted a literature study on similar work that aimed to compare feature location techniques. Also some researches that proposed a new technique did compare their results to pre-existing feature location techniques; these were an-alyzed as well. However, a lot of times the metrics that were used were only suited to these types of techniques. We want metrics that are suited to compare all types of feature location techniques. Two of the proposed metrics from, amongst others, the field of information retrieval are: preci-sion and recall.

Precision shows the amount of correct results relative to the number of results found. This can be described as:

P recision = no. of true positives

no. of true positives + no. of false positives

Recall shows the amount of correct results found relative to the total number of correct results i.e. how complete the found correct results are. This can be described as:

Recall = no. of true positives no. of correct results

To show how this works in an example, let us consider the feature ’Month sorting’ shown in table2.2. Now let us say that a feature location technique X was able to find the methods as shown in table

4.1. Feature location technique X found 3 methods that are in the gold set. Precision in this case would be 3 / 6 (50%) and recall would be 3 / 7 (42.86%). Ideally both metrics are 100%: for recall this would indicate that the results are complete and for precision this would mean that there is no noise (false positives) in the results.

(18)

ID Short description Methods found 1535044 Month sorting net.sf.jabref.FieldComparator.FieldComparator(String, boolean) net.sf.jabref.FieldComparator.compare(Object,Object) net.sf.jabref.Util.toFourDigitYear(String) demo.Demo.demo1() demo.Demo.demo2() demo.Demo.demo3()

Table 4.1: Example data found with feature location technique X

4.2 Data analysis tooling

The data for each software package in the benchmark is divided into separate files according to the issue ID. So for the short and long descriptions and gold set there are separate files with the issue ID included in their file name; this is done for all issues. We made a small tool which is able to read all IDs and search for their appropriate information in the separate files automatically. Furthermore, the tool is also capable of reading the files that are exported by the feature location tools (found methods) and calculate the metrics by comparing them to the gold sets.

Figure 4.1: Overview data analysis tooling

From the output of the tools we extract the methods found. Then we compare the methods found to the methods that are in the gold set and calculate properties such as:

• Precision and recall

• No. of false and true positives • No. of methods found

• Methods missed

We then save this information in Microsoft Excel files. We used Python and the Pandas library to analyze the results. Pandas is an open-source Python package which provides flexible data-structures

(19)

4.3 Benchmarking the Feature Location Tools

To answer the research questions, we benchmarked the tools FLAT3, Featureous and STRADA using the metrics from the previous section and the additional data analysis tools. We go through each tool and present our findings and provide our analysis. We do this by using the following structure for each tool:

• Performance tool - this refers to our first research question pertains to the results in terms of precision and recall.

• Small features compared to large features - this section covers our second research question about whether feature location techniques are better at finding small features than big features. To be able to define what passes as a small feature, we took from the data set by Dit et al. the average number of methods per feature which in this case is 6.

• Features found compared to features not found - this refers to our third research question about whether there are any similarities between the features that were found and not found. • Usage evaluation - covers our last research question addresses our experience using the tools.

4.3.1 FLAT

3

Performance tool

We show the results by providing the average, the minimum, maximum, middle and most occurring values (mode) for this tool. As can be seen in table4.2 FLAT3 does not seem to be a able to reach high precision at all scoring an average of 0.95%. The maximum precision that FLAT3 reaches is 66.67% meaning that there has never been a case where only the methods in the gold set were found. And in the best case scenario, two-thirds of the results will be the right methods. The most recurring result is 0%, showing that this tool does not find the right results more often than it does. FLAT3

does much better in finding complete results as can be seen by the recall metric, reaching 71.14% complete results on average. And more often than not, it is capable of finding complete results as both the maximum and the mode are 100%. When counting the occurrences of recall results, the top three results are 100% (158 times), 0% (31 times) and 50% (19 times).

On average, FLAT3_{outputs 1143 methods of which three are also present in the gold set. This means}

that a lot of noise is also found when searching with this tool. This is not ideal since it would mean that over 1100 methods would have to be examined to find the relevant three. However, the results in table4.2consider all methods that FLAT3finds and do not take into account the ranking that FLAT3 does. In FLAT3, each method in the result is assigned a relevancy score and the result is sorted in a descending order. Search engines such as Google also operate in this manner, showing the most likely relevant on top. Generally speaking when using Google, most people will not analyze every result found. In that case it would be ideal if the top items are the sought after items. Applying the same idea to FLAT3_{, it would mean that the methods in the gold set ideally appear on top of the list. So}

taking this ranking possibility into account, we took another look at the results. We also varied in terms of the number of top methods and so considered the top five, ten and fifteen methods. These results can be seen in table4.3.

Metric Mean Min Max Median Mode Precision (%) 0.99 0.00 66.67 0.21 0.00 Recall (%) 70.77 0.00 100 83.33 100

Table 4.2: Precision and recall data of FLAT3 on benchmark systems

Now considering the ranking, FLAT3 seems to do better when looking at precision as it is around ten times the precision with all methods included: with the top five methods the precision is 10.77%, with the top ten methods 8.0.8% and with the top fifteen methods 6.71%. There is a decline when

(20)

Tool Metric Mean Min Max Median Mode Top 5 Precision (%) 10.77 0.00 100 0.00 0.00 Recall (%) 15.65 0.00 100 0.00 0.00 Top 10 Precision (%) 8.08 0.00 90 0.00 0.00 Recall (%) 20.45 0.00 100 0.00 100 Top 15 Precision (%) 6.71 0.00 73.33 0.00 0.00 Recall (%) 23.91 0.00 100 0.00 0.00

Table 4.3: Benchmark results for ranked output of FLAT3

including more and more methods however. So the highest average precision is achieved when looking only at the top methods and then only the top five. This is good news as the amount of code that needs to be studies is less.

However, this is a different story for recall as it plummets to as low as around 15%. Yet there is a rise when looking at the recall of the top ten methods and top fifteen methods, it goes up by 4.8% from the top five to the top ten and goes up by another 3.46% from the top ten to the top fifteen methods. The more methods we include the lower the precision but the higher the recall. This indicates that from the top five methods, the next five methods contain less relevant methods than false positives. And the same is true if we include five more methods (the top fifteen methods). To check if this could be seen in the results, we took a look at the average ratio of true positives and false positives between the top five, ten and fifteen methods. For the top five the ratio is 0.5 : 4.45, for the top ten it is 0.78 : 9.16 and for top fifteen it is 0.96 : 13.92. And indeed this seems to be the case: the average true positives rises resulting in higher recall and the average false positives rises as well resulting in lower precision.

Small features compared to large features

As previously mentioned we took the average amount of methods per feature as our threshold for big feature; we consider small features as features with less than 6 methods. Furthermore, we again looked at the top five, ten and fifteen methods for each feature.

In table4.4we can see the precision and recall for small and large features when we consider the top five, ten and fifteen methods found for each feature. In terms of precision, larger features con-sistently have the higher score compared to small features: 18.97%, 15.43% and 13.12% compared to 7.34%, 5.01% and 4.03%. Conversely, smaller features have higher recall compared to larger features: 18.67%, 23.31% and 26.81% compared to 8.41%, 13.58% and 23.91%. This indicates that it is eas-ier to find more complete results for small features but also have more noise compared to large features.

Tool Feature size Metric Mean Min Max Median Mode Top 5 Small Precision (%) 7.34 0.00 66.67 0.00 0.00 Recall (%) 18.67 0.00 100 0.00 0.00 Large Precision (%) 18.97 0.00 100 20 0.00 Recall (%) 8.41 0.00 50 1.10 0.00 Top 10 Small Precision (%) 5.01 0.00 66.67 0.00 0.00 Recall (%) 23.31 0.00 100 0.00 0.00 Large Precision (%) 15.43 0.00 90 10 0.00 Recall (%) 13.58 0.00 83.33 6.25 0.00 Top 15 Small Precision (%) 4.03 0.00 66.67 0.00 0.00 Recall (%) 26.81 0.00 100 0.00 0.00 Large Precision (%) 13.12 0.00 73.33 6.67 0.00 Recall (%) 23.91 0.00 100 0.00 0.00

(21)

Features found compared to features not found

We consider having found a feature if the recall for that feature is 100%. And a feature that was not found would have a recall of 0%. We go by each system and present our findings.

For ArgoUML, FLAT3_{was able to fully locate twenty-seven features. One notable result is a feature}

concerning the multiplicity of an association in an UML diagram. This feature has fourteen methods in the gold set. The gold set of the remaining features had on average two methods. FLAT3_{was not}

able to fully locate five features. Nothing from this stands out however. ArgoUML contained a sum of fifty-two features.

From a total of thirty-nine features in JabRef, FLAT3 _{was able to fully locate nineteen features.}

These features have a lot to do with opening files, saving and exporting; features that could be considered to be used quite often. Only one feature in JabRef had a recall of 0%. And this feature had one method that needed to be found. Notable is that the keyword for the corresponding feature was used, yet the method was not found even though it contains this keyword in its name.

Then we go onto jEdit which has 150 features. FLAT3 found seventy features completely. Fifteen features were not found. A lot of the found features were related to the search function within jEdit. But some related to the search function also appear within the set of features not found. Both sets of features even contained the same keywords.

The last system, muCommander, has ninety-two features. FLAT3 _{found forty-four completely and}

did not find ten features. From the ten features not found one is a large feature containing 104 methods in the gold set.

Generally speaking, the features that FLAT3_{is able to find do not seem to have any sort of pattern.}

The sizes of the features are pretty much the same across the features found and not found. Usage evaluation

Using FLAT3 _{does not require much from the user. Essentially, once the target system is imported}

into Eclipse and indexed the only step to use FLAT3_{is to fill in a query and start the search. FLAT}3

automatically shows the results in a window. The results are shown as an ordered table. And each record can be clicked to instantly open the editor with the location of the chosen record.

4.3.2 Featureous

Performance tool

From table 4.5 we can see that Featureous is also not very good when it comes to achieving high precision, scoring an average of 4.78%. The highest that Featureous is able to reach is 33.33%. Compared to maximum precision that FLAT3 _{reaches, Featureous reaches half of that. The most}

recurring result is 0%, showing that this tool does not find the right results more often than it does. One thing that is noticeable from the results is that Featureous does not always find a method when tracing an execution. One possible explanation is that the scenarios do not execute the methods. This seems unlikely however because we have based our scenarios on the descriptions of the features. A lot of the times step-by-step instructions were included which we also used. Another is that the annotations of the methods are placed at incorrect locations. This is quite possible and can certainly influence the results as such. Ultimately, we were unable to exactly pinpoint the reason for this result.

Table 4.5: Precision and recall data of Featureous on benchmark systems

In table 4.6we can see the precision and recall for small and large features. The precision for small features is lower compared to large features with a difference of almost 4%. This difference is most

(22)

likely due to the difference in the number of methods found. For small features this is 318 methods on average and for large features this is 142 methods. While it may seem like for small features too many methods are found, it should be noted that the amount of small features is three times the amount of large features. This causes the average amount of methods found to be high.

The recall for small features is much higher than it is for large features. We expected this to be the case. The average amount of methods to be found for small and large features is two and ten respectively. And Featureous is able to achieve 100% recall for small features most of the time. For large features Featureous usually finds about a third of the methods. From this we can see that Featureous is quite good at locating small features.

Feature size Metric Mean Min Max Median Mode Small Precision (%) 3.91 0.00 33.33 0.49 0.00 Recall (%) 74.70 0.00 100 100 100 Large Precision (%) 7.96 0.00 33.33 4.65 0.48 19.15 Recall (%) 38.72 0.00 60 33.33 33.33

Table 4.6: Benchmark results of Featureous for small and large features

For ArgoUML, Featureous was able to fully locate twenty-one features and was not able to fully locate five features. Quite some features that FLAT3 _{found completely were also found with Featureous.}

But the features not found were not the same between the two tools. One explanation for this could be that the methods that were found had more to with the user interface of ArgoUML and so have a higher chance of being used or easier to find. ArgoUML contained a sum of fifty-two features.

From a total of thirty-nine features in JabRef, Featureous was able to fully locate six features. All of these features are small features. The number of methods found for these features were also relatively low (the lowest being four methods) which means their precision is not bad. Eleven features in JabRef had a recall of 0%. The one method that FLAT3 did not find was also not found with Featureous. Furthermore, four of these features are what we consider large features peaking at twenty methods. For one feature the amount of methods found was less than the amount of methods in the gold set.

Then we go onto jEdit which has 150 features. Featureous found seventy-five features completely but ten features were not found. Similar to FLAT3_{, a lot of the found features were related to the}

search function within jEdit.

The last system, muCommander, has ninety-two features. Featureous found thirty-six completely and did not find thirteen features. A portion of the methods not found in FLAT3_{are also present in}

this set.

Generally speaking, the features that Featureous is able to find are small features that are closely related to the user interface. It is also noticeable that the methods that are not found are methods that are in the a test environment of the project. There are for example classes that implement unit tests and test the method of a feature. FLAT3 was able to find these methods more reliably because the names of the methods are similar to the relevant methods.

Usage evaluation

Using Featureous requires quite some effort from the user. Parts of the code have to be annotated so that Featureous knows from where to trace. This means that the user still has to do some work in identifying the right locations. The heuristic provided by the developers of Featureous is indeed a helpful starting point. But for some features it was still not that trivial to find the right locations. Testing this tool required the most time by far out of the three tools as the other tools did not require diving into the source code. Another part that requires effort is setting up each target system to be

(23)

the systems. Our final remark is that tracing mechanisms have some influence on the performance of the systems. We did experience some slow-down which means it took more time to test the features.

4.3.3 STRADA

We mentioned earlier that STRADA is quite old and requires Eclipse 3.1. Eclipse 3.1 support Java only up to version 1.5. Therefore it was not possible to test the systems jEdit, JabRef and muCommander all requiring at least Java version 1.6 to build and run. This is because they rely on some API calls that were introduced in version 1.6. So for STRADA we were able to test ArgoUML since this system is able to run completely on version 1.5. We thought it best to still include our findings in this thesis. Performance tool

As can be seen in table4.7 STRADA has some extreme results. Precision is the lowest of the three tools but the recall is also the highest of all the tools. STRADA is a tool that traces every method from the start of the application to the end of the application. The low precision could be explained by the fact that on average 1577 methods were found during execution. STRADA achieving high recall means it finds near-complete features a lot of the time. Compared to the recall of Featureous, STRADA performs a lot better. However, Featureous does not trace entire executions of scenarios and instead only traces part of the execution. We also notice that STRADA always finds relevant method due to the recall being at least 40%.

Table 4.7: Precision and recall data of STRADA on ArgoUML

In table 4.8 we can see the precision and recall for small and large features in ArgoUML. Here we see that the average precision for small features is lower than that of large features. Similarly to Featureous, we suspect that this is due to the higher amount of small features compared to large features. This results in a lower precision because the average amount of methods found is higher. Precision for small features is quite high, almost all methods in the gold set are found most of the time. This is not so much the case for large features although the difference is not that much with a difference of 6.74%.

Feature size Metric Mean Min Max Median Mode Small Precision (%) 0.23 0.13 1.15 0.31 0.31

Recall (%) 93.50 87.50 100 100 100 Large Precision (%) 0.55 0.21 1.15 0.75 0.68

Recall (%) 86.76 75 100 100 100

Table 4.8: Benchmark results of STRADA for small and large features

For ArgoUML, FLAT3 _{was able to fully locate forty-three features. As seen by the minimal recall in}

table 4.7, STRADA always finds at least two-thirds of a feature’s implementation. And so there are no features that STRADA was not able to find. There appears to be no trend in the features that could be found fully. With the tracer marking everything that is executed the chances of missing methods was expected to be low and it does seem to be the case a lot of the time.

(24)

Usage evaluation

Using STRADA is in the middle of the three tools in terms of effort to use it. Like with Featureous, the target systems have to be able to build and run. But that is the bulk of the effort. After that, it is a matter of executing the steps in the test scenarios and letting STRADA do the rest. It is unfortunate that the tool does not work with any other version of Eclipse than 3.1.

4.4 Using tools in a combination

So far we have given the results of each individual tool. However, we may wonder what the results would look like if we combined tools. So we have experimented the combination of FLAT3 _and

Featureous since they could fully test the systems in the benchmark. In this configuration, the idea is to use Featureous to find methods from the execution of a test scenario and filter that with the search capability of FLAT3_{. However, we were unable to directly feed the output of Featureous as input}

for FLAT3_{. So to still get some idea of how this combination could work we have taken the sets of}

methods of both tools and taken the intersection of these sets as the output of the combined tools. The idea being that any methods found is only present in both sets since filtering from Featureous would only leave the methods that are also in FLAT3. This is not ideal since we do not really know which methods FLAT3 would provide with its search but this should give an glimpse of the performance.

Metric Mean Min Max Median Mode Precision (%) 15.37 0.00 100 3.45 0.00 Recall (%) 44.56 0.00 100 28.57 100

Table 4.9: Results when combining FLAT3 _{and Featureous}

Precision is way higher as we can see but the average recall seems to drop to an unfavorable 44.56%. Compared to the generally high performance in recall of the tools (66% to 89%) the combination leaves something to be desired. We expected some increase in precision since we are cutting down the size of the results but seeing recall being that much worse is a little surprising. It could be the result of the recall of Featureous being the lowest of the three tools and that this is having an effect since we are using the data of Featureous as a starting point. Filtering from there could make the overall recall much worse.

4.5 Answering research questions

Now with the data in hand, what do we learn? In this section we provide the answers to our research questions.

How do the available tools implementing a feature location technique perform using the dataset by Dit et al.?

A pattern in the results of the three tools is that precision is low while the recall is high. What this means is that the tools are successful in finding the implementations of features but also find a lot of extra baggage. This low precision could be due to several reasons.

For FLAT3_{, we think it is related to the way it searches for relevant methods. We know that FLAT}3

creates a set of words for each method by means of indexing the source code. And that to calculate the relevancy it compares the query to the collection of sets of words. Methods usually use a number of words to express the intent behind them. And some features might use some of the same words in their methods, especially if the developers use some conventions or style in their code. Since the systems in the benchmark are not hobby projects we believe that this is indeed the case. So although the methods are not related they could still appear in the results because the words that are used are also present in the query.

(25)

feature. And some features include opening a window with various options while the focus is on one particular part of that window. One instance is the ”Preferences” option. In the application JabRef instances for various types of options are instantiated with all their own background operations. And since the test scenario includes opening this ”Preferences” option, all the methods appear in the trace even though the scenario continues with only one tab.

Do small features have better precision and recall results than big features?

Small features consistently have higher recall than large features meaning that it is easier to find near-complete implementations of small features. This is what we expected as we theorized that small features should be easier to locate. What we did not expect was that small features also have lower precision. This in part related to the results overall. High recall and low precision is the theme as also made clear in answering the first research question. So the same trend is apparent when splitting the same results into groups such as small and large features. However, we believe it is still interesting to analyze since the difference in recall amounts up to 35% between small and large features. What we attribute this to is the amount of small features compared to large features. Small features outnumber large features almost three to one. This causes the average precision for small features to be worse when in fact it could be just as high as that of large features. Especially since the median values are closer to each other.

What are the characteristics of the features that the tools are (un)able to find?

The amount of features for which absolutely nothing was found is relatively small for each application tested, around 10% of the 333 features. Unfortunately, there is no real pattern in the features that were not found. Both small and large features appeared in this set. The number of methods found by a feature location tool varied and the features were not overly complicated or simple. The same can be said for features that were found successfully (recall at 100%). One thing that was noticeable was that most of these features are small features with less than six methods in the gold set. This also reinforces our conclusion that these tools are better at locating small features.

What was the experience using the tools? Does it take a lot of effort to use the tools? In terms of setup all three tools were no problem to install. Instructions included with the tools makes this process easy. In terms of ease-of-use, FLAT3requires the least amount of steps and effort. Opening the search window, typing a query and starting the search are all that needs to be done. It does require the source code of the application to be available since it needs to create an index of the project. We do not think this is a big hurdle since feature location tools are likely to be used in software maintenance tasks. And this often includes access to the source code.

Featureous can be quite complicated to use since parts of the source code (feature entry points) need to be annotated before Featureous can trace executions. The time it takes to dive into the source code and find the right spots to annotate can vary from relatively short to pretty long. It sometimes seems counter-intuitive to study the source code to annotate and then have the tool trace the execution of the code that was just studied. However, the time it takes to search for the starting points of features could be inconsequential compared to studying whole features. Especially when the size of the features are large. Another important factor to consider with Featureous is that the applications have to be able to run. That means considering the requirements of the applications in terms of the version of Java required and also the version of the IDE that is used.

STRADA has the same challenges as Featureous in terms of getting the systems to run. The challenge of the IDE supporting the Java version that the system needs to be able to run proved too much for a few systems in our benchmark. And since STRADA is quite old it was not possible to use an IDE that both the tool and systems supported. One thing that makes STRADA easier to use than Featureous is that the source code does not have to be inspected. This saved quite some time in testing this tool.

(26)

Chapter 5

Conclusion

In this section we present our conclusions, discuss some threats to the validity of this research and provide some recommendations for future work.

5.1 Conclusions

In this study we have looked at the performance of three tools implementing feature location techniques and assessed their performance using the precision and recall metrics and the dataset provided by Dit et al [4].

Our work makes the following contributions:

• Evaluation of tools implementing feature location techniques using a single data set and with a significant amount of features,

• Extending the original benchmark of Dit et al. with test scenarios that are suited to test dynamic feature location techniques.

To the best of our knowledge, this is one of the first studies done in evaluating feature location tech-niques using a single data set with a large amount of features. We hope future studies can improve this benchmark or apply it to assess their technique.

In this thesis we have shown that the tools FLAT3, Featureous and STRADA are pretty good at finding the methods that implement a feature. However, they struggle to exclude methods that are not relevant to a feature. In all tools that were tested hundred to thousands of methods were found which makes it difficult to know which of them are the correct ones.

FLAT3 _{tries to solve this problem by also ranking the results and we have tested this by looking at}

a select amount of the top methods that FLAT3 _{outputs. However, from our results we conclude}

that looking only at the top methods means that a lot of the relevant methods will be missed as they appear lower on the ranking list. If this ranking could be improved the results of the top amount of methods would also improve. After all, it is not that the missed are not found, it is just that they are not at the top of the list.

Some methods might appear on top because the words that are used are more similar to the query. So methods that are more relevant to the feature appear lower on the list. This could be improved by including more factors when calculating relevancy. Factors such as (combined) word occurrence or structural dependencies in which methods that are in the flow of calls made are taken into account. Dynamic feature location techniques such as those implemented in Featureous and STRADA theoret-ically should find the relevant methods most of the time. And they do show this capability although we should not overlook the extra effort it takes to be able to perform dynamic feature location.

(27)

Re-if the tool requires a certain IDE and that IDE does not support the version of the programming language.

Combining tools to locate features has shown to decrease the amount of false positives but also decrease the amount of relevant methods found. Although it seems like combining tools does not yield the best results we believe it is definitely worth exploring various combinations. We have tested one specific combination and used a rather novel way to combine them rather than a true integration.

5.2 Threats to validity

Threats to the validity of our work mainly deal with the variation of the input that is possible for feature location. For FLAT3 for example the input consists of a query and there are many possible ways to construct them. We have tried to mitigate this by looking at previous work in textual feature location and basing our approach in formulating our queries the same way. Variation in test scenarios can also be made. Here we have also look at previous work on how to achieve this. Our approach consisted of looking at the descriptions of the issues and recreating the test scenarios from there.

5.3 Future Work

One direction of future work is making improvements on the benchmark for which we see two possi-bilities. The first one is adding additional metrics for different types of feature location techniques. In our results we see the strengths and weaknesses of different types of techniques but we cannot really say anything about techniques within the same category. For example, we could ask which dynamic feature location technique performs better. Although that might be seen from precision or recall, hav-ing metrics that better characterize high performance of dynamic feature location techniques would make it possible to better distinguish them.

The second improvement would be automating the process of benchmarking feature location tech-niques. We have taken a semi-automated approach in which the data is automatically gathered and prepared for use. The process of using the tools was done manually. Processing the output of the tools is then again automated. However, manually gathering the data can be time-consuming and automating this part as well would make it easier to use the benchmark.

Another direction for future work could be studying how well such tools help a programmer in his work. Our study focuses on establishing how well techniques work as a feature location technique. But do they help a programmer spend less time and effort in learning how a system works before changes can be made? Or perhaps do not help a programmer in a right way? Some researches have tried to establish an effort metric for textual feature location techniques in which the ranking of the results indicate how much effort must be spent before a relevant piece of source code is studied.

(28)

Bibliography

[1] Apache Lucene Core. http://lucene.apache.org/core/index.html. [Online accessed: 10-July-2018].

[2] Kunrong Chen and V´aclav Rajlich. Case study of feature location using dependence graph. In Program Comprehension, 2000. IWPC 2000. 8th International Workshop on, pages 241–247. IEEE, 2000.

[3] Muslim Chochlov, Michael English, and Jim Buckley. A historical, textual analysis approach to feature location. Information and Software Technology, 88:110–126, 2017.

[4] Bogdan Dit, Meghan Revelle, Malcom Gethers, and Denys Poshyvanyk. Feature location in source code: a taxonomy and survey. Journal of software: Evolution and Process, 25(1):53–95, 2013.

[5] Neil Dupaul. Static testing vs. dynamic analysis. https://www.veracode.com/blog/2013/12/ static-testing-vs-dynamic-testing. [Online accessed: 05-April-2018].

[6] Alexander Egyed, Gernot Binder, and Paul Grunbacher. STRADA: A tool for scenario-based feature-to-code trace detection and analysis. In Companion to the Proceedings of the 29th Inter-national Conference on Software Engineering, pages 41–42. IEEE Computer Society, 2007. [7] Thomas Eisenbarth, Rainer Koschke, and Daniel Simon. Locating features in source code. IEEE

Transactions on Software Engineering, 29(3):210–224, 2003.

[8] Daiki Fujioka and Naoya Nitta. Constraints based approach to interactive feature location. In Software Maintenance and Evolution (ICSME), 2017 IEEE International Conference on, pages 499–503. IEEE, 2017.

[9] Robert L Glass. Frequently forgotten fundamental facts about software engineering. IEEE software, 18(3):112–111, 2001.

[10] Jeff Hanby. Software maintenance: Understanding and estimating costs.http://blog.lookfar. com/blog/2016/10/21/software-maintenance-understanding-and-estimating-costs/. [Online accessed: 19-April-2018].

[11] Andrew Le Gear, Jim Buckley, Brendan Cleary, JJ Collins, and Kieran O’Dea. Achieving a reuse perspective within a component recovery process: an industrial scale case study. In Program Comprehension, 2005. IWPC 2005. 13th International Workshop on, pages 279–288. IEEE, 2005. [12] Parisa Moslehi, Bram Adams, and Juergen Rilling. Feature location using crowd-based screen-casts. In Proceedings of the 15th International Conference on Mining Software Repositories, MSR ’18, pages 192–202, New York, NY, USA, 2018. ACM.

[13] Andrzej Olszak and Bo Nørregaard Jørgensen. Featureous: a tool for feature-centric analysis of Java software. In Program Comprehension (ICPC), 2010 IEEE 18th International Conference on, pages 44–45. IEEE, 2010.

(29)

[15] Meghan Revelle and Denys Poshyvanyk. An exploratory study on assessing feature location techniques. In Program Comprehension, 2009. ICPC’09. IEEE 17th International Conference on, pages 218–222. IEEE, 2009.

[16] Martin P Robillard and Gail C Murphy. Concern graphs: finding and describing concerns using structural program dependencies. In Proceedings of the 24th international conference on Software engineering, pages 406–416. ACM, 2002.

[17] Abhishek Rohatgi, Abdelwahab Hamou-Lhadj, and Juergen Rilling. An approach for mapping features to code based on static and dynamic analysis. In Program Comprehension, 2008. ICPC 2008. The 16th IEEE International Conference on, pages 236–241. IEEE, 2008.

[18] Trevor Savage, Meghan Revelle, and Denys Poshyvanyk. FLAT3_{: feature location and}

tex-tual tracing tool. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 2, pages 255–258. ACM, 2010.

[19] E Burton Swanson. The dimensions of maintenance. In Proceedings of the 2nd International Conference on Software Engineering, pages 492–497. IEEE Computer Society Press, 1976. [20] Norman Wilde and Michael C Scully. Software reconnaissance: Mapping program features to

(30)

(31)

Appendix A

Test scenarios for dynamic feature

location

A.1 JabRef

ID Short description Scenario

1535044 Month sorting Start app. Click ”Options”->”Preferences”->”Entry table”. Start trace. In primary sort criterion select ”year”. In secondary sort crite-rion select ”month”. Click ”Ok”. Stop trace. Exit app.

1538769 Freezes when exiting Import di-alog box by Cancel button

Start app. Start trace. Click ”File”->”Import into new database”->”Cancel”. Stop trace. Exit app. 1540646 default sort order: bibtexkey Start app. Click

”Options”->”Preferences”->”Entry table”. Start trace. In primary sort criterion select ”bibtexkey”. Stop trace. Click ”Cancel”. Exit app.

1542552 Wrong author import from In-spec ISI file

Start app. Start trace. Click ”File”->”Import into new database”. Select demo file. Stop trace. Exit app. 1545601 downloading pdf corrupts pdf

field text

Start app. Double click any record. Go to tab ”General”. Start trace. In File field, click ”Download”. Fill

”http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf”. Click ”Ok”->”Ok”. Stop

trace. Exit app. 1548875 download pdf produces

unsup-ported filename

Start app. Double click any record. Go to tab ”General”. Start trace. In File field, click ”Download”. Fill

”http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf”. Click ”Ok”->”Ok”. Stop

trace. Exit app. 1553552 Not properly detecting changes

to flag as changed

Start app. Double click any record. Start trace. Edit the ”Author” field. Click ”File”->”Save database”. Stop trace. Exit app.

(32)

1219592 Bibtex key generation: author or editor

Start app. Double click any record. Start trace. Click ”Generate BibTeX key” button. Stop trace. Exit app. 1588028 export HTML table doi url Start app. Start trace. Click

”File”->”Export”. Fill name in and choose ”HTML (*.html)” as filetype. Click ”Save”. Stop trace. Exit app.

1594123 Failure to import big numbers Start app. Start trace. Click ”File”->”Open database”. Choose ”de-modb2.bib”. Stop trace. Exit app. 1594169 Entry editor: navigation

be-tween panels

Start app. Double click any record. Start trace. Press Ctrl + Tab. Stop trace. Exit app.

1601651 PDF subdirectory - missing first character

Start app. Click ”Options”->”Preferences”->”External pro-grams”. In ”Main file directory” browse for the directory where the provided sample data is. Click ”Ok”. Double click record ”Feature location in source code: a taxonomy and survey”. Go to tab ”General”. Start trace. In field ”File” click button ”Auto”. Stop trace. Exit app.

1641247 Minor–No update after generate bibtex key

Start app. Double click any record. Start trace. Click ”Generate BibTeX key” button. Stop trace. Exit app. 1648789 Problem on writing XMP Start app. Click

”Options”->”Preferences”->”XMP metadata”. Start trace. Check the ”Do not write the following fields to XMP Metadata” checkbox. Click ”Ok”. Stop trace. Exit app.

1709449 Clicking a DOI from context menu fails

Start app. Start trace. Click the icon ”Open DOI web link” of record ”Soft-ware reconnaissance: Mapping pro-gram features to code”. Stop trace. Exit app.

1711135 BibTeX export error; missing space before line breaks

Start app. Click ”File”->”Open database”. Choose ”demodb3.bib”. Se-lect any record. In field ”Author” fill in ”#JaneDoe# and #JohnDoe# and #JPMorgan# and #FooBar#” (with-out quotes). Start trace. Click ”File”->”Export”. Fill in filename and choose any filetype. Click ”Ok”. Stop trace. Exit app.

1285977 Impossible to properly sort a nu-meric field

Start app. Reset the sorting if the records were sorted. You can see it next the column names (arrow up, down and no arrow). Start trace. Click the col-umn ”Year”. Stop trace. Exit app.

(33)

1749613 About translation Start app. Start trace. Click ”Options”->”Preferences”->”External programs”. Stop trace. Exit app. 1785416 Linking with exact BibKey fails Start app. Start trace. Click

”Options”->”Preferences”->”External programs”. Check radio ”Autolink only files that match the BibTeX key”. Click ”Ok”. Select record ”Feature location in source code: a taxonomy and survey”. Go to tab ”General”. In field ”File” click ”Auto”. Stop trace. Exit app.

1827568 ”Save Database” Menu not working in Version 2.2

Start app. Click any record and make any change. Start trace. Click ”File”->”Save database”. Stop trace. Exit app.

2027944 Export filter: failed if directory changed

Start app. Start trace. Copy an ex-port filter to any directory. Configure this export filter in JabRef (Options ->Manage custom exports). Export a database using this filter. Rename the directory containing the filter. Recon-figure the export filter to account for the new directory name. Export the database. Stop trace. Exit app. 2105329 New Entry: Main window

doesn’t get updated

Start app. Start trace. Click the + icon (”New BibTeX entry”). Choose any type. Stop trace. Exit app. 2119059 First author in RIS format Start app. Start trace. Click

”File”->”Import into new database”. Choose file ”demodb4.ris”. Stop trace. Exit app.

2904968 Entry Editor does not open Start app. Start trace. Double click on any record. Stop trace. Exit app. 2931293 faill to generate Bibtexkeys Start app. Start trace. Click

”Options”->”Preferences”->”BibTeX key generator”. Fill in Default pattern ”[authorsAlpha][year]”. Click ”Ok”. Double click any record. Click ”Gener-ate BibTeX key” button. Stop trace. Exit app.

1297576 Printing of entry preview Start app. Select any record. Start trace. Click ”Options”->”Preferences”->”Entry preview”. Click ”Test”. Right click and choose ”Print Preview”. Exit printing. Stop trace. Exit app.

1436014 No comma added to separate keywords

Start app. Double click any record. Go to tab ”General”. Start trace. In field ”Keywords” select the 2 keywords ”lorem” and ”ipsum”. Stop trace. Exit app.

Locating Features in Source Code A Quantitative Evaluation