Graph-Based Defect Prediction of JavaScript Frameworks

(1)

Master of Science: Software Engineering,

Graduation Thesis

Graph-Based Defect Prediction

of JavaScript Frameworks

Peter Bond

July 28, 2014

Supervisors: Magiel Bruntink (UvA), Oleksandr Volodymyrovych Murov (Avanade), Gijs Ramaker (Avanade)

MSc

Softw

are

Engineering

-Universiteit

v

an

Amsterd

am

(2)

(3)

Abstract

Defect prediction may benefit software development by providing information on where defects likely reside. Such information can aid both managers and developers in allocating their resources to those modules or components which are most likely to contain defects. This potential has lead to a considerable amount of research over the years, which explored various approaches in order to predict defects. Some of these approaches have provided promising results.

One such approach is graph-based. This thesis explores the potential of some graph-based metrics for defect prediction applied to open-source JavaScript frameworks. Accordingly, this thesis aims to answer the following three research questions: i) can we predict defect-prone releases based on source code?, ii) can we predict defect-prone releases based on non-source code information?, and iii) can we predict defect-prone files based on source code information? The JavaScript frameworks Knockout, Angular and Ember were selected for analysis.

Results of this research do not conclusively support the usage of the applied graph-based metrics to aid in defect prediction of JavaScript frameworks. However, both the quality and quantity of the data may have influenced the results of this thesis, obscuring potential valuable results. Future research with better and more input data is warranted to validate these results. Some suggestions for future research are provided.

(4)

3.3 Data Retrieval . . . 10 3.4 Data Storage . . . 11 3.5 Data Validation . . . 12 3.6 Data Processing . . . 12 3.7 Data Analysis . . . 13 4 Results . . . 14 4.1 Knockout . . . 14 4.2 Angular . . . 17 4.3 Ember . . . 19 5 Discussion . . . 21 6 Conclusions . . . 23 7 Future Research . . . 23 Bibliography 25

(5)

1 Introduction

A lot of effort has been made by researchers worldwide to aid the software development process by means of code analysis. Various models have been proposed to aid in this process by predicting defects. Proper defect prediction models allow managers or software engineers to allocate their limited resources to those modules or components which are most likely to contain defects. As such, defect prediction models might reduce costs and increase the quality of software. However, defect prediction models have been thoroughly criticized throughout the years, with perhaps the most notable discussion on this topic fed by Fenton and Neil in their paper published in 1999 [7]. Moreover, although improvements have been made in this field, defect prediction models still seem unreliable and, in some cases, hardly useful for developers as demonstrated by a recent case study at Google [13]. As described in the paper by Fenton and Neil over a decade ago, researchers have come up with a great variety of approaches in order to predict defects such as size and complexity metrics, testing metrics, process quality data and multivariate approaches. A trend that has continued thereafter and recently led to the empirical finding that defect prediction models using static code attributes do not get any better, which is indicative of a ceiling effect [18]. It is hypothesized that this ceiling effect is due to the limited information content of static code features, as such the information content needs to be improved rather than the defect prediction models themselves.

A relatively new and promising approach to defect prediction is based on graph theory. Methods based on graph theory have been applied in a variety of disciplines, ranging from analyzing computer network vulnerabilities [19], to analyzing cellular networks in biology [2]. In essence, defect prediction models based on graph theory capture certain properties of code in a graph and subsequently apply metrics to it. A graph consists out of a set of nodes and might have edges between them. A graph is written as G(V, E) with V denoting the set of nodes and E the set of edges. These edges can be either directed or undirected, resulting in a directed and undirected graph respectively.

One of the earlier efforts of applying graph theory, with a focus on defect prediction, is that of Turhan, Kocak and Bener [21]. In their research, they proposed a static call graph based ranking (CGBR) framework, which can be applied to any defect prediction model based on static code attributes. Their framework aims at increasing the information content of static code attributes as previous research by Menzies et al. has shown that defect predictors based on static code attributes have reached a performance ceiling [18]. They hypothesize that static code features have limited information content and provide empirical evidence supporting this notion. Menzies et al. come to the conclusion that further progress in defect predictors may not come from better algorithms, but rather from improving the information content of the training data. As such, Turhan et al. strive to do just that with their CGBR framework. They tested their framework on three industrial software projects written in C, with the biggest project encompassing 107 modules. On top of the framework, they applied a Naive Bayes predictor which indeed performed better when fed the CGBR adjusted data compared to the original data.

A later effort, published in 2012 and authored by Bhattacharya et al. [4], also applied graph-based metrics. They showed how graph-graph-based analysis aided in estimating defect severity, pri-oritizing refactoring efforts and predict defect-prone releases. They analyzed eleven large open-source software projects, including Firefox, Eclipse, MySQL and Samba. All these projects were written in either C or C++ (with the exception of Eclipse, which is written in Java) and the lifespan of these projects was typically a decade or more, thus providing a wealth of information. They concluded that graph topology analysis concepts can open many actionable avenues in software engineering research and practice.

This research further builds upon the research of Bhattacharya et al. by applying graph-based metrics to open-source JavaScript frameworks (JSFs) which are available on GitHub, such as the Knockout JSF1. The objective of this thesis is to evaluate whether these graph-based metrics can predict defects in JSFs.

Further background information and discussion on the topics of defect prediction, JSFs and GitHub repositories is provided in the similarly named subsections in Section 2.

(6)

1.1 Research Questions

Questions were formed by loosely applying the goal question metric approach [5]. As laid out in Section 1, various models have been developed to aid the developer in tracking down defects. Such models allow for an improved allocation of resources and ideally to less defects in final releases of software. Consequently, this leads to more reliable software. However, research in the field of defect prediction for JSFs is severely lacking. This thesis therefore attempts to provide useful information to predict defects in JSFs. Ultimately, this might facilitate the defect fixing process of JSFs. Several questions address this goal, as laid down below. And with these questions in mind, previous work by Bhattacharya et al. [4] has been chosen to provide suitable metrics to address them.

From the perspective of a manager, answering these questions would aid them to allocate resources, whereas for developers it would help them to track down defects faster.

• Goal: To predict defects in JSFs.

• Research question 1: Can we predict defect-prone releases based on source code? • Research question 2: Can we predict defect-prone releases based on non-source code

information?

• Research question 3: Can we predict defect-prone files based on source code information?

1.2 Thesis Overview

The reader is provided some background in the field of defect prediction, GitHub repositories and JavaScript frameworks (JSFs) in Section 2.

A list of JSFs which were selected for analysis can be found in Section 3.1. This section also includes the list of criteria they were matched on.

Section 3.2 describes what data was required from these JSFs for analysis, and which metrics were used.

How all the data was retrieved for this thesis is described in Section 3.3. Following this section, the storage of this data is described in Section 3.4 which includes a figure of the database schema. Measures undertaken by the author in order to validate the data are described in Section 3.5. Subsequently, the actual processing of the data (e.g. generation of call graphs) is described in Section 3.6. This section is followed by Section 3.7 which provides a quick overview of how the data analysis was performed.

Results are presented in Section 4 and are followed by a discussion in Section 5.

Finally, conclusions which can be drawn from this research are provided in Section 6 and some suggestions for further research in the field are made in Section 7.

(7)

2 Background

2.1 Defect Prediction

Quality assurance activities, such as tests or code reviews, are an expensive, but vital part of the software development process [15]. It is hypothesized that predicting defects assists in this process by reducing the effort necessary for searching for defects, as well as increasing reliability of the software, as more defects might be found (and fixed). However, defect prediction is not without its difficulties.

Over the past few decades, many efforts have been made to predict defects. This ongoing quest for better defect prediction models has not been without setbacks, as Fenton and Neil broadly discussed in their renowned publication ”A Critique of Software Defect Prediction Models” [7]. Fenton and Neil showed that many methodological and theoretical mistakes were made by defect prediction researchers and critique state-of-art defect prediction models. Many of these models are based on size and complexity metrics, testing metrics, process quality data or are multivariate in their origin.

Many models relying on size or complexity metrics assume that defects are a function of size or are caused by program complexity. Indeed, much empirical research has shown correlations between these metrics and defects. However, Fenton and Neil stressed that correlation should not be confused with causation. Inherently, such models ignore the causal effect of programmers and designers, whom, after all, are the ones who introduce defects in the first place. Moreover, high complexity might be a consequence of poor design ability or problem difficulty and subsequently lead to defects. As such, code complexity itself can not be appointed as a causal factor solely based on correlations. Finally, defects can already be introduced before a single line of code is written, i.e. during the design stage.

Models relying on testing metrics might suffer from some additional drawbacks. These models rely on code coverage of test cases, such as statement coverage. Highly complex software is also inherently difficult to test. Therefore it is more than likely that defects will indeed correlate with testing metrics, yet the finding is irrespective of causality.

Empirical evidence has revealed that multivariate approaches only require a handful of metrics in order to obtain the same accuracy as with more metrics. This can be explained by colinearity; metrics capturing the same underlying attribute and as such, addition of these metrics does not contribute to accuracy. Nevertheless, formulas capturing a ’handful of metrics’ (e.g. 6) in order to predict defects provide almost no actionable information to developers. It would be hard for a developer to derive from such a formula what to alter in order to decrease the defect count.

Moreover, Fenton and Neil highlighted serious issues concerning the statistical methodology used by researchers in the field. Most notable is the usage of linear regression models using variables which are clearly correlated, yet linear regression assumes zero correlation between independent variables. Moreover, researchers have mostly attempted to fit their models with historical data, without further testing it on new data sets.

Not all defects are created equally

Fenton and Neil also underscored the unknown relationship between defects and failures. A difficult issue which still plays today and can be considered a twofold problem:

1. Defects differ in their severity.

2. Defects differ in their likeliness to manifest themselves.

In other words: not all defects are created equally. For the first point, it stands to reason that ten minor defects are far less important than one critical defect. However, it seems a daunting task to validate a model which can assess defect severity. In order to validate a model, data is required which includes a defect’s severity and thus needs to rely on the judgment of the developer who issued it. There is valid ground to highly doubt such data. In fact, Herzig et al. concluded that 33.8% of all defect reports were misclassified as one, based on manual examination of over 7,000 issue reports from five open-source projects [11]. If so many defects are misclassified, then how can we assume classification of a defect’s severity to be correct? As Menzies et al. rightfully

(8)

mentioned, severity is too vague to reliably investigate [17]. Furthermore, a recent systematic review highlights the lack of studies incorporating defect severity into their measurements of predictive performance [10]. Only 1 out of the 36 studies which met their review criteria applied defect severity in their model. Therefore, in order to take on the first problem relating to the unknown relationship between defects and faults, i.e. defect severity, data needs to be manually examined following a systematic approach in order to accurately reflect actual defect severity. Nevertheless, some researchers have found that defect prediction models remain reliable at a noise level of 20% probability of incorrect links between defects and modules [20], so some noise in defect report data can be dealt with. And presumably, a similar number might hold for models coping with defect severity data.

The second point, a defect’s likeliness to manifest itself, is put in perspective by a study per-formed at IBM. Adams examined nine large software projects in order to assess the relationship between defects and their manifestation as failures in terms of mean time to failure (MTTF) [1]. These projects each had many thousands of years of logged use worldwide and thus provided a wealth of data in order to clarify this relationship. The results demonstrated a clear example of the Pareto principle: only a fraction of the defects (around 2%) had a MTTF less than 50 years, whilst 33% had a MTTF greater than 5.000 years. Moreover, these numbers can only be generated in retrospect, which makes evaluation of defects’ likeliness to manifest themselves a tedious task.

Defect prediction models relying on static code attributes do not improve

Nearly a decade after the publication by Fenton and Neil, a literature review by Menzies et al. demonstrated that defect prediction models relying on static code attributes have hit a ”perfor-mance ceiling”, i.e. some inherent upper bound on the amount of information offered by static code features when identifying modules which contain defects [18]. Menzies et al. hypothesized that static code features have limited information content and that this, in turn, leads to three predictions. These first two predictions stated that: i) information from static code features can be quickly and completely discovered by even simple learners, and ii) more complex learners will not find new information. For these predictions, Menzies et al. provided compelling empirical evidence supporting their hypothesis of limited information content. Their third prediction thus concludes with the premise that further progress in learning defect predictors will not come from better algorithms, but from improving the information content of the training data.

Graph-based defect prediction

Turhan, Kocak and Bener proposed a static call graph based ranking (CGBR) framework in order to increase the information content of static code attributes [21]. Their CGBR framework can be applied to any defect prediction model based on static code attributes. In order to increase the information content of static code attributes, the authors built static call graphs and subsequently assigned ranks to modules by virtue of their CGBR algorithm which uses the pagerank algorithm. They subsequently used the Naive Bayes model for defect prediction. They used data from a large-scale white-goods manufacturer in order to evaluate their method. Their results showed that the probability of false positives is decreased significantly, while recall is retained. Moreover, they concluded that using their CGBR framework can improve testing efforts by 23%. And since their datasets were implemented in a procedural language, i.e. C, the authors suspected that even better results might be obtained with object-oriented projects.

Further research which employed graph-based analysis was recently performed by Bhat-tacharya et al. [4]. The authors used graphs in order to predict bug severity, maintenance effort and defect-prone releases. Additionally, these graph metrics capture significant events in the soft-ware lifecycle which sometimes go undetected by traditional metrics such as eLOC2_{or McCabe’s}

cyclomatic complexity [14]. For this purpose, the authors constructed graphs at two abstraction levels. The first being at function level (call graph), and the second being at module level (mod-ule collaboration graph). Furthermore, the authors constructed developer collaboration graphs.

2_{Effective Lines of Code, the measurement of all lines that are not comments, blanks, or standalone brackets}

(9)

These graphs lend themselves, as mentioned before, to predict bug severity, maintenance effort, defect-prone releases and are able to capture (otherwise undetected) significant events in the software lifecycle.

Bhattacharya et al. performed their method on eleven open-source software programs, in-cluding Firefox, Eclipse, MySQL and Samba, among others. This research, however, will focus on large open-source JavaScript frameworks (JSFs), with the JSF Knockout in specific 3_{. A}

complete list of these JSFs is provided in Section 2.2. The software analyzed by Bhattacharya et al. were all written in C or C++ (with the exception of Eclipse, which is written in Java). Contrary to these programming languages, JavaScript is dynamically typed, whereas C, C++ and Java are all statically typed (albeit weak or strong). Additionally, literature on defect pre-diction models and their application to JavaScript is lacking. This thesis will therefore attempt to touch base in the field of defect prediction of JSFs.

2.2 JavaScript Frameworks

JavaScript is the most popular client-side scripting language for web applications [8, 9]. How-ever, compared to languages such as Java and C#, relatively little tool is available to assist with testing JavaScript code [3]. Static analysis of JavaScript code is difficult due to its dynamic semantics. As such, it is difficult to build call graphs for JavaScript code, yet call graphs are of pivotal importance for software maintenance tools and integrated development environments (IDEs) with code analysis tools. Fortunately however, tremendous progress has been made with on-the-fly call graph construction, which might be incorporated in IDE services. Feldthaus et al. proposed an approximate call graph constructor [6], which is also used to construct call graphs for data processing in this thesis, as elaborated on in Section 3.6. Feldthaus et al. their analysis is field-based, which implies that it uses a single abstract location per property name and thus can not distinguish between multiple functions that are assigned to properties of the same name. Moreover, it only tracks function objects and does not reason about non-function values. Addi-tionally, it ignores property accesses, i.e., property reads and writes using JavaScript’s bracket syntax. Although these three properties make Feldthaus et al. their approach to construct call graphs principally incorrect (it can miss calls), their evaluation showed it is quite accurate in practice.

By virtue of JavaScript’s popularity and the increasing demand of rich internet and single-page applications, JavaScript frameworks (JSFs) have been developed to fit the needs of the modern day web developer. The JSFs examined for this thesis are listed in Section 3.1 and all follow the model-view* (MV*) design-pattern. MV* enforces that the JSF has a data model, which holds certain information. In general these models do not handle interaction. Additionally, it enforces that the JSF has a view which can be manipulated in order to display information (from the model) and usually contains data-bindings, behaviors and events. Details aside, it is the model-view separation which is of importance, as it aids the developer with writing his code, especially when it involves more than some trivial document object model (DOM) manipulation. The latter being increasingly the case with the rise of single-page applications.

2.3 GitHub Repositories

The data required for analysis was retrieved from GitHub repositories. Git repositories keep track of changes made to a code (commits) and by whom these changes were made (authors). Additionally, GitHub provides the possibility to track issues as well. These issues can be labeled, for example as being a bug or a feature. When a developer solves the issue, he has the ability to close it and optionally merge it to a commit as well. Such a closure or merge is a form of an event. Moreover, releases can be tagged by developers so it is exactly clear what code belongs to which release.

Next to facilitating open-source development, GitHub thus provides a wealth of data about a wide variety of software projects. This enables researchers to test their hypotheses on large pub-licly available datasets which include defect tracking information, and enables other researchers

(10)

to replicate experiments since the data is readily retrievable. As such, GitHub provides a big op-portunity for empirical research in the field of software engineering. Recently, researchers have published a set of recommendations for software engineering researchers on how to approach the data in GitHub [12]. Their recommendations are based on their research examining the characteristics of the repositories in GitHub and how users take advantage of it.

Issues concerning software repository data for empirical research

In an ideal world, all data stored in these repositories would be flawless. In practice, however, this is not the case. There are several caveats to take into account when mining Git repositories. Knowing about these caveats can aid research by undertaking (either partially or fully) corrective measures to deal with them, or at the very least, acknowledge them when corrective measures can not be performed at the time, but might be in the future.

Andreas Zeller begged the question ”can we trust software repositories?” in his equally named paper which was recently published in Perspectives on the Future of Software Engineering [22]. Zeller highlighted that these repositories are not meant for research. Although this excludes the possibility of the Hawthorne effect, the data itself is not meant to be automatically retrieved and analyzed.

The first issue Zeller raised encompasses the need to map defects to changes in the code. Surely, through automatized means commit messages can be scanned for words like ’bug’, ’de-fect’ or ’fix’, and identifiers can refer to the corresponding defect database. However, this only works under the assumption that developers always link defects in their commit messages. An assumption which will not hold in practice. Luckily, defects can also be linked to commits when developers close (or merge) issues with corresponding commits, something which seems more routinely done with the data used for this thesis (author’s observation).

Nevertheless, the second issue Zeller raised is that the data can be wrong to begin with, i.e. something reported as a defect (or something else), might not be a defect at all (misclassification). This issue of misclassification was the subject of a recent publication of Herzig et al. [11]. The authors manually classified more than 7,000 issue reports from the defect databases of five large open-source projects. They found that 33.8% of all defect reports were misclassified and that, on average, 39% of the files marked as defective actually never had a defect. Furthermore, of these five projects, misclassification of defects varied from 24.9% to 40.8%, indicating that misclassifi-cation of defects between projects can vary widely. And although some researchers have found that defect prediction models remain reliable at a noise level of 20% probability of incorrect links between defects and modules [20], the results of Herzig et al. clearly show that misclassification can easily surpass that level [11]. Ideally therefore, issues should be systematically classified manually to counter this issue. Nevertheless, such a practice is painstakingly time consuming and at the very least, the issue should be acknowledged and mentioned by authors relying on such data. Due to time constraints, this research relies on automatic defect classification.

Further misclassification can occur in terms of version misclassification when defects are automatically attributed to a specific version of the software. Automatic means might rely on the release dates of versions and attribute a defect to the then newest version. This, of course, can lead to misclassification as defects can be reported for older versions at any time, but nevertheless this practice is applied commonly in literature, also by Bhattacharya et al. [4]. Additionally, automatic version classification might rely on labels, but these might not always be used. For example, none of the JSFs analyzed in this research had labels corresponding to specific versions for the reported defects. Therefore, for this research defects were manually attributed to a specific version as further elaborated on in Section 3.5.

(11)

3 Research

The objective of this thesis is to evaluate whether graph-based metrics can aid in defect prediction of JSFs. Ultimately, this might facilitate the defect fixing process of JSFs. Several research questions have been proposed to achieve this goal, namely:

• Research question 1: Can we predict defect-prone releases based on source code? • Research question 2: Can we predict defect-prone releases based on non-source code

information?

• Research question 3: Can we predict defect-prone files based on source code information? To answer these questions, first several JSFs were selected for analysis, as elaborated on in Section 3.1. Second, the metrics of interest and on which data they rely are described in Section 3.2. Third, the data required for analysis was retrieved from the JSFs’ repositories as described in Section 3.3. Fourth, the data would need to be stored in a relational database in order to subsequently process it as depicted in Section 3.4. Fifth, adequate measures were taken to ensure the data was valid as clarified in Section 3.5. The manner of data processing is described in Section 3.6. Finally, the data was analyzed by applying several metrics to it as displayed in Section 3.7. The results of this research are shown in Section 4.

3.1 Framework Selection

The Knockout framework4_{was included for analysis per request by the host company Avanade as}

it is rooted in Microsoft technology. In order to select additional JSFs for comparative analysis, a set of criteria was formulated. The criteria reflect accessibility of source code for analysis, comparable design-pattern, practical relevance, and ample data for analysis. Ultimately, this led to the following criteria:

• Open-source and available through GitHub • Model-view* (MV*) design-pattern

• >25 total contributors • First commit >1 year ago • >10.000 LOC

• Labeling of defects on issue tracker

The author found two additional frameworks to Knockout which met the above criteria, namely: • Angular5

• Ember6

For convenience, no distinction was made between libraries (e.g. Knockout strongly leans towards being a library rather than an actual framework), frameworks and anything in between.

3.2 Metrics and Required Data

In order to answer the research questions as described earlier, several metrics were selected. These metrics are adopted from Bhattacharya et al. [4] and rely on graphs. A graph is given by G(V, E), with V denoting the set of nodes and E the set of edges. In order to generate these graphs, source code for all JSF versions was required for static code call graphs and all commit information was required for commit-based developer collaboration graphs.

4_{GitHub repository can be found at: https://github.com/knockout/knockout}

5_{GitHub repository can be found at: https://github.com/angular/angular.js}

(12)

Metrics

• Edit distance (ED) - This metric can be applied to directed graphs and its purpose is to uncover large-scale structural changes between releases or to predict defect counts for a release. It is applied to static call graphs constructed from the source code, as well as to commit-based developer collaboration graphs. It is used to answer the first two research questions.

Given two graphs G1(V1, E1) and G2(V2, E2), the edit distance of these two graphs can be

written as ED(G1, G2) and is calculated as follows:

ED(G1, G2) = |V1|+|V2|−2|V1∩ V2|+|E1|+|E2|−2|E1∩ E2|

Note that the edit distance merely describes the difference in number of unique nodes and edges between two graphs. The edit distance is written in pseudocode as follows:

function editDistance(G1, G2) V1 = getTotalNodes(G1) V2 = getTotalNodes(G2) E1 = getTotalEdges(E1) E2 = getTotalEdges(E2) IS = getIntersection(G1, G2) return V1 + V2 - (2 * getTotalNodes(IS)) + E1 + E2 - (2 * getTotalEdges(IS)) end function

Figure 1: An illustration of the ED between two graphs: G1(left) and G2(right). G2has three

additional nodes and three additional edges compared to G1(highlighted by lightblue coloring).

Thus the ED between both graphs is six.

• Noderank - This metric is similar to pagerank and is applied to directed graphs. It is used to answer the third research question by assessing the overlap between high ranked nodes and defect files. Noderank is calculated per node and given a node u, its noderank is written as N R(u). In order to calculate N R(u), we define a set INuas the set containing all

nodes v that have an outgoing edge to u. Initially, every node in V has an equal noderank. Iterating over INu, the new N Rn+1(u) is calculated as follows until the noderank values

converge: N Rn+1(u) = X v∈INu N Rn(v) deg+_(v)

(13)

node v. Also note that to enable convergence, at the end of every iteration, all values need to be normalized so that their sum is equal to one.

Moreover, the following simple metrics, generated with escomplex7_{, are also used:}

• Logical lines of code (LLOC) - A count of the imperative statements.

• Cyclomatic complexity (CCM) - A count of the number of paths in the program flow control graph [14].

Statistics

Spearman’s rank correlation coefficient (rs) was applied to calculate correlations between the

results of the various metrics (described above) and defect count. A two-tailed value of p was calculated in order to determine statistical significance. Statistical significance was set at p ≤ 0.05 for all measurements (five in total) as a whole of a single dataset, i.e. a JSF. A Bonferroni correction was then applied to determine the α for single measurements, resulting in α = 0.01.

3.3 Data Retrieval

All data that was required for analysis is publicly available through the GitHub API, as such, a tool (Gitomerase) was written in PHP to communicate with the GitHub API (version 3) and retrieve the data. The GitHub API works over HTTPS and sends and receives data in JavaScript Object Notation (JSON).

Gitomerase is named after the DNA polymerase enzymes in biology which duplicate (tran-scribe) pieces of DNA to RNA. Since the GitHub API returns data in JSON, Gitomerase readily decodes this data to associative arrays for easy subsequent handling of the data as illustrated in Figure 2.

Gitomerase can fetch issues (and their corresponding events), releases, and commits belonging to a given repository, as well as user data.

Figure 2: 1) Gitomerase sends a cURL request to the GitHub API, 2) the GitHub API answers with a JSON response, 3) Gitomerase decodes the JSON response to an associative array for easy handling.

(14)

The following repositories were used to retrieve the required data for each JSF: • Knockout - https://api.github.com/repos/knockout/knockout

• Angular - https://api.github.com/repos/angular/angular.js • Ember - https://api.github.com/repos/emberjs/ember.js

Since some frameworks (Angular and Ember) have multiple versions being developed in par-allel, commits were retrieved and ordered per version in these cases. This was necessary in order to construct reliable developer collaboration graphs. The GitHub API allows filtering per branch or tag when retrieving data and in both cases either separate branches or tags were used to allow this selective retrieval.

3.4 Data Storage

A tool was written in PHP in order to communicate with the database used for storage and the subsequent handling of the data provided by Gitomerase from the GitHub API. The tool is named Gitosome after the ribosome in biology which synthesizes protein (translation) as prescribed by the messenger RNA. All data was stored in a MySQL database and the PHP Data Objects (PDO) extension was used as an interface for communication with it.

Gitosome can store issues, releases, events and user data into the database, as well as some related relevant data such as labels. For example, an issue can be labeled as a defect (bug) and thus provides the classification required for analysis.

The database schema is illustrated in Figure 3.

Figure 3: Relational database schema for storage of the data by Gitosome.

The commits contain changes made to the code of a project. Linking these commits to issues provides information about the nature of the code change (e.g. a defect fix). Thus linking commits to issues provides useful metadata about code changes. The relational database schema allows linking commits to issues, as a link might be established directly through a mention of the issue number in the commit message, or through issues their corresponding events. Events are inherently linked to a corresponding issue (through the foreign key issue_number) and are linked with a commit (through the foreign key commit_id) by representing a merge event with a commit or by representing a closure event with a commit.

(15)

Figure 4: A commit can either be directly linked to an issue through a mention of the issue number in the commit message or indirectly through corresponding events (closure or merge) which are linked to a commit.

3.5 Data Validation

Issue type classification

Reported issues were classified as defects when they were labeled as one, i.e. if they had an attached label named ’defect’, ’bug’, ’fault’ or similar. Although Herzig et al. report that 33.8% of all defect reports were misclassified as one, based on manual examination of over 7,000 issue reports from five open-source projects, this research relies on labeling rather than manual classification by the author. This choice was made due to time constraints.

Issue version classification

Initially, reported defects were attributed to the then newest version, i.e. a defect was attributed to the nearest previous version. However, manual inspection of a subset of issues revealed that this would lead to a possible substantial version misclassification. Therefore, issues were manually examined by the author to improve version classification. Manual examination relied on the following information to determine the correct version: i) the issuer explicitly stated a version name in either the title or description of the issue, ii) the version name could be derived from an attached minimalistic example demonstrating the defect (e.g. JSFiddle), or iii) the version name was explicitly mentioned in one of the reactions on the issue as being the one in which it was introduced. If a version could not be derived from the previous three points, the version was attributed to the then newest version (as determined by the date the issue was opened). Additionally, when multiple versions were being developed in parallel, the title or description of the issue was sometimes labeled by the issuer with ’BETA’, and as such it was attributed to the then newest beta version, which was not necessarily the then newest version. This approach prevented version misclassification of 24.2% of all reported defects across all three JSFs.

3.6 Data Processing

Static source code call graphs

Several metrics to analyze the data rely on call graphs of both source code and non-source code data. For the first, static call graphs had to be generated. The Approximate Call Graph Builder for JavaScript (ACG) was chosen to generate these call graphs. ACG implements a field-based call graph construction algorithm for JavaScript as described by Feldthaus et al. [6]. Although this approach is in principle unsound (it can miss calls), Feldthaus et al. report that it produces highly accurate call graphs in practice. Own observations supported this conclusion and it was concluded that such an approach would not negatively affect the reliability of the results of this research. As such, ACG was chosen to construct static call graphs for JavaScript source code.

The default behavior of ACG represents nodes as file positions rather than function names. This behavior was chosen since not all functions have names and function names do not need to

(16)

be unique. However, own observations revealed that this was only the case in a small minority of calls. As such, ACG was adapted to represent nodes as function names (prefixed with their filename) and exceptions were chosen to be ignored. E.g. a function foo calling a function bar, both within a file named test.js, was thus represented as follows:

test.js/foo -> test.js/bar

Subsequently ACG was modified to write the call graphs to PHP files as an associative PHP array to simplify subsequent processing of the graphs, i.e. applying the graph metrics as described in Section 3.2.

The following format was chosen for the associative array: $graph[’filename/caller’][’filename/callee’] = true;

The first index of the array, ’filename/caller’, represents the function which acts as the caller and the second index, ’filename/callee’, represents the function which is being called (the callee).

Commit-based developer collaboration graphs

The commit-based developer collaboration graphs were constructed similarly. All filecommits between the dates of two releases were fetched. When >1 filecommits corresponded to the same file (i.e. there might be more than one developer who worked on the file during the same period of time), the corresponding commits were fetched in order to determine the authors of these commits. An undirected edge between these authors (represented by nodes) was then made, signifying that they worked on the same file.

Defect sample selection

For Knockout, there was a total of 148 reported defects and all these defects were manually classified and no sampling of reported defects took place for this JSF. For Angular and Ember however, a selection of the population was required due to time constraints. For this purpose, a systematic sampling method was applied to extract a sample from the population of all reported defects. The sample size was aimed to be about the same size as the total population of reported defects of Knockout. Therefore, for both JSFs, every nth reported defect (ordered by their date created) was selected for inclusion. For Angular n was set at 6 and for Ember n was set at 2. This led to a sample of 137 reported defects for Angular and 127 reported defects for Ember.

3.7 Data Analysis

The edit distance was calculated for every version of each JSF based on the corresponding static source code call graphs as described in 3.6. This was done both for static source code call graphs including and excluding native JavaScript function calls. Likewise, the edit distance was calculated for every version of each JSF based on the corresponding commit-based developer collaboration graphs. Additionally, the LLOC and CCM was calculated for every version of each JSF.

After these results were generated, the rswas calculated for the results of these metrics and

defect count as described in 3.2.

Furthermore, noderank was applied to the version of each JSF which had the highest defect count. The top 5 nodes, as ordered by their noderank, were than selected for comparison with filecommits. Of these 5 nodes, the corresponding files were determined and were laid out against the commits attributed to the selected version their corresponding files.

(17)

4 Results

Graphs, consisting out of nodes and edges, were determined with the method as described in Section 3.6. Static source code call graphs were generated without and with native function calls (e.g. array push), with metrics applied to the latter graphs featured with a prime (0) symbol.

Reported defects were attributed manually to a version by the author as described in Section 3.5.

The edit distance (ED) for a version n was calculated for its immediate predecessor n − 1 by applying the formula as described in Section 3.2. Additionally, the commit-based developer collaboration edit distance (DC ED) was calculated using the same formula.

The noderank was calculated for the version with the highest amount of reported defects. Subsequently, the overlap between the top 5 nodes (as ordered by their noderank) their corre-sponding files and the files correcorre-sponding to the reported defects, was calculated and expressed as the percentage of the total number of files corresponding to defects.

4.1 Knockout

Version Defects DC ED Nodes Edges ED Nodes0 _Edges0 _ED0 _LLOC _CCM

1.0.2 0 NA 6 4 NA 145 139 NA 943 228 1.0.3 0 0 6 4 0 145 139 0 965 237 1.0.4 0 0 6 4 0 145 139 0 1028 253 1.0.5 0 0 6 4 0 145 139 0 1084 266 1.1.0 0 0 6 4 0 145 139 0 1203 286 1.1.1 0 0 9 6 5 149 142 7 1286 303 1.1.2 0 9 10 7 2 150 144 3 1337 324 1.2.0 1 9 13 9 5 165 172 51 1530 377 1.2.1 1 0 13 9 0 169 177 9 1577 397 1.3.0 beta 0 21 22 17 29 188 200 64 2079 485 1.3.0 RC 0 21 23 20 4 191 206 19 2196 528 2.0.0 RC 1 0 23 20 0 193 207 3 2199 534 2.0.0 RC2 0 0 23 20 0 193 207 0 2199 534 2.0.0 8 21 23 20 0 193 207 0 2198 533 2.1.0 beta 0 21 30 26 15 200 214 22 2329 557 2.1.0 RC 0 0 31 27 2 201 215 2 2337 560 2.1.0 RC2 0 0 31 27 0 201 215 0 2340 560 2.1.0 16 9 31 27 0 201 215 0 2340 560 2.2.0 RC 0 9 38 32 18 205 224 81 2389 578 2.2.0 15 19 38 32 0 205 224 0 2392 579 2.2.1 29 40 38 32 0 205 224 0 2395 580 2.3.0 RC 2 35 40 33 3 207 236 14 2430 593 2.3.0 9 0 40 33 0 207 236 0 2430 593 3.0.0 beta 10 9 55 49 39 221 262 76 2567 639 3.0.0 RC 6 9 57 54 7 227 270 18 2681 679 3.0.0 31 16 57 54 0 227 270 0 2683 677 3.1.0 beta 19 7 62 62 13 228 277 26 2808 699 3.1.0 0 9 62 62 0 228 277 0 2808 699

Table 1: Metric values for several Knockout versions. The edit distance for a given version is the edit distance between that version and its immediate predecessor. The first columns display the values calculated from the graphs excluding native function calls, whereas the columns appended with the prime symbol take native function calls into account.

The correlation coefficient (rs) was calculated for the defect count and the results of various

other metrics. Subsequently, the same was done starting from the first version of Knockout with >1 defects, i.e. versions ≥2.0.0. Finally, further categorization of the data by taking all beta and release candidate versions corresponding to a stable version together, and thus by summation of all defects and EDs of these versions, was performed. The results are provided in Table 3.

(18)

Version Defects DC ED Nodes Edges ED Nodes0 _Edges0 _ED0 _LLOC _CCM 1.0.2 0 NA 6 4 NA 145 139 NA 943 228 1.0.3 0 0 6 4 0 145 139 0 965 237 1.0.4 0 0 6 4 0 145 139 0 1028 253 1.0.5 0 0 6 4 0 145 139 0 1084 266 1.1.0 0 0 6 4 0 145 139 0 1203 286 1.1.1 0 0 9 6 5 149 142 7 1286 303 1.1.2 0 9 10 7 2 150 144 3 1337 324 1.2.0 1 9 13 9 5 165 172 51 1530 377 1.2.1 1 0 13 9 0 169 177 9 1577 397 2.0.0 9 63 23 20 33 193 207 86 2198 533 2.1.0 16 30 31 27 17 201 215 24 2340 560 2.2.0 15 28 38 32 18 205 224 81 2392 579 2.2.1 29 40 38 32 0 205 224 0 2395 580 2.3.0 11 35 40 33 3 207 236 14 2430 593 3.0.0 17 38 57 54 46 227 270 94 2683 677 3.1.0 19 16 62 62 13 228 277 26 2808 699

Table 2: Metric values for categorized Knockout versions. All beta and release candidate versions corresponding to a stable version were taken together by summation of their defects and edit distances. The first columns display the values calculated from the graphs excluding native function calls, whereas the columns appended with the prime symbol take native function calls into account.

Versions ED ED0 _{DC ED} _LLOC _CCM

All −0.103 (p = 0.608) −0.061 (p = 0.763) 0.368 (p = 0.059) 0.623 (p < 0.001)† 0.613 (p < 0.001)†

≥2.0.0 −0.282 (p = 0.308) −0.270 (p = 0.330) 0.259 (p = 0.352) 0.368 (p = 0.177) 0.324 (p = 0.239)

Stable* 0.589 (p = 0.021) 0.626 (p = 0.013) 0.802 (p < 0.001)† 0.923 (p < 0.001)† 0.923 (p < 0.001)†

Table 3: Correlation coefficients between defect count and results of various metrics for Knockout. *All beta and release candidate versions corresponding to a stable version were categorized as one. † denotes statistical significance.

Calculating the noderank of each node for version 3.0.0 yielded 11 nodes with a noder-ank >0 out of a total of 57 nodes. The top 5 nodes comprised 77% of the total nodernoder-ank and were spread among 4 different files, namely: src/binding/bindingAttributeSyntax.js,

src/utils.domNodeDisposal.js, src/subscribables/dependentObservable.js, src/memoization.js. Version 3.0.0 totaled 31 reported defects, corresponding to 18 filecommits, spread among 8

unique files. Of these 18 filecommits, 2 (11.1%) corresponded to a file in which one of the top 5 nodes, as ordered by their noderank, was located (src/subscribables/dependentObservable.js).

Node Noderank Overlap Defect Filecommits

(...)/bindingAttributeSyntax.js/applyBindingsToNodeInternal 0.04933 0%

src/utils.domNodeDisposal.js/cleanImmediateCommentTypeChildren 0.04933 0%

(...)/bindingAttributeSyntax.js/applyBindingsToDescendantsInternal 0.04933 0%

(...)/dependentObservable.js/evaluateImmediate 0.05892 11.1%

src/memoization.js/findMemoNodes 0.56187 0%

Table 4: Top 5 nodes ordered by their noderank for version 3.0.0 of Knockout. Note that the total noderank of all nodes is equal to 1 (by definition).

(19)

Figure 5: Scatterplots of defect count vs. ED (left) and defect count vs. DC ED (right) for all Knockout versions.

Figure 6: Scatterplots of defect count vs. LLOC (left) and defect count vs. CCM (right) for all Knockout versions.

Figure 7: Scatterplots of defect count vs. ED (left) and defect count vs. DC ED (right) for stable Knockout versions. *All stable, beta and release candidate versions of a specific version were categorized as one.

Figure 8: Scatterplots of defect count vs. LLOC (left) and defect count vs. CCM (right) for stable Knockout versions. *All stable, beta and release candidate versions of a specific version were categorized as one.

(20)

4.2 Angular

1.0.0 5 NA 217 483 NA 429 864 NA 4980 1089 1.0.1 3 0 217 483 0 429 864 0 4981 1091 1.0.2 3 34 219 485 4 432 871 10 5003 1096 1.0.3 1 68 242 743 313 456 1149 336 5060 1110 1.0.4 2 51 243 755 15 458 1168 25 5073 1116 1.0.5 1 52 243 742 17 458 1155 19 5119 1132 1.0.6 1 177 243 740 4 457 1153 9 5126 1135 1.0.7 3 460 244 744 5 458 1157 5 5153 1150 1.0.8 4 407 244 743 7 461 1280 136 5183 1170 1.1.0 3 34 219 485 333 432 871 502 5010 1098 1.1.1 2 68 243 746 317 457 1153 341 5103 1126 1.1.2 0 51 244 758 15 457 1171 30 5155 1142 1.1.3 2 52 244 770 42 457 1183 44 5267 1173 1.1.4 2 177 253 793 58 465 1209 82 5493 1222 1.1.5 4 812 262 818 68 476 1239 109 5692 1281 1.2.0 28 702 238 814 841 464 1818 1557 6165 1440 1.2.1 7 50 242 831 27 468 1837 33 6203 1448 1.2.2 2 16 240 828 7 466 1833 8 6221 1453 1.2.3 7 76 240 828 30 467 1834 30 6239 1453 1.2.4 3 76 242 830 4 469 1836 4 6233 1440 1.2.5 3 18 244 608 238 471 1603 261 6234 1432 1.2.6 0 28 246 615 13 474 1610 22 6270 1435 1.2.7 4 57 247 622 8 477 1618 11 6290 1443 1.2.8 4 97 248 622 3 478 1618 3 6256 1439 1.2.9 1 54 249 704 83 479 1704 87 6256 1440 1.2.10 4 30 249 704 0 479 1704 0 6258 1441 1.2.11 2 76 249 696 12 478 1697 12 6269 1443 1.2.12 3 111 249 696 0 478 1697 0 6275 1445 1.2.13 6 167 250 698 3 479 1703 7 6313 1454 1.2.14 8 139 251 701 4 480 1707 9 6393 1478 1.2.15 3 26 251 704 3 481 1712 8 6433 1494 1.2.16 7 241 259 721 41 489 1738 62 6514 1510 1.2.17 1 290 260 725 15 491 1742 20 6570 1528

Table 5: Metric values for several Angular versions. Beta and RC versions were attributed to the corresponding stable version. The edit distance for a given version is the edit distance between that version and its immediate predecessor, with the exception of versions 1.1.0, for which version 1.0.1 was determined as its predecessor. The first columns display the values calculated from the graphs excluding native function calls, whereas the columns appended with the prime symbol take native function calls into account.

other metrics. The results are provided in Table 6.

Versions ED ED0 DC ED LLOC CCM

All −0.148 (p = 0.419) −0.073 (p = 0.693) 0.284 (p = 0.116) 0.227 (p = 0.203) 0.336 (p = 0.056)

Table 6: Correlation coefficients between defect count and results of various metrics for Angular. Calculating the noderank of each node for version 1.2.0 yielded 66 nodes with a noderank >0 out of a total of 238 nodes. The top 5 nodes comprised 79% of the total noderank and were spread among three different files, namely: src/jqLite.js, src/ng/sce.js and src/Angular.js.

Version 1.2.0 totaled 28 reported defects, corresponding to 23 filecommits, spread among 20 unique files. Of these 23 filecommits, 1 filecommit (4.3%) corresponded to a file in which one of the top 5 nodes, as ordered by their noderank, was located (src/Angular.js).

(21)

Node Noderank Overlap Defect Filecommits src/jqLite.js/jqLiteAddNodes 0.06237 0% src/jqLite.js/JQLite 0.08585 0% src/ng/sce.js/sceToString 0.12986 0% src/Angular.js/isArrayLike 0.22885 4.3% src/Angular.js/forEach 0.28167 4.3%

Table 7: Top 5 nodes ordered by their noderank for version 1.2.0 of Angular. Note that the total noderank of all nodes is equal to 1 (by definition).

Figure 9: Scatterplots of defect count vs. ED (left) and defect count vs. DC ED (right) for all Angular versions.

Figure 10: Scatterplots of defect count vs. LLOC (left) and defect count vs. CCM (right) for all Angular versions.

(22)

4.3 Ember

1.0.0 28 NA 190 441 NA 417 1554 NA 7944 1517 1.0.1 0 16 191 442 4 418 1556 5 7945 1520 1.1.0 1 199 191 432 16 421 1540 41 8028 1550 1.1.1 1 0 191 432 0 421 1540 0 8028 1550 1.1.2 15 0 191 432 0 421 1537 3 8027 1549 1.1.3 0 0 192 433 4 422 1539 5 8028 1550 1.2.0 16 9 208 486 89 439 1654 176 8368 1635 1.2.1 0 0 209 487 4 440 1656 5 8370 1636 1.2.2 0 0 208 485 3 439 1653 4 8370 1636 1.3.0 6 125 247 787 509 480 2137 827 8609 1686 1.3.1 12 0 248 788 4 481 2139 5 8610 1687 1.3.2 3 0 247 786 3 480 2136 4 8610 1687 1.4.0 23 0 244 957 466 478 2128 892 8905 1720 1.5.0 13 0 247 1037 247 482 2297 399 9174 1774 1.5.1 4 0 247 1037 0 482 2297 0 9195 1778 1.6.0 1 36 351 1159 798 586 2423 1220 10975 1917

Table 8: Metric values for several Ember versions. Beta and RC versions were attributed to the corresponding stable version. The edit distance for a given version is the edit distance between that version and its immediate predecessor, with the exceptions of versions 1.2.0 and 1.3.0, for which versions 1.1.2 and 1.2.0 were determined as their predecessors, respectively. The first columns display the values calculated from the graphs excluding native function calls, whereas the columns appended with the prime symbol take native function calls into account. The low number of defects corresponding to version 1.6.0, despite the large edit distance, is due to it being a four day old release at the time of fetching the data.

other metrics. Subsequently, the latest version was omitted and values were recalculated. The results are provided in Table 9.

Versions ED ED0 DC ED LLOC CCM

All 0.238 (p = 0.392) 0.276 (p = 0.319) −0.085 (p = 0.763) 0.088 (p = 0.746) 0.088 (p = 0.746)

≤ 1.6.0 0.342 (p = 0.232) 0.387 (p = 0.172) −0.045 (p = 0.880) 0.140 (p = 0.620) 0.140 (p = 0.620)

Table 9: Correlation coefficients between defect counts and results of various metrics for Ember. Calculating the noderank of each node for version 1.0.0 yielded 62 nodes with a noderank >0 out of a total of 190 nodes. The top 5 nodes comprised 42% of the total noderank and were spread among four different files, namely: packages/ember-metal/lib/property_get.js, packages/ember-runtime/lib/system/namespace.js, packages/ember-metal/lib/events.js, packages/ember-metal/lib/property_events.js.

Version 1.0.0 totaled 28 reported defects, corresponding to 13 filecommits, spread among 7 unique files. Of these 13 filecommits, 0 corresponded (0%) to a file in which one of the top 5 nodes, as ordered by their noderank, was located.

Node Noderank Overlap Defect Filecommits

(...)/property_events.js/propertyDidChange 0.05954 0% (...)/property_events.js/propertyWillChange 0.05954 0%

(...)/events.js/sendEvent 0.08325 0%

(...)/namespace.js/superClassString 0.10684 0%

(...)/property_get.js/get 0.11162 0%

Table 10: Top 5 nodes ordered by their noderank for version 1.0.0 of Ember. Note that the total noderank of all nodes is equal to 1 (by definition).

(23)

Figure 11: Scatterplots of defect count vs. ED (left) and defect count vs. DC ED (right) for Ember versions ≤1.6.0.

Figure 12: Scatterplots of defect count vs. LLOC (left) and defect count vs. CCM (right) for Ember versions ≤1.6.0.

(24)

5 Discussion

Research question 1: Can we predict defect-prone releases based on source code?

Static source code call graphs were generated for every version of each of the JSFs. Sub-sequently the ED for every version was calculated and there appeared no significant relation between this ED and defect count in the majority of the cases. The correlation coefficient varied widely between the three JSFs and a negative correlation was reached in many cases. For exam-ple, there was a negative correlation between defect count and ED for all versions of Knockout, and the same was the case for all stable versions of Angular. When we examine the first case, Knockout, the correlation becomes remarkably different when the versions are categorized to stable versions only, i.e. data attributed to corresponding beta and RC versions is attributed to the corresponding stable versions. A moderate positive correlation becomes apparent after this categorization and nearly reaches statistical significance (p = 0.021). A possible explanation for this is the mismatch between an increase in edit distance between the latest stable version and the newest beta or RC version, and an overattribution of defects to stable versions. Indeed, all RC versions and beta versions combined are responsible for only 38 out of 148 reported de-fects, whilst they encompass the largest changes in ED continuously. It seems unlikely that the majority of defects was actually introduced in the stable versions rather than the beta and RC versions.

The overattribution of defects to stable versions might be due to a smaller popularity of beta and RC versions compared to stable versions. As such, defects introduced in beta or RC versions might only become apparent when they are released as stable versions, due to their popularity. Moreover, the ED between beta or RC versions and their subsequent stable versions is often 0.

Notable enough, in all cases for Knockout the correlation between LLOC and defect count and CCM and defect count is stronger. It is unclear why these metrics appear to be such strong predictors.

In contrast to the considered versions of Knockout, for Angular beta and RC versions were not taken into account to begin with. This choice was made as it would lead to a total of nearly 100 versions, which might blunt outcomes of measurements, as only a sample of 129 defects was used. Yet the same negative correlation for ED and ED0 was found. A possible explanation in the case of Angular is due to the applied software versioning. Many versions are released, with some versions followed by the next already within a couple of days. As such, the same effect as with Knockout might hold: defects introduced in a certain version might only become apparent in future versions, due to their relatively short lifespan.

Nevertheless, both LLOC and CCM exhibit a positive, albeit weak and insignificant, corre-lation. Noteworthy is that the correlation for CCM is trending towards significance (p = 0.056). Moreover, in the case of Ember, beta and RC versions were also not taken into account, similar to Angular. And contrary to Angular Ember applies software versioning with less frequent releases. Indeed, results do indicate a weak positive correlation between ED and defect count, although not statistical significant. And in contrast with Knockout and Angular, both LLOC and CCM display a weaker, almost non-existent, correlation with defect count.

Finally, it seems ED might provide information on defect-prone releases, however misat-tribution of defects to the wrong versions blunts results, making the measurement unreliable. Nevertheless, it does appear ED does hold predictive value for defect-prone releases when pro-vided ample and correct data. Bhattacharya et al. [4] found that ED could detect major changes between releases, although the authors did not investigate whether this metric holds predictive value in terms of defect-prone releases. Considering the ED can indeed capture major changes between releases, it seems likely such major changes correspond to an increase in defects for that release.

Research question 2: Can we predict defect-prone releases based on non-source code informa-tion?

Commit-based developer collaboration graphs were generated for every version of each of the JSFs. Subsequently the DC ED for every version was calculated and the results varied among the JSFs. In the case of Knockout and Angular, a trend towards significance (p = 0.059 and p = 0.116, respectively) became apparent for a positive correlation between the DC ED and

(25)

defect count, although there appeared no correlation whatsoever in the case of Ember. Interest-ingly, when versions of Knockout are categorized to stable versions only, i.e. data attributed to corresponding beta and RC versions is attributed to the corresponding stable version, a strong positive correlation (rs= 0.80) which reaches statistical significance (p < 0.001) becomes

appar-ent. On all occasions of Knockout, however, the two simple size metrics LLOC and CCM show stronger and more significant correlations. Of course, they are based on source code information, contrary to DC ED. For Angular LLOC has a slightly weaker correlation than DC ED, although for CCM it is slightly stronger and trending towards significance (p = 0.056).

In the case of Ember, it appears the DC ED exhibits a weird variation compared to the other two JSFs. This weird variation might be explained by how the data was gathered. In the case of Knockout, there are no releases being developed in parallel, and as such commits can be attributed to a certain version based on their date created. And although Angular has multiple releases being developed in parallel, commits can be easily retrieved for each version by selecting the respective branch. However, Ember also has multiple releases being developed in parallel, but their repository is not divided in separate branches for these releases. Instead, they rely on tags and it appeared the data retrieval process relying on these tags led to noise in the data.

Concluding, it seems DC ED provides very useful information on predicting defect-prone re-leases based on non-source code information (i.e. user commits). The conflicting result of this metric applied to Ember can be attributed to noise in the data. These results seem to be in conflict with the data reported by Bhattacharya et al. [4]. They did not found any correlation between DC ED and defect count and report that defect-tossing based collaboration is a better defect predictor than DC ED. Defect-tossing based collaboration could unfortunately not be examined, as none of the chosen JSFs appeared to make active use of attributing developers to reported defects.

Research question 3: Can we predict defect-prone files based on source code information? Noderank was applied to the version with the highest defect count for every JSF. Subse-quently, the top 5 nodes were chosen to be laid out against files in which defects have occurred, in order to see if there was any overlap between the two. The version with the highest defect count was chosen in order to have an ample amount of data for comparison in order to prevent a type II error. Indeed, many versions often corresponded to only a few, or even none, reported defects, which could mask any overlap.

Although the top 5 nodes comprised a high percentage of the total noderank for all three JSFs, there appears to be hardly any overlap between noderank and defects occuring in the corresponding files. For Knockout, only 2 out of 8 unique files with reported defects corresponded to a file in which one of the top 5 nodes, as ordered by their noderank, was located. Similar results were obtained for Angular, for which only 1 out of 20 unique files with reported defects corresponded to a file in which one of the top 5 nodes was located. In the case of Ember, there was absolutely no overlap. These results however, should not come as a surprise. Since noderank is a measure of relative importance of a node based on calls from other nodes, a higher noderank signifies a more important node. It is therefore likely that, if there was a defect in a node with a high noderank, it would have been likely already been resolved very early during development. Moreover, since many other nodes depend on that node, a defect is likely to be detected very early, most likely before the change to the code is committed to the repository. Therefore it is less likely to survive till a commit and thus be reported as a defect by others. Additionally, it might very well be that developers working on these files are well aware of their importance for the software. This knowledge of the developers might make them very cautious when working on these files, as well as from others before allowing it to be committed to the repository.

Bhattacharya et al. [4] however found that noderank could be used to predict defect severity. The authors did not investigate whether the metric had any predictive value in terms of defect count of a file corresponding to the respective node. This research did not evaluate whether noderank could predict defect severity as it seems unlikely that reported defect severity actually is correctly classified as discussed in Section 2.1. Thus, measurements based on reported defect severity and their results might be too unreliable to draw conclusions.

Concluding, noderank appears ill-suited to predict defect-prone files based on source code information of (mature) JSFs.

(26)

6 Conclusions

The results presented in this thesis do not conclusively support the usage of the applied graph-based metrics to predict defects in JSFs. The observed correlations between these metrics and defect count seem to highly rely on the correctness of the input data, as witnessed by the high impact on these correlations when the input data is ’corrected’ or categorized. Additionally, the correlations vary widely among the three tested JSFs, further putting its application in doubt. Moreover, although the ED does appear to reflect major changes in code, as witnessed by the large ED values for major versions, simple metrics such as LLOC and CCM appear at least as effective in reflecting these changes for the three tested JSFs. Additionally, the ED value did not reach statistical significance in any case, although ED0 _{came close (p = 0.013) on one set of input}

data. However, simple metrics such as the LLOC and CCM value reached statistical significance on the same input data, with a stronger correlation. Moreover, these simple metrics reached, or approached, statistical significance on other occasions as well, whereas the ED did not.

However, the DC ED metric seems more promising as it reached, or approached, statistical significance on multiple occasions with a tendency for a moderate to strong correlation. Yet on other occasions the correlation of DC ED with defect count did not approach statistical significance, still highlighting its variability, also putting the application of this metric in doubt. Furthermore, noderank did not appear to provide any useful direction in terms of defect-prone files on any of the JSFs, although the small amount of data whereon this metric was applied might have blunted results. However, due to the severe lack of overlap between the top 5 noderank and the files in which these nodes were located, and the files related to reported defects, it appears unlikely this was the cause of lack of results for this metric.

Concluding, current results do not support the usage of the applied graph-based metrics in defect prediction for JSFs. However, both the quality and quantity of the data might have influenced the results of this research, obscuring potential valuable results. Future research with better and more input data is warranted to validate these results. Suggestions for future research are provided in Section 7.

7 Future Research

Several issues with defect prediction and data mining from GitHub repositories have been raised in Section 2.1 and Section 2.3, respectively. It seems prudent that future research should focus on dealing with these issues, as noise in the input data might have severely impacted the results of this research.

Misclassification of defect data is an important issue highlighted by Herzig et al. [11]. The first type of misclassification can involve either an issue being classified as a defect, while it is not, or the other way around. Additionally the second type of misclassification can involve an issue attributed to the wrong version. An issue can automatically be attributed to the then newest version, while the issue should be attributed to an older version. The first type of misclassification can be neutralized by manually examining every issue, applying a systematic approach for classification. The second type of misclassficiation can also be dealt with by examing every issue. This might involve skimming the issue title, description or comments for hints of the corresponding version, or manually reconstructing the issue with incrementally older versions. Although the latter might be too impracticable as it is extremely time consuming.

Since manual classification is such a time-consuming task, the field would benefit if a dataset would be established with manually verified data. Over time, researchers can continue to add new data to it and feed data from the database to their defect prediction models in order to assess their performance. Moreover, this facilitates homogeneity across studies, allowing for reliable interstudy comparison, much like the goal of the PROMISE dataset [16].

Another potential issue which has not been addressed in this research is the impact of an open-source’s project popularity. As popularity for such a project increases over time, the probably of a defect to be found increases in parallel. As a result, the number of defect reports will most likely increase as well. This effect would be independent of the actual number of defects in the software and could thus confound results. Corrective measures could be made to correct for

(27)

this confounding factor in future research. Readily available quantitative data from a project’s GitHub repository, such as how the number of contributors evolve over time, might aid in this. Closely related to this issue is the ’exposure’ time of a certain release. All things equal, it is highly likely that the shorter the ’exposure’ time before a new release, the less defects will be attributed to it. And the longer its ’exposure’ time before a new release, the more defects will be attributed to it. This may also confound results.

Moreover, although the DC ED appears to have some predictive value concerning defect-prone releases based on current results, it remains questionable to what extent this information is actionable. The DC ED provides information on changes of collaboration on the same files between developers. I.e. if the same developers work on the same files, the DC ED remains 0, if changing developers are working on the same files, the DC ED increases. In an industrial setting, such results might translate to managers allocating the same developers to the same modules in order to prevent defects. In an open-source project it remains hardly actionable, as there is no manager directing who works on what. Since the results presented here are based on open-source JSFs, future research should focus on investigating whether the same results hold for an industrial setting.

Finally, future research applying graph-based metrics should ensure that correct static source code call graphs are generated. This research generated these graphs with ACG, which, in principle, generates unsound call graphs and thus can miss calls. Although it seems unlikely this negatively affected the reliability of the results of this research, it might be addressed in future research.

(28)

Bibliography

[1] E. N. Adams. Optimizing preventive service of software products. IBM Journal of Research and Development, 28(1):2–14, 1984.

[2] T. Aittokallio and B. Schwikowski. Graph-based methods for analysing networks in cell biology. Briefings in bioinformatics, 7(3):243–255, 2006.

[3] S. Artzi, J. Dolby, S. H. Jensen, A. Moller, and F. Tip. A framework for automated testing of javascript web applications. In Software Engineering (ICSE), 2011 33rd International Conference on, pages 571–580. IEEE, 2011.

[4] P. Bhattacharya, M. Iliofotou, I. Neamtiu, and M. Faloutsos. Graph-based analysis and prediction for software evolution. In Proceedings of the 2012 International Conference on Software Engineering, pages 419–429. IEEE Press, 2012.

[5] V. Caldiera and H. D. Rombach. The goal question metric approach. Encyclopedia of software engineering, 2(1994):528–532, 1994.

[6] A. Feldthaus, M. Schafer, M. Sridharan, J. Dolby, and F. Tip. Efficient construction of approximate call graphs for javascript ide services. In Software Engineering (ICSE), 2013 35th International Conference on, pages 752–761. IEEE, 2013.

[7] N. E. Fenton and M. Neil. A critique of software defect prediction models. Software Engi-neering, IEEE Transactions on, 25(5):675–689, 1999.

[8] A. Gizas, S. Christodoulou, and T. Papatheodorou. Comparative evaluation of javascript frameworks. In Proceedings of the 21st international conference companion on World Wide Web, pages 513–514. ACM, 2012.

[9] S. Guarnieri, M. Pistoia, O. Tripp, J. Dolby, S. Teilhet, and R. Berg. Saving the world wide web from vulnerable javascript. In Proceedings of the 2011 International Symposium on Software Testing and Analysis, pages 177–187. ACM, 2011.

[10] T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A systematic literature review on fault prediction performance in software engineering. Software Engineering, IEEE Transac-tions on, 38(6):1276–1304, 2012.

[11] K. Herzig, S. Just, and A. Zeller. It’s not a bug, it’s a feature: How misclassification impacts bug prediction. In Proceedings of the 2013 International Conference on Software Engineering, pages 392–401. IEEE Press, 2013.

[12] E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian. The promises and perils of mining github. In Proceedings of the 11th Working Conference on Mining Software Repositories, pages 92–101. ACM, 2014.

[13] C. Lewis, Z. Lin, C. Sadowski, X. Zhu, R. Ou, and E. J. Whitehead. Does bug prediction support human developers? findings from a google case study. In Software Engineering (ICSE), 2013 35th International Conference on, pages 372–381. IEEE, 2013.

Graph-Based Defect Prediction of JavaScript Frameworks

Master of Science: Software Engineering,

Graduation Thesis