NATURALIZE A replication study

(1)

NATURALIZE

A replication study

Daniel Conde

danijose24@gmail.com

July 31, 2015; 30 pages

Host organisation: Eindhoven University of Technology

External supervisor: Dr. Alexander Serebrenik Internal supervisor: Dr. Vadim Zaytsev

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

1

Abstract

In November 2014, Allamanis et al. presented a framework called NATURALIZE that learns the style of a codebase and suggests revisions to improve stylistic consistency by applying statistical natural language processing to source code. [1]

They applied NATURALIZE to ten open source Java projects to suggest revisions to natural identifier names and formatting conventions. In their paper they also presented four tools for ensuring natural code during development and release management. They concluded that NATURALIZE achieves 94% accuracy in its top five suggestions for identifying names. They also performed a qualitative evaluation by (1) submitting a selection of NATURLIZE’s suggestions to three human evaluators to evaluate the quality of the suggestions, and (2) submitting suggestions to active projects for acceptance. It was found that 63% of the suggestions were called useful by the evaluators, and that 14 of 15 suggestions submitted to the active projects were accepted.

In this thesis we replicate the experiments using the same tools on ten projects selected from the same open source Java project repository in the same manner as done by the original authors. Running the same experiments we find that NATURALIZE achieves an accuracy of 94% in its top five suggestions for identifying names. We also performed the same qualitative evaluations and found that 70% of NATURALIZE’s suggestions were said to be useful by the evaluators, and that 6 of 21 suggestions were accepted into active projects.

(4)

3

1 Introduction

Every programmer has an own style and preferences when it comes to comments, identifiers, white space, formatting, naming conventions, etc. In order to enhance the readability and maintainability of source code, a set of guidelines may be established for a specific programming language. These guidelines or coding conventions recommend programming styles, practices and methods for each aspect of the language. Some of the more common coding conventions include naming conventions, comment conventions, indent style conventions, etc. [16]

A coding convention may be formalized in a document, or may be the informal set of practices of individual programmers. When collaborating on a project, programmers may be unaware of the coding conventions in place. This could be because the conventions may have been formed implicitly as a result of local decisions, which over time emerged as a consensus convention. In addition, even if coding conventions have been explicitly documented, one does not usually have the means to verify whether the code adheres to those conventions or to help the developers bring their code in adherence.

In their paper “Learning natural coding conventions”, Allamanis et al. presented a framework called NATURALIZE that “solves the coding convention inference problem for local conventions.” NATURALIZE does this by “offering suggestions to increase the stylistic consistency of a codebase.”

NATURALIZE uses a codebase to learn identifier conventions, and applies statistical natural language processing (NLP [13]) techniques to generate a list of suggestions to aid refactoring, specifically renaming. Refactoring is when source code is modified to improve its readability or structure, often to bring it to conform to a stated coding standard. The renaming of variables and methods is a common refactoring activity.

The paper presents a series of experiments to show the relevance of coding conventions, and the accuracy, robustness, and acceptance of NATURALIZE. The results of the experiments show that NATURALIZE achieves 94% accuracy in its top suggestions for identifier names and that it never drops below a mean accuracy of 96% when making formatting suggestions.

The authors also demonstrate the relevance of coding conventions by showing empirically that programmers enforce conventions through code review feedback and corrective commits. In addition, a qualitative review resulted in 14 of 18 patches based on NATURALIZE suggestions being incorporated into five of the most popular open source Java projects.

1.1 Problem statement

The original work presents the framework NATURALIZE which the authors used to run a series of experiments in order to demonstrate its relevance, accuracy, effectiveness, quality and acceptance. This thesis is a replication of the original paper and attempts to validate the results by running the same experiments.

Replications have been gaining in popularity within the field of software engineering, as they help to confirm the original results by performing external validating experiments. They also help gain insight into the original results and build knowledge. [8]

There are two types of replications, conceptual and exact, where the latter type attempts to follow the original procedures as closely as possible in order to determine whether the same results can be obtained. [8] In this thesis we attempt an exact replication.

It is an exact replication in terms of methodology, the types of experiments performed, and the tools used to run the experiments. The projects used for the experiments are not the same, but they have the same characteristics, i.e., they are well-known projects with a considerable number of forks and watchers obtained from GitHub. [17]

(5)

4

The reason for using a different set of projects is to demonstrate that NATURALIZE achieves a level of accuracy and robustness similar to the original paper regardless the choice of the projects.

The execution of the experiments and the examination of the results will answer the following question:

Are the experiments and the results presented in the paper externally valid with respect to relevance, accuracy, robustness, and acceptance of the suggestions?

In order to replicate the results presented in the original paper we combined different research methods:

1. Empirical research method. This is used to demonstrate the relevance of coding conventions. [7]

2. Automatic evaluation. Is a standard research method used for natural language processing. Following the guidelines of the original paper this method is used to evaluate the accuracy of the suggestions made by NATURALIZE. [2, 3]

3. Qualitative Evaluation. We use this method to evaluate the effectiveness of NATURALIZE by running two experiments with human evaluators. [4, 5]

This thesis is structured as follows:

- Section 1 introduces the subject and motivation

- Section 2 presents information about the main elements of NATURALIZE - Section 3 contains the experiments performed using NATURALIZE

.

- Section 4 discusses the results obtained from the experiments

.

Section 5 presents the conclusions of the thesis - Section 6 provides a summary of related work

(6)

5

2 The nature of NATURALIZE

NATURALIZE is language independent and can be applied to a wide variety of coding conventions. It uses an existing codebase, called the training corpus, from which it learns the code convention. In our experiments the training corpus is the current codebase from which NATURALIZE can learn the coding convention used in the project selected for the experiment. The architecture of NATURALIZE is illustrated in Figure 1 below.

Figure 1 - The architecture of NATURALIZE. [1]

The architecture consists of two main elements: proposers and a scoring function. The input is a code snippet which is to be “naturalized”. The snippet is selected based on user input and the type of tool used.

The proposers modify the input snippet and generate a list of alternative suggestions (candidates) for the input snippet. The scoring function evaluates the suggestions for naturalness compared with the training corpus. The naturalness of a snippet is measured using statistical language modelling. The scoring function determines the alternative candidates preferred by the language model, and also the confidence in its selection. In order to obtain the most “natural” suggestion, the scoring function compares the scores of the alternative candidates, selecting the alternative candidate with the higher score, i.e., the better suggestion. The output is a short list of suggestions that represents alternative snippets which may replace the input snippet.

Another function is the suggestion function, which is responsible for determining what suggestions are sufficiently natural. In order to determine if a suggestion is sufficiently natural, the function applies two thresholds, k and t. The threshold k determines the rank, i.e., the maximum number of suggestions allowed. The threshold t is a minimum confidence value, which “controls the suggestion frequency, that is, how confident NATURALIZE needs to be before it presents any suggestions to the user.” [1]

The authors also propose a scoring function (binary snippet decision) to determine whether to accept or reject a snippet. This function measures the best suggestion that NATURALIZE is able to make, and compares it with a confidence value T. The function not only compares alternatives for renaming, but also for formatting. The function rejects a snippet when the score is higher than the threshold T, meaning that the snippet is unnatural according to the language model. As T increases less input snippets will be rejected, resulting in more snippets with unnatural code being accepted. On the other hand, as T increases there will also be fewer natural snippets rejected.

The authors present four tools that use the core of NATURALIZE to improve developer productivity and code quality. These are:

Stylish? This tool rejects commits that present an excessive disruption in the conventions of a

codebase. This is important to prevent developers to commit changes into the repository which do not follow the coding conventions of the project. The tool returns a binary value indicating whether the code is natural. When it finds that the code is natural, or if it cannot find a good alternative, it makes no suggestions.

(7)

6

Buildlm. This tool is able to build a language model, i.e., it creates a series of rules extracted

from the inferred conventions of a project. This language model can be used by a code formatter such as styleprofile and stylish?.

Devstyle. This is an Eclipse plugin that allows a developer to check whether the changes made

in the source code are unconventional or not. The results are presented in order of “naturalness”, where the result in the top position is the most natural.

Styleprofile. This tool allows one to analyse a set of files, which results in a series of renaming

suggestions for identifiers.

These tools are available on the webpage of the authors.1

1_{http://groups.inf.ed.ac.uk/naturalize}

(8)

7

3 The experiments

The experiments in this replication are the same as in the original paper and are primarily designed to evaluate the accuracy and robustness of NATURALIZE in making naming suggestions. In addition, experiments were performed to assess the relevance of coding conventions, and to evaluate the acceptance of specific NATURALIZE suggestions by human evaluators and current open source projects. The experiments fall into four categories as follows:

1. Two empirical experiments to assess the relevance of coding conventions: a. An evaluation of 1000 randomly selected commit messages (Subsection 3.2.1) b. An examination of code review discussions (Subsection 3.2.2)

2. Four experiments to evaluate the accuracy of NATURALIZE suggestions. Specifically, experiments were performed to assess:

a. The accuracy of single point suggestions for identifiers (Subsection 3.3.1) b. The accuracy of multiple point suggestions for identifiers (Subsections 3.3.2) c. The accuracy of single point suggestions for formatting (Subsection 3.3.3)

d. The effectiveness of the Stylish? tool at rejecting unnatural code (Subsection 3.3.4)

3. Two experiments to assess the robustness of NATURALIZE by evaluating:

a. The effectiveness of NATURALIZE in not suggesting junk names (Subsection 3.4.1) b. Sympathetic uniqueness (Subsection 3.4.2)

4. Two qualitative evaluations of the acceptance of NATURALIZE suggestions by: a. Review of 20 suggestions by three human evaluators (Subsection 3.4.1) b. Submission of 21 suggestions to active projects (Subsection 3.4.2).

The tools for running the experiments were provided by the authors of the original paper. The specific experiments and their results are described in the following sections.

3.1 Project selection

The methodology for carrying out the experiments follows that of the original paper. For our training corpus we use well-known open source Java projects from GitHub [10]. As in the original paper, the selection is made from active projects that are not forks. Projects are selected based on the number of forks and watchers. Watchers are people who are notified of changes in a project, and forks are subprojects stemming from the main project. The number of watchers and forks are an indication of how relevant a project is because they show (1) the extent to which people are interested in following the development of the project (watchers), and (2) the interest in contributing to further and/or related development (forks).

As mentioned earlier, this thesis is an exact replication in terms of methodology, the types of experiments performed, and the tools used to run the experiments, but the projects used for the experiments are different. This was done in order to test NATURALIZE on a different set of projects. Unlike the original paper, we do not assume the normal distribution for forks and watchers in GitHub to calculate z-values and generate a list of ten projects to carry out the experiments. Instead, we give each project a numerical value based on the sum of forks and watchers, and order the projects by the value obtained.

The open source Java projects were obtained using Google BigQuery [11] and GHTorrent [12] in combination with GitHub. Both BigQuery and GHTorrent provide a platform to query massive datasets. GitHub provides a public dataset which is updated every hour and can be accessed using Google BigQuery. The combination of Google BigQuery and GitHub’s dataset makes it possible to process a considerable amount of data quickly. GHTorrent monitors the GitHub public event timeline and retrieves its contents. The data is extracted to a MySQL database which can be accessed using its own web-based database explorer.

(9)

8

Specifically, we use BigQuery to obtain the Java projects ordered by forks and watchers, which we convert into a list of projects which we use as starting point to select our test projects. We validate that each project is not a fork by using the GHTorrent web-based database explorer. We then validate the number of forks and watchers for the projects by using the GitHub search engine, and manually checking and comparing the number of forks and watchers obtained from GitHub with the list of projects obtained using BigQuery and GHTorrent.

Project Name Forks Watchers Commits Description

slidingmenu 4459 986 336 Android Slide Menu Library

android-universal-image-loader 3937 1135 970 Android Caching And Displaying Images android 3065 974 2715 Github’s Android App

zxing 3125 650 3103 Java Barcode Image Processing Library viewpagerindicator 3037 612 241 Android Paging Indicator Widgets android-async-http 2895 717 777 Android Asynchronous HTTP Library iosched 2338 794 128 Google I/O 2014 Android App bigbluebutton 2563 225 10828 Web Conferencing System springside4 1851 688 858 JavaEE, Reference Architecture picasso 1839 632 885 Android image processing library

Table 1 - The ten selected open-source Java projects in order of popularity, defined as the sum of forks and watchers.

In order to comply with replication guidelines [8], we removed and replaced four projects that were also used in the original paper.2_{Replication guidelines mandate that either all or none of} the projects of the original paper should be used. We elected to select new projects in order to expand the number of projects used to evaluate NATURALIZE, and to increase our knowledge of the NATURALIZE framework. This resulted in a refined list of ten popular projects used in this thesis for the experiments. The resultant list spans a wide range of projects in terms of size and type (see Table 1).

3.2 The relevance of coding conventions

The original paper commences with two empirical studies to assess the relevance of following coding conventions, and in particular that of formatting and naming conventions. We replicate these experiments as described below.

3.2.1 Commit messages

Although programming languages have established their own conventions, there are other internal conventions that are spontaneous agreements, which must be followed by the team members.

As was done in the original paper, we examined 1000 commit messages extracted randomly from the ten selected open source projects. These messages were examined for mentions of renaming, changing formatting, and following other format conventions.

We follow the guidelines of the original paper in order to obtain the commit messages from the projects. We clone the master branch of each project and use the GitHub command line. To achieve this, all the projects were cloned allowing us to export the commit messages using the following GitHub command line:

git log --since='dd-mm-yyyy' --pretty=format:'"%h","%s","%b"' --no-merges > log.csv

As we can see, the command has different parameters such as “since”, “pretty” and “no-merges”. The parameter “since” represents the day when the repository was created.

2_{The four replaced projects were: elasticsearch; libgdx; netty, and; platform_frameworks_base. They were the four most} popular projects in the original study, and not surprisingly, they were still among the most popular projects at the time of our selection.

(10)

9

This is important to ensure that every commit message is available for selection. The next parameter “pretty” defines the format of the file to export. In this case the format defines “%h” as the hash of the commit, “%s” specifies the subject of the commit, and “%b” provides the body of the commit message. All of them are separated by a comma. Mostly, the commits have more than one line in the body of the commit message, which is not handled by git log. In this case every commit selected for the sample was manually checked in order to ensure that all the lines of the commit message are included in the experiment. Last, but not less important, the parameter “no-merges” helps us to filter only the commits that were meant by the developers, and which are not the result of an automatic merge of the repository branch system. This parameter keeps the sample clean by only including original commits from the developers. The result is a comma-separated values file per project containing the commit messages. The following is an example of a commit message selected randomly from the project “bigbluebutton”.

“a182800”,”Fix video profiles init”,”TODO: Fix problems with webcam icon”

Once the commits are collected, we proceed to randomly select the sample for the experiment using a formula developed using Excel VBA (Visual Basic for Applications).

The commit messages were drawn randomly from the ten projects. In order to analyse the commit messages, we read every commit message in the sample. The commit messages are classified as renaming, formatting and general depending on its message, i.e., if a message states that it includes a name fix or a change in the formatting in the source code, it was classified and included in the results. The commit messages that are related to coding conventions in general were included in the category “general”. For example, messages with words such as “refactoring” are classified as “general”.

During the analysis of the commit messages we observed that, as was the case in the original paper, developers sometimes do not comment explicitly in the commit message that some changes in the source code are related to coding conventions.

Project Hash Message Coding Convention

sliding-menu cd1ed8a Added ability to set behind width instead of offset. Basically just the inverse, but easier to manipulate for tablet layouts, etc.

Renaming

android ea72e54 Only update recent list when organization changes

Formatting

Table 2 - Sample of selected commit messages.

Commits often include more than one type of change. Developers may make explicit the changes related to fixes or new implementations, but may not make explicit renaming and formatting changes in the source code. Table 2 above shows two commit messages drawn from two projects where coding convention changes are implicit in the source code. They were determined to include coding conventions by manually examining the differences in the source code. Figure 2 illustrates the first example from Table 2. Here we see circled in white, the red message:

“private int mTouchModeAbove = SlidingMenu.TOUCHMODE_MARGIN”,

which is replaced by the new line in green, where not only the access level of the identifier was changed, but where also the variable was renamed:

(11)

10

Figure 2 - Example of implicit renaming (Hash cd1ed8a).

The results of our review of the 1000 commit messages are shown in Tables 3, 4, and 5. We found that 4.6% of commit messages contained changes related to coding conventions, which is similar to the 4% reported in the original work. Looking more in detail, we find that in our work, 1.9% of the commit messages contained renamings, compared with 1% in the original paper. On the other hand, the original paper states that 2% of the commit messages contained formatting, whereas our experiment shows a result of 1.6%.

Project name Renaming % Renaming Formatting % Formatting Total* % Total

slidingmenu 3 3% 1 1% 5 0.5% android-universal-image-loader 2 2% 3 3% 8 0.8% android 0 0% 3 3% 3 0.3% zxing 2 2% 2 2% 5 0.5% viewpagerindicator 3 3% 0 0% 3 0.3% android-async-http 1 1% 1 1% 5 0.5% iosched 2 2% 3 3% 6 0.6% bigbluebutton 0 0% 2 2% 2 0.2% springside4 3 3% 1 1% 6 0.6% picasso 3 3% 0 0% 3 0.3% Total 19 1.9% 16 1.6% 46 4.6%

* Includes other coding conventions

Table 3 - Result of review of 1000 randomly selected commit messages.

Renaming Formatting Other coding conventions No coding conventions

Original Paper

(Expected) 1% (10) 2% (20) 1% (10) 96% (960) Thesis

(Observed) 1,9% (19) 1.6% (16) 1.1% (11) 95.4% (954)

Table 4 - Contingency table.

Type Commits p-val

Conventions Thesis Original paper 4.6% (3.3% - 5.9%) 4% (3% - 6%) p < 0.01 Renaming Thesis Original paper 1.9% (1.0 % - 2.70%) 1% (1%-2%) p < 0.01 Formatting Thesis Original paper 1.6% (0.80% - 2.4%) 2% (1% - 3%) p < 0.01

Table 5 - Percent of commit messages containing feedback related to formatting.

(12)

11

Performing a chi-squared statistical validation demonstrates that there is no significant difference between the expected results (the original paper) and the results obtained in our experiment. The chi-squared statistic is 3.304 with the corresponding p-value of 0.34709. The result of the two reviews shows that between 4 and 5% of the commits were related to code conventions, attesting to its relevance and confirming the results presented in the original paper.

3.2.2 Code review discussion

The second experiment to assess the relevancy of coding conventions consists of examining discussions regarding reviews of source code changes. For this experiment we use a sample of GitHub code reviews. Code reviews on GitHub occur when a collaborator makes changes in a branch and wants to include those changes in the master branch (merge the branches). In order to ask the collaborators of the master branch to include the changes, the author of the changes creates a “Pull Request”.

When a pull request is created, a conversation starts. The collaborators of the master branch can review the changes by inspecting the source code. Additionally, the inspectors can make comments referencing the source code, ask questions, and request some other changes by the author of the pull request before making a decision. The collaborators of the master branch can either merge the pull request, i.e., accept the changes, or reject the pull request by closing the pull request.

The sample used in our experiment consists of 135 pull requests (1043 comments) selected from a large, popular, active project called JUnit.3_{The examination seeks to determine to what} extent the feedback from collaborators is related to identifier naming, code formatting, or code conventions in general.

We found that 24% of pull requests contained feedback regarding coding conventions in general, including 7% related to renaming and 10% to formatting.

In the original paper this examination was done on 169 code reviews (1093 threads) randomly selected across product groups performed at Microsoft Corporation during 2014. The authors found that in that case 38% of the code reviews included coding conventions in general, of which 24% were related to renaming and 9% to formatting.

Renaming Formatting Other coding conventions No coding conventions

Original Paper (Expected) 9% 24% 5% 62% Thesis (Observed) 7% 10% 7% 76%

Table 6 - Contingency Table.

A comparison of the results is shown in Table 6. We performed a chi-squared validation of the results and determined a chi-squared statistic of 11.1697 with a corresponding p-value of 0.010843. This shows a major difference between the results of the JUnit and Microsoft code reviews.

That the Microsoft code reviews show higher percentages can be attributed to a more formal well-established review process which includes strict standards, definitions, and coding conventions. What is significant in this experiment is that both the JUnit and Microsoft code reviews show that a significant proportion of code reviews (pull requests on GitHub) has content related to coding conventions, naming and formatting. This also attests to the relevance of coding conventions.

(13)

12

3.3 Evaluation of NATURALIZE’s suggestion accuracy

In this section we replicate the original paper’s evaluation of NATURALIZE's suggestion accuracy. Following the evaluation framework of the original paper, we first evaluate naming suggestions, focussing on suggesting new names for (1) variables (locals, arguments, and fields); (2) method calls; and (3) nametypes (class names, primitive types, and enums). These are the distinct types of identifiers that the Eclipse compiler recognizes. [14]

When NATURALIZE suggests a renaming, it renames all locations where that identifier is used at once. In addition, a leave-one-out cross validation is used in order to avoid training the language model on the files for which we are making suggestions. This is to ensure that NATURALIZE does not select the correct name for an identifier from other occurrences in the same file.

3.3.1 Single Point Suggestion

In this experiment we evaluate the accuracy of NATURALIZE in making single point suggestions. A single point suggestion is when NATURALIZE is asked for suggestions for one specific identifier.

Following the original experiment, we use an automated tool to collect all unique identifiers from the test files, and the locations where they occur and name the same entity. We then ask NATURALIZE to suggest a new name and to rename all occurrences of the identifiers at once. Next we ask NATURALIZE to score every alternative and to rank them from high to low in a list of suggestions. We search the list of suggestions for the original name of the identifier, and the position that it appears in. When the original name does not appear in the list of suggestions, the perturbation is excluded from the sample, as was done in the original paper. To measure the accuracy, we collect the number of times that NATURALIZE returns the original name in the top k = 1 and k = 5 positions in the list of suggestions.

In Figures 3, 4 and 5 are shown the results of our experiments. In Figures 6, 7, and 8 we show a comparison with the graphs from the original paper.

Each point on the curves represents a value for a quality threshold t. This threshold controls the suggestions that NATURALIZE makes by discarding all suggestions with a score higher than t. As we increase the threshold, the quality of the suggestions decreases.

The x-axis shows the suggestion frequency, which is a measure of how often NATURALIZE makes a suggestion when it is in a position to do so. The y-axis shows suggestion accuracy. This measures the frequency of when the original name is found in the top k suggestions. The experiments were run using k = 1 and k = 5 as top suggestions following the guidelines of the original work. The shaded area in the graph represents the interquartile range at 25% and 75% for k =1 and k = 5.

We observe that NATURALIZE achieves high accuracy when the frequency is low. This is because there are fewer suggestions, but these are of a higher quality. Similarly, when the quality of the suggestions decreases, the frequency increases because more suggestions with a lower quality score are accepted. The result is that suggestion accuracy drops. This is the same result as found in the original paper.

However, as the original authors pointed out, NATURALIZE does reject sufficient suggestions with a low quality score to avoid the “Clippy effect”.4_{We note that at a 60% suggestion} frequency, the lowest accuracy (for typenames) is still above 70%.

As was also the case in the original work, the frequency and accuracy of NATURALIZE vary by project and type of identifier.

4_{The “Clippy effect” occurs when the user ignores a system because too many of the suggestions that it makes are not} useful. These systems are often disabled by the user. http://bit.ly/1OOxAkI

(14)

13

Figure 3 - Accuracy of Suggestions for Variables.

Figure 4 - Accuracy of Suggestions for Method Calls.

(15)

14

Figure 6 - Accuracy of Suggestions for Variables.

Our results left, graph from the original paper right.

Figure 7 - Accuracy of Suggestions for Method Calls.

Figure 8 - Accuracy for Suggestions for Typenames.

We note that our results for suggestion accuracy in the top positions for suggestion frequencies up to 50% are comparable to those presented in the original paper. Specifically, we obtained the same mean accuracy of 94% across all identifier types at k = 5. The mean accuracy for all identifiers at k =1 at the 50% suggestion level was 82%.

(16)

15

A more detailed comparison of the results is shown below.

Variables. In Figure 6 the graph on the left shows the results we obtained for the accuracy of

variables, with on the right the graph presented in the original work.

For k = 5, we obtain an accuracy of 99% to 100% for frequencies from 10% to 40%, which is approximately the same accuracy presented in the original work for the same frequency range. As the frequency increases the accuracy drops by project, and we note that it starts to drop between the frequency of 40 and 50%. At the 50% suggestion frequency, we recorded an accuracy of 94%, compared with 97% accuracy in the original paper at the same frequency. The accuracy continues to decrease to a level of 67% at a suggestion frequency of 80%. In the original paper the accuracy at this level was still at 80%.

As to be expected, NATURALIZE shows a lower performance in accuracy with suggestions at k = 1. This is because NATURALIZE in this case needs to return the original identifier name in the first position of the suggestions. We obtained an accuracy of 79% at a frequency of 50% comparable with approximately 89% accuracy presented in the original work at the same frequency level. Nevertheless, 79% accuracy for k = 1 is notable for a system that learns conventions from the codebase. The results for the entire range are shown in Figure 3.

Method calls. In Figure 7 is shown the NATURALIZE suggestion accuracy for method calls. In

general, the accuracy we obtained across the entire frequency range (x-axis) is higher than the accuracy presented in the original work for both k =1 and k = 5. At the 50% frequency level, we obtained 99% accuracy for k = 5 and 95% accuracy for k = 1, which exceeds the accuracy obtained in the original work for both k = 1 and k = 5, approximately 89% and 73%, respectively. As we observed with variables, the accuracy of NATURALIZE decreases as the frequency increases and this case is no exception. However, the accuracy obtained at 80% frequency is still quite high, and also higher than that presented in the original work for both k = 1 and k = 5. We obtained accuracy of 81% and 70%, compared to approximately 70% and 53% accuracy for k = 5 and k = 1, respectively.

Typenames. In Figure 8 is shown the accuracy of NATURALIZE suggestions for typenames.

We note how the accuracy of the suggestions drops considerably faster than for variables and methods calls, and also in comparison with the accuracy of typenames obtained in the original work. Specifically, from a frequency of 30% at k = 5 and 20% at k = 1, we observe that the accuracy drops very quickly. Although the accuracy at k = 5 is still at the 90% level at a frequency of 40%, it drops to 82% at the 50% level. In the original paper the accuracy of k = 5 at that frequency was still approximately 97%. We do observe in Figure 5 that the overall shape of the curves is similar.

3.3.2 Multiple Point Selection

NATURALIZE is able to make suggestions for identifiers by using the tools devstyle and/or

styleprofile. In this section we evaluate the accuracy of NATURALIZE in performing this task. In

order to accomplish this we follow the guidelines of the original paper by mimicking code snippets in which one identifier name violates the project’s code convention. For this we use as our test snippets the methods of all project files.

For each test snippet we randomly select and perturb one identifier name to one that does not occur in the project. Then we run NATURALIZE to determine whether it correctly shows the perturbed name in a list of suggested changes. We evaluate the accuracy of NATURALIZE by determining how often it correctly identifies the perturbed snippet and at what position it occurs in the list of suggestions.

As was done in the original work, we set the recall at rank k = 7 because humans can take in seven items at glance. [18, 19] We compute the results by using an automated tool that helps us to randomly obtain and perturb the test snippets.

(17)

16

We observe that NATURALIZE correctly identifies the perturbed name 75% of the time, and that it shows the perturbed name on average in position two in the list of suggestions (mean reciprocal rank 0.57). In the original paper, NATURALIZE correctly identified 64% of the perturbed names, and also showed them on average in position two of the suggestion list (mean reciprocal rank 0.47).

The following graph shows the mean accuracy of NATURALIZE by project in the multiple point selection experiment. The accuracy is measured by the number of times that NATURALIZE correctly shows the perturbed name in a list of k = 7 suggestions. The accuracy ranges from 69% to 90%, and we observe that the variance from the mean is 0.42%. The line labelled “Our Results” corresponds to the mean accuracy achieved by NATUTALIZE across the ten projects in our experiment. The dotted line labelled as “Original work” corresponds with the mean accuracy of NATURALIZE presented in the original work.

Figure 9 - NATURALIZE accuracy in the Multiple Point Selection experiment.

This experiment is complemented by a manual examination using devstyle. We randomly select 13 methods from each of the ten projects, and manually perturb one random identifier to a random name (junk999), which we consider not to be used in the codebase. The results of this manual perturbation are similar to those obtained when using the automated tool to select and perturb identifiers. NATURALIZE returned the bad name 74% of the time with rank k = 7, and on average in position 2 of the list (mean reciprocal rank of 0.53).

3.3.3 Single point suggestion for formatting

NATURALIZE also makes formatting suggestions. To test the accuracy of its suggestions we run an experiment where formatting is changed, and then test if NATURALIZE is able to recover the original formatting from the context.

Replicating the experiment in the original paper, we use a 5-gram language model using a modified token stream (q = 20). We use an automated tool to obtain all the files from the projects in our experiment. The tool obtains the content of the file in tokens and searches for a whitespace token to be perturbed. The whitespace token is stored to be used later.

The tool proceeds to obtain the n-gram tokens that are around the whitespace token, and perturbs only the whitespace token. The language model of the original n-gram without the perturbation is then used to ask NATURALIZE for suggestions for the perturbed n-gram.

(18)

17

Only the first suggestion is considered for measuring the accuracy of NATURALIZE. To determine whether the suggestion is correct, the tool compares the name of the token that was perturbed with the name of the first suggestion. The procedure is repeated by the tool for all the whitespace tokens in a file across all the files in a project.

The tool also applies thresholds, where a lower threshold means more “quality” restrictions resulting in fewer suggestions being accepted. Conversely, the higher the threshold, the more suggestions will be accepted. For each threshold, the tool calculates (a) the number of correct suggestions, (b) the number of given suggestions, and (c) the total number of suggestions. The latter is the total number of perturbations analysed. The accuracy of NATURALIZE is calculated by dividing the number of correct suggestions by the number of given suggestions.

Following is an example of the output of this experiment for the project named “SlidingMenu”. Each threshold value determines what “quality” score a suggestion must attain to be considered. The given suggestions are the number of suggestions that meet the threshold criteria, and the correct suggestions indicate how many perturbations are correctly detected by NATURALIZE.

Thresholds

[0.1, 0.2, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 5.0, 6.0, 7.0, 8.0, 10.0, 12.0, 15.0, 20.0, 50.0, 100.0]

Number of Correct Suggestions

[453, 512, 1326, 1886, 2581, 3224, 5162, 6827, 8244, 9891, 12625, 14608, 15912, 16954, 18189, 18827, 19331, 20069, 21248, 21256]

Number of Given Suggestions

[460, 519, 1345, 1925, 2639, 3288, 5232, 6911, 8345, 10011, 12804, 14848, 16199, 17285, 18593, 19275, 19846, 20680, 22119, 22130] Precision 0.9847826086956522, 0.9865125240847784, 0.9858736059479554, 0.9797402597402597, 0.9780219780219780, 0.9805352798053528, 0.9866207951070336, 0.9878454637534365, 0.9878969442780108, 0.9880131854959544, 0.9860199937519525, 0.9838362068965517, 0.9822828569664794, 0.9808504483656350, 0.9782713924595278, 0.9767574578469520, 0.9740501864355537, 0.9704545454545455, 0.9606220896062209, 0.9605061003163127. Total perturbations = 22140

Precision = Correct Suggestions / Given Suggestions Suggestion Frequency = Given Suggestions / Total Perturbations

Table 7 - Example of results, single point suggestion for formatting.

Figure 10 - Accuracy of single point evaluation for formatting across ten projects (k=1).

(19)

18

As shown in Figure 11, NATURALIZE is extremely effective in formatting suggestions. On our ten test projects NATURALIZE achieved 97.7% mean accuracy, essentially the same as the 98% suggestion accuracy presented in the original paper (Figure 12).

Figure 11 - Accuracy of single point evaluation for formatting across ten projects (k=1).

The box plots show the variance in performance across the ten projects.

Figure 12 - Accuracy of single point evaluation for formatting (original paper).

The lowest accuracy attained (shown in Figure 10 in the form of blue crosses) is 86%. This score results from one specific project (bigbluebutton) which had a mean accuracy of 92%, varying from 86% to 95%.

As also noted in the original paper, the high accuracy is still attained when NATURALIZE is required to reformat half of the whitespace in the test file (0.5 suggestion frequency in the graphs). As noted by the authors in the original paper, this is remarkable for a system that is not provided with any hand-designed rules about what formatting is desired.

3.3.4 Binary snippet decisions

This experiment is designed to determine how effective the tool stylish? is at rejecting unnatural code in terms of formatting or naming. This was done by selecting at random 500 snippets from each project, and uniformly perturbing one third of the identifiers and one third of the whitespace tokens, and leaving the remaining one third of the snippets unperturbed. Note that the tool does not know which snippets are perturbed and how they were perturbed. NATURALIZE was then run to assess to what extent it correctly recognises the perturbed (unnatural) code.

(20)

19

Figure 13 - Accuracy of rejecting unnatural changes.

In Figure 13 are shown the results for our experiment and those of the original paper in ROC curves. Each point of the curves represents a different threshold, where the x-axis shows the False Positive Rate and the y-axis the True Positive Rate, the latter being the proportion of perturbed snippets that were correctly rejected.

As shown in the graph, NATURALIZE correctly rejects 36% of perturbed snippets when the FPR is at 20%. In the original study NATURALIZE correctly rejected 40% of the perturbed snippet at an FPR of just 5%. As was the case in the original study, NATURALIZE is somewhat worse at rejecting snippets whose variable names have been perturbed. It was suggested that it may be more difficult to predict identifier names than formatting.

3.4 The robustness of suggestions

An important aspect of the original paper was to test the robustness of the suggestions. One had to test that NATURALIZE does not simply rename all tokens to common “junk” names that appear in many contexts. Another aspect is that NATURALIZE should retain unusual names that signify unusual functionality. This is referred to in the original paper as the Sympathetic Uniqueness Principle or “SUP”.

3.4.1 Junk Names

This experiment evaluates the effectiveness of NATURALIZE in suggesting junk names. A junk name is defined as a semantically uninformative name that is used in disparate contexts. For example, “foo” and “bar” are junk names because they do not provide any information about what they represent. On the other hand, we have names like “i” and “j” which are not considered junk names when they are used as loop counters. It is, therefore, difficult to formalise the concept of junk, and detecting junk is not a straight forward task. But as the original paper explains “most developers know it when they see it.”

Since NATURALIZE learns from the codebase, we might think that it would often suggest junk names because they appear in different n-grams. The original work states that this is not the case. On the contrary, they state that the opposite is true, that junk names will have a lower probability and a worse score if they appear in too many contexts. Suggestions with a low score are rejected.

In our experiment we simulate a low quality project by randomly renaming variables to junk names in each project, where all the occurrences of the same variable have the same junk name. The purpose is to simulate a low quality corpus which NATURALIZE will use to make suggestions. We run the experiment and measure how NATURALIZE’s suggestions are affected by the proportion of junk names present in the training corpus. The percentages of junk names introduced in every project are 0.5%, 1%, 2%, 3%, 5%, 7%, 10% and 15%. As in the original paper, we generate the junk variables using a discrete Zipf’s law with a slope of s = 1.08, where the slope characterizes the distribution of the data. Zipf’s law tells us how frequently words and rare words appear in a collection by ranking them. The law states that the frequency of a word multiplied by its rank in a list is approximately constant. This constant is established

(21)

20

by the top = 1 of the list, because the multiplication of its frequency by its rank (one) will result in the frequency. This means that the multiplication mentioned above results in a number close to the frequency of the top = 1 word. This has been verified in a previous work. [15]

Figures 14 and 15 below show the effect on NATURALIZE’s suggestions in our and the original experiment, respectively, as the evaluation projects are gradually infected with more junk names. By comparing the boxplots with the dotted-line which corresponds to y=x, we observe that initially NATURALIZE suggests junk names with a lower frequency than they exist in the perturbed project. However, as the level of perturbation to the codebase increases, so does the percent of identifiers renamed to junk names. We also found that as the percent of junk increases, the number of projects affected also increases. Specifically, with perturbation at 7% there were 3 projects with a major impact, at 10% there were four, and at 15% six of the ten projects showed a major impact. The graph of the results of the experiment in the original paper shows the same tendency.

Figure 14 - The impact of adding junk to the training corpus, our experiment.

(22)

21

3.4.2 Sympathetic uniqueness

The original work states that NATURALIZE substantially preserves surprising identifier names when they appear in unusual functionalities avoiding the “heat death of a codebase”.5_This experiment means to measure the percentage of identifiers for which NATURALIZE does not suggest an alternative name in relation to a threshold t. The threshold is the sensitivity of the quality of the suggestions. This means that only those suggestions, in this case unknown names, with a score lower than the threshold t are accepted and considered part of the result. We follow the methodology of the original work by finding all identifiers in each test file that are unknown (UNK) to the language model. We then ask NATURALIZE for suggestions for every unknown identifier in the project and we check if NATURALIZE suggests to keep the unknown identifier.

The results of our experiment are shown in Figure 16 and not dissimilar to those of the original work in the sense that they show that NATURALIZE does substantially preserve surprising identifier names. However, there are some noteworthy differences between the types of identifiers. The original paper found that for reasonable threshold values (between 3 and 6 varying by type of identifier), NATURALIZE suggests non-unknown identifiers for only a small proportion of the UNK identifiers (about 20%). We found the same for variables and types, but at lower threshold levels between 2 and 4. With respect to types, we did not find a substantial improvement beyond t = 4, and the percent of preserved surprising types levelled off at approximately 56%, well below the 80% of variables and methods.

Figure 16 - Percent of preserved surprising identifiers.

3.5 Manual examinations of suggestions

We evaluate NATURALIZE’s suggestions by running a qualitative experiment. This experiment gives us information about the quality of the suggestions of NATURALIZE by using three human evaluators as was done in the original work. The human evaluators are computer scientists with at least 3 years’ experience.

We use two projects (bigbluebutton and android) from the corpus and run styleprofile on 30 methods, randomly obtaining a list of suggestions for all the identifiers in each method. As in the original work, we assigned 20 test snippets per evaluator. In this way every snippet is assessed by two human evaluators.

The task of each evaluator is to determine whether the suggestion in a snippet is reasonable, depending not only in their personal opinion, but also depending on the coding convention of the project from which the snippet was selected. Every evaluator works on his task independently and has access to the source code of the project.

5_{The original work states that the “heat death of a codebase” occurs when semantically rich, surprising names are}

(23)

22

In the experiment every evaluator has 30 minutes to read the source code of the project to obtain insight into the coding conventions used in the project. Once the evaluators have explored the project, they have 15 minutes per code snippet to evaluate the suggestions. We also ask the evaluators to note how much time they took to make a decision whether to accept or reject the suggestions. As in the original paper, we noticed that the evaluators needed 5 minutes on average per code snippet. Additionally we asked the evaluators for comments about the experiment, and one common remark was that 30 minutes is not enough to really learn about the conventions applied in the project.

Classification Total Percentage

Both “Yes” 9 30% Both “No” 9 30% Yes/No – No/Yes 12 40% At least one “Yes” 21 70%

Table 8 - Results of manual examination.

As in the original paper, our results provide evidence that NATURALIZE’s suggestions are qualitatively reasonable. We found that 70% of the suggestions were accepted by at least one evaluator, compared with 63% acceptance in the original paper. Acceptance by both evaluators was 30%, compared with 50% in the original work.

Figure 17 - Code snippet where last alternative was chosen.

In addition, we asked the evaluators to explain why they accepted or rejected specific suggestions. Interestingly, it was sometimes difficult for them to give a reason for their choice. In this connection, we find that although the top k = 1 suggestions were selected for almost all snippets, we also note an occasion when this was not the case. In this instance the evaluator preferred the last alternative in the list (Figure 17).

(24)

23

Another curious occurrence is the snippet shown in Figure 18, when one evaluator accepted a suggestion stating that the current name is too generic, whereas another rejected the same snippet because it is common to represent that kind of variable in that way.

3.6 Suggestions accepted by projects

We also evaluate the acceptance of NATURALIZE’s suggestions by submitting high confidence renamings to seven of our evaluation projects. This follows the guidelines of the original work by running styleprofile to select three high confidence suggestions for each project (21 commits in total). Then we create a pull request per project to see if the suggestions are accepted to be merged with the project.

Three projects accepted the suggestions (6 of 9 commits in total), two projects did not answer, and two projects rejected the requests. Regarding the latter two, one commented that there was ‘No need to use common used names in those places.’ The other project which rejected the pull request stated “I'd generally say this is too trivial to bother with, but it's really minor anyway”.

Project Pull Request Status

android 835 Merged (1/3) android-async-http 898 Merged (3/3) android-universal-image-loader 1027 Closed bigbluebutton 677 Open picasso 1080 Merged (2/3) springside4 468 Open zxing 407 Closed

(25)

24

4 Discussion

Allamanis et al. presented NATURALIZE as the first tool that learns style from a local codebase and provides suggestions to improve stylistic consistency. Their conclusion was that “NATURALIZE effectively makes natural suggestions, achieving 94% accuracy in its top suggestions for identifier names, and even suggests useful revisions to mature, high quality, open source projects”. In our thesis we replicate the original experiments with the following results.

4.1 The relevance of coding conventions

Our review of 1000 randomly selected commit messages (experiment 3.2.1) shows that a non-trivial 4.6% of the commits contained revisions to code conventions. This compares with 4% in the original paper.

In addition, we performed an examination of 135 pull requests (1043 comments) from a large, popular, active project called JUnit (experiment 3.2.2). We found that 24% of pull requests contained comments regarding coding conventions in general, including 7% related to renaming and 10% to formatting. In the original paper a similar examination was done for code reviews performed at Microsoft Corporation during 2014, with the finding that 38% of the code reviews included coding conventions in general, of which 24% were related to renaming and 9% to formatting.

Chi-squared validation done on the commit messages showed that the variances were not significant. On the other hand, the JUnit code review did show a significant difference. The higher percentages for the Microsoft review in the original paper may be attributable to a more rigorous approach to code reviews at Microsoft. The result of both experiments attests to the relevance of coding conventions.

4.2 The accuracy of NATURALIZE suggestions

In the experiment on the accuracy of a single point suggestion for identifiers (experiment 3.3.1), we found that NATURALIZE attains high levels of accuracy in its top suggestions for all identifiers. Specifically, with k = 5, we note accuracies of 99% and 94% for variables; 99% and 98% for method calls; and 91% and 82% for typenames, at frequency levels of 40% and 50%, respectively. We calculate a mean accuracy of 93.8% over all identifier types, compared with 94% presented by Allamanis et al. in their conclusion.

We found NATURALIZE accuracy of multiple point suggestions for identifiers (experiment 3.3.2) to be 75% compared with 64.3% in the original work. We also found that NATURALIZE returned the bad name on average in position 2 of the list, the same as in the original experiment.

NATURALIZE accuracy for single point suggestions for formatting (experiment 3.3.3) was 97.7%, nearly the same as the 98% accuracy reported in the original paper.

We found the effectiveness of NATURALIZE at rejecting unnatural code (experiment 3.3.4) to be less than reported in the original study. Whereas in the original paper the tool was seen to correctly reject 40% of the perturbed snippets at a False Positive Rate of 5%, this level was not reached in our experiment until a FPR of 22%. Nevertheless, even at this rate NATURALIZE would still be useful at filtering pre-commit script.

4.3 The robustness of NATURALIZE

With respect to junk names, we found that NATURALIZE initially suggests junk names at a lower frequency than they exist in the perturbed project (experiment 3.4.1). However, when the level of junk in the perturbed project progressively increases, we find that the suggestion frequency of junk increases faster. We found that when junk reaches a level of 10% in the perturbed project, the junk suggestion frequency is also 10%. A progressive increase is also seen in the results of the original paper, but at a lower rate of increase.

(26)

25

As in the original paper, we found that NATURALIZE substantially preserves unique variables and methods (experiment 3.4.2, Sympathetic Uniqueness), but is less effective in doing so with types. We found that NATURALIZE suggests non-unknown identifiers for only 20% of the unique identifiers at relatively low thresholds, which is slightly better than in the original paper.

4.4 Qualitative evaluation of acceptance of NATURALIZE suggestions

The qualitative evaluation shows that NATURALIZE does suggest useful revisions to mature, high quality, open source projects.

In the review of 20 commits by three evaluators, 70% of the commits were accepted by at least one evaluator. This compares with 63% acceptance reported in the original paper.

With respect to commits suggested to projects, we found that 6 of 21 commits (29%) were accepted compared with 14 of 18 (78%) in the original paper.

4.5 Threats to validity

In an exact replication there is always the risk of replicating the threats to validity present in the original experiments. Some of the principal threats to validity are discussed below.

External validity

As was done in the original paper, we selected ten of the most popular active Java projects from GitHub for our experiments. However, there are 98,000 repositories of Java projects on GitHub, with projects of many sizes (e.g. varying numbers of identifiers), with different maturities of coding conventions, and in different stages of development. A different selection of projects may have resulted in different results.

Experiments which involve the work of individual developers and internal/external evaluators are also subject to external threat. In addition, the experiments were conducted in an artificial environment with the standard external threat that results from a real life situation may be different. For example, as was the case in the original paper, evaluators were given 30 minutes to familiarise themselves with the coding conventions used in the projects. In real life, the evaluator of a renaming suggestion would already have been familiar with the coding convention.

Construct validity

All experiments were designed as they were in the original work. The experiments related to the accuracy of NATURALIZE were run using the same tools with the same source code as in the original paper. These tools were provided by the original authors. The tools used for the graphical comparison and presentation of the results were also provided by the original authors.

This means that as a replication study, all the threats of construct validity in this thesis are derived from the original work.

Internal validity

Internal validity might have been affected by our understanding of the original paper. However, we have had close contact with the authors of the original paper to ensure that our understanding is correct.

(27)

26

5 Conclusions

In our experiments we were able to replicate the results presented by Allamanis et al. Specifically, with respect to the principal claim, we found that NATURALIZE effectively makes natural suggestions achieving 93.8% accuracy (original paper 94%) in its top suggestions for identifier names, and achieves 97.7% accuracy (original paper 98%) with formatting suggestions.

We also found that NATURALIZE suggests useful revisions to mature, high quality, open source projects, but to a lesser degree than was found in the original paper.

(28)

27

6 Related Work

During this thesis we reviewed a number of papers specifically related to identifier naming, coding conventions, and machine learning to deduce coding style from a code base. Principal works reviewed include the following:

Butler et al. presented in 2010 the paper “Exploring the Influence of Identifier Names on Code

Quality: An empirical study” [20], in which they show that flawed identifiers and methods in Java

classes are associated with low quality source code.

Lawrie et al. presented in 2006 the paper What’s in a Name? A Study of Identifiers [21], in which they reported on the results of their experiments on comprehension using different alternatives for naming identifiers. Specifically, they tested the use of abbreviations, single letters, and complete words, and found that complete words improve source code comprehension.

Sharif et al. wrote about source code comprehension using identifier names in a 2010 paper entitled An Eye Tracking Study on camelCase and under_score Identifier Styles. [22] They present a study to determine if identifier naming conventions affect code comprehension. They use eye tracking to evaluate the best alternative between using underscore or camel case for identifier naming. Although the results do not show a difference between the two in terms of accuracy, the subjects recognised identifier names using underscore faster.

Reiss presented in 2007 the paper Automatic code stylizing [25], which introduces a tool that uses machine learning to deduce coding style from a codebase. The knowledge obtained can then be used to change a source code to the same coding style. This approach follows the idea of building a code style from a source code instead of a rule-based approach where all the parameters must be defined. The study included different elements of coding style such as indentation, naming, spacing, ordering, etc.

(29)

28

7 Acknowledgements

I would like to thank the following for their constructive criticism, guidance, and insights:

Miltiadis Allamanis, co-author of the original paper on NATURALIZE. Dr. Alexander Serebrenik, Eindhoven University of Technology. Dr. Vadim Zaytsev, University of Amsterdam

(30)

29

Bibliography

[1] Allamanis, M., Barr, E.T., Bird, C., Sutton, C., Learning natural coding conventions, Proceedings of the 22nd_{ACM SIGSOFT International Symposium on Foundation of Software} Engineering, 2014, ACM, pp. 281-293.

[2] Chin-Yew L., Rouge: A package for automatic evaluation of summaries, Proceedings of the ACL-04 Workshop, 2004, ACL, pp. 74–81.

[3] Papineni K., Roukos S., Ward T., Zhu W.-J., BLEU: a method for automatic evaluation of

machine translation, Proceedings of the 40th Annual Meeting on Association for Computational

Linguistics, 2002, ACL , pp. 311–318.

[4] Seaman, C.B., Qualitative methods in empirical studies of software engineering, Software Engineering, IEEE Transactions, 1999, IEEE, pp. 557-572.

[5] Kaplan B., Maxwell J., Qualitative Research Methods for Evaluating Computer Information

Systems, Evaluating the Organizational Impact of Healthcare Information Systems, Health

Informatics, 2005, Springer New York, pp. 30-55.

[6] Tichy, W.F., Padberg, F., Empirical Methods in Software Engineering Research, In Companion to the proceedings of the 29th International Conference on Software Engineering, ICSE COMPANION 2007, IEEE, pp. 163-164.

[7] Shull, F., Singer, J., Sjøberg, D. Guide to Advanced Empirical Software Engineering, 2008, London: Springer-Verlag.

[8] Shull, F., Carver, J., Vegas, S., Juristo, N., The role of replications in Empirical Software

Engineering, Journal Empirical Software Engineering 2008, Kluwer Academic Publishers

Hingham, 13(2), pp. 211-218

[9] Bird, M., Rigby, P. C., Barr, E. T., Hamilton, D. J., German, D. M., Devanbu, P., The

promises and perils of mining git, IEEE International Working Conference on Mining Software

Repositories, 2009, IEEE Computer Society Washington, pp. 1-10

[10] Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D. M., Damian, D., The

promises and perils of mining GitHub, Working Conference on Mining Software Repositories,

2014, ACM, pp. 92-101

[11] Google Inc., An Inside View At Google BigQuery.

https://cloud.google.com/files/BigQueryTechnicalWP.pdf 2012. Visited June 12, 2015.

[12] Gousios, G., The GHTorrent dataset and tool suite. Working Conference on Mining Software Repositories, 2013, IEEE Press Piscataway, pp. 233-236

[13] Jurafsky, D., Martin, J. H., Speech and Language Processing: An Introduction to Natural

Language Processing, 291 Computational Linguistics and Speech Recognition. Prentice Hall,

2nd edition, 2009.

[14] Eclipse-Contributors. Eclipse JDT. http://www.eclipse.org/jdt/, 2015. Visited July 10, 2015. [15] Allamanis, M., Sutton, C., Mining source code repositories at massive scale using language

modelling, Proceedings of the 10th Working Conference on Mining Software Repositories 2013,

IEEE Press Piscataway, pp. 207-216

[16] Oracle, http://www.oracle.com/technetwork/java/codeconvtoc-136057.html, Visited July 10, 2015.

(31)

30

[18] Cowan, N., The magical number 4 in short-term memory: A reconsideration of mental

storage capacity, Behavioral and Brain Sciences 2001, pp. 87-114

[19] Miller, G., The magical number seven, plus or minus two: some limits on our capacity for

processing information, Psychological review 1956, pp. 81-97

[20] Butler, S., Wermelinger, M., Yu, Y., Sharp, H., Exploring the Influence of Identifier Names

on Code Quality: An empirical study, 14th European Conference on Software Maintenance and

Reengineering 2010, pp. 156-165.

[21] Lawrie, D., Morell, C., Feild, H., Binkley, D., What’s in a Name? A Study of Identifiers, ICPC

'06 Proceedings of the 14th IEEE International Conference on Program Comprehension 2006, IEEE, pp. 3-12

[22] Sharif, B., Maletic, J., An Eye Tracking Study on camelCase and under_score Identifier

Styles, ICPC '10 Proceedings of the IEEE 18th International Conference on Program

Comprehension 2010, pp. 196-205

[23] Drupal, https://www.drupal.org/coding-standards, Visited July 15, 2015. [24] MSDN, http://bit.ly/1JnVW2v, Visited July 16, 2015

[25] Reiss, S., Automatic code stylizing, ASE '07 Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering 2007, pp. 74-83