Detecting Rework in Software: A Machine Learning Technique

(1)

by

Minh Phuc Nguyen

B.Sc., Ho Chi Minh University of Science, Vietnam, 2015

A Project Submitted in Partial Fulfillment of the Requirements for the Degree of

MASTER OF SCIENCE

in the Department of Computer Science

c

Minh Phuc Nguyen, 2020 University of Victoria

(2)

Detecting Rework in Software: A Machine Learning Technique

by

Minh Phuc Nguyen

B.Sc., Ho Chi Minh University of Science, Vietnam, 2015

Supervisory Committee

Dr. Daniela Damian, Supervisor (Department of Computer Science)

Dr. Neil Ernst, Committee member (Department of Computer Science)

(3)

iii

ABSTRACT

Rework, which is a major software development activity, is the additional effort needed to correct or improve completed work. Software organizations generally spend 40 to 50 % of their effort on avoidable rework, which could have been foreseen. The cost of rework could exceed 50% of the project cost if it is identified late in devel-opment cycle. Therefore, rework detection is key to efficient software develdevel-opment and to reduce project cost. However, little research has been conducted to study about rework identification and categorization. In this project, we proposed a ma-chine learning based rework detection model that can classify a development task as rework or not based on its description. With the help of data augmentation, we achieved an F1-score of 0.72 and an accuracy of 0.74. We designed a flexible rework detection service architecture, which could be integrated with collaborative develop-ment platforms. Based on the trained model, we impledevelop-mented a proof of concept service in Python, and integrated it into Jira.

(4)

List of Tables

Table 2.1 Summary of the data set . . . 4

Table 2.2 Performance comparisons of Multinomial Naive Bayes . . . 11

Table 2.3 Performance comparisons of SVC . . . 11

Table 2.4 Performance comparisons of XGBoost . . . 12

(6)

List of Figures

Figure 2.1 Frequency distribution of words . . . 5

Figure 2.2 Sample length distribution . . . 6

Figure 3.1 Rework detection service architecture . . . 15

Figure 3.2 A Jira task and rework label . . . 16

Figure 3.3 Webhook settings in Jira . . . 19

Figure 3.4 Create a new automation rule trigger in Jira . . . 20

(7)

Chapter 1 Introduction

Rework in software development is the additional effort needed to repair or improve completed work. Rework in a software project includes rewriting code that is inef-ficient or has bugs, and rewriting code due to environment changes or requirement changes. It is a major software development activity [1], indeed software organiza-tions generally spend 40 to 50 % of their effort on avoidable rework [2]. The cost of rework is extremely high that it could exceed 50% of the project cost [3]. In addition, rework is expensive proportionally to how late it is discovered in development cycle [4, 5]. Many studies focused on reducing the amount of rework as early as possible to avoid expensive downstream fixes [6, 7, 8].

In fact, not all rework should be eliminated completely, and a certain amount of rework is expected. Apart from corrective maintenance and new features, enhance-ments are also made iteratively to improve system maintainability, performance and design implementation [9, 10]. In this project, we aim to study about rework which includes both unavoidable and avoidable rework. Essential rework refers to the amentioned enhancements, and is unavoidable because developers could not have fore-seen the changes needed in the next version of software systems. On the other hand, accidental rework refers to development effort needed to correct previous work, which is also known as defects. These defects are avoidable and they should be discovered early and fixed less expensively.

Cass et al. have attempted to formalize and explain the nature of rework [1, 11]. The authors proposed a method to describe requirement specifications visually using a processing programming language named Little-JIL. The visual diagrams are expected to aid organizations in detecting rework. However, we find it difficult to build an automatic rework detection method based on this work. Wahono et al.

(8)

reviewed many automatic approaches using machine learning models [12]. The data set used in these automatic approaches, PROMISE [13], contains source code metrics (e.g. line count of code, number of the flow graph, count of lines of comments, etc) which can only be calculated after code changes are made.

As mentioned before, rework should be identified as soon as possible to reduce the cost of software development. However, there is little known research on rework identification and categorization, which are important subjects. Having a good under-standing of rework is key to efficient software development. For example, developers and project managers can estimate development tasks better if they are aware of rework and which rework is unavoidable or avoidable.

In this project, we propose a machine learning based rework detection model. Our goal is that the model can predict a development task is a rework or not based on its description. A task description typically describes detailed information about the task, such as the motivation of the task and pieces of work that must be completed. This project is guided by the following research questions:

RQ1 Is it possible to identify rework using machine learning techniques?

RQ2 Is it possible to build a rework detection service in order to aid software devel-opment?

Our work has two main contributions:

1. We propose a machine learning model which is able to predict a development task as a rework or not with cross-validated accuracy of 75%, and an F1-score of 0.73.

2. We design an extensible rework detection service that is integrated into collab-orative development platforms.

The rest of this project report is organized as follows. Chapter 2 introduces the methodology, experimental results and discussions. Chapter 3 describes a proposed architecture for the rework detection service and a proof-of-concept service that is integrated with Jira platforms. Chapter 4 presents our conclusions.

(9)

3

Chapter 2 Experimental Evaluation

In this chapter, we describe how we develop machine learning classifiers to categorize a development task as rework. Based on these predictive models, we select the best performing classifier to design and build a proof-of-concept automatic rework detec-tion service to integrate with collaborative software development platforms, such as JIRA. We propose rework detection as a binary text classification problem i.e. given a document, the model outputs class 0 indicating not rework or class 1 indicating a rework.

2.1 Methodology

2.1.1 Data set

To the best of our knowledge, there is no public data set for rework classification of development tasks. We collaborated with an organization that develops cloud-based e-commerce platforms for worldwide customers. The organization granted us access to their issue tracking system, Jira, under a non-disclosure agreement. To track an issue in software development, a development task of a platform like Jira has many attributes such as summary, task description, labels, assignee, priority, comments, etc. We used task description to build our data set and ignored other attributes. More specifically, each sample in our data set is a text file, which contains a task description. We argue that this attribute is the most crucial of a development task and suitable to feed into a document classifier. Moreover, using task description fits our goal to detect a rework as early as possible since task descriptions usually exist in task creation. We collected a new private data set, which is only accessed

(10)

Not Rework Rework Total

Initial data set 343 81 424

Manual data set 83 212 295

Final data set 426 293 719

Table 2.1: Summary of the data set

by us, the organization and people with appropriate permissions, based on software development tasks we retrieved and parsed by a Python script. This Python script fetched Jira development tasks into csv files using Jira REST API, each line in a csv file contains attributes of a task separated by commas. After that, we extracted ID and description attributes of the tasks to generate our data set where file name is a task ID and a file contains a task description. We discarded tasks with no description or tasks that are older than 4 years.

The organization had classified 424 development tasks, of which 81 tasks were considered as rework as the initial data set for the project. To determine if a de-velopment task represented rework, the organization analyzed key words relevant to rework in task description such as “refactor”, “rework” and examined references to previous completed tasks. The initial data set was small and imbalanced where only 20% of the tasks were rework. To add more data samples to the data set and increase our confidence of rework classification, we manually labeled 295 unclassified tasks in 6 rounds of independent classification. After each round, we evaluated our inter-agreement score through Cohen’s kappa coefficient. In the last round, we reached a substantial inter-rater reliability agreement at 0.6 [14]. These classified tasks were then verified by the aforementioned organization that 212 tasks were rework. In total, our data set contains 719 tasks from 5 software projects of the organization, of which 293 tasks were labeled as rework. We denote Rework as positive class (class 1) and Not Rework as negative class (class 0) because our goal is to detect rework. The final data set is fairly balanced as the rework class represents 40.75% of tasks (293 tasks) in our data set. Table 2.1 summarizes the number of tasks of each class in the data set we collected.

We did an explanatory data analysis to learn more about the data set character-istics. Figure 2.2 shows the sample length (in characters) distribution of samples in the data set. The shortest task description is 27 characters. The longest task descrip-tion have a length of 16,548 characters. Most of task descripdescrip-tions are less than 2,500 characters. More specifically, 637 tasks (88.59%) are shorter than 2,500 characters.

(11)

5

Figure 2.1: Frequency distribution of words

Figure 2.1 shows top 40 frequent words in the data set. We observe some of these words are common words in English and might be uninformative, or stop words. We filtered them out using the stop word set provided in the NLTK library [15].

2.1.2 Data cleansing and preparation

Cleaning data is a crucial step of machine learning model training, and can improve the results significantly. Technical documents, like Jira development tasks, are likely to contain extra uninformative or unwanted words, such as source code and HTML tags. In this project, we performed a variety of techniques to reduce noise in our data set: removing template directives, punctuation, non-English words and stop words, and lemmatization.

Template directives are texts that are inserted automatically or manually into task description when a user creates a new development task. The purpose of template directives is to insert common texts into and recommend a structure for task descrip-tions. For example, in our raw data set, template directives in a bug report could be ”** Quick Summary **”, ” ** Acceptance Criteria **”, ”*Steps to reproduce*”, etc. They are used to recommend task creators what information should be included in task descriptions, and obviously they do not affect the semantics of task descriptions.

(12)

Figure 2.2: Sample length distribution

These template directives usually appear between one or more asterisk symbols, so we remove them using regular expressions.

For punctuation removal, we relied on the built in punctuation string in Python, specifically

string.punctuation = !"# \$ %&’()*+, -./:;<=>?@[\]^_‘{|}~. We simply removed a character in a document if it is a punctuation.

To remove stop words, we used the English stop word set provided by the NLTK library as we mentioned above.

Lemmatization is the process that groups inflected forms of a word as a single word. In other words, lemmatization converts a word to its base form. For example, “watched”, “watches”, and “watching” are different forms of the base word “watch”. By performing lemmatization, we can reduce the dimensions of the vector that rep-resents a document, thus speeding up model training. We used the WordNet [16] Lemmatizer provided by the NLTK for this data cleaning technique.

(13)

7

2.1.3 Downsampling

Downsampling or subsampling is a technique to deal with imbalanced data sets. We first started with an imbalanced data set, where only 20% of the samples were labeled as rework. We did several experiments to verify effectiveness of the technique. Specifically, downsampling is the process of removing samples of majority class to balance the ratio of the classes in the data set. Through experiments, the classification results with downsampling applied were not better than not doing downsampling. A possible reason is because the data set was small, removing data samples from the data set also cause loss of information for the model to learn. After we added more samples to the data set, we decided not to use this technique in our final experiments.

2.1.4 Data augmentation

Due to the fact that the data set is fairly small, we experimentally applied data augmentation to explore whether it actually improves training results. Data aug-mentation is a technique that enlarges our data set by adding artificial data samples. Additionally, because the model has the possibility to train with more diverse data, it is expected to reduce over-fitting, and increases the overall the reliability of the mod-els. Augmentation is used abundantly and improves results of many tasks in computer vision [17, 18]. Indeed, one can easily apply image manipulation operations, such as cropping, rotating, noise adding, scaling, etc, to create many more images from a data set. Augmentation does not change the semantics of images whereas in natural language processing, the semantics of a document might be changed by augmentation. Therefore, we need to keep the semantics of documents as much as possible.

In this project, we applied two techniques: named synonym replacement and word swapping from Easy Data Augmentation (EDA) techniques [19]. The authors of the EDA research reported that for the most part, augmented samples generated from their technique conserved the labels of original samples. For synonym replacement, we iterated through a document, selected a random number of words and replaced each selected word with a random synonym. For example, synonyms of send are “deliver”, “forward”, “issue”, “post”, “ship”, etc and we can choose a random word in this set to replace the word send. The number of replaced words is controlled by a parameter α. We used α = 0.3 throughout our experiments. Synonym replacement was implemented as in the following Python code:

(14)

words = text.split(’ ’)

count = int(alpha * len(words))

indexes = random.sample(range(len(words)), count) for idx in indices:

ss = syns_set(words[idx]) if len(ss) > 0:

words[idx] = random.choice(tuple(ss)) return ’ ’.join(words)

Where syns set is the set of synonyms of a word excluding itself. The set is obtained by the following helper method:

def syns_set(word): ss = set()

for syn in wordnet.synsets(word): for l in syn.lemma_names():

ss.add(l.replace(’-’, ’ ’)) if word in ss:

ss.remove(word) return ss

For word swapping, we randomly selected two words in the document and swap them together. We do this for a number of times n which is controlled by a parameter α like synonym replacement. It was implemented in the following code:

def word_swap(text, alpha = 0.3): words = text.split(’ ’)

count = int(alpha * len(words)) for i in range(count):

indexes = random.sample(range(len(words)), 2)

words[indexes[0]], words[indexes[1]] = words[indexes[1]], words[indexes[0]] return ’ ’.join(words)

Through experiments, we observed that word swapping did not help in improv-ing the classification performance, but synonym replacement positively impacted the performance. In the results section, we will focus on describing performance gains from synonym replacement.

(15)

9

We denoted the augmented data set as DA data set to differentiate from the baseline data set.

2.1.5 Feature engineering

Textual documents must be converted some numeric representations, which can be understood by machine learning algorithms. We used two feature extracting methods in this project: Bag of Words and Term Frequency-Inverse Document Frequency (TF-IDF).

Bags of Words : This is a simple but very common method to represent a text document. A bag of words model of a document is a numeric vector where each dimension is a word in the document from the text and its value is the frequency of the word. Here is an example of how to convert a document to a bag of words model on a simple text: “I like playing sports. I like watching movies too”. We split into words (or tokens), then count the frequencies of the words. The resulting vector of the text is [2, 2, 1, 1, 1, 1, 1], corresponding to the sorted sequence of words [“I”, “like”, “movies”, “playing”, “sports”, “too”, “watching”]. One might want to take account of not only single words (uni-grams) but phrases (bi-grams or tri-grams) which contain more than one word. We used CountVectorizer (CV), a bag of words implementation provided by scikit − learn, with ngram range parameter = (1, 2) (both uni-grams and bi-grams are selected.) to transform our data set into count vectors.

TF-IDF : TF-IDF is another metric that is used to transform a document into a numerical vector. While word count vectorizer only considers word frequencies, TF-IDF considers how relevant a word is to a document. Technically, TF-IDF is computed by multiplying TF metric and IDF metric. TF denotes the contribution of a term to a document:

tf (t, d)=. count of t in d number of terms in d

IDF indicates how informative a term is across a collection of documents. It is calculated by the following formula:

idf (t, d)= log(. number of documents

(16)

TF-IDF indicates how relevant a word is to a document. A very common word results in a low or zero value, and a rare word has high value indicating that the word can be useful in distinguishing the document. We used T f Idf V ectorizer provided by scikit − learn to extract TF-IDF vectors.

2.1.6 Model Development

We experimented with four common algorithms for text classification: Multinomial Naive Bayes, SVM, Random Forest, and XGBoost Logistic Regression. Multinominal Naive Bayes and SVM are known for their simplicity and performance. Random For-est and XGBoost follow the ensemble learning method, which creates a multitude of decision trees and averaging the output from the trees. We tested if Random Forest and XGBoost could achieve good classification results on our data sets. For each algorithm, we performed an extensive model training using three feature extractors: CV, TF (without multiplying the IDF feature) and TF-IDF. We first ran these al-gorithms on our original data set, resulting in baseline classifiers. Then, we ran the algorithms on the DA data set which is enlarged by data augmentation, with the same settings as the previous run. The main purpose of the experiment was to verify whether data augmentation is helpful to small data sets. After extracting features, we used Chi-squared test to calculate the dependency between each feature and the label of the corresponding document. We selected top-1200 features which have high dependency, indicating that they are important features.

We used 10-fold cross validation to evaluate models to reduce over-fitting. It is worth noting that on the DA data set, we only applied augmentation on the training set after the data was split into the training set and test sets because test sets should only contain real data samples. Another reason is that if we perform data augmenta-tion before data splitting, there is a chance an augmented data sample is in training set and its original version is in test set. The models then have knowledge about data in test set and the results will be unreliable. Therefore, we used KF old class in sklearn to randomly generate training and test indices of a data set. Parameter random state is also used to make sure the results in this project are reproducible.

For evaluation metrics, we gathered accuracy, precision, recall and F1-score. We searched for the best F1-score among classifiers.

(17)

11

2.2 Results

Data set Model F1 Accuracy Precision Recall

Baseline CV 0.53 0.651 0.584 0.491 DA CV 0.677 0.636 0.593 0.698 Baseline TF 0.429 0.684 0.824 0.297 DA TF 0.675 0.638 0.533 0.93 Baseline TFIDF 0.343 0.664 0.858 0.22 DA TFIDF 0.653 0.593 0.5 0.946

Table 2.2: Performance comparisons of Multinomial Naive Bayes

CV, TF and TFIDF are abbreviations for feature extractors that we discussed in the previous section: Count Vectorizer, Term Frequency and Term Frequency-Inverse Document Frequency.

Table 2.2 shows performance metrics of the classifier Multinominal Naive Bayes on the baseline and DA data set. With CV, accuracy decreases by 0.015. Precision is slightly improved while recall significantly jumps from 0.491 to 0.698, thus increasing F1-score by 0.147 from 0.53 to 0.677. With TF, accuracy also drops by 0.046 to 0.638, and precision sees a decline from 0.824 to 0.533. But there is a huge improvement on recall from 0.297 to 0.93. As a result, F1-score is improved by 0.246 to 0.675. With TFIDF, the same pattern happens similar as with TF, and F1-score rises from 0.343 to 0.653. In general, models on the DA data set are more reliable and accurate.

Table 2.3: Performance comparisons of SVC

Table 2.3 shows performance metrics of the classifier Support Vector Classifier on the baseline and DA data set. On every feature extractor, the accuracy score on the DA data set is worse than its counterpart on the baseline data set. However, the F1-score is improved on every case, specifically F1-score improved by 0.059 to 0.59

(18)

(CV), 0.035 to 0.666 (TF), 0.08 to 0.683 (TFIDF). These results indicate that data augmentation improves the overall performance.

Table 2.4: Performance comparisons of XGBoost

Table 2.4 shows performance metrics of the classifier Linear Regression with XG-Boost. The accuracy score of the CV Linear Regression on the DA data set is slightly improved by 0.002 to 0.676. The accuracy scores of XGBoost with TF and TFIDF on the DA data set are lower than on the baseline data set. However, there is a big improvement on the F1-score. In detail, F1-score increases by 0.059 to 0.5 (CV), 0.117 to 0.672 (TF), and 0.18 to 0.666 (TFIDF).

Table 2.5: Performance comparisons of Random Forest

Table 2.5 shows performance metrics of the classifier Random Forest. We observe that there is a decline in the accuracy score of TF Random Forest on the DA data set from 0.744 to 0.728, but the F1-score is improved from 0.621 to 0.668. The F1-score and accuracy of CV Random Forest and TFIDF Random Forest on the DA data set are both improved. The F1-score of CV Random Forest on the DA data set has improved significantly. More specifically, the F1-score increases by 0.105 to 0.72 and the accuracy increase by 0.024 to 0.747. The CV Random Forest on the DA data set achieves the highest performance among the classifiers.

(19)

13

2.3 Discussions

The best classifier is Random Forest with feature extractor Count Vectorizer on the DA data set. In detail, compared to the baseline CV Random Forest, the CV Random Forest classifier on the DA data set gains a huge improvement in F1-score of 0.105, resulting in 0.72 while accuracy also increases slightly from 0.723 to 0.747. Although the precision and recall metrics of this classifier are not the highest metrics among the classifiers, they are highly balanced, resulting in the highest F1-score.

Each other classifier on the DA data set also performs better than its counterpart on F1-score metric. For instance, the baseline TF-IDF Naive Bayes achieves very skewed precision and recall of 0.858 and 0.22 respectively, which result in low F1-score of 0.343. Whereas, the TF-IDF Naive Bayes on the DA data set achieves better metrics of 0.5 and 0.946 respectively, resulting in an improved F1-score of 0.653.

These results shows the effectiveness of data augmentation in balancing precision and recall, and increase the overall performance. However, we observe that the results are sensitive to the training set and may not be widely applicable to other data sets. This issue would be mitigated when we have a larger data set by collecting and labeling more data.

We also observe that among the algorithms, Random Forest show its good and reliable performance on both data sets, where as the rest 3 algorithms: Multinomial Naive Bayes, SVM and XGBoost Logistic Regression only perform well on the DA data set. We choose the CV Random Forest classifier to build our rework detection service in the next chapter.

(20)

Chapter 3 Design and Implementation

3.1 Design

In this section, we propose an architecture for a rework detection service with exten-sibility in mind. The service utilizes the predictive model we trained in the previous section and aids collaborative development platforms, such as Jira. We architect the service to be independent of the target platform infrastructure. In reality, there is an approach to build such the proposed service is that is to create a plugin and integrate it into the platform. For example, one can create a plugin for Jira with Atlassian SDK [20]. We find our service would be more flexible if it is deployed on a separate cloud or host and integrated into the platform through APIs provided by the plat-form. The service persists classified tasks into a database for future model training to adapt to production data distribution. Users also should able to adjust the rework classification of a development task if they feel the prediction is incorrect, and the service should be aware of such changes to reflect into its database. Otherwise if users do not change the rework classification, it is implied that the service predicted correctly. As the data set grows over time, we can develop better models.

Figure 3.1 shows the overall architecture of our proposed service. There are two main factors of the architecture: collaborative development platforms and our rework detection service. We will describe each factor of the architecture in details.

Platform Users use the platform as usual and do not need to interact to our service. When a new task is created or a task is changed, the platform notifies our service about the changes. There is no need to wait for responses from service since service

(21)

15

Figure 3.1: Rework detection service architecture

can update tasks on platform from its side. Two basic events that the platform must notify our service are task creation and task change.

Service For task creation events, the service predicts reworks from tasks and up-date their label as rework on platforms if needed. For task upup-date events, the service simply persists the rework label of the changed task. We assume task change events come from user actions, thus the labels in these events are ground-truth for con-tinuous training. We used labels to reflect rework classification results. Labels on collaborative development platforms are keywords that users can add to development tasks to indicate whether they possess certain characteristics. Labels are encouraged to use because of their filterability. For example, Jira and GitHub both provide a built-in feature to filter tasks by their labels. Rework is a characteristic of a develop-ment task, thus labels are suitable to indicate if a task is a rework. Figure 3.2 shows the detailed view of a Jira task and its associated label rework indicates that it is a rework.

Now we deep dive into each component of the rework detection service. First of all, we introduce API gateway - the entrance of the service. The API gateway is be able to observe and react to changes on the platform. Changes will be transformed into jobs, which are stored in a job queue. The job queue works as a message broker to dispatch messages to other systems. In this architecture, the job queue dispatches jobs to the

(22)

detector where predictions are made. The detector is backed by machine learning model, which receives a document and returns rework classification result. Tasks received by detection service will be persisted into a database. At the same time, prediction results and relevant information will be logged into log files. Performance metrics can also be logged. These log files are displayed by a GUI dashboard service for analytic purpose.

Figure 3.2: A Jira task and rework label

As soon as the API gateway receives a task payload, it should not keep connec-tion to platforms to reduce latency and avoid any performance issues, but it pushes requests into the job queue instead. In fact, network failures can happen at any time and these communications should be asynchronous. Another important trait of the job queue is that it should be able to recover pending tasks in case the service is failed so that the service does not miss any tasks. On the platform end, some platforms provide retry policy which attempts to resend a request for a number of times if it is failed to reach the endpoint. In reality, Jira provides two mechanisms to deal with failures: first, retry policy where it will retry to resend a web hook up to 5 times and second, a special API for clients to retrieve the list of failed web hooks in the last 72 hours. The job queue is critical to maintain this service resiliency. Moreover, having a separate job queue eases system testing as developers can simulate web hook requests by sending mocked web hook payloads to the job queue instead of making changes on production environment to trigger the web hooks.

(23)

17

In the next step, job queue dispatches received messages in form of jobs to the detector to classify tasks. If the job is for task creation event, the detector evaluates the trained model and update the corresponding rework classification result on the platform. If the event is task description change, the detector classifies the task with its new task description. If the event is task label change, the detector simply persists the task labels to the database. In any cases, the detector persists the classified task into the database for future model retraining. Simultaneously, the classified results are logged into log files, which can be visualized on a GUI dashboard for analytic purpose. On model deployment, we find it simple to embed the trained model to our service. This way the model can be tracked by a source version control such as git along with application code. The whole application is easy to manage as the model is just a file in source code. There is an alternative source version control Data Version Control (DVC) to manage machine learning models, data sets in an efficient way [21]. However, digging into details of DVC usage is out of scope of this project.

Logging files in our service should be easily fetched by a logging dashboard visual-izer service, either locally or remotely. By having interactive and real-time knowledge about the service, developers have insights about how the trained model works with production data. This log visualization component is crucial especially in long term because we may want to deploy a trial model and track whether it works better than the previous one on production data.

Model retraining is about continuous delivery of machine learning models. Tech-nically, data engineers regularly experiment with new data collected over time to im-prove the model or to adapt to new data distribution. When a better model during development is trained, we may want to evaluate it on the production data carefully before replacing the current one in production. Indeed, a model performs well on a data set during development phase does not guarantee it will work well against unseen data on production environment. To reduce the possibility of classification perfor-mance degradation, we suggest a strategy called shadow deployment [22]. Shadow deployment is a way to deploy a new version B of a system alongside its current pro-duction version A and replay A’s incoming requests on version B without impacting production environment. Version B replaces version A when its performance meet the requirements.

(24)

3.2 Implementation

In this section, we present our implementation of a proof-of-concept service written in Python for Jira. The predictive model is trained with scikit-learn in Python, so we decide to build a Flask web application, which is a Python web framework for easy deployment. The application declares three RESTful POST APIs to receive Jira webhooks, one for issue creation event issue created and two for description and label attribute update. These API callback URLs are configured in Jira settings. More specifically, the webhook for issue created event is configured in menu Jira Setings, System, WebHooks. Figure 3.3 shows a configured webhook for issue created event. This means that Jira will notify every new issue to this callback URL. As discussed in the previous section, for issue creation events, we directly pass the payload to our predictor to classify the newly created task. For issue update events, we need to distinguish between a task description update and a label update because our service reacts differently depending on what attribute of the task is changed. If its description is changed, we pass the payload to our predictor to update the rework prediction. To be aware of what attribute of a task is changed, we make use of automation rules in Jira to create custom automation rules to notify our service about customized events. An automation rule in Jira consists of three parts: trigger, condition and action. Automation rule settings of Jira is accessed at menu Jira Settings, System, Automation Rules. For issue description changes, we define the corresponding rule as follows:

• Trigger: When description of an issue is changed by editing. • Condition: None

• Action: Send a web request to a given webhook callback URL.

Another similar rule is created for label change event. Figure 3.4 and 3.5 show detailed steps of how to setup automation rules in Jira.

The following skeleton code snippet shows how we declare Jira webhooks callback: [samepage=true]

@app.route("/issue_create", methods=["POST"]) def issue_create():

# Predict rework

(25)

19

Figure 3.3: Webhook settings in Jira

job = q.enqueue_call(func=predict_issue, args=(data[’issue’],))

@app.route("/issue_update", methods=["POST"]) def issue_update(): data = request.get_json() is_update_label = False is_update_desc = False if ’changelog’ in data: changelog = data[’changelog’] for item in changelog[’items’]:

if item[’field’] == ’labels’: is_update_label = True

if item[’field’] == ’description’: is_update_desc = True

if is_update_label:

# Persist the issue to DB elif is_update_desc:

(26)

Figure 3.4: Create a new automation rule trigger in Jira

# Predict rework else:

# Ignore this event

When the predictor returns a result, we trigger a Jira REStful API to update labels, then persist the task into database. To realize a job queue, we use RQ which is a lightweight library for queuing and processing job in background backed by Redis [23]. We implement the detector as an object that loads sklearn model from file and keep the model in memory for fast evaluation. Detector predicts new issues from Jira, add rework label if needed and update the task on Jira via Jira REST API. Obviously, at application launch, we load Jira API key from environment so that detector can authenticate with Jira.

Classification results output from the detector are persisted into a Sqlite database. In addition, classification results are also logged into log files using the standard Python logging module. We implement a simple logging dashboard service to display logs from log files for analytic purpose.

(27)

21

(28)

Chapter 4 Conclusions

In this project, based on our created data set, we built a new rework classifier to classify rework from development tasks on collaborative development platforms, such as Jira and GitHub. By combining a variety of techniques from machine learning, such as data cleaning, feature selection and augmentation, we improved the models compared to their performance on the baseline data set. We designed an architecture for rework detection service to integrate with collaborative development platforms. Based on the best performing model, we implemented a proof-of-concept service.

In the future, this project can be extended in the following ways:

• Use deep neural networks and modern natural language processing techniques to enhance the prediction performance.

• Apart from task description, other attributes of a development task such as pull requests, comments can be considered as features.

• Fully implement the described prototype and evaluate its usefulness with our industrial collaborator.

(29)

23

Bibliography

[1] Aaron G. Cass, Stanley M. Sutton, and Leon J. Osterweil. Formalizing rework in software processes. In EWSPT, 2003.

[2] Barry Boehm and Victor R. Basili. Software defect reduction top 10 list. Computer, 34(1):135–137, January 2001. https://doi.org/10.1109/2.962984 doi:10.1109/2.962984.

[3] Kelley Butler and Walter Lipke. Software process achievement at tinker air force base. Technical Report CMU/SEI-2000-TR-014, Software Engi-neering Institute, Carnegie Mellon University, Pittsburgh, PA, 2000. URL: http://resources.sei.cmu.edu/library/asset-view.cfm?AssetID=5253.

[4] R. N. Charette. Why software fails [software failure]. IEEE Spectrum, 42(9):42– 49, 2005.

[5] J. Westland. The cost behavior of software defects. Decision Support Sys-tems, 37:229–238, 05 2004. https://doi.org/10.1016/S0167-9236(03)00020-4 doi:10.1016/S0167-9236(03)00020-4.

[6] Michael Diaz and Jeff H King. How cmm impacts quality, productivity, rework, and the bottom line. 2002.

[7] F. Shull, V. Basili, B. Boehm, A. W. Brown, P. Costa, M. Lindvall, D. Port, I. Rus, R. Tesoriero, and M. Zelkowitz. What we have learned about fighting defects. In Proceedings Eighth IEEE Symposium on Software Metrics, pages 249–258, 2002.

[8] Vimla Devi Ramdoo and Geshwaree Huzooree. Strategies to reduce rework in software development on an organisation in mauritius. International Journal of Software Engineering & Applications, 6:9–20, 2015.

(30)

[9] E. Burton Swanson. The dimensions of maintenance. In Proceedings of the 2nd International Conference on Software Engineering, ICSE ’76, page 492–497, Washington, DC, USA, 1976. IEEE Computer Society Press.

[10] R. E. Fairley and M. J. Willshire. Iterative rework: the good, the bad, and the ugly. Computer, 38(9):34–41, 2005.

[11] Aaron Cass, Leon Osterweil, and Alexander Wise. A pattern for mod-eling rework in software development processes. pages 305–316, 05 2009. https://doi.org/10.1007/978-3-642-01680-6 28 doi:10.1007/978-3-642-01680-6 28.

[12] Romi Satria Wahono. A systematic literature review of software defect predic-tion. Journal of Software Engineering, 1(1):1–16, 2015.

[13] J Shirabad and Tim Menzies. The {PROMISE} repository of software engineer-ing databases. 01 2005.

[14] M. McHugh. Interrater reliability: the kappa statistic. Biochemia Medica, 22:276 – 282, 2012.

[15] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly Media, 2009.

[16] George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, November 1995. https://doi.org/10.1145/219717.219748 doi:10.1145/219717.219748.

[17] Amy Zhao, Guha Balakrishnan, Fredo Durand, John V Guttag, and Adrian V Dalca. Data augmentation using learned transformations for one-shot medical image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8543–8553, 2019.

[18] Hassan Abu Alhaija, Siva Karthik Mustikovela, Lars Mescheder, Andreas Geiger, and Carsten Rother. Augmented reality meets computer vision: Efficient data generation for urban driving scenes. International Journal of Computer Vision, 126(9):961–972, 2018.

(31)

25

[19] Jason Wei and Kai Zou. Eda: Easy data augmentation techniques for boosting performance on text classification tasks, 2019. http://arxiv.org/abs/1901.11196 arXiv:1901.11196.

[20] Create a HelloWorld plugin project. Accessed on 2020-07-26. URL: https://developer.atlassian.com/server/framework/atlassian-sdk/create-a-helloworld-plugin-project/.

[21] Open-source Version Control System for Machine Learning projects. Accessed on 2020-07-26. URL: https://dvc.org/.

[22] Suhrid Satyal, Ingo Weber, Hye-young Paik, Claudio Di Ciccio, and Jan Mendling. Shadow Testing for Business Process Improvement, pages 153–171. 10 2018. https://doi.org/10.1007/978-3-030-02610-3 9 doi:10.1007/978-3-030-02610-3 9.

[23] RQ: Simple job queues for Python. Accessed on 2020-07-26. URL: https://python-rq.org/.

Detecting Rework in Software: A Machine Learning Technique

Contents

List of Tables

List of Figures

Chapter 1

Introduction

Chapter 2

Experimental Evaluation

2.1

Methodology

2.1.1

Data set

2.1.2

Data cleansing and preparation

2.1.3

Downsampling

2.1.4

Data augmentation

2.1.5

Feature engineering

2.1.6

Model Development

2.2

Results

2.3

Discussions

Chapter 3

Design and Implementation

3.1

Design

3.2

Implementation

Chapter 4

Conclusions

Bibliography