Improving Alerts in Software Quality Monitoring

(1)

Improving Alerts in Software

Quality Monitoring

A domain specific language for software quality

alerts

Cornelius Ries

cornelius.ries@googlemail.com

July 25, 2018, 60 pages

Supervisor: Magiel Bruntink,m.bruntink@sig.eu

Host supervisor: Bugra M. Yildiz,b.yildiz@sig.eu

Host organisation: Software Improvement Group,https://www.sig.eu

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

Abstract

This thesis provides a detailed analysis of the current state of software quality monitoring which is defined as measuring a software’s maintainability over time and triggering alerts for detected and important events. During the analysis it became apparent that the current fully automatic solution suffers from the issues of false positives as well as false negatives.

A prototype was designed and implemented to evaluate if considering the context information of a project would solve these problems. The prototype consisted of a domain specific language, which we called Software Monitoring Alert Language (SMAL). It allows domain experts to manually spec-ify alerts and their appropriate thresholds in order to trigger alerts on relevant metric changes for the project in question. The prototype solution was developed using Xtext and integrated into the already existing software landscape of the Software Improvement Group (SIG) using a web editor for the manipulation of the language file as well as a REST interface to retrieve the parsed configuration. The evaluation was conducted on two customer projects and their respective SIG consultants. Based on the experiment results we provide a discussion on the effectiveness of our approach and compare it to the current solution. During the individual results we were able to reduce the number of false pos-itives and false negatives. Since there is no definition for the actual number of false pospos-itives or false negatives we cannot make any observation by how much we improved the alert generation. Overall we can conclude that the prototype setup shows promising results but needs further validation in a real world environment.

(3)

Acknowledgments

I would like to express my sincere gratitude to both my supervisors - Magiel Bruntink and Bugra M. Yildiz, as their constant feedback and advice helped me in approaching this project with an open mind and complete it successfully. Furthermore, I would like to thank everybody else at Software Improvement Group (SIG) for providing me with the opportunity to write my thesis there and the help they offered me in every possible way.

Cornelius Ries Amsterdam The Netherlands June 2018

(4)

5.1.4 Decisions . . . 25 5.1.5 Solution . . . 25 5.2 Prototype implementation . . . 26 5.2.1 Technologies . . . 27 5.2.2 Architecture . . . 28 5.2.3 Changes to MAS . . . 28 6 Design validation 30 6.1 Setup . . . 30 6.2 Execution . . . 31 6.2.1 Project A . . . 31 6.2.2 Project B . . . 32 6.3 Results. . . 34 6.3.1 Project A . . . 34 6.3.2 Project B . . . 36 7 Discussion 38 7.1 Discussion per project . . . 38

7.1.1 Project A . . . 38 7.1.2 Project B . . . 38 7.2 General discussion . . . 39 7.2.1 Problem improvement . . . 39 7.2.2 Design . . . 40 7.2.3 Xtext . . . 41

8 Conclusion and future work 42 8.1 Answers to research questions . . . 42

8.1.1 RQ1: Will the number of relevant alerts increase if we introduce context infor-mation into the alert generation process? . . . 42

8.1.2 RQ2: How can we design a solution to introduce context information into the alert generation process?. . . 43

8.2 Future work . . . 43

8.2.1 Test solution with other tools . . . 43

8.2.2 Improve trend analysis. . . 43

8.2.3 Establish a common definition for a false positive . . . 43

8.2.4 Automatic generation of thresholds. . . 43

8.2.5 Alert generation on delta changes. . . 44

Bibliography 45 A Investigation 47 A.1 MCC database . . . 47

A.1.1 Structure . . . 47

A.1.2 SQL statements. . . 47

A.2 Lifecycle survey . . . 48

A.2.1 Example. . . 48

A.2.2 Evaluation statement . . . 48

A.3 Alerts survey . . . 48

A.3.1 Layout and questions . . . 48

A.3.2 Results . . . 50

B Validation 54 B.1 Project B Configured alerts . . . 54

(6)

Acronyms

ANTLR Another Tool For Language Recognition. 27

AST Abstract Syntax Tree. 14

DSL Domain Specific Language. 8,13,14,26,41,43

GPL General Purpose Language. 13,14

IDE Integrated Development Environment. 8,27

LSP Language Server Protocol. 41

MAS Monitor Alert System. 12, 17, 18, 22, 23, 28, 30, 34, 36, 37, 43, Glossary: Monitor Alert System

MCC Monitor Control Center. 17,18,28,40, Glossary: Monitor Control Center

REGEX Regular Expression. 27

REST Representational State Transfer. 28, 41

SAT Software Analysis Toolkit. 17,24,28, Glossary: Software Analysis Toolkit

SMAL Software Monitoring Alert Language. 26–28,30

TAR Technical Action Research. 8,15–17,23,30

XML Extensible Markup Language. 24

(7)

Glossary

client also known as customer, contracted SIG to analyze one or several of their software projects to gain insight into the software quality and to gain advice on how to improve that quality. 11,

12,15,17, 18, 21,22,24,27,30–32,35,36, 38

consultant advices a client on the quality of one or multiple software projects. The advice includes providing the results of code analysis as well as suggesting possible ways for improvement. 15,

16

Monitor Alert System is responsible for generating the alerts based on software quality in the process of SIG. 12

Monitor Control Center is the web-interface where all alerts of the analysis process are displayed.

17

snapshot is defined as the state in software code of a software project at a specific point in time and typically provided by the customer to SIG to be analyzed. 11,12,18, 25, 28,30,31,35,36

Software Analysis Toolkit is the software in the SIG software stack that is responsible for perfom-ing the static code analysis, calculatperfom-ing the different metrics and aggregatperfom-ing the metrics into the maintainability score as well as persisting the results into a database. 17

Xtext Xtext is a framework for development of programming languages and domain-specific lan-guages. With Xtext you define your language using a powerful grammar language. As a result you get a full infrastructure, including parser, linker, typechecker, compiler as well as editing support for Eclipse, any editor that supports the Language Server Protocol and your favorite web browser. 8,14,27, 28, 43

(8)

Chapter 1

Introduction

Maintenance is one of the most important aspects of software development since it amounts to nearly 90% of the overall costs [MHDH13]. Maintenance contains amongst others the tasks of fixing bugs, adding new features and improving the code. The ease of performing these tasks is also known as maintainability [VRvdL+₁₆_{]. Ease meaning time spent on understanding the code and changing the}

code. The more complex the code, the longer a developer needs to understand it and change it. This can go so far, that developers are afraid to change the code because it became so complex that they no longer understand what it is doing.

For the purpose of expressing the maintainability of a system, SIG developed a model [HKV07] which is based on the maintainability part of the ISO/IEC 25010:2011 [ISO10] standard. To use the model for a particular system, various metrics like the unit size have to be calculated from the source code. For Java, this means the lines of code per method or constructor. Another metric, the percentage of duplications, relates the number of duplicated lines against the overall line count of the whole system. By comparing the system in question against a benchmark that was established by analyzing a large number of systems, it is possible to express how maintainable a system is. This ranking of maintainability is expressed in terms of a star rating (1 to 5) in the SIG maintainability model. For example, if a system is in the top 5% of systems it receives a rating of 5 stars.

Continuously analyzing the source code while a system is being developed provides chronological measurements and shows the changes in maintainability. Looking for changes in the different quality characteristics on purpose is the process that we will define as ”software quality monitoring”. The task of monitoring is almost always done automatically by using a tool. Such a tool often includes the possibility to generate an alert if monitored values change, either compared against a threshold or compared to previous values. In the context of software monitoring, one way to approach this is to generate an alert if the quality of a system drops a star in a maintainability category.

1.1 Problem analysis

In the general context of alert generation, there are basically two problems with alerts. The first problem is, that there are too many alerts, which in turn, will desensitize the people responsible for handling the alerts. The fact that there are too many alerts also implies, that there are alerts that are not relevant. We will call this type of alert false positive. Reasons for this might be, that the used thresholds are too loose or the process that triggers the alerts is run too frequently. The second problem is, that important changes are not detected and the corresponding alerts are not triggered. We will call this type of alert false negative. This can be the case if the thresholds are too strict or the data points are too far apart.

For the domain of software monitoring in the context of SIG, we discovered that both these problems are currently present. For systems with a large code base or being relatively old, a lower number of alerts is being generated than for newer systems or systems with a small code base. The main reason for this is, that the current implementation of alert generation in this domain, described by Bijlsma et al. [BCV12] in 2012, uses the same parameters across all systems. But there are also other reasons

(9)

that we identified during our analysis, which add to the complexity and prompted us not to pursue a tuning of those parameters based on size in the code base and/or age of the system. Adding other context information, like team size or release frequency as parameters into the calculation, the problem becomes more complex. Another problem is, that the historical data on alerts proved to be incomplete on what a relevant alert is since there is also no clear definition on that.

1.2 Research method

The research in this thesis was executed using a prototype approach that is based onTechnical Action Research (TAR)[Wie12]. The research was conducted at SIG. Our research cycle was setup as a three step process:

Problem investigation At the start of our research cycle, we first investigated the problem domain, namely software quality monitoring. The results of this step were the problems that we wanted to solve and the research questions we wanted to answer.

Artifact design Based on the results of our analysis we designed and implemented a prototype solution that is supposed to solve the problem and help us to answer our research questions.

Design validation To complete our research cycle we evaluated our solution with the help of domain experts at SIG and report on our findings.

1.3 Research questions

During our analysis we discovered that the alerts and the definition of what makes an alert relevant is dependent on the external context information of a project. While some types of context information can be measured, there are others that can not be measured or are not available during the alert generation. Based on these observations this thesis aims to find a solution for this problem and answer the following research questions:

Research Question 1: Will the number of relevant alerts increase if we introduce context infor-mation into the alert generation process?

Research Question 2: How can we design a solution to introduce context information into the alert generation process?

1.4 Software monitoring alert language

The solution we designed to introduce context information into the alert generation process features a Domain Specific Language (DSL) implemented with Xtext for configuration of alert thresholds. The main motivation behind usingXtext is, that it provides an easy way to implement aDSL, with enhanced features like a web-editor, content-assist, and other Integrated Development Environment (IDE) like features. We extended the Java web project generated by Xtext to feature a REST API, which can be used to query the configuration while parsing the underlying language file on-the-fly. While our solution applies to SIG foremost there are other tools to measure source code properties. One example is Sonarqube1_{. Using a REST API it should be possible to integrate a}_DSL

for configuration as well, but we have not tested this.

(10)

1.5 Outline

In this section, we outline the general structure and idea of this thesis. In Chapter 2, we provide background information on the relevant topics and the domain in which we conduct this thesis. In Chapter 3, we describe our research method in more detail, outline how we intend to validate our research and present the threats to validity of our research approach. In Chapter4, we take a detailed look into the current state in the alert generation in software quality monitoring and give a rationale for research direction and research questions. In Chapter5, we give insight into how we designed and implemented a solution to introduce context information into the domain. In Chapter6, we describe our experiment (setup, execution, and result) that we used to validate our solution. In Chapter7, we discuss the results and give answers to our research questions. We conclude the thesis in Chapter8.

(11)

Chapter 2

Background

For a better understanding of the research conducted in this project, we provide some background information. In the first section, we describe the domain our project is rooted in, namely software quality monitoring. In the second section, we provide information regarding project context informa-tion. In the third and final section, we provide a brief description of domain specific languages.

2.1 Software quality monitoring

There are various ways of defining the term software quality [ISO10] [BKCF14]. It can be defined from an end-user perspective and express how well the software satisfies the requirements. It can also be defined from a security point of view so that it expresses how the software handles privacy-related issues. In the context of this project, however, we look at software quality from the source code perspective. In this context, software quality monitoring refers to the process of periodically analyzing source code, measuring various metrics and looking for changes in those over time. The metrics can be aggregated and expressed using a model like the SIG maintainability model [VRvdL+₁₆_].

2.1.1 SIG maintainability model

The SIG maintainability model aggregates the measured metrics of the source code into an overall maintainability score of the system. The score is expressed as a scale from 0.5 to 5.5 with 5.5 being the best possible score. The scale and especially the thresholds of the scale were established by benchmarking 400 systems. This ensures comparability of one individual system in the context of the model. The model is based on the maintainability characteristic of the ISO/IEC 25010:2011 [ISO10] standard. The standard and the SIG model are composed of five sub-characteristics: Modularity, Reusability, Analysability, Modifiability and Testability. The sub-characteristics themselves are also assigned a rating based on the previous score scale. The sub-characteristics are derived from various metrics of the source code:

Volume The code of the entire system. Comments are not counted as code. The score is derived from an estimation of man-years needed to rebuild the system.

Duplication The number of duplicated lines of code.

Unit size The lines of one unit. The exact definition of a unit depends on the programming language. For Java, for example, a unit is defined as a method. As with volume, comments are not counted.

Unit complexity The decision points (if, while, etc.) inside a unit. Also known as cyclomatic complexity based on McCabe [McC76].

(12)

Unit interfacing The parameters in a method’s signature.

Module coupling How often a method within a module is called from other methods. Component balance The number of system components and their uniformity in size. Component independence Same as module coupling, but on a component level.

The score of some of the metrics, like duplication or unit size, are based on a risk profile [HKV07], which is relative to the overall number of lines of code. For example, to receive a rating of 4 for the unit size, there can be at most 6.9% lines of code in units with more than 60 lines of code and so on. For maintainability and the sub-characteristics the relevant metrics are then aggregated into the score.

2.1.2 Monitoring

SIG offers a monitoring service to someclients. For thoseclients, SIG periodically receives asnapshot

of a system. The results for the currentsnapshot, as well as all the historical results, are stored in a database. Looking at the collected historical data it is possible to draw conclusions about the current state of the system compared to the past. It is also possible to make estimations about trends and the future of the maintainability of a system based on those trends.

Figure 2.1 shows several of the mentioned metrics over time. The date range of the available

snapshots are transformed into date values on the x-axis, with 0 being the date of the very first snapshot as the start. Every consecutive snapshot calculates the date difference in days and uses the result as x-axis value. The y-axis is the score value in the maintainability model. The system depicted in the diagram has been monitored by SIG since the beginning of the project. This is also represented by the high score in Volume (black), which has an almost linear decrease (5.5 - 4) in the first quarter of the diagram and stabilizes towards the end. Another indication for the novelty of the project is, that the remaining metrics (Duplication - red, Unit size - blue, Module coupling - yellow) behave irregularly in the beginning and also stabilize as the maturity increases.

(13)

2.1.3 Alerts

Alerts are interesting events that need to be detected and made aware, in case no one actively observes the environment. They can also act as support if an active observation is being performed, but additional safeguards are necessary. To detect interesting changes in the software quality of aclient’s system, SIG implemented alert generation for the metrics mentioned above. The generation of those alerts is handled by a dedicated system which is known as the Monitor Alert System (MAS). The

MASgenerates three different types of alerts [BCV12]:

Sudden change A sudden change is a drop or rise in the rating of one of the source code metrics. For the calculation, the analysis results of the past are collected, weighed and compared against the currentsnapshotvalue. If the change exceeds a certain threshold, an alert is triggered.

Sustained change A sustained change is generated for breaks in trends in one of the source code metrics. For example, if the duplication rating was on an upward trend and experiences a change in direction that was over a longer period of time and too subtle for the sudden change detection, then the sustained change detection is supposed to catch these kinds of alerts.

Component change A component change is generated if a new component is added to the system or an existing component is deleted. It also detects new and removed dependency relationships between components.

2.2 Project context information

We want to provide a definition of the term as well as some examples since our research is partly focused on the fact that project context information is influencing the alerts being generated. Project context information describes characteristics about projects that are dependent on the individual project itself. For our research, it is important to differentiate between project context information that can be measured automatically from the source code and the type that cannot be measured automatically. A good example of the first type is the project size. We can measure the size of a project in the volume metric described above. An example of the second type would be the current project lifecycle phase.

2.2.1 Software evolution lifecycle

Since there are multiple definitions of the term project lifecycle, we decided to use the staged model proposed by Rajlich et al. [RB00] to which we will refer as the software evolution lifecycle. We selected this model because we are looking at the code of a system and how it evolves over time. The software evolution lifecycle defines five phases, which are described below and are further visualized in Figure2.2.

Initial development The project is new and not in production yet. Developers are working to implement the initial set of requirements.

Evolution The project is in production and the developers work on the functionality of a system to satisfy changing requirements.

Servicing The project is in production, but only bugs and other maintenance work is being per-formed. Changes are usually costly and take a lot of time.

Phase out The project is in production, but is considered or scheduled to be removed or replaced. It is only kept alive to maximize generated revenue.

(14)

Figure 2.2: Visualization of livecycle model. [RB00]

2.2.2 Project size

The size of a project is also a type of project context information and like the project lifecycle, it can be defined in several ways. It is possible to just look at the size of all the code in the project. But we could also look at team size, number of teams, number of involved parties and so on. As with the previous type we decided to define the project size as the size of code since SIG focuses on the source code.

2.2.3 Release frequency

The release frequency is the last example of context information. It describes how often and at what frequency the system is deployed into production. While this information could be collected from project management systems like JIRA or continuous integration servers like Jenkins, the information provided by these systems is typically not available during the analysis of the code.

2.2.4 Others

There are plenty more types of context information. Since we cannot go into detail for every one of them, we will conclude this section with a list of several more, that can be explored if so inclined:

• developer skill • developer experience • team size

• development practices • development process

2.3 Domain specific languages

Since the topic ofDSLsis very broad, we only give a short introduction to what aDSLis, how they are usually implemented and provide an example. ADSLis, like the name suggests, a language specific to a domain [vDKV00]. The direct opposites areGeneral Purpose Languages (GPLs), like Java or C#. Typical examples for aDSLare SQL or HTML. HTML is specifically designed to be interpreted

(15)

by web browsers and describe the layout of a website. To achieve a proper separation of concern, multipleDSLsare often used together like HTML and CSS.

DSLsare usually implemented by specifying a grammar and using a parser generator to generate a parser that is able to read a piece of text written according to that grammar [Bet16]. The parser splits the text based on tokens (defined keywords and symbols in the grammar) and builds anAbstract Syntax Tree (AST). In the end a generator is used to translate theASTinto aGPL language source code file so that it can be compiled and executed. There are a lot of tools and frameworks to support the implementation of a DSL. Beside many others [EVV+_{] we have decided to use} _Xtext _{in this}

(16)

Chapter 3

Research method

This chapter describes our research method. In the first section we describe technical action research. In the second section we describe how we implemented technical action research in this thesis. In the third and final section we point out the threats to the validity of our research.

3.1 Technical action research

For the selection of our research method, we looked into the available paradigms [ESSD08] and decided to base our research method on action research, in particular, technical action research. As per the definition given by Wieringa et al. [Wie12], TAR is a practical design research method. Using this approach the researcher designs an artifact that tries to help a client while answering a set of knowledge or improvement questions. TARdefines 5 individual phases: Problem Investigation, Artifact Design, Design Validation, Implementation and Implementation Evaluation:

1. Problem Investigation During the initial phase, the researcher assumes the role of client-helper and analyses the problem domain to define a problem worth solving.

2. Artifact Design After selecting a problem to solve the researcher assumes the role of designer and designs an artifact that will be introduced into the problem domain.

3. Design Validation In this phase the researcher assumes the role of artifact investigator. The design is validated by analyzing the effects that the artifact has in the domain: What will be the effects of the artifact in a problem context? How well will these effects satisfy the criteria?

4./5. Implementation and implementation evaluation The last two phases then transfer the design to a real implementation to further test and validate the design in a real-world setting.

3.2 Implementation in this thesis

In our case, we decided, to focus on the first three phases ofTAR. This is also reflected in the structure of this work: Chapter 4 focuses on investigating the problem (Phase 1). Chapter 5 focuses on the artifact design (Phase 2). Chapter6focuses on the design validation (Phase 3).

Because of the time constraints to conduct this project we decided to test the design by implementing a prototype solution. Using this prototype-based approach, we expect to provide an indication of the practical implementation and implications for SIG. Note: This is the only section in which we refer to SIG as the client. For the rest of this thesis, the termclientis defined as a customer of SIG.

To validate our design and research we tested our solution on two selected projects and their historical data available at SIG with the help ofconsultantsby executing the following steps:

(17)

• Translate the context information and thresholds into an alert configuration file. • Generate the alerts using the historical data of the system.

• Categorize the alerts with the help of the consultantinto desirable/undesirable.

• Compare the differences between our generated alerts and the alert history of the system.

3.3 Threats to validity

We now present all threats to the research validation that was outlined in the previous section.

3.3.1 Conclusion validity

We have validated our approach with only two systems and their respectiveconsultant and domain expert. Although the systems come from completely different clients and domains, a more extensive test should be executed to strengthen our conclusions. While we do not see any reason that our approach should not work for other clients and their projects, we have not tested this. Additionally, we only validated the results with one expert per system. A second opinion per system should be considered, before generalizing the conclusion on a system level.

3.3.2 Configuration validity

The configurations used in our experiment were constructed with the help of the responsibleconsultant

for the project. The resulting alerts were evaluated with the same person. This might affect the judgment of alerts since theconsultantset the thresholds himself and thus is sure that if an alert is triggered, the resulting alert must be desirable. To make sure that our conclusions are sound on a project level, the results should be validated using a second opinion.

3.3.3 External validity

Originating from our use ofTARwithin the context of SIG we cannot make any concrete generaliza-tions. This is further affected since we only validate our approach with two systems. Although the approach we have developed should apply to other models and tools as well, we have not verified this. The validation experiment we conducted, also featured the opinion and judgment by the responsible domain expert, which can be prone to subjectivity. If there had been another consultantfor one of the projects, he might have been interested in other alerts, which in-turn would have affected our results.

(18)

Chapter 4

Problem investigation

This chapter represents the first phase ofTAR. It gives insight into the steps of analysis performed during the initial period of the project. It can be seen as another part of background information, that was gathered not by literature research, but by investigation and observation. The first section focuses on the process within SIG in which the alerts are generated. The second section gives more insight into the workings of the MAS. The third section takes a detailed look at the historical alert data. The fourth section describes the survey we conducted to confirm existing suspicions and the findings from previous sections. The fifth and final section defines the problem and research questions that we want to solve and answer in this project.

4.1 Software analysis process

We first focused on analyzing the process since the MAS is integrated into SIG’s software analysis process. Figure 4.1 shows the process with all the systems and applications involved as well as the different alerts that are being generated. TheMAS is triggered as the final step of the process after the analysis results have been persisted into the database. The analysis step is performed by a system known asSoftware Analysis Toolkit (SAT). The analysis results (metrics and ratings) are displayed in a web-interface (monitor), which is accessible for consultants and theclientof the system in question. The alerts are displayed in another system called the Monitor Control Center (MCC). The MCC

also shows alerts from other parts of the process, which are raised by a process watchdog, that is responsible for keeping track of the status of each individual analysis step.

Everyday an employee is assigned to observe the alerts displayed in the MCC. This means that he or she is responsible to check the alerts and decide whether they are important or not. If they decide on the former then the alert is closed with a note. If it is important or just unclear to the person on duty, the alert is forwarded with a message to the responsible consultant of the project. After deciding on a course of action, like notifying theclient, the alert is then resolved with another note in theMCC. During this process an alert can have one of the following three states: ”open”, ”in progress” and ”resolved”. If the consultant is contacted, the status will have the intermediate state ”in progress”. Otherwise, the status will switch directly from ”open” to ”resolved”.

TheMCCis also the point in the process where possible problems become visible. Alerts are either relevant or not. Other definitions for relevant alerts are ’interesting’ or ’positive’. Invalid alerts are also known as ’not interesting’ or ’false positives’. False positive means, that the alert was generated, but added no value or new knowledge for the consultant. During observation, it became apparent that a lot of alerts are false positives. However, these alerts are not clearly flagged as such, which makes further investigation hard. Furthermore, there is no clear definition of what a false positive is. During our analysis we found out, that this heavily depends on theclient and/or system. Some consultants prefer to be kept up to date using the alerts, while others monitor the system more closely themselves and only want to be notified if there is a huge change (+/-0.5) in a metric rating.

(19)

Figure 4.1: Analysis Process

4.2 Monitor alert system

TheMAS has been around for almost ten years. But there have been no functional changes to the system in the last five years. After inspecting the source code we can report that the system still uses the same approach to generate the alerts as reported by Bijlsma et al. [BCV12] in 2012. One of the problems with the approach taken in the paper by Bijlsma et al. only shows after quite some time: In their paper, the authors calibrate the parameters for the alert generation using a benchmark of one representative system. In daily use, the parameters are the same for every system and are hard-coded into the source code with the result that they cannot be changed depending on the system that is analyzed. This is, in our opinion, one of the reasons why some systems, especially young or small systems, generate a lot of alerts, whereas older and bigger systems do not generate alerts for changes that would have been interesting. As a result, the mechanism behaves more sensitive for smaller systems than for the bigger systems.

4.3 Previous alerts

For more information on past generated alerts, we obtained a copy of theMCCdatabase. From this, we hoped to gain more insight into the generated alerts and the problems with them. The database layout and some of the SQL statements used for analysis can be found in appendixA.1. The first alert generated in theMCCwas from September 2012. Our copy contained the alerts up to January 2018. During this timespan, 30840 alerts were generated, of which 9403 (30%) were generated by the

MAS. The remaining alerts were generated because of schedule violations (theclientforgot to upload thesnapshot) or the analysis process reported another error. Since we decided to focus on theMAS

alerts for this project, we did not look further into those alerts.

Looking into the type of MASalert in more detail, we can see that 6754 alerts (70%) were rating changes with the rest being the component changes alerts. The rating changes can be further divided into sustained changes 1274 (19%) and sudden changes 5480 (81%). From this, we can see that most alerts generated by the MAS are alerts for sudden rating changes. As previously mentioned, there is no clear indication on what a false positive is. So we decided, for the purpose of this research, to define a false positive as an alert that was closed without forwarding it to a consultant. Our rationale behind this was that the person on MCC duty decided to close it immediately, without further communication. With this definition it was possible to divide the alerts because the messages for one alert were logged in theMCCdatabase. A false positive was an alert with exactly one message. Everything else was a valid alert. Using this approach we tried to find patterns in the threshold for sudden rating changes, but this proved to be a dead end because of inconsistency in the data. The way theMCC alerts are handled is dependent on the person on duty. Some people tend to confirm every alert, whereas others forward every alert. This, of course, being the two extreme examples.

(20)

After this, we decided to focus on the alerts being generated for a few systems, with the criteria being that they were analyzed on a regular basis and generated a significant number of alerts in the past. If we add the alerts to the same system displayed in Figure2.1, the resulting view, which is depicted in Figure4.2, allowed us to make further observations.

Figure 4.2: Metrics and generated alerts (red vertical bars) for a system.

4.3.1 False positives

The large number of alerts in the first quarter of analysis results raised our suspicion that the project lifecycle played a significant role in the alerts being generated. A good indication is the high score for the volume metric (meaning small codebase) which indicates the initial development phase. The metric drops over time, as more code is added and stabilizes towards the end of the diagram, meaning the project matures and moves more into the evolving/servicing phases. The other metrics behave in the initial development phase and contribute to the number of triggered alerts. But since this behavior is expected during initial development, the alerts triggered become irrelevant, otherwise known as false positives.

4.3.2 False negatives

Another interesting observation was that changes for the metrics that are calculated using a risk profile are not detected properly, or at least, that some interesting changes are missed since they are affected by the system size. We will call these alerts false negatives. To give an example: If the code size increases in the correct relationship with the number of duplicates, then no alert is triggered for the duplication metric, as it stays the same or varies only very little. Figure 4.3 shows the raw number of duplication occurrences for the system also visible in Figure4.2. In the diagram, we can see that the number of duplications steadily increases following the sudden surge right after the 1000 day value. If we compare this with the raised alerts visible in Figure4.2 we can see, that following the alert for the previously mentioned surge, no further alerts are raised for this system. From this finding, we can conclude, that the current solution suffers from the issue of false negatives.

(21)

Figure 4.3: Number of duplication for the system also visible in Figure4.2.

4.4 Survey

To confirm our suspicions, we decided to enlist the help of SIG’s consultants using a survey. For the design of the survey we consulted relevant literature [Bur01]. The survey was divided into two parts, which are described in the following subsections.

4.4.1 Lifecycle

The first part of the survey was designed as a spreadsheet and conducted using Google Spreadsheets. A screenshot of the survey, picturing the layout and instructions with some example open source systems can be seen in FigureA.2.

Design

The real survey consisted of a preselected list of systems. The systems were selected based on three criteria: It had to be analyzed frequently (once a week), over a longer period of time (> 6 month) and had raised alerts in the past. The selection was done in part automatically using SQL and scraping of web systems, other parts were done manually. The steps are outlined in the list below.

• We selected the initial set of systems that triggered alerts in the past (SQL query using the MCC database).

• For the resulting systems, we scraped the relevant information (responsible consultant, industry) from the process watchdog web interface. The scraping was done by writing a helper application using Java and Jsoup1_.

• We checked the snapshot availability manually.

In the resulting survey it was the task of the project’s consultant to assign the phases of the software evolution lifecycle where possible (some projects go back as far as 2009). To minimize possible data

(22)

Project Available Data Dates Selected Date Range Years #Alerts P1 2017-12-01 - 2018-04-09 2017-06-01 - 2018-01-01 0.5 22 P2 2011-12-01 - 2013-12-01 2012-12-01 - 2013-12-01 1 11 P3 2015-01-01 - 2016-01-01 2015-01-01 - 2016-01-01 1 14 P4 2017-02-01 - 2017-12-31 2017-01-01 - 2018-01-01 1 15 P5 2017-06-01 - 2017-12-31 2017-06-01 - 2018-01-01 0.5 1 Total 4 63

Table 4.1: Number of alerts from initial development systems.

Project Available Data Dates Selected Date Range Years #Alerts P2 2013-12-01 - 2018-01-01 2016-01-01 - 2017-01-01 1 2 P3 2016-01-01 - 2018-01-01 2017-01-01 - 2018-01-01 1 10 P6 2017-06-01 - 2018-01-01 2017-06-01 - 2018-01-01 0.5 1 P7 2017-02-01 - 2018-01-01 2017-01-01 - 2018-01-01 1 4 P8 2015-01-01 - 2018-01-01 2015-01-01 - 2016-01-01 1 11 Total 4.5 28

Table 4.2: Number of alerts from evolving systems.

shortage in the feedback, in case some consultants would not answer the survey, all eligible systems were put in the survey. In the end, we wanted to randomly select projects from the phases, spanning the same amount of time, count the number of alerts and compare them. The survey contained 147 systems spanning 40clients.

Results

After 2 weeks of data gathering, we received responses from 10 consultants for 40 systems from 14

clients. After checking and discarding the responses where concrete date assignment of the lifecycle phases was missing we selected 5 systems for the initial development phase and 5 systems for the evolving phase. The remaining phases lacked a significant number of responses. To make the com-parison more sound, the time the selected systems spent in their respective phase was kept at a half to one year. The selection was a manual process while the counting of alerts was done using a SQL statement (ListingA.5). The raw available date ranges, the selected date ranges for analysis as well as the result in the number of alerts can be seen in Table 4.1 for initial development systems and Table4.2for evolving systems. The timespan is in years.

From these results it is obvious that our suspicion is confirmed. The initial development systems trigger more than twice the number of alerts with half a year less in total timespan. This means that the number of alerts is influenced by the lifecycle.

4.4.2 Alerts

The second part consisted of a set of questions that were supposed to elicit further indications about the historic alerts and possible requirements for a solution.

Design

The structure and questions of the survey can be found in appendix A.3.1. The alert survey was divided into two sections. The first section was designed to gather more information about alerts. The second section was designed to elicit information about requirements for generated alerts. The participant was only able to see and answer one section at a time. Questions ending with a dot are answered using a scale from 1 (Strongly Disagree) to 5 (Strongly Agree). Questions ending with a question mark are open-ended questions that allow the participant to type in an answer. For the open-ended questions, an answer was not required.

(23)

Results

Since the second part was answered after completing the first part, we received 10 responses for this survey as well. The results for the survey can be found in appendix A.3.2. We will now give our interpretation of the results in the following paragraphs. Because of the unstructured nature of the open-ended questions, we will give a summary of the answers.

Question 1 Figure A.3. The alerts generated by the MAS are not seen as completely useless, although a small majority finds them less useful.

Question 2 FigureA.4. Here we can find a more distinct distribution. The sustained alerts (long-term trend analysis) are not seen as useful.

Question 3 FigureA.5. The results for this questions are inconclusive since there is no majority on either side of the scale.

Question 4 Figure A.6. Here we can see a more distinct distribution. The component alerts (architecture changes) are seen as useful.

Question 5 ListingA.6. The overall problem statement can be summarized by the following answer given in the survey: ”Doesn’t work well for all system sizes (too sensitive for small, new systems, too insensitive for large ones)”.

Question 6 ListingA.7. The major requirement can be summarized by the following answer given in the survey: ”Needs to take project context into account: what types of alert are relevant is determined by the client.”

Question 7 ListingA.8. This question was not answered as straightforward as the previous two. From the answers, we were able to deduce, that there is a relation with the answer to question six since each of the consultants has their own definition of what an interesting change is.

4.5 Research questions & problem definition

From our findings, we had a good indication that the definition of what makes an alert interesting is dependent on the project context, theclient and the consultant, otherwise known as the context information of the project. Since we only had a limited amount of time available to conduct this project, we decided to focus on the false negatives. This leads us to our first research question (RQ1): Will the number of relevant alerts increase if we introduce context information into the alert generation process?.

The problem we see here is: There are too much information and interrelated variables to consider so that we can properly implement a completely autonomous solution, that generates the alerts we want it to generate. During this analysis, we discovered two types of context information, namely amount of code (Volume) and the lifecycle phase. Adding other context information, like team size or release frequency as parameters into the calculation, the problem gets even more complex. This results in our second research question (RQ2): How can we design a solution to introduce context information into the alert generation process?

With our established research questions we will now describe in the following chapter how we designed and implemented a solution that introduces the context information into theMAS.

(24)

Chapter 5

Artifact design

This chapter represents the second phase of TAR. It introduces the artifact that was designed and implemented during this project. In the first section, we provide an overview of the steps taken to design the solution. In the second and final section, we provide more detail on the implementation part of the prototype.

5.1 Design

To design the solution, we first gathered the functional and non-functional requirements. Based on the requirements and domain concepts we then describe the design decisions we have taken to design our solution. We conclude this section with an example of our solution.

5.1.1 Functional requirements

The functional requirements were derived from the theory we developed during the problem investi-gation phase (see the previous chapter).

FR1: Context information In the previous chapter, we detected that context information is responsible for the number of generated alerts. This means that we need to find a solution, that enables the consultants to specify context information.

FR2: Configuration Since the interesting changes are project dependent, the alerts need to be configured individually for each project.

FR3: Categorization If an alert is triggered, it should be possible to tag it as a relevant alert or a false positive. A more detailed categorization would be welcome here. This additional insight will be useful in the future and provide possibilities for improvement.

FR4: Improve current functionality The current functionality should be kept, but also im-proved so that it takes project size into account. Especially the number of false positives should be reduced.

5.1.2 Non-Functional requirements

Designing a solution usually includes non-functional requirements, that may affect the behavior of the resulting system. The overall non-functional requirement we need to satisfy for this project is usability. According to a consultant at SIG, theMAS already had the possibility to be configured. However, we were not able to find out the level of detail for the configuration or the way of how the configuration was carried out. From this perspective, talking with consultants and taking the gathered information into account, we came up with the following non-functional requirements.

(25)

NFR1: Usability If the solution is not easy to understand and use, it won’t be as effective and maybe not used at all like the previous configuration capabilities. This is why we need to implement the solution in a way that facilitates the user in the best way possible.

NFR2: No complex & hard to read configuration files Configuration files, especially in

XML, tend to be hard to maintain by human beings. They are compromised by useless tokens that are used to ensure machine-to-machine communication and schema validity, but negatively influence fast understanding by humans. Another reason for this requirement is, that SIG recently converted theXMLconfiguration for theSATtoYAML. We wanted to take this a step further and try to design a solution as simple as possible.

NFR3: On-the-fly The alert configuration for a system has to be quickly available in a matter of seconds. If it is not possible to change the configuration easily, the capability to configure is meaningless. If a consultant receives a notice from aclient, that there will be a major change, he can then adjust the alerts beforehand and not receive an alert since he already knows what will happen.

5.1.3 Configurable software quality alerts

Another aspect we need to consider in designing the solution is the existing solution and its domain, namely software quality monitoring and alerts. In Section 2.1.3, we already described the existing types of alerts. The alerts can be categorized into trend analysis (sudden changes equal short trend change, sustained changes equal long trend change) and static analysis (component changes). The new solution would introduce a new type of alert: configurable alerts.

Figure 5.1 gives an overview over the domain. This model is not supposed to be exhaustive and shows the elements within the domain & their relations that were encountered during our time at SIG.

(26)

5.1.4 Decisions

Requirements selection

With the limited time that is available to conduct this project, we decided to focus on the first two functional requirements FR1 (context information) & FR2 (configuration) since they are related. There are types of context information that cannot be measured automatically. This is why we need to provide a configurable solution that takes the context information into account.

Despite the time constraints, there is another reason why we decided not to focus on FR3 (Cat-egorization) & FR4 (Improve trend analysis): They are related as well and need a longer period of time for data collection to improve. We suggest introducing the functionality for the consultants to categorize generated alerts since the current solution does not provide a clear distinction between a false positive and a valid alert. After a reasonable timespan, this data can then be used to tune the existing thresholds.

Design decisions

In the following paragraphs, we explain the design decisions we have taken and how they map to the functional, non-functional requirements as well as the domain.

DD1: Domain specific language We suggest building the solution to configure alerts using a DSL specific to the domain. This will have the benefit of a clear view and instant recognition of the data instead of using configuration files which are hard to read and edit like XML since they have to represent the domain as well as contain the data. We also decided to reuse common terms in the domain and at SIG (see Figure 5.1) to use as the keywords which we will need to construct the grammar. With this, we intend to further improve the readability and ease of recognition of what the language is supposed to do. With this first design decision, we already satisfy FR2 and NFR2. DD2: Alert Thresholds on raw metric data Going back to the origins [Org18] in alert genera-tion, a configured alert can be triggered if it rises above or drops below a set threshold. Furthermore, it can be generated if the difference between the current snapshot and the previous one is above a certain threshold. For example: if the number of duplications in the source code rises above 250, we generate an alert. For another project, a different threshold is chosen. Since it is possible to configure alerts for raw metrics that are not dependent on project size or the lifecycle, we can also satisfy FR1 with this design decision.

DD3: No configuration for existing alert types We intentionally left out any configuration for the existing types of alerts, because the parameters are, at least in our opinion, hard to grasp and explain. This would be in direct conflict with NFR1. We also suspect that the previous configuration capabilities were able to just adjust these parameters. But since it is hard to judge the impact of the parameters on a project and the alerts generated, it was probably never used.

5.1.5 Solution

The result of our analysis and the decisions is a domain specific language that introduces the capability to configure alerts based on the available context information. The new elements introduced into the domain are visible (red) in an updated version of Figure5.1and can be seen in Figure5.2.

The syntax of our language is modeled after the JSON standard [IETF14]. The result is a simple declarative language that enables users to specify alerts and specific thresholds for every available metric. As per DD1, we also reused most of the keywords already encountered in the domain as tokens for our language. An example of several alerts defined can be seen in Listing 5.1. In the example, we first define the metadata for our configuration, like the customer, project as well as the email addresses, where the alerts should be sent to. The second part of the configuration, defines two alerts: alert1 and alert2. alert1 is defined only for the language Java and triggers if the duplication rating drops below 4.0. alert2 on the other hand, will trigger if the overall number of duplications in the source code will rise above 750.0. The configuration possibilities will be explained in more detail

(27)

Figure 5.2: Software quality monitoring & alerts domain adjustments.

in section5.2.1where we will also explain the grammar of our language. Since it is custom to give a

DSLa name, we decided to call our languageSoftware Monitoring Alert Language (SMAL).

1 customer customerName

2 project projectName

3 emails ["email1@example.com" "email2@example.com"]

4 alertDefinitions [ 5 { 6 name alert1 7 lang java 8 metrics [DUPLICATION] 9 threshold min 4.0 10 } 11 { 12 name alert2 13 metrics [DUPLICATION_VIOLATION] 14 threshold max 750.0 15 } 16 ]

Listing 5.1: Configuration example specified in SMAL

5.2 Prototype implementation

This section gives more insight into the prototype implementation. It also provides information on further technical and architectural decisions that were taken during the implementation phase.

(28)

5.2.1 Technologies

We first had to select a technology on how to implement SMAL. One way to go could have been to useAnother Tool For Language Recognition (ANTLR), but that would not have been feasible since we also needed to provide a comfortable way to edit the SMALsource file. This is why we decided to look at some language workbenches [EVV+_{] and decided to use} _Xtext_{. This decision was taken}

for multiple reasons. Xtext is rooted in the Eclipse and Java community and since SIG also uses Java for internal development this seemed like a perfect fit. Furthermore, we also have some prior experience with usingXtextwhich, considering the time constraints of this project is another reason. But the most important reason for choosingXtextare the easily extensible support forIDEfeatures like content assist and formatting as well as the availability of these features in the web editor that is provided. The web editor was chosen to satisfy NFR3. Since it is nowadays common to implement new applications that need to be available from anywhere as a web application, we decided to use this approach as well.

After setting up the project with web editor support we implementedSMALin aXtextgrammar. The grammar can be seen in Listing5.2. Xtextgrammars are implemented top-to-bottom using rules and keywords. In our grammar, we define 6 rules in total. The first rule ”AlertConfiguration” defines a keyword (also called a literal) ’customer’, which corresponds to the client of the project. After the keyword, we assign a value that is parsed as an ”ID” (predefined Xtextterminal rule) to a field ”customer” of the ’AlertConfiguration’. The resulting structure that is created after parsing will then begin with a Java interface called AlertConfiguration and contain the respective field. In a similar fashionXtext creates a structure for the rest of the grammar. Notable features are the construction of enum types (”ThresholdType” rule) and the terminal rule type (last rule in our grammar). The first creates an enum type in the resulting structure. Terminal rules are used to terminate parsing at a location withRegular Expression (REGEX) expressions. For more information on Xtext grammar we suggest to have a look at [Bet16] and [Xte18].

1 AlertConfiguration: 2 ’customer’ customer=ID 3 ’project’ project=ID 4 emails=Emails 5 ’alertDefinitions’ ’[’ alertDefinitions+=AlertDefinition* ’]’; 6 7 AlertDefinition: 8 ’{’ 9 ’name’ name=ID 10 (’lang’ lang=ID)? 11 (’type’ type=ID)? 12 ’metrics’ ’[’ metrics+=ID+ ’]’

13 ’threshold’ thresholdType=ThresholdType threshold=DOUBLE

14 (’interval’ interval=IntervalType)?

15 (emails=Emails)?

16 ’}’;

17

18 Emails:

19 {Emails} ’emails’ ’[’ addresses+=STRING* ’]’;

20

21 enum IntervalType:

22 ONCE=’once’ | WEEKLY=’weekly’ | MONTHLY=’monthly’;

23

24 enum ThresholdType:

25 MIN=’min’ | MAX=’max’ | DIFF_UP=’diff up’ | DIFF_DOWN=’diff down’;

26

27 terminal DOUBLE returns ecore::EDouble:

28 INT ’.’ INT;

Listing 5.2: Xtext grammar of SMAL

We will now explain the capabilities of our language in more detail: For reference to the keywords in our grammar, we will use the single quote ’ character. For reference to resulting structure types, we will use the double quote ”. The grammar allows the editor to specify the alerts based on the context information for a specific ’customer’ and ’project’. It also expects a list of ’emails’, where the alerts

(29)

should be sent to. After that, it expects a list of ’alertDefinitions’. One ”AlertDefinition” consists of a ’name’ made up by the editor, a list of metric (’metrics’) names and the ’threshold’. A threshold can have a lower bound (’min’), upper bound (’max’) as well as a positive/negative difference in values (’diff up’/’diff down’) from the currentsnapshotto the previous one. It is also possible to specify some optional parameters for the alert, namely the language (’lang’), the ’type’ (SIG internal definition), the ’interval’ in which e-mails should be sent and an additional set of e-mail addresses.

After implementation of the grammar, we also implemented some of the additional features that

Xtext provides. This includes the formatting feature, that allows formatting of the text in the web editor so that all the source files will have the same layout in the end. We furthermore introduced a content assist helper that automatically queries theSAT database for the system in question and gives suggestions for the current item. For example, for the keyword ”metrics” all the possible metrics are queried from the database and suggested to the editor. This takes context information into account. The suggestions are queried based on the system the editor is working on. If the editor is also configuring an alert specifically for Java, only metrics that are available for this language are suggested.

5.2.2 Architecture

After implementation of the grammar, we set up the web part. Xtextautomatically generates a web project with a sample web editor. Since the language features e-mail notification, but we had no e-mail server to connect to, we decided to model the prototype after the MCC, which displays the generated alerts in a simple list. The result consisted of a web interface with three features: The first feature is a selection of the desired system with two input fields for the customer and project. The second feature displays the generated alerts from theMAS. The final feature is the web editor for the

SMALlanguage file.

The language file is stored in a local GIT repository with automatic push and pull functionality, keeping the file up-to-date with the central repository. This enables the consultants to not only use the web editor but also edit the file with a normal text editor.

To reduce the dependencies and conform with SIG’s clean code principles [VRvdL+16], the web project also contains a Representational State Transfer (REST) interface that returns the configu-ration based on a customer and project query. Using this approach, we avoid the huge dependency tree, that would be introduced into the target application if we included the parser directly. Another benefit of this approach is, that the solution can be introduced into any existing software landscape. The final version of the prototype architecture can be seen in Figure 5.3. A consultant can edit the language file for a system using the web-interface provided by the web project of SMAL. The web project then stores the file in GIT. These two steps build the configuration part of the process. If theMAS is triggered to analyze the latestsnapshot for alerts, the parsing part of the process is triggered. TheMASgets the configuration to use from the web project ofSMAL, which in turn parses the corresponding language file stored in GIT. This interpreter [MHSS05] approach, which can also be described as on-the-fly parsing, ensures that the correct configuration file is always used.

5.2.3 Changes to MAS

TheMASdefines a Detector interface that can be implemented to add a new type of alert detection. We implemented a ConfigurableChangeDetector that queries the configuration from theREST inter-face of the web project, which is deployed on a application server. After obtaining the configuration, the detector analyzes the currentSAT snapshot data and adds an alert to the list of alerts for the currentsnapshotif it detects any violations.

(30)

(31)

Chapter 6

Design validation

This chapter represents the third phase of TAR. It details our approach to verify the design using the prototype implemented in the previous chapter and to answer RQ1 (Will the number of relevant alerts increase if we introduce context information into the alert generation process?) and RQ2 (How can we design a solution to introduce context information into the alert generation process?). We first describe the general setup of our experiment. In the following sections, we then describe the execution and results of our experiment on concrete projects.

6.1 Setup

To answer RQ1, we asked for the help of SIG consultants again. To put it briefly, we wanted to create a configuration of alerts and thresholds for a selected project with them, generate the alerts for the last year and compare them with the alerts raised by the original MAS. At the end of the experiment, we asked the consultants how useful the designed solution would be for them and their respective clients. With an answer to this question based on expert opinion, we wanted to answer RQ2. The performed individual steps of the experiment are described in more detail in the following paragraphs.

Project selection To run our experiment, we had to select a project. The only requirement from our side was a current project with regular (once a week) analysis history for at least the last 6 months. Otherwise, we relied on the expertise of the consultants to select the project.

Extract context information After selecting the project, we sat together with the consultant to elicit the context information and thus making the context in which we ran the experiment concrete. For this purpose, we first asked them about the reason why theclienthas a monitoring contract with SIG. The second step during elicitation was to go through the historic metric changes by taking a look at the respective metric diagram in the monitor (see Section4.1) and asking them, where they would have wanted to be alerted and discuss possible thresholds with them. The last question tried to elicit further information about interesting events they reported to theclient. During the elicitation, we also presented the prototype.

Creating the configuration The next step consisted of turning the elicited information into a concreteSMALfile.

Generate alerts In the following step, we ran the alert generation using a modifiedMASthat was able to rerun historicsnapshotdata.

Categorize alerts To compare the raised alerts, we decided to use the distinction of desirable/un-desirable defined by Versnel in his master thesis in 2010 [Ver10]. With the help of the consultant, the new alerts and old alerts were categorized based on this definition. We also looked at each raised

(32)

alert in detail and wanted to know from the consultant if he would investigate the change and, if applicable, happened to report this change to theclient.

6.2 Execution

The setup explained in the previous section was executed with two different consultants and one project for each of them. The elicitation of context information and judgment of the alerts was separated into two different sessions of approximately one hour each. After gathering the relevant context information, the configuration file was created by us and validated with the consultant at the start of the second session. We then generated the alerts based on historical data and categorized them with the domain expert. For data privacy and data protection reasons, we will call the projects ’Project A’ and ’Project B’ respectively.

For each project, we will now go into more detail on the elicited context information, the resulting configuration file and the thresholds. The actual results of the experiments will then follow in section

6.3.

6.2.1 Project A

Context information

The project is a case management system in the public sector and written in Java and Javascript. The project is not developed by theclient organization but is outsourced to another company. The project had historic problems of being unstable, failures in regression as it had little unit tests and long deployment times. Overall the system has a low maintainability rating and this is one of the reasons why theclienthas a monitoring contract. The second major reason is, that theclientcontact person (controller) for SIG uses the changes in metric values to exert pressure onto the development team. While it is debatable if this is a good practice, we will ignore this fact and focus on the fact, that he wants to be alerted if the quality in source code drops. Having no technical background, he wants to see the results in some sort of KPI values. The overall goal of the project was to improve maintainability of the system as it can be observed in the model.

Alert configuration

After establishing the context of the project, we looked through the overall metrics that SIG reports on (Section2.1.1). The resulting configuration file can be seen in Listing6.1.

Maintainability Starting with maintainability, the consultant selected a value of -0.05 between

snapshots as a guard to be alerted on and a threshold of maximum 3.1 so that reaching the next improvement target can be reported as a success to theclient.

Volume Volume was not of particular interest to the consultant since it does not say anything about the project except the amount of source code. Rather than looking at the overall size of the project, he was more interested in the productivity betweensnapshots. This is why we used historical data of the projects to establish a threshold for the ’NR OF MAN MONTHS’ metric, which is an SAT internal metric that expresses the amount of time it would take to re-implement the system from scratch. The threshold itself was established by going through all the historic snapshots and comparing the value for that metric from onesnapshotto the precedingsnapshot. After dividing the changes into negative and positive datasets we established the 95 percentile values of each set and used those as the thresholds.

Unit metrics safeguard Since the consultant was not particularly interested in specific thresholds for the UNIT related metrics, we decided to establish a general alert to guard against major changes in these metrics. The thresholds for the up and down difference were set to 0.1.

(33)

Module Coupling The metric was not that interesting for the consultant. He was more interested in the number of unit calls increasing. That is why we decided to establish a threshold on a rise of 250 on the internal metric ’java unit Fan in’, after looking at the historical graph of the metric. Component metrics These metrics were not that important for the project, so we did not establish any alerts here.

1 alertDefinitions [ 2 { 3 name maintainability_upper_limit 4 metrics [MAINTAINABILITY] 5 threshold max 3.1 6 } 7 { 8 name maintainability_drop 9 metrics [MAINTAINABILITY]

10 threshold diff down 0.05

11 } 12 { 13 name duplication_rise 14 metrics [DUPLICATION] 15 threshold diff up 0.05 16 } 17 { 18 name duplication_drop 19 metrics [DUPLICATION]

21 }

22 {

23 name metric_safe_guard_rise

24 metrics [UNIT_COMPLEXITY UNIT_INTERFACING UNIT_SIZE]

25 threshold diff up 0.1

26 }

27 {

28 name metric_safe_guard_drop

29 metrics [UNIT_COMPLEXITY UNIT_INTERFACING UNIT_SIZE]

31 } 32 { 33 name java_unit_Fan_in_rise 34 metrics [java_unit_Fan_in] 35 threshold diff up 250.0 36 } 37 { 38 name productivity_check_up 39 metrics [NR_OF_MAN_MONTHS] 40 threshold diff up 4.5 41 } 42 { 43 name productivity_check_down 44 metrics [NR_OF_MAN_MONTHS]

46 }

47 ]

Listing 6.1: Configuration for Project A

6.2.2 Project B

Context information

The project belongs to a differentclientthan Project A and is set-up as a microservices architecture developed in Java. Like with Project A, the project is not developed by theclientitself, but uses the services of SIG to keep an eye on the code quality. They understand the need for maintainable code and apply a coding guideline (set-up based on advice by SIG), in case a switch of the development

Improving Alerts in Software Quality Monitoring