The Effect of Sprawl Maintenance on Software Maintainability

(1)

Masterthesis Project

The Effect of Sprawl Maintenance on Software

Maintainability

Submitted in partial fulfillment of the requirements for the award of the degree of

Master of Science in

Software Engineering

Submitted by

Student Rosemary Moerbeek

University Supervisor Jan van Eijck Company Supervisor Lodewijk Bergmans Company Supervisor Joost Visser

(2)

A B S T R A C T

The aim of this study is to propose and test a new software metric that is able to distinguish between the degree to which a software system is refactored during its maintenance phase. Assuming that refactoring leads to increased software main-tainability, we test its effectiveness by correlating it with a maintainability rating provided by the Software Improvement Group (SIG). We propose Change Ratio as a simple and easy computable metric representing the normalised ratio between changed lines of code and new lines of code. A selection of 135 audited systems vary-ing in size, function, used techniques, maturity and quality provides the data set with which the relationship of Change Ratio and maintainability is investigated. Amongst the findings of this study are: 1) highly maintainable software systems don’t show different average Change Ratio values over long periods of time in comparison to less maintainable systems, 2) a -0.168 Pearson’s product-moment correlation of maintainability and average Change Ratio values over periods of 5% system size growth has been found, 3) Change Ratio sequences of length 3 show a Spearman’s rank correlation of -0.71 with descending software maintainability when ordered reverse lexicographical, 4) similar sequences which combine Change Ratio values with the absolute size of code churn model the probability of descend-ing maintainability with an accuracy of 84.8% over a range of 15%.

Furthermore, a higher level metric is suggested that focuses on new and changed system files rather than new and changed lines of code.

(3)

C O N T E N T S

1 i n t r o d u c t i o n 4

1.1 Sprawl Maintenance 5 1.2 A Software Quality Model 5 1.3 Relevant Research 8 2 r e s e a r c h a p p r oa c h 10 2.1 Problem Identification 10 2.2 Metric Proposal 10 2.3 System Selection 11 2.4 Metric Evaluation 12 2.5 Data 16 3 c h a n g e r at i o 19

3.1 Change Ratio Characteristics 19 3.2 Direct Correlation 22 3.3 Change Correlation 23 3.4 Sequence Patterns 27 4 c h a n g e d f i l e r at i o 34 4.1 Data Distortions 34 5 c o n c l u s i o n 38 5.1 Threats to validity 39 5.2 Future Research 39 Appendices 43 a a p p e n d i x 3 44 b a p p e n d i x 4 46

(4)

1

I N T R O D U C T I O N

During the deployment phase of a software system in a real-world environment, great effort is done to enhance, modify and adapt the system to new or changed requirements. As an effect of these evolvements, the code becomes more complex and drifts away from its original design, thereby lowering the quality of the system [Mens and Tourw´e, 2004]. The degree to which certain system qualities suffer can be managed by putting continuous effort into maintaining or improving the soft-ware’s original design: refactoring. Effort and time put into these practices is paid back by preserving the system’s level of maintainability.

While the importance of refactoring practices is acknowledged by most software developers working on an evolving code base, pressing deadlines, management decisions or ignorance can lead them not to act accordingly. It is therefore that a software system’s maintainability usually decreases more than necessary over time. In this study, an effort is made to identify the effects of different maintenance at-titudes on a system’s maintainability. One extreme attitude could be continuous refactoring during maintenance tasks, while another extreme could be no refactor-ing at all. We will name the latter attitude sprawl maintenance for convenience. The attitude of most maintenance projects will lie somewhere in between these extreme cases.

Intuitively, performing sprawl maintenance has a negative impact on the maintain-ability of the software: when no refactoring activities are performed and only new code is added during maintenance practices, code complexity increases inevitably.

Modelling the relationship between how a software project grows and how its maintainability reacts, could lead to more clarity towards the development of tech-nical debt in maintenance systems. For this purpose, we will propose a simple method to model this relationship and evaluate the model using benchmarking techniques applied to 135 real-life software systems varying in size, age, function-ality and qufunction-ality.

We choose to express software growth from one version to another is in terms of code churn. Code churn is a process measure that captures the amount of code change taking place within a software unit over time [Nagappan and Ball, 2005]. In the scope of this study, code churn will be represented by three code churn com-ponents: new lines of code, changed lines of code and deleted lines of code. Code churn is easily computed by conventional methods like git diff and does not require any technical knowledge to understand. This approach differs from the more

(5)

con-1.1 sprawl maintenance

ventional growth measure, where only the system size from one version to another is taken into account. In this study, we will regard a changed line of code equally important as a new line of code because they both represent a unit of effort that is put into the system that has effect on its internal or external behaviour.

Expressing a software unit’s maintainability is more complex and includes many facets. The model we choose to use in this study has already proven itself to be useful and accurate in real-world software systems many times. We will further elaborate on this choice in section 1.2.

1.1 s p r aw l m a i n t e na n c e

Software refactoring can be defined as a sequence of code changes that improves the quality of design without changing the behavior of the software [M. Fowler, 1999]. Regularly performing refactor activities to a deployed software system re-duces the cost and effort of software maintenance for the long run by keeping the software complexity within acceptable levels [Shatnawi, 2010]. An empirical study on the effects of refactoring activities found that software reusability, flexibility, extendability and effectiveness improved in 65%, 53%, 42% and 42% of the cases respectively.

This study proposes a software metric that represents the degree to which a soft-ware system is refactored over a period of time. While continuous refactoring is a known term in the field of software engineering, its counterpart is still unnamed. We will therefore refer to the absence of refactorings as sprawl maintenance.

1.2 a s o f t wa r e q ua l i t y m o d e l

As software quality is a concept that can be defined in many ways, we want to use a software quality model that has proven itself to be useful and accurate in real-world applications. The SIG Monitor quality model fulfils this criterion considering it has been applied to more than 1000 software systems varying in functionality, size, ar-chitecture and used technique receiving mostly positive client feedback. The model is based on the SIG/T ¨UViT Evaluation Criteria providing standardised evaluation and certification of the technical quality of the source code of software products.

While the ISO/IEC 25010 international standard on software product quality pro-vides a general taxonomy of quality aspects as shown in figure 2, the SIG Monitor quality model focuses on the internal quality characteristics of maintainability and its sub-characteristics as defined by the ISO/IEC 25010 standard: analyzability, modifiability, testability, modularity and reusability. These (sub-)characteristics are defined as follows:

m a i n ta i na b i l i t y The degree of effectiveness and efficiency with which a prod-uct or system can be modified by the intended maintainers, where modifica-tions can include correcmodifica-tions, improvements or adaptation of the software to changes in environment, and in requirements and functional specifications. It

(6)

1.2 a software quality model

Figure 2.: The ISO/IEC 25010 standard classifies software quality in a structured set of characteristics and sub-characteristics

also includes installation of updates and upgrades. It can be interpreted as ei-ther an inherent capability of the product or system to facilitate maintenance activities, or the quality in use experienced by the maintainers for the goal of maintaining the product or system.

a na ly s a b i l i t y The degree of effectiveness and efficiency with which it is pos-sible to assess the impact on a product or system of an intended change to one or more of its parts, or to diagnose a product for deficiencies or causes of failures, or to identify parts to be modified.

t e s ta b i l i t y The degree of effectiveness and efficiency with which test criteria can be established for a system, product or component and tests can be per-formed to determine whether those criteria have been met.

m o d u l a r i t y The degree to which a system or computer program is composed of discrete components such that a change to one component has minimal impact on other components.

r e u s a b i l i t y The degree to which an asset can be used in more than one system, or in building other assets.

m o d i f i a b i l i t y The degree to which a product or system can be effectively and efficiently modified without introducing defects or degrading existing prod-uct quality.

(7)

1.2 a software quality model

The SIG Monitor quality model aims to evaluate these software maintainability characteristics by mapping a set of well-chosen source code properties onto each software maintainability characteristic [Heitlager et al., 2007]. For instance, the source code property ’volume’ negatively affects the software quality characteristics ’analysability’ and ’testability’ because of the simple assumption that a large code base is harder to analyse and test. Figure 3 presents an overview of source code properties impacting software quality characteristics.

Figure 3.: Maintainability characteristics mapped to software properties After mapping software properties onto maintainability characteristics, a bench-marking step is carried out by the SIG Maintainability Model to compare the cur-rent system’s results to all results of past system evaluations. SIG maintains a benchmark repository holding the results of hundreds of standard system evalua-tions. The great advantage of this benchmarking step is that it makes the results of an evaluation meaningful: telling a client that their software system is bad is usually no news to them, telling him how their system scores compared to other real-life software systems is much more informative.

1.2.1 Product measures vs. process measures

Note that the code churn data are not involved in computing the software quality ratings. Their independence is a critical factor for the validity of any potential results.

There is however a small theoretical dependence between the maintainability rating and the churn data: since volume is one of the source code properties that negatively influences a software’s analysability and testability, a large amount of new lines added to the system can cause a decrease in its maintainability rating. We believe this dependency only exist sporadically and will not affect any results, because the majority of our churn data concerns only fractions of the total system size. The change in maintainability rating based solely on volume change is insignificant for the great majority of our data.

(8)

1.3 relevant research

1.3 r e l e va n t r e s e a r c h

One of the first studies that tried to associate a system’s evolution metric with a system’s quality, was conducted by [Munson and Elbaum, 1998]. They measured the effectiveness of code complexity churn as fault surrogate and validated their re-sults with testing reports. Their rere-sults included a 0.949 Pearson product moment correlation between new lines of code added to a software system and its complex-ity. In other words, a software system’s complexity grows with its size. This is an important statement for our study, as our SIG maintainability model assumes the same.

A recent study by [Karus and Dumas, 2012a] used data extracted from revision control systems to unveil predictors of yearly cumulative code churn of software projects. They compared results using organisational metrics with results using code metrics and found that results using organisational metrics were superior to results gathered purely with code metrics. Organisational metrics included: (1) the total size of the project team; (2) previous activity of the committing developer; (3) the project revision size in terms of number of files. Predicting yearly cumulative code churn however, depends intuitively on different factors than software quality does. Therefore it is not likely that the results of this study would improve using purely organisational metrics. Still, this could be an option for future research.

A lot of research is being done on the subject of how to find error-prone code blocks in your software system to help developers to save time and effort. [Shin et al., 2011], [Giger et al., 2011] and [Nagappan and Ball, 2005] all hypothesised that code blocks that are changed a lot compared to other blocks, are more likely to have defects in them. Code churn is in these works considered as an important metric that is able to represent local developer activity. In these studies, no dis-tinction is made between new lines of code, changed lines of code or deleted lines of code. [Shin et al., 2011] investigated whether software metrics obtained from source code and development history are discriminative and predictive of vulner-able code locations. Code churn was one of their software metrics, together with code complexity and developer activity metrics. The models using all three types of metrics together predicted over 80 percent of the known vulnerable files with less than 25 percent false positives. [Giger et al., 2011] explore the advantage of using fine-grained source code changes (SCC) for bug prediction. SCC captures the exact code changes and their semantics down to statement level. A series of experiments is presented using different machine learning algorithms with a data set from the Eclipse platform to empirically evaluate the performance of SCC and code churn in terms of modified lines of code (LM). The results show that SCC outperforms LM for learning bug prediction models.

[Nagappan and Ball, 2005] showed that relative code churn measures are good pre-dictors for defect density while absolute code churn measures are not. Even though we do not aim to assess defect density of software in this study, we prefer to use relative code churn measures as well instead of absolute code churn measures. A

(9)

1.3 relevant research

more detailed elaboration of this choice can be found in 2.2 on page 10.

In summary, prior research on code churn focused mostly on finding ’weak spots’ in software systems with varying results. The approach taken in this study will differ from prior research in the following ways:

1. We will investigate the relationships that exists between code churn components (e.g. new lines of code, changed lines of code and deleted lines of code). In existing literature, code churn is computed as the sum of added, changed and optionally deleted lines of code since the last revision. By summing the components, potential valuable information is lost that could have been captured by the proportion of the code churn components instead.

2. Instead of finding vulnerable locations inside a software project as done by [Shin et al., 2011], [Giger et al., 2011] and [Nagappan and Ball, 2005], we aim to detect vulnerable software projects amongst a large set of random software projects using benchmark techniques.

3. We will test our hypotheses using high-level software quality metrics devel-oped and implemented by SIG while most existing studies used bug reports as proxy of software quality.

(10)

2

R E S E A R C H A P P R O A C H

This chapter elaborates on the steps we took to reveal a relationship existing be-tween maintenance attitudes (continuous refactoring vs. sprawl maintenance) and maintainability. We have the privilege to conduct our research in the professional environment of the Software Improvement Group (SIG) located in Amsterdam, where static source code metrics are being used to provide actionable advice in order to improve quality aspects of their clients’ software. While they have devel-oped multiple models for accessing separate software quality aspects as defined by ISO 25010, we will only make use of their maintainability model (see section 1.2 on page 5). One reason for this is that the maintainability of a software unit relates directly to its technical debt which is often considered a heavy burden for software departments. Another, more practical reason is that their maintainability model has the highest maturity and is applied on the highest number of real-life software systems for long periods of time. A more thorough elaboration of our data set can be found in section 2.5.

2.1 p r o b l e m i d e n t i f i c at i o n

This study aims to detect whether certain maintenance attitudes (continuous refac-toring vs. sprawl maintenance) affects the overall software maintainability in any way. Investigating if such a relationship exist includes at least two facets:

From a static perspective, we want to know whether systems where sprawl main-tenance has been performed in the past show lower overall maintainability.

Dynamically speaking, we want to know how sprawl maintenance affects the main-tainability per unit of time.

2.2 m e t r i c p r o p o s a l

A metric is proposed that represents the maintenance attitude: continuous refac-toring on the one hand, sprawl maintenance on the other. For its calculation, we will rely solely on code churn data from version to version for two main reasons: (1) we believe code churn data is sufficient for this purpose; (2) code churn data is

easy to collect for almost any software developing organisation, which makes the proposed metric broadly employable.

(11)

2.3 system selection

To exclude any biases towards certain software attributes, the metric has to meet at least the following requirements:

• Independence of system size

• Independence of programming language • Independence of team sizing

• Easy to calculate and understand

By calculating a ratio of changed lines of code and new lines of code per unit of time, we manage to satisfy aforementioned requirements. We name this proposed metric Change Ratio and calculate its value as follows:

ChangeRatio= _ChangedLOCChangedLOC₊_NewLOC

Any Change Ratio value ranges between 0 and 1. This normalisation is espe-cially important to avoid outliers due to data disruptions. If both changedLOC and newLOC are zero, no Change Ratio value exists. In other words: if no lines are added or changed since the last snapshot, no Change Ratio can be computed. Semantically, this makes sense: if there was no churn activity recorded, we do not expect a change in any (sub-)quality rating. We can just skip these events without data disruptions.

2.3 s y s t e m s e l e c t i o n

Both resources contain a total of 36470 records of 1106 distinct software systems. However, not all available systems fit our study due to data incompleteness, incon-sistency or other causes that make them unreliable. We have therefore excluded all systems that do not satisfy all of the following criteria:

s y s t e m i s i n m a i n t e na n c e p h a s e

This information is not directly available from our data source. Therefore, we have manually selected projects that are considered maintenance systems by the SIG consultants.

s y s t e m i s a c t i v e

We consider projects to be inactive if more than 50% of their snapshots don’t show any activity in terms of code churn. (517 projects meet this requirement) s y s t e m i s o f s i g n i f i c a n t s i z e

The system has to be a mature, complete and self-containing software unit. This information is not directly available from our data, but we will use sys-tem size in terms of invested man years as proxy for completeness. We use man years instead of lines of code to avoid biases to any programming lan-guage.

(12)

2.4 metric evaluation

s y s t e m h a s a m i n i m u m o f 1 0 s na p s h o t s r e c o r d e d

Our data source contains a lot of Software Risk Assessments (SRA). These assessments focus on measuring software quality at a certain moment in time, rather than monitor changes in software quality over a longer period of time. To calculate a Change Ratio, at least two snapshots are required, but more are always preferred to detect trends or patterns. Therefore, we filter the SRA’s by selecting projects with a minimum of 10 recorded snapshots. (543 projects meet this requirement)

s y s t e m h a s b e e n r e c o r d e d o n r e g u l a r t i m e i n t e r va l s (+- 1 week) If time intervals between successive records are not constant, or if the time in-tervals are simply to big, the churn records get less reliable. Ideally, a Change Ratio is calculated after every commit to a system. The data used in this study however, consists of weekly records and are inherently less precise than if we would use daily records. For instance, if the same line of code is changed twice in one week, this could be captured in daily but not weekly records. An exploratory study on code churn found that the difference in recorded churn between commit-based and week-based intervals is in the range of 4-20% whereas the difference between commit-based and month-based inter-vals ranges in 6-22% [Kraaijeveld]. Guided by these numbers, we will only consider projects where no snapshot interval is longer than two months. This selection leaves us with 135 representative systems that will be included in our exploratory study on Change Ratio. Due to the confidentiality of SIG’s analyses, we can not share any information that could identify any of the selected systems. The set of selected systems however, shows a broad variation across system size, functionality and quality assuring generality of any results that may be found. 2.4 m e t r i c e va l uat i o n

The proposed software metric Change Ratio is not yet documented in literature, so thorough evaluation is required. According to [Fenton and Pfleeger, 1998], con-ducting a formal experiment seems most appropriate in our context. In contrast to other investigation techniques like performing a case study or a survey, a formal experiment allows us to identify the relationship of two data sets in a way that is objective, replicable while isolating controlled variables.

All of the experiments will be implemented and executed in the programming lan-guage R, a free software environment for statistical computing and graphics [R Development Core Team, 2008].

2.4.1 Exploring Metric Characteristics

The main characteristics of our proposed metric will be explored using statistical and graphical methods. This exploration step is needed to guide our quantitative study on Change Ratio later on and will help us interpret future findings. Also, it

(13)

may reveal patterns and inspire future steps. Questions that will be addressed in this section include:

• How are Change Ratio values distributed over our data set of 135 different software projects?

• What is a low Change Ratio value? • What is a high Change Ratio value?

• Do different software systems show similar Change Ratio patterns? • Is the Change Ratio stable or volatile over time?

2.4.2 Direct Correlation

Finding the statical relationship of maintenance attitude and maintainability will be addressed in this section by computing the Pearson’s moment-product correla-tion of the average system’s maintainability rating and the average of its recorded Change Ratio values. Note that this does not address the question whether Change Ratio values relate to the degree of decreasing software maintainability since the collection of data started at different levels of maintainability ratings.

A significant positive correlation found here would suggest that maintenance tasks in highly maintainable software systems are performed with less new lines of code with respect to changed lines of code. However, it is not said that a series of high Change Ratio values cause high maintainability.

Questions that will be addressed in this section include:

• Do high maintainable systems show relatively high Change Ratio’s compared to low maintainable systems?

• Which maintainability aspect is most associated with Change Ratio values? 2.4.3 Change Correlation

To answer the question whether a particular maintenance attitude has a predictive power towards the degree to which a maintainability rating decreases (or improves), we compute a Pearson’s product-moment correlation between Change Ratio and dif-ferences in maintainability rating per unit of system growth. Two important factors should be considered here:

1. The degree to which a system’s maintainability score changes per unit of time is influenced by the total size of the system. Improving 1000 lines of code in a big system does not necessarily lead to the same change in maintainability rating as it would in a small system. This means that we should not use absolute amounts of code churn.

(14)

2. The size of the team that performs the maintenance tasks influences the time it takes to perform a certain task. This means that we should not use absolute time intervals.

We exclude both factors by dividing our data into chunks of comparable units of growth. Each chunk represents a sequence of records in which the system has grown a certain percentage of its own size. By doing so, we exclude biases regard-ing system size and team size.

After dividing the data into comparable chunks, we will conduct an inter-system and intra-system experiment calculating the Pearson’s product-moment correlation between Change Ratio values and differences in maintainability ratings.

For the inter-system correlation we use chunks of 5% system growth, while we use 1% system growth for the intra-system correlation. Data exploration showed us that an increase of 5% of the total system size (measured in LOC) is able to re-semble a significant change to the system’s maintainability rating but is still small enough to capture different Change Ratio periods within systems.

For the intra-system correlation we use a more fine-grained approach, by dividing the data into smaller chunks of 1% system growth. This leads to more data points per project and thus more reliable outcomes, but also inputs more volatility in pe-riods of Change Ratio records.

Question addressed in this step include:

• Does a software system’s maintainability decrease more per unit of growth if there was a low average Change Ratio measured in that period?

2.4.4 Change Ratio Sequences

This step focuses on the behavior of the maintainability rating of a software system during certain Change Ratio sequences of length 3 rather than average Change Ra-tio values. To represent Change RaRa-tio values, discrete values (ranging from 1 to 4) are used rather than continuous ones to generate more comparable sequences and to flatten out any data disruptions. Similarly, the behavior of the maintainability rating is represented by discrete values descending, unchanged and ascending. As an example, typical data points could be:

sequence maintainability behavior

{1, 4, 2} descending

{2, 3, 2} unchanged

{3, 3, 2} ascending

The reason to choose sequences of length 3 is twofold. Firstly, a sequence of length 3 represents a period of time (3 weeks) in which a significant change in soft-ware quality is possible and also expected. Two-week data would probably result in more volatile data. Secondly, choosing for sequences of 4 or more Change Ratios

(15)

would result in 256 or more very similar sequences.

In addition, we choose to apply a reverse lexicographical ordering onto the 64 possible sequences to be able to express the correlation between certain sequences with certain maintainability behaviors with a Spearman’s rank correlation coeffi-cient. Such an ordering gives more weight to recent Change Ratio values than to older ones.

When assigning discrete values to both Change Ratio values and maintainability changes, it is important to use only the current system as a reference. For example, a change in maintainability rating of 0.01 might be considered large in system A, while it would be considered small in system B1

. 2.4.5 Churn sized classification

Similar to the preceding section, Change Ratio sequences are tested on correlation with descending, unchanging or ascending maintainability ratings in this section. However, the steps we take in order to make discrete values of the continuous Change Ratio values is slightly different: we do not only take into account the size of the Change Ratio, but also the size of the total churn over the same period of time.

This approach aims to partially eliminate the noise caused by periods of little ac-tivity that almost certainly do not lead to any significant change in maintainability rating at all. In other words, this approach focuses on periods of significant activity.

(16)

2.5 data

2.5 d ata

In this study, two sets of data will be used. One is the set of churn data represent-ing how many lines of code were added, changed, deleted and moved from one version to another will be used to calculate Change Ratio values. Secondly, a set of maintainability ratings as computed by SIG’s Maintainability Model of the same systems in the same periods as the collected churn data. All subjacent maintainabil-ity ratings (e.g. analysabilmaintainabil-ity, testabilmaintainabil-ity, modularmaintainabil-ity, reusabilmaintainabil-ity and modifiabilmaintainabil-ity) are also included in this data set.

2.5.1 SAW

The Software Improvement Group has been monitoring more than 500 unique soft-ware systems for 3 or more weeks. Since 2007, SIG analyses numerous aspects of software systems of all kinds and collects these data in the Software Analysis Ware-house (SAW).

The content of this warehouse can roughly be divided into three groups:

s na p s h o t i d e n t i f i c at i o n Name of system, component or file and date of recording.

s o u r c e c o d e p r o p e r t i e s Results of static source code analysis including total lines, lines of code, lines of comments, programming language, approxima-tion of invested man years, percentage of duplicated code, fan-in, fan-out, and churn.

(sub-)quality ratings Ratings ranging between 0.5 and 5.5 for maintainability and its characteristics according to ISO standard 9126: analysability, testabil-ity, modulartestabil-ity, reusability and modifiability.

Note that the system analysis is done on several different levels of granularity. Iden-tification, source code properties and quality ratings are recorded for the system as a whole, but also for each of its components, sub-components and even files. 2.5.2 Computing Churn

In this section, we will elaborate further on how churn is actually computed. The reliability of any results in this study depends on the churn computation process. This process includes three steps: 1) a pre-processing step which makes the source code ready for analysis; 2) an implementation of the Meyers Diff algorithm com-puting new, changed and deleted lines of code; and 3) an additional step to find source code files that were moved into other directories. These three steps are described in greater detail below:

p r e-processing step When a new version of a client’s software systems reaches the analysis tool, a few pre-processing steps are taken before the churn anal-ysis begins. This includes removing blank lines, comments and redundant

(17)

2.5 data

white spaces and tabs. This step assures that only ’real’ differences in code get recorded: adding or removing comments in a file has no influence on code churn.

m y e r s d i f f Next, the new, changed and deleted lines of code are computed fol-lowing the Myers Diff algorithm. This algorithm [Myers, 1986] aims to solve the problem of determining the differences between two sequences of sym-bols. It has various applications, for example in spelling correction systems or in the study of genetic evolution, but is widely used in file comparison tools. The algorithm performs well when differences are small and is conse-quently fast in typical applications, like comparing different versions of a file. When two source code files A and B are inputted in the algorithm, it will find the shortest edit script (SES) that transforms A into B by only insertions and additions. The problem of finding the SES is equivalent to finding the Longest Common Sub-sequence (LCS) of two files. In [Bergroth et al., 2000], different LCS algorithms are compared based on runtime and memory usage. The implementation of the Myers Difference algorithm scored above average at both tests. Because of its simplicity and speed, it is the default algorithm for the GNU command diff.

If the algorithm encounters a line of code in the latest version that does not exist in the first version, it is considered a changed line if a similar lines can be found in the first version. Otherwise, it is counted as a new line. Note that the similarity threshold has great impact on the final result of the algorithm: how similar to its previous version must a line of code be to be considered changed instead of new? In the context of this study, we use a 50% threshold which is also the default value as used by the git diff command. Upon manually evaluating a sample of churn results, this seems an accurate threshold.

m ov e d f i l e s h e u r i s t i c One of the limitations of Myers Difference Algorithm is that it does not take moved lines of code or moved files into account. This implicates that renaming a file or its path is regarded as a deletion and an addition to the system. The same applies when code is moved from one file to another. This is a problem that concerns the calculation of Change Ratio: if moved lines can not be detected, they will be counted as deletions followed by additions instead causing incorrect records of new lines of code in the SAW which are used to calculate the Change Ratio values.

To partly overcome this limitation, SIG has developed their own heuristic for detecting moved files. The heuristic takes two code snapshots as input (two versions of the same system), and works as follows:

1. Construct a set of all paths to files and directories separately for both snapshots: FA and FB where A is the original snapshot and B is the new

(18)

2.5 data

2. Divide all paths into one of the following categories: added, deleted and common.

• Fadded ← FB−FA

• F_deleted ← FA−FB

• Fcommon ← FA∩FB

3. From F_added and F_deleted, remove all partial paths that are in F_common. 4. The intersection of the new F_added and F_deleted represents the set of moved

files: Fmoved← Fadded∩Fdeleted

Unfortunately, this heuristic has two major drawbacks. One is its incomplete-ness: it is only capable of finding moved files that are not renamed after mov-ing them to a different directory. The heuristic does not consider the file’s content. A second drawback is the fact that SIG started recording moved lines only recently, meaning that the data set is much smaller than the data sets of new and changed lines of code. Considering these drawbacks, we will not make use of this data set.

(19)

3

C H A N G E R AT I O

Code churn is a measure of the amount of code change taking place within a soft-ware unit over time [Nagappan and Ball, 2005]. It captures the final result of effort put in the system in four distinct values: new lines of code, deleted lines of code, changed lines of code and moved lines of code. They can accordingly be written as the quadruple{NLOC, DLOC, CLOC, MLOC}. In most studies on code churn [Shihab

et al., 2013], [Karus and Dumas, 2012b], churn is calculated as the sum of added, changed and optionally deleted lines of code. However, in this study, we are not interested in the sum of the components, but rather in the relationships between them. 1

Different types of maintenance activities result in different code churn dis-tributions. For instance, changing the name of a variable within a software unit is likely to result in a code churn distribution of{0, 0, n, 0}, where n is the number of lines where the variable occurs. Adding a new feature will result in a distribution similar to{n, 0, 0, 0}.

We believe that these distributions reveal general information about the mainte-nance process as a whole, but are especially of interest in finding clues for sprawl maintenance.

An important assumption we will make here is the following: software projects that are regularly refactored have relatively higher amounts of CLOC than projects

showing sprawl maintenance (e.g. adding features without refactoring the original system design). To measure the degree of sprawl maintenance, we introduce the Change Ratio metric here: a normalised ratio between CLOC and NLOC. It is simply

computed as follows:

ChangeRatio= _ChangedLOCChangedLOC₊_NewLOC

3.1 c h a n g e r at i o c h a r a c t e r i s t i c s

To be able to detect abnormalities in the behaviour of a system’s Change Ratio, there should first be an understanding of a normal behaving Change Ratio. In

1 What’s also a fact to consider when calculating churn, is that summing the components greatly neglects the impact of CLOC, since it’s usually much lower than NLOCand DLOC.

(20)

3.1 change ratio characteristics

the following section, we therefore conduct an exploratory study on the concept of change ratio as this is not covered in existing literature to the best of our knowledge. 3.1.1 Properties

The Change Ratio is meant to measure a certain attitude towards maintaining a software system. To be able to distinguish between the different attitudes of soft-ware maintainers, we believe the most important code churn components are NLOC

and CLOC. We intentionally do not include DLOC and MLOC in the calculation of

the Change Ratio. Deleting and moving of lines of code is mostly the case in a structural change and does not add functionality to the system.

The reason for choosing a normalised ratio, is to exclude dependency on team size and system size. In other words, using a normalised ratio, we measure the attitude independent of absolute code churn. However, by calculating a normalised ratio, we encounter a problem with data disruptions caused by extremely small code churn records. This problem is further explained in section 3.2.

Figure 4 shows the behavior of the ratio over a period of 100 (weekly) snapshots for four randomly selected projects but no obvious trends, patterns or similarities appear to stand out. This is further investigated in section 3.2.

0 20 40 60 80 100 0.0 0.4 0.8 Snapshot Change Ratio 0 20 40 60 80 100 0.0 0.4 0.8 Snapshot Change Ratio 0 20 40 60 80 100 0.0 0.4 0.8 Snapshot Change Ratio 0 20 40 60 80 100 0.0 0.4 0.8 Snapshot Change Ratio

Figure 4.: Examples of Change Ratio of 4 different systems

To be able to distinguish between high and low Change Ratio values, the data distribution is of interest. Figure 5 shows the density histogram of all recorded Change Ratio’s of all selected projects, revealing a strong left skewed distribution.

(21)

3.1 change ratio characteristics

Distibution of Change Ratio Records

Change Ratio F re qu en cy 0.0 0.2 0.4 0.6 0.8 1.0 0 500 1000 1500

Figure 5.: Density Histogram of Recorded Change Ratio

The skewness means that NLOC is usually bigger than CLOC. New functionality is

impossible to implement without adding lines of code. Therefore, it seems reason-able that NLOC is usually much bigger than CLOC.

Dividing the data into quartiles provides four Change Ratio categories: very low [0.0-0.018], low [0.018-0.049], high [0.049-0.113] and very high [0.113-1.0]. In section 3.2 these categories will be used to find find certain patterns in Change Ratio be-haviour.

A few other observations can be made from figure 5: Change Ratio’s accumulate at 0, 1₄, 1₃, 1₂ and 1. Data inspection showed that these disruptions are caused by extremely small code churns. For instance {NLOC = 3, CLOC =1} gives a Change

Ratio of 1₄, {NLOC = 1, CLOC = 1} gives a Change Ratio of 1₂ and all snapshots

where CLOC =0 give a Change Ratio of zero.

These data disruptions will be solved in section 3.3 by dividing Change Ratio records into comparable units of growth such that each chunk represents the same percentage of growth relative to its system size.

(22)

3.2 direct correlation

3.2 d i r e c t c o r r e l at i o n

In this section, we focus on how a project’s Change Ratio relates to its average maintainability rating. Note that this will not necessarily reveal any clues on how a Change Ratio influences software maintainability: the recordings of all systems in the data set started at different levels of maintainability. If a system started at a very low maintainability rating, but managed to increase it substantially during the recording period (for example by re-factoring), this will not be represented in the mean maintainability. The mean maintainability will just show a mediocre rating. However, we can argue the other way around: intuitively, we expect a positive correlation between average maintainability score and average Change Ratio con-sidering the following statements:

• In a system with high modularity, existing code can be changed with less risk to break other parts of the system than in a system with low modularity. • In a system with high analysability, computation process is better understood

and changes can be made without adding redundancy.

• In a system with a low duplication rating (implicating a high percentage of duplicated code), changes involving one of the clones must be implemented at several locations, with extra/redundant code as consequence.

High modularity, high analysability and low duplication all increase a systems maintainability rating following the SIG Maintainability Model. In other words, to implement a new feature or improvement in a highly maintainable software system, fewer new lines of code are needed. What is more, simple system modifications can be reached by just changing lines of code.

So, the question we’re addressing here is:

Do highly maintainable software systems show larger Change Ratio’s in comparison to less maintainable software systems?

The answer can be derived by looking at figure 6, where the average maintainabil-ity characteristic ratings (ranging from 0.5 to 5.5) is plotted against average Change Ratio values of all selected software projects. The red lines, representing the least squares fittings to the data points of each graph, are all more or less horizontal implicating no significant correlation between the two axes.

Therefore we conclude that highly maintainable software systems don’t show larger Change Ratio’s in comparison to less maintainable software systems accord-ing to our data.

(23)

3.3 change correlation 0 1 2 3 4 5 0.00 0.15 0.30 Volume Ave ra ge C ha ng e R at io 0 1 2 3 4 5 0.00 0.15 0.30 Unit Complexity Ave ra ge C ha ng e R at io 0 1 2 3 4 5 0.00 0.15 0.30 Duplication Ave ra ge C ha ng e R at io 0 1 2 3 4 5 0.00 0.15 0.30 Unit Size Ave ra ge C ha ng e R at io 0 1 2 3 4 5 0.00 0.15 0.30 Unit Interfacing Ave ra ge C ha ng e R at io 0 1 2 3 4 5 0.00 0.15 0.30 Module Coupling Ave ra ge C ha ng e R at io 0 1 2 3 4 5 0.00 0.15 0.30 Reusability Ave ra ge C ha ng e R at io 0 1 2 3 4 5 0.00 0.15 0.30 Modifiability Ave ra ge C ha ng e R at io 0 1 2 3 4 5 0.00 0.15 0.30 Analysability Ave ra ge C ha ng e R at io 0 1 2 3 4 5 0.00 0.15 0.30 Testability Ave ra ge C ha ng e R at io 0 1 2 3 4 5 0.00 0.15 0.30 Modularity Ave ra ge C ha ng e R at io 0 1 2 3 4 5 0.00 0.15 0.30 Maintainability Ave ra ge C ha ng e R at io

Figure 6.: Average system’s maintainability characteristic rating (range: 0.5-5.5) and average Change Ratio value for all 135 selected projects

3.3 c h a n g e c o r r e l at i o n

In this section, we compute a direct Pearson’s product-moment correlation between the Change Ratio and the difference in software maintainability. We use difference in software maintainability to make sure that the software maintainability data rep-resents a change over a certain period of time, as Change Ratio does too. This is done by simply subtracting the maintainability rating at time = t from the rating at time = t−1 such that a negative results represents a decreased rating and a positive one an increased rating.

Next, all data per system is divided into chunks where each chunk represents a 5% volume growth in terms of lines of code. This means that some chunks span over multiple weeks of data, while other chunks contain only one week of records. The average length of a chunk is approximately 13.5 weeks. The reason why we divide our data into chunks of similar relative growth, is that it makes the difference in quality rating comparable over different sized software projects. A quality rating is more stable for large sized projects than it is for smaller ones. By looking at relative growth, all sized projects get a ”fair” chance of increasing/decreasing its quality

(24)

3.3 change correlation

rating.

Note that this division into chunks also eliminates the fact that some systems are maintained by more people than others, or at different speeds, because the effort to reach a 5% volume growth is independent of team size and programming speed. For each chunk, the corresponding averaged Change Ratio and difference in main-tainability rating is calculated. With these numbers, the following questions can be answered:

1. During a period of 5% volume growth, how many C_LOC were recorded for every NLOC?

2. What is the change in maintainability (characteristic) rating of system X in reaction to a 5% volume growth?

Intuitively, we hypothesise: if the answer to the first question is relatively low (low Change Ratio), the answer to the second question is also relatively low (significant decrease in software quality). However, figure 7 suggests something else. It shows how the changes in maintainability ratings are distributed for both high Change Ratio’s (in green) as low Change Ratio’s (in red). There does not seem to be a big difference at first glance. Our hypothesis would have been confirmed only if the distribution of low Change Ratio’s was shifted to the left and/or the distribution of high Change Ratio’s was shifted to the right. Instead, both distributions are cen-tred around zero implicating no predictive power of Change Ratio values towards maintainability ratings.

The distribution for high Change Ratio’s is a bit more wide, suggesting more volatil-ity in maintainabilvolatil-ity rating in a period where relatively more lines of code are changed for every added line of code. However, this observation is not significant enough to draw any conclusions on.

A more in depth evaluation of the Pearson’s product-moment correlation be-tween average Change Ratio value over periods of 5% system growth and the change in maintainability characteristic ratings over the same periods can be found in table 1. Red cells indicate statistically insignificant results (p>0.05). For none of the software qualities and properties a significant correlation has been found. We can conclude now that the average Change Ratio value over a period of 5% volume growth of the software system does not correlate significantly with the change in maintainability characteristic rating over the same period, according to our data. It is known in the field of software engineering research, that software systems do not let them compare to each other easily. Even while we controlled the most crit-ical variables like system size, programming language and team size, the Change Ratio might be too fine-grained to model complex software engineering processes. Therefore, we repeat the last experiment but only consider intra-system data. This means a correlation coefficient is calculated between maintainability differences and Change Ratio’s for all projects separately. This choice greatly shrinks the size of our data set, therefore we take chunks of 1% volume growth instead of 5%. This

(25)

Low Change Ratio

Difference in Maintainability Score

F re qu en cy -0.4 -0.2 0.0 0.2 0.4 0 40 80

High Change Ratio

Difference in Maintainability Score

F re qu en cy -0.4 -0.2 0.0 0.2 0.4 0 40 80

Figure 7.: Distribution of changes in maintainability rating for small and large Change Ratio’s

results in more data points per project, but also more volatility of Change Ratio. On top of that, only projects with more than ten data points are considered to assure significance of potential result. Of the 135 pre-selected systems, only eighteen ful-filled this new constraint. Of these eighteen projects, twelve showed no significant correlation. The results of the other six projects are summarised in table 2. Surpris-ingly, three of them showed significant positive correlation while the other three showed significant negative correlation. Upon manual investigation, no obvious reason for this could be found. Different functionality, programming languages and systems sizes are represented in both groups. Due to the confidential charac-ter of our data set however, we can not provide any further information that could lead to the identification of the systems under investigation. We will conclude that Change Ratio values and maintainability ratings do relate in certain software sys-tems, both positively as negatively.

(26)

Quality Correlation with Change Ratio

Volume -0.232 Duplication -0.019 Unit Size -0.019 Unit Complexity -0.051 Unit Interfacing -0.077 Module Coupling -0.153 Component Balance 0.159 Component Independence 0.130 Analysability -0.078 Modifiability 0.220 Testability -0.172 Modularity -0.029 Reusability 0.096 Maintainability -0.168

Table 1.: Pearson’s moment-product correlation of Change Ratio and quality char-acteristics

Quality System 1 System 2 System 3 System 4 System 5 System 6 Maintainability -0.453 -0.788 0.514 0.841 -0.951 0.693 Table 2.: Pearson’s moment-product correlation of intra-system Change Ratio and

(27)

3.4 sequence patterns

3.4 s e q u e n c e pat t e r n s

While previous sections mainly studied the effect of averaged Change Ratio values, the focus of this section lies in finding abnormal changes in maintainability ratings after certain sequences of Change Ratio values occur.

The following steps are taken to represent sequences in our Change Ratio data. Discrete values are used rather than continuous ones in order to generate more comparable sequences and to flatten out data disruptions.

1. Map every Change Ratio data point to its corresponding quartile. All data points get assigned an integer value ranging from 1 to 4, which can be inter-preted as 1=very low Change Ratio, 2=low Change Ratio, 3=high Change Ratio and 4= very high Change Ratio. Quartiles are recalculated for every software system separately.

2. Combine every mapped data point with its two chronological predecessors (also mapped) as a sequence. This implicates a set of 43 = 64 possible se-quences. We have chosen sequences of length = 3 for two reasons. Firstly, it represents a period of time (3 weeks) in which a significant change in soft-ware quality is possible and also expected. Two-week data would probably result in volatile data. Secondly, choosing for sequences of 4 or more Change Ratio’s would result in 256 or more very similar sequences.

3. For every sequence of three data points, find the average of the change in its three corresponding maintainability ratings. This means that if a certain project scores successively 4.0, 4.2, 4.3, and 4.6 points for its software main-tainability as measured by SIG, a value of 0.2+0.1₃+0.3 = 0.2 quality change is recorded for the corresponding sequence of the three last mapped Change Ratio data points. 2

4. These maintainability rating changes are mapped to its corresponding tertile. All points get assigned an integer value ranging between 1 to 3. These values can be interpreted as 1 =descending maintainability, 2=unchanged maintainability and 3 = ascending maintainability. Tertiles are calculated for every software system separately to exclude potentially confounding variables like system size and team sizing. Also, this ensures three similar sized sets of data. The steps above ultimately lead to the data gathered in table 7 on page 44. A couple of statements can be made by closer observation:

• Only one of the twelve sequences with the highest ratio of descending soft-ware maintainability contains more than one high or very high Change Ratio. • Only one of the fourteen sequences with the lowest ratio of descending soft-ware maintainability rating contains more than one low or very low Change Ratio.

2 Here again, it is important not to choose a sequence much longer than 3 maintainability ratings as they could compensate each other.

(28)

• The sequence {1, 1, 1} has the second highest ratio of descending software maintainability rating.

• The sequence {4, 4, 4} has the lowest ratio of descending software maintain-ability rating.

• Of all 16 sequences that end with a very low Change Ratio, 12 are more likely to result in a descending software maintainability rating than an ascending software maintainability rating.

• All sequences that contain more than one very low Change Ratio, have a higher than average ratio of descending software quality.

• Three out of four sequences that end with two subsequent very low Change Ratio’s, have more than 50% change of descending software maintainability rating, whereas sequences that end with two very high Change Ratio’s have an average of 31.7% chance of descending software maintainability rating. These observations suggest the correctness of our hypothesis: sequences of low Change Ratio values cause descending maintainability more than sequences of high Change Ratio values.

Even tough above observations are convincing, a more reliable result is reached by using a statistical approach by ranking the sequences in reverse lexicographical order3

and measure the correlation of these rankings with the chance of descend-ing software maintainability ratdescend-ing. By usdescend-ing a reverse lexicographical orderdescend-ing, we intentionally give more weight to more recent Change Ratio values. Indeed, this approach shows a Spearman’s rank correlation of −0.71 with a p-value of 1.122∗10−10. The correlation coefficient is negative, meaning that sequences with relatively low Change Ratio’s tend to have a high chance of maintainability rating descending.

This, however, is not the total story. The fact that low Change ratio’s correlate with descending maintainability rating does not mean that high Change Ratio values correlate with ascending maintainability. Indeed, only a−0.07 Spearman’s rank cor-relation is found here with a p-value as big as 0.58. Furthermore, the Change Ratio sequences do show a Spearman’s ranking correlation of 0.62 with the chances of an unchanged maintainability rating. These and more Spearman’s ranking correlations can be found in table 34

.

Some conclusions that can be drawn from table 3 are:

1. There is significant less chance of decreasing maintainability rating for se-quences of higher Change Ratio values. This chance ranges from 26.0% for sequence {4, 4, 4} to 54.7% for sequence {3, 1, 1}.

3 As in table 7 on page 44

(29)

2. An increasing software maintainability rating is not predictable using Change Ratio sequences.

Difference Direction

Quality Down Unchanged Up

Volume -0.734 0.707 0.611 Duplication -0.725 0.685 -0.069 Unit Size -0.311 0.619 -0.466 Unit Complexity -0.183 0.605 -0.463 Unit Interfacing 0.022 0.635 -0.602 Module Coupling -0.137 0.624 -0.433 Component Balance -0.382 0.585 -0.540 Component Independence -0.099 0.465 -0.398 Analysability -0.679 0.617 0.246 Modifiability -0.682 0.642 0.021 Testability -0.513 0.640 -0.281 Modularity -0.150 0.545 -0.473 Reusability -0.01 0.537 -0.497 Maintainability -0.708 0.645 -0.070 Table 3.: Change Ratio correlation with maintainability characteristics

(30)

3.4.1 An alternative classification approach

As mentioned before, using a ratio as metric has an important consequence: the absolute churn numbers of CLOC and NLOC are lost. This is both good and bad.

Being independent of system size, team size and programming languages is good, but not being able to differentiate between {NLOC = 1, CLOC = 1} and {NLOC =

100, CLOC =100}, both occurring in the same system, can be a loss of crucial

infor-mation. Together with the fact that we have not yet found a symmetric relationship between Change Ratio sequences and software maintainability rating (i.e. only decreasing and unchanging maintainability correlates with Change Ratio sequences. Increasing maintainability ratings do not), we want to investigate how exactly the absolute size of CLOC and NLOC influences the maintainability rating. We have

there-fore substituted Change Ratio for new lines of code (table 5) and changed lines of code (table 4), which show the following:

1. Both N_LOC as C_LOC are much better predictors for an increasing software maintainability rating than Change Ratio.

2. Change Ratio predicts significantly better than C_LOC for decreasing duplica-tion rating, analysability, modifiability, testability and maintainability, and increasing volume rating and component balance.

3. Change Ratio predicts significantly better than N_LOC for decreasing maintain-ability rating, unchanged volume rating and increasing analysmaintain-ability.

4. The only maintainability characteristic that is better predicted by Change Ra-tio instead of CLOC and NLOC together, is decreasing maintainability rating.

Figure 8 shows an alternative classification of Change Ratio’s than the one we used in the beginning of this section. One that combines the value of Change Ratio (high or low) with the absolute size of churn relative to system size (big or small). Again, we have chosen to use four classes, where 1 = low Change Ratio/big churn, 2 = high Change Ratio/big churn, 3 = low Change Ratio/small churn and 4 =

high Change Ratio/small churn. We regard the amount of churn big when NLOC +

CLOC > x where x is the median of all calculated NLOC+CLOC, and small otherwise.

Note that we had to combine classes 1 and 2 from our first classification to rep-resent low Change Ratio and classes 3 and 4 for high Change Ratio. More than four classes would implicate a minimum of 256 possible sequences to look at, which would probably diffuse our data too much to draw reliable conclusions.

With only the classification approach of Change Ratio’s changed, we performed the same experiment as before. The results are shown in table 8 on page 46.

Interpreting the results in table 8 can not be done in the same fashion as we did with table 7, because of one simple reason: there is not one right ranking of se-quences. So, let’s look at the classes 1−4 separately to find out what they mean and what influence on the software quality we expect them to have:

(31)

Difference Direction Quality Difference Direction Down Unchanged Up

Volume -0.736 0.616 0.223 Duplication 0.287 -0.763 0.710 Unit Size 0.347 -0.632 0.510 Unit Complexity 0.328 -0.579 0.473 Unit Interfacing 0.235 -0.599 0.507 Module Coupling 0.494 -0.604 0.347 Component Balance 0.239 -0.491 0.336 Component Independence 0.071 -0.549 0.530 Analysability 0.392 -0.618 0.382 Modifiability 0.285 -0.539 0.337 Testability 0.300 -0.659 0.636 Modularity 0.219 -0.561 0.448 Reusability 0.145 -0.416 0.402 Maintainability 0.315 -0.725 0.520 Table 4.: Changed lines correlation with maintainability characteristics

Difference Direction Quality Difference Direction Down Unchanged Up

Volume 0.723 -0.582 -0.607 Duplication 0.695 -0.753 0.535 Unit Size 0.451 -0.658 0.571 Unit Complexity 0.291 -0.686 0.625 Unit Interfacing 0.061 -0.671 0.633 Module Coupling 0.467 -0.668 0.467 Component Balance 0.482 -0.604 0.504 Component Independence 0.146 -0.576 0.478 Analysability 0.671 -0.662 0.088 Modifiability 0.641 -0.686 0.177 Testability 0.549 -0.711 0.596 Modularity 0.326 -0.717 0.634 Reusability 0.123 -0.519 0.506 Maintainability 0.639 -0.713 0.471 Table 5.: New lines correlation with maintainability characteristics

c l a s s 1: low change ratio/big churn

Class 1 implies a lot of new code added to the system, without touching exist-ing code much. This class is typical when the maintenance team is developexist-ing new functionality. Adding new code to a system without changing existing code inevitably increases volume and complexity of the system, which both

(32)

3.4 sequence patterns Change Ratio Pe rce nt ag e of Syst em Ad de d or C ha ng ed 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5

High churn & low Change Ratio High churn & high Change Ratio Low churn & low Change Ratio Low churn & high Change Ratio

Figure 8.: Two dimensions of churn: Change Ratio and size

have negative influence on software quality. Therefore, we expect to see a pe-riod of decreasing software quality in a system when class 1 is the dominant class.

c l a s s 2: high change ratio/big churn

Class 2 indicates lots of changes to a system, without adding much new lines of code. This would be a typical situation when the maintenance team is refactoring the system. As refactoring activities are meant to increase software quality, we expect to see a period of increasing software quality in a system when class 2 is the dominant class.

c l a s s 3: low change ratio/small churn

Class 3 occurs when just a few lines of code are added to the system. As this could not have large impact on the system, we do not expect large changes to the software quality in a period in which class 3 is dominant.

c l a s s 4: high change ratio/small churn

Similar to class 3, we do not expect large changes in software quality when just a few lines of code are changed.

Now, with these expected behaviours of software quality in mind, single expec-tation values can be extrapolated to the whole sequences. An example: for se-quence {1, 1, 1}, a maintainability change of(−1) + (−1) + (−1) = −3 is expected. For sequence {1, 2, 3}, a maintainability change of (−1) + (+1) + (0) = 0 is ex-pected. Note that these expected values are dimensionless quantities. Follow-ing this approach, each sequence of three Change Ratio values is assigned its expected value of change in maintainability rating. Dividing these sequences into seven groups of possible expected values (ranging from -3 to +3) and averaging

(33)

the results leads to the graph in figure 9. The red and green dashed lines rep-resent linear fits to the measured data, respectively of downwards and upwards changes in maintainability ratings. The fact that the coefficient for downwards

-3 -2 -1 0 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0

Expected change in of Maintainability Rating

Pe rce nt ag e of o bse rva tio ns (g re en :u p, re d: do w n) Up Coefficient=-0.0008 Down Coefficient=-0.025

Figure 9.: Expected and measured units of change in maintainability rating maintainability ratings is more than 30 times as big as the coefficient for upwards maintainability ratings is interesting: indeed it is 15% more likely to have a de-scending maintainability rating after a Change Ratio sequence with an expected value of -3 than it is after a sequence with an expected value of 3. The model p = 0.39156667+expected value∗ (−0.02439048) predicts the probability of an de-scending maintainability rating with an accuracy of 84.8% using our data set. As-cending maintainability rating are distributed similar across the seven groups of expected values.

Concluding, this alternative classification approach led to more reliable results de-spite the fact that the data got very generalised during the classification steps. Still, assigning more than 4 classes likely leads to better results. We therefore encourage future researchers to do so.

(34)

4

C H A N G E D F I L E R AT I O

During this study on Change Ratio and maintainability, we came to believe that using lines of code might be a too low-level way of expressing the complex processes of software developing. Therefore, this chapter will suggest a more high-level approach while still addressing the question how does churn affect a software system’s maintainability?.

The main difference is that we will look at the level of files (e.g. classes in the case of object oriented programming) instead of lines of code. That is, CLOC will

now represent the number of affected (new and changed) lines of code in already existing files since the last recording. Similarly, NLOC now represents the number

of new lines of code coded into new files since the last recording. For this reason, this metric will be named Changed File Ratio.

4.1 d ata d i s t o r t i o n s

Our data source, the SAW, does not explicitly contain which code files were added or removed from the system since the last known snapshot. We can however simply deduce this information by comparing the two sets of file names of two consecutive snapshots: file names present in the first set but not in the second were removed, while file names absent in the first set but present in the second set were added to the system.

The great drawback of this approach is that the renaming of files and folders causes disruptions in the data. For example, when a folder is renamed between two snap-shots, it will appear as if all files from the folder were removed and a new folder was added. The severity of this renaming problem can be seen in figure 10, where the black lines represent the existence of a certain file in the system. It looks like the total system was deleted and a similar sized system was added between two snapshots. To overcome this problem, we will propose a heuristic to reconstruct the life cycle of renamed components.

(35)

4.1 data distortions 2011 2012 2013 0 1000 2000 3000 4000 5000 File LifeCycle (lease) Year F ile N umb er

Figure 10.: Life Cycle of files in a System

Different Version Control Systems solve the problem of tracking renamed files over successive versions of a software system in different ways. A renaming must be told explicitly to Subversion (SVN), while Git keeps track of the content rather than the name of a file Loeliger [2009]. Another approach is taken in Mayrand et al. [1996], where the content of two source code files is compared by properties of their Abstract Syntax Tree.

Our data source, the SAW, does not include the original source code or any com-mit comment or annotation, so none of these approaches is suitable for the data we have. However, the SAW contains more than 25 calculated properties for ev-ery file in the system for evev-ery recorded date. Examples include lines of code (LOC), number of lines of the biggest clone, lines of comments and programming language. The backtracking algorithm 1 builds upon the assumption that certain selected properties together can distinguish a file even after a renaming.

Note that large, complex files will be easier to backtrack than small simple files

1

because their properties are more unique. This is illustrated by figure 11 and 12, showing the distribution of respectively lines of code and the size of the biggest code clone in a file for one of the selected systems of our data set. It can be seen that the majority of files contain less than 50 lines of code and no code clones at all.

1 For example a Java class containing only a constructor and functions to get and set its property values.

(36)

4.1 data distortions

Algorithm 1Reconstruct Component Life Cycle after Renaming 1: procedure Reconstruct(S)

2: S←Set of all snapshots of one project 3: for i= f irstDate→lastDate do

4: E

0←Filenames at date = i 5: E₁←Filenames at date = i+1 6: G←E₀− (E₀∩E 1) 7: N← E 1− (E0∩E1) 8: for f_g ∈G do 9: for fn ∈N do 10: m←match(f g, fn) 11: if m>threshold then 12: map(f_g, f_n) 13: remove(f n, N) 14: end if 15: end for 16: end for Histogram of locVar Average LOC F re qu en cy 0 200 400 600 800 1000 0 200 400 600 800

Figure 11.: Distribution of aver-age size of files in a system

Histogram of cloneMeans

Average Clone Size

F re qu en cy 0 100 200 300 400 0 200 400 600 800 1000 1200

Figure 12.: Distribution of aver-age biggest clone of files in a system

The backtracking heuristic is tested for a random selection of 10 systems. Test results are summarised in table 6. The following matching parameters are used:

• PROGRAMMING LANGU AGEf ileA =PROGRAMMI NG LANGU AGEf ileB

• | LOCf ileA−LOCf ileB |< LOCf ileA∗0.05

• | LI NES OF COMMENTSf ileA−LI NES OF COMMENTSf ileB |< LI NES OF COMMENTSf ileA∗

0.05

• | BIGGEST CLONEf ileA −BIGGEST CLONEf ileB |< BIGGESTCLONEf ileA∗

0.05

(37)

4.1 data distortions

• | FAN OUTf ileA−FAN OUTf ileB |< FAN OUTf ileA+2

System Potentially renamed files Single matches Multiple matches

1 78 16 38 2 279 26 173 3 951 31 646 4 104 8 76 5 28 5 9 6 684 47 387 7 154 13 57 8 49 8 10 9 351 24 98 10 586 34 353

Table 6.: Test results of the proposed backtracking algorithm for 10 randomly se-lected systems

Table 6 summarises the number of matches for each potentially renamed file. Potentially renamed files are all files which name has been removed from one snapshot to the other. We’ve tried to match certain properties of these files to prop-erties of seemingly new files in the next snapshot. Manual inspection showed that 72% of the single matches were indeed accurate. Unfortunately, the majority of the potentially renamed files could be matched against multiple files (e.g. two or more matches exist), suggesting our matching threshold is set too low. However, running the same test with a more strict threshold where every property must be exactly the same, led to almost no single matches at all.

Discouraged by the test results, we have chosen not to use the heuristic to recover distorted data. Instead, we will leave the idea of the Changed File Ratio to future re-search where more suitable data is available. We encourage any future rere-searchers to repeat or improve our study on Change Ratio in chapter 3 using the proposed Changed File Ratio.