Mining Productivity-Influencing Factors: Explore and Visualise Software Repositories

(1)

Mining Productivity-Influencing

Factors

Explore and Visualise Software Repositories

Simon Schneider

simon.schneider@student.uva.nl

2018-08-30, 57 pages

University supervisor: dr. Magiel Bruntink,m.bruntink@sig.eu

Host supervisor: dr. Marina Stojanovski,m.stojanovski@sig.eu

Host organisation: Software Improvement Group,https://www.sig.eu

Student ID: 11679719

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

Abstract

Continuously improving productivity is necessary to stay competitive in fast-moving markets such as software development. As companies grow, it becomes harder to understand the causes of efficiency barriers. Multitudes of empirical studies were conducted in the past to find out which factors influence the productivity of developers. We propose the use of version control mining to elicit these produc-tivity factors automatically. Our goal is to increase the actionability of existing producproduc-tivity metrics and reveal novel insights into the software development process. In this study, we deliver a repeatable approach that connects productivity-influencing factors to code history. We show the application of this process with two measurable and impactful productivity-influencing factors: staff turnover and work fragmentation. To further understand work fragmentation we also analyse the temporal cou-pling between modules. After mapping these phenomena to a numerical representation, we design and implement a data mining prototype. This tool allows us to extract and visualise measurements from a multitude of version control system logs in an empirical study.

The analysis of six proprietary projects show compelling results. We were able to detect different de-velopment phases, causes for work fragmentation and deliver refactoring advice on a conceptual level. Furthermore, a strong positive correlation between tracked working hours and mined code repository activity was found. Correlations between static software metrics such as lines of code and the measured factors are however weak. The results presented in this study refer to the analysed projects and may not generalise across other environments. We conclude that measuring productivity-influencing fac-tors is feasible and advisable to control the software development process on an empirically informed basis.

(5)

Chapter 1

Introduction

The need for a steady increase in productivity to stay competitive is a challenge that every company faces, especially in fast-moving markets such as software development. A demand for a shorter time-to-market while maintaining product quality forces vendors to constantly find new opportunities to optimise productivity [Car06, AH96]. Measuring the efficiency of the development process is seen as an essential basis for raising productivity and therefore also competitiveness. This measurement of productivity is not trivial since there seems to be no general definition of software development productivity and the perception of productivity differs across domains [Car06]. Economists define productivity as ”the effectiveness of productive effort, especially in industry, as measured in terms of the rate of output per unit of input” [Oxf18]. An important, and still open, issue is the definition of unit of work in the context of software engineering [GKS08]. A widely used and well-understood metric to measure productivity is to use lines of code as the output of the development process and person hours as input [GE06]. This approach was criticised for not only penalising high-level languages [Jon94] but also desirable abstraction in general [Car06]. Since metrics also guide future development as an indicator of performance [GE06], this metric could have an adverse effect on actual productivity. Metrics that deteriorate as real economic productivity increases should naturally be avoided.

For this reason, it is not advisable to use size estimations to compare productivity across projects, even if they facilitate the same language. We can, however, still use the lines of code metric to analyse the evolution of a specific project [HM00,KSMH06]: One could, for example, detect software changes that decrease maintainability and are followed by a period of high daily code churn and a higher bug frequency to argue that stability and productivity were affected negatively [HKV07].

Instead of directly measuring the productivity via the rate of output and input we can also take a look at what influences productivity in software development. Multitudes of studies used empirical approaches over the last decades to discover what makes some projects more efficient than others. It became clear that certain factors influence productivity positively while others impede development activities. Meta-reviews carried out by both Trendowicz and M¨unch [TM09] and Wagner and Ruhe [WR18] analysed this landscape of productivity-influencing factors to create a basis for new produc-tivity models. This aggregation of empirical studies results in a list of explicitly named factors that affect every software developing company. The authors established a taxonomy which helps us to understand more about the composition of productivity-influencing factors and their relation to each other. Team capabilities, for example, is seen as a parent of numerous sub-factors such as training level, teamwork capabilities or application familiarity.

This approach of splitting productivity into factors and sub-factors reduces the complexity of creat-ing a screat-ingle holistic measure of productivity by breakcreat-ing it down into subproblems. By measurcreat-ing a recurring phenomenon that has an impact on productivity we can understand the root cause of productivity drops.

Previous research often focused on the manual elicitation of productivity-influencing factors. Re-searchers analyse the behaviour of developers throughout the work day on-site or through software that monitors the various activities of developers. The problem with this approach is that projects cannot be analysed in retrospect if the software did not collect activity data. Systems like version

(6)

control and associated issue tracking software are already used in many software projects and allow new forms of uncovering compelling and actionable information about software systems [Has08]. In-stead of measuring a system only once to find out more about the current state, this evolutionary data enables us to study historical aspects such as the productivity of a team over time.

We want to address this problem of finding the root causes of productivity bottlenecks in an automated manner. Practitioners state that existing volume metrics do not give sufficient insights into the development process itself [Car06]. They encounter the limited actionability of single-valued metrics like lines of code per hour [FB14, p. 448]. Actionable metrics would allow a software manager to make an empirically informed decision based on the software products status [MSW12]. Code churn may tell us when the project shrinks or increases in size but does not explain why we see sudden or steady changes. Actionability is however only one of many criteria, and un-actionable metrics are not necessarily useless, as we can still use them as the starting point for a detailed investigation. Instead of only utilising volume measurements, a stakeholder could also gain an understanding of the factors that influence the productivity of his project. The issue here is that the manual elicitation is a time consuming process and there are no set rules on how to gather an compare them.

This is why we propose the measurement of productivity-influencing factors through data stored in version control. Creating metrics that guide corrective purposes are highly relevant for practitioners and consultants as they constantly search for new ways to improve productivity. Our goal is to increase actionability of productivity measurements for practitioners by using multiple proxies. These proxies highlight diverse aspects and create new possibilities to find concrete efficiency bottlenecks. To limit the need for available data sources and make the underlying approach available across diverse project structures and companies, we restrict the required data sources to the version control systems alone. This paper focusses on a small set of exemplary factors to show the application of this approach. A prototype that automatically executes the extraction of the selected factors from version control systems will be designed and implemented. We want to show novel insights that we retrieve from a collection of real industry projects through this prototype.

The study is hosted by the Software Improvement Group (SIG), which is a management consultancy firm that focuses on software-related challenges. This environment allows us to work on commercial and proprietary projects by SIG and its clients using methods that build on top of existing static code analysis tools.

1.1 Research questions

The impact of productivity-influencing factors has been shown repeatedly[WR18]. We must assume that these factors are applicable and relevant for all software projects. Hence, we need to identify questions that test our proposed approach rather than reproducing the impact of productivity factors. RQ1 Accuracy. How accurately can we measure productivity-influencing factors using version control

attributes only?

It is naturally interesting to learn how accurately our indirect measurements through version control systems represent the real development process. We can only test the accuracy in the context of our available projects. A degree of accuracy is needed that allows us to take the correct actions based on the results. We expect to see a strong positive correlation between tracked working hours and author activity, since the effort of developers should be represented in the history of a codebase.

RQ2 Relation to traditional measurements. To which extent can we observe a relation between static code analysis metrics, such as lines of code written, and the selected productivity-influencing factors?

The lines of code metric and other static code measurements are already used to judge effort and code quality. We assume that factors that impede productivity correlate negatively with the lines of code written and other metrics that already judge overall efficiency. Finding relations between traditional code quality measurements and productivity-influencing factors would allow new insights into the available projects. These relationships would tell us more about the influence of code quality on productivity.

(7)

1.2 Solution outline

Answering the research questions above leaves us with a set of sub-problems that we have to solve: • Selecting factors: To select productivity-influencing factors that we further analyse in this

study we have to examine existing literature. We have to filter out factors that are hard to measure and not impactful enough to show a perceivable effect on productivity.

• Mapping: To measure productivity-influencing factors we have to relate them, or more precisely map them, to a numeric range. We need to set up fixed rules to make the measurement factors through version control attributes repeatable.

• Empirical study: We need to apply the measurements to a set of real projects to explore visualisations and the characteristics of our results. The mined results will be used to answer the research questions in the scope of the available dataset.

Figure 1.1: High-level depiction of research approach

The mapping requires a careful selection of measurable factors. Accordingly, a data mining proto-type that is used for the empirical study requires a set of rules it has to operate on. For these reasons, we identified the approach depicted in figure 1.1. The high-level depiction was chosen deliberately, and details will be described in the subsequent chapters that follow the structure of this process.

1.3 Limitations

The described approach of measuring productivity-influencing factors from a version control mining perspective has several limitations that can be addressed without examining data.

Cannot measure economic success: Our approach is limited to data that may include infor-mation on the development process and the resulting product. When we are measuring the efficiency of developers we are not necessarily measuring the success of a business. By just looking at the history of a project, we can not determine which value a work unit has to customers. It can not be

(8)

assumed that efficient development structures automatically result in a successful product that meets the expectations of a client.

Limited measurability: As mentioned, it is not possible to measure every influencing factor that was described in literature, and we have to filter them. Furthermore, every measurement of a chosen factor is an estimation of a real phenomenon. Developers may cluster many changes into one commit, multiple developers could share the same author name, or developers may use different pseudonyms. We describe methods to counteract varying commit strategies but have to accept that the activity during some time periods cannot be reconstructed from version control.

Limited comparability: At this stage, the described approach can also not be used to automati-cally compare the overall productivity between different projects, teams or individual developers. The impact of each factor has the be judged in its context, and an extensive data acquisition of many factors would be necessary to get a holistic view of productivity. This paper shows the mapping and measurement of exemplary factors that are not necessarily the root cause for productivity bottlenecks in a specific project.

Focus on proprietary projects: We are using commercial projects to show the application of our approach. The described solution is generalisable since productivity-influencing factors were studied in a wide range of projects, including open source environments. The statements based on our results and the chosen parameters are however specific to the analysed projects. These parameters include threshold values and other variables that have to be introduced to make the measurements work across different environments. Some properties of an organisation differ strongly between open-and proprietary-projects. A large team size, for example, is not desirable in commercial companies [RGBC+14, CB97] but does not cause problems in large open source environments with different organisational structures [RGBC+₁₄_].

(9)

Chapter 2

Background

2.1 Version control systems

Version control systems (VCS) were developed to solve concurrent code changes by multiple developers and to satisfy the need of reverting changes when functionality breaks. VCSs make the development process simpler by allowing new workflows, like code merges, feature branches and continuous delivery [Ott09].

Figure 2.1: Centralized and distributed version control systems [Ern14]

There are two different models that manage concurrent access: centralized models and distributed models (see figure2.1). Both models are represented in a great variety of version control implemen-tations in one form or the other. Central version control systems only have one main repository that stores all previous and the current revisions of the code base. A user checks out files, performs the necessary changes and commits them back into the repository together with a commit message that describes the made adjustments. Distributed version control systems are a newer approach that shares many principles with centralised systems. A distributed system does, however, not require a central location and stores the complete repository on every local computer [Ott09,DAS09,Ern14]. In the past years, we saw a shift from centralised systems, such as SVN, CVS or Perforce to modern

(10)

distributes systems like Git or Mercurial. Benefits of distributed systems that most likely caused this shift are: no need for a central point of failure, no need for constant network connectivity, support for experimental changes and automatic merges, which simplified conflicts between two changes at similar locations [DAS09].

Git became the dominant version control system in open source environments whereas SVN still stays highly relevant in commercial projects [Bla18]. When we create universal tools for analysing version control systems it is, consequently, necessary to find concepts and metadata that all relevant systems have in common. Both models allow us to investigate file changes and associated metadata such as developer, commit message and timestamp [Ott09]. For this paper, we are defining commits as the most granular set of changes that a developer records together with a textual description of his work. Other version control systems may call this atomic operation check in, submit or record and use different technical implementations, but in the context of mining, these terms all represent the same concept.

2.1.1 Mining version control systems

Mining software repositories to predict bugs, find refactoring candidates and understanding the evo-lution of a software repository are established practices [DRKK+₁₅_, _Tor15_, _BOW11_{]. Researchers}

and practitioners use different attributes of code repositories to form judgments about quality of the software process.

Code churn is the sum of lines added, modified or deleted between two builds [ME98]. Previous studies argue that code churn can be used to analyse the code quality of developers. They state that developers that continuously rewrite own code are less efficient and probably created unstable constructs in the first place [Tor15,BOW11]. In this study, we do not use code churn to argue about code quality but instead, use it to judge the contribution during different time periods. Periods of high activity are marked by an increased level of code churn, whereas periods with little activity have lower levels of churn [ME98].

Change-sets are groups of commits over a certain time period by the same authors. In version con-trol mining this aggregation is necessary to normalise the commit-strategy across multiple developers. [KSMH06]. One developer may create a small functionality in one hour using 10 commits, whereas another developer may take the same time, but groups the changes in one commit. Both commit strategies would result in one change-set each, if we collect every hour as one change-set. Another benefit of using change-sets instead of commits is the decreased number of datasets that have to be analysed [KSMH06]. It is faster to compare a reduced number of change-sets with each other than doing a full comparison over atomic commits.

Coupling is the degree of interdependence between software modules whereas cohesion refers to the degree to which the elements inside a module belong together [CY79]. A majority of experts [HM95] agrees that designing systems following a pattern of low coupling and high cohesion leads to products that are both more reliable and maintainable. This concept of code coupling can be lifted from static analysis to a measurement of software evolution. Instead of analysing code constructs that formally couple code pieces together we can analyse which units of code recurringly change together. We call this form of historical analysis temporal coupling or co-evolution since some pieces of code to seem to evolve together [Tor15]. Two files have a shared change history if they are changed by the same commit or change-set. Research that predicts the propagation of change through temporal coupling reached a success rate of up to 64% [Has08].

2.2 Related work

Related work is primarily made up of empirical research that analyses the typical work days of developers in realistic environments and also studies that mine version control systems to gain insight

(11)

into the evolution of a software system.

Vanya et al. describe an approach that takes established practices to analyse software system clus-ters statically into a historically-based approach. They use repository mining to extract changes over time and construct a network of cohesive groups on a file level. In an industry example, they showed that visualisations could be used to understand the system better and improve refactoring [VHK+08]. Robillard and Dagenais created a set of tools that had the aim of improving the efficiency of developers as they navigate through source code [RD10]. The authors wanted to recommend useful information to developers about past changes. They chose an approach that calculates change clusters that overlap with the developers current elements of interest. Qualitative analysis showed that only 13% of the recommended clusters were useful to developers. The authors concluded that this success rate does not justify the effort developers have to invest in order to examine the recommendations. Our results are targeted towards a business perspective and project management. We can, however, reuse the heuristics and mining techniques Robillard and Dagenais described. The recommendations may not be useful on a granular level but can be valuable to understand the structure of the overall system. A similar approach of using coevolution of system components to discover change hotspots in soft-ware repositories is described by Tornhill [Tor15]. Instead of using traditional clustering techniques he is using circle packing to visualise areas that required an unusual amount of effort based on code churn. The practically relevant steps of mining a multitude of version control systems mentioned by the author can be extended in the context of productivity mining.

Gousios et al. suggest a model that tries to solve the issue of measuring units of output in the context of software engineering [GKS08]. They argue that traditional contribution metrics should be combined with data mined from software repositories. According to their proposed model con-tribution can be measured by counting commits, churn, activity in forums, wiki concon-tributions and other development activities that can have a positive or a negative impact on their effort metric. The approach of mapping effort to version control attributes can be used to connect influencing factors to similar code base characteristics. Instead of focussing on many data sources that are available for open source projects we concentrate purely on mining software repositories that are available for most commercial projects. We furthermore do not to estimate or predict effort on an author level but want to learn more about the efficiency of the overall development process.

In his study on ”The Road Ahead for Mining Software Repositories” Hassan details the history of version control mining and its open possibilities [Has08]. The author describes a transition from static record-keeping version control systems towards active repositories. His description of open challenges inspired the approach of this paper but also demonstrates limitations that we can not circumvent with version control mining alone. These limitations include gaps or noise in data and the claim that repositories can only be used to show correlation instead of causation. Version control systems were not designed for data mining, but hide information about software structure, temporal coupling through change propagation, team dynamics, and code reuse. In alignment with our research goals, the author also argues for the automation of empirical studies to verify common wisdom and increase the accessibility and usability of measurements.

The related work shows that mining software repositories is a highly relevant field to understand and predict the evolution of software. In contrast to most studies, we do not analyse the efforts of specific authors or search for granular refactoring candidates. We are however reusing approaches such as the contribution model, mining practices and visualisation techniques proposed by previous studies and adopt them in the context of productivity measurement.

2.3 Productivity-influencing factors in literature

The productivity of software production processes dependent on numerous influencing factors. In a systematic literature survey of 126 publications, Trendowicz and M¨unch set out to identify common

(12)

factors that affect productivity positively or negatively [TM09].

Wagner and Ruhe split productivity-influencing factors up into soft- or hard-factors and took a historical perspective [WR18]. They conclude that soft-factors are getting more significant over the years but hard factors are still preferred for technical measurability reasons. Naturally, both literature reviews have a substantial overlap of factors and merely categorise them differently.

The meta-review of Trendowicz and M¨unch alone already identified 246 factors that are commonly analysed in the literature [TM09]. To build a model of productivity, it is necessary to select relevant factors to mine from version control.

The criteria used to decide whether to include or exclude factors for the analysis are as follows: 1. Impact: The effort of mapping and measuring a factor should be justified by the effect a factor

has on productivity. We use the frequency of studies that consider an influencing factor as impactful and their description of relevancy for the overall evaluation.

2. Measurability: We can not consider factors that do not leave evident traces in version control systems. It must be feasible to connect the factor to attributes of the version control system and vice versa. This restriction, unfortunately, excludes a selection of impactful soft-factors but is necessary to reach the goal of a limited need for accessible data sources.

3. Appropriate side effects: A factor that is believed to have a positive impact on productivity is not included if negative side effects would make it an inadequate measure of productivity. For example, studies suggest that high code ownership reduces faults and increases productivity [BNM+₁₁_, _TM09_{], but is profoundly negative if employees become unavailable together with}

their key knowledge [RL02].

4. Independence: Some factors may overlap and represent a similar viewpoint on productivity. A factor that contains information, which is already covered by another factor and therefore redundant will not be included [Roc94]. Examples of redundant factors are “programming language experience”, “training level” and “tooling experience” [TM09], where all factors es-sentially estimate the experience of a developer in general or in a specific context.

5. Nonexploitability: It should not be possible for developers to influence the measured factor easily without increasing real economic productivity [MSW12]. For example, if we measure the activity and effort of developers by the number of commits in a certain period, developers can easily exploit this metric by committing inappropriately often.

Factor Impact Measurability Side effects Independence Nonexploitability

Staff turnover + + ◦ + +

Software complexity + + ◦ ◦ +

Team cohesion & communication + −− ++ ◦ −

Team capability & experience ++ − ++ − −

Table 2.1: Overview of productivity-influencing factors criteria (−− very negative, − negative, ◦ neutral, + positive, ++ very positive)

The following chapters examine the criteria stated above for the productivity-influencing factors most mentioned in the literature. Table2.1gives us an overview of qualitative ratings for each criterion based on the rationale given below.

2.3.1 Staff turnover

Staff turnover is described as a team-related project characteristic that has a high impact on produc-tivity [TM09]. This factor describes how members leave or enter a development team over the course

(13)

of a project. One of the main issues of high staff turnover is “a loss of key knowledge” [RL02] and an increase of the “team exhaustion rate” [CHR+₉₈_].

Since the average team size can be static while developers are constantly exchanged it is not enough to just measure the number of developers. Newly entering developers are often not experienced in the domain or the structure of a project and typically spend time on an onboarding process before they can add value to a software product [CHR+98]. Mining the team size from a software repository was already done before, and developed approaches can be used to measure the change in a team between two-time intervals [MML15,JSS11].

To exploit this metric and pretend that the composition of a team is stable developers would need to switch between version control authors and distribute their changes over inactive authors or multiple users would need to commit under the same name. These practices would not only be impractical but also detectable, since the volume measurements would drop for the first approach and the team size would shrink for the latter one.

2.3.2 Software complexity

The complexity of software refers to the degree to which a program is difficult to understand by human developers [MD08, p.308] and is considered to influence productivity by many authors [TM09].

Figure 2.2: Subfactors of software complexity that influence productivity

There has been some disagreement concerning the effect of system structure on productivity. Where some sources claim that system structure with high cohesion and low coupling reduces the maintenance effort and therefore also increases productivity eventually [HM95,KKKM00], other sources argue that too much structure decreases productivity during the initial development [FB14].

Since this study is primarily concerned with commercial projects and several studies show that maintenance accounts for 55% to 90% of the overall cost of software in the industry [DH13,Gla01,

YL94, MML15], it is unavoidable to measure and guide maintainability using static- [HKV07] and change-cluster analysis.

For reasons of measurability we do not focus on ”Database complexity” and ”Interface to external systems” since these sub-factors leave little or no traces in an isolated code repository.

The complexity of the architecture and code is often measured in a single metric. Just like this study tries to split unified productivity metrics up in multiple factors, Fenton & Bieman state that there is a multitude of attributes that make up complexity (see figure2.2) and argue that there is a need

(14)

to examine each attribute separately [FB14]. Using temporal coupling to measure the structure of a system is a common practice to predict change [RD10,ZZWD05], catch architectural decay [Tor15], detect cross-cutting concerns [BZ06], find bugs caused by incomplete changes [DLR09] and see where functionality deteriorates in a legacy system [KSMH06].

Another effect of temporal coupling is work fragmentation which is the need for developers to switch between tasks and contexts. Empirical research on the behaviour of developers showed that perceived [MBM+₁₇_{] and observed [}_MML15_,_SRG15_{] productivity decreases in periods of high work}

fragmentation. The researchers draw our attention to developers that switch between projects but also towards interruptions. In the projects available for this research we do not have the possibility to measure the same developer across multiple software systems.

Figure 2.3: Difference between low work fragmentation (cohesive tasks) and high work fragmentation (distributed tasks)

Figure2.3illustrates how work fragmentation is characterized by changes that developers make to a system. Project A shows a low work fragmentation where developers have little context switches. Project B, on the other hand, shows how a high work fragmentation is reflected by temporal coupling between modules. In the latter project, developers have to make changes across the system to accom-plish their goals while keeping a large number of dependencies in mind. A system architecture that creates temporal coupling between modules forces engineers to spread their development activities.

At this point, it is hard to determine what qualifies as high or low temporal coupling. We can, however, use the data from the analysed projects to compare the overall coupling and investigate components that show high coupling. While doing so, it would also be interesting to compare the temporal coupling of components with static coupling.

(15)

cohesion, developers would have to unite modules with high temporal coupling or change their commit strategy. Since we are using change-sets instead of pure commit data, this would mean that developers had to postpone changes for multiple hours if they want to touch various modules at the same time. This behaviour would have a negative impact on static measurements of module volume and component balance.

2.3.3 Team cohesion & communication

Even before agile software development became popular, a small team size was equal to high produc-tivity for many authors [TM09]. This impact of team size has much to do with the team structure and the burden of additional “communication links” as team size grows [CB97].

To measure the modularity of developers in a project, researchers have used social network analysis techniques to mine collaborating developers from version control [JSS11]. They used the similarity of commits, based on overlapping file paths to compute the similarity of developers via jaccard- and cosine similarity and validated the approach together with the respective team leaders.

Improving previous results would require access to more data that allows a better judgement about the actual communication between developers. Two developers may work closely together, but if their parts of the system communicate via a strict interface, we are not able to draw a connection based on the version control history. Since we only plan to mine version control systems and do not propose to include data from ticket systems or other collaboration tools, we will not cover this factor in this study.

2.3.4 Team capability & experience

A majority of literature considers “team capability and experience” to be the most influential pro-ductivity factor [TM09]. Subfactors of this are “programming language experience”, “application familiarity”, and “project manager experience”. It is clear that the skill of individual team members has a substantial impact on the overall team productivity and no strong side effects. Measuring the proficiency of a programming language is usually done with manual tests [FKL+₁₂_{] and hard to gather}

from version control information only.

Project managers are typically not directly concerned with development in commercial organisations, and relevant tasks like planning and monitoring are not directly reflected in source code. We could, however, measure how experienced a developer is in the current application by analysing his contri-butions to the system. This factor overlaps with almost all other factors: good team structure usually assumes highly skilled technical people, and experienced developers are probably more likely to create a maintainable system structure.

It could be possible to judge developer proficiency by combining static analysis with the mining of historical data, but this author focused measurement would be outside the scope of this study.

(16)

Chapter 3

Mapping productivity-influencing

factors

Work fragmentation and staff turnover are a real phenomenon that we want to approximate by mea-suring information that is stored in version control systems. This process of converting characteristics of the real world domain to a numeric range is called mapping [FB14, p.52]. By setting the rules for this mapping, we create a repeatable function that assigns numeric values to productivity-influencing factors. When we see a change in staff turnover empirically, we expect that this is reflected in its formally defined metric.

Figure 3.1: Measurement mapping from real phenomena to numeric representations via available data

The example illustrated in3.1shows how real phenomena are mapped to numeric representations. As we can see the binary relation “is more than” is preserved in the numeric representation. Real staff turnover was larger in May 2018 than in April of the same year, which is reproduced in the numerical representation (10 > 4) . In measurement theory, this important preservation rule is called representation condition, which asserts that a measurement mapping M must map entities into numbers and empirical relations into numerical relations in such a way that the empirical relations

(17)

are preserved by the numerical relations [FB14, p.55]. Adhering to the representation condition is driving the decisions we are making while defining the rules for the mapping. Traditional metrics for lines of code, for example, specify rules for excluding comments or duplicated code to approximate spent efforts. After studying the characteristics of productivity-influencing factors we make similar decisions to motivate rules for the mapping.

3.1 Mapping of staff turnover

By analysing the activity of individual authors over time, based on made commits into the software repository, we can estimate the actual composition of a team during a certain time period.

Figure3.2shows that we have four categories of developers inside a code base, when we divide the members of a team into time periods such as months. There are authors that are entirely new to a project, authors that staid from the previous month, leaving authors and authors reentering the code base after one or multiple time periods where they did not touch the version control. Our simplified model of staff turnover will be limited to full-time developers and treats the phenomenon of developers that left a project and reentered after an extended period of time as new developers.

Figure 3.2: Example of calculating staff turnover in a software repository

We, therefore, define the staff turnover ST during a certain period of time p as the union of team members T that left the team between the previous and the current time period (Tleave) and those who entered the team (Tenter):

ST (p) = |Tp\ Tp−1| ∪ |Tp−1\ Tp| = |Tenter| ∪ |Tleave| = |Tp4 Tp−1|

(3.1)

The average staff turnover STmean is a decimal number that can be used to compare the staff turnover between projects over a similar time span and team size:

STmean= 1 pn− p1 ∗ pn X p=p1 ST (p) (3.2)

(18)

1 import pandas as pd

2

3 def active_team_members(commits: pd.DataFrame, period="M", minimum_work_days=7):

4 activity = commits[["Author"]] # Only select the author column for activity

5 activity.index = pd.to_datetime(commits[’Date’]) # Parse the date column to date object

6 activity["Days"] = 1 # Create a Days column and fill with ones for later usage

7

8 # Put every author in a separate column and fill cells with active days

9 activity_daily = activity.pivot(columns="Author", values="Days").fillna(0).resample("D").sum()

10

11 # Aggregate the daily activity into the given time period and sum up active days

12 activity_period = activity_daily.clip(0, 1).resample(period).sum()

13

14 # Only return activity data above the work days threshhold

15 return activity_period > minimum_work_days

16

17 commits: pd.DataFrame = pd.read_csv("commits.csv")

18 team_members_monthly = team_members(commits, "M", 7)

19 team_members_weekly = team_members(commits, "W", 2)

Listing 3.1: Retrieving active authors per month or week in Python using Pandas

A difficulty that we encounter is the standardised measure of team size using version control only. The function “active team member ” in listing 3.1transforms a list of commits with dates and au-thor names into a matrix of active time periods per auau-thor. Depending on the “minimum work day” parameter we can change the threshold of activity that the author has to fulfil before we count him as a team member.

Month Author A (d) Author B (d) Author C (d) Team (mwd=10d) Team (mwd=4d)

2016-06 0 0 15 {C}, ST = 1 {C}, ST = 1

2016-07 0 8 19 {C}, ST = 0 {B,C}, ST = 1

2016-08 13 9 2 {A}, ST = 1 {A,B}, ST = 2

STmean= 0.67 STmean= 2 Tmean= 1 Tmean= 1.67 Table 3.1: Example of staff turnover over the course of three months with varying minimum working days (mwd)

Table3.1compares two arbitrary values for minimum working days per month (10 days and 4 days). In our example, we already have a list of days a developer actively spent in a software repository aggregated by month. Author A touched the repository on 13 days during August 2016, Author B on nine days and Author C only spent two days coding. We can see that a higher threshold does not only correspond with smaller team size but also influences staff turnover immensely.

Authors of a case study on the open-source project OpenStack found that “the best effort estimation can be obtained with threshold values in the range from 9 to 12” [RGBC+14]. During the execution of our empirical research we are using tracked working hours and calibrate this parameter for the selected projects.

3.2 Mapping of work fragmentation

Business needs to change many subsystems at once cause work fragmentation. These requirements are hard to mine from version control only. Since we are focussing on measurable attributes, we can take a look at what causes work fragmentation inside a software system. When we measure

(19)

the amount of times system components were changed together, we can see which areas of a system caused work fragmentation and which parts are cohesive. For this reason, we create two measurements: temporal coupling between components and historical work fragmentation. The first measurement allows us to learn more about the system structure and find refactoring candidates whereas the second measurement gives us an indication about the actual work fragmentation that occurred over time.

3.2.1 Mapping of temporal coupling

We are measuring temporal coupling, that occurs when software components change together, by counting change-sets that affect at least two components:

T Cabs(COM Px, COM Py) = |C(COM Px) ∩ C(COM Py)| (3.3) Where:

COM Px: is a component like a module or unit of the system.

C(COM Px): are a list of change-sets that include a change on a certain component COM Px. T Cabs(COM Px, COM Py): is a representation for temporal coupling between component COM Px and COM Py.

This resulting number represents absolute temporal and will grow together with the number of change-sets that touch either component COM Pxor component COM Py. To compare module pairs that have widely differing change frequencies, we need a relative distance measure. A commonly used method to calculate the relative similarity of two sets X and Y is the intuitive Jaccard coefficient J [VHK+08] :

J (X, Y ) = X ∩ Y

X ∪ Y (3.4)

This second measurement T Crel (see formula3.5) uses the Jaccard distance to bring the temporal coupling into a relative perspective and helps us to find components that may be small or receive little change but are still strongly coupled:

T Crel(COM Px, COM Py) =

|C(COM Px) ∩ C(COM Py)| |C(COM Px) ∪ C(COM Py)|

∈ [0, 1] (3.5)

Description Formula Example value

Sets touching module a C(Ma) 60

Sets touching module b C(Mb) 40

Sets touching module a or b |C(Ma) ∪ C(Mb)| 80

Sets touching module s C(Ms) 10

Sets touching module t C(Mt) 15

(20)

The example given in table3.2 explains how greatly absolute and relative coupling measurements can diverge when used to compare module pairs. Module a and b show a stronger absolute coupling but a much weaker relative temporal coupling than module s and t.

To find components that cause the highest work fragmentation and components that cause fragmen-tation whenever they are touched we must measure both values.

The similarity for each pair is then computed to obtain a similarity matrix [KSMH06]. This ma-trix of n components is compromised of 1

2(n

2_{− n) temporal coupling values that are of our interest.} We only have to calculate one half of the matrix, since temporal coupling values are symmetric: T C(COM Px, COM Py) = T C(COM Py, COM Px). Reflexive values are also not relevant to judge temporal coupling between components: T Crel(COM Px, COM Px) = 1.

3.2.2 Mapping of work fragmentation between and inside change-sets

When a developer has to adjust many modules to make a single change, we speak of high work fragmentation. Additionally, we also speak of work fragmentation when a developer has to switch contexts between two separate changes. Since we are focussing on work fragmentation between mod-ules, context switches are the number of new modules a developer has to touch when jumping from one change to another. We call these two measurements W Fintra for work fragmentation within a change-set c1 and W Finter for work fragmentation between two change-sets c1 and c2.

Since we do not want to measure context switches between multiple days, we have to introduce a threshold value t. We can influence the result of W Finter by adjusting the time threshold t between two change-sets. If two change-sets are to far apart W Finter will be 0 to reflect that work fragmenta-tion only occurs if developers switch between contexts without a long break in between [MBM+₁₇_].

Just like temporal coupling, we can also adjust the component level from unit- towards module-level.

W Fintra(c1) = |{COM Pi | COM Pi touched by c1}| (3.6)

W Finter(c1, c2, t) = |{COM Pi | COM Pi touched by c1 ∧ ¬(COM Pi touched by c2) ∧ c2.date − c1.date ≤ t}|

(3.7)

We again illustrate this measurement by using an exemplary collection of change-sets created by the same author on one day in figure3.3.

(21)

Figure 3.3: Example of calculating inter work fragmentation and intra work fragmentation

As we can see, the time threshold is necessary to calculate the work fragmentation between two change-sets. In our example, we chose a threshold of 3 hours what explains why we have a work fragmentation between change-set 1 and 2 but no fragmentation between set 2 and 3.

3.3 Mapping summary

We mapped two productivity-influencing factors to a numerical representation: staff turnover and work fragmentation. In order to better understand work fragmentation, we decided to also measure the temporal coupling of high-level components. When two modules were often changed together they are the cause for work fragmentation. The mapping rules were formally described and can now be implemented in any programming language of choice.

(22)

Chapter 4

Empirical study

The empirical study is the final step of the research process described in figure1.1. In the following, we do not only describe the actual data mining in more detail, but also tasks that precede and follow it. Without a correct selection, cleaning and transformation it is unlikely to find valid patterns that characterise productivity or to advantageously act on false statements. A commonly used method to explore datasets is the Knowledge Discovery in Databases (KDD) process described by Fayyad [FPSS96]. He describes a set of pragmatic aspects of how business-oriented knowledge demands can be put into practice:

1. Data Selection: Determine the appropriate data type and source, as well as suitable instru-ments to collect data.

2. Data Cleaning: Define and remove actual noise, gathering the necessary information to model or account for noise, deciding on strategies for handling missing data fields.

3. Transformation: Finding useful features to represent the data depending on the goal of the task.

4. Data Mining: Searching for patterns in a particular representational form or a set of such representations.

5. Interpretation: Evaluate mined patterns and possibly return to one of the previous steps to adjust parameters or strategies.

6. Acting: Using the required knowledge to improve the development process or learn for future projects.

The individual steps only provide a rather superficial description of the used technology. This allows us to adopt the elementary structure and choose technologies and methods as necessary.

(23)

Figure 4.1: Knowledge discovery process for productivity-influencing factors in version control systems

Figure 4.1 shows how we tailored the knowledge discovery process to our needs. It describes the flow of information from version control repositories as input to a set of visualizations and metrics. The last steps of the knowledge discovery process, interpretation and acting, will be described in the subsequent discussion chapter.

4.1 Dataset

For our empirical research, we have to select projects that provide a sufficiently long history, use version control systems and have background information available. We need data on the context of a project to argue about causes for correlations that we find during the knowledge discovery process. To test our prototype in different environments, we also choose projects from different vendors. Table

4.1lists projects that were selected for the analysis. The historical data which is necessary to observe factors across different development stages ranges between 6 months and more than 3 years including the initial creation and maintenance phase.

(24)

Project name Origin Volume (KLOC) Commits Duration S1 SIG 38 1596 2017-05 - 2017-11 S2 SIG 563 5928 2015-03 - 2018-05 C1 Customer 66 1276 2016-03 - 2018-07 C2 Customer 153 2138 2017-01 - 2018-02 C3 Customer 146 1737 2017-05 - 2018-03 C4 Customer 143 2162 2014-10 - 2018-01 Total 1109 14 837 134 months

Table 4.1: Analyzed projects provided by SIG and customers

As indicated in the listing, projects either stem from SIG or anonymised customers of SIG that were willing to provide version control together with source code. We decided to forgo the analysis of projects that only had weekly snapshot data available since the historical information was not fine-grained enough for our approach and authorship is unclear.

4.2 Prototype development

For the development of a prototype, we have opted to use Python [VRD11]. Next to its simplicity Python is often used for data science because of powerful statistical and numerical packages such as NumPy1 _{and Pandas}2_{. Pandas lends itself to our use case since the commit history of a version}

control system can be seen as a data frame that is indexed by date and contains multiple columns such as author, churn, commit message or touched file paths. Various relational algebra operations such as projection, selection and renaming can then be used to reach the specified result and prepare the data for human evaluation.

We make extensive use of Matplotlib3 _{and Seaborn}4 _{to render and explore different forms of data}

visualisations. The created software evolution tool will not only generate our metrics and diagrams but perform multiple steps to prepare the raw input data. Our final prototype will work together with internal tooling of the hosting company but does not require data sources other than the version control system and can therefore be published.

1_{NumPy package:} _{http://www.numpy.org}_{(accessed 2018-06-12)} 2_{Pandas data analysis library:}

https://pandas.pydata.org(accessed 2018-06-12)

3_{Matplotlib - Python 2D plotting library:} _{https://matplotlib.org}_{(accessed 2018-06-12)} 4_{seaborn - statistical data visualisation:}_{https://seaborn.pydata.org}_{(accessed 2018-06-12)}

(25)

Figure 4.2: Commit mining prototype class diagram

Figure4.2shows an overview of the classes that make up the mining prototype. The actual imple-mentation details are described in the individual sections on the knowledge discovery process below. As we can see a central VcsMiningPipeline class orchestrates the execution. The whole application is structured in three layers that only communicate through the single direction of the pipeline: parsing (purple), data mining (blue) and visualisation (yellow). Our VcsLogParser uses the correct parsing strategy for the supported version control systems. Adding a new version control system can be done by adding a strategy to the parser. Similarly, the VisualizationService reads the intermediate output of the ChangeSetMiner and triggers the visualisations specified in the visualization modes argument. The ChangeSet, FileMove and FileChange classes model the basic concepts of version control sys-tems. ChangeSets are the most granular work units of developers that we consider for mining. Every ChangeSet is made up of multiple instances where files were moved, and file content was changed. Since we usually have to analyse different lists of ChangeSets, such as the non-empty sets of a partic-ular developer over the course of a specific month, we also create a ChangeSetCollection class. These collections contain basic operations for filtering sets that return new immutable data structures. This immutability allows us to store calculations, like the total amount of churn, in an internal cache.

(26)

1 class VcsMiningPipeline:

2 def __init__(self, input: str, output_directory: str, time_thresshold_minutes: int,

3 visualization_arguments: VisualizationArguments): 4 self.input = input 5 self.output_directory = output_directory 6 self.visualization_arguments = visualization_arguments 7 self.time_thresshold_minutes = time_thresshold_minutes 8 9 def run(self):

10 # Retrieve raw changesets from version control log and aggregate them into change sets

11 raw_commits: ChangeSetCollection = VcsLogParser(self.input).run()

12 change_sets: ChangeSetCollection = ChangeSetAggregator(raw_change_sets,

13 self.time_thresshold_minutes

14 ).run()

15

16 # Run ChangeSetMiner on effort relevant sets and write intermediate results to output

17 change_set_miner = ChangeSetMiner(change_sets.only_effort_relevant(),

18 raw_change_sets,

19 self.output_directory)

20 change_set_miner.run()

21

22 # Create visualizations from intermediate results and save figures to diagrams subfolder

23 visualizer = VisualizationService(input_directory=self.output_directory,

24 output_directory=self.output_directory + ’/diagrams’,

25 self.visualization_arguments)

26 visualizer.run()

27

28 # Run pipeline with exemplary parameters

29 vcs_miner = VcsMiningPipeline("projectA/commits_2017.git.log", "results/projectA", 120,

30 VisualizationArguments( 31 modes=["staff_turnover", "author_activity"], 32 min_active_days_per_month=5, 33 padding=[10,10,20,10], 34 titles=True, 35 anonymize=True 36 )) 37 vcs_miner.run()

Listing 4.2: VcsMiningPipeline class that orchestrates prototype components

In source code4.2 we can see how the pipeline is implemented. The class diagram and provided source code snippet illustrate that the individual layers are strictly separated, and we can run each step individually. In line 29-37 we start the prototype with exemplary parameters. In the actual application, the tool retrieves parameters from the command line or an eventual user interface. We were able to keep the prototype at a small size of 2259 LoC by leveraging third-party libraries. Pandas allowed us to map change-sets to the wanted output format using idiomatic constructs.

1 # Read data frame from CSV input, parse Date column and select wanted columns

2 df: pd.DataFrame = pd.read_csv(input, index_col=0, parse_dates=[’Date’])

3 df = df[["Date", "Category", "Churn", "Size"]]

4

5 # Use each distinct value of category as columns and have weekly churn per change category in rows

6 df = df.pivot(index="Date", columns="Category", values="Churn").fillna(0).resample("W").sum()

7

8 # Show line chart plot in IDE

9 pyplot.show()

Listing 4.3: Exemplary usage of Pandas to transform a list of changesets into a line diagram The example given in source code4.3highlights the benefits of this technology stack for experimental data mining. Once fluent in the data analysis API, a developer can write short transformation functions that would require substantial implementation effort to write from scratch. This way of writing code is not only more concise but also much more performant. By deliberately avoiding

(27)

for-loops and making use of NumPy vectorisation, we can facilitate high performant numerical routines provided by SciPy5_{. In the example, we parse a list of changesets from a CSV into a data frame, change}

the layout of the table and aggregate the rows into weekly sums in 3 basic steps. As an interpreted language Python allows us to write interactive scripts that show preliminary visualisations directly in the IDE. This way we can try out different prototypical solutions with short feedback loops before we write down adequate functionality in code.

4.3 Data selection

It was necessary to restrict the required data sources to a bare minimum for two primary reasons: limited data availability and universality.

We decided to use simple log outputs instead of operating on the complex data structure of highly diverse version control systems [Has08] directly as a starting point of the process.

1 # Git log for changes in the year 2018

2 git log --pretty=format:’[%h] %aN %ad %s’ --date=short --numstat --after=2018-01-01

3

4 # SVN log for changes in the year 2018

5 svn log --verbose --xml > logfile.log -r {2018-01-01}:HEAD

6

7 # Mercurial log for changes in the year 2018

8 hg log --template "rev: {rev} author: {author} date: {date|shortdate} files:\n{files %’{file}\n’}\n"

9 --date ">2018-01-01"

Listing 4.4: Generating similar version control log files for Git, SVN and Mercurial via the command line

Listing4.4shows the exemplary creation of log files with a similar structure using three of the most popular version control systems: Git6_{, SVN}7_{and Mercurial}8_.

This approach allows us to add a new version control systems by writing a small parser strategy (see

4.2) as long as the new system also adheres to the basic concept of recording which author changed files at a particular time. To filter and categorise data it is also necessary to provide a project-specific configuration. This configuration includes a whitelist for relevant file path patterns and a blacklist for irrelevant file path patterns. Most analysed projects include generated code or documentation, external libraries or configuration folders that will be incorporated into the evaluation.

Furthermore, we need a mapping from file path to a module identifier. This mapping is necessary since some projects do not use the root folder or top-level packages to facilitate modules. Some PHP content management systems like TYPO39 _{or Drupal}10_{, for example, require the user to put}

functionality in specific folders depending on the file type or phase of the application lifecycle.

4.4 Data cleaning

Data cleaning is typically concerned with missing attributes or invalid data [FPSS96]. The well-structured data that we extracted from the client projects did not contain anomalies like missing authors or invalid dates but needs to be classified to be useful for productivity measurements. It was, however, necessary to perform rudimentary record linkage to identify authors that use different

5_{SciPy - ecosystem for mathematics, science, and engineering:}

https://www.scipy.org/getting-started.html (ac-cessed 2018-08-23)

6_{Git log documentation:}_{https://git-scm.com/docs/git-log}_{(accessed 2018-06-05)}

7_{SVN log documentation:}_{http://svnbook.red-bean.com/en/1.7/svn.ref.svn.c.log.html}_{(accessed 2018-06-05)} 8_{Mercurial SCM log documentation:}

https://www.mercurial-scm.org/repo/hg/help/log(accessed 2018-06-05)

9_TYPO3 _directory _structure: _{https://docs.typo3.org/typo3cms/CoreApiReference/ApiOverview/}

DirectoryStructure/Index.html(accessed 2018-06-06)

(28)

pseudonyms or change the spelling of their name over time. The name of an author gets transformed into alphanumerical ASCII characters, is lowercased and has its whitespace removed. With this rudimentary sanitation function we create a more generic label for developers. Examples of possi-ble duplicates that would be joined are authors with with diacritical marks like the first name Ren`e that often occur as Rene and names like Sam van Dijk that can occur as SamVanDijk or sam van dijk. For this study, we propose a simplified version of change categories. Instead of splitting changes into correction, adaption, perfective and implementation [HGH08], we are interested in the following classifications:

• Effort relevant changes: Changes that add features (forward engineering) make adjustments (reengineering) or correct mistakes (corrective) and require implementation effort.

• Non-effort relevant changes: Changes that do not require any significant amount of effort or can easily be automated. Examples of this are automated code clean-ups or the inclusion of generated and third party files that were not detected by the previous scoping. This often includes code-managerial tasks such as the yearly update of legal information or changing code formatting guidelines.

Polluting change-set data with non-effort relevant information would result in metrics that unfairly favour or penalise projects that contain unusually many code-managerial tasks such as reformatting and include automatically generated code or documentation. We can still use non-effort relevant activities to estimate periods in which a developer was active but filter them out for other metrics such as the coupling of modules. Previous research approached this problem by merely removing large outliers [ZZWD05,VHK+08,Tor15]. The threshold was either defined by developers, represented by a certain percentage or was even arbitrary. Studies by Hattori and Hindle show that a vast majority of commits touch less than five files and only change a small portion of a system [HL08,HGH08].

Figure 4.3: Distribution of commit size in analyzed projects(n=14 837, mean=65 LOC, median=9 LOC, maximum=41 158 LOC, standard deviation=427)

We used the classification of commit sizes proposed by Hattori on our dataset and see in figure4.3

that our results correspond with previous statements made on the basis of open source projects. Only analyzing the majority of commits by solely taking tiny and small commits into account would not be a sound strategy for this study since the mentioned research also suggest that large commits do not only contain code management activities but also regular development activities that were explicitly grouped by developers for various reasons [HL08].

As figure4.1shows, we flag non-effort relevant commits automatically but give users of the tool the possibility to find falsely identified commits.

We accelerate the manual labelling process by classifying commits into ”Forward Engineering”, ”Reengineering”, ”Corrective Engineering” and ”Management” based a list of keyword list given

(29)

in Hattori’s study on the nature of commits that is compared to the commit messages (please see appendixC).

Figure 4.4: Commit change categories per commit size based on keyword list (n=14 837)

Our analysis of commercial projects again confirms the above statements on the nature of large commits made by previous studies. Figure4.4shows the commit strategies stacked by the commit sizes given by Hattori. [HL08] We can see that large-scale commits and even outliers in the 99 percentile (very large) should not always be ignored since they contain a significant amount of forward-engineered activities.

4.5 Data transformation

This step of transforming the previously cleaned data enriches commit data with information necessary for further analysis and groups raw version control commits into larger change-sets.

4.5.1 Enriching commit data

We transform commits by enriching and aggregating them. Our prototype adds the following at-tributes to every commit:

• Effort relevant churn: Instead of only adding deleted and added lines to calculate the churn, we are using type 1 clone detection [McC04] to get a better image of the effort that went into writing the code. Only the adjusted pieces of cloned code are relevant for the churn calculation. • Touched modules: With the project specific rules that match paths and module names we

can add a list of affected modules and the amount of churn for each module to each commit. • File move detection: Some version control systems like Git consider file movement as churn

11 _{but record the path before and after the action. Our tool keeps track of moved files, flags}

moved files and only counts the lines that were changed after moving the file and does not see the whole content of a moved file as churn.

(30)

4.5.2 Change-set approximation

In order to analyse and compare repositories across different projects and commit strategies, it is useful to group commits by the same author into change-sets as long as their commit date does not differ by more than a given δ [KSMH06,VHK+₀₈_{]. These change-sets are assumed to have a common}

motivation but are only an approximation of true interrelated changes on a system. Figure4.5shows how the choice of δ can influence temporal coupling between modules.

Figure 4.5: Consolidate commits into change-sets using different time delta

The smaller δ1 suggests that only module A and B as well as B and C are temporally coupled. Choosing the larger δ2 could distort the analysis by bringing module A closer to C, while a delta smaller than δ1 would not show an interdependence between module A and B. This step does not only group commits that presumably belong to the same intent together but also reduce the runtime of our tool by reducing the number of changes that are compared to each other (see listing4.5).

1 for (i = 0; i !=change_sets.length; i++)

2 for (j = i+1; j < change_sets.length; j++)

3 compare_change_sets(change_sets[i], change_sets[j])

Listing 4.5: Pseudo code to compare a list of change-sets with each other in O(1₂(n2_{− n)) = O(n}2₎ Neither Vanya [VHK+08] nor Kothari [KSMH06] mention which time delta they concretly used for their changeset approximation. Kothari talks about a “short time frame”, whereas Vanya assumes a delta that yields the smallest amount of falsely classified outcomes. In the case of change-set approximation, we do not know if a developer finished a task when he goes into a break, or grabs a cup of coffee and follows the same task after his break. A developer can also work on different tasks in parallel and checks in files at the same point without common motivation. In the case of this research, the approximation step is primarily necessary to determine work fragmentation. We manually analysed change-sets that result with a δ of 1, 2 and 4 hours. For practical purposes, we chose 2 hours, since these change-sets were large enough to show explicable results for temporal coupling. This means that we could find reasonable sets of changes which involved similar modules with a common intention in the samples we inspected. This δ may only be valid for the projects we analysed in this research. Developers in commercial projects usually have strictly assigned tasks that they follow throughout a full workday. For open source projects, in which most developers typically make infrequent changes [RGBC+14], this delta needs to be shorter.

(31)

Project name Commits Change-sets (δ = 2h) Commits per Change-set Runtime improvement S1 1596 612 2.61 681% S2 5928 2945 2.01 405% C1 1276 923 1.39 191% C2 2138 960 2.23 496% C3 1737 764 2.27 517% C4 2162 880 2.47 604% Median - - 2.16 482%

Table 4.2: Results of change-set approximation for analyzed projects

Table4.2clearly shows the benefits of the change-set approximation for the overall runtime. Instead of running the distance algorithm from listing 4.5 over all commits, we only have to compare the reduced amount of change-sets. The asymptotic growth rate stays the same, but reducing the number of elements improves the mining performance almost by a factor of 5 on average. Even though we chose a relatively high value for δ, we only group 2.16 commits into a change-set on average.

4.6 Data mining

The data mining step applies simple techniques to extract numbers based on the rules created in the mapping of section3. We retrieve the activity data per developer by passing a data frame of unfiltered commits to the function described in listing3.1. When we iterate over each month we can keep track of developers that were not present in the previous step or left the team in the current step. This simple algorithm is displayed in listing4.6. As we can see in lines 15 and 16, we stay close to the formal mapping described in equation3.1.

1 def get_staff_turnover(commits: pd.DataFrame, minimum_work_days=7):

2 # Get active days per author indexed by month

3 monthly_team_members = active_team_members(commits, ’M’, minimum_work_days)

4

5 # Initialize previous month authors with empty set

6 previous_month_authors = set()

7 # Create empty dictionary that will contain staff turnover for all months

8 staff_turnover = {}

9

10 for month, author_days in monthly_team_members.iterrows():

11 # Set of author names that were active in current month

12 authors = set(author_days[author_days >= 0].index.values)

13

14 # Calculate changes in authors between previous and current month

15 new_authors = authors - previous_month_authors

16 leaving_authors = previous_month_authors - authors

17 unchanged_authors = authors.intersection(previous_month_authors)

18

19 # Add current month to staff turnover result

20 staff_turnover[month] = {

21 "new_authors": new_authors,

22 "leaving_authors": leaving_authors,

23 "unchanged_authors": unchanged_authors,

24 "staff_turnover": len(new_authors) + len(leaving_authors)

25 }

26

27 previous_month_authors = author_days

28

29 return staff_turnover

Mining Productivity-Influencing Factors: Explore and Visualise Software Repositories