University of Groningen Preserving and reusing architectural design decisions van der Ven, Jan

(1)

Preserving and reusing architectural design decisions van der Ven, Jan

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below.

Document Version

Publisher's PDF, also known as Version of record

Publication date: 2019

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):

van der Ven, J. (2019). Preserving and reusing architectural design decisions. University of Groningen.

Copyright

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

(2)

Chapter 8

Towards Reusing Decisions by

Mining Open Source Repositories

“The code is the architecture.”

- Hohpe et al. [92]

This chapter is based on:

• Jan Salvador van der Ven and Jan Bosch. “Making the Right Decision: Sup-porting Architects with Design Decision Data”. In: Proceedings of the 7th Eu-ropean Conference on Software Architecture (ECSA 2013). Ed. by Khalil Drira. Vol. 7957. Lecture Notes in Computer Science. Springer, 2013, pp. 176–183 • Jan Salvador van der Ven and Jan Bosch, "Towards Reusing Decisions my

Mining Open Source Repositories". Submitted to an international Software En-gineering Journal., 2018

The structure of this chapter is according to the latter. The description on the levels of architectural decisions is added from the ECSA 2013 publication.

Abstract

Frameworks and reusable components caused a rapid increase in software development speed. Frameworks offer a proven architecture to work on, and enable easy integration of components. Components provide specific functionality: calculations, APIs, connec-tors, GUIs, etc. Selecting the right components is one of the most important decisions in software development projects, but also surprisingly hard to get right. These decisions are getting increasingly difficult as the availability of alternative components for similar functionality is growing. Luckily, this increased availability also holds for the number of projects that use these components, so an increasing set of example projects is available online. We show that data about previously made decisions in these projects can be made available for decision-makers, so they can learn from others making similar decisions. We provide a detailed analysis on the suitability of our approach for popular programming languages. We describe how the decision can be identified in the version history of open source projects. This data contains statistical data about the used components and can be used to base decisions on. Also, we show that decision-makers can be easily contacted for specific decision rationale. Our approach is exemplified by an implementation that mines decisions from Ruby projects.

(3)

8.1 Introduction

In the past years, the productivity of software development has increased. Re-cently developed component frameworks [148] for a range of programming lan-guages help quick development and component integration. In basically all of the major development paradigms frameworks exist that help developers with a basis where they can start development immediately. These frameworks actually implement a set of architectural decisions so the developer does not need to make these decisions over and over again. For example, all the modern web frameworks (Rails, Django, etc.) implement the MVC (Model View Controller) pattern [66], and base their APIs on SOAP [194] or REST [64]. In addition, these frameworks provide the ’plumbing’ for the integration of 3rd party Commercial-Off-The-Shelf (COTS) [129] or Open Source (OS) components [10], making it easier to combine and reuse these components. The main challenge for developers and architects shifts from the choice for specific architectural patterns or styles to the selection of the right frameworks and component sets. The choice for the framework is often driven by non-project specific requirements like contracts with suppliers, experi-ence of the team or preferexperi-ence of the customer organization. However, compo-nents come and go much quicker than frameworks. So, the challenge for devel-opers and architects shifts from choosing the right architecture once, to making component related decisions continuously. This research focuses on the design decisions that involve the selection of these components: Component Selection Decisions. In this work, we define Component Selection Decisions as architectural decisions that involve the selection and evaluation of components. The abbreviation CSD will be used for the rest of this manuscript.

In software ecosystems, information on selecting the right components is crit-ical for success [77]. Many of the decisions on component use have been made earlier by others working on similar systems. It would be great if decision-makers could access the decisions made by others, preferably in a data-driven fashion, that would allow them to determine what selections were made from a set of alterna-tives and with what frequency. That would give decision-makers hard, quantified data to base their own decisions on. The question is: how we can access these decisions and the people that made them?

Interestingly, over the last decade or more, several open-source software repos-itories have achieved broad adoption and host hundreds of thousands of projects in virtually any programming language and application domain imaginable. Ex-amples include SourceForge1 with 430K projects and 3.7 million developers, and GitHub2with more than 57 million repositories and more than 8.5 million devel-opers. On the one hand, this is part of the identified problem; how to locate the correct component in this huge set? But, as many of the projects in these reposito-ries are public there is a huge amount of data available about the structure of these projects as well as the evolution of these structures over time.

The version control systems (e.g. Git, svn, Mercurial) of these projects contain exactly this evolution data. In order to provide decision-makers with data about the CSDs that others made, these version control systems provide an excellent

1_{http://sourceforge.net/} 2_{https://github.com/}

(4)

source of data. However, considering the sheer volume of data, this requires an automated, rather than manual, approach to derive the information. We propose an approach that harvests this big software data and makes it available to deci-sion makers in a statistical way. With this, decideci-sion makers can avoid making bad decisions while finding relevant alternatives for the decision at hand.

The contribution of this chapter is threefold:

• First, we analyze the possibility of mining CSDs from the history of version control systems. We show the suitability of this approach for the set of mostly used programming languages.

• Second, we show that data from open source systems can be mined and made available to base CSDs on by describing our implementation for open source Ruby projects.

• Third, we demonstrate the applicability of the approach by showing how decision makers benefit from the acquired data.

This manuscript is organized as follows. First, the context of making decisions based on big software data is explained, ending with a vision on how to base de-cisions on data. Then, in the research approach, the theoretical background on mining repositories for decisions is described, resulting in three research ques-tions. These questions are addressed in the subsequent secques-tions. This work ends with a discussion and a conclusion.

8.2 Context

8.2.1 Components in Software: Libraries and Frameworks

As software development is maturing, strategic reuse of software can be the differ-ence between success or failure [166] [157]. On architecture level, patterns or styles, or COTS or OS components are the typical reusable artifacts. This means software engineering is changing for a large part from system development to component integration, where software developers are becoming component integrators [10]. There are several reasons why components are gaining attention. First, as hard-ware is getting cheaper, there is no need to create tailored components for specific situations that skimp on (hardware) resources. Instead, reusing existing compo-nents, and gluing them together is a very good (and typically faster) alternative. Additionally, tools for sharing sources and reusing components have matured rapidly. It is nowadays easy to find numerous components that satisfy specific functionality. Examples of places where you can find components are Rubygems3 or Cocaopods4. Finally, frameworks have been developed that enable easy inte-gration of these components into projects. These frameworks satisfy a tailored set of architectural decisions and provide plumbing to tie components together.

This increased possibility for component integration creates a shift for the work of software architects and developers. Modern frameworks implement a set of

3_{https://rubygems.org/} 4_{http://cocoapods.org/}

(5)

specific architectural styles and patterns; therefore, these decisions are no debate anymore. Also, the integration of components is simplified as the ’plumbing’ is typically taken care of by these frameworks. The most important question to solve becomes what components to use in a situation. This places CSDs on a critical path in many software development projects. However, since the potential number of components is immense, the knowledge about these components is extremely difficult to acquire and it is hard to keep this knowledge up-to-date. Architects and developers seldom use a structured process when selecting components [130], and normative methods provided by the research community are rarely used [82]. Most component selection is done based on the experience from internal or exter-nal experts [10]. However, the context and implementation details are typically very different. So, the challenge is how to access the right knowledge, and how to contact decision-makers that faced similar challenges for sharing rationale on the decision.

8.2.2 Available Decision Data

There are various sources available to base component selection decisions on. Of-ten, the initial search starts with the experience of the architect, or its colleagues (tacit knowledge). In addition, most components have online documentation that is con-sulted as an extra source of information. However, as the authors typically write this documentation, it is the question how independent this data is. A third source of data for selection consists of anecdotal experience reports in the form of tutorials or blog posts. These documents often describe a simplified use case of the com-ponent, making it hard to judge if the proposed solution will work in complex (real-life) situations. As a last source, many hosts of components provide metadata on the component: number of downloads, dependencies, versions, last modify date, etc. In Figure8.1, an example of this data is given for the ’Rest’ Ruby Gem5.

Architects have to base their decisions on these sources, but to get real knowl-edge about the usage one has to experiment with a sample implementation, which is very time consuming and expensive. Our work fills this gap between the available documentation and the expensive experiments with a data driven solution that helps architects to acquire data about the usage of components without having to make the investment for the experimentation. The data is based on the actual implementations that have been done previously with specific components. Statis-tics on the usage can be used to find the possible alternatives, while the commit data can provide rationale or access decision maker that made a similar decision before. It helps decision-makers to avoid the ’first fit’ trap [82] by making knowl-edge about alternatives easily accessible.

8.2.3 Vision: Decision Support based on Big Software Data

The previous subsection described how decision makers are forced to make deci-sions based on limited decision data. The needed data on made decideci-sions is hid-den in project histories. We propose a methodology that enables better decision making based on big data that is mined from real-world repositories. Because the

(6)

FIGURE 8.1: Publicly available Component Data for the Rest Ruby Gem

(7)

FIGURE 8.2: The Envisioned System: Decision Support and Data Mining

data needs processing before it can be used for decision support, two parts need to exist: the acquisition and processing of the available data (Data Mining), and mak-ing the data accessible for architects (Decision Support). In Figure 8.2 these two aspects are visualized. When the Decision Maker faces a problem that involves a component selection, it can use the Decision Explorer to find similar decisions made by others previously (stored in the Decision Database). This database is filled based on the data from Project Repositories. From these repositories, deci-sions have been extracted by identifying relevant Deltas.

With this envisioned system, the decision maker can base the decisions on data of real-life projects. To the knowledge of the authors of this manuscript, a system like this does not exist currently. In this work, we investigate how to develop such a system.

8.3 Research Approach: Mining Decisions

8.3.1 Architectural Decisions

In research about architectural design decisions [23] [180], typically four aspects of decisions are considered: the decision topic, the choice, the alternatives that are considered and the rationale (sometimes formalized as ranking) of the decision. We discuss these four aspects of architectural decisions, and describe how we use these aspects to identify decisions in repository data of open source projects.

• Decision Topic. The decision topic is the actual problem that needs to be solved. Often, these topics arise from previous decisions (e.g. we decided to base our application on NoSql technology, which specific database product are we going to use?), or from non-functional requirements (e.g. how are we going to ensure our up-time is high enough?)

• Choice. The choice, or decision, is the result of the decision process. Often, this is the only part that is communicated (discussed or documented).

(8)

TABLE8.1: Reflection of Decisions in Version Management Decision Concept Version Management Concept

Topic and Decision Commit

Rationale Commit message and author information Alternatives Structure of commits

• Alternatives. A typical decision has more then one alternative to chose from. Alternatives can be just named (e.g. different component names), or some-times architecture parts are considered as alternatives (different styles or pat-terns, or comparing specific implementations of components). In rare cases, the alternatives are realized and compared as a Proof of Concept or Proof of Technology.

• Rationale. The rationale of a decision describes, often in plain text, why the chosen alternative(s) solve(s) the problem at hand, and why the chosen decision is the best solution.

The focus of this chapter is on design decisions that change after the initial implementation of the system, during development or maintenance. These deci-sions express themselves through changes in the version management system, i.e. commits of new and changed code. All of the previously mentioned aspects of a design decision can be located in the version history or implementation of the system. The decision topic and the choice have a reflection in the (architecturally relevant) commits. The rationale for the decision is ideally reflected in the commit message, and the author of the commit can be contacted for additional rationale. Alternatives can be found in the history of the architecturally relevant commits. Table8.1summarizes this.

There are different abstraction levels of architectural decisions. As described by de Boer et al. [23], decisions are often related to each other, and this relation-ship typically forms a tree structure down from more abstract to more concrete (decisions cause new decision topics). Figure 8.3 symbolically visualizes such a graph. Generally speaking three levels of decisions can be distinguished:

1. High-level decisions. High-level decisions affect the whole product, al-though they are not necessarily always the decisions that are debated or thought through the most. Often, people that are not involved in the realiza-tion of the project (e.g. management or enterprise architects) heavily affect these decisions. Typical examples of high-level decisions are the choice to adopt an architectural style (e.g. service-oriented), use a programming lan-guage, use high-level systems (e.g. service bus implementation) or a specific application server. Changing these decisions typically has a huge impact on the architecture of the system.

2. Medium level decisions. Medium level decisions involve the selection of specific components or frameworks, or describe how specific components

(9)

FIGURE8.3: Relationships between Decisions

map to each other according to specific architectural patterns. These deci-sions are often debated in the architecture and development teams and are evaluated, changed and discarded during development and maintenance of the system. They have a high impact on the (nonfunctional) properties of the product and are relatively expensive to change.

3. Realization level decisions. Realization level decisions involve the structure of the code, the location of specific responsibilities (e.g. design patterns), or the usage of specific APIs. These decisions are relatively easy to change, and have relative low impact on the properties of the system.

As we have experienced in our industrial cases [184], the architectural decisions that are hardest to make are the medium level decisions, for the following reasons: • These decisions have a high impact on the functional and non-functional

properties of the system

• they change constantly, especially compared to high-level decisions that only change when remaking the system

• they are costly to change because of the impact on the system

• because new components and version are created constantly, it is hard to stay knowledgeable about relevantalternatives

• they have unpredictable results until they are implemented in the system. The focus of this chapter is on medium level design decisions that change during development or maintenance.

8.3.2 Mining Big Software Data

Big Software Data [143] enables researchers to learn from the history of projects at a large scale. To understand component evolution based on project history,

(10)

FIGURE8.4: Project History concerning Component Change

Figure 8.4 shows component changes in three projects schematically. This figure visualizes component changes in different projects over time, where the ’+’ means the addition of a component and a ’-’ means a component has been removed. This historical data provides a great source for finding CSDs. Component changes are reflected on several levels of abstraction in projects.

• File - Level. When a project evolves, the source files of the system change. These changes are tracked in the version management system, often as changes per line (added, removed or modified lines). These changes can imply a change in component use.

• Commit - Level. This means a set of file changes (a commit) reflect a change, for example the change of a component. In earlier version management sys-tems, this was the default unit of change. Changes were pushed to a server containing the latest version of the total system.

• Pull Request - Level. In modern version management systems like Git or Mercurial, commits are grouped together as pull requests. This makes it pos-sible to create small increments of changes by doing commits locally, while being able to deliver a set of related changes at once to the master server. However, as Kalliamvakou et al. point out [105], pull requests are not used with enough discipline to know if this abstraction level is usable.

• Release - Level. A set of commits or pull requests can be bundled as a re-lease. When releasing to a production environment, the right component configuration needs to be in place. Like the pull request - level, this level can be seen as a set of component - level changes.

Component changes can be found on all levels. For the remainder of this chap-ter, we consider commit - level CSDs, even though we do analyze single lines in the detection process. The commit is used because this is a level that is available

(11)

as first-class entity in all version management systems, and always has date, a commit message and author, information essential when looking for the decisions and rationale of decisions. We consider the changes based on pull requests and releases as potential future work.

The evolution of source code is captured in the version management systems of projects, and the decisions are also traceable in these systems. However, it depends on the version management system and the programming language structure how they are represented, and how they can be extracted. When looking at the basic op-erations on component changes in projects, we encounter three distinct situations [186]:

1. A component was added in a commit. This means that somewhere in the project, at least one line was added with a name of, or reference to the com-ponent. A typical example of this is the C #include statement. Component addition is an example of the ’Existence decision’ of Kruchten [114].

2. A component was removed in a commit. Removing a component is reflected in the version management system by modifying import statements, config-uration files, and / or removing code and files. Kruchten calls this type of decision the ’Ban or non-existence’ [114].

3. A component was replaced in a commit. In the third situation, an explicit decision was made to remove one component and introduce another one. If this happens in the same commit, this is a strong indication that one compo-nent was replaced for another, hence an alternative with potentially similar functionality.

Adding components to a project happens the most often (in our initial study, 62% of the found changes were additions versus 38% removals [186]). However, the only information that one can get from this that the component was introduced in a project. So, this data does not add very much to existing data sources like total downloads as shown in Figure 8.1. Sometimes, it is not even possible to check if the component is actually used anymore even though it is still included in the project (e.g. a library is included but it is never used in the source code).

The second situation (component removal) contains more valuable informa-tion for CDS detecinforma-tion: someone made an explicit choice to stop using a compo-nent. However, it is difficult to know if this component is removed in this specific commit, or that the project stopped using the component some time ago and the explicit removal happens during a cleanup of the code, long after the decision was made.

The third situation is the most promising, where it is known that a component was added and removed in the same commit. This increases the probability of the two components being related and perhaps even being alternatives for each other. We use the delta (∆) as a concept to describe replacements of components (adding one component and removing another in the same commit). In Figure 8.5 this concept is visualized. On the left side, the processed commits are shown for two distinct projects. The first commit (10defd0. . . ) removed two components (Rest and jQuery UI), while adding one (Bootstrap). This leads to two deltas (and

(12)

Decision 2 Replace Component B by Component Z Commit 10defd0… - Rest - jQuery UI + Bootstrap ... Commit 5db61e4… - jQuery UI + Bootstrap ... Commit bfd2806… - MySQL + PostgreSQL ... Project One Project Two Replace Rest by Bootstrap Replace jQueryUI by Bootstrap Replace MySQL by PostgreSQL Δ Δ Δ Δ Δ Legend Commit Delta of two components Candidate Component Selection Decision FIGURE8.5: The Design Decision Extraction Process

(13)

Rest jQuery UI Bootstrap 1 2 Legend

Component Number of found deltas n

FIGURE8.6: Component Replacements

hence candidate CSDs): Rest→Bootstrap and jQuery UI→Bootstrap. The deltas can be calculated for all the commits in a set of projects. Then, the found candi-date decisions can be summed across all projects to see what changes happened at what frequency. An example of a way to visualize this graph is shown in Figure 8.6, where only the interaction with Bootstrap is visualized for the example above. Associated with such an identified decision, the commit can provide additional data. For example, the commit message can contain rationale on the decision. Also, the time stamp on the commit can help to search for trends in component replacement. Last, the author information provided by the commit can help to contact the decision maker for additional rationale on the decision.

A single found deletion and addition is no evidence that the components were actually replacements for each other. For example, in the scenario used above the Bootstrap component is no replacement for Rest component. If deltas occur often (e.g. in several unrelated commits from different projects), the chances decrease that the found replacements are incidents. If this mechanism is applied to a large number of projects, a weighed graph can be created that provides insight in CSDs that have actually been applied across many projects.

8.3.3 Research Questions

In order to assess if the envisioned system can be constructed, several questions arise. First, the suitability of programming languages for mining decisions should be explored. Second, the extraction process should be tested; is it possible to iden-tify the decisions, to relate them to each other so statistical data can be acquired on them? Based on our previous work with a small set of Ruby projects, we have the indication that this is feasible. However, is this approach effective and can it be scaled to larger number of projects? This leads to the following research questions for this research.

• RQ1: How do different languages compare in their suitability for mining CSDs?

(14)

• RQ3: Do the identified CSDs provide sufficient information to base the deci-sion process on?

Because each of these questions required a different approach, we have in-cluded the experimental setup in the corresponding sections. Following, the re-search questions will be addressed sequentially.

8.4 Programming Language Comparison

8.4.1 Experimental Setup

In this section, the first research question is assessed: RQ1: How do different lan-guages compare in their suitability for mining CSDs? First, the criteria for the pro-gramming languages are described. Then, the selection of the languages is de-scribed, followed by an analysis of the applicability of our approach to each of these languages. This section concludes with a comparison of the programming languages based on the described criteria.

8.4.2 Criteria for Programming Languages on Suitability for

Min-ing Decisions

Based on the theoretical description of the mining process in the previous section, we describe the suitability of programming language in this section. Three criteria are used to assess the suitability of the language:

1. There has to be a large number of available projects that use the language. These projects can contain the data for CSD mining systems. We investigated several hosting services for open source projects (e.g. GitHub, Sourceforge, BitBucket). However, as GitHub hosts 50 times more projects compared to the largest competitor, we decided to base our analysis on GitHub projects. We counted the total number of projects (from GitHub6) and the number of active projects (from Githut7) as a measure for the available data.

2. The structure of the language (and supporting tools) needs to enable mining of data based on the version history of projects. For this, we identify what component management systems (usually in the form of dependency man-agement systems) are commonly used, and the suitability of these systems for discovering CSDs.

3. A language has to have a solid component ecosystem that facilitates intensive component (re)use. For this criterion, we analyze if there is a general ac-cepted location for finding components, and check how many components these locations host.

When these three criteria are met, the programming language can potentially be used to create a decision database where decision support can be based on.

6_{Checked on 18-4-2016 from}_{http://github.com} 7_{Based on Q4 of 2014 from}_{http://githut.info/}

(15)

8.4.3 Programming Languages

We have analyzed a variety of different programming languages. The program-ming languages with the most active repositories on GitHub8were analyzed. Shell, R and VimL were skipped since they are not programming languages that are used extensively for consumer products. In addition, we also skipped CSS as this lan-guage is always integrated with other lanlan-guages (e.g. HTML, JavaScript) and it does not use components in the language itself. Next, we discuss all of these pro-gramming languages, and describe where data on component selection is located. Per language (group), the involved files are described, and how they could be parsed to obtain relevant CSD data. Also, we describe the three criteria per lan-guage.

JavaScriptwas introduced in 1995 as a functional programming language that originated as a client-side scripting language for managing dynamics and interac-tion for HTML websites. Lately, JavaScript got growing atteninterac-tion because of the asynchronous communication as a server-side implementation with Node.js9. A huge number of projects is available in JavaScript: 324K active projects and 2034K total repositories with JavaScript as their main language. Even though libraries and components are used in client-side JavaScript, the inclusion of them is usu-ally done across different files of different types (e.g. JavaScript, static HTML or dynamic server-generated HTML). This makes it very difficult to process compo-nent decisions in client-side JavaScript. There are however tools that enable de-pendency management on one location for client-side JavaScript: Browserify10 or Bower11. With Bower, the components that a project uses are defined in a single JSON file: bower.json. This file would be perfect to mine for CSDs. The server-side Node.js framework uses npm12 to manage dependencies that are defined in one file: package.json. Here, the ’dependencies’ section describes the components that the project uses. This data can be used to trace the CSDs. There is an active com-munity across the Bower and npm tools that provide a large set of components to the community.

Javais used extensively in the open source world. With 223K active and 1890K available projects, a large dataset is available for mining CSDs. The Java pro-gramming language has a rich history in dependency and build management. Currently, three options are most often used: Maven, Ivy and Gradle. In all of these solutions, a structured file is used (Maven and Ivy use XML and Gradle uses JSON) to describe the components that are used. The used files have a specific name and location (pom.xml, ivy.xml and build.json), so they are relatively easy to find and process. However, because there is no one accepted way to define com-ponents (many frameworks use different definitions of comcom-ponents), there is no single point where the components can be discovered with ease.

Pythonis a popular language, for open source development, industry, as well as academia. It has more than 100 K available projects just on GitHub. The pip package manager is commonly used to manage the required dependencies for

8_{Based on data from}_{http://githut.info} 9_{http://nodejs.org/}

10_{http://browserify.org/} 11_{http://bower.io/}

(16)

projects. The requirements.txt and setup.py files define, per project, what libraries and components are used. This file can be used similarly to the Ruby Gemfile as described in the previous section to mine for CSDs. Components can be found easily on the pypi website13, where 56K components are hosted.

Phpis a rich language that has many extensions, components and frameworks. With 848K projects, it has a sufficient amount of project data available. Depen-dency management is not part of the language itself; packages can be included in basically any file in the project. However, the composer library manager is used in many (professional) projects for managing the components in the project. It uses a compose.json file that defines in JSON what specific components are used in the project. This file could be used with our method to mine CSDs. Composer compo-nents are listed at packagist14, where around 52K components are available.

C and C++are less popular in the open source community with 87K and 73K active projects (536K and 481K total) available. Because the implementation of C / C++ parsers and tools varies across the different platforms, no real depen-dency manager exists. This makes it difficult to identify component changes across projects. The data exists, but scattered across potentially any file in the project with an #include statement. These #include statements can include external compo-nents, but they can also refer to other components within the same project, which can pollute the data. There is no single location available for discovering or down-loading components.

C#has around 56K active projects on GitHub, and around 467K Github repos-itories tag as using C#. In .NET projects (C#), the typical structure is to have a solution file (.sln), which counts as the single point of entry for the project. This file refers to potential multiple .csproj files (C# project files). These XML files de-fine among others the components that are used. The challenge with C# projects is that the data is scattered across the project so it is hard to identify a single point where the changes can be found. Components for C# (.NET) projects are available at NuGet15, where around 35K unique components are available.

Objective-Cis developed mainly for creating mobile apps for the iOS platform. A small base of 37K active projects is available on GitHub from the 301K total. Components are defined and distributed as Cocoapods16. The used components are summarized in one file per project: the Podfile. This file has a similar structure to the Ruby Gemfile as it is an unstructured summary of used components: a line containing ’pod x’ means that component x is used in the project. Hence, Objective-C projects could be mined perfectly for CSDs. Components are available on cocoapods.org (around 7K).

Go, even though it is a very new language, already has 22K active projects available on GitHub (119K total). In Go, dependency management was imple-mented from the beginning. In the Godep JSON file, all the used components are defined. With the Godep tooling, this file is used to update the environment with the right components and versions of these components. The Godep file would be

13_{https://pypi.python.org/pypi} 14_{https://packagist.org/} 15_{http://www.nuget.org/} 16_{http://cocoapods.org/}

(17)

TABLE8.2: Project Availability of Programming Languages Language Active Projects Total Repos Conclusion

JavaScript / Node.js 324 K 2.034 K ++ Java 223 K 1.890 K ++ Python 165 K 1.041 K ++ PHP 139 K 848 K + Ruby 133 K 1.001 K ++ C++ 87 K 536 K + C 73 K 481 K + C# 56 K 476 K + Objective-C 37 K 301 K + Go 22 K 119 K +

a perfect source for the component decisions. On Godoc17, already 60K compo-nents are available.

8.4.4 Results

This subsection provides a summary of the languages and the suitability for min-ing CSDs, as described in the previous section. All languages have an enormous number of projects available. Even the language with the least active projects (Go) has 22K projects available.

All of the modern heavy-used programming languages have the capability to provide the decision data as described earlier in this chapter. In some languages, the extraction process is fairly simple, as there is one file that describes the used components with one component per line (Python, Ruby, Objective-C). Other lan-guages use a structured JSON or XML file that contains the same expression power (JavaScript Node.js, Java, PHP, Go). However, in order to acquire this data, pars-ing the changes is more complicated as the specific context should be taken into account. Some other languages have a more complex structure to mine, where the component dependencies are scattered across the project files (JavaScript, C++, C), or scattered across specific files (C#). With these languages, changes to the whole project should be considered instead of commits on a specific file. However, the definitions of the component usages have a simple structure (one line per compo-nent) so parsing them once they are found is possible.

Most languages provide a specific location where data about components can be found. Typically, open source components are shared there. With Python, Ruby and Go, the use of the component mechanism and ecosystem is part of the best practice of the language and therefore these provide the most interesting cases for our approach.

Table 8.5 shows the results of the three assessed criteria for all programming languages. To conclude, seven of the ten assessed programming languages are suitable for CSD extraction. For three of the C variants (C, C++ and C#), this is somewhat

(18)

TABLE8.3: Ease of Mining of Programming Languages

Language Tooling Data Location File Type

Conclusion JavaScript

/ Node.js

N/A Node.js: package.json

or bower.json

JSON

+/-Java Maven, Ant/Ivy, Gradle

pom.xml XML +

Python Pip requirements.txt Text ++

PHP Composer composer.json JSON +

Ruby N/A Gemfile Text ++

C++ MS studio #include statements N/A –

C MS studio #include, makefile,

configure.in

N/A –

C# MS studio .sln / .csproject & .vbproject Text / XML - Objective-C

cocoapods.org Podfile Text ++

Go Godeps Godeps.json JSON +

TABLE8.4: Programming Language Component Ecosystem

Language Component location # Available Compo-nents Conclusion JavaScript / Node.js https://www.npmjs.com/, http://bower.io/search/ 475K / 22K + Java https://search.maven.org 216K + Python https://pypi.python.org/pypi 126K ++ PHP https://packagist.org/ 165K + Ruby https://rubygems.org/ 139K ++ C++ N/A N/A -C N/A N/A -C# https://www.nuget.org/ 102K + Objective-C http://cocoapods.org/ 7K + Go http://godoc.org/ 60K ++

(19)

TABLE8.5: Summary of Programming Language Suitability

Language Project Availability Ease of Mining Ecosystem

JavaScript / Node.js ++ +/- +/-Java ++ + + Python ++ ++ ++ PHP + + + Ruby ++ ++ ++ C++ + – -C + – -C# + - + Objective-C + ++ + Go + + ++

more difficult, mostly because the references to components are potentially scat-tered though the projects. So, concerning RQ1: How do different languages compare in their suitability for mining CSDs?, it can be concluded that the methodology is suitable for most of the assessed languages, but some languages will be easier to implement then others.

8.5 Decision Mining

8.5.1 Introduction

This section addresses the second research question: RQ2: How effective can CSDs be mined in a scalable way?. We have extended our work on the proof-of-concept on mining open source Ruby projects [186] in order to show this.

8.5.2 Experimental Setup

For the proof-of-concept implementation, a set of Ruby repositories from GitHub was cloned and the relevant history was processed to a database. As a start, all the projects were cloned (create a copy of the whole project history) to a local com-puter. From these repositories, all the changes (commits) on the Gemfile18 were collected. The Gemfile contains a list of all the components that are used in a Ruby project, so the history of this Gemfile reflects the component usage in a project.

We looked at the lines that changed between commits on Gemfiles of all ana-lyzed projects. Since we were looking for decisions that involve component selec-tion, we focused on the changed lines of the commits representing the change of components (in this case, the lines that started with "gem").

Every commit on the Gemfile is taken, and every line that changed in the Gem-file within the commit is processed. To do this, we have automatically processed the output from the git log command, which outputs the history of a file. As exam-ple, a fragment of the git log command of one of the selected projects (factory_girl)

(20)

commit 554e6ab378a3c10a28d9... Author: ##### <###@###.com> Date: Fri Aug 12 22:06:10 2011

rr => mocha ... -gem "rr" +gem "mocha" +gem "bourne" ...

FIGURE8.7: A Fragment of a Git Log Output Example

TABLE8.6: Acquired Data

Parameter Initial Dataset Final Dataset

# projects imported 620 1,318

# commits 12,413 22,270

# changed lines 43,053 71,745

# added lines 26,665 44,662

# removed lines 16,388 27,083

is presented in Figure8.7. The following data can be extracted from this fragment: the commit-id, the author of the commit (for privacy reasons anonymized), the date, and the commit message (in this case, "rr ⇒ mocha"). After that, the lines that changed are displayed subsequently (added lines with a plus (+) sign and re-moved lines with a minus (-) sign). Git log offers opportunities to customize the output, which we used to process the data.

The described process was done completely automated, so no human interpre-tation was necessary for acquiring the data sets. From the local project repositories, we created insert scripts for all the commits and all the lines that were changed in each commit. This was inserted into our database for further analysis. Our previ-ous work describes the extraction process in more detail [186]. We used the data from our previous work as the initial dataset, which we extended with a new set of projects. A summary of numbers of used projects and commits is presented in Table8.6. The software for mining the repositories, as well as the insert scripts for replicating the used dataset can be found in the replication package of this research

19_.

Components often have dependencies on other components [1]. This can be a problem in our approach, as the dependency for a component can also become an identified change. However, when working with Ruby projects, the Gemfile identifies the required components, and the software that installs the components (Bundler) handles the dependencies. So, the required components are not explicit in the Gemfile and therefore do not corrupt the data.

(21)

8.5.3 Results

This section addresses RQ2: How effective can CSDs be mined in a scalable way? To analyze the results, we have selected a subset of the components that affect the selection of database technology with the Relationship Visualizer [186]. Appendix B shows four graphs in two dimensions. The first dimension is the threshold for how often a specific change (e.g. component A is replaced by component B) oc-curred (N) at least. So, only if more than 9 commits were found with this specific change, they are included in the upper part of the appendix. The other dimension is the number of projects used (and hence, the number of commits the graph is based on). This is divided in on our initial and our final data set.

Even though the four graphs consider the same subject, depending on the dataset and threshold they look quite different. If the threshold is lower, more alternatives are shown (e.g. pg -> sqlite3 is shown in the bottom graphs but not in the graphs on top). If more projects are considered, more alternatives are found. If the number associating the arrow is higher, more decisions have been found for this specific CSD. The lower threshold views can be used as an exploration strat-egy to find non-trivial alternatives.

For the showed dataset, we used the threshold of >5 to be safe to avoid in-cidental commits as CSDs. As we showed in our previous research, 62% of the identified commits were CSDs [186]. So, the chance that 6 commits describe the same change that is not a CSD is very low (0, 386 =0, 003). Increasing the dataset should be weighed against the time it takes to collect and analyze the data (in the current implementation this takes about twice the time to generate the views if the data set doubles).

In order to validate that data becomes more meaningful when the number of projects increases, the data from our previous work [186] is extended with a new set of projects. We counted the total number of candidate CDSs we found. Table 8.7summarizes the results. In this table, the first column shows how often a certain CSD was found at least (N) in the data set. We wanted to make sure the proposed methodology scales. We checked if the amount of decisions increases when data from more projects is added. This means that the decisions found are not just coin-cidental. Second; we assume that there is a significant overlap between decisions made in different projects. This would imply that more common decisions have a higher number of occurrences if the total dataset size is increased. The validity of the assumptions was checked with the data in Table8.7. It shows that the number of deltas grows when the number of commits grows.

Based on the previous research where 62% of the found commits were deci-sions, we converted the deltas to (potential) CSDs. In Table 8.8, we show how much the number of CSDs grew and how much it grew compared to the total number of commits.

The amount of CSD that occur more often (higher N) grows if more projects are considered. This confirms that there is a significant overlap between decisions made in different projects. To conclude, we can say that by increasing the number of projects, the strength of the decisions increases (more same decisions located where rationale can be found), and the total dataset with decisions is larger, so more decision can be located. So, concerning RQ2, we have shown that scaling the CSD mining process to larger quantities is applicable.

(22)

TABLE8.7: Number of Deltas Identified N≥ Initial Dataset Final Dataset

2 4,865 6,852 3 770 1,231 4 188 431 5 80 201 6 50 124 10 18 51

TABLE8.8: Summary of CDS Growth

N≥ Initial: #CSD / #commits Extended: #CSD / #commits Absolute CSD Growth CSD Growth per commit 2 0.33549 0.26337 141% 79% 3 0.05862 0.05224 160% 89% 4 0.01483 0.01895 229% 128% 5 0.00639 0.00895 251% 140% 6 0.00402 0.00555 248% 138% 10 0.00145 0.00229 283% 158%

8.6 Accessing Decision Rationale

8.6.1 Experimental Setup

In the previous section, the extraction of decisions is confirmed. However, the applicability of these decisions is not assessed. Often, the rationale of the decision [189] is more important than the decision itself. This section describes how the found decisions can be used to access relevant rationale.

The first data source for the decisions is the statistical data from other people making similar decisions; how often did others make decisions, how often did they choose alternatives. As a second source, the commit messages can be used to acquire rationale for the decision. In the case a decision is made more often, this dataset contains a lot of decision rationale, as every identified commit has a commit message describing why the changes were made.

In some situations, it can be necessary to have additional knowledge on the made decision. This can partially be found in the additional data from the mined CDSs. Every commit in a repository has an author. From this author, the name and email address are known. So, contacting a decision maker for additional ra-tionale on the decision can be fairly easy. However, as this is data from open source projects, it is unknown if the decision makers will respond to a call for help. Also, if they respond, it is unknown if their answers contain useful rationale. In order to address the usefulness of the methodology, this section analyses the last research

(23)

TABLE8.9: Quantitative Results

Group % Decision? Rationale? Alternatives?

All % Yes 61,75 % 25,50 % 4,75 %

All % No 38,00 % 68,75 % 84,00 %

All % Empty 0,25 % 5,75 % 11,25 %

Researchers % Yes 66,00 % 23,50 % 5,00 %

Subject Matter Experts % Yes 57,75% 27,50 % 4,50 % question: RQ3: Do the identified CSDs provide sufficient information to base the decision process on?

We have conducted an experiment where we contacted decision makers for additional rationale. This experiment is not intended to enrich the current dataset with rationale data, but to see if the commit meta-data can be used to contact relevant peers for decision rationale.

8.6.2 Analysis of Rationale from Commit Messages

In order to identify whether commit messages and the information about removed and added components are good indicators of design decisions, we have presented 100 different commits to six subject matter experts. In order to get these commits, we randomly picked 100 commits from the Gitminer database that had commit messages of more then 30 characters (therefore, had a solid chance of containing rationale). We distributed the commits among our subject matter experts. Half of the experts got the first 50 of these commits, the other half the got the last 50. Two researchers (one contributing to this chapter and one external) judged all 100 commits. The participants that conducted the research were experienced Ruby software developers, experienced software architects, and researchers with soft-ware engineering background. We have asked them to answer, per commit, the following questions:

• Does this commit involve a design decision?

• Does this fragment contain rationale for a decision?

• Does this fragment give relevant information about alternatives for a deci-sion?

The core results of the validation are presented in Table8.9. As shown in this table, according to our experts more then half (61,75%) of the selected messages contained decisions. It is interesting to note that there was a significant differ-ence in the recognition of decisions between the researchers and the experts for the same data set. The researchers found decisions in 66% of the commits, and the experts in 57,75% of the commits. The existence of rationale in the data was discovered in more than a quarter of the messages (25,5%). So, about 56% of the identified architectural design decisions lacked any rationale. The subject mat-ter expert discovered rationale slightly more often then the researchers (27,5% vs. 23,5%). The alternatives were much harder to find. Alternatives were only found

(24)

in 4,75% of the commits. The distribution between researchers and experts was almost the same.

The participants were given the opportunity to describe their experiences. One software engineer was surprised about the succinctly of the commit messages, and the lack of information in it. His experience in company projects was that commit messages were used much more to communicate decisions. He was able to show examples of this to the researchers easily from a company repo. The lack of good information in all the commits could be caused by our selection of open source projects. One architect was really enthusiastic about the possibility to contact oth-ers that have made the same decision (the author information provided with the commit).

During our analysis of the data collected with the Gitminer tool, we found qualitative results in addition to the quantitative results. We identified different aspects related to design decisions, that we used as expert validation:

• There were commit messages that indicated changes of components and ra-tionale about them. E.g. "Bundler and Jeweler not playing well. Removing Jeweler" (jeweler), or "use mysql2 instead of mysql because of shit encoding" (mysql).

• Commit messages where a decision is made, but the rationale was clearly missing: "Changed to jeweler2" (jeweler), or "remove thin" (thin)

• Some commit messages indicated cleanup: "Don’t need json gem depen-dency." (json), or "Do not depend on rack directly" (rack).

• Several messages described configuration issues: "Make compatible with ruby 1.9" (ruby-debug), or "Unfortunately, we can’t put ruby-debug in the gemfile because it breaks 1.9.2 compatibility. Just put it back in locally when you want to use it, or figure out how to do a switch by ruby verison in the Gemfile" (ruby-debug)

• Some commits were done because of certain non-functional requirements: "rcov/coverage makes the specs take a) 2x as long to boot and b) slows down actual specs by about 25%" (rcov), or "Use 1.9.3-p194; replace rcov with sim-plecov. Future commits will turn simplecov on in all situations. (According to its documentation, simplecov is pretty fast.)" (rcov)

8.6.3 Acquiring Tacit Knowledge from Decision-Makers

Besides the found rationale in the commit messages, we were interested to see if decision-makers were willing to share additional tacit knowledge. For this, we used semi-automated emails to contact the authors of commits. Semi-automatically means that we have created templates that have been automatically filled and send to the authors of the commits based on the commit content. The emails were sent to a random set of commits from the dataset. This enabled us to see if the email addresses were usable (and real), and see if the authors are willing to provide ra-tionale of the decision. We constrained the data set in the following way:

(25)

• To make sure we use diverse decisions, we have randomly selected the deci-sions from our database.

• We have taken decisions that were involved in at least two different commits (N>=2) to avoid coincidental component replacements as much as possible. • The emails were sent to unique persons, so no author received more than one

email to avoid being reported as spam.

Two different templates were used to base our emails on (see Appendix A). In one of the emails we explicitly stated we were conducting research and needed some help, in the other email we asked the author about rationale for the decision we found. The two versions were used, as we did not know how many reactions we would get. If the reply rate would have been low, we could send more emails of the most successful version. We took the decisions that occurred the most often in our data set, and prepared emails for them. We sent out a total of 100 emails to different authors, 50 of each version. The number 100 was based on the balance between several things. On the one hand, a larger number would give a better statistical result. On the other hand, as the intention of this experiment was not to create a large database of replies, is seems unethical to let a very large number of people write a serious answer without having a serious question. So, it was decided to send out 100 emails in order to validate the questions about the quality of the author data.

8.6.4 Results

As shown in the previous section, roughly 60% of the commits on Gemfiles were considered as concerning a design decision. For our whole dataset, this would mean that 60% of the 7527 commit messages contains decisions ( 4500). Of course, the other commit messages (with <30 characters) could also contain decision in-formation, so this number could very well be higher. Calculated in the same way, about 1900 commit messages contain rationale about made decisions. When relat-ing this to the number of projects, on average every open source project we used contained 6 decisions in commits and 3 commit messages with relevant rationale. However, this data is much more useful when looking at all the projects in the Gitminer system. When handling a threshold limit for a certain amount of projects involved (say, 10 projects), then the chances of having relevant decisions or rationale is significant. Architects can use this information to make much better-informed decisions, because they are based on more projects that made a similar decision, thus having more statistical relevance.

From the 100 emails we sent out, 10 turned out to have invalid email addresses. From the remaining 90 emails, 32 got replied (response rate 35,6%, N=90). From these replies, the vast majority (two thirds, N=32) of the reactions came in within 24 hours. Other studies that contacted Github users by email to fill in surveys got lower response rates, even though they send the emails to active users: 23%, N=1.160 and 14,1%, N=10.000 in the work of Singer et al. [167] and 19%, N=4.500 for the research conducted by Vasilescu et al. [182]. Compared to these researches, the response rate to our questions was significantly higher (Fishers exact test for

(26)

TABLE8.10: Results of Email Experiment

Type 1 (Please help me)

Type 2

(Research) Total

# total emails send 50 50 100

# delivered emails 43 47 90

% delivered emails 86% 94% 90%

# replied emails 21 11 32

% replied emails 42% 22% 32%

% replied emails of the delivered 49% 23% 36%

# replied emails with rational 20 8 28

% replied emails with rationale 40% 16% 28%

% replied emails with rationale of delivered 47% 17% 31%

binomial distribution: p=3e-2, p=5e-6, p=2e-3, respectively). From the replies, re-searchers of this manuscript judged the answers by assessing for rationale on the decision, in order to be able to say something about the usefulness of the replies. About 28% (N=90) of the actual delivered emails got an answer containing ratio-nale of the decision. Table8.10summarizes the results of the email experiment.

There was a difference between the reactions for the two types of questions. When asking the subject for help, we got far more replies than when asking to as-sist in scientific research (40%, n=50 vs. 16%, n=50). This means that in the real sit-uation of a decision-maker needing rationale, 2 out of 5 emails will get replied. We can imagine this percentage will be even higher if the email contains more details about the actual problem at hand (something we could not do automatically). This clear distinction in reply rate implies that, if asking the right questions, one has a reasonable chance of getting rationale data from decision makers. Researchers should pay attention that mentioning research could influence the results signifi-cantly.

As a conclusion, the question described in the beginning of this section can be answered as follows: RQ3: Do the identified CSDs provide sufficient information to base the decision process on? 10% of the email addresses turned out to be incorrect. So, the vast majority of the email addresses can be used to contact decision makers. The decision makers replied in 32% of the sent emails, most of them replied within 24 hours. So, the decision makers do want to reply, but it might be necessary to send a question to multiple commit authors in order to get a reply. Sending the right question does actually influence the reply rate, so phrasing the question right does help (40%, N=50 for asking for help vs. 16%, N=50 for research question). Most of the answers contained relevant rationale about the decision at hand. Even though the numbers are small (28%, N=100), the results are very promising that accessing rationale through this way is indeed plausible.

(27)

8.7 Discussion

8.7.1 Results

In the previous sections, we have shown that for many programming languages, it is possible to create a decision support system that assists decision makers in mak-ing their CSDs. We have seen that it is possible to scale the approach to increase the dataset as well as the accuracy of the data. Even though the work described in this chapter is based on a proof-of-concept for one programming language, the approach is promising for the software development industry. Being able to make better founded decisions leads to more successful projects. Also, we have seen that the software engineering community is very auxiliary by providing rationale on the decisions in roughly one third of the enquiries for rationale. People are willing to help others, independent of what their own benefit could be in this.

8.7.2 Threats to Validity

As threat to the external validity, the selection of the projects is important. For this research, we focused on decisions concerning component selection. These decisions have a high impact on the system, while they are made constantly during development and maintenance of systems [186]. Then, we further narrowed our focus on specifically open source components, because of the availability of these components in open source projects. Last, during our proof of concept, we scoped to specifically Ruby components. We have chosen to use Ruby projects because Ruby is used extensively in both the open source and the industrial world, making it ideal to conduct real world research on. As using company source code often involve legal issues, we have focused on open source projects. Our tooling was run on a company repository where one of the researchers was working at the time, to validate that we saw similar patterns. This showed some component replacements that were immediately recognized by the architect of the company as having been a debate in the past. As this was only one project, no real usable data was extracted but the identification of decisions was confirmed.

Because cloning and mining for CSD takes (processing) time, a limited set of open source projects was processed. To confirm the validity of our approach, we have added an extra set of projects to see if our results are repeatable and got better based on more data. We have shown that by extending the dataset the precision as well as the number of potential alternatives increased. However, in order to be usable on a large scale, it would be necessary to have a system that is updated regularly with new projects and commits to existing projects. This is out of scope for the current research.

Concerning the construct validity, several concerns can be pointed out. First, the research is based on a limited selection of projects. We selected the projects on statistical properties, not on specific knowledge about the projects, to make sure the projects were diverse and representative. We tried to make a dataset that was as representative as possible for generalizing to Ruby projects by selecting all active projects in a specific point of time. However, it is possible that due to the time of cloning, this dataset was accidentally biased (e.g. due to holidays there

(28)

was more or less activity in certain types of projects), but we have not found any patterns in this direction.

The data on Github can be used for empirical research, but one has to be cau-tious in many aspects, as the data is not always what it seems as Kalliamvakou et al. point out [105]. We minimize these threats in the following way. We selected projects that had some community (> 1 watcher and > 1 fork) that changed at least once in the month before we extracted the data. We did not look at the pull re-quests, but at the sequence of commits that were accepted in the project history. We did not analyze if the users were real persons, but seeing the results in the answers to our emails, many commits were actually done by real humans.

When we contacted architects, we only emailed people once even though they might be responsible for multiple CSDs. The selection of the people we emailed was done at random, based on the decisions we mined. One threat to our research is that we based our emails on old data (the initial data set), which was also men-tioned in the replies by some of the decision-makers. However, we did get enough replies to validate our assumption that authors are willing to provide tacit knowl-edge. One of the major drawbacks on this approach is that this rationale from the decision maker is not part of the actual mined data. Acquiring this data will cost time for the person that needs this rationale. However, if this data is not available else ware, even though it might take some time it is still better then not having the data. Some rationale is available instantly in the form of the previous commits and the commit messages accompanying these.

As threat to internal validity, the following can be pointed out. Our definition of decisions based on adding and removing components has some constraints. First of all, we don’t know if we found all CSDs that were actually made in the projects, because they might be reflected in different (sequential) commits. Addi-tionally, German et al. [70] describe that sometimes developers change the history of git repositories, which can cause missing decisions. We acknowledge this, but to reach our research goals it is sufficient to work with the decisions that we actu-ally identified. As future work one could look at these changes in sets of commits (e.g. pull requests or commits within a certain time-span). Secondly, there is no guarantee that the found commits are CSDs but rather based on coincidental ad-dition and removal of components. First of all, this would imply that the commit contained different (unrelated) functionality, which is considered a bad practice in software development. Second, to minimize this effect we counted how often a replacement occurred. The chance that unrelated replacements happen often is very small, so these changes will have a lower relative occurrence score when the number of projects increases. In our previous work with subject matter experts, we have found that 62% of the identified commits concerned actual decisions [186].

8.7.3 Related Work

In architecture design decision research, hierarchical structures are used to model architectural knowledge [23] or design decisions [100] [189]. This research of-ten emphasizes the recording of decisions, and the extraction of made decisions

(29)

later in the development process. There is a growing base of evidence that explic-itly managing architecture decisions is effective [162] [111]. Traditionally, docu-menting software architectures [42], as well as documentation templates [117] and computational modeling [146] have been extensively used and researched. Due to the sometimes-poor quality of the documentation, Lopez et al. [133] present an ontology-based mining solution to structure and extract relevant decision data from a set of components. Van Heesch et al. [85] describe a tool that assist the architect in the decision and documentation process, by providing different view-points on the architecture. Another research tool proposed to assist the architect is the Decision Buddy [69]. One of the fundamentals of this tool is the Solution Repository, where known solutions to decision problems are stored. This repos-itory could very well be seeded with relevant component replacement data from our research, so the decision maker has access to the data at the right moment. Soliman and Riebisch [169] describe the reuse of decision data from a different an-gle with a focus on the sharing of architecture knowledge. A topic that is being discussed heavily is the role of the architect [62] [184] and the role of ’the architec-ture document’ in the design process [184]. Often the architect is responsible for creating and maintaining the architecture documentation. However, the decision-maker is not supported in making these decisions based on statistical data.

Van Vliet and Tang [191] describe the rationality of the decision-making pro-cess. They show that this process is not always as rational as commonly assumed, and use the term bounded rationality to describe decision making based on a finite set of options based on limited information. Decision makers can benefit from our research by basing their decisions on an extended dataset that is based on histori-cal usage data of components.

The increasing amount of available data in online repositories is getting more attention in research. Kagdi et al. [103] provide a very extensive literature sur-vey with supporting taxonomy for mining software repositories. We have used this taxonomy to classify our research in Section 3. The popularity of the Work-ing Conference on MinWork-ing Software Repositories [141], and the recent attention in the Empirical Software Engineering journal [151] are good examples of this. The work of Le et al. [125] assesses architectural changes from software repositories based on calculations of changes that are also seeded by the addition and removal of elements in the projects. For this, they also use different versions of open source projects as the source. Similarly, Kouroshfar et al. [110] use calculations on the model changes to identify co-changes in Module views to mine for architectural elements. However, the goal of these researches was not to assist architects in mak-ing decisions, but to identify what types of decisions are made in these projects. Voinea and Telea developed a generic framework for mining of software reposi-tories [192]. Our work could benefit from this research if we want to extend it to larger sets of repositories.

Commercial-Off-The-Shelf or Open Source Software components are typical assets for software reuse [139]. For the discovery and selection of components, Ayala et al. [11] describe the high dependence on experience, either personal or from others. Most research investigates the component selection by interviewing decision-makers [130] [82], instead of basing it on statistical data from open source