The code components library : standardized documentation for efficient reuse of code components

(1)

University of Amsterdam

Faculty of Economics and Business

The Code Components Library

Standardized documentation for efficient reuse of

code components

Author: Lisette Mientje Anne Tiedemann

Student number: 10452583

MSc Thesis Business Administration

Track: Digital Business

23

rd

_{of June 2017}

(2)

Statement of Originality

This document is written by student Lisette Tiedemann, who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

“Good programmers know what to write. Great ones know what to rewrite and reuse” Eris S. Raymond (1999)

(4)

Abstract

Many organizations are currently building data mining tools to optimize products, identify problem areas and support business decisions. With data mining tools, data scientists transform raw data into valuable information which can lead to strategic advantages hence a competitive edge over its rivals. The current unstructured method of code documentation within data mining projects is the main motivation to conduct this study. This study uses the design study methodology to address the main research question: how should a library of code components be designed to simplify reuse of pre-existing code components in data mining projects? With sufficient and standardized code documentation, data scientists are able to reuse pre-existing codes written by themselves or others. This not only enables them to work quicker and hence outperform competition, but will also reduce cost as data scientists are scarce and expensive resources. Therefore, this study introduces the first standardized documentation platform for data mining projects; the Code Components Library (CCL). The CCL is based on field tested library solutions from software engineering. The design and proof of concept phase are conducted at the Data Science & Business Consulting department of the Rabobank, Utrecht, The Netherlands. The proof of concept discusses the positive validation results and the expected advantages and challenges of the CCL adoption. Finally, the motivational aspects are discussed to analyse how data scientists can be motivated to contribute to the CCL.

Index Terms – Design Study, Data Mining, Software Engineering, Code Components Reuse, Standardized Documentation, Reuse Libraries, Reuse Maturity Level

(5)

Acknowledgement

I would like to express my gratitude to all the people that shared their knowledge with me for writing my master thesis. In particular, I would like to thank my supervisor, Prof. Dr. H. Borgman for his input and feedback. Moreover, I want to give a great thanks to the data scientists team of the Rabobank, Arjan van den Heuvel, Joris Penders, Stefan Radnev, Chris van den Berg, Edwin Thoen, Gertjan Bronsema, Yannick Janssen, Stefan de Moor, Mario Liebreks, Martin Leijen, Robert Rodger, Joost Carpaij and Jeroen Zonneveld. They participated in my test trial, provided me with very useful feedback and continuously helped me develop myself in the world of programming. Furthermore, I want to thank my supervisor at the Rabobank, Ruben van Loosbroek, who also provided me with useful feedback. Your guidance and advice helped me during the journey of this research. I do hope that this report will be of good use and interesting to read.

Enjoy.

(6)

ABSTRACT...4 ACKNOWLEDGEMENT ...5 LIST OF FIGURES ...8 LIST OF TABLES ...8 LIST OF ABBREVIATES ...9 1. INTRODUCTION ...10 1.1 Problem Statement... 12 1.2 Design Study ... 12 1.3 Contribution ... 13 1.4 Outline ... 13 2. RELATED WORK ...14 2.1 Theoretical Basis ... 14 2.1.1 Data Mining ...14

2.1.2 The CRISP-DM Methodology ...14

2.1.3 Data Mining Application in the Banking Industry ...16

2.1.4 Reuse Maturity Model ...17

2.2 Existing Solutions in Similar Fields... 19

2.2.1 Why Software Reuse? ...19

2.2.2 Comparison Software Engineering and Data mining ...19

2.2.3 Reuse of Components in Software Engineering ...20

2.2.4 Existing Solutions in Software Engineering ...21

3 METHODOLOGY...22

3.1 Research Question ... 22

3.2 Research Design ... 23

3.2.1 Design Study ...23

3.2.2 Multi-Method Qualitative Research ...24

3.2.3 Data Collection ...24

3.2.4 Data Analysis ...25

(7)

4.1 The Design of the CCL ... 28

4.1.1 Markdown Template ...29

4.1.2 Semantic Level...29

4.1.3 Search Optimization...30

4.1.4 Uniform Code Titles ...31

4.2 Input of the CCL ... 32

4.3 Upload-process of Code Components ... 34

4.4 Workflow Diagram ... 35

5 PROOF OF CONCEPT...36

5.1 Validation phase ... 36

5.2 Reuse Maturity Level ... 37

5.3 Advantages and Challenges of the CCL ... 37

5.3.1 Advantages of the CCL ...37

5.3.2 Challenges of the CCL ...38

5.4 Motivation Perspectives to Document Code ... 40

6 DISCUSSION ...43

6.1 Academic Contributions ... 43

6.2 Implications for Management ... 45

6.3 Limitations ... 45

7 FUTURE RESEARCH ...47

8 REFERENCES ...48

9 APPENDICES ...51

Appendix 1. Overview of interviews conducted ... 51

Appendix 2. Visualization of the CCL in BitBucket ... 52

Appendix 3. The Markdown Template ... 53

(8)

List of Figures

Figure 1. Overview of the CRISP-DM tasks and their outputs. ... 15

Figure 2. The five levels of the RiSE Maturity Model. ... 17

Figure 3. Reuse Library Environment (Frakes & Pole, 1994). ... 21

Figure 4. The workflow diagram of the CCL. ... 35

Figure 5. Visualization of the CCL in BitBucket. ... 52

List of Tables

Table 1. Formulation of the main research goal of this design study. ... 22

Table 2. Semi-structured interview questions. ... 27

Table 3. Search operators to optimize the search for code components in the CCL. ... 30

Table 4. Overview of the code components categories and their abbreviations. ... 32

Table 5. Overview of the operators necessary to upload code components to the CCL. ... 34

Table 6. The five expected advantages after adoption of the CCL. ... 38

Table 7. The four expected challenges after adoption of the CCL. ... 39

Table 8. Overview of interviewees, job roles and the dates the interviews were conducted... 51

(9)

List of abbreviates

CCL Code Components Library

CRISP-DM The Cross Industry Standard Process for Data Mining

DS & BC Data Science & Business Consulting

DSR Design Science Research

NA Not Applicable

RiSE The Reuse in Software Engineering

RSL Reusable Software Library

(10)

1. Introduction

Progress in digital data acquisition and storage technology has resulted in the growth of database size. To extract meaningful patterns from the raw data stored in these large databases, data mining tools are used. Data mining is the science of extracting useful information from large datasets or databases (Hand, 2001). Data mining is broadly used in a variety of business sectors; these include: credit card companies, retailers, financial services, banks, telemarketing, airlines, manufacturers, telephone companies and insurance companies (Chen, Sakaguchi, & Frolick, 2000). These sectors are currently the major users of data mining technology. Banks are in a premature stage compared to the other sectors, although data mining has shown great potential in this sector. By using suitable data mining tools, banks can offer “tailor-made” products and services to their customers and make useful financial predictions.

Today, data scientists provide vast quantities of predictive models, visualisations, dashboards and more code components to make financial predictions. Data scientists are a scarce and expensive resource and therefore it is of high importance to increase the effectiveness and efficiency of their work: the data mining projects (Igual & Segui, 2017). According to the CRISP-DM methodology, the lifecycle of a data mining project can be broken down in six phases: business understanding, data understanding, data preparation, modelling, evaluation and deployment (Wirth & Hipp, 2000). As mentioned, the efficiency of data mining projects must increase and this can be realized through the creation of a documentation platform for codes written in the data preparation and modelling phase. The design of a documentation platform will enable reuse of code components and will fasten the data preparation and modelling phase of the data mining process. Therefore, this study presents a design for a reusable code components library. The standardized documentation platform is named the Code Components Library (CCL).

(11)

Data mining projects are quickly becoming established engineering projects (Marban, 2009). Therefore, to develop the CCL, a comparison was made with library management in software engineering. Research in software engineering has shown that software reuse positively affects the efficiency of software engineering processes and thereby maximizes the productivity of software engineers. Software libraries make the reuse of coding script components possible. The main motivation of this study is that experience gained from the software engineering process could be reused and integrated to improve the data mining process. Moreover, interviews were conducted at the Data Science & Business Consulting (DS & BC) department of the Rabobank to analyse the requirements of the data scientists.

The design study methodology is used in this study to assess the main research question: how should a library of code components be designed to simplify reuse in data mining projects? The design study methodology includes a learning phase, where related work on existing library solutions in a comparable field to data mining, software engineering, is discussed. Then, in the design phase, the design of the CCL is introduced. The design phase is followed by a proof of concept. The proof of concept includes the validation of the CCL and investigates the Reuse Maturity Level to emphasise the importance of adopting the CCL. Furthermore, expected advantages in the adoption phase of the CCL are discussed, such as faster experimentation and productivity of data mining processes. Implementation of the CCL also brings up expected challenges that will be discussed. Finally, the motivational aspects of data scientists to contribute to the CCL with sufficient documentation are considered.

This study contributes to the data mining literature by designing the first standardized documentation platform for data mining projects; the Code Components Library. It enables sufficient reuse of code components and improves the data mining process. This design study will answer the following research question: how should a library of code components be designed to simplify reuse of pre-existing code components in data mining projects?

(12)

1.1 Problem Statement

Data scientists are deploying data mining technologies to acquire and improve their decision-making capabilities. The complexity and rapidly increasing number of code components that are developed within the DS & BC Department of the Rabobank demands a new strategy to manage these code components. Currently there is no existing solution for library management of code components in the data mining process. Data scientists are a scarce and expensive resource so their output should be utilized efficiently. Reuse opportunities are lost due to the lack of a documentation platform where developed code can be stored. Therefore, this study designs a new approach for documenting code components: The Code Component Library (CCL).

1.2 Design Study

The design study methodology is used to address the central research question and support the design phase of the CCL. A design study is a project in which visualization researchers analyse a specific real-world problem faced by domain experts, design a visualization system that supports solving the problem, validate the design, and reflect about lessons learned to refine visualization design guidelines (Sedlmair, Meyer, & Munzner, 2012). The real-world problem in this specific case is the lack of documentation and tooling standards for code components. The design study methodology includes a learning phase and design phase. The design phase is followed by the proof of concept to validate the design of the CCL. The proof of concept is addressed by a different research method: multi-method qualitative research.

(13)

1.3 Contribution

As information systems become more complex, development of models and transformations is no longer viable. Model management aims to solve this problem by providing techniques and tools for dealing with models and model transformations in more automated ways. Several research communities such as databases, document management and software engineering have studied it independently for years, but there is no research on library management for data mining. Data mining has a more broader output of code categories and therefore needs a more sophisticated and broader library design than existing library solutions for software engineering. This prior literature will be applied on the data mining process and adjusted to the broader output of code categories and the data scientists’ user preferences. Since data mining is of great importance for the success of organizations, it is vital that the process is well managed and documented.

1.4 Outline

The remainder of this document is structured as follows: Chapter 2 presents the theoretical foundations for this thesis and a discussion of the existing library solutions in comparable fields. The methodology is discussed in Chapter 3. The design of the CCL is presented in Chapter 4. After an introduction of the design, the proof of concept of the CCL is discussed in Chapter 5. To contribute to efficient adoption of the CCL, the expected advantages and challenges for the adoption of the CCL are discussed in this chapter. Chapter 6 will elaborate on the contributions and limitations of this research, followed by the future research areas in Chapter 7. Finally, Chapter 8 lists the references and Chapter 9 contains the appendices.

(14)

2. Related Work

2.1 Theoretical Basis 2.1.1 Data Mining

Today, banks are realizing the various advantages of data mining. Data mining in general refers to extracting knowledge from large amounts of data. Data mining is defined by Bose and Mahapatra (2001) as a process of identifying meaningful patterns in databases that can then be used in decision making. Turben et al. (2007) defines data mining as a process that uses statistical, mathematical, artificial intelligence and machine learning technique to extract and identify useful information. A consortium of leading data mining users and suppliers have developed a comprehensive process model for carrying out data mining projects, the CRISP-DM methodology. The CRISP-DM methodology will be the leading framework in establishing the design requirements.

2.1.2 The CRISP-DM Methodology

Data mining projects are currently carried out with no standard framework. The Cross Industry Standard Process for Data Mining (CRISP-DM) process model aims to address this problem by defining a process model which provides a framework for carrying out data mining projects. The CRISP-DM process model aims to make large data mining projects, more reliable, less costly, faster, repeatable and more manageable (Wirth & Hipp, 2000).

The CRISP-DM methodology consists of two components; the Reference Model and the User Guide. Where the User Guides provides more detailed instructions and hints for each phase and whereas the Reference Model presents a quick overview of phases, tasks and their outputs. For this study, the Reference Model is used to describe the life cycle of a data mining project. This life cycle is broken down in six phases which are shown in Figure 1. In the following section, each phase will be discussed briefly.

(15)

The first phase, Business Understanding, focuses on understanding the project requirement from a business perspective and translates this into a data mining problem definition. Accordingly, a preliminary project plan is designed in this phase to achieve the objectives. The data understanding phase starts with an initial data collection and continues with activities to get familiar with the data. In the next phase, the data preparation phase, the data will be fed into modelling tools from the initial raw data. It covers all activities to construct the final dataset.

Figure 1. Overview of the CRISP-DM tasks and their outputs.

In the fourth phase, modelling phase, various modelling technique are selected and applied. There are several modelling techniques to address the same data mining problem type. The next phase is the evaluation phase. In this phase, the models created are evaluated before the final deployment of the model. The final phase is the deployment phase. The activities of deployment phase depend largely on the requirements of the final report. It can be as simple as generating a report or as complex as implementing a repeating data mining process. The CCL solution that will be designed in this study for the reuse of code components aims to make the modelling task more efficient. It will eliminate time consuming code component development tasks by effectively managing previously developed code components.

(16)

2.1.3 Data Mining Application in the Banking Industry

Data mining is used to create analytical models for solving business problems in banking and finance. There are numerous areas in which data mining can be used in the banking industry. This indicates the importance of efficient deployment of analytical models to utilize the outcome of the data mining projects. The following section will elaborate on four examples of how the banking industry has been effectively utilizing data mining (Moin & Ahmed, 2012):

(1) Marketing

Marketing is one of the most widely used areas of data mining in the banking industry. Because of the high competition level in the industry, intelligent business decision in marketing is more important than ever. The bank’s marketing department can use data mining to analyse customer databases. The main objective of data mining in marketing is to determine customer behaviour regarding product, price and distribution channel. Moreover, it is useful to analyse past trends, determine present demand and forecast customer behaviour of various products and services.

(2) Risk Management

Managing and measurement of risk is at the core of every financial institution. Banks analyse if the customers they are dealing with are reliable or not by using data mining techniques. Banks need to be cautious when providing loans due to the risk of loan defaults by customers. Data mining techniques help to distinguish borrowers who repay loans promptly from those who don’t.

(3) Fraud Detection

Financial fraud has recently attracted a great deal of concern and attention (Ngai, Hu, Wong, & Sun, 2011). Financial fraud detection is vital for the prevention of the often-devastating consequence of financial fraud. With the help of data mining more fraudulent actions are

(17)

being detected and reported. Data mining is applied to extract and uncover the hidden truths behind very large quantities of data. Kou et al. (2004) highlight an important advantage of data mining. They emphasize that it can be used to develop a new class of models to identify new attacks before they can be detected by human experts.

(4) Customer Relationship Management

Banks have many and huge databases containing transactional data and other details of its customers. Valuable business information can be extracted from these databases. To retain its customers’ banks must cater to their needs and put the customers first. They can do so by understanding the customer’s interests, by extracting their typical transactional behaviour.

2.1.4 Reuse Maturity Model

Maturity models are used to guide the transformation of a company from an initial to a target stage (Paulk & Curtis, 1996). A maturity model can help to define and categorize the state of organizational capabilities. For this study, the Reuse in Software Engineering (RiSE) Maturity Model will be used to measure the Reuse Maturity Level of code components (Garcia & Lucrédio, 2007). The five Reuse Maturity Levels are indicated in Figure 2.

(18)

Level 1, Ad Hoc Reuse level, is the lowest maturity level. This level is the baseline for all organizations and characterized as ad hoc, and occasionally even chaotic. Reuse practices are rarely used or not used at all and is discouraged by management for this level. The next level, the Basic Reuse level, is characterized by a basic usage of potentially reusable assets. Code and documentation are generated by methods of reuse-based tools, but not developed by the organization. The assets are modified manually by the developers. In the third level, described as the Initial Reuse level, reuse practices are standardized and deployed in the whole organization. The main difference compared to level 2 is that in level 3 projects are getting some reuse guidelines and assets from the organizations. The Organized Reuse level is the fourth level. In this level domain engineering is performed and reuse-based processes are in place to support and encourage reuse. The final level, the Systematic level, is the highest level in the RiSE Maturity Model. At this level, all major obstacles to reuse have been removed. All definitions, guidelines and standards are in place. The whole organizations knowledge is planned, stored and maintained in a reuse inventory. The RiSE has as its main purpose to serve as a roadmap for software reuse adoption and implementation (Garcia & Lucrédio, 2007).

This study will indicate the current Reuse Maturity Level of code components to identify any room for improvement. The CCL that will be designed during this study aims to increase the current Reuse Maturity Level of code components.

(19)

2.2 Existing Solutions in Similar Fields 2.2.1 Why Software Reuse?

Systematic reusing of existing code for the development of new software provides several benefits (Boehm, 1981). First, it has a positive effect on productivity. Due to the reuse of existing code, less code should be developed from scratch. This means the efficiency of the development process is increased. A second benefit is that by reusing well-tested software elements the overall quality of the resulting software system will be increased. The reusable software elements are often used multiple times so it benefits from the collective experience of previous users. This will result in detection and correction of bugs and deficiencies in the documentation. The third and final benefit is the reduced time-to-market. This study will aim to deploy these benefits to the process of data mining by developing a library management platform for code components.

2.2.2 Comparison Software Engineering and Data mining

As previously described, data mining projects are an important tool for the banking industry, but research in documentation of these projects is not readily available. However, in software engineering the research is more mature. Software Engineering refers to the disciplined application of engineering, scientific and mathematical principles and methods to the economical production of quality software (Humphrey, 1989). By using established disciplines such as configuration management, coding standards or name conventions, re-solving previously problems can be avoided. Comparable problem mitigation is just as valuable for the data mining process. In data mining the value also lies in avoiding redundant work, meaning to avoid creating code components that have already been developed in previous projects. Where a software engineering project’s goal is to reach one single outcome

(20)

from a large array of data. In other words, software engineering delivers a conclusion, whereas data mining will support decision making. A well-known example that is used in data mining is a Monte Carlo simulation; this simulation shows a range of outcomes with the most likely possibility. A Monte Carlo simulation is simply a looped run with variables set within a certain range. The resulting code outcome could then be further analysed with software engineering. The reuse of data mining is for exploration and experimentation purposes where software engineering’s main goal is product development. Although an

overlap in the code can be found, the reuse library for software engineering mainly requires transformations. A transformation generally means modifying the shape or appearance of objects while preserving the content. Section 4.2 goes into more detail on the different code categories in data mining projects. Existing library solutions in software engineering consider every software component as equal and this is where a new approach is required. The new library design for data mining projects will address the difference in code input, because data mining projects consists of a wider array of code categories. Therefore, data mining needs a more sophisticated and broader library design.

Not only the broader range of code components demands a new approach. As mentioned, the data mining projects are mainly focused on experimentation and exploration. This demands a quick and short duration of the project and reuse is required more often due to the similarity of code in experimentation projects.

2.2.3 Reuse of Components in Software Engineering

Several different approaches to achieve reuse of software components in software projects have been proposed in literature. Systematic software reuse is a key business strategy that software managers can employ to improve their software engineering processes, to improve product quality, to decrease time-to-market and decrease costs (Visconti & C, 1993). The key aspect for successful reuse-based software engineering is a well-structured software library.

(21)

In computer science, a software library is a collection of data and programming code that is used to develop software programs and applications (Burton & Aragon, 1987). A software library generally consists of pre-written code, procedures, scripts or configuration data, typically with well-defined and documented interfaces, intended to be reused. The functions of library management are to extract reuse information from design or source code files, to assure the quality of the candidate components.

2.2.4 Existing Solutions in Software Engineering

The main foundation of the Reusable Software Library is the database of attributes of every reusable software component in the RSL. The library management subsystem provides a set of tools to maintain the software library. Figure 3 is a schematic of the Reuse Library Environment for software component reuse. Library management of software engineering includes tools to insure efficient operation, including (Burton & Aragon, 1987):

- Automated data collection - Standardized data entries

- Continuity and consistency of reuse information across the life cycle - Completeness and reasonableness

of reuse information

- Examination of reuse information

From the figure above it becomes clear that continuity and consistency of reusable information across the life cycle and completeness and reasonableness of reusable information are important requirements for a library solution. These requirements are adopted in the design of the CCL.

(22)

3 Methodology

This chapter presents the research questions and the research design of this design study. The main goal of this design study is formulated in the table below to indicate the core attributes and the main context.

Table 1. Formulation of the main research goal of this design study.

3.1 Research Question

First, the attributes that are necessary for sufficient documentation are identified, then a design of the CCL will be developed. Furthermore, the current Reuse Maturity Level is identified to emphasize the urgency of adoption of the CCL. To increase code component reuse and management, a library model will be developed. Besides the development of the documentation standards and a library platform, important managerial aspects will be discussed, such as adoption and feasibility of the CCL. The main research question that will be addressed in this study is: how should a library of reusable code components be designed to simplify reuse of pre-existing code in data mining projects?

To answer the main research question, four sub-questions are defined. The first two sub-questions (SQ) asses the design phase of the CCL. The aim is to identify the current Reuse Maturity Level, by analysing how often code components are deployed in projects with similar characteristics. Secondly, the design requirements are analysed. The first two sub-questions are addressed with the design study methodology described in section 3.2.1. I design

To

From the viewpoint, of In the context, of

a code components library enable code components reuse

the developers and maintainers (data scientists) data models

(23)

SQ1: What is the current Reuse Maturity Level of code components? In other words, how extensive are code components reused in similar projects?

SQ2: What standardized documentation is needed for sufficient documentation of code components and how should a library of code components be designed to simplify reuse?

The second two sub-questions of this study assess the proof of concept. They will be asked to analyse the impact and the expected advantages and challenges in the adoption phase of the CCL. These second two sub-questions are addressed with the multi-method qualitative research method described in section 3.2.2.

SQ3: Is it feasible to develop a generic infrastructure (library) for managing code components? What are the expected advantages and challenges for the adoption of the CCL? SQ4: What can the organization do to motivate data scientists to document code components?

3.2 Research Design 3.2.1 Design Study

The first two sub-questions will be assessed by a design study, which implies that by analysing the data collected a new approach is proposed. A design study is a project in which visualization researchers analyse a specific real-world problem faced by domain experts, design a visualization system that supports solving the problem, validate the design and reflect about lessons learned to refine visualization design guidelines (Sedlmair, Meyer, & Munzner, 2012). The research strategy used in this study is the design science research strategy (DSR). DSR is a research strategy aimed at knowledge that can be used in an instrumental way to design and implement actions, processes or systems to achieve desired outcomes in practice (van Aken, Chandrasekaran, & Halman, 2016). DSR is used to address field problems or to exploit promising opportunities (Winter, 2008).

(24)

3.2.2 Multi-Method Qualitative Research

To assess the second part of this study, the proof of concept, the multi-method qualitative research is used. This means several data collection methods are used (Saunders, Lewis, & Thornhill, 2012). Data is collected through observations, semi-structured interviews and an extensive literature research. By combining these methods of data collection this study gives an overview of the expected adoption advantages and challenges.

3.2.3 Data Collection

The data collection was conducted in April 2017. The data collection of this qualitative research is realized through semi-structured interviews. Interviews are a resourceful tool for analysing a person’s thoughts, paradigms, feelings and intentions about the research subject. The choice for semi-structured interviews over structured interviews has been made to give the flexibility to ask additional questions during interviews, as new themes might be found (Saunders, Lewis, & Thornhill, 2012). Two groups of interests in the topic were interviewed to design the CCL. First, the data scientists were interviewed and then the business consultants of the DS & BC Department. A total of thirteen interviews were conducted. The interviews were conducted face to face.

The data scientists were selected with a minimum experience requirement of 5 years in data mining projects. This requirement was set to assure the expertise in writing and documentation of code components. These resources have been selected to assess the first two sub-questions, to set the requirements for the design phase. The difference in input data for data mining projects compared to software engineering projects demands new insights on the characteristics of the input data. The insights of the data scientists were used to structure the input data; the code components categories. Moreover, they are questioned on the current reuse of code components to indicate the current Reuse Maturity Level and on the

(25)

documentation attributes that are needed for sufficient documentation. The data manager was interviewed to verify the selected documentation attributes.

The interviews with the business consultants had a slightly different approach, because they were asked questions with the focus more on the management and implementation aspects of the issue. The choice to include the business consultants as respondents next to the data scientists, is to gather insights on the management process of adopting new practises. The input from the business consultants is used to address the second two sub-questions, to provide the proof of concept with a managerial view on the adoption process.

All the interviewees were Dutch native speakers; therefore, the interviews were all done in Dutch. This allowed the interviewees to speak more freely and prevented a language barrier. The answers of the interviews were all first translated to English before the analysis started. The interview questions generated primary qualitative data. A table of the interviews conducted at the DS & BC Department of the Rabobank is attached in Appendix 1. The semi-structured interview questions are shown in the table 2.

3.2.4 Data Analysis

All the interviews were recorded with permission and summarized after the interviews were conducted. The transcripts were systematically analysed and summarized. Common themes and emerging patterns were identified to give a complete view of the insights gathered from the data scientists, business consultants and the data manager.

(26)

Questions Section Theoretical models and consulted literature

What programming languages do you use?

Reuse Maturity Level of Code Components (2.1.4 and 6.1)

(Pierce, 2002);

(Stackoverflow, 2017); Interviews data scientists Which of the following code components do

you write?

How are code components currently documented? And where?

a. Within project-scope b. Outside project-scope

Interviews data scientists; Reuse in Software Engineering (RiSE) Maturity Model (Garcia & Lucrédio, 2007)

Are code components currently reused? The code components that are reused, are they written by yourself or by other data scientists? How do you become aware about what code your colleagues are writing?

What attributes do you think are necessary for sufficient documentation of a code component in the data lab?

Documentation Attributes for the Markdown, Platform Analysis and Search Methods (4.1 – 4.3)

Interviews data scientists and data manager How relevant are the mentioned

documentation attributes (rate 1-5) for enabling reuse?

Which platform do you consult to find available code related to your data mining project?

Definitions (metadata) should be well defined to correctly reuse code components. Is there currently a database of definitions? If not, can you list all the keywords you have used in prior projects?

Is there a clear overview of the privacy regulations on different source data?

Privacy Analysis included in Markdown (4.1)

(Clifton, 2004); Rabobank Privacy Regulations; Interviews data scientists, business consultants and data manager

What levels of privacy/sensitivity can you identify?

Will editing or combining of code components create privacy issues?

(27)

What advantages do you see in the adoption of a reuse component library?

Advantages and Challenges of Adoption (6.2)

Interviews data scientists and business consultants What obstacles do you see in the adoption of a

reuse component library?

Will you be willing to document your code components? If not, what would be an incentive for you to become more willing?

Motivation Analysis (6.3)

Interviews data scientists; (Shmerlin & Hader, 2015) Would gamification have a positive effect on

your motivation to document your code components?

Interviews data scientists; (Pedreira, 2015)

Would you document because you want to do it or because your asked to do it?

Interviews data scientists, business consultants; (Shmerlin & Hader, 2015) What do you think is the preferable time to

document your code components? Table 2. Semi-structured interview questions.

(28)

4 Design phase

This section will elaborate on the design phase of the CCL. First, the design of the CCL is introduced. Then, the input of the CCL is structured by identifying the reusable code components. Finally, a workflow diagram is developed to illustrate the workflow of the CCL.

4.1 The Design of the CCL

An effective library platform must meet high functional requirements. An important functionality of the library is the implementation of change to code components. Adding new code components and changing existing code components in the library must be easy to do. Besides the functional part, the integration with currently used documentation tooling is almost equally important. Data scientists are already familiar with documentation tooling such as Jupyter Notebook and R Notebook. The Jupyter Notebook is an open-source web application that allows the creation and sharing of documents that contain live code, equations, visualizations and explanatory text. R Notebook covers the same functionalities, but is easier to use for R programming. Both notebooks can therefore be used as a tool to visualize Markdown files. A Markdown is a plain text formatting syntax designed so that it can be converted to HTML using a tool by the same name. Markdown is often used to format readme files. The CCL consists of a standardized Markdown design that will be deployed in a BitBucket repository. BitBucket is a distributed version control system that is in use by the DS & BC department to deploy projects to project folders. It enables reuse and the DS & BC department is already familiar with the tooling. The next section will elaborate on the Markdown template that is designed to facilitate the library with a common documentation strategy. A visualization of the CCL in BitBucket is attached in Appendix 2.

(29)

4.1.1 Markdown Template

Before the code components are suitable to enter the library, they must pass the quality assurance of sufficient documentation. To ensure the code is sufficiently documented, a documentation template for code components is created, so anyone who is not familiar with the code can get an understanding of the code and the goal of the specific code component. Below the design of the Markdown is shown. It includes all the documentation attributes with the description that is necessary for sufficient documentation. The Markdown template is disclosed as a README.dm file in the master folder of the CCL. Appendix 3 contains the code to build the Markdown, which consists of Markdown code and HTML code. The Markdown code is used to include comments for the user and the HTML code is used to visualize the documentation attributes in a table format in Markdown. Appendix 4 provides a table with a description of each documentation attribute.

4.1.2 Semantic Level

While discussing data quality it is important to define the semantic level of the code components. Semantic concerns the meaning of symbols. Meanings are assigned to data by people depending on their prior knowledge and experience. Meanings are intersubjective and continuously constructed. The mapping of a symbol to a real-world state is possible but may be different for different people (Shanks & Darke, 1998). For example, in mathematics and physics height can either be expressed as an “h” or a “z”. When deploying a code component to the library, it is very important that the definitions used are well defined. Moreover, it is important that the data scientists are aware of these definitions and document their code components with the correct definitions. To ensure the correct definitions are used, the data scientists are supplied with a data definitions list. The data definitions list describes the business definitions of roughly 3500 business terms. This is not only to ensure that the

(30)

data scientists use the correct definitions in their description of the code component, but also that they reuse the code components in the proper way by correctly interpreting the definitions.

4.1.3 Search Optimization

To optimize the search within the CCL, the table below shows all the search operators that

funnel the search results to the exact code component of interest.

Search operator Example query BitBucket Commands

project: pr oj ect: ARCH Matches files within the “Architecture”

project

repo: r epo: CCL Matches files within the “CCL” repository

lang: l ang: pyt hon Matches files that are written in a specific language

AND i nt er est AND mort gage Matches files that contain both “interest”

and “mortgage”

OR i nt er est OR mort gage Matches files that contain either “interest” or “mortgage”

NOT i nt er est NOT mort gage Matches files that contain “interest” but

don’t contain “mortgage”

( ) i nt er est AND ( mort gage OR l oan)

Matches files that contain “interest” and either “mortgage” or “loan”

Table 3. Search operators to optimize the search for code components in the CCL.

To optimize the ability to search within the CCL even more, tags should be attached to each

code component that is deployed to the CCL. A tag is a marker or semantic descriptor. Tags

(31)

the feature to create tags that indicate a point in history and does not include a semantic

descriptor. To solve this an alternative tagging method is developed to provide the possibility

to add content tags to the code components. The alternative tagging method is described

below.

The tags will be noted in the last section of the Markdown. The Markdown demands

three compulsory content tags: code category, keywords of the functionality and

programming language. The notation for a tag is the following:

CCLtagIsearchtag

This code search will show all the code components in the CCL that are tagged with this

specific search tag. This way it will directly lead to Markdown of the code components of

interest. The first line in the Markdown shows the title of the code component. Through the

tagging alternative it is also possible to look for multiple search terms within one search. The

notation is the following:

CCLtagIsearchtag1 AND CCLtagIsearchtag2

This code search will result in Markdown files that contains both search tags.

4.1.4 Uniform Code Titles

The Markdown template will give a structured information layer of the description of the code components. Besides, the structured description it is also important for the clarity of the CCL that the code components are named with the same structure. The code components should be named as followed:

Codecategory_Functionality.Programminglanguage

The code category must be written in their abbreviation shown in table 4. Furthermore, the functionality of the code components is described in the title and the programming language

(32)

is indicated by the file extension. An example of a business rule analysing life events for starting businesses in Pig is shown:

BR_LifeEventsBedrijfStarten.pig

By using the uniform code titles, the ability to search in the CCL will increase. Moreover, the code components can be easily categorized according to their code category, functionality or programming language which enables data scientists to quickly screen what code has been written already before they start programming for their new project. This will enable the data scientists to better indicate the time they will spend on the programming.

4.2 Input of the CCL

It is important that the input of the CCL is structured properly because the amount of code components is endless. Through observations, interviews and literature, the most relevant code components suitable for frequent reuse are identified. The code components are categorized in four categories shown in the table below. The reason for categorizing the code components is to give a clear structure to the input of the code component library. Every category consists of at least one type of code component with its abbreviations for quick notation.

Reporting Code Components

Model components Transformations Docker

Containers Visualisations

(Visual)

Complete analytical models (CAP) Transformations (Trans) Docker containers (DC) Dashboards (DB) Business rules (BR) Features (Feat) Model components (MC) Functions (Func)

(33)

This section gives the definitions of the code components that are identified as suitable for the code components library. The first category describes the reporting code components. Visualization describes any technique used for creating images, diagrams, or animations to communicate a message. A dashboard is a user interface that organizes and presents information in a way that is easy to read for the user.

Second, is the model components category. A complete analytical model uses mathematical or symbolic relationships to provide a formal description of the system. The model is then used to derive an explicit expression of a performance measure or, in most of the cases, to define an algorithm or a computation procedure able to calculate the performance indicators (Terkaj & Urgo, 2012). A business rule is a statement that describes a business policy or procedure. Business logic describes the sequence of operations that is associated with data in a database to carry out the rule. And the last group of code component within this category are the model components, meaning fragments of the complete model which can be separately used to train a specific model.

The third category contains all the code components that cause a transformation of the dataset. A transformation generally means modifying the shape or appearance of objects while preserving the content. Common applications are transformations of vectors, quaternions or matrices. A feature is defined in this study as a code component written to analyse a part of the dataset. The machine learning expression for a piece of measurable information about something. If you store the age, annual income, and weight of a set of people, you're storing three features about them. Furthermore, a function is a portion of code intended to carry out a single, specific task.

The last category of code components contains the docker containers. Docker containers are the core of the docker program in which programs and applications can run in simulated environments (Stackoverflow, 2017).

(34)

4.3 Upload-process of Code Components

The table below summarizes the operators that are necessary to upload (push) the code components developed to the CCL.

Query to run in terminal BitBucket Commands

Gi t cl one

htt ps://ti ede ml @gi t @r abobank. nl/ ac m/ ar ch / code_co mponent s_li br ar y. git

To clone the CCL repository

Cd exi sti ng- pr oj ect Gi t i nit

Gi t add -- all

Gi t co mmit -- m ‘ ‘I niti al Co mmit’ ’ Gi t r emot e add ori gi n

Gi t push -- u ori gi n mast er

When code is ready to be pushed to the CCL repository

Cd exi sti ng- pr oj ect Gi t r emot e set - url ori gi n

Gi t push -- u ori gi n mast er

When the code is already tracked by Git then set this CCL repository as the “origin” to push to.

(35)

4.4 Workflow Diagram

A workflow diagram defines a series of steps that a process must execute consistently and shows how tasks will flow between resources (Minor, Mirjam, & Schmalen, 2008). Figure 4 illustrates the workflow of the CCL. There are two types of users able to enter the CCL: data scientists and the administrator. The administrator is responsible for the user administration of the data scientists and more importantly checks the library input on documentation completeness.

(36)

5 Proof of Concept

The proof of concept will first elaborate on the validation phase of the CCL. Second, the current Reuse Maturity Level of code components is indicated, which emphasizes the high need for adoption of the CCL to increase the reuse of code. Furthermore, the expected advantages and challenges in the adoption of the CCL will be discussed. Finally, the motivational aspects of documenting code into the CCL are described.

5.1 Validation phase

The validation phase examines if the CCL satisfies the intended use and meets the user requirements. Furthermore, validation gives management continuous and comprehensive information about the quality and progress of the development or design (Wallace & Fujii, 1989). The key aspects when validating the design quality are to test the correctness, consistency, completeness, accuracy, readability and testability. The design quality of the CCL is measured through a test period of one month. During this period five data scientists are asked to document their code components according to the CCL documentation requirements. After they documented their code component according to the CCL standards, they were asked their opinion upon the key issues mentioned above. The feedback is provided in the form of evaluation of the documentation requirements. The general feedback was positive and did not include any missing requirements. They had a positive attitude towards the use of a Markdown documentation method, the structure of the documentation attributes and the tooling selected. However, there were some comments on extensiveness of the documentation. Some documentation attributes were indicated as unnecessary. These attributes were then deleted from the design. The design was finalized based on the feedback to meet the requirements that were mentioned by the data scientists. Then, ten code components were deployed to the CCL with the required documentation, code titles and

(37)

sufficient content tagging. After this, other data scientists tested the ability to search in the CCL by looking for the right code components in the CCL. This was successfully tested, because due to the content tags the code components were easily found based on their functionality. Moreover, the uniform code titles gave a clear visual overview of the available code components.

5.2 Reuse Maturity Level

The current Reuse Maturity Level of code components is measured through observations and interviews at the DS & BC Department. Code components are very rarely reused, which indicated the Reuse Maturity Level of Ad Hoc Reuse. The small amount of code components that are currently reused are code components developed by the person himself in previous projects. This means reuse is individualized which is also a characteristic of Ad Hoc Reuse Maturity. Occasionally reuse of code components written by others than themselves happens, but this is an uncoordinated process. The data scientists are only aware of code components written by others through verbal communication, which, in all fairness, is not their strongest attribute. Furthermore, it is time consuming to start looking for data scientists that might have worked with similar code components.

5.3 Advantages and Challenges of the CCL 5.3.1 Advantages of the CCL

The biggest expected advantage after adoption of the CCL is faster experimentation due to avoidance of double production of code components by data scientists. The data scientists often stated that currently the wheel is being reinvented repeatedly. This double production of code components will be greatly reduced with the goal of complete elimination of double

(38)

production by adoption of the CCL. A second advantage is a higher reproducibility and auditability, by standardizing the meanings and documentation of code components.

Furthermore, the quality will be improved because of the open-source structure. Every data scientist can make changes and additions to the code component in case of inaccuracies that were unnoticed by the author of the code. It will function as a discussion platform where data scientists work together to find the best solution. This means that the ease of collaboration will stimulate data scientists to work together on creating a sustainable CCL. Finally, reuse opportunities are easily recognized which creates more time for innovative data science. The five expected advantages are summarized in the table below.

Expected Advantages Description

Faster Experimentation It avoids production of similar code components that are already documented.

Higher Reproducibility and Auditing

By standardizing the meanings and documentation of code components it is possible to define standard terms and ease communication problems.

Improve the Quality of Code You are being obligated to invest time in the documentation of your work and after this, others can improve your work when they notice inaccuracies. More Collaboration It will function as a discussion platform where the data

scientists work together to find the best solution. More time for Innovative Data

Science

Reuse opportunities are easily recognized which creates more time for Data Science.

Table 6. The five expected advantages after adoption of the CCL.

5.3.2 Challenges of the CCL

This paragraph will elaborate on the expected challenges for the adoption of the CCL. The expected challenges are identified by observations and interviews at the DS & BC Department. First, the most impactful challenge, which is repeatedly remarked during the interviews, is the lack of participation from the data scientists. The second challenge arises

(39)

when code is not written by oneself. Those who have never seen the source data must be able to understand the code components through the right documentation. If the documentation is not clear it can then be time-consuming to understand the code someone else has written. This has been emphasized multiple times by the data scientists.

Furthermore, privacy issues can arise due to the easy access of code components that are developed on different types of source data. The users of the CCL should be made aware of these privacy issues that can arise, because even when a specific code component is not based on privacy sensitive source data, a combination with another code component may create a privacy issue in the end. The final challenge that may arise is poor ability to search. By using a common tagging method this issue should be countered. The four expected challenges are summarized in the table below.

Expected Challenges Description

No participation of data scientists Lack of motivation and interest to document will lead to too less participation.

Not written by me It can be time consuming to understand code that others wrote.

Privacy issues Legal issues can arise when code components reused in a project that is not allowed due to the source data that was used to build the code component.

Poor ability to search Problems with finding the right solution in the CCL can be time-consuming.

(40)

5.4 Motivation Perspectives to Document Code

Most data scientists are aware of the importance of documentation of code. However, code documentation is often overlooked in practice. Documentation of code components enables data scientists to understand code faster, improve the quality of code components and reuse code components efficiently. A major challenge in documentation of code components involves motivating people to invest time in documentation. For successful reuse, the data scientists must (1) document their own developed code components (2) actively use the library to search for reusable components. If the data scientists are not willing to commit to these two actions, the reuse opportunities are missed, and thus the benefits of reuse are lost. Documenting is perceived as tedious, difficult, time consuming and distracting from the main task of coding. To increase data scientists’ motivation to document, a proposed solution for encouraging documentation should:

- Increase motivation aspects - Mitigate the hindering aspects

5.4.1 Increase Motivation Aspects

An analysis of the semi-structured interviews indicated four motivational aspects listed below:

1. Documentation promotes code comprehensibility 2. Increase of overall code quality

3. Documentation assigns credits to the developer of the code component 4. Gamification

The first motivation aspect emphasizes the importance of code comprehensibility (Shmerlin & Hader, 2015). Data scientists are motivated by the readability of code and this contributes to a better understanding of the code. During the test phase the data scientists were asked to

(41)

evaluate code components that were documented in the CCL Markdown template. The code was easier to understand than previous documentation methods. This increased the motivation of data scientists to contribute to the CCL. A second motivational aspect is the increase of overall code quality. Through documentation in the CCL the code components are shared with the other data scientists. From there they can improve and expand on the code components entered to the CCL. This will lead to increased quality of the code components. Another motivation aspect is that by documenting code components the credits are assigned to the developer. It is not only important that the data scientists will be evaluated on the code they develop, more importantly they should be evaluated on the completeness of documentation they provide. When documentation completeness is reviewed in the performance measurements of data scientists they will be more motivated to contribute to documentation of code. Moreover, data scientists are motivated by the fact that they contribute to the team by documenting their own code components for others to reuse.

Finally, gamification is often used to increase engagement, involvement and to motivate users in performing certain behaviour. Prior academic literature is focusing on the application of gamification in software engineering for increasing the engagement and results of developers (Pedreira, 2015). Gamification increases motivation by making an unpleasant task more attractive. It is also indicated by the data scientists as possible incentive for them to start documenting code components, but this was a minority of 30% of the interviewed data scientists. This minority is attracted by gamification in the form of up-voting code components. The other 70% did not experience gamification as a stimulation for documentation of code. They describe it as being childish and unprofessional. Moreover, they feel that motivation should be intrinsic and not dependent on a game score.

(42)

5.4.2 Mitigate the Hindering Aspects

Besides the motivation aspects, the analysis of the semi-structured interviews with the data scientists also indicated hindering aspects. The hindering aspects that were identified by the data scientists are categorized in three categories:

1. Difficult and tedious task 2. Interruption of coding 3. Time consuming

A hindering aspect that is indicated as most disturbing is difficulty of the documentation. To keep documentation as simple as possible data scientists prefer tooling that is already being used by themselves. This decreases difficulty and does not require extra time from the data scientists to master new tooling. Another hindering aspect is the interruption of coding. Documentation will be most sufficient when it is done during the coding itself, while the data scientists have the most accurate knowledge at that time. This interrupts the coding process and is therefore not attractive to data scientists, because it distracts them from their core task: to write code. And finally, data scientists experience documentation as a time-consuming task. To increase the motivation of the data scientists to contribute with their documentation to the CCL it is important that these hindering aspects are mitigated. Therefore, a tooling that is already in use is adopted as platform for the CCL to decrease the first hindering aspect of task difficulty. This also minimizes the time they will spend on the documentation, while the data scientists are already familiar with the tooling. The interruption of coding is still a hindering aspect that is difficult to tackle. Emphasizing the self-learning aspect of documentation might make it feel less interrupting while it is positioned as a self-development task.

(43)

6 Discussion

6.1 Academic Contributions

The fast revolution in the data mining industry demands a strong basis of code documentation, but the importance is currently overlooked in practice (Garcia & Lucrédio, 2007). The main motivation of this study is that experience gained about library management in the software engineering process could be reused and integrated to improve the data mining process. Prior research on library management in the software engineering process is an important basis for discussing library requirements for the data mining process. Requirements such as the continuity and consistency of reuse information across the life cycle and completeness and reasonableness of reuse information are included in the design phase of the CCL (Frakes & Pole, 1994). The current library solutions for software engineering fall short if they were used for data mining storage, due to the limited ability to structure the several categories of input. The different code components categories in data mining projects mentioned in section 4.2 demand a different structuring method, while in existing software engineering library solutions every input received is comparable to one another, no real distinction exists.

The design study methodology is used to assess the main research question: how should a library of code components be designed to simplify reuse of pre-existing code components in data mining projects? To create new insights on the unexplored field of library management in the data mining process, data scientists of the Rabobank have been interviewed. These resources provide important insights on the library requirements and user interface, while they are daily actively involved in the data mining process.

The current Reuse Maturity Level of code components is an Ad Hoc Reuse maturity, meaning the code components are very rarely reused. The small amount of code components that are currently reused are code components developed by the person himself/herself in

(44)

previous projects or via verbal communication. This means the efficiency of the data scientists is relatively low and can be brought to a higher level when including documentation standards. Consequently, this study introduces the first standardized documentation platform for data mining projects; the Code Components Library, to enable reuse of code components in data mining projects and to improve collaboration between data scientists.

The reuse opportunities that will arise during adoption of the CCL will mainly positively impact the data preparation and modelling phase of the CRISP-DM model. This is because both phases can be implemented quicker when some code components are reused and not rewritten every time. Several important characteristics and functionalities of the design of the CCL have been introduced:

- The CCL is deployed as a repository in BitBucket

- The CCL provides a standardized Markdown design containing all documentation attributes necessary for sufficient reuse of code components

- The uniform code titles give a clear structure to the CCL

- The CCL is filled with a search function deployed within the Markdown files to make sure the search is not a time-consuming task

In the proof of concept, the following characteristics and functionalities have been validated. Besides the characteristics and functionalities, the CCL was also validated on the user interface and ease of use. The test results were positive and showed that the CCL is an easy to adopt documentation platform which enables and stimulates code reuse.

The code components library : standardized documentation for efficient reuse of code components

University of Amsterdam

Faculty of Economics and Business