Mature bottom-up data analysis

(1)

U

NIVERSITEIT VAN

A

MSTERDAM

T

HESIS

Mature Bottom-up Data Analysis

Author:

Jori (JJM) van Schijndel, MSc

Supervisor: Han Boer RE RA

A thesis submitted in fulfillment of the requirements for the degree of AITAP

in the AITAP

Amsterdam Business School

(2)

(3)

iii

Declaration of Authorship

I, Jori (JJM) van Schijndel, MSc, declare that this thesis titled, “Mature

Bottom-up Data Analysis” and the work presented in it are my own1. I confirm that:

• This work was done wholly or mainly while in candidature for a de-gree at this University.

• Where any part of this thesis has previously been submitted for a de-gree or any other qualification at this University or any other institu-tion, this has been clearly stated.

• Where I have consulted the published work of others, this is always clearly attributed.

• Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.

Signed: Date:

1_{This LaTeX template including this Declaration of Authorship has been downloaded} from http://www.LaTeXTemplates.com.

(4)

(5)

v

“If you torture the data long enough, it will confess.”

(6)

(7)

vii

UNIVERSITEIT VAN AMSTERDAM

Abstract

IT Audit

Amsterdam Business School AITAP

Mature Bottom-up Data Analysis

by Jori (JJM) van Schijndel, MSc

This thesis starts by stating that merely placing trust in or relying on bottom-up data analytics without gaining the required assurance is unacceptable. It further states that regular (process focussed) auditing techniques and guidance are not suitable for bottom-up data analytics to gain sufficient assurance and that this creates an impasse. By investigating literature from fields like data analytics, software development, software security, model validation, code review etc. we created a very broad and comprehensive set of procedures, processes, controls, techniques and measures that can positively influence trust in bottom-up data analytics. With the use of a questionnaire respondents from different fields (e.g. IT auditors and data analysts) provided input on e.g. their preferences for these controls. This allowed us to select, sort and structure different controls to implement an operational maturity model that is geared towards bottom-up data analyt-ics. This maturity model is presented as a guidance document to facilitate both data analysts and auditors: by using the maturity model, the analyst can create more auditable data analyses results and can better balance cost versus benefits. Also the (IT) auditor can use the maturity model as a basis for the norms used during audits on bottom-up data analyses or as best-practices when giving advice.

(8)

(9)

ix

Acknowledgements

I wish to thank my research thesis supervisor Han Boer for his supervision during this study. This thesis could not have succeeded were it not for the effort he put in to it.

My sincere thanks goes to all respondents to the questionnaire I set out during this study. Their input was invaluable and vital to the study.

I am also thankful to my colleagues at KPMG for their discussions, and especially to Dennis Tesselaar because of his input to this project.

(10)

(11)

xi

A NOREA quality criteria 39 B Controls, techniques and measures 41 B.1 Governance . . . 41 B.2 Construction . . . 42 B.3 Testing . . . 42 B.4 Verification . . . 44 C Initial questionnaire 47 D Final questionnaire 57 E Summary of responses 67 F Questionnaire analyses 71 G Proposed solution 73 H Respondents 77 I Analyses code 79 Bibliography 85

(13)

1

Chapter 1 Introduction

1.1 Introduction

A perfect storm of factors has come together to make data analytics1(DA)

a mainstream technology in use by companies today (Halper and Stodder,

2014). This process of data analytics holds enormous business potential

for companies to create data driven processes, to get customer insights, to innovate, to support in critical business decision-making etc.: research indi-cates companies that utilize data analytics are five percent more productive and six percent more profitable than other companies (McAfee and

Bryn-jolfsson, 2012). Companies are therefore realizing that data analytics can

provide valuable insight to help them better compete. At the same time software vendors are making analytics software easier to use and to con-sume. Because of that, the adaptation of data analytics is increasing. With this, not only statisticians, data scientists etc. are using data analytics, but also business analysts and casual users are (increasingly) making use of the

different technologies available (Halper and Stodder,2014).

Across of these (potential) benefits for companies there are also a lot of pit-falls, risks and problems that arise when using data analytics. For example, in a survey of Bain & Company only four percent of companies are saying that they have the right people, tools and data to get meaningful insights

or advances out of the data analytics (Wegener and Sinha,2013). Also

im-portant to merit from the use of data analytics is to (be able to) trust the outcomes of the analytics. Unfortunately a KPMG reports shows that or-ganizations do not fully trust their analytics and that trust in data analytics is still a significant challenge for organizations. Just thirty-eight percent of respondents indicated that they have a high level of confidence in the cus-tomer insights they receive from the use of data analytics. Also only a third seem to trust the analytics they generate from their business operations. Yet the vast majority of respondents say these insights are critical to their

busi-ness decision-making (KPMG,2016). At the same time companies are more

and more trying to take advantage of data analytics (Deloitte, 2013). This

offset between the lack of trust, but the increasing and reliance on the use of data analytics is an important reason for writing this theses.

1_{Within this thesis the term data analytics is considered broad and includes business} information (BI), business analytics (BA), big data etc.

(14)

1.2 Background and the role of the IT-auditor

In the referenced KPMG report it is stated that there currently is a ’trust gap’ in analytics and that trust in analytics can be underpinned with four

’an-chors of trust’: quality, effectiveness, integrity and resilience (KPMG,2016).

There are many approaches to try to limit this trust gap by having good (IT & Data) governance in place, monitoring of outcomes, having cross-functional DA teams etc. like discussed in that report. This is for example the situation when BI reports are created and adjusted using the in-place change-management processes including design, testing and verification of the reports and the underlying code. In these cases, those processes try

to control or influence certain quality aspects2 _{of the BI reports to ensure}

that these quality aspects are in line with e.g. the (perceived) risks and the risk-appetite of the company.

Meanwhile, on the other side of the aisle are the well-known Excel spread-sheets or the data analytics tools meant for self-service data discovery by individual business units (e.g. Tableau, SPSS, Qlikview, R). These types of analytics normally don’t have a strong plan for governance, are built to en-able fast ad-hoc analyses and are sometimes designed, created, maintained and used by the same end user or the business unit needing the

analyt-ics (Singh,2016). These types of analytics can pose serious business risks

and there are therefore (good) reasons for companies to have a lack of trust

in (specifically) these types of data analytics (Olshan,2013; Rittweger and

Langan,2010).

These diametrically approaches to data analytics are coined by the Eckerson

group as top-down versus bottom-up data analytics3_(Eckerson,₂₀₁₆_{). The}

top-down governance approach is the IT driven data analytics using sani-tized data, having a standardized approach and following change manage-ment procedures. The bottom-up data analytics are the analytics performed by users within a business unit who query, analyze, explorer and mine data for the questions they have for their own tasks by using tools like Excel. This bottom-up approach is sometimes also called ’data discovery’ or

ad-hoc data analyses4.

Actually this top-down governance approach versus the bottom-up data discovery approach are two sides of the same coin: both are focused on sat-isfying the needs of the business, but the difference lies in the freedom and agility these users have with the bottom-up data discovery approach versus the controls in place with the top-down approach that higher management needs to impose to be able to be in control.

There are then generally two ways in which the IT-auditor and the IT-audit profession becomes involved with the above mentioned data analytics:

2_{With quality aspects we refer to the NOREA quality aspects of effectiveness, efficiency,} exclusivity, integrity, verifiability, continuity and manageability (NOREA,2002). More in-formation on this is given in chapter two.

3_{the Eckerson group uses the term BI instead of DA, but data analytics, as used in this} thesis, encompasses Business Intelligence.

(15)

1.3. Problem, problem statement and research question 3 • the IT-auditor is using data analytics to perform audit tasks (e.g.

test-ing logical access segregation of duties) (Stest-ingleton,2013); or

• the IT-auditor is auditing the data analytics itself.

The use of data analytics by the auditor, be it IT, operational or financial,

is something we already see within literature (Heijden and Bajnath, 2015;

Boerkpam and Soerjoesing, 2010). Also the auditing of analytics like a

computer-generated report is something that generally belongs to the tasks

of the IT-auditor (Singleton,2014). Nevertheless, the Public Company

Ac-counting Oversight Board (PCAOB) reported that a major area of deficiency in (financial) audits is relying on these types computer-generated report without gaining the required assurance regarding the accuracy and

com-pleteness of the report’s information (Tysiac, 2012). Given the ’fallacy’ of

merely relying on a computer generated report, standard setters have there-fore issued guidance to (IT) auditors regarding the auditing of computer

generated reports or computer-processed data (Office,2009).

When we look at this guidance given to the testing of computer-generated reports or computer-processed data a lot of emphasis is placed on process related testing. E.g. application controls, security controls, change man-agement etc. Although this can work for things like a computer-generated list of accounts receivable, for an ad-hoc query or a random Excel spread-sheet this will generally not be the case. Here we come back to one of the main difference between the top-down approach versus the bottom-up ap-proach to data analytics: the top-down apap-proach is generally governed by or should be governed by various (IT) controls and will therefore suit itself for being audited. The bottom-up data discovery approach will generally lack these types of controls and also will not have design documentation, test reports, code reviews, logging etc. Nevertheless, if the data analytics are important or material to the user, this can be an auditor or any other stakeholder, then some sort of comfort is required to be able to rely (that is, to trust!) on the data analytics. This can be reasonable assurance, limited assurance or any other form. In this situation, there is yet again a role for the IT-auditor.

1.3 Problem, problem statement and research question

1.3.1 Problem

With this we come to the main problem that is the focus of this thesis. As stated, their generally is a trust gap in analytics. Also the PCAOB states that

having a “leap of faith” (Singleton, 2014) and just trusting these analytics

like a computer generated report isn’t acceptable to auditors. This means that to be able to close this apparent trust gap with data analytics, at least from the point of an auditor, certain activities need to be performed. With regards to top-down data analytics the prerequisites for these activi-ties generally are, or at least should be, in place. This means that an auditor can look from a process perspective to the analytics (e.g. the change man-agement process) and combine this with some substantive testing to come

(16)

to a conclusion or an attestation about the trustworthiness of the analytics. With bottom-up data analytics these prerequisites are generally not avail-able (or very limited) and given the nature of this form of data analytics or data discovery (e.g. the degree of freedom, agility, cost and overhead versus top-down) it is also not realistic and desirable to expect the same processes, recordings etc. as with the top-down approach as this would negate various positive aspects of bottom-up data analytics.

This means that when bottom-up data analytic become relevant for an audi-tor or another stakeholder the activities required to be able to ’trust’ the data analytics are generally not possible to execute using the same approach that is suitable for the top-down data analytics approach. The problem then is that, from an audit perspective, reliance on bottom-up data analytics should be substantiated by activities, but the current guidance for this is only ef-fective for top-down data analytics and not for bottom-up data analytics.

1.3.2 Problem statement and research questions

Given this observation that the current guidance on auditing analytics is not geared towards bottom-up data analytics, but instead to top-down data analytics, a logical question is then what activities can be executed with regards to these bottom-up data analytics (either prior to, during or after the execution of the analytics) to ensure that it is possible that comfort can be attained towards these data analytics and the accompanying outcomes. The formal problem statement is then:

Placing trust in or relying on bottom-up data analytics without gaining the re-quired assurance is unacceptable. Also the (current) guidance towards attaining the required assurance for data analytics is not suitable for bottom-up data ana-lytics as the prerequisites for this are generally not available with bottom-up data analytics (e.g. processes like Change Management) and given the nature of bottom-up data analytics it is also not desirable nor technically feasible pricewise to handle and expect the same from bottom-up data analytics as from top-down data analyt-ics. This results in an impasse when bottom-up data analytics or the accompanying outcomes become material to either an auditor or any other stakeholder.

The primary research questions is then:

What activities5can be performed prior to, during or after the execution of

bottom-up data analytics by analysts, auditors and others that can facilitate and/or ensure trust or reliance on these bottom-up data analytics and the accompanying outcomes that are acceptable within both the IT-audit profession and data analysis profession. 5_{With activities is meant any procedure, process, control, technique or measure in the} broad sense that can be performed to positively influence certain quality criteria of the ana-lytics.

(17)

1.4. Subsidiary research questions 5

1.4 Subsidiary research questions

In the above section the primary research questions was given. In this sec-tion this primary research quessec-tions will be split into separate subsidiary research questions that will be answered throughout this thesis.

Literature review:

1A what quality criteria of bottom-up data analytics are (the most) relevant

for trusting the analyses and the accompanying outcomes; and

1B What procedures, processes, controls, techniques or measures are

avail-able in current literature to positively influence the quality of bottom-up data analytics with regards to data analysis in the broad sense and within data analysis related fields such as software development, soft-ware security, model validation, code review, softsoft-ware testing, data conversion, and software assurance;

The goal of this literature review is to come to a very broad and comprehen-sive set of procedures, processes, controls, techniques and measures that already exist within the field of data analysis and related fields. These are then not already specifically geared or balanced towards bottom-up data analyses.

Field study:

2A What do analysts and IT auditors consider from a personal

perspec-tive, given the outcome of research question one, as practical and vi-able procedures, processes, controls, techniques and measures in the context of bottom-up data analyses; and

2B What is the interrelation between these different procedures, processes,

controls, techniques and measures and the (required) maturity6of an

organization implementing it;

The goal of this field study is to come to a reduced set of practical and viable

activities within a maturity model7that can facilitate and/or ensure trust or

reliance on bottom-up data analyses.

1.5 Research Design

The methodology for this study is that, in line with the structure of the re-search questions, it starts with a literature review and is followed by a field study. Within the literature review, already available guidance for data analysis and related fields has been collected to provides a more detailed 6_{Mature with regards to the use of bottom-up data analytics within the organization.} More information on this term is given in following chapters.

7_{More information on e.g. a maturity model and maturity levels are given in chapter} three and four.

(18)

analysis of the existing literature and it shows how this literature interre-lates with the problem statement and how deficiencies in the current liter-ature provides a motivation and justification for the research undertaken within this thesis. This literature review gives answer to the research ques-tions 1A and 1B and these answers are then used as input for the field study. Within the field study, professionals from the field of data analysis, audit and related fields are asked to give input towards the practicality and via-bility of the output of research questions 1A and 1B:

• A questionnaire is set out among a multidisciplinary group of people to receive their input on the practical applicability (i.e. particularly cost versus benefit of the control and the maturity of the control) of the derived activities from research questions 1A and 1B over multiple axes. This allows us to answer research questions 2A. Based on their input, the highest ranking activities are selected as input to answer question 2B; and

• The activities taken from research question 2A are selected, ordered and combined based on e.g. the required maturity to implement these activities. This results in a maturity model geared towards bottom-up data analyses. This again allows us to answer research question 2B. The resulting maturity model or guidance document is advantages for both the analyst performing the bottom-up data analyses as well as for the (IT) auditor. Using the maturity model, the analyst can create more auditable data analyses results and can better balance cost versus benefits. Also the (IT) auditor can use the maturity model as a basis for the norms used during audits or reviews on bottom-up data analyses.

1.6 Organization of study

The above methodology resulted in this thesis, that is structured as follows: • Chapter 1 introduced the background to this thesis by explaining the risks associated with bottom-up data analytics. It explained and de-scribed how we tried to determine if we can reduce these risks with specific research questions and a certain research methodology. • Chapter 2 gives a general overview of (bottom-up) data analyses

tech-niques and creates a link between data analyses and applicable IT au-dit theory. This literature study results in answers to research ques-tions 1A and 1B.

• Chapter 3 explains the questionnaire that was used to receive input on the practical applicability of the derived activities from research questions 1A and 1B. The input received resulted into an answer for research questions 2A.

• Chapter 4 provides a scoring model and a maturity model to structure the output of research questions 2A based on the maturity of organi-zations and the different business functions the control relates to. This

(19)

1.7. Scope, Assumptions and Limitations 7 allows to create an operational model and to answer research question 2B.

• Chapter 5 is used to provide a complete answer to the different sub-sidiary research questions we have in this thesis and to formulate an answer to the original research question and to formulate a solution to the problem statement. In this chapter we will also summarize our project, conclude about our findings and the added value of this work.

• Chapter 6 is the final chapter of this project. This is also the point where we can be critical to ourselves and what we did. So we will use this to point out certain limitations of our project. To end this thesis, we will also give recommendations for possible future research.

1.7 Scope, Assumptions and Limitations

The full field of data analytics, big data etc. is to vast for a single thesis. For this reason a brief introduction on data analytics is given in chapter two, but this is in no way meant as a full or balanced description of the field, hence this is explicitly stated. As this is also an explorative study, certain scoping is applied to the thesis. For this, the following choices are made upfront:

• data governance, data quality etc., although important for data ana-lytics, are not part of the proposed control measures with the possible exception of data validation throughout the analytics;

• also the study is focused on (informal) acceptance by IT-auditors and other professionals. There is no investigation on how this relates to specific audit standards or assurance levels; and

• an operational model is the output of research question 2B. The re-sulting model itself is not being validated by e.g. case studies. Given the required depth of such a validation that would best be suited as a possible follow-up study to this thesis.

An assumption is made regarding the reader that certain technical aspects are known upfront. Although effort is made to create an accessible study, it cannot be prevented that certain techniques or (technical) jargon would require the reader to consult other literature.

As stated in the research questions, already available literature is consulted during the literature review. It is unfeasible to review all possibly relevant literature. It is therefore left to the professional judgement which literature is consulted.

(20)

(21)

9

Chapter 2 Background and theory

This chapter presents a general overview of (bottom-up) data analyses tech-niques and an overview of literature related to the control and review of data analyses and related fields. To start, a general overview of (bottom-up) data analyses techniques is given in section 2.1 to introduce common techniques and terminology used within data analyses. Section 2.2 makes the link between data analyses, applicable IT audit theory and other related fields like software development and presents an overview of the possible controls, techniques and measures that are available in literature to ensure (adequate) control over (bottom-up) data analyses. An answer to the first research questions is then given in section 2.3.

Note that some basic familiarity with both performing and auditing data analytics is assumed within this chapter.

2.1 Data analyses background

General remark: this section is based on several data analyses books from also the public domain. We encourage the reader to read, if necessary, the books ’Data mining and Analyses - Fundamental concepts and Algorithms’

(Meira Jr.,2014) and/or ’Introduction to Data science’ (Stanton,2012).

Although the approach to data analytics can differ between data scientist, application, etc., there are generally several overarching steps within the process that are (almost) always applicable. These combined steps can be

considered the data analytics lifecycle (Nashawaty, 2016): data, modeling,

validation, deployment and maintain.

The following subsections will give an introduction per step in this data an-alytics lifecycle to familiarize the reader when certain general data analyses terms, tasks and challenges.

2.1.1 Data

Data analyses is, among other things, the process of discovering insightful, interesting, and novel patterns and facts, as well as descriptive and under-standable models from data. A prerequisite for this is the availability of data. Therefore we look in this section at some of the basis properties and

(22)

requirements of data when being used for data analytics within a business

environment1.

The quote "data is not information, information is not knowledge, knowl-edge is not understanding, understanding is not wisdom" (Clifford Stoll) illustrates how data is, as if it was a pyramid structure, the foundation of knowledge, understanding and wisdom. With data analytics one of the main tasks is then turning that data into information and onwards up the pyramid. This means that data is the basic ingredient for data analytics. The data a company collects can be used for data analytics, but disparate data sources are often a barrier for effectively performing analytics (Nashawaty,

2016). Managing, utilizing and integrating data sources are a challenge for

organizations and organizations have taken to storing data in data lakes, which, put simply, are archives stored in a data-warehouse that store a lot of data in its raw or native format instead of structuring, cleaning, consoli-dating etc. A data lake like that can then easily turn into a data swamp if the data continues to amass. For effective analytics, organizations needs to col-lect, store and make data accessible in an easily decipherable form in which completeness and accuracy of the data can be assumed c.q. validated. If this integration of data is not already the case, data analytics will start with identifying and integrating systems or data. These steps are generally consider to be pre-processing tasks such as data extraction, data aggrega-tion, data cleaning, data fusion, data reduction and feature construction. In certain situations, pre-processing tasks will include the creating of (sepa-rate) data sets for testing, training, and production purposes.

As the quality of data will generally effect the quality of the data analytics outcomes, we encourage the reader to read relevant literature or handbooks

on data quality (Eurostat,2007).

2.1.2 Modelling

Modeling (or mining or programming) is essentially the core part of ana-lytics for transforming data into knowledge. Although the following list is not meant as an exhaustive list, it presents an overview of general tasks

within modeling2_{. In certain situations these different tasks can be}

com-bined within a single analyses:

Exploratory data analysis: exploratory data analysis aims to explore the numeric and categorical attributes of the data individually or jointly to extract key characteristics of the data sample via statistics that give information about the centrality, dispersion, and so on.

Extraction: during extraction the task is to extract a subset of the data ei-ther as a finalized result or as input for furei-ther analyses. This can be a simple filtering of the data, but in e.g. the case of unstructured data 1_{Although from an academic viewpoint no introduction to data could forgo the basics} like the Information Theory from Clause E. Shannon (Shannon,1948), this paper limits itself to the basics from an organizational standpoint.

2_{This listing is based on the above mentioned book ’Data mining and Analyses -} Funda-mental concepts and Algorithms’.

(23)

2.1. Data analyses background 11 the extraction could consist of text-analyses to extract certain linguis-tic information.

Combining: although the combination of data is also a pre-processing step (see above), the combining of data within the analyses step is to join information on the basis of matching characteristics (e.g. combining patients to doctors based on the provided care).

Aggregation: within aggregate modeling different data is combined to present an overarching description of that combined data (e.g. determin-ing the number of cars produced per day based on the production database).

Rule-based systems: rules-based analytics is a way to use rules (or a rule base) to manipulate data. This can be used to e.g. validate compliance in a data-driven way.

Frequent pattern mining: within frequent pattern mining the task is to ex-tract informative and useful patterns out of large datasets. Patterns comprise sets of co-occurring attribute values, called itemsets, or more complex patterns, such as sequences, which consider explicit prece-dence relationships (either positional or temporal), and graphs, which consider arbitrary relationships between points. The key goal is to

discover hidden trends and behaviors in the data (Meira Jr.,2014).

Clustering (including anomaly detection): clustering is the partitioning of data records into natural groups called clusters, such that data records within a group are very similar, whereas data records across clus-ters are as dissimilar as possible. This also allows for the detection of anomalous records within the data.

Classification: with classification the goal is to assign a label or class to a yet unlabeled records. The classification task is to predict the label or class for a given unlabeled point. Many different classification models exist such as decision trees, probabilistic classifiers, support vector machines, and so on.

2.1.3 validation

Broadly speaking, (model) validation may be defined as supporting all the other steps of the data analyses lifecycle in order to improve the quality of the outcomes. It is designed to check plausibility of the outcomes and to correct possible errors and is one of the most complex operations in the

life cycle of data analytics (Eurostat, 2006). Validation should ideally be

performed according to a set of common and specific rules depending on the type of model, stage and on the level of data aggregation. Nevertheless in most situation the method of validation should also be tailored towards the specific analyses, as each analyses has its own particular characteristics, risks and problems.

(24)

2.1.4 deployment

Although with bottom-up data analytics the deployment stage will not al-ways come into play as the analytics generally are more of a ’one-off’ char-acteristics, deployment of analytics can still be of vital importance if the analytics are provided to a production system to be accessible to a larger group of users or to be run over production data. Within IT the ITIL Re-lease and Deployment management process has the primary goal to ensure that both the integrity of the production environment is protected and that the correct components are released. With regards to data analytics this means that with proper deployment it should be ensured that the new an-alytics do not negatively affect the deployment environment (e.g. present a too high a load or deletion of data) and that analytics are only promoted to the production environment if sufficient testing and validation has been performed.

2.1.5 maintain

Similar to the deployment phase, the maintenance phase will not always be applicable to bottom-up data analytics. Nevertheless, production mod-els can be altered or updated due to changes in the format of source data, new requirements, changes in relevant rules, for fixing bugs etc. In these situations, the level of modeling and validation is generally less than is the case with the original deployment. Still it should be ensured that the risk to the production environment is kept to a minimum. Although a general process like IT Change Management could be used for this, this would gen-erally negate various positive aspects of bottom-up data analytics like the flexibility and the short(er) lead times.

2.2 Theory

Within this section the link is made between (bottom-up) data analyses, ap-plicable IT audit theory and other related fields like software development. Given the scoping as discussed in chapter 1, data (including data quality, data validation and data governance) are not part of this section. Because of the generally shorter lifespan and smaller distribution circuit (in com-parison to top-down analytics), the focus will also be limited to modeling and model validation as this will generally have the biggest effect on the outcomes of the analytics. This means that deployment and maintenance of bottom-up analytics will only be touched upon briefly.

Also a characteristic of bottom-up (self-service) analytics is that a large part of the validation will be performed by the same analyst or same team cre-ating the analyses or model whereas with e.g. software development this is more separated with end-user testing, formal acceptance etc. This means that with bottom-up data analyses there is less of a strict separation between modeling, testing and validation and that it is more of a circular approach.

(25)

2.2. Theory 13 Therefore some of the techniques that could be considered a separate val-idation step, could by others be considered as a logical part or extension to the modeling step and vice versa depending on the situation, execution and personal preference. For this reason, this distinction or separation isn’t strictly placed on the controls, techniques and measures presented in the following sections.

2.2.1 IT audit and quality criteria

Within the ISO/IEC-norm 8402 quality is defined as "the totality of features and characteristics of a product or a service that bear on its ability to

sat-isfy stated or implied needs" (ISO,1994). This definition shows that quality

is not universal: e.g. one user of data analyses can have other needs then someone else and needs can even change over time depending on the

situ-ation (Tian,2005). Because the IT-audit profession still wants to be able to

give an attestation on the use and quality of information technology, a more

granular definition or division of quality is required3_.

Similar to the ISO/IEC-norm 8402, the International Organization for Stan-dardization (ISO) also issued models related specifically to software (prod-uct) quality with initially the ISO norm 9126 and later the updated

soft-ware product quality norm ISO/IEC 25010:2011 (Standardization, 2011).

Although ISO 25010 defines different software product quality character-istics, the model itself does not provided concrete models, methods or mea-surements to evaluate an individual instance of a software product or a data analyses outcome. The lack of these concrete measures makes that the ISO 25010 quality model on its own isn’t an operational model to evalu-ate bottom-up data analyses. A somewhat similar issue arises when we look at other well-known IT related models like COSO, COBIT, ITIL ASL and BiSL: some models more than others are high-level, focused on (over-arching) processes, full-blown and require a high investment (with regards to time, money and professional skills). These models can be valuable for top-down data analytics, because these are part of the general information technology without an organization. But as discussed in chapter one, these full-blown models are both unrealistic and undesirable for bottom-up data analytics.

Also the Dutch professional body for IT-auditing (i.e. NOREA), like other professional bodies, has given a description of quality by dividing it within different quality criteria. Although there isn’t an unambiguous terminology of the different relevant quality criteria for IT auditing both outside and

even within NOREA (Goor RE CISA, 2004; NOREA, 2007), the following

division will be used for this study4_{: effectiveness, efficiency, exclusivity,}

integrity, verifiability, continuity and manageability.

The division in different quality criteria allows us to place (more or less) emphasis on certain criteria that are, given the specific object, situation 3_{E.g. in a paper from ’76 23 different software quality criteria are defined (al,}₁₉₇₆_{). In the} Mc Calls Quality factors there is a division of 11 quality criteria divided over 3 categories of factors (Kevitt,2008).

(26)

and needs, deemed (more or less) important. For example, ISACA states that reasonable assurance of the quality criteria integrity, reliability, useful-ness and security of Computer Assisted Audit Techniques (CAATs) needs to be obtained prior to placing reliance on CAATs and that these same cri-teria hold when an IT-auditor makes use of customized queries or scripts

(ISACA,2010). This means that e.g. the manageability of such scripts is less

important for trusting CAATs given the audit objective.

When we apply this same reasoning to specifically bottom-up data analyses (instead of CAATs and scripts in general) and the problem statement of this thesis, which is about ensuring trust in bottom-up data analytics and the accompanying outcomes, the following statements can be made regarding the different quality criteria from NOREA:

effectiveness Users of analytical results will always need to determine if the analytics itself correspond to the specific requirements of the user, but a potential gap between what is provided by analytics and what is required or requested by a user does not impact the trust on the actual outcomes, but more the usability for that user. This is in line with the ISACA statement on the quality criteria usefulness of CAATs: The IS auditor should assess the usefulness before reliance is placed on the object, but this quality criteria isn’t on not trusting the CAATs, but on not being able to use it for a specific audit goal. Therefore the NOREA quality criteria effectiveness is less relevant for trusting (or promoting trust in) the analyses and the accompanying outcomes;

efficiency Although budget overruns are undesirable from a financial per-spective, these do not influence the actual (numerical) results. There-fore this NOREA quality criteria doesn’t directly influence trust in analytics. But as will be explained below, this quality criteria will be in scope for this thesis;

exclusivity Perhaps from a privacy or intellectual property (IP) perspective you want to limit the users having access to the outcomes of analytics, is doesn’t influence the trust you can place on the analytics itself. Ex-clusivity does become important when it has a negative effect on the integrity (see below) of e.g. the data when unauthorized users can make adjustments. This is in line with the ISACA statement on the quality criteria security of CAATs: the IS auditor should both verify the integrity of the data and ensure that production data is sufficiently safeguarded. But as already mentioned, the (source) data used within data analyses is out of scope for this thesis. Also compliance to data privacy or data protection regulations isn’t in scope for this thesis. Therefore the NOREA quality criteria exclusivity is less relevant for trusting the analyses and the accompanying outcomes;

integrity Integrity is essentially about the trustworthiness of analytics and about how well the analytics, given the design, depicts reality. In line with the ISACA statement that integrity and reliability is relevant for CAATs and customized queries or scripts, the NOREA quality criteria integrity is very relevant for trusting the analyses and the accompa-nying outcomes.

(27)

2.2. Theory 15

verifiability As trust is an issue with (bottom-up) data analytics, the pos-sibility to be able to understand how an analytics worked, performed and behaved is of crucial importance if you want to be able to provide comfort to users of the analytical outcomes. ISACA states that an IS auditor should verify e.g. the integrity of an object, but this implies that the object itself should lend itself to be verifiable. Not only does the outcome of an analyses needs to be correct to be able to trust it, but is should also be possible to know or to investigate if this is indeed the case. Therefore the NOREA quality criteria verifiability is very rele-vant for trusting the analyses and the accompanying outcomes.

continuity Bottom-up data analysis tend to have a more limited scope, are ad-hoc and are generally not part of the general processes within a company. therefore it is less likely that problems with bottom-up data analysis will affect the general information computing within an or-ganization. Also the (numerical) results, if indeed available, are not influenced by continuity problems. Therefore the NOREA quality cri-teria continuity is less relevant for trusting the analyses and the ac-companying outcomes;

manageability Bottom-up data analysis are meant to give answers to the questions users have ’in the here and now’. This also means that these scripts, queries and analyses have a shorter lifespan that e.g. produc-tion code. This also means that the NOREA quality criteria manage-ability (and maintainmanage-ability) is less relevant for trusting the analyses and the accompanying outcomes;

Although as stated above, the two quality criteria integrity and verifiabil-ity are deemed the most important for bottom-up data analyses within this study, it is stated earlier that it is not technically feasible pricewise to expect the same from bottom-up data analytics as from top-down data analytics. This means that when using certain techniques or controls the proportion-ality of that technique or control (in respect to time, money, effort) needs to be taken into consideration. Therefore also efficiency is considered an important criteria within this study. In summary this means that the three quality criteria (i.e. integrity, verifiability and efficiency) are the main qual-ity criteria that will be used in the remainder of this thesis when discussing up data analytics and the controls and techniques around bottom-up data analytics.

2.2.2 Controls, techniques and measures

In the previous section we concluded that there are three main quality crite-ria for the topic of this thesis (i.e. integrity, verifiability and efficiency). Con-trols, techniques, procedures, processes and measures (hereinafter: con-trols) that (positively) influence the criteria integrity and verifiability are required to allow for (greater) trust in analytics, while the criteria efficiency should ensure an adequate balance between the time, costs and effort of these controls and the added value of the additional comfort given.

Because of this difference between integrity and verifiability on the one hand and efficiency on the other, a tiered approach is used is this study:

(28)

in the remainder of this chapter we will look at controls that influence in-tegrity and verifiability without already looking at efficiency. In the next chapters we will add efficiency to the mix, when we ask professionals in the field to not only look at the added value of a control but to balance that according to their (perceived) added value versus the expected costs. As stated in the first chapter, it is unfeasible to review all possibly rele-vant literature to identify controls that (can) positively influence integrity and verifiability. Therefore a best effort approach is used to review a broad range of literature including software development, software security, model validation, code review, software testing, data conversion, and software as-surance. Based on this review, controls are selected that can (positively) influence the criteria integrity and verifiability and this resulted in the

fol-lowing selection of techniques5_:

Note: for readability reasons, the remainder of this section will only intro-duce the different (business) functions (i.e. governance, construction, test-ing and verification) over which the different controls are divided. In the referenced appendices the selected controls are displayed and explained.

Governance

Governance is centered on the controls related to how an organization as a whole can manage bottom-up data analytics activities. More specifically, this includes concerns that cross-cut groups involved in the use and devel-opment of bottom-up data analytic (software) to provide a breeding ground for quality analytics. This business function therefore relates to controls that isn’t related to an individual project or team, but that should be provided and facilitated by the organization to improve quality.

See sectionB.1for relevant controls related to the business function

gover-nance.

Construction

Construction concerns the actual processes and activities related to how an individual, team or department defines and creates bottom-up data analy-ses. In e.g. software development this will include product management, requirements gathering, high-level architecture specification, detailed de-sign, and implementation. With bottom-up data analyses this is generally more limited to basis requirement design, functional design and the actual programming of the analytics. Relevant controls can support the actual an-alyst to increase the quality criteria integrity and verifiability during the construction phase.

5_{The choice is made to divide the different controls in the following (business) functions:} governance, construction, testing and verification. This division is based on the division used within the OWASP Software Assurance Maturity Model (Singleton,2014) but the func-tion verificafunc-tion is split into a separate testing and verificafunc-tion funcfunc-tion and the operafunc-tions function is not included given the generally shorter lifespan of bottom-up data analytics. The OWASP model and the reason for using it as a basis, is discussed further in chapter 4.

(29)

2.3. Answers to Subsidiary research questions 1A and 1B 17

See section B.2 for relevant controls related to the business function

con-struction.

Testing

Testing is an activity performed for evaluating the quality of the analyses and for subsequent improvement if required. Hence, the goal of testing

(or debugging) is the systematical detection of different classes of errors6

in a minimum amount of time and with a minimum amount of effort

(Jo-vanovi´c,2008). With regards to testing and the detection of errors it must be

stated that, by a reduction to the halting problem, it is possible to prove that finding all possible (run-time) errors in an arbitrary program is undecidable

(Rice, 1953). Nevertheless relevant controls are intended for the analyst to

ensure within reason that the analyses does what it is intended to do7(i.e.

the quality criteria integrity).

See sectionB.3for relevant controls related to the business function testing.

Verification

Verification is generally focused on the processes and activities related to how an analyst or organization checks and tests artifacts produced through-out e.g. software development. This typically includes quality assurance work such as testing, but it can also include other review and evaluation

activities (Chandra,2016). For the scope of this paper we take a more

lim-ited view on verification by limiting it to the quality criteria verifiability. Therefore relevant controls should relate to how bottom-up data analyses can be performed in a way that it is possible to obtain an understanding of the structure and operation of the analyses and the outcomes even after a longer period of time by both the analyst, but also by other stakeholders like third parties.

See sectionB.4for relevant controls related to the business function

verifi-cation.

2.3 Answers to Subsidiary research questions 1A and

1B

Arguing about which techniques are the best to use is like arguing whether a hammer or saw is more valuable when building a house. If you try to build a house with just a hammer, you’ll probably do a terrible job. More important than the tool is probably the person holding the hammer anyway

(Foundation,2008).

6_{An error can be defined here as a human action that produces an incorrect result.} 7_{Testing techniques can generally be described from a hierarchically approach. E.g.} san-ity testing is a subset of functional testing or testing can be divided between black box test-ing, white box and gray box (Jovanovi´c,2008). In this thesis we present those techniques that are not to general (e.g. functional testing), but that can still be applied to a general set of analytics based on professional judgement.

(30)

Nevertheless it is useful for IT auditors, analysts and other interested par-ties to have some guidance towards available techniques that are specifi-cally useful for bottom-up data analyses. As there is considerable contro-versy between e.g. software builders, testers, consultants etc. about what is important in software testing and what constitutes responsible software

testing (Jovanovi´c,2008), it is unlikely that there is going to be an

univer-sal truth for bottom-up data analyses and all techniques are going to have their own strengths, weaknesses, sweet spots, and blind spots. Therefore this thesis states the following regarding the research questions 1A and 1B:

2.3.1 Subsidiary research question 1A

As discussed in section 2.2.1, for bottom-up data analytics the quality

cri-teria integrity and verifiability are (the most) relevant to enable trust in the analytics. Additionally the quality criteria efficiency needs to be taken into consideration when determining the proportionality (in respect to time, money, effort) of the controls that relate to the aforementioned quality cri-teria. Therefore an answer to the research question 1A is integrity, verifia-bility and efficiency.

2.3.2 Subsidiary research question 1B

In the section 2.2.2we presented four different business functions for which

controls are identified within the field of data-analytics and/or within

re-lated fields like software development. In appendix B these controls are

showed and therefore that appendix present an answer to research ques-tion 1B.

2.4 Summary

This chapter presented both a general overview of (bottom-up) data anal-yses techniques as well as more details regarding quality criteria and con-trols related to enabling trust in bottom-up data analyses. In this chapter we concluded that there are three main quality criteria relevant for bottom-up data analyses (i.e. integrity, verifiability and efficiency) and that there are controls available to (positively) influence these criteria.

The following chapter will build on this by showing how input is gathered from both the IT-audit field as well as related field via a questionnaire. This input is used to rank, structure and order the derived controls mentioned in this chapter.

(31)

19

Chapter 3 Questionnaire

Following chapter 2, this chapter presents both the questionnaire used within this thesis to investigate the controls, techniques and measures (hereinafter: controls) discussed in the previous chapter as well as the detailed and ag-gregate results based on the input received from respondents.

3.1 Questionnaire design

The main goal of our questionnaire is to receive input from professionals

in the field1_{on how they (personally) perceive the added value versus the}

expected costs of using the controls mentioned in chapter 2 when either cre-ating, using and/or reviewing bottom-up data analytics. Although certain

controls might very well suit a mature organization2_{, they might be}

consid-ered too complex or difficult for certain less mature organization. There-fore the questionnaire will also ask the respondents at which maturity level of an organization the control might be useful to implement. The respon-dents will also be asked more generic questions with regards to their own background and experience with regards to data analytics and their general level of trust in (bottom-up) data analytics.

The questionnaire will therefore have the following structure:

• Introduction: short introduction to the background and structure of the questionnaire;

• General questions: questions with regards to their background etc.; • Background information: more in-depth information on how the

ques-tionnaire relates to the thesis topic and instructions on how to inter-pret the questions including several terms and definitions used; • Questions on controls, techniques and measures: per business

func-tion respondents are asked to indicate how they perceive each indi-vidual control; and

• Closing: also several open-ended questions are asked to gather input on how they experience bottom-up data analytics.

1_{This focus in this thesis is on (IT) auditors and data analysts, but non-exhaustive this} also includes software developers, software security professionals, code reviewers, software testers and data scientists.

(32)

With the exception of the general questions and closing section, each ques-tion from the quesques-tionnaire will consist of three sub quesques-tions related to either the perceived added value, the expected cost or the required ma-turity of an organization implementing it. Each sub question will be an-swered in the form of a likert scale: the sub questions related to the added value or the cost will consist of five labeled options and the sub question related to the required maturity of an organization will consist of three la-beled options. Although care is taken in describing the control to prevent a misunderstanding of what the control entails, respondents have the option to indicate per question that they don’t (fully) understand the control and that therefore that question can be skipped.

Based on the collective input received from the survey, each control will receive a score for each of the three sub question by computing the average value of the response. Based on the computed score for the added value and for the expected cost a derivative score will be computed to give an overall quality score to each control. Also a derivative maturity value is calculated. Details on this are given in chapter four.

Using the division of controls as discussed in chapter two3the highest

scor-ing controls (i.e. based on the quality score) per required maturity level will be selected for the maturity model: therefore each selected controls will be mapped towards both a maturity axes and a life-cycle axes in the model. The filling in of the questionnaire will take approximately between 45 and 75 minutes.

3.1.1 Population and sample size

As already stated in chapter one, this thesis is about controls that, when used, could increase trust in bottom-up data analytics in a way that is ac-ceptable within both the IT-audit profession and data analysis profession. Therefore it is logical to focus the questionnaire towards these two groups of experts as they are likely to offer insights into research question 2A and 2B that would not be attained by a literature study alone. As the techniques mentioned here are also extracted from the broader profession of software professionals and because the IT-audit profession generally interacts with both the operational- and financial auditor, the questionnaire will also be send out to other software professionals and audit professionals in general. These professionals will be targeted via direct contract to promote a high response rate.

With regards to the sample size that is required for this thesis, the choice is made to have a minimal response of 15 respondents with a minimal

re-sponse of 5 respondents within the two main target groups4. This choice is

3_{Controls are mapped to either governance, construction, testing or verification.} 4_{The target groups are IT-auditors, data analysts, software professionals without a focus} on data analysis and other auditors (i.e. financial and operational). The IT-auditor and the data analyst are, within these separate target groups, the two most important groups for this thesis as the IT-audit profession and data analysis profession are explicitly mentioned in the research questions.

(33)

3.1. Questionnaire design 21 made, as there generally isn’t an objective minimal required number of re-spondents and it will always depend on the situation (Baker and Edwards,

2012). By limiting the required sample size, the questionnaire allows for

more questions than would usually be appropriate for e.g. a questionnaire that would be targeted to a large and unknown audience due to the re-quired time it would take to fill in such a questionnaire. This increase in scope of the questionnaire will likely make up for the possible lack of vari-ety in respondents.

3.1.2 Pilot survey

After the first version of the questionnaire was completed, the question-naire was reviewed by both a data analyses professional with a background in mathematics and the thesis supervisor with the aim to detect any flaws in the questioning and to correct these prior to the main survey. Their

feed-back resulted in a final version of the questionnaire. See appendixCfor the

initial version of the questionnaire prior to review and see appendixDfor

the final version of the questionnaire as used in the main survey.

3.1.3 Questions and question motivation

The version of the questionnaire as used in the main survey is presented in

appendixD. This section will describe the questions and other elements in

the questionnaire in a high-level fashion including the motivation behind certain elements.

Introduction

The introduction is meant to introduce the topic of the questionnaire and to describe to the respondent the high-level structure of the questionnaire. More detailed information to guide the respondent is given in a following section.

General questions

The respondent is asked some administrative questions relevant for the scoping of the population. The respondent is also asked to elaborate on the relation between their function and the use of data analyses.

About this questionnaire

In this section the respondent is explained how the questionnaire relates to this thesis study, that the topic is about the risk of bottom-up data analyses and the means to control this. It is explained how each question consists of 3 sub questions that relate to 1) the (NOREA) quality criteria integrity and verifiability, 2) the quality criteria efficiency and 3) the maturity of an organization.

(34)

To ensure a correct understanding of the questionnaire and the individ-ual questions also certain terms and definitions related to (bottom-up) data analyses are given and an example question is explained.

Substantive questions and closing

This section contains 39 questions divided over the four business functions described in chapter 2. In closing also 5 open-ended questions are asked.

3.2 Response

Due to the required time to fill in the questionnaire as well as due to the specific target groups, 31 potential respondents where first asked person-ally if they would be willing to fill in the questionnaire. 26 respondents replied positive and to them the questionnaire was send out. From these potential respondents 17 replied before 18-08-2017 and there results where

incorporated into this research5_.

The majority of the questions are closed-ended questions. The responses to these have a major influence on the proposed solution discussed in chapter four. These will therefore only be touched upon briefly in the remainder of this chapter. As the open-ended questions have less of a direct influence on our proposed solution, the responses to these are discussed in more detail in the following section.

3.2.1 Summary of responses

For the closed-ended questions an overview of the responses is displayed

in appendixE. These results are further used in chapter four for creating an

operation maturity model, and are therefore not discussed any further in this chapter.

With regards to the open-ended questions, respondents were asked to their personal (business) experience with data analyses and there general level of trust in analytics. They were also asked to indicate if they agree with our original statement or observation that the application of self-service data analyses poses business risks when business decisions are made based on the analyses without proper controls to ensure the integrity and verifiability

of the analyses. Based on their answers, the following observations6 are

made:

• Each respondent indicated that they agree with our original statement that there are business risks with bottom-up data analyses.

5_{Please see appendix}_H_{for the details on the respondents.}

6_{Due to the unstructured nature of the responses in combination with the (limited)} num-ber of respondents, no statistically significant statements can be made. Therefore the term observation is used.

(35)

3.2. Response 23 • 8 respondents also made additional comments to the question if they agree with our original statement: 5 respondents indicated that al-though they agree, the cost-benefit trade-off or proportionality of con-trol is really important and that the potential impact of the analyses (e.g. which decisions are made based on this) is important to take into consideration. Other remarks indicated that there is a lack of proper education and training, that (master) data management and verification of premises is (also) important and one respondent also indicated that there are also a few ’basis controls’ that could be con-sidered quick-wins.

• with regards to the level of trust respondents place on analytics their response varies from very positive to quite negative. Factors that in-fluence this trust are: availability of good controls, complexity of the analyses and if there has been a review. A major contributing factor to the level of trust that more than half of the respondents mentioned is the human factor: sufficient knowledge, skill, expertise, experience and common sense is required to trust the analyst and with that the analyses and analyses outcomes.

• Some respondents also added suggestions for other possibly relevant controls: the use of a standard library of code for recurring tasks, staff training within the organization and when using test automation soft-ware to also include controls on the test automation itself. Also ref-erences where made to data privacy, (master) data management etc., but as already discussed, this is deliberately taken out of scope for this thesis.

• When it comes to their personal experience, some respondents de-scribed some positive, negative or neutral examples they encountered during reviews: hard-coded values can have very negative effect and clean code can be bad and messy code can perhaps work perfectly. Also when reviewing code one respondent indicated there are almost always mistakes to be found. Another respondent indicated that de-tails like ’larger then’ versus ’larger than or equal’ occur a lot. Also uncommented and undocumented code is difficult to review and hav-ing a good connection between an analyst and an SME can make or break an analysis.

In the closing section respondents also indicated via a closed-ended ques-tion how they perceived the relative importance of the four business

func-tions using a four point likert scale. Using a basic calculation7 the relative

ordering and preferences between the business functions was investigated. This resulted in the following observations:

• Looking at the entire population there is the following preference be-tween the four business functions in relative importance (from high to low): construction, testing, verification and governance;

7_{Using the four point likert scale each business function received between 1 and 4 points} per respondent with the most ’negative’ answer given a score of 1 and the most positive answer a score of 4. The average score per business function per (sub)group indicated the relative ordering and preference.

(36)

• If we look intergroup8_{, the IT auditor is the only population group}

that indicated that the testing phase is more important the construc-tion phase;

• Again intergroup, the group of (non IT) auditors indicate that gover-nance is more important than both testing and validation; and

• The group of analysts see the validation phase as less important than the other groups.

The results from this questionnaire indicates how different professionals view the controls discussed in this thesis and if these can be considered as practical and viable controls in the context of bottom-up data analyses. So although further analyses will be performed in the following chapter, these responses to both the open-ended and closed-ended questions are them-selves an answer to sub question 2A.

Also the answers to the open-ended questions substantiate that, as dis-cussed in chapter one, that there are (business) risk with bottom-up data analyses or at least that this is the view of the respondents and that con-trols can help with these risks, but that this needs to be proportional. Also (bottom-up) data analytics aren’t always trusted and respondents indicate that trust can be increased via the appropriate controls (e.g. reviews), but that there is also a large human factor involved with regards to someone’s skills, knowledge and expertise. These observations indicate that the premises underlying part of the original research questions are indeed correct.

8_{When looking between the population groups the individual group sizes are too small} to consider these observations as very substantive.

(37)

25

Chapter 4 Proposed solution

In chapter three we presented a questionnaire that was used to receive in-put from professionals in different fields. In this questionnaire, respondents were requested to provide their personal view on a set of different controls, techniques and measures (hereinafter: controls) that were taken from the literature study discussed in chapter two. The input received from the re-spondents show how certain controls are generally seen as being more pre-ferred (e.g. a better ratio between the gains versus the costs) than others. However, this on its own does not provide sufficient guidance on when to use these controls in practice. So in this chapter we present a basic matu-rity model together with a scoring model that can be used for the output of the questionnaire to place (a selection of) the controls within that aforemen-tioned maturity model. This results in an operational (maturity) model that is presented as an answer to the last subsidiary question.

4.1 Maturity model

As already mentioned in chapter one, part of the goal of this field study is to create a (maturity) model that could facilitate and/or ensure trust or re-liance on bottom-up data analytics. In chapter two the focus was on which controls could support in this and in chapter three this resulting set of con-trols was rated using input from experts. These same experts also provided input on the required maturity of an organization if they want to implement these aforementioned controls.

As this study is a starting point for controls specifically relating to bottom-up data analytics, we propose a relatively simplistic maturity model. This model and the maturity levels used in the questionnaire are based on the

OWASP Software Assurance Maturity Model1(hereinafter: SAMM)

(Chan-dra,2016). This means that we have three maturity levels and an implicit

starting point at zero:

0 Implicit starting point representing the situation that there is no use of

bottom-up data analytics;

1 Initial understanding and use of bottom-up data analytics;

1_{The main functional difference between our proposed model and SAMM is that SAMM} is only focused on security in software will our model is only focused on bottom-up data analytics.

Mature bottom-up data analysis

U

NIVERSITEIT VAN

A

MSTERDAM

T

Mature Bottom-up Data Analysis

Declaration of Authorship

Abstract

Acknowledgements

Contents

Chapter 1

Introduction

1.1

Introduction

1.2

Background and the role of the IT-auditor

1.3

Problem, problem statement and research question

1.4

Subsidiary research questions

1.5

Research Design

1.6

Organization of study

1.7

Scope, Assumptions and Limitations

Chapter 2

Background and theory

2.1

Data analyses background

2.2

Theory

2.3

Answers to Subsidiary research questions 1A and

1B

2.4

Summary

Chapter 3

Questionnaire

3.1

Questionnaire design

3.2

Response

Chapter 4

Proposed solution

4.1

Maturity model