Data quality improvement: procedure to improve data quality issues within and between sources

(1)

DATA QUALITY IMPROVEMENT

Procedure to improve data quality issues within and

between sources

Universiteit Twente

PDEng Design Report

Bas Turpijn - s0012122

(2)

CONTENT

MANAGEMENT SUMMARY ... 5

1. INTRODUCTION AND RESEARCH METHODOLOGY... 6

1.1 Background ... 6 1.2 Objectives ... 8 1.3 Design Approach ... 8 1.4 Research Questions ... 9 1.5 Research Methods ...10 1.5 Conclusion ...12

PART I PROBLEM FORMULATION...13

2. CONTENT OF DATA QUALITY ...14

2.1 Research Opportunities on Data Quality ...14

2.1.1 Phases and steps in data quality research ...14

2.1.2 Dimensions and metrics to assess data quality ...15

2.1.3 Methods and techniques to improve data quality ...17

2.1.4 Types of data ...18

2.1.5 Research Opportunities on Data Quality Issues Within and Between Sources 18 2.2 Data Quality Framework ...19

2.3 Conclusion ...20

3. SITUATIONAL CONTEXT...21

3.1 RWS Situational Context ...21

3.1.1 Scope ...21

3.2 Data Quality Requirements in Situational Context ...22

3.2.1 AM Business Requirements ...22

3.2.2 Data Quality Requirements ...23

3.3 Conclusion ...24

4. DATA QUALITY ASSESSMENT IN SITUATIONAL CONTEXT ...25

4.1 Introduction Case Study ...26

4.2 Nature and extent of the Data Quality Issues in Situational Context ...29

4.2.1 Quality measurements within sources ...29

4.2.2 Quality measurements between sources ...31

(3)

PART II BUILDING, INTERVENTION AND EVALUATION ...34

5. PRODUCT PROTOTYPE ...35

5.1 Data Linking Methods ...36

5.2 Data Normalization Methods ...38

5.3 Data Integration Methods ...39

5.4 Process Control Methods ...40

5.5 Process Redesign Methods ...42

5.6 Conclusion ...43

6. PRODUCT SPECIFICATION ...44

6.1 Overview ...45

6.2 Detailed Description ...48

6.2.1 Situational Method Fragment 1: Combination of Linkage by Key and Linkage by Similarity ...49

6.2.2 Situational Method Fragment 2: Text Normalization by Morphological, Syntactical and Semantical Analysis ...54

6.2.3 Situational Method Fragment 3: Data Integration by Local-as-View ...57

6.2.4 Situational Method Fragment 4: Automated Feedback Control ...61

6.2.5 Situational Method Fragment 5: Process Redesign by Operational Method Study ...65

6.2.6 Assembly strategy ...66

6.3 Use ...67

6.3.1 Situational method base to improve data quality within sources ...67

6.3.2 Situational method base to improve data quality between sources ...69

7. PRODUCT EVALUATION ...71

7.1 Selection of Prototype Method Fragments ...72

7.2 Situational Method Fragments ...74

7.2.1 Case Study SMF 1 ...77

7.3 Situational Method Base ...86

7.3.1 Situational Method Base for DQ Issues within Sources ...86

7.3.2 Situational Method Base for Redundancy and External Inconsistencies between Sources ...88

(4)

PART III REFLECTIONS AND FORMALIZATION LESSONS LEARNED ...92

8. REFLECTIONS...93

8.1 Design Principle 1 - Practice-Inspired Research ...93

8.2 Design Principle 2 – Theory-Ingrained Artefact ...93

8.3 Design Principle 3 – Reciprocal Shaping ...94

8.4 Design Principle 4 – Mutually Influential Roles ...94

8.5 Design Principle 5 – Authentic and Concurrent Evaluation ...95

8.6 Design Principle 6 – Guided Emergence ...96

8.7 Design Principle 7 – Generalized Outcomes ...96

8.8 Conclusions ...97

9. FORMALIZATION LESSONS LEARNED ...98

9.1 Lessons learned for proper use of the design product in current form ...98

9.2 Recommendations for improvement of the design product for use in broader context ...99

9.3 Conclusions ... 100

10. CONCLUSIONS ... 101

(5)

MANAGEMENT SUMMARY

Poor data quality is a barrier for enterprises to benefit from the growing amount of data available. Although data quality has been a research topic for many years already, there is a lack of experiments to validate different data quality methodologies in practice. Therefore, there is a call, both from research and practice, to develop improvement methodologies in practical – situational - contexts.

This PDEng project aimed to develop a procedure to reduce data quality issues. We distinguished data quality issues within and between sources in order to cover the main data quality issues faced by enterprises. In order to design an improvement procedure, we followed the Action Design Research Framework (ADR). ADR is typically conducted when both practical relevance and academic rigor are required. The building blocks of our improvement procedure are situational method fragments (SMF’s). SMF’s are adjusted and/or refined existing methods in order to be applied in a situational context. We assembled these SMF’s to a situational method bases (SMB). The SMB specifies our improvement procedure.

We developed Situational Method Bases (SMB) for three situational contexts: 1) invalid, incomplete and internal inconsistent data within a source, 2) redundant and external inconsistent data between two or more sources and 3) poor interoperable and linkable data between two or more sources. The building blocks of the SMB’s are our five SMF’s:

1. Combination of linking by key and linking by similarity.

2. Text Normalization by morphological, syntactical and semantical analysis 3. Data Integration by Local-as-View

4. Automated feedback control 5. Operational Method Study

These SMF’s have been allocated to the SMB’s, each dealing with a situational context: 1. Invalidity, incompleteness and internal inconsistency: SMF 1 + SMF 2 + SMF 4 2. Redundancy and external inconsistency: SMF 3

3. Poor Interoperability & Linkability: SMF 2 + SMF 5

These SMB’s have been evaluated on effectiveness to reduce poor data quality, observed in case studies regarding our situational context: Asset Management and Civil

Engineering Objects. The main issues in this context concerned too many invalid and incomplete data instances within sources, redundant and external inconsistent data between sources and also poor interoperability & linkability. Although some

incompleteness issues left over, most data quality issues were solved. The improvement procedure is thus effective to reduce poor data quality.

Before using the improvement procedure, the user should always compare the

characteristics of the own situational context with the situational features of the SMF’s. Also involvement of practitioners working in the context should be secured.

Some further developments are recommended for application in broader context: - Conduct more advanced data mining and/or machine learning experiments 1) to

improve effectiveness of the procedure in order to deal with the issues left over and 2) to deal with inaccurate data. We were not able to develop metrics to assess data accuracy, which is however an important quality dimension; - Conduct more advanced experiments on semantical analysis to improve

effectiveness of text normalization (SMF 2). Due to time limitations, we could not incorporate semantical analysis, making text normalization less effective.

(6)

1. INTRODUCTION AND RESEARCH METHODOLOGY

1.1 Background

Since last years, we can observe an enormous growth of data (figure 1). Data is not only growing in volumes, measured in bytes. It is also growing in its velocity and variety (Steenbruggen, 2015). Velocity refers to the frequency data is processed through an organization, variety refers to the great variety of data sources several data types

(Alharti et al, 2017; Sagiroglu & Sinanc, 2013). These 3 V’s really make data “Big Data”.

Figure 1 Data is growing (Reinsel, 2018)

To take advantage of the benefits from these data, organizations need to implement data centric working approaches. Data centric working approaches like Business intelligence and analytics (BI&A) and Data Analytics (D&A) have become increasingly important in both the academic and the business communities over the past two decades (Hsinchun Chen et al., 2012).

Data is the base for understanding and understanding is the base for decision making (figure 2). Data are represented with symbols, e.g. numbers, words, signals or images. Data processed to have meaning is called information and provides answers to “who”, “what”, “where” and “when” questions. When applying information, “how” questions can be answered and knowledge is gained (Bellinger, 2004). A data centric working approach comprises the techniques, technologies, systems, practices, methodologies and

applications that analyzes data to better understand its business and benefit from a better decision making process (Hsinchun Chen et al, 2012).

(7)

Figure 2 From data to wisdom: a data centric working approach (Ackhof, 1989)

Many organizations however are facing some barriers to work more data centric. Most frequent mentioned barriers by data management professionals: inadequate staffing and skills, lack of business support, unready IT infrastructure, fragmented data landscape and poor data quality (Fernandez, 2018; Berndtsson et al, 2018; Alharti et al, 2017; McComb, 2016; Cai & Zhu, 2015).

The board of Rijkswaterstaat (RWS) is very aware on the benefits of data centric working (see figure 2). But, like other organization, RWS is also facing the challenges to

overcome the potential barriers to work more data centric. Especially, a fragmented data landscape and poor data quality are the major problems in de current information

systems (Steenbruggen, 2015). Practice at RWS and other organization (Redman, 1998) show that a fragmented data landscape often go hand in hand with redundancies and inconsistencies between the data sources. Therefore, we summarized the issues of poor data quality and a fragmented data landscape by data quality problems within and between sources, respectively.

Data quality issues within and between sources problems manifests itself in (Sandkuhl, 2011; Haug, 2011; Redman, 1998):

· Increasing lead times in information retrieval;

· Increasing running costs, due to data repair activities (see figure 3); · Irrelevant results from queries;

· Redundant data capture and storage.

All these problems are noticeable at RWS, hindering the organization to fully understands its business, work more data centric and improve its decision making process. Therefore, RWS initiated this PDEng project to formulate the data quality issues within and between sources from scientific view and design a solution for these issues for practical use.

(8)

Figure 3 Cost of Data Quality (Haug, 2011)

1.2 Objectives

The PDEng project aims to develop a procedure to reduce the observed data quality issues within and between sources.

1.3 Design Approach

In order to achieve the design objective, we adopted the Action Design Framework (ADR, Sein et al, 2011, see figure 3). ADR is a combination of action research and design science research and comprises four stages. These stages are executed cyclical and reiterative and not sequential, like in more traditional design approaches. When the problem is clear and/or the design environment is relatively stable, a sequential method could be quite appropriate (Wieringa, 2014). ADR is typically conducted in a dynamic organizational context when the problem and goals are still open to debate (Sein et al, 2011) and also when practical relevance and academic rigor are required (Rogerson & Scott, 2014). Our project environment is an example of a context where the design environment is not stable yet, but dynamic. This argues in favour of ADR:

Ø The RWS vision on Information Services and Data Architecture Principles are still subject to organizational developments and there are also many knowledge questions on this field. The nature and extent of the data quality problem and the data quality improvement goals are open to debate and the working processes are dynamic.

Ø The RWS data are captured, stored and used in these working processes. Therefore, the data quality problem is particularly perceived in operational practice and has thus practical relevance.

Ø RWS is also a very large organization, doing many kind of working processes. There is a call for uniform procedures to deal with similar kinds of practical problems throughout the organization, based on state-of-the-art scientific knowledge and academic rigor.

Therefore, we regarded ADR as an appropriate design approach for our project. The four ADR stages are (Cronholm et al, 2016; McCurdy et al, 2016; Sein et al, 2011):

(9)

1. Problem formulation

One of the main tasks in this stage is to describe the problem content in more detail by scanning the research opportunities related to the problem content and related work already been done. Another task is identifying the project

requirements and the practical issues in the project context related to the problem.

2. Building, Intervention and Evaluation (BIE)

This stage develops the solution, evaluates its performance and intervenes into the design process when further refinement and/or adjustment is needed. We included design principles from Situational Method Engineering (SME). SME is an approach, aiming to reshape existing (prototype) methods and make them applicable in a specific situation. Since SME emphasizes on practical relevance (Harmsen et al, 1994), it suits well with ADR. We included principles from the field of System Engineering (SE) to evaluate our solution (Blanchard & Fabrycky, 2014).

3. Reflect and learn

This stage encourage designers to reflect on the design process and the use of the design product in order to identify gained knowledge related to the research opportunities.

4. Formalize the lessons learned

The main task in this stage is to generalize the gained knowledge in order to apply the solution in broader context and formulate recommendations for further

developments and improvements.

1.4 Research Questions

Based on the four ADR stages and related tasks, we formulated four research questions and several sub questions.

1. How can we formulate the data quality problem?

a) What are the research opportunities on data quality?

b) Which theoretical framework can serve as base to conduct research on data quality?

c) What are the data quality requirements in our situational context at RWS? d) What is the nature and extend of the data quality in our situational

(10)

b) How can we intervene into these methods to improve data quality in our situational context?

c) How can we assemble these refined methods to an effective improvement procedure?

d) What are the effects of the refined methods and the final improvement procedure on the observed data quality problems in our situational context?

3. Which knowledge has been gained from the design process and the use of the design product?

4. How can we use this knowledge to apply the design product in broader context?

1.5 Research Methods

For every ADR stage, we elaborated the research methods to execute the stages and answer the research questions:

o Problem formulation

o Q1: How can we formulate the data quality problem? a) What are the research opportunities on data quality

This step aims to better understand the theory behind data quality research and confine the research area further. We chose for

literature review and explored materials that have been provided by the courses followed in the PDEng trajectory and from scientific journals about data quality.

b) Which theoretical framework can serve as base to conduct research on data quality?

Idem 1a).

c) What are the data quality requirements in our situational context at RWS?

This step aims to better understand the context of data quality at RWS and discover the requirements for our solution. Therefore, a review on scientific literature and RWS enterprise documents, combined with stakeholder interviews to map the theory to the real world, has been considered most appropriate to achieve these goals. We explored materials from scientific journals regarding requirement analysis and written sources from the organization and conducted stakeholder interviews to do the mapping and discovery of the data requirements.

d) What is the nature and extend of these data quality in our situational context?

(11)

Here, we transferred the data quality theory to the practice of our context at RWS. Case studies are most appropriate to map

theoretical concepts to a real world entities. The disadvantage of case studies is the difficulty to generalize the results. By selecting representative samples of the data landscape, we were able to deal partly with this deficiency. RWS has a strong geographical oriented organization structure, but the data sources have national data models. Therefore, confinements to region and certain asset types have been considered appropriate samples to set the scope for the case studies. Chapter 3 addresses the considerations behind the sample choice in more detail. On the sample extraction, we

assessed the data quality conform our data quality framework. With practitioners working in region on the selected assets, we validated the assessment.

o Building, Intervention and Evaluation (BIE)

o Q2: What is an adequate improvement procedure to decrease the quality problems?

a) What are appropriate methods to improve data quality?

The selection of prototype methods have been done by literature review to discover the state-of-the-art techniques available. The selection has been validated by interviews with data quality experts. b) How can we intervene into the prototype methods to improve data

quality in our situational context?

To design appropriate and practical methods, we used concepts of Situational Method Engineering (SME - Harmsen et al, 1994; Ralyte & Rolland, 2001). SME aims at defining situational method

fragments by selecting, refining and assembling different existing prototype methods.

c) How can we assemble these refined methods to an effective improvement procedure?

Based on the outcomes of the previous step, we assembled the effective SMF’s to an improvement procedure: the Situational

Method Base (SMB). To realize this product, we also use concepts of SME to assemble methods (Hamsen et al, 1994; Ralyte & Rolland, 2001).

d) What are the effects of the refined methods and the final

improvement procedure on the observed data quality problems in our situational context?

(12)

of the results of the refined methods. It has been conducted with help of practitioners who were able to give expert opinion about the correctness. The validation aims to examine the effectiveness of the refined methods and the final improvement procedure in the context of the design objectives: the reduction of observed poor data quality instances.

o Reflect and learn

o Q3: Which knowledge has been gained from the design process?

This product is the outcome of reflections on the knowledge that has been gained form the design process. We used the 7 ADR design principles to discuss the adherence to these principles during the design process. In addition, we reflected whether the use of the design product contributed to the research opportunities. This has been done by a desk study.

o Formalize the lessons learned

o Q4: How can we use this knowledge to apply the design product in broader context?

We also attempted to formulate general statements about the SMF’s and SMB and new research opportunities. This has been done by a desk study.

1.5 Conclusion

Figure 4 summarizes the main stages from the ADR approach and presents also the structure of the design report. In part I, the reader can find more detail on the problem formulation. The results from the design stage (Building, Intervention and Evaluation) have been reported in part II. We merged the third and fourth stage of the ADR approach to form part III of the design report, since they are closely intertwined. Part III addresses the reflections on and lessons learned from the design process.

(13)

PART I PROBLEM FORMULATION

The Problem Formulation explains the content and context of the problem in more details.

When formulating a problem, we need to develop a theoretical base to define structures behind the problem and the context. We started with the research opportunities and related research on data quality within and between sources in chapter 2, resulting into a data quality framework. Chapter 3 puts focus on the situational context, wherein our solution has to perform and solve the problem. This situational context also gave the main input for the data quality requirements. We concluded the problem formulation with a case study on the nature and extent of the problem in the situational context of our project. This is topic of chapter 3.

After reading part I, the reader should have understanding about the research

opportunities in the field of data quality, the content of the data quality issues within and between sources and also the nature and extent of it in our situational context.

(14)

2. CONTENT OF DATA QUALITY

This chapter describes the content of the problem more in detail: data quality. The quality of data plays an important role in enterprise applications, having large impact on operating processes as we could read in the previous chapter. Over the time, researchers worldwide have produced a vast amount on methodologies and techniques to describe and deal with data quality issues.

This chapter starts with a review on data quality issues from literature. From this review, we derived the research opportunities for our PDEng project and a conceptual framework how to deal with data quality.

This chapter refers to the research question 1a and 1b and addresses the data quality framework:

2.1 Research Opportunities on Data Quality

In the 90’s data quality started to become a serious subject of research, leading to several definitions of and perspectives on data quality (Cai and Zhu, 2015). The Total Data Quality Management Group of the Massachusetts Institute of Technology (MIT) proposes a definition from a user perspective and defines data quality as “the fitness for use” (Wang and Strong, 1996). In addition, Batini et al (2009) provide a survey on several common perspectives to look at data quality from several views:

1. phases and steps in data quality research;

2. dimensions and metrics for data quality assessment;

3. strategies and techniques for data quality improvement and; 4. the types of data to be considered.

We used this survey as a base to explore data quality issues and extended it with insights from other authors to discover our research opportunities.

2.1.1 Phases and steps in data quality research

From this perspective, data quality is regarded as a sequence of activities needed to deliver information products of good quality. Several researchers compare product manufacturing with information manufacturing (Wang et al, 1995; Shankaranayan, 2000; Ackhof, 1989):

Product Manufacturing Information Manufacturing

Input Raw materials Raw data

Process Materials processing Data processing

Output Physical products Information products

Table 1 Information Manufacturing Process

Analog to assure product quality, data quality handlings are needed in order to satisfy the user information demand (Wang et al, 1995). Several authors mention the main handlings, or phases, which should be part of data quality research (Farid et al., 2016; Debatista et al., 2015; Batini et al., 2009):

1a. What are the research opportunities on data quality?

(15)

1) Preparation;

Preparative steps are mainly about gathering the relevant information and give structure to it. The preparation phase does not require much effort if all these information is already available (Batini et al., 2009). Examples about preparative steps are:

· collecting metadata and information on the content and context of data (like data models, data definitions);

· setting up a data management plan; · collecting business rules.

2) Quality Assessment;

This handling comprises several steps to measure the data quality and quantify the amount of data records which violate the constraints or quality rules. Examples from literature regarding assessment activities are:

· information system analysis;

· determining the area of interest (scope); · data quality requirement analysis;

· measurement of the data quality in the area of interest. 3) Quality Improvement;

Logically, after the assessment follow the steps to repair the as erroneous classified data and improve the data quality. Examples from literature regarding improvement activities:

· determining the impact on working processes; · identification the error causes;

· selection, refinement and evaluation of methods to repair poor data quality;

· assembly of methods to design improvement solutions.

Preparation, assessment and improvement are thus the most important phases in data quality research, each having some logical activities or steps.

2.1.2 Dimensions and metrics to assess data quality

From the perspective of data quality assessment, definitions of data quality and units of quality measurement are needed. In other words: dimensions and metrics. Literature provides us with all kinds examples of dimensions, but there exists a general basic set of data quality dimensions referring to the fitness for use of data values (Cai and Zhu, 2015; DAMA, 2013; Magnani and Montese, 2010; Batini et al., 2009; Wang et al, 1995):

Ø Validity

Validity has to do with the closeness of a data value (or instance) to the elements of the corresponding definition domain. By example, the definition of tallness is that it should be a represented by an integer data type and in cm as unity. So any data instance not corresponding to this definition is not valid.

Ø Completeness

Completeness is defined as the degree to which the database includes the data describing the real world objects. Often, completeness is related to the amount of NULL values, blanks or instances providing non useful information like “not

known”.

Ø Consistency (Internal)

Consistency refers to the violation of semantic rules defined over a set of

instances. Well known semantic rules are integrity constraints. Two fundamental categories are distinguished: intra-relation constraints and inter-relation

(16)

Ø Accuracy

Accuracy refers to the extent whether the data instance value represents the real world object and its features correctly. In the example of a person age: the data might refer to the wrong person or his/her tallness is not correct.

Ø Timeliness

Timeliness refers to the data update over time. There is no common definition of timeliness, but often it is determined by comparing the creation or mutation data with the frequency of change. During childhood, it is expected that a person’s tallness will change, but as an adult, it should be rather stable.

These dimensions apply all to quality issues within a source. In addition, we found also quality dimensions in data quality frameworks, concerning issues between data sources. Based on examples from literature, we summarized the basic dimension on data quality between sources:

Ø Interoperability & Linkability

Interoperability between data sources is an important dimension to ease data exchange (Howell et all; 2016). Interoperability addresses the ability of

information systems to have clear and shared expectations regarding the content, context and meaning of data.

Linkability has to do with the ability to retrieve matches between real word objects in different sources, because they are somehow related to each other (Daas et al, 2009).

Ø Consistency (External)

Consistency can also be referred to the similarity of values on common or related properties if a real-world object has been represented in two or more data

sources. The values on the common properties for the same object should be similar between the sources. In this report, we refer to external consistency, when it comes to consistencies between two sources, whereas internal consistency refers to the adherence of integrity constraints within a source.

Ø Redundancy

Redundancy has to do with multiple representations of one real world object. Storing data redundant goes against the basic data modeling principles, but it is common practice for reasons like data access from several user perspectives and improve efficiency (Maydanchik, 2007). Risks of redundantly stored data are slowing down system performance, giving invalid data output and affecting data integrity negatively.

To every dimension, a metric can be attached: the metric is a translation of the rule or constraint into measurable units (Batini et al., 2009). The measurements of these metrics should be compared to data quality requirements in order to assess the data quality problem. The data quality requirements depend on the situational context and are topic of next chapter.

Table 2 summarizes the mentioned quality dimensions, both within and between sources, and their metrics.

(17)

Dimension Metric Quali ty withi n sourc es

Validity Amount of values conform data definition Completeness Amount of filled values

Internal Consistency Amount of values conform constraints Accuracy Amount of values conform reality Timeliness Date of last modification

Quali ty betw ee n sourc es

Interoperability/Linkability Amount of data from source I that can be linked to source II

External Consistency Amount of value in field A of source I that equals the value in field B of source II

Redundancy Amount of duplicates between the sources

Table 2 Data quality dimensions and metrics for issues within and between sources (Cai and Zhu, 2015; DAMA, 2013; Magnani and Montese, 2010; Batini et al., 2009; Daas et al, 2009; Maydanchik, 2007; Wang et al, 1995)

2.1.3 Methods and techniques to improve data quality

From the perspective of data quality improvement within and between sources, two general types of methods can be adopted (Isele and Bizer, 2013; Batini et al., 2009; Elmagarmid et al, 2007):

o Data driven methods o Process driven methods

The data driven methods improves the data quality modifying the raw data instances directly, whereas the process driven methods is doing that by redesign the data processing.

Data linking, data normalization and data integration methods are widely most adopted data driven approaches (Batini et al., 2009).

Data Linking (also mentioned record linkage, data deduplication or entity matching) aims to find similar real world objects in other data sources to deal with defragmentation and deficiency issues (Batini et al, 2009; Elmagarmid et al, 2007; Isele and Bizer, 2013). This a popular topic of research in the field of linked data (Bizer and Health, 2011). Data linking methods only apply to the data instance level, since changes at schema level or higher is not needed.

Data Normalization (also mentioned standardization) aims to use computer memory more efficient and prevent data heterogeneity. A well-known application of normalization is in relational database design (Codd, 1970), standardizing representations of real world objects in the database. Over the years, specializations in particular branches have evolved, like textual normalization (Jurafsky & Martin, 2016; Mohbey & Tiwari, 2011), which is very handy if representation are in textual characters. Normalization applies typically to the schema and instance levels.

Data Integration aims to simplify the information retrieval by defining a unified view of different data sources (Batini et al., 2009; Lenzerini, 2002). A user will be able then to access the data via that view. To create that unified view, heterogeneity issues on both data instance level and on schema level has to be solved (Batini et al, 2009). A possible integration approach is then to design a uniform schema or ontology in order to solve these issues (Shvaiko and Euzenat, 2012; Howell et al, 2016). Integration methods have impact on all layers in the data architecture. Data integration requires therefore more resources than data linking and data normalization.

(18)

information manufacturing process, causing poor data quality. Process driven strategies are often more expensive than data driven approaches, but in the long term are found to outperform the latter (Batini et al., 2009).

The main improvement methods can thus be categorized into data oriented methods or process oriented methods.

2.1.4 Types of data

Another way to look at data quality is from perspective of data types. In literature, several perspectives can be found to categorize data instances. We distinguish between the technical view from computer science and a more practical user view: From

Computer Science view data is regarded in its elementary form and how it should be represented in a database (Meinsma, 2016):

o primitive types

§ Examples are booleans (yes/no; 0/1), strings (free texts), dates, integers, floating (or doubles).

o complex types

§ Examples are arrays, functions and records (or tuples). · From User view data is regarded as a product and how it should look like:

(Shankaranayan, 2000; Batini et al, 2009): o Structured

§ Representation of real world objects or phenomena, mostly in tabular form, according to a specific definition.

o Semi structured

§ Representation of real world objects or phenomena, in more flexible formats, like XML, JSON and RDF.

o Unstructured

§ Representation of real world objects or phenomena in a natural language.

In our literature search, we found many data quality research from the field of statistics, mostly referring to structured data. Statisticians indeed have provided many methods to identify poor data instances and repair them (Davis, 2002). But these methods perform only on numerical primitive datatypes (integers and floating). Scientists dispute on the distribution between the data types categories, but we cannot neglect the amount of non-numerical data types and their related data quality issues (Russom, 2008). The way data are categorized by types, depends thus on the context: from technical point of view, data is categorized different than from user perspective.

2.1.5 Research Opportunities on Data Quality Issues Within and Between Sources

The previous sections described related research already been done on data quality recently and also in a more distant past. We could read about a set of quality dimensions to investigate the nature and extend of the data quality and a survey on state of the art methods to improve poor data quality. Several authors however concluded that data quality research has still not reached a good maturity level yet (Taleb et al., 2018; Batini et al., 2009). Relating data quality issues more closely to business processes is one of the main challenges to put scientific knowledge into practice. Other related work on this topic also concluded that there is no “one size fits all” solution to deal with data quality issues. Thinking about data quality assessment and determining the necessary actions for improvement is an on-going effort for organizations (DAMA, 2013; Pipino et al., 2002). The opportunity for this design project is to relate data issues in terms of quality

(19)

dimensions to appropriate methods to improve data quality on the dimensions, which are important in a situational context.

Another bridge to gap is about the presented methods themselves (see 2.1.3). Some of them are rather abstract, so refinement is needed to apply them in a situational context. Others are only applicable to one specific data types (see 2.1.4). In addition, there is a lack of experiments to validate different methodological approaches in practice (Batini et al., 2009). Additional knowledge how to refine and evaluate these prototype methods and apply them on all common data types is thus needed. Our PDEng project, aiming to develop a procedure to improve data quality issues within and between sources,

regardless their datatypes, is therefore an attempt to gap these knowledge bridges.

2.2 Data Quality Framework

The previous section clarified the common issues around data quality from scientific view. We identified three main phases in data quality research: 1) preparation, 2) assessment and 3) improvement.

Preparation Steps Assessment Steps Improvement Steps

• Collect Data Models, Databases and Business Rules regarding the Data Sources;

• Investigate working processes related to the data sources.

• Information System Analysis; • Area of Interest; • Data Quality Requirement Analysis; • Data Quality Measurements. • Impact Quality Problems on Working Processes; • Identification of Error causes; • Selection, Refinement & Evaluation of Methods; • Assembly & Implementation of Methods;

Although a necessity to start with, the preparation does not require much effort if data models and business rules are available and the relevant databases accessible (Batini et al., 2009). From perspective of the design objective to improve data quality, we are therefore mainly interested in knowledge on data quality assessment and data quality improvement. Regarding data quality assessment, we need a set of basic data quality dimensions in order to investigate the nature and extend of the data quality in our current situational context. With this information we can express the largeness of the problem and what we have to improve (see research question 1d). Regarding data quality improvement, we need to reduce the observed poor data instances and solve the problem in our situational context (see research question 2). This calls for a set of initial (prototype) methods, which will be refined further in order to reduce poor data quality within and between sources. The main entities in our data quality framework comprises then:

1) the mentioned phases (preparation, assessment and improvement); 2) the data quality dimensions to conduct an assessment and;

(20)

2.3 Conclusion

In response to research question 1a, the main research opportunity's on data quality are on:

1) How to refine the found improvement methods to concrete situational methods; 2) How to relate these methods to the quality dimensions, measuring the nature and

extent of data quality on several kinds of data types in a situational context; 3) How to evaluate these methods in a situational context.

Figures 5a and 5b respond to research question 1b and depict our theoretical

framework. Firstly the phases and steps in data quality research: the preparation phase, the assessment phase and the improvement phase. The assessment coincides with steps in the ADR problem formulation stage , whereas the improvement coincides with steps from the ADR Building, Intervention and Evaluation (BIE) stage.

We remind that the last to steps of the improvement phase have been implemented with concepts of Situational Method Engineering (SME). These steps were also our main design steps and correspond to research opportunities 1 and 3. Research opportunity 2 corresponds to the allocation of the improvement methods to the quality dimensions, as presented in figure 5b.

Figure 5a Phases and Steps in Data Quality Research

Figure 6b Quality dimensions for the assessment (left) and (prototype) methods for the improvement (right)

(21)

3. SITUATIONAL CONTEXT

This chapter elaborates our situational context more in detail and aims to discover the main data quality requirements to be met.

The first section defines our situational context at RWS. This context gives input for several sub products in our design process, like the data quality requirements. The second section addresses these data quality requirements, related to our situational context.

This chapter thus responds to research question 1c eventually:

3.1 RWS Situational Context

This section focusses on the practical relevance of data quality issues. We transfer the data quality issues as described theoretically to the practice at Rijkswaterstaat (RWS). To get this focus on practical relevance (Rogerson & Scott, 2014) and relate the data quality issues more closely to business processes (Batini et al., 2009), we chose to confine to the practice of one business process at RWS: the Asset Management (AM) process.

3.1.1 Scope

Our first confinement is the Asset Management business process at RWS. RWS defines the Asset Management (AM) process as follows (RWS Service Level Agreement, 2017):

• “Assure reliability, availability, maintainability and safety of the physical assets, which realize the infrastructural networks, with balance between performance, risks and costs”

The AM process at RWS is an appropriate situational context to investigate the data quality issues further for several reasons:

1 An internal audit on several working processes at RWS shows that transfer of

information within and between processes often causes stagnation in the production chain of a certain working process. This stagnation often manifests in the Asset Management (AM) process. The AM process is characterized with several

information transfer moments between organizational units and also other business processes depend on information from the AM process (Interne Auditdienst Rijk, 2016). As a consequence, the board announced to head for data quality, with special focus on the AM process (RWS Board Focus Letter, 2017);

(22)

mention inconsistencies between the several AM data sources, so it is hard for them to find unequivocal information about assets.

The Asset Management business process is still rather wide scope: RWS maintains

hundred thousands of objects, categorized in hundreds of object types, all with their own characteristics. We therefore confined the scope also by object types and focus on civil engineering object types. The reason to select these types lie in the fact that these objects are very critical in network performance and the highest percentage of planned maintenance activities (measures) at RWS are performed on civil engineering objects (see figure 6).

Object Type Category Percentage of Measures

Civil Engineering Object Types 46%

Road Pavement Object Types 33%

Traffic Management Object Types 2%

Soil Object Types 1%

Others 18%

Figure 7 object types with highest percentage of planned measures (Source RWS Uniform Programming System – RUPS, 2018)

3.2 Data Quality Requirements in Situational Context

This section defines the main data quality requirements, related to the RWS context of Asset Management and Civil Engineering Objects. The research field of System

Engineering provide us with several methods and techniques to discover requirements (Hull et al, 2011; Blanchard and Fabrycky, 2014):

Ø Interactive methods Ø Descriptive methods Ø Comparative methods

In general, IT projects prefer interactive methods to discover requirements to ensure user involvement and reduce the probability of a mismatch (Sein et al, 2011; Blanchard and Fabrycky, 2014; Yourdon, 1990). An interactive method needs however a clear information demand from customer side. The RWS AM process is however a very dynamic context (see also 1.3). There is a lot of discussion about the information demand, even among the practitioners. This situation makes interactive methods very risky to become never ending stories. Descriptive methods are an alternative if at the higher level of the enterprise, the business goals have been documented clearly. Comparative methods are an alternative if there are analogous or existing systems covering the information demand to a large extent.

At RWS, business goals at enterprise level have been clear defined. Therefore, we adopted a descriptive method, scenario analysis, to discover requirements. Concisely stated, scenario analysis translates the higher level requirements to requirements, related to a situational context. In our case, data quality requirements.

Appendix B describes the scenario analysis in more detail. This section confines to the presentation of the results.

3.2.1 AM Business Requirements

The annual Service Level Agreement (SLA) between the Minister of Infrastructure and RWS mentions the main requirements for every business process. The requirements

(23)

· Availability road network: 97% · Availability waterway network: 99% · Availability water barriers: 100% · Availability water agreed level: 90% · Safety road network: 99%

· Safety waterway network: 95%

· Reliability travel times road: no quantitative requirements formulated · Reliability travel times water ways: 90% waiting times at locks < 20 min · Reliability data on asset quantity: 95%

· Reliability data on asset quality: 95% · Reliability data on accidents: 100% · Reliability data on traffic intensities: 90% · Reliability data on water quantity: 95%

· Reliability data on water quality: no quantitative requirements formulated · Maintainability coast line: 90%

3.2.2 Data Quality Requirements

At business level, some clear statements have been given on the reliability of asset management data. In section 1.1, we defined good data quality as data which is “fit for use”. We applied the reliability requirements regarding asset management from section 3.2.1 on also the “fitness for use” data requirements. This choice has been verified by practitioners1_{, working in our situational context. During the design process, we}

interviewed them and asked their opinion about several design choices like these. Requirements from the AM process

In our situational context, we confined to the civil engineering objects maintained by RWS (see section 3.1.2). The main information on civil engineering objects can be found in data related to asset quantity and asset quality. Then, we can state that at least 95% of the data instances in the AM data sources should be “fit for use”.

We related this “fitness for use” to the data quality dimensions within sources to express “fitness for use” and attach metrics to it. So, the data about civil engineering objects are not allowed to have more than 5% invalid, incomplete and internal inconsistent values per property (see 2.3) in order to be fit for use.

Requirements from other business processes

Section 3.1.1 explained the need for data transfer between the AM process and other RWS business processes. By example: asset maintenance is financed by public funds. RWS therefore has to account for the expenses due to the AM process annually. This requires data transfer between the AM process and the financial management (FM) process. We related this need for data transfer to the interoperability & linkability data quality dimension in section 2.3.

In addition to the requirements related to the AM processes, we also searched for requirements regarding related business processes, like the Information Management process (IM). The IM process developed some data architecture principles for the RWS enterprise. A relevant principle regarding data quality is the one-off data capture, one-off data storage. We related this principle to two data quality dimensions between sources: redundancy and external inconsistency. So, the civil engineering objects may not have duplicates representation in more AM data sources and also external inconsistent

(24)

The data quality requirements within sources have thus been related to the business requirement that all data should not have more than 5% invalid, incomplete and internal inconsistent values per property. The data quality requirements between sources have been related to the business requirements of data transfer abilities and the absence of redundant and external inconstant data between sources.

3.3 Conclusion

Our situational context refers to a part of the asset management (AM) process, related to civil engineering objects, maintained by RWS. We confined thus to data sources in the AM process, containing representations of civil engineering objects and their properties. Based on business requirements regarding the AM process, we were able to formulate the main data quality requirements within sources:

Ø Quantitative and qualitative data about civil engineering objects may not have more than 5% invalid, incomplete and internal inconsistent data instance values; Based on business requirements regarding the related information management process, we were able to formulate the main data quality requirements between sources:

Ø Civil engineering objects may not have duplicate and external inconsistent representations in two or more AM data sources;

Ø Civil engineering object data should be interoperable & linkable with data instances from related business process, like the FM process.

The conclusion of this chapter corresponds also to the data requirement analysis step in the assessment phase in our framework (figure 5a).

(25)

4. DATA QUALITY ASSESSMENT IN SITUATIONAL CONTEXT

The fourth chapter presents the transfer of the research opportunities on data quality to the situational context of Rijkswaterstaat by means of a case study on the civil

engineering object data. This case study aimed to assess the nature and extent of the data quality problems based on the data quality requirements.

We conducted the preparation as assessment steps as defined in the data quality framework (see 2.3). Appendix A describes these steps in more detail. This chapter confines to a brief introduction to the case study and its main results. The conclusion in the end, expresses the main data quality problems observed in our situational context, based on the data dimensions (Ch. 2) and the data quality requirements (Ch. 3). This chapter eventually respond to research question 1d.

(26)

4.1 Introduction Case Study

Important topics to introduce the case study are its scope and the data quality metrics, needed to measure the violations per dimension.

Further Scope

Due to time limitations, we confined the scope of our situational context further regionally. All RWS regions execute AM conform uniform business processes. This allowed us to confine our research also to one region. We selected the eastern part of the Netherland (see figure right), partly for practical reasons (access to data and practitioners), but also because all kind of civil engineering object types, maintained by RWS, occur the eastern part of the Netherlands. We measured thus the violations within and between AM sources regarding representations of civil engineering objects in the eastern part of The Netherlands.

In addition, we included only properties of civil engineering objects,

which can be found in two or more sources. Reason behind this further confinement is that we also had to investigate the quality issues between sources. Then, overlapping fields are most interested to put emphasize on.

Figure 7 summarizes then the fields (i.e. properties) included in the scope per asset management source.

Figure 8 Properties in scope regarding data sources in scope (source: own information analysis)

We categorized these fields in three functional groups: Ø Key properties: pk, fk

(27)

These are crucial to identify an object uniquely in a database. A primary key (pk) is the main database identifier of an object. Foreign key’s (fk) are used to

establish a relation between two data instances, stored in different tables. These tables can reside within the same source, but in different sources.

Ø Similarity properties: sp

These are characteristics, that identify an object uniquely in the real world, like its name and/or its location.

Ø Other common properties: cp

These are characteristics represented in two or more databases, but do not identify an object uniquely. Examples from figure 7:

o type represents a categorization of an object. Real world instances are mostly specified or classified. Based on this specification or classification, an object is assigned to a table. This kind of properties are therefore very important when it comes to give structure to a database and design a schema. The type property also determines which other properties are important to represent an object in the database.

o year of foundation is considered important to determine the life cycle endurance. It is also used in planning of regular inspections.

o material refers to the composite of the main civil engineering structure of an object in KernGIS and BKN. In DISK, material refers to the composite of a smaller object part (bouwdelen).

o monumental status is an important feature of an object, generating extra requirements and constraints on the maintenance of an object.

o measures advices are formulated after a risk assessment and programmed if they are prioritized. Measures are also found in two sources: DISK and RUPS. To link measures to the civil engineering objects, we need to enclose the full object decomposition, containing elements (elementen) and building parts (bouwdelen).

Specification Data Quality Metrics

Table 3 specifies the data quality metrics per dimension in more detail and expresses whether a metric should be applied on one of the sources. These metrics have been based on the basic metrics found in literature (table 2). It was needed to refine them further to apply them in our situational context. The refinement has been done by examining the data models of the sources and with help of information from

practitioners at RWS Eastern Netherlands. We asked these practitioners to provide us with concrete quality rules regarding these sources, which can serve as metric to do the assessment.

Table 3 also shows that the dimension timeliness is not considered to be relevant. No quality metric needed to be formulated. The dimension accuracy is considered to be

(28)

The redundancy between sources refers to the amount of duplicate data instances. Data instances are duplicates if:

- the object in the sources are all representations of the same real world; - the common properties all represent the same object characteristic.

To determine the redundant objects and discover inconsistent property values between these objects, we have to link the data first. Part II of the report comes back to this, since we considered data linking as a prototype improvement method (see 2.1.3). The amount of common properties can be considered as an indication for potential redundancy and external inconsistencies between sources (Elmagarmid et al, 2007). Therefore, the amount of common properties has been chosen as metric to measure the redundancy and external inconsistency between sources. External inconsistency can be an indication for inaccurate data (Redman, 2005): if two data instances, representing the same real world object, but having two different values on the same property, it can be stated that at least one of the values is wrong and thus inaccurate (Olsen, 2003). But we cannot state that external inconsistency can measure inaccuracy fully.

A metric for interoperability & linkability is the availability of properties, which can serve as linkage attribute to transfer data between sources and information between business processes (Howell et al, 2016; Daas et al, 2009). We adopted this metric to measure the quality on interoperability & linkability.

Appendix A.2.3 describes the allocation of these metrics to the properties per source more in detail.

Dimension Metric Source

BKN DISK KERNGIS ULTIMO RUPS SAP Internal

inconsistency Amount of values, which do notoccur in a list of allowed values ⱱ ⱱ ⱱ ⱱ ⱱ Internal

inconsistency Amount of values, which are notin line with values of related properties

ⱱ ⱱ

Invalidity Amount of values, which have

not the required length ⱱ ⱱ ⱱ

Invalidity Amount of values, which do not contain certain required

characters

ⱱ ⱱ ⱱ

Invalidity Amount of values, which are

outside a predefined range ⱱ ⱱ ⱱ ⱱ

Invalidity Amount of values, which do not match a pattern or regular expression

ⱱ ⱱ ⱱ ⱱ

Incompleteness Amount of empty values ⱱ ⱱ ⱱ ⱱ ⱱ ⱱ

Inaccuracy -Timeliness

-Redundancy Amount of common properties, representing the same

information

ⱱ ⱱ ⱱ ⱱ ⱱ ⱱ

External

inconsistency Amount of common properties,representing the same information

ⱱ ⱱ ⱱ ⱱ ⱱ ⱱ

Interoperability

& Linkability Amount of properties to serve aslinkage attribute ⱱ ⱱ ⱱ ⱱ ⱱ ⱱ

(29)

4.2 Nature and extent of the Data Quality Issues in Situational Context

The measurements revealed the nature and extent of the data quality issues within and between sources. The measurements have been done by applying the metrics on the fields as defined in the previous section.

4.2.1 Quality measurements within sources

Table 4 summarizes all measurements and presents the percentage of violated records per field and quality dimension. We recall from the data quality requirements that at most 5% of the data instances are allowed to violate the rules in order to be fit for use. If percentage of violated records exceed this norm, the cell has been colored red. The N is the total amount of measured instances per source.

Internal inconsistency and invalidity are complementary: if a field is instantiated by a values from a domain table, the internal inconsistency between the data instance value and the domain table values has been checked. If a field is filled by a primitive data type (a string, date or integer) or complex datatype (geometry), invalidity metrics have been applied. Incompleteness metrics have always been checked.

Table 4 shows that most rejected values are invalid or incomplete. Internal

inconsistencies are not quite an issue: this has to do with integrity constraints, an important functionality in relational databases. These integrity constraints force practitioners to enter consistent values. There is however a trade-off between internal consistency and completeness (Wang et al, 2006): if practitioners do not know which consistent value to enter, the field often remains empty, causing completeness to increase. Table 4 indeed shows that several of the incomplete properties are related to domain datatypes. Apparently, the practitioner did know what to enter and left the field empty. Another reason can also be that the property is not relevant to the real world object to be represented. In Appendix I elaborates on the error causes in more detail: for some civil engineering object types, it indeed is likely to assume that some incomplete properties are not relevant. Further requirement analysis in our situational context is needed to confirm this assumption. For the design of the improvement procedure, we continued with the assessment outcome of table 4.

(30)

Table 4 Data Quality Measurements on dimensions within sources in Eastern Netherlands: %violations TOTAL VALUES REJECTED VALUES

KERNGIS Kunstwerken DATATYPE N

Internal

Inconsistent Invalid Incomplete

pk ObjectID primitive-integer 1592

fk Kunstwerk Code primitive-string 1592 8% 0%

sp Geometry geometry 1592 0% 0%

cp Type domain 1592 2% 1%

cp Foundation Year primitive-date 1592 2% 24%

BKN Kunstwerken

pk ObjectID primitive-integer 572

sp Geometry geometry 572 0% 0%

cp Material domain 572 9% 20%

cp Monumental Status domain 572 0% 0%

ULTIMO ProcessFunction

pk ID primitive-integer 591

sp Coordinates primitive-integer 591 0% 3%

cp Monumental Status domain 591 0% 39%

DISK Beheerobjecten

DISK Elementen pk ID primitive-integer 2117 fk Beheerobject ID primitive-integer 2117 0% 0% cp Type domain 2117 0% 0% DISK Bouwdelen pk ID primitive-integer fk Element ID primitive-integer 3170 0% 0% cp Type domain 3170 0% 0% cp Material domain 3170 0% 21% DISK Maatregelen pk ID primitive-integer fk Beheerobject ID primitive-integer 865 0% 0% fk Element ID primitive-integer 865 0% 0%

cp Measure Advice primitive-string 865 7% 0%

RUPS Maatregelen

fk SAP-WBS Element primitive-string 13675 11% 1%

fk Inkoopdossier Code primitive-string 13675 52% 0%

fk Element Code primitive-integer 13675 0% 1%

cp Measure Advice primitive-string 13675 29%

cp Unity prices primitive-integer 13675 11%

SAP SAP Finance - Dossier

pk Dossiernummer (WBS Element) primitive-string 2332 0%

cp Cost primitive-integer 2332 0%

SAP SAP Finance - Zaak

pk Zaaknummer primitive-integer 3372 0%

fk WBS Code primitive-string 3372 0%

cp Cost primitive-integer 3372 0%

SAP SAP Finance - Bestelling

pk Bestelnummer primitive-integer 4751 0%

fk WBS Code primitive-string 4751 0% 0%

fk Zaaknummer primitive-integer 4751 0% 0%

(31)

4.2.2 Quality measurements between sources

Table 5 presents the common properties between the AM data sources, indicating the potential redundancy and external inconsistency between sources. Per source pair has been indicated which properties are overlapping. The tables shows quite some

redundancy between the asset quantity data sources, due to the occurrence of several common properties between the sources. The GIS data sources (KernGIS, BKN) and the EAM2_{data sources (DISK, Ultimo) have separate data capture processes. Although there}

exists some cooperation among these capture processes, the users are working in different working processes within the AM business process. GIS users mainly work on short term operational AM working processes, whereas EAM users mainly work on long term strategical AM tasks. This difference also caused a separate data capture process regarding the same real world entities.

Table 6 presents availability of properties, which can serve as linkage attribute to transfer data between sources, measuring the interoperability & linkability between sources. Per source pair has been indicated which property categories can be used to link the source pair. Also the results of the quality measurement based on the previous section have been included: if the key or similarity property of both pairs has good quality, the interoperability & linkability on this property has been assessed sufficient. It can be observed then that interoperability & linkability between the asset quantity sources is good via the spatial properties. The interoperability & linkability between the asset maintenance source (RUPS) and the financial management source (SAP) is rather poor. Although a common property has been established (see figure 7), the data quality on this common property is rather poor in the asset maintenance source (see table 4). This confirms the observation from an internal audit that transfer of information between processes indeed is an issue between the AM and the FM process (Interne Auditdienst Rijk, 2016).

(32)

Table 6 Available properties between sources, which can serve as linkage attribute and their data quality

4.3 Conclusion

Based on the measurements, general remarks are:

• Internal consistency is not quite an issue, due to adherence to business rules. • Most quality issues within sources are related to validity problems and

incompleteness on primitive data types.

• The interoperability & linkability between civil engineering objects in the 4 asset quantity information systems is quite good: there are sufficient properties to serve as linkage attribute and because the geometry quality is also good, the civil engineering objects mostly meet the requirement to be linkable to spatial data. Data transfer between the asset quantity sources is therefore not prone to error.

The main data quality problems within sources are:

• Key properties of civil engineering objects do often not meet the data quality requirements (>95%, see 3.3), due to invalidity issues. As a consequence, objects cannot be retrieved easily on these (foreign) key properties.

• Key properties of measures in RUPS do often not meet the data quality

requirements, due to invalidity. As a consequence, measures cannot be retrieved easily on these key properties.

(33)

see 3.3). As a consequence, analysts cannot retrieve often the useful information regarding type or foundation year of an civil engineering object.

The main data quality problems between sources are:

• There is quite some redundancy on properties between the 4 main asset quantity information systems. Especially geometry, type and foundation year are

represented in all systems. When there is also overlap between objects (records), this is redundant data, not meeting the requirement “one-off capture, one-off storage”.

• The interoperability & linkability between the asset management systems and the financial system is rather poor: the only foreign keys has poor quality. Another possibility is to link on textual similarity, but the asset management process and the financial management process use different vocabularies to describe the similarity properties. So, the quality on this dimension fully depends on the possibility of textual linking. The chance to meet the requirement of reliable data transfer between different the AM and FM process is rather low.

The conclusion of this chapter corresponds to the data quality measurement step of the assessment phase in our framework (figure 5a). We found that the main quality

problems within sources are related to the validity and completeness dimension. We only found a few internal consistency issues. The main quality problems between sources manifest in all basic quality dimensions in our framework (figure 5b).

Timeliness and accuracy of the data were not measured: timeliness was not considered to be important in our scope. For accuracy, we could not find an objective metric. External consistency can be seen as an indication that there is inaccurate data, but we cannot measure the full nature and extent of the data accuracy issues.

So, in our situational context, the improvement procedure must at least be able to deal with validity, completeness and internal consistency issues within sources and

redundancy, external consistency and interoperability & linkability issues between sources.

(34)

PART II BUILDING, INTERVENTION AND EVALUATION

This part of the report contains the chapters about the design of the improvement procedure. In our data quality framework, we defined some improvement steps (see 2.3 and figure below).

The last two improvement steps coincides with our main design activities. These have been concretized with concepts of Situational Method Engineering (SME), since our design is an improvement procedure or Situational Method Base (SMB) in SME terminology. The SMB is an assembly of Situational Method Fragments (SMF’s). The SMF’s are refinements of existing prototype methods.

Chapter 5 discusses the selection and refinement of the initial existing prototype

methods, which were the building blocks of our SMF’s and SMB’s. Chapter 6 specifies the SMF and SMB in more detail from a user perspective. Chapter 7 addresses the evaluation of the design products .

After reading part II, the reader should have understanding about the realization of the improvement procedure, how to apply the improvement procedure and its performance to reduce the data quality issues in our situational context (see Ch. 4).

(35)

5. PRODUCT PROTOTYPE

This chapter discusses the initial selection of prototype methods and its refinements to situational method fragments. The selection has been based on literature review (see chapter 2):

Ø Data Oriented Methods o Data Linking o Data Normalization o Data Integration Ø Process Oriented Methods

o Process Control o Process Redesign

Harmsen et al. (1994) provide a framework to intervene into existing prototype methods and refine them in order to apply the methods in a situational context. Depending on the objectives, requirements and situational features, a certain degree of flexibility is needed to do the refinement (see figure 8). When a prototype method is already very concrete and ready for application, no further refinement is needed: low flexibility suffices then. If a method fragment is rather abstract and/or not yet ready for application, more flexibility is needed then to specify the prototype method more in detail (Harmsen et al, 1994).

Figure 9 Degree of Flexibility for Method Fragment Refinement (Harmsen et al, 1994)

The following sections elaborate the prototype methods and its refinement on the scale of flexibility in order to develop situational method fragment. In the end, this chapter

answers research questions 2a and 2b.

2a. What are appropriate methods to improve data quality?

2b. How can we intervene into the prototype methods to improve data quality in our situational context?