O NTOLOGY (R E )U SE IN D ATA W AREHOUSING

(1)

O

NTOLOGY

(R

E

)U

SE IN

D

ATA

W

AREHOUSING

Master Thesis

Research Master for Economics and Business

Author: Sander Zwanenburg (s1469703)

First supervisor and assessor: Prof. Dr. Ir. J.C. Wortmann Co-assessor: Prof. Dr. E.O. de Brock

Second supervisor: R.A. Ittoo, Msc

(2)

2

T

ABLE OF CONTENTS

Table of contents ... 2

Abstract ... 2

1 Introduction ... 3

1.1 Introduction to the data integration problem ... 3

1.2 Introduction to data warehousing ... 3

1.3 Limitations of data warehouses ... 5

1.4 Introduction to ontologies ... 6

1.5 Ontologies as a remedy ... 7

1.6 Problem & aim ... 8

2 Introduction to the virtual approach ... 10

2.1 Development and types of the virtual approach ... 10

2.2 The virtual versus the materialized approach ... 12

3 Methodology ... 13

3.1 Data warehousing ... 13

3.2 Virtual approach ... 13

4 Results ... 13

4.1 Data warehousing ... 13

4.2 The virtual approach ... 16

5 Conclusion ... 19

6 References ... 20

7 Appendix A: Data models of Data Warehouses ... 24

A

BSTRACT

(3)

3

1 I

NTRODUCTION

1.1 I

NTRODUCTION TO THE DATA INTEGRATION PROBLEM

There is a vast and growing body of literature dealing with the data integration problem. The data integration problem refers to “combining data residing in different sources and providing the user with a unified view of these data” (Lenzerini, 2002). The goal in this problem is to fulfil the information needs of the user – or the knowledge

worker – preferably without violating user sovereignty, i.e. not disabling users to express their view in requesting

integrated information (Ziegler & Dittrich, 2004). These needs are assumed to be read-only1. Maintaining data source autonomy is required in the data integration problem, as the primary functions of the sources, such as transaction processing, must not be affected. In addition, the sources are often assumed to be heterogeneous; especially the semantic heterogeneity has been a challenge to cope with. Sometimes, less structured data sources such as the World Wide Web or emails are explicitly included. Because the process of structuring data is being dealt with by other literature (e.g. literature dealing with natural processing techniques), this paper considers only structured data as direct source data, although ultimate sources might be unstructured.

There are two broad approaches to the data integration problem: data warehousing and the virtual approach, the latter of which will be introduced in section 2.

1.2 I

NTRODUCTION TO DATA WAREHOUSING

Data warehousing is seen as an important approach to the data integration problem (Haas, Schwartz, & Kodali, 2001), and is applied in many organizations. In data warehousing, the integrated data is materialized in a data warehouse before a query is posed, and therefore this approach has also been called the materialized approach (Hull, 1997). According to Inmon (1995), a data warehouse (DWH) is “a subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decision-making process,” and can therefore be regarded as a special type of database (DB), or, conforming to Rahm and Bernstein (2001), as a decision support database2. Unlike most DBs employed in organizations, which are often transaction processing DBs, the primary goal of a DWH is to enable analyzing data, which is sourced from other DBs (hereafter: the source DBs). Analyzing data in the source DBs themselves is often hampered and limited by its limitations. Table 1 provides an overview of these limitations of most DBs for analysis along with how data warehouses overcome these limitations.

1

This assumption conflicts with a small minority that also considers read and write access to integrated information, e.g. Hull (1997).

2_{In some data warehousing implementations, multiple DBs exist, e.g. an operational data store or separate data}

(4)

4

Using operational DBs for analysis Using a DWH for analysis System load High – Analytical use of the

databases may hinder operational use or vice versa.

Low – A separate system, the DWH, is employed. The source systems are only used periodically, usually when there is no operational activity.

Domain Restricted – Operational DBs are often specialized in a specific function such as sales.

Wide – Since data is pulled from multiple source DBs, transformed, and loaded into the DWH – a process referred to as ETL (Extract Transform and Load)3 – the domain of the DWH is wider than that of any individual source DB.

Data quality Poor – This is due to e.g. misspellings, abbreviations or various syntaxes (also known as ‘dirty data’).

Good – While integrating, the data is cleansed, removing all syntactic heterogeneity.

Data structure

Often normalized – normalized data structures require many join operations for analysis.

Often hierarchical – The corresponding star or snowflake schemas facilitate analysis. See Appendix A for a discussion on data models of DWHs.

Time scope Limited – Transaction processing DBs seldom keep historic, obsolete data

Large – Historic data is kept in the data warehouse, and often updates in source data are accommodated in DWHs through responses to slowly changing dimensions (see e.g. Frank, 2008).

Table 1. Why doing analysis using a DWH instead of in the source DBs themselves.

The analysis is triggered by users, who can pose queries to the DWH, requesting data that is stored in the DWH in accordance with its schema. The users can directly address their query to the database management system (DBMS) of the DWH or via other applications. This functionality of the DWH is called query processing (QP). This paper focuses on ETL, schema design and QP, as these areas correspond with important limitations of DWHs, which will be discussed next. The areas are depicted in Figure 1. In this figure, for the sake of simplicity, the cylindrical symbol is used for representing the DB as well as its DBMS and other applications that use the DB. Analogously, the cube represents the DWH and the applications that use it.

3

(5)

5 DWH DB sources ETL Users Schema (re)design QP 1 2 3

Figure 1. Three important areas in data warehousing

1.3 L

IMITATIONS OF DATA WAREHOUSES

There are various limitations of data warehousing, corresponding to these three areas of interest.

1. ETL limitations – The design of the ETL process is often considered as a very tedious and error-prone task and hence as an expensive task; the ETL process is often quite complex (Skoutas & Simitsis, 2007a), mainly due to the high heterogeneity of source DB systems (DBS) (Trujillo & Lujan-Mora, 2003). Furthermore, the ETL process is, when it is designed in traditional hard code, very rigid; it is difficult and time consuming to incorporate new data sources or to accommodate changes in current sources, which require programming changes (Widom, 1995). Many commercial solutions for ETL design do not facilitate automatic identification of transformations (Skoutas & Simitsis, 2007b).

2. Schema design limitation – The schema design of a DWH is a challenge; conventional DB schema design techniques cannot be applied here, for example because the DWH schema design is dependent on source DBSs. Also, the information about the semantics of the data and the constraints and requirements of the DWH is often incomplete, or even inconsistent, and at times only provided in natural language after oral communication with the involved parties (Skoutas & Simitsis, 2007a). More specifically, determining measures, facts or events is often considered as the most difficult part of the schema design, and is usually done manually (Phipps & Davis, 2002). An indication of the difficulties is given by the great number of very different approaches to DWH design (see Sen & Sinha, 2005, for a comprehensive overview).

(6)

6

An issue fundamental to these three limitations is the alignment of world views: in ETL the source DBs’ world views are to be mapped to the DWH world view; in schema design, a new world view is to be formalized from existing world views; and in QP the world views of the user and the DWH are to be bridged. Formalizing these different world views into explicit specifications of conceptualizations, can help solving these problems. These specifications are referred to as ontologies. Before explaining how ontologies can be used for addressing data warehouse limitations, the notion of ontologies is introduced next.

1.4 I

NTRODUCTION TO ONTOLOGIES

The definition of ontology has been subject to a lot of debate (Guarino & Giaretta, 1995). A widely used definition is: “an explicit specification of a conceptualization” (Gruber, 1995). “A conceptualization, in this definition, refers to an abstract model of how people commonly think about a real thing in the world; and explicit specification means that concepts and relationships of an abstract model receive explicit names and definitions” (Buccella, Chechich, & Brisaboa, 2005). Although some ontologies can be as ‘simple’ as DB schemas, their actual purpose is usually to represent the complexity of the real world more precisely, by including semantics, taxonomies, non-taxonomic relationships and constraints. Ontologies can be expressed in different ways. They can be expressed by graphically by UML (Unified Modelling Language), in natural language, or in formal languages that may be computer readable (see Uschold and Gruninger, 1996, for a typology of ontologies according to their formality). Although manually using ontologies has been popular in system engineering (Guarino, 1998), computer readable ontologies have recently gained much attention within the computer science community, for example to support the advent of the semantic web.

Whereas ontologies can be expressed in different ways, the universe of discourse it represents can be expressed in different ways as well, or, more specifically, at various levels of detail. Generic concepts like time, space, object,

activity etc. are independent of a particular domain, but can be specialized into domain-specific concepts, e.g. object and activity can be specialized into automobile and selling respectively. Guarino (1997, 1998) refers to the

set of these general concepts including the corresponding definitions, interrelations etc. as a top-level ontology, whereas more specific concepts are represented in other ontologies such as domain ontologies or task ontologies. Bellatreche, Pierra, Xuan, Hondjack, & Ameur (2004) classify ontologies differently, by identifying their relationship to DBs in a data integration environment. They distinguish between (1) local ontologies, relating to individual DBs, (2) global ontologies, relating to all DBs, and (3) shared ontologies, which are connected to the local ontologies, as illustrated in Figure 2.

Figure 2. Different ontology architectures in a multiple database environment (from Bellatreche, Pierra, Xuan, Hondjack, & Ameur, 2004)

(7)

7

What can be inferred from Guarino (1998) and Bellatreche, Pierra, Xuan, Hondjack, & Ameur (2004), is that the abstract model of how people commonly think about a real thing in the world, i.e. the ontology, can also refer to multiple groups of people or databases, describing how people can think about a entity in the world. Thus, according to Guarino (1997, 1998), ontologies can be used in a wide variety of contexts. By providing a shared understanding in various contexts, they can enhance (1) communication, (2) inter-operability and (3) system engineering (Uschold & Gruninger, 1996). These three uses correspond with how ontologies are used to overcome the described limitations of DWHs: communication corresponds with QP, inter-operability corresponds with ETL and system engineering corresponds with schema design. The use of ontologies for dealing with the limitations is briefly described next.

1.5 O

NTOLOGIES AS A REMEDY

ETL limitations – To tackle the ETL limitation, Critchlow, Ganesh and Musick (1998) use ontologies for automatic

generation of data warehouse mediators. These authors assume that schemas of the source DBs and of the data warehouse are given. The mediators fulfil a core role in their ETL solution, as they resolve semantic and syntactic conflicts between the schema of the source DBs and the schema of the DWH. The solution reduces the maintenance costs associated with accommodating changes in source DBs or incorporating new source DBs. Furthermore, Rahm and Bernstein (2001) have reviewed approaches to automatic schema matching. In these approaches, the input consists of multiple schemas (of source DBs and the DWH), the output consists of mappings from a DB to a DWH. However, a limitation of this approach so far is that ontologies are seen equivalent to schemas.

Schema design limitations – Guarino (1998) discusses how ontologies can be used in information systems (ISs)

during the development time of an IS; an ontology – if available a priori – can be “transformed and translated into an IS component, reducing the costs of conceptual analysis and assuring (...) the ontological adequacy4 of the IS.” When the necessary ontology is not complete – which is often the case – the existent ontological knowledge can instead be used by designers as a powerful tool in designing the IS. These manual or automatic designs can refer to the database component, the user interface as well as the application constituting the IS. Specifically for DWHs, a commitment to an ontology, according to Nimmagadda (2005), will ensure consistency in the DWH design, that otherwise is compromised if the semantics are distorted, or not properly reflected. Such a commitment is demonstrated by Romero and Abello (2007), who use an ontology as a first step in designing a DWH schema semi-automatically. Similarly, Phipps and Davis (2002) use an ontology describing a source DB as a starting point for automatic source-driven DWH schema design.

QP limitations – Reducing the intrusiveness of DWHs with ontologies can be achieved in various ways and to

various extents. Ontologies can assist users in formulating their queries by enabling them to browse through the metadata of the DWH, as demonstrated by Kamal, Borlawsky and Payne (2007). Browsing metadata and other query support functionalities such as graphical drag-and-drop querying are provided by many commercial DBMSs of DWHs, such as Business Objects. A further reduction of the intrusiveness of DWHs can be achieved by using ontologies for translating queries and/or reports, between natural and formal languages. In a rather unconventional DWH solution – in which data marts are created based on individual user requests – Xie, Yang, Liu, Qiu, Pan, & Zhou (2007) propose using ontologies to map a business world view to a the DWH world view, enabling the users to express the query in familiar business terms. Furthermore, Pan and Heflin (2003) introduce an ontology-based mechanism enabling DB users to express their queries in their natural language; this refers to a

4_{The ontological adequacy refers to the correctness in the representation of ontologies, which may be endangered}

(8)

8

functionality commonly called semantic querying. In semantic querying, the user query is analyzed with natural language processing techniques and transformed into a machine understandable query. The mechanism is designed for any DB, and can hence be applied in data warehousing as well.

Figure 3 summarizes where ontologies have been used in data warehousing.

DWH DB sources ETL Users Schema (re)design QP 1 2 3

Figure 3. Current uses of ontologies in supporting data warehousing

1.6 P

ROBLEM

&

AIM

The use of ontologies in data warehousing has mainly been studied in isolation, focusing on either one of the three identified DWH limitation areas. Separate implementations of ontologies for overcoming the mentioned DWH limitations are expensive because the knowledge acquisition bottleneck applies to every single implementation. This bottleneck refers to the difficulties of capturing and encoding knowledge for use in the system (Rubin, Smith, & Trajkovic, 1999), which include low speed of acquisition, knowledge latency, knowledge inaccuracy, and difficult maintenance (Wagner, 2007). Considering the use of ontologies in data warehouses, it seems likely that the knowledge acquisition bottleneck can be alleviated by reusing similarities in these ontologies and their applications, thereby reducing costs. However, there has been no research conducted on using ontologies in data warehousing for addressing multiple limitations in a single integral framework.

(9)

9

1

2

4

5

3

Figure 4. Supported developments with regard to ontological support in data warehousing (dotted arrows)

Node 1 2 3 4 5

Ontology available Yes* Yes Yes Yes Yes DWH employed No Yes Yes Yes Yes

ETL support No No Yes No Yes

QP support No No No Yes Yes

* Or easily obtainable

Table 2. Explanation of nodes in Figure 4.

The arrows in Figure 4 represent developments from an as is configuration to a to be configuration. Full arrows represent well studied developments, whereas dotted arrows represent developments which could benefit from reusing similarities in ontological use in data warehousing. For example, the development depicted by arrow from node 2 to 3 (hereafter: development [2,3]) is well studied; here, the support for ETL by ontologies is to be developed and one can refer to for example Critchlow, Ganesh and Musick (1998). In contrast, development [3,5] is an example in which there has been no research at all. Here, the use of ontologies for ETL is to be extended to QP support. In this case, reusing ontologies that were in place for ETL support might be very cost efficient. Figure 4 will be referred to in the results section of this paper to indicate which developments can benefit.

(10)

10

The remainder of the paper is organized as follows. In section 2, the virtual approach is introduced. This is the alternative approach to the data integration problem. This approach is also considered in this study as will be explained in the methodology section (section 3). Section 4 presents the results of the literature studies. Section 5, finally, concludes the paper.

2 I

NTRODUCTION TO THE VIRTUAL APPROACH

In this section, first an understanding in the basic characteristics and developments of the virtual approach is provided. Next, the virtual approach is compared with the materialized approach, or data warehousing.

2.1 D

EVELOPMENT AND TYPES OF THE VIRTUAL APPROACH

In this paper, the virtual approach is seen as the logical complement of data warehousing in the data integration problem. In the virtual approach there is – in contrast to data warehousing or the materialized approach5 – no central data repository with integrated data. This dichotomy can also be found in Haas, Schwartz and Kodali (2001) and Hull (1997). Unfortunately, there is no naming convention for the virtual approach. According the above description on data integration, here, the virtual approach is synonymous with the federated approach, in which federated DB systems are used. These systems are an important part of the virtual systems.

A federated database system (FDBS) is a collection of cooperating but autonomous component DBSs, and a distinct, central component which does not store any content data (Heimbigner & McLeod, 1985). Hence, there is no central data repository like in data warehousing. Whereas in data warehousing data integration and QP are decoupled activities, i.e. they occur independently, in FDBSs they are coupled activities; data is extracted from the appropriate source DBs, integrated, transformed and presented to the user based on a specific query. This approach to the data integration problem has also been called integration on-the-fly.

FDBSs have developed considerably over the last three decades. Federation for the data integration problem was first proposed by Heimbigner and McLeod (1985), although it evolved from the idea of a single federated database (Hammer & McLeod, 1980). FDBS were proposed as a response to the shortcomings of database integration6, viz. violation of source DBS autonomy. The design of the proposed FDBS is highly decentralized; source DBS administrators have to determine which local data is shared – this is formulated in export schemas – and which non-local data is allowed to be imported, formulated in import schemas. The data integration activity can be executed by each of the component DBSs, triggered by the query it received from the user. The central component, called the federal dictionary, has only administrative functions: supporting the establishment, maintenance and termination of the federation (Heimbigner & McLeod, 1985). In this paper this type of system is called a loosely coupled FDBS, following the terminology of Hammer and McLeod (1993) and complying to most literature, although some believe all FDBS are loosely coupled systems (e.g. Bright, Hurson and Pakzad; 1992).

In many later approaches the data integration function is allocated to the central system, for example by Sheth and Larson (1990), calling it the federated database management system (FDBMS). All users of this centralized federation address the central component for their global queries (i.e. queries that cannot be answered by a single source DB) which in turn retrieves the data from the sources, integrates the data and presents the results to the user. Some of these more centralized FDBSs have been called virtual data warehouses (Hull & Zhou, 1996), global

5

The materialized approach and data warehousing are used as synonyms. In the remainder of the paper, the term data warehousing is mostly used, conforming to most literature.

6_{Note that database integration is different from data integration, since the DBs are ‘merged’ into one new DB and}

(11)

11

schema multidatabases (Bright, Hurson, & Pakzad, 1992) and federated data warehouses (Kerschberg, 2001), as the

system can be queried like a DWH. In this paper, all federated systems in which a central system is addressed in QP, are commonly called tightly coupled federated FDBSs, again conforming to Hammer and McLeod (1993).

There are many differences among data integration systems that can be classified as tightly coupled FDBSs. Firstly, the architecture of the central system can be mediator-based, agent-based or wrapper-based (Buccella, Chechich, & Brisaboa, 2005), and can involve many components. Secondly, tightly coupled FDBSs can use an own global schema, which is expressed in terms of the local schemas (global as view – GAV) or vice versa (local as view – LAV) (Rousset & Reynaud, 2004), although a global schema is not required necessarily (see e.g. Firat, Madnick, and Grosof, 2007). Thirdly, the identification of source DBs for a specific query is usually executed by the FDBMS, but in multidatabase language systems for example, in which no global schema is used, this is to be indicated in by the user in the query (see Bright, Hurson and Pakzad, 1992).

An important difference between loosely and tightly coupled FDBS is the query plan. Figure 5 illustrates this difference and also compares the query plans to the query plan in data warehousing. The arrows represent data streams, corresponding to a query or to ETL.

DBS sources Extract Transform Load Users Users Local QP DBS sources Users Global QP Users

Local QP Integration_{‘on the fly’}

(12)

12

Figure 5. Query plans for data integration systems

This figure shows that users of both DWHs and tightly coupled FDBS address global queries to a centralized system, whereas users of loosely coupled FDBS address all their queries to the component DBSs.

2.2 T

HE VIRTUAL VERSUS THE MATERIALIZED APPROACH

There has been a great deal of debate on whether the virtual approach or the materialized approach is superior for solving the data integration problem. Victory claims can be found at both the DWH side (Inmon B. , 2004) and at the virtual side (e.g. Kerschberg, 2001; Firestone, 1999). Advantages of the virtual approach over data warehousing include:

 Real time QP – The answers to queries can describe fresh data, whereas in data warehousing there is a

time lag as ETL is scheduled periodically, although there is a growing body of literature dealing with a relatively new concept: real time data warehousing.

 Less maintenance – No centralized content data exists, that need to be appended, updated or deleted.

 Less intrusiveness – There may not a single world view that is imposed to the users (Castillo, Silvescu,

Caragea, Pathak, & Honavar, 2003).

The disadvantages of federation over data warehousing include:

 Hindrance of source system use – Extraction of data from component DBSs cannot be scheduled, and

might therefore hinder local DBS use because of high system load.

 Longer query resolve time – This is due to additional query processing activities – viz. extracting and

integrating – and due to schema design; usually, more join operations are needed as component DBS schemas are normalized.

 Inability to trace history – Component DBSs such as transaction processing systems often delete old,

obsolete records for performance reasons. These records cannot be queried in a FDBS, whereas they are often maintained in a DWH.

 No support for traceability – In federation, the same query posed at different moments can yield different

and possibly conflicting answers, in contrast to many data warehousing solutions (although dependent on the responses to slowly changing dimensions).

(13)

13

3 M

ETHODOLOGY

The identification of the ontology reuse opportunities in data warehousing is executed by conducting a literature analysis such that various ways of using ontologies can be analyzed. The analysis consists of two parts, corresponding to two bodies of literature: data warehousing and the virtual approach to the data integration problem.

3.1 D

ATA WAREHOUSING

Firstly, literature on current uses of ontologies in data warehousing is analyzed. This is obviously a natural place to consider given the aim of the research, but unfortunately this body of literature, specifically aiming at data warehousing, is very limited in size. The aim of the analysis is to find similarities which can be exploited for reuse. These similarities can refer to (parts of) ontologies themselves, or applications which use them. In this study, the literature is grouped according to the function of the ontology in the context of data warehousing: supporting ETL, supporting schema design or supporting QP. In this paper, the set of literature that deals with ontology-based QP in data warehousing is appended with general DB literature dealing with ontology-based QP support. This is possible because the corresponding solutions for DBs can usually also be applied in data warehousing; querying a DWH is conceptually not different from querying a regular DB.

3.2 V

IRTUAL APPROACH

Since the first body of literature is limited in size, another body of literature is considered which deals with the virtual approach. The reason to investigate the use of ontologies in this approach as well, is that the function of systems in this approach – in principle the same as the function of DWHs – is more often supported by ontologies. In fact, most work on ontologies in the field of the data integration problem only considers the virtual approach; see e.g. Wache et al. (2001). In order to identify ontology reuse possibilities in data warehousing in this paper, the focus in this literature is positioned on systems in which ontologies support both the data integrating and the QP activities. Among these systems, the ontological architecture is analyzed in this paper. Possibly, in the literature, the same (parts of) ontologies are used for both the data integration activity and QP. Because data is integrated based on a query, these activities (viz. data integration and QP) may be executed as one process. However, data integration and QP can be distinguishable sub activities of this process, and these activities might be decoupled. In that case, ontology support for these activities may be well suited for a data warehousing as well, since these activities have the same purpose in data warehousing as in the virtual approach, apart from loading data in a DWH. Schema design in the virtual approach is – when a global schema exists – very different from schema design in data warehousing, as many other factors play a role in the materialized approach. For example, data redundancy does not play a role in the virtual approach. Therefore, possible ontological support for schema design of virtual systems is not considered.

4 R

ESULTS

This section presents the findings of (1) the investigation to the use of ontologies in data warehousing and (2) the study on the use of ontologies in the virtual approach and how they can be transformed to a DWH environment.

4.1 D

ATA WAREHOUSING

(14)

14

4.1.1 ADDRESSING THE ETL LIMITATION

Chritchlaw, Ganesh, & Musick (1998) use formal ontologies for automatic mediator generation, automating an essential part of the ETL (re)design process, given the relevant DB descriptions of both the sources and the DWH. A single ontology is used, consisting of four concepts:

- abstractions: domain specific concepts

- databases: DB descriptions including attributes (both sources and DWH)

- mappings: mappings between a DB and an abstraction

- transformations: transformation functions to resolve differences in representation

This type of ontology can be qualified as unconventional as mappings and transformations do not actually represent a real thing in the world.

Skoutas and Simitsis (2007a) also use ontologies for the design of ETL processes. They distinguish between the application ontology and the ontologies which semantically annotate DBSs, the latter of which is described in OWL-DL. Mappings between these ontologies are inferred, which are initiated manually and resolved automatically. Xuan, Bellatreche and Pierra (2006) propose the floating version model that fully automates ETL redesign by accommodating schema changes in source DBSs. However, it is assumed that a hybrid ontological architecture is employed, in which local ontologies are referenced to the shared ontology, as discussed in Bellatreche, Pierra, Xuan, Hondjack, & Ameur (2004). In other words, they assume the existence of both ontology-based data sources and ontology-based DWHs (where local ontologies are richer than just schemas).

Baumbach, Brinkrolf, Czaja, Rahmann (2006) use ontology-based data structures to parse and convert data from source DBSs to the DWH. First, data are extracted from the sources and put together in an ontology-based data structure. This structure is transformed into a relational structure, which in turn is transformed to a flat file and in the end to the DWH. Unfortunately, it is not clear why ontologies-based data structures are used in their approach, and what benefits are achieved.

4.1.2 ADDRESSING THE SCHEMA DESIGN LIMITATION

Romero and Abello (2007) claim to be the first ones who propose a semi-automatic approach to generating multidimensional schemas – schemas which are frequently used in data warehousing – from ontologies. Here only a single domain ontology is used, describing the data sources, which can be relational DBs but also text or web files. This ontology need not describe all relationships between its concepts, as the method is able to infer them through reasoning.

Phipps and Davis (2002) automate the development and evaluation of the conceptual schema of a DWH. A starting point here is an enterprise-wide entity-relationship (ER) schema of a source DB. User input is used for further refinements of the DWH schema. The output consists of multidimensional ER schemas, which can be evaluated by analyzing the feasibility to answer specific user queries.

(15)

15

ambiguous semantics in conceptual models. For example, ontology theory can help testing whether a conceptual model is clear or complete.

4.1.3 ADDRESSING THE QP LIMITATION

Pan and Heflin (2003) extend DBSs to support semantic queries by using ontologies. Their system, DLDB, is based on a semantic mark-up language called DAML+OIL (see Connolly, van Harmelen, Horrocks, McGuinness, Patel-Schneider, & Stein, 2001). In this language, the relational database is represented in an ontology using a combination of the property table approach and the horizontal class approach7. The query Application Progamming Interface (API) component of the DLDB system receives the queries from the users, translates them using this ontology and poses them to the DB.

Kamal, Borlawsky, Dhaval and Payne (2007) develop a DWH metamodel for QP. This metamodel can be seen as an ontology which consists of the DWH data descriptions and a knowledge base. This knowledge base is created based on various source system vocabularies, which are linked to the DWH ontology. These vocabularies might be better aligned with the user view than the DWH vocabulary. Therefore, the meta-model is able to assist the user in formulating queries and understanding DWH data.

4.1.4 OVERVIEW AND CONCLUSIONS

The results are summarized in Table 3.

Reference Use One or more

ontologies

Ontologies describing Chritchlow, Ganesh and

Musick (1998)

ETL One Abstractions (domain concepts)

Databases (sources & DWH) Mappings

Transformations Xuan, Bellatreche and

Pierra (2006)

ETL Multiple (local and shared)

Local ontologies: each local data source Shared ontology: DWH

Skoutas and Simitsis (2007a)

ETL Multiple The application

Other ontologies: source DBSs Romero and Abelló

(2007)

Schema design One All sources (of potentially different structures) Phipps and Davis (2002) Schema design One Source DB (preferably enterprise-wide)

Pan and Heflin (2004) QP One Relational database (DWH)

Kamal, Borlawsky, Dhavel and Payne (2007)

QP Multiple DWH

Source vocabulary Table 3. Current uses of ontologies in data warehousing

What can be concluded from the previous discussion is that in both ETL and QP support, the DWH data characteristics (such as syntax, semantics and schematics) are captured in either a separate ontology or a part of an ontology. Although detailed implementations are different, this indicates that it is in principle possible to use the same ontology – or the same part of an ontology – in a system supporting both QP and ETL. This solution would be cheaper and easier to build in comparison to using separate systems, as one component is completely reused. Also, maintenance would be cheaper and easier, as updates have to be accommodated only once. Furthermore, the management system, used for e.g. altering the ontology, can be copied as well, although this advantage is

7

(16)

16

dependent on the other components of the supporting systems. Developments [2,5], [3,5] and [4,5] in Figure 4 can benefit from this reuse.

Not surprisingly, DWH data characteristics are not described in ontologies used for DWH schema design, as those should be the end result of schema design. However, Romero and Abello (2007) use an ontology describing DBs, which is also used in systems which support ETL. This indicates that ontology reuse here is also possible: the ontology used in Romero and Abello’s method in designing the DWH schema can be used for ETL support as well. Not all advantages of the previous result apply here as well; since the ontology of Romero and Abello (2007) is used only once during design time, it does not need to be maintained during runtime anyway, so the advantage of less maintenance does not hold. Nevertheless, reuse can be beneficial for development [1,3] and, given that the current DWH is developed using a method like from Romero and Abello (2007), also for development [2,3].

4.2 T

HE VIRTUAL APPROACH

As already indicated, ontologies are used much more extensively in the virtual approach than they are used in data warehousing. Buccella, Chechich and Brisaboa (2005) have reviewed nine federated systems on aspects including architecture, ontological use, and query plan. Like in data warehousing, ontologies can support various activities or functionalities in FDBSs. However, there is not a uniform way ontologies are used, as FDBSs can be very different. In this section, first the results of the analysis on tightly coupled FDBSs are presented, as well as the corresponding conclusions. These systems look most like DWHs as a central system is addressed for global queries, and executes the integration. Subsequently, loosely coupled FDBS are discussed with regard to the use of ontologies, although these systems have less in common with data warehousing and are nowadays less popular.

4.2.1 T

IGHTLY COUPLED

FDBS

Arruda, Baptista, and Lima (2002) propose a system for integrating structured or semi-structured data sources on the web. The semi-structured data sources are assumed to be exported into XML documents. The system uses an ontology as a common schema, and consists of a search engine, a mediator, an XML query engine and wrappers. The query plan is as follows. First, the user uses a chosen user ontology to compose a query and submits it to the search engine. This search engine identifies the ontology and interacts with the mediator for query rewriting. This query is rewritten in accordance to common conceptual ontological terms by the mediator. The search engine then maps the rewritten query to either the XML Query Engine, if XML documents are queried, or to wrappers, if structured DBs are queried. The wrappers rewrite the query again, such that it can be read by the DBMS of the local repository, according to a previous correspondence between ontology and scheme. The wrappers and the XML Query Engine retrieve the data from the data sources, and the wrappers translate this into XML format. The mediator receives and integrates the XML documents and presents the result to the user.

Hence, ontologies are used to translate a user query to a global query for QP, and to translate a global query to a local query for the data integration activity. These translations are clearly separated and both are supported by a global ontology describing a general vocabulary. This already indicates that this part of the ontology can be reused. First however, another system is discussed.

(17)

17

example, in a user view, ‘price’ might include tax, whereas a definition of price in a DB might exclude tax, in which case the answer will be interpreted incorrectly. Therefore, the mediator translates this query according to both the user context, in which his preferences or expectations are stored, and the source contexts, describing the source data characteristics. Both these contexts are described with respect to the ontology. This ontology includes both a domain and context model. The domain model is used to define a common type system, thereby describing only generic concepts (e.g. price). These generic concepts are sliced and diced by modifiers which define the context model. This context model describes how users or DBs can interpret the generic concepts (e.g. price is either including or excluding tax). Hence, the ontology can be seen as a top level ontology that is prepared for mapping to more specified models, relating to either information sources or receivers.

Proposed application in data warehousing

Since user queries are directly translated to DB queries, there is no clear distinction between QP and data integration activities. However, the use of a top-level ontology, which is used for mapping with user contexts and DB contexts, can in principle be used as a shared component of an ontological system supporting both ETL and QP in data warehousing. In such a system, the DWH schema can be seen as another context, such that translating a user query to a DWH query can be supported, as well as translating source data to the DWH view in the ETL process.

Because this top-level ontology is reused, cost savings can be achieved in developments [1,5], [2,5], [3,5] and [4,5], with regard to Figure 4.

More specifically, the system proposed by Chritchlow, Ganesh, & Musick (1998) for ETL support already uses abstractions, to which detailed descriptions of DBs and the DWH is mapped. This system could be extended for supporting QP as well by also storing specific descriptions of user contexts and map them to the same abstractions. In this way, the abstractions are reused for QP support, which can save costs in a [3,5] development.

A general architecture of data warehousing with such ontology-based support is illustrated in Figure 6.

Ontology-based Translator for ETL

Ontology-based Translator for QP

Users DBS sources

4

Top level ontology

1 DWH

3 2

(18)

18

4.2.2 LOOSELY COUPLED FDBS

A common characteristic of loosely coupled FDBS is that the local component DBSs are addressed for global queries. The creation of a federation here corresponds with extending the scope of available data at each participating DBS. This works well when local DBS users are also the users of the FDBS; new users however, might require to address multiple DBSs. Therefore, they need to learn about various DBSs, such as their data characteristics and the corresponding user interfaces, in order to fulfil their information needs. In contrast, current users only need to learn about the import schema of their DBS; the loosely coupled FDBS thereby greatly avoids the QP limitation of other integration systems.

Although many new schemas are to be created in developing loosely coupled FDBS, the corresponding literature do not consider creating richer ontologies. Analogously to other data integration systems, ontologies could help to map export with import schemas for data integration. In QP however, ontologies are not needed, as the local users already possess knowledge on the data characteristics and the DBMS of the local DBS. Despite the lack of ontologies in the literature considering loosely coupled FDBS, the way in which these systems avoids the QP limitation can be applied in data warehousing as well.

Proposed application in data warehousing

The same benefits can be achieved in data warehousing as well, provided that some source DB users are also the users of the DWH. In the same way, a (virtual) import schema, which is mapped with the DWH, could be developed for each local DBS such that QP is facilitated in two ways:

1. Expressing the query – The (extended) DB descriptions can be used in an ontology-based QP support system such that the users of the DWH can formulate their query based on the (extended) view of their own DBS; the support system can translate the locally expressed query into a DWH readable query. This would reduce the need in learning a new DB: only learning the import schema is required.

2. Addressing the query – The local DBMS can be enhanced with a functionality that allows its users to pose global queries. This local DBMS could submit these queries to an ontology-based QP support system. This would eliminate the need for learning a new user interface. This benefit would only hold under the assumption that the users used to address their local query directly to the local DBMS. When this would not be true, i.e. when they use other applications for their local queries, these applications might be considered for enhancement that allows posing global queries.

Obviously, the same can be applied reversely, by translating a query answer expressed by the DWH into the view of the (extended) local DB. Furthermore, the local DBMS interface can be used to display the query answer.

(19)

19

This architecture can be more beneficial when it is combined with an ontology-based ETL support system, such as the system proposed by Chritchlaw, Ganesh, & Musick (1998), since DB descriptions can be reused. Figure 7 illustrates this combined architecture. In this figure, the numbers 1 to 8 represent the data stream of a global query. The letters a to d represent the data stream of the ETL process. The remaining arrows represent (1) local DBS use (at the local DB); (2) what is described by the central ontology (dashed) and (3) that the depicted ontology is used by both support systems.

In this architecture, not only are changes in source DBs more efficiently accommodated, as shown by Chritchlaw, Ganesh, & Musick (1998), the development and maintenance costs for such an ETL support system are reduced as well, as the ontology describing import schemas, local DBs and the DWH, is reused. This finding can be applied in developments [1,5], [2,5], [3,5], and [4,5]. Additional to this proposed architecture, the ontology-based QP support system can be enhanced with user ontologies or user contexts, analogous with eCoin, that enable knowledge workers who are not users of enhanced local DBSs, to address the DWH with a familiar world view as well.

ETL

QP

Local DBS DWH (DBS)

DWH –

DBMS

Local DB

Local

DBMS

Ontology-based QP support system Ontology-based ETL support system (mediator) 2 3 6 7 b c a d 1 8 Users Import schema DWH (DB) 4 5

Figure 7. A proposed architecture for reusing DB descriptions in ontology-based data warehousing

5 C

ONCLUSION

(20)

20

By analyzing tightly coupled FDBSs, a top-level ontology was identified as a component for the data integration system. This component can be used in data warehousing for translating user queries to DWH readable queries, and for translating source data to DWH data, again alleviating limitations of ETL and QP using one component.. Based on an assumption in loosely coupled FDBSs – i.e. some local DBS users are also the users of the integration system – also source DB ontologies can be reused in data warehousing, as the users are familiar with the source DB already. This reuse can reduce the intrusiveness of the DWH for these users.

Combinations of results can be used for many DWH improvement programs, and can reduce development and maintenance costs. The combination of the latter two results is very useful, since it is likely that some of the DWH users are also source system users, but not all. A corresponding architecture is proposed, which deals with ETL and QP limitations very efficiently as DB descriptions are reused for QP and ETL. In addition to this architecture, user ontologies or ‘user contexts’ may be designed for DWH users which do not use any of the source systems (intensively), reducing the DWH intrusiveness for all DWH users and making the DWH even more user friendly. Future work could shift the focus to ontologies that are used in organizations for other means. For example, they may be used in structuring non-structured texts. These ontologies might also be useful in developing data integration systems, in which case ontologies can play an even more central role in information management. More practical future work could implement the proposed architectures and provide empirical support for the claimed advantages.

6 R

EFERENCES

Alasoud, A., Haarslev, V., & Shiri, N. (2005). A hybrid approach for ontology integration. Proc. VLDB Workshop on

Ontologies-based techniques for DataBases and Information Systems (ODBIS). Trondheim, Norway.

Alexaki, S., Christophides, V., Karvounarakis, G., Plexousakis, D., & Tolle, K. (2001). The ICS-FORTH RDFSuite: Managing voluminous RDF description bases. 2nd International Workshop on the Semantic Web.

Arruda, L., Baptista, C., & Lima, C. (2002). MEDIWEB: A Mediator-based environment for data integration on the web. Databases and Information Systems Integration. ICEIS , 34--41.

Baumbach, J., Brinkrolf, K., Czaja, L. F., Rahmann, S., & Tauch, A. (2006). CoryneRegNet: An ontology-based data warehouse of corynebacterial transcription factors and regulatory networks. BMC genomics , 7, 24.

Bellatreche, L., Pierra, G., Xuan, D. N., Hondjack, D., & Ameur, Y. A. (2004). An a priori approach for automatic integration of heterogeneous and autonomous databases. Lecture notes in computer science , 475--485.

Bright, M., Hurson, A., & Pakzad, S. (1992). A taxonomy and current issues in multidatabase systems. Computer , 25, 50--60.

Buccella, A., Chechich, A., & Brisaboa, N. (2005). Ontology-Based Data Integration Methods: A Framework for Comparison. Colombian Journal of Computation , 6 (1).

(21)

21

Chaudhuri, S., & Dayal, U. (1997). An overview of data warehousing and OLAP technology. ACM Sigmod Record, 26 (1), pp. 65 – 74.

Connolly, D., van Harmelen, F., Horrocks, I., McGuinness, D. L., Patel-Schneider, P. F., & Stein, L. A. (2001). DAML+ OIL (March 2001) reference description. W3c note , 18.

Critchlow, T., Ganesh, M., & Musick, R. (1998). Automatic Generation of Warehouse Mediators Using an Ontology Engine. 5th International Workshop on Knowledge Representation Meets Databases (KRDB '98), 10, pp. 8-1 - 8-8. Seattle.

Firat, A., Madnick, S., & Grosof, B. (2007). Contextual alignment of onotologies in the eCOIN semantic interoperability framework. Information Technology and Management , 8 (1), 47-63.

Firat, A., Madnick, S., & Siegel, M. (2000). The Camélón Web Wrapper Engine., (pp. 14--15).

Firestone, J. (1999). DKMS Brief No. Nine: Enterprise Integration, Data Federation, and DKMS: A Commentary.

Executive Information Systems .

Frank, L. (2008). Databases and Applications with Relaxed ACID Properties. Doctoral dissertation, Copenhagen Business School, Denmark.

Gruber, T. R. (1995). Toward principles for the design of ontologies used for knowledge sharing. International

Journal of Human Computer Studies , 43, 907--928.

Guarino, N. (1997). Understanding, building and using ontologies. International Journal of Human Computer

Studies, 46, 293--310.

Guarino, N. (1998). Formal Ontology and Information Systems. In N. Guarino, Formal Ontology in Information

Systems (pp. 3-15). Amsterdam: IOS Press.

Guarino, N., & Giaretta, P. (1995). Ontologies and knowledge bases: Towards a terminological clarification. Towards

very large knowledge bases , 25--32.

Haas, L., Schwartz, P., & Kodali, P. (2001). DiscoveryLink: a system for integrated access to life sciences discovery.

IBM Systems Journal , 40, 489--511.

Halevy, A., Rajaraman, A., & Ordille, J. (2006). Data integration: the teenage years. Proceedings of the 32nd

international conference on Very large data bases, (pp. 9--16).

Hammer, J., & McLeod, D. (1993). An approach to resolving semantic heterogenity in a federation of autonomous, heterogeneous database systems. International Journal on Cooperative Information Systems , 2, 51--83.

Hammer, M., & McLeod, D. (1980). On database management system architecture. Infotech State of the Art Report:

Data Design , 177-202.

Heimbigner, D., & McLeod, D. (1985). A federated architecture for information management. ACM Transactions on

Information Systems (TOIS) , 3, 253--278.

Hull, R. (1997). Managing semantic heterogeneity in databases: a theoretical prospective. Proceedings of the

sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems (pp. 51--61). AACM New

(22)

22

Hull, R., & Zhou, G. (1996). A framework for supporting data integration using the materialized and virtual approaches. ACM Sigmod Record , 25, 481--492.

Inmon, B. (2004, March 1). The Virtual Data Warehouse – Transparent and Superficial. Information Management

Magazine .

Inmon, W. H. (2005). Building the data warehouse. New York, NY, USA: John Wiley & Sons, Inc. Inmon, W. H. (1995). What is a data warehouse? Prism Tech Topic , 1 (1).

Inmon, W., Imhoff, C., & Battas, G. (1995). Building the operational data store. John Wiley & Sons, Inc. New York, NY, USA.

Jukic, N. (2006). Modelling Strategies and Alternatives for Data Warehousing Projects. Communications of the ACM,

49 (4), 83-88.

Kamal, J., Borlawsky, T., Dhaval, R., & Payne, P. R. (2007). Development of an Ontology-Anchored Data Warehouse Meta-Model. AMIA... Annual Symposium proceedings, (p. 1001).

Kerschberg, L. (2001). Knowledge management in heterogeneous data warehouse environments. Lecture notes in

computer science , 1--10.

Kimball, R., Ross, M., Thornthwaite, W., Mundy, J., & Becker, B. (2008). The Data Waehouse Lifecycle Toolkit:

Practical Techniques for Building Data Warehouse and Business Intelligence Systems (2nd Edition ed.). New York,

NY, USA: John Wiley & Sons.

Lenzerini, M. (2002). Data integration: A theoretical perspective. Symposium on Principles of Database Systems

(PODS), (pp. 233-246).

Nimmagadda, S. L., Dreher, H., & Rudra, A. (2005). Ontology of Western Australia Petroleum Data for Effective Warehouse Design and Data Mining. IEEE International Conference on Industrial Informatics (INDIN). Perth, Australia.

Pan, Z., & Heflin, J. (2003). Dldb: Extending relational databases to support semantic web queries. Workshop on Practical and Scalable Semantic Systems, ISWC2003.

Phipps, C., & Davis, K. (2002). Automating data warehouse conceptual schema design and evaluation. Proc. of the

International Workshop on Design and Management of Data Warehouses (DMDW’2002), (pp. 23--32).

Rahm, E., & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. The VLDB Journal , 10, 334--350.

Romero, O., & Abelló, A. (2007). Automating multidimensional design from ontologies. Proceedings of the ACM

tenth international workshop on Data warehousing and OLAP, (pp. 1--8).

Rousset, M. C., & Reynaud, C. (2004). Knowledge representation for information integration. Information Systems ,

29, 3--22.

Rubin, S., Smith, M., & Trajkovic, L. (1999). Randomizing the knowledge acquisition bottleneck. IEEE International

(23)

23

Samos, J., Saltor, F., Sistac, J., & Bardes, A. (1998). Database architecture for data warehousing: an evolutionary approach. Lecture notes in computer science , 746--756.

Sen, A., & Sinha, A. P. (2005). A comparison of data warehousing methodologies. Communications of the ACM , 48 (3), 79.

Shanks, G., Tansley, E., & Weber, R. (2003). Using ontology to validate conceptual models. Communications of the

ACM , 46 (10).

Sheth, A. P., & Larson, J. A. (1990). Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys (CSUR) , 22, 183--236.

Skoutas, D., & Simitsis, A. (2007a). Flexible and Customizable NL Representation of Requirements for ETL processes.

Lecture Notes in Computer Science , 4592, 433.

Skoutas, D., & Simitsis, A. (2007b). Ontology-Based Conceptual Design of ETL Processes for Both Structured and Semi-Structured Data. International Journal on Semantic Web & Information Systems , 3, 1--24.

Trujillo, J., & Lujan-Mora, S. (2003). A UML based approach for modeling ETL processes in data warehouses. Lecture

Notes in Computer Science , 307--320.

Uschold, M., & Gruninger, M. (1996). Ontologies: Principles, Methods and Applications. Knowledge Engineering

Review , 11 (2).

Wache, H., Vögele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann, H., et al. (2001). Ontology-based Integration of Information - A Survey of Existing Approaches. IJCAI-01 Workshop: Ontologies and Information

Sharing, (pp. 108-117). Seattle WA.

Wagner, C. (2007). Breaking the knowledge acquisition bottleneck through conversational knowledge management. Innovative Technologies for Information Resources Management , 200.

Widom, J. (1995). Research problems in data warehousing. Proceedings of the fourth international conference on

Information and knowledge management, (pp. 25--30).

Xie, G., Yang, Y., Liu, S., Qiu, Z., Pan, Y., & Zhou, X. (2007). EIAW: Towards a Business-friendly Data Warehouse Using Semantic Web Technologies. Lecture Notes in Computer Science , 4825, 857.

Xuan, D. N., Bellateche, L., & Pierra, G. (2006). Ontology Evolution and Source Autonomy in Ontology-based Data Warehouses. Revue des Nouvelles Technologies de l'Information (EDA 2006) , 55-76.

Ziegler, P., & Dittrich, K. (2004). User-Specific Semantic Integration of Heterogeneous Data: The SIRUP Approach.

(24)

24

7 A

PPENDIX

A:

D

ATA MODELS OF

D

ATA

W

AREHOUSES

In the literature, DWHs are usually modelled with a dimensional data structure. A dimensionally modelled schema consists of a fact table and dimensional tables. Records in the fact table are called facts. A popular example of such a fact is a transaction. For each fact, often measures are stored, such as costs, price, or quantity. Measures play an important role in analysis. Furthermore, foreign keys are stored for each fact, referring to data in the dimension tables. When all dimensions consist of one dimension table, the schema is called a star schema. In contrast,

snowflake schemas have dimensions which are normalized and hence consist of multiple tables. Every dimension

table is connected with one-to-many relationships in the direction of the fact table. Figure 8 illustrates examples of the two schemas types.

Figure 8. A star (left) and a snowflake (right) schema (from Chaudhure and Dayal, 1997)

Each star or snowflake is associated with a data mart. In Kimball’s approach to data warehousing, the data warehouse consists of these data marts (Kimball, Ross, Thornthwaite, Mundy, & Becker, 2008). However, there is an important alternative for modelling the data warehouse, championed by Inmon (2005). He uses the entity relationship model (ER model) for the central data warehouse. His data warehouse is not directly used for analysis; instead, additional data marts are designed for that purpose. These data marts are modelled dimensionally, and consist of one star or snowflake schema. These data marts are designed for a specific department or user group. Both approaches are illustrated in Figure 9. There has been a debate on these two approaches to data warehousing. Analogous to the debate on the virtual versus the materialized approach, the decision should be based on which approach better fits, rather than which methodology is better (Jukic, 2006).