Terry

(1)

A Tool for Assessing Database Performance with

Existing Workloads

Jaap Koetsier

mail@jkoetsier.nl

August 22, 2018, 106 pages

Supervisor:

Prof. dr. S. Manegold

Host organisation:

ShareCompany B.V. (

http://www.sharecompany.nl)

, supervised by M. Bracher

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica

Master Software Engineering

(2)

Abstract

This thesis introduces Terry, a tool for assessing query performance in terms of execution times on

a range of database systems. Terry is able to transform and run an existing SQL workload on any

database system. Terry is useful for testing query performance on different database types, to be able

to make an informed decision on the database type to use. The basis of this work originates from

a real-world scenario, where the question arose whether the current database system was the best

option for handling the current workload, if it was better to switch to another database system, or

even use multiple database systems in parallel. To test this we investigated whether it was possible

to classify queries in such a way that classes comprised queries with similar query performance on

a certain database system. A useful query classification turned out to be too dependent on a solid

database system configuration, for which one needs to have a thorough understanding of the databases

included in the classification.

During our research we could not reproduce the problems that existed in our real-world scenario in our

experiments. Query execution times in our experiments did not reflect the latencies as experienced

in the production environment. However, the provided workload and data were no perfect mirror of

the production environment. We developed the tools that make the experiments we performed easily

repeatable in the future on an environment that better reflects the production environment.

To further assess the usefulness of Terry we continued to use Terry to assess and optimise query

performance of the TPC-H query set.

(3)

Introduction

In greenfield projects, we as software engineers often have to decide on one or more database systems

to use. The large number of database systems available can make this a difficult choice. Because of a

lack of in-depth knowledge on database systems and the difficulty of making the choice, we often go

for database systems we, or the organisation, are familiar with, or we may choose a new fashionable

technique we have heard good things about. Of course, this dilemma presents itself at many

occa-sions during the software process, not only to making the choice of a database system to use. While

choosing for familiarity has its advantages to the organisation, choosing for a fancy new technique

asks for good argumentation.

The difficulty with database systems is that the design choices made during the development of the

database systems are not always visible to the user. Database systems have properties such as ACID

compliance, security, scalability, durability, etc. These properties are generally well documented and

aid us in choosing a database system. Database system performance, however, is often harder to

pre-dict. The trade-offs made by the database system designers within the database software are opaque

to the user. There are as many ways as optimising and executing queries as there are database

sys-tems. These differences inevitably involve internal trade-offs, favouring one type of workload over

another. This may cause that simple queries pay a time penalty in query optimisation on a certain

detabase system, to win time on more complex queries.

When choosing a database system these differences are not always apparent. Standard benchmarks

are available, but they hardly ever mirror our exact workload, are run on different hardware, are not

always completely unbiased or are not configured optimally for the given workload - the latter being

a challenge in itself. Even with all the information and benchmarks available, choosing a database

system is not a straightforward task.

As time passes and a software project develops, the database grows and the variety of queries we

execute against the database increases, it can happen that we find that the average query execution

times increase and the database is blamed for the latency in our application. When this happens,

we are eager to find solutions to bring the performance back to an acceptable level again. Perhaps

increasing hardware will provide an adequate solution. Maybe it is time to switch to another type

of database system, or it may be the case that the performance on different groups of queries is so

diverse that it is worth considering using multiple database systems alongside each other.

For example, when the software has developed in such a way that a large part of the queries executed

are mainly key lookups, while another subset of the queries comprises highly analytical queries, it

may be more efficient resource-wise to use different database systems for these queries. Such insights

may drive future architectural decisions, modularising the software in such a way that responsibilities

are in line with the database systems used in that part of the software.

In a perfect scenario we can spread the data of one database over multiple databases without

hav-ing the need to duplicate any data. This situation would be a good indicator for an opportunity to

modularise our architecture. In practice however, we would not always find this ideal situation. In

the case that the use of multiple database systems in parallel is justified, but we are unable to do this

without duplicating data, we would have to synchronise data between different systems, which can be

a challenge in itself.

(6)

The research in this thesis is inspired by the situation at ShareCompany, an Amersfoort-based

com-pany that provides stock market data to their customers. All data that is provided to their customers

flows through a central SQL Server database. The data within this database is updated mainly in

reg-ular daily batches, where the data is fetched from data providers. ShareCompany’s customers access

API’s in front of a thin application layer that queries the database on behalf of the API user. The

execution times of queries on the database have a direct effect on the response times of the external

API’s.

In the current situation, some queries take several seconds to return. This not only directly affects the

usability of the external API’s, it also keeps ShareCompany from expanding their business. The

ques-tion within ShareCompany is whether their current database system suits their needs for expanding

their business, or if it may be better to migrate to a different database system, or even use different

database systems for different types of queries in parallel.

Using different database systems in parallel only pays off when queries that perform poorly on one

database system perform significantly better on a different database system. For a small

improve-ment in execution time of a query the administrative overhead of possible data synchronisation and

administration within the application layer may not pay off. In other words, we would like to know if

there are groups of queries that perform noticeably different on different database systems. If those

groups exist and they cover all queries with, to ShareCompany, reasonable execution times, we may

have found a combination of database systems that will increase overall performance. This brings us

to our first research question:

RQ 1: How can we classify application level queries in such a way that we observe classes with

similar execution time on a particular database?

RQ 1a: How can we find those classes?

Perhaps we find during answering this RQ that one database system performs acceptable over all

queries, which would mean that migrating to a different database system would solve the problem

altogether.

To answer RQ 1, we will need to experiment with the current workload and database of ShareCompany

on their current SQL Server database and other database systems, perhaps database systems other

than relational SQL database systems. We want to make our experiments repeatable and easy to

replicate on different database systems that may use a different SQL dialect than the current system,

or even a different query language altogether.

To run our experiments we want to be able to take the current workload, data and schema from

ShareCompany and run this workload on different database systems. This is captured in RQ 2:

RQ 2: How can we construct a tool to automate our tests?

If we find a query classification in RQ 1 we will try to generalise this query classification, so that

we can use this query classification to predict the best suitable database system for future queries.

RQ 3: How can we generalise the query classification?

To answer RQ 1 and RQ 3, we will use the tool constructed based on RQ 2. This tool will help us

in automating the execution and timing of queries on a range of database systems and allows us to

perform repeated experiments. Based on the output of this tool we will try to come up with a query

classification, answering RQ 1 and RQ 3.

This thesis is further divided into chapters. In Chapter

2 we present some background on the

database landscape. We discuss some popular types of database systems that we considered including

in our experiments. We will also discuss some popular database benchmarks and end with a short

(7)

introduction to relational algebra, which is necessary as a foundation later in this thesis.

In Chapter

3 we discuss Terry, the tool we have developed to help us perform our experiments, to

answer RQ 2. Chapter

4 discusses the experiments we performed on the ShareCompany situation

and how we used Terry to improve the performance of the TPC-H query set on a MySQL database,

answering RQ 1 and RQ 3. Chapter

5 ends with our analysis and conclusions.

(8)

Chapter 2

Background and context

In this chapter we will discuss some popular types of contemporary database systems. We will then

discuss some existing database benchmarks. We will conclude this chapter with an introduction

to relational algebra, which will act as a foundation to understand the internal model of Terry in

Chapter

3 .

2.1 Database Systems

In this section, we will discuss some popular types of database systems, and how they fit into our

experiments.

2.1.1 Relational SQL-Databases

Relational databases are the traditional SQL-databases. These are visualised as tables with columns

and possible relations between columns. Relational databases store tuples (rows in a table) that are

queried using SQL. Traditionally these tuples are stored together on a per-row basis, the so-called

row-stores. Column-stores on the other hand store one or more columns together, often found in

OLAP-focused database systems. OLAP

1

_{workloads differ from OLTP}

2

_{in that OLAP workloads}

generally consist out of queries that perform calculations on a large amount of data. OLTP workloads

consist out of relatively simple lookup queries. The schema’s of OLAP databases often contain wide

tables, generally much wider than those found in an OLTP database. Column-stores therefore are the

preferred choice for OLAP workloads, as large amounts of data have to be analysed. A traditional

row-store database will have to fetch all rows to analyse, inducing a high cost in data transfer.

Column-stores can fetch only the partitions of the table that are required to execute a query [

BKM08

,

MKB09

].

This generalisation is confirmed by Holloway et al. and Harizopoulos at al. Holloway et al. [

HD08

]

found that narrow row-stores outperform column-stores, but that column-stores outperform row-stores

in case the workload is I/O bound, because less bandwidth is needed to transfer a few columns instead

of the complete row. Also, Holloway et al. found that adding predicates to queries negatively impacts

column-stores more than row-stores. Harizopoulos et al. [

HLAM06

] came up with similar conclusions

two years prior, but they also added that in CPU bound workloads, where I/O is less dominant,

column-stores are not necessarily faster than row-stores.

HTAP

3

_{databases offer a hybrid solution for handling both OLTP and OLAP workloads on a single}

database system, without the need to duplicate data from an OLTP database to an OLAP database

for analysis. This removes the naturally occurring data duplication latency between the two database

systems and provide the end-user with a system that supports real-time data analysis. The underlying

storage model of HTAP database systems is often a derivative of the Decomposition Storage Model

as found in OLAP databases, such as the Flexible Storage Model, which combines properties of

row-stores and column-stores [

APM16

]. However, because HTAP databases are hybrid by definition,

1_{Online Analytical Processing} 2_{Online Transaction Processing}

(9)

storage models between HTAP databases will differ and trade-offs will be made favouring one workload

over the other.

Finally there is the term NewSQL, which denotes a new generation of SQL stores, that are horizontally

scalable while keeping ACID properties and fault tolerance [

GHTC13

]. Pavlov et al. ask the question

What’s Really New with NewSQL? in their paper on NewSQL [

PA16

] and conclude that NewSQL

is not a completely new type of database system, but a system in which existing techniques are

combined to achieve a horizontally scalable relational database. NewSQL is a buzzword to define

a type of database that was inevitable to arrive. Relational databases used to be difficult to scale

horizontally without an extra middleware-layer. NoSQL stores such as document-stores tried to fill

this gap that opened up when the internet brought us large amounts of data. NewSQL data stores

combine this scalability with the traditional SQL data store.

The current ShareCompany database is a traditional single-node row-store, so it only makes sense to

include relational databases in our experiments.

2.1.2 Graph Databases

Graph databases are based on graph theory. The focus in graph databases is on the relations between

nodes. Data can be saved both as labels to edges (the relations) as on nodes. Graph databases are

efficient in traversing relationships, but are not that efficient in simple data lookups. As Grolinger et al.

[

GHTC13

] put it: [Graph databases] are suitable in scenarios such as social networking applications,

pattern recognition, dependency analysis, recommendation systems and solving path finding problems

in navigation systems.

This is why we exclude graph databases from our experiments. It would be interesting to see in how far

a graph database such as Neo4J handles a normal SQL workload such as the one from ShareCompany,

but we do not expect it to outperform an SQL database on the ShareCompany workload.

2.1.3 Key-value Stores

Key-value stores are interface-wise the most basic database systems around. A key-value store allows

the user to store, update and delete data by key. They provide no means of querying or filtering

on the values; the database is unaware of the underlying value. Because of this simplicity, key-value

stores are very efficient in doing just that; fetching values by key. This is why key-value stores are

often used as caching mechanisms, such as Memcached or Redis.

There is no way to capture an ordinary SQL workload or schema in a key-value store, because it

would mean we could only query by key, without filtering on other columns. This is why we will not

include the key-value stores in our experiments.

2.1.4 Document Stores

Document stores are essentially key-value stores, but with one major difference. In contrast with

key-value stores, document stores allow the user to query the values as well. Document stores are

generally schema-less, just as key-value stores, but the values are often structured in a format such

as JSON

4

. The database engine is able to analyse the structured value and filter on attributes of the

value. It is often able to generate indices on these attributes as well. Some document-store databases

such as MongoDB are now also able to join documents from different collections

5

. In MongoDB,

collections are the equivalent of tables, and documents are the equivalent of table rows. This means

that it is possible to query related objects, instead of one object at a time.

Querying relations in MongoDB is still very limited compared to SQL, but we will investigate if it is

possible at all to catch part of the ShareCompany workload in a document store.

2.2 Existing Benchmarks

In this section we will discuss some existing benchmarks.

4_{JavaScript Object Notation}

(10)

2.2.1 TPC Benchmarks

The Transaction Processing Performance Council, or TPC [

tpca

], is a non-profit organisation founded

to define transaction processing and database benchmarks and to disseminate objective, verifiable TPC

performance data to the industry. The TPC currently lists 14 active benchmarks on their website, for

different industry applications. There are 6 application categories: Transaction Processing - OLTP,

Decision Support, Virtualization, Big Data, IoT and Common Specifications. The benchmarks are

offered in different SQL dialects and include scalable synthetic data. Examples of TPC benchmarks

are TPC-C, an OLTP workload, TPC-H (decision support), TPCx-BB for Big Data and TPCx-IoT

for IoT applications. TPC benchmarks try to mimick a real-world scenario from industry and their

benchmarks are well-known in the database community. Results from TPC benchmarks on different

kinds of hardware are posted on the TPC website, along with the benchmark-specific performance

metric (such as the number of queries per hour or transactions per hour), the scale factor (the size

of the data set used) and the Price/Performance ratio. Through this, TPC achieves a high level

of standardisation of database benchmarks. However, as the TPC highlights in their benchmark

documentation, these benchmarks are no substitute for benchmarks with a real-world workload.

2.2.2 YCSB

The Yahoo Cloud Serving Benchmark [

CST

+

₁₀

_{] is a benchmark aimed at databases operating in a}

cloud environment. Key points in this benchmark are latency vs. throughput and scaling/elasticity.

YCSB contains a core set of workloads that can be used to assess the performance of database systems

along these lines. The YCSB benchmark uses artificial data to perform the benchmark and is not

aimed at a specific workload, but it merely tries to give a global insight in how databases perform

along the axes of latency, throughput and scalability with preset workloads; update heavy, read heavy,

read only, read latest and short ranges. YCSB provides a number of parameters that can be used to

change the distribution of data, for example. The YCSB is mainly used to assess NoSQL datastore

performance, but it does support SQL databases as well. The downside of YCSB in light of our

experiments is that it is hard to provide your own queries. YCSB does provide an option to generate

your own workload, but that comes either down to writing a parameter file with workload properties,

which YCSB then uses to generate queries, or writing Java classes that capture the workload. The first

is not an option as we have a specific workload and data of which we want to assess the performance.

The latter would be a possibility, but it is cumbersome to implement and hard to change when our

workload or data is updated.

2.3 Relational Algebra

The tool we are going to develop should be able to take an SQL workload in any dialect and output

this workload in any other SQL dialect and possibly other query languages. To do this we will need

a generic internal representation that can represent the queries that make up the original workload.

SQL is based on relational algebra, and all relational databases use some kind of relational algebra as

their internal query representation. Relational algebra is based on the 1970 paper from Codd [

Cod70

]

and comprises the following basic operations: The projection (π), selection (σ) and Cartesian product

(

×

), plus the set union (∪) and difference (\) operators. We expect the reader to be familiar with

the set operations, but we will explain projection and selection with an example. Here is an example

of a simple SQL query:

SELECT charCode FROM AsciiCharacters WHERE intCode = 65

The equivalent in relational algebra could be:

π

charCode

(σ

intCode=65

(AsciiCharacters))

From right to left, we select all tuples from AsciiCharacters where intCode = 65, and we project

the charCode column from the resulting tuples. A selection selects rows, where a projection projects

(11)

columns of the input relation.

This is a very basic example, but it gives an idea how a query may be represented as a tree-structure

within the database software. With these basic operators we could in theory express all joins.

How-ever, in practice this algebra is often extended with easier to understand operators for joins. Also,

actions such as grouping, aggregation and ordering are not expressible in pure relational algebra.

These operations are often added as new nodes (operations) in the tree or added as meta data as part

of other nodes.

The formalisation of relational algebra makes it possible to reason about queries and formalise query

optimisations in database systems.

(12)

Chapter 3

Terry

To answer RQ 2, we developed a tool we called Terry, to assist us in performing our experiments.

Terry needs to take any existing SQL workload and database schema as input. From this input

Terry must be able to generate a workload and schema in different SQL dialects and query languages

other than SQL. Terry should be easily extendable with a new type of database system to extend the

experiments and use Terry for future experiments on new database types. Terry should be able to set

up a new database from scratch with the provided workload and schema along with the data, but it

must be able to run experiments on an existing database as well. This means that we should be able

to provide Terry the authentication credentials and address of an existing database.

Terry should output the query response times per query and system statistics such as CPU, memory

and I/O usage.

3.1 Overview

Terry takes an existing workload and database schema in SQL format and executes the queries in the

workload against the database of choice. Optionally, Terry takes the CSV data export of the existing

database to import in the database we are running our experiments on. The only precondition in this

case for Terry to be able to run the experiments is a running database server with an empty database.

With the provided workload, schema and data Terry has everything to initialise our experimental

database and run the experiments.

Before running the experiments, Terry extracts subqueries to run these separately, providing the user

with more fine-grained information about the query performance. Optionally, Terry can generate

in-dices on the columns used for filtering, ordering and grouping to save the user from manually creating

these indices.

Terry times the queries, monitors the system’s resources and writes the query timings, query statistics

and systems statistics to a number of CSV files. A Python script (see Appendix

B

) visualises this

data with graphs.

We have added Docker support to make experiments portable and repeatable.

The development of Terry was an iterative process. Results of experiments asked for new features,

improving either the process or the results. Examples of these are the automatic creation of indices

and the extraction of subqueries (see Section

3.3.4 ).

In this chapter we will describe Terry and the design decisions made during the development of

Terry in Section

3.3 , after we explained the usage of Terry in Section

3.2 . We will decompose Terry

into modules and handle each module separately. In Appendix

A

Terry is explained on a lower level

for future development.

(13)

3.2 Usage

Terry is developed in Java Spring and is run as a typical Java command line application:

java -jar terry.jar [command line arguments]

Here terry.jar is the output JAR-file resulting from the build. See README.md in the Terry

repos-itory for the build command and the resulting JAR file. From here on terry.jar will be used to refer

to the compiled Java code. Replace this as necessary.

3.2.1 Command line arguments

Terry has three mandatory command line arguments:

--workload=[workloadfile.sql|workload.yml] --schema=[schemafile.sql]

--dbtype=(mssql|mysql|monetdb|postgres|..)

Here we define the mandatory workload and schema in SQL format and the database type to run

the experiments on. The format of these files is further explained in Section

3.2.2 .

One other commonly used, but not mandatory argument is:

--data-directory=[data_dir]

If this argument is provided, the application looks for DSV

1

_{files within this directory. It matches}

the file names with table names and imports the data in the corresponding table. By default, Terry

looks for files with the *.csv extension and has default values for the field separators, line delimiters

and quotation marks, but this is configurable, see Section

3.2.2 .

Besides these arguments the configuration for the database to run experiments on (argument

dbtype) needs to be set. The currently implemented database types each have their own default

configuration in the resources directory (named db.[dbtype].properties), but these configuration

settings can be overridden on the command line:

--username=[database username] --password=[database password] --database=[database name] --host=[database host] --port=[database port]

In the resources directory we also find the application.properties file with the default application

configuration. These settings can all be overridden on the command line as well, a helpful feature of

Java Spring. We will explain these settings here:

--output_directory=[output_dir]

By default Terry will write the results in the /tmp/terry_out directory. This setting can be used

to write to a different location. See Section

3.2.3 for more information about the generated output.

--no_runs=[m] --skip_first=[n]

The default settings for these are m = 1 and n = 0. Terry will perform m + n runs sequentially,

omitting the first n runs from the results. For the last m runs Terry will capture the output as discussed

in Section

3.2.3 . Skipping the first n results can be used to warm up the database. Databases may

need a couple of runs to cache results, create indices or perform other optimisations so that subsequent

queries perform better. Terry does not support running multiple cold runs, as this is less interesting

for our experiments. To execute multiple cold runs it is up to the user to ’cool’ the system after each

run.

--skip_result_validation=(true|false) --skip_create_schema=(true|false) --create_indices=(true|false)

(14)

These three arguments control Terry’s behaviour. The argument skip_result_validation can

be used to skip the query result validation, see

3.2.2 . skip_create_schema can be used when

re-peating experiments on an earlier set-up database, to avoid having to re-initialise the database again.

create_indices (false by default) creates indices on all fields that are either used in the query filters

(the ’where’ clauses, but also the ’on’ expressions of joins), ’group by’ or ’order by’ clauses of the

queries.

--rename_imported_csv_files=(true|false)

The default value of this argument is ’false’.

This argument comes in handy during (further)

development of Terry, when problems arise when importing CSV files. Files that are correctly imported

will be renamed with the postfix .done so that Terry ignores these files on subsequent runs.

3.2.2 Input files

This section describes the format of the input files.

Schema

The schema should be an *.sql file that contains a series of ’create table’ SQL statements, separated

by semicolons, as follows:

CREATE TABLE asccivalues ( intVal int NOT NULL,

charVal char NOT NULL );

CREATE TABLE users ( id int NOT NULL, firstname varchar(100),

lastname varchar(100) NOT NULL, PRIMARY KEY (id)

);

Workload

The workload file can be formatted in two ways. The first is similar to the schema definition; an

*.sql file with a list of ’select’ statements, separated by semicolons. Terry will assign all queries an

integer value as identifier.

It is also possible to supply the workload in a YAML

2

file with a *.yml extension, with the following

structure:

workload:

- query: "SELECT intVal, charVal FROM asciivalues WHERE charVal = ’A’" identifier: queryAsciiByCharA

results: - 65, A

- query: "SELECT intVal, charVal FROM asciivalues WHERE intVal > 65 AND intVal < 69" identifier: queryAsciiByIntRange

results: - 66, B - 67, C - 68, D

When the workload is provided with the expected results, Terry will validate the query results of

the experiments with the results provided in the YAML file and outputs a warning message if the

results do not match the expected results, unless skip_result_validation is set (see

3.2.1 ).

2_{YAML Ain’t Markup Language, a structured file format that separates lists, objects and attributes using newlines,} identation and hyphens. Much as JSON, but using whitespace instead of brackets.

(15)

DSV Data

When the data_directory command line argument is provided, Terry reads and imports the data as

DSV files into the database. By default, Terry scans for *.csv files, but by setting the dsv_format

parameter to tbl, Terry scans for *.tbl files and treats them as TPC data files.

*.csv files

are expected to have one header line, double quotes as quotation marks, comma’s as

field separators, NULL (unquoted) as null value and a newline character as record terminator.

*.tbl files

should have no header lines, don’t use quotation marks, use pipe symbols as field

sep-arators, empty strings as null values and a newline character as record terminator. This is the way

TPC provides the data with their benchmarks.

A custom DSV format can be provided as well, with the following arguments:

--dsv-extension= --dsv-field-separator= --dsv-quotes= --dsv-header-lines= --dsv-record-terminator= --dsv-null-value=

These parameters override the defaults for the selected dsv_format, or csv if no format is

explic-itly selected. Keep in mind that not all database systems support having more than 1 header line.

Trying to run experiments on PostgreSQL for example with dsv_header_lines > 1 will throw an

UnsupportedConfigurationException.

File Naming

When Terry reads the DSV files, everything before the first dot in the file name will be matched

with the table name. For example, customers.csv will be imported into customers, but so will

customers.002.csv.

3.2.3 Output files

Terry creates output files in the as argument provided output directory, or in /tmp/terry_out by

default. The output files for a particular run are prepended with the current timestamp, an underscore

and the database type to distinguish between runs, for example 20180701120503_mysql_systemstats.csv.

We chose to output the raw data in CSV format here for two reasons. The first is that there are better

tools and languages for analysing data and drawing graphs than with Java. We decided to do further

calculations on the raw data and graph drawing with Python (see Section

3.4 ). The second reason

is that by splitting the data generation from the representation we were able to refine our analysis

and graph drawing without having to run Terry every time. Appendix

C

contains examples of the

generated files.

The following files will be generated:

responsetimes.csv

contains the response times of all queries in nanoseconds per run. If no_runs

is set to 5, there will be five columns with response times per query.

resulttimes.csv

contains the result times of all queries in nanoseconds. The layout is the same as

for responsetimes.csv. While the response time is the time measured between sending the request

to the database and the time of return of the database cursor, the result time also includes the time

needed to fetch the data from the database. The result time of a query execution is by definition

larger than the response time.

(16)

querystats.csv

contains the statistics per query. Per query we see the average response and result

times, the number of rows returned, the result width (the number of columns returned), the number

of tables that are touched in this query, the total width of all tables that are touched by the query, the

weighted table widths (

P

len(touched_tables)

k=0

table_width(k)) plus the statistics per table in the query:

how often a table is touched (the touched_cnt column), the number of columns per table and the

table sizes.

systemstats.csv

contains the system statistics during the run. Every approximately 500ms the

system statistics are captured and ultimately written to this file. This file shows the CPU usage

(ticks, loads and 1, 5 and 15-minute load averages), memory usage, swap usage and cumulative I/O

reads and writes.

queries.txt

contains the queries as executed on the database, in the destination query language.

3.3 Design

This section provides a high-level overview of Terry’s design. For lower level documentation of Terry

for future development, see Appendix

A

.

Figure 3.1: Terry package structure.

Figure 3.2: Terry program flow. The dotted squares correspond with the packages in Figure

3.1 .

The dashed arrows are optional flows. Although the arrows in the Database Connector contain

SQL-like terms (SELECT, CREATE, IMPORT), the Database Connector could implement any type of

database.

Terry is divided into six packages, as can be seen in Figure

3.1 . Figure

3.2 shows the program flow

through the corresponding packages.

(17)

On the left hand side of Figure

3.2 we find the program input: The workload, schema and optionally

the directory containing CSV data to import. The input package contains the logic of parsing the

SQL into an internal relational algebra-like representation.

Control then flows to the Database Connector, which takes the internal representation to generate

queries in the query language of the chosen database. The BenchRunner then takes control, taking

the generated queries from the Database Connector to generate the final output files.

In the following sections we will discuss the design of the six packages further.

3.3.1 Input

The input package transforms the provided workload and schema into the internal representation.

The schema and workload files contain SQL statements that are parsed using the JSQLParser

pack-age [

jsq

]. JSQLParser generates parse trees that are transformed into the internal representations for

the workload and schema.

JSQLParser supports a variety of SQL dialects. It is however unclear from the JSQLParser

doc-umentation which SQL is exactly supported, and whether it implements any ANSI SQL standard.

According to the JSQLParser GitHub page it extends the "core JavaCC grammer for SQL [..] taken

from Guido Draheim’s site". However, the page contains no link to Guido Draheim’s website, nor is

the website findable on Google. We have implemented JSQLParser and did not find anything that

was missing to be able to correctly parse our SQL.

We have considered other options, but none of them were favourable over JSQLParser. Apache

Calcite [

cal

] is a framework for developing databases and comes with a built-in SQL parser that

gen-erates an internal representation in relational algebra, which could have saved us two steps. Calcite

could have acted as the foundation of Terry. However, the SQL that Calcite supports is very basic.

We would have needed to extend this to suit our needs. Secondly, Apache Calcite’s source is large.

It would have taken considerable time to get used to it, and it would have been hard to clean up the

unneeded code.

Another option was writing/generating a parser by either writing our own grammar or use an

ex-isting SQL grammar. Writing our own grammar would have taken too much time, as SQL is so large

and the dialects so diverse that we would have kept extending our grammar and parser constantly.

Existing SQL grammars are available [

ant

], but the ANTLR4 grammar for SQL Server alone is already

over 5000 lines. We have tried to implement only the part of the grammar we actually needed, but

this was impossible to do in reasonable time, because of the large number of possible nodes in the

resulting parse tree. Besides that, implementing an SQL Server grammar would also have limited the

use of Terry to the SQL Server dialect. We have further explored available grammars on the internet

for more generic and smaller grammars, but these came with the same kind of problems.

Although the exact SQL that is supported by JSQLParser is not clear, it appears to support

everything we need and JSQLParser saved us valuable time compared with the other options we had.

3.3.2 Internal Representation

Terry creates a generic internal representation of the SQL workload and schema input files. This

internal representation makes it possible to adapt the workload and schema to any other database, as

long as the target database is able to express the semantics of the original queries.

(18)

Figure 3.3: The internal representation should make translation from any input to any output possible

Workload

The workload is internally represented in a relational algebra-like form (see the background in

Sec-tion

2.3 for a short introduction). Most, if not all, relational databases represent queries internally in a

structure that is related to relational algebra. SQL SELECT-queries and relational algebra are closely

related, but expressing SELECT-queries in terms of pure relational algebra alone is not possible.

To come up with the internal representation for Terry we have studied the documentation and source

code of several database systems. Most useful for this were PostgreSQL, MonetDB and Apache

Cal-cite. The common approach in these database systems is that they create a tree-structure where

each node represents either an operator from relational algebra, or a node representing a specific SQL

expression that is not representable with relational algebra, such as grouping or ordering. Sometimes

SQL expressions that are not representable with relational algebra are expressed as part of another

node, for example by including ordering in a projection node. In the basis all these systems use

relational algebra, but the approach to how to express the auxiliary expressions differs.

We have based our internal representation of the workload on relational algebra as well. This

in-ternal representation is more abstract than the original SQL, yet it still contains all information to

generate semantically equivalent SQL in any dialect. This internal representation is also suitable to

generate queries for a database that uses a query language other than SQL, provided it supports all

operations that are present in the original query.

Figure

3.4 shows an example of the internal representation of the simple query SELECT lastName

FROM persons WHERE firstName = ’Alan’ AND lastName IS NOT NULL LIMIT 1 OFFSET. On top

is the Workload object that contains a Query object, which takes as input an object of type Relation.

The blue entities in Figure

3.4 are all children of the super type Relation. Relational algebra operators

take one or more relations as input and produce a relation. For this simple query all relations take a

single input relation, but the join-operators for example take two relations as input.

Left of the blue relation squares are the parts of the query that are handled within that operation.

Schema

The schema is internally presented by a tree structure with as root the Schema class. This class links

to one or more Table objects. The Table objects contains a list of child objects of the Column class,

(19)

Figure 3.4: The internal representation of a workload containing the query SELECT lastName FROM

persons WHERE firstName = ’Alan’ AND lastName IS NOT NULL LIMIT 1 OFFSET 0. Note that

a Workload can contain multiple queries, and a Projection can contain multiple selectExpressions.

such as BooleanColumn or IntegerColumn and may contain additional properties. A DecimalColumn

for example has a precision and scale property. The parent Column class holds generic information

such as the column name and whether a NULL value is allowed or not.

Figure

3.5 contains an example of how a CREATE-statement is transformed into an internal

repre-sentation.

The intrep package implements the Visitor-pattern on the Workload and Schema trees and provides

the WorkloadVisitor and SchemaVisitor classes to be extended. These visitors can be used by the

Database Connectors to generate queries in the target query language.

3.3.3 Database Connectors

To make Terry easily extendable with other database systems, we introduced the concept of Database

Connectors.

Adding a new type of database can either be done by extending the abstract

DatabaseConnector class, or by extending the more specific JdbcDatabaseConnector class for JDBC

connections. See Appendix

A

for a more in-depth explanation of extending one of the two Database

Connector classes.

The Database Connectors take care of everything that is specific to a database system, from

connec-tion handling to query generaconnec-tion in the target query language. The connector package also includes

abstract default implementations of an SqlSchemaVisitor and an SqlWorkloadVisitor. The only

abstract method in these classes is the getIdentifierQuoter() method. When this method is

over-ridden by providing an SqlIdentifierQuoter a default SQL will be generated for both the workload

and schema. Only parts of the query that deviate from the default have to be overridden to implement

a new SQL dialect. In our PostgreSQL implementation for example, the PostgresSchemaVisitor

overrides the visit(DecimalColumn decimalColumn) method to use the numeric column type

in-stead of the default decimal SQL type. The PostgresSchemaVisitor extends another 3 methods

and the PostgesWorkloadVisitor only overrides 1 method; with minimal extra code we are able to

(20)

Figure 3.5: The internal representation of a schema with one Table with CREATE query in SQL

Server dialect. Note that a Schema can contains multiple Table objects, and that the primaryKey of

the Table can consist of multiple columns.

generate queries in a new SQL dialect.

Non-SQL databases are possible to implement as well.

We have implemented the start of a

MongoDbDatabaseConnector that successfully generates MongoDB queries from the simple SQL

in-put. The support for SQL-like constructs in MongoDB is limited however, so for our experimental

workloads MongoDB’s functionality was not sufficient, but for simpler workloads the application is

able to generate correct MongoDB queries.

3.3.4 BenchRunner

The benchrunner package takes care of running the queries and anything that is necessary to generate

the desired output. Before the BenchRunner class starts running the experiments the BenchPreparer

in the same package extracts subqueries, as inspired by Kersten [

KKZ18

]. Subqueries are extracted

from the query and added to the workload with a postfix to the query identifier. If query listing0

contains 2 subqueries, listing0-0 and listing0-1 will be added to the workload containing the

subqueries. Note that not all subqueries can be executed as separate queries because they refer to

the parent query or other information from outside the subquery. The resulting timings will show up

as 0 in the output for these queries.

The BenchPreparer also takes care of creating indices when this is desired by the end-user.

After the BenchPreparer has finished with the preparations, the BenchRunner starts running the

ex-periments. A separate MonitoringThread is started to collect the system statistics every 500ms while

the queries are executed. Finally, the BenchRunner writes the results to files using the ResultsWriter

class.

3.3.5 Docker

The docker package adds support for Docker to Terry. When, in the database configuration file the

type parameter is set to docker, Terry tries to spin up a Docker container with a database instance.

This feature makes it possible to capture database configuration and state to make experiments

portable and repeatable.

(21)

When using a Docker container, some additional configuration is necessary:

docker_image=[the docker image to use] default_db_port=[internal database port]

docker_ready_log_line=[line in the Docker logs that shows that Docker is up and running] stop_container=[true|false]

The first three settings are mandatory. The default_db_port is the database port within the

Docker container. The docker_ready_log_line is the log line Terry keeps waiting for 30 seconds

that shows that the Docker container is ready for use. This differs per Docker image but will be

something like Server is listening.

Terry will automatically destroy the Docker container on application exit. If the user likes to keep

the Docker container running for future use, stop_container can be set to true, or similarly on the

command line with –stop_container=true.

Sometimes environment variables need to be passed to the Docker container, this can be done with

the following configuration setting:

docker_envvars[0]=ENVVAR1=value1 docker_envvars[1]=ENVVAR2=value2

Other configuration settings that are available, but would need no changing in normal situations:

docker_address=[docker address for Terry to connect with, see Docker configuration] default_port=[default set to 43210, the public container port to connect to] docker_data_directory=[directory to bind with the data directory]

Here the default_port is the port Terry will communicate with. This is the public Docker container

port that is internally routed to the default_db_port (see above). The docker_data_directory is

the mount point within the Docker container for the data_directory containing the DSV files. The

application will handle replacing the source locations of the files to import in the queries with the

mount point within the Docker container. The default mount point is /opt/csv_data.

3.3.6 Config

All configuration in Terry can be done in two ways; via command line arguments (see Section

3.2.1 )

or via the configuration files that can be found in the src/main/resources directory. Java Spring

provides this feature out of the box, to be able to override all settings via the command line. The

config package links the configuration settings to Spring Configuration objects that are injected

further in the application.

3.4 Terry Tools

The graphs you will find in this thesis are drawn using a Python tool that takes the output files

from Terry. For every run the average result and response times will be plotted in a bar graph

including vertical error bars and horizontal lines for the median. For outliers in the plots the tool will

automatically detect the cut-off point and add a second plot below with a logarithmic y-axis so that

the bars are still visually comparable. For some plots the outliers are not as apparent as for others.

The cut-off point can be overridden through a command line argument if necessary.

The graph drawing tool is consistent in the colouring of database types and will plot runs from

different database types present in the same directory against each other. The tool will also draw the

plots for the system statistics as seen further in this thesis. See Appendix

B

for all tools used during

our experiments.

3.5 Conclusion

With the development of Terry we have answered RQ 2. We now have a tool that can run different

workloads on a range of database systems. We have implemented SQL Server, PostgreSQL, MySQL

and MonetDB in Terry to assist us in answering RQ 1 and RQ 3 in the next chapter. The next chapter

will also show the usefulness of Terry in answering those research questions.

(22)

Chapter 4

Workload Analysis

To answer RQ 1 and RQ3, we performed experiments using Terry on two different workloads. This

chapter discusses these experiments and assesses the usefulness of Terry in the process.

The first section discusses the experiments we ran with the ShareCompany workload. In the second

part of this chapter we download the TPC-H benchmark and use its schema, data and workload to

assess the usefulness of Terry in assessing and improving query execution times.

The used schema’s and workloads as well as all results are available in the GitHub repository at [

git

].

4.1 ShareCompany workload

ShareCompany had a challenge in that the query performance on their production server varied widely

between queries. The main challenge were the queries that took over four seconds to return. We hoped

that this behaviour would be reflected in Terry’s output, and that we could use Terry to find out what

exactly made these queries slower than others. We then wanted to run the same workload on different

database systems to see if there were databases on which the slower queries ran faster, and if it thus

may be feasible to use a different database alongside the current SQL Server database.

4.1.1 Data, workload and environment

ShareCompany provided us with a SQL Server instance that contained a copy of the current acceptance

database, and with a workload of 35 queries. The size of this database was 60GB, against the 300GB

of the production database. This 60GB reduced to 9.7GB when exported to CSV. The provided server

ran SQL Server 2017 version 140.1000.169 Developer Edition on Windows 2016 Server Standard with

an Intel Xeon E5440 CPU

1

_{and 32 GB of memory.}

The obtained workload from ShareCompany consisted only of dynamic queries. In their production

environment they made heavy use of stored procedures. However, ShareCompany was in the process of

migrating their data model, while also rewriting all queries from stored procedures to dynamic queries,

thereby abandoning all stored procedures. Therefore, we ran the experiments with the translated

dynamic queries on the new data model. The differences in data model between the acceptance and

production environments were minor and consisted mostly of table and column name changes, but

our workload would not have worked on the current production environment.

The ShareCompany workload and schema can be found in Appendix

D

.

4.1.2 Experiments

First we used Terry to run the workload on the existing acceptance database at ShareCompany.

We ran the queries 5 times, with 2 warmup runs before. The average query response times are in

Figure

4.1 .

(23)

Figure 4.1: Query response times on the ShareCompany server. The two graphs represent the same

times, but the bottom graph contains a logarithmic y-axis to get a better view of the relations between

the outliers of the top graph..

(24)

Figure 4.2: Query response times on MySQL, PostgreSQL and MonetDB of the ShareCompany

workload running on the MacBook Pro.

These were not the results we were expecting. There may be one query (priceHistory0) that

is significantly slower than the other queries, but still this query comes back in 111ms. We were

expecting to see queries that would take multiple seconds to return.

We further ran the ShareCompany workload on PostgreSQL 10.4, MySQL 8.0.11 Community and

MonetDB 11.29.3. These experiments were conducted on a MacBook Pro with an Intel Core

i7-6820HQ

2

_{and 16 GB RAM, running macOS High Sierra 20.13.5. Figure}

_4.2

_{shows our first run.}

In this figure, we see that MonetDB clearly outperforms PostgreSQL and MySQL. We also see

some peaks in response times for PostgreSQL and MySQL on the same queries, which may lead us

towards a query classification. MonetDB is barely visible in this graph, but our graph drawing tool

also outputs the query response per database. Figure

4.3 shows the response times of MonetDB of

the same run.

Going back to Figure

4.2 , we see that in the logarithmic scaled plot, listing0, listing0-0,

listingName0-0, price0-0, price1-0 and priceHistory1-0 show almost identical patterns at

MySQL and PostgreSQL. To analyse this, we will first list the characteristics of all slow queries,

which are the queries of which the bars are cut off on the top plot. For the analysis we use the

(25)

query workload and the query statistics as exported by Terry. See Appendix

C

for an example of this

output.

instrument0

contains a point lookup on the instrument table, 3 inner joins with on one side a

primary key column and an ordering. The ordering is negligible as the expected result is only 1 row

long, as can be seen in the full biqh_workload.yml file in the data repository. The 3 inner joins are

on the tables instrumentdocument, document and documenttype, that contain 131 141, 77 845 and 37

rows respectively. The instrument table contains 9 984 618 rows. As the joins can not be evaluated

before the where clause on the instrument has been evaluated and the joins are on relatively small

tables and, except for the join on instrumentdocument, are all on primary keys based on previous

selects, we suspect that an index on the instrument.isin column may improve the performance of

this query significantly.

instrument1

contains another point lookup on the instrument table, 6 joins an final ordering on

the 37 result tuples. The ordering here should again not be a problem. The first two and the fifth

joins are simple joins on a primary key on two relatively small tables, instrumentdocument (131 141

rows), document (77 845) and documenttype (37). The third join contains a subselect with an inner

join on language (78) and documentlanguage (78 990). The fourth and sixth joins are two similar

joins on relatively small tables again. We suspect again that an index on instrument.isin will help

the performance of this query.

listing0 and its subquery listing0-0

have very close execution times, which leads us to think

that what happens in the subquery is responsible for the latency. The subquery listing0-0 is a

point lookup over two columns on the large listing (10 018 628) table. The main query does a point

lookup based on this subquery and contains two joins on listingname (540 357). The subquery in this

query is curious, as the where clause from the query could just as well be moved to the parent query,

omitting the need for the subquery. However, in this case the subquery helps us in analysing the

query execution time of the parent query.

listing3

contains a point lookup on listing (10 018 628), with a join on instrument (9 984 618) and

two joins on listingname (540 357), involving a point lookup on nindexed columns in both

on-clauses.

listingname0 and its subquery listingname0-0

show something interesting.

The query

listingname0 returns within milliseconds, while the subquery listingname0-0 takes multiple

sec-onds to return. The subquery listingname0-0 involves a simple point lookup over two columns on the

listing (10 018 628) table. The listingname0 query as a whole involves a filtering based on the result

of the subquery and two columns on listingname (540 357) and a join on the small listingnametype

(11) table. Because the listingname table is significantly smaller than listing, the query optimiser

can turn the query ’inside out’, by first filtering the rows on listingname and use the result of the

rows that conform to all the other filters to perform a simple primary key lookup on the large listing

table, assuming that an index is always created on the primary key. This explains why the subquery

is slower than the query as a whole; in executing the subquery the complete listing table would have

to be searched. The longest search in the parent query will be the length of the listingname table.

price0 and its subquery price0-0

show the same pattern as the listingname0 query with a

subselect on listing. However, in this case the price table is an empty table, which removes the need

of even touching the listing table.

price1 and its subquery price1-0

are exactly the same as price0 and price0-0, with only

different values.

(26)

Figure 4.3: Query response times on MonetDB of the ShareCompany workload running on the

Mac-Book Pro.

priceHistory0

is a simple query on one table that contains lookups on two columns on priceHistory

(5 645 960), which, without indices or other meta information on either of the two columns involves a

complete scan over the priceHistory table. This scan probably takes up some time and the intermediate

result is likely to be large as well, based on the table name.

priceHistory1 and its subqueries priceHistory1-0 and priceHistory1-1

: From the plot we

see that priceHistory1-1 does little to the execution time of the parent query. This is a simple

lookup query on the small period (63) table. The priceHistory1-0 query is a subselect on the long

listing table as we have seen before, only this time there is little to be done to optimise the query

around this expensive subselect, as the rest of the parent query contains lookups on non-indexed

columns on the also large pricehistory (5 645 960) table, with as we suspected before probably large

intermediate results.

story0, story1 and story2

are queries that do not touch any huge tables, but they do touch the

tables many times on different columns. We do not see any major issues, but the queries are not

taking seconds either like the other outliers. On MySQL the queries between 41 and 69ms, whereas

on PostgreSQL the queries return between 93 and 135ms. These queries are outliers because they

are slightly more complicated than most of the other queries and contain multiple lookups on smaller

tables.

(27)

Figure 4.4: Query response times on MySQL, PostgreSQL and MonetDB of the ShareCompany

workload running on the MacBook Pro after letting Terry create indices.

large tables, or in case of the story queries, multiple lookups on smaller tables. This may explain

why MonetDB is more consistent in performance over all queries.

A major difference between MonetDB on one side and MySQL and PostgreSQL on the other side,

is that MonetDB automatically creates (partial) indices as necessary. MySQL and PostgreSQL do

not, which, as mentioned above, proably explains many of the peaks in the queries, as we have only

imported the data from the ShareCompany database without the existing indices.

To test this, we ran the same experiments again, but we let Terry create indices on the touched

columns this time, which are all columns that are touched by selections, joins, groupings and

or-derings. We expect the query performance to improve significantly, because as we have seen above

there are many queries that involve point lookups on long tables, especially on listing and instrument.

However, even on lookups on smaller tables the improvement in execution times should be noticeable.

Figure

4.4 contains the results of the run where we let Terry create indices.

Here we see that PostgreSQL got slower on listing1 than the run without indices, which is

unex-plainable at this moment. However, the other peaks are gone, and with that the possible query classes

we observed before. Figure

4.5 shows the results of MySQL alone on the ShareCompany workload,

before and after creating indices.

(28)

Figure 4.5: Query response times on MySQL of the ShareCompany workload running on the MacBook

Pro, before and after creating indices.

(29)

create indices on the database. The priceHistory0 query is with 1523ms the slowest to return, but

all queries have become significantly faster. Since priceHistory0 was also the slowest query on the

ShareCompany server, it is worth taking another look at the query. In the querystats CSV as output

by Terry, we see that the query only touches one table (priceHistory). This table is 28 columns wide

and is 5 645 960 rows long. The query filters on two fields, orders the result and returns the first 10

rows. Our instinct says that, because of the size of the table and the table name, the result of the

filtering may be large, and that the query execution time comes mostly from the ordering, given that

we have built indices on both the filter columns.

Based on only Terry’s output, we can then put this to the test and see how the query behaves on the

database with the ORDER BY clause omitted. As expected, the query returns in 45ms (measured in

MySQLWorkbench). When we add the ORDER BY clause again we see response times comparable

to Terry’s output again (around 2000ms). By looking at the querystats CSV and the query itself,

we quickly get a sense of how the query is related to the data. Because Terry outputs the queries in

the MySQL dialect with the other results, it is a matter of a simple copy and paste to run the same

query on the database manually for further investigation. If we look one step further and change the

query in such a way that we see the length of the intermediate result that needs ordering; by replacing

the SELECT expressions with a COUNT(*) statement and removing the LIMIT and ORDER BY

clauses, we see that the intermediate result is 22583 rows. 22583 rows is not a particularly long

intermediate result, and given the fact that PostgreSQL and MonetDB have no issues with sorting

these rows within milliseconds, we suspect this should be improvable on MySQL as well for someone

with more knowledge of MySQL than us.

The increase in execution time on PostgreSQL after automatically creating indices asks for further

investigation by an expert as well. We ran the experiments on PostgreSQL later again to make sure

that this was not a glitch on PostgreSQL that somehow span multiple executions of the same query in

the previous run, but the results showed the same. Somehow listing1 performs worse when indices

are created on the database. This can undoubtedly be solved on PostgreSQL, but it shows again that

with a small change on the database a previous classification becomes invalid.

4.2 TPC-H workload

To assess the usefulness of Terry in assessing and improving query performance on different database

systems, we downloaded the TPC-H benchmark tools [

tpcb

] and generated a dataset (Scale Factor

1) and corresponding queries. The TPC-H workload is specifically designed with a diverse range

of queries to stress the database under observation. The TPC-H workload is in essence an OLAP

workload, whereas the ShareCompany workload is more of an OLTP workload.

4.2.1 Data, workload and environment

We downloaded the TPC-H dbgen and qgen tools to generate the data and workload. We used the SQL

Server dialect for generation of the workload, which we then translated into a format that JSQLParser

could handle. The changes we made were as follows:

• Remove the set rowcount and go keywords, and add a top N to the SELECT for every set

rowcount N where N 6= −1.

• Query 13: Move the column aliases from the subquery up to the SELECT clause of the subquery

(see lines 6 and 13 below).

Terry - A Tool for Assessing Database Performance with Existing Workloads

Terry

A Tool for Assessing Database Performance with

Existing Workloads

Jaap Koetsier

mail@jkoetsier.nl

August 22, 2018, 106 pages

Supervisor:

Prof. dr. S. Manegold

Host organisation:

ShareCompany B.V. (

http://www.sharecompany.nl)

, supervised by M. Bracher

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica

Master Software Engineering

Abstract

This thesis introduces Terry, a tool for assessing query performance in terms of execution times on

a range of database systems. Terry is able to transform and run an existing SQL workload on any

database system. Terry is useful for testing query performance on different database types, to be able

to make an informed decision on the database type to use. The basis of this work originates from

a real-world scenario, where the question arose whether the current database system was the best

option for handling the current workload, if it was better to switch to another database system, or

even use multiple database systems in parallel. To test this we investigated whether it was possible

to classify queries in such a way that classes comprised queries with similar query performance on

a certain database system. A useful query classification turned out to be too dependent on a solid

database system configuration, for which one needs to have a thorough understanding of the databases

included in the classification.

During our research we could not reproduce the problems that existed in our real-world scenario in our

experiments. Query execution times in our experiments did not reflect the latencies as experienced

in the production environment. However, the provided workload and data were no perfect mirror of

the production environment. We developed the tools that make the experiments we performed easily

repeatable in the future on an environment that better reflects the production environment.

To further assess the usefulness of Terry we continued to use Terry to assess and optimise query

performance of the TPC-H query set.

Contents

Abstract

1

1

Introduction

4

2

Background and context

7

2.1

Database Systems

. . . .

7

2.1.1

Relational SQL-Databases

. . . .

7

2.1.2

Graph Databases

. . . .

8

2.1.3

Key-value Stores

. . . .

8

2.1.4

Document Stores

. . . .

8

2.2

Existing Benchmarks

. . . .

8

2.2.1

TPC Benchmarks

. . . .

9

2.2.2

YCSB

. . . .

9

2.3

Relational Algebra

. . . .

9