Terry
A Tool for Assessing Database Performance with
Existing Workloads
Jaap Koetsier
mail@jkoetsier.nl
August 22, 2018, 106 pages
Supervisor:
Prof. dr. S. Manegold
Host organisation:
ShareCompany B.V. (
http://www.sharecompany.nl)
, supervised by M. Bracher
Universiteit van Amsterdam
Faculteit der Natuurwetenschappen, Wiskunde en Informatica
Master Software Engineering
Abstract
This thesis introduces Terry, a tool for assessing query performance in terms of execution times on
a range of database systems. Terry is able to transform and run an existing SQL workload on any
database system. Terry is useful for testing query performance on different database types, to be able
to make an informed decision on the database type to use. The basis of this work originates from
a real-world scenario, where the question arose whether the current database system was the best
option for handling the current workload, if it was better to switch to another database system, or
even use multiple database systems in parallel. To test this we investigated whether it was possible
to classify queries in such a way that classes comprised queries with similar query performance on
a certain database system. A useful query classification turned out to be too dependent on a solid
database system configuration, for which one needs to have a thorough understanding of the databases
included in the classification.
During our research we could not reproduce the problems that existed in our real-world scenario in our
experiments. Query execution times in our experiments did not reflect the latencies as experienced
in the production environment. However, the provided workload and data were no perfect mirror of
the production environment. We developed the tools that make the experiments we performed easily
repeatable in the future on an environment that better reflects the production environment.
To further assess the usefulness of Terry we continued to use Terry to assess and optimise query
performance of the TPC-H query set.
Contents
Abstract
1
1
Introduction
4
2
Background and context
7
2.1
Database Systems
. . . .
7
2.1.1
Relational SQL-Databases
. . . .
7
2.1.2
Graph Databases
. . . .
8
2.1.3
Key-value Stores
. . . .
8
2.1.4
Document Stores
. . . .
8
2.2
Existing Benchmarks
. . . .
8
2.2.1
TPC Benchmarks
. . . .
9
2.2.2
YCSB
. . . .
9
2.3
Relational Algebra
. . . .
9
3
Terry
11
3.1
Overview
. . . .
11
3.2
Usage
. . . .
12
3.2.1
Command line arguments
. . . .
12
3.2.2
Input files
. . . .
13
3.2.3
Output files
. . . .
14
3.3
Design
. . . .
15
3.3.1
Input
. . . .
16
3.3.2
Internal Representation
. . . .
16
3.3.3
Database Connectors
. . . .
18
3.3.4
BenchRunner
. . . .
19
3.3.5
Docker
. . . .
19
3.3.6
Config
. . . .
20
3.4
Terry Tools
. . . .
20
3.5
Conclusion
. . . .
20
4
Workload Analysis
21
4.1
ShareCompany workload
. . . .
21
4.1.1
Data, workload and environment
. . . .
21
4.1.2
Experiments
. . . .
21
4.2
TPC-H workload
. . . .
28
4.2.1
Data, workload and environment
. . . .
28
4.2.2
Experiments
. . . .
29
4.3
Conclusions
. . . .
32
5
Analysis and conclusions
35
Appendices
37
A.1 Configuration
. . . .
38
A.2 Application Entry
. . . .
38
A.3 Extending Terry with a new Database System
. . . .
38
A.3.1
Extending the JdbcDatabaseConnector
. . . .
39
A.3.2
Limitations
. . . .
39
B Terry Tools
41
B.1 Graph Plot Tool
. . . .
41
B.2 TSQL scripts
. . . .
41
B.3 CSV scripts
. . . .
41
B.4 Other scripts
. . . .
42
C Terry Output
43
C.1 Response and result times (responsetimes.csv and resulttimes.csv)
. . . .
43
C.2 Query statistics (querystats.csv)
. . . .
45
C.3 System statistics (systemstats.csv)
. . . .
50
C.4 Queries (queries.txt)
. . . .
53
D ShareCompany workload and schema
57
D.1 Workload
. . . .
57
D.2 Schema
. . . .
69
Chapter 1
Introduction
In greenfield projects, we as software engineers often have to decide on one or more database systems
to use. The large number of database systems available can make this a difficult choice. Because of a
lack of in-depth knowledge on database systems and the difficulty of making the choice, we often go
for database systems we, or the organisation, are familiar with, or we may choose a new fashionable
technique we have heard good things about. Of course, this dilemma presents itself at many
occa-sions during the software process, not only to making the choice of a database system to use. While
choosing for familiarity has its advantages to the organisation, choosing for a fancy new technique
asks for good argumentation.
The difficulty with database systems is that the design choices made during the development of the
database systems are not always visible to the user. Database systems have properties such as ACID
compliance, security, scalability, durability, etc. These properties are generally well documented and
aid us in choosing a database system. Database system performance, however, is often harder to
pre-dict. The trade-offs made by the database system designers within the database software are opaque
to the user. There are as many ways as optimising and executing queries as there are database
sys-tems. These differences inevitably involve internal trade-offs, favouring one type of workload over
another. This may cause that simple queries pay a time penalty in query optimisation on a certain
detabase system, to win time on more complex queries.
When choosing a database system these differences are not always apparent. Standard benchmarks
are available, but they hardly ever mirror our exact workload, are run on different hardware, are not
always completely unbiased or are not configured optimally for the given workload - the latter being
a challenge in itself. Even with all the information and benchmarks available, choosing a database
system is not a straightforward task.
As time passes and a software project develops, the database grows and the variety of queries we
execute against the database increases, it can happen that we find that the average query execution
times increase and the database is blamed for the latency in our application. When this happens,
we are eager to find solutions to bring the performance back to an acceptable level again. Perhaps
increasing hardware will provide an adequate solution. Maybe it is time to switch to another type
of database system, or it may be the case that the performance on different groups of queries is so
diverse that it is worth considering using multiple database systems alongside each other.
For example, when the software has developed in such a way that a large part of the queries executed
are mainly key lookups, while another subset of the queries comprises highly analytical queries, it
may be more efficient resource-wise to use different database systems for these queries. Such insights
may drive future architectural decisions, modularising the software in such a way that responsibilities
are in line with the database systems used in that part of the software.
In a perfect scenario we can spread the data of one database over multiple databases without
hav-ing the need to duplicate any data. This situation would be a good indicator for an opportunity to
modularise our architecture. In practice however, we would not always find this ideal situation. In
the case that the use of multiple database systems in parallel is justified, but we are unable to do this
without duplicating data, we would have to synchronise data between different systems, which can be
a challenge in itself.
The research in this thesis is inspired by the situation at ShareCompany, an Amersfoort-based
com-pany that provides stock market data to their customers. All data that is provided to their customers
flows through a central SQL Server database. The data within this database is updated mainly in
reg-ular daily batches, where the data is fetched from data providers. ShareCompany’s customers access
API’s in front of a thin application layer that queries the database on behalf of the API user. The
execution times of queries on the database have a direct effect on the response times of the external
API’s.
In the current situation, some queries take several seconds to return. This not only directly affects the
usability of the external API’s, it also keeps ShareCompany from expanding their business. The
ques-tion within ShareCompany is whether their current database system suits their needs for expanding
their business, or if it may be better to migrate to a different database system, or even use different
database systems for different types of queries in parallel.
Using different database systems in parallel only pays off when queries that perform poorly on one
database system perform significantly better on a different database system. For a small
improve-ment in execution time of a query the administrative overhead of possible data synchronisation and
administration within the application layer may not pay off. In other words, we would like to know if
there are groups of queries that perform noticeably different on different database systems. If those
groups exist and they cover all queries with, to ShareCompany, reasonable execution times, we may
have found a combination of database systems that will increase overall performance. This brings us
to our first research question:
RQ 1: How can we classify application level queries in such a way that we observe classes with
similar execution time on a particular database?
RQ 1a: How can we find those classes?
Perhaps we find during answering this RQ that one database system performs acceptable over all
queries, which would mean that migrating to a different database system would solve the problem
altogether.
To answer RQ 1, we will need to experiment with the current workload and database of ShareCompany
on their current SQL Server database and other database systems, perhaps database systems other
than relational SQL database systems. We want to make our experiments repeatable and easy to
replicate on different database systems that may use a different SQL dialect than the current system,
or even a different query language altogether.
To run our experiments we want to be able to take the current workload, data and schema from
ShareCompany and run this workload on different database systems. This is captured in RQ 2:
RQ 2: How can we construct a tool to automate our tests?
If we find a query classification in RQ 1 we will try to generalise this query classification, so that
we can use this query classification to predict the best suitable database system for future queries.
RQ 3: How can we generalise the query classification?
To answer RQ 1 and RQ 3, we will use the tool constructed based on RQ 2. This tool will help us
in automating the execution and timing of queries on a range of database systems and allows us to
perform repeated experiments. Based on the output of this tool we will try to come up with a query
classification, answering RQ 1 and RQ 3.
This thesis is further divided into chapters. In Chapter
2
we present some background on the
database landscape. We discuss some popular types of database systems that we considered including
in our experiments. We will also discuss some popular database benchmarks and end with a short
introduction to relational algebra, which is necessary as a foundation later in this thesis.
In Chapter
3
we discuss Terry, the tool we have developed to help us perform our experiments, to
answer RQ 2. Chapter
4
discusses the experiments we performed on the ShareCompany situation
and how we used Terry to improve the performance of the TPC-H query set on a MySQL database,
answering RQ 1 and RQ 3. Chapter
5
ends with our analysis and conclusions.
Chapter 2
Background and context
In this chapter we will discuss some popular types of contemporary database systems. We will then
discuss some existing database benchmarks. We will conclude this chapter with an introduction
to relational algebra, which will act as a foundation to understand the internal model of Terry in
Chapter
3
.
2.1
Database Systems
In this section, we will discuss some popular types of database systems, and how they fit into our
experiments.
2.1.1
Relational SQL-Databases
Relational databases are the traditional SQL-databases. These are visualised as tables with columns
and possible relations between columns. Relational databases store tuples (rows in a table) that are
queried using SQL. Traditionally these tuples are stored together on a per-row basis, the so-called
row-stores. Column-stores on the other hand store one or more columns together, often found in
OLAP-focused database systems. OLAP
1workloads differ from OLTP
2in that OLAP workloads
generally consist out of queries that perform calculations on a large amount of data. OLTP workloads
consist out of relatively simple lookup queries. The schema’s of OLAP databases often contain wide
tables, generally much wider than those found in an OLTP database. Column-stores therefore are the
preferred choice for OLAP workloads, as large amounts of data have to be analysed. A traditional
row-store database will have to fetch all rows to analyse, inducing a high cost in data transfer.
Column-stores can fetch only the partitions of the table that are required to execute a query [
BKM08
,
MKB09
].
This generalisation is confirmed by Holloway et al. and Harizopoulos at al. Holloway et al. [
HD08
]
found that narrow row-stores outperform column-stores, but that column-stores outperform row-stores
in case the workload is I/O bound, because less bandwidth is needed to transfer a few columns instead
of the complete row. Also, Holloway et al. found that adding predicates to queries negatively impacts
column-stores more than row-stores. Harizopoulos et al. [
HLAM06
] came up with similar conclusions
two years prior, but they also added that in CPU bound workloads, where I/O is less dominant,
column-stores are not necessarily faster than row-stores.
HTAP
3databases offer a hybrid solution for handling both OLTP and OLAP workloads on a single
database system, without the need to duplicate data from an OLTP database to an OLAP database
for analysis. This removes the naturally occurring data duplication latency between the two database
systems and provide the end-user with a system that supports real-time data analysis. The underlying
storage model of HTAP database systems is often a derivative of the Decomposition Storage Model
as found in OLAP databases, such as the Flexible Storage Model, which combines properties of
row-stores and column-stores [
APM16
]. However, because HTAP databases are hybrid by definition,
1Online Analytical Processing 2Online Transaction Processing
storage models between HTAP databases will differ and trade-offs will be made favouring one workload
over the other.
Finally there is the term NewSQL, which denotes a new generation of SQL stores, that are horizontally
scalable while keeping ACID properties and fault tolerance [
GHTC13
]. Pavlov et al. ask the question
What’s Really New with NewSQL? in their paper on NewSQL [
PA16
] and conclude that NewSQL
is not a completely new type of database system, but a system in which existing techniques are
combined to achieve a horizontally scalable relational database. NewSQL is a buzzword to define
a type of database that was inevitable to arrive. Relational databases used to be difficult to scale
horizontally without an extra middleware-layer. NoSQL stores such as document-stores tried to fill
this gap that opened up when the internet brought us large amounts of data. NewSQL data stores
combine this scalability with the traditional SQL data store.
The current ShareCompany database is a traditional single-node row-store, so it only makes sense to
include relational databases in our experiments.
2.1.2
Graph Databases
Graph databases are based on graph theory. The focus in graph databases is on the relations between
nodes. Data can be saved both as labels to edges (the relations) as on nodes. Graph databases are
efficient in traversing relationships, but are not that efficient in simple data lookups. As Grolinger et al.
[
GHTC13
] put it: [Graph databases] are suitable in scenarios such as social networking applications,
pattern recognition, dependency analysis, recommendation systems and solving path finding problems
in navigation systems.
This is why we exclude graph databases from our experiments. It would be interesting to see in how far
a graph database such as Neo4J handles a normal SQL workload such as the one from ShareCompany,
but we do not expect it to outperform an SQL database on the ShareCompany workload.
2.1.3
Key-value Stores
Key-value stores are interface-wise the most basic database systems around. A key-value store allows
the user to store, update and delete data by key. They provide no means of querying or filtering
on the values; the database is unaware of the underlying value. Because of this simplicity, key-value
stores are very efficient in doing just that; fetching values by key. This is why key-value stores are
often used as caching mechanisms, such as Memcached or Redis.
There is no way to capture an ordinary SQL workload or schema in a key-value store, because it
would mean we could only query by key, without filtering on other columns. This is why we will not
include the key-value stores in our experiments.
2.1.4
Document Stores
Document stores are essentially key-value stores, but with one major difference. In contrast with
key-value stores, document stores allow the user to query the values as well. Document stores are
generally schema-less, just as key-value stores, but the values are often structured in a format such
as JSON
4. The database engine is able to analyse the structured value and filter on attributes of the
value. It is often able to generate indices on these attributes as well. Some document-store databases
such as MongoDB are now also able to join documents from different collections
5. In MongoDB,
collections are the equivalent of tables, and documents are the equivalent of table rows. This means
that it is possible to query related objects, instead of one object at a time.
Querying relations in MongoDB is still very limited compared to SQL, but we will investigate if it is
possible at all to catch part of the ShareCompany workload in a document store.
2.2
Existing Benchmarks
In this section we will discuss some existing benchmarks.
4JavaScript Object Notation
2.2.1
TPC Benchmarks
The Transaction Processing Performance Council, or TPC [
tpca
], is a non-profit organisation founded
to define transaction processing and database benchmarks and to disseminate objective, verifiable TPC
performance data to the industry. The TPC currently lists 14 active benchmarks on their website, for
different industry applications. There are 6 application categories: Transaction Processing - OLTP,
Decision Support, Virtualization, Big Data, IoT and Common Specifications. The benchmarks are
offered in different SQL dialects and include scalable synthetic data. Examples of TPC benchmarks
are TPC-C, an OLTP workload, TPC-H (decision support), TPCx-BB for Big Data and TPCx-IoT
for IoT applications. TPC benchmarks try to mimick a real-world scenario from industry and their
benchmarks are well-known in the database community. Results from TPC benchmarks on different
kinds of hardware are posted on the TPC website, along with the benchmark-specific performance
metric (such as the number of queries per hour or transactions per hour), the scale factor (the size
of the data set used) and the Price/Performance ratio. Through this, TPC achieves a high level
of standardisation of database benchmarks. However, as the TPC highlights in their benchmark
documentation, these benchmarks are no substitute for benchmarks with a real-world workload.
2.2.2
YCSB
The Yahoo Cloud Serving Benchmark [
CST
+10
] is a benchmark aimed at databases operating in a
cloud environment. Key points in this benchmark are latency vs. throughput and scaling/elasticity.
YCSB contains a core set of workloads that can be used to assess the performance of database systems
along these lines. The YCSB benchmark uses artificial data to perform the benchmark and is not
aimed at a specific workload, but it merely tries to give a global insight in how databases perform
along the axes of latency, throughput and scalability with preset workloads; update heavy, read heavy,
read only, read latest and short ranges. YCSB provides a number of parameters that can be used to
change the distribution of data, for example. The YCSB is mainly used to assess NoSQL datastore
performance, but it does support SQL databases as well. The downside of YCSB in light of our
experiments is that it is hard to provide your own queries. YCSB does provide an option to generate
your own workload, but that comes either down to writing a parameter file with workload properties,
which YCSB then uses to generate queries, or writing Java classes that capture the workload. The first
is not an option as we have a specific workload and data of which we want to assess the performance.
The latter would be a possibility, but it is cumbersome to implement and hard to change when our
workload or data is updated.
2.3
Relational Algebra
The tool we are going to develop should be able to take an SQL workload in any dialect and output
this workload in any other SQL dialect and possibly other query languages. To do this we will need
a generic internal representation that can represent the queries that make up the original workload.
SQL is based on relational algebra, and all relational databases use some kind of relational algebra as
their internal query representation. Relational algebra is based on the 1970 paper from Codd [
Cod70
]
and comprises the following basic operations: The projection (π), selection (σ) and Cartesian product
(
×
), plus the set union (∪) and difference (\) operators. We expect the reader to be familiar with
the set operations, but we will explain projection and selection with an example. Here is an example
of a simple SQL query:
SELECT charCode FROM AsciiCharacters WHERE intCode = 65
The equivalent in relational algebra could be:
π
charCode(σ
intCode=65(AsciiCharacters))
From right to left, we select all tuples from AsciiCharacters where intCode = 65, and we project
the charCode column from the resulting tuples. A selection selects rows, where a projection projects
columns of the input relation.
This is a very basic example, but it gives an idea how a query may be represented as a tree-structure
within the database software. With these basic operators we could in theory express all joins.
How-ever, in practice this algebra is often extended with easier to understand operators for joins. Also,
actions such as grouping, aggregation and ordering are not expressible in pure relational algebra.
These operations are often added as new nodes (operations) in the tree or added as meta data as part
of other nodes.
The formalisation of relational algebra makes it possible to reason about queries and formalise query
optimisations in database systems.
Chapter 3
Terry
To answer RQ 2, we developed a tool we called Terry, to assist us in performing our experiments.
Terry needs to take any existing SQL workload and database schema as input. From this input
Terry must be able to generate a workload and schema in different SQL dialects and query languages
other than SQL. Terry should be easily extendable with a new type of database system to extend the
experiments and use Terry for future experiments on new database types. Terry should be able to set
up a new database from scratch with the provided workload and schema along with the data, but it
must be able to run experiments on an existing database as well. This means that we should be able
to provide Terry the authentication credentials and address of an existing database.
Terry should output the query response times per query and system statistics such as CPU, memory
and I/O usage.
3.1
Overview
Terry takes an existing workload and database schema in SQL format and executes the queries in the
workload against the database of choice. Optionally, Terry takes the CSV data export of the existing
database to import in the database we are running our experiments on. The only precondition in this
case for Terry to be able to run the experiments is a running database server with an empty database.
With the provided workload, schema and data Terry has everything to initialise our experimental
database and run the experiments.
Before running the experiments, Terry extracts subqueries to run these separately, providing the user
with more fine-grained information about the query performance. Optionally, Terry can generate
in-dices on the columns used for filtering, ordering and grouping to save the user from manually creating
these indices.
Terry times the queries, monitors the system’s resources and writes the query timings, query statistics
and systems statistics to a number of CSV files. A Python script (see Appendix
B
) visualises this
data with graphs.
We have added Docker support to make experiments portable and repeatable.
The development of Terry was an iterative process. Results of experiments asked for new features,
improving either the process or the results. Examples of these are the automatic creation of indices
and the extraction of subqueries (see Section
3.3.4
).
In this chapter we will describe Terry and the design decisions made during the development of
Terry in Section
3.3
, after we explained the usage of Terry in Section
3.2
. We will decompose Terry
into modules and handle each module separately. In Appendix
A
Terry is explained on a lower level
for future development.
3.2
Usage
Terry is developed in Java Spring and is run as a typical Java command line application:
java -jar terry.jar [command line arguments]
Here terry.jar is the output JAR-file resulting from the build. See README.md in the Terry
repos-itory for the build command and the resulting JAR file. From here on terry.jar will be used to refer
to the compiled Java code. Replace this as necessary.
3.2.1
Command line arguments
Terry has three mandatory command line arguments:
--workload=[workloadfile.sql|workload.yml] --schema=[schemafile.sql]
--dbtype=(mssql|mysql|monetdb|postgres|..)
Here we define the mandatory workload and schema in SQL format and the database type to run
the experiments on. The format of these files is further explained in Section
3.2.2
.
One other commonly used, but not mandatory argument is:
--data-directory=[data_dir]
If this argument is provided, the application looks for DSV
1files within this directory. It matches
the file names with table names and imports the data in the corresponding table. By default, Terry
looks for files with the *.csv extension and has default values for the field separators, line delimiters
and quotation marks, but this is configurable, see Section
3.2.2
.
Besides these arguments the configuration for the database to run experiments on (argument
dbtype) needs to be set. The currently implemented database types each have their own default
configuration in the resources directory (named db.[dbtype].properties), but these configuration
settings can be overridden on the command line:
--username=[database username] --password=[database password] --database=[database name] --host=[database host] --port=[database port]
In the resources directory we also find the application.properties file with the default application
configuration. These settings can all be overridden on the command line as well, a helpful feature of
Java Spring. We will explain these settings here:
--output_directory=[output_dir]
By default Terry will write the results in the /tmp/terry_out directory. This setting can be used
to write to a different location. See Section
3.2.3
for more information about the generated output.
--no_runs=[m] --skip_first=[n]
The default settings for these are m = 1 and n = 0. Terry will perform m + n runs sequentially,
omitting the first n runs from the results. For the last m runs Terry will capture the output as discussed
in Section
3.2.3
. Skipping the first n results can be used to warm up the database. Databases may
need a couple of runs to cache results, create indices or perform other optimisations so that subsequent
queries perform better. Terry does not support running multiple cold runs, as this is less interesting
for our experiments. To execute multiple cold runs it is up to the user to ’cool’ the system after each
run.
--skip_result_validation=(true|false) --skip_create_schema=(true|false) --create_indices=(true|false)
These three arguments control Terry’s behaviour. The argument skip_result_validation can
be used to skip the query result validation, see
3.2.2
. skip_create_schema can be used when
re-peating experiments on an earlier set-up database, to avoid having to re-initialise the database again.
create_indices (false by default) creates indices on all fields that are either used in the query filters
(the ’where’ clauses, but also the ’on’ expressions of joins), ’group by’ or ’order by’ clauses of the
queries.
--rename_imported_csv_files=(true|false)
The default value of this argument is ’false’.
This argument comes in handy during (further)
development of Terry, when problems arise when importing CSV files. Files that are correctly imported
will be renamed with the postfix .done so that Terry ignores these files on subsequent runs.
3.2.2
Input files
This section describes the format of the input files.
Schema
The schema should be an *.sql file that contains a series of ’create table’ SQL statements, separated
by semicolons, as follows:
CREATE TABLE asccivalues ( intVal int NOT NULL,
charVal char NOT NULL );
CREATE TABLE users ( id int NOT NULL, firstname varchar(100),
lastname varchar(100) NOT NULL, PRIMARY KEY (id)
);
Workload
The workload file can be formatted in two ways. The first is similar to the schema definition; an
*.sql file with a list of ’select’ statements, separated by semicolons. Terry will assign all queries an
integer value as identifier.
It is also possible to supply the workload in a YAML
2file with a *.yml extension, with the following
structure:
workload:
- query: "SELECT intVal, charVal FROM asciivalues WHERE charVal = ’A’" identifier: queryAsciiByCharA
results: - 65, A
- query: "SELECT intVal, charVal FROM asciivalues WHERE intVal > 65 AND intVal < 69" identifier: queryAsciiByIntRange
results: - 66, B - 67, C - 68, D
When the workload is provided with the expected results, Terry will validate the query results of
the experiments with the results provided in the YAML file and outputs a warning message if the
results do not match the expected results, unless skip_result_validation is set (see
3.2.1
).
2YAML Ain’t Markup Language, a structured file format that separates lists, objects and attributes using newlines, identation and hyphens. Much as JSON, but using whitespace instead of brackets.
DSV Data
When the data_directory command line argument is provided, Terry reads and imports the data as
DSV files into the database. By default, Terry scans for *.csv files, but by setting the dsv_format
parameter to tbl, Terry scans for *.tbl files and treats them as TPC data files.
*.csv files
are expected to have one header line, double quotes as quotation marks, comma’s as
field separators, NULL (unquoted) as null value and a newline character as record terminator.
*.tbl files
should have no header lines, don’t use quotation marks, use pipe symbols as field
sep-arators, empty strings as null values and a newline character as record terminator. This is the way
TPC provides the data with their benchmarks.
A custom DSV format can be provided as well, with the following arguments:
--dsv-extension= --dsv-field-separator= --dsv-quotes= --dsv-header-lines= --dsv-record-terminator= --dsv-null-value=
These parameters override the defaults for the selected dsv_format, or csv if no format is
explic-itly selected. Keep in mind that not all database systems support having more than 1 header line.
Trying to run experiments on PostgreSQL for example with dsv_header_lines > 1 will throw an
UnsupportedConfigurationException.
File Naming
When Terry reads the DSV files, everything before the first dot in the file name will be matched
with the table name. For example, customers.csv will be imported into customers, but so will
customers.002.csv.
3.2.3
Output files
Terry creates output files in the as argument provided output directory, or in /tmp/terry_out by
default. The output files for a particular run are prepended with the current timestamp, an underscore
and the database type to distinguish between runs, for example 20180701120503_mysql_systemstats.csv.
We chose to output the raw data in CSV format here for two reasons. The first is that there are better
tools and languages for analysing data and drawing graphs than with Java. We decided to do further
calculations on the raw data and graph drawing with Python (see Section
3.4
). The second reason
is that by splitting the data generation from the representation we were able to refine our analysis
and graph drawing without having to run Terry every time. Appendix
C
contains examples of the
generated files.
The following files will be generated:
responsetimes.csv
contains the response times of all queries in nanoseconds per run. If no_runs
is set to 5, there will be five columns with response times per query.
resulttimes.csv
contains the result times of all queries in nanoseconds. The layout is the same as
for responsetimes.csv. While the response time is the time measured between sending the request
to the database and the time of return of the database cursor, the result time also includes the time
needed to fetch the data from the database. The result time of a query execution is by definition
larger than the response time.
querystats.csv
contains the statistics per query. Per query we see the average response and result
times, the number of rows returned, the result width (the number of columns returned), the number
of tables that are touched in this query, the total width of all tables that are touched by the query, the
weighted table widths (
P
len(touched_tables)k=0
table_width(k)) plus the statistics per table in the query:
how often a table is touched (the touched_cnt column), the number of columns per table and the
table sizes.
systemstats.csv
contains the system statistics during the run. Every approximately 500ms the
system statistics are captured and ultimately written to this file. This file shows the CPU usage
(ticks, loads and 1, 5 and 15-minute load averages), memory usage, swap usage and cumulative I/O
reads and writes.
queries.txt
contains the queries as executed on the database, in the destination query language.
3.3
Design
This section provides a high-level overview of Terry’s design. For lower level documentation of Terry
for future development, see Appendix
A
.
Figure 3.1: Terry package structure.
Figure 3.2: Terry program flow. The dotted squares correspond with the packages in Figure
3.1
.
The dashed arrows are optional flows. Although the arrows in the Database Connector contain
SQL-like terms (SELECT, CREATE, IMPORT), the Database Connector could implement any type of
database.
Terry is divided into six packages, as can be seen in Figure
3.1
. Figure
3.2
shows the program flow
through the corresponding packages.
On the left hand side of Figure
3.2
we find the program input: The workload, schema and optionally
the directory containing CSV data to import. The input package contains the logic of parsing the
SQL into an internal relational algebra-like representation.
Control then flows to the Database Connector, which takes the internal representation to generate
queries in the query language of the chosen database. The BenchRunner then takes control, taking
the generated queries from the Database Connector to generate the final output files.
In the following sections we will discuss the design of the six packages further.
3.3.1
Input
The input package transforms the provided workload and schema into the internal representation.
The schema and workload files contain SQL statements that are parsed using the JSQLParser
pack-age [
jsq
]. JSQLParser generates parse trees that are transformed into the internal representations for
the workload and schema.
JSQLParser supports a variety of SQL dialects. It is however unclear from the JSQLParser
doc-umentation which SQL is exactly supported, and whether it implements any ANSI SQL standard.
According to the JSQLParser GitHub page it extends the "core JavaCC grammer for SQL [..] taken
from Guido Draheim’s site". However, the page contains no link to Guido Draheim’s website, nor is
the website findable on Google. We have implemented JSQLParser and did not find anything that
was missing to be able to correctly parse our SQL.
We have considered other options, but none of them were favourable over JSQLParser. Apache
Calcite [
cal
] is a framework for developing databases and comes with a built-in SQL parser that
gen-erates an internal representation in relational algebra, which could have saved us two steps. Calcite
could have acted as the foundation of Terry. However, the SQL that Calcite supports is very basic.
We would have needed to extend this to suit our needs. Secondly, Apache Calcite’s source is large.
It would have taken considerable time to get used to it, and it would have been hard to clean up the
unneeded code.
Another option was writing/generating a parser by either writing our own grammar or use an
ex-isting SQL grammar. Writing our own grammar would have taken too much time, as SQL is so large
and the dialects so diverse that we would have kept extending our grammar and parser constantly.
Existing SQL grammars are available [
ant
], but the ANTLR4 grammar for SQL Server alone is already
over 5000 lines. We have tried to implement only the part of the grammar we actually needed, but
this was impossible to do in reasonable time, because of the large number of possible nodes in the
resulting parse tree. Besides that, implementing an SQL Server grammar would also have limited the
use of Terry to the SQL Server dialect. We have further explored available grammars on the internet
for more generic and smaller grammars, but these came with the same kind of problems.
Although the exact SQL that is supported by JSQLParser is not clear, it appears to support
everything we need and JSQLParser saved us valuable time compared with the other options we had.
3.3.2
Internal Representation
Terry creates a generic internal representation of the SQL workload and schema input files. This
internal representation makes it possible to adapt the workload and schema to any other database, as
long as the target database is able to express the semantics of the original queries.
Figure 3.3: The internal representation should make translation from any input to any output possible
Workload
The workload is internally represented in a relational algebra-like form (see the background in
Sec-tion
2.3
for a short introduction). Most, if not all, relational databases represent queries internally in a
structure that is related to relational algebra. SQL SELECT-queries and relational algebra are closely
related, but expressing SELECT-queries in terms of pure relational algebra alone is not possible.
To come up with the internal representation for Terry we have studied the documentation and source
code of several database systems. Most useful for this were PostgreSQL, MonetDB and Apache
Cal-cite. The common approach in these database systems is that they create a tree-structure where
each node represents either an operator from relational algebra, or a node representing a specific SQL
expression that is not representable with relational algebra, such as grouping or ordering. Sometimes
SQL expressions that are not representable with relational algebra are expressed as part of another
node, for example by including ordering in a projection node. In the basis all these systems use
relational algebra, but the approach to how to express the auxiliary expressions differs.
We have based our internal representation of the workload on relational algebra as well. This
in-ternal representation is more abstract than the original SQL, yet it still contains all information to
generate semantically equivalent SQL in any dialect. This internal representation is also suitable to
generate queries for a database that uses a query language other than SQL, provided it supports all
operations that are present in the original query.
Figure
3.4
shows an example of the internal representation of the simple query SELECT lastName
FROM persons WHERE firstName = ’Alan’ AND lastName IS NOT NULL LIMIT 1 OFFSET. On top
is the Workload object that contains a Query object, which takes as input an object of type Relation.
The blue entities in Figure
3.4
are all children of the super type Relation. Relational algebra operators
take one or more relations as input and produce a relation. For this simple query all relations take a
single input relation, but the join-operators for example take two relations as input.
Left of the blue relation squares are the parts of the query that are handled within that operation.
Schema
The schema is internally presented by a tree structure with as root the Schema class. This class links
to one or more Table objects. The Table objects contains a list of child objects of the Column class,
Figure 3.4: The internal representation of a workload containing the query SELECT lastName FROM
persons WHERE firstName = ’Alan’ AND lastName IS NOT NULL LIMIT 1 OFFSET 0. Note that
a Workload can contain multiple queries, and a Projection can contain multiple selectExpressions.
such as BooleanColumn or IntegerColumn and may contain additional properties. A DecimalColumn
for example has a precision and scale property. The parent Column class holds generic information
such as the column name and whether a NULL value is allowed or not.
Figure
3.5
contains an example of how a CREATE-statement is transformed into an internal
repre-sentation.
The intrep package implements the Visitor-pattern on the Workload and Schema trees and provides
the WorkloadVisitor and SchemaVisitor classes to be extended. These visitors can be used by the
Database Connectors to generate queries in the target query language.
3.3.3
Database Connectors
To make Terry easily extendable with other database systems, we introduced the concept of Database
Connectors.
Adding a new type of database can either be done by extending the abstract
DatabaseConnector class, or by extending the more specific JdbcDatabaseConnector class for JDBC
connections. See Appendix
A
for a more in-depth explanation of extending one of the two Database
Connector classes.
The Database Connectors take care of everything that is specific to a database system, from
connec-tion handling to query generaconnec-tion in the target query language. The connector package also includes
abstract default implementations of an SqlSchemaVisitor and an SqlWorkloadVisitor. The only
abstract method in these classes is the getIdentifierQuoter() method. When this method is
over-ridden by providing an SqlIdentifierQuoter a default SQL will be generated for both the workload
and schema. Only parts of the query that deviate from the default have to be overridden to implement
a new SQL dialect. In our PostgreSQL implementation for example, the PostgresSchemaVisitor
overrides the visit(DecimalColumn decimalColumn) method to use the numeric column type
in-stead of the default decimal SQL type. The PostgresSchemaVisitor extends another 3 methods
and the PostgesWorkloadVisitor only overrides 1 method; with minimal extra code we are able to
Figure 3.5: The internal representation of a schema with one Table with CREATE query in SQL
Server dialect. Note that a Schema can contains multiple Table objects, and that the primaryKey of
the Table can consist of multiple columns.
generate queries in a new SQL dialect.
Non-SQL databases are possible to implement as well.
We have implemented the start of a
MongoDbDatabaseConnector that successfully generates MongoDB queries from the simple SQL
in-put. The support for SQL-like constructs in MongoDB is limited however, so for our experimental
workloads MongoDB’s functionality was not sufficient, but for simpler workloads the application is
able to generate correct MongoDB queries.
3.3.4
BenchRunner
The benchrunner package takes care of running the queries and anything that is necessary to generate
the desired output. Before the BenchRunner class starts running the experiments the BenchPreparer
in the same package extracts subqueries, as inspired by Kersten [
KKZ18
]. Subqueries are extracted
from the query and added to the workload with a postfix to the query identifier. If query listing0
contains 2 subqueries, listing0-0 and listing0-1 will be added to the workload containing the
subqueries. Note that not all subqueries can be executed as separate queries because they refer to
the parent query or other information from outside the subquery. The resulting timings will show up
as 0 in the output for these queries.
The BenchPreparer also takes care of creating indices when this is desired by the end-user.
After the BenchPreparer has finished with the preparations, the BenchRunner starts running the
ex-periments. A separate MonitoringThread is started to collect the system statistics every 500ms while
the queries are executed. Finally, the BenchRunner writes the results to files using the ResultsWriter
class.
3.3.5
Docker
The docker package adds support for Docker to Terry. When, in the database configuration file the
type parameter is set to docker, Terry tries to spin up a Docker container with a database instance.
This feature makes it possible to capture database configuration and state to make experiments
portable and repeatable.
When using a Docker container, some additional configuration is necessary:
docker_image=[the docker image to use] default_db_port=[internal database port]
docker_ready_log_line=[line in the Docker logs that shows that Docker is up and running] stop_container=[true|false]
The first three settings are mandatory. The default_db_port is the database port within the
Docker container. The docker_ready_log_line is the log line Terry keeps waiting for 30 seconds
that shows that the Docker container is ready for use. This differs per Docker image but will be
something like Server is listening.
Terry will automatically destroy the Docker container on application exit. If the user likes to keep
the Docker container running for future use, stop_container can be set to true, or similarly on the
command line with –stop_container=true.
Sometimes environment variables need to be passed to the Docker container, this can be done with
the following configuration setting:
docker_envvars[0]=ENVVAR1=value1 docker_envvars[1]=ENVVAR2=value2
Other configuration settings that are available, but would need no changing in normal situations:
docker_address=[docker address for Terry to connect with, see Docker configuration] default_port=[default set to 43210, the public container port to connect to] docker_data_directory=[directory to bind with the data directory]