Comparing MongoDB to SQL Server, Replicated

(1)

Master Graduation Thesis

Comparing MongoDB to SQL Server, Replicated

Khosrow Azizi

University of Amsterdam

Part time two-years programme Master’s Software Engineering

Status: Final

University of Amsterdam Supervisor:

Dr. Tijs van der Storm

Host Organization:

BELLT-GCA

Host Organization Supervisor:

Ir. Jan Brandsma

(2)

2 Summary ... 5

Preface ... 6

1. Introduction ... 7

2. Motivation ... 8

2.1. Project hypothesis ... 9

3. Background ... 10

3.1. Relational databases ... 10

3.2. Non-relational databases ... 10

3.3. Used database variants ... 12

3.3.1. Microsoft SQL Server ... 12

3.3.2. MongoDB ... 12

4. Related work ... 14

5. Experiment ... 16

5.1. Setup ... 16

5.2. Data ... 17

5.3. Data modeling ... 17

5.3.1. SQL Server data model ... 17

5.3.2. MongoDB data model ... 18

5.4. Test Cases ... 19

5.5. Application design ... 20

5.6. Application execution ... 20

6. Results and analysis ... 23

6.1. Insert results ... 23

6.1.1. Comparison to original paper ... 27

6.2. Update results ... 28

6.2.1. Updating users last name by querying their first names ... 28

6.2.2. Updating departments name by querying their ids ... 30

6.2.3. Updating projects name by querying their ids ... 31

6.3. Select results ... 33

6.3.1. Selecting Departments by querying their ids ... 33

6.3.2. Selecting Departments by querying a random name ... 34

6.3.3. Selecting Users by querying their ids ... 36

6.3.4. Selecting Users by querying a random first name ... 37

6.3.5. Selecting Departments by querying a random User’s first name ... 39

(3)

3

6.3.7. Selecting Users average age by querying random Project ids ... 41

7. Discussion ... 43

7.1. Threats to validity ... 43

8. Conclusion ... 44

9. Future work ... 45

References ... 46

Appendix ... 48

I. Database performance monitor interface source code ... 48

II. Result graph of insert Project test ... 49

III. SQL Server Results Data ... 50

(4)

4 “It is not the strongest of the species that survives, nor the most intelligent,

(5)

5 Summary

Parker et al [1] studied the performance difference between MongoDB and Microsoft SQL Server on basis of a number of insert, update and select scenarios. As the result of their study, they conclude that MongoDB has got a better performance when it comes to insert, update and simple queries. However, Microsoft SQL Server performs better when updating and querying the non-key fields (attributes) as well as for aggregate queries. In this Master thesis we replicate the mentioned study in order to validate its results. Moreover, we extend the study with more data in order to monitor how both databases perform when the data grows. Although there are differences between the exact values in the results of both studies, we also conclude that MongoDB performs better in case of insert, update and key field related queries. However, when non-key field related queries are utilized we see that SQL Server has got a better performance. In case of aggregate queries we also see the similar results as in [1] i.e. SQL Server performs better than MongoDB. However, we see that the new aggregate framework of MongoDB provides a significant improvement with respect to the method used in [1] which results in a slightly better performance than SQL Server. Furthermore, we show how different

implementation of the experiment can significantly affect the results. The results of this study will help us to answer the question whether or not to choose a NoSQL database in case of a modest-sized database with structured data from the performance point of view.

(6)

6 Preface

The two years of the study Master Software Engineering next to my job has been a really challenging time for me. Yet, I am glad that I had the opportunity to do this Master study. I strongly believe that it has helped me to develop myself at the professional level as well as at the personal level. Therefore, I would like to thank my teachers Hans van Vliet, Hans Dekkers, Jan van Eijk, Vadim Zaytsev, Jurgen Vinju and Magiel Bruntink for their effort. I especially want to thank Tijs van der Storm for his good feedback and guidance as my supervisor for this Master thesis and as my teacher in Software Construction course.

I also would like to thank my employer who made it all possible. I thank my managers Ir. Kees Buring and Ing. John van Leeuwen for their cooperation during my graduation project. I thank my colleagues at BELLT-GCA, especially my supervisor Ir. Jan Brandsma for his time and his useful feedback.

Last but definitely not least, I would like to take this opportunity to thank my parents for their love and their immense support. They have given much of their life to make it possible for me to follow my dreams.

(7)

7 1. Introduction

Parker et al [1] did a study to compare a SQL database to a NoSQL database. The focus of their study is on the performance of processing a modest amount of structured data. In their study they also try to address whether it is meaningful to use a NoSQL database instead of a SQL database. In order to achieve this, they have done an experiment where they compared the performance of Microsoft SQL Server as SQL database to MongoDB as NoSQL database on basis of a couple of insert, update and select queries with a variant sets of data; three tables with a minimum of 4 and a maximum of 8192 records. The results of their study shows that MongoDB has got a better performance when it comes to insert, update and simple queries. However, SQL database performs better when updating and querying the non-key attributes1_{as well as for aggregate queries.}

In this Master graduation thesis we try to replicate the mention study in order to validate its results. Moreover, we extend the study with more data to see how both databases perform when the data grows. In order to do so, we setup an experiment that matches the experiment from [1] as much as possible. We note here that unfortunately we don’t have the source code that is used in [1]. Consequently, this forms a challenge for us since we have to setup an experiment according to what we can perceive from the text in [1]. Furthermore, the hardware that we use in our experiment is not exactly the same as in the original paper. Therefore, it may be that we would achieve other result values than those from the original paper. However, we believe this shouldn’t really affect the overall result of how these two databases perform with respect to each other. The remainder of this paper is organized as follows. In chapter 2 we introduce the motivation and research question of this Master thesis. In chapter 3 we will provide the reader a concise background information about SQL and NoSQL databases so that it is easier to understand the work in this paper. In chapter 4 we discuss related work to this thesis. We present our experiment in chapter 5. The results of our experiment will be illustrated and explained in chapter 6. In chapter 7 we discuss our findings. We close this thesis with the conclusion and future work in chapters 8 and 9 respectively.

1_{Non-key attributes are attributes that are not part of a key. Generally, most attributes are simply descriptive, and fall into}

(8)

8 2. Motivation

The relational database model has been the dominant model for data store systems in the last decades [1, 4

and 5], and for good reason. Relational databases provide a robust and reliable solution for storage and managing persistent data, they are backed with an extensive data management tools and there is a large pool of labor qualified to implement and maintain these systems [26]. Relational databases have been used for nearly every storage problem, even if a data model did not match the relational model well [7]. However, in the past few years, the “one size fits all”-thinking concerning data store systems has been questioned by both, science and web affine companies, which has led to the emergence of a great variety of alternative non-relational databases [28]. These non-relational databases, commonly known as NoSQL databases [2], are becoming more and more prevalent [1, 2]. According to [26 and 27], the NoSQL movement is motivated by various aspects like:

o Working with a lot of various data types e.g. a mixture of structured, semi-structured, unstructured and polymorphic data

o Handling massive volumes of data o Agility or speed of development

o Need for a data model that aligns with Object Oriented paradigm o High performance and scalability

o The desire to identify viable alternatives to expensive proprietary database software

Moreover, there are a number of domains (like web 2.0 applications, data warehousing, stream processing etc.) where the relational database solutions are not well suited [4, 12]. NoSQL solutions try to get rid of limitations that relational databases impose in such domains by for instance relaxing some database

constraints like data consistency, incorporating in-memory data processing and simplifying the data models. Although there are various NoSQL database systems (like key-value data stores, document oriented data stores etc.), the NoSQL databases share several key characteristics. NoSQL databases are designed to cope with large volume of data which is structured, semi-structured, unstructured and polymorphic data [26]. When compared to relational databases, NoSQL database systems claim to be more scalable and provide superior performance [26].

Currently, there is much discussion in the database world about using NoSQL databases instead of relational databases. As a result, a natural question for organizations and application designers is whether or not to choose a NoSQL database [1]. From the theory [1, 2, 4, 6, 7, 8, 10, 13, 14] it is known that if the data is not structured and extremely large, then a NoSQL database is a good choice. However, it is not clear if using a NoSQL database instead of a relational database is a viable solution for a modest-sized database with structured data [1]. Obviously, answering such a question is related to answering some other sub questions like:

o How does the NoSQL database under consideration perform with respect to its relational counterpart? o What about the available database management tools?

o How is security managed in the NoSQL database under consideration? o Is it possible to migrate from one NoSQL database to another one? o What can we say about return on investment?

In this Master thesis we try to address the performance aspect of the above question. We do this by replicating the experiment done by Parker et al [1] where we compare the performance of MongoDB, as NoSQL database, to Microsoft SQL Server as its relational counterpart. With their study they conclude that MongoDB is definitely the choice for users who need a less rigid database structure. Moreover, MongoDB could be a good solution for larger data sets in which the schema is constantly changing or in the case that queries performed will be less complex. For those users that have a strict schema defined and a modest amount of structured data they also found MongoDB to perform better than SQL in general. However, they add, there are downsides to MongoDB, such as its poor performance for aggregate functions and querying based on non-key values.

We hope with this Master thesis to validate the results from [1] in order to take a step forward in answering the question “whether or not to choose a NoSQL database in case of a modest-sized database with structured

(9)

9 2.1. Project hypothesis

As we will see in the chapter “Background”, there are two different methods for modeling the data in

MongoDB; namely normalized model and denormalized model. Since the normalized data model is used in the original paper [1], we also will be using it for modeling the data. Using normalized data model makes MongoDB similar in its behavior to the SQL Server variant since every object has its own collection and relations are implemented by a reference to the other objects (documents). However, a major difference between these two database systems is that SQL Server implements full ACID properties while MongoDB eliminates some of them, like consistency. Although these rules are important in order to have a robust and correct database [11], maintaining ACID properties becomes harder and requires a certain amount of overhead as the data grows. MongoDB gets around this overhead by eliminating some of the ACID properties, like consistency, which should result in a better performance. For instance when an object with a lot of references is inserted into the

database, MongoDB doesn’t need to check the existence of the referenced objects. Furthermore, SQL Server make use of join operation for fetching and combining information from the many different tables as needed. As the data gets bigger such a join operation becomes also a time consuming task, especially when it comes to join the many-to-many2_{relations. MongoDB on the other hand doesn’t need such a join operation as for} instance a many-to-many relation between two documents (tables) can be solved by putting references in an array in one of the documents in question.

Therefore, in the context of this Master thesis we hypothesis that “in general MongoDB will perform better or

at least equally to SQL Server and therefore it could be used instead of SQL Server”.

2

In a many-to-many relationship, a record in one table relates to multiple records in a second table, and a record in the second table relates to multiple records in the first table. Source: http://office.microsoft.com/en-us/training/many-to-many-relationship-RZ006149432.aspx?section=26 [28-07-2014]

(10)

10 3. Background

Before we dive into the details of the experiment it is appropriate to have a general understanding of data modeling in relational databases and NoSQL databases. Moreover, in this chapter we also describe the used database variants in this Master graduation project; namely Microsoft SQL Server and MongoDB.

3.1. Relational databases

Relational database management systems (RDMBS) are the predominant technology for storing structured data since 1970 [4]. These database systems are widely known as SQL databases [2] named after the language they are queried by. The data in the relational model is usually represented by a database table (schema). Objects with the same type (i.e. the same number of attributes with the same type and format) are grouped together in a table, making it structured data [1]. The relational databases store data in rows whereby each row has the same number and type of data columns. Moreover, the data in relational databases is typically

normalized, which results in creation of multiple tables. Querying data which is spread over multiple tables requires fetching and combining information from one or more different tables. This process of combining information based on a matching value of primary-key and foreign-key 3_{across multiple tables in the relational} databases is called joining tables.

Furthermore, an important property of relational databases is that they are based on ACID principle which defines a set of rules for transactions to ensure integrity of data in the database [6]. In summary, ACID stands for:

o Atomic: a transaction is complete when all operations are complete. For example an insertion of an object is completed if and only if the insertion of all the related objects is completed too. Otherwise a rollback will be performed.

o Consistence: a transaction cannot collapse the database. This means, for instance, during inserting an object if one or more of its related objects cannot be found in the database the operation is seen as illegal. In this case a rollback will be performed.

o Isolated: all the transactions are independent and cannot affect each other. An example of this is that updating a row in a certain table will not affect insertion of any other row in the same table.

o Durable: when a commit has taken place, the transactions cannot be undone. For instance if an object is deleted this action cannot be undone.

3.2. Non-relational databases

The non-relational databases are recently getting a lot of attention as an alternative to relational databases. These non-relational databases are generally called NoSQL. It is commonly agreed that NoSQL stands for Not

Only SQL [4]. Therefore, according to [4], the goal of NoSQL is not to reject SQL, but to be used as an

alternative data model for applications that do not work well with relational database model. Mostly NoSQL databases differ from relational databases in their data model [6]. Many NoSQL databases organize the data into key-value pairs [1]. According to [1], the key is used to uniquely identify a particular data item and the value can be a simple word, number or a complex structure with unique semantics. Subsequently, such a property can help dealing with data that is not structured or is semi-structured such that similar data objects can be grouped together, but the objects may have different characteristics [1].

3_{A table typically has a column or combination of columns that contain values that uniquely identify each row in the table.}

This column, or columns, is called the primary key of the table and enforces the entity integrity of the table. Because primary key constraints guarantee unique data, they are frequently defined on an identity column. A foreign key is a column or combination of columns that is used to establish and enforce a link between the data in two tables to control the data that can be stored in the foreign key table. In a foreign key reference, a link is created between two tables when the column or columns that hold the primary key value for one table are referenced by the column or columns in another table. This column becomes a foreign key in the second table.

(11)

11

In contrast to SQL databases, NoSQL databases relax or eliminate some of the ACID rules in order to achieve a better performance and more flexibility. NoSQL databases consider the requirement of ACID properties to be too restrictive and impossible to achieve in distributed environments, as suggested initially by Brewer’s theorem on CAP [6]. Therefore, most NoSQL databases use an alternative principle called BASE (Basically

Available, Soft state and Eventually consistent) which is based on the CAP theorem. In summary, CAP stands

for:

o Consistency: it means that whenever data is written, everyone who reads from the database will always see the latest version of the data.

o Availability: it means that we always can expect that each operation terminates in an intended response.

o Partition tolerance: it means that the database still can be read from and written to when parts of it are completely inaccessible.

According to CAP theorem it is impossible for any network shared-data system to guarantee all of these three properties simultaneously [6]. Such a system can provide only two out of the three properties. Figure 3.1 illustrates the possible choices between the CAP theorem properties. For example a

CA-based (Consistency and availability based) would mean that consistency and availability are paramount and it is achieved by not introducing partition.

(12)

12 3.3. Used database variants

In this Master graduation project we use SQL Server 2014 Express Edition as the relational (SQL) database and MongoDB version 2.6.1 as the non-relational (NoSQL) database. In the following subsections we concisely describe these two database management systems in the context of this project.

3.3.1. Microsoft SQL Server

Microsoft SQL Server is a commercial relational database management system designed by Microsoft. In order to capture the semantics of the database, the data (objects) in SQL Server is represented by a table (schema). The objects with the same number of properties, type and format are grouped together in a table with a column for each property, making it structured data. Each row in such a table represents a different object. As it is generally the case with relational databases, tables in SQL Server are typically normalized, which results in the creation of multiple tables. The relations between tables is realized by use of primary and foreign keys concept. Furthermore, queries involving more than one table require a join operation. In order to make this more tangible let’s have a look at the following example. In this example we assume we have got a table named Person which defines and contains various person’s data like first name, last name etc. And we also assume that we have got a table named Address which defines and contains addresses of different persons. Now, if we want to get the last name and the address information of all persons whose first name matches a random first name we use a query that would look like this:

Example 3.1: query snippet to combine data from different tables in SQL Server

3.3.2. MongoDB

MongoDB is an open source non-relational document oriented (NoSQL) database management system designed by MongoDB4_{. In MongoDB the objects are represented by documents (with JSON like format). The} documents with the same characteristics/properties are grouped together in a collection. Data in MongoDB has a flexible schema. Unlike relational databases, MongoDB’s collections do not enforce document structure. According to [15], this flexibility facilitates the mapping of documents to an entity or an object. Each document can match the data fields of the represented entity, even if the data has substantial variation. In practice, however, the documents in a collection share a similar structure. In MongoDB there are two ways for representing the relations between data: referenced and embedded documents.

3.3.2.1. Referenced document

As stated in [15], references store the relationships between data by including links or references from one document to another. Applications can resolve these references to access the related data. Broadly, these are normalized data models. Figure 3.2 shows the normalized data model of the Person and Address in MongoDB.

4_{Previously the company name was 10gen but it is now changed to MongoDB. Source:}

http://www.mongodb.com/press/10gen-announces-company-name-change-mongodb-inc [28-07-2014]

SELECT Person.LastName, Address.StreetName, Address.ZipCode, Address.State

FROM Person INNER JOIN Address ON Person.Id = Address.PersonId

(13)

13

Person document Address document

Figure 3.2: Normalized MongoDB Data Model

3.3.2.2. Embedded document

According to [15], embedded documents capture relationships between data by storing related data in a single document structure. MongoDB documents make it possible to embed document structures as sub-documents in a field or an array within a document. These denormalized data models allow applications to retrieve and manipulate related data in a single database operation. Figure 3.3 shows the denormalized data model of the Person and Address in MongoDB.

Person document

Figure 3.3: Denormalized MongoDB Data Model

In this project, as we will see in chapter Experiment, we use normalized (references) data model in our experiment. { _id: 1, FirstName: “X”, LastName: “Y”, Age: 24, } { _id: 200, PersonId: 1, Street: “StreetX”, ZipCode: “XYZ”, State: “ABC”, } { _id: 1, FirstName: “X”, LastName: “Y”, Age: 24, Address: { Street: “StreetX”, ZipCode: “XYZ”, State: “ABC”, } }

(14)

14 4. Related work

There are few academic papers which focus on performance comparison of NoSQL to SQL databases. Therefore, we also consider other studies which focus on NoSQL concepts and SQL concepts in general as related work to our paper as they help us to build up a good basis for our research.

Li and Manoharan [2] compare read, write, delete, and instantiate operations on key-value stores implemented by NoSQL and SQL databases. Their results suggest that, despite NoSQL databases are generally designed for optimized key-value stores, not all NoSQL databases perform better than SQL databases.

Leavitt [3] discuses some drawbacks of NoSQL databases. He argues that NoSQL databases face several challenges. According to him, since NoSQL databases don’t work with SQL queries, they require manual query programming, which can be fast for simple tasks but time-consuming for others. Besides, SQL databases natively support ACID, while NoSQL databases don’t. NoSQL databases thus don’t natively offer the degree of reliability that ACID provides. He adds, if users want NoSQL databases to apply ACID restraints to a data set, they must perform additional programming. Moreover, NoSQL databases don’t natively support ACID transactions, therefore they could compromise consistency, unless manual support is provided. Not providing consistency enables better performance and scalability but is a problem for certain types of applications and transactions, such as those involved in banking. Furthermore, he believes, most organizations are unfamiliar with NoSQL databases and thus may not feel knowledgeable enough to choose one or even to determine that the approach might be better for their purposes. Besides, unlike commercial relational databases, many open source NoSQL applications don’t yet come with customer support or management tools.

Nance et al [4] discuss the pro and cons of NoSQL database. They note that NoSQL and SQL databases don’t rule out one another but they rather complement each other. They argue that SQL database model and NoSQL database model are each good for specific applications. Depending on what problem the organization is trying to solve, it will determine if a NoSQL database model should be used or if NoSQL model should be used. Mohan [5] reminds us to first look around and see what have been done in the past in the domain of database management systems. He criticizes NoSQL systems for lacking a common language which would cause vendor lock. Another critic is having no sophisticated query optimizer which requires application developers to have a more in depth knowledge about the underlying database system. Moreover, system migration is considered by him as a nightmare as different advanced NoSQL systems provide a varying data modeling constructs. In order to avoid these kind of pitfalls, he believes that more of the work in the past decades in databases area should be fully leveraged to build more industrial-strength and easier to use NoSQL systems.

Indrawan-Santiago [6] compares ten NoSQL databases including MongoDB, HBase and Cassandra on basis of their data model, transaction model, license, indexes, and sharding. She notes that business analytics capabilities is one of the major limitations of the NoSQL databases. Furthermore, she suggest that many relational database end-users will find NoSQL databases to be difficult to use, as building queries in NoSQL requires more sophisticated programming skills. Therefore, providing a high-level query language will be important for the acceptance of this technology by the end-users.

As choosing a suitable database seems to be very difficult, Hecht and Jablonski [7] evaluate the underlying techniques of NoSQL databases considering their applicability for certain requirements. They compare fourteen NoSQL databases including MongoDB, CouchDB, Cassandra and HBase on basis of the data model, support for queries, partitioning, replication, and concurrency controls.

Barmpis and Kolovos [8] compare the conventional and commonly used persistence mechanisms in Model-Driven Engineering with novel approaches such as the use of graph-based NoSQL databases.

Cattell [9] gives a concise description of some important database characteristics like ACID in SQL databases and BASE in NoSQL databases. He also discusses different types of NoSQL databases like key-value stores, document stores and extensible record stores. Furthermore, he contrast some NoSQL databases like Riak, MongoDB, SimpleDB, HBase and CouchDB on basis of data model, consistency mechanisms, storage mechanisms, durability guarantees, availability and query support. He argues that some of these NoSQL

(15)

15

databases sacrifice some of these dimensions, e.g. database-wide transaction consistency, in order to achieve others, e.g. higher availability and scalability.

Boicea et al [10] study the differences between Oracle Database (an SQL database management system) and MongoDB (a NoSQL document oriented database management system). These two databases are compared on basis of theoretical differences, features, restrictions, integrity, distribution, system requirements, architecture, query and insertion times. They conclude that if you want to use a fast, flexible database, you can rely on MongoDB. Otherwise if the rapidness of the database isn’t the main concern, and if you need relations between the tables and the collections, you can rely on the classic solution, Oracle database.

Abramova and Bernardino [11] describe the characteristics and operational principles of NoSQL databases. However, the main focus of their paper is to compare two NoSQL databases; namely MongoDB and Cassandra. This comparison is done according to Yahoo Cloud Serving Benchmark (YCSB) which is one of the most used benchmarks to test NoSQL databases [11]. They evaluated that with the increase of data size,

MongoDB started to reduce performance. Cassandra, on the other hand, got faster while working with an increase of data. They conclude that Cassandra showed the best results for almost all test scenarios that they have performed.

Stonebraker and Çetintemel [12] summer up the last decades of database management system (DBMS) development in one phrase: “One size fits all”. They argue that such a concept is no longer applicable to many of the database markets of today. In order to support their claim, they use examples from the

stream-processing market and warehouse market. In summary, they try to show that ACID properties are not required in most of these kind of markets and simpler, specialized performance constructs provide a better solution. Pokorny [13] discusses NoSQL databases in context of cloud computing. It also describe the basic and general properties of NoSQL databases.

Pritchett [14] provides a nice and practical paper. He gives a quick background information which help us to better understand some of the NoSQL and SQL principles. He concisely explains the CAP theorem and introduces the ACID and BASE properties and the differences between them. Furthermore, by use of a practical example, he explains how to incorporate BASE property in a distributed system in order to achieve a higher scalability.

(16)

16 5. Experiment

As mentioned before, we replicate the work of [1]. What makes this a challenging job at this point is that we cannot find the original experiment used in [1] or its source code which forces us to design our own experiment from the text and explanations in [1]. In this chapter we explain the experiment design, the rationales behind it and how it will be executed.

5.1. Setup

The experiment setup (deployment view) is depicted in figure 5.1. The experiment is constituted of four main parts, namely:

o Experiment Application: it runs on the “Experiment App Machine” o SQL Server 2014 Express Edition: it is hosted on “Database Server” o MongoDB version 2.6.1: it is hosted on “Database Server”

o Test results storage (a SQL Server database): it is hosted on “SQL Server Database”

These parts will be explained in turn in a moment, but for now it is sufficient to know that these are the four main parts of the experiment.

For this experiment we use a client-server architecture. The client machine, running on an Intel i5 Dual Core 2.3 GHz processor with 8 GB of DDR3 Ram and Windows 7 64-bit Operating Software, will be hosting the

experiment application. The server machine, running on an Intel Pentium Dual Core 2.0 GHz Processor with 4 GB of DDR3 Ram and Windows 8.1 64-bit Operating Software, will be hosting both SQL Server database and MongoDB database. Moreover, a SQL Server database will be hosted on another physical machine which serves as storage for the experiment results (this will be explained in more detail later on). The reason behind

choosing such an architecture is twofold. First, we want to minimize the effect of any other application that would have on the performance of the database server. Second, client-server architecture is the most common solution (when it comes to database related applications) within the context of the host company where this Master graduation project is taking place.

Furthermore, these three machines are connected to each other through an Ethernet switch forming a network of three machines. We deliberately isolate the experiment network from the network of the host company as we want to reduce the network latency and noise effects as we might have otherwise.

Figure 5.1: Experiment deployment view

(17)

17 5.2. Data

The experiment data consists of three objects representing Department, Project and User (see figure 5.2). There are relations between these objects, as a User has a particular Department and an array of Projects. Similarly, a Project is associated with a particular Department, a manager that is a User and also an array of Users of the Project.

Figure 5.2: Objects schema. Source: [1]

5.3. Data modeling

As these two databases use a significantly different technique for modeling the data, the objects schema in figure 5.2 will be transformed to the SQL Server data model as well as to MongoDb data model separately.

5.3.1. SQL Server data model

As mentioned before, in order to capture the semantics of the database, the data in relational model (SQL) is usually represented by a table or schema. The objects with the same number of properties, type and format are grouped together in a table with a column for each property. Furthermore, the relations between the objects are realized by use of primary and foreign key concept. Therefore, the implementation of object schema of figure 5.2 in SQL Server requires one table for each of the data objects, as well as an additional table representing the many-to-many relationship between the User and Project objects. This additional table replaces the array of Projects in the User table and the array of Users in the Project table. See figure 6.3.

(18)

18 5.3.2. MongoDB data model

As explained earlier, in MongoDB the objects are represented by documents. The documents with the same characteristics/properties are grouped together in a collection. Unlike relational databases, MongoDB’s collections do not enforce document structure. Each document can match the data fields of the represented entity, even if the data has substantial variation. Furthermore, as mentioned before, in MongoDB there are two ways for representing the relations between data: references and embedded documents. However, as we are replicating the work in [1], we will be using references (normalized data model). References store the

relationships between data by including links or references from one document to another. Applications can resolve these references to access the related data.

The MongoDB implementation of the objects schema in Figure 5.2 requires one collection for each of the data objects; User, Department and Project. The relation between objects (such as the Department of a User) will be implemented by storing a reference to the other document. The downside to this feature is that MongoDB has no built in way to retrieve an object based on reference [1] which means that we must define our own method for retrieving the data on reference. Figure 5.4 depicts the MongoDB data model.

(19)

19 5.4. Test Cases

In this experiment we will perform seven test cases with same insert, update and select scenarios. Each successive test case will have more data. From the seven test cases, the first four test cases are according to the original work in [1] and the remaining three are extensions since we would like to monitor how the performance of both databases is further affected as the data grows. Table 5.1 illustrates the number of Departments, Users and Projects for each of the four test cases.

Test Case Departments Users Projects

1 4 128 16 2 8 256 64 3 16 1024 512 4 128 4096 8192 5 128 4096 8192 6 1128 5096 9192 7 2128 6096 10192

Table 5.1: Test Cases

For the Insert portion of the test, each of the data items in table 5.1 will be inserted in the database under consideration.

For the Update portion of the test, three different type of updates will be used. The first type involves updating the name of one-tenth of the departments based on their primary key. The second type updates the last name of any user whose first name matches a chosen name. The third type updates the name of one-fourth of the projects based on their key. Each of these three types of updates will be tested on all four test cases. The Selects will be divided into two categories: simple and complex. The simple queries involve selecting data of only one object type. The complex queries involve multiple object types, nested queries, and aggregate functions.

There will be four different simple select queries. The first select retrieves a department by its primary key. The second retrieves a department by a randomly chosen name. All departments will have unique names, and the name will be guaranteed to match one result in the department’s collection. The third select retrieves a user by its primary key. The fourth select retrieves all users whose first name matches a randomly chosen first name; this randomly chosen name may not match any users in the collection.

Furthermore, there will be three different complex select queries which involve a join operation. The first complex select retrieves all departments that contain one or more users that have a randomly chosen first name. Since first name is randomly assigned this may or may not return any matches. The second complex select retrieves all users who work on any of the three randomly chosen projects. Since users are randomly assigned to projects this query may also not return any valid matches. The final complex select returns the average age of the users who work on any of the three randomly chosen projects, and obviously, contains the aggregate function average(). This final query may return zero if no users is assigned to the projects.

(20)

20 5.5. Application design

In order to test the performance of both databases we develop a windows console application in C#. Since we want to perform the tests on both databases in exactly the same way we make use of interface concept in object oriented paradigm. This application has got five main parts as follows:

o Database performance monitor interface: it is an interface that defines which test scenario (methods) we are about to perform.

o SQL Server performance monitor: it is the concrete class that implements the above interface and uses Entity Framework and straight SQL commands to perform the needed operations in SQL Server. o MongoDB performance monitor: it is the concrete class that implements the above interface and uses

MongoDB query writer to perform the needed operations in MongoDB.

o Test results recorder: it functions as a layer for storing the test results and random test data in a SQL Server database.

o Experiment runner: it is the main program which defines the logic of performing the experiment. It uses the above interface to do the performance tests and stores the results in a separate database by use of another object called “Test results recorder”.

The complete experiment application source code can be found (or cloned) at https://github.com/Khosrow-Azizi/MasterGraduationExperiment.git. In the following subsection we will focus on the how the experiment is executed in more detail.

5.6. Application execution

We consecutively carry out all test cases in one run with only one database at a time. Since we express the performance in average execution time, we run each test scenario in each test case 100 times, except for select scenarios in test cases 2 and 4 where the number of runs is 200 in accordance with [1]. The reason for

performing so many runs is to reduce the skewing effect of any outliers would have on the average execution time [1]. The results (i.e. the execution times) are subsequently stored per test case and test scenario in a database for further analysis.

Since timing plays the most crucial role in this experiment, it is very important to choose the appropriate tool. The default tool that is available in C# library for timing purposes is the “Stopwatch” class in

“System.Diagnostics” namespace which provides a set of methods and properties that we can use to accurately measure elapsed time [16]. The Stopwatch property that we might use would be “ElapsedMilliseconds” which gets the total elapsed time measured by the current instance of the Stopwatch, in milliseconds [16]. However, the type of this property is long5_{which means that we only can measure in terms of integer values. For} instance if an operation takes 0.4 milliseconds the ElapsedMilliseconds would indicate 0 milliseconds. This obviously is not acceptable in our experiment since we are interested in performance ratio between both databases and we would like to monitor the difference as accurately as possible. Fortunately, there is another way to solve this issue. We can use another property of Stopwatch called “ElapsedTicks” which represents the number of elapsed ticks in the underlying timer mechanism. According to [17], a tick is the smallest unit of time that the Stopwatch timer can measure. In addition to ElapsedTicks, Stopwatch provides another property called “Frequency” which indicates the timer precision and resolution. For example, a timer frequency of 2 million ticks per second equals a timer resolution of 500 nanoseconds per tick [18]. Now, by using the Frequency property we can convert the ElapsedTicks value into a number of milliseconds as illustrated in code snippet 5.1.

Code snippet 5.1: Calculating elapsed time by Stopwatch. This code snippet is part of a larger program.

5_{Type l}_{ong represents a very big (8 byte) integer value. Source:}_{http://msdn.microsoft.com/en-us/library/ctetwysk.aspx}

[28-07-2014]

privatedouble GetTimeElapsed(Stopwatch stopwatch) {

return ((double)stopwatch.ElapsedTicks / (double)Stopwatch.Frequency) * 1000; }

(21)

21

Another important timing related issue that brings some implications is when we run the experiment on a multiprocessor computer as it is the case here. According to Microsoft [16], on a multiprocessor computer, it does not matter which processor the application thread runs on. However, because of bugs in the BIOS or the Hardware Abstraction Layer (HAL), we can get different timing results on different processors. This means that if we run the entire experiment repeatedly, there is a high possibility that we get different results each time. More importantly as we run the experiment separately with only one database at a time, we could not justify the difference between the results as it might be caused by the difference between the assigned processors. In order to solve this issue we should make sure that the application always runs on the same processor. In C# we can realize this by specify processor affinity for the application thread. According to Microsoft [19, 20], we can use “ProcessThread.ProcessorAffinity” which sets the processors on which the associated thread can run. An example of how to use this method is illustrated in code snippet 5.2.

Code snippet 5.2: Example of setting the processor affinity to first processor. This code snippet is part of a larger

program.

Moreover, since we are interested in the execution time that is spend by the databases to perform a certain operation, we try as much as possible to avoid the effect of other parts of the application that could have on the measured time. To be more concrete, we time the execution of a certain test scenario at the moment that the call to the database, in question, is actually taking place. Code snippet 5.3 shows an example of how such a timing is done in our experiment. The code in the dashed red frame is where the timing is done; right at the command execution.

Code snippet 5.3: Example of how we measure the execution time in SQL Server. This code snippet is part of a

larger program.

As stated before, some test scenarios use random data to perform a specific operation. In order to make sure that the experiment with both databases uses the exact same random data, we store these data in a separate database (the test result database). Each time that we run the experiment these random data will be loaded into the memory for further use in the application.

Process proc = Process.GetCurrentProcess();

ProcessThread thread = proc.Threads[0]; thread.ProcessorAffinity = (IntPtr)1;

publicdouble UpdateUserLastName(string firstNamePattern, string newLastName, Stopwatch stopwatch) { using (SqlConnection sqlConnection = newSqlConnection(connectionString)) {

sqlConnection.Open();

using (SqlCommand command = newSqlCommand()) { command.Connection = sqlConnection;

command.CommandType = CommandType.Text;

command.CommandText = "UPDATE [dbo].[User] SET Lastname = @newLastName WHERE FirstName LIKE @firstNamePattern";

command.Parameters.AddWithValue("@newLastName", newLastName); command.Parameters.AddWithValue("@firstNamePattern", firstNamePattern); stopwatch.Restart(); command.ExecuteNonQuery(); stopwatch.Stop(); } sqlConnection.Close(); } return GetElapsedTime(stopwatch); }

(22)

22

Moreover, according to [25], in MongoDB every document is stored in a record that contains the document itself and extra space, or padding, which allows the document to grow as the result of updates. All these records are contiguously located on disk and when a document becomes larger than the allocated record, MongoDB must allocate a new record. New allocations require MongoDB to move a document and update all indexes6_{that refer to that document, which takes more time than in-place updates and leads to storage} fragmentation [25]. Therefore, we try to avoid the document’s growth where possible. For instance, we choose to have a static number of Users (three) per Project as the User-Ids are embedded in Projects document by an array and growth of this array results in growth of the Project document as a whole. Such a decision should help us to narrow down the effects on the performance so that we can justify our results more confidently. Furthermore, it is noteworthy that each time we run the experiment we first restart the database server machine, which is hosting both SQL Server and MongoDB, so that we start with a clean state (i.e. both Ram and databases execution plans7_{will be cleaned up).}

6_{Indexes in databases are similar to indexes in books. In a book, an index allows you to find information quickly without}

reading the entire book. In a database, an index allows the database program to find data in a table without scanning the entire table. An index in a book is a list of words with the page numbers that contain each word. An index in a database is a list of values in a table with the storage locations of rows in the table that contain each value. Source:

http://technet.microsoft.com/en-us/library/aa933129%28v=sql.80%29.aspx [28-07-2014]

(23)

23 6. Results and analysis

As mentioned before, we run each test scenario in each test case 100 times except for select scenarios in test cases 2 and 4 where the number of runs is 200 in accordance with [1]. After having run the experiment, the results are explained in this chapter by use of two-column graphs.

In each graph the execution time is captured by the Y-axis, in milliseconds, and the corresponding test case is captured by the X-axis in test case number. Moreover, the first column, the blue one, in each test case monitors the SQL Server’s performance and the second column, the red one, monitors the MongoDB’s performance.

6.1. Insert results

Since the insert results in the original paper are combined in one graph, we also combine them in graph 6.1 which shows the overall insert performance; i.e. average execution time of inserting Department, User and Project all together per test case. In accordance with [1], the inserts in SQL Server are performed by use of Entity Framework. And the inserts in MongoDB are performed by use of mongo query writer (C# driver).

Graph 6.1: Overall insert results

As we can see, both databases perform worse as the database size gets bigger. However, we see a big

difference in the execution times between both databases. As it can be seen, MongoDB is much faster than SQL Server. MongoDB is approximately 3.7 times (75:20) faster in test case 1 and approximately 4 times faster in test case 7. So, why is SQL Server that much slower than MongoDB?

We believe this performance difference is caused by three main factors. First one is the ACID property of SQL Server. This means for SQL Server that it needs to perform a check on existence of the related data in the other table(s) on each data insert. For instance when we insert a row in the User table, SQL Server needs to check if the related DepartmentId already exists in the Department table which adds an extra overhead to the insert execution time. In order to justify this we use graph 6.2 and graph 6.3 where the individual insert execution times for Departments and Users are monitored (the results of inserting Projects can be found in appendix II).

0 5000 10000 15000 20000 25000 1 2 3 4 5 6 7 TIME ( MIL LISE CON DS) TEST CASE

Overall Inserts

SQL Server MongoDB

(24)

24

Graph 6.2: Insert Department results

Graph 6.3: Insert User results

As we can see, in case of inserting Departments SQL Server performs almost equally to MomgoDB in most of the cases. However, when it comes to inserting Users we see that MongoDB performs better in all cases (except in test case 4) as inserting the Users requires checking the existence DepartmentId in the Department table. 0 500 1000 1500 2000 2500 3000 3500 4000 1 2 3 4 5 6 7 TIME ( MIL LISE CON DS) TEST CASE

Insert Department

SQL Server MongoDB 0 2000 4000 6000 8000 10000 12000 1 2 3 4 5 6 7 TIME ( MIL LISE CON DS) TEST CASE

Insert User

SQL Server MongoDB

(25)

25

Second factor, and also the most influential factor, of poorer performance of SQL Server is using Entity

Framework for inserts. Using Entity Framework has a huge impact on the performance when it comes to inserts of objects with many-to-many relations. As it can be justified by SQL Server Profiler8_{(see Figure 6.1), when we} insert a Project for instance, Entity Framework creates an insert command for the Project under consideration and it also creates separate insert commands per related User. This means when we insert a Project with three Users there will be four separate insert commands as follows:

 Insert Project

 Insert First ProjectUser  Insert Second ProjectUser  Insert Third ProjectUser

Figure 6.1 shows how the insert commands are received by SQL Server.

Figure 6.1: SQL Server Profiler showing the insert commands sequence

As a result, there will be four round trips9_{to SQL Server. This obviously adds extra overhead to the execution} time of SQL Server since such an insert command, as we will see in next paragraph, could be achieved by just one round trip.

So, an appropriate question here would be if the Entity Framework is the right tool for this part of the experiment. We believe it is not as there is another efficient way for doing this; namely using straight SQL Command. When we use straight SQL Command we can insert projects more efficiently by doing all necessary activities in just one command (and implicitly in one round trip) as follows:

 Insert Project Insert all ProjectUsers

Figure 6.2 shows how such a command is received by SQL Server.

8_{The SQL Profiler tool can be used to debug, troubleshoot, monitor, and measure the application's SQL statements and}

stored procedures. SQL Profiler captures activity occurring in SQL Server, driven by requests from the client application. Source: http://msdn.microsoft.com/en-us/library/ff650699.aspx [28-07-2014]

9_{A round trip in this context means a cycle of a request from client application to SQL Server and a response from SQL}

(26)

26

Figure 6.2: SQL Server Profiler showing the insert command

As we can see in graph 6.4, this significantly enhances the SQL Server execution time.

Graph 6.4: Overall insert results including inserts by SQL straight command

The third factor causing MongoDB perform better than SQL Server is the way MongoDB handles insert or write operations. According to [15], in order to keep the database durable (i.e. make it possible to recover from a crash) MongoDB copies the write operations to an on-disk file called journal. However, writing to the disk brings the necessary overhead with it. So, in order to minimize this overhead, MongoDB periodically (by default every 100 milliseconds) copies the write operations to the journal in batches called group commits. SQL Server, on the other hand, maintains the durability of the database by instantaneously logging the write operations to disk as the operations are received. This method is the default behavior of SQL Server and is called fully durable transaction commit [21]. Although this method guarantees a higher order of database durability than the method used by MongoDB, we believe, it also adds more overhead to the execution of the write operations.

Overall Inserts

SQL Server Entity Frame SQL Server Command MongoDB

(27)

27 6.1.1. Comparison to original paper

When we compare the insert results from graph 6.1 to those of the original paper (see graph 6.5), we see that in both results the execution time in SQL Server as well as in MongoDB increases as the data gets bigger. However, in our results MongoDB performs better than SQL Server in all test cases whereas in the results of the original paper neither of the databases performs consistently better. So, the obvious question would be why is there such a difference in the results?

Graph 6.5: Overall insert results. Source: [1]

We believe there are several factors which could play a role here. Aside from the differences in the hardware, the first obvious factor is the difference between the versions of databases used in our experiment and the ones used in [1]. We use MongoDB version 2.6.1 whereas they used version 1.7. As we could find in the release notes of MongoDB from version 1.7 onwards there have been a lot of improvements with regard to

performance. For instance, according to [22], index performance enhancements has made the indexes 25% smaller and faster. However, it is not explicitly clear from the notes if this also improves the insert/write operations. We are inclined to assume that the improvements to indexing could result in improvements in insertions since every insert in a collection also involves modification of “key field” index. Another important factor that adds differences to the results is that we don’t have the source code that is used in [1]. This means that we are not sure how the execution timing is exactly done there. Therefore there is a possibility that our method of timing slightly differs from that of [1].

(28)

28 6.2. Update results

As mentioned before, all updates in SQL Server are performed by using straight SQL commands and updates in MongoDB are performed by MongoDB query writer. The update tests contain three scenarios as follows:

o Updating Users last name by querying their first names o Updating Departments name by querying their ids (keys) o Updating Projects name by querying their ids (keys)

We run each update scenario 100 times per test case in order to reduce the skewing effect any outliers would have on the average execution time. The results of these test scenarios are discussed in the next subsections.

6.2.1. Updating users last name by querying their first names

In this test scenario the last name of users is updated whose first name matches a random name. The results are shown in the graph 6.6. The results indicate that as the data gets bigger it takes longer for both databases to perform this update operation. However, as we can see, SQL Server shows constantly a better performance than MongoDB in all test cases. We believe the performance difference here is due to how the data is defined in these databases. As stated before, SQL Server uses fixed/static schemas to define the structure of the data whereas MongoDB has a flexible/dynamic schema which doesn’t enforce the data structure. Having a static schema makes sure that all items (rows) in the same table in SQL Server will have the same structure; i.e. the same type and same number of fields. Having a dynamic schema, on the other hand, means that the items (documents) in the same collection in MongoDB do not need to have the same set of fields and even the common fields of items in the same collection may hold different type of data. Therefore, when we query a non-indexed field, as it is the case here, MongoDB has to perform more complex lookups on each data item which results in a higher execution time.

Graph 6.6: Results of updating Users last name by a random first name

UpdateUserLastNameByFirstName

SQL Server MongoDB

(29)

29

6.2.1.1. Comparison to original paper

When we compare the results from graph 6.6 to those of the original paper (see graph 7.7), we see that in both results the execution time in SQL Server as well as in MongoDB increases as the data gets bigger. Moreover, SQL Server performs better than MongoDB in all test cases in both results.

(30)

30 6.2.2. Updating departments name by querying their ids

In this test scenario we update the department name using its key (Id). Graph 6.8 shows the execution time per test case. As we can see, MongoDB outperforms SQL Server in all test cases. Moreover, SQL Server seems to have more trouble as the database size increases. It is noteworthy that SQL Server has a small variation in execution times as long as the data size doesn’t make a huge jump. This can be seen through the test case 1 to 3 where the number of departments in the database doubles per test case (4, 8 and 16 respectively). However, the number of departments in test case 4 increases by 8 times the number of departments in its pervious test case. This is where we see a significant increase in the execution time in SQL Server and from there the number of departments increases by 1000 per test case which again results in a small variation in the execution times. MongoDB on the other hand has a small variation in execution time across all test cases. So, why MongoDB has a better performance than SQL Server?

Graph 6.8: Results of updating Departments by random keys

We believe this performance gap is caused by the way both databases work. MongoDB stores and applies write operations in memory and in the on-disk journal before the changes are present in the data files on disk [23]. In order to do this it uses different in-memory views. One of these in-memory views is the private view which stores data for use with read operations [23]. The private view is also the first place MongoDB applies new write operations (like updates). Moreover, when a write operation is submitted, MongoDB periodically copies the write operations to the journal in batches called group commits in order to overcome the disk commit overhead. SQL Server, on the other hand, handles this in a different way. When a query is submitted to SQL Server it first loads the needed data from disk into the memory (if it is not already there) it then logs the modifications to the disk and after that it applies the modifications to the data in memory [24]. Although this method guarantees a higher level of database durability, it adds more overhead to the execution time with respect to how MongoDB does it.

0 1 2 3 4 5 6 1 2 3 4 5 6 7 TIME ( MIL LISE CON DS) TEST CASE

UpdateDepartmentNameByKeys

SQL Server MongoDB

(31)

31

6.2.2.1. Comparison to the original paper

When we look at the results of the original paper in graph 6.9 we see that also here MongoDB outperforms SQL Server in all cases. Furthermore, the variation in execution time is small across the all Test Cases, particularly for MongoDB.

Graph 6.9: Results of updating Departments by random keys. Source: [1]

6.2.3. Updating projects name by querying their ids

In this test scenario we update the project name using its key (Id). The results are depicted in Graph 6.10. As we can see, MongoDB outperforms SQL Server in all test cases. Although the execution time from test case 4 to test case 7 doesn’t vary a lot, it increases in both databases as the database size increases. We believe the same factors as in the previous scenario are involved here which result in a better performance of MongoDB with respect to SQL Server as both scenarios are much alike (see section 6.2.2). The difference is that here we are updating the projects which also have more data with respect to departments.

Graph 6.10: Results of updating Projects name by random keys

0 20 40 60 80 100 120 140 160 1 2 3 4 5 6 7 TIME ( MIL LISE CON DS) TEST CASE

UpdateProjectNameByKeys

SQL Server MongoDB

(32)

32

6.2.3.1. Comparison to the original paper

The results from the original paper are shown in graph 6.11. These results also show that MongoDB outperforms SQL Server in all cases. Furthermore, we also see that the execution time increases in both databases as the database size increases.

(33)

33 6.3. Select results

As stated earlier, we performe all select tests by using straight SQL command in SQL Server and by using MongoDB query writer in MongoDB. These select tests are divided by two categories; namely simple and complex.

The simple select category covers the select tests which involves only one table. These test scenarios are as follows:

o Selecting Departments by querying their ids o Selecting Departments by querying a random name o Selecting Users by querying their ids

o Selecting Users by querying a random first name

The complex select category contains join and aggregate queries. These test scenarios are as follows: o Selecting Departments by querying a random User’s first name

o Selecting Users by querying random Project ids

o Selecting Users average age by querying random Project ids

As mentioned before, in order to reduce the skewing effect any outliers would have on the average execution time we run each test scenario a few times. However, we note that the number of selects varies per test case; the number of selects in test cases 1, 3, 5, 6 and 7 is 100 and in test cases 2 and 4 it is 200 in order to be consistent with [1]. Hence it is not always useful to compare the execution times of the test cases to each other, but rather to compare the execution times of SQL Server to MongoDB. The results of these test scenarios are discussed in the next subsections.

6.3.1. Selecting Departments by querying their ids

In this test scenario we select a department by querying its key (Id). Graph 6.12 shows the execution time per test case. Although the size of the data increases from test case 1 to test case 7, the execution of the select tests is also affected by the number of selects as stated earlier. That is why the execution time in test cases 3 and 5 decreases with respect to test cases 2 and 4. Overall, MongoDB performs better than SQL Server in all test cases. We believe this is due to the way these two databases differ in handling the queries. The biggest difference between them, here, is that MongoDB uses an in-memory view (private view) [23] to answer the queries whereas SQL Server reads the queried data from the disk. Hence is MongoDB faster than SQL Server in this case as reading from the memory is faster than reading from the disk.

(34)

34

6.3.1.1. Comparison to the original paper

The results from the original paper are shown in graph 6.13. These results also illustrate that MongoDB performs better than SQL Server in all cases.

Graph 6.13: Results of selecting Departments by keys. Source: [1]

6.3.2. Selecting Departments by querying a random name

In this test scenario we select a department by querying a random name. All Departments have a unique name which means that this test scenario always returns only one department. Graph 6.14 illustrates the results of this test scenario. As we can observe, SQL Server outperforms MongoDB in all test cases. Moreover, we see that as the data size grows the performance decreases in both databases. However, we see that MongoDB has more trouble than SQL Server as the data grows. As it was the case with update scenario “Updating users last name by querying their first names” in section 6.2.1, we are here querying a non-indexed field. Since SQL Server uses a static tabular schema to define the data, it can choose the individual fields more easily than locating the fields in the dynamic schemas (documents) in MongoDB. This is why SQL Server shows a better performance in this test scenario.

(35)

35

6.3.2.1. Comparison to the original paper

The results from the original paper are shown in graph 6.15.

Graph 6.15: Results of selecting Departments by a random name. Source: [1]

What is striking here is that these results suggest the opposite of our results. As it can be seen it is MongoDB that outperforms SQL Server here. This is remarkable since this test scenario concerns, in essence, the same type of query as in test scenarios “Updating users last name by querying their first names” in section 6.2.1 and “Selecting Users by querying a random first name” in section 6.3.4. In all of the three test scenarios the operations involve querying a non-indexed field. In our paper the results of all three (the results of latter scenario will be presented in a moment) are consistent with each other. However, in the original paper this test scenario result is the one that is not consistent with the two other ones. So, what is happening here?

Unfortunately we couldn’t find any justification for this deviation between the results; not in the theory nor in the original paper itself. Could it be that the results of SQL Server in this test scenario are accidentally swapped with those of MongoDB?

(36)

36 6.3.3. Selecting Users by querying their ids

In this test scenario we select a user by querying its key (Id). Graph 6.16 shows the execution time per test case. We note here although the size of the data increases from test case 1 to test case 7, the execution of the selects test is also affected by the number of selects as stated earlier. Overall, MongoDB performs better than SQL Server in all test cases. As it was the case in test scenario “Selecting Departments by querying their ids” in

section 6.3.1, this is because MongoDB uses an in memory view to store and read its data while SQL Server has to retrieve its data from disk. Since disk fetch time is slower than memory fetch time, MongoDB performs better than SQL Server.

Graph 6.16: Results of selecting Users by key

6.3.3.1. Comparison to the original paper

The results from the original paper are shown in graph 6.17. As we can see, these results also illustrate that MongoDB performs better than SQL Server in all cases.

(37)

37 6.3.4. Selecting Users by querying a random first name

In this test scenario we select users where the first name matches a randomly chosen name. This test scenario concerns in essence the same type of query as scenarios in section 6.2.1 and section 6.3.2. All these scenarios involve querying a non-indexed field. The results of this test scenario are illustrated by graph 6.18. As we can see, SQL Server outperforms MongoDB in all test cases. These results are very consistent with those from

section 6.2.1 and section 6.3.2. As stated in those sections, since SQL Server uses a static tabular schema to define the data it can choose the individual fields more easily than locating the fields in the dynamic schemas (documents) in MongoDB which results in a better performance.

Comparing MongoDB to SQL Server, Replicated

Master Graduation Thesis