High performance distributed storage and querying on microblogs

(1)

High performance distributed storage

and querying on microblogs

Tamasja Sastradiwiria

University of Amsterdam Master Software Engineering

(2)

1 Introduction

Social media such as Twitter is an interesting source of information. tweets are typically short, but there is a large volume of them. This provides a challenge when a system is needed to obtain meaningful results from a set of tweets in near real-time. This thesis project compares three existing, open-source search engines that are capable of full-text keyword search and a distributed setup; Solr, ElasticSearch and MongoDB. We will look how they are structured, how they differ and compare and what influence on performance we expect from this. To test this in practice, a

benchmark of these system has been done and each system has been tested using a set of 300M Tweets for their ability to index and query results. This document contains the results from this comparisation, describes how the benchmark was executed and presents the results obtained from the benchmark.

1.1 Theoretical background

Social media offer great research opportunities for social sciences, political sciences and policy makers. Microblogging services such as Twitter allow users to make short, frequent posts. Trends and interests can be found by looking for certain keywords, phrases and hash tags within the short messages published on Twitter, called Tweets. Trends and interests can be analyzed in various ways, such as comparing between different locations, at different times of day or between certain user groups.

Typically Tweets are high in volume, but short in length as the maximum of allowed characters in a Tweet is 140 characters. According to Twitter, around 500 million new Tweets are published per day1_{. Due to the high number of Tweets that are generated the collection of Tweets that are available}

to be analyzed will grow to a considerable size in both bytes and number.

The Information and Language Processing Systems group (ILPS) is currently gathering Tweets on a daily basis and is interested in methods to store and analyze the data obtained this way. The ILPS is interested in making use of horizontally scaling distributed solutions to do this in order to better provide for the constantly growing amount of Tweets in the collection. A horizontally scaled system allows for adding more computer nodes to the cluster to expand available capacity.

Data should be accessible quickly, preferably search request queries results should be near real-time, meaning a response should be returned to the requester in no more than several hundred milliseconds. Therefore the search provider should be capable of high query throughput and low query response times.

Due to the large amount of existing and newly generated data, indexing performance is also of concern. The search provider should be able to index new Tweets quickly and efficiently, especially since a stream of new Tweets is generated daily that have to be included. Although it's acceptable that there is a delay to when Tweets are available for querying, the provider must be able to do both tasks at once as the ability to query should exist at any time regardless of whether the index is being updated.

There are a number of search engine solutions available that can scale horizontally, such as

ElasticSearch, Solr (SolrCloud) and MongoDB. The capability of these systems can be evaluated by the speed of processing operations and claimed resources, such as RAM and hard disk space. These systems take different approaches while all providing full-text search capabilities. Solr and

(4)

ElasticSearch both make use of Lucene as their core program for information retrieval, while MongoDB is entirely integrated. SolrCloud has evolved distributed capabilities on Solr and is composed of various components, while ElasticSearch was purposely created from the ground up. It is uncertain whether these systems should be based on a core information retrieval program, and whether a purposely built system provides advantages over an evolved and composed system.

(5)

1.2 Technical details of investigated search engines

1.2.1 Underlying data structure

ElasticSearch and Solr make use of Lucene as their core indexing program for full-text search12_,

while MongoDB3_{uses different algorithms for the same purpose and structure their databases}

differently.

Lucene is a text search-engine library written in Java4_{. It creates its index by extracting a list of}

keywords, called terms, from a document, together with the term frequency and a reference to the containing document. Lucene uses this dictionary of lexicographically sorted terms to look up terms and its associated properties. The resulting data structure is referred to as an inverted index.

Searching through the sorted terms dictionary is done using binary search, which has a complexity of O(log n), where n is the number of terms in the dictionary. In addition to binary search, recent versions of Lucene can make use of a finite state tranducer (FST) for term prefix lookup to reduce memory footprint of large term dictionaries.

A Lucene index consists of several smaller immutable indexes called segments. Each segment is an index on its own and contains a terms dictionary. When entries are created, updated or deleted in the index, a new immutable segment file is created reflecting the new data. A bitmask is used on old out of date data to exclude it from further queries. A drawback is that every commit to the index

requires a new segment file to be created and every update requires all document data to be rewritten entirely, even for small modifications.

This makes Lucene especially bad at small atomic changes as it significantly increases the amount of segments and data to write. Additionally, performing an update is essentially the same steps as creating a new entry and deleting the old entry.

When performing searches against the index, having too many segments can negatively impact search performance since many files have to be opened and considered before the complete result can be formed. To minimize this impact Lucene periodically merges segments into larger ones. This merging optimizes the index by reducing the number of files the index consists of and by excluding bitmasked old data, removing irrelevant segment files and old data inside segment files. The

merging function of Lucene combines small segments into larger ones and keeps on combining until there is only a single segment or the segment is at its maximum size. The terms dictionary is recreated using merge sort. During this process all or a large part of the index will be recreated and rewritten, and this step-wise process is possibly executed multiple times. This makes optimizing very heavy on CPU and I/O making this a relatively expensive operation. As work accumulates, optimizing will also become more expensive to process the longer no merge operation has been performed.

In MongoDB every database has a namespace file and several files containing collections and raw data. Every data file consists of multiple “extends”, which are essentially headers in a linker list. A collection can consist of one or more extends. Each extend has a pointer to the previous and next extends that make up part of the database. Also each extend references the first and last record of their contained document data. Data themselves are linked lists as well, linking to their next and previous document. This linked list structure allows iteration over the extends and documents. Documents in MongoDB can increase and decrease in size. When the extend does not have space to fit the new data in a document, it will have to be moved to another extend. To counter this,

1 https://www.elastic.co/products/elasticsearch 2 http://lucene.apache.org/solr/

3 https://www.mongodb.org/ 4 https://lucene.apache.org/core/

(6)

MongoDB uses padding, which allocates extra empty space to allow documents to grow. When old documents are deleted, MongoDB will try to reuse the space by creating new documents in it, provided the deleted document left enough space. This means that as changes are made to the database, fragmentation occurs inside of the data segments. As write operations take place, file locking is needed to ensure consistency.

In addition, to prevent loss of data due to partial writes, a transaction log is necessary to guarantee durability. MongoDB uses a journal for to offer this guarantee to transactions. All operations are written to a journal before applying to data files. Journal is memory buffered but flushed in frequent intervals, limiting the data that can be lost. A journal can be replayed applying all recorded

transaction to the database bringing it back in a consistent state. The write guarantee can be altered with driver write concern. However as guarantee increases so does latency.

1.2.2 Solr and ElasticSearch

Operating on top of Lucene, Solr and ElasticSearch each further enhance Lucene features and various methods to optimize performance. Solr was created in 2004 as a web application wrapper around Lucene, making its functions more accessible through a cross-platform API without the need to integrate the Lucene library within application. At first Solr was only capable of master-slave replication, and much of the management and load-balancing had to be done manually. It was only later in 2012 that SolrCloud was introduced in Solr, adding many distributed capabilities. To help Solr with distributed cluster management, another program, ZooKeeper was utilized. In this way Solr evolved towards distributed capabilities by adding new functionality to the old Solr and using external programs such as ZooKeeper to aid it.

ElasticSearch was created in 2010 and was built with distributed capabilities and cloud in mind from the start. It is specifically designed for distributed purpose and aims to make it easy to set up and maintain a cluster of ElasticSearch servers by automating many of its tasks.

Scaling

The primary way through which scaling is done in these systems is through sharding. Sharding is the concept of splitting a database collection into multiple smaller collections. As Solr and ElasticSearch make use of Lucene, each shard consists of a smaller, separate Lucene index. Horizontal scaling is archived by dividing these shards over multiple nodes in a cluster.

A limitation to this method is that, once shards are created, increasing the number of shards again would require reevaluating data in the collection, and can lead to data being unevenly being

distributed across nodes. Due to this, the ability to change the number of shards after the collection is limited, or in case of ElasticSearch, not supported. To simplify this scaling issue, a single node can have multiple shards when the cluster is created, a practice called 'oversharding'. When new nodes are added to the cluster at a later time, shards can be moved to the new nodes with relative ease. ElasticSearch has automatic shard re-balancing, making it capable of automating the process of moving shards across nodes in the cluster1_{. Solr requires manual action to move shards across the}

cluster, but unlike ElasticSearch provides methods to split existing shards2_{. In both systems the}

amount of shards chosen when the database is created is an important decision and can become a limiting factor to how much scaling can be done in the cluster without relatively expensive actions such as shard-splitting and recreating a large portion of, or even the entire collection.

To improve reliability, each shard can have replicas that can be promoted to master/primary shard in case the primary shard fails. Replication can also have a positive effect on performance, as

operations that only read data can be delegated to a replica shard.

1 https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html 2 https://cwiki.apache.org/confluence/display/solr/Collections+API

(7)

Both Solr and ElasticSearch delegate work to the Lucene shard and when needed transparently combine the results of multiple shards for output. In this regards the way they work is similar and the architecture they use to archive scaling is comparable.

Request Handling

When a request comes in the receiving node has to know which node can fulfil the request and which shards need to be involved in it. This is done by internally routing the request to the correct nodes. Because of this the client can make request to all nodes in the cluster node and get

appropriate response without the need of knowledge about the cluster layout and responsibilities. To achieve this the node handling the request needs to know about the cluster state. The

implementation here between Solr and ElasticSearch differ slightly. Solr mostly delegates this process to ZooKeeper while ElasticSearch by default uses an internal system called Zen discovery to find other nodes within the cluster. This makes ZooKeeper play an important role within

SolrCloud as it needed for proper load-balancing and keeping track of the status of the cluster. As ZooKeeper is an external process, it can be run from a different system as the Solr process, while ElasticSearch processes are integrated.

When messages from part of the cluster are unable to reach another part of the cluster, the cluster become divided and a so-called split-brain can occur. In this case each side thinks they are the complete cluster, accepting operations as such. This can lead to inconsistent views when clients unknowingly make requests to multiple parts of the cluster. When the cluster again combines, while data in the cluster has been changed, inconsistencies can occur which require a conflict resolution for the inconsistent data. ZooKeeper avoids this split-brain scenario by rejecting requests that are split off from the main cluster, causing Solr to reject requests that would modify data when the cluster becomes split. ElasticSearch under most circumstances will automatically try to elect a new leader in order to continue operating, which will cause the split brain scenario.

Under normal operations all nodes will have a consistent and up-to-date view from these

mechanisms and should not affect the performance of the search engine. Although ZooKeeper is another external part it does not directly influence Solr operations during a client response, it is only responsible for the data it uses in its process.

We can state that Solr favors consistency over write availability and can be considered consistent and partition-tolerant. ElasticSearch attempts to maintain operation and favors availability while being partition-tolerant. However configuration options in ElasticSearch allow for different trade-offs between availability and consistency, making it possible to favor consistency and partition-tolerance.

When making changes to a SolrCloud cluster, or modify the ZooKeeper instances the cluster, or part of the cluster must be taken down. ElasticSearch allows changes to be set dynamically and can automatically detect changes in the cluster. In practice ElasticSearch requires less manual actions on the cluster, making operating an ElasticSearch cluster less work due to more automated processes. Although this makes maintaining the cluster less work, it does not affect how requests are handled when the cluster operates as intended.

Transactions

Solr and ElasticSearch both make use of a transaction log. The transaction log makes sure operations are updated atomically. Each node has a transaction log (also write ahead log) that atomically applies operation. These operations are not committed to the Lucene index but are processed asynchronously. In this way the transaction log used as a mechanism to buffer operations that have to be processed to Lucene. As single document updates in Lucene are expensive due to having to write a new segment for each, bulk operations will reduce the amount of shards that are

(8)

created. By sending the operations out at once, the log further causes the Lucene state to be updates less frequently, decreasing the frequency of cache invalidation. This comes with a trade-off, as changes to the data can not be observed immediately, but only after a delay once the transaction log is applied. Both Solr and ElasticSearch have methods to allow results to include data from the transaction log, allowing some degree of real-time get, but they each come with limitation. Using the transaction log, Solr and ElasticSearch increase both durability and performance, at the cost of visibility.

Caching

There are several differences in how Solr and ElasticSearch make use of caching. The most important Solr caches are as follows:

Filter Cache: Stores unordered sets of documents. Each filter is executed and cached separately. Result are returned by set intersection.

Field Value Cache: Used for faceting and sorting

Query Result Cache: Stores the top N results of a query as an ordered set of document IDs. Since not the full document is cached less memory is required.

Document Cache: Store Lucene document objects from disk. ElasticSearch makes use of the following cache mechanisms:

Filter Cache: Caches filter results as a bitset. Multiple queries using the same filter can make use of the same filter cache.

Shard Query Cache: When a request is handled, each shard executes the query individually, and the results are combined by the coordinator as a global result set. The shard cache caches results from each shard locally. This cache only has to be invalidated when changes to the index affect the particular shard.

Many Solr caches apply to the entire collection, and have to be invalidated when changes are made to the index. ElasticSearch only has to invalidate the cache affecting particular shards, allowing some cache to persist even when the index is being modified. The difference in behavior can have an effect on how the performance of the implementation behaves.

1.3 Background

Previous research on a number of tweets collected by the ILPS has shown ElasticSearch, using distributed methods and ElasticSearch itself as a search provider as well as data storage archives high performance in both indexing and querying data compared to using ElasticSearch for search engine and a NoSQL solution for storage [1].

As the amount of needed computation power grows, vertical scaling becomes more and more expensive. Using horizontal scaling, the computation power can be divided over a larger number of relatively inexpensive nodes. Additionally horizontal scaling has the benefits of being easier to scale as one can add additional nodes to the system as the need grows, while replacing hardware in vertical scaling often creates down-time. This does not mean horizontal scaling comes without downsides. Horizontal scalable solutions typically do not provide ACID (Atomicity, Consistency, Isolation, Durability) guarantees, instead providing BASE properties (Basically Available, Soft state, Eventual consistency).

CAP Theorem

Conventional Databases generally adhere to the ACID (Atomicity, Consistency, Isolation, Durability) principle. The CAP theorem [2] states that due to the nature of distributed services, providing a reliable ACID guarantee while maintaining the desired performance is impossible.

(9)

For distributed services, the CAP theorem describes three guarantees for a service which are desirable:

• Consistency (The same data is always observed) • Availability (Operations never fail)

• Partition Tolerance (Nodes can continue to operate even when messages between peers are lost)

Although all three are desirable, only two can be provided at a time and in essence one guarantee has to be sacrificed [3][4]. In general systems build with a distributed setup in mind will guarantee partition tolerance and chose between consistency and availability.

• Consistent, Available (CA) Systems can only operate in the absence of a network partition.. Examples of CA systems include: RDBMSs

• Consistent, Partition-Tolerant (CP) Systems may forfeit availability by blocking certain transactions in the event of a partition to prevent inconsistency. Examples of CP systems: BigTable, HyperTable, Hbase, Redis, MongoDB1_{, SolrCloud}2

• Available, Partition-Tolerant (AP) Systems will become inconsistent in the event of a partition in order to maintain availability. These systems usually provide “eventual consistency” (BASE). Examples of AP systems are: Dynamo, Tokyo Cabinet, Voldemort, Cassandra, CouchDB, ElasticSearch3_.

The Inverted Index Data Structure

Lucene is being used by both Solr and ElasticSearch, this provides much of the search capabilities for both. Wrapping around Lucene, they provide their own API, add additional features, and make use of multiple Lucene indexes for their distributed capabilities. As such, an index in these search providers does not directly map to a Lucene index, but can be composed of a number of indexes. For these indexes, Apache Lucene makes use of an inverted index to map terms to documents. An inverted index is created by extracting a list of keywords or terms from a document, optionally with relevance weights [5]. The inverted file (index) consists of a sorted list of the extracted attributed, containing a link to the document. Usage of an inverted index improves search efficiency, at the cost of the need of additional storage for the inverted index, and the need to update the index whenever a document changes.

An inverted file containing all terms of a document can potentially double the amount of required storage as the whole document has to be contained in the inverted file. Usually certain restrictions are applied when creating the inverted file, such as stop words (excluded words) or rules regarding punctuation marks.

Several structures can be used for storing an inverted file:

• Sorted Array: Array structure storing the list of keywords, together with the number of documents and links to the documents containing the keyword. Commonly binary search is used to search the array. A main disadvantage of this method is that updating the sorted array is an expensive operation.

• B-Tree: Each node in the tree contains the shortest word that distinguishes words in the next level, creating a tree structure where the keywords themselves are stored in the leafs of the tree. Compared to sorted arrays, b-trees require more space.

• Tries: A tries structure uses the digital decomposition of the set of keywords to represent those keywords.

1 http://blog.mongodb.org/post/475279604/on-distributed-consistency-part-1 2 https://cwiki.apache.org/confluence/display/solr/SolrCloud

(10)

2 Related Work

Twitter is increasingly being used to share critical information with substantive, meaningful impact on society. Since the search method Twitter shows results in chronological order it is only useful to find what people are Tweeting about at the moment itself. A useful variation of this task would be to provide only the essential tweets related to a topic, providing only the most important Tweets to get a good overall view of the topic [6].

This requires two steps, the ability to select interesting topics, and to select and present high quality Tweets related to the topic.

Topic Detection

A method for finding these topics is identifying emerging topics (trends). An example of such a system is TwitterMonitor [7]. TwitterMonitor is a system that performs trend detection over the Twitter stream and is able to detect trends in real time. As every tweet is associated with a

timestamp and a profile which often contains metadata such as name, location and a biographical sketch this provides a wealth of information. Because trends are typically driven by emerging events, breaking news and topics that attract a great deal of attention this information can be very valuable. The method used by TwitterMethod first identifies keywords that suddenly emerge at an unusual high rate, compared to their previous rate. After such a bursty keyword has been detected, the keyword can be grouped in order to identify a group of keywords related to the same topic. To do so, for each keyword a short history is evaluated to see if they co-occur with other keywords. Combined with geo-location present in the tweets, local and global trends can be identified. Search and scaling

Q. Liping et al. compared Lucene with Oracle using a data-set of 700M tuples consisting of a timestamp, id and short text [8]. When querying Lucene and Oracle for keywords, both show increasingly larger response times on queries when the data set increases, but at 100M tuples Lucene is considerably faster than Oracle by a very large factor. At 700M tuples Lucene is about twice faster than Oracle. From this one can conclude that Lucene is superior to Oracle Text in this use case. However the difference is much larger on a smaller data set and steadily grows closer to Oracle’s performance until the maximum amount of data is used. When creating Oracle text indexes on the tuple, the size increases by about three times, while the total size in Lucene increases by 1.8 times. However querying full-text searches in Lucene is highly affected by available memory. The more available memory, the more data can be loaded into memory, avoiding expensive hard disk operations. Additionally, keyword search is not a very CPU demanding tasks, making I/O

bandwidth the performance bottleneck. However, index building is a CPU demanding task. They also found there is no close connection between indexing building performance and number of indexing threads.

Lucene also lies at the core of in production search services at Twitter and Spotify. Spotify uses a Lucene based search service for their full text music search [9]. Some efforts have been made to increase the performance of these solutions. In an attempt to optimize the search performance at Spotify, a new search method called ‘swarm’ is introduced. For reference, a vanilla ElasticSearch setup was used to compare benchmarks with using 16 shards on 8 machines with a replication factor of 2. Using this setup ElasticSearch response time was able to outperform Spotify's original solution and swarm search on many queries. However adding new documents to ElasticSearch using the HTTP API was very slow in comparison. While this paper gives good insights on the capabilities of ElasticSearch regarding querying, it does not look at indexing performance, as well as memory usage and storage size of the index, as these can restrict the maximum number of items that can potentially be indexed.

(11)

At Twitter, a Lucene based solution called Earlybird is the core retrieval engine that powers

Twitter's real time search engine, specifically designed to handle tweets [10]. To meet the demands of real-time search, two types of indexes are used: an optimized read-only index and an active writable index used to index new tweets. Once the active index stops to accept new tweets, it will be converted to a read-only structure. While the described method of Earlybird uses Lucene, it is mentioned that a MySQL based solution using inverted indexes was used before, but no mention is made at which point this proved insufficient. It is important to note that the Twitter Earlybird method is specifically optimized for real-time search in a chronological listing of Tweets only. A distributed method using partitioning is used to reduce query and indexing latency and replication is used to archive robustness. While replication has some impact on latency, it increases throughput in linearly scalable fashion.

Similarly, this latency impact is also found and reported in more detail in Solr and the Cloud where is mentioned that using Solr in a distributed environment using SolrCloud impacts performance negatively, and shows that using vertical scaling more cost-efficient results can be archived [11]. According to Nillson, using a distributed setup has a very strong negative impact on indexing speed compared to a single machine configuration. While this impact is initially very large, the

performance difference does not significantly decrease any further as more shards are added. This is effect is caused by the overhead involved with dividing the work among the shards in the cluster. When comparing cost of Virtual Machines on the windows Azure platform, it is more cost-efficient to scale vertically first by increasing the VM size, compared to scaling horizontally by adding more (small) virtual machines. However, using horizontal scaling adds more robustness due to automated fail-over and increases replicas. This suggests that usage of a distributed setup would be

underwhelming for smaller data sets, but will be less expensive on very large data sets such as being in use at Twitter.

Further exploring the capabilities of ElasticSearch, Kerim Meijer conducted an experiment using ElasticSearch as a search engine in combination with a NoSQL storage solution for data storage for the purpose of both storing and querying large amounts of data in a scalable distributed way [1]. Benchmarks were performed with ElasticSearch in combination with Voldemort, Cassandra, MongoDB as well as using ElasticSearch's internal storage method using Tweets as data. Using 16M Tweets these benchmarks reveal that using ElasticSearch as both a search and storage engine performs really well. Compared to solutions combining ElasticSearch with a NoSQL storage solutions, ElasticSearch shows higher throughput in both storing new Tweets and querying the index, especially when using no shard replication. However ElasticSearch showed instability in multiple occasions leading to questions regarding to stability, especially when the data size grows larger in size.

(12)

3 Research Method

3.1 Design

In this experiment we will create a distributed setup for each search provider and run several benchmarks against these systems. From these benchmarks we want to be able to answer our sub-questions:

• Which search engine has the highest performance and efficiency when storing a large number of documents?

• At what speed can we obtain results from the search engine for our queries and what effects does a large number of documents have on this performance?

• Which search engine takes the least time to return a result after submitting a query, in particular when querying and returning many documents?

• What effects does simultaneous indexing have on querying performance?

• Which search engine makes the most efficient use of disk space and memory to maintain its index?

3.2 Cluster setup

As the experiment should resemble a distributed environment, each cluster will consist of five nodes. While a distributed setup comes in many different sizes and can grow much larger, five nodes provided a good balance between the cost to operate each node in this benchmark and having a fair number of nodes in the cluster to be a small scale distributed system. A sixth machine running a custom made Java application will perform the benchmarks by running several tests against the cluster in order and store the results.

3.3 Configuration

When configuring the cluster, the default configuration was used as much as possible. This is expected to have effects on performance as most providers provide several settings which can be tweaked to increase performance, although often with a trade-off. Additionally, the operating system used a default configuration as well. It might be possible to increase performance by changing system or kernel parameters such as limits and swappiness, but these were not evaluated, even when recommended by the search provider. As previous research concluded that for ElasticSearch, highest performance can be obtained by using no replication [1], this experiment also uses no replication.

ElasticSearch

While ElasticSearch is based on Lucene, it uses its own query syntax. While it is possible to use the Lucene syntax, the ElasticSearch specific syntax was used in our tests. ElasticSearch also uses analyzers to enhance the way results are stored and queries are evaluated, from which defaults are picked. Additionally, ElasticSearch has a delay between committing indexing operations which by default is one second, providing near-real time search.

Solr

In the case of Solr, the default Lucene syntax is being used as well as periodical index commits similar to ElasticSearch, but with a default delay of 15 seconds.

(13)

MongoDB

To speed up indexing, MongoDB provides ways to change reliability by changing the requests WriteConcern. In our benchmarks the default write concern (Acknowledged) is used.

While evaluating MongoDB, performance increased significantly by changing this to the weakest write concern (Unacknowledged). This write concern will return success even if no attempt at storing the document is made. Similarly a higher write concern is expected to decrease performance while increasing reliability.

Specific to MongoDB, its full-text index by default does not rank the results by relevance, which is different from the way Lucene returns results. While MongoDB does provide functionality to order these results, the large memory consumption of this method prevented us from using it in this experiment as the query would fail due to the system running out of resources. Therefore MongoDB results are not ordered by relevance.

3.4 Measurements

The following measurements will be gathered:

• Indexing performance by measuring the throughput while building the index.

• Query throughput: Measuring the throughput of full-text phrase searches, as well as full-text searches in combination with a filter, such as a date range or queries that only apply to a certain user. We will also measure query performance while simultaneously performing index building operations. This method will resemble a busy set-up where new content is continuously added while searches are being performed at the same time.

• Query latency: The amount of time it takes for a query to return a result.

• Resource usage: We expect disk space and memory to be consumed by the search provider as the search index grows.

To obtain these measurements, a number of tests were created. The benchmark program will perform each test, after which 5 million new Tweets are stored in the index and the tests will run again. This will continue until the source of new tweets is exhausted or the benchmark is stopped. The tests are as follows:

• Indexing, sending documents to the search engine to be indexed. Performance measured in number of documents indexed per second.

• Phrase Query, running full-text searches using a predefined list of keywords, which were picked at random. Only the top 100 results were requested. Measured in number of successful search query responses.

• Filtered Phrase Query, similar to the phrase query, except the results were filtered with an additional condition, such as a date range or selecting tweets from a selected user or with certain hashtags. Only the top 100 results were requested. Measured in number of successful search query responses per second.

• Loaded Phrase Query, measured performance of the phrase query test, while sending new documents to index to the search engine at the same time.

• Loaded Filtered Phrase Query, similar to the combination of the loaded phrase query, except while using a filtered phrase query.

• Phrase Query Response time, full-text query without result limit. Each keyword is selected in order to construct the query and runs ten times, of which the average response time and number of returned results is reported. Before the test, queries run 5 times to warm caches. • Filtered Query Response time, similar to phrase query response time, except constructing

every combination of a keyword and condition.

• Resource Usage, finally, after each iteration we gather disk, memory and swap usage for each machine in the cluster and store the average. Disk usage was obtained by gathering the

(14)

space used by the specific engine in the data storage folder of the used index only, while memory and swap are the total usage of the machine.

With the exception of gathering resource usage and response times, each test will run for a period of 60 seconds, after which the number of successful operations during this period is returned as measurement. By using a fixed time period, the tests remain scalable by ensuring test run time remains the same no matter how the benchmark performs as well as somewhat smooth out fluctuations in the benchmark results.

The collection of tweets that will be used, consist of about 400M of tweets which have been obtained from the Twitter API during a period of 4 months from 31 January 2013 to 2 may 2013. This results in 162GB of compressed data and about 1TB of uncompressed data. The data consists of a JSON format which contains entries for newly created tweets.

Queries are created by either a phrase or a combination of a phrase and a filter. Phrases are obtained using a precompiled list from the ILPS TREC 2013 micro-blog search, selected by crawling two month of tweets in 2013 and generating a ranking which can be used to select phrases [12]. To create our additional filters, the first 6 weeks of tweets were analyzed and filters were selected by popularity. Using this method filters will return a higher number of results ensuring the phrase query still has some significant impact on the query in addition to the filter.

Using these selections methods 60 phrases and 35 filters were collected, resulting in 60 total queries for usage in the phrase query test and 2100 queries for the filtered phrase query test, which

combines each phrase and filter.

Test Type Phrases Filters Total Queries

Phrase Query 60 – 60

Filtered Phrase Query 60 35 2100

Table 1: Number of available queries.

Before each test, queries and documents used for indexing are constructed and loaded into memory on the system performing the benchmark. This minimizes the amount of CPU and disk access of this system, preventing a hardware bottleneck influencing the results. A queue of 200000 queries is created and up to 500000 new documents are loaded. This amount was tested to be sufficient to run the benchmarks using only documents and queries available in memory.

(15)

4 Results

Tests were performed using a cluster consisting of five virtual machines running CentOS 6.5 Server Edition 64 bit. Virtual machines each had 4x 2.7Ghz VCPU and 32GB Memory. These five

machines were configured as a cluster for the respective search engine being tested.

A single sixth machine was used to run the benchmarking software and orchestrate the tests. To obtain results each tests was run for 60 seconds at a certain index size, after which the number of documents was increased to the next target amount, in steps of 5 million.

The benchmarking software consisted of a program build in Java which was specifically created to execute the benchmarks in this project. To make running the benchmarks as easy as possible, the program was designed to automate many tasks. Once the desired number of virtual machines were set up, the program could be run and would install needed software, configure the virtual machines into a cluster, processed the source files containing tweets and execute all desired benchmarks for the selected search engines. After benchmarks were run files containing collected benchmark data could be fetched from the system. This method also ensures every time the benchmark is executed, the same environment is recreated on all virtual machines as it is embedded inside the program. The program was designed to allow for modification of the benchmarked systems, and be extendable to add more search engines when created. This proved useful during the project as it made it possible for the benchmark to run a single search engine while others were developed, and allowed multiple search engine benchmarks to run in sequence, allowing the program to run during the night which let to less idle time on the virtual machines. The full source code of the benchmark program is available on the projects' GitHub1_.

When querying data there are often multiple methods available. In this benchmark the following were used:

There are multiple methods available within each search engine • ElasticSearch

◦ Phrase queries were constructed using a match query.

◦ Filtered Phrase queries were constructed using a filtered query containing a match query and a bool filter using a term or range filter to create the conditions.

• Solr

◦ Queries in Solr were constructed using the standard Lucene syntax, e.g. 'text:solar flare'. ◦ Additional conditions were appended using the same syntax, eg. 'hashtag:news'.

• MongoDB

◦ MongoDB full-text searches are done by creating a full-text index and a $text query. ◦ Additional conditions are combined with this query.

More detailed technical information about this, and how the respective systems were configured can be found in Chapter 9.

(16)

4.1 Indexing

Across the whole result set, ElasticSearch shows an average of 6864 inserts per second, making it the fastest performing search provider in this benchmark. The results show a large increase and decrease in indexing speed between measurements, resulting in a high standard deviation of 754. Despite this the performance does not notably degrade as the index size increases.

Solr shows an average of 1011 inserts per second across the whole result set. Similar to

ElasticSearch, performance does not show a notable decrease in performance as the index size increases, although indexing before the first 50 million documents are slightly higher.

MongoDB starts with 1333 inserts per seconds but drops considerably after, with 143 inserts per second at 15 million on it's lowest. At 35 million a small peak can be seen of 722 inserts per second, but most measurements do not show more than a few hundred indexed documents per second, making MongoDB the slowest search provider in this benchmark.

Fig. 1: Indexing 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Indexing ElasticSearch Solr MongoDB

Number of documents in index

In s e rt s p e r s e co n d

(17)

4.2 Full-text search

ElasticSearch starts with around 50 queries per second and ends with ~32 queries per second at an index size of 300 million documents, showing a slight downwards line in performance. It is noticeable that until around 200 million documents, performance is fairly similar and that gaps between different measurements are larger displaying sudden peeks and lows in the results. After this point, the decrease in performance becomes steeper and the gaps become much smaller resulting in a flatter line.

Solr starts with ~38 queries per second and ends at 175 million with ~12 queries per second, showing a noticeable drop in performance right from the start. At 50 million documents Solr reaches around ~15 queries per second, which is a significant drop. After this mark, the decrease in performance becomes much less steep, showing results close around this number for the majority of the other measured points and only losing 3 queries per second compared to the largest index size measured. In all cases Solr has the lowest amount of queries per second, making it the slowest performing search provider in this benchmark.

MongoDB starts at 482 queries per second at five million documents but drops very quickly showing a very steep decrease in performance, especially before the first 30 million documents. At the last measured point, with 40 million documents, MongoDB shows ~64 queries per second and comes very close to ElasticSearch. MongoDB was only measured until 40 million documents, but does show the highest numbers within it's measured points.

Not shown in the graph but displayed in the results table is performance with an empty index. Which is 2363 for ElasticSearch, 843 for Solr and +3333 for MongoDB. Because of the way the benchmark was performed, values above 3333 were not measured, meaning MongoDB exceeds this value. This shows that when performing queries on an empty index, MongoDB has a very high throughput.

Fig. 2: Phrase Search 1 10 100 1000 Phrase Search ElasticSearch Solr MongoDB

Q u e ri e s p e r s e co n d

(18)

4.3 Full-text search with conditions

ElasticSearch shows a noticeable downwards line in performance. The line downwards is especially steep before the first 20 million documents but query throughput steadily decreases at larger index sizes during the whole benchmark.

Similarity, Solr shows a noticeable decrease in performance when index size increases. However, while Solr has a lower amount of queries per second compared to ElasticSearch, at the 170 million mark, performance is roughly equal, and the 175 million mark shows that Solr is slightly faster than ElasticSearch. Unfortunately since no further data was gathered for Solr, we cannot see if this trend continues. Similar to ElasticSearch, a large drop in performance can be seen at the start. However after 60 million documents the decrease in performance becomes much less steep, which leads to the eventual crossing of the ElasticSearch line.

MongoDB starts highest with 468 queries per second, but also shows the most steep decrease in performance. At the next data point of 10 million documents, only 199 queries per second are measured. A considerable drop. Only ~57 queries per second can be seen at it's last measured point of 40 million documents, which is the lowest of each search provider.

Fig. 3: Filtered Phrase Search 1

10 100 1000

Filtered Phrase Search

ElasticSearch Solr

MongoDB

(19)

4.4 Full-text search with indexing simultaneously

ElasticSearch starts at 37 queries per second, which is lower than the initial 50 queries per second seen when querying without indexing. Besides the lower performance the line looks similar, with a slight degrade in performance as the index size increases, which becomes more noticeable at the 200 million documents mark.

Solr initially displays results similar to querying without indexing. The data points are more erratic compared to before but the decrease in performance is still noticeable and shows a line comparable to querying without indexing. Compared to querying without indexing, Solr even shows a higher query throughput and brings Solr performance much closer to ElasticSearch.

MongoDB displays a considerable drop in performance compared to querying without indexing. While at 40 million documents 64 queries per second could be archived before, now only 7 queries per second are performed. Although the first two measurements show a throughput of 31 and 38 queries per second respectively, all other data points display a throughput around or lower than 10 queries per second, making MongoDB the lowest performing search engine in this benchmark.

Fig. 4: Loaded Phrase Search 0 5 10 15 20 25 30 35 40 45 50

Loaded Phrase Search

ElasticSearch Solr

MongoDB

(20)

4.5 Full-text search with conditions and indexing simultaneously

Here ElasticSearch starts at 264 queries per second, compared to 402 queries per second when no indexing operations are being done. Solr starts at 117 queries per second compared to 265. Besides the lower throughput, both ElasticSearch and Solr display the same behavior compared to querying without index operations. After an index size of 100 million documents, Solr performance starts to become roughly equal to ElasticSearch. Because ElasticSearch performance continues to drop at an increased rate compared to Solr, at 150 million documents we can clearly see the lines have crossed an Solr performance shows highest, while ElasticSearch is higher at the lower index sizes.

MongoDB starts at 11 queries per second, with a sudden peak of 38 queries per second at 10 million documents. A small downwards line can also be seen. At the last measured point of 40 million docs, 7 queries per second are archived, which is considerably lower than the 57 queries per second archived when not indexing.

Fig. 5: Loaded Filtered Phrase Search 0 50 100 150 200 250 300

Loaded Filtered Phrase Search

ElasticSearch Solr

MongoDB

(21)

4.6 Memory Usage

A clear upwards line in memory usage can be seen in each search provider. Solr uses the most memory at every point compared to other providers. Especially at the start of the graph, a large increase of memory can be seen. MongoDB shows the least amount of memory used, but displays a line going upwards at a slightly sharper angel than ElasticSearch.

Comparing at the last measured point (40 million docs) of all providers, Solr uses 2642MB memory, ElasticSearch 942MB and MongoDB 867MB.

Fig. 6: Memory Usage 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 Memory Usage ElasticSearch Solr MongoDB

A vg . m e m o ry n o d e p e r s e rv e r (k B )

(22)

4.7 Disk Usage

MongoDB uses the most disk space out of each provider, having both the highest amount of disk usage at start as well as the most sharply increasing usage. In the results of ElasticSearch, a few peaks can be seen, after which the line continues in the same linear fashion.

Out of all the bench marked search providers, Solr uses the least amount of disk space.

Comparing at the last measured point (40 million docs) of all providers, MongoDB uses 14.4GB of space, ElasticSearch uses 3.4GB, and Solr uses 1.9GB. Swap usage was also tracked during the benchmarks, but is not reported because neither provider used any swap during the benchmark.

Fig. 7: Disk Usage 0 5000000 10000000 15000000 20000000 25000000 Disk Usage ElasticSearch Solr MongoDB

A vg . d is k u s a g e p e r n o d e ( kB )

(23)

4.8 Response Time

Unlike previous benchmarks, throughput was measured using no limits on the number of results returned. These results represent the time taken when N results are returned using an index size of 170M tweets, which is the largest index size archived in Solr and thus the largest index size a compensation can be made between Solr and ElasticSearch. Like other benchmarks, data was gathered for every index size in steps of 5M. The raw data represented in these graphs is omitted from this document due to the large number of data points, but can be obtained from the repository of raw results on GitHub1_{. The 170M mark is arguably the most interesting as this is the largest}

amount of documents the benchmark was performed on. MongoDB is not represented as there was insufficient data obtained.

4.8.1 Full-text search

While ElasticSearch starts below Solr it ramps up at higher numbers while Solr manages a more steady line at an average of 14.5ms. While ElasticSearch remains comparable for the majority of the results, around 125k results response times get progressively worse. Around 12M results returned Solr remains at it's average, while ElasticSearch takes 572ms to return the results.

1 https://github.com/twitm/TwitterTester2Results

Fig. 8: Phrase search response time 1

10 100 1000

Phrase Search Response Time

Solr ES Number of results R e s p o n s e ti m e in m s

(24)

4.8.2 Full text search with conditions

ElasticSearch has a slightly lower response time than Solr. However ElasticSearch shows, despite the higher peaks, a smoother increase when the number of results grow, while Solr shows very steep performance degradation after around the 1.5M results mark.

Fig. 9: Filtered phrase search response time 1

10 100 1000 10000

Filtered Phrase Search Response Time

Solr ES Number of results R e s p o n s e ti m e in m s

(25)

5 Discussion

5.1 Underlying data structure

There is barely enough data on MongoDB to compare with other providers, but its performance regarding indexing appears to be too low to handle these amounts of data. It is indeed very slow compared to both Solr and especially ElasticSearch. As MongoDB shows a very steep line regarding disk usage (See Fig. 7), it's possible its performance was limited by I/O more

significantly. To put the low observed performance more into perspective, it should be noted that MongoDB is in essence a database engine, rather than a search engine, and provides stronger guarantees regarding data loss and consistency.

When querying without any filter (Fig. 2) MongoDB initially starts highest when querying the data, but does show consistently worse performance as the index size increases, although there is not enough data available to reliably identify a trend. It might seem at first that MongoDB would be preferable at lower index sizes (5 to 10 million) since it shows higher performance there.

It should be noted however that query results in MongoDB were not ordered by a relevance algorithm, unlike the Lucene implementation used by Solr and ElasticSearch. Although the relevance algorithm is present in MongoDB applying it caused queries to time out due to high resource usage. As one of our conditions was that we wanted to obtain high-quality results,

MongoDB is unable to satisfy this criteria here. When a filter is applied the extra condition appears to have no or very little negative effect on querying performance (Fig. 3).

A very significant impact on performance can also been seen when simultaneously querying and updating the index (See Fig. 4 and 5), making it able to run only a few queries per second. The significant amount of disk MongoDB uses could indicate the low performance in other

benchmarks was due to high disk usage and I/O blocking, indicating MongoDB relies more on disk access, but as this was not measured directly, additional measurements would be required to clarify this.

5.2 Solr and ElasticSearch

When indexing new documents ElasticSearch shows to perform really well in comparison, reaching almost seven times the indexing speed of Solr (See Fig. 1).

Despite the differences in indexing speed, both ElasticSearch and Solr do not notably degrade as the index size increases, indicating that for this measurement, we can expect both to keep scaling well to a larger amount than that was used in this experiment.

As the number of documents increases, both Solr and ElasticSearch only show a light degeneration in performance as the number of documents increases. ElasticSearch shows the highest throughput here. When a filter is included, (Fig. 3) Solr appears to overtake ElasticSearch performance as the index size increases. Unfortunately, not enough results from the Solr benchmark were gathered to see if the trend continues, but it appears that when using additional filters, Solr will eventually outperform ElasticSearch at large index sizes. The notable difference is that ElasticSearch uses a special filter syntax which claims to have performance benefits over the regular query construct, where Solr uses a more direct Lucene syntax. It appears that although ElasticSearch initially

benefits from its filter construct, it produces less efficient results when a large number of documents must be examined. It should also be noted that using a filter yielded much higher throughput in this benchmark than full-text only, with the exception of ElasticSearch above 270M. It does decrease

(26)

more steeply. Especially Solr seems to benefit from using a filter, as it shows much better performance over full-text search only.

During simultaneously indexing while running queries, ElasticSearch shows a considerable drop in performance. (See Fig. 4 and 5) The results on Solr comparing phrase queries actually shows a slight increase in throughput (Fig 4). It is most likely that this is just the result of slight variations in the benchmark, as filtered querying also shows very close results. It appears Solr is much less affected.

When measuring response times, ElasticSearch shows higher throughput when performing full-text searches, when returning a large number of documents Solr has shorter response times (Fig. 8). ElasticSearch degrades in performance when more results need to be returned, Solr remains linear. When querying with a condition, Solr shows the best result time in all circumstances (Fig. 9). When considering real-time search, Solr shows best results when returning many results or in all cases when a condition is used. If the required number of returned results is limited, say 1000 or less, both ElasticSearch and Solr can be considered viable solutions with acceptable response times for real-time search.

From Fig 6 and 7 we can observe that Solr trades higher memory utilization for lower disk space. As memory can be considered more expensive than disk space ElasticSearch can provide a better scaling solution in this regard but this is situational.

6 Limitations

6.1 Measurements

Resource usage during execution of tests is not measured. For example I/O wait time, CPU load and usage and network usage are not measured. While the measured results are a consequence of these factors, no specific bottleneck can be identified because of this. Since these measurements have to be constantly monitored during the duration of the benchmark, it would have presented additional complexity to implement these functions.

Due to limited availability of CPU hours allotted on the cluster, it is not viable to re-run benchmark multiple times in order to increase the accuracy of results. Each test in the benchmark runs for one minute during which values are measured. This will average out short-term fluctuations in order to increase the accuracy of the obtained results and make them more visible. However while it makes influences on the results more notable, since the duration is only one minute, the increase in accuracy is still limited.

6.2 Virtual Environment

Tests ran on virtual machines inside the SURFSara cluster. Load on the system caused by other virtual machines can therefore influence the tests. This could cause some inaccuracy in the

benchmarks. Since we do not control these machines their usage on the cluster is not visible and it's not possible to tell how much impact they have on the tests. As benchmarks take a decent amount of time to complete, sometimes several days, it is not unthinkable the usage of the cluster will vary during the test. Additionally, since the virtual environment consists of several nodes, although on the same location, network loads and speeds can differ between machines. Similar to local resource usage, these variables were not measured.

(27)

6.3 Choice of evaluated systems

In our choice of systems, we looked at open source solutions and how well they would be able to fulfil our requirements and the popularity of the products. Other distributed systems that can provide the same services exist they were not evaluated in this benchmark. The results from this tests are therefore not conclusive as not evaluated systems could turn out to produce more desirable results.

6.4 Differences with benchmark

During benchmarks we only gathered data needed to answer our questions. The obtained results sometimes raise questions regarding their cause. For example, the results show a significant gap between the speed at which Solr and ElasticSearch index results, but it is unclear what caused these results and what part of the system was the bottleneck. Since Solr and ElasticSearch are both Lucene at core, somewhat similar results would have been expected. This would imply the

additional layer Solr and ElasticSearch provide has a severe impact on the final performance of the system. Another explanation could lay in the Java client used to make requests, as the way requests are made could have an influence on the benchmark results. While officially supported clients were used to make requests, different clients were used for each search provider. It would have been possible to eliminate this factor by using a common HTTP library to make requests to each engine. It is unclear what impact these clients had on the results. Nerveless one could argue that using the official client to communicate with the search provider is the preferred method and part of the setup. The results obtained in this way do provide a method to compare the measured systems, however it is not clear what causes the bottleneck in the system that leads to these results. For this to be clear, additional measurements should have been taken, such as I/O wait time and actual CPU usage. It would have been useful to take these measurements as well so more background

information would have been available as to clarify the obtained results.

We initially hoped to capture more results using additional setups changing application options and cluster configuration using different replication options as well as obtaining benchmarks for the full set of 400M tweets we had available. However due to difficulties it was decided to simplify the experiment to only a single setup with a maximum of 300M tweets. Ultimately only ElasticSearch managed to run the benchmark using all 300M tweets. The primary reason for this is that we experimented a lot of instability on the nodes at the SURFSara cluster due to a problem with the networking interface that caused kernel panic in the virtual machines at unpredictable times. Since a test setup consisted of six machines running at the same time this occurred frequently. This caused a lot of issues since the benchmark could not continue when a node went down and often had to be restarted.

Since there was a limited number of CPU hours available to run the experiment, we had to compromise. Furthermore, the tests took a longer time to complete than predicted, which was especially true for Solr and MongoDB as their indexing speed was significantly lower than

ElasticSearch. Because the tests in the benchmark are run at different index sizes, indexing towards the next target index size must take place before the next benchmark can start, which takes time directly related to the indexing performance of the search engine. Since this directly influences the total time the benchmark has to run, this also increased the chance of one of the nodes experiencing kernel panic within the time span making the benchmark even more difficult to complete. For this reason we were only able to get Solr results up to 175M while ES experienced almost no issues. MongoDB indexing was so slow that the benchmark was manually stopped at 40M tweets after running for several days. It would most likely have taken weeks for it to finish benchmarks on the

(28)

complete set of tweets.

The instability of the nodes we experienced makes the danger of using a computer cluster in this way clear. Since no replication was used, if a single node went down the search engine could not function anymore and no service was provided. In essence availability of our search providers were extremely low, something that would have only decreased would more nodes have been added to the cluster. For the purpose of the benchmark, results would not have been valid when the cluster goes partly down even if when continuing to operate due to resulting in incomparable conditions. However would this cluster have been expected to operate providing any guarantees to availability using no replication would provide inadequate in most situations.

Each search provider tested provides methods to ensure availability can be maintained when the cluster goes partiality down using replication methods. While this experiment was primary focused on performance and availability it would be valuable to get benchmarks while using replication allowing for a more reliable system, as availability would only further decrease with mode nodes, essentially putting limits on the amount of scaling that can be done while still maintaining an operatable state.

It would have given additional valuable insights to run these benchmarks using additional setups, in particular using different replication options. The results obtained during this experiment give insights on the used setup, in combination with the used configuration options only.

(29)

7 Conclusion

In this thesis we compared three search providers and performed an experiment in order to find out which distributed search setup can archive the best performance on indexing and full-text queries on a very large amount of documents. At maximum, 300M tweets were indexed and used in these benchmarks. Meaningful results were only obtained for Solr and ElasticSearch, results for

MongoDB were limited. MongoDB showed disappointing performance in comparison with other engines, which consequently hindered the ability to obtain more results for it. From our experiment we can conclude MongoDB is not suitable to handle full-text search in a distributed setup for such a large number of documents.

When putting new documents in the index, ElasticSearch is able to index the most documents per second. In comparison with other search providers, the performance advantage regarding indexing of ElasticSearch is significant. The number of documents already in the index does not notably impact performance for any provider.

We looked at querying the search provider with full-text queries and full-text with an additional condition in order to find the provider that would provide the highest throughput.

For full-text queries, ElasticSearch obtains the highest throughput. When a condition is applied however, Solr can be expected to archive the best performance when a very large number of documents are queried.

For full-text queries in ElasticSearch and Solr, the number of documents only affects performance slightly. With an additional condition the number of documents queried has a larger negative impact on performance.

Both ElasticSearch and Solr can be considered qualifying for real-time search when the number of returned results is limited. At higher number of results, Solr has the lowest response times and performs best here.

Using search engines for both querying and indexing impacts performance, but does not change any previously made conclusions.

ElasticSearch and MongoDB are lighter on memory, while Solr trades higher memory utilization for lower disk usage. As memory can be considered more expensive than disk space, Solr is not

preferred here.

Between ElasticSearch and Solr there is no search provider that comes out ahead in all areas. Depending on the importance of the tasks to be performed, a search engine should be picked. When fast indexing is required and only full-text queries are performed, ElasticSearch appeared suitable. In case queries require conditions or real-time results are required for a large amount of documents, Solr provided best in this setup.

From the results we can see that Solr and ElasticSearch outperform MongoDB, and conclude that basing the system on a core information retrieval, in this case Lucene, is advantageous.

In the technical comparisation between Solr and ElasticSearch we found, at a high-level, similarities is scaling, request handling, and transactions. Differences between the two were not expected to influence the performance of the system. This turned out to be incorrect as results between them vary greatly. We can identify similar trend lines in the results, identifying both are capable of scaling equally. As we can not say one should be preferred in every case, we can conclude that a purposely built system does not provide a performance advantage over an evolved and composed system.

(30)

8 Bibliography

1: Kerim Meijer, Effciently storing and searching distributed data. UvA, 2013

2: Eric Brewer, Towards Robust Distributed Systems. Principles of Distributed Computing, 2000 3: S Gilbert, N Lynch, Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, Volume 33 Issue 2, pages 51-59, June 2002

4: A Fox, EA Brewer, Harvest, yield, and scalable tolerant systems. Proceedings of the The Seventh Workshop on Hot Topics in Operating Systems, page 174-178, 1999

5: WB Frakes, R Baeza-Yates, Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Inc., 1992

6: I Soboroff, Evaluating Real-Time Search Over Tweets, 2012, AAAI International Conference on Weblogs and Social Media

7: M Mathioudakis, N Koudas, TwitterMonitor: Trend Detection over the Twitter Stream.

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, p 1155-1158, 2010

8: Q Liping, W Lidong, An Evaluation of Lucene for Keywords Search in Large-scale Short Text Storage. Computer Design and Applications International Conference, 2012

9: E Mols, Sharding Spotify Search. University of Twente, 2012

10: M Busch, K Gade, B Larson, P Lok, S Luckenbill, J Lin, Earlybird: Real-Time Search at Twitter, 2012, Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, pages 1360-1369

11: J Nillson, Solr and the Cloud. Uppsala University, 2013

12: W Weerkamp, R Berendsen, S Liang, Z Ren, M de Rijke, M Tsagkias, N Voskarides, The University of Amsterdam (ILPS) at TREC 2013 Microblog Track. , 2013

(31)

9 Appendixes

9.1 Result Tables

Indexing

Index Size ElasticSearch Solr MongoDB 0 6752,98 984,40 1333,75 5000000 7318,23 1256,33 260,22 10000000 6746,35 1064,42 147,63 15000000 7387,08 1054,45 143,30 20000000 6351,53 1058,60 348,32 25000000 6893,13 1194,80 273,95 30000000 6147,72 1297,67 166,15 35000000 7492,35 1105,03 722,25 40000000 5452,08 1087,67 231,67 45000000 6933,50 1142,42 50000000 5973,17 919,87 55000000 7277,53 919,98 60000000 6166,67 945,30 65000000 6789,02 953,97 70000000 5363,05 924,93 75000000 7002,45 911,17 80000000 7337,55 995,12 85000000 7654,33 947,33 90000000 7510,98 1034,88 95000000 7924,63 964,37 100000000 7575,08 952,88 105000000 7531,57 963,35 110000000 8003,45 987,48 115000000 7536,65 1014,98 120000000 7499,30 847,23 125000000 7948,58 991,23 130000000 7719,75 1002,02 135000000 7696,05 1021,22 140000000 8011,22 983,72 145000000 7930,07 966,42 150000000 6159,98 1013,78 155000000 7818,28 982,02 160000000 7653,27 992,08 165000000 8333,33 998,78 170000000 7957,18 971,60 175000000 6782,15 974,92 180000000 6506,27 185000000 6466,32 190000000 6628,15 195000000 5779,63 200000000 6258,67 205000000 6677,60 210000000 6611,72 215000000 7015,35 220000000 6128,17 225000000 5415,57 230000000 6528,03 235000000 6732,40 240000000 6481,50 245000000 6364,58 250000000 6233,43 255000000 7256,23 260000000 6136,93 265000000 6324,55 270000000 6295,20

(32)

275000000 6018,78 280000000 6528,05 285000000 5515,88 290000000 6914,55 295000000 5867,28 300000000 7388,67

High performance distributed storage and querying on microblogs