Data replication algorithms in distributed databases

(1)

Master of Science in Software Engineering

Data replication algorithms in

distributed databases

Author:

George Pachitariu

Supervisor:

Vadim Zaytsev

September 28, 2014

University Of Amsterdam

(2)

Summary

This thesis is from the databases domain. There are two big classes of databases: Relational and Non-relational. Relational databases are famous for delivering ACID (Atomicity, Consistency, Isolation, Durability) properties but they are not easily scalable, so they cannot be used in a system that has a large num-ber of concurrent users. This is why some companies stopped using relational databases and started using a different type of database that could scale bet-ter [4] : NoSQL databases. But the problem with this type is that most of them do not offer strong consistency (the C from ACID), like relational ones.

Research question: How can a scheme of replicating data be designed in a distributed database in order for it to offer strong consistency and high availability? The scheme also has to offer high performance since it has to support interactive online services with a large number of users.

Strong consistency means that once an operation has been completed suc-cessfully, all the nodes (servers) see the same value.

High availability means that at any moment in time, the database is able to serve requests.

High performance means that the database is able to serve a high load of operations per second, with a small writing time per operation.

My research question is especially relevant because it asks whether you can combine the best of both worlds. Having the scalability of NoSQL databases and the consistency of relational databases in the same solution would provide a better choice for applications that need scalability.

The results and conclusions

I successfully created a database that offers strong consistency, high availability and high performance. I also increased performance by designing my own optimisations for that database (which are de-scribed in the second part of this thesis).

The source code for the system that I developed as part of this thesis project is publicly available at:

(6)

Chapter 2

Introduction

Megastore is a software system developed by Google. The design of this system is described in [1]. This paper also explains how the system achieves high availability and strong consistency. The focus of this thesis is only on the data replication scheme from Megastore.

In Megastore every node can perform a write operation. To achieve con-sensus when there are multiple nodes trying to complete different operations in parallel, Megastore uses two mechanisms. The first one is similar to a mas-ter approach (explained in detail in section 4.3.2), but this approach does not work when nodes (like the master node) fail. For those cases Megastore has another mechanism implemented, which is an implementation of the classical Paxos algorithm [12].

The Paxos algorithm is explained in detail in chapter 3 and the design of Megastore is presented in chapter 4.

After studying the Paxos algorithm and the Megastore software system, I implemented my own version of Megastore according to the design from the paper.

Then I went beyond the research paper and started making my own opti-misations to the system, which are described in the second part of this thesis (from page 24 onwards). I selected 4 optimisations, after carefully analysing more alternatives.

(7)

Part I

(8)

Chapter 3

Paxos Algorithm

3.1 Description

The Paxos consensus algorithm [11, 12] is used in a system with a group of processes that can propose values. The algorithm ensures that the processes reach consensus on a single value.

The processes can have one or more of the following agent roles: proposer and acceptor. A proposer issues proposals and an acceptor accepts or rejects pro-posals. Agents communicate between themselves using asynchronous messages. They can also fail or work at different speeds. Messages can take arbitrarily long to be delivered, can be lost, can be duplicated, but they are never corrupted.

If there are no proposals, then there will be no accepted values. If there is one value accepted, it has to be learnt by every process. Also, a process can not learn values that were not chosen.

The algorithm (according to [12]):

Phase 1. (a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors.

(b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.

Phase 2. (a) If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.

(b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.

3.2 Example

Let us suppose that there is a system that is deployed on 3 servers, each server can receive values, and they all should reach consensus on a single one ‘x =0 (a number).

(9)

The three servers are S1, S2, S3;

Prepare proposals look like: P (proposal number); Accept proposals look like: A(proposal number, value); Phase 1 (Fig. 1)

Let us say that a user writes ‘x = 30 on S1 (represented as a grey arrow in the figure).

Then S1 sends a prepare request P (1) to S2 and S3, asking them to respond with a promise to never accept a proposal numbered less than n (n=1), and the proposal with the highest number less than n that it has accepted, if any (the blue arrows in the figure).

S2 and S3 both respond in the same way, with acceptance and with no previous accepted proposals (the green arrows in the figure).

Phase 2 (Fig. 2)

S1 receives response from a majority of the acceptors, so it can issue a proposal. Since it did not receive any previous proposals, it can issue the one it wants and had previously received (x = 3). So it sends the accept proposal: A(1, 3) to S2 and S3 (the dark purple arrows in the figure).

S2 and S3 accept the proposal, and so, consensus is achieved. (Fig. 2). In the end they just send acceptance messages to the initial proposal that they accepted the new value (the pink arrows from the figure).

(10)

What if the acceptor nodes had previously accepted another pro-posal?

If S2 or S3 had already accepted a proposal before, they would return it along with the message of accepted prepare proposal. Also, when receiving such a message with a proposal, the receiver (proposer) has the obligation of dropping his initial proposal and continuing the Paxos rounds with the new received proposal (if there is more than one, it chooses to continue with the highest numbered one). This brings us to an important conclusion:

Conclusion: The Paxos algorithm is not about choosing the latest value proposed in the system (in the case with previously accepted proposal, the latest proposal gets dropped). It is about all the nodes reaching consensus on only one value.

3.3 Algorithm analysis

To assure progress once a value is chosen, that value has to be written on disk. Otherwise, if all processes choose a value and then they all fail and restart, the chosen value is lost.

Since in the end the goal is to have a single value chosen, then the number of processes that accept a proposal must be a majority (n + 1 from 2n processes), because every 2 rounds of proposals will have at least a member in common, which will not be able to accept both of the two proposals.

The reason for which the proposals are numbered is the following. What if more values are proposed at the same time and all of them reach all nodes, but none of them reach a majority of nodes? Then it must be allowed for a process to accept more than one proposal. The acceptors keep track of proposals by assigning them a number. In this way a proposal will be composed from a value (that may repeat) and a proposal number (that is unique). A value is chosen when a single proposal with that value has been accepted by a majority of the acceptors. When this happens it is said that the proposal has been chosen.

The algorithm can allow for multiple proposals to be chosen if they have the same value. Note that the proposal numbers are totally ordered.

(11)

An optimisation could be defined for the algorithm in which an acceptor can ignore a prepare request with number n if it has already accepted a prepare request with a number higher than n. Now, an acceptor needs to remember only the highest numbered proposal that it has accepted and the number of the highest numbered prepare request to which it has responded.

It is easy to think of a case where two proposers are ”fighting” with each other in order to get their values chosen. Two processes, one of them issues a proposal which passes phase 1 but does not get accepted because the other process issues a proposal with a higher number. This goes on again with the second proposal. Each proposal passes the first phase but fails the second one. A modification to the algorithm could solve this scenario by making the proposer back off for some time if its proposal gets rejected. The main drawback of this solution is that, by backing off, the time it takes for the proposer to complete successfully increases.

3.4 Usages of the algorithm

Paxos is used in many distributed systems, for a number of different purposes. Some of the most important systems where it is implemented are the following. Like Megastore, Spanner [6] uses a modified version of Paxos to replicate data. Windows Azure Storage [3] which is also a database uses Paxos to elect leaders (masters).

Chubby [2] uses Paxos to provide locking services. ZooKeeper [9] uses a modified version of Paxos to provide a service for coordinating processes in distributed applications.

Scalaris is a distributed hash table and it uses a modified version of Paxos to obtain strong consistency for operations [14].

Paxos achieves consensus on a single value when there are multiple proposers. Other systems use a master approach as an alternative to Paxos (Megastore [1] uses both: the Paxos and a master approach).

3.5 Summary

In this chapter I presented the Paxos algorithm and its underlying logic, with the help of an example. Then I analysed the constraints that this algorithm must satisfy in order for it to be successful.

Finally, I described several practical applications for the algorithm, within various distributed systems.

(12)

Chapter 4

Megastore

4.1 Introduction

Megastore is a database system used in interactive online services [1]. This chapter describes the replication scheme behind Megastore.

Megastore was designed in an environment which had the followings needs: 1. It had to be highly scalable, because it was created for applications on the

Internet, which have a potentially huge number of users.

2. It had to be easy to use. Online application developers compete for users and they want rapid development for features (fast time to market). 3. The service always had to be responsive, which implies a system with low

latency.

4. It had to have strong consistency and durability. User operations should be consistent and not be lost.

5. Since users expect applications to be always available, the system was required to have a high availability.

According to [1], Megastore is the largest system that uses Paxos to replicate primary user data across datacenters on every write. Systems usually use Paxos only for locking services.

Megastore has been widely deployed within Google for several years [7]. It handles more than three billion write and 20 billion read transactions daily and stores nearly a petabyte of primary data across many global datacenters. [1]

Megastore achieves strong consistency within small partitions of data like relational databases, and a high degree of availability like NoSQL data stores. Strong consistency is achieved by synchronously replicating data across a wide area network. It also provides Atomicity, Isolation and Durability, like normal relational databases.

The focus of this thesis is only on the data replication scheme from Megastore. From the perspective of a layered architecture there is the data replication module as a layer, then there is a layer called ”API” above it and a layer called ”Bigtable” below the data replication layer.

(13)

The API layer offers more types of operations on the database. If initially there are only the operations put() and get() (because the database is a big hash table), this layer offers more ways to manipulate data (e.g. it offers an operation to select a range of keys from the table).

The layer below, Bigtable, is responsible for storing the data and writing it on disk.

The replication module is only responsible for propagating (and achieving consensus on) the write operations and writing them on the log. Also the performance times I measure in this thesis are only related to the replication module. The performance of the above or below layers are not discussed here.

4.2 Architecture

4.2.1 Entities

The database is split into small partitions called entities. Megastore offers strong consistency for operations inside entities but only eventual consistency for operations across different entities.

To achieve a higher degree of availability each entity is replicated on two or three servers. To continue to have a high degree of availability even when a datacenter goes down each replica is stored in another datacenter. Examples of events that can take down a datacenter include power outages and problems to the cooling system.

It is important to note that the asynchronous transactions are between differ-ent differ-entity groups, but which are in the same datacdiffer-enter. The network communi-cation between datacenters consists only of synchronous and consistent replicated operations.

(14)

4.2.2 Partitioning data in entities

The partitioning of data into entities is made by the developers. They have to partition in such a smart way that the entities are big enough so that the operations that need to have strong consistency operate on data that is in the same entity. But also, if the entity contains a lot unrelated data, then there will be many operations that will be serialized, despite the fact that they can be performed in parallel without affecting the correctness of the system. Serializing many operations decreases the number of operations per second that the system is able to complete. The goal is to have only a few operations per second per entity.

An example of good partitioning is in an e-mail application. Each e-mail account forms a natural entity. Operations within an account are consistent (like labeling an email as important). The user will see the change despite failures on some replicas. Also, sending messages between different accounts (entities) is not synchronous, but it does not affect the user experience. Users expect that it will take a few seconds for the message to arrive at the recipient.

4.2.3 Write-ahead log

The system uses a write-ahead log for managing the database operations. Using a single log for all the entities would limit performance because it would be used at every write of the database system. Megastore uses multiple logs, one for each entity. Since each entity is only on some (two or three) of the nodes, network communication is not very high.

Making a write operation on the database consists of encapsulating the op-eration in a cell log and then propagating (replicating) the cell synchronously on all of the replicas of the entity. After each write operation, the logs on all

(15)

of the replicas of the entity contain the same cells. Any node can initiate reads and writes.

4.3 Optimisation to the Paxos algorithm

The Paxos algorithm is used in the system to replicate the transaction log (which contains the write operations), such that a separate instance of Paxos is used for each position in the log.

The optimisation was inspired from the master based approach (pattern). The optimisation helps decrease the network communication required to make write operations by one network trip in most cases.

I first present how the master based approach looks like, and then the opti-misation.

4.3.1 Master based approach

The master based approach proceeds as follows. One of the nodes is chosen to be the master and all the reads and writes have to get its approval. By participating in all the writes, its state is always up-to-date. This means that it can serve write operations without any network communication.

The problems with this approach are:

1. The master will become a bottleneck for the system. It will have to par-ticipate in every event of the system and the events can not be completed in parallel (otherwise some events might get into conflict).

2. The master is also required to have enough resources to serve even when the system is at full workload. Meanwhile, the other replicas (slaves) will waste resources until they become the master.

3. If the master crashes, the system needs an algorithm to re-elect a master which implies a series of high-latency communications between replicas in different datacenters. This communication often causes user-visible outage.

4.3.2 Megastore approach

The Megastore approach involves employing a frequently reassigned master (which is called leader ).

Using a master improves the performance of the system in the following way. The Paxos algorithm requires two rounds to complete successfully (achieving consensus of a write operation on all the nodes): a round of prepares, and a round of accepts. The Megastore approach allows for the first round to be completed also by getting the acceptance of the master (and so there are two ways now: getting accepted by the master or completing a Paxos round of prepares).

Also, a node will try to get the master approval first and if it gets rejected or if it does not get a response, it will try the Paxos prepare round.

(16)

Enabling a node to continue if the master rejected the request or did not respond is for protecting the system against the case when the master crashes at the wrong time or becomes too slow.

The master is changed after every write operation, the new master being the node that completed the last write operation. They chose this policy because they observed that most users submit writes from the same region repeatedly. Using this policy most writes will be made on the master. This means that getting the approval of the master will be a local operation.

In conclusion, replacing a Paxos round (the prepare phase) with the acceptance of the master achieves one less network-roundtrip most of the time because getting the acceptance of the master is a local operation most of the time, while the prepare phase from Paxos will always include a network roundtrip of messages.

4.3.3 Read operation algorithm

A read operation can be made on any node. The algorithm for a read operation is the following. First it asks the coordinator if the local replica is up-to-date.

If it is, then the algorithm will simply return the value from local database and stop there.

If not, it has to find the last position on the write-ahead log (from a majority of replicas), and fill in its own log with all the operations that it has missed by requesting them from other nodes. This step is called catch-up. Then it is up-to-date and can serve the read operation.

4.3.4 Write operation algorithm

The write operation can also be made on any node. First it asks the leader to accept it.

If the leader accepts it, the proposer node can push the value to be accepted by the remaining replicas.

If the leader does not accept it, it means that another replica has already proposed another value. Being rejected by the leader, the node starts the two phase Paxos algorithm, to complete its write operation. The algorithm is exactly the one from chapter 3.

Then the proposer node sends a message to all of the non-acceptors of its value to tell them that they are no longer up-to-date.

In the end, if the proposer node writes the original value in all the replicas (in Paxos in phase 1, along with the accepted proposal message, a node can also receive a new value with which the node has to continue), it returns that the operation succeeded and the proposer node becomes the new leader. Otherwise it returns that it failed.

Note: The write operation algorithm will always finish successfully. But, from the perspective of the user of the database, if the node finishes with the original value it is called a successful write, and if it finishes with another value it is called unsuccessful (failed) write. This is also the perspective used in this thesis starting with this section.

(17)

4.4 Summary

In this chapter I present the design of the database software system Megastore, as described in [1].

I start by stating the requirements for the system. Then I present the manner in which the data is structured inside the database (as entities). Next I write about the write-ahead log, a structure used for storing the write operations that are inserted on the database.

Then I describe some algorithms. I show how it is implemented in Megastore the two phases Paxos algorithm (which follows a no-master approach) and how it is combined with an algorithm that follows the master-approach.

Finally I explain in detail the database write algorithm and the read algo-rithm.

(18)

Chapter 5

Simulating Megastore

context

After implementing the system, I wanted to create (simulate) the conditions under which the real system from the paper was built. To accomplish this I adjusted some configurations in my system.

In the beginning I had to decide on how the requests on the database are distributed on the nodes of the system. I knew from [1] that the same users will make requests from the same place most of the time (the word most is used in the paper).

This phenomenon is also mentioned in a paper describing a globally dis-tributed database system used by Yahoo! [5]: ”For example, a one week trace of updates to 9.8 million user ids in Yahoo!’s user database showed that on average, 85 percent of the writes to a given record originated in the same datacenter.”

Since Yahoo and Google have roughly the same kind of applications (like e-mail or news) I can assume that similar figures apply in the case of Google. All the experiments in this thesis had the same distribution of the load of requests: 85% on first node, 10% on second node, 5% on last one.

In the paper it says that each entity is replicated two or three times. I used three times always. This means that I always simulate three nodes in my experiments.

I chose for all entities to be on the same machine. Despite that, I simulated network latency. The simulation consists in the server thread waiting for a ran-dom time interval of either 0, 1, or 2 milliseconds milliseconds between opening a connection and starting to read data from the client. This way both will wait at the same time for the same period, thus simulating network latency.

The latency (the random value) is changed at the start of each session, and throughout the session all the communication between all the nodes will have the same value, which means that we are simulating equal distances between nodes.

In the next part the experiments consist in obtaining multiple measurements of the performance of the system when we use an op-timisation compared to when we do not use it. The latency in both measurements (let us call it a pair) will always be the same. So, when analysing results, it only matters how the values from the same pair differ, and

(19)

not between different pairs. This also means that the values between different pairs are in no particular relation to each other (values from one pair being much bigger than those from another can be because in one pair the latency was 2 ms and in the other one it was 0 ms).

The rationale behind choosing a maximum latency of 2 milliseconds started from the following quote [1]: Most of our target customers scale write through-put by sharding entity groups more finely or by ensuring replicas are placed in the same region, decreasing both latency and conflict rate. At the moment of writing this thesis, on the website of the Google company related to their dat-acenters, they say that they have three datacenters for the Europe region: one in Dublin, Ireland, one in St Ghislain, Belgium and one in Hamina, Finland. As an example, I took the distance between the closest two (from Ireland to Belgium) and I found an approximate geographical distance between them of ∼ 750km. It is known that information is transferred through fiber optic ca-ble, so a bit of information will travel with at most the speed of light (because there are also network switches and routers in the way). At the speed of light (299.792km/ms), 750km are made in around 2 milliseconds.

Other ways of deploying this system include the following. The first one is using an actual network of three computers that are at large distances from one another (like in the infrastructure of Google). The second one is using a network simulator, like ns-2 for example [13]. I have built my own platform for simulating a network in Java, because it was easier to integrate it with the database system which was also written in Java.

5.1 Special terms used in experiments

Session - A session consists of starting an instance of the Megastore database, making read and write requests to it (on all the nodes) and then closing it back. Also in each session, the database is under a constant number of write requests per second. Each session is executed separately from another session.

Waiting time (for an operation) - represents how much time has passed between the moment of arrival of the operation in the database (its creation) and the moment when it is completed. This means that if there is an operation in the system and it fails to complete twice (each time takes 10 ms) and then successfully completes (taking 6 ms), the waiting time will be 10 ∗ 2 + 6 = 26 ms.

(20)

Chapter 6

Study of the Megastore

optimisation

6.1 Introduction

In this chapter I will call Megastore optimisation (or simply optimisation) the approach presented in 4.3.2. In that section I present two ways to accomplish a step in the Paxos algorithm, namely without optimisation (when I am always using the Prepare phase) or using the optimisation (when the system can choose to use the Prepare phase or to get the acceptance of the leader).

I wanted to know the effects of using the optimisation, as compared to not using it. I chose to study this comparison since at the moment of writing this thesis there were no studies on this subject in the literature. Therefore, I claim this study as a contribution.

In this experiment I measure the performance of the database in term of average waiting time per operation (see 5.1). Measuring the waiting time is probably the most important since it reflects how much the user has to wait on average per operation to complete.

6.2 Expectations

From the perspective of the system I expected to have better performance in the version which uses the optimisation. This is simply because most operations use one less network roundtrip in the optimised version than in the unoptimised one.

When computing the number of failed operations, it is important to know how many chances of succeeding has one operation against other concurrent operations.

In the optimised version the operation accepted by the leader has to make one less network roundtrip than other concurrent write operations. This advantage will give it more chances of succeeding than the rest.

In the unoptimised version, each operation starts with the same chances of succeeding which makes it harder for each of them to compete against all other concurrent operations (and to finish successfully). This will create a larger

(21)

number of failed operations.

6.3 Experiment

In the experiment I first measured the performance when I did not use the Megastore optimisation and then when I did use it. The experiment was set up as described in chapter 5. Each session lasted 30 seconds.

Megastore was designed so that it supports few (this is the word used in [1]) write operations per second. I assumed it means a number smaller than 30, so the experiment tests the database when the load is equal or smaller to 30.

Then I made a graph showing the average waiting time per operation when the database was using the optimisation and when it was not, with the system under different loads.

6.4 Results

This is the image:

It can be noticed that performance of the database is higher when the database uses the optimisation.

6.5 Comparison of the success of the write

algo-rithms

In this section I have applied the Megastore optimisation. Since there are two ways now for performing a write operation on the database (using the master approach or using the prepare phase from Paxos), I wanted to know how many times each of them gets called and how many times it succeeds. I first describe what happens behind the scenes for each method to fail or succeed from an algorithm perspective and then present my findings from an experiment.

(22)

Using the master approach

Assuming the master is local (this happens most of the time), an operation starts by getting the approval of the master (leader). To succeed it has to finish propagating the operation to a majority of nodes (one network round-trip).

It will only be unsuccessful when a majority does not reply. Using the classic approach (the prepare phase from Paxos)

If a node uses the prepare phase from Paxos for a write operation (let us call it X) it means that it did not get the approval of the leader. Knowing that the leader accepts the first proposal it receives and rejects all the following ones, it means that there was another operation (named Y ) before which did get the approval.

Since X does not know about Y (otherwise X would have chosen the next log position in order not to get into conflict with an existing operation), it means that Y did not finish yet.

For operation X to succeed it means that it has to finish two network round-trips before the other operation (Y ) finishes one network roundtrip.

Experiment and Results

I ran an experiment with the same set up as above. The experiment had 10 sessions of 30 seconds, each with a different load. The loads were 3, 6, ..., 30.

What I discovered is that for each session the number of times an operation would succeed using the classic approach was 0 (zero) and it would fail around around 370 times in total.

Another thing I found is that only one of the operations that used the master approach failed. The rest of them (around 5200) succeeded.

Conclusion

It could be hypothesised that in the real system at Google the classic approach was mainly implemented to treat the case of nodes failing (e.g. one node starts an operation with the master approach, i.e. gets the leader approval, but then crashes, before propagating the operation to a majority).

6.6 Summary

In this chapter I studied the effects of the Megastore optimisation. I introduced the topic by stating what I want to measure to determine the effects of the optimisation on the performance of the system. Then I stated some expecta-tions of what I thought would happen. Next, I ran an experiment (having the optimisation on and off) and discussed the results.

Having applied the optimisation, there are two ways for an operation to complete successfully (using the master approach or using the two phase Paxos approach). I wanted to know how many times each of these methods succeeds or fails, so I ran an experiment counting exactly that. Since there was a big difference in results, I stated a hypothesis in a small conclusion section about

(23)

the reasons why a system might rely on both methods, even though one of them is clearly inferior to the other.

(24)

Part II

Adapting the scheme of

Megastore for higher loads

of write operations per

second per entity

(25)

Chapter 7

Introduction and

experiment set-up

configurations

In the previous chapter I measured the performance between the two phases Paxos and the real system with the optimisation made by Google engineers.

It is known that Megastore was designed to serve only a few operations per second [1]:

Limiting that rate to a few writes per second per entity group yields insignif-icant conflict rates.

In this part of the thesis I intend to study how the Megastore scheme can be adapted for another scenario. The scenario consists of the database having to serve more operations per second with less time per operation, even though the conflict rate will increase (because the number of operations per second coming from different nodes will increase). If more operations per second can be achieved with the same writing time per operation, it means that the developers do not have to split their data into such small pieces. Having bigger pieces implies having a smaller number of pieces, which can mean that it will be easier to develop applications, because there are less things to worry about.

While applying the optimisation from the previous chapter I analysed 4 optimisations. All the optimisations are studied independently (that is, when doing experiments, I only have one optimisation applied to the system). Set up for all the experiments in this part

On the basis of the database simulating Megastore (the Megastore constraints are described in chapter 5), I designed the following set up for all the experi-ments. For each experiment I measure the results when using the optimisation versus when not using it. There are 10 sessions, each takes 30 seconds, and the load of incoming write operations per second is constant per session. Each session has a load from the list: 10, 20, 30, ... 100.

When viewing the results (the waiting times from the diagrams), I present the exact times if they are smaller than 100, or indicate that they are bigger than 100 when it is the case. I focus only on times that are smaller than 100

(26)

is specified that Most users see average write latencies of 100-400 milliseconds, depending on the distance between datacenters, the size of the data being written, and the number of full replicas. In my case the distance is similar and all three nodes are full replicas, but in all of my experiments the size of the data I am writing is minimal. Also, I only simulate the network layer. I do not simulate the layer below, responsible with writing the data on disk (disk seeking alone takes on average 10 of milliseconds). Writing a minimal amount of data and eliminating the time it takes to write on disk, it seemed reasonable to aim for the smallest time from their paper: 100 milliseconds.

(27)

Chapter 8

Optimisation: Restarting

operations when nodes

receive signs that they

might fail

If an request fails in a distributed system, the request is usually restarted from the client side in order to redistribute the request to another node. In this chapter I wanted to see if failing operations can be retried from the server side. Retrying from the server side can be more effective because it can be done at the first sign that the operation might fail, instead of waiting until we are sure that the operation failed and we have to send back the failure message.

8.1 Introduction

From experiment 6.5 I noticed two things:

• A good proportion of the write operations use the Prepare Phase from the two phase Paxos.

• All operations that use the Prepare Phase fail to write the value of the user in the database.

Since all operations that use the Prepare Phase - step fail, this means that the waiting time for the operations that choose that step is larger than three network round-trips (one is represented by the request to the leader and getting rejected, and the next two are the two phases from Paxos). With the goal of avoiding waiting that time and without breaking the correctness of the system I started analysing the writing algorithm.

First I wanted to know the circumstances in which the prepare phase from Paxos gets chosen. The below image is a snapshot of the writing algorithm from Megastore [1]:

(28)

The phases of the two phase Paxos are given by Step 2 and Step 3. From the description of Step 1 it can be seen that the system only chooses the prepare phase if the leader rejects the proposal. The leader rejects a proposal only if there was another proposal on the same log position before (the log structure used is explained in detail in section 4.2.3).

I drew a model representing this situation. To help specify the situation, the first operation is made on the leader (this is how it happens most of the cases) and the second one is on another node. This is the model:

(29)

Info 1 It can be seen in the image how the second node receives a write operation, asks the leader to accept it but the leader does not do it (because there was already an operation before which got accepted), and then it starts the two phases Paxos algorithm.

Info 2 In the first phase of the two phases Paxos algorithm, when a node receives a prepare request, the node can respond with acceptance and with any other accepted proposal. The algorithm can be found in section 3.

From Info 1 it is known that the leader already has an accepted proposal. This means that when node 2 sends a prepare request to it (this moment is marked in the image with a red star) it will usually return the accepted proposal to node 2 who now has the obligation of spreading the received value to the rest of the nodes.

The word ”usually” was used because it can happen that the leader node has lost the value by crashing or restarting in the time interval represented as a grey line on the leader timeline. I discuss how I solved this case in the next section.

But, if no node breaks, what will happen is that the second node will continue to finish the writing operation with another value, which, from the perspective of the user, means that his/her operation fails.

(30)

As a conclusion for this section I showed how most of the times the two phase Paxos will basically help propagate another value to the rest of the nodes instead of completing with the original value.

8.2 Optimisation

Knowing that the two phase Paxos is unsuccessful most of the times, I started looking for any signs a node might receive from other nodes, telling the node that its requests to the leader will fail and that it will have to do the slow two phase Paxos.

One good sign a node can receive is the network message : Enforced Ac-cept Request (EAR). This message is found in the following scenario. After being accepted by the leader, a node has to spread the write operation to the other nodes, to reach a majority. The message used for spreading is ”Enforced Accept Request” (EAR). In the snapshot with the writing algorithm from Mega-store, this is described as Step 3, after Step 1 was successful. Receiving this type of message on a specific log position is a good sign for a node because now it knows that if it sends or has sent a proposal request to the leader for that position, it will fail and that it will have to continue (with the two phase Paxos) propagating the same value that it received through the EAR (because this is the proposal the leader has already accepted).

I added to the previous diagram the Enforced Accept Request messages (shown with black arrows) that node 1 will send after it gets accepted by itself (node 1 is also the leader). This is the new model:

If the system assumes that the leader does not break, it means that if a node receives the EAR, then it has to do the two phase Paxos, and it cannot propose its value. In short, if a node receives the EAR it means that its operation does not succeed.

(31)

Optimisation: Once a node receives an EAR message for a log position and it has also started an older operation on that position, it can, automatically initiate a new write operation with the same value. (represented as the green dotted arrow in the above model). But if the new operation does not get accepted by the leader it does not continue with the two phase Paxos, and instead terminates.

8.2.1 A problem with the optimisation

Starting a new operation if the node has a sign that the previous one might fail leads to incorrectness when the previous operation actually succeeds and now the node has two completed operations with the same value. First I will describe when does this happen, and then I present my solution to keep the system working correctly.

The case happens when the leader breaks and forgets its accepted proposal after sending the EAR message and after denying the leader-acceptance to other nodes.

On the one hand this means that the nodes which get their proposals denied by the leader will go into the two phase Paxos, and in the first phase they will not get any already accepted proposals, which means that they can propose their own values (and thus successfully complete the write operation).

On the other hand, this means the nodes which receive the EAR message will start a new operation with the same value.

8.2.2 Solution to the problem

The solution to the above problem, in which the system might end up having two completed operations in the database for the same values, is to put a lock in the points where each thread starts the process of trying to complete the write operation.

There is one lock for each writing value, such that before a thread wants to start the process, it has to acquire the lock (after which no other thread will be able to start writing the same value), then to check if the value was written (if it had to wait for another thread and that one completed it, the job is done and it can now return). The lock is released when the operation fails or gets completed (at the end of phase two in Paxos or when being rejected by the leader or after it gets approved by leader and the write was accepted by a majority of nodes). The places where this lock is deployed are represented in the above image as red rectangles.

This solution will keep the correctness of the algorithm in all the possible cases. Also, as can be seen in the model, the two threads will usually operate at different times, such that deploying a lock on them will not create a bottleneck for the system.

8.3 Experiment

I wanted to see the effects of my optimisation (when the node starts a new operation if it receives signs that an older one might not succeed) on database performance. For this I ran an experiment where the system either had the

(32)

optimisation applied to it, or it did not. The experiment was set up as described in the introduction to Part II of the thesis (page 24). In it I measured the average waiting time per operation for all the write operations. This is the diagram:

It can be seen in the image that the performance of the database improves when it uses the optimisation in the cases when the load is 40 and 50 operations per second.

8.4 Related work

In the design of Megastore, any node can propose write operations and so it is possible to have concurrent write operations. Since only one can succeed at a time (to keep the strong consistency), this creates the problem of having the other operations fail. This optimisation helps nodes with failing operations predict that their operation proposals might fail, so they can restart them.

There are two other major databases with a design very similar to that of Megastore: PNUTS [5] (from Yahoo) and Microsoft Azure Storage [3]. All three split the data in small groups (in Megastore, the group is known as an entity), replicate each group around three times and provide strong consistency for operations inside the group and eventual consistency for operations across groups.

Unlike Megastore, PNUTS and Microsoft Azure solved the problem of con-current writes by having only one node (the master) making all the write oper-ations and in this way, all operoper-ations will succeed.

8.5 Summary

In this chapter I started by analysing the write algorithm in order to improve the performance of the system. I found a scenario in which the system can be improved. The scenario is that a node has to continue running the Paxos algorithm after being rejected by the leader, even though the Paxos algorithm

(33)

almost never succeeds. I designed an optimisation where I still employ the Paxos algorithm, but I also restart the operation (as a new one).

Having in parallel two methods of completing an operation can create corrupt data in the database when they both work, so I applied a lock to have only one method operating at once (and to complete the operation only once), in order to keep the write operation correct.

Finally I created an experiment to see if I improved the performance of the system and by how much. My expectations were confirmed, as this optimisation improves the performance of the system.

(34)

Chapter 9

Improving the time for

consecutive operations on

the same node

9.1 Introduction

I selected the problem of how to improve the time for consecutive operations on the same node. I found two possible solutions for solving this problem. The solutions differ by the idea that they have behind.

The idea behind the first solution is to reduce the duration between consecutive operations: if I make the second operation start earlier, it will finish earlier.

The idea behind the second solution is to group all the consecutive op-erations that are waiting to be executed on the same node. In this way I can treat a list of operations as a single one.

Firstly, I will present both solutions and show how they improve the system (in independent experiments).

Secondly, I will show which solution is more effective.

Finally, I will discuss why they can not be used together (or why the perfor-mance of the system decreases if I use both of them).

9.2 Solution 1

Since the priorities for this system are mainly represented by the waiting time for each operation and the number of writing operations per second that the database can achieve, I analysed the writing operation algorithm further.

9.2.1 The success point from the system perspective

ver-sus from author node perspective

Let us look at how a write operation is carried out, when the node responsible for it is also the leader:

(35)

I consider the moment in time marked with a blue line. At this moment, the system has the new value saved in the log on two nodes, which is a majority (since there are three in total), so the operation is completed successfully. The problem is just that it will take another network trip from node 2 to node 1, for node 1 to realise that it was successful (the moment marked with a red line).

The success point from the system perspective is when the system has successfully completed an operation and can start the next one, without break-ing its correctness. In my figure it is marked with a blue line. The success point from the author node perspective is when the node realises that it was successful and can start the next operation. In my figure it is marked with a red line.

In my current algorithm, if the system has another operation waiting on the same node, the operation will start too late. It will start at the moment with the red line instead of starting at the moment with the blue line. Having this information, I started looking for ways to exploit it.

Note: Exploiting presents some difficulties because at the moment marked with the blue line only the second node knows that the operation was successful (because it was enforced, meaning that the leader accepted it, and because the second node accepted it, and two nodes out of three makes it a majority). Since only the second node knows that the operation was successful, only it can start accepting the next operation. To do that, the node will mark the next position in its log as opened to Weaker Accept Request (I explain this concept in the next paragraph).

This brings us to my optimisation:

Optimisation: Once a node (other than the leader), accepts an enforced accept request, it marks the next cell in its log as Cell opened to Weaker Accept Request. The leader, after sending an enforced accept request will also start the next write operation. But after the new write operation gets accepted by the leader (as the first one the system encountered), the leader node will not send Enforced Accept Requests as usual (I discuss in the next section why this is not the case), but send Weaker Accept Requests. This optimisation is exemplified in the following paragraph.

A Weaker Accept Request will only get accepted if it is for a log position marked as ”Opened to Weaker Accept Request”. Let us see this in a model:

(36)

What is new in this model is that the leader starts the second write op-eration immediately after the first one got accepted by the leader. Different characteristic about the new operation is also noticeable. It can be seen that the leader then sends the ”Weaker Accept Requests”. These get accepted by the second node, because when it accepted the previous message (an enforced accept request ), it also enabled the next log cell for the current one (see moment T2). The same requests get rejected by the third node, because the node was not expecting them.

Also, in the moment marked as T4, the leader receives an accepted weaker request, which means that another node accepted it, and now, it is the second one to finish the operation. Now the operation is declared successful.

Rationale on verification of correctness

This optimisation is made under the next assumption. I assume that the pre-vious operation succeeds and the same node will continue to be a leader. When this does not happen, it means that no other node will enable the next log cell for the weaker accept request (WAR), and so, all of the WAR-s will fail. When a node receives rejection on all of its WAR-s, it will return ”write unsuccessful”. But since in the first step, the old leader node saves the value in the log (see T1 in the above model), the system allows for an Enforced Proposal (coming from a real leader) to be able to overwrite the log value made with a Weaker Proposal (coming from a potential future leader).

How often does the optimisation take effect

The optimisation will occur whenever the system has an operation in progress that just received the leader acceptance and it also has a second operation (one

(37)

or more) that is waiting to be processed. What the optimisation brings is that now the waiting time for the second operation is less by at most one network-roundtrip (the network-roundtrip it takes to spread the ”Enforced accept request” -see the pictures above).

It is known from the requirements of the system that 85% of the requests are processed by the same node (see chapter 5). This means that the probability of having 2 consecutive operations on that node is 85% ∗ 85% = 72% (because they are independent events). In 72% of all cases my optimisation will reduce the duration of the operations by at most one network roundtrip, if they have to wait for another operation to finish first.

Also, the weaker cell log enabling is chained. If a node receives a weaker accept request and accepts it, the next log cell also becomes opened for weaker accept requests.

9.2.2 Experiment

I wanted to evaluate the effects of my optimisation in practice. I ran an ex-periment to see the effects that my newly described optimisation has on per-formance, as measured in average waiting time per operation.. On this system I made an experiment to see the effects on the performance of my newly de-scribed optimisation. The performance was measured in average waiting time per operation.

I noted the results and put them in a diagram. This is the diagram:

As can be seen in the diagram my optimisation offers the same or better results (best seen at 70 operations per second) than when I am not using it and it also allows for 80 operations per second to be made in less than 100 ms (in 94 ms).

9.3 Solution 2

9.3.1 Introduction

It should be kept in mind that the database is a distributed hash-table, and it has a write-ahead log to remember write operations. A write operation in the

(38)

record in the database, the operation will be of the form: < key, null >. Each cell in the log can contain multiple write operations. As long as the operations are applied in the right order (by applying all operations from each cell from the log from left to right, and for each cell all write operations from left to right), the size of a log cell does not matter: any two consecutive cells can be concatenated into one.

9.3.2 Optimisation

If the system has more than one operation in the queue of the same node, waiting to be executed by the database, the node joins all operations from the queue together (so that they will act as a single cell in the log) and achieve for all operations consensus on all the nodes.

9.3.3 Experiment

I wanted to see the effects of this optimisation so I ran an experiment with the same set up as the experiment for solution 1. I put the results in a diagram and these are the results.

It can be seen in the diagram that using the optimisation offers much better results than when I am not using it, except at 40 operations per second, when using the optimisation is worse than not using it.

9.4 Solution 1 versus solution 2

(39)

As can be seen there is a big difference between these two optimisations. In the following I discuss why this happens and what I can learn from it.

9.4.1 Discussion on the difference between optimisations

Solution 1 promises to help an operation start earlier by at most one network round-trip when it has to wait for another operation in progress on the same node to finish first. But remember that for 72% of the writes the leader is local, so that the write operation will only take one round. Since the solution 1 helps an operation start earlier by one network round-trip, for 72% of the writes, the operations will happen (start) at the same time.

Solution 2 promises to concatenate consecutive operations on the same node, resulting in only one operation. Since consecutive operations on the same node also happen 72% of the times, in Solution 2 for 72% of the writes, the operations will happen (start) at the same time.

So I have two optimisations that promise the same thing, but in practice one of them is much more powerful than the other. Why?

To answer that, I looked at how many operations succeed and how many fail in both optimisations when the difference in results is high, and for that I selected 70 operations per second to be the database load. The session took 30 seconds and these are the results.

Solution 1 : Succeeded: 1689. Failed: 1725 Solution 2 : Succeeded: 2180. Failed: 354

Two observations can be made on these results. The first one is that with solution 2 there were much fewer operations in the system during those 30 sec-onds. In solution 2 there were (1689 + 1725) − (2180 + 354) =880 less

(40)

operations (so from two or more operations results only one). Having less op-erations in the system, there are smaller chances of nodes getting into conflict (two nodes trying to complete different writing operations at the same time, making one fail the writing process).

This brings me to the second observation. By having smaller chances of nodes getting in conflict (and failing to complete operations) in the second solution, the number of failed operations is much lower (354) compared to the first solution (1725).

9.4.2 Why can I not use both optimisations?

The strength of the second solution lies in achieving a small number of operations. If I would use both optimisations, a write operation that has just arrived will not wait for the previous operation to finish, but instead will start on its own (in 72% of the cases). This is what the first optimisation does: it helps operations start earlier, and so, many of them will start on their own, creating a bigger number of write operations in the database. Creating a bigger number of operations cancels the strength of solution 2, and so, using them together is much more worse than using only solution 2.

9.5 The difference between the two

optimisa-tions and the one from the Megastore paper

In the Megastore paper [1] there is the following paragraph:

For groups that must regularly exceed a few writes per second, applications can use the fine-grained advisory locks dispensed by coordinator servers. Se-quencing transactions back-to-back avoids the delays associated with retries and the reversion to two-phase Paxos when a conflict is detected.

The quote suggests that Megastore has implemented an optimisation (in this section I will call this optimisation Megastore optimisation) to improve the time for consecutive operations on the same node in the following way. A node A can request the coordinator to block the requests for the next write operation coming from other nodes, so that the next operation proposal from node A will definitely succeed.

The first solution is similar to the Megastore optimisation, because the effect of both of them is to create sequences of operations. However, in Megastore optimisation, if the node that requested the lock for the next operation (to create a sequence) crashes (or suddenly becomes slower), the write operations coming from other nodes will get rejected. In the first solution, if the node that wants to propose the next operation in the sequence crashes, the operations coming from other nodes will succeed, and so, the system will continue to serve write operations.

The second solution is different from the Megastore optimisation in the fol-lowing way. In the Megastore optimisation, for the second operation to start, the first one has to finish propagating to the rest of the nodes (this means one network round-trip). From what is described in the paper the Megastore op-timisation only helps with getting the approval of the leader. In the second solution all operations are bundled together in the same packet. Then, for this

(41)

packet the node gets the approval of the leader and propagates it to the rest of the nodes. This means one network round-trip for n write operations and not n network round-trips from n write operations (like in Megastore optimisation).

9.6 Related work

This optimisation (grouping write operations together) was also implemented in other software systems, such as Dynamo, which is a highly available key-value storage system developed by Amazon [8]. The motivation in the case of Dynamo is the same as in my system: a time expensive operation has to be performed and by grouping, the system can treat all the operations as one. If in my case the time expensive operation was sending proposals through the network, in the case of Dynamo the task was writing the operations on disk. In [8] it is described how an object buffer is kept in memory and periodically written onto the disk.

In their case, using an object buffer lowered the latency of 99.9% of the write operations by a factor of 5 during peak traffic, even when they used a small buffer. Also, the optimisation reduced the largest values for latency (outliers).

The authors of [8] note how this optimisation trades performance for dura-bility. In case of a server crash, all the values from the buffer are lost. This is a problem because the Dynamo database returns that the write operations were completed successfully (when the database has just stored them in the buffer) but if the values (the write operations) in the buffer are lost, it means that the data is not durable.

My optimisation for Megastore does not have this problem. If more write operations are bundled together in a node, and that node crashes, it means that all those operations will fail, the users will see that and they will repeat the operations.

9.7 Summary

In this chapter I wanted to optimise the case when the system has consecutive operations waiting on the same node to be completed. I tried two solutions that promise the same thing, but have a different underlying idea behind. They promise that when I have 2 consecutive operations, the second operation will start faster (more precisely with the duration of a network roundtrip faster). The idea behind the first solution is to minimize the time interval between the starting point of two operations (they will run partly in parallel). The idea behind the second solution is to group consecutive operations into a single operation.

Then I expressed my rationale for why the write operation will continue to work correctly and the database will not have corrupt data if I apply either of the solutions.

Then I checked how the solutions improve the performance by running exper-iments. The first experiment had the first optimisation applied and the second experiment had the second optimisation applied.

Then I analysed the difference in performance between choosing one optimi-sation over the other. Since it was a big difference I wanted to know what causes

(42)

it. In this way, I found another beneficial effect that the second optimisation has on my system (it reduces the number of operations in the system).

Finally, I discussed why I can not, or, more precisely, should not use both optimisations at the same time.

(43)

Chapter 10

Reducing the number of

failed operations for

non-primary nodes

10.1 Introduction

It is known that 85% of the write operations are on the same node (see chapter 5). I will call this node the primary node. I will also call the other two nodes (combining the performance) non-primary nodes.

Then I wanted to know how many operations are successful on the primary node and how many were successful on the non-primary ones with the database on a medium load (so if our goal is between 10 and 100 operations per second we chose 50 to be the load). The experiment took 30 seconds and these were the results.

• Write operations on the primary node: Succeeded: 1031. Failed: 32 • Write operations on the non-primary nodes: Succeeded: 218. Failed: 406 Let us look at the results for the non-primary nodes. It can be seen that on average for every successful operation there was another operation that failed. Remembering that the database will continue to restart failed operations until they succeed, it can be said that on average an operation only succeeds the second time it tries (the first time fails and the second time succeeds). Waiting for every operation twice leads to large waiting times. This is also shown in the next diagram, where I included the waiting times for each operation on the non-primary nodes (on the same database load, in a session that took 30 seconds).

(44)

10.2 The problem

It can be seen in the above diagram that many write operations take more than 100 milliseconds. One cause of high waiting times is the following. It is known that in order to have an operation succeed, it has to get the acceptance of the leader. Also the leader will usually be the primary node (because the leader is the node who won the last operation and the primary node is responsible for 85% of the operation).

The problem with this is that if a new operation ”lands” on the primary node, the time it takes to be accepted by the leader is almost zero (because the leader is local). But if it is on a non-primary node, it will take a network trip (0-3ms) to be accepted by the leader, time in which another operation on the primary node can take its place.

In summary, the problem is that it is difficult for the operations on the non-primary nodes to succeed (which, as a number, account for 15% of the total number of operations on the database), and so the waiting time for them is high (it can take up to 0.5 seconds, see the above image).

10.3 The solution and the optimisation

Solution: From time to time the system starts favouring only the operations on non-primary nodes.

The alternatives considered when trying to find out the most effective solu-tion varied greatly.

First I considered creating windows (intervals) of time when the primary node would pause its operations to make way for operations from non-primary nodes. I tried with small windows of time for each operation (around 7ms), then I tried with larger windows of time for a fraction of the operations.

(45)

The problem with this approach was that it is hard to know how much the node should wait and how often (considering that the load on the database changes). If the node would wait too little, my approach would be ineffective because the system is not helping the operations from non-primary nodes. If the node would wait too long, the result would be intervals of time when there are no operations on non-primary nodes to be executed, and the primary node is also blocking the existing ones on itself. Since it was hard to find a good combination, I started looking for other ways of dealing with the operations from non-primary nodes.

Then I considered the following approach:

Optimisation: For every couple of operations from non-primary nodes that failed in the last period, I assign the next operation slot (position in the log) only for operations from non-primary nodes. Trying to make a write from a primary node to a slot assigned only for non-primary ones will result in being rejected (by the leader). This approach had higher effects than the waiting approach, so I chose it.

10.4 Rationale on verification of correctness

A problem that might occur is to assign a slot for non-primary nodes, but all the non-primary nodes stop making write operations, and so, the primary node gets blocked for making its writes. The solution to this is to make the slot assignment temporary (only for 200ms). After that, the special slot becomes normal again, accepting writes from any node (primary or not).

10.5 Experiment

First I wanted to determine the effects of my optimisation on the number of operations that succeed/fail on the primary and non-primary nodes. With the same set-up (a load of 50 operations per second in a session that took 30 seconds) these are the results:

• Write operations on the primary node: Succeeded: 1234 (before: 1031). Failed: 123 (before: 32)

• Write operations on the non-primary nodes: Succeeded: 223 (before: 218). Failed: 193 (before: 406)

The optimisation had the following effects on the database.

Firstly it increased the number of failed operations on the primary node by 91 (from 32 to 123), which was quite expected because now there are slots in the log where the requests from the leader node will fail.

Secondly, it decreased the number of failing operations on the non-primary nodes by 213 (from 406 to 193). This is what I was looking for (what this optimisation was designed for), to help the operations from non-primary nodes succeed easier.

Thirdly, it can also be seen that the number of operations that succeeded on the primary node increased (from 1031 to 1234). It could be hypothesised

Data replication algorithms in distributed databases

Master of Science in Software Engineering