Distributed Monitoring

(1)

Distributed Monitoring

Edwin Westerhoud

edwinwesterhoud@voormedia.com

July 2014, 50 pages

Supervisors: Magiel Bruntink (UvA)

Mattijs van Druenen (Voormedia) Michiel Verkoijen (Voormedia)

Host organisation: Voormedia B.V.,http://voormedia.com

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen Wiskunde en Informatica

(2)

2.3.5 Scenarios . . . 10 2.4 GEMS . . . 10 2.4.1 Parameters . . . 10 2.4.2 Data structures . . . 11 2.4.3 Gossipping . . . 11 2.4.4 Node insertion . . . 13 2.4.5 Scenarios . . . 13 2.5 Summary . . . 14 3 GEMS Implementation 15 3.1 Gossip agent . . . 15 3.1.1 Settings . . . 15 3.2 Resource monitoring . . . 15 3.2.1 Visualiser . . . 16

3.2.2 Expanding the resource monitor . . . 16

3.3 Network simulator . . . 18

3.4 Additions and adjustments . . . 18

3.4.1 Alerting . . . 18

3.4.2 Gossip node selection . . . 18

3.4.3 Layering. . . 18 3.4.4 Locking . . . 19 3.4.5 TCP and UDP . . . 19 4 Research Method 20 4.1 Performance benchmarking . . . 20 4.1.1 Test setups . . . 20

(3)

CONTENTS CONTENTS 4.1.3 Parameter settings . . . 21 4.1.4 Benchmarking script . . . 21 4.2 Resource usage . . . 22 4.2.1 Resource metrics . . . 22 4.2.2 Test setup. . . 22 4.3 Data analysis . . . 23 5 Results 24 5.1 Performance benchmarks . . . 24

5.1.1 Number of nodes and geolocation. . . 24

5.1.2 Parameter settings . . . 24

5.2 Resource usage . . . 24

6 Discussion 28 6.1 Performance benchmarks . . . 28

6.1.1 Number of nodes and geolocation. . . 28

6.1.2 Parameter settings . . . 30

6.1.3 Detection time model . . . 30

6.2 Resource usage . . . 31

6.3 Validation . . . 32

6.3.1 Performance benchmarks . . . 32

6.3.2 Resource usage . . . 32

6.4 Main research question. . . 34

7 Conclusion 36 7.1 Future work . . . 37

Bibliography 38 A Ping Matrix 39 B Measurements 40 B.1 Test setup: All . . . 40

B.2 Test setup: US + EU . . . 41

B.3 Test setup: EU . . . 42

B.4 Resource usage . . . 43

C Literature Survey 44 C.1 Gossip protocols . . . 44

C.1.1 GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Dis-tributed Systems . . . 44

C.1.2 Gossip-Style Failure Detection and Distributed Consensus for Scalable Hetero-geneous Clusters . . . 44

C.1.3 Experimental Analysis of a Gossip-Based Service for Scalable, Distributed Fail-ure Detection and Consensus . . . 45

C.1.4 Simulative performance analysis of gossip failure detection for scalable dis-tributed systems . . . 45

C.1.5 A Gossip-Style Failure Detection Service. . . 46

C.1.6 The promise, and limitations, of gossip protocols . . . 47

C.2 Other approaches . . . 47

C.2.1 On scalable and efficient distributed failure detectors. . . 47

C.2.2 An adaptive failure detection protocol . . . 47

C.2.3 On the Quality of Service of Failure Detectors. . . 48

C.2.4 Failure detectors for large-scale distributed systems. . . 49

(4)

Abstract

Most servers need to be available 24 hours a day, 7 days a week. Therefore, a failure of such a server needs to be detected as soon as possible. Distributed monitoring can be an alternative to conven-tional, centralised monitoring, promising robustness and lower failure detection times. This thesis will research in which cases a distributed monitoring solution is a better alternative to centralised moni-toring. We made a proof-of-concept implementation of a distributed monitoring algorithm, which is compared with centralised monitoring. We found the distributed monitor can provide more informa-tion about the monitored servers, such as server load, and performed better than current external centralised monitoring solutions. This does come at the trade-off of increased running costs due to development and maintenance and higher resource usage on the servers running the monitor.

(5)

Preface

This thesis concludes the one year master course Software Engineering at the University of Amsterdam. After an initial month of planning and performing a literature survey, I have spent the months of April to July at Voormedia researching distributed monitoring.

The results of this project would not have been possible without the help of several people. First of all, I would like to thank my university supervisor, Magiel Bruntink, for his feedback during the project and helpful tips when writing this thesis.

I would like to thank Voormedia for hosting this thesis project. Mattijs van Druenen as my su-pervisor, and everyone else who has given valuable feedback during the project and when reviewing previous versions of this thesis.

Finally, I would like to thank everybody involved in the master programme for their lectures, guidance and enthusiasm. The past year, I learned and experienced a lot about software engineering.

(6)

Chapter 1 Introduction

Online services should be available all the time. For example, a failed website can cost a business a lot of potential customers. To ensure high availability of a server, fast detection of a failure is required. That allows for system administrators to respond to it and make sure the service is back online as quickly as possible. Moreover, from monitoring the system health, possible failure causes can be detected early, helping to prevent downtime.

We will be researching in which cases a distributed monitoring solution is a better alternative compared to centralised monitoring solutions. To compare the two approaches, existing monitoring algorithms will be studied and a proof-of-concept implementation of a distributed monitoring algo-rithm will be made. Moreover, we will test the performance of it using benchmarking. The distributed monitor uses a gossip protocol, which is a simple, yet flexible and robust approach of communication between the servers in the network.

This thesis is hosted by Voormedia, a company located in Amsterdam creating media solutions like websites and (mobile) applications. Voormedia currently uses a centralised monitoring solution and would like to know in which cases distributed monitoring is a proper alternative.

This document is structured as follows. The remainder of this chapter will state the research questions. Chapter2 will discuss the background and context by introducing a set of scenarios and analysing the different monitoring approaches from the perspective of these scenarios. Next, chapter3

will describe the implementation of the GEMS algorithm, used for distributed monitoring. Chapters4

to 6 discuss the benchmarking method and present and discuss the results of these benchmarks. Finally, chapter7 will conclude this thesis and identify future work.

1.1 Research questions

Main question. In which cases is a distributed monitoring solution a better alternative to cen-tralised monitoring?

Sub questions:

1. What scenarios does Voormedia have with respect to monitoring servers?

2. What advantages and disadvantages does distributed monitoring have in terms of metrics, such as the failure detection time and resource usage?

(a) What is the influence of the geolocation of the nodes?

(b) What is the influence of the algorithm’s parameters, such as the gossip interval1?

(c) How much resources does the implementation use and how is this affected by the algorithm’s parameters?

(7)

1.1. RESEARCH QUESTIONS CHAPTER 1. INTRODUCTION

To answer the research questions, we will implement the gossip-based, distributed monitoring al-gorithm described by Subramaniyan et al. [SRGR06]. The implementation will be used to compare distributed with centralised monitoring, which is currently used.

The first sub question identifies a set of scenarios, which can be important factors when choosing distributed or centralised monitoring. These scenarios are formulated from the perspective of Voor-media and will provide an overview of the functional trade-offs between the two different monitoring approaches.

The second sub question focuses on the technical differences, which can be measured. Examples of such measures are detection time and resource utilisation like processor and network usage. Using benchmarking, we will determine the performance and resource usage of the distributed monitoring algorithm and compare this to the centralised solution.

(8)

Chapter 2 Background and Context

This chapter starts by introducing the problem environment using a set of scenarios. Next, different monitoring approaches will be discussed and analysed from the perspective of these scenarios. Also, this chapter incorporates the findings of the literature study done for this thesis, found in appendixC.

2.1 Problem analysis

To analyse the functional differences between centralised and distributed monitoring, we listed possible scenarios. Each of these will be discussed from the perspective of both monitoring approaches in this chapter. This set is compiled based on the situation at Voormedia; however, all scenarios should be applicable to other situations as well.

Add server. This scenario describes how to add a new server to the monitor and how the system reacts to this.

Remove server. This scenario describes how a server can be removed from the monitor and how the system reacts to this.

Check status report. This scenario describes how system administrators can check status reports. These status reports can contain (historic) monitoring information, for example about the up-time, response time and server resource usage (for example CPU load).

Add sensor. This scenario describes how a new sensor can be added to the monitoring network. Sensors are used to monitor server resource information (for example CPU load).

Server failure. When a server fails, this needs to be detected by the monitoring system. Further-more, an alert needs to be sent to notify the system administrators.

Server recovery. When a server recovers after it has failed for a period of time, the monitoring system has to detect this. It needs to continue monitoring the recovered server and a recovery message needs to be sent.

Network partition. In case a subset of the nodes is separated from the other nodes in the network, a network partition is created. The nodes in one of the partitions cannot reach the nodes in the other partitions.

2.2 Monitoring approaches

2.2.1 Centralised monitoring

Monitoring servers can be done using different approaches. Centralised approaches use a single server dedicated to monitoring. This central server will collect all data about the status of the monitored servers and reports it to the administrator. Also, this server sends an alert when a monitored server fails.

(9)

2.2. MONITORING APPROACHES CHAPTER 2. BACKGROUND AND CONTEXT

Semi-centralised monitoring

Some centralised monitoring solutions provide monitoring from different servers. These servers can be geographically distributed around the world, providing more information about the status of the monitored server. For instance, a server could be offline for a specific part of the world. This cannot be detected when the only monitoring server is in the same part as the server that is being monitored. Monitoring models

Felber et al. [FDGO99] describes two models for a (centralised) server to monitor. In the push model, the servers periodically send a heartbeat message to the monitor. The monitor knows a server failed when the heartbeat messages stop from that server. In that case, it sends an alert to the system administrator about the failed server.

An alternative is the pull model. In this model, the monitored servers are passive and respond to liveness requests. The monitor will periodically send a liveness request to the servers and the servers will respond to these requests. When the monitor does not receive a response to a liveness request within a certain timeout, it knows that server has failed. This model is less efficient compared to the push model, because it uses two-way messages to monitor a server. However, the advantage is that the monitored servers are very simple: they only need to respond to the liveness requests and do not need to know the timing of the messages. The central monitor will control the frequency of the liveness requests.

The paper also introduces a third model: the dual scheme. This is a combination of the push and the pull model. Initially, the monitor assumes all servers will push new heartbeat messages. If the monitor stops receiving heartbeat messages, it will switch to the pull model and send a liveness request. If the server does not respond to that request either, a failure is detected.

2.2.2 Distributed monitoring

A different approach is distributing the monitoring server. This approach does not rely on a single monitoring server, but uses the servers to monitor themselves and each other. The advantage of this approach is robustness: there is no single point of failure in the form of a central monitor. The servers can use different protocols to communicate their status. A naive approach is to send its status to all other servers in the network. However, this will lead to a flood of messages when scaling up this protocol.

Gossip protocols

Gossip protocols are a more efficient way of communicating. In gossip protocols, each server maintains a list of time values. For every server in the network the corresponding value indicates the number of rounds passed since the last heartbeat that has been received by any server in the network. Every set interval, each server sends its list (included in a gossip message) to another server in the network. This server can be chosen randomly, but Ranganathan et al. [RGTC01] also proposes three variations: round-robin, binary round-robin and round-robin with sequence checking. When a server receives a gossip message, it updates its own knowledge about the network with the information it received.

To detect if a server failed, the servers in the network will detect if there is consensus about the failed server among the rest of the network. If all working servers suspect another server to have failed, it is considered down and a system administrator can be alerted.

2.2.3 Internal and external monitoring

Monitoring solutions can be internal and external. Internal monitoring requires a software client to be installed on the server that needs to be monitored. This client typically provides system information to the monitoring system. External monitoring does not require additional software to be installed. The monitoring is done independent of the server that is being monitored. Most of the time, external monitoring uses a service the monitored server already provides. For example, monitoring a web server will use HTTP requests to check if the server is still available.

(10)

2.3. PINGDOM CHAPTER 2. BACKGROUND AND CONTEXT

An advantage of internal monitoring is the extra information it can provide. External monitoring cannot check the processor utilisation, for example. This does come, however, with the disadvantage of having to install software on all servers that need to be monitored.

Another advantage is the location within the network of the monitoring client. The client will be installed on the server, placing it within the local network. This way, services that are normally blocked by a firewall can be monitored. For example, the database server that a website relies on can be monitored independently of the website. Even though a failure of the database server can be detected externally by a non-functioning website (using HTTP status codes), the root cause of the error cannot be detected using external monitoring, because the erroneous status code can be caused by other faults as well.

2.3 Pingdom

The following subsections analyse the monitoring service in terms of performance and possibilities of Pingdom1_{, a centralised monitoring solution. We chose Pingdom because it is currently in use by}

Voormedia, however the analysis is generalisable to other external, centralised monitoring services as well. We compared Pingdom to several other services and found only minor differences (for example the available alert methods).

2.3.1 Configuration

Using Pingdom, servers can be monitored by adding it through the web interface. Several types of monitors are available. Servers can be checked using an HTTP(S) monitor, however monitors for Ping (ICMP), DNS and mail servers are also possible to set up. Other than the monitoring settings, the checking interval and alerting settings are also configured here. The checking interval can be configured from every minute to every hour. Alerts can be send using different media (for example email and SMS) and can be delayed. Delayed alerts are used for reducing the number of alerts, by ignoring failures that are resolved within the delay threshold.

2.3.2 Monitoring

Once the monitor has been created, Pingdom will monitor the server. It will run the check at the chosen interval and measures the response time of this. If the server responds to the configured request with a successful HTTP response, Pingdom knows the server is still online. As soon as the server responds too slowly, with an erroneous HTTP response code or not at all, it is considered offline and an alert is sent to the administrators. This downtime is also registered by the Pingdom service in order to provide historic reports and uptime information about the monitored server. In order to detect when the server is restored, Pingdom will continue monitoring, even when the server has failed earlier checks.

2.3.3 Performance analysis

In order to compare the distributed monitoring algorithm to Pingdom, we need to determine the detection time of Pingdom. The deterministic monitoring algorithm checks the server at a set interval, for example every minute. As soon as the server has failed, the next check will detect this failure. An alert can be sent immediately after the failure has been detected, when Pingdom is configured to do this. This does not add extra time to the detection. The time between the server failure and the detection by Pingdom therefore is less than or equal to the checking interval. This means that Pingdom can detect a failure within one minute when checking the server every minute.

2.3.4 Semi-centralised external monitoring

Pingdom uses multiple external servers to run the checks. These servers are geographically separated, allowing for monitoring from different locations. Currently, a total of 61 servers in the US and Europe

(11)

2.4. GEMS CHAPTER 2. BACKGROUND AND CONTEXT

are used for monitoring. With these, Pingdom can offer more robust monitoring and detect network partitions when at least one of their servers is outside the partition.

This also means that it provides external monitoring. As discussed in section2.2.3, this limits the data that can be collected. Therefore, Pingdom provides no addition resource monitoring other than the response time of the checks and the status of the server.

2.3.5 Scenarios

Add server. When adding a server, basic information like the name and (ip) address of the server is filled in, as well the monitoring frequency and alert threshold. Also, the type of monitor is configured. Examples of these are Ping and HTTP. Pingdom will start monitoring this server by checking it at the specified frequency.

Remove server. Removing a server is done through the web interface. When removing a server from the monitor, Pingdom will stop checking it and delete the monitored data.

Check status report. Historic reports that contain monitoring data can be generated. These re-ports contain information about the server uptime and about the response time.

Add sensor. Pingdom does not provide functionality to monitor custom sensors. Because of the limitations of an external monitor, this data cannot be accessed.

Server failure. When a server has failed, it will stop responding to the Pingdom checks. When the set threshold of subsequent failed checks has been reached, Pingdom will send an alert informing the system administrators.

Server recovery. During downtime, Pingdom continues to monitor the failed server. Doing this, the downtime can be measured and when the server is recovered a notification is sent.

Network partition. In case of a network partition, Pingdom will detect a failure. Because the service monitors the servers from different (geographically separated) locations it is unlikely that all monitoring servers are contained in the same partition.

2.4 GEMS

Gossip-Enable Monitoring Service (GEMS) is a monitoring algorithm that was introduced by Sub-ramaniyan et al. [SRGR06]. The algorithm works by creating a network of nodes. Each server that is monitored represents a node in this network and also participates in monitoring the other nodes. This makes it a distributed monitoring solution. The following subsections discuss the algorithm as described in the original paper.

2.4.1 Parameters

The algorithm uses two parameters, which have the same value for all nodes: tgossip: time between two gossip messages sent by a node.

tcleanup: threshold value for the heartbeat value of a node before it will be suspected to have failed.

Every tgossip, each node in the network sends a gossip message to another node in the network.

The node that will be gossipped to is picked at random. On average, each node will send and receive one gossip message each round. Typically, the value of tgossip will be between 0.1 and 1 second.

Depending on the desired failure detection time, this can be increased or decreased outside this range. When decreasing the value of tgossip (resulting in more gossipping each second), the detection time

(12)

2.4.2 Data structures

Each node keeps four data structures that are used for monitoring the other nodes:

Nodes list: contains information (name, ip and port) about all nodes currently in the network, including failed nodes.

Gossip list: contains, for each node, the number of gossip rounds passed since the last heartbeat. Suspect list: contains, for each node, if the node is suspected to have failed.

Suspect matrix: two-dimensional matrix where each row is the suspect list of a node in the network. When a node has not heard from another node, either directly or through gossipping from other nodes, the heartbeat value increases each round (every tgossip). When this heartbeat value exceeds

the threshold value, tcleanup, the node is considered to have failed. This threshold exists to avoid

false positive alerts, which can occur due to the random node selection of the algorithm and network latency.

If any node in the network suspects another node, it will also check if consensus among the network about the suspected node is reached. Therefore, it checks the column in the suspect matrix of the suspected node. This column contains the suspect value for each node about the suspected node. If all other live nodes also suspect the node, consensus is reached. A node is ignored (not live) when at least half of the network suspects the node. The suspect value of the suspected node is ignored as well.

When a node detects that consensus about a failure has been reached, it broadcasts this to all other nodes. Now, an alert about the failure can be sent, however this was not included in the original algorithm described in [SRGR06]. Therefore, an addition has been made to the algorithm, which is described in section3.4.1.

2.4.3 Gossipping

Each round (every tgossip seconds), each node sends a gossip message to a random other node in the

network. This message contains the gossip list and the suspect matrix of that node. When receiving a gossip message, a node first updates its local gossip list by taking the minimal heartbeat value from its current gossip list and the received gossip list. Following that, the suspect list gets updated based on the new heartbeat values. Finally, the suspect matrix is updated based on the modifications done in the gossip list. When the heartbeat value for a node is replaced by the received value, the gossip list for this node in the received message is more recent. Therefore, the corresponding row in the suspect matrix is replaced by the row from the received suspect matrix.

An example of a node processing an incoming gossip message can be found in fig.2.1. Here, the status of node 1 is shown, before and after processing the gossip message from node 2. Node 1 first updates its heartbeat values for nodes 2, 3 and 4, because the received heartbeat values are lower. Second, node 1 updates its suspect list based on the new gossip list. The result is that node 3 is no longer suspected. Finally, node 1 updates its suspect matrix by replacing all rows of which the heartbeat value was updated with the rows from the received gossip message (nodes 2, 3 and 4 in this example).

After these steps, node 1 will check for consensus about all suspected nodes. In this case, node 1 checks the column of node 4. Because all live nodes suspect node 4, consensus is reached and a failure has been detected.

(13)

Figure 2.1: Example of node 1 processing an incoming gossip message from node 2. This illustration has been recreated from [SRGR06, Figure 1].

(14)

2.4.4 Node insertion

In order to add a new node to the network, the node is required to connect to any of the nodes currently in the network. The node that introduces the new node to the network is called the sponsor node. If the new node tries to connect with a failed node to join the network, it will send an error message after a certain timeout period. In that case, the node can be restarted using a different sponsor node to join the network.

To join the network, the new node sends a join request to the sponsor. When the sponsor node receives such a request, it sends the new node two messages: its gossip list and the list of failed nodes. The gossip list contains all current nodes in the network, with the heartbeat values of the sponsor node. The new node uses this list to fill its own node list and copies the corresponding heartbeat values to its own gossip list. This is needed, since initialising all heartbeat values to 0 suggests that the new node recently gossipped with all nodes. However, this is not the case and could lead to an increased failure detection time.

The list of failed nodes is saved in the failure detector of the new node. This list contains the nodes that are currently failing and for which an alert has been sent. The new node requires this list to prevent it from sending new alerts for failures that have already been detected and reported.

After the sponsor node has sent the gossip list and failed nodes, it broadcasts the information about the insertion to all nodes in the network. Each node receiving this message will add the new node to its data structures and will start gossipping to it.

2.4.5 Scenarios

Add server. Adding a server starts by deploying the software to the new server. The server requires a configuration file (when not using the default configuration), which specifies the name of the server and the ip and port of the sponsor node. Finally, the server can be started, after it will try to connect to the sponsor node and join the network.

Remove server. When a server needs to be removed from the network, it can be stopped by ex-ecuting the stop command. This will notify the network, so that the removed server will be removed from all other servers and not get suspected.

Check status report. The gossip algorithm itself does not support status reports. However, sec-tion3.2will discuss the resource monitor, which supports adding custom monitors. To demon-strate status reporting, we implemented a monitor that offers a web interface with a real time graph of the different sensors in the network.

Add sensor. Adding a sensor to a node is done by deploying the sensor script to the node, updating the configuration file of the node and restarting the individual node. The rest of the network will be able to access this data (when needed) without a restart.

Server failure. In case of a failure, the other nodes will no longer receive gossip messages from the failed server. As a result, the heartbeat values at the other nodes will increase each gossip round. As soon as all live server’s heartbeat values exceed the threshold value (tcleanup), consensus is

reached and an alert will be sent.

Server recovery. When a server recovers it will either be started again (in case of a forced reboot, for example) or continue gossipping without a restart (in case of a network failure). In both cases, the other nodes in the network will detect a crashed node is back and notify the crashed node it was offline. When receiving such a notice, the failed node will send a restore notification. Network partition. The gossip protocol as described in [SRGR06] is robust to network partitions. Both partitions will detect the nodes in the other partition to be failed and will send alerts accordingly.

(15)

2.5. SUMMARY CHAPTER 2. BACKGROUND AND CONTEXT

2.5 Summary

The table below provides an overview of the different scenarios discussed in the previous sections. Here, + indicates a scenario is supported, – that it is not supported and +/– that it is partially supported.

Scenario Pingdom GEMS

Add server + +

Remove server + +

Check status report + +/–

Add sensor – +

Server failure + +

Server recovery + + Network partition + +

Pingdom supports all scenarios fully except adding custom sensors. This requires internal server access, which is not provided by an external monitor. GEMS supports all scenarios, however, the implementation made for this thesis is limited to real time status reporting of the resource monitor. In order to provide full historic reports, a new resource monitor can be developed that stores and aggregates this data. This can be used to generate historic reports.

(16)

Chapter 3 GEMS Implementation

This chapter discusses the implementation of the GEMS algorithm and the additions and adjustments made for this thesis. Further, it will give a technical overview and an indication the implementation and deployment effort.

The GEMS algorithm is implemented in Ruby. This choice was made mainly because Voormedia works with this programming language. Further, Ruby is an easy prototyping language because it is a scripting language that allows for rapid development and has an extensive standard library.

The system is separated in two main components: the gossip agent and the resource monitor. In addition to this, a local network simulator and a benchmarking script has been developed. All components are discussed in the following subsection, except for the benchmarking script (which is discussed in chapter4).

3.1 Gossip agent

The gossip agent is the main component of the software. It manages the state of the network by keeping a list of nodes, a gossip list containing all heartbeat values and the suspect matrix. These are implemented as discussed in section2.4.2. Since the suspect list of a node is stored in the suspect matrix as well, the suspect list is not stored separately. The suspect matrix provides the suspect list to other classes by fetching the correct row from the suspect matrix. Since the gossip agent is the main component of the software, it is also the largest. It consists of 11 classes, each providing functionality (e.g. failure detection) or storing data (e.g. the suspect matrix). In total, these classes contain approximately 950 lines of Ruby source code1_.

3.1.1 Settings

The gossip agent is also responsible for loading the settings and parameters. Two settings files are being used: a global configuration file containing system-wide settings, such as the algorithm’s parameters, and a node-specific file which contains information about the node, such as it’s name and sponsor node. Upon start-up of the node, both files get loaded from a local or remote location. This is specified as a start-up parameter. The settings are stored as YAML files. This is a readable data serialisation format that is supported by the Ruby standard library.

3.2 Resource monitoring

One advantage of internal monitoring is the potential data that can be collected from the nodes. We support this using resource monitoring. Subramaniyan et al. [SRGR06] give an architectural overview of the resource monitoring implemented, on which this implementation was based.

Resource monitoring is done by the resource monitoring agent. It manages two types of resource monitors: sensors and monitors. Sensors are data providers, collecting system information. Each

(17)

3.2. RESOURCE MONITORING CHAPTER 3. GEMS IMPLEMENTATION

Figure 3.1: Screenshot of the resource monitor, visualising the ping value (measured in ms) from each node to the sponsor node (AWS-EU-1).

sensor registers itself at the resource monitoring agent, specifying a tag. Monitors are data processors, using the system information collected by the sensors. They subscribe to updates from a specific sensor by specifying the tag, or subscribe to all sensors by using the reserved all tag.

The resource data is piggybacked on the existing gossip messages. Each time the gossip algorithm requests the resource data, the resource monitoring agent collects the data from all sensors. Addi-tionally, when a gossip message is received, the resource data gets delivered to the monitors that have subscribed to the sensors.

3.2.1 Visualiser

To test and demonstrate the resource monitoring, we have implemented several sensors and a monitor. Four hardware sensors have been implemented (collecting CPU, memory, disk and network usage) as well as a sensor to measure the ping time to another (specified) server. To monitor these sensors, a graphing monitor has been developed. Using the Rickshaw2 javascript library, we can visualise the sensor data that is being monitored. A screenshot of the visualiser can be found in fig. 3.1. Here, the ping from each node to the sponsor node (which is node AWS-EU-1) is being monitored. In this example, the network consists of eight nodes, one in each Amazon Region3.

3.2.2 Expanding the resource monitor

Creating and adding new sensors and monitors are simple tasks that do not require changes in existing code. The resource monitoring agent is set up to allow new sensors and monitor without requiring a full restart of the network. Furthermore, the sensors and monitoring scripts are simple. In total, eight sensors and monitors have been developed, averaging 22 lines of code for each sensor and monitor4_.

To illustrate to process of adding new sensors and monitors, we will demonstrate the addition of a new sensor. Since the process of adding a new monitor is essentially the same, we will limit the example to adding a sensor. The new sensor will “measure” the local wall clock time of the server. The process consists of the following steps: creating the sensor script, deploying it, updating the configuration file and restarting the node. The steps are explained below.

2_{http://code.shutterstock.com/rickshaw/}

3_{Amazon’s cloud computing service has been used to test the software, in which instances can be launched within}

eight different regions.

(18)

3.2. RESOURCE MONITORING CHAPTER 3. GEMS IMPLEMENTATION

Create script

The script for this sensor can be as follows, written in Ruby:

1 class TimeSensor 2

3 def initialize(monitoring_agent, settings)

4 monitoring_agent.register_sensor(:time, method(:poll)) 5 end 6 7 def poll() 8 Time.now 9 end 10 11 end

It consists of two methods: the initialize method and the poll method. The first method registers the sensor with the resource monitoring agent, specifying the tag and the method that supplies the monitoring data. Here, the poll method is registered as a sensor with the time tag. The second method will be called whenever the gossip agent requests the resource data. Here, the method just returns the current time.

The initialisation method also takes the settings object as a parameter (line 3). These settings can be configured in the settings files. In this sensor there are no settings used, but other sensors can have configurable parameters, such as the refresh rate.

Deploy script

Once the script has been developed, it needs to be deployed to the nodes that will be running the sensor. Nodes that will not run this sensor do not need the script. The script needs to be placed in the resource monitoring folder, since this is the place where all sensors and monitors are loaded from. Change configuration

The settings files of the nodes that will be running the sensor need to be updated. The new sensor can be added to the list of resource monitors as follows:

1 name: "AWS-EU-1" 2 port: 1111 3 sponsor_ip: 0 4 sponsor_port: 0 5 6 RMA: 7 - CPUSensor: 8 refresh_rate: 2 9 - TimeSensor: 10 nil

After line 6, the resource monitors and sensors are registered. In this example, lines 9 and 10 have been added to use the TimeSensor on node AWS-EU-1. The TimeSensor has no parameters, therefore nil is specified. The name of the sensor in the configuration file is equal to the class name of the sensor implementation.

Restart nodes

Finally, the nodes that have an updated settings file and will be running the new sensor need to be restarted. This is done using the restart command on the node, or just by stopping and starting the service manually. The nodes will read the updated settings and start the TimeSensor.

(19)

3.3. NETWORK SIMULATOR CHAPTER 3. GEMS IMPLEMENTATION

3.3 Network simulator

In order to speed up the development and testing process, we have developed a network simulator. The network simulator has also been implemented in Ruby and acts as a proxy between the nodes running locally. It allows for introducing (randomised) network delay and packet loss, as well as simulating network partitions. Also, it provides some helper functions that speed up debugging the algorithm, for example to stop all nodes. The simulator has been used during development, allowing us to test the algorithm locally in different network environments.

The network simulator can be enabled through the global settings file. Here, the ip address and port of the simulator are also specified.

3.4 Additions and adjustments

Because of the scope of this thesis and the different application environment of the monitoring algo-rithm in the original paper, we made some additions and adjustments. These are described in the following subsections.

3.4.1 Alerting

Subramaniyan et al. [SRGR06] do not describe an alerting mechanism. When a node has detected consensus, this is broadcasted across the network. However, this can happen at two nodes simulta-neously, leading to multiple broadcasts. The nodes can handle this, but when using this broadcast as a failure alert as well, duplicate alerts can be sent. To solve this issue, only the live node with the lowest identifier will send the consensus broadcast and alert. When this node fails, the next live node will take this role. The other nodes will know an alert has been sent when they receive the consensus broadcast.

3.4.2 Gossip node selection

During testing of the algorithm, we got a lot of false alerts in case of a network partition. Especially when having a skew distribution of the nodes, for example in a 30% partition, a lot of gossip messages from the smaller partition are sent to unreachable nodes. This results in a too little gossipping within the partition, leading to false positives. Ignoring failed nodes altogether when choosing a node to gossip to will completely solve this problem. However, in order to detect the partitioning has been restored, the nodes need to try gossipping to the other partition occasionally. Therefore, we made it less likely to gossip to failed nodes. This will increase the gossipping within a network partition, making false positives less likely, while keeping the ability to detect the partitions has been restored.

3.4.3 Layering

Layering of nodes allows for improved scalability by grouping nodes together. Nodes within one group communicate using the gossip protocol as described and implemented. In addition to that, the groups of nodes form higher level layers, which gossip less frequently with each other. Gossipping between groups is done by a single node every round. Each round, one node from each group gossips to a node in another group. This way, the groups get liveness information about each other and failures of whole groups are detected.

We decided that layering is beyond the scope of this project. Implementing support for layering will increase the complexity of the software, while the advantage of scalability is not needed. The network size Voormedia is interested in can work within one group, making layering unnecessary. This simplification also benefits the resource monitor. Subramaniyan et al. [SRGR06] describe a resource monitor that uses aggregation function to aggregate the resource data from different groups across layers. When using a single group, this functionality is not needed and therefore also not implemented for this thesis project.

(20)

3.4. ADDITIONS AND ADJUSTMENTS CHAPTER 3. GEMS IMPLEMENTATION

3.4.4 Locking

The algorithm as described by Subramaniyan et al. [SRGR06] acquires a system-wide lock before inserting a new node. The lock manager, a single node that gets assigned by the network, manages the lock. This can, for example, be the node with the lowest identifier.

We decided to leave out the locking step when inserting a node, as the use of it is limited. It will protect the network against concurrent insertions, which could lead to nodes having an inconsistent state. However, this is unlikely, as node insertions are not done regularly. Moreover, in our implemen-tation, this problem can only occur when using two different sponsor nodes. When two nodes join the network simultaneously through the same sponsor node, the insertions will be processed sequentially. In that case, the first node has been added to the list of nodes that will be send to the second node and the first node will be informed about the insertion of the second node.

This adjustment made node insertions simpler and quicker, as no locking mechanism is required.

3.4.5 TCP and UDP

Both TCP and UDP data transfer is used by the implementation. All gossip messages are sent using UDP, because a single gossip message is not critical to the operation of the network. In case a gossip message gets dropped, the cleanup time ensures no false positive failures will be detected. Apart from dropped messages, UDP also does not guarantee the order of the delivery of the gossip messages. Therefore, we add the current time to all gossip messages. Using this, the receiving node ignores gossip messages that are older than the latest gossip message received from that node.

In addition to UDP gossip messages, we use TCP data transfer to send critical messages. Critical messages are messages used for node insertions, failure and restore alerts and for node removals. These messages do not occur regularly, making the overhead of using TCP not an issue. The advantage of using TCP is that it guarantees correct delivery of the messages. This prevents from having inconsistent states among different nodes in the network that can occur when some nodes do not receive these messages.

(21)

Chapter 4 Research Method

This chapter describes the research method. We will use benchmarking to analyse the performance of the distributed monitoring algorithm. These benchmarks will help answering sub question2.

4.1 Performance benchmarking

In order to test the performance of the distributed monitor, we use performance benchmarking. This is a structured method for experimental analysis of the implementation of an algorithm, used to compare the performance with other approaches. We will follow principles described by Moret [Mor02], John-son [Joh02] and Barr et al. [BGK+₉₅_{] to ensure comparability, reproducibility and proper reporting}

of the results.

We developed a benchmarking script that automates the process of starting and stopping the monitors on different servers. It measures the time it takes to do certain tasks, for example inserting a new node. By running this script with different parameters for the algorithm, we can observe the effect of these parameters on the different metrics. The measured time is wall clock time, since this is the most relevant to monitoring.

4.1.1 Test setups

The nodes running the monitoring algorithm will be hosted by Amazon EC2 instances1_{. Amazon}

instances are virtual machines that run within a specific region. There are eight regions available, distributed around the world. To test the algorithm, we will run three different test setups that cover three different usage scenarios:

All: North Virginia, Oregon, North California, Ireland, Singapore, Tokyo, Sydney, S˜ao Paulo*

US + EU: North Virginia, Oregon, North California*, Ireland EU: Ireland*

*: sponsor node in this setup.

Test setup All uses all eight regions available. This is the worst-case scenario, due to the latency between the different regions around the world. Test setup US + EU uses all four regions in the US and Europe. This is the most interesting setup for Voormedia, as they mostly use these regions. The average latency between these regions is less than half of the latency of regions in test setup All. Finally, test setup EU uses nodes within one single region. This test setup is used to test performance on internal networks, as the latency within nodes in the same region is internal network traffic. We have chosen Ireland (EU), however any region will have the same results because the latency within a region is very similar for all regions. This is shown in appendixA, where we have tested the latency between all regions.

(22)

4.1. PERFORMANCE BENCHMARKING CHAPTER 4. RESEARCH METHOD

The sponsor node for all test runs within one test setup does not change. To avoid overly optimistic results because of the choice of sponsor node, we tried to pick the worst-case sponsor for each test setup. This node was found by testing the ping values between all regions. The region that has the highest ping to the other regions was picked as the sponsor node. The results of these latency tests can be found in appendixA.

4.1.2 Number of nodes and geolocation

To help answering research question2a, we will run benchmarks varying two parameters: the number of nodes and the geolocation of the nodes. Running the benchmarking script, we will collect four performance metrics:

Initialisation time (tinit): time between starting all nodes and the moment that all nodes added

each other for monitoring.

Insertion time (tinsert): time between starting one new node and the moment that the new node

has added all existing nodes and all nodes added the new node.

Detection time (talert): time between the failure of one node and the alert message for that node.

Restoration time (trestore): time between the restart of the crashed node and the restore message

sent by the restarted node.

We will run benchmarks using all three test setups (described in the previous section). For each of these setups, we will run tests for different network sizes.

4.1.3 Parameter settings

In addition to the number of nodes and their geolocation, we will test the influence of the two algorithm parameters. Using this, we can answer research question2b. The following parameters will be varied in these test runs:

Gossip interval. The gossip interval determines how often the nodes gossip. A higher gossip interval implies less gossip messages are sent each second. Therefore, we expect a lower server load as it increases, but also an increase in detection time.

Cleanup time. The cleanup time determines how many gossip rounds the nodes will wait until another node is considered to have failed. We expect this parameter to affect the detection time as well, however, it should have no influence on the other performance metrics and the resource usage.

In order to minimise the effect of latency, we will run these benchmarks within one region, using test setup EU. The same performance metrics as described in the previous section are collected. We will run these tests using different network sizes, as this is an influence on the performance metrics as well.

4.1.4 Benchmarking script

All performance benchmarks will be done using a script. This will run the tests efficiently and eliminate possible human errors. The script runs on a local machine and requires a list of n nodes. Running it with n nodes, it first connects to all nodes using SSH. When all connections are made, it starts the monitor on n − 1 nodes. One random node (except the sponsor node) is not yet started in order to measure the insertion time in the next stage. The script waits until the nodes are started and registers tinit. After this, the nthnode is inserted and tinsertis measured. When all nodes added

the last node, one random node is crashed by killing the Ruby process. An alert about the failure of the crashed node is awaited and the time is reported as talert. Finally, the crashed node is restarted

and the time it takes until the restore message is send is measured as trestore. After the script has

finished, all nodes are stopped and the results are saved in a CSV-file. Along with the four metrics, the number of nodes, the inserted node and the crashed node are also logged.

(23)

4.2. RESOURCE USAGE CHAPTER 4. RESEARCH METHOD

All time measurements are made on the local machine, running the script. This prevents measure-ment errors from unsynchronised clocks between nodes in the network. It does add the delay from the SSH connection between the local machine and the nodes in the network, however this influence is measured to be small. AppendixAshows the latency between any of the regions and the Voormedia office (which is where all benchmarks are done). The maximum latency is 525 milliseconds, measured between Sydney and Voormedia. Because there will also be a delay when sending real alerts (for example when using email), this influence is accepted.

4.2 Resource usage

Besides the performance, we are also interested in the resource usage of the implementation. Because the performance benchmarking script does not replicate a real world scenario where the algorithm is running for a longer period of time, we can not measure this during the performance benchmarking. Therefore, we performed separate tests to measure the resource usage. Using these measurements, we can answer research question2c.

Testing the resource usage is done by running n nodes for a longer period of time. During this time, the network is not disturbed by node additions or removals, allowing for the network to stabilise. After the network has stabilised, the measurements are taken. This will be discussed in more detail in the following subsection.

The resource usage benchmarks will vary the number of nodes and the gossip interval parameter. We will keep the cleanup time constant for all test runs, as this value does not affect the resource usage of the algorithm. The cleanup time is chosen at 40. This is a realistic number, also used in the performance benchmarks and yet large enough to avoid false positive alerts.

4.2.1 Resource metrics

During the resource tests, the following metrics are collected: • CPU usage

• Network in • Network out

The measurements are made using Amazon CloudWatch2_{. The nodes run on Amazon virtual instances}

and CloudWatch externally monitors the CPU and network usage of these. Since the monitoring algorithm is the only software running on these instances, the external metrics that are collected is equal to the resource usage of the monitor. An example of CloudWatch can be found in fig.4.1.

4.2.2 Test setup

To measure the resource usage, we will use a setup similar to test setup EU (defined in section4.1.1). To minimise the potential influence of latency on the resource usage, we will run all nodes within one region. The measurements will be made at a single node in the network, but not at the sponsor node. Both the influence of the number of nodes and the gossip interval parameter will be tested using this setup.

The node at which the measurements are made is scaled up to a medium instance (instead of the micro instances used before). This instance type provides 3 ECUs where 1 ECU provides the equivalent CPU capacity of a 1.0–1.2 GHz 2007 Opteron or 2007 Xeon processor, according to the definition by Amazon.

(24)

4.3. DATA ANALYSIS CHAPTER 4. RESEARCH METHOD

Figure 4.1: An example of the CPU monitor from Amazon CloudWatch, which is used to measure the resource usage of the algorithm.

4.3 Data analysis

All measurements made will be exported to a CSV file. These can be imported in R, which we will use to process the measurements and plot graphs to visualise the results. Furthermore, we can use it to fit a linear model that can be used to estimate the performance of the algorithm at different parameters settings.

We will validate the number of benchmark runs by generating box plots and scatter plots. These will illustrate the variance among the results on which we can base the interpretation of the data. This validation can be found in section6.3

(25)

Chapter 5 Results

This chapter presents the results collected from the benchmarking. The benchmarks are discussed in separate sections, where the first section discusses the performance benchmarks and the second section the resource usage of the algorithm. In addition to the figures in this chapter, the measurement data can also be found in appendixB.

5.1 Performance benchmarks

5.1.1 Number of nodes and geolocation

The first experiment compares the influence of the number of nodes and the geolocation of the nodes. The other parameters were fixed at tgossip = 0.1 and tcleanup = 201. We ran 25 iterations of the

benchmarking script using from 2 up to a maximum 30 nodes. The results are plotted in fig. 5.1, where each point represents the mean value of the 25 iterations.

5.1.2 Parameter settings

In order to test the influence of the algorithm’s parameters, we ran four series of benchmarks with different parameters settings. All benchmarks used the EU test setup, having all nodes within one region. The benchmarks were done from 5 to 30 nodes, with a 5 node interval. We switched to this interval because previous measurements showed the intermediate measurements are redundant. We do not need such a fine granularity and this allows testing more parameter settings. The results can be found in fig.5.2, where each point represents the mean value of 25 iterations of the benchmarking script.

5.2 Resource usage

Finally, we ran benchmarks to compare the influence of the number of nodes and the parameter settings on the resource usage. We used the EU test setup again and did benchmarks from 5 to 30 nodes with a 5 node interval. The nodes were left running for at least 15 minutes before taking the measurements, allowing the network to stabilise. During the testing, no false positive alerts were generated. To get a realistic view of the resource usage, the benchmarks were made with the nodes running all resource monitors described in section3.2.1.

The results are plotted in fig.5.3. Each point represents a measurement extracted from CloudWatch. Amazon calculates these values from 5 consecutive test samples (made every minute) and reports the average every 5 minutes. Because of this, we did one test run for each of the different parameter settings. Furthermore, since the algorithm was left running for a longer period of time, we could verify the network has stabilised.

(26)

5.2. RESOURCE USAGE CHAPTER 5. RESULTS ● ●● ● ●● ● ●● ● ●● ●● 0 5 10 15 20 25 30 0 10 20 30 40 50 Means: t_init Number of nodes Time (s) ● All US + EU EU

(a) Initialisation time

● ● ● ● ● ●● ● ● ● ● ● ● ● 0 5 10 15 20 25 30 0 10 20 30 40 50 Means: t_insert Number of nodes Time (s) ● All US + EU EU (b) Insertion time ●● ● ● ● ●● ● ●● ● ● ● ● 0 5 10 15 20 25 30 0 5 10 15 20 25 Means: t_alert Number of nodes Time (s) ● All US + EU EU (c) Detection time ● ● ● ●● ● ● ●● ● ● ● ● ● 0 5 10 15 20 25 30 0 5 10 15 20 Means: t_restore Number of nodes Time (s) ● All US + EU EU (d) Restoration time

Figure 5.1: Mean values for the four performance metrics for the three test setups (tgossip = 0.1,

(27)

5.2. RESOURCE USAGE CHAPTER 5. RESULTS ● ● ● ● ● ● ● 0 5 10 15 20 25 30 0 10 20 30 40 50 Means: t_init Number of nodes Time (s) ● _{0.1, 20} 0.1, 40 0.5, 20 0.5, 40

(a) Initialisation time

● ● ● ● ● ● ● 5 10 15 20 25 30 0 10 20 30 40 50 Means: t_insert Number of nodes Time (s) ● _{0.1, 20} 0.1, 40 0.5, 20 0.5, 40 (b) Insertion time ● ● ● ● ● ● ● 5 10 15 20 25 30 0 20 40 60 Means: t_alert Number of nodes Time (s) ● 0.1, 20 0.1, 40 0.5, 20 0.5, 40 (c) Detection time ● ● ● ● ● ● ● 5 10 15 20 25 30 0 5 10 15 20 Means: t_restore Number of nodes Time (s) ● 0.1, 20 0.1, 40 0.5, 20 0.5, 40 (d) Restoration time

Figure 5.2: Mean values for the four performance metrics for different settings of the algorithm parameters. The graph legends represent: (tgossip, tcleanup). The data for these plots can be found

(28)

5.2. RESOURCE USAGE CHAPTER 5. RESULTS ● ● ● ● ● ● ● 5 10 15 20 25 30 5 10 15 20 CPU load Number of nodes CPU load (%) ● 0.1, 40 0.5, 40 1.0, 40

(a) CPU load

● ● ● ● ● ● ● 5 10 15 20 25 30 0 50 100 150 200 Network in Number of nodes Netw or k in (Kb/s) ● 0.1, 40 0.5, 40 1.0, 40 (b) Network in ● ● ● ● ● ● ● 5 10 15 20 25 30 0 50 100 150 200 Network out Number of nodes Netw or k out (Kb/s) ● 0.1, 40 0.5, 40 1.0, 40 (c) Network out

Figure 5.3: Measured resource usage for different settings of the algorithm parameters. The graph legends represent: (tgossip, tcleanup). The data for these plots can be found in appendixB.4.

(29)

Chapter 6 Discussion

This chapter will discuss the results from the perspective of the research questions (stated in sec-tion1.1) and analyse the validity of the results.

6.1 Performance benchmarks

6.1.1 Number of nodes and geolocation

Research Question 2a: What is the influence of the geolocation of the nodes?

The first set of benchmarks is done to test the influence of the number of nodes and their geolocation. This will help answering research question 2a. The graphs in fig. 5.1 show the mean values for the performance metrics up to a maximum of 30 nodes. The different lines are the mean values for the different test setups.

Initialisation time. During the initialisation of the network, the sponsor node starts a new network and all other nodes join it through the sponsor node. Since all nodes need to be added by the sponsor

node, the initialisation is expected to grow linearly with the number of nodes. Figure5.1ashows the linear growth of the initialisation time for all three test setups. The US + EU and EU test setups both have similar initialisation time. The test set containing all nodes, however, grows faster as the number of nodes increases compared to the other test setups. This increase can be explained by the higher latency between the nodes. When the network delay to the sponsor node is higher, the overall time to start the network will increase, as all nodes need to communicate with the sponsor. This determines the slope of the lines shown in the graph. Because three out of four node in the US + EU setup are located in the US, the slope for this setup is similar to the EU setup.

Insertion time. Inserting a node to the network starts by sending a join request from the new node to its sponsor. The sponsor node will respond by sending the gossip list and the list of failed nodes, after which it will notify all other nodes about the insertion. The last step is the bottleneck in the algorithm, having a complexity of O(n). The insertion times (fig. 5.1b) for all test setups are very similar for smaller networks, all growing linearly with the number of nodes. However when the network size exceeds n = 11 (All) and n = 19 (US + EU), the insertion time no longer grows gradually. We got a lot of false positive alert messages, causing the network to flood with alert and restore messages. This made it impossible for the gossip messages to come through and recover the network. The cause of this problem is that the nodes are too slow to process the network traffic. We are using Amazon’s micro instances, which only provide a small amount of CPU resources. This, combined with the higher latency and the extra traffic that is generated when failures are detected, caused a bottleneck at the sponsor node that shows in the insertion time.

(30)

6.1. PERFORMANCE BENCHMARKS CHAPTER 6. DISCUSSION ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● 0 5 10 15 20 25 10 20 30 40

t_insert and t_restore

Number of nodes

Time (s)

● t_insert t_restore

Figure 6.1: Mean values for the insertion time and restoration time for the US + EU test setup (tgossip= 0.1, tcleanup= 20). The lines are from plotted from the same data as in figs.5.1band5.1d.

Detection time. Figure5.1cshows the detection time: the time between a failure of a node and the alert message for that failure. The failure detection time depends on the network reaching consensus about a failed node. It grows linearly with the number of nodes, however, the CPU bottleneck causes the detection time spike when the network size exceeds 11 (All) or 22 (US + EU). The bottleneck causes a lot of extra network traffic (due to false positive alerts), taking more time for gossip messages to be processed and consensus to be detected.

Restoration time. The restoration time (fig. 5.1d) is determined by the insertion of the failed node. As soon as another node informs the failed node that it has been down, it will send its recovery message. This will happen after the failed node has been re-inserted, therefore the restoration time should be similar to the insertion time, also growing linearly with the number of nodes. This actually shows to be the case up to the point that the insertion time spikes up, as shown in fig.6.1. Therefore, we can safely assume the insertion will be similar to the restoration time, even for larger networks. The restoration time therefore indicates what the insertion time will be when the processing power of the nodes is no longer a bottleneck.

From the previous paragraphs, we can conclude the geolocation of the nodes does have an effect on all four metrics. For the initialisation and insertion time, the influence can only be observed for the All test setup. The difference here is explained by the latency to the sponsor node. The average latency to the sponsor node in this setup is 270 milliseconds, compared to 89 milliseconds in the US + EU test setup (averages calculated from appendixA). The influence on the detection and restoration time is smaller, since these two metrics not just depend on communication with the sponsor node. The detection time does require consensus (including the sponsor node), but the status of the sponsor node can be communicated through the other nodes in the network, avoiding the high latency to the sponsor node.

(31)

6.1. PERFORMANCE BENCHMARKS CHAPTER 6. DISCUSSION

6.1.2 Parameter settings

Research Question 2b: What is the influence of the algorithm’s parame-ters, such as the gossip in-terval?

The second set of benchmarks is done to determine the influence of the tgossipand tcleanup parameters. The results of these benchmarks

are shown in fig.5.2and will be used to answer research question2b. Initialisation and insertion time. The initialisation and inser-tion times are not influenced by the two parameters. This result is expected, because these actions do not require the algorithm to wait a number of gossip rounds to be completed. For the insertion time at network sizes 25 and 30, the results do show some difference. How-ever, this is caused by the high variance found in these test runs, which will be discussed in section6.3.

Detection time. The detection time is the most important metric. Naturally, it is influenced by both parameters. First of all, the detection time follows a linear trend in all four test cases. Further, the slope of these lines is not affected by the two parameters: all test cases have a similar slope. The parameters determine the y-intercept (the “height”) of the lines. From the lines plotted in the graph we can determine the influence of both parameters. This can be expressed in a model that can be used to predict the detection time for different, untested parameter settings and network sizes. We will elaborate on this in section6.1.3.

Restoration time. The restoration time is determined by two factors: the insertion time and the tgossipparameter. The first is obvious, as a node needs to be re-inserted in the network. The restoring

node follows the same procedure as a newly inserted node would make, contacting its sponsor node. The gossip interval parameter has an influence on the restoration time as well, since the restored node waits a number of gossip rounds until it sends a restore message. This number of rounds is fixed to 10 in all benchmarks that have been done. At a smaller gossip interval, these rounds will be done in less time and therefore the total restoration time will be smaller.

From the previous paragraphs, we can conclude the parameters settings do have an influence on the failure detection time and restoration time. The initialisation and insertion time do not depend on gossipping; therefore the parameters do not affect these metrics. The gossip interval (tgossip) has the

biggest influence, on both the detection time and restoration time. The cleanup time (tcleanup) only

affects the detection time and its influence is also determined by the gossip interval. Increasing the cleanup time will have a greater effect when using a higher value for the gossip interval, since the cleanup time is expressed in gossip rounds.

6.1.3 Detection time model

The detection time follows a linear trend and depends on the number of nodes (n), the gossip interval (tgossip) and the cleanup time (tcleanup). With this, we will create a linear model that can be used to

predict the detection time for other (untested) parameter settings and network sizes. By analysing the measurements made with the EU test setup (fig.5.2c), we created the following model:

talert= 0.3 · n + 16 · tgossip + tgossip· tcleanup (6.1)

This equation consists of three components, one for each of the relevant parameters. The first component represents the factor of the number of nodes. We found this factor by calculating the slope of the plotted lines in fig. 5.2c. The second component represents the effect of the gossip interval. The factor 16 cannot be calculated or determined by analysing the algorithm, but depends on external factors such as the network latency. Therefore, we fitted the model on the measurements taken during benchmarking by adapting this value. The final component is explained by analysing the algorithm. Nodes wait tcleanup gossip rounds until another node is marked as failed. Since the

(32)

6.2. RESOURCE USAGE CHAPTER 6. DISCUSSION ● ● ● ● ● ● ● 5 10 15 20 25 30 0 10 30 50

Means: t_alert

Number of nodes Time (s) ● _{0.1, 20} 0.1, 40 0.5, 20 0.5, 40 1.0, 20 1.0, 40 model

Figure 6.2: Mean values from fig. 5.2c, along with the prediction according to the model in Equa-tion (6.1) and two additional measurements with a gossip interval of 1. Each point represents a mean value of 25 iterations and the legend represents: (tgossip, tcleanup).

To verify this model, we plotted the predictions along with the results of the benchmarks shown in fig. 5.2c. In addition, we did two new benchmark runs with a gossip interval of 1 second. As fig.6.2shows, the model’s predictions are close to the measured detection time. This is confirmed by testing the fit of the model using R. We found an R2_{value of 0.967 and a residual standard error of}

2.186 seconds. This was calculated from a total of 1300 observations taken using the EU test setup.

6.2 Resource usage

Research Question 2c: How much resources does the implementation use and how is this affected by the algorithm’s param-eters?

Figure 5.3 shows the resource usage measured by CloudWatch, at three different settings for tgossip. These results will be used to answer

research question2c.

CPU load. Most CPU resources are used by the gossipping. Each node sends one gossip message each round and, on average, receives one message as well. Sending a gossip message consists of collecting the gossip list and suspect matrix, compressing the data packet and finally sending it to a (randomly) chosen node. Processing a received gossip message is more complex. First, the gossip list is updated, which has a complexity of O(n) since the heartbeat value for each

node is compared to the received heartbeat value. Second, the suspect matrix is updated according to the updated heartbeat values and the new matrix is checked for consensus. The consensus check is the most complex step, since the suspect matrix is two-dimensional in the number of nodes. This results in a complexity of O(n2_{). This is best is shown in fig.}_5.3a_{, at a gossip interval of 0.1 seconds.}

Further, this figure shows the growth factor depends on the gossip interval (tgossip). A lower gossip

interval (resulting in more gossipping per second) increases the growth factor.

Network usage. The network usage for incoming and outgoing network traffic is the same, which is expected because all nodes send and (on average) receive one gossip message every gossip interval. Therefore we will discuss figs. 5.3b and 5.3c in one paragraph. The network usage follows a linear trend. Even though the space complexity of the suspect matrix is O(n2_{), this does not show in the}

Distributed Monitoring