Benchmarking Akka

(1)

Bachelor Informatica

Benchmarking Akka

Dennis Kroeb

June 15, 2020

Supervisor(s): Ana-Lucia Varbanescu

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

In modern-day computing, concurrent programming is essential in high-performance sys-tems. The Akka platform provides a programming model to meet this need. Systematic performance analysis studies for Akka do not exist. Therefore, this thesis proposes such a study. To this end, the performance of the Akka actor model is assessed by first comparing it to Java in a microbenchmarking experiment, which illustrates the overhead and different threading models in Akka. Furthermore, to compare Akka with other models, we ported two applications from Computer Language Benchmarks Game (CLBG) to Akka, and compared their performance against the original CLBG models, using the CPU metrics, compressed code size, and sampled memory usage. Akka performed similar to Java. Based on this anal-ysis, we conclude that Akka can get similar performance to Java in non-blocking concurrent applications, but Akka has a larger code size in general.

(4)

(5)

Introduction

The world we live in values (real-time) connectivity more with each day, and the demand for high-performance distributed applications keeps rising. Proper design and implementation is important to ensure good performance, and choosing a suited programming model is important for these applications. Concurrency plays an important role in these distributed applications.

The Akka platform claims to perform well in distributed applications, by providing native concurrency through an actor-based model. The platform incorporates the Akka toolkit, which is a collection of modules built by Lightbend, a company specialised in real-time cloud-based services [1]. The toolkit offers support for two common programming languages: Scala and Java. In this project, we only evaluate the Java implementations of Akka, because of our familiarity with the language.

The problem is that there is currently no systematic performance analysis study done on Akka compared to other models. Performance analysis of Akka can aid in choosing the right model when designing new applications. This thesis is aimed at solving the mentioned problem.

1.1 Research question and approach

To provide a systematic performance of Akka compared to other models, we answer the following research question in this thesis:

How does the Akka actor model perform compared to other models?

This thesis focuses on assessing the performance of the Akka actor model, which is the core module of Akka. Akka uses the actor model programming principle, and our performance analysis gives insight into the performance impact of this abstraction compared to other programming principles, like regular Object-oriented programming (OOP) in Java for example.

To answer our research question we propose two different types of comparison, specifically addressing two sub-questions:

SQ1: How does Akka compare against Java for a basic application?

SQ2: How does Akka compare against the multiple models in the Computer Language Bench-marks Game?

To answer SQ1, we microbenchmark Java multithreading versus Akka actors through a sim-ple synthetic program (Chapter 3). For the second sub-question (SQ2 ), we benchmark Akka versus many other models, using the Computer Language Benchmarks Game (CLBG). We fur-ther report on porting existing input programs included by CLBG to Akka, because CLBG does not thus far support Akka natively. This porting process and the CLBG results can be found in Chapter 4. In Chapter 5, we conclude our report and discuss our findings.

(8)

If a new model like Akka does not perform much better than other models on existing pro-grams, rewriting an existing program might not be worth the work. It should not be forgotten that there is (currently) no such thing as a ’best’ model, because all programming models have different use cases and features which vary in their relevance with respect to a given program. A nice analogy for this: it is very hard to be an excellent sprinter and marathon runner at the same time.

1.2 Ethical aspects

This project stems from an application where a distributed system is used to supervise illegal activities (e.g. poaching) in national parks [2]. Because this surveillance software uses Akka, a proper performance analysis could help with catching of poachers if our findings contribute to a better code base.

Furthermore, our work can enable users to make informed choices about the tools they use to program their applications, which is beneficial to the efficient use of (computational) resources. This could then in turn lead to less power consumption, which is better for the environment and also offers financial benefits, as long as the functionality is not compromised by the reduced power consumption.

The work in this thesis is open-source, which offers transparency and reproducibility. This allows others to perform additional research based of our findings. This project does not touch on controversial topics (like artificial intelligence for example), and we see this work as ethically responsible.

(9)

CHAPTER 2

Background and related work

In this chapter, we aim to provide the basic terms and notions required to understand the research done in this thesis. Thus, we discuss the Akka platform, benchmarking, CLBG, and we briefly present related work.

2.1 Benchmarking and CLBG

In the context of this thesis, we define a benchmark as one program that measures execution of an input program to quantify performance; a benchmarking suite is a set of such programs, typically representative for real-life applications, whose combined performance measurements give a better understanding of the performance across different types of applications. The main challenges for any benchmark are the selection of the applications and the selection of the representative metrics. Both selections depend on the goal of the benchmark. In this work we focus on using the Computer Language Benchmarks Game (CLBG) as our benchmark suite.

CLBG is a benchmark suite that tests many programmings models using various implemen-tations of algorithms (programs). The suite is well-documented [3] and open-source [4]. The benchmark results are posted on the CLBG website, but the suite also provides the option to display your own benchmark results on a (local) webpage (Figure 2.1).

To assess the performance of a model, CLBG uses the following metrics: Execution time, memory usage, compressed code size, total CPU time over all threads and individual thread usage. An overview of these metrics are in Table 2.1 Example measurements can be seen in Figure 2.1. Compressed code size is measured using the GZIP tool [5]. CPU information is gathered using the GTOP library [6]. Memory measured by GTOP as well, and is sampled every 200ms. Programs who run less than 1 second may not have accurate memory measurements as a result of this. [3].

Metric Unit Tool Details

Execution time Sec. GTOP 2.0 Measure the whole execution time, from start to finish

Total CPU time Sec. GTOP 2.0 Total CPU non-idle time for all cores combined

CPU load per core % GTOP 2.0 Amount of non-idle CPU work

performed per core with respect to the total time Average peak memory MB GTOP 2.0 Peak RAM usage (sampled every 200ms). Compressed code size B GZIP 1.6 Compressed using minimal GZIP compression

Table 2.1: Metrics used by the CLBG suite

CLBG is transparent - due to its clear specifications and metrics - and offers a wide range of programs implemented using many models. This enables users to get performance insights while not overfitting to certain models, a mistake made by other Akka benchmarks[7][8].

(10)

Figure 2.1: The CLBG WebUI for displaying benchmark results. The number behind some source programs identifies the implementation of the algorithm for that model, as some models have multiple implementations [4].

CLBG obtains its measurements about models in a generic way and does not require a lot of special action for a new model to be supported, assuming the implementation of the input program is valid with respect to the program’s specifications. Because of this, CLBG is useful for benchmarking Akka as a new model for the suite. While input programs for CLBG can be extended to support other models, but you should be careful to follow the specifications listed by the suite, to ensure a fair comparison between models.

2.2 The Akka platform

The Akka platform is a toolkit which provides concurrency and distributivity in message-driven applications using the actor model programming principle [1].

2.2.1 The Akka actor model

The Akka actor model is the core of the Akka platform, which incorporates the actor model programming principle. The actor model allows concurrent programming with the advantage of enforcing encapsulation without locks, boosting concurrent performance [9]. This is very beneficial in Object-oriented programming (OOP), because encapsulation plays a major role here. The actor model uses a sender/receiver type model, where the receiver performs an action based on a message. Akka does this in such a way that the internals of the receiver are not compromised, removing the need for locks [1].

(11)

Figure 2.2: How Akka actors process messages, and how locks are avoided [1].

The following step-by-step description demonstrates how the actor model works (from the Akka documentation), displayed in figure 2.2 [1]:

1. Actor2 receives a message from Actor 1 and puts it in the queue.

2. If Actor2 was not scheduled for execution, it is marked as ready to execute (by the dis-patcher).

3. The dispatcher starts the execution of Actor2.

4. Actor2 also receives a message from Actor3, which is then queued.

5. Actor2 processes the message from the front of the queue. The internal state of Actor2 is modified. Actor2 is now able to send messages to other actors.

6. Once the queue is empty, or other actors are awaiting execution, Actor2 is unscheduled.

The dispatcher in the Akka actor model is a scheduler which allows actors who are waiting in memory to work on a thread. It is possible to have multiple dispatchers in a program, and they are very configurable [10].

2.2.2 Java, Akka and multithreading

Multi-threading is the execution of programs making use of multiple threads. In general, we distinguish between logical (hardware) threads and virtual (software) threads [11]. Running multiple virtual threads on the same processor is achieved by each thread getting a time slice in which it may work using one hardware thread; when the time has elapsed, a next thread can start or continue its work. A context switch occurs during this transition [11]. A context switch encapsulates all the (time-consuming) operations required for a logical thread to switch state and work on a different task. Having too many virtual threads running on a processor with too few hardware threads often leads to very bad performance, even worse than sequential execution, because of the overhead introduced by excessive context switches.

In Java, threads can be spawned for a given task to be executed on-demand [12]. It is also possible to use a thread-pool : this method does not create a new thread for each new task, but instead re-uses threads efficiently. The benefit of a thread-pool is that you can have many more tasks than threads, reducing the overhead of thread creation [13].

While concurrency and parallelism through multithreading can offer great performance ben-efits, it is not always easy to apply multithreading properly. When programming concurrently, the task should be non-blocking, in concurrent applications, blocking operations are detrimental for performance because the thread execution can be extensively delayed by the execution of other threads. The negative performance impact applies to Akka too [1].

Efficiently re-using threads can be done in Java by using the ThreadPoolExecutor (TPE) class [14][15]. Another option in Java is the ForkJoinPool (FJP) approach using the ForkJoinPool class [16]. This gives options to manually fork and join in Java code, using threads which get their tasks from a double-ended queue (a deque). A deque is a combination of a stack and a queue, where items can be inserted and removed from both the head and the tail [17].

The benefit of using a deque is making distribution of tasks faster, because they can be gotten from two ends of a queue, whereas in a TPE all threads get a task from a single (regular) queue.

(12)

If tasks are small, having a deque can then increase performance and, as such, using an FJP is preferable [16]. This performance increase partially comes from the work-stealing-algorithm, which acts as a load balancer which lets threads ’steal’ tasks assigned to a different thread from a deque if the stealing thread has little or no work left to do. While stealing of work happens infrequently, it still improves performance in certain cases [16].

Because Akka is built on JVM, Akka uses similar multithreading techniques as Java in the Akka dispatchers. This means that each dispatcher can be configured to either use a TPE or FJP, with the latter being the default option [10]. The performance analysis of TPE vs FJP still holds: when Akka actors have small tasks to perform, the ForkJoin dispatchers should perform better.

2.3 Related work

In this section we summarize relevant related work on benchmarking in general, comparative studies on programming models, and CLBG.

According to [18], when designing a benchmark, a few important steps should be taken into account. The first step is deciding what exactly you want to measure. This means finding out which performance criteria are important and define the metrics of the benchmark accordingly. The next step is defining procedures for how the benchmark should be run - for example, should testing be done or not under a (too) high workload (stress testing), how many repetitions of execution there should be, and how with how distinct measurements should be aggregated (e.g., by taking the average value). The third step is finding or creating input programs that are repre-sentative to what the benchmark is trying to measure. Programs for benchmarking Akka in Java should make good use of Akka, for example, because otherwise the benchmark is just quantifying Java’s performance instead. Creating programs specifically for the benchmark forces designers to properly think about what the program should do, making it easier to analyse the measure-ments later on. Good benchmark programs should be simple, representative to the metrics, (trans)portable, system independent, and preferably in higher level languages[18]. Finally, the benchmark should be ran on the programs and the results should then be validated and analysed. Validation involves checking whether the benchmarks results are reproducible. It should also be ran on different systems to check for transportability (compatability). Properly documenting the benchmark also contributes to its quality. The CLBG suite meets these requirements, making it a suitable suite for use in this thesis.

In the context of programming languages/models, benchmarking is often related to deciding which programming model to use given a type of problem. A comparative study has been done using Rosetta Code programs [19], which focuses on comparing models in terms of their per-formance. The authors state that it is very important to precisely know the features of models to be able to compare them properly. Scripting, procedural, object-oriented and functional programming languages all have their pro’s and con’s. The article used the Rosetta Code repos-itory [20] to compare these paradigms using a large enough data set to provide high statistical significance. The article also covers differences in features of languages, like conciseness (to-the-point), performance and failure-proneness. The latter applies to weakly-typed languages, for example. Moreover, functional and scripting languages are more concise than procedural and object-oriented languages.

The authors signal a significant problem in comparisons of programming models: they can suffer from overfitting to specific problems, while not providing proper statistical significance or generalisability [19]. The results of the article are well-founded and statistically seen very signif-icant, making it a useful source for model comparison and thus for this thesis. The CLBG suite provides many programs covering multiple problems in many models[4], to avoid the overfitting pitfall mentioned by the article.

A lot of work comparing specific pairs of programming models also exists. For example, re-search that compares Go and Java and specifically analyses both thier concurrency performance, and the compile time of both models, is presented in [21]. The former metric is relevant for this

(13)

thesis, since Akka focuses on concurrency as well. The work focuses on matrix multiplication using both models, and compares the execution time of this application. The authors conclude that Go is faster when using concurrency, but is outperformed by Java for as the problem size increases in the sequential implementation of the matrix multiplication.

The article clearly describes the implementations of the used programs. The results are based of the mean of three simple benchmarks ran for each experiment, and no statistical analysis is per-formed on these results. The performance comparison only focuses on sequential and concurrent matrix multiplication, and does not use any other programs to benchmark. This work, therefore, does not provide a thorough performance analysis of Go and Java, because it is overfitting to these specific cases [21][19].

Another performance comparison between Java and Go was done in a PhD thesis project [22]. This work focuses on the parallel performance of the two models, and concludes that Go performs matrix multiplication faster than Java, but that the speedup was relatively higher on a lower number of threads; the author claims that the observed differences in execution time, were not a direct consequence of parallelism, but a result of differences in the models.

This work [22] also focuses only on implementations of the matrix multiplication. Again, we see that performance analysis research is fit to one single problem, and not applied to a broader spectrum of problems, where different features related to parallelism are tested properly. The project referenced the concurrency performance comparison article [21], and stated that Go (version 1.2) compiles three times faster than Java(version 1.7.0 45). While both studies use matrix multiplication to measure execution time one cannot generalise this compile time differ-ence between the models, which is bad for the project’s integrity. Nevertheless, these sources still contain relevant information for performance analysis, if we combine it with a broad set of programs from the CLBG suite to get an idea of the performance of the Akka actor model.

As for CLBG, it is referenced in existing research as a benchmark for an adaptation of C which allows dynamic code execution on the Java Virtual Machine (JVM) [23]. This adaptation is called TruffleC, and uses an efficient implementation of the C interpreter to dynamically compile code, while most traditional compilers produce static code. CLBG was used as a benchmark to test the performance of TruffleC versus static C compilers [23].

The results of the CLBG benchmark in this study are not explained in sufficient detail. The study is not using CLBG to its full potential, because not all metrics are shown in the results. This study is relevant to this thesis nonetheless, because the Akka platform is also built on the JVM. In this thesis, CLBG metrics will put to better use when benchmarking Akka.

To summarise, correct benchmarking requires proper selection of applications and metrics, includ-ing a thorough understandinclud-ing of the application behaviour and metrics meaninclud-ing. The benchmark should also give transparency about the environment in which it runs, and the used parameters should be reported. Programs that are benchmarked should be simple, to aid in the analysis of the results. Well-selected simple programs lower the amount of reasoning required to under-stand why certain behaviour is observed. Knowledge of the used input programs and models in which these are implemented is also important to properly analyse results. Previous studies show that performance analysis should not be generalised, because overfitting to a specific case is common. Another previous study used CLBG, but did not use all the metrics available in CLBG. Not using all of the metrics provided makes it harder to give a broader comparison of a models performance.

Based on this related work analysis, and to the best of our knowledge, we assess that the work presented in this thesis is the first that attempts to extend CLBG with Akka, while using and reporting all metrics of the benchmark for a subset of its applications ported to Akka.

(14)

(15)

CHAPTER 3

Akka vs Java: a microbenchmark

In this chapter we study how Akka and Java compare to each other, using a simple synthetic input program. This analysis will provide an answer to our first research sub-question (SQ1, see 1.1).

3.1 Experiments setup

The experiments done in this thesis were performed on an HP Omen 15 (ax010nd) laptop (2016). This model has an Intel i5 6300HQ CPU, with a base clock of 2.3GHz and the ability to boost to 3.2Ghz. The CPU has 4 cores and 4 hardware threads. The laptop has 8GB DDR4 RAM operating at 2700MHz. The OS is Ubuntu 16.04 LTS installed on an M.2 NVMe SSD (256GB). The laptop was connected to the power grid at all times during experimentation, and any unnec-essary user applications were closed before running experiments. For model version details, see Table 4.2. All experiments were automated using bash and all plots are created using matplotlib in Python [24].

3.2 The Counter program

To get an idea of how Akka actors perform against native Java multithreading, we propose a comparison using a synthetic input program and measuring execution time. This program performs integer addition using the unary increment operator (++), where each thread (or actor) performs an equal share of the total iterations. The thread count and the total iterations can be given as arguments to the input program.

For our first experiment, the timer is started when the threads or actors are being created. The timer stops when the result of the program is computed and all threads or actors are fully done. Results are aggregated and displayed using bash and Python. The core of the Java implementation is presented in Algorithm 1, and the Akka implementation can be seen in Algorithms 2, 3 and 4.

(16)

Algorithm 1 Java multi-threaded counter. The time is measured with System.nanoTime(). Threads are stored so they can be joined later on when all work is done. Threads are created and then immediately started.

1: startTime = System.nanoTime();

2: for (int i = 0; i < totalThreads; i++) do

3: Thread t = new Thread(() => {

4: NonAkkaCounter counter = new NonAkkaCounter();

5: for(int j = 0; j < iterLimit; j++) do 6: counter.count++;

7: end for

8: });

9: t.start();

10: threadArray.add(t); . Store the created threads to be able to check for completion

11: end for

12: EndTimerWhenDone(); . Joins the threads from the threadArray and stops the time

Algorithm 2 Akka code snippet of the main (parent) actor which creates counter actors and tells them to start counting. The time is measured with System.nanoTime().

2: for (int i = 0; i < totalActors; i++) do

3: final ActorRef<Counter.Command> counterActor =

4: getContext().spawn(Counter.create(), ”subCounter” + i);

5: counterActor.tell(new Counter.Loop(IterLimit, getContext.getSelf()));

6: end for

Algorithm 3 Akka code snippet of a child counter actor which performs the counting. It tells the main actor it is done when the for-loop is done. See Alg. 4 to see how this is handled by the main actor.

1: _{procedure private Behaviour<Command> onLoop(Loop loopInfo)} 2: for (int = 0; i < loopInfo.limit; i++) do

3: count++;

4: end for

5: loopInfo.parentActor.tell(new AkkaMainActor.CounterFinished(count));

6: return Behaviours.same();

7: end procedure

Algorithm 4 Akka main actor which stops the timer if all children are done.

1: _{procedure private Behaviour<Main> onCounterFinished(CounterFinished info)} 2: actorsFinished++;

3: result.addAndGet(info.count); . Add the subcount to the atomic integer result

4: if (actorsFinished. ≥ totalActors) then

5: EndTimer(); . Stops the timer

6: return Behaviors.stopped()

7: end if

8: return Behaviors.same()

(17)

(a) 1024 (b) INTMAX

Figure 3.1: The performance of the counter program, using different amounts of threads (Java) and actors (Akka). The execution time is the average of 10 runs. Note that the x-axis is in log scale. The initialisation of the Akka ActorSystem is not measured.

3.3 Scalability and overhead

The counter program performs two additions per iteration: one for the for-loop counter and one for the increment of the local counter variable. Each thread or actor computes a part of the total result, and these intermediate results are added together when the threads or actors finish computing. The time is measured by the JVM from within the code. The initialisation of the program and its variables are not relevant for this experiment, because we are interested in the overhead and scalability of our concurrent counter program. We do not include these (near-constant) initialisation costs in the measurements for our first experiment.

Experiment 1a: low count, strong scaling

In the first experiment, we only counted to 1024 (a very low number) using a varying number of threads up to 1024 threads. In this way, high overhead is expected for both implementations of the counter program. Moreover, as the number of threads/actors increases, the workload per thread decreases to the point where each thread performs a single iteration of the counter loop. As the problem size is fixed, and we only increase the threads/actors, this is a strong scalability test for the two models.

The results of experiment 1 are presented in Figure 3.1a. We make the following observations:

• The Akka actors are faster until the amount of actors grows beyond 512.

• The overhead for both Java and Akka is outweighing the speedup of additional threads/actors when we have too many counters. Akka performs much faster before this point.

• The fastest time was observed when using a single actor. This behaviour confirms our expectation: the problem size was so small that any additional threads/actors caused extra overhead, and no visible performance gain.

(18)

Experiment 1b: high count, strong scaling

The experiment was repeated for a much larger problem size (i.e., equal to the maximum 32-bit signed integer INTMAX). The results are displayed in Figure 3.1b. We make the following observations:

• For a lower thread/actor count, Akka is faster in this experiment, as well. However the Java threading implementation outperforms Akka sooner than in Figure 3.1a (i.e, at 48 threads).

• Excessively large numbers of threads lead to the same performanne behaviour as that ob-served in Experiment 1a (Figure 3.1a): too many threads/actors cause too much overhead, and slow the execution down.

3.4 Measuring different phases

As seen in figures 3.1a and 3.1b, Akka is faster for a lower amount of actors. In Figure 3.1a, the execution time initially does not change a lot when the number of threads is changed. To get an idea what causes Java to be slower than Akka in these cases, the previous tests were extended to measure thread/actor creation, assignment of work as well as the time to complete which we already measured.

Experiment 2: Initialisation cost

The previous experiments did not measure the creation of the Akka actor system where the main actor which spawns the child counter actors is initialised. Since these actors use the same (default) dispatcher, it is possible that the threads on which active Akka actors are processed are created outside of our measured time frame, explaining the difference between Java and Akka in terms of overhead.

The thread/actor creation time is measured when all threads/actors have been created, and time is measured from the moment both implementations are started (in the main functions), but we also measure the original starting point. The latter is meant to capture the overhead of creating threads among other initialisation factors in Akka. Once initialisation is done, all threads/actors will be told to start working, and when this is done the next time interval is measured. When the threads/actors are all done working, the last interval is measured.

Algorithm 5 Code snippet of the Java counter program with the measurement of the thread creation and starting.

2: for (int i = 0; i < totalThreads; i++) do

3: Thread t = new Thread(() => {

4: NonAkkaCounter counter = new NonAkkaCounter();

5: for(int j = 0; j < iterLimit; j++) do

6: counter.count++;

7: end for

8: });

9: threadArray.add(t); . Store the created threads to be able to check for completion 10: end for

11: timeInterval(System.nanoTime(), ”INIT”); . Measures the initialisation time and resets startTime 12: for (int i = 0; i < totalThreads; i++) do

13: threadArray.get(i).start();

14: end for

15: timeInterval(System.nanoTime(), ”WORK”); . Measures the time for starting all tasks and resets startTime

16: EndTimerWhenDone(”DONE”); . Joins the threads from the threadArray and stops the time

(19)

In Algorithm 5, we see how these intervals were measured. Note that the code in this algorithm is different from the code in Algorithm 1, because the creation and starting of the threads have been split into different loops. Creating and starting a thread right away while new threads are still being created results in an early start of these threads, and if spawned threads exceed the logical threads of the run-time environment, the scheduling overhead of both creation and starting could interfere with one another. Since the program only finished when all threads are done, starting the first thread when the other threads have not been created can then increase the execution time due to the mentioned overhead.

The results of this experiment can be seen in Figure 3.2, for the low-count and high-count versions of the experiment, respectively.

We make the following observations:

• Java Thread initialisation seems constant for all threads, which is a significant penalty to the performance of the Java implementation. In Figure 3.2a, we can see that the work after thread creation takes far less time than the thread creation itself. This overhead does not scale with the workload size, because the line is around 45 milliseconds in both figures 3.2a and 3.2b.

• In Figures 3.2c and 3.2d we can see that the overhead of starting up an ActorSystem and some actors takes up quite some time, and that this scales with the amount of actors that need to be created. This scaling can be seen more clearly in figures 3.2a and 3.2b. The initialisation of the ActorSystem itself takes roughly 840ms, and as more actors are created this overhead increases slightly.

(20)

(a) 1024: The initialisation is measured from the start of creating threads/child actors.

(b) INTMAX: The initialisation is measured from the start of creating threads/child actors.

(c) 1024: Full initialisation measurements (d) INTMAX: Full initialisation measurements

Figure 3.2: Counting to 1024 and INTMAX using different amounts of threads (Java) and actors (Akka). Elapsed time is displayed for the three phases of the counter program. Average of 10 runs. Notice that the x-axis is in log scale. Code in Alg. 5

(21)

3.5 Initialisation of the Akka actor model

The Akka actor initialisation overhead mainly comes from the ActorSystem, which also handles creation of system actors and the multithreading, either via a TPE or FJP. The time it takes for an actor to receive its first message is negligible compared to the ActorSystem creation, or construction of an actor. This means that once an actor has finished execution the constructor method, it is ready to process messages.

Experiment 3: Akka initialisation

To get more knowledge about the initialisation of Akka, we timed the creation of the ActorSystem, actors themselves and the time to send the first message in the case of our counter program. The results are in table 3.1, and they explain the large overhead visible in Figures 3.2c and 3.2d: the creation of the ActorSystem was measured in these experiments. An ActorSystem can’t exist without any actors, so you could argue that the ActorSystem creation actually is the sum of the actor construction and the ActorSystem creation. We note that the execution of code in actor’s constructor method never took more than 40 milliseconds. This is not shown in table 3.1 because it falls under the Actor Construction phase. Isolating actor creation from ActorSystem creation is hard because they are created hand-in-hand.

Mean (n=10) [ms] Std.Dev. [ms] Actor Construction 69.15 2.47 ActorSystem Creation 775.18 9.35 Receiving first message 3.33 0.37

Total time 847.66 8.51

Table 3.1: Average elapsed time of Akka initialisation. Measured using the counter program. Amount of actors does not affect results because the spawning of the child actors is not measured here, as only the main actor is examined.

3.6 ForkJoinPools and ThreadPoolExecutors

So far, we have observed that Java threading scales differently than Akka actors when the amount of threads/actors increases, and this might have to do with the Akka (default) dispatcher versus spawning a lot of threads in the Java implementation.

Akka dispatchers either use an ForkJoinPool (FJP) or a ThreadPoolExecutor (TPE). The former is the default setting, but both options will use the amount of threads in the common threadpool, unless Akka is instructed otherwise. As such, using 1024 actors in Akka does not mean 1024 threads are utilised.

The FJP and TPE options are both native to Java. To make a fair, and thus better comparison between Java and Akka, we should compare the performance where both implementations use the FJP to achieve multithreading. For this to be done correctly, both implementations should use the same amount of threads.

We will also compare Java and Akka using the TPE on these poolsizes, because we want to know what difference is between FJP and TPE. We expect that the FJP implementations will be faster, because this Executor is claimed to be fast when the task size is small [16], which is the case for our counter program.

Experiment 4: ForkJoinPool comparison

To enable a fair Akka vs Java comparison when using FJP, we must use FJP of matching sizes in Akka and Java.

Because the common threadpool in our case (refer to Section 3.1) uses 3 active threads (determined by Java), we first examine the time to complete of our counter experiments using a pool size of 3. We also experiment with poolsizes of 4 and 8, to see the effect of larger poolsizes.

(22)

Figures 3.3a and 3.4a show Experiment 4’s results. We make the following observations:

• Initially, the performance using different poolsizes is roughly the same. However, when the amount of tasks grows, differences in elapsed time can be observed between the pool sizes for both Java and Akka, although in Akka the difference is more noticeable.

• For the small problem size (seen in Figure 3.3a) the poolsize of 3 performs the best. Also, for this case, using too many threads adds more overhead than speedup in Akka.

• For the large problem size (see in Figure 3.4a) the poolsize of 8 is faster for both Java and Akka.

• The variability of the results (pictured as standard deviation in the graphs) grows larger as the poolsize increases, for both the small and large problem sizes. A possible explanation for this is that the environment on which the tests were run has 8 hardware threads, and other (system) processes were still running during the test (only necessary user programs were running), which can affect the measured time through unrelated context switches, which show up as more overhead in this experiment.

Experiment 5: ThreadPoolExecutor comparison

The results for both cases (low- and high-count) are presented in Figures 3.3b and 3.3b, respec-tively. We make the following observations:

• The performance trends for the TPE are similar to those measured for FJP.

• For TPE the different poolsizes don’t affect the elapsed time as much as in FJP, with the exception of p=8 for Java in Figure 3.4a. In this case, the increased number of tasks (and thus a smaller task size) lowered the elapsed time, but the elapsed time increases after 256 Tasks have been passed.

• The standard deviation areas are smaller for the TPE implementation in both Java and Akka. Poolsizes seem to affect the performance of a TPE implementation less than an FJP implementation in Java, if the problem size is larger. For the small problem size in Figure 3.3b, a larger poolsize caused more overhead.

(23)

(a) 1024: ForkJoinPool (b) 1024: ThreadPoolExecutor

Figure 3.3: Counting to 1024 using FJP and TPE implementations in Java and Akka. Average of 10 runs. The figures show the standard deviation as well. The chosen poolsizes are 3, 4 and 8.

(a) INTMAX: ForkJoinPool (b) INTMAX: ThreadPoolExecutor

Figure 3.4: Counting to INTMAX using FJP and TPE implementations in Java and Akka. Average of 10 runs. The figures show the standard deviation as well.

(24)

(25)

CHAPTER 4

CLBG for Akka

In this chapter, we discuss the porting of the CLBG programs in Akka and analyse their perfor-mance in order to find our answer on sub-question SQ2.

4.1 Selecting relevant CLBG programs

Because we cannot port every input program available in CLBG due to time constraints, we need to select programs based on their relevance to Akka. This means we select programs which might show interesting insights in the performance of Akka in relation to other models. We prefer programs where concurrency is used, because the Akka actor model can then be tested better. We can identify these candidates by analysing if existing implementations (in other models) show concurrent behavior.

Because Akka is written in Java, looking at the Java implementations can make it easier to determine whether a program is suited for porting to Akka. Specifically, we looked at the following aspects in source code:

• The use of multithreading and its configuration (e.g. ThreadPoolExecutors in Java); • The size of individual tasks (compared to each other);

• The amount of tasks;

• Blocking operations (e.g. reading/writing to a file).

On top of these criteria, we also check the (code) complexity of the program: smaller/simpler programs are easier to port of course, and if the code is less complex we can make stronger assumptions when analysing the results as to why we observe certain behaviour.

We have identified five interesting candidate programs, listed in Table 4.1; the two underlined programs in the table are the ones we selected for porting. Both selected programs are concur-rent, meaning that we can make an Akka port which utilises what the Akka actor model has to offer. The Binary Trees program is more oriented towards memory allocation, and the memory usage of Akka seems important for us to investigate. The program is non-blocking. Reverse Complement program is blocking, however. This is important to the Akka implementation, and we are curious to see how the performance of Akka relates to other models in a blocking program. These programs have a relatively low code complexity, making them more favourable over other programs. Spectral Norm and Fannkuch-redux are more complex, but their listed properties are similar to those of Binary Trees, while binary trees also stressed memory. Choosing Binary Trees saves time and can provide information about Akka’s performance in a memory intensive concurrent program. The Pidigits program is sequential, and we assume that Akka would not perform better than Java in this case.

(26)

Program Description Conc. Task size Blocking Binary Trees (De)allocating many perfect binary trees

and checking validity. Memory intensive. Yes Balanced No Rev. Complement

Constructing the reverse complement of a DNA string by using a look-up table and a buffered file read

Yes Balanced Yes Spectral Norm Calculate the Spectral Norm [25] [26] Yes Balanced No Pidigits Determine numbers of Pi, sequentially.

Uses the unbounded Spigot algorithm [27] No Single task No Fannkuch-redux Fannkuch (pancake) flipping algorithm

which works on permutations [28] Yes Imbalanced No

Table 4.1: Overview of our selected 5 (unordered) input CLBG programs and their features. The underlined programs were chosen to benchmark Akka. Task size refers to the balance among tasks sizes, where ”Balanced” means that each task performs roughly the same amount of work. Finally, we also specify whether the program has any blocking operations can be observed in the right-most column.

4.2 Program 1: Binary trees

The Binary Trees program is focused on binary tree creation, thus focusing on memory allocation. The program takes one parameter, N which is the tree depth. The fixed value of N = 21 is used. This means that the program works with very large binary trees, which we expect to take up quite a lot of memory. Memory efficiency of a model is important, and with so many tree nodes, differences in memory efficiency, if any, should become apparent in the results.

The program must consist of a Tree class or structure, which has pointers to its left and right children nodes, and methods that must allow for [29]:

• Tree allocation (constructor)

• Tree deallocation (destructor, implicit in most models)

• Tree traversal with a node count. This is meant to validate the tree’s node count. Validation is done through printing the amount of nodes. This output should match the reference output provided by CLBG.

The program itself creates the following trees, in order:

1. A stretch tree with depth N + 1, to verify there is enough memory available.

2. A long-lived tree allocated with depth N , which must remain in memory until all the other trees are (de)allocated.

3. Several trees created with varying depth, starting at 4 and increasing till N with a stepsize of 2. Each of these trees are created multiple times.

4. Finally deallocate the long-lived tree.

4.2.1 Porting to Akka from pseudocode

The Binary trees program pseudocode can be seen in Algorithm 6. This code was based on the existing implementations and the program specifications. The pseudocode is meant to give a high-level abstraction of the program, making the implementation of the program in Akka (see Algorithm 7) more easy to write.

There are important differences when using the Akka actor model as opposed to more tradi-tional programming principles one would use in Java. In Algorithm 6 we see a MAIN procedure (line 31) which handles most of the program’s work, by using the TREE class. Data flows directly from the helper function CREATE-TREE (line 15) into the MAIN procedures variable checkSum (line 39). However, in Algorithm 7, the MAIN procedure (Alg. 7, line 45) is very short, because the MAINACTOR handles what the original MAIN did.

(27)

This difference exists in Akka because the data flow between actors should always happen using messages. These messages are typically sent using the TELL command. A non-actor cannot be ”told” something by an actor, and therefore we need a MAINACTOR to be able to receive (and send) data.

In the Akka pseudocode in Algorithm 7, the checkSum is sent (line 46) using the TELL com-mand rather than using a return statement (Alg. 6, lines 17 and 19). The MAINACTOR then receives the checkSum (line 20) and orders the output so that it can be printed in the correct order. In Algorithm 6, a similar approach is used to ensure the order of output, but this is not shown in the pseudocode for sake of simplicity.

We also need to keep track of how many actors are done, because Akka otherwise does not terminate (by default). This requires an analysis of the conditions for program termination, in a more complicated sense than you would do when waiting for an FJP to finish. Specifically, this means we must explicitly keep count of tasks started and stopped, as seen in Algorithm 7 in the form of actorsBusy (lines 17 and 22).

(28)

Algorithm 6 Binary Tree program: Tree class, helper function, and Main program

1: _{class Tree}

2: _{lef t child : TREE} 3: _{right child : TREE}

4: . Constructor: Allocate the node and set its children

5: _{procedure TREE(TREE left, TREE right)} 6: lef t child = left

7: right child = right

8: end procedure

9: . Constructor: Allocate the node without children

10: _{procedure TREE} 11: lef t child = NULL

12: right child = NULL 13: end procedure

14: . Method: Traverse the tree and count nodes recursively

15: _{procedure CHECK-TREE}

16: if lef t child == NULL then . The tree is perfect, no need to check both children

17: return 1

18: end if

19: return 1 + lef t child.CHECK-TREE() + right child.CHECK-TREE()

20: end procedure

21: end class

22: . Helper function to create trees given depth N, from the bottom up

23: _{procedure CREATE-TREE(Integer N)} 24: if 0 < N then

25: return TREE(CREATE-TREE(N-1), TREE(CREATE-TREE(N-1)

26: end if

27: return TREE() 28: end procedure

29: . Main code of the program. N is the maximum tree depth.

30: . De-allocation must happen after each PRINT, which happens implicitly in most models.

31: _{procedure Main(Integer N)}

32: stretchTree = CREATE-TREE(N+1)

33: PRINT(stretchTree.CHECK-TREE())

34: longLivedTree = CREATE-TREE(N)

35: for (depth = 4; depth ≤ N; depth += 2) do

36: checkSum = 0

37: iters = 1 << (N - depth + 4)

38: for (i = 1; i ≤ iters; i++) do

39: checkSum += CREATE-TREE(depth).CHECK-TREE()

40: end for

41: . Ensure that least depth trees are printed first to match reference output

42: PRINT(checkSum) 43: end for

44: PRINT(longLivedTree.CHECK-TREE())

(29)

4.2.2 CLBG results - Binary Trees

The CLBG suite was extended with an Akka implementation based on the pseudocode in Algo-rithm 7. The tree depth was chosen to be N = 21, as specified in the program’s description in the documentation [29]. The benchmark was configured to run each test 10 times. Unfortunately, the suite does not provide a way to get the standard deviation of the measurements, but we assume the standard deviation is not significant based on some preliminary testing. The tests were run with the same experiments setup as in previous experiments (section 3.1). While CLBG provides implementations in many models, only a handful were benchmarked. This is because a lot of models did not work well (like compilation), even after tweaking for a while.

The chosen models are listed in Table 4.2. Most models in CLBG have multiple implemen-tations, so their implementation ID is suffixed to the program name. The source code for these implementations can be found in the CLBG documentation for the Binary Trees program [29]. The Akka implementation will be referred to as akka-1, but can of course not be found in the CLBG documentation.

Model Version Features

Akka 2.6.4 Concurrency, Actor model, Object-oriented, used within Java (or Scala), JVM Dart 2.8.2 Concurrency, For user applications, Object-oriented

Go 1.6.2 Concurrency, Go-routines

Python 3.6.8 Concurrency, Scripting language, supports OOP Java 1.8.0 201 Concurrency, Object-oriented, JVM

JRuby 1.7.22 Concurrency, Java implementation of Ruby, Object-oriented, JVM Julia 1.4.1 Concurrency, Numerical applications

Table 4.2: Models used in our CLBG experiments. All models support concurrency.

Figure 4.1a presents the total execution time of the Binary Trees program for the tested mod-els (which are colour-coded to make modmod-els more distinctive). Java and Akka implementations executed the fastest, with java-7 and akka-1 having comparable execution times. In Figure 4.1b we can see that Java and Akka are similar in terms of CPU time as well. Figure 4.1 shows that models are mostly grouped in their CPU time and elapsed time measurements. Especially in Figure 4.1b we can see this grouping. jruby-5 has a higher CPU time than jruby-4, but has a lower elapsed time than the other JRuby implementations. This is an example that a lower CPU time is not necessarily better in terms of fast execution. Factors which contribute to this difference are the level of parallelism and CPU idle times. CPU time tells us how much load is put on the CPU and 4.1 shows us that both Java and Akka have fast execution times while having relatively low CPU stress, which is an indication of their CPU efficiency of execution this program.

Between Java and Akka, akka-1 has a similar elapsed time but a slightly lower CPU time than java-7. Furthermore, java-2, java-3 and java-6 all have a lower CPU time than akka-1, yet their elapsed time is 50% longer than akka-1. Akka can be as fast as Java, while having a 9% faster CPU time for this program. CPU time and CPU load ordering does not match each other directly, because CPU time is not determined solely by CPU load. julia-2 has the least total CPU load but does not have the lowest CPU time, for example. For implementations within the same model however, implementations with a lower CPU time generally have a lower CPU load than other implementations within that model. In Figure 4.2 we can see that CPU load is lower than java-7, which is more evidence that Akka is more CPU efficient in this case, because java-7 has a similar execution time compared to akka-1.

For the tested program, Akka seems to be more CPU efficient than Java. If we compare java-7 and akka-1 in terms of memory, Figure 4.3a shows that Akka uses about the same amoung of memory as java-7 (about 5% less, to be precise). Other Java implementations use less than akka-1, however, but these also had a longer execution time. Similar to CPU time, the models seem to grouped in their memory usage, which can easily be observed by looking at the colours of the bar chart in Figure 4.3a. The results for GO were left out of this analysis, because the measurements were invalid (CLBG did not measure the memory of the child process used by GO properly).

(30)

Algorithm 7 Implementation in Akka (pseudocode) of the Binary Trees program

1: _{class MAINACTOR} 2: _{minDepth : Integer} 3: _{maxDepth : Integer} 4: _{actorsBusy : Integer} 5: _{longLivedT ree : TREE} 6: _{OutputArray : Array}

7: _{procedure Create(INTEGER minDepth, INTEGER maxDepth)} 8: minDepth = minDepth 9: maxDepth = maxDepth 10: actorsBusy = 0 11: stretchTree = CREATE-TREE(N+1) 12: PRINT(stretchTree.CHECK-TREE()) 13: longLivedTree = CREATE-TREE(N)

14: for (depth = minDepth; depth ≤ MaxDepth; depth += 2) do

15: newActor = TreeMakeActor.CREATE(minDepth, maxDepth, GET-SELF())

16: newActor.TELL(MakeTreeCommand(depth))

17: actorsBusy++

18: end for

19: end procedure

20: _{procedure OnStopCommandReceived(Integer depth, Integer checkSum)} 21: OutputArray[depth - minDepth / 2 ] = checkSum

22: if (−−busyActors ≤ 0) then

23: for String s in OutputArray do

24: PRINT(s) 25: end for 26: PRINT(longLivedTree.CHECK-TREE()) 27: STOP-AKKA() 28: end if 29: end procedure 30: end class 31: _{class TreeMakeActor} 32: _{minDepth : Integer} 33: _{maxDepth : Integer} 34: _{parentActor : MAINACTOR}

35: _{procedure Create(Integer minDepth, Integer maxDepth, MAINACTOR replyTo)} 36: minDepth = minDepth

37: maxDepth = maxDepth

38: parentActor = replyTo

39: end procedure

40: _{procedure OnMakeTreeCommandReceived(Integer depth)} 41: checkSum = 0

42: iters = 1 << (maxDepth - depth + minDepth)

43: for (i = 1; i ≤ iters; i++) do

44: checkSum += CREATE-TREE(depth).CHECK-TREE()

45: end for

46: parentActor.TELL(StopCommand(depth, checkSum))

47: end procedure

48: end class

49: . Creates the MAINACTOR which automatically starts the creation of trees.

50: _{procedure Main(Integer N)} 51: MAINACTOR.CREATE(4, N)

(31)

(a) Elapsed time (seconds) (b) Total CPU time (seconds)

Figure 4.1: Binary Trees: CLBG measurements of elapsed time (seconds) and total CPU time (seconds). Averages of 10 runs. The number suffixed to model names indicates the implementa-tion number within CLBG.

Figure 4.2: Binary Trees: CLBG measurements of CPU core load. Average of 10 runs. The number suffixed to model names indicates the implementation number within CLBG.

Finally, we also compare the compressed code for Binary Tree. The size is measured using the GZIP tool and the results are presented in Figure 4.3b. We observe that Java implementations require less code than Akka, which is expected given that The Akka actor model requires pro-grammers to write more code. This is visible in Algorithm 7, which is more verbose (while still being simplified pseudocode) than Algorithm 6 (notice that Algorithm 6 also includes the TREE class, which Algorithm 7 also uses). The results indicate that Akka is one of the most verbose models of the tested programs, only being surpassed by Dart in two cases. However, dart-1 is much shorter than other Dart implementations (Figure 4.3b), while dart-1 also performed well in terms of elapsed time, CPU time and memory usage (Figures 4.1 and 4.3a). It would be difficult to make akka-1 much shorter while still having the same functionality.

4.3 Program 2: Reverse Complement

The Reverse Complement CLBG program was selected to be ported and benchmarked due to its use of blocking operations. As blocking is malicious to performance in concurrent application, this program should provide insight into Akka’s performance when handling blocking operations.

(32)

(a) Memory used (MB). Average of 10 runs. (b) Compressed code size (Bytes).

Figure 4.3: Binary Trees: CLBG measurements of memory used (MB) and compressed code size (Bytes). The number suffixed to model names indicates the implementation number within CLBG.

The program uses a mapping to create the reverse complement of a nucleotide (building block of DNA) string to create the complementary DNA string [30]. The nucleotide mapping is one-to-one, and can be seen in Table 4.3. The input DNA string is read from the standard input stream, and the reverse complement is printed to the standard output stream. The input and output are both in FASTA format, which is a standard format for DNA string files [31]. The read and write operations happen using buffers, meaning only a (small) part of the input is processed rather than reading the input stream as a whole. For our benchmark, we will use the 250MB input file from CLBG [32].

4.3.1 Porting to Akka from Java

We created our Akka implementation based on one of the Java implementations available in CLBG, namely java-4 [32]. We ran the benchmark before implementing the program in Akka, to see which Java program performed well while having a relatively low code complexity and size. We then chose for java-4 because porting it to Akka seemed more easy than porting other applications. Java-4 scored well in the preliminary benchmark despite its short and simple code. Pseudocode for this Java implementation can be seen in Algorithm 8.

The java-4 code was used to create the Akka implementation (akka-1). The reading of input from Algorithm 8 (line 31) is done in the ReaderActor (Alg. 9, line 1), while computing the reverse complement was done by the ReverseActor (Alg. 9, line 16). Once the ReverseActor receives new valid input, the ReaderActor starts reading more input (concurrently). This was done to make the program concurrent. The buffer size (BUFSIZE) was increased from 82 (in java-4) to 1024 in akka-1. We expect the overhead of sending and receiving messages to be reduced, because the amount of tasks grows smaller when the buffer size is increased, resulting in less frequent IO operations. The buffer size was different between the implementations in CLBG, and no fixed value was specified.

(33)

Algorithm 8 The pseudocode for java-4 from CLBG

1: class ReversibleByteArray extends java.io.ByteArrayOutputStream 2: _{procedure reverse(())}

3: . count is the number of valid bytes in the ByteArrayOutputStream buffer [33]

4: if count > 0 then

5: INTEGER begin = 0

6: INTEGER end = count - 1

7: while (buf[begin++] 6= newline) do

8: pass . Reversal should happen line-by-line, so we search for the next newline

9: end while

10: while (begin ≤ end) do

11: if (buf[begin] == newline) then

12: begin++

13: end if

14: if (buf[end] == newline) then

15: end−−

16: end if

17: if (begin ≤ end) then

18: BYTE tmp = buf[begin]

19: buf[begin++] = COMP-MAP(buf[end]) . Apply the mapping (Table 4.3)

20: buf[end−−] = COMP-MAP(buf[tmp])

21: end if

22: end while

23: WRITE(buf, 0, count) . Write to Std.Out

24: end if

25: end procedure

26: end class

27: _{procedure MAIN(())}

28: BYTE[ ] line = new byte[BUFSIZE]

29: INTEGER read

30: REVERSIBLEBYTEARRAY buf = new REVERSIBLEBYTEARRAY()

31: while (READ(line)) do 32: INTEGER i = 0

33: INTEGER last = 0

34: while i < read do

35: if (line[i] == ’>’) then

36: buf.WRITE(line, last, i - last)

37: buf.REVERSE() 38: buf.RESET() 39: last = i 40: end if 41: i++ 42: end while

43: buf.WRITE(line, last, read - last)

44: end while

45: buf.REVERSE()

(34)

Algorithm 9 Akka pseudocode for the Reverse Complement CLBG program (akka-1. It uses the ReversibleByteArray class from Alg. 8 (line 1).

1: _{class READERACTOR}

2: reverseActor : REVERSEACTOR

3: line : BYTE[ ]

4: _{procedure CREATE(())} 5: line = new BYTE[BUFSIZE]

6: reverseActor = REVERSEACTOR.CREATE(GET-SELF())

7: end procedure

8: _{procedure OnReadCommandReceived(())}

9: INTEGER read = READ(line) . Stores values in the buffer

10: reverseActor.TELL(ReverseCommand(read, line)) 11: if (read == -1) then

12: STOP-AKKA() . Awaits completion of running tasks

13: end if 14: end procedure 15: end class 16: _{class REVERSEACTOR} 17: readerActor : READERACTOR 18: line : BYTE[ ] 19: buf : REVERSIBLEBYTEARRAY

20: _{procedure CREATE((READERACTOR reader))} 21: line = new BYTE[BUFSIZE]

22: buf = new REVERSIBLEBYTEARRAY()

23: readerActor = reader

24: end procedure

25: procedure OnReverseCommandReceived((INTEGER read, BYTE[ ] line)) 26: if (read 6= -1) then

27: readerActor.TELL(ReadCommand()) . If there is more to read, start a new read

28: INTEGER i = 0

29: INTEGER last = 0

30: while i < read do

31: if (line[i] == ’>’) then

32: buf.WRITE(line, last, i - last)

33: buf.REVERSE() 34: buf.RESET() 35: last = i 36: end if 37: i++ 38: end while

39: buf.WRITE(line, last, read - last)

40: else 41: buf.REVERSE() 42: end if 43: end procedure 44: end class 45: _{procedure Main(Integer N)} 46: READERACTOR.CREATE().TELL(ReadCommand()) 47: end procedure

(35)

Code Meaning Complement A A T C C G G G C T/U T A M A, C K R A, G Y W A, T W S C, G S Y C, T R K G, T M V A, C, G B H A, C, T D D A, G, T H B C, G, T V N G, A, T, C N

Table 4.3: The mapping used to get the Reverse Complement. The code column is a representa-tion of nucleotides, where N can be seen as a wildcard because it can be any of the 4 nucleotides, for example. The complement column denotes the output of the mapping based on the input code value [32]

4.3.2 CLBG results - Reverse Complement

We present results for running the CLBG benchmark 10 times for all the selected models in Table 4.2 in Figure 4.4a. We observe that Akka had a higher execution time than most models, and Java was always faster in this experiment. The akka-1 implementation was based on java-4, which executed faster than akka-1. While akka-1 was able to read input concurrently, the overhead of communication between actors outweighed the benefit of this.

Preliminary testing of the Akka port during implementation showed that a lower buffer size slowed down Akka significantly. A lower buffer size resulted in more overhead, because communication between the ReaderActor and ReverseActor happened more frequently.

In Figure 4.4b, depicting CPU time, we also see a relatively high CPU time for Akka com-pared to other models; again, java-4 outperformed akka-1. Furthermore, Figure 4.5 shows that akka-1 has a higher CPU load than most models, although some Java implementations had an even higher CPU load. However, these models (java-3 and java-8) were among the fastest models in terms of elapsed time (Figure 4.4a). We can clearly state that the akka-1 implemen-tation is less efficient than the Java implemenimplemen-tations for this program, because it has a higher CPU time, CPU core load, and execution time compared to most models including Java. Akka-1 exhibits worse CPU performance than most models for this program.

(36)

(a) Elapsed time (seconds) (b) Total CPU time (seconds)

Figure 4.4: Reverse Complement: CLBG measurements of elapsed time (seconds) and total CPU time (seconds). Averages of 10 runs.

Figure 4.5: Reverse Complement: CLBG measurements of CPU core load. Average of 10 runs.

(37)

(a) Memory used (MB). Average of 10 runs. (b) Compressed code size (Bytes).

Figure 4.6: Reverse Complement: CLBG measurements of memory used (MB) and com-pressed code size (Bytes).

While Akka’s CPU performance is poor for this program, the amount of memory used is only slightly more than Java’s, and models like Dart and JRuby use much more than Akka and Java (Figure 4.6a). We do see that implementations in a same model can vary in memory usage. This behaviour is observed for all models, as there is not much grouping of the same models. For the Binary Trees program, the memory usage was more grouped (Figure 4.3a). Overall, the results seem to indicate that memory usage is more implementation specific than model dependent, for this program, making the data in Figure 4.6a less suited to make generalised statements about Akka’s memory usage.

In terms of code size, the results are not what we expected, because Akka tends to more verbose than Java. However, Figure 4.6b shows that Akka is more concise than several Java implementations. Nevertheless, it is still more verbose than most implementations, and java-4 is shorter than akka-1. This difference in length is not visible in the pseudocode in Alg. 8 and Alg. 9, because some of Akka’s verbosity was abstracted away to keep the pseudocode brief.

(38)

(39)

CHAPTER 5

Conclusion and future work

Concurrency is important and it is more easily achievable if a programming model has it embed-ded. Because Akka has concurrency deeply embedded in its design, we compared Akka against models to see how it performs.

5.1 Main findings

The main goal of this project was to answer the following research question: How does the Akka actor model perform compared to other models?. To this end, we also formulated two sub-questions:

SQ1: How does Akka compare against Java for a basic application?

SQ2: How does Akka compare against the multiple models in the Computer Language Bench-marks Game?

As we have seen in Chapter 3, where we experiment with a simple synthetic program, using more actors or threads creates more overhead than speedup for our tested program. Having many actors in Akka is more detrimental to the execution time of our program than having many individual threads in Java, although Java also suffers when too many threads are used. Using ForkJoinPools (FJPs) and ThreadPoolExecutors (TPEs) reduces the overhead caused by having too many threads in Java. Akka does also use FJPs or TPEs to work concurrently, but this is affecting the dispatcher that is used to govern which actors get to work on a thread. Having many Akka actors still causes overhead, because each actor needs to be instantiated and kept in memory. While initialisation in Akka is worse than Java when executing a simple, short program, the distribution of messages to Akka actors is faster than the starting of tasks in Java. Java was always faster in terms of total execution time, because of the initialisation over-head of Akka. Using FJPs or TPEs in Akka resulted in similar performance for our test program.

An answer to SQ2 can be given based on the results in Chapter 4. In our research, we have seen that for the non-blocking Binary Trees program, Akka was more CPU efficient compared to most other models, while for the blocking Reverse Complement program, Akka had worse CPU performance than most other models. In both programs Akka did not have significantly higher memory usage than Java. Akka has a larger code size in general, as the actor model makes it quite verbose. However, a more verbose model does not always mean that the code size is always much larger than less verbose models, because the difference depends on how the model applies to the application being implemented.

Concluding with an answer to our research question, Akka has the potential to outperform other models if it is applied to programs where it is useful to have multiple actors, and having any blocking operations should be avoided to aid the performance of Akka. Akka can perform better than other models in terms of CPU and memory performance, at the cost of code size, but

(40)

this requires that the actor model is applicable to the program. Java was comparable or better than Akka in our CLBG experiments. If initialisation of a program is not that important, Akka can be used to perform tasks more efficiently, with respect to the distribution of work, than Java can using multi-threading. However, having too many actors work at the same time can create significantly more overhead than you would see in a Java ThreadPool, for example.

5.2 Future work

While our work has provided interesting insights into the performance of Akka versus other models, there are a lot of additional experiments which could be done to make differences between Akka and other models more clear.

First off, all the experiments were performed on one machine, and results can vary on other systems (one with more CPU cores, for example). Thus, repeating these experiments on more systems would be useful for generalization.

Second, our microbenchmark from Chapter 3 could be extended to also investigate the mem-ory usage of Akka compared to Java, because it would be interesting to see the effects of many actors being held in memory while they are waiting to occupy a thread.

Furthermore, additional research should be done using CLBG, because the suite offers many more input programs. An input program which uses more actors to check the effects on memory could be Spectral Norm [25], as high dimension matrix multiplication is easy-to-distribute in tasks. New programs could also be created in both Java and Akka to compare just these two models, as they are closely related to each other.

Finally, it would also be very interesting to see how Akka performs in a large-scale distributed application, because it is advertised as having great performance in those types of applications.

(41)

Acronyms

CLBG Computer Language Benchmarks Game. 3, 5, 7, 9, 10, 12, 13, 25, 26, 29, 31, 32, 33, 34, 35, 36, 37, 39, 40

FJP ForkJoinPool. 5, 11, 12, 21, 22, 23, 27, 39

INTMAX the maximum amount of a 32-bit signed integer, equal to 2147483647. 17, 18, 20, 23

JVM Java Virtual Machine. 12, 13, 17, 29

OOP Object-oriented programming. 7, 10, 29

(42)

(43)

Bibliography

[1] Lightbend, “How the actor model meets the needs of modern, distributed systems.” https://doc.akka.io/docs/akka/current/typed/guide/actors-intro.html, mar 2020.

[2] H. Koen, J. De Villiers, G. Pavlin, A. De Waal, P. De Oude, and F. Mignet, “A framework for inferring predictive distributions of rhino poaching events through causal modelling,” in FUSION 2014 - 17th International Conference on Information Fusion, Institute of Electrical and Electronics Engineers Inc., 2014.

[3] I. Gouy, “Details about computer language benchmarks game (clbg).” http://benchmarksgame.wildervanck.eu/how-programs-are-measured.html, 2020.

[4] I. Gouy, “Github repository for the computer language benchmarks game.” https://salsa.debian.org/benchmarksgame-team/benchmarksgame/-/tree/master, 2020.

[5] J.-l. Gailly, “Gzip documentation.” https://www.gnu.org/software/gzip/manual/gzip.html, 2011.

[6] M. Baulig, “Libgtop documentation.” https://developer.gnome.org/libgtop/stable/, 2020.

[7] A. Hasija, “Performance benchmarking akka actors vs java threads.” https://blog.knoldus.com/performance-benchmarking-akka-actors-vs-java-threads/, Oct 2019.

[8] P. Nordwall, “Blog: Yet another akka benchmark.” https://blog.jayway.com/2010/08/10/yet-another-akka-benchmark/, Aug 2010.

[9] S. Boyd-wickizer, M. F. Kaashoek, R. Morris, and N. Zeldovich, “(mit csail) non-scalable locks are dangerous,” 2011.

[10] Lightbend, “Akka dispatchers documentation.” https://doc.akka.io/docs/akka/current/typed/dispatchers.html, april 2020.

[11] A. Tanenbaum and H. Bos, Modern Operating Systems. Pearson, 2015.

[12] “Oracle java 8 documentation: Threadpoolexecutor.” https://docs.oracle.com/javase/tutorial/essential/concurrency/runthread.html, 2020.

[13] J. Friesen and J. Friesen, “Java 101: Understanding java threads, part 3: Thread schedul-ing and wait/notify.” https://www.javaworld.com/article/2071214/java-101–understandschedul-ing- https://www.javaworld.com/article/2071214/java-101–understanding-java-threads–part-3–thread-scheduling-and-wait-notify.html, Jul 2002.

[14] “Oracle java 8 documentation: Threadpoolexecutor.” https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html, 2020.

[15] Baeldung, “Introduction to thread pools in java.” https://www.baeldung.com/thread-pool-java-and-guava, April 2020.

Benchmarking Akka

Bachelor Informatica