Parallel and Persistent Adaptive Asynchronous Optimizations for Array Programming Languages

(1)

Master Computer Science

Parallel and Persistent

Adaptive Asynchronous

Optimizations for

Array Programming Languages

Heinrich Michael Wiesinger

6181104

H.M.Wiesinger@student.uva.nl

August 16, 2016

Supervisors:

Computer

Sci

ence

—

University

of

Amsterd

am

(2)

(3)

Parallel and Persistent

Adaptive Asynchronous

Optimizations for

Array Programming Languages

Master’s Thesis

written by

Heinrich Michael Wiesinger

under the supervision of

dr. Clemens Grelck,

and submitted in partial fulfilment of the requirements for the degree of

M.Sc. in Computational Science

at the University of Amsterdam

Date of public defense:

23/08/2016

Members of the Thesis Committee:

dr. Clemens Grelck

dr. Inge Bethke

dr. ir. Raphael Poss

(4)

(5)

Abstract

In an effort to increase the optimization potential of an application, compilers opt more and more to offer optimization techniques that affect an application at runtime. Some techniques simply serve as developer support tools to offer increased protection of programming errors or extended debugging information in case of bugs. But some tech-niques also aim to analyze application performance and adjust applications at runtime to perform better.

In this thesis we present the work on an adaptive asynchronous optimization frame-work for the functional programming language SaC. Using this frameframe-work, shape/rank-generic applications can generate specializations at runtime, linking them back into the running application and immediately make us of them. We analyze several performance enhancing additions to this base architecture of the framework, like a persistence layer and multiple specialization controllers. Through various experiments and simulations we show the capabilities and possibilities of the framework, and verify the performance enhancing effect of the new features.

(6)

(7)

Acknowledgements

I’d like to start off with my deepest gratitude to my supervisor, Dr. Clemens Grelck. I was relatively new to the field of compilers when I started with my research and he proposed a topic that immediately spiked my curiosity and fueled my motivation to explore this field. He offered me help every step of the way, not only on technical questions, but also with organizatorial matters around my studies. He is an excellent teacher and I can say without doubt that everything I know about compilers today I learned from him.

I would also like to thank all the SaC project’s members who helped me whenever I would get stuck in my research. Specifically Raphael Poss who helped me debug some very tough problems.

I’d also like to thank all of my programme’s teachers, from whom I learned so much. Your support was crucial in getting me this far in my studies. Special thanks go to pro-fessor Alban Ponse, who helped me find my way when I started with the Grid Computing programme and struggled to find an optimal study / work balance.

Particular thanks go to M2mobi, the company I had the pleasure to work for the past seven years. They were always very accommodating when it involved my sometimes rather flexible scheduling requirements. Working for them allowed me to follow my studies without having to worry about financial support and housing, and the fun projects I worked on provided a good change from the sometimes rather theoretical projects of my studies. Special thanks go to Michiel Baneke and Michiel Munneke, the owners of M2mobi. It was their initial suggestion that eventually led me to start my master studies at the University of Amsterdam.

Ever since I came to Amsterdam I met many people, and made many friends. All of which made this whole experience much more than just studying abroad. I can happily say that a whole new life started for me the moment I came to Amsterdam. Two people who deserve special mention are my long-time flatmate Felipe, who constantly motivated me to keep going with my studies, even if he’s probably not aware of that fact, and my friend and previous colleague Leo, whom I had great conversations with about my studies.

Last but not least, I’d like to thank my family for their support, financial and oth-erwise, without which I would certainly not be here today. My father, for being a pillar providing stability in the background, always ready in case he’s needed. My mother for constantly reminding me to focus to reach my goal. My sister for always being there, eager to talk to me. And my brother for all the hugs, fun and laughs we shared since I moved abroad. I know me moving to another country was hard for everyone in my family. Their continued support over all these years means so much more to me because of that.

(8)

(9)

7.4 Summary . . . 55 8 Discussion 57 8.1 Overview . . . 57 8.2 Persistence Layer . . . 57 8.3 Identification . . . 58 8.4 Garbage Collection . . . 58 8.5 Security . . . 59 8.6 Alternative Implementations . . . 60 8.7 Parallel Specialization . . . 61 9 Related Work 63 9.1 JIT: Just-In-Time Compilation . . . 63

9.2 Sambamba . . . 63

9.3 ADORE/COBRA . . . 64

9.4 Jikes RVM . . . 64

10 Conclusion 65 10.1 Research Questions . . . 65

10.1.1 Research Question 1: Can we make runtime generated specializa-tions available to the running application sooner? . . . 65

10.1.2 Research Question 2: How does a persistence layer affect adaptive asynchronous optimizations? . . . 66

10.1.3 Research Question 3: What is the impact of increasing the number of specialization controllers? . . . 66

10.1.4 Research Question 4: How do adaptive asynchronous optimizations perform on a broader set of applications? . . . 66

(11)

CHAPTER 1

Introduction

1.1 Overview

A common mantra among software engineers is to create reusable code. This keeps an application smaller, reduces the risk of bugs, reduces the risk of bugfixes being incomplete and helps keeping the application maintainable for a longer period of time. It won’t be long before these reusable code pieces end up being grouped together to reusable compo-nents to not only have those benefits in one application, but share them among multiple applications. A side-effect of creating these shared components, however, is that they need to cover generic ground. Not every application’s use case is the same. Corners need to be cut in order to make the shared code compatible with all potential use cases. Edge cases which are only needed in very rare situations for maybe even only one application enter the shared code, and potentially make the generic shared code more complicated again, reversing the initial motivation for having it in the first place.

In today’s development world software engineers walk a thin line between complexity, maintainability and performance. And more often than not, performance draws the short straw. But it does not have to be this way. Compilers add more and more features to make a software engineer’s life easier. Recent examples of this include various sanitizers that were added to LLVM [21]: an AddressSanitizer [25], a MemorySanitizer [27] and a ThreadSanitizer [26]. Particularly the AddressSanitizer is rather interesting, as it became so popular after its introduction to LLVM that it later also got implemented in the GNU Compiler Collection.

The AddressSanitizer is also a good example for a further direction compiler optimizations have to take. Traditional optimization techniques target compile time, which generally yields good performance results for cases where the needed optimization criteria can be determined at compile time as well. But sometimes situations are encountered where it simply cannot be determined at compile time what is going to happen. The Address-Sanitizer thus is integrated into the application at compile time, but actually runs at application runtime, detects memory errors when they happen and alerts the software engineer with detailed error messages about the problem. At this point the automatism stops and the software engineer has to actually go back and change the code and recom-pile in order to benefit from a better application.

This approach can, however, also be applied to different kinds of optimizations. Let’s shift our focus from C/C++ and take a look at functional array languages at the example of SaC [24]. SaC (Single Assignment C) is a language with a focus on numerical algo-rithms on big datasets. These algoalgo-rithms can be implemented at a very high abstraction,

(12)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

matrix multiplication convolution n-body simulation

shape/rank-generic shape/rank-speciﬁc

Figure 1.1: Comparing shape/rank-generic and shape/rank-specific performance on the example of three algorithmically complex applications. Shape/rank-generic performance was normalized to 100% showing how big a lead a shape/rank-specific version can have. More information on these applications can be found in chapter 5 and chapter 7

compared to more traditional languages like C or Fortran, without having to make com-promises in runtime performance [20][30]. The syntax of SaC is modelled closely to the language C, in order to make it more attractive to developers coming from an imperative language background by presenting them a familiar picture. Contrary to C, however, the core of SaC is functional and side-effect free. Additionally arrays are first-class citizens in SaC. Any SaC expression evaluates to an array and arrays may be passed between functions without restrictions. The runtime efficiency in spite of the high abstraction of the implementation is achieved by having an extensive set of optimizations in the SaC compiler sac2c.

The SaC compiler, however, suffers from the same problem as described before. The most effective optimizations in the compiler are dependent on array size information be-ing available or determinable at compile time. If the information only becomes available at runtime, for example when the input data is read from a file, a lot of the optimiza-tion potential is lost, which can lead to much worse performance. Figure 1.1 shows how large a performance deficit this can bring. If array size information is not available at compile time, the computation becomes a factor three slower. That is a huge amount of optimization potential lost. If it were possible to harness this potential still at runtime, several use cases could be sped up substantially.

This is exactly what an adaptive optimization framework aims to do. The idea is to moni-tor function calls at runtime, detect calls with further optimization potential and generate more optimized versions directly at application runtime, which can also be immediately used again by the application once generated. A proof-of-concept presented in [29] has already shown the basic potential of such an architecture, but being a proof-of-concept, was not aimed for immediate practical applicability. Figure 1.2 shows an example of the results obtained. The chart displays measurements for a convolution experiment using a matrix of size 1000x1000. This convolution experiment had two functions of interest for runtime specialization. The performance increases at iteration 9 and iteration 35 correspond to the specializations for those functions becoming available.

This example shows the potential performance gains that can be achieved, but also some of the improvements still possible. What one can see is that the potential performance

(13)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 5 10 15 20 25 30 35 40 45 50

Time in seconds per convolution step

Convolution steps

Experiment 2: 1000x1000 matrix Runtime specialisation disabled Runtime specialisation enabled

Figure 1.2: Example performance of the adaptive optimization framework as presented in [14]

gains are impressive. However, one also sees that it takes quite long until the full opti-mization potential is reached. Furthermore the generated specializations are of temporary nature. Any kind of persistence was not necessary to show the potential performance of an adaptive asynchronous optimization framework. This means that once applications are run again, also those specializations need to be regenerated.

With this thesis we research potential improvements to the concept. Aiming to reduce the time until specializations become available to the running application, we present two architecture extensions that managed to do that in very different ways and for different use cases. On the one hand we look at how adding a persistence layer affects performance of applications with adaptive asynchronous optimizations enabled. On the other hand we analyze the effect that running multiple specialization controllers has on the appli-cations used in our experiments. To get a broad picture of how adaptive asynchronous optimizations behave in practice, especially in combination with the new architecture ex-tensions, we employ experiments using repeated matrix multiplications, convolution and n-body simulation. Additionally we created a small data stream simulation application to specifically show the effect adaptive asynchronous optimizations have on long-term scenarios in combination with these new improvements.

The results obtained from these experiments show that adaptive asynchronous optimiza-tions bring great performance improvements in all tested scenarios. Results for runtime optimized applications are in most cases so close to a shape/rank-specific version that the remaining difference is insignificant. Only for n-body simulation the shape/rank-specific version is still twice as fast as the runtime optimized version, which, however, is still a factor 2.5 faster than the shape/rank-generic version. The persistence implementation proved to bring great improvements as well when all the benefits of runtime generated specializations can be applied directly from the beginning of an application run. Long-term effects of the persistence layer were especially apparent in simulations performed with data stream simulator. Creating multiple specializations in parallel showed promise in some experiments as well, outlining how multiple specialization controllers can speed up specialization generation in situations where many specializations are requested in a short period of time.

(14)

1.2 Research Questions

When looking at the results obtained in the experiments for the proof-of-concept imple-mentation of adaptive asynchronous optimizations, it becomes very interesting to think about how to take this further. The concept shows great promise and optimization po-tential is definitely tremendous. Inspired by the original work, this thesis sets out to answer the questions on whether performance of adaptive asynchronous optimizations can be even further improved, whether we can reduce the time until specializations be-come available at runtime, and how adaptive asynchronous optimizations behave in a broader set of applications.

1.2.1 Research Question 1: Can we make runtime generated specializations available to the running application sooner?

The proof-of-concept implementation showed through results obtained in [29] and [14], that once runtime generated specializations become available to the running application, performance improves dramatically. Generating specializations, however, takes a good amount of time. Specializations are generated sequentially, one after the other, in the order the shape/rank-generic functions they correspond to were called in the application. All of these factors contribute to the fact, that runtime specializations have a delay until they become available to the running application. This thesis thus analyzes whether this delay can be reduced.

1.2.2 Research Question 2: How does a persistence layer affect adaptive asynchronous optimizations?

One potential option to reduce the delay until specializations become available to the running application is by storing them in a persistence layer. This does not reduce the initial delay when specializations are first generated, but it makes the generated specialization available to future application runs, or even other applications. In this thesis we run a set of experiments to analyze the exact effects such a persistence layer has on repeated application runs.

1.2.3 Research Question 3: What is the impact of increasing the number of special-ization controllers?

Another potential option to reduce the delay until specializations become available to the running application is to increase the number of specialization controllers. These controllers are responsible for generating the specializations at runtime. The controllers run in threads separate from the main application, which would allow parallel generation of specializations in case there is more than one controller. Through a set of experi-ments we look at the impact running more than one specialization controller has on both application performance and time delay until specializations are available.

1.2.4 Research Question 4: How do adaptive asynchronous optimizations perform on a broader set of applications?

Previous experiments aimed at showing the principle performance improvements that adaptive asynchronous optimizations can bring. In this thesis we are going to extend the range of experiments performed so we can take conclusions on how useful adaptive asynchronous optimizations can be in actual applications. To do this we employ three

(15)

focused experiments, each illustrating a different behavior, and measure how adaptive asynchronous optimizations impact the performance. Additionally using a data stream simulator we can have a look at how adaptive asynchronous optimizations affect the performance of long running applications.

1.3 Contributions

The research performed for this thesis directly resulted in a number of contributions to the SaC compiler. At the top stands a vastly improved adaptive asynchronous optimization framework, capable of optimizing applications in many more situations than was possible before with the proof-of-concept implementation. The addition of a persistence layer and the option to run multiple specialization controllers further improved the usability of the framework.

Another product of the research in this thesis is the data stream simulator, used for experiments in chapter 5 and chapter 7. This simulator arose from the need to have an easy way to construct test scenarios for adaptive asynchronous optimizations. Based on its flexibility this simulator will continue to prove useful in future research on adaptive asynchronous optimizations.

1.4 Structure

This thesis starts by giving an introduction to the background of adaptive asynchronous optimizations. In chapter 2 we give a general introduction into the programming language SaC. In chapter 3 we briefly take a look at the proof-of-concept implementation for adaptive asynchronous optimizations. Following this we present the persistence layer in chapter 4 and continue with its evaluation in chapter 5. In chapter 6 we discuss multiple parallel specialization controllers, with their subsequent evaluation in chapter 7. We round out the thesis by a discussion of architectural choices taken and the results we obtained in our experiments in chapter 8, followed by an overview of related work in chapter 9, before presenting our conclusions in chapter 10.

(16)

(17)

CHAPTER 2

SaC

The principle of the work presented in this thesis is applicable to any particular language which implements function overloading with respect to subtyping of the function argu-ments. Adaptive asynchronous optimizations hook into the function selecting mechanism inherent to such kind of overloading to both trigger generation of new specializations as well as loading and calling existing ones. To evaluate the potential benefits of the pro-posed feature set on actual example problems I chose to base my research on SaC (Single Assignment C).

The introduction already gave a short explanation about the language itself. The following chapters go more into detail of the features of SaC that serve as basis for or interact with the adaptive asynchronous specialization framework. Specifically this involves a description of SaC’s type system, an introduction to With-Loops, the handling of functions, modules and namespaces within SaC as well as an explanation of how function overloading is implemented. The remaining language features of SaC itself are of lesser importance for this thesis. A full description of the language can be found in [24].

2.1 Types

At first we shortly describe SaC’s type system. There is a general distinction between data types, function types and user-defined types.

2.1.1 Data Types

As previously mentioned, arrays are first-class citizens in SaC. This consequently also affects the available types in SaC. Generally, every type within SaC is an array type. Arrays are defined by a rank scalar, which describes the dimensionality of the array, a shape vector, which defines the number of elements in each dimension, and a data vector, which holds the actual data stored in the array. Figure 2.1 gives a short overview of how arrays could look like, at the example of integer arrays. The type in this case does not matter, as arrays for all base types would look the same. Additionally to integers (int) SaC has all other known types from C (double, char, etc.) available as base types.

(18)

j k i 10 7 8 9 12 11 5 4 6 1 2 3 rank: 3 shape: [2,2,3] data: [1,2,3,4,5,6,7,8,9,10,11,12]   1 2 3 4 5 6 7 8 9   rank: 2 shape: [3,3] data: [1,2,3,4,5,6,7,8,9] [ 1, 2, 3, 4, 5, 6 ] rank: 1 shape: [ 6 ] data: [1,2,3,4,5,6] 42 rank: 0 shape: [ ] data: [42]

Figure 2.1: Truly multidimensional arrays in SaC and their representation by data vector, shape vector and rank scalar. Reproduced from [14]

The shape vector is internally classified into four categories:

AUD Array of unknown dimension denominates arrays of any, further unknown dimen-sionality. This includes scalars of dimensionality 0.

AUDGZ Array of unknown dimension greater zero denominates arrays of any, further unknown dimensionality, except scalars (arrays of dimensionality 0).

AKD Array of known dimension denominates arrays of known dimensionality. How-ever, the exact shape of the array is unknown.

AKS Array of known shape denominates arrays where both dimensionality and shape are known.

Generally the shape vector of a type is displayed as a comma-separated list of the number of elements of the array in each dimension. To indicate that it is indeed a vector, the list is encompassed in rectangular brackets ([]). For scalars an empty vector [] is used, which, however, in this case is optional and may also simply be omitted.

To declare the type of an array of unknown shape (AKD) the . (dot) symbol is used as a placeholder. The shape vector of a three dimensional array therefore is a vector with three placeholders [.,.,.]. For arrays of unknown dimension one uses the * (asterisk) symbol for AUD or the + (plus) symbol for AUDGZ as placeholder.

Furthermore there is a natural subtype relationship between the individual categories of shape vectors, which Figure 2.2 provides an overview for. The category AUD includes all arrays of a given element type. An array with element type α is therefore a subtype

(19)

AU D α[∗] "" AU DGZ α[+] ₎₎ _,, AKD α[.] ## α[., .] %% α[., ., .]... && AKS [] α[1] α[3]... α[2, 8] α[7, 6]... α[4, 3, 8] α[7, 2, 5]...

Figure 2.2: Subtype hierarchy in SaC. Reproduced from [16]

of the AUD type α[*].

One level below in the hierarchy are arrays of the category AUDGZ. Those include all arrays with the exception of ones with dimensionality 0 (scalars). Therefore all arrays with the element type α and a dimension greater than zero are subtypes of the AU-DGZ type α[+]. Another level lower are arrays of the category AKD. Every AKD type includes all arrays of a given dimensionality and thus becomes a supertype for all corresponding AKS types in the lowest level of the hierarchy.

2.1.2 Function Types

Function types in SaC are implicitely defined through the function signatures. However, one differentiating factor here for SaC is that functions are allowed to have more than one return type.

2.1.3 User-defined Types

User-defined types is a feature found in many different languages and can be generally categorized into two implementations. The most common one is a simple alias, where the developer can define an alternate name for an already existing type. In C for example this can be accomplished using the typedef construct. The other one are actual new types independent from the basic types already available in the language. In C this would for example be the case for structs.

2.2 With-Loops

The key language construct of SaC is the With-Loop. With-Loops enable developers to design algorithms using shape-invariant programming, meaning they can implement an algorithm for arrays of any size and shape, without having to explicitely handle all possible array sizes the application might work with at runtime. Optimizations within the SaC compiler ensure that no performance is lost due to this. In total there are three main optimization techniques employed in the compiler to attain these performance ben-efits. The first is called With-Loop Folding. This considers situations where the output of one With-Loop is used as input for a following With-Loop [22][23]. Next is With-Loop Fusion. This considers With-Loops that operate on the same array bounds, but have an independent data flow [10]. Last there is With-Loop Scalarization. This considers situa-tions where one With-Loop is nested in another With-Loop [13]. A general overview of With-Loop optimization techniques can be found in [12].

(20)

value-transforming or (re-)structuring operations can in almost all cases be completely elemi-ninated. Several of the optimizations in the SaC compiler, however, depend on having exact array size information available at compile time in order to reach optimal perfor-mance. If that information only becomes available at runtime, a lot of the optimization potential is lost. Adaptive asynchronous optimizations would fill this gap and also bring these optimizations to applications missing concrete array size information at compile time. More information about With-Loops can be found in [24].

2.3 Function overloading

SaC implements function overloading with respect to subtyping of the function argu-ments. Function overloading describes the functionality to reuse an identifier for multiple function definitions. This allows functions with similar functionality to carry the same identifier, independent of the types of their arguments. To avoid confusion, individual declarations of functions are henceforth referred to as instances, whereas the group of all instances is referred to as overloaded function, or short function.

Examples of function overloading can be found in many languages. In C basic opera-tors are overloaded. This allows, for example, arithmetic operations (addition/subtrac-tion/etc) independent of the argument types, as long as they are numeric. C, however, does not support further overloading those operators or overloading of built-in functions. One can not overload user-defined functions nor can one add new instances to overloaded functions. Java on the other hand does support such kind of overloading as one can de-clare functions carrying the same identifier but having different types and/or a different number of arguments.

In SaC every function can potentially be overloaded. The individual instances can be declared by reusing the same identifier in the function declaration. A separate declara-tion of what instances together comprise an overloaded funcdeclara-tion like HASKELL uses in the form of type classes [15] is not necessary.

Figure 2.3 shows the specification of an overloaded function on the example of the func-tion add. There are two instances of the funcfunc-tion. Line 1-6 show the declarafunc-tion of an instance for integer arrays of unknown dimension. The actual computation is not impor-tant for the example and was thus left out. Line 8-13 show the declaration of an instance for scalars. Which instance to use is decided at runtime based on the subtypes of the passed arguments, a process also known as dispatching. As the example also reveals it is possible to declare different function instances for arrays of different dimensionality and shape. An application of a function will always get the function instance with the lowest matching subtype in the subtype hierarchy dispatched. In the case of the function add from Figure 2.3 this means, that for arguments of, for example, type int[.] the instance of type int[*] is used. Yet for scalar arguments the instance of type int is used. The only restriction to keep in mind when declaring these functions is that they need to able to be uniquely identified according to all argument types at the time of the dispatch. A description of the resulting conditions and the formal semantic can be found in [24]. Dispatching in SaC is done through so-called wrapper functions. The compiler generates a wrapper function for every function, and by default every function call points to the wrapper function instead of the actual instance, even if there is only one instance to choose from. However, there is an optimization in the compiler that detects such cases and removes the wrapper call in favor of calling the one existing instance directly, if ad-ditional safe-guards are met. The wrapper function itself can be (very much simplified) viewed as a long if-elseif-else statement, where according to the number and types of the

(21)

1 int[*] add( int[*] A, int[*] B) 2 { 3 result = ... 4 5 return( result); 6 } 7

8 int add( int a, int b) 9 {

10 result = a + b; 11

12 return( result); 13 }

Figure 2.3: Example for function overloading in SaC

arguments the best matching function instance is selected, called and the return values handed back to the caller of the wrapper function. A more detailed description of the dispatch mechanism in SaC can be found in [19].

2.4 Modules and Namespaces

Namespaces are more and more common in programming languages. One can find ex-amples in languages such as C++, Java or HASKELL. The core concept of namespaces is to shield identifier usage from name clashes. This means for example, that two func-tions, which are not related to each other (so they are not instances of an overloaded function) can carry the same identifier within one application, provided both are defined in different namespaces. The benefit here lies mostly in software engineering aspects. It is quite common that more than one developer works on a single application. Inte-grating the work of multiple developers into a working product can be challenging when one has to be constantly aware to not reuse an identifier another developer has already used elsewhere. The larger the application and the more developers work on it, the more difficult the situation becomes. With namespaces functionality is nicely contained and it becomes markably easier to integrate multiple components with each other.

Modules share a similar concept, but are in essence less formal. While namespaces pro-vide a semantic separation of code, modules propro-vide a physical separation of code. They allow an application to be built from multiple source files. Abstract concepts of this can be found in pretty much any language today, as it is an integral usability feature for developers and allows for a more structured development approach and organization. In C for example, one can split up the code into multiple .c files, which are compiled separately into .o files and later linked together to form the final binary.

SaC has support for both concepts. However, namespaces are hidden away from the developer and implicitely defined for every module and program. Modules need to be explicitely named, and the module name is used as namespace identifier. Programs use the namespace main.

By default none of the identifiers used within a module can be referenced from another module or program. They are all private. To allow for an identifier to be referenced SaC provides two options: the provide and export directives. Using provide one alters the visibility of the specified identifiers such that they can be seen from other modules and programs. However, this does not allow functions to be overloaded across module boundaries. In fact, SaC only allows overloading of functions within one module. To still be able to make use of function overloading, the entire function needs to be cloned

(22)

into the new module. export declares identifiers which are available for such cloning. On the side of the consumer, identifiers made public through provide can be referenced by using their fully qualified name consisting of the module name followed by a double colon followed by the identifier (e.g. Module_1::function_1). To simplify this SaC has the concept of a search space, where all identifiers within this search space can be referenced in an unqualified manner. By default all identifiers of the current module are within that search space. Using the use directive one can add identifiers to that search space. To clone an identifier one can use the import directive. All of these directives provide functionality to either affect all identifiers, a specific list of identifiers, or all ex-cept a specific list of identifiers within a module. More information about SaC’s module system can be found in [16].

(23)

CHAPTER 3

Background: Adaptive Asynchronous

Optimizations (Proof-of-concept)

3.1 Fundamental Specification

Like mentioned in section 2.2, the core concept of adaptive asynchronous optimizations is to fill the performance gap for applications that work with array rank and shape only known at runtime. Multiple vital components inserted into key areas of the compiler and the built application add a sophisticated architecture allowing to use the full potential of the compiler’s optimization phases at application runtime.

Reducing the architecture requirements to a minimal functional specification we end up with three main components. First, there needs to be a way to detect shape/rank-generic functions and trigger the generation of shape/rank-specific versions. Then there needs to be a way to actually generate those shape/rank-specific versions. Lastly, those generated shape/rank-specific versions need to be made available to the application.

A proof-of-concept implementation of such an architecture is presented in [29], and fur-ther detailed in [14]. The work done as part of that effort showed that the fundamental specification described above is enough to verify the possible performance gain from such an architecture. In this chapter we thus provide an overview over the architectural approach taken for the proof-of-concept.

3.2 Architecture

Figure 3.1 shows the architecture of the proof-of-concept as detailed in [14]. Judging from this illustration one might think that the proof-of-concept implementation already has a more complex design than the basic fundamental specification. But taking a closer look we can classify the individual parts to exactly match it.

We have an executable program that during runtime detects the use of shape/rank-generic functions and files specialization requests, which end up in a queue. A dynamic specializa-tion controller takes requests from the queue and triggers the generaspecializa-tion of shape/rank-specific versions, further referred to as specializations. This sequence matches the first component in the fundamental specification. The specialization controller thus invokes the SaC compiler in a special way to load the pre-generated intermediate code, create a new SaC module from it containing the shape/rank-specific version of the function we requested and generates the binary code. This sequence matches the second component

(24)

Executable Program Dynamic Specialisation Controller Specialisation Request Queue Registry Function Dispatch file request lookup dispatch function and retrieve inspect Binary Code Code Intermediate link with generate load create update

link with _SAC Module SAC Compiler

invoke

Figure 3.1: Architecture of the proof-of-concept implementation as detailed in [14]

of the fundamental specification. Once the binary code is available, the generated library is linked back into the running application by the dynamic specialization controller. The information about the specialization being available is written into a dispatch function registry so the running application knows it can now call a shape/rank-specific version in-stead of the generic one. This sequence matches the third component of the fundamental specification.

3.2.1 Detection and Trigger

The way to determine in SaC which functions can be specialized to attain performance benefits, is to take a look at the arguments passed to the function call. The only possible place to detect potential optimization cases is thus always at or around the function call itself. From a software engineering perspective the precise place the detection is included still has a more critical importance, but overall the interaction flow does not change. When a potential optimization case is detected, a request is put into a queue. The specialization controller continuously checks the queue for new requests and works on each one of them in order. Specialization requests enter the queue in the order the respective functions are called. There is only one controller, so the specialization requests are also worked on, and further on resolved, in the order of the function calls.

3.2.2 Generation

Generation of the specializations happens through invoking the SaC compiler with cus-tom private commandline arguments to communicate the necessary information for the optimizations. The SaC compiler loads the intermediate code for the module containing the function that needs to be specialized. After the intermediate code is available some filtering happens to exclude functions present in the module that are independent from the function getting specified. This saves processing time for the compiler. As a follow-up some adaptations are performed on the code to make sure the shape/rank-specific code is

(25)

conforming to the structure built to load it back into the application. While all of these actions are far from trivial, they again center around software engineering problems. At this point we skip over detailing those problems encountered by the proof-of-concept implementation.

3.2.3 Linking and Using

The specializations are generated as shared libraries in a temporary location. The spe-cialization controller links the libraries directly from there and stores the pointer to the wrapper of the generated shape/rank-specific function in a registry. The registry is built to be specific for one function. Every overloaded function has its own registry. Having the pointer to the wrapper function in the registry, rather than directly the shape/rank-specific function, means that when calling the function nothing special needs to be done to inform the running application about the new specialization. We simply replaced the original call to a wrapper, with a call to a different wrapper, which knows about the newly created specialization and can direct the call to that one instead of the generic function.

(26)

(27)

CHAPTER 4

Persistent

Adaptive Asynchronous Optimizations

4.1 Introduction

The results of the experiments performed with the proof-of-concept implementation of an adaptive asynchronous optimization framework and presented in [29] and [14] clearly display the optimization potential of the concept. In this chapter we take a look at improving the usability of adaptive asynchronous optimizations by extending the archi-tecture with a persistence layer. This would avoid having to regenerate specializations on every application run and would reduce the effort of generating specializations to a one-time overhead.

Depending on architectural choices using the persistence layer it might even be possible to reuse generated specializations across different applications using the same functions. This would be useful for creating specializations of functions in the standard library for example.

This chapter explains the architectural choices made during the research, culminating in a production-ready implementation of an adaptive asynchronous optimization frame-work for SaC based on filesystem persistence of specializations. Experiments detailing the performance benefits to be had from this solution and how it compares to various other scenarios can be found in chapter 5.

4.2 Data Store

There are multiple ways to implement a persistence layer, but to determine an apt solu-tion one has to look at the requirements from a user and from an applicasolu-tion perspective. We can assume that multiple users will run the same application on one machine, so the persistence layer should be user aware. Specifically on a development machine, gener-ated specializations should probably not be shared. However, it makes sense that two users might want to share a pool of generated specializations on production systems, so it should be possible to either combine or link multiple persistence layers together. Furthermore, the persistence layer needs to be able to cope with more than one concur-rently running application requiring access. This is true for the case when a persistence layer is shared between multiple users, but of course a single user can also choose to run two or more applications at the same time. On top of concurrent access to different specializations in the persistence layer, it also needs to be able to handle concurrent read

(28)

and write access to the same specialization. This would be the case for example when two separate processes generate the same specialization at the same time.

From the application perspective, the persistence layer should primarily allow for fast loading and fast storing of generated specializations. Furthermore specializations need to be uniquely identifiable, since the persistence layer can hold specializations for many functions of different applications.

Considering these requirements a persistence layer could be implemented above different data store technologies. Examples here could be a relational database system, a filesys-tem directory structure or even cloud storage. From a practical perspective, however, a filesystem data store makes the most sense. Concurrency is available by default and thread-safety is only an issue when two specializations for the same function are gen-erated at the same time. But even that is straight-forward to sort out. Loading and storing speed is mostly dependent on the hardware used for storage. Any reasonably modern solution is already equipped with SSD storage solutions which provide plenty of performance for this use case. The choice for a particular filesystem may also have an impact on loading and storing speed, but the file sizes of the generated specialization, usually below 1 MiB, cause that impact to diminish up to the point where it would barely be measurable in the first place. The application-side requirements are thus perfectly handled by a filesystem based data store.

From a user perspective, the common way to make it user specific is to define its location within a user’s home folder, for example $HOME/.sac2c/rtspec/. Generated specializa-tions can directly be copied or moved to the data store in the format they are generated in, and as a consequence also simply be loaded as such. Sharing a data store can be accomplished through various means. Given every generated specialization is stored in a separate shared library file, they could simply be copied, moved or linked to another user’s data store. The data store could also be placed at a shared filesystem location through means of an environment variable, for example $SAC2C_RTSPEC_REPO. Another approach would be combining different user repositories through means of a union mount system like overlayfs. This additionally would bring some safety features to the table since only one of the mount points that get combined can be selected as writeable. The others are simply readonly. Every user could then for example have readonly access to every other user’s data store, while still being able to write into his own. With this also the requirements from the user’s side are handled by a filesystem based data store.

4.3 Identification

For storing runtime-generated specializations of functions in a persistence layer it is paramount to be able to uniquely identify a certain function as one does not want to end up loading a shape/rank-specific version of a different function into a running ap-plication. Simple name-based identification is not sufficient, as it is conceivable that multiple applications may use either a different implementation or a different evolution of the same function, bearing the same name. Additional metadata information about a function needs to be used in order to be able to identify a function, and specifically a function instance, correctly. The metadata used for the implementation presented in this thesis uses the module name, the argument count, types and shapes as well as a unique identification string additionally to the function name. For the unique identification string we will take a look at two possible options, each of which provides a sufficiently unique identification of functions.

(29)

accurately its definition and body. It is reasonable to assume that two functions sharing the same intermediate compiler representation have the same behaviour to the outside and can thus share generated specializations. Taking the intermediate compiler represen-tation of the code rather than the actual code simply abstracts from formatting issues. Obviously in most cases using the intermediate compiler representation of a function is way too long to use as an identifier, especially on a filesystem level. As such we would generate a hash of the intermediate compiler representation and use that generated hash as identifier.

One could assume that using a hash like this as unique identification string is already sufficient, especially in combination with argument types, count and shape information. Alas relying on a perfect hashing algorithm in this case is not possible, since we do not know the entire set of possible functions. Any hashing algorithm we choose is thus prone to collisions. Relying on the hash alone is therefore not sufficient enough and we need the other previously mentioned extra information to be sure to identify the right function. The hash would be generated at compile time of the generic implementation. Having this in mind, the speed of the hashing algorithm is not really important. We can therefore look for one that fits easily into our codebase (language, coding style, etc) or sacrifices speed for a reduced risk of collisions.

Alternatively to hashing the function definition and body, a second option could be to use a compile-time generated UUID, conforming to the standard described in OSF DCE 1.11. Compared to the hash of the intermediate compiler representation is has the benefit to be easier and faster to attain. The downside is that any recompilation of the generic function, even if the intermediate compiler representation did not change, would yield a different identifier.

As said, either option serves well as a unique identification string, and as from the per-spective of any persistence implementation it would only be a simple string, there is no further impact whether one is used or the other. The only difference would be in the implementation to generate the unique identification string itself. Hashing would require the generic function definitions and bodies to be written out into files or stored in separate string values so that a hash could be generated easily from a combination of both definition and body. Using the UUID on the other hand can be implemented using function calls available in common system libraries. For the purpose of this thesis we chose to use a UUID based unique identification string for simplicity reasons.

To generate the unique identification string a new subphase was added to the SaC compiler, to attach the string as an attribute to the function definition node in the in-termediate compiler representation. Compiler phases describe the order in which certain operations on the source code have to be performed. In simple terms one could have a phase in the beginning responsible for parsing, then another one for optimizing, and a final one for generating the binary code. A subphase would describe a smaller, ideally but not necessarily self-contained step, that performs a certain action on the source code. In our case the subphase was added to the phase for exporting symbols, which culminates in the serialization of the syntax tree. It is this serialized version of the syntax tree that is later loaded by the sac2c compiler to generate a shape/rank-specific version of a func-tion, a process that uses the unique identification string. To have that string available in the serialized version of the syntax tree, we therefore need to have a subphase adding that information before the serialization happens.

In this subphase we iterate over the function definitions available in the module that is currently compiled and look for functions we can specialize at runtime. There is a wide

1

(30)

$HOME .sac2c host seq-rtspec ConvolutionAuxiliaries convolution_step 82d08472-8775-47d9-b677-a6a5a6d6b800 double 1-2-1000-1000.so 1-3-100-100-100.so b7936376-8aef-4317-8ff9-a2fc0cd5d5da double 1-2-1000-1000.so 1-3-100-100-100.so is_convergent 7293f74c-c57f-46c4-9972-4be23b612888 double-double-double 3-2-1000-1000-2-1000-1000-0.so 3-3-100-100-100-3-100-100-100-0.so feed5c9c-127d-499a-b120-ba7e111e7648 double-double-double 3-2-1000-1000-2-1000-1000-0.so 3-3-100-100-100-3-100-100-100-0.so

Figure 4.1: Example directory layout in the filesystem data store of the specialization persistence

variety of function definition options and situations in SaC. Not all situations would benefit from adaptive asynchronous optimizations, like for example, a function without any arguments. Some options require edge-case handling in various places, for example functions that take a variable amount of arguments or have a variable amount of return values. To limit implementation complexity we aimed at the most common cases for function definitions that would benefit from adaptive asynchronous optimizations. In the subphase we thus look for functions taking arguments of generic rank and shape, and filter out functions that we either know won’t benefit from adaptive asynchronous optimizations or that we consider edge cases and leave for a later implementation. Ev-ery remaining function gets a unique identification string attached to its node in the intermediate compiler representation.

(31)

4.4 Filesystem Layout

As noted before, to uniquely identify a specialization in the persistence layer we need module name, function name, argument types, argument count and shapes as well as the UUID identifier serving as unique identification string. All of these need to be represented in the data store. Assuming one folder hierarchy level for each of those individual pieces, any combination of them is sufficient to meet the unique identification requirement. The actual combination used can thus be picked based on other requirements, for example ease of use for developers, or ease of synchronization between repositories.

Figure 4.1 shows how the final directory layout for the persistence implementation of the adaptive asynchronous optimization framework looks like. $HOME/.sac2c is the top level directory. host and seq-rtspec are SaC internal identifiers for the host environment and the SBI (SaC binary interface) respectively. In this case they refer to the local host environment and the sequential version of the adaptive asynchronous optimizations. This is followed by the module names, then the function names, the UUID, the argument types and closes off using the argument count and argument shapes as filenames. Having the UUID as a folder in this hierarchy allows for easy synchronization between repositories as all specializations for one variant of a function are under a common subfolder.

4.5 Specialization Generation and Loading

Having the specialization persistence available radically changes the communication pat-terns within the adaptive asynchronous optimization framework architecture. Figure 4.2 outlines the major steps happening in the architecture now. Comparing with Figure 3.1

Specialization Request

Queue

Dynamic Specialization Controller(s)

Executable Program Specialization Registry Persistence Layer Lookup

function pointer Check existence of _{specialization and} load it if present File specialization request Inspect and retrieve SAC Compiler Specialization Invoke Generate Store specialization and then load it Update

Becomes available

(32)

Shape/rank-generic Function Wrapper Entry Function Wrapper Funciton generic Function Runtime Specialization Wrapper Wrapper Function Runtime Specialization Wrapper generic Function generic Function

Figure 4.3: Runtime specialization wrapper placement in the proof-of-concept implemen-tation (top) and the new implemenimplemen-tation (bottom)

we can already determine many changes. The most visible one is probably the addition of the persistence layer. However, the most interesting one is the new role and position of the specialization registry.

In the proof-of-concept architecture without persistence, there can only be two cases. Either there already is a specialization loaded, or a specialization still needs to be gen-erated. To check for either of those cases, a runtime specialization wrapper is inserted into the code. In Figure 4.3 we can see a simplified representation of what a call to a shape/rank-generic function in SaC looks like.

As already briefly mentioned in section 2.3, a call to a shape/rank-generic function is typically guarded by a call to the function wrapper. Specifically in situations when a function already has some specializations available, the call always goes through the wrapper. The wrapper decides according to shape and rank of the arguments which instance of the function is called, or if none of them match, is responsible for throwing an error and potentially abort the application. However, if the compiler can determine that there is only one matching instance, it can optimize the call to the wrapper function

(33)

away, and call this single instance directly.

In Figure 4.3 we thus see these two situations, once with wrapper and once without. For the proof-of-concept code it was decided to insert the runtime specialization wrap-per code in front of a function wrapwrap-per call. This allowed for an implementation that was still able to give accurate insight into the potential performance gains of adaptive asynchronous optimizations, yet kept the implementation complexity at a minimum. As mentioned in subsection 3.2.3, every function has a separate registry, holding a pointer to a function wrapper. This wrapper would either be the original function wrapper, or a new one generated together with the requested specialization. The generated wrap-per function knows about all functions available in the original module, plus the newly generated specialization. Whenever a new specialization is generated, this wrapper is simply switched out, and in the same fashion as before, knows again about all function available in the previous module, plus the newly generated specializations. With every new specialization being generated the wrapper thus knows about one more function instance.

The runtime specialization wrapper would fetch the pointer to the current function wrap-per from the registry and call it. It therefore has very limited knowledge about the spe-cializations that are available. It only knows whether a specialization was generated or not, but cannot determine whether it is the right one. For this reason, a specialization request is issued on every function call, leaving the responsibility to check whether it is actually necessary to generate this specialization up to the controller.

When integrating the persistence layer it became apparent that this approach was no longer viable. When loading specializations from the persistence layer, we could no longer call the function wrapper. Specializations could be generated from multiple different ap-plications, in any order. The function wrappers included with the specializations would know about the specialization itself, but not necessarily about other specializations that we could potentially also be interested in. And since the registry only supports holding a pointer to one function wrapper, we could potentially lose references to previously loaded specializations when loading a new one.

We needed to be able to call shape/rank-specific functions directly, and as a consequence also shape/rank-generic functions. This also meant that the runtime specialization wrap-per needed to be inserted at a different location. Calls to shape/rank-specific functions work slightly differently from calls to shape/rank-generic functions and function wrap-pers. The values returned from a shape/rank-specific function have more array informa-tion available than values returned otherwise, which can cause side-effects in code that does not expect that information to be there. A function wrapper thus includes safety guards to hide array information from values returned from shape/rank-specific functions to avoid these side-effects. With the runtime specialization wrapper now required to call either shape/rank-specific or shape/rank-generic functions, it had the same responsibility as the function wrapper to implement these safety guards.

In the bottom part of Figure 4.3 we can see a simplified representation of how a function call to a shape/rank-generic function works after these changes. The runtime special-ization wrapper needed to be inserted in front of every shape/rank-generic function call, instead of in front of every function wrapper call. This had the beneficial side-effect that we could now also trigger generation of specializations on calls where the function wrapper was optimized out.

Since the runtime specialization wrapper now needed to be able to call specific specializa-tions, the registry needed to be able to keep track of those specific specializations as well. Holding a single pointer was no longer enough, since we needed a pointer for every loaded

(34)

specialization, on top of a pointer to the shape/rank-generic function which still needed to be called in case no matching specialization was available. All those function pointers also needed to be clearly identifiable to be able to select the right specialization. In order to meet those requirements and still be able to select the best available function pointer quickly, we decided to store the function pointers in a hashmap. A string concatenation of UUID, argument types, count and shapes serves as lookup key for the hashmap. We found a readily available and easy to use hashmap implementation in uthash 2, which also happened to be already used in other areas of the compiler as well.

The specialization registry now already handled checking loaded specializations for a match and if none was found returning a pointer to the shape/rank-generic function in-stead. It presented itself as a natural fit when we had to find a place to integrate actually loading specializations from the persistence. Loading from the persistence should hap-pen in a way, that in case a specialization is available, no call to the shape/rank-generic function would need to be made. But we also do not want to check in the persistence if a matching specialization is already loaded. The moment for loading the specializa-tion therefore needed to be in between checking loaded specializaspecializa-tions, and calling the shape/rank-generic function, which both happens in the specialization registry. Conse-quently the registry now also knew when to best issue a specialization request. No request is needed when a specialization is already loaded or is available in the persistence. So only when the shape/rank-generic function is called a specialization request is issued. The decision tree for the specialization registry now therefore looks like this:

• Check whether an already loaded specialization matches the argument types, count and shapes for the current call. If there is, return the respective function pointer. These can either be specializations loaded from persistence or specializations gener-ated within the current application run. Once loaded there is no difference between the two anymore.

• If no match has been found in the already loaded specializations, check whether there is a matching specialization available in the persistence layer. If there is one, load it and return the respective function pointer.

• If no match can be found in either already loaded specialization or the persistence layer, issue a specialization request and return a pointer to the shape/rank-generic function.

Once the specialization request has been issued the process is again similar to the proof-of-concept implementation. The controller fetches the specialization requests from the queue and checks whether the specialization has already been generated before or not. While specialization requests are no longer issued when a specialization is already loaded, it could still be the case that the shape/rank-generic function was called multiple times before a specialization was available. Each of those calls would have triggered a separate specialization request. The controller therefore still needs to verify that generating a new specialization is actually needed.

Once verified the controller invokes the SaC compiler, which generates a specialization at a temporary location. In the proof-of-concept implementation the controller would then continue to load the specialization and use it immediately. With a persistence layer in the picture it makes sense to first move the specialization from the temporary location into the persistence directory and load it from there. This unifies loading of specializations and reduces code complexity.

2

(35)

CHAPTER 5

Evaluation of Persistent

Adaptive Asynchronous Optimizations

5.1 Introduction

The addition of a persistence layer to adaptive asynchronous optimizations resulted in major architectural changes in critical locations. When evaluating the performance of adaptive asynchronous optimizations and the effect the persistence layer has on it, it thus becomes also interesting to compare to the proof-of-concept implementation and see if any of the architectural changes also had an impact on performance.

To evaluate the effects of the persistence layer for adaptive asynchronous optimizations on application performance we utilize four experiments. We use repeated matrix multi-plication, convolution and n-body simulation to take a look at specific characteristics of adaptive asynchronous optimizations, as each of these experiments has a slightly different behavior. The source code used for these experiments is based on existing demonstration applications for SaC. Using a data stream simulator specifically built to evaluate adap-tive asynchronous optimizations for this thesis, we then take a look at some long-term scenarios to get insight into how well the concept fares in such situations.

All experiments are performed on an Intel NUC6i7KUK system, using an Intel Core i7 6770HQ, which is a quad core CPU with hyper threading and turbo boost capabilities, both of which are enabled by default. The system is equipped with 32GB DDR4 RAM, running at 2400MHz, and a Samsung 950 Pro 512GB NVMe PCIe SSD as hard disk. Slackware Linux 14.2 served as operating system.

5.2 Repeated Matrix Multiplication

The first experiment performs matrix multiplications repeatedly on two input matrices. One iteration in the application is thus equivalent to one multiplication performed on the matrix. Figure 5.2 shows the module containing the multiplication code. There are two instances of the function matmul, one for two dimensional arrays of any size, and one for two dimensional arrays of size 2x2. The latter implementation is irrelevant for the computation itself, but is necessary for the main application to use the function wrap-per generated for the function matmul. Without the second instance, the SaC compiler would see that there is only one instance and point to it directly instead of to the wrap-per function. This behavior is needed for the proof-of-concept implementation of the adaptive asynchronous optimizations. The new implementation presented in this thesis

(36)

1use Array:all; 2use Hiding:all; 3use Matrix:all; 4use StdIO:all; 5use RTimer: all; 6 7int main() 8{ 9 A = genarray( [SIZE1,SIZE2], 1.0); 10 B = genarray( [SIZE2,SIZE3], 1.0); 11 12 A_b = hideShape(A); 13 B_b = hideShape(B); 14 15 timer = createRTimer(); 16 17 for (i = 0; i < 40; i++) 18 { 19 startRTimer(timer); 20 A_b = matmul( A_b, B_b); 21 stopRTimer(timer); 22 print(getRTimerDbl(timer)); 23 resetRTimer(timer); 24 } 25 26 destroyRTimer(timer); 27 28 print(A_b[0,0]); 29 30 return(0); 31 } 32 33

Figure 5.1: Main Program for the matrix multiplication experiment

1module Matrix; 2 3use Array:all; 4 5export { matmul }; 6

7double[.,.] matmul( double[.,.] A, double[.,.] B) 8{ 9 BT = transpose( B); 10 11 C = { [i,j] -> sum(A[i,.] * BT[.,j]) }; 12 13 return( C); 14 } 15

16 double[2,2] matmul( double[2,2] A, double[2,2] B) 17 {

18 ... 19 } 20 21

Figure 5.2: Performance critical functions for the matrix multiplication experiment

works also without the second instance.

Figure 5.1 shows the main application for the matrix multiplication experiment. The function calls relating to RTimer show how the iteration measurements are obtained. In line 15 we create a timer using the createRTimer function. This is done outside the loop since we can simply reset this one timer after every iteration rather than constantly creating new timers. In line 19 we start the timer, then call the matrix multiplication function, and in line 21 we stop the timer again. In line 22 we output the measurement taken to stdout, and in line 23 consequently reset the timer before entering the next iteration of the for-loop. For a more theoretical and semantic look at how timers work in SaC we refer to [11] and [9].

The final print statement in line 28 ensures that the matrix multiplication actually takes place. Without that statement the result of the function matmul is not used anywhere and the compiler could therefore reason to remove the function call entirely.

(37)

opti-1 use Array:all; 2 use Hiding:all; 3 use StdIO:all; 4 use RTimer: all; 5 6 int main() 7 { 8 A = genarray( [SIZE1,SIZE2], 1.0); 9 B = genarray( [SIZE2,SIZE3], 1.0); 10 11 timer = createRTimer(); 12 13 for (i = 0; i < 40; i++) 14 { 15 startRTimer(timer); 16 A = matmul( A, B); 17 stopRTimer(timer); 18 print(getRTimerDbl(timer)); 19 resetRTimer(timer); 20 } 21 22 destroyRTimer(timer); 23 24 print(A[0,0]); 25 26 return(0); 27 } 28

29 double[.,.] matmul( double[SIZE1,SIZE2] A, double[SIZE2,SIZE3] B) 30 { 31 BT = transpose( B); 32 33 C = { [i,j] -> sum(A[i,.] * B[.,j]) }; 34 35 return( C); 36 } 37 38

Figure 5.3: Optimized version of the matrix multiplication experiment

mizations that would otherwise be applied because array shape and rank are hardcoded in the application.

Figure 5.3 shows how an optimized version for the matrix multiplication experiment looks like. This example creates the performance base line that we set to achieve with adap-tive asynchronous optimizations. Notable here is that the matmul function is available directly in the same file, so the compiler does not need to bother with module imports. The hideShape calls have been removed here, meaning the compiler now recognizes the hardcoded array size and will directly optimize for it. Optimized version of the later experiments follow the same routine. We thus refrain from listing examples of optimized versions for the following experiments.

When collecting results for this first experiment, we found that performance measure-ments for individual iterations had a rather wide margin of error. In Figure 5.4 we see an example comparing the shape/rank-generic version with the shape/rank-specific ver-sion, illustrating how large performance deviations between iterations can sometimes be. There is no different workload between iterations, so in an optimal case one would expect all iterations to have the same performance. The test system was also used exclusively for running these experiments and any system services were reduced to a minimum. What remains can merely be categorized as background noise, potentially subtle scheduling decision differences for the CPU, or other environmental factors hard to get control of. However, with such a high margin of error, using performance data of complete scenario runs would present a wrong picture when comparing one scenario with another. To present meaningful performance comparisons we therefore chose to use minimum values obtained from measurements over multiple scenario runs instead.

The purpose of this simple experiment is to show the basic capabilities of adaptive asyn-chronous optimizations. There is one function of interest, Matrix::matmul, which

Parallel and Persistent Adaptive Asynchronous Optimizations for Array Programming Languages

Parallel and Persistent

Adaptive Asynchronous

Optimizations for

Array Programming Languages

Heinrich Michael Wiesinger

6181104

H.M.Wiesinger@student.uva.nl

August 16, 2016

Computer

Sci

ence

—

University

of

Amsterd

am

Parallel and Persistent

Adaptive Asynchronous

Optimizations for

Array Programming Languages

Master’s Thesis

written by

Heinrich Michael Wiesinger

under the supervision of

dr. Clemens Grelck,

and submitted in partial fulfilment of the requirements for the degree of

M.Sc. in Computational Science

at the University of Amsterdam

Date of public defense:

23/08/2016

Members of the Thesis Committee:

dr. Clemens Grelck

dr. Inge Bethke

dr. ir. Raphael Poss

Contents

CHAPTER 1

Introduction

1.1

Overview

1.2

Research Questions

1.3

Contributions

1.4

Structure

CHAPTER 2

SaC

2.1

Types

2.2

With-Loops

2.3

Function overloading

2.4

Modules and Namespaces

CHAPTER 3

Background: Adaptive Asynchronous

Optimizations (Proof-of-concept)

3.1

Fundamental Specification

3.2

Architecture

CHAPTER 4

Persistent

Adaptive Asynchronous Optimizations

4.1

Introduction

4.2

Data Store

4.3

Identification

4.4

Filesystem Layout

4.5

Specialization Generation and Loading

CHAPTER 5

Evaluation of Persistent

Adaptive Asynchronous Optimizations

5.1