Execution Time Estimation in the Work ow Scheduling Problem

(1)

Execution Time Estimation in the Workflow

Scheduling Problem

Master’s Thesis

written by

Artem Chirkin

under the supervision of

Adam Belloum and Sergey Kovalchuk

June 17

th

, 2014

(2)

Abstract

Research, business, and production computational problems con-stantly increase their requirements for computing resources, because they involve building complex models and processing of large amounts of data. This leads to the creation of heterogeneous computing systems. Scheduling software is developed to manage compound applications -workflows - that consist of a number of sub-tasks. The development of workflows and heterogeneous systems raises the scheduling problem: how to map the tasks of a workflow onto the available resources in the most efficient way (minimizing makespan or cost, etc.).

An important part of the scheduling problem is the estimation of the workflow execution time. The quality of scheduling strongly depends on the runtime estimate. The aim of this work is to highlight common problems in estimating the workflow execution time and propose a solu-tion that takes into account the complexity and the randomness of the workflow components and their runtime. The solution addresses the problems at different levels from task to workflow, including the error measurement and the algorithm analysis.

The basis of the developed approach is a stochastic model of the task runtime that consists of the mean and the standard deviation estimates as the functions of the task execution arguments. The models and the execution data then are used to create the estimates of runtime distributions. I use a dual representation, characteristic / distribution functions, in order to combine tasks’ estimates into the overall workflow makespan. Additionally, I propose the workflow reductions - the operations on a workflow graph that do not decrease the accuracy of the estimates, but simplify the graph structure, hence increasing the performance of the algorithm.

Keywords: workflow, DAG, makespan, execution time, runtime, estimate, scheduling

(3)

Acknowledgments

I would like to thank the coordinators of our double-degree Mas-ter’s programme, who provided me with the outstanding opportunity of participating in this international project. This program allowed me to gain the unique experience of studying in the University of Amster-dam while continuing my education in the ITMO University.

My thanks are to Adam Belloum, my supervisor in UvA, for his advice in my master project and the patience in correcting the thesis. Without his help this work would definitively not have been possible. I would also like to thank Sergey Kovalchuk, my supervisor in ITMO, who guided my research for almost two years.

Finally, my thanks go to Valeria Krzhizhanovskaya for her constant help in solving the organizational questions regarding our educational program, and for the possibility of participating in a project aimed at the development of an urban flood decision support system; the runtime models I created in scope of this project were the starting point for the thesis.

(4)

Introduction

Research, business, and production computational problems constantly increase their requirements for computing resources, because they involve building complex models and processing of large amounts of data. Engineers have to deal with physics of composite structures, astrophysicists make use of observational data, etc. These problems are often decomposed into smaller tasks. In order to run small sub-tasks and aggregate them to solve large-scale problems, scientists develop scheduling and management software. Therefore heterogeneous systems (HSs) (such as Grids or Clouds) become more popular, as they provide high flexibility in resource scaling, processing of large data sets, the possibility to run single services on different computers as a part of a complex application [1][2]. Additionally, they provide the significant boost in performance compared to single computers due to various kinds of parallelizing: large data amounts can be processed in parallel, or different sub-tasks of a complex problem can be executed concurrently on the separate computing nodes (CNs) [3]. On the one hand, these systems allow various organizations to share their hardware and software resources among them and collaborate better. On the other hand, cloud services allow users to access computing resources and software without the need of purchasing them. This is especially relevant to the scientific area, where the tasks require high performance computing and the software is expensive. Development of an HS involves creating special software to run in this environment. The software should contain tools to schedule and execute the computational tasks. Com-plex tasks executed in an HS usually consist of several separate tasks and form a workflow. I refer to the workflow as a collection of tasks and informational dependencies between them; these dependencies impose some restrictions on the execution order of the tasks. One popular definition of the workflow is given by Workflow Management Coalition [4]: “The computerised facilitation or automation of a business process, in whole or part”. They also give a definition for Workflow Management System: “A system that completely defines, manages and executes workflows through the execution of software whose order of execution is driven by a computer representation of the workflow logic”. A number of papers are written to describe various implementations of such systems (e.g. [5][6]).

The main question researchers address when developing a workflow management sys-tem is how to schedule a workflow in the most efficient way: minimizing a makespan or a cost, meeting a deadline, or other constraints [7][8]. I assume that an HS contains a number of packages (applications) which provide services; a task of the workflow can be regarded as the execution of a single service. An HS contains a number of nodes (com-puting units). The scheduling problem consists in mapping tasks to nodes optimally.

(6)

A popular approach to model a workflow is to represent it as a directed acyclic graph (DAG) [7][9][10]. Each node of the DAG represents an atomic task, and edge represents a dependency of two tasks. The tasks are mapped to the CNs, hence the corresponding vertices of the DAG store all information related to the execution of the tasks. It is well-known that the complexity of a general scheduling problem on a DAG is the NP-Complete [11], thus many heuristics exist that provide a nearly-optimal solution to the scheduling problem. In addition, the execution time of the tasks depends on many complex factors, such as the performance and the workload of CNs, algorithms’ structure, on the execution parameters - various papers differ in the way they represent these factors.

A significant part of the scheduling problem is the estimation of the workflow execution time. In this work it is also called the “runtime” or “makespan” estimation. In order to obtain the workflow runtime estimation, one needs to combine the execution time estimations of the tasks forming the workflow graph. The quality of scheduling strongly depends on the runtime estimate, because it is usually either a target of the optimization or a part of it.

The runtime estimation problem is not so well-developed as the scheduling problem because several straightforward techniques exist providing performance that is acceptable in many scheduling cases. However, these techniques may fail in some cases, especially in a deadline constrained setting. For example, consider a simple approach - to take the average along previous executions as the only value representing the runtime; it underestimates the total time in case of a parallel parameter-sweep1 workflow, because it estimates the workflow makespan as the maximum of the average tasks runtime values; whereas in a real world it is likely that at least some of the tasks take more time to execute than usual. Moreover, when estimating the workflow execution time, one should take into account the dependency between the executed tasks, the heterogeneity of the tasks and the computing resources, the variety of workflow forms. One important feature of an HS, from the estimation perspective, is that some components of the runtime are stochastic in their nature (data transfer time, overheads, etc.). These facets are not considered together in modern researches, therefore the problem needs further research.

Besides scheduling, the runtime estimation is used by the clients of an HS. Cloud computing providers use pricing models to show their clients the billing information[12]; they have to calculate precisely the execution time estimates to give the expected price of the workflow execution to a client. Given the estimated time bounds, one can use a model described in [13] to provide a user with the pre-billing information.

The aim of this work is to highlight common problems in estimating the workflow execution time and propose a solution that takes into account the complexity and the randomness of the workflow components and their runtime. The solution addresses the problems at different levels from task to workflow, including the error measurement and the algorithm analysis.

(7)

Outline

Chapter 2 gives an overview of related work explaining the use cases of the runtime estimation and its common issues. Various schedulers differ in the way they represent the execution time - that affects their performance. Confident estimates also can be used in the pricing models. Additionally, this chapter describes the problem of the performance estimation, which may seriously affect the quality of the runtime estimation.

Chapter 3 presents the proposed solution in general. It describes the stochastic ex-ecution time model and its issues; gives a notion of the problem at the workflow level; explains the ideas of the workflow reductions and the dual CDF-CF representation of the runtime. Finally, it presents an algorithm to estimate the workflow runtime based on the available information about the tasks.

Chapter 4 gives the mathematical background for the solution. The results presented in this chapter provide the basis for the project.

Chapter 5 describes the algorithm to estimate the workflow makespan proposed in chapter 3 in details; explains the workflow reductions and their impact on the accuracy and the performance of the solution.

Chapter 6 presents the implementation details and the benchmarks.

(8)

Chapter 2

Related work

The runtime estimation has two applications: user’s forecasting and monitoring of the workflow execution, and the workflow scheduling. The first one mostly plays the role of the usability improvement, hence there were not so much research done in that area; some information about pricing models can be found - it is presented in section 2.2. The second was studied very well: due to the complexity of the scheduling problem, a variety of articles were published presenting diverse heuristics. Surprisingly, most of them did not pay much attention to the runtime estimation problem. In section 2.1 I give an overview of the estimation approaches used in these articles.

2.1 Execution time in workflow scheduling

The workflow scheduling is known to be a complex problem; it requires consider-ing various aspects of application’s execution (e.g. resource model, workflow model, scheduling criteria, etc.)[7]. Therefore, many authors describing architecture of their scheduling systems do not focus on the runtime estimation. I separate the estimation from the scheduling so that the results obtained in my research (the workflow makespan estimation) can be used in complex scheduling systems which already exist or are being developed. In such setting an estimator provide values of a target function (i.e. execution time) to the scheduler in a similar way as if it has calculated them internally. Addition-ally, in order to monitor the workflow execution process or re-schedule the remaining tasks, one can use the estimation system to get remaining execution time at the moment when the workflow is being executed. Some examples of scheduling systems that can use any implementations of the runtime estimation module are CLAVIRE[5][14], a schedul-ing framework proposed by Lschedul-ing Yuan at al.[6], and CLOWDRB by Somasundaram and Govindarajan[15].

2.1.1 The need of estimating workflow execution time

To this date, a large variety of scheduling heuristics have been studied. Some of them do not require estimating the workflow runtime: for instance, if the workflow is fully parallel (consists only of independent tasks, e.g. parameter-sweep), minimizing the difference between the tasks’ execution time (i.e. working at a task level) minimizes the overall makespan. However, from my point of view, this decreases the flexibility and

(9)

\scheduling level

time representation\ task workflow

unspecified task ranking

and makespans[8][10] architecture[6][15][20][21]

random - Chebyshev inequalities[22]

Composing quantiles[23]

fixed NP parallelizing models

[24][25][26]

optimization algorithms [9][10][18][27]

ordinal Round-Robin,

greedy etc.[19][18]

-Table 2.1: Use of the runtime estimation in scheduling - classification

lowers the maximum of possible effectiveness of a scheduler. Additionally, in many cases a user needs to know the workflow execution time: to estimate the execution cost, or to check whether the execution time meets a given deadline.

One example of a scheduling algorithm, which does not depend on the workflow run-time, is presented in [16]. R. Cushing et al. in their work on auto-scaling scientific work-flows did not predict the execution time, instead they managed an optimal resource load by scaling tasks independently. This approach is efficient, but considers only task-level optimizations. The same optimization model were presented in [17], which considered only a parameter-sweep workflow. V. Korkhov [18] proposed multi-level scheduling and load-balancing heuristics; they considered several algorithms, some of which used the to-tal execution time (Simulated annealing), and some did not (greedy algorithms). By the means of simulations, they showed that most of the time the Simulated annealing algo-rithm preformed better than the simple heuristics independent of the workflow runtime (task-level algorithms), but took more time to compute (though 10 seconds vs 1 second, as the authors claimed, is rather small comparing to scientific tasks which need much longer time).

Although the optimization algorithms based on the workflow execution time are usu-ally more effective, the simple heuristics approach is still popular, and many papers based on this approach were published. For instance, a number of such heuristics were presented by Gutierrez-Garcia et al.[19]. Other examples can be found in [8][10]. Some examples of the workflow-level (which estimate the workflow runtime) scheduling heuristics can be found in [9] and in [20]; these algorithms were shown to perform better than the task level heuristics.

Finally, one more scheduling approach considers the data-centric workflows: Cushing et al. in their paper [16] presented a workflow auto-scaling technique, which is based on the assumption that tasks are data-intensive, therefore performance is measured more like the data processing rate than the execution time.

2.1.2 Representation of execution time

As I mentioned above, many ways to solve the scheduling problem exist nowadays. I decided to classify the schedulers by the way they represent the execution time. Table 2.1 shows my classification of the papers presented in the review; the section below discusses

(10)

T₁ T₂ T3 T₃ T₄ T2 T1 T2 T3 a) d) c) b) T1 T₄ T5 T2 T₁ T₃

Figure 2.1: Primitive workflow types; (a) sequential workflow - the workflow makespan is the sum of variables; (b) parallel workflow - the makespan is the maximum of variables; (c) complex workflow that can be split into parallel/sequential parts; (d) complex irreducible workflow.

the classes in details.

Unspecified time

First of all, since the scheduling problem is a complex one, sometimes researchers tend to focus more on scheduling itself omitting the question of the execution time rep-resentation in their articles. In this case their frameworks can take advantages of the best available runtime estimator directly, without making any changes to the underlying theory. Mainly, this is more valid for papers describing the architecture of scheduling software than the algorithms [6][15][21]. Another example is [20], a paper by Falzon and Li who proposed the heuristics that require a matrix of expected completion time (task-on-machine execution time values) as an input; the runtime estimator can pass the values, for instance, in the form of the mean execution time, or the quantiles1 _{with a}

certain confidence level2_{. Generally, for any algorithm, which refer to the execution time}

as a single value (e.g. [8][10]), one can use the mean of a random variable, until there is no need to calculate workflow time - average time of executing all parallel branches is larger than the maximum of branches’ averages.

Stochastic time

The second type of schedulers makes assumptions that has a potential to give the most accurate runtime predictions. Knowledge of the expected value and the variance allows measuring the uncertainty of the workflow runtime, making the rough bounds on time using Chebyshev’s inequality3. For instance, Afzal et al. [22] used Vysochanski-Petunin 1_{The quantiles are the points taken at regular intervals from a cumulative distribution function (CDF)}

of a random variable; e.g. 95%-quantile (percentile) is such t that F (t) = 0.95

2 _{Confidence level is a measure of the reliability of a result. A confidence level of 0.95 means that}

there is a probability of at least 95% that the result is reliable, e.g. the execution time does not exceed a given deadline t.

3_{Chebyshev’s inequality use the variance and the mean of a random variable to bound the probability}

of deviating from the expected value: P|X − E[X]| ≥ kpVar(X) ≤ 1

k2, k > 0. I also refer to the Chebyshev-like inequalities as the other inequalities that use the mean and the variance to bound the

(11)

inequality4 _{to estimate the runtime of a program.}

If provided by the quantiles or the distribution approximation, one can make the rather good confidence intervals5 _{on the execution time that may be used in a}

deadline-constrained setting. N. Trebon [23] presented several scheduling policies to support urgent computing. To find out if some configuration of a workflow makes the execution time to meet a given deadline, they used quantiles of the tasks’ execution time. For example, it easily can be shown that if two independent concurrent tasks T2 and T3 in Figure 2.1.c

finish in 20 and 30 minutes with probabilities 0.9 and 0.95 accordingly, then one can set an upper bound on the total execution time max(T2, T3) to max(20, 30) = 30 minutes

with probability 0.9 · 0.95 = 0.855. If the tasks are connected sequentially (Figure 2.1.a), the total execution time is 20 + 30 = 50, again with probability 0.9 · 0.95 = 0.855. This approach is quite straightforward and very useful to calculate the bounds on the quantiles, but does not allow evaluating the expected value in the same way. Additionally, it is highly inaccurate for a large number n of tasks: let say, one has quantile p for each task, then the total probability for any workflow topology is pn _{at least. For example, in}

Figure 2.1 workflows (a) and (b) give p3_{, workflow (c) gives p}4 _{(if calculate max(T} 2, T3)

firstly) or p5 (if calculate tasks sequentially), workflow (d) gives p7 or more depending on the way the steps are composed.

The main drawback of using random execution time is that one needs much more information about the algorithm of the task than just a mean runtime; in order to com-pute the quantiles one has to obtain full information about distribution of a random variable. Some authors do not take into account the problem of obtaining such informa-tion assuming it is given as an input of the algorithm. However, in real applicainforma-tions the execution data may be inconsistent with the assumptions of the distribution estimation method: if a program takes several arguments, and these arguments change from run to run, it might be difficult to create a proper regression model. Additionally, even if one has created good estimations of the mean and the standard deviation of a distribution as the functions of program arguments, these two statistics do not catch the changes in the higher-order moments6 _{like the asymmetry. As a result, under some circumstances}

(for some program arguments), distribution of the runtime may have a larger “tail” un-expected by the constructed empirical distribution function, hence the program has a larger chance to miss the deadline.

Fixed time

The third class of schedulers considers the task runtime as a constant value. On the one hand, this assumption simplifies the estimation of the workflow execution time, because in this case there is no need to take into account a chance that a task with the longer expected runtime finishes its execution faster than a task with the shorter time.

probability.

4 _{Vysochanski-Petunin inequality is one of the Chebyshev-like inequalities; it is constrained to}

uni-modal distributions, but improves the bound: P|X − E[X]| ≥ kpVar(X)≤ 4

9k2, k >p8/3.

5 _{Confidence interval is a measure of the reliability of an estimate. In a context of the random}

execution time, it is an interval (constructed using an observed data sample) of the time which contains the actual time with a given confidence level (probability).

6_{A moment is a statistic (a function of a sample) that measures the shape of a data set. The example}

is an ordered moment E[Xn] or central moment E[(X − EX)n], so the expected value is the first moment, the variance is the second central moment.

(12)

Especially, this approach simplifies the calculation of the workflow execution time in case of parallel branches (Figure 2.1.b): if the execution time of branches is constant, then the expected execution time of the whole workflow is equal to the maximum of the branches’ execution time. On the other hand, this assumption may lead to significant errors in the estimations and, consequently, to inefficient scheduling. In the case of a large number of parallel tasks, even a small uncertainty in the task execution time makes the expected makespan of the workflow longer. However, in some cases a random part of the program execution time is small enough, thus this approach is often used. One example of using fixed time is the work of M. Malawski[28] et al. who introduced a cost optimization model of workflows in IaaS (Infrastructure as a Service) clouds in a deadline-constrained setting.

Other examples of schedulers, which do not exploit the stochastic nature of the run-time, are [18][10][9][27]. In these cases one cannot insert the expected value of the runtime into their algorithms directly because they calculate the workflow execution time, which may consist of parallel branches. The main problem, again, is that the average of the maximum of multiple random variables is usually larger than the maximum of their av-erages.

Several papers are difficult to be classified into this or previous section, because they estimate the average using statistical data, but do not use the variance. For instance, Ichikawa and Kishimoto in [24][25][26] fit polynomial models to the data in order to estimate the mean execution time.

Ordinal time

The last class of schedulers use task-level scheduling heuristics (which do not take into account the execution time of the workflow). An algorithm of such scheduler uses a list of tasks ordered by their execution time - this ordering is the only information used to map the tasks onto the resources. First of all, this class is based on various greedy algorithms; two of them are described in [18]. Comparing to the workflow-level algorithms, they are very fast, but shown to be less effective [18]. A number of scheduling heuristics of this class is described in a paper by Gutierrez-Garcia and Sim[19].

2.2 Execution time and pricing models

The estimation of the workflow execution time is important not only for scheduling. In many cases the HS users need to estimate the time and the cost of the execution before deciding whether to use a resource. Along with increasing popularity of the Cloud computing, this leads to the need of creating the pricing models [12].

Sharma et al. [13] proposed a novel economic model for pricing the cloud resources (IaaS). The scientists combined Moore’s law of increasing computing power of resources with financial instruments model (Black-Scholes) to derive the pricing policy. They con-sidered computing resources as a financial asset, and lease of the resources as an option (i.e. the right to buy underlying asset at a certain price). The price of the resources decreases due to deprecation (from the Moore’s Law) and aging of the resources; the service costs affect the rent too. The clients use the service (buy an option) if it is more

(13)

profitable than buying the computing resources (underlying asset).

In the case of the AaaS paradigm (Application as a Service), which provides an application(s) together with the (choice of) computing resources to run it, there is a need to estimate the price of executing the application (or a workflow) on a chosen resource instead of simply estimating the price of a fixed time of the resource. Therefore one needs to compute the runtime estimation and then compute price of the execution using the time bounds. Given the estimated time bounds, one can use a model described in [13] to provide a user with the pre-billing information.

2.3 Single node and performance issues

To the best of my knowledge, all recent researches dedicated to task scheduling prlem avoided estimating performance of a particular task on a given machine before ob-taining the runtime data. For instance, Falzon and Li [20] assumed that the execution time of each task on each resource is given as an input in matrix form. Example of an-other approach is [17]: Ana-Maria Oprescu created a scheduling system, “BaTS”, which estimated the runtime of a parameter-sweep application in a cluster containing several types of machines. The BaTS had no prior information about the performance of various machine types in the cluster; instead, it executed the small sets of tasks on a machine of each type to obtain initial rough estimations, which then were refined. This resulted in additional costs caused by the execution on the inefficient expensive computers.

Some researches do not measure the performance of the nodes explicitly. In [24][25] Ichikawa et al. defined a parametric model of the algorithm execution time dependent on the problem size and the number of processors in a cluster. Their NP-T model targeted parallel algorithms with complexity up to O(N3_{) (N is an abstract “problem size”) and}

used nine coefficients to be obtained after the first ten test runs. Thus, the performance parameters of the computing nodes in that model were used implicitly and obtained after the benchmark runs on a target system. In [29] Rechistov et al. presented several Machine Learning approaches to model the runtime of a task (problem instance) based on its features, and gave an example of retrieving the features from a problem. They showed that the Random Forest algorithm was the most efficient in many task cases. One can include into their model several performance parameters (such as CPU speed, memory clock etc.) as the instance features in order to measure the performance of a resource for a particular task.

The problem of measuring the performance when predicting the execution time may be addressed by analyzing the structure of the source code of a program. In [30] Wilhelm et al. investigated the problem of estimating the upper bound on the worst-case execution time. They presented two methods (static and measurement-based) and an overview of timing-analysis tools able to precisely compute the upper bounds of the runtime. These tools deeply analyzed an execution code and a processor structure, therefore they were bounded to the programming languages, the compilers, or the processor architectures they are dedicated to.

Another approach, which is very useful in the case of distributed systems, is to use the cumulative performance parameters for computing resources, and then create the parametric performance models on top of the execution time models to take performance

(14)

parameters into account. Dominguez and Alba[31] presented a methodology to compare the execution time of the algorithms on different hardware. They used a set of well-known CPU benchmarks to estimate the performance of a given machine (Whetstone, Dhrystone, Livermore Loops, Linpack and CPUBenchmark), and found one benchmark, which predicts the runtime better than others (in case of the program they analyzed). Then they constructed the execution time model as a function of two arguments: the computer performance and the baseline execution time. They proposed a way to construct such models as an alternative to the basic intuitive inverse cross multiplication, which performs poorly (if the algorithm A takes n seconds on processor P1 which has a score of S1 MFLOPS in a given benchmark program, in the processor P2 with score S2, the time will be n * S1/S2 seconds).

Summary

This chapter highlighted the importance of using high quality runtime estimations. The workflow-level algorithms were shown to perform better than the task-level heuristics [18]. Several papers I reviewed describe the advantages of the stochastic approach [22][23]. These works motivated the current research to use the stochastic model as the runtime representation. The existing methods to address the issues of this approach require a number of improvements; the following chapters cover these improvements and present an efficient algorithm of estimating the workflow execution time.

(15)

Chapter 3

Runtime Estimation

Chapter 2 described the ways scheduling systems use the runtime estimations. Sev-eral examples showed that the stochastic representation gave the most accurate results [18][22]. The most significant advantage of the stochastic approach is the ability to mea-sure the error of the estimation in a simple way. Additionally, it allows to represent the effect of the components that are stochastic in their nature, e.g. the data transfer time, the execution of randomized algorithms. However, this approach has its own problems, specific to the time representation. The first problem is the creation of the task execution time models: in order to efficiently perform the regression analysis one needs a sufficient dataset; estimating the full probability distribution as a function of the task input usu-ally is a complicated task. The second problem is related to combining random variables (tasks’ execution time) into a workflow makespan because of the complex dependencies in the workflow graph.

In the most simple form the proposed solution to the problem is straightforward: 1. Create the execution time models for the tasks in a scheduling system. One can

assume that the arguments off all tasks are known before the execution; this is not always true - an input of one program may contain an output of another one and be unknown before the predecessor finishes its execution. However it is possible to create a collection of models for various sets of parameters; for example, if only three of five arguments of a certain task are known, a three-parameter model is used (which gives the worse estimations than the full model, but nevertheless is useful). 2. Collect the execution data and fit the models; the executions are usually given for a varying set of parameters (even different machines), thus the data should be normalized. The models can be adjusted using the optimization and machine learning methods.

3. Combine the execution time models of the tasks into the workflow execution time model; one may calculate the exact values or the upper bounds on the final time. Assuming the mentioned approach, noticeable features of the problem model are listed below:

• The workflow is represented as a DAG; four examples of workflows are presented in Figure 3.1

• The nodes of the workflow are the tasks and are represented by random variables (execution time) Ti > 0; the random variables are mutually independent; I assume

(16)

variables Ti to be continuous on the experimental basis (multiple equal measures of

execution time are hardly possible).

• The workflow execution time (runtime, makespan) is the time needed to execute all tasks in the workflow. An equivalent definition is the maximum along the time of all paths throughout the workflow.

• The path throughout the workflow graph is a sequence of the dependent nodes which starts at any starting node (node that does not have the predecessors) and finishes at any of the last nodes (nodes that are not followed by any other nodes). The number of paths increases with each fork in the graph; for instance, in Figure 3.1 (a) has the only one path throughout the workflow, (b) has three paths, (c) has two paths, (d) has three paths.

3.1 Composed time

T1 T2 T3 T3 T4 T2 T1 T2 T3 a) d) c) b) T1 T4 T5 T2 T1 T3

Figure 3.1: Workflow examples Besides the calculation time of the

pro-grams, the workflow execution time includes the additional costs for the data transfer, the resource allocation, and the others. Therefore, single task processing can be considered as the call of a computing service, and its execution time can be represented in a following way:

T = TR+ TQ+ TD + TE + TO (3.1)

Here, TR is the resource preparation time, TQ - queuing time, TD - data transfer time, TE

- program execution time, TO - system overhead time (analyzing the task structure,

se-lecting the resource, etc.). For detailed description of the component model (3.1) see [14]. Assuming the components to be stochastic variables, one should analyze their distribu-tions and dependencies. In the case if all components are independent, one can easily find the mean and the variance of T :

E[T ] = E[TR] + E[TQ] + E[TD] + E[TE] + E[TO]

Var(T ) = Var(TR) + Var(TQ) + Var(TD) + Var(TE) + Var(TO)

(3.2) However, in the case of dependent variables the variance equality is wrong - one should add the covariance terms to compute Var(T ). Especially this is related to dependence between program execution time TE and data transfer time TD: usually execution time

does depend on data size significantly, thus TE and TD are correlated if an input data size

is not fixed by the runtime model of a task. The variables can be treated as independent if their shape does not change with the model arguments and the mean is fixed by the model arguments (see section 3.2). Here and later for simplicity I assume all components are included into the overall task execution time T .

3.2 Task execution time

The first problem of the stochastic approach is related to the way one represents the execution time when fitting it to the observed data. The most general way to represent

(17)

the execution time T is a regression model:

T = f (X, β)d (3.3)

Here X is the task execution arguments, represented by a k(≥ 0)-dimensional vector in an arbitrary space; X is controlled by an application executor, but can be treated as a random variable from the regression analysis point of view. T ∈ R is a dependent variable (i.e. execution time); f (X, β) is a regression (random) function of a specified form, so that T |X is a random variable; β are the unknown parameters of the function f , their form depends on the representation of the regression function. This formula allows to represent any kind of the runtime dependency on the task arguments, but is difficult to work with. It can be rewritten in the following form:

T = µ(X, β) + σ(X, β)ξ(X, β)d (3.4)

Where X - the execution arguments, β - unknown regression parameters, µ = E[T |X], σ = pVar(T |X), ξ(X, β) - random variable. The regression problem consists in estimating unknown parameters β, thus estimating the mean, the variance, and the conditional distribution of the runtime. Some authors used only the mean µ (e.g. [18][24]), as a consequence they could not create the confidence intervals. Others computed the mean and the variance, but did not estimate ξ(X, β) (e.g. [22]); they used Chebyshev-like inequalities, which are based on the knowledge of the mean and the variance only. Trebon [23] used the full information about the distribution, but assumed it to be obtained by an external module. As I explained in section 2.1.2 (Stochastic time), it is not seemingly possible to estimate the conditional distribution of ξ for any input X, therefore one need to simplify the problem. As an example, one may assume that ξ and X are independent. However, in general case ξ depends on X - this means that the distribution of T not only shifts or scales with X, but also has the changing shape. Consequently, for some values of X, the time T may have a significantly larger right “tail” than for others. Given a small size of a training sample (observed executions) this results in the unexpectedly high probability of missing the deadline.

Figure 3.2 presents the described phenomena by the example of a synthetic test. I generated two small samples from the Weibull distribution to simulate the execution of a program with the varying parameters X: one “good” sample (ξ does not depend on X) generated using a constant value of the shape parameter k, and one “bad” sample (ξ does depend on X) for which k = 3 − X/55. Scale parameter λ was always adjusted so that Var(ξ) = 1. In this setting for the high values of X the “bad” distribution of “ξ” has the large right tail. This effect is not noticeable if one observes the whole execution sample (Fig. 3.2.a), because only few runs were observed with the parameter values X > 80. At the same time, if looking at a narrowed region of X, it is obvious that the “bad” distribution does not fit the theoretical quantiles completely (Fig. 3.2.b).

In order to check if the sample (i.e. execution logs of a program) satisfies the inde-pendence condition, one should use a statistical hypothesis test: use the hypothesis that two samples X and ξ are independent as the null hypothesis (the basis for a test), and check whether it is accepted or can be rejected with a given confidence level. In case of continuous data, the smoothed bootstrap tests are shown to perform best [32]. By cat-egorizing the data, such tests as Fisher’s exact [33] or asymptotic Pearson’s chi-squared (only if there is enough data) tests may be used. However this approach does not solve

(18)

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 a) Whole sample (X = 0..100)

Weibull(λ,k) theoretical quintiles

Data quintiles independent X and ξ dependent X and ξ 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 b) Only high X (X > 80)

Weibull(λ,k) theoretical quintiles

Data quintiles

independent X and ξ dependent X and ξ

Figure 3.2: Q-Q plots - Weibull distribution; sample size N = 60; k = 3 for independency case; k = 1.1..3 for dependency case

the problem if the independence hypothesis is rejected thus the question needs further research.

Above I explained why the straightforward approach of fitting the mean and the variance to estimate the distribution of the program execution time should be used with caution and might give wrong results in some cases. If one cannot statistically accept that the distribution of ξ does not depend on X, quantiles cannot be calculated directly; in this case the more robust inequalities like Chebyshev’s inequality may be used to estimate the runtime bounds. In my research I propose to combine the distribution-dependent quantiles and the Chebyshev-like inequalities using p-values from the independence tests. More formally, the confidence level of the runtime bounds (3.4) may be written in the following form:

P (T < Tu) = P (T < Tu|H0, {Xn, ξn})fn(p) + P (T < Tu|¬H0, {Xn, ξn})(1 − fn(p)) (3.5)

Here p stands for p-value of hypothesis test; H0 - hypothesis of independence (null

hy-pothesis); fn(p) - some non-decreasing function of p, which depends also on the sample

and represents the probability of the sample being taken from the independent distri-butions X and ξ. In other words, I compute the probability P (T < Tu) of meeting the

deadline Tu using the Law of total probability1.

3.3 Workflow execution time

The workflow makespan is a combination of the sums and the maximums of random variables. Despite the variables are assumed to be independent, they are connected in the workflow graph, what introduces the complex dependencies between the paths. For instance, in workflow (c) in Figure 3.1 paths 1 − 2 − 4 and 1 − 3 − 4 are dependent via inclusion of nodes T1 and T4, but one can calculate T23= max(T2, T3) hence all max and

1 _{The Law of total probability expresses the total probability of an outcome which can be realized}

via several distinct events: P (A) =P

(19)

sum operations are applied to the independent variables. Another example is workflow (d): all paths throughout this workflow are mutually dependent, and one cannot simplify out groups of independent tasks (T3 and T4 start at different time; T1 and T2 affect the

start time of the sets of nodes).

If the only information about the execution time available is the mean and the vari-ance, then the root of the problem is that max(ET1, ET2) ≤ E[max(T1, T2)]. This means

that the expected execution time of parallel branches cannot be calculated directly: the best one can do is to calculate the mean time of each path (by taking the sum of the averages along a path), and then estimate an upper bound on the maximum along all paths (as in [22]). However, in case of large workflows this approach may be too computa-tionally expensive, because in the worst case a number of paths grows exponentially with a number of tasks. Another approach is to compute the confidence intervals separately for each task, and then combine not the variables, but the probabilities. Unfortunately, sometimes it cannot be used because it leads to applying Chebyshev’s inequality many times and increasing the size of the confidence bounds.

If one decides to use full information about the variables (including estimation of error distribution ξ), then the workflow execution time problem is reduced to the problem of evaluating the maximum of multiple random variables. Note that this question is usually solved using kernel density estimations (e.g. [34]). In this case bounds far more precise than those provided by Chebyshev’s inequality, but are computationally expensive. Addi-tionally, the issue of dependent variables makes the problem more complicated. Formally, it is known that the maximum of independent variables can be found as the multiplica-tion of their CDFs, but generally this is not true for dependent variables. Due to specific of the problem, this issue can be addressed as follows: all nodes (as random variables) are mutually independent, and the paths depend on each other only via inclusion of the same nodes; this raises only positive correlation(which means higher values of one of the variables are less common), hence one can estimate an upper bound on the maximum of paths by assuming them independent (see chapter 4). Another problem of this approach is computing the execution time of a single path - the sum of the random variables. If one decides to represent the distribution of a random variable as a CDF, the estimation of the sum requires the evaluation of the convolution of two CDFs, what is computationally expensive to perform on each step. One workaround is to use two interpretations of ran-dom variables: a CDF and a characteristic function (CF). It is known that a CF of the sum of multiple random variables is the multiplication of their CFs. Thus, the solution is to use both representations and convert one representation into another when it is needed. The conversation can be performed using the Fourier transform; the speedup is then achieved because the number of required transforms is lower than the number of the sum and max operations. The last issue is that this approach assumes the calculation of all paths throughout the workflow and then taking the maximum of them. It is possible to deal with this issue by simplifying the workflow: to reduce the number of paths one can combine the groups of nodes in a workflow graph according to the pre-defined patterns that do not affect (or improve) the precision of the method. This operation results in nesting of the workflows: a lower-level workflow is processed as a single node (task); its runtime model does not depend on other tasks in the parent workflow, thus its execution time is estimated independently. For example, nodes T2 and T3 in Figure 3.1(c) may be

reduced into a parallel sub-workflow, efficiently reducing the number of paths to one. This project is assumed to use all possible information about random variables (task

(20)

execution time) - a CDF. Taking into account the issues described in this chapter, below I present a general structure of the algorithm that aggregates tasks’ runtime into the workflow makespan:

1. Apply a workflow reduction (simplification).

2. In a remaining workflow calculate a CF for each node. 3. Calculate a CF of each path in the workflow.

4. Convert output set of the CFs into the CDFs via the Fast Fourier Transform (FFT). 5. Bound the workflow runtime CDF assuming the paths to be independent.

(21)

Chapter 4

Mathematical background

The stochastic approach to model the execution time means performing operations on random variables. The fact, that the workflow makespan is a combination of the sums and the maximums, raises the question of performing these operations on the random variables, possibly in efficient way. Besides the representation issues (in comparison to fixed-value approach), the extra complexity is introduced here by the possibility of correlation among the variables during the computations.

4.1 Approaches to estimate runtime

I consider three ways to construct the workflow runtime estimation:

1. Estimate the execution time moments (e.g. mean, variance) of all tasks and combine them into the workflow runtime moments1_{(then use Chebyshev’s inequality to get}

the confidence intervals);

2. Estimate the quantiles of the execution time of the tasks, combine them to get the confidence bounds on the total time;

3. Estimate the runtime CDF of the tasks, combine them to get the CDF of the total time.

In order to proceed with the calculations, I use the following notation. Denote s -a tot-al number of nodes in the gr-aph. Tk, k ∈ 1..s - the independent random variables

representing execution time of each node; due to the nature of execution time, one can safely assume they are non-negative, continuous, and at least first two their moments are finite E[Xl] < ∞, l ∈ {1, 2}. Fk is a CDF of the variable Tk. Let r be the total

number of different paths throughout the workflow; sj _{is the number of steps in the path}

j ∈ 1..r. Then kj_i, i ∈ 1..sj_{, j ∈ 1..r, k}j i < k

j

i+1 is an index of node on the j-th path of a

workflow, each path contains a number of nodes which is less than or equals to sj ≤ s. The execution time of a path throughout a workflow is denoted as follows:

T_pathj = sj X i=1 T_kj i. (4.1)

1_{A moment is a statistic (a function of a sample) that measures the shape of a data set. The example}

is an ordered moment E[Xn] or central moment E[(X − EX)n], so the expected value is the first moment, the variance is the second central moment.

(22)

T₁ T₅ T₄ T₁ T₂ T₃ T₃ T₄ T₂ a) b)

Figure 4.1: Example workflows; the makespan of a workflow (a) can be decomposed into the sums and the maximums of the independent parts T1, max(T2, T3), and T4; a

work-flow (b) cannot be decomposed into such independent parts due to complex connections T1, T2− T3, T4.

The execution time (makespan, runtime) of a workflow is the time required to execute all paths of the workflow in parallel:

Twf = max j∈1..r T j path = max_j∈1..r   sj X i=1 T_kj i  . (4.2)

It is known that the maximum number of edges in a DAG is s(s − 1)/2 (i.e. the sum of the algebraic progression; the first node is connected to (s − 1) others, the second to (s − 2) others, and so on). Then it is possible to bound the number of paths throughout the workflow (the proof is given in appendix C):

Lemma 1. DAG with number of nodes s ≥ 4 contains 1 ≤ r ≤ 2s−2 _{paths throughout it.}

Lemma 1 shows that the number of paths in some cases grows exponentially with the number of nodes, which may lead to the performance issues. These issues are ad-dressed in chapter 5; in this chapter I estimate the workflow execution runtime by taking the maximum along all paths throughout it, because it simplifies the overall procedure (execution time of the path is the sum of independent random variables).

Estimating moments

This approach seems to be the easiest one, because the moments are represented by the scalars (i.e. no vectors or functions). Obviously, the expected value (and the variance) of the sum of (independent in case of the variance) random variables equals to the sum of their expected values (variances). The problem is to bound the mean (and the variance) of the maximum time of parallel branches, since the exact solution does not exist in general (max(E[T1], E[T2]) ≤ E[max(T1, T2)]). Moreover, in the case shown

in Figure 4.1.b tasks T3 and T4 may start at the same time or at different time, therefore

such cases cannot be decomposed into the sums or the maximums of the single tasks (in this case one needs to estimate the maximum of two paths that are correlated).

(23)

moments of path’s runtime can be computed directly: E[T_pathj ] = sj X i=1 E[T_kj i] Var(T_pathj ) = sj X i=1 Var(T_kj i) (4.3)

Expectation of the maximum cannot be calculated directly, but one can get its lower bound: E[Twf] ≥ max j∈1..r E[T j path] (4.4) Although the moments for a workflow cannot be calculated directly, it is possible to use the simple maximums as the initial approximations for the moments in order to use these values where the precision is not mandatory:

E[Twf] ∼ max j∈1..r E[T j path] Var(Twf) ∼ max j∈1..r Var(T j path) (4.5)

Estimating quantiles

The strongest side of this approach is that it allows to compute the bounds on the workflow execution time easily if provided by the quantiles of the execution time of each task (the proof is given in appendix C):

P (T1+ T2 ≤ t1+ t2) ≥ P (T1 ≤ t1)P (T2 ≤ t2)

P (max(T1, T2) ≤ max(t1, t2)) ≥ P (T1 ≤ t1)P (T2 ≤ t2)

(4.6) The main problem here again is the heterogeneity of the workflows; it is not clear how to select the quantiles to get the best result (shortest time, highest probability). Moreover, due to the inequalities, the overall time depends on the order of combining the quantiles. Consider Figure 4.1.a; let all quantiles be equal to √4p, where p is desired target confi-dence level. If one computes in the following order: T1+T2, T1+T3, max(T1+T2, T1+T3),

max(T1+ T2, T1+ T3) + T4- the final probability is p1.25. If one computes in another order:

max(T2, T3), T1+ max(T2, T3), T1max(T2, T3) + T4 - the final probability is p.

Unfortu-nately, it is not possible to use exactly the same algorithm in the case in Figure 4.1.b, because tasks T3 and T4 there depend on different sets of tasks and may start at different

time. All these problems force me to refuse using the quantile method and come up with approach discussed in the following section.

Estimating probability functions

If the CDFs are given for each task, then it is clear how to compute the overall probability. The sum of random variables turns into the convolution of two CDFs, the maximum of random variables turns into the multiplication of two CDFs. The only problem is to address the correlation of two variables: for example, in Figure 4.1.b random variable T1+ T3 partially depends on max(T1, T2) + T4.

(24)

Define Zsum = T1+ T2. Both random variables T1,2 are positive, have finite mean and

variance. Then the CDF is given by the following expression:

FZsum(z) =

z

Z

0

FT2(z − x1|T1 = x1)dFT1(x1).

Similarly for the maximum: Zmax = max(T1, T2), then:

FZmax(z) = FT1(z)FT2(z|T1 ≤ z) =

z

Z

0

FT2(z|T1 = x1)dFT1(x1).

Measuring the dependence (i.e. joint CDF) of two variables as well as the integration are the difficult procedures, hence there is a need to find a workaround. On the one hand, it is known that in case of independent variables X and Y , CDF of their maximum can be found simply as the multiplication of their CDFs:

Fmax(X,Y )(t) = FX(t)FY(t). (4.7)

On the other hand, one can use a CF to calculate the sum, because the CF of the sum of independent random variables equals to multiplication of their CFs:

φX+Y(ω) = φX(ω)φY(ω). (4.8)

Since all the tasks in each path throughout a workflow are considered to be independent, the easiest way to compute distribution of the runtime of one path is to use the CF. Thus, one can use (4.7) for the maximum operation and (4.8) for the sum operation. However, this raises the question of the conversion between CDF and CF - which is addressed in section 4.3.

4.2 Estimating the workflow makespan CDF

Schedulers usually have to satisfy a constraint on the maximum workflow makespan (“deadline”) when mapping tasks on CNs. Thus, a common request of the scheduler to an estimator is to get the probability that a workflow in a given configuration executes not later than the defined deadline. I call this the probability of meeting deadline t. Note, this definition is equal to the definition of the CDF - the probability that a random variable is lower than fixed value t (argument of the CDF).

Assume the CDFs of all paths in the workflow are already estimated. As it was mentioned, the paths as random variables may correlate, therefore it is difficult and computationally expensive to compute the workflow CDF directly. This thesis, instead, proposes to make a lower bound on the CDF. This allows to compute the lower bound on confidence level of meeting deadline - the most important application of the runtime estimation.

Similarly to the notation presented in the beginning of this chapter, I denote the CDFs as follows: Fk(t) is a CDF of the variable Tk; F_pathj (t) is a CDF of the variable

T_pathj ; Fwf(t) is a CDF of the workflow runtime. The CF is denoted as φ(ω) and uses the

(25)

The formula for the probability of executing within a deadline t comes directly from (4.2) and can be written as follows:

Fwf(t) = P (Twf ≤ t) = P  max j∈1..r   sj X i=1 T_kj i  ≤ t   (4.9)

The main problem is that, due to occurrence of the same random variables in different paths, the probability of the maximum cannot be decomposed into the multiplication of the probabilities directly. What one can do instead, is to compute the distribution of each path T_pathj separately and then bound the total distribution of Twf (the proof is given in

appendix C): P (Twf ≤ t) ≥ r Y j=1 P T_pathj ≤ t = r Y j=1 P   sj X i=1 T_kj i ≤ t   (4.10)

The inequality above allows to say that the probability of meeting a deadline is higher than the calculated value, i.e. to ensure that the workflow will execute not later than it is required.

4.3 Transform between CDF and CF

Section 4.2 explained how to bound the CDF of a workflow given the CDFs of all paths throughout it. However, it is easier to use the CF than the CDF for the paths, because it does not require the convolution (integration) of the sum of the random variables (tasks). Since the model of a task runtime is non-parametric, the only way to represent the CDF or the CF is to store their values in a vector (grid, i.e. set of points) (in contrast to the parametric distributions, which allow to store couple of parameters instead of a bunch of values). Denote the size of the grid as n. It is clear that the operations (4.7) and (4.8) are linear in the number of performed operations on the grid O(n), whereas the convolution requires O(n2_{) operations (or, probably, O(n log n) if using some sophisticated}

algorithms).

If one stores both the CDF and the CF, then the sum and the maximum operations can be computed in the linear time; the CDF and the CF have the important property that they can be converted one into another using the Fourier transform; conversion of the functions using the FFT requires O(n log n) operations, resulting in the overall complexity of O(n log n). FFT is required only once per path, hence this also significantly reduces the number of “intensive” O(n log n) operations. As a result, the overall complexity of the method is O(rn log n). In this section I derive the conversion formulae for these two representations of the random variable distribution.

4.3.1 Estimating CDF and CF

Assume random variable X with CDF FX(t) and CF φX(ω). Here parameter t denotes

the time domain, and ω denotes the frequency domain. According to the definition, φX(ω) = E[eiωX], FX(t) = P (X ≤ t).

(26)

These functions can be estimated using sample Xm _{= X} 1..Xm - i.i.d.: ˆ φm(ω) = 1 m m X j=1 eiωXj_, _Fˆ m(t) = 1 m m X j=1 1{Xj ≤ t}.

Let’s say, one would like to calculate the distribution on a grid containing n points. Since the sample is finite m < ∞, one can define such interval (a, b) according to the sample, that a < minj∈1..mXj, b > maxj∈1..mXj. Then ˆFm(a) = 0, ˆFm(b) = 1. Parameters a, b,

and n completely define the grids that are used to store the values of CDF, CF: tk = k n(b − a) + a, k = 0, 1, ...(n − 1) ωj = 2πj b − a, j = 0, 1, ..., n 2 (4.11)

Note, ω-grid in (4.11) is almost half-sized (n/2 + 1 elements instead of n). The CF is a Hermitian2 _{function φ}

X(−ω) = φX(ω) (and ˆφm(−ω) = ˆφm(ω)), hence one can restore

the values of the CF for negative argument to get a full-sized grid.

4.3.2 Fourier Transform

After calculating the estimations of the CF and the CDF on the grids as shown above, one can transform between the representations using the FFT in a following way:

ˆ φ∗_m(ωj) = ( iωj n (a − b)e iωjaPn−1 k=0 ˆFm(tk) −_nk e−2πikn j if j 6= 0, 1 if j = 0. (4.12) ψj =                n a − b b + a 2 − E[X] if j = 0, ineiωja (a − b)ωj ˆ φm(ωj) if j ∈ {1..n₂ − 1}, ineiωn−ja (a − b)ωn−j ˆ φm(ωn−j) if j ∈ {n₂..n − 1}. (4.13) ˆ F_m∗(tk) = 1 n n−1 X j=0 ψje 2πij n k+ k n. (4.14)

The sums in (4.12) and (4.14) have exactly the form of the forward and backward Discrete Fourier Transforms (DFTs) respectively. Therefore, the FFT, which has the complexity O(n log n), can be applied directly to the equations. Consider setting n equal to the power of two, because a grid of such size is processed by the FFT in the most efficient way. The following lemma shows the precision of the transforms (the proof is given in appendix C):

2 _{In mathematical analysis, a Hermitian function is a complex function with the property that its}

complex conjugate is equal to the original function with the variable changed in sign f (−x) = f (x); for the CF this property is obvious: φ(−ω) = Ee−iXω= EeiXω_{= φ(ω)}

(27)

Lemma 2. Equations (4.12) and (4.14) have following error: ˆ φ∗_m(ωj) − φX(ωj) ≈ bn/2+ c √ mξ, ˆ F_m∗(tk) − FX(tk) ≈ bn/2+ c √ mξ,

where ξ - standard normal random variable, bn/2 is a bias, which equals to an error of

the finite sum of the corresponding Fourier series.

One can assume that the CDF is the analytic function, then the bias bn/2 decays

exponentially with the grid size m, hence the significant part of the error is the variance of the initial estimations (statistical estimates of the CF and the CDF are the random functions). The variance of the maximum is close to the variances of the variables; the variance of the sum is the sum of the variances (for independent variables); using these properties on can roughly estimate the order of the error of the workflow makespan estimation: ˆ F_{W F}∗ (t) − FW F(t) = O rsb∗n/2 + O r s m∗ · ξ (4.15)

Here b∗_n/2 denotes only the asymptotic of the Fourier series error - the convergence may be polynomial or exponential, depending on the type of the CDFs; m∗ may be considered as the size of the smallest sample. That is, the variance of the estimate depends on the sample size and the number of nodes. The bias also depends on the number of paths, but it can easily be controlled by varying the grid size.

One more thing, which affects the bias is the size of the interval b − a. The interval influences the bias in terms of the shape of the CDF: it should be large enough to contain all the points of the underlying sample (remember the rules for the choice of a and b in subsection 4.3.1), but if the interval is too large, the grid become sparse. If the grid is sparse and the interval is large, the most of CDF’s values equal to zero (on the left) or one (on the right), and only few values represent the shape of the CDF, difference between them increases; this enlarges the estimate of the modulus of continuity3 _{of the CDF,}

increasing the error of the Fourier series approximations. Therefore, one should keep the size of the interval (a, b) as small as possible, but containing all the values of a sample. However, when estimating a workflow makespan, all random variables have the varying range; one needs also consider their sums and maximums; if using one grid that covers all the variables, the interval is too large. A better approach is to use a smaller grid, but shift it to fit each variable; section 4.3.3 describes this procedure. Thus, I assume a grid for a single variable; the most convenient way to choose the grid size is to measure it in terms of standard deviations: one should estimate the mean and the variance of a random variable (accuracy here is not important, hence one may use (4.5)), then define a and b as the mean plus/minus several standard deviations of the random variable (consider values 3 − 5).

One more problem of this approach is that it might be difficult to estimate the value of E[X]. If the estimation is poor, then the final CDF shifts to the higher or lower values depending on the sign of the bias of E[X] (because ψ0 is included into each term in the

sum that forms the inverse Fourier transform). One workaround is to add a constant to all values ˆFm(tk), so that ˆFm(b) = 1 after the transform.

3 _{The modulus of continuity is the function to measure quantitatively the uniform continuity of}

functions; in the context of the discrete Fourier transform it shows maximal difference of the values of a function on two neighbour points.

(28)

4.3.3 Combining shifted t-grids

As it was mentioned in subsection 4.3.2, using one grid to compute the estimations of all tasks gives poor quality of the result. It is clear, because the execution time of a workflow usually is far higher than task’s execution time. Surely, if a workflow consists of ten similar tasks in a sequence, then each of the tasks has the average time ten times smaller than the overall time. But if they use the same grid for the CDF, then the most of grid’s points are wasted - filled by ones or zeros. Easy to show that, in general, the average time grows linearly with number of nodes, whilst the standard deviation grows as square root of the number of tasks. Therefore, in order to improve the overall accuracy of the method, I use formulae (4.5) to obtain the initial rough estimations of the moments: the size of the grid is always the same, proportional to the standard deviation of the workflow makespan; the center of the grid is shifted with the average execution time of the workflow’s component (task or path).

Assume two random variables X1, X2:

t1_k= tk+ e1 b − a n , t 2 k = tk+ e2 b − a n , e1, e2 ∈ Z.

Then they have the same ω-grid, because it is not affected by the translation in t domain: ω1_j = ω_j2 = ωj = 2πj b − a, j = − n 2, .., −1, 0, 1, .., n 2 − 1.

This means that the estimation of the distribution of X1_{+ X}2 _{stays the same:}

φΣ_j = φ1_jφ2_j.

However, the estimation of max(X1_{, X}2_{) becomes a bit more complicated. I assume that}

the number of points in the grid stays the same. Therefore, if the grids are not the same, the grid of variables’ maximum must be truncated. The aim of the method is to provide lower bound on CDF, thus one must preserve the higher grid. Assume, for simplicity, that e2 ≥ e1. Then rewrite the grids’ formulae:

tk = t2k, t 1

k = tk+ (e1− e2)

b − a

n .

As a result, get following formula: F_kmax= F 2 kFk+e1 2−e1 if k ≤ n − 1 − e2 + e1, F2 k if k > n − 1 − e2+ e1. (4.16) It is always possible to choose such e1, e2 that center of the grid is close to the mean value

of the random variable. This allows to make sure that F (a) is close to zero and F (b) is close to one for all tasks during the execution of the algorithm.

Formulae of this chapter dictate the constraints on a possible algorithm to obtain the workflow runtime estimate. Firstly, one needs to get the estimations of the moments for tasks and paths using (4.5). Using the estimations, one should then calculate the CFs on a constructed grid (ω-grid stays the same during the computations). Finally, using (4.12,4.13,4.14) and (4.16) one can convert the CFs into the CDFs and combine them on the shifted grids into the overall estimation.

(29)

Chapter 5

Algorithm description

Because the runtime estimator is supposed to work with a scheduler, the workflow graph is assumed to come from it. However, the estimator does not use all the informa-tion provided by the scheduler: for instance, edges in the scheduler’s graph present the informational dependencies between the tasks (i.e. input of one task depends on output of the others), whereas the estimator is only interested in the execution order constraints. As a result, a graph from the scheduler is not optimal for use of the estimator, as it might contain the redundant edges (causing the occurrence of the extra paths through-out the workflow). Additionally, the large number of parallel sections in the graph leads to increasing the amount of the paths. It is possible to encapsulate such sections into the sub-workflows and compute their execution time independently, hence increasing the accuracy and the performance (by decreasing the number of highly correlated paths).

Firstly, this part of the thesis presents so-called workflow reductions - the operations that simplify the workflow structure leading to the reduction of the variables, the Fourier transforms, and improves the overall accuracy. Then, using the mathematical background from chapter 4, it presents the step-by-step explanation of the runtime estimation algo-rithm, which was proposed in chapter 3.

Workflow representation

In this project I assume the following representation of the workflow (simplified Haskell syntax):

1 d a t a WorkflowTask = Task {

t a s k : : TaskKind , −− t a s k ( o r sub−w or k f lo w ) f o r e x e c u t i o n a t t h i s node

3 p r e v : : I n d i c e s , −− p r e v i o u s t a s k s

q u a n t i l e s : : [Double] , −− v a l u e s o f CDF on t h e g r i d

5 waves : : [Complex Double] , −− v a l u e s o f CF on t h e g r i d mean : : Double, −− mean r untime e s t i m a t i o n

7 v a r : : Double −− v a r i a n c e runtime e s t i m a t i o n }

9

d a t a TaskKind = AtomicTask { . . . } | Workflow {

11 n o d e s : : Map I n d e x WorkflowTask , −− In de xe d t a s k s l e a f s : : I n d i c e s , −− f i n i s h t a s k s i n t h e w o r k f l o w

13 }

(30)

The basis of the representation is the Workflow type, which contains a list of indexed tasks (WorkflowTask) and a list of indices of the finish tasks (that do not have tasks to execute after them). Each task has a list of its predecessors (prev), the information about contained random variable, and a task kind (TaskKind). The task kind can be either a simple task (AtomicTask) or another workflow (Workflow), which plays the role of an encapsulated sub-workflow.

This chapter uses the Haskell syntax for the listings, because this language is used for the implementation. The functional programming paradigm perfectly suits for the considered kinds of tasks: a functional-style program uses functions to apply a single transformation on a graph, graph’s self-repeating structure encourages the recurrent def-initions that are so common in functional programming. The most important for under-standing (in the context of the presented listings) feature of Haskell is symbol “.” that denotes composition of two functions:

1 ( . ) : : ( b −> c ) −> ( a −> b ) −> a −> c ( . ) f g x = f ( g ( x ) )

Listing 5.2: Haskell function “.” - composition of functions

Function “.” allows the infix form: (.) f g x = f . g $ x = f (g (x)). Dollar sign “$” is used as an alternative to the parentheses, and separates the function (left side) from the argument (right side). The meaning of other functions presented in the listings can be guessed by their names; for instance, “filter” uses a boolean predicate to filter the list of values, “concat” concatenates a list of the lists into a single list.

5.1 Workflow reductions

T₄ T₁ T2 T₃ a) b) T T1 T2 3

Figure 5.1: Workflow examples Figure 5.2 presents a workflow, which contains

parts that can be simplified. One can count the number of paths m throughout it; it equals 16. Ob-viously, some of the paths in this graph are guar-anteed to have lower value (execution time) than others. For example, path 1 − 4 − 9 − 11 is always shorter than 1 − 4 − 5 − 6 − 9 − 11, because the for-mer sequence is a subsequence of the latter; then it does not contribute to overall runtime distribution, it only wastes an extra computing time. Therefore one should modify somehow the graph so that it does not contain the shorter paths. Another

possi-ble improvement is to compute the execution time of nodes 7 and 8 as a single random variable, which reduces the number of paths throughout the workflow by 3. Now the question is how to formalize these simplifications, implement them algorithmically, and prove that they help? Following are three types of reductions that partially answer these questions (the other types of reductions are also may be introduced in future):

• Delete Edge (DE) - delete all edges which introduce the paths fully contained in some other paths (“shorter” paths); e.g. edge 4 − 9 in Figure 5.2;

• Reduce Sequence (RS) - combine all fully sequential parts into the single meta-tasks; if the workflow contains the sequences of the tasks like in Figure 5.1.a, these

Execution Time Estimation in the Work ow Scheduling Problem

Execution Time Estimation in the Workflow

Scheduling Problem

Master’s Thesis

written by

Artem Chirkin

under the supervision of

Adam Belloum and Sergey Kovalchuk

June 17

, 2014

Abstract

Acknowledgments

Contents

Chapter 1

Introduction

Outline

Chapter 2

Related work

2.1

Execution time in workflow scheduling

2.1.1

The need of estimating workflow execution time

2.1.2

Representation of execution time

2.2

Execution time and pricing models

2.3

Single node and performance issues

Summary

Chapter 3

Runtime Estimation

3.1

Composed time

3.2

Task execution time

3.3

Workflow execution time

Chapter 4

Mathematical background

4.1

Approaches to estimate runtime

Estimating moments

Estimating quantiles

Estimating probability functions

4.2

Estimating the workflow makespan CDF

4.3

Transform between CDF and CF

4.3.1

Estimating CDF and CF

4.3.2

Fourier Transform

4.3.3

Combining shifted t-grids

Chapter 5

Algorithm description

Workflow representation

5.1

Workflow reductions