Streaming workflow transformation

(1)

.

Plot

^.

.V_plots

.

Model

^.

.Vpredictions .

.V_sum

.

Σ

.

Avg

^.

.V_avg

.w₃

.w₄

.w3 .w₄

. .V_all

.

∪

.w₁

.w2 . .

T

₁

.

T

2 ^.

.

T

3 ^.

.w_e

.

Streaming Work ow

Transformation

(2)

(3)

Streaming Work ow Transformation

Master’s esis in Computer Science

Tjalling van der Wal

February

Supervisor:

Andreas Wombacher

(4)

(5)

Abstract

is thesis presents . a formal model for streaming work ows adapted for transformation and . transformation rules for streaming work ows de ned according to that formal model. e validity of the transformation rules is demonstrated by formally proo ng equivalence. e validity of the formal model is demonstrated by the fact that valid transformation rules can be de ned.

Transformation of streaming work ows is the rst step towards automatic optimization of streaming work ows. By providing a formal model and transformation rules, this thesis demonstrates that it is possible to build a self-optimizing Streaming Work ow System.

(6)

(7)

List of Figures

. Optimization . . . . . Monitoring a River to Detect Groundwater In uxes . . . . Types of Window Sequences . . . . . Regular Window De nition. . . . . Fresh Tuples . . . . . Alignment of Window Sequences. . . . . Subalignment of Window Sequences . . . . . Position of the Formal Model . . . . . Zip Rule. . . . . Copycat Rule . . . . . Copycat Rule LHS and RHS Combined . . . . . Partitioned Timeline . . . . . Shared Loop Rule . . . . . Copy Elimination Rule . . . . . Copy Bypass Rule (Righthand Side) . . . . . Example of a Multi-step Work ow Transformation . . . . Union Associativity Rule . . . . . Selection Pushdown Rule . . . . . Masquerade Rule . . . . . Single Input Union Rule . . . . . Idempotent Activities Rule . . . . . Collapse/Split Select Rule. . . . . Intermediate Project Rule . . . . . Aggregate Split Rule . . . . . Computational Costs in the Aggregate Split Rule . . . . . Shared Join Rule. . . . . e Ultimate Work ow. . . . .

(10)

(11)

Chapter One

Introduction

is thesis presents a formal model and transformation rules for streaming work ows. In this chapter the motivation, objectives and research questions are introduced. First, this chapter introduces an example of a streaming work ow that will be used throughout this thesis.

. Monitoring a River

An environmental scientist wants to devise a system to detect ground water in uxes that indicate weaknesses in dikes along a river. Part of his research involves deploying sensors to measure water temperature, ow and saline levels. e equipment used is advanced and capable of transmitting the measurements wirelessly to the research centre, some- times as frequent as every second when required.

To detect potential natural hazards as soon as possible, a computer system at the research centre continuously processes the fresh data and compares it to con gured thresholds and predicted values. To reduce false positives, the system utilizes sliding averages and historic data.

e processing done by the system has been de ned and con gured by means of a work ow speci cation. is work ow is called a streaming work ow, because it is used to process streams. In a stream of data, the individual data items are related and the order of the data has seman- tic signi cance. (As opposed to business process work ows, in which the work items are similar but independent.)

e system also processes a streaming work ow de ned by a second scientist. By comparing each week of data with the same week in preceding years, the scientist hopes to prove his own hypothesis on the eﬀects of climate change on the observed river. Although this work ow is less frequently executed, it is nonetheless streaming.

(12)

Both work ows independently calculate averages over the same mea-

surements. Luckily the system is smart enough not to calculate the averages twice.

. Sensor Data Problems and Challenges

e system from the example does not exist (yet). Section . proposes such a system. is section provides the motivation for building it.

Problems e utilization of measurements is often limited to the per- son or team that has collected them. at is a shame because collect- ing measurements is expensive. e low utilization of measurements is caused by the following problems:

• Data is not stored centrally. Measurements are stored on personal desktops: not available to other researchers or teams; they may not even know about it. After a researcher leaves the project, the measurements are not accessible any more.

• Data integration is hard. Conventions diﬀer and detail is missing.

e quantity T could be either ground, soil or air temperature.

is makes it hard to combine and compare measurements from diﬀerent projects.

• Data governance is diﬃcult. After a researcher leaves or after a project is nished, nobody knows how measurements were collected and how they were processed. How frequent did the sensor

‘sense’? Is the data raw or has it been processed?

• e processing is manual and ad-hoc. Everyone uses their own favorite tool, like Matlab, Excel, SPSS or custom scripts. is exacerbates the other problems.

Each of these problems makes it harder to collect and process sensor data in such a way that measurements can be easily shared and reused.

Challenges Apart from the problems above there are recent develop- ments that could provide new opportunities but could also increase the existing problems in the future.

(13)

• Although sensor deployments are expensive, individual sensors have become cheaper. is means more sensors can be deployed at the same cost and more data can be collected. But this also means that the amount of data to manage becomes larger.

• Another challenge is included in the example in the previous sec- tion: streaming data. Instead of arriving periodically in batches, the data arrives as a continuous stream. is means real-time monitoring is possible, provided a more robust way of managing data is developed.

. Streaming Work ow System

e solution to these problems and challenges is to build a system such as already described in Section . . e system will process incoming sensor data according to processing instructions speci ed by the user by means of streaming work ows: it is a Streaming Work ow System.

In ‘Composable data processing in environmental science — a process view.’ [ ] the same problems have been described and a similar system is suggested.

Scope of the System Although the example in Section . and some of the problems in Section . are speci c to environmental sciences, the ambition is to build a generic system that can be used in many more situations. Examples include administrative systems, monitoring systems and control systems, in accounting and security domains. Considering that a cash register can be seen as a sensor that observes sales, it is logical not to limit the Streaming Work ow System to environmental science applications.

Functions and Requirements for the System A Streaming Work ow System must perform the following functions:

. e system enforces formalization of data processing by means of streaming work ow speci cations. Formalization is needed for the system to be able to perform optimization. Additionally, formalization provides governance.

(14)

.

.. ..

Original .. .

Alternatives . . .

Best Alternative .

... .. Alternatives .

search

.

search

.

choose

. .

this thesis

Figure . : Optimization

. e system shall manage data storage and distribution. If the data is centrally available, it can be shared and reused more easily.

. e system shall optimize the work ow speci cations submitted by users. Optimization shall be done both within and between work ows. Criteria for optimization include computational and storage cost. Optimization is needed to let the system scale in terms of users, data volume and data rate.

ese three requirements are suﬃcient to tackle the problems and challenges from Section . and to make the example from Section . reality. is thesis contributes to the third requirement: optimization of streaming work ows.

. Optimization

Optimization is a requirement for the Streaming Work ow System.

But what is optimization? And what is needed to build an optimizer?

Optimization Optimization can be described as searching for an alternative that is better than the original. Figure . shows the basic process. Starting from an original, an optimizer searches for alternatives. e optimizer continues by searching for more alternatives based

(15)

on alternatives that have been found earlier. In the end, from all the

alternatives found, the best one is chosen.

As example, consider the algebraic formula x = (a + b)∗ (a + c).

By applying algebraic rules the following alternatives can be found: x = (a + c)∗ (a + b) and x = a ∗ (b + c). e second alternative is better because it requires less computation to nd x.

Searching for Alternatives e search for alternatives has two prereq- uisites: . a formal way to describe both the original and the alternatives and . a formal de nition to tell whether an alternative is valid. In the algebraic example the formula itself is a formal description and any formula that evaluates to the same value is a valid alternative.

In the work ow example from Section . , work ows are described by a speci cation. Searching for alternatives is done by applying transformation rules to a streaming work ow speci cation to create alternatives, just like algebraic rules are used to create alternative formulae. e contribution of this thesis is de ning transformation rules for streaming work ows.

Choosing an Alternative Which alternative is best depends on the context. In case of the algebraic example, the original formula is the best when trying to solve for x = 0, but the alternative x = a∗(b+c) is the best when computing the value. e selection of the ‘best’ alternative streaming work ow speci cation is outside the scope of this thesis.

. Objectives

is thesis tries to contribute to the idea of a Streaming Work ow Sys- tem by showing the feasibility of building such a system, in particular the feasibility of creating an optimizer for streaming work ows. e objectives of this thesis are:

. Validate the idea of a Streaming Work ow System by showing it is theoretically possible to perform automatic optimization of user-de ned streaming work ow speci cations.

(16)

It is assumed that optimization is possible if transformation is

possible. erefore this thesis is limited to streaming work ow transformation.

. De ne a formal model for streaming work ows with transforma- tion in mind. An existing formal model for streaming work ows is used as a starting point.

. De ne transformation rules for both generic and speci c activi- ties, and for a reasonable set of common activities (including relational).

. Research Questions

To ful ll the objectives, this thesis will answer the following research questions:

. How must the (existing) formal model be adapted to allow for transformation?

. What is a suitable de nition of equivalence between work ows?

. Can transformation rules be de ned for common types of activities?

a) Can rules be de ned for generic situations?

b) Can rules be de ned for common (relational) activities?

c) Can rules be de ned for aggregates and join activities?

By answering these research questions, the second and third objective from Section . are ful lled. e rst objective is indirectly ful lled by ful llment of the second and third objective.

. esis Structure

e structure of this thesis follows the research questions. In Chapter the formal model is described and in Chapter equivalence of work ows is de ned. ese two chapters are prerequisite for the next chapters.

(17)

In Chapters , and transformation rules valid within the formal

model are de ned. ese chapters can be read selectively, but in particular the Copycat Rule (Section . ) should be read because for this rule a full proof is provided that uses and illustrates most concepts from the formal model.

Section . presents an example that shows how two independent but similar streaming work ows can share computation through the application of multiple transformation rules. Section . shows how transformation rules can be used to optimize the ‘Monitoring a River’

example as the work ow evolves during a research project.

In Chapter related work with regard to streaming work ows and continuous querying is discussed. In Chapter the results of this research are evaluated against the objectives and in Section . a suggested approach for actually building an optimizer for a Streaming Work ow System is presented.

(18)

(19)

Chapter Two

Formal Model

Before a streaming work ow can be transformed, a formal description of the work ow is required. e description must be formal because . the description must be manipulatable for a computer system and . it must be possible to proof the equivalence between work ows. is chapter introduces the formal model used to describe streaming work ows.

In the rst section the ‘Monitoring a River’ case from the introduction is used as an example to introduce the core concepts of the formal model. In Section . the behavior of a streaming work ow is formal- ized. In Section . assumptions on the system’s runtime are listed.

ese are needed to make the model work and to complete the proofs of transformation rules. Section . discusses design choices that have in uenced the formal model.

Origins e formal model presented in this chapter is based on the formal model described by Wombacher in ‘Data Work ow — A Uni- ed Time Controlled E-Science Processing Model’ [ ]. Compared to the original model, this new model trades expressive power for the predictability needed to do transformations.

It is important to note that although the formal model is less expressive, it does not mean that a Streaming Work ow System would be less powerful. More powerful constructs can be oﬀered by the system, with the limitation that not all of those can be transformed and optimized using the formal model presented here.

. Monitoring a River with a Streaming Work ow

Section . introduced the example of a scientist who tries to protect dikes by detecting groundwater in uxes. e data processing needed to detect in uxes has been speci ed by the scientist by means of a streaming work ow. is section explains how a streaming work ow works

(20)

.

. ...TemperatureSensor1... .VT........

........ .VT...Avg... .VT

...TemperatureSensorn... .VT

.........InuxDetector... .Vinfluxes

...FlowSensor1 ...

.....Avg......Sum........Subtract... .Valarms

...FlowSensorn ...

.........DataEntry... .Vm3/minute

.Vm3/minute

.Vknowninuxlocations .Vm3/hour.Vm3/hour . .sensors

. .lteringandaggregation

Figure.:MonitoringaRivertoDetectGroundwaterInuxes

(21)

Activity Output (content of

view)

Purpose

.

.Activity. .^.. .

Temperature Sensor T in^◦C at location xat time t

Observe river Flow Sensor Flow in m³/sat lo-

cation x in time in- terval t

Observe river

T in^◦C at location xat time t

Eliminate values

outside per-

centile to address sensor errors

Avg(T ) Average T in^◦C at

time t over last hour

Calculate average temperature in river

Avg(Flow) Average ow in

m³/s in time interval t

Use average as

‘weighted majority vote’ to reduce impact of sensor error

Sum(Flow) Total ow in

m³/hourat time t

Calculate total ow in last hour In ux Detector Detected in uxes at

location x during time interval t

Does what this work ow was created for; all the other activities exist to feed data to this activity.

Data Entry List of known in ux locations

Manually record locations with known issues or special cir- cumstances.

Subtract List of alarms without false positives

Eliminate known in ux locations from list of in uxes.

Table . : Description of Work ow Components in Figure .

(22)

using this example.

To detect in uxes, the scientist has written an algorithm that looks for anomalies in the combination of water temperature and water volume.

Before sensor data can be fed to this algorithm, the data has to be l- tered to address sensor errors and has to be aggregated. After the in ux detection, known locations have to be ltered out to produce a list of locations on the dike that need inspection. So the work ow has the following steps: . record the data, . lter the data, . aggregate the data . execute detection algorithm, . record known in ux locations and . eliminate false alarms.

Figure . shows the streaming work ow created by the scientist. Table . details the purpose of each individual component in the work ow.

Together these components perform the six required steps.

e work ow starts with the sensors deployed in the river; periodically they send data to the research centre. In the formal model ev- erything that produces or processes data is an activity. e activities representing the sensors are shown on the left of the work ow as blue boxes. Associated with each activity is a view that stores all data pro- duced by an activity. In the work ow, views are depicted as small green circles.

e recording of sensor data is represented by the sensor activities and their corresponding views. e second step to perform is to lter the data to reduce the impact of sensor error. For temperature data, a activity periodically reads the data from all the sensor views and produces a new stream from which outliers have been removed. For the ow data, the average over all sensors is calculated by an Avg activity.

e output from both activities is again stored in views.

e third step, aggregation, is similar to the ltering: an activity reads the data from a view, applies the required computation and produces a stream which is stored in a new view. is results in two views:

one with a sliding average temperature and one with a sliding total ow volume.

e custom in ux detection algorithm is the fourth step and is also an activity. It reads the aggregated data and produces a stream of locations with an anomaly in the combination of water temperature and water volume.

(23)

As a sixth and nal step the list of locations is ltered by means of a

manually maintained list that is produced by step ve.

Two phrases in the description above have been deliberately left vague:

‘periodically’ and ‘reads the data’. is raises two questions: when does an activity execute? and which data is used for an execution?. e next section will answer those questions.

. Behavior of Activities

is section describes the behavior of activities. Activities are the active components of a streaming work ow, as opposed to views that just store data and thus have no behavior. e behavior of activities is de ned by two components: . when to execute and . what input data to use. Both are speci ed using window sequences.

De nition : Window Sequence A window sequence is a regular and predictable sequence of windows on a view. Each window represents a single activation of an activity and the selected data from the view to use for that activation. A window sequence is either de ned in terms of time or in terms of arriving tuples.

When an activity must be activated (or executed) is determined by an arithmetic sequence of timestamps or tuple-ids. Elements of such a sequence are called reference points.

De nition : Reference Points e reference points of an activity are an arithmetic sequence generated by [epoch + n∗ delta | n ∈ N0]. For time-based activities epoch and delta must both be valid time units. For tuple-based activities both must be tuple counts.

An activity can have a diﬀerent window sequence for each input relation, but they all share the same reference points.

For each reference point the work ow system activates the activity.

e system feeds the activity with data selected by the current window and the activity appends zero or more output tuples to the output view.

(24)

.

. xed .view

.

.. .

.landmark .view

. n

..

.. .

.partitioning .view

. n

..

.. .

.sliding .view

. n

..

.. .

.hopping .view

. n

..

Figure . : Types of Window Sequences. e horizontal axis represents the data in the view and the bars are window instances, indicating the selected data. e vertical axis displays the n used to generate the reference points sequence [epoch + n∗ delta | n ∈ N0].

(25)

What data must be selected from input views is de ned relative to the

reference point of the current activation. ree basic types of window sequences are supported.

De nition : Regular Window Sequence For regular windows, the lower and upper bound of the data interval selected from a view is de- ned relative to the reference point. e interval de nition is:⟨reference point− (ws + offset) . . . referencepoint − offset]. ws is the window size of the window and offset allows to specify the distance of the window to the reference point. Figure . visualizes this.

ere are exactly three subtypes of regular window sequences based on the relation between the window size (ws) and the distance that the reference point shifts between each window (delta). ese three sub- types are partitioning (ws = delta), sliding (ws > delta) and hopping (ws < delta). Figure . shows them.

A partitioning window sequence has two speci c properties: all windows are disjoint: ∀i, j : wi∩ wj = ∅ (mutually exclusive) and all windows together cover the entire view: ∩

wi = V (collectively ex- haustive). A hopping window sequence ignores part of the data in the input view.

.

.ws .offset

. lower bound

. upper bound

.reference point

Figure . : Regular Window De nition

De nition : Landmark Window Sequence For landmark windows the lower bound is xed and the upper bound is de ned in the same way as for regular windows. e interval de nition is:⟨landmark . . . reference point− offset].

De nition : Fixed Windows For xed windows the interval is de- ned with xed lower and upper bounds. Fixed windows are used for one-time queries and to incorporate historic data.

(26)

De nition : Window (∼Instance) A single window from a window

sequence; obtained by substitution of the reference point of the current activation into the window sequence de nition. It serves as the data selection interval for the current activation. e lower bound of a window is exclusive and the upper bound is inclusive.

De nition : Standard Tuple-based Window Sequence For tuple- based activities a special window sequence is de ned. e standard tu- ple based window sequence we for event based activities is de ned as delta = 1, ws = 1, offset = 0.

When activated, an activity is not just called with a bag of input tuples. e work ow system also reports the interval used and all the reference point sequence and window sequence information. Union and Join activities need this information to distinguish between fresh tuples and tuples they ‘have seen before’.

Knowing the speci cation of the window of the current activation, an activity can calculate the bounds of the previous window. e latter can be used to determine which tuples ‘have been seen before’ and which tuples are fresh.

De nition : Fresh Tuples A fresh tuple is a tuple that was not included in an earlier window instance. Figure . shows windows from a sliding window sequence with fresh tuples marked orange. In partitioning and hopping windows, all tuples are fresh.

.

.sliding .view

. n

..

.. .

n = 0

Figure . : Fresh Tuples in a Sliding Window Sequence. Fresh tuples shown in orange.

(27)

.

.w1

.w2

Figure . : Alignment of Window Sequences

. .w1

.w2

Figure . : Subalignment of Window Sequences

An important property of reference points and window sequences is alignment. is property is used in proofs in Chapter .

De nition : Window Sequence Alignment Two window sequences w1 and w2 align iﬀ delta(w1) = delta(w2) and∃n : epoch(w2) = epoch(w1) + n∗ delta(w1).

Because n is allowed to be negative, alignment is an equivalence rela- tion: it is re exive, symmetrical and transitive. Reference to alignment of activities is shorthand for alignment of their respective window sequences.

De nition : Window Sequence Subalignment Two window se- quences w1and w2subalign iﬀ ∃k ∈ N1 : k∗ delta(w1) =delta(w2) and∃n ∈ N : epoch(w2) =epoch(w1) + n∗ delta(w1).

Window subsequence alignment captures the case where one window sequence is a subsequence of another window sequence. For example, an average over a period of a week that is calculated each week, is a subsequence of an average over a seven day period that is calculated each day.

(28)

. Assumptions on the System Runtime

To make the formal model function, several assumptions must be made on the Streaming Work ow System runtime.

Assumption I: e System is In nitely Fast. e model has a formal and mathematical nature. Execution time is not something that concerns the model. erefore the model assumes that execution of activities is fast; so fast that execution takes zero time. is assumption has a very nice eﬀect on the notion of equivalence: since execution is in nitely fast, intermediate states of the system will not be observable and will thus not aﬀect equivalence.

Assumption II: e System Ensures Correct Execution Order. When more than one activity is enabled for the same reference point, the runtime of the Streaming Work ow System will ensure that an execution order is chosen that respects the input-output dependencies between the activities. When an activity is executed it is guaranteed that all input views are complete for all reference points less or equal the current reference point.

e system may not be able to ful ll this assumption when the work- ow speci cation contains cycles. is thesis ignores cyclic subgraphs.

Cyclic subgraphs must be treated as a single non-transformable activity.

Assumption III: e Initial System is Empty. e work ow model does not allow orphaned views. As a result any data in the system must have been added to the system through an activity. is includes historic data and (con guration) data that is added manually. is assumption is needed to proof the correctness of transformations.

. Design Choices

is section details a number of choices made while designing the formal model. Unlike the assumptions from the previous section, these choices are not needed to make the formal model function. ey do however have had their impact on the formal model.

(29)

Design Choice I: Streaming and Querying are the Same. Wombacher

[ ] makes a distinction between a streaming mode of operation and a special query mode. is model chooses to treat them the same. To illustrate this in a more technical manner: Suppose the system is like a functional programming language and the sensor data is stored in lists.

To process the data, you need mapping and folding functions. But:

these functions can be applied to both nite and in nite lists and also to lists that are still being generated. For the functions there is no dif- ference between stream processing fresh data and batch processing historic data. Likewise the streaming work ow model does not distinguish streaming and querying modes.

e reason for this choice is that it allows the formal model to be ag- nostic of the current time; epoch and other time variables are not limited to Now and the future but can also be in the past.

It is personal belief that this makes the model cleaner and less com- plex but does not needlessly complicate the implementation of the system runtime.

Design Choice II: Triggers are Ugly. In [ ], triggers are used to determine when activities should be executed. Triggers are not declarative and their semantics are hard to capture, especially the fact ‘ week’ does not mean the same as ‘ days’. Also a trigger alone can not be ma- nipulated by transformations because it does not describe what data an activity needs for an activation. erefore in this model triggers have been replaced by window sequences.

Design Choice III: Work ows are Immutable. Once submitted to the system by the user, a work ow is immutable. When a work ow needs to be updated, the user must submit a new work ow to the system. When- ever is said that a work ow is updated, it means that a new work ow derived from the original is added to the system.

e rationale for immutable work ows is data governance: because work ows cannot change, it is always possible to tell how a certain tuple was generated.

e implication of immutable work ows is that many work ows exist in the system that are only slightly diﬀerent from each other, often in trivial ways. erefore there is much more opportunity for shared computation than apparent.

(30)

Design Choice IV: All Activities are Stream Deterministic. Wom-

bacher [ ] distinguishes deterministic and non-deterministic activities.

In this thesis, this distinction is ignored and all activities are considered to be stream deterministic: given the same data stream, a just initial- ized instance of an activity always produces the same output stream.

Whether an activity carries internal state across subsequent activations is irrelevant for this type of determinism.

e kind of determinism described in [ ] is much stronger as it is de ned in terms of individual activations, theoretically allowing out-of- order execution. is sort of optimizations is outside the scope of this thesis.

Stream determinism does require that the output of activities solely depends on the current and previous inputs. e results should not de- pend on (in particular) the actual execution time. is gives the system the exibility to delay execution, but also means that transformations can cause earlier execution of activities without breaking equivalence.

Sensors and data entry activities in a work ow are not stream deterministic. is is not a problem because these activities are the leafs of the work ow graph and can not be eliminated anyway for that reason.

. Position in the System

is section describes the position of the formal model within a Stream- ing Work ow System. Describing what the model is and, in particular, what it is not, has helped me considerably to think about transformation rules in isolation.

e Model is Not a User Interface. e formal model is not intended as a user interface. A real system will provide a higher level of abstraction. Because the model is not a user interface, seemingly arbitrary and counter-intuitive limitations and assumptions can be made about work-

ows.

e Model is Not an Execution Plan. e Streaming Work ow Sys- tem does not actually execute a streaming work ow according to the speci cation. Diﬀerent algorithms and data structures may be used de- pending on activity con guration, expected or observed data rates and work ow structure. Also, common subgraphs may be implemented by

(31)

.

Streaming Work ow System

..

.. .User Interface .Formal Model .Implementation

.

Relational Database

..

.. .SQL .Relational Algebra .Physical Operators

Figure . : Position of the Formal Model in a Streaming Work ow System.

a single algorithm. Because the model is not an execution plan, it is no problem when a transformation rule produces a work ow that seems less eﬃcient.

So What is the Model? e formal model is an abstraction layer that enables the system to transform streaming work ow speci cations into any shape. e desirability of a shape is no concern in the model.

Figure . compares the Streaming Work ow System with a relational database management system (RDMS): the model sits between the user interface and the actual implementation of activities, similar to the function of relational algebra.

(32)

(33)

Chapter ree

Equivalence

In Section . , optimization was summarized as ‘searching for an alternative that is better than the original’. One of the questions that arises from that description is: what de nes a valid alternative? When is a streaming work ow a valid alternative for another streaming work ow?

A work ow is a valid alternative iﬀ it can be interchanged with the original without changing the output of the system. e formal concept for ‘valid alternative’ or ‘interchangeable’ is equivalence. Two work ows are valid alternatives for each other when they are equivalent.

Equivalence is denoted by the≡ operator. In this thesis the nota- tions =tand≡tare used to state that an equality or equivalence is valid at a certain point in time.

. Monitoring a River the Same Way, but Diﬀerent

Previous chapters have used the example of a scientist who wants to observe dikes along a river to detect groundwater in uxes and his col- league who studies the eﬀects of climate change on the river. What does equivalence mean to them?

To the user of a Streaming Work ow System equivalence means that he gets the results he expected; that the data apparently has been pro- cessed in the way he has speci ed the data to be processed. e system must deliver predictable, reproducible and traceable results to guarantee the scienti c validity of his research.

Because the user has such speci c expectations and noble intentions with the results, the system can not compromise the quality of the results in any way. e only thing the system is allowed to do is replace a work ow by another work ow that is equivalent, but possibly more eﬃcient.

(34)

Equivalence simply means that two diﬀerent work ows deliver the

same results. is chapter formalizes this intuitive de nition to make it possible to proof that transformation rules preserve equivalence in Chapter .

. Equivalence of Activities

is chapter describes equivalence bottom-up, so this section starts with describing equivalence of activities. Intuitively, activities are equivalent when they produce identical output from the same input.

De nition : Equivalent Activities Two activities are de ned to be equivalent iﬀ . they run the same code, . this code is stream determin- istic and . they have the same con guration. According to Assumption IV, the second condition is always true.

Note that sensors and other activities at the leafs of the work ow are never equivalent. One could either argue that they do not run code in the sense intended in the de nition of equivalent activities or that the location of their deployment is part of their code. Anyhow, the rst condition is false.

In transformation rules, activities are considered equivalent iﬀ they have the same label. If the labels are diﬀerent, then the activities are not equivalent. is does not prevent transformation rules from stating that two activities are interchangeable because work ow equivalence will be de ned in terms of view equivalence.

. Equivalence of Views

With equivalence de ned for activities, the equivalence of views can be de ned.

De nition : Equivalent Views Two views are de ned to be equiva- lent iﬀ at all points in time their content is identical. Content includes all the attributes in the user de ned schema plus a subset of the metadata managed by the Streaming Work ow System.

(35)

e content of a view for the purpose of equivalence consists of all

records (including empty records), with all the attributes in the user de ned schema. In addition the idealized transaction time (part of the system’s metadata) attribute is included; this is the timestamp used by interval predicates to select data from views.

Adding tuple id to the content would make equivalence much harder, but not adding it restricts the kinds of tuple based triggers that can be transformed. is choice requires further consideration.

Equivalence of views can be very simple. e following lemma formalizes a trivial case of equivalence between views. It is a special case of the de nition above.

Lemma : Trivial Equivalence If two views are . produced by equivalent activities, . from equivalent input views and . with the same window sequence, then they are equivalent according to De nition .

e rst and second conditions must be (indirectly) true for shar- ing to be possible at all: when doing something diﬀerent with the data, nothing can be shared and likewise when the data is not from a common source. A lot of transformation rules target situations where the third condition is not true, but where there is some relation between the window sequences that causes a subsumption relation between views.

A simple proof of Lemma can be given by induction over time.

Given two views for which all three conditions hold: VAwhich holds the output of activity A and VBwhich holds the output of activity B.

At t = 0 both views are empty according to AssumptionIII, so their content is the same. At time t the content of both views is still the same.

At time t + 1 both A and B are activated and . read the same data interval from equivalent views, . perform the same operation on the same data and . produce the same output tuples which are appended to VAand VB. So at time t + 1 the contents of VAand VBare still the same. us VAand VBare equivalent by De nition . e rst rule discussed in Chapter is proofed by showing trivial equivalence. e second rule shows a proof for a situation where the third condition of Lemma is not met.

(36)

Strict Equivalence Weak Equivalence

. Zip . Copycat

. Shared Loop

. Copy Bypass . Copy Elimination

. Union Associativity . Selection Pushdown . Masquerade . Single Input Union

. Idempotent Activities

. Collapse and Split Select . Intermediate Project . Aggregate Split . Shared Join

Table . : Transformation Rules by Equivalence Type

(37)

. Equivalence of Streaming Work ows

Based on the de nition of equivalent views, the de nition of equivalent work ows can be given.

De nition : Equivalent Work ows (Strict) Two streaming work- ows are equivalent iﬀ for each view in either work ow, an equivalent view exists in the other.

Many transformation rules introduce views to the work ow to store intermediate results that can be shared and reused. As a result these rules are not valid under the strict de nition of work ow equivalence.

From a practical (user) viewpoint however additional views are no issue.

erefore a weaker de nition of work ow equivalence is added.

De nition : Equivalent Work ows (Weak) Given a subset of the views, two streaming work ows are equivalent iﬀ for each view in the set, equivalent views exist in both work ows.

is de nition weakens the equivalence by restricting the number of views that has to have an equivalent. In practice it will be no problem that the set of equivalent views is restricted because there is an asymmetry as described next.

De nitions and are symmetrical because an equivalence relation has to be symmetrical. In reality there is asymmetry: one work ow will be the speci cation written by the user and the second work ow is the internal work ow used by the system runtime. e subset used for equivalence is the set of views in the user speci cation. Addition- ally, many users are only interested in the end results produced by their work ow. Intermediate views can therefore also be removed from the subset used for equivalence.

Finally, any transformation rule can be rewritten to preserve strict equivalence. is is shown by the Copy Elimination/Bypass Rule in Section . . By performing only the additions speci ed by the rule and not the deletions, a work ow is created with a maximum set of equivalent views. It is just a little odd to call this a ‘transformation’. By exten- sion this means that a Streaming Work ow System can always recreate

(38)

views when the subset of views used for equivalence (i.e. user interests)

changes.

Comparing Strong and Weak Equivalence Table . lists the transformation rules presented in the next chapters classi ed by the type of equivalence. Rules with the same type of equivalence also tend to share other characteristics.

An interesting observation is the fact that rules that adhere to strict equivalence identify duplicate views in a work ow or identify a subsumption relation between views. Application of these rules is always bene cial from a computational or storage perspective. Rules that only preserve weak equivalence try to make duplicate work explicit, allowing the results to be shared and the duplication to be eliminated. Addition- ally these rules can be used to trade computational cost for storage cost and vice versa.

Independence of Equivalence e de nition of equivalence given is independent of both the work ow structure and the execution model.

is has the following advantages:

• Transformation rules can structurally change the work ows, as long as the required views are still present.

• e formal model can be changed and implementations can di- verge from it. A realistic example would be a ltering activity or a join activity that produces two output views based on two independent predicates.

• Part of a work ow de nition can use other semantics and a dif- ferent execution model. e de nition of equivalence allows the formal model to be treated as a subset of the original model de-

ned by [ ].

. Local and Global Equivalence

Streaming work ows are composed from multiple activities, each of which has its own semantics. It would be practical if transformation rules can be de ned just for basic situations, focussing on a single se- mantic aspect of a speci c type of activity. is is possible because global equivalence is preserved by preserving local equivalence.

Streaming workflow transformation

Plot

Model

Σ

Avg

∪

T

T

T

Streaming Work ow

Transformation

Streaming Work ow Transformation

Abstract

Contents

List of Figures

Introduction

Formal Model

Equivalence