BatchJS Implementing Batches in JavaScript

(1)

Master’s Thesis

BatchJS

Implementing Batches in JavaScript

Master Software Engineering Host organization and group: CWI

Contact person and supervisor: Tijs van der Storm University of Amsterdam

Douwe Kasemier

(2)

(3)

Abstract

None of our popular programming languages know how to handle distribution well. Yet our programs interact more and more with each other and our data resorts in databases and web services. Batches are a new addition to languages that can finally bring native support for distribution to our favourite programming languages. This research proposes BatchJS as a JavaScript implementation of batches. Since server-side JavaScript has become a popular solution for data-heavy web applications, batches are a powerful addition to the language. To demonstrate this, an experiment is carried out which shows how BatchJS can handle access to an SQL database. The use of BatchJS gives a programming model as clean as object–relational mapping can get but it provides a much better performance.

(4)

Preface

Dear reader, this document is my final thesis which concludes the master’s program Software Engineering at the University of Amsterdam. I have started this program in September 2010 after signing up in early 2008. Back then, while following a bachelor in applied information technology, I decided to attend this program. My main motivation was my belief that academic skills are very valuable in one’s (professional) life. Hopefully, it helped me to acquire some of these skills, along with a lot of practical knowledge in the field of Software Engineering (SE).

I believe and hope that the present study on batches for JavaScript will be of good value to the field of SE. While writing this thesis I only started to realise more and more that this subject is something that SE missed out on. I may only contribute a little, but it may be part of a solution to one of the biggest issues in SE.

Unfortunately it took me a little over a year to write this thesis. While I was warned that a full 100% of focus is necessary to write a thesis in only six months, I found it difficult to let go of other responsibilities. In the end it worked out, partly due to the following people who I would like to thank for their help.

Acknowledgements

I am grateful for the help and feedback of my supervisor, Tijs van der Storm, during the period in which I have worked on this thesis. Also, I would like to thank the SWAT group at the CWI and the teachers of the Software Engineering master for their stimulating courses. One of the most valuable things in the SE program is how one learns to appreciate the value of a scientific approach. Furthermore, I would like to thank Arthur Koedam as well as my mother for their feedback on writing a scientific publication in English.

I also like to thank my girlfriend Symone, without her I probably wouldn’t have been able to do this. She just knows better than anyone how to get things done. Finally, I am thankful towards my parents, grandparents and parents-in-law for their various ways of helping me to be able to follow this master’s program.

Douwe Kasemier

(5)

3. Experiment and results 33 3.1. Experiment . . . 33 3.2. Test suite . . . 34 3.3. Benchmark . . . 35 3.4. Results . . . 36 4. Analysis 41 4.1. Implementation of BatchJS . . . 41 4.2. Benchmark . . . 42 4.3. BatchJS code . . . 44 4.4. Research question . . . 44 5. Conclusions 48 5.1. Summary . . . 48 5.2. Related work . . . 49

(6)

5.3. Future work . . . 50

References 52

Appendices 56

A. BatchJS 57

B. Benchmark code and results - Variable size sets 58

(7)

1. Introduction

1.1. Distribution

Distribution is a long-lasting cause of software development problems that is usually taken for granted. In its broadest sense, a distributed system is a system where multiple programs execute in a different address space. Instead of just a different address space, often a more narrow definition, where the programs should be on a different location in a network, is used. A few examples of distributed systems are:

• Webservices

• SQL database management systems • Remote Method Invocation

• Cloud storage

Because the components of a distributed system execute in a different address space they do not have the exact same view on the system that they are as a whole. To communicate, the components of a distributed system need an external channel. This could vary from something like a Unix socket to an HTTP connection over the internet.

How our programming languages handle distribution

The typical approach to distribution in programming languages is that it is not supported in the language itself. This means that distributed tasks are dependent on a library. Good examples of this are Remote Method Invocation in Java, LINQ in the .NET framework or a RDBMS interface. These libraries often make a program more complex or they lack performance.

Why distribution is a problem

The example in Figure 1.1 shows a loop that checks the publication date of articles against a variable date. If the article is published before this date, the title and author are printed and the status of the article will be changed to offline. As long as articles is a local object this will be an efficient solution. But what if articles is a remote collection? In that case the code will introduce unnecessary latency. First, the client has to fetch all

(8)

Figure 1.1: Iterating over a collection 1 f o r e a c h ( var a r t i c l e in a r t i c l e s ) { 2 if ( a r t i c l e . P u b l i s h D a t e < e x p i r a t i o n D a t e ) { 3 C o n s o l e . W r i t e L i n e ( a r t i c l e . T i t l e + " by " + a r t i c l e . g e t A u t h o r () . N a m e ) ; 4 a r t i c l e . O f f l i n e = t r u e; 5 } 6 }

articles from the server. Secondly, for every article that has to be put offline an extra call has to be made to send this status back to the server. Also, the getAuthors() method becomes a remote method that performs another call. Two calls for every expired article introduce unnecessary latency because these actions could have been performed in just one round-trip to the server. This problem is called the n+1 query problem.

n+1 query problem

The n+1 query problem means that for every row in a result-set another query has to be performed. When data from a remote source is fetched and used in an iterator that performs another query based on the result row, the n+1 query problem surfaces. Common examples are fetching all n rows from an 1-n relation from the result row or performing an update to the row. As a result, for a loop with n iterations, n + 1 queries are performed, viz., one for every iteration (= n), and one to fetch the result that is iterated.

In some cases, solving the n+1 query problem is not intuitive. Table 1.1 shows two relational database tables that are related through ManufacturerID. What if someone would like to fetch all manufacturers and then fetch all products for every manufacturer? As this is a relational database, only flat, tabular results are allowed. There is no query that would return a nice set of manufacturers with a subset of products for each manufacturer.

Table 1.1: Related manufacturer and product tables

ManufacturerID Name

3 Chairs & co

4 Food & co

ProductID ProductName ManufacturerID

10 Sofa 3

11 Dining chair 3

12 Barcalounger 3

15 Cheese 4

(9)

Common solutions to the problems of distribution

Some libraries that deal with distribution leave it to the developer to solve the n+1 query problem. Others libraries offer solutions to the problem that a developer can use. There are a number of traditional solutions to the n+1 query problem. The following are some of the most common ones:

Server fa¸cade A server fa¸cade moves some responsibility and logic to the server so that we can execute the remote code in one round trip. Figure 1.2 is an example of a server fa¸cade for the example in Figure 1.1. While good for performance, a server fa¸cade moves maintenance and possible points of failure from only the client to both client and server [13].

Figure 1.2: Server fa¸cade 1 // C l i e n t : 2 var a r t i c l e s = 3 s e r v e r . g e t E x p i r e d A r t i c l e s A n d S e t O f f l i n e ( e x p i r a t i o n D a t e ) ; // f a c a d e 4 5 f o r e a c h (var a r t i c l e in a r t i c l e s ) { 6 C o n s o l e . W r i t e L i n e ( a r t i c l e . T i t l e ) ; 7 } 8 9 // S e r v e r : 10 p u b l i c a r t i c l e g e t E x p i r e d A r t i c l e s A n d S e t O f f l i n e ( D a t e T i m e e x p i r a t i o n D a t e ) { 11 f o r e a c h (var a r t i c l e in t h i s. g e t A r t i c l e s () ) { 12 if ( a r t i c l e . P u b l i s h D a t e < e x p i r a t i o n D a t e ) a r t i c l e . O f f l i n e = t r u e; 13 } 14 }

Data Transfer Objects Data transfer objects or DTOs are objects with the sole purpose of storing and retrieving data that has to be sent from client to server. In the case of the example in Figure 1.1 a DTO could be used to hold the article information, including author, in order to lower the number of requests. However, updating the article would still require an extra request. Like a fa¸cade, DTOs move some responsibility to the server.

Figure 1.3 shows a possible solution using a DTO. In this case the server creates a collection object that contains data for every article and sends it to the client. The DTO provides a method to persist the complete collection on the server. While not much changes in the client code the server has to create the DTO as well. Do nothing about it Instead of changing the programming model it is possible to

“ac-cept the loss” and prefer a natural programming model over performance. While this approach may seem odd it is used in practice. Some very popular RESTful web services have adopted it.

(10)

Figure 1.3: Data transfer object 1 // C l i e n t : 2 var a r t i c l e s = 3 s e r v e r . g e t A r t i c l e s D T O () ; // DTO of a c o l l e c t i o n of a r t i c l e s 4 5 f o r e a c h ( var a r t i c l e in a r t i c l e s ) { 6 if ( a r t i c l e . P u b l i s h D a t e < e x p i r a t i o n D a t e ) { 7 C o n s o l e . W r i t e L i n e ( a r t i c l e . T i t l e + " by " + a r t i c l e . g e t A u t h o r () . N a m e ) ; 8 a r t i c l e . O f f l i n e = t r u e; 9 } 10 } 11 12 a r t i c l e s . s a v e () ; // s a v e the DTO

Figure 1.4 shows an example of how a RESTful web service can be very inefficient. The code is part of a fictitious program that uses the Facebook Graph API. A Facebook post has an ID and n comments. A comment has an ID and an integer with the number of likes. This code removes all comments for a post that has zero likes. In the worst case scenario where no comment has any likes we have a call to the service for every deletion. As a result the number of round trips to the server will be n + 1.

Figure 1.4: Facebook API 1 // C r e a t e a new F a c e b o o k H T T P c l i e n t

2 F b C l i e n t c l i e n t = new F b C l i e n t () ; 3

4 // Set the url to the c o m m e n t s of s o m e p o s t

5 client - > url = " / " + p o s t I d + " / c o m m e n t s "; 6 var c o m m e n t s = client - > get () - > b o d y ;

7 8 f o r e a c h ( var c o m m e n t in c o m m e n t s ) { 9 if ( c o m m e n t . l i k e s == 0 ) { 10 client - > url = " / " + p o s t I d + " _ " + c o m m e n t . id 11 client - > d e l e t e () ; 12 } 13 }

1.2. Object–relational impedance mismatch

SQL relational databases are known to cause the described problems with distribu-tion. Most of our programming languages are very different from the declarative SQL language family. The most popular programming languages are object-oriented. In

(11)

a relational database this creates the problem that the object model differs from the relational model. For this reason developers usually dislike the usage of SQL in such languages. It has a negative impact on flexibility [36]. This difference in program-ming models is often called the object–relational impedance mismatch (an analogy to electrotechnical impedance difference).

A common approach to this mismatch is the use of object–relational mapping (ORM) [7], [27]. ORM often takes place in the form of an automated framework like Hibernate, NHibernate or Doctrine. ORM however does come with its own problems which make it a topic of discussion going as far as Ted Neward underpinning why ORM is the Vietnam of computer science [31].

The problems of ORM

The n+1 query problem (revisited) ORM frameworks are naturally prone to the n+1 query problem. Because they are used to have a more natural programming model a developer can often use remote collections as if they where a local collection. This is very similar to the general problem with distribution as described before. Figure 1.5 is an excerpt of example code that is part of the tutorial of Doctrine, a popular ORM framework for PHP. This example is a very concrete case of how easy the n+1 query problem may surface with the usage of ORM.

In every iteration of the first loop in this example another loop on children of $bug is performed. As a result the number of SQL queries is one for each bug to select the products plus the original SQL query to select all the bugs (n+1). Another thing that can be observed in this code is that the reporter and engineer tables are joined in order to prevent two more selects in every iteration. If the writer of this code would have omitted the result the number of queries would increase to three. The fact that the developer has to make decisions about this is a serious disadvantage of remote collections. If a developer misses an n+1 query problem a potentially very large performance penalty exists.

Persistence When using an ORM framework there has to be a moment where the objects are populated from the database. A classic approach to ORM is to do this as soon as the object is created. This creates a synchronization problem when persistent objects are stored for some time, e.g., when user input is required. Thereafter, the object may very well be out of sync with the object in the database [32]. Another approach to this problem is the use of object faulting. In this case, objects are populated from the database at the moment the objects are first used. A disadvantage of this solution is that it essentially hides information about the state of objects from the developer. Even with object faulting the program may stay idle after an object has been populated, causing the object to stray from their corresponding database values.

(12)

Figure 1.5: n+1 query problem as seen in the tutorial of a popular ORM framework for PHP 1 $ d q l = " S E L E C T b , e , r F R O M Bug b J O I N b . e n g i n e e r e J O I N b . r e p o r t e r r O R D E R BY b . c r e a t e d D E S C "; 2 3 $ q u e r y = $ e n t i t y M a n a g e r - > c r e a t e Q u e r y ( $ d q l ) ; 4 $query - > s e t M a x R e s u l t s ( 3 0 ) ; 5 $ b u g s = $query - > g e t R e s u l t () ; 6 7 f o r e a c h ( $ b u g s AS $ b u g ) { 8 e c h o $bug - > g e t D e s c r i p t i o n () ." - ". $bug - > g e t C r e a t e d () - > f o r m a t (’ d . m . Y ’) ." \ n "; 9 e c h o " R e p o r t e d by : ". $bug - > g e t R e p o r t e r () - > n a m e ." \ n "; 10 e c h o " A s s i g n e d to : ". $bug - > g e t E n g i n e e r () - > n a m e ." \ n "; 11 f o r e a c h ( $bug - > g e t P r o d u c t s () AS $ p r o d u c t ) { 12 e c h o " P l a t f o r m : ". $ p r o d u c t - > n a m e ." \ n "; 13 } 14 e c h o " \ n "; 15 }

Unused data Objects are usually populated using all data in the database, whether or not it is used. For example, when an employee from a fictional Employee table is loaded using an ORM framework and only a method that prints his or her name and phone number is called, it may very well fetch 20 columns of unnecessary data, and transport this to the Employee object.

Bulk remote operations ORM does not generally support bulk remote operations well. Manipulating sets of objects using one statement, like bulk inserting or updating in a database, is not how we naturally program in imperative languages. The tempting solution for a bulk operation is a loop structure which adds bulk data. Again, this results in n + 1 SQL queries. To execute these operations in a single query, most ORM frameworks require the developer to define explicit initiation and commit points using some form of a transaction helper.

Workarounds Many ORM frameworks try to work around these problems by bridging the gap to SQL. ORM frameworks usually have a query builder or a DSL to retrieve data. This offers more control for the developer to improve performance. On the downside these query builders and DSLs can be very verbose, like SQL.

The workarounds in ORM frameworks always present a trade-off between the object-oriented programming model and the relational model. The natural model that ORM frameworks try to achieve is partially lost and the impedance mismatch still exists. Figure 1.5 uses the DQL query language. This is a good example of a workaround offered by an ORM framework.

(13)

1.3. Batches

Batches are a recent approach to interaction with remote objects as if they where local objects. Batches are powerful because they do not suffer from the described prob-lems with distribution. The batch statement is a language feature designed to support batches[26].

How batches solve the problems caused by distribution

If you would look back at Figure 1.1 you could ask yourself the question: “Why can’t we just execute this code in one round trip to the server?” It would require our programming languages to understand that a block of code sometimes does not need to be executed imperatively. Instead, it should be recognized as a block with remote code which may be optimized. The addition of the batch statement to a language can provide this. Statements and expressions inside a batch statement combine local and remote opera-tions. This separation of execution is done automatically, the developer does not neces-sarily need to be aware of this. Remote code is always a single block that is executed first and returns a response that the local code uses. As a result, a limit of one round trip from client to server in each batch statement is guaranteed. Because the developer interacts with local objects, the programming model is very natural and understandable [17].

Batches as an improvement of ORM

Batches are able to provide a natural programming model and optimized execution for database access. One of the possibilities of batches is to build an object-oriented interface to a relational database that treats rows as persistent objects. This is similar to the way we are used to work with ORM frameworks. The key architectural difference is that the batch statement intermingles local and remote code, where ORM frameworks offer functions that execute this remote code. By using batches, this database interface overcomes the traditional performance problems of ORM frameworks.

1.4. Implementation of batches in JavaScript

The batch statement is a language level feature and it can be implemented as a dialect of many programming languages (although it clearly would not make sense in every language). Previously it has been implemented as a Java dialect called Jaba (Java with batches) [26]. The idea of adding batches with a dedicated batch statement leads to very similar implementations in other imperative languages.

(14)

In this study, batches are implemented in JavaScript. JavaScript is an emerging lan-guage. This is caused by the growing demand for more and more complex web applica-tions. The growth of interest in the language is further accelerated by the rising popu-larity of server-side JavaScript. Server-side JavaScript was first introduced by Netscape Enterprise Server in 1994. In 1996, Microsoft started to offer server-side JScript1 in their IIS webserver. More recently, Mozilla’s open-source Rhino engine which brings JavaScript support to Java applications has become increasingly popular.

One of the accelerants of te increase in popularity of server-side JavaScript is node.js. Node.js is a JavaScript platform that uses the Google V8 engine. At the time node.js was first released, V8 was very fast in comparison to competing engines. Since then, a performance battle between JavaScript engines started and node.js quickly gained popularity due to its performance [39], [40].

Controversy on JavaScript

There is some controversy regarding JavaScript. The language receives criticism for being ambiguous, full of bad features, and generally not developer-friendly [37]. On the other side it is often praised for its flexibility and functional programming features [19].

Unlike most other languages, JavaScript does not offer any built-in features regarding file system IO. A big disadvantage of this is that there is no language defined system to build modular code spread across modules and files. This is a very often heard point of criticism from developers on JavaScript. In 2009 a project to define a standard to make JavaScript work outside a browser (for example server-side) called CommonJS started. The CommonJS standard is currently at version 1.1.1 and defines a standard for modules, packages, promises and system IO. Because server-side frameworks like node.js start to implement CommonJS, one of the main acceptance problems of server-side JavaScript is disappearing [28].

JavaScript and non-blocking IO

JavaScript does not offer support for blocking IO operations. As a result, any IO op-eration has to be done asynchronously. In JavaScript, IO is done using what is called “the event loop”. To explain the event loop, the JavaScript engine contains a message queue. Whenever a message is taken off the queue it takes an associated function of the stack and runs it to completion. Because of this architecture, IO is mostly handled using events and callbacks in a JavaScript program. The code in Figure 1.6 shows an example of this event loop based code versus blocking code. This event driven nature of JavaScript has proven to be powerful for large scale web applications.

(15)

Figure 1.6: The event loop 1 // T h i s c o m m e n t is r e a c h e d f i r s t 2 var p r o d u c t s = db . q u e r y (" S E L E C T * F R O M P r o d u c t s AS p r o d u c t ") ; 3 4 // T h i s c o m m e n t is r e a c h e d second , h e r e we h a v e p r o d u c t s 5 6 // In J a v a S c r i p t , t h i s e x a m p l e is o n l y p o s s i b l e w i t h a lot of effort , or s o m e l i b r a r y to i n t r o d u c e b l o c k i n g . 1 // T h i s c o m m e n t is r e a c h e d f i r s t 2 var p r o d u c t s ; 3 db . q u e r y (" S E L E C T * F R O M P r o d u c t s AS p r o d u c t ", f u n c t i o n( p r o d u c t s ) { 4 // T h i s c o m m e n t is r e a c h e d third , h e r e we h a v e p r o d u c t s 5 }) ;

6 // T h i s c o m m e n t is r e a c h e d second , the o p e r a t i o n a b o v e is non - b l o c k i n g , p r o d u c t s is n u l l

JavaScript as a transport for batches

Batches require an underlying interface to send the remote code to the server and receive the results. A flexible scripting language can represent code understandable to clients and servers that are written in different languages. JavaScript might seem a good fit for this purpose[13]. For this reason an implementation of batches that is built in JavaScript has been chosen for this study.

1.5. Research question and goals

Research question How can we implement batches for SQL databases in JavaScript and how does it perform compared to object–relational mapping?

The main goal of this study is to demonstrate how batches can be implemented in JavaScript. To prove the added value of the implementation, the secondary goal is making a comparison between relational database access with batches and traditional ORM.

This study is limited to the use of batches for SQL databases. The reason for this is that batches are very promising as a solution to the problems of ORM. Batches support both a natural programming model as well as optimized execution. ORM frameworks always have to make a choice between those qualities.

(16)

Hypotheses

At the start of this study, four hypotheses were postulated about the answers to the research question. In this study, these hypotheses will be validated in order to evaluate the results and answer the research question.

Hypothesis I The batch statement can be implemented successfully in JavaScript.

Based on the theory, there is no reason to believe that the batch statement cannot be ported to JavaScript. JavaScript’s non-blocking IO and the general controversy on the language may however lead to some interesting insights in the possibility to implement a new statement to the language.

Hypothesis II It is possible to implement the batch statement in JavaScript in such a way that developing does not require additional compilation/translation steps by the developer.

In addition to hypothesis I, batches for JavaScript can be implemented in a devel-oper friendly way. In this case (based on the research question) this means that there is no additional manual compilation/translation step.

Hypothesis III Performance of the batch statement will be at least as good as the performance of an ORM framework.

The hypothesis is that the batch statement performs equal or better than an ORM framework. The problem with performance is that it depends heavily on imple-mentation quality. If this research impleimple-mentation outperforms a frequently used and actively developed ORM framework it is plausible that batches are generally faster. If this hypothesis turns out incorrect however, it would mean that the theory on batches in this study is not valid.

Hypothesis IV JavaScript code that handles persistent data will be at least as nat-ural as when using an ORM framework.

Batch statements will provide the same ability to read and manipulate persistent data as an ORM framework. The natural programming model that ORM frame-works offer will be possible by using batches.

“A natural programming model” is a subjective quality. This investigation does not define what exactly makes a natural programming model, nor will it be an empirical study on the programming model of batches versus ORM. It can however offer a crude comparison of batched versus ORM code. Using lines of code and a common sense view on complexity any deviations from this hypotheses can be easily recognized.

Using batches there will be less necessity for overhead code needed to select data or perform bulk operations. As a result, code that uses the batch statement will probably be more compact and less dependent on non-language behaviour for these actions than comparable ORM code.

(17)

1.6. Validation

Quality of the implementation

It is not possible to prove that software is of good quality. Usually comparisons between software are made to decide which software is of the best quality. Because implementing batches for JavaScript is a novel concept, there is nothing to compare it with. As a result of this, any conclusions drawn when any of the hypotheses fail may lead to the discussion if this is a result of a problem with the implementation. This is not necessarily a problem, but it is important to notice this during the interpretation of the research results.

Validity of benchmarks

In this study, the validity of the benchmarks is preserved by thoroughly doing a small benchmark. Benchmarking can be a study of its own, so in this investigation it is kept as simple as possible. The benchmarks are thorough enough to confirm or reject the hypotheses but small enough to be achievable in a study that mainly aims at implemen-tation of a concept.

The conclusions based on these benchmarks are not drawn from exact numbers. Instead, they are based on obvious performance differences. The tests are mostly taken from a known and used set of simple tests that can validate the behaviour of batches for JavaScript. More thorough research on the performance of batches is outside the scope of the research goals.

Research categorisation

The lack of evaluation is a recurring critique on research in the Software Engineering field [24], [34]. Understanding research and the level of evaluation that can be expected is easier if topic, approach and research method are classified. For this reason this study is classified using a framework for classifying SE research [22].

Topic The topic of this study is programming languages, on the edge with methods and techniques. This investigation focuses on a programming language solution to take an existing technique (local–remote interaction between a program and a database) to a higher level.

Research approach The research approach of this study can be categorized as formula-tive. This involves the development of a theory with the aim of scientific progress [30]. According to literature research [22] this is the most frequently used approach in the (young) field of software engineering. There is a clear difference between formulative research and evaluative research, which is not very common in the

(18)

software engineering field. Formulative research usually does not employ the sci-entific method and focuses more on progress. Evaluative research on batches for JavaScript would be possible after formulative groundwork is done.

Research method This study uses the method of concept implementation. Moreover, a software laboratory experiment is used, but this is not the main method. The laboratory experiment is part of the concept implementation as a way of validation. This is a clear distinction because the level of detail of result interpretation can be lower when validating an implementation. If research is based on an experiment alone, one should expect more detail in order to add any value to the SE field.

Research boundaries

The implementation of batches for JavaScript is a proof of concept, it does not necessarily have to be complete. The proof of concept provides a limited set of control structures and SQL actions that the batch statement can interpret and generate. This makes it sufficiently complete to answer the research questions.

1.7. Overview of this thesis

This chapter explained the motivation behind this study and provides an overview of the research questions and goals.

Chapter 2 explains details on BatchJS, the implementation of batches for JavaScript. Chapter 3 describes the research and gives an overview of the results.

Chapter 4 analyses the results.

Finally, chapter 5 provides an overview of related work and contains a conclusion of this study and recommendations for further work.

(19)

2. BatchJS

In order to use batches in JavaScript, a language called BatchJS is proposed. The grammar of BatchJS is a superset of JavaScript, the difference is the addition of the batch statement. BatchJS is compiled to regular JavaScript code and batch scripts. These batch scripts represent server-side operations and are enclosed as regular strings in the compiled JavaScript code. JavaScript is fully compatible with BatchJS hence a BatchJS compiler should be able to compile regular JavaScript files as well. The BatchJS compiler is available on GitHub (Appendix A).

2.1. The BatchJS language

2.1.1. Batch statement syntax

The basic syntax of a batch statement in BatchJS is batch (declaration : expression) block. Figure 2.1 shows a code example of a very minimal batch statement.

Batch statement initialization A batch statement always starts with the batch key-word, followed by a declaration and an expression. The expression should evaluate to a BatchJS compatible runtime. The declaration will be used as the remote root object inside the batch statement. Operations on this object and objects derived from it should perform a remote operation. An initialized batch block is always followed by a block which contains statements and/or expressions.

Figure 2.1: Batch statement example 1 b a t c h (let db : c o n n e c t i o n ) {

2 db . P r o d u c t s . f o r E a c h (f u n c t i o n( p r o d u c t ) { 3 p r i n t ( p r o d u c t . id ) ;

4 }) ;

(20)

Let declaration The declared remote root in the batch statement cannot be used out-side the batch block. JavaScript has a scoping system where var declarations are always scoped to the function that they are declared in. Statements do not have their own scope for regular variables (Figure 2.2 demonstrates an example of this behaviour). The let declaration for JavaScript as defined in JavaScript 1.7 [5] does allow for local scoping in statements.

To conform to the JavaScript scoping system only a let declaration is supported in the batch statement initialization. Let declarations are currently not a widely accepted construction across different JavaScript interpreters. Because the let statement is parsed by BatchJS this should not be considered a problem.

Figure 2.2: JavaScript var scoping

1 var x = 5; 2 var y = 20; 3 if( y > 10 ) { 4 var x = 10; 5 } 6 c o n s o l e . log ( x ) ; // o u t p u t s ’10 ’

2.1.2. The batch block

The block inside the batch statement always contains JavaScript code. This code is compiled to implement the desired behaviour of the batch statement. The current version of BatchJS does not support the complete JavaScript syntax in the batch block. A subset that is mandatory to perform all basic Create, Read, Update, Delete (or CRUD) actions on remote data is supported. The following paragraphs provide an overview of the allowed statements and expressions in a batch block and their effects on the batch scripts.

For. . .each loop Iterating is necessary to support operations on lists of data, like database rows. The for. . .each statement that is commonly used to achieve this does not exist in JavaScript. Arrays can be iterated using a for statement in combination with Array.length and direct access to indices. In JavaScript objects are often used instead of arrays when a unique key can be used. This makes it possible to iterate over these IDs using the for. . .in loop (which iterates over object properties).

BatchJS uses the forEach function that uses a function argument as an iterator imple-mentation. This is a defined standard of array iteration in ECMAScript 5 and JavaScript 1.6 which is not yet widely accepted. Figure 2.3 shows how the JavaScript forEach func-tion is used with the batch statement.

(21)

Figure 2.3: Batch statement with a forEach loop 1 b a t c h (let db : c o n n e c t i o n ) { 2 db . P r o d u c t s . f o r E a c h (f u n c t i o n( p r o d u c t ) { 3 c o n s o l e . log ( p r o d u c t . U n i t P r i c e ) ; 4 }) ; 5 }

If–else statement In batches, the if statement is used inside an iterator to filter records. This makes the statement essential for the implementation of basic CRUD operations with BatchJS. Figure 2.4 shows how the if statement can be used inside a batch.

Figure 2.4: Batch statement with an if 1 b a t c h (let db : c o n n e c t i o n ) { 2 db . P r o d u c t s . f o r E a c h (f u n c t i o n( p r o d u c t ) { 3 if ( p r o d u c t . U n i t s I n S t o c k == 0 ) { 4 c o n s o l e . log ( p r o d u c t . N a m e + ’ s o l d ’ + p r o d u c t . O r d e r s . c o u n t () 5 + ’ t i m e s ’ + p r o d u c t . U n i t P r i c e + ’ / u n i t ’ ) ; 6 } e l s e { 7 c o n s o l e . log ( p r o d u c t . N a m e + ’ i s c u r r e n t l y in s t o c k ’ ) ; 8 } 9 }) ; 10 }

Create, Update & Delete The creation of new objects is performed using a regular constructor. Updating is an implicit operation done by simply changing the object inside the batch statement. Deletion is done using a delete() method on the object itself. Figure 2.5 shows examples of these operations.

(22)

Figure 2.5: JavaScript batch CRUD operations 1 // c r e a t i o n

2 b a t c h (let db : c o n n e c t i o n ) {

3 var p1 = new db . P r o d u c t s ({ N a m e : ’ Foo bar ’, U n i t s I n S t o c k : 42 }) ; 4 var p2 = new db . P r o d u c t s () ; 5 p2 . N a m e = ’ Foo bar ’; 6 p2 . U n i t s I n S t o c k = 42; 7 } 8 9 // u p d a t i n g 10 b a t c h (let db : c o n n e c t i o n ) { 11 db . P r o d u c t s . f o r E a c h (f u n c t i o n( p r o d u c t ) { 12 if ( p r o d u c t . U n i t s I n S t o c k == 0 ) { 13 p r o d u c t . S t a t u s = " S o l d out "; 14 } 15 }) ; 16 } 17 18 // d e l e t i o n 19 b a t c h (let db : c o n n e c t i o n ) { 20 db . P r o d u c t s . f o r E a c h (f u n c t i o n( p r o d u c t ) { 21 if ( p r o d u c t . U n i t s I n S t o c k == 0 ) { 22 p r o d u c t . d e l e t e () ; 23 } 24 }) ; 25 }

2.2. Implementation details

2.2.1. Architecture

BatchJS consists of a compiler and a number of runtimes to handle batches in the compiled JavaScript code. The BatchJS compiler is used to translate BatchJS source code to regular JavaScript and batch scripts. The runtimes can execute batch scripts and handle the result set of batched operations. BatchJS requires a server that understands batch scripts. This server receives a batch script and responds with a result set. This architecture is visualised in Figure 2.6.

BatchJS is completely implemented in JavaScript as a command line node.js module. It currently works in a Linux environment but can easily be ported to any system that is supported by node.js. Table 2.1 shows a table of the source lines of code of BatchJS v0.1, calculated using CLOC [4]

(23)

Figure 2.6: Architecture of BatchJS

2.2.2. Compiler

What defines a batch statement is that its body is a combination of local and remote code. The main task of a batch statement compiler like BatchJS is to perform the separation of this code.

The current version of BatchJS uses a two-step compilation method. First, all batch statements are identified by a parser. Then, for every batch block an abstract syntax tree is generated. From this tree, BatchJS generates JavaScript code and a batch script. This compiled code for every batch is merged as the compiled JavaScript code. The architecture of this compiler can be seen in Figure 2.7.

The lexical analyser and parser are a modified version of UglifyJS. UglifyJS is an open-source JavaScript parser, compressor and beautifier. BatchJS adds the ability to parse batch scripts to UglifyJS.

Abstract syntax tree

BatchJS uses an abstract syntax tree (or AST) that represents the body for every batch statement. To build and traverse this tree the treehugger.js library is used [23]. BatchJS uses the visitor pattern to traverse the AST using a pre-order, depth-first method and generate the output JavaScript code.

(24)

Table 2.1: Lines of code in BatchJS v0.1 (excluding tests and benchmarks)

Part of BatchJS LOC

Compiler 1062

Modifications to UglifyJS parser 71

Runtimes 124 BatchJS total 1256 UglifyJS library 4721 Treehugger library 1531 Total SLOC 7508 *

Figure 2.7: Architecture of the BatchJS compiler

2.2.3. Separation of local and remote code

The main responsibility of the BatchJS compiler is the separation of local and remote code. This separation is performed based on the concept of identification by reachability [9]. By remembering variables that are connected to some root, and variables that are connected to this variable, you can determine if a variable is reachable from this root. This technique is widely used by garbage collectors to determine if a variable is used or not. It also has been used before as a way to identify persistent objects in a transaction [8], [41], somewhat similar to batches.

The remote root

To use identification by reachability, a batch statement needs to have an identifier for the remote root. A batch statement always starts with this identifier, which is a let declaration as seen in Figure 2.8. This declaration uses a colon, which can be read as “in”. This means that the declared root runs in a certain runtime, which is passed after the colon. At the beginning of a batch statement body, the root is always the only remote object in a batch statement. If it is left unused in the batch statement, all code

(25)

Figure 2.8: Remote root 1 b a t c h (let db : c o n n e c t i o n ) { 2 // ’ db ’ is the r e m o t e r o o t 3 } 4 b a t c h (let s e r v i c e : c o n n e c t i o n ) { 5 // ’ s e r v i c e ’ is the r e m o t e r o o t 6 } State

As the compiler traverses the AST of the batch statement body it uses a state object to keep track of remote objects. At the start of the traversal, the compiler assumes that code is local. This means that at this point, all BatchJS code will be compiled directly to equal JavaScript code.

On every node that the compiler traverses a check if this node is remote is performed. In case of a variable, the node is checked against the list of remote objects that the state object (State) keeps. In case of property- or indexed access, a remote node might be deeper in the AST (Figure 2.9). The detection of reachability is performed recursively through these nodes, as seen in Figure 2.10. These types of nodes are passive, they can be connected to the remote root node but they do not change the state of the compiler. Other nodes, namely new, if, call, forEach, variable assignment and variable declaration do change the state of the compiler. When one of these nodes can be connected to the remote root it is added to a list of reachable remote objects in State. At this point, the compiler changes the compilation mode to remote. This means that the BatchJS code will be compiled to remote code. When traversal meets another node that can not be connected to the root, the mode changes back to local.

In the example of Figure 2.9, the ForEach node asks its left-hand side child if it is remote. Because this node is a property access node, this check is performed recursively until a variable node that represents root is found. The ForEach node is then marked as a remote object and the ForEach, the left-hand side and the right-hand side function are compiled as remote code. Inside this function, the mode will be remote, until a node changes this.

Passing state

State is dependent on the node that is visited. The visitor calls always pass state as an argument so State traverses through the abstract-syntax tree, as seen in Figure 2.11. As remoteness does not propagate through every node, nodes reset the remoteness of State at the end of their traversal. In the example in Figure 2.9, ForEach resets state after Function has been traversed, so the If node is not affected by the remote ForEach.

(26)

Figure 2.9: Recursive traversal through property access

Iterators

Iterators are a special case in batches. When an iterator (currently always a forEach) is remote, every iteration creates a different value for a remote variable with the same name. This value in a remote iterator is a cursor to an iteration, and every action performed on it will be executed on every element of the array [38]. To support both local as well as remote actions, a remote iterator always creates a local iterator as well. The generated local code fetches a list of results from the result-set and iterates over these results. This set is limited to results that are used by the local code. To keep track of nested loops and their iteration variables, State keeps a list of which loop is at a certain depth. Figure 2.12 shows how nested iterators are handled by BatchJS.

(27)

Figure 2.10: Recursive checking of remoteness 1 S t a t e . p r o t o t y p e . t r a v e r s e I s R e m o t e = f u n c t i o n( n o d e ) 2 { 3 var s t a t e = t h i s; 4 var r e m o t e = f a l s e; 5 6 n o d e . t r a v e r s e T o p D o w n (’ Var ( x ) ’, f u n c t i o n( b ) { 7 if( s t a t e . i s R e m o t e ( b . x . v a l u e ) ) { 8 r e m o t e = t r u e; 9 10 // E x i t the l o o p 11 r e t u r n; 12 } 13 }) ; 14 15 r e t u r n r e m o t e ; 16 }

Figure 2.11: Passing of state in BatchJS

1 V i s i t o r . p r o t o t y p e . v i s i t C o n s P o s t f i x O p = f u n c t i o n( node , a r g s ) { 2 var s t a t e = a r g s [ 1 ] ; 3 ... 4 var t a r g e t = n o d e [ 1 ] ; 5 t a r g e t . a c c e p t (this, s t a t e ) ; 6 ...

Figure 2.12: Iterators in BatchJS 1 2 if( s t a t e . t r a v e r s e I s R e m o t e ( n o d e [ 0 ] ) ) { 3 s t a t e . r e m o t e = t r u e; 4 l o o p M o d e = ’ b o t h ’; 5 c u r s o r N a m e = c a l l b a c k [ 1 ] [ 0 ] [ 0 ] . v a l u e ; 6 } 7 8 s t a t e . l o o p [ s t a t e . l o o p d e p t h +1] = { 9 i n L o o p : true, 10 i n i t L o o p : true, 11 m o d e : l o o p M o d e , 12 i t e m : c u r s o r N a m e 13 }; 14 15 s t a t e . l o o p d e p t h ++; 16 17 if( s t a t e . r e m o t e ) { 18 ...

(28)

Output variables

Whenever a remote variable is used from a local mode, an output variable is added to the script. An output variable receives a unique incrementing positive integer number. Remote code is generated that assigns the output to this unique variable. Local code is generated that fetches the variable from the result set. The owner of the variable is either the remote root node, or a variable that holds an iteration. For example in Figure 2.14, product.getString("g0") means that the output variable g0 from the iteration variable product is used in the local code.

Generation of code

The state object keeps a buffer of compiled local and remote code. The visitor meth-ods update the mode of State and add code according to the behaviour of the visited statement or expression. As State traverses through the AST, this buffer is filled. Local code is a string of JavaScript code, remote code is a string of batch script. As mentioned before, State is always initialized with a remote root.

Figure 2.13 shows how BatchJS splits the code of one of the tests in BatchJS. In this code, db.Products is recognized as a property access on a remote property. As a result, the ForEach is registered as a remote loop where the iteration variable product is registered as a remote variable. The following if statement uses this variable product. The if statement is a node that changes the mode if it depends on remote variables. However, State was already in remote mode, therefore it does not change.

The console.log in the body of the if statement is local because console.log is a function call and console is a local variable. The arguments of this local function uses the remote variable product. This creates an output variable for every iteration of products.

Generated code

Figure 2.14 shows the compiled result of the code from Figure 2.13. As can be seen, the batch statement has been compiled to an anonymous function which takes a variable $service$ as a parameter and passes the connector. In code, batch(let x : y) { z } gets substituted with (function($service$){ z })( y ). The let declared variable is only used by the compiler to determine which code is local and which code is remote. It does not show up in the generated code.

The body of the batch statement is compiled to 3 parts:

1. An initialization of any input variables for this batch statement in $input$. 2. The declaration and execution of the remote code.

(29)

Figure 2.13: Code splitting and AST 1 b a t c h(let db : c o n n e c t o r ) { 2 // M O D E : L O C A L 3 4 c o n s o l e . log (" In - s t o c k p r o d u c t s t h a t c o s t m o r e t h a n $3 . 0 0 : ") ; 5 // M O D E : L O C A L 6 7 db . P r o d u c t s . f o r E a c h (f u n c t i o n( p r o d u c t ) { 8 /* M O D E : R E M O T E 9 * AST : 10 * F o r E a c h ( P r o p A c c e s s ( Var (" db ") , " P r o d u c t s " ) , 11 [ F u n c t i o n ("" , [ F A r g (" p r o d u c t ") ] , . . . ) ] 12 */ 13 if( p r o d u c t . U n i t s I n S t o c k > 0 && p r o d u c t . U n i t P r i c e > 3 . 0 0 ) { 14 /* M O D E : R E M O T E 15 * AST : 16 If ( Op ( "&&" , 17 Op ( " >" , P r o p A c c e s s ( Var (" p r o d u c t ") , " U n i t s I n S t o c k " ) , Num ( " 0 " ) ) , 18 Op ( " >" , P r o p A c c e s s ( Var (" p r o d u c t ") , " U n i t P r i c e " ) , Num ( " 3 " ) ) 19 ) , B l o c k ( . . . ) 20 */ 21 c o n s o l e . log ( p r o d u c t . P r o d u c t N a m e + " is in s t o c k and c o s t s m o r e t h a n $3 . 0 0 . ") ; 22 /* M O D E : L O C A L 23 * AST : 24 C a l l ( P r o p A c c e s s ( Var (" c o n s o l e ") , " log " ) , 25 [ Op ( "+" , P r o p A c c e s s ( Var (" p r o d u c t ") , " P r o d u c t N a m e " ) , 26 S t r i n g (" is in s t o c k and c o s t s m o r e t h a n $3 . 0 0 . " ) ) 27 ]) 28 */ 29 } 30 }) ; };

(30)

Figure 2.14: Compiled BatchJS code 1 var c o n n e c t o r = r e q u i r e (" . . / . . / . . / src / r u n t i m e / j a b a ") ; 2 3 (f u n c t i o n( $ s e r v i c e $ ) { 4 var $ i n p u t $ = {}; 5 6 var $ s c r i p t $ = ’ for p r o d u c t in r o o t . P r o d u c t s do ( if ((( p r o d u c t . U n i t s I n S t o c k > 0) && ( p r o d u c t . U n i t P r i c e > 3) ) ) t h e n ( g0 : p r o d u c t . P r o d u c t N a m e ) end ; ) end ; ’; 7 8 var $ b a t c h $ = $ s e r v i c e $ . e x e c u t e ( $ s c r i p t $ , $input$ , f u n c t i o n( $ r e s u l t $ ) { 9 c o n s o l e . log (" In - s t o c k p r o d u c t s t h a t c o s t m o r e t h a n $3 . 0 0 : ") ; 10 $ r e s u l t $ . g e t I t e r a t i o n (" p r o d u c t ") . f o r E a c h (f u n c t i o n( p r o d u c t ) { 11 c o n s o l e . log (( p r o d u c t . g e t S t r i n g (" g0 ") + " is in s t o c k and c o s t s m o r e t h a n $3 . 0 0 . ") ) ; 12 }) ; 13 }) ; 14 }) ( c o n n e c t o r ) ;

The callback with the local code has a result parameter which contains the response from the service. How this service executes the script and how this result set works is up to the runtime. In this case the Jaba runtime is used but other runtimes, e.g., a REST, RMI or SQL implementation, may vary in the way they handle scripts and result sets.

2.2.4. Batch script syntax

BatchJS uses intermediate batch scripts in order to send batched operations to the server. The abstract syntax of the batch scripts generated by BatchJS is given in Figure 2.15. This language is sufficiently expressive to support the most common operations in many languages.

The language that is used in this study has been used in earlier implementations of batches in Java [17]. Recent seminar presentations by Cook [12], [13], suggest that batch scripts should be written in a very generic programming language where the rep-resentation may differ based on the client and server that are used. The test-server and run-time created in the present study uses a List of Expression objects [16] that corresponds to the defined syntax.

2.2.5. Runtime

The batch script is evaluated by the runtime. An example of such a runtime would be an HTTP connection to a web service that understands batch scripts or a receiver of remote procedure calls. In this study, the service will be an SQL connector. The service

(31)

Figure 2.15: Batch script syntax s = s ; s | f or id in e do s end; | if e then s | var id = e | e e = !e | e op e | p p = c | id | p.id | out : e | (e) op = == | ! = | | | | && | > | >= | < | <= | + | − | ∗ | / s and e are respectively statements and expressions.

The service and the result set do have to follow an interface. The service needs a method execute which takes a script, an input object and a function. The result set needs a method getIteration to retrieve iterators from the result and a method getString to access output variables. Calls to these methods are performed from the compiled code.

2.2.6. Limitations Asynchronous batches

JavaScript is a language that does not support blocking operations. IO is handled using the event loop and this translates to events and callbacks in code. Batches are always used to handle remote operations. As a result, the body of a batch statement is actually a callback on the event that a result of a batch operation has arrived.

Limited operations in batches

BatchJS currently supports a limited set of operations in the body of batch state-ments. This limitation exists because BatchJS is an experimental research application. JavaScript is a dynamic, weakly typed language with first class function support and a lot of statements and expressions that influence the behaviour of batches. Implementing all this in a research phase is not necessary. Currently, the batch statement supports the following operations:

(32)

Iteration In JavaScript there are a number of iteration methods. BatchJS only supports the forEach method to iterate over an array. This choice has been made because forEach is the accepted method of iterating over arrays in ECMAScript 5. As a result, the following implementations of iterating are not supported by BatchJS:

• A for loop. This statement offers more flexibility than just iterating over a collection. This would make implementation more complicated on both the client side as well as the server side of batches.

• A while loop, or do. . .while loop. Like a for loop its flexibility would make implementation to complicated for the goals of this study.

• A for. . .in loop. This statement loops over objects and returns only the keys in the object. To access the field of an iteration an extra assignment is necessary.

• The for. . .each. . .in statement. This is a statement to iterate over arrays in JavaScript 1.6. It is not in an ECMAScript standard and the only widespread JavaScript engine that supports it is SpiderMonkey.

Local iteration BatchJS currently does not support iteration of local variables, because it is not in the scope of this study. It would be fairly easy to implement because the compiler can already distinguish between local and remote iteration.

Index access In JavaScript, poperties of objects can be either accessed using a dot notation or using an indexed notation with brackets. The indexed notation adds the option to access a variable index and can be used to access an offset of an array. Because it is not necessary for the present study, this feature is not implemented in BatchJS at this point.

Uninitialized variables Variables currently have to be immediately initialized. The cur-rent batch script does not grammatically support declaration of variables.

Nested batches The batch statement cannot be placed inside a batch body of another batch statement at this point. For the present study this is not necessary. However, because batches are asynchronous in JavaScript you would often want to execute a second batch statement in the callback event of the first. At this point, this is only possible by placing the second batch in a function, and calling that function from the first batch.

(33)

3. Experiment and results

3.1. Experiment

In order to make a comparison between BatchJS for SQL databases and traditional ORM, a test suite was implemented using both technologies. This experiment uses the Northwind database that comes with SQL Server and Visual Basic [1], converted to the MySQL format [3] for use with the MySQL relational database management system (RDBMS).

The Northwind database was chosen for the following reasons:

• The database has a very common structure of related tables that you would often find in company databases.

• A test set on this database, called 101 LINQ samples was used for the Batch2SQL proof of concept [18].

The test-suite created in the experiment was used both to check the validity of BatchJS as well as to perform a benchmark of BatchJS against ORM and pure SQL.

SQL translations

This study focuses on the creation of the BatchJS compiler, the separation of local and remote code, and the performance advantage BatchJS will give over ORM. Building the translation from batch scripts to SQL queries is not within the scope of this investigation. Instead, BatchJS was tested using Batch2SQL. Batch2SQL is a Java translator of batch scripts to SQL queries [14] which was used in previous research using Jaba [17].

Environment

All experiments in this study were performed in the following environment:

BatchJS runtime environment BatchJS runs as a node.js module. Because Linux is the recommended operating system for node.js, it is used as a host system for BatchJS. All experiments have been performed on a Debian virtual machine that runs on a Windows 7 host.

(34)

Batch2SQL In order to use Batch2SQL a test server was created. This server is able to process batch scripts as well as SQL queries. It returns a result tree for batch scripts or a JSON representation of the returned rows for SQL queries. The server uses Batch2SQL to handle batch scripts or the Batch2SQL database connection to perform pure SQL queries received from non-BatchJS benchmarks. In this way any overhead created by the extra connection that BatchJS needs to Batch2SQL will be negated. In short, each test was executed in the following way:

node.js ↔ Batch2SQL test server ↔ mysql

Node.js uses an HTTP connection to connect to Batch2SQL. Batch2SQL uses a socket to interact with the MySQL database. The test server is available in the BatchJS github repository (Appendix A).

RDBMS In this experiment MariaDB 5.5 was used as the RDBMS. MariaDB is a community-developed, fully compatible, drop-in replacement for MySQL. To con-nect to MariaDB in node.js, the node-mysql plugin is used. This plugin is listed as the recommended RDBMS connector in the node.js documentation. MySQL is a popular system in node.js as well as in Java development. It is also used by the Jaba test suite for the Northwind database.

3.2. Test suite

The test suite consists of a subset of 101 LINQ samples from Microsoft [2]. This test suite was chosen because the most relevant related work, the Batch2SQL proof of concept with batch statements written in Jaba [17], uses this set as well.

Tests translated to BatchJS

Sixty-three of the LINQ samples operate solely on local sets, making them irrelevant for batches. Because there are actually only 100 samples in the set used in this study (as downloaded on April 26, 2012), this leaves 37 tests that operate on the Northwind database.

Tests 2&3, 15&19, 30&33 and 42&43 are, as couples, structural duplicates with different atomic values. Due to the difference between LINQ and batches, their BatchJS ver-sions would be duplicates in terms of structure and operations. When removing those duplicates we are left with 32 samples.

Twenty-two of these 32 samples are currently working in Batch2SQL for Jaba. Others do not work because of unimplemented features in the SQL translations or because of bugs in the translations that the authors of Batch2SQL did not explain and/or fix.

(35)

BatchJS does not support group by or order by operations. Batch2SQL does not allow these functions to be used from a batch script. Unfortunately, in the study that created Batch2SQL, these functions were implemented using reflection on the classes that parse Jaba code in Java. As a result, group by and order by clauses will not work from a script. This leaves 13 usable tests from the LINQ samples.

The LINQ samples do not include updates and deletes. The experiment uses the three insert, update and delete samples from previous research on Batch2SQL in JABA as well as four own samples that use bulk inserts, deletes and updates. In total, 15 tests were used to perform the experiment on BatchJS.

Node-orm and SQL-objects

The ORM framework that was used to compare BatchJS against is node-orm [33]. The creator of another ORM framework, called sequelize.js, performed a benchmark of differ-ent JavaScript ORM frameworks and generated some interesting results [20]. According to these results, node-orm performs better than sequelize.js as well as the popular ORM framework persistence.js.

To verify the results a simple, naive SQL fetcher was created. This fetcher could do anything but join data. This ensures it suffers from the n+1 query problem. In this study it will be called “SQL-objects”.

Native Query

In terms of performance, using SQL code to directly interact with the MySQL database should be expected to be the fastest. The “Native Query” solution in this experiment just uses a simple database connector and SQL queries. The queries are written to provide the most efficient solution for each test.

3.3. Benchmark

The goal of the benchmark in this experiment was to find out if there is an equal performance or a very clear difference. The benchmark consists of different scenarios, based on an altered and simplified version of the Northwind database. The database was simplified to keep the ORM models simple. To control the amount of matching results in each test the data was altered to have full control over the number of results. The total size of the experiment in source lines of code is displayed in Table 3.1.

(36)

Table 3.1: Lines of code in the benchmarks and the experiment

Type LOC

Benchmarking program 821

Batch2SQL test server (excluding libraries) 435

Tests and benchmarks 3600

Experiment total SLOC 4856

Validity of the results

In order to check the validity of the benchmarks, the output of all test scripts was saved to text files in a test run. All tests generate the same results although in some cases ordering of results may vary because it is not defined in the test.

Controlled result size and number of executions

Some of the benchmarks have a condition that limits the number of matching rows in a result set. They are executed limited to 1, 4, 16, 64 and 256 result rows. All benchmarks have been executed a 100 times in order to flatten out any anomalies in the results. To ensure that any serious issues with the benchmarks were noticed, a warning was printed in the results when a single test took more than four times the average. This did occur only once, namely in the “bulk delete” test with 64 result rows where the BatchJS version used 648ms once on an average of 139ms. As one result out of 100 this did not have a significant effect on the “bulk delete - 64 results” test.

3.4. Results

All benchmark results are presented as bar charts in Appendix B and C. These appen-dices also provide the BatchJS code as well as the Native Query SQL of these tests 1. All results show the average execution time of a test in milliseconds, based on 100 tests. The following can be seen from the results:

• In LINQ sample #3 (Appendix B, Figure B.1) there is only a simple where clause with two conditions, as shown in the code example (Appendix B, Figure B.2). Because where clauses can be added to SQL-objects and because node-orm accepts conditions, all three tests should execute only one query. Node-orm has a slight performance disadvantage in larger result sets, but overall there is no big difference between the results.

(37)

Figure 3.1: LINQ sample #4 – “Where drill down” 1result ₄results 16 results 64 results 256 results 0 100 200 300 time (ms ) BatchJS SQL-objects node-orm Native Query

• LINQ sample #4 (Figure 3.1) is a situation where a drill down is performed on related tables. All customers with the region code NL are selected and for these customers the order numbers and dates are printed. In this case BatchJS gains a clear advantage over SQL-objects, node-orm as well as Native Query as result sets grow.

• LINQ sample #17 (Figure 3.2) shows a significant difference in performance be-tween Native Query, BatchJS, SQL-objects and node-orm. In this test a nested loop is combined with an assignment of a new variable and a conditional statement inside the second loop (Appendix B, Figure B.8).

(38)

Figure 3.2: LINQ sample #17 – “Selectmany from assignment” 1result ₄results 16 results 64 results 256 results 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 time (ms) BatchJS SQL-objects node-orm Native Query

(39)

Figure 3.3: LINQ sample #21 – “Take nested” 1result ₄results 16 results 64 results 256 results 0 1,000 2,000 3,000 4,000 time (ms) BatchJS SQL-objects node-orm Native Query

• LINQ sample #21, as seen in Figure 3.3, shows another situation where BatchJS is clearly faster than SQL-objects and node-orm. In this test a nested loop is used with a conditional statement around the inner loop. The result of this test is inter-esting because the chart shows that the performance of the latter two increases as the result set grows. The performance of BatchJS stays about the same, decreas-ing the difference. In this test a nested loop is used with a conditional statement around the inner loop.

Figure 3.4: LINQ sample #23 – “Skip nested”

1result ₄results 16 results 64 results 256 results 0 200 400 600 time (ms ) BatchJS SQL-objects node-orm Native Query

(40)

• LINQ sample #23 is a test where all but the first three results, where a condition on a related table is true, are shown. In Figure 3.4 it can be seen that BatchJS gains an advantage over SQL-objects and node-orm as the result set grows.

Figure 3.5: Bulk delete

1result ₄results 16 results 64 results 256 results 0 200 400 600 800 time (ms ) BatchJS SQL-objects node-orm Native Query

• Figure 3.5 shows the results for the bulk delete operation. In this test a category and all the products in this category are deleted. BatchJS, Native Query and SQL-objects give the same performance numbers but node-orm is slower. The difference grows as the number of rows that are deleted increases. This is an interesting result because this test, as can be seen in Appendix B, Figure B.17, is a good example of the n+1 query problem.

• Appendix B, Figure B.19 shows the results of a bulk update operation. In this sample all products that satisfy a condition are updated. As with a bulk delete node-orm is outperformed by both BatchJS as well as SQL-objects. The difference grows as the number of updated products increases.

• Appendix C, Figures C.1 and C.10 show the results of benchmarks where the result set is of a fixed size. This means that there is no where condition that restricts the size or that this size cannot be variable within the test. Of the eight tests that run on fixed size sets, only LINQ samples #11 and #19 show a clear difference between BatchJS and SQL-objects or node-orm in terms of performance. These samples operate on nested for loops on the entire resultset. “Insert 10 – select – delete” (Appendix C, Figure C.10) shows a result where the Native SQL solution is faster than BatchJS.

(41)

4. Analysis

This study provides novel insights in the performance of batches, as well as in their implementation in JavaScript. The analysis of these results is based on the previously postulated hypotheses and the research question.

4.1. Implementation of BatchJS

Hypothesis I The batch statement can be implemented successfully in JavaScript. BatchJS is a fully working implementation of batches in JavaScript. It shows that batches can be implemented in JavaScript.

BatchJS also gives insight in the functional differences between JavaScript and a host-language from earlier work on batches, Java. JavaScript only supports non-blocking code and therefore a batch may behave different than one might expect. Mainly, the local code part of a batch statement is a callback. It is not executed directly after the statement initialization (the execution of the remote code) but instead it is put on the event stack. The body is executed when the event queue reaches the event that the response on the batch script has been returned to the client. This is normal JavaScript behaviour but it is not normally found in the form of a statement body. JavaScript statements operate on local code, the power of the batch statement is that its body is a combination of local and remote operations.

The asynchronous local code in a batch statement should not be considered a prob-lem. There have been attempts to introduce blocking behaviour to JavaScript but this defies the architecture of a JavaScript engine. Non-blocking code is one of the main reasons server-side JavaScript gains popularity so changing this behaviour is not recom-mended.

Hypothesis II It is possible to implement the batch statement in JavaScript in such a way that developing does not require additional compilation/translation steps by the developer.

BatchJS supports both compilation to a file as well as direct execution of the compiled code. JavaScript is a scripting language, so the ability to directly run a script means that a JavaScript developer does not have to change this workflow to adapt BatchJS.

(42)

As a node.js module and using the CommonJS module system, BatchJS and JavaScript modules can be intermingled in one JavaScript program.

4.2. Benchmark

Hypothesis III Performance of the batch statement will be at least as good as the performance of an ORM framework.

The n+1 query problem

LINQ sample #23 and the bulk update and delete samples (Appendix B, Figure B.13, Figure B.16 and Figure B.19, resp.) are clear examples of the n+1 query problem. In these examples, node-orm performs more subqueries as the result set grows. In the case of LINQ sample #23, both node-orm and the SQL-objects solution built for this experiment have to perform a subquery for every customer that matches a condition. The number of performed queries is n + 1, so for 256 results this means 257 queries. BatchJS performs a number of queries equal to [number of batch statements] ∗ [iterations]. In this case this translates to two queries. This test can be written in one query, as used by the Native Query solution in Figure 4.1.

Figure 4.1: LINQ example #23 in one query 1 S E L E C T c . C u s t o m e r I D , o . id , o . O r d e r D a t e

2 F R O M O r d e r s o

3 I N N E R J O I N C u s t o m e r s c ON c . id = o . C u s t o m e r I D

4 W H E R E c . R e g i o n =" NL " # v a l u e d e p e n d s on n u m b e r of r e s u l t s 5 O R D E R BY o . C u s t o m e r I D

The results of the bulk update and delete statements are a bit different. SQL-objects does not suffer from the n+1 query problem in these cases because it can delete or update with a where clause. This is the effect of an implementation choice to let SQL-objects take an optional where clause with any query. Node-orm however, has to perform a delete query on each matching row. In that case, n + 1 queries are performed.

The results from these examples match the hypothesis based on the theory of batches. This batch statement with two forEach loops performs two queries regardless of the number of matching results, so it does not suffer from the n+1 query problem. Nonethe-less, one could write SQL code to do this in one query. This is visible in a small but consistent performance advantage for the pure query solution in these tests.

BatchJS Implementing Batches in JavaScript

Master’s Thesis

BatchJS

Douwe Kasemier

Abstract

Preface

Acknowledgements

Contents

1. Introduction

1.1. Distribution

1.2. Object–relational impedance mismatch

1.3. Batches

1.4. Implementation of batches in JavaScript

1.5. Research question and goals

1.6. Validation

1.7. Overview of this thesis

2. BatchJS

2.1. The BatchJS language

2.2. Implementation details

3. Experiment and results

3.1. Experiment

3.2. Test suite

3.3. Benchmark

3.4. Results

4. Analysis

4.1. Implementation of BatchJS

4.2. Benchmark