• No results found

Crowdsourced software development with microtasks

N/A
N/A
Protected

Academic year: 2021

Share "Crowdsourced software development with microtasks"

Copied!
162
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Master Thesis

Crowdsourced software development with

microtasks

Job Witteman

job witteman@outlook.com

October 31, 2016, 161 pages

Supervisor: Hans Dekkers Host organisation: UvA

(2)

Contents

Abstract 5 Preface 6 1 Introduction 7 1.1 Initial study. . . 7 1.2 Problem statement . . . 8 1.3 Research question. . . 8

1.4 Aims & objectives . . . 8

1.4.1 Relevance . . . 8

1.4.2 Contributions . . . 9

1.5 Outline . . . 9

1.6 Conventions . . . 10

2 Background 11 2.1 Crowdsourcing software development . . . 11

2.1.1 Programming with microtasks . . . 11

2.2 Platforms . . . 12

2.2.1 Amazon Mechanical Turk . . . 12

2.2.2 Topcoder . . . 13

2.3 CrowdCode . . . 13

2.3.1 Key properties . . . 14

2.3.2 Platform infrastructure and programming languages . . . 14

2.3.3 Why CrowdCode . . . 14

2.4 Learning about solutions in programming . . . 15

3 Theory - Learning in CrowdCode 17 3.1 Lemma . . . 17

3.2 Sense making . . . 17

3.3 Finding errors in task understanding & execution . . . 18

3.4 Problems in crowdsourcing with microtasks . . . 18

3.5 Hypotheses . . . 18 3.5.1 Hypothesis 1 . . . 18 3.5.2 Hypothesis 2 . . . 19 4 Research setup 20 4.1 Method . . . 20 4.2 Experiment variations . . . 20 4.2.1 Assignments . . . 20 4.2.2 Crowd sizes . . . 21 4.2.3 Experiment durations . . . 22 4.3 Variables . . . 23 4.3.1 Measurements . . . 23

4.3.2 Quality of the solution . . . 24

(3)

4.5 Overview experiments . . . 26

4.6 Data processing. . . 27

4.7 Recruiting participants. . . 27

4.8 Informing the workers . . . 28

5 Research execution 29 5.1 Recruiting workers . . . 29

5.2 Individual experiments . . . 30

5.2.1 Results . . . 30

5.2.2 Worker experiences. . . 30

5.3 Small crowd experiments. . . 31

5.3.1 Results . . . 31

5.3.2 Worker experiences. . . 32

5.4 Big crowd experiments . . . 33

5.4.1 Results . . . 33

5.4.2 Worker experiences. . . 34

6 Data analysis and results 35 6.1 Metrics . . . 35

6.1.1 Microtask durations . . . 35

6.1.2 Quality of the solution . . . 36

6.1.3 Submission sizes . . . 43

6.2 Observations . . . 44

6.2.1 Divergent solutions. . . 44

6.2.2 Applied course corrections. . . 45

6.3 Other indications . . . 45

6.3.1 Worker coordination . . . 45

6.3.2 Microtask execution with a big crowd . . . 48

7 Evaluation 49 7.1 Hypotheses validation . . . 49

7.1.1 Hypotheses 1 . . . 49

7.1.2 Hypotheses 2 . . . 50

7.2 Answering research question. . . 50

7.3 Limitations . . . 51 7.4 Claims . . . 51 8 Reflection 53 8.1 Challenges. . . 53 8.2 Future work . . . 54 8.2.1 System improvement . . . 54 8.2.2 Research directions. . . 54 9 Conclusion 56 Bibliography 57 Glossary 59 Appendices 61 A Assignments 62 A.1 Todolist assignment . . . 62

(4)

A.2 Checkers assignment . . . 66

A.2.1 Definition difficult assignment. . . 66

A.2.2 Data Types . . . 66

A.2.3 Functions . . . 67

B Promotion 69 B.1 Flyer Experiments . . . 69

C CrowdCode 71 C.1 CrowdCode 1 - Initial version . . . 71

C.2 CrowdCode 2 - A modified workflow . . . 74

D Measurements 77 D.1 Microtask durations . . . 77

D.1.1 Theory . . . 77

D.1.2 Calculation . . . 78

D.1.3 Expectation 1. . . 78

D.2 Quality of the solution . . . 78

D.2.1 Theory . . . 78 D.2.2 Calculation . . . 79 D.2.3 Expectation 2. . . 79 D.2.4 Expectation 4. . . 79 D.3 Submission sizes . . . 79 D.3.1 Theory . . . 79 D.3.2 Calculation . . . 79 D.3.3 Expectation 3. . . 80 D.4 Divergent solutions . . . 80 D.4.1 Theory . . . 80 D.4.2 Observe . . . 80 D.4.3 Expectation 5. . . 80 D.5 Course corrections . . . 81 D.5.1 Theory . . . 81 D.5.2 Observe . . . 81 D.5.3 Expectation 6. . . 81 E Gathered data 82 E.1 Study replication . . . 82

E.1.1 Experiment setup. . . 82

E.1.2 Results . . . 82

E.1.3 Worker experiences. . . 84

E.2 Surveys workers. . . 84

E.2.1 Pre-study survey results . . . 84

E.2.2 Post-study survey results . . . 87

E.2.3 Big crowd survey results . . . 89

E.3 Main variables . . . 91

E.3.1 Durations . . . 91

E.3.2 Quality . . . 93

E.3.3 Submission sizes . . . 103

E.4 Other variables . . . 104

E.4.1 Quality with supporting functions . . . 104

E.4.2 Reviews . . . 106

E.4.3 Skipping behavior . . . 107

E.4.4 Durations per microtask type . . . 108

(5)

F Test quality 111

F.1 Expert specifications . . . 111

F.1.1 TodoList . . . 111

F.1.2 Checkers. . . 124

G Observations 139

(6)

Abstract

This research evaluates the concept of crowdsourced software development, supported by the tool CrowdCode. In this tool microtasks are used to crowdsource programming work. We hypothesize that in order to execute microtasks well, context information is required. We argue that to get the required context information, workers need to do bigger tasks, which invalidates the concept of mi-crotasks.

The different approaches and findings in the software development field with regard to task decompo-sition show that researchers have not found an optimal granularity yet. This study gives more insight in whether working with a very small task granularity can work. If this is the case, potential benefits such as higher scalability, faster time-to-market and lower contribution barriers may be possible.

Several other tools for crowdsourced software development exist, but non of the existing tools use microtasks. Most of the tools use a bigger granularity for the task decomposition. Workers in these tools have more freedom in their work as opposed to working with microtasks and must have more knowledge about the problem domain. Existing solutions that do use microtasks are aimed on simpler work than programming. Despite the common belief that microtasks can only work for simple work, there are several researchers that share the belief that it is possible to do more complicated work with microtasks. The design the microtasks is an open problem.

In our experiments we have varied with the number of participants and with two assignments of different complexity given to the participants.

Of the five experiments in which participants received an easy assignment, three were completed and two were almost completed. For the latter, one or two functions were not finished yet (out of five requested functions).

None of the five experiments in which participants received a difficult assignment were completed. All presented rough versions of the requested functions. In several experiments supporting functions were created which improved the possibility for working in parallel.

The quality of the work that individual workers achieved was not clearly better than the quality of the work from the crowds. The work of the crowds showed attempts to coordinate tasks, but there was not a specific coordination method that was used more often than other methods.

The data gathered in this study does not invalidate the concept of microtasks. We have conducted qualitative research with human participants. The amount of collected data is not sufficient to be statistically significant. Thus, the findings in this study should be considered with caution. The quality and submission sizes in the crowd experiments were higher than expected. The amount of divergent solutions, detected in the crowd experiments, was low.

(7)

Preface

I’ve always been interested in how people work together, especially in making software. Both the technical side and the psychological side are extremely interesting and challenging. For outsiders, it is a magic process. For insiders, it includes a tremendous amount of energy, coordination, knowledge, skills and dedication. There are numerous methods for making a team work together, but a holy grail does not seem to exist.

My interests go further to the possibilities of crowdsourcing. This is a form of working that gained extraordinary popularity and not without reason. The boundaries of what is possible are yet to be found. Triggered by the book from James Surowiecki, The Wisdom of Crowds, I think we can do more with crowdsourcing than what we currently believe is possible.

In this thesis we try to find out whether it is feasible to work together in a crowdsourcing environ-ment with microtasks. A very different way of working, compared to what most software engineers are used to. The possible benefits are extremely interesting, but are they withing reach? Equally important, what are the tradeoffs?

I want to give a special thanks to Thomas LaToza, for letting me conduct experiments with his tool CrowdCode and for making the time to give me advice. This thesis would not have been possible without his support.

Also I want to thank my supervisor Hans Dekkers, for guiding me through the hazards of academic research and for giving me valuable advice, when I was confronted with an infinite set of icebergs.

I want to thank my friends, family and colleagues at Heerema Marine Contractors, for their support and for dealing with someone that suffers from a chronic lack of time. You have my gratitude.

(8)

Chapter 1

Introduction

Crowdsourcing has received considerable attention during the last years. The meaning and definition of crowdsourcing has been widely debated. In this study the definition from Howe is used [3]:

”The act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call.”

In software engineering crowdsourcing is used as well. For a definition of crowdsourced software development the definition from Stol and Fitzgerald is used [27, p.3]:

”The accomplishment of specified software development tasks on behalf of an organization by a large and typically undefined group of external people with the requisite specialist knowledge through an open call.”

Several platforms crowdsource work in the form of microtasks. An example of such a platform is Amazon Mechnical Turk (AMT). Microtasks must be short, independent and self-contained. Cur-rently, they are used mostly for simple work. Microtasks are executed by persons in the crowd. Throughout this study we call these persons in the crowd workers.

Researchers have tried to execute more complex work activities with microtasks as well. Several researchers believe that programming tasks can be decomposed into microtasks. The benefits of such a decomposition is that microtasks have a low entry level and can be executed in parallel. Thus, a major advantage of microtasks is the potential scalability. Because they are small and self-contained, they can be distributed over an arbitrarily large crowd [19, p. 3].

In this study, it is verified whether programming work can be executed well with a microtask sys-tem. A major challenge is that programming is complicated work. Developers need to learn what is expected and build a mental model of the situation. They need to take into account what has already been built and work together with other developers to develop new solutions. Work needs to be decomposed into tasks and these tasks need to be coordinated between developers. Often, developers need to have an understanding of a large portion of the system and the underlying processes.

The statements above have implications for executing programming activities with microtasks. Developers still need access to a certain amount of information to learn about the solution. Even harder, their small solutions for the microtasks need to converge into one solution for a more complex problem. So far, there is no concrete evidence that this is possible.

1.1

Initial study

LaToza et al. have developed an approach for decomposing programming work into microtasks [15]. In their publication they introduce a cloud IDE for crowd development: CrowdCode. In the user

(9)

study that LaToza et al. conducted with CrowdCode, it appeared to be possible to write code and unit tests with a small crowd. One of the key properties of CrowdCode is that it generates microtasks dynamically. The work is coordinated through a graph of artifacts. Based on the events that happen on an artifact, new microtasks are spawned. Artifacts can have multiple dependencies and generate new microtasks. A microtask queue is in place for handling tasks sequentially.

Despite the fact that LaToza et al. were able to perform programming work with their system, they’ve also described considerable challenges. The lack of task information in microtasks will make it harder for workers to understand what to do. The researchers experienced an overhead in the development process. The belief is that the code produced per man-hour is likely to be lower in this kind of environment. Ensuring good quality of the work is another major challenge [15, p.11].

In this study we refer to this version of the system asCrowdCode 1. After some experimentation with CrowdCode 1, Lecce and Ricci made several changes to the system [24]. In their version more task information is given to the workers.

The intention of this choice was to improve the quality of the submissions reducing the amount of mistakes made by the workers. We refer to this version of the system asCrowdCode 2.

1.2

Problem statement

The problem that we study is whether a crowd is able to build a good solution by means of pro-gramming microtasks. The work from LaToza et al. has showed that it is possible to produce tests and code. No claims are done about the quality of the work and the differences between microtasked development and traditional development. It is unknown whether a solution was devised by one or two individuals in the crowd, or by the crowd as a group. The latter would mean that a crowd is able to learn about the solution. This contradicts our belief that learning is an individual activity. By sharing pieces of knowledge and information, other individuals may learn as well. A problem is that not all individual workers in the crowd are aware of the solution direction from other workers and attempt to build their own solution.

1.3

Research question

Our research looks from the perspective of the following question:

Is the crowd able to learn how to fix a complex software construction problem in a microtasking environment?

1.4

Aims & objectives

According to our literature study it is not yet known whether we can use microtasks for crowdsourced software development. There is no empirical evidence that this is possible. Based on the literature that was found, we have formed our own theory and defined hypotheses to validate this theory.

The CrowdCode system has been used in this study because earlier user studies showed promising results [12] [15]. There are no other examples of concrete systems that crowdsource programming tasks through microtasks.

(10)

• Software development with high scalability: due to the smallness and independence of micro-tasks, many developers can work on distinct microtasks in parallel. Work can be distributed to an arbitrarily large crowd. Software may be delivered sooner, because a crowd is not restricted to office hours [19, p.3].

• Lower development costs: by paying per performed microtask, companies may lower there software development costs. Internal development teams are expensive and highly specialized. The costs for performing microtasks at a platform likeAMTare low compared to the costs for traditional software development work. For the execution of programming microtasks, more specialized workers may be required [17, p.2].

For companies this way of working might provide a viable alternative for their expensive software development costs. Through this approach it is possible that a company needs less in-house developers and is able to crowdsource parts of the work that lends itself well for development through microtasks.

1.4.2

Contributions

This study makes the following contributions in the area of crowdsourced software development with microtasks:

1. Replication of previous user study [12]: in order to understand the current status of the Crowd-Code tool, we have replicated one of the previous user studies. The replication process and results are discussed in appendixE.1.

2. Established a learning theory for crowd programming with microtasks: we’ve defined a theory that gives support for reasoning about programming work with microtasks. Our hypotheses are based on this theory. The theory is explained in chapter3.

3. Gathered empirical data: through user studies we’ve collected considerable information from workers performing real work with a microtask tool for software development. The analysis of this data is presented in chapter6.

4. A vision on programming work with microtasks: we have explored the possibilities by conducting a series of user studies with the CrowdCode tool and we have validated whether it presents a plausible way of executing programming work. This is discussed in chapter 7.

1.5

Outline

The outline for this thesis is as follows. In chapter2we elaborate on the literature about crowdsourced software development and microtasks. Also, we explain the CrowdCode tool and discuss our interest in it.

Based on our literature study and knowledge about CrowdCode we come to a common theory and hypotheses in Chapter3.

In order to validate our theory and hypotheses, experimentation is done with CrowdCode 2. No earlier user studies have been done with this version. Before the experiments some minor corrections were applied. 10 experiments have been performed in which we have varied with three different crowd sizes (individual, small crowd and big crowd ) and two different assignments (todolist and checkers). The setup and structure of the experiments are explained in chapter 4. The execution is described in chapter5. The data gathered in the experiments has been analyzed. The results of the analysis is described in chapter6.

In chapter7 the results from the data analysis are evaluated and the hypotheses are validated. We have drawn some lessons learned from this research, which are explained in chapter8. We’ll conclude in chapter9.

(11)

1.6

Conventions

In this thesis, the following conventions are used. • Quote: ”This is a quote.”

• Keyword/variable: keyword

• Key sentence/question: This is an important sentence. • Reference to theoretical definition: hypothesis 1 • Outcome value: false

• Name of program construct (such as a function): doSomething • Reference to chapter: chapter1

• Reference to (sub)section: section1.1

• Reference to appendix: appendix A

• Reference to bibliography: [1] • Reference to glossary: AMT

• Formulas are displayed in italic and are numbered:

a2 + b2 = c2 (1.1)

In the appendix code fragments (JavaScript, JSON) are displayed in the following way:

1 /∗∗

2 ∗∗ Some comment 3 ∗∗/

4

5 function doSomething(inputParam) { 6 var someVariable = inputParam; 7 return someVariable;

8 }

(12)

Chapter 2

Background

In this chapter the findings from the literature study are discussed. Our focus is on identifying how existing crowdsourcing research and crowdsourcing platforms address programming tasks. Especially, how existing solutions decompose work into smaller tasks. The tool CrowdCode, which we have used for our experiments in this study, is discussed in more detail. The last part of this chapter discusses the existing perspective on learning in programming. This forms the basis for our constructed theory which is introduced in chapter3.

2.1

Crowdsourcing software development

The attention towards crowdsourced software engineering in general has grown considerably over the last years. Mao et al. published a survey which discussed the publications from different software engineering perspectives in a crowdsourcing context [22]. Various platforms are discussed in which the primary concern is crowdsourcing of programming work. The programming platforms offer a wide range of purposes. In this chapter relevant platforms are discussed.

Fitzgerald and Stol indicate that there are potential benefits in crowdsourcing software development [2]. They indicate that it can lead to cost reduction, faster time-to-market and higher quality. This belief is shared by LaToza et al. [17]. Fitzgerald and Stol emphasized the necessity for more research in the area of crowdsourced software development [26]. In a case study six key concerns in crowdsourced software development have been identified [2]:

• Task decomposition • Coordination

• Planning and scheduling • Intellectual property • Motivation

• Quality assurance

Our focus is on the task decomposition concern. Researchers consider this to be a challenge and an open problem in crowdsourced software development [22, p.24] [16] [26]. In most platforms the granularity of tasks is large. This means that developers need to learn about the entire task and the underlying processes. In this learning process, tacit knowledge is developed and a mental picture about the context is formed [20]. In crowdsourcing platforms, tasks are often kept small and isolated. For programming tasks, this is very hard to do. LaToza et al. proposed to do this by means of a microtasking platform [15].

2.1.1

Programming with microtasks

LaToza et al. published about their ideas for programming with microtasks [17]. By having small and independent tasks, important benefits could be achieved. For example, realizing a shorter

(13)

time-to-market is a potential benefit. This vision assumes that it is possible to decompose the work into microtasks and execute these microtasks in parallel.

Furthermore, microtasks can lower the barriers for (non-)developers to contribute to a software development project. As a microtask can be performed within a couple of minutes, it is easier for a worker to start contributing. This is different in Open-source software (OSS) development. In OSS barriers to participate exist, especially for new developers. This form of development requires the developer to learn about the codebase, learn about the associated tools and socialize with the open-source community. In a microtasking environment, the worker can learn about the platform, while working on microtasks [15, p.11].

The literature does not show a design yet for microtasks specific for programming. Currently, there are no commercial platforms that execute programming work with microtasks. The biggest example of a platform for crowdsourced software development, Topcoder, uses a waterfall model with larger development tasks.

In a research agenda proposed by Latoza et al., several challenges are mentioned in the areas decomposition, coordination and quality [18, p.2-3]. These areas have been heavily investigated in traditional software development, but this is not the case in a microtasking environment. It is unknown what the granularity of a task should be. Smaller tasks can increase parallelism, but may also cause an increase in communication overhead.

Intertwined with the task decomposition, is coordination. Challenges are in matching work to workers, while taking into account the specialism of a worker. Tracking which work is done and automatically generating new microtasks are also complicated.

Managing quality is another challenge. A microtask may fail due to malicious work, disappearing workers or legitimate error [17, p.1].

2.2

Platforms

Both microtasks and crowdsourced programming are concepts that have been used for years by several commercial platforms. In this section two of the biggest platforms are discussed.

2.2.1

Amazon Mechanical Turk

Amazon Mechanical Turk (AMT) is one of the oldest existing platforms for crowdsourcing by means of microtasks. It was introduced in 2005. The platform is used for several software engineering activities as well. Microtasks on AMT are called Human Intelligence Tasks (HIT). The nature of this work can vary greatly. In this chapter some examples are shown that use AMT.

AMT is one of the few commercial public platforms that uses microtasks to decompose work. It is basically a marketplace for microtasks. Microtasks are presented as HITs. A worker is called aturker

at AMT. HITs can be created through a web interface, the command line, or through an API[28, p.3]. For researchers, this has been an interesting opportunity for experimentation. It is positioned as a general-purpose crowdsourcing platform [25].

Kittur et al. have experimented with AMT as a user study platform [6]. Workers had to rate the quality of articles on Wikipedia. They’ve found out that small changes in the design of the task had a high impact on the quality. It is indicated that tasks should contain explicitly verifiable questions as a part of the task. A mechanism should be in place to detect malicious work. Next, several platforms that use AMT are discussed.

(14)

tasks. The initial request is done in natural language. In their experiments, for example creating and populating a blog, intervention from the requester was often required in order to successfully complete tasks. It was noticed that expert workers provided better instructions in their subdivisions. Some of the tasks starved or derailed because of bad instructions. This means respectively that no workers attempted to perform the task anymore, or that workers could not perform the task due to poor quality of the task. In a supervised form, it was possible to come to an acceptable result. The experiments showed that it is difficult for average workers to come up with good, self-contained task descriptions. This insight has impact on how programming microtasks should be designed.

TurKit

TurKit, developed by Little et al., is another example that makes use of AMT. It tries to take the microtask concept a step further [21]. It is a toolkit for prototyping and exploring algorithmic human computation. In the toolkit a crash-and-rerun programming model is introduced. The idea that Little et al. wanted to test, is whether it is possible to execute iterative, sequential tasks through AMT. Through an imperative programming style, called TurKit Script, it is possible to make calls to AMT. As these calls are dependent on human input at AMT, it may take a considerable time for the script to complete. It is possible to let the script rerun, until the input is in place. TurKit shows how microtasks can be incorporated in a bigger task.

CrowdForge

Another prototype that makes use of the features of AMT, is CrowdForge [7]. In this case a variant on the MapReduce pattern is used. Tasks are first partitioned by the crowd and broken down into discrete subtasks. Next, the subtasks are executed by one or more workers. Finally, in the reduce step the results of multiple workers are merged into a single output.

2.2.2

Topcoder

Topcoder is the best known example of a commercial platform that crowdsources software develop-ment. It does so by organizing design competitions and by offering financial awards for the winners, which is described in detail by Lakhani et al. [11]. Two kind of competitions exist: algorithm devel-opment and client software develdevel-opment.

In the latter competition type, the client plays an active role in the development process. The client provides a specification, or may gather ideas through a conceptualization competition. Platform man-agers from Topcoder help the client through the development process and with finding a proper task decomposition.

The client picks the best offered solution or solutions. The task descriptions are not generated, they are specified by the client with help from the platform managers. Stol and Fitzgerald indicate in a case study that Topcoder uses a waterfall approach [27, p.1]. In the case study the company uses a large task granularity for the distinct development phases. 1031 pages of technical documentation had to be written, distributed over the five development phases [27, p.6]. More documentation was needed than when the development was done internally. The case study shows a complication that can also be expected in crowdsourced software development with microtasks. The crowd is lacking domain knowledge. In the case study, the company also indicates this as a problem [27, p.5]. A challenge in a microtasking platform is to provide just enough domain knowledge with which the worker can complete the microtask.

2.3

CrowdCode

LaToza et al. introduced the CrowdCode tool to evaluate the feasibility of microtask programming [15, p.1]. With CrowdCode we have a concrete prototype in which we can experiment and explore microtasked programming further. The tool provides us with the means of verifying whether pro-gramming with microtasks is a valid approach for creating solutions of sufficient quality within an

(15)

acceptable amount of time.

The first publication from LaToza et al. about CrowdCode was published in 2014 [15]. We refer to this version as CrowdCode 1. The system was modified radically by Ricci and di Lecce [24]. They’ve made several important changes to the microtask workflow. We refer to this version as CrowdCode 2.

The differences between CrowdCode 1 and CrowdCode 2 are explained in detail in appendixC.

2.3.1

Key properties

The platform has the following key properties [24]:

• An initial task description is supplied, along with the necessary Abstract Data Types (ADTs). • The platform uses characteristics from Test Driven Development and Behavior Driven

Devel-opment in the workflow

• The workflow consists of microtask types for writing unit tests, writing code for functions, doing reviews and correcting work.

• Subsequent microtasks can be generated dynamically, depending on changes that the crowd makes. New microtasks are spawned as long as this is necessary.

• A worker can work for maximally 10 minutes at a microtask and write at most 10 lines of code. • The platform automatically assigns microtasks to workers.

• Workers can gather points by completing microtasks and by achievements, which is reflected by a leaderboard.

• The platform offers a Questions & Answers system through which workers can communicate with others in the crowd.

2.3.2

Platform infrastructure and programming languages

The system frontend is written in Angular and the backend in Java. It makes use of the realtime NoSQL database Firebase1for storing data in JSON format and synchronizing data across clients in

realtime. For the backend services the Google Datastore is used, part of the Google Cloud Platform. The artifact and microtask states are stored here. The connected clients get their data from Firebase. The system runs in Google App Engine2.

2.3.3

Why CrowdCode

So far, no tools have been developed for doing programming work with microtasks. The research from LaToza et al. has already showed that it is possible to, up to some extent, perform programming work [15]. Although not proven yet, the potential benefits for this development method can be of significant importance. Learning how to work with a dynamic workforce such as the crowd is a big challenge, but its importance has grown over the last years.

In this study CrowdCode is used because it is the only concrete system for doing programming work in this way. With it, it is possible to reason about a very important challenge in crowdsourced software development: task decomposition. The dynamic generation of microtasks through function artifacts is a good idea and the tool may provide valuable insight in the underlying processes. By presenting the sameclient requestto various crowds, we can gather useful empirical data.

In our research we have used CrowdCode 2. During the research, small adjustments were made to fix the most critical bugs.

(16)

2.4

Learning about solutions in programming

In order to understand whether microtasks can work well in a programming environment, we need to know how individuals come to a solution in programming. How do they learn about a problem domain and how do they come up with a solution for such a problem? Psychologists have researched problem solving behavior extensively and throughout this study their theories are used.

The biggest difference between a novice- and an expert programmer, is that the latter has a large set of patterns available for numerous kinds of programming problems. The expert can access and use this information intuitively. This has been described in the Naturalistic Decision Making (NDM) model, which has been introduced by Klein [8]. When people need to make a decision or solve a problem, they’ll first try to solve it intuitively. Kahneman refers to this as system 1 [5]. This system acts fast and unconscious. Based on the information that is available, people try to intuitively match this against a situation from earlier experiences. If the situation is recognized, people already have a solution or a strategy available for that pattern.

Always trying to use the same pattern solution is obviously not sufficient. Situations might share cues, but have different cues as well. Earlier solutions might not work. For this, people use their analytical system, also called system 2 by Kahneman. This is the slow and deliberate part, which requires much more cognitive effort. In NDM, system 2 is used to do a form of mental simulation of a particular action for a situation. People mentally assess whether an action works out for a situation. How thoroughly this is done, depends on other constraints, such as time available. The combination of recognition and mental simulation increase the chance on having a good solution for the situation. If people do not recognize a situation at all, their expectancies are violated, they will reassess the situation and look for more information.

The way how people come to their judgments and decisions are part of bounded rationality theory, as explained by Kahneman [4]. People use their intuition and deliberate reasoning to come to solutions. In their reasoning, people need to deal with their cognitive limitations, the limited information that is available and time constraints. Often, people are not aware of their own intuitive cognitive biases and limitations.

If confronted with problems for which no pattern exists yet, people often try to use more general problem solving techniques. These techniques can help to understand the problem better and look on the problem from new angles. In mathematics these techniques are well established. Polya described an extensive range of supporting techniques that problem solvers may use [23]. According to Polya, it is important to first fully understand the problem, before trying to think of a solution.

These theories have important consequences for reasoning about programming with microtasks. Programmers are familiar with seeking more information in order to construct a mental model of the situation [20]. They try to build a high-level perspective on the situation. In a microtask en-vironment, where the tasks are small and self-contained, it is expected that workers do not have a high-level perspective. A microtasking system must deal with this limitation. Another challenge is that each individual matches situations against his/her own set of known patterns. Individual pat-terns from different workers may be radically different. As workers do not know each other, it may be much more difficult to align different solution directions. The diversity of workers can generate more solution directions, but it is not clear whether these can converge into a single good solution. A system should facilitate this alignment process.

As the case study with Topcoder showed, the requesting party must give a much more detailed specifi-cation to external workers, then what would be the case if the system were to be developed internally [27].

Also, as Kulkarni et al. indicated in their experiments with Turkomatic, workers have difficulty with instructional writing [10, p.6]. As programming work consists of many interdependencies, it is impor-tant that instructions are written in a clear sense in order to provide clear subsequent work.

A problem arises when all or most of the workers in a microtasking system are confronted with unfamiliar problems. If no worker is able to match an existing pattern, it might not be possible to come up with a solution at all. Due to a lack of information workers might start with a wrong

(17)

solu-tion direcsolu-tion. Whether workers are able to recognize these kind of mistakes is unclear at the moment. Similar as in Polyas techniques for solving mathematical problems, programmers often have a suite of general purpose patterns with which they attack problems. These patterns may require a degree of freedom in a working environment, which may not be possible in a microtasking environment. Another issue is that other workers need to understand that a general purpose solution is tried.

In the next chapter, a theory is introduced with which we reason about learning in a programming environment with microtasks.

(18)

Chapter 3

Theory - Learning in CrowdCode

Based on our knowledge from the literature study and experience with building software we have constructed a new theory that we will use within our research. In this chapter we’ll illustrate the theory and formulate our hypotheses.

3.1

Lemma

We base our theory on the following lemma:

Lemma 1 Building software is a learning process.

• To be able to build good software, we need to learn what to build, how to build it and how to assess the quality of what we’ve built.

• Software is build with several people over time. We assign tasks to different people in different time slots.

• The software that we develop is too big to develop individually or monolithically.

• When starting development we identify subsystems, modules, components, algorithms and/or data structures. We expect that these are part of the final solution.

3.2

Sense making

When we give a task to a person, we need to supply a task description. This description can be communicated verbally, textually or through a combination of both. It is more or less formalized and more or less complete.

Sense making is the process of getting from a task description to task understanding.

A person will match a task with knowledge and skills acquired from previous experiences. As indicated in the previous chapter, Klein explained this process in his theory about Naturalistic Decision Making (NDM) [8]. When the task is received, the person will execute the task based on how he/she has learned to execute similar tasks in the past. Based on the knowledge and skills of the person, the following scenarios can occur:

• The specific problem is recognized: the person understands it concerns a problem of type x and knows that the solution for problems of type x have a number of properties P.

• General problem solving is applied: the person does not have specific experience with a problem of type y, but does have knowledge of the general topic (e.g. programming tasks). A known approach from programming tasks is applied with properties R.

(19)

• The problem is unfamiliar: the person has not seen a problem of type z before, or a problem with similar characteristics. No cues are recognized and no pattern is matched. The person has no idea about the solution. In this scenario a person still has to develop a pattern for this kind of problem. Developing these patterns is a slow process. This is what we call learning.

We can say that the more general the problem is, the more dependent the person is on the task description.

3.3

Finding errors in task understanding & execution

When a person receives a task, he/she will first has to get an understanding of the task. The person needs to know whether the understanding of the task is correct. For a part this knowledge is acquired through the task description. If there are errors, ambiguity or missing information in the task de-scription, the person needs to consider the task in the bigger picture.

The person can validate whether the task description is correct by comparing the properties of the task with innate ideas.

3.4

Problems in crowdsourcing with microtasks

To get a good understanding about what to do, a person has to form a mental model of the whole [20]. Through several interactions, persons can create an understanding [13] [14].

In crowdsourced software development with microtasks, the number of interactions of a person with the platform and other persons are limited. This makes understanding the overall task more difficult. To perform well in executing microtasks, the person already has to have a strong innate idea of what needs to be done. That means that he/she should already have a solution for this specific problem type. He/she has to have a pattern for the type of problem.

If a person knows the solution, we argue that it is best that he/she will complete the entire task. The context of a microtask gives to few options to:

• Use the knowledge to create a good solution

• Use the knowledge to find and correct previous errors

If a person doesn’t know the solution, it will be very hard to come up with a solution at all. The combination of the microtask description and the limited time to understand the problem as a whole will make it very hard to use general knowledge of software development.

3.5

Hypotheses

In the next sections we formulate two hypotheses and specify what we must measure in order to vali-date them. If the hypotheses hold then this gives a strong indication for our theory. If the hypotheses do not hold, then we have a strong indication for rejecting our theory. In the latter case software development with microtasks should be considered as a serious option for software development tasks. The theory, calculations and expectations for the measured variables are explained in detail in appendixD.

3.5.1

Hypothesis 1

(20)

Experimentation is done with three different crowd sizes (specified in more detail in section4.2.2). It is expected that creating a good solution becomes harder when the amount of individual learning is limited.

For the big crowd experiments it is expected that the amount of gathered participants is not suffi-cient to complete the assignment. Thus, our measurements need to consider partly completed solutions as well.

In order to validate this hypothesis the quality of the crowd solution, the duration to create the crowd solution and the submission sizes are measured.

Validation

The following variables are measured/observed per experiment:

• Total effective amount of time worked on microtasks per experiment (equation D.1to be used to verify expectation 1, see appendixD.1.3)

• Quality of the solution: to be determined by measuring the quality of tests (equationsD.2and

D.3to be used to verify expectation 2, see appendix D.2.3)

• Average submission size per microtask (equation D.4 to be used to verify expectation 3, see appendixD.3.3)

In the expectations the patterns are described that we expect to see in the data.

3.5.2

Hypothesis 2

The success of providing a solution for a task description depends on the experience of the crowd with solving similar problems. If the crowd does not have experience with the task description, then the crowd solution does not converge into a single solution.

Experiments are conducted using two assignments (specified in more detail in section4.2.1). Each individual or crowd will execute one assignment. The first assignment contains common programming tasks with should be familiar to most programmers. The problem pattern should be recognized and a solution should immediately come up. Limited problem solving skills are required to satisfy the specification.

In the second assignment the crowd is confronted with tasks that are less familiar. The solution for the specification is not known upfront and the crowd needs multiple iterations to learn how to come to a correct solution.

Validation

The following variables are measured/observed per experiment:

• Quality of the solution: to be determined by measuring the quality of tests (equationsD.2and

D.3to be used to verify expectation 4, see appendix D.2.4)

• Presence of divergent solutions (explained in appendixD.4.2to be used to verify expectation 5, see appendixD.4.3)

• Applied course corrections (explained in appendix D.5.2to be used to verify expectation 6, see appendixD.5.3)

(21)

Chapter 4

Research setup

In the previous chapter we have presented our theory and we have introduced two hypotheses. In this chapter the research method is discussed. Furthermore, it will be discussed which variables are considered in this study. We will explain how the data will be processed. Finally, the recruitment process is discussed.

4.1

Method

In this study, qualitative research is performed. We have performed a number of user studies, which has generated a small dataset which we can analyze. Furthermore, data from surveys and observations are used in our analysis.

Conducting controlled experiments with human participants using tools is considered to be difficult [9]. The recruitment of participants and the task design are often hard. The work environment of a worker in the crowd is uncontrolled. Workers use there own workstation with varying characteristics, the work location can be anywhere and the working hours can deviate. These uncontrolled settings cause extraneous variation if we want to measure activity within a crowd system. Nevertheless, it is important to take the extraneous variation into account in research about crowdsourced software development, as there may be fundamental differences compared to experiments in a lab setting. Besides this, we wanted to increase the amount of participants for our experiments. Being able to participate from a location of own choice helped in the recruitment process.

4.2

Experiment variations

The following sections describe the variables with which is varied during the experiments.

4.2.1

Assignments

In chapter2we have described several examples of complex work with microtasks. In this study, the complex work is software development. The complexity within software development differs as well. Software engineers deal with simple assignments in which limited sense making is required and where modules/functions are independent, but also with highly complex assignments that require consider-able problem solving skills and insight in dependencies between modules/functions, or other aspects of the work.

We have defined two assignments in order to test how difficult we can make assignments, when working in a microtasking environment.

(22)

one or more input parameters and one output parameter. By default the crowd can use the String, Number and Boolean data types. In the client request other data types are defined as well. For the custom data types aJSONstructure is used.

The first assignment concerns the functions for a simple todolist web application. The requested functionality is of trivial complexity, for which software developers normally would require limited problem solving skills. The following functions are requested. The italic names are custom data types.

• addItem: add a TodoItem to a TodoList

• deleteItem: remove a TodoItem from a TodoList

• updateItem: update an existing TodoItem with another TodoItem in a TodoList

• todoListToHTML: convert a TodoList to a String that contains the HTML code that represents the TodoList

• todoListToCSV: convert a TodoList to an array of String that contains a comma-seperated values representation of the TodoList

In this assignments limited dependencies exist. Problems encountered should be of trivial com-plexity. The todoListToHTML and todoListToCSV functions are not fully specified and some decisions/assumptions need to be made by the crowd, which requires at least a little alignment be-tween workers. The intention of this assignment is to see whether crowds are capable of completing relatively simple programming work.

The second assignment concerns the functions for a checkers game, which is of higher complexity. More intertwined logic and dependencies are expected in the implementations. This requires more interaction between workers in the crowd.

• doMoves: given an input Board and an array of Move execute the moves and give back a new Board

• isAvailableMove: given an input Board and a player (represented by a string), determine whether there is at least one available move. Return a Boolean.

In both assignments the crowd may specify new supporting functions. It is not possible to change data types or request new data types. The specified functions leaves considerable decision space to the crowd. The amount of rules that are possible to implement force developers to think about reusable solutions for checking input and for processing moves. This adds to the possibilities and complexity. The intention of this assignment is that we can observe under which circumstances the crowd will have a hard time in creating solutions.

For the full specifications of the assignments we refer to appendixA.

4.2.2

Crowd sizes

The crowd sizes with which we vary are described in the following sections. Individual

In this variant one worker will do all microtasks. In this case the worker is able to learn about the entire task and the platform. We conduct four individual experiments. Two for the todolist assignment and two for the checkers assignment. The worker uses two CrowdCode sessions, as submitted microtasks cannot be reviewed by the same worker session. The setup is the same as the small crowd experiment. The experiments start with an instruction mail.

(23)

Small crowd

Five-six workers will work together in one assignment completing microtasks. This crowd size should resemble a small, autonomous development team. In this case coordination with other workers is necessary. Workers can still learn individually and align their knowledge about the solution with other workers. Problems can be recognized and solved through earlier acquired knowledge. Four small crowd experiments are conducted. Two for the todolist assignment and two for the checkers assignment. Prior to these experiments a time window of four hours is planned together with the workers. The experiments start with an instruction mail. The setup is the same as for the individual experiments.

Big crowd

n workers work together where every microtask is executed by a new worker. Individual learning is reduced to a minimum. The worker cannot learn from earlier experience with the platform. Only the task information in the microtask can be used.

It is a challenge to find a new worker for every microtask that meets the inclusion criteria. The experiments are either done on location with the participant and experimenter in the same room, or remotely through a Google Hangouts session where the experimenter has access to the screen of the participant.

A worker in the big crowd receives one trial microtask, one todolist microtask and one checkers mi-crotask. The trial microtask is meant to become acquainted with the platform. From earlier studies we know that workers need some time to familiarize themselves with the system and to understand what is expected in a microtask. During the trial microtask questions about the platform are answered by the experimenter.

If possible, the three microtask should all be of the same microtask type. If this is not possible due to the fact that in one assignment a particular microtask type is not available, an additional trial microtask is given to the worker.

The session is structured as follows:

• Short explanation about the experiment (three minutes) • Choose a microtask type and pick a trial task (five minutes) • Execute microtask todolist assignment (ten minutes) • Execute microtask checkers assignment (ten minutes) • Short survey (two minutes)

The participants are asked to speak out loud their reasoning process during the microtask. The experimenter will take notes of this reasoning process.

4.2.3

Experiment durations

Initially, we’ve estimated that the necessary working time per worker is between 4 and 8 hours. We took into account that this time was distributed over an experiment period of 24 hours. During the first experiment with a small crowd we’ve noticed that the period of 24 hours was not working out well. Details about this are explained in section5.3.1. Because of this issue, the condition was changed to a duration of 4 hours per worker within a fixed time window. This means that there were always multiple workers concurrently active and work could always be reviewed by other workers. A side effect was that all workers stepped in at the same point. Workers did not have to cope with stepping into an assignment where the solution was already partly completed.

(24)

crowd experiments. Thus, it is to be expected that the individual assignments have progressed less far than the small crowd assignments.

During the individual and small crowd experiments we do not have access to the screens of the participants. Thus, it is not possible to check whether the participants are actually working. However, we know the effective time per worker by summing the effective duration of work on microtasks per worker.

4.3

Variables

In the following sections the main variables are explained which are measured and observed in order to validate the hypotheses. It concerns the variables and measurements as displayed in figure4.1.

Figure 4.1: Involved variables and measurements

The combination of the measurements and observations of these variables will provide a perspective on how well the individuals and crowds have performed.

4.3.1

Measurements

To verify whether the patterns as described in our hypotheses are met, we’ll measure and observe the following variables:

• Microtask durations: the total effective microtask durations (details in appendix D.1).

• Quality of the solution: a rating of the quality of the work, measured through the defined tests (details in appendix D.2).

• Submission sizes: the average submission size per time unit (details in appendixD.3) • Divergent solutions: amount of present divergent solutions (details in appendix D.4). • Course corrections: amount of made course corrections (details in appendix D.5).

(25)

4.3.2

Quality of the solution

In order to perform our measurement, a verification project is created in CrowdCode for every ex-periment. The testrunner feature of CrowdCode is used to verify whether a test passes or fails. For every experiment the last function submissions and the last test submissions are taken into account. The intermediate solutions are not used.

In figure4.2a simple example of a verification project is displayed for the function addItem in the todolist assignment. In the first example three tests fail due to a line which is commented out.

Figure 4.2: The expert and small crowd defined tests for the function addItem

Next, we make a small change to the code to make the remaining tests pass. The result is showed in figure4.3.

(26)

Figure 4.3: After a small code change, the remaining tests pass as well

The verification project provides us (a) with a method to perform the measurements and (b) an additional check on the correctness of the expert tests. If the expert test contains flaws, we expect to see inconsistencies in the outcome of the tests.

Due to the fact that in some experiments more time is spend than in other experiments, the worker durations should be taken into account in determining the quality of the work. To deal with this, a relative quality is calculated in which all experiment qualities for a particular assignment are extrap-olated to the maximum duration for that assignment.

The quality totals do not take into account the supporting functions which have been defined by the crowd. These are not taken into account in the hypotheses validation as well. The main goal of the assignment is to complete the requested functions. Quality is measured over these functions alone. It is a risk that particular relevant tests may have been defined on a supporting function and not on a requested function. However, the supporting functions would skew the data too much and make the data more difficult to interpret.

The total amount of time spent per experiment defers. Especially in the checkers assignment, where it is unlikely that the solution is finished, it is difficult to interpret the quality score. To deal with this problem, the duration is taken into account to calculate a relative quality score.

4.4

Other variables

As indicated at the start of this chapter, the workers participated in the experiment, working from their own work station. This means that the work environment was uncontrolled. We cannot provide an extensive list of variables. In this section we list the variables that we have expected to be of influence:

• Developer experience: the skills and experience of the developers defer. Through the pre-study survey information about the worker characteristics was collected.

(27)

• Individual knowledge about solution for the particular assignment: participants may already have experience with the assignment. For example they may have built functions for a checkers board game for an earlier assignment in school.

• Possibility of direct contact between workers: workers may talk directly to each other through other tools or, when working from the same location, verbal communication.

• CrowdCode version: between the two CrowdCode versions there are fundamental differences in the workflow. The task decomposition and overall balance are different.

• Ambiguous specifications through client request and other workers: unclear or incomplete spec-ifications make it hard for workers to understand what to do.

• Motivation of workers: workers may have different motives to participate in the experiments. Some participated out of interest, some for the financial reward and some to help out. These drivers and other personal characteristics may influence the motivation and thus, the effort that workers put in the work.

• Deliberate bad work: workers may not be interested in creating a good solution. They may want to see the boundaries of a system and submit bad work to see the effect. Also they may just want to receive points for the leaderboard and fool the system with bad submissions. • Bugs in the system: in both CrowdCode versions there are several unresolved issues. For some

bugs workarounds have been identified and documented, but there are also bugs that restrict the worker(s) to continue.

• Dynamics in the crowd: the workers do not know each other. Different kinds of workers (per-sonality, skill, background, etc.) may respond in different ways.

• Information from experimenter: the experimenter provides information before the experiment and answers questions about the platform during the experiment. As the experimenter learns about the platform, the quality of information towards the crowd may change.

• Skipping behavior: workers have the possibility to skip a microtask. Hard microtasks or mi-crotasks where the task description is bad/ambiguous may be skipped more often. If certain microtasks are repeatedly skipped, this may prevent the creation of subsequent work and it may not be possible to create a good solution.

• Review quality: workers may be less or more experienced with doing code reviews or other forms of review. The quality of reviews influences the motivation of workers and may influence the quality of their work.

4.5

Overview experiments

10 experiments are conducted in order to observe and validate the hypotheses. An overview of the experiments is displayed in the following table, where n is the total amount of participants (in these experiments each microtask is executed by a new participant):

ID Crowd size Assignment Crowd size Worktime

1 individual Easy - todolist 1 4h

2 individual Easy - todolist 1 4h

3 small crowd Easy - todolist 5-6 4h 4 small crowd Easy - todolist 5-6 4h

5 big crowd Easy - todolist n 10 minutes per worker

6 individual Hard - checkers 1 4h

7 individual Hard - checkers 1 4h

8 small crowd Hard - checkers 5-6 4h 9 small crowd Hard - checkers 5-6 4h

(28)

4.6

Data processing

As a basis for the analysis the storedJSONdata from Firebaseis used. All of the test/code submis-sions and event data is stored in this data storage.

In order to make sense of the data, several functions have been written in JavaScript and have been executed within the JavaScript runtime environment NodeJS. The functions contain the building blocks for presenting detailed information split per experiment. The submissions are extracted from the related microtask data and are enriched with event information (for example the assignment/pick and submission timestamps), review information and several other attributes of interest.

Furthermore, functions have been developed for aggregating the data. The data is aggregated by experiment and function. These functions construct new JSON objects which we can further use and question depending on our interests. The following high-level functions have been developed:

• displayAssignmentData(functionNames, microtaskType): collect theartifact sub-missions, microtasks and events. Enrich the microtask object with detailed information about events of interest with the extendMicrotask(microtask, artifactFunction) function (e.g. submit events). A report is displayed per function submission.

• getAssignmentStatistics(functionNames, microtaskType, withCode): collect the artifact submissions, microtasks and events. Create an aggregation object in which proper-ties of interest are stored for each microtask, such as submission durations and the submitted LOC. Finally, enrich this object with totals (e.g. total duration). The data is added for each artifact function.

• getReviewScores(functionNames): collect the microtasks and extract the scores infor-mation and store this in an object per function name. Split the scores by microtask type. • getReviewAverages(reviewScores): based on the gathered review scores, create an

ob-ject that calculates the average scores per function name and microtask type. Furthermore, determine the percentage of microtasks that is rejected through review.

• displaySubmittedTests(functionNames): takes all function artifacts and extracts the active tests from every function. Displays the test case description along with their inputs/out-put or assertion code.

• getAllDurations(functionObj, testObj): uses the result from the getAssignmentStatistics function to determine the total duration per function. The durations for each microtask type

per experiment are aggregated. The function will return a new object with the duration aggre-gations.

The functions starting with get return a new JSON object which can be processed further. The functions starting with display will output the results on the screen or to a file.

4.7

Recruiting participants

A major challenge in this study is gathering sufficient participants. This is mentioned by Ko et al. in their publication about controlled experiments with human participants [9, p.6-11]. Participants must meet the inclusion criteria and have to be willing to invest a considerable amount of time. The inclusion criteria are the following:

• Either an educational background in computer science, or a job in software development • Basic knowledge of JavaScript syntax

(29)

By restricting the amount of participants in this way, we attempt to limit the amount of malformed data due to possible incompetence of workers. In a real-world scenario, a qualification mechanism can be implemented to test whether workers have sufficient knowledge for performing microtasks. Participants of the experiments receive a financial compensation of e30,-. Participants of the big crowd experiments do not receive a financial compensation. The intention of the financial reward is to stimulate more people to participate.

Furthermore, to persuade more people into participating, a flyer (appendixB) has been created which is distributed through the following channels:

• Facebook • LinkedIn • Google+ • Student groups

• (Assistant) Professors at GMU and UCI • Colleague group

• Private contacts at development companies • Face-to-face communication

• JavaScript event ’Coding Hour’ in Amsterdam

By face-to-face explanations about the experiments it is attempted to motivate more people to participate.

4.8

Informing the workers

If a person agrees to participate he/she will receive a pre-study survey with some questions about his/her background and experience. The participant has the opportunity to select a date on which he/she can participate. Shortly before the experiment the participant is requested to confirm his/her participation.

At the start of the individual and small crowd experiments all workers receive an instruction mail with basic information to get started. The mail contains the following pieces of information:

• Link to the platform and information how to login

• Statement about the topic of the assignment (todolist /checkers) • List of platform restrictions

• Tips for working with the platform

• An attached buglist with the known bugs and workarounds

During the experiments questions about the platform are answered, either by mail or through the Q&A system in CrowdCode. Substantive questions about the assignment itself are answered by indi-cating that this should be determined by the crowd. The experimenter communicates through mail or by answering questions through the Q&A system in CrowdCode.

After the experiment the participant receives a post-study survey, with questions about the expe-riences with CrowdCode and microtasks.

(30)

Chapter 5

Research execution

This chapter describes the execution of the experiments as defined in the previous chapter.

5.1

Recruiting workers

Despite the considerable attention towards recruiting it has proven to be hard to get sufficient regis-trations. For the individual and small crowd experiments 46 persons responded, of whom 36 filled in the pre-study survey. The amount of responses through social networks was very low (14,7%). Most participants responded on a call from their teacher or (assistant) professor (38,2%). Of the 36 persons that committed to participate, 11 persons dropped out before or during the experiments (30,6%). The dropouts did not respond on further communication and did not provide feedback about why they did not continue their participation. The experiments where workers did not put in the agreed amount of time, have not been taken into account in our data analysis, because there was not sufficient data to analyze.

In figure5.1it is displayed how the 25 workers that eventually participated knew about the exper-iment. Further information about the workers is presented in sectionE.2.

Figure 5.1: How did the experiment came under the attention of the worker (amount of workers: 25) Of the 25 persons that did participate in the individual and small crowd experiments, 7 persons rejected the financial compensation because they felt this was not necessary or appropriate.

The 39 big crowd participants were gathered through face-to-face communication at the UvA and personal e-mails to students and colleagues.

(31)

This gives the following totals for the experiments: • 21 persons for the small crowd experiments • 4 persons for the individual experiments • 39 persons for the big crowd experiments

Some of the experiments were preceded by a video conference call with Google Hangouts. This was primarily done to stress the importance of participating seriously in the experiment. Similar information as in the instruction mail and the CrowdCode tutorials was supplied to the workers in this call.

The workers were asked to fill in a survey before the study and after the study. The results of these surveys can be found in appendixE.2.

5.2

Individual experiments

5.2.1

Results

In the table below the amount of microtasks and total work time is showed. The work time is calcu-lated by taking the effective duration of every microtask.

Experiment Assignment Amount of microtasks Total work in hours

uvaStudy3 todolist 154 7.2

uvaStudy12 todolist 99 3.6

uvaStudy10 checkers 32 3.7

uvaStudy11 checkers 44 3.4

Table 5.1: Total durations for individual experiments

Both the todolist assignments were completed. All requested functions were implemented. The worker in uvaStudy12 added three supporting functions. The worker for uvaStudy3 did not complete the assignment during the first try. It was requested to him to complete the assignment for a small additional financial compensation.

The checkers assignments were not completed. The total amount of microtasks is lower than for the todolist assignment. The amount of time spent was comparable with the time spent for the todolist assignments. Some supporting functions were created to check the consistency of the function input parameters. The programming work in uvaStudy10 did not progress far yet. In uvaStudy11 the work progressed slightly further.

Besides the individual experiments as shown in this section, three other attempts for this experiment crowd size have been done which have failed (uvaStudy1, uvaStudy5 and uvaStudy9 ). The worker in the first failed attempt did half of the work and then quit. The second failed attempt concerned a worker that only logged in once and then directly quit. No feedback was received from both of these workers. The last failed attempt was a worker that could not make it due to planning issues. The data for the failed attempts has not been taken into account in our analysis, as there was not sufficient data to analyze.

5.2.2

Worker experiences

Most of the individual workers indicated that the assignments were clear in the surveys. This was similar for the clearness of the microtasks. The main information used to solve problems was by using

(32)

The workers indicated that at the start it was hard to understand what was expected. Most confu-sion occurs when the specification within a task is not completely specified. It is not clear for workers how to gather information about specifications. Furthermore, one worker indicated that working in-dividually with two sessions was inconvenient. Also, one worker indicated that it would be convenient if workers have more time for coding microtasks.

Most workers have indicated that they appreciated the smallness of microtasks. The way how tests are incorporated in the system is also appreciated by most of the workers.

5.3

Small crowd experiments

5.3.1

Results

In the table below the amount of microtasks and total effective durations are showed. Experiment Assignment Amount of microtasks Total work in hours

uvaStudy4 todolist 120 8.5

uvaStudy8 todolist 104 10

uvaStudy6 checkers 74 8

uvaStudy7 checkers 179 9

Table 5.2: Total effective durations for small crowd experiments

The effective working time per worker is lower than in the individual experiments. This indicates that during the set time window, not all workers spent all their time on working on microtasks. Fur-thermore, in uvaStudy7 much more microtasks were spawned than in the other checkers assignments. It appeared that this crowd split up the work much faster and more frequently, generating more work for other workers.

The todolist assignments were completed by both crowds as far as possible. Due to a bug, both crowds were not able to implement all functions. Tests were created for all five functions, but only four were implemented. For the remaining function, no further microtasks were spawned by the platform. It was not possible to repair this during the experiment. Furthermore, the crowds spent their time with work on the other functions.

The checkers assignments were not completed. The crowds did outline their solution direction. This was done in multiple ways. In uvaStudy6 there were traces ofpseudo code to bring structure and instruct other workers. An example of using pseudo code (lines starting with //#) is displayed in figure 5.2. In uvaStudy7 a structure was forced by splitting up functions in several supporting functions. In total, besides the 2 originally requested functions, 16 supporting functions were defined. For 9 of these functions tests and/or code were written. 39 tests were defined and 131 LOC were written. Of the unimplemented functions, some seemed to be obsolete or redundant.

Referenties

GERELATEERDE DOCUMENTEN

capaciteit voor clusters en HRC moet worden gehandhaafd (daarna kan worden overwogen deze opnieuw in te zetten voor herhaalde screening in geval van een epidemie/clusters

Dit verzoek werd gericht aan Infocel, het punt zal opnieuw op de agenda worden geplaatst wanneer we feedback hebben gekregen.. Sciensano zal de flowcharts in de protocollen

Het kabinet Linard, het ONE en het kabinet Glatigny verwijzen naar de aanbevelingen van de Hoge Gezondheidsraad (voor 16-17-jarigen en 12-15- jarigen met

o Het gaat in de cultuursector vooral om kleine ondernemers die door de crisis hun economische middelen drastisch hebben zien verminderen en niet in staat zijn nieuwe

Diligent Minutes biedt de mogelijkheid notulen rechtstreeks te koppelen aan Diligent Boards™ voor een..

De RMG erkent dat meerdere aspecten een rol kunnen spelen aangaande het toelaten van versoepelingen maar benadrukt dat vanuit epidemiologisch standpunt, en

Brussel stelt de vraag of deze verdeling voor aankomst in het ziekenhuis uitgevoerd zou kunnen worden door 112. Wallonië is voorstander van deze taakverdeling in

“Voor de periode van 27 december 2020 tot 2 januari 2021 werd een verdere daling geregistreerd van het aantal nieuwe besmettingen, die overeenkomt met een verdere daling van het