• No results found

Gamification of crowdsourcing tasks: what motivates a medical expert?

N/A
N/A
Protected

Academic year: 2021

Share "Gamification of crowdsourcing tasks: what motivates a medical expert?"

Copied!
101
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Gamification of crowdsourcing tasks:

what motivates a medical expert?

Master’s Thesis

(2)
(3)

Gamification of crowdsourcing tasks:

what motivates a medical expert?

THESIS

submitted in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

in

INFORMATION STUDIES

specializing in

HUMAN CENTERED MULTIMEDIA

by

Rens van Honschooten

born in Amsterdam, The Netherlands

Department of Artificial Intelligence Faculty FEW, VU University Amsterdam Amsterdam, Netherlands

http://www.few.vu.nl/

Center for Advanced Studies IBM Netherlands Amsterdam, Netherlands http://www.ibm.nl

(4)

c

(5)

Gamification of crowdsourcing tasks:

what motivates a medical expert?

Author: Rens van Honschooten Student ID: 10069313

Email: rens.van.honschooten@student.vu.nl

Abstract

In this document incentives are discussed to explore whether or not it is pos-sible to motivate medical experts to perform crowdsourcing annotation tasks that require medical expert knowledge and can not be performed by a lay crowd. To find out what these incentives are, we first identified incentives from literature and then surveyed 24 medical experts. The most important incentives identi-fied from the survey were personal growth, competition and fun. We explored whether or not we could combine these incentives in a gamified crowdsourcing application called Dr. Watson. In Dr. Watson medical experts compete by playing annotation games and can unlock medical articles to maintain their knowledge. A mockup version of Dr. Watson was created and evaluated with six medical ex-perts. Three out of six medical experts stated to play Dr. Watson again, because they liked playing the game, wanted to learn more or improve their score. This indicates that the personal growth, fun and competition incentives were incor-porated successfully and that it is possible to combine incentives effectively. To make Dr. Watson more efficient for the medical experts however, it is required that medical experts can select a topic they want to learn.

Thesis Committee:

University supervisor: Dr. Lora Aroyo, Faculty FEW, VU University Amsterdam Company supervisor: Dr. Chris Welty, IBM Watson Research Center, New York Company supervisor: Robert-Jan Sips, CAS Benelux, IBM Netherlands

(6)
(7)

Acknowledgments

I would like to thank my supervisors Dr. Lora Aroyo, Dr. Chris Welty and Robert-Jan Sips, for their feedback and assistance during the development of this thesis. In addition, I would like to thank my colleagues on the CrowdTruth team, Anca, Ben-jamin, Harriëtte, Khalid, Lukasz, Oana, Tatiana and Manfred, who also provided me with great support during the project. I also want to wish good luck to Carlos with the development of Dr. Watson and good luck to the others on the team with their future endeavors.

Rens van Honschooten Amsterdam, The Netherlands August 17, 2014

(8)
(9)

Contents

Acknowledgments iii

Contents v

List of Figures vii

List of Tables ix

1 Introduction 1

1.1 Research Questions . . . 2

1.2 Glossary and Definitions . . . 3

2 Crowdsourcing incentives 5 2.1 Crowd incentives to perform work . . . 6

2.2 Incorporating incentives using gamification . . . 7

2.3 Overview of general crowdsourcing incentives . . . 8

2.4 Conclusion . . . 9

3 Medical crowdsourcing incentives 11 3.1 Medical crowdsourcing incentives literature . . . 11

3.2 Crowdsourcing incentives survey for medical experts . . . 15

3.3 Requirements for medical crowdsourcing incentives . . . 21

3.4 Conclusion . . . 21

4 Design of Dr. Watson 23 4.1 Dr. Watsondesign rationale . . . 23

4.2 Gathering CrowdTruth data using Dr. Watson . . . 25

4.3 Medical literature . . . 25

4.4 Scoring . . . 26

4.5 Challenges . . . 29

4.6 Levels . . . 30

4.7 The win and lose screen . . . 31

4.8 Difficulty . . . 31

(10)

CONTENTS CONTENTS

4.10 Conclusion . . . 33

5 Evaluation of Dr. Watson 35 5.1 Experimental Setup . . . 35

5.2 Result analysis and discussion . . . 36

6 Future Work 43 7 Conclusions 45 Bibliography 47 A Survey on incentives for crowdsourcing medical text annotations 51 B Interview Analysis Tables 61 C Interview introduction, tasks and questions 69 C.1 Interview introduction . . . 69 C.2 Usability tasks . . . 69 C.3 Interview questions . . . 70 D Interview Transcripts 71 D.1 Interview 1 . . . 71 D.2 Interview 2 . . . 72 D.3 Interview 3 . . . 74 D.4 Interview 4 . . . 76 D.5 Interview 5 . . . 77 D.6 Interview 6 . . . 79

(11)

List of Figures

2.1 Overview of the approach used to extract and evaluate crowdsourcing in-centives for medical experts . . . 5 3.1 A screenshot of the Dr. Detective game . . . 14 3.2 The influence of different crowdsourcing incentives according to the 24

survey respondents in percentages. The neutral responses are not shown . 19 4.1 The homepage of Dr. Watson with the four content boxes. . . 26 4.2 An example of the user vectors and crowd vectors for each type of subtask 28 4.3 The games homepage with two normal difficulty game modes unlocked . 31 4.4 The win screen that is displayed after winning a Dr. Watson game . . . . 32 4.5 The performance page with the Dr. Watson Ranking . . . 34 5.1 The ingame page for a question-answer task . . . 36 5.2 The ingame page for a question-answer task . . . 42 B.1 The influence of different crowdsourcing incentives according to the 24

(12)
(13)

List of Tables

B.1 Demographic characteristics of 24 survey respondents . . . 61 B.2 The types of games game playing survey respondents(n=20) play . . . 62 B.3 The responses to the statements posed for the question: “what would or

usually motivates you to play a game frequently” . . . 62 B.4 The responses to the statements posed for the question: “What type of

game would you prefer to play? A game that:” . . . 64 B.5 The time the 24 survey respondents reported to spend contributing to a

medical crowdsourcing activity . . . 65 B.6 The responses to the statements posed for the question: “What is you most

preferred setting for performing crowdsourcing activities” . . . 65 B.7 The responses to the statements posed for the question: “What is for you

the most enjoyable (preferred) way to study/learn?” . . . 66 B.8 The way the respondents perceive the length of the medical text in the New

England Journal of Medicine interactive Use Case . . . 66 B.9 The responses to the statements posed for the question: “How do you

perceive the current feedback?” . . . 67 B.10 The responses to the statements posed for the question: “What would

be the most preferred domain for crowdsourcing activities, or interactive medical use cases?” . . . 67 B.11 The responses to the question: “Do you prefer to play first person, third

person, or games not involving a digital character?” . . . 67 B.12 The responses to the question: “What would be the main reason you stop

playing a game?” . . . 68 B.13 Clustering of the answers to the question “What is your medical domain

specialization?” . . . 68 E.1 The enjoyment of playing the game on a 1-5 scale and feedback on what

would make it more enjoyable . . . 81 E.2 The feedback on the scoring of the game . . . 81 E.3 The feedback on how challenging the game was and how to make it more

challenging . . . 82 E.4 The feedback on whether or not the option to unlock more game modes

(14)

List of Tables List of Tables

E.5 The feedback on how the experts like the idea of unlocking articles using reputation points . . . 83 E.6 The feedback on alternatives to unlocking articles using reputation points 83 E.7 The feedback on alternatives ways to unlocking articles . . . 84 E.8 The feedback what would motivate the experts to keep using the application 84 E.9 The feedback what annotation tasks a medical experts can perform while

reading an article . . . 85 E.10 The feedback whether or not the experts would play the game again and why 85 E.11 The feedback what the experts would change about Dr. Watson in general 86 E.12 The feedback what the experts would change about the tasks and ingame

part of Dr. Watson . . . 86 E.13 Final suggestions from the experts to improve Dr. Watson further . . . 87

(15)

Chapter 1

Introduction

Human annotated data is required to train a cognitive system and evaluate the perfor-mance of these systems, especially when adapting to a new domain. An example of a cognitive system that needs human annotated data is Watson QA [8] developed by IBM. Watson defeated the best players on the Jeopardy game show and was trained on multiple databases, taxonomies and ontologies with publicly available human an-notated data [12]. Currently, IBM Research aims at adapting the Watson technology for question-answering in the medical domain, which requires large amounts of new training and evaluation data in the form of human annotations of medical text [6].

For the training of specific components in the NLP pipeline, which is part of the Watson computer, it is important to collect a new type of ground truth data. Typi-cally, disagreement is avoided when creating ground truth data, because one assumes that there is only one right answer for each annotated instance. This assumption is challenged in [1], because disagreement in certain annotation tasks reflects seman-tic ambiguity of an instance and provides useful information. This indicates there is no universal ground truth and disagreement for cognitive computing tasks is fun-damental. By understanding and harnessing disagreement, ground truth data can be acquired that is richer in diversity of perspectives, opinions, and interpretations, which reflect more realistic human knowledge [1]. This new type of ground truth is called the CrowdTruth.

To collect CrowdTruth data, contracting experts annotators is possible. This will guarantee that annotated data is acquired, but the problem with this approach is that it is slow, expensive and generates relatively small amounts of data. An approach to collect larger amounts of annotated data is crowdsourcing. Crowdsourcing is the concept where small tasks called human intelligence tasks (HITs) are distributed to a crowd. The distributor of these tasks usually pays a couple of cents for each HIT that is completed by a person in the crowd. The idea behind crowdsourcing is that HITs may be performed quickly and cheaply by soliciting from a large crowd of people [9]. Two popular crowdsourcing platforms for distributing these tasks to a crowd online are Amazon Mechanical Turk1(AMT) and CrowdFlower2(CF).

1https://www.mturk.com/ 2http://www.crowdflower.com/

(16)

1.1 Research Questions Introduction

In the CrowdTruth project AMT and CF are used to collect CrowdTruth data, be-cause medical knowledge is not always required to annotate medical text. An example is a term identification task, where a worker does not have to fully understand the text to find terms that can be relevant for a diagnosis. For tasks that require medical knowledge AMT and CF is not used, since most of the crowd present on AMT and CF does not have medical expert knowledge. Last year, a crowdsourcing application was created for tasks that require medical knowledge called Dr. Detective [6]. This application incorporated game design techniques and mechanics, such as high scores and levels, to motivate medical experts to use the application.

The use of game design techniques and mechanics to enhance non-game contexts, is called gamification and is a way to motivate a crowd to perform annotation tasks [15]. Instead of monetary incentives gamification uses incentives, such as the desire to be entertained [22]. An example where gamification is used to motivate an expert crowd to acquire annotations data is Spotvogel3, where one has to label different birds found in video fragments to gather points. Currently, over 100.000 labels are collected, showing that gamified crowdsourcing tasks can be used to collect a large amount of expert annotated data.

In this thesis we continue with the work of [6] and put our focus on the incentives that can be used to motivate different types of medical experts to perform crowdsourc-ing tasks, allowcrowdsourc-ing us to and acquire annotations data to train IBM Watson in the Medical domain. We first extract incentives from the work of [6] and literature. There-after we survey medical experts about the incentives found and the gaming preferences of the experts. The outcome of the survey will help us create a requirements list for the design of a gamified crowdsourcing application like Dr. Detective, called Dr. Wat-son. We will evaluate this design by creating a prototype version of Dr. Watson and testing the prototype with users, by letting them perform user tasks and interviewing them afterwards. This will help us find the most optimal combination of incentives and evaluate whether or not we achieve the personal targets of medical experts as well as our personal targets.

In chapter 2 other crowdsourcing games and incentives are discussed. Medical crowdsourcing games and incentives for medical experts are discussed in chapter 3, as well as the survey used to gain insight in incentives and game preference of medical experts. I would like to thank Lora Aroyo for helping me create the survey and Robert-Jan Sips for helping to send the survey to multiple medical experts. The design of the crowdsourcing application and detailed explanation of each the elements of the Dr. Watsongame is discussed in chapter 4. The experimental setup of the experiment used to evaluate the game, that consisted of user tasks and interviews as well as the analysis of the experiment is discussed in chapter 5. Future work is discussed in chapter 6 and in chapter 7 we answer our research questions.

1.1

Research Questions

The research that will help us discover the incentives required to motivate medical ex-perts to perform crowdsourcing tasks and acquire annotations data to train IBM Watson

(17)

Introduction 1.2 Glossary and Definitions

in the Medical domain can be summarized by with following research questions: 1. What are incentives for different types of medical experts to contribute in

medi-cal crowdsourcing tasks?

2. Can we combine these incentives, so that we achieve optimal efficiency and effectiveness in terms of:

a) Achieving their personal targets in the medical crowdsourcing task b) Achieving maximum quality and quantity of the crowdsourcing result To answer these research questions a gamified crowdsourcing application, de-signed for medical experts, is posed. The application incorporates incentives to moti-vate the experts to perform crowdsourcing tasks and is created with the goal to generate gold standard data for the training and evaluation of cognitive systems such as IBM Watson. By finding incentives for different types of medical experts and combining these incentives in a crowdsourcing application, the efficiency and effectiveness for our annotation tasks can be evaluated. Since two parties are involved, e.g. the IBM-Watson team and medical experts, we will measure the effectiveness and efficiency that can be achieved using these incentives for both parties. The effectiveness and effi-ciency for IBM-Watson is dependent on the quality and quantity of the crowdsourcing result. The effectiveness and efficiency in terms of the medical crowd depend on the outcome of the survey.

1.2

Glossary and Definitions

• crowd incentive: motivating factor for the crowd to engage in a crowdsourcing activity.

• crowdsourcing: the practice of obtaining content by soliciting contributions from a (virtual) community.

• gold standard: in NLP evaluation, the set of annotations that is considered definitive.

• medical expert: A person with an above average medical knowledge, e.g. pri-mary, secondary and tertiary care providers, medical student and persons in-volved in the medical domain.

• NLP: natural language processing.

• relation: an annotation object consisting of two terms that are tied together by a medical property.

• term: an annotation object consisting of a set of words that together form a medical concept.

(18)
(19)

Chapter 2

Crowdsourcing incentives

The approach we will use to answer our first and second research question consists of three steps, as can be seen in Figure 2.1. First a preliminary requirements analysis is performed, to find out what crowdsourcing incentives are for medical experts, by extracting incentives from literature and surveying the medical experts about crowd-sourcing incentives. This will be discussed in this chapter and chapter 3 respectively. The results of the crowdsourcing incentives survey are analyzed in chapter 3 as well, to create a list of requirements for the design of the Dr. Watson game for medical experts. Lastly, the prototype version of Dr. Watson will be evaluated in chapter 5, to find out whether or not the incentives incorporated in the game were effective.

In this chapter we first discuss incentives for a general crowd to perform work, such as crowdsourcing tasks. Thereafter gamification within the crowdsourcing do-main as a way to incorporate incentives is discussed. Lastly, an overview of the

incen-Figure 2.1: Overview of the approach used to extract and evaluate crowdsourcing incentives for medical experts

(20)

2.1 Crowd incentives to perform work Crowdsourcing incentives

tives discussed in this chapter is provided.

2.1

Crowd incentives to perform work

The use of crowdsourcing may be a cheap way to gather a large amount of data, but crowdsourcing also has limitations. Two of those limitations are a lack of motiva-tion and cognitive limitamotiva-tions [2]. Usually, two types of motivamotiva-tion are distinguished: intrinsic motivation and extrinsic motivation. Intrinsic motivation moves a person to perform an activity for the sake of the activity itself, while extrinsic motivation is ac-tivated from external factors such as monetary rewards and recognition [15]. Intrinsic motivation can be generated by creating challenges for the players of a game, stimu-lating their curiosity, providing choice autonomy and creating fantasy to allow people experiences unavailable in real life [17]. According to [24] the most important ex-trinsic motivator for employees to perform work are good wages. The number one intrinsic motivator for employees is full appreciation for the work done, followed by job security and the performance of interesting work. For a medical expert crowd, fantasy most likely will not motivate them, since they are more serious than a general crowd. We do expect that medical experts are curious, like challenges and want to be recognized. These factors can be combined and incorporated Dr. Watson, by creating challenges that challenge their medical knowledge and reward them with a title, which directly shows their increased status. Their curiosity can be triggered by providing them with new literature in their domain.

Wikipedia is an example where a large crowd of experts is motivated to perform work. Here the crowd executes tasks without money as an extrinsic motivator. Ac-cording to [13] people are motivated to contribute to Wikipedia because they want to contribute to the community, improve their own reputation, believe in the principle of reciprocity and have the freedom to pick tasks they like and complete these tasks at any pace. The medical experts will most likely want to contribute to the medical com-munity, be recognized for their work and do tasks they want to do. This indicating that a wikipedia-like application may motivate medical experts to perform crowdsourcing tasks.

A platform where crowdsourcing tasks are performed by an intrinsically motivated crowd is Zooniverse1. Zooniverse is a platform with projects created and maintained by the Citizen Science Alliance, where volunteers can help scientists to analyze very large and/or complex data sets. Currently there are five project categories: space, cli-mate, humanities, nature and biology. The citizen scientists can freely select which projects they are interested in and perform micro tasks defined by the scientist. Ac-cording to [18] a great number of the crowd workers in the Galaxy Zoo2 project are motivated to work because they identify with the projects goals, are contributing to sci-ence and have an interest in astronomy. Even though we do not have multiple projects, it is possible to give the medical experts the freedom to choose tasks they find interest-ing and want to perform.

1https://www.zooniverse.org/ 2Zoohttp://www.galaxyzoo.org/

(21)

Crowdsourcing incentives 2.2 Incorporating incentives using gamification

A final example where a crowd of people is intrinsically motivated to execute tasks, is open source projects. Motivational factors to participate in open source software projects that may also be relevant motivational factors for the performance of crowd-sourcing tasks are: an increase in status and recognition, learning, personal enjoyment, reciprocity and having a sense of ownership and control [7]. For sustained participa-tion in open source projects these motivaparticipa-tional factors are not enough however. To achieve sustained participation, positive reinforcement of situated learning, identity construction through community recognition and self-perception is required[7]. We can help the medical experts to learn new things by providing them with new medical literature within their domain. To help the medical experts be recognized by the com-munity recognition we can create a ranking page, where the biggest contributors have a high ranking.

2.2

Incorporating incentives using gamification

One way to combine and incorporate incentives for a crowd of people, is by using gam-ification. The ESP Game [22] (renamed Google Image Labeler in 2006) was created with the goal to label the majority of images on the World Wide Web and was the first to generate thousands of image labels by using a gamified crowdsourcing approach.

Currently this gamified crowdsourcing approach is emerging in different domains. An example of this is Peekaboom [23]. Peekaboom is a web-based game crowdsourc-ing game, designed with the purpose of creatcrowdsourc-ing an image database with fully anno-tated images. By playing Peekaboom annotation are obtained about the objects in an image, the location of these objects within the image and how much of the image is necessary to recognize the object. Each game one player takes on the role of “Peek" and one player takes on the role of “Boom". Peek starts with a blank screen, while Boom starts with an image and a word related to it. Boom will click on a part of the image, revealing a small area of the image. Then, Peek has to guess what Boom’s word is, while Boom can indicate whether or not the guess is hot or cold. When Peek guesses the word both players obtain points and switch roles. Each round motivates both players to work together in order obtain a highscore. The cumulative points of a player determines their ranks, which is a concept we could use for our crowdsourcing game, since an expert that performs more tasks and performs the task well should have a high rank. Teamwork may also be an incentive for medical experts, but requires two medical experts to perform the same task at the same time. Since we do not want an expert to wait endlessly until a teammate appears and to be demotivated by the fact that ‘noone’ plays it, we created a single player crowdsourcing game. Once medical experts regularly play our crowdsourcing game, tasks that can be performed as a team can be added to make the game more enjoyable.

To let a crowd create sentiment lexicons the Sentiment Quiz was created. [21]. The sentiment quiz is a Web-based crowdsourcing game on Facebook, where the players are presented sentences and have to indicate whether or not positive or negative lan-guage is used. The sentiment quiz makes use of leaderboards (the call it score boards) and game levels. Since the game is played on Facebook, players can also recommend the game to others, attracting new players and earn some point for doing so. For many participants the fact that people are contributing to science is also a strong motivational

(22)

2.3 Overview of general crowdsourcing incentives Crowdsourcing incentives

factor. Even though placing our medical crowdsourcing game on Facebook can be a good way to increase the amount of medical experts, it may prevent the medical ex-pert crowd to take our medical crowdsourcing game seriously. We have not placed our game on Facebook, but we do allow the experts to share and invite other experts to play our medical crowdsourcing game to still benefit from social media.

Duolongo3is a crowdsourcing game created by von Ahn with the purpose to trans-late text on the web [20]. Duolingo makes use of the motivation of people to learn languages and their ability to understand context, which is hard to understand for a computer. After creating duolingo account, a player can choose which language they want to learn and which topics they want to learn first. If a player wants to unlock the harder topics the player must first successfully complete a test about the basic top-ics. By playing the game players can level up in a language increasing their level, which will be visible next to their name in the discussion forums and their profile pages. On the profile page the players can track the progress of friends and view their achievements. The concept of having to unlock harder tasks by playing games is in our crowdsourcing game, since this will require the medical experts to perform tasks to progress and prevents that they perform tasks that are too difficult for them, resulting in low quality annotations.

2.3

Overview of general crowdsourcing incentives

Based on the incentives discussed in literature and the incentives that can be extracted from the crowdsourcing games we created a list of incentives that can motivate a crowd to perform work:

• personal growth

• being involved in something interesting

• beign in contact with and contributing to the community • personal enjoyment

• choice autonomy • status and recognition • a monetary reward

Each of these incentives are mentioned as incentives in at least two separate crowd-sourcing initiatives we discussed. Combined with the fact that we did not limit our-selves to incentives to perform gamified crowdsourcing work, but also included incen-tives in related domains such as open source projects, we assume this is a complete aggregation of incentives for a general crowd to perform work.

(23)

Crowdsourcing incentives 2.4 Conclusion

2.4

Conclusion

Personal growth is the most important incentive, because it leads to sustained partici-pation [7]. Enjoyment and being involved in something interesting are also important incentives, since no person is going to voluntarily do things they do not enjoy or are not interested in. Being in contact with a community can also be an incentive for medical experts, since most experts are already in contact with and contributing to the community on a daily basis. The fact that the experts already contribute to the community on a daily basis however, can also mean their need to have contact with a community is already satisfied causing this incentive to be less motivating. Status and recognition can be incorporated within our crowdsourcing game, by rewarding the experts with prestigious titles and creating a ranking of the experts playing the game. Choice autonomy can provide medical experts with the freedom to choose what they want to do and when, but can also have a negative effect, since too much choice can have demotivating consequences [11]. Lastly, a monetary reward is considered a weak motivator, because it is known to reduce intrinsic motivation [5] and because most medical experts already earn a more than average amount of money.

(24)
(25)

Chapter 3

Medical crowdsourcing incentives

In chapter 2 motivation and incentives to perform crowdsourcing tasks were discussed for a general crowd, as well as gamification of crowdsourcing tasks in other crowd-sourcing initiatives. In this chapter we focus on the incentives that can be used to motivate a medical expert crowd to perform crowdsourcing tasks, in order to answer our first research question. First we discuss literature on crowdsouring incentives in the medical domain in section 3.1. Therafter the survey about incentives to participate in crowdsourcing activities and gaming is discussed in section 3.2. Lastly we discuss a requirements list we created for the design of Dr. Watson in section 3.3.

3.1

Medical crowdsourcing incentives literature

In this section we discuss gamification and serious games in the medical domain, to extract incentives and gaming elements can be used to motivate medical experts to per-form crowdsourcing tasks. First gamified medical crowdsourcing games and medical games with a purpose are discussed. Thereafter the incentives in the Dr. Detective game used to motivate medical experts are discussed.

3.1.1 Gamified medical tasks

Gamification and crowdsourcing are also used in the medical domain. Foldit [4] is an example of a multiplayer online game targeted at a non-expert crowd. The crowd helps to solve hard protein structure prediction problems. The players can interact and ‘fold’ protein structures to discover the structure the proteins would most likely take in nature. Like the sentiment quiz a motivational factor for the crowd is the fact that they contribute to science. Players can receive a score based on how well they make the protein structure compact, make sure that hydrophobics do not get in contact with the external environment and proteins chains do not intersect. The players with a high score will be visible on the leaderboard. Players can also form teams that can solve protein structures and appear on the team leaderboard. In both cases the player or players compete in Foldit to gain the highest score on the leaderboard. This indicated that our crowdsourcing game should enable the medical experts to be competitive.

In [14] a malaria diagnosis game was created to test how well a non-expert crowd can perform binary medical diagnostic decisions (e.g. infected versus uninfected). The

(26)

3.1 Medical crowdsourcing incentives literature Medical crowdsourcing incentives

crowd was presented with a tutorial before starting the game to train the crowd to per-form the diagnostic task and to pass the tutorial an accuracy of .99% was required on the training examples. The crowd was within an accuracy of 1.25% of diagnostic de-cisions made by medical experts. In addition to helping with the diagnostic process, players would receive a score after each level. This indicates that it is possible to train non-medical experts to become a medical expert for a specific task. We could use this training concept in the future to reduce the amount of medical experts required and train our non-expert crowd on AMT and CF to perform specific tasks in our crowd-sourcing game. We do think it is necessary to explore this option more thoroughly first, due to the fact that our tasks are text based and not as easy as recognizing a single type of image.

To crowdsource high quality health information online and quickly answer pa-tient questions the gamified crowdsourcing application called HealthTap1was created. HealthTap allows patients to ask health related questions to the crowd of doctors or to a specific doctor. After the question in answered by one or more doctors the patient can give a ‘Thank You’, that with impact the DocScore. By answering health questions a doctor can win virtual awards, receive medals and build a strong referral network. This indicated that medals and virtual rewards may be a way to keep the medical ex-pert motivated to play our crowdsourcing game. In addition, these rewards allow us to stimulate good behavior and performance of crowdsourcing tasks. HealthTap also al-lows medical students to learn and build a resume, by also answering health questions. A student can answer these questions, but before the patient can see the answer of the student another medical expert first reviews and edits the answer of the student. If the quality of the annotations of the medical student performing crowdsourcing tasks in our crowdsourcing game is bad, this concept can be used by letting medical experts review the work of medical students before they submit their annotation task.

3.1.2 Medical games with a purpose

In addition to gamified medical crowdsourcing task, there are medical games with a purpose that use gamification, also called serious games. These games aims to mainly achieve learning and behaviour change, instead of entertainment [3]. Serious games such as 3DiTeams, CliniSpace, HumanSim and Virtual ED all provide a realistic vir-tual learning environment, where multiple medical experts can enter the virvir-tual envi-ronment at the same time [3]. This allows them to learn and collaborate with each other as a team, coordinate their actions and test their medical knowledge. Since collabo-rating to perform a crowdsourcing task in our case will reduce the odds of capturing disagreement, we do not incorporate teamwork in our crowdsourcing application. Fo-cusing on learning however may be a good way to make our crowdsourcing game useful for the medical experts. Games that focus on learning instead of teamwork are the Off-pump Coronary Artery Bypass game and the Total Knee Arthroplasty game, that both focus on helping the experts learn specific surgical procedures. Games such as Code Orange, Nuclear Event Triage Challenge, Peninsula City, Triage trainer and Burn Center also focus on learning, but help the medical experts learn triage for spe-cific events and incidents [3]. Most of these games are as realistic as possible and do

(27)

Medical crowdsourcing incentives 3.1 Medical crowdsourcing incentives literature

not use gamifying features such as leaderboards and highscores, with the exception of Code Orange and The Burn Center. Code Orange integrates a form of scorecards showing the player key tasks that should have been performed. In our game we allow the medical expert to view the score of others in addition to their own score, so that they can improve their score in the future and learn what we expect from them. The Burn Center uses a scoring system based on the performance of the player, but also sets a time limit to treat the patient with a burn. In addition to the time limit and score the Burn Center program is approved for Continuing Nursing Education Credit and Continuing Medical Education, meaning the medical experts playing the game still use their time effectively. We do not use a time limit and reward medical experts with a higher score for performing a task fast however, because we do not want medical ex-perts to rush our annotation tasks. The credits for continued learning will most likely motivate medial experts to play our crowdsourcing game, but requires that our crowd-sourcing game is evaluated and approved by the American Medical Association as an educational game for medical experts.

Even none of the games discussed are exactly the same, the crowd is motivated to perform the gamified crowdsourcing tasks. Peekaboom incorporates game features such as ranks, time limits and experience points to keep players motivated to play the game. With the exception of time limits these elements are used in our crowdsourcing game. HealthTap incorporates virtual awards and medals, which are also incorporated in our own crowdsourcing game, to make the game more fun and allow us to stimulate good behavior and performance of crowdsourcing tasks. The medical games with a purpose focus on helping the medical expert learn new things or maintain their knowl-edge, which may be a good motivator for medical experts, since medical experts have to continue learning to keep their license as a medical expert.

3.1.3 Incentives in the Dr. Detective game

To find necessary motivators for engaging medical expert into contributing, they were interviewed in [6]. The questions in these interviews were related to gaming, partici-pating in crowdsourcing activities, medical competitions, learning and reading medical literature. The interviewees expressed an interest in reading medical case reports and wanted the crowdsourcing tasks to challenge their problem-solving skills. The incen-tives to participate in crowdsourcing activities that were identified from the interviews of [6] were:

• learning • competition • entertainment

To meet these requirements the Dr. Detective game (Figure 3.1) was designed as a clue-finding game, where a medical experts has to select different types of terms that lead to a given diagnoses, such as medication, allergies, age and location in a paragraph of a medical case report. Once the medical expert was finished with a game, they would receive points for agreeing with the answers of others, suggesting new answers

(28)

3.1 Medical crowdsourcing incentives literature Medical crowdsourcing incentives

Figure 3.1: A screenshot of the Dr. Detective game

and the amount of consecutive tasks they performed. If the suggested answer was not selected by others, the medical expert would lose points. The game had different difficulty levels, a leaderboard and allowed the medical expert to choose a medical domain for the task. The medical experts that played the game stated that the scoring was too difficult to understand, which can be explained by the fact that the game did not show how the score was calculated. The scoring itself does reward good annotation behavior however, which is why we address the problem that medical expert do not know for what they get points and base our scoring on the scoring in Dr. Detective. The leaderboard also is an element we used in our own crowdsourcing game, since it allows medical experts to be competitive. What is lacking in the Dr. Detective game is a screen that is shown that after a medical expert completes a task, to show their score for that task and motivate them to perform more tasks. Another downside of Dr. Detectiveis the fact that a user has to submit terms for each type of term and can not easily change their answers. In addition, the game does not have elements that make the game entertaining or give the expert a goal to keep on annotating. An expert may be motivated to be the best in the leaderboard, but after reaching a high place in the leaderboard there is nothing left to do.

The incentives that can be extracted from the literature in this section in addition to the ones already mentioned are:

• working in a team • virtual ranks and awards

(29)

Medical crowdsourcing incentives 3.2 Crowdsourcing incentives survey for medical experts

Even though working in a team does not have to be competitive as in the serious games that focus on learning, Foldit is an example where players can compete as a team. Teamwork may motivate medical experts but, we did not think teamwork should be incorporated in the first version of our crowdsourcing game. We expected that the first version would not have a large amount of users, which would result in long waiting times to find a team member. If a medical expert has to wait long to find a team member the expert may be demotivated to play the game or instantly quit playing, which is what we wanted to prevent. To award the work and achievements of the medical experts, virtual ranks and awards were used to motivate the medical experts. It is also possible that the medical experts would be motivated to contribute in crowdsourcing tasks if they receive direct benefits that are relevant for their work as can be seen in HealthTap.

3.2

Crowdsourcing incentives survey for medical experts

The questions in the interviews performed in [6] were related to gaming, participat-ing in crowdsourcparticipat-ing activities, medical competitions, learnparticipat-ing and readparticipat-ing medical literature. Even though interviews had the goal to find necessary motivators for engag-ing medical expert into contributengag-ing, no questions focused directly on incentives. To gain more insight in incentives necessary to motivate medical experts, the questions and outcome of these interviews were used as a starting point for the creation of an online survey in addition to the incentives we extracted from literature. The reasons to use an online survey instead of interviews is because it can be send to multiple med-ical experts in a sort amount of time and allows the medmed-ical experts with an irregular schedule to fill in the survey whenever they have time to do so. We used the gaming and participating in crowdsourcing activities part of the interview performed in [6] as a basis. The gaming part because we needed more insight in what type of games and game elements the medical experts would like. The participating in crowdsourcing part was used to gain insight in the incentives that could motivate them to participate in crowdsourcing activities and answer our first research question.

The online survey we created consisted of three parts that will be discussed indi-vidually in section 3.2.1, 3.2.2 and 3.2.3. The survey was created using Google Forms. Each participant would receive an email invite with a link to the survey and was kindly asked to send the survey invite to other colleagues if they could. The first part of the survey focused on incentives to participate in crowdsourcing activities, followed by a part with questions related to gaming. The third and final part consisted of demo-graphic questions, to find out whether or not we surveyed different types of experts and gain insight in possible differences between the experts. Even though the sur-vey consisted of three separate parts, the gaming part and incentives to participate in crowdsourcing activities part complement each other, since incentives can be extracted from gaming preferences and vice versa. The full survey can be found in Appendix A.

3.2.1 Incentives to participate in crowdsourcing activities

In the first part of the survey we focused on the incentives to participate in crowdsourc-ing activities. To find out what may be the best incentive or incentives to motivate the medical experts, we combined the incentives of [6] with the incentives that can be

(30)

ex-3.2 Crowdsourcing incentives survey for medical experts Medical crowdsourcing incentives

tracted from chapter 2 and the literature discussed in this section in a Likert question with a five-point scale (question 1.2). This question can be viewed as one of the most important one, since it would enable us to answer our first research question. Other questions in the incentives part were used to gain insight in how many medical experts already performed crowdsourcing tasks and how much time they would spend con-tributing to a crowdsourcing task. To gain insight in the preferred length of a text and the kind of feedback they would prefer after completing a problem solving task, such as the diagnosing task in the Dr. Detective game, we used screenshots of an interactive use case from the New England Journal of Medicine2as an example.

3.2.2 Questions related to gaming

In the second part of the survey we asked questions related to gaming. The first ques-tion was whether or not the medical experts play games or not. Medical experts that did not play games could skip the other questions in this part and continue with answering the demographics questions.

The other questions in the gaming part were used to find out what type of games the medical experts play, what would motivate them to play a game frequently and what would be a reason to stop playing. To gain more insight in the gaming preferences of the medical experts, they were required to rate the importance of various game elements and other gaming related statement on a five point Likert scale. An example of such a statement was, “Do you prefer games that allow you to cooperate with other players” and “Do you prefer games that are action packed”. These questions would help us to find out what game elements and incentives could be used and incorporated in Dr. Watson, which would help us answer our first and second research question.

3.2.3 General demographic questions

The third and final part of the survey contained demographic questions. These ques-tions were used to gain insight in the population that answered the survey. We asked for their age, gender and the years involved in the medical domain. Another reason to add demographic questions to the survey was to find out whether or not different types of medical experts were represented. The types of health care provider we distinguished were based on the types of medical experts that are distinguished in MedlinePlus3. This resulted in six categories:

1. Primary care provider, e.g. medical doctor, nurse practitioner, physician assis-tant

2. Nursing care provider, e.g. registered nurse, licensed practical nurse, advanced practice nurse

3. Specialty care provider, e.g. cardiologist, oncologist, physical therapist, etc. 4. Person involved in the medical domain, e.g. lecturer, researcher, etc.

2http://www.nejm.org/multimedia/interactive-medical-case 3http://www.nlm.nih.gov/medlineplus/ency/article/001933.htm

(31)

Medical crowdsourcing incentives 3.2 Crowdsourcing incentives survey for medical experts

5. Medical student 6. Other

Since our goal was to find out what the incentives are for different types of medical experts that are able to perform crowdsourcing tasks that can not be performed by a general crowd, we altered the categorization in order to also include medical experts such as students and people working in the medical domain. We replaced the drug therapy category from MedlinePlus with ‘Person involved in the medical domain’, so that researchers and medical lecturers that involved in the medical domain would also be included in the categorization. In addition, the medical student category was added, because medical students are not a specific type of care provider yet. Lastly, the other category was added in case a medical expert did not fit in any of the categories.

3.2.4 Result Analysis of the crowdsourcing incentives survey

This section describes the results of the online survey discussed in section 3.2. First the demographic characteristics of the survey respondents are discussed. Thereafter the answers to the questions related to crowdsourcing activities and gaming are dis-cussed. The tables and figures with the results from the online survey can be found in Appendix B. We chose to report the results in tables, because our survey results con-sisted for most part of exact values and tables are excellent for providing exact values [19].

Demographic characteristics of the survey respondents

An overview of the demographic characteristics can be found in table B.1 in the ap-pendix. The survey was answered by a total of 24 medical experts, with an average age of 29. There were more female than male respondents. Each type of medical ex-pert was represented in the survey, with the exception of nursing care providers. One person stated to be a different type of medical expert, because this was a medical in-formatics student instead of medical student. Most medical experts were involved in the medical domain between two and five years or for more that ten years. This means that the incentives we can extract from the survey should work for almost all types of medical experts around the age of 29.

The medical domain specialization of the experts can be found in table B.13 in the appendix. If a medical expert stated to have two domain specializations, e.g. Nephrol-ogy and epidemiolNephrol-ogy, their specialization is reported as being a single specialization to prevent counting these medical experts twice. All medical experts with no medical domain specialization stated to be studying. In our population are more than ten dif-ferent specializations within the population, consisting caretakers working in primary, secondary and tertiary care. This diversity in specializations means that the incentives we can identify are indeed based on the input of a different types of medical experts. In addition, there will be minimal bias towards one type of medical expert, since no types of experts are over-represented.

(32)

3.2 Crowdsourcing incentives survey for medical experts Medical crowdsourcing incentives

Incentives to crowdsource

Six medical experts stated to contribute to crowdsourcing to activities on the web. Four of these experts stated to have contributed to a medical crowdsourcing activity. The amount of time medical experts would spend contributing to a medical crowdsourcing activity can be found in table B.5 in the appendix. Most medical experts stated to contribute to crowdsourcing once or twice a week with small contributions of less than five minutes. Other medical experts stated to contribute weekly with larger contribu-tions between ten and twenty minutes, monthly with small contribucontribu-tions or only when there is a specific related event. This indicates that a crowdsourcing task in Dr. Watson should no take an expert more than five minutes to complete.

The response to the question “What do you think influences most the motivation of medical workers to participate in crowdsourcing tasks?” is shown in figure B.1 in the appendix. The likert scale type question is plotted as a diverging stacked bar chart as recommended by [19]. The statements on which the medical experts agreed the most were:

• The task leads to personal growth, e.g. learning more about their medical field and new discoveries made within that field

• The task is interesting for the medical expert to perform, e.g. allows the experts to achieve personal growth, contribute to the medical field or help patients • The task is fun, e.g. cognitively challenges the medical expert and requires

them to use their problem solving skills

These results correspond to the results of [6], where learning, entertainment and competition were identified to be important crowdsourcing incentives for a medical expert crowd. This indicates that these incentives will have the strongest motivating effect for a medical expert crowd. Even though we distinguished different types of medical experts, we found that each type was motivated the most by the personal growth, tasks that are interesting and tasks that are fun. Due to the fact that the experts are motivated by the same things, we were able to create a single requirements list for the design of our crowdsourcing game in section 3.3.

The preferences of the medical experts regarding the setting to perform crowd-sourcing activities can be found in table B.6 in the appendix. There was no clear preference regarding the setting, indicating that medical experts prefer a combination between a serious wikipedia type of application with a serious community and shared knowledge building and a real competitive game that mainly achieves entertainment purposes. An educational game was the least preferred setting. Since each of the three settings are almost equally preferred Dr. Watson incorporates serious, as well as entertaining elements.

Table B.10 in the appendix shows that the most preferred domain to perform a medical use case was any medical domain. A slight preference existed for other specific domains such as general medicine. The least preferred domain was their own medical domain, which indicated that it would be advantageous to incorporate a larger variety of medical topics in Dr. Watson.

(33)

Medical crowdsourcing incentives 3.2 Crowdsourcing incentives survey for medical experts

Figure 3.2: The influence of different crowdsourcing incentives according to the 24 survey respondents in percentages. The neutral responses are not shown

Lastly, the preferences of the medical experts regarding learning new things and the text length and of a medical use case can be found in table B.7 and B.8 in the appendix respectively. Regularly reading literature and attending relevant domain pre-sentations were the most preferred way to learn, while listening to podcasts, watching educational video’s and playing educational games were less preferred. This indicated that incorporating literature could be a good way to achieve the learning and knowl-edge building goals of the medical expert crowd in a crowdsourcing application. The text length of the medical case that was used as an example had a normal acceptable length according to half of the population, which indicated that an introductory text for an annotation task of around 150 words will be read by most experts.

Gaming preferences

Out of the 24 medical expert only four experts stated to not play games. These medical experts did not answer the gaming related questions and are not present in this part of

(34)

3.2 Crowdsourcing incentives survey for medical experts Medical crowdsourcing incentives

the analysis.

The type of games medical experts play and their preferences can be found in ta-ble B.7 and tata-ble B.4 respectively in the appendix. Half of the population stated to play puzzle games, while eight of out twenty medical experts stated to play family games and action & adventure games. Based on the response to the statements regarding the type of games medical experts prefer to play we found that they prefer games that:

• Are visually appealing, e.g. does not resemble the grey and overly cluttered look of information systems most experts are using, but rather a clean user in-terface

• Have a clear goal, so that right from the start the expert can think of how to achieve this goal, instead of wasting time figuring out what to do

• Are fast paced, meaning the tasks they are performing or topics can change every 5 to 10 minutes, which resembles the concept of treating a patient and quickly moving on to the next

• Allows them to compete with others, in order to prove that they are better than some of their peers

• Have a clear winner and loser, so that the expert knows that they did a good job or not, allowing them to adjust tactics in the future if necessary

• Allow the expert to freely explore the content of the game, e.g. does not restrict the expert to explore only one thing at a time and gives them the freedom to choose what to do and when

Games that are turn based, have a time limit, allow you communicate with others and share things with friends through social media were not preferred. This indicated that Dr. Detectiveshould contain competitive elements, have a clear purpose and does not restrict the freedom of an expert, allowing the expert to freely explore the application. In table B.3 contains aspects that would motivate or not motivate a medical expert to play a game frequently. The only aspect that motivates the medical experts to play a game frequently was the competitive aspect. This in accordance with the rest of our other results and the results of [6], who also identified competition to be a motivational aspect.

Lastly reasons to stop playing a game can be found in table B.12. The main reason to stop playing a game would be a lack of time to play the game. Followed by having explored the main content of the game, completing the game 100 percent and finding the game to no longer be challenging. This indicated that in order to motivate medical experts to keep playing a gamified medical crowdsourcing application the game must remain challenging and have an endless supply of new content.

(35)

Medical crowdsourcing incentives 3.3 Requirements for medical crowdsourcing incentives

3.3

Requirements for medical crowdsourcing incentives

Based on the incentives discussed and the result analysis of the online survey, we were able to gain more insight in incentives and game features that can be used to motivate a medical expert crowd to perform crowdsourcing tasks. These incentives are an aggregate of the incentives that can be extracted from the gaming preferences of the medical experts and the statements on which the medical experts agreed the most, e.g. a task must lead to personal growth, be interesting and fun. To reduce the size of the list we combined incentive of competing with others and preference for games with a clear winner and loser into a single incentive. This aggregation resulted in the following list of requirements for the Dr. Watson game:

• crowdsourcing tasks must lead to personal growth, e.g. the medical expert must be able to learn new things about their medical field and and new discoveries made within that field, enabling them to be the best medical expert they can be • crowdsourcing tasks must be interesting for the medical expert to perform, e.g.

allows the experts to achieve personal growth, contribute to the medical field or help patients

• crowdsourcing tasks must be fun to perform, e.g. cognitively challenges the medical expert and requires them to use their problem solving skills

• crowdsourcing tasks must remain challenging for the medical expert over time and keep challenging their knowledge

• the application should enable the medical experts to be competitive and win, to prove that they are better experts than their peers

• the application should enable the medical experts to freely to explore the con-tent of the application, not restricting them too much in what they want to and when

• the application should incorporate content from various medical domains, making the application interesting for experts from every medical domain • the game should be visually appealing, e.g. does not resemble the grey and

overly cluttered look of information systems most experts are using, but rather a clean user interface

• the game should be fast paced, meaning the tasks they are performing or topics can change every 5 to 10 minutes, which resembles the concept of treating a patient and quickly moving on to the next

3.4

Conclusion

Our goal is to collect CrowdTruth data to train Watson in the medical domain and need medical experts to perform annotation tasks that are too difficult for a non-medical ex-pert crowd to perform. Since the requirements that medical exex-perts want to achieve

(36)

3.4 Conclusion Medical crowdsourcing incentives

personal growth by learning and like to perform challenging tasks, these requirements match our goals and were considered the most important requirements. Even though a general crowd would also prefer a game to be fun and interesting, the medical experts are focused on their career and want to spend their time learning to increase their med-ical knowledge and become the best medmed-ical expert they can be, while a general crowd is more focused on being entertained. We considered the requirement that a medical expert should be able to be competitive and win, also as an important requirement, since this is not a requirement that always applies to a general crowd. This meant that incorporating this requirement would help us tailor the application to the medical expert crowd. The requirements that the application allows one to freely explore the content and a crowdsourcing task must be visually appealing as well as remain chal-lenging also apply to a general crowd and were considered less important. This is due to the fact that these requirements will not help to tailor the crowdsourcing game to the medical experts. Based on this list of requirements we created a design for Dr. Watson that should motivate medical experts to perform to crowdsourcing tasks, which will be discussed in chapter 4.

(37)

Chapter 4

Design of Dr. Watson

In this chapter we describe the design of Dr. Watson - a gamified medical crowdsourc-ing application designed with the purpose of motivatcrowdsourc-ing medical experts to perform medical crowdsourcing tasks for watson. In chapter 3 we discussed the online survey we used to gain insight in incentives and game features that can be used to motivate a medical expert crowd to perform crowdsourcing tasks. Based on the requirements discussed in section 3.3 we designed multiple components for Dr. Watson, such as a scoring and difficulty levels. In this chapter we first explain the design rationale of Dr. Watson, followed by an explanation of what we do to collect the CrowdTruth using Dr. Watson. Lastly, we go over each of the components and explain why each element should motivate a medical expert to annotate.

4.1

Dr. Watson design rationale

In the design of Dr. Watson we focused on three things:

• Helping the medical expert crowd learn new things about their medical field and new discoveries made within that field, by providing them with medical literature

• Making Dr. Watson competitive, so the medical experts can compete and win, by using an in-game scoring

• Keeping the annotation tasks interesting and challenging, by using multiple dif-ficulty levels

Because the medical expert crowd wants to learn new things about their medical field and new discoveries made within that field, we focused in Dr. Watson on provid-ing the medical experts with medical literature. The reason we focused literature, was because the largest amount of medical experts stated to prefer to learn by reading arti-cles. We hypothesized that not all experts would have easy access to literature or time to search for literature themselves and thought a homepage where experts can view and easily access literature would motivate the experts to use Dr. Watson. To access the lit-erature however, the medical experts have to perform annotation tasks on that article or

(38)

4.1 Dr. Watson design rationale Design of Dr. Watson

perform another annotation task to acquire reputation point. The reputation points can be exchanged for an article of their choice and are used to determine the Dr. Watson rank of each expert on the leaderboard. The leaderboard should motivate the experts to collect more reputation points, even if they can unlock the literature they want, since the medical experts are competitive and want to be on top of the leaderboard.

Since the medical experts also wanted to compete and win, we created an in-game scoring which enables us to make each task competitive and determine if a medical expert wins after completing a task. With the scoring we also motivate the medical expert to agree with others by giving agreement points. We reward experts that agree on an answer because we assume that an annotation is correct if multiple experts agree. We have to make this assumption because we do not know what the correct answers are in the crowdsourcing result. We also use scoring to motivate experts to suggest new annotations, since this allows us to capture annotations that also may be correct. During each task the expert competes with the crowd that already performed the task and has to beat or match their annotation score to win. After the task is performed we show a win or lose screen with a challenge they almost completed, to motivate the expert to continue playing. To motivate the expert more, the experts can view the annotations of other experts, so that they realize they are part of a team working towards a goal as well as enable them to improve their score in the future.

Lastly, to keep the tasks interesting and challenging we created multiple difficulty levels for the annotation tasks in Dr. Watson. The difficult tasks contain more medical terms in a sentence, which makes the task more complex and more difficult. The tasks also require the expert to gather more points to win, but reward more reputation points once completed. To motivate the medical expert to keep returning to the game, other and more difficult tasks have a level requirement. This level requirement can be met by leveling up and prevents that new experts explore all content and stop playing Dr. Watsonbecause they ‘completed’ the game. To level up and unlock the other tasks, the medical experts have to complete challenges. The reason we created challenges is because this makes the game more challenging but also allows us to motivate medical experts to perform the less popular annotation tasks and motivate them to annotate articles without annotations. To motivate the medical expert to perform the challenges, each challenge also rewards the medical expert with reputation points or a title. When a medical expert unlocks a title, the expert can select this title as the title to be shown next to their name on the leaderboard, allowing the expert to display their most prestigious achievement and impress their peers.

Dr. Watson combines learning elements as well as competitive elements and is a serious crowdsourcing game that challenges the knowledge of the medical experts, helps the experts to learn new things and compete with their peers. In section 4.2 we first discuss how CrowdTruth data is gathered using Dr. Watson. Thereafter, each element of Dr. Watson is discussed:

• The medical literature and four content boxes that are used to provide the experts with literature is discussed in section 4.3

• The rationale and explanation of the scoring is given in section 4.4 • The challenges in Dr. Watson are discussed in section 4.5

(39)

Design of Dr. Watson 4.2 Gathering CrowdTruth data using Dr. Watson

• Levels and leveling up in Dr. Watson is discussed in section 4.6

• The win and lose screen used to motivat the expert to continue playing is dis-cussed in section 4.7

• The difficulty levels to keep Dr. Watson challenging are discussed in section 4.8 • The performance page and Dr. Watson ranking are discussed in 4.9

4.2

Gathering CrowdTruth data using Dr. Watson

Since disagreement in certain annotation tasks reflects semantic ambiguity of an in-stance and provides useful information, we want to harness disagreement [1]. The CrowdTruth is a type of ground truth that is richer in diversity of perspectives, opin-ions, and interpretatopin-ions, which reflect more realistic human knowledge. To acquire the CrowdTruth using Dr. Watson, we let multiple medical experts perform the same annotation task. Ideally, if we ask the medical experts to answer an ambiguous ques-tion that has two correct answers, half of the medical experts gave the first correct answer and the other half gave the second correct answer. The amount of medical experts required to perform each unique task to capture the CrowdTruth in Dr. Wat-sondepends on the complexity of the task. If a task has a limited amount of possible answers, e.g. a relation direction task that consists of a single multiple choice ques-tion, about five medical experts performing this task will be enough to capture the CrowdTruth using Dr. Watson. For tasks that are more complex a larger amount of medical expert is required to capture the CrowdTruth, since there is a larger amount of answers possible.

To prevent that too much or too little medical expert perform the same annotation task in Dr. Watson, we control the available tasks, the challenges and the cost of each article. For each type of task a list exists with the available tasks the medical expert can perform. Every time the medical expert starts a task of a certain type, one of the available tasks is randomly selected. Tasks that reach a threshold value, e.g. a maximum of 10 annotations, are removed from the list, preventing that too much experts perform that task. We motivate the medical expert to perform tasks with too little annotations by generating challenges for that task type. Since medical experts can also perform an annotation task on a specific article, we reduce or increase its cost to make it more or less attractive for an expert to perform annotate that article.

4.3

Medical literature

The homepage of Dr. Watson is centered around four content boxes that are used to provide the medical expert with interesting literature that enables them to learn new things about their medical field and new discoveries made within that field. We hypothesized that not every expert would have easy access to all medical literature or time to search for literature regularly themselves and thought that a place where a medical expert can view and easily access the literature, would motivate medical experts to use Dr. Watson. A screenshot of the homepage can be seen in Figure 4.1. Each content box contains different types of literature that may be interesting for the

(40)

4.4 Scoring Design of Dr. Watson

Figure 4.1: The homepage of Dr. Watson with the four content boxes.

expert to read. The first content box displays new literature that is recently published in one of the top 5 journal with a high impact score. The reason to only display articles in the top 5 journals is because of the fact that only displaying new articles may result in low quality literature that may not be interesting to read and an excessively fast feed of articles that will be hard to oversee as an expert that uses that application two times a week. The second content box contains high urgency literature that is in ‘dire need’ of annotations. This content box is important from the perspective of CrowdTruth, because it is necessary to gather CrowdTruth data from all medical sub domains. The articles in this content box can be articles that either have no annotations at all or need more annotations. The third content box contains the most popular literature that is bought using reputation coins the most that week and have a large number of annotations. This content box is not very effective from the perspective of CrowdTruth, but will most likely contain very interesting literature for the medical experts. Lastly, we have a personal feed content box, where a user can customize their the content by entering keywords and publishers of which they would like to receive articles.

4.4

Scoring

In this section we first discuss the scoring rationale and our scoring components. Ther-after, we discuss the different tasks we have and the scoring for these tasks.

4.4.1 Scoring rationale

Since medical experts are competitive and want to win, we hypothesized that making the experts compete while performing annotation tasks would make them enjoy the an-notation tasks more. To make the tasks competitive we used the concept of the crowd-sourcing game Waisda, where players receive a score for tags entered while watching a video and win the game if they have the highest score at the end of the game [10]. In Dr. Watsonthe medical experts receive points for every annotation that is beneficial for

Referenties

GERELATEERDE DOCUMENTEN

Research question 3 was: How does the accumulation of critical incidents and other work characteristics (workload, social support) relate to private life functioning in rescue

Resource prediction and quality control for parallel execution of heterogeneous medical imaging tasks.. Citation for published

To start it off, the first broad category was that AI could help medical experts to increase their skill variety as they can have the ability to differentiate faster and

In a one- week emotional reflection experiment using the experience sampled data of 28 wearable users, their scores on the Toronto Alexithymia Scale 20 were compared to scores of

To construct a useful task assignment for a secondary school unit that takes into account different aspects that the teacher-task assignment should fulfill, specific information

Obwohl seine Familie auch iüdische Rituale feierte, folgte daraus also keineswegs, dass sie einer anderen als der deutschen ldentität añgehörte, weder in ethnischer,

Neurons are not the only cells in the brain of relevance to memory formation, and the view that non- neural cells are important for memory formation and consolidation has been

If the option foot was passed to the package, you may consider numbering authors’ names so that you can use numbered footnotes for the affiliations. \author{author one$^1$ and