Designing a system for Triplet Verification and Extraction by Human Computation through a User-Centered Approach

(1)

UNIVERSITY OF AMSTERDAM, FACULTY OF SCIENCE

Thesis Master Information Studies, track Human Centered Multimedia

Designing a system for Triplet Verification and Extraction by Human

Computation through a User-Centered Approach

Author:

Bastian Geneugelijk (11428988) bastian.geneugelijk@student.uva.nl

Supervisor: Dr. Abderrahmane Khiat

Freie Universität Berlin

Second assessor: Dr. Frank Nack University of Amsterdam

(2)

Designing a system for Triplet Verification and Extraction by

Human Computation through a User-Centered Approach

Bastian Geneugelijk

University of Amsterdam, The Netherlands

Student ID: 11428988 bastian.geneugelijk@student.uva.nl

ABSTRACT

Triplets (subject -> predicate/relation -> object) often define a com-mon understanding of the meaning of information. This enables sharing and reuse of data. Triplet extraction by human computa-tion approaches frequently requires domain experts and can be considered a tedious and repetitive task. To overcome these limita-tions, we develop agame with a purpose to make an attractive and easy-to-use system and make the task of triplet verification and extraction entertaining. The developed workflow supports high-quality triplets by the implementation of triplet verification by other users. Furthermore, we validated our approach by a qualitative user test that consisted of a user experience questionnaire, followed by a semi-structured interview. Based on the obtained test results we argue that our system provides a convenient and engaging way for triplet verification and extraction through human computation.

1 INTRODUCTION

Over the recent years, digitization has led to exponential growth of digital information, mostly made available through the world wide web. Digital information is, in the absence of structure, very heterogeneous which hinders the share and re-use of data. Humans can interpret this information, but machines cannot capture the semantics of this heterogeneous information since it is not repre-sented. A common way to represent the semantics of information is by using triplets: subject -> predicate/relation -> object (e.g.,the Brandenburger Tor <is located in> Berlin).

Existing approaches for triplet extraction either use fully au-tomatic or human-computer approaches. The first category does not perform well on text that is not necessarily adhered to any specific structure (e.g. Facebook and Twitter content), making these systems fail in identifying all the forms in which a relation can be expressed. The second category, on the other hand, relies on ex-perts combined with machine computation to extract triplets. Since triplet extraction by human computation (manual triplet extraction) does not scale to the worldwide web[7] and often requires domain experts, this task can be assigned to crowds using platforms such as Amazon Mechanical Turk1. However, this approach has limits such as (1) employing a large number of workers quickly becomes expensive, (2) some data requires domain experts and (3) the task of triplet extraction can be considered tedious and repetitive to users. This research attempts to extend the limitations of current human-computer approaches by answering the following research question: How can we create an engaging and convenient triplet extraction system for human computation that supports high-quality triplets? To answer this question, we first analyze the current state of triplet

1_{http://mturk.com}

extraction. We investigate human-computer approaches to identify what strategies are already tested with the crowd. By reviewing triplet extraction approaches, we aim to tackle the limitations of automated and manual triplet extraction by using gamification methods, which result in a system that is attractive and easy to use. Furthermore, we look for approaches that extract triplets automati-cally to examine whether we can automate elements in our system within the scope of our research.

Based on a literature study, we developed a workflow that sup-ports high-quality triplets, as described in section 3. Furthermore, we established a set of design principles that reflect our problem statement. These design principles are applied to design our pro-totype, as presented in section 4. To validate our approach, we conducted a semi-structured interview along with a user experi-ence questionnaire with 16 participants, as described in section 5. In section 6, we examine the results of the qualitative user test and propose potential future work.

2 RELATED WORK

In this research, we aim to come up with a system that engages humans in extracting triplets (subject, predicate/relation, object). For this, we look at current efforts in automatic and crowdsourcing triplet extraction approaches. Furthermore, we examine usability studies that are related to the task of triplet extraction. The analysis of this section will be part of the baseline for the design of our triplet extraction tool.

2.1 Automatic Triplet Extraction

Extracting triplets from text requires Named Entity Recognition (NER) and Relation Extraction (RE). NER aims at identifying and disambiguating names of entities within text, usually constrained into seven categories: Location, Person, Organization, Money, Per-cent, Date and Time[13]. However, triplet extraction does not rely only on the recognition of entities from text but also on the rela-tionship between entities; thus the need for RE was established. In this research, we paid special attention to RE approaches, since RE is the key feature of triplet extraction.

Various automatic approaches have been developed to reduce the dependency on workers and domain-experts in RE. We can divide automatic approaches into two categories: methods that require a defined set of relations (closed RE) and methods that do not need a defined set of relations (open RE). One example of a closed RE approach is supervised RE, that attempts to identify entity-pairs (subject, object) for the given relations by using statistical methods,

(3)

such as kernel methods2[25]. The main problem of supervised RE is that the development of a sufficient corpus may cost a lot of effort. Another method of closed RE is bootstrapping, which starts with an initial model that consists of a few examples. The model gradually expands itself since the model is retrained after it extracted some unknown relations. KnowItAll [8] extends the approach of bootstrapping to an unsupervised system, that does not require manually extracted relations at all. However, due to many iterations, bootstrapping suffers from semantic drift which appears when errors in classification accumulate. In addition to bootstrapping, Mintz et al. [17] proposed a different method, called distant supervision. Distant supervision uses entity-pairs from a semantic database (e.g., Knowledge Graph3) and attempts to find sentences holding these entities in unlabeled text. Based on entity-pair matches, Mintz et al. extract textual features using syntactic and lexical features to train an RE algorithm. However, distant su-pervision is initially limited to the schema as imposed by a semantic database which makes it challenging to manage relations that do not exist in the schema. The concept of open RE was proposed by Banko et al. [3] as Open Information Extraction (Open IE) and does not require existing entity-pairs or a set of extracted relations. Currently, Open IE can extract the four most commonly detected relations namely,verb, noun + prep, verb + prep and infinite. This pre-defined pattern of how a relation should look like is also a limitation: relations have to occur between entity names and in the same sentence. Furthermore, since extracted relations are not specified, it is difficult to use these relations in other systems. The outcome of the system indicates that there is some relation between two entities, but there is no generalization between these relations.

2.2 Human-Computer Triplet Extraction

Human extracted relations can boost the performance of triplet extraction, as demonstrated for distant supervision by Liu et al. [16] and unsupervised systems by Zouaq et al. [26]. A conventional ap-proach for obtaining human-extracted triplets or relations is by us-ing crowdsourcus-ing. Crowdsourcus-ing can be defined as "a distributed problem-solving and production model" ([4], p. 75). Because a par-ticipating crowd is of importance for the success of crowdsourcing, we describe genres that have distinctive incentives for motivating workers to take part in the activity of triplet extraction.

2.2.1 Mechanized Labor. Researchers can use online labor mar-kets such as Amazon Mechanical Turk4or CrowdFlower5where the motivation for workers to complete tasks (e.g., triplet extrac-tion) is to get paid. This approach, however, limits the scalability of manual triplet extraction since employing large numbers of work-ers becomes expensive. Furthermore, the usage of untrained people for triplet extraction can have downsides. Siangliulue et al. [21] reported that usage of crowds might be challenging, especially in communities where a level of domain knowledge is required. Therefore, Siangliulue et al. [21] suggest that the establishment of

2_{collection of algorithms for pattern analysis that is used to study general types of}

relations in datasets.

3_{https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html} 4_{http://mturk.com}

5_{https://www.crowdflower.com}

a semantic model (triplets or relationships, etc.) should come from within such a community itself.

2.2.2 Games With a Purpose. A game with a purpose (GWAP) is a technique to outsource computational steps that are difficult to execute by a machine. GWAP differentiates itself therefor from mechanized labor since the primary motivation to take part is to be entertained. Therefore, we see GWAP as a compromise between the user, who wants to be entertained and the machine, since the user covers the problems of the machine. In the domain of triplet extrac-tion, we identified Higgins [14], a system that combines informa-tion extracinforma-tion automatically with a game-based human-computing engine. Higgins uses multiple semantic resources as well as a sta-tistical language model to indicate potential subjects, relations and objects between phrases and generates questions as a game to fill in missing attributes for triplets. However, the limitation of these closed-formed questions is that they leave the challenge of discov-ering new triplets unsolved.

3 APPROACH

The system is divided into two parts: an Information Extraction (IE) engine and a Human Computing (HC) engine, as shown in figure 1. The IE engine deals with the automatic extraction of nouns, verbs, etc. and automatic triplet extraction. We put the HC engine in place to identify relationships between entities and verify triplets by human computation.

3.1 IE Engine

Once the user has uploaded text to our system, the text is processed by a pipeline of Natural Language Processing (NLP) tools. Our method transforms the text into sentences, each provided with a list of tokens. Each token is tagged with part of speech (POS) and dependencies. The POS provided along with the words are derived from the Penn Treebank Project [19]. A grammatical analysis of the sentence is conducted to obtain the POS category for each token, such as averb or noun. Besides, we get the dependencies of tokens within the sentence, to gain insight in the directed links between words so that it can be used to, for example, divide the main clause from a possible sub-clause. In our research, we store the depen-dencies to provide context for extracted triplets. Furthermore, we employ Stanford’s Open Information Extraction (Open IE) system [1] in our pipeline to make sure we obtain triplets that can be ex-tracted automatically. Stanford’s Open IE can extract triplets from plain text so that the schema for these triplets does not need to be declared in advance. The extracted triplets are saved to a triplet store so that they are accessible by the HC engine for verification and ready for export for further usage outside our system.

3.2 HC Engine

The HC engine is responsible for two different tasks, namely, triplet verification and extraction. The extracted relation results in a triplet and is verified based on the sentence wherefrom it is extracted. A decision whether to include a triplet in our IE triplet store is made based on the distribution of votes and majority voting. Triplet extraction is done by selecting asubject, predicate and object from a presented sentence. An overview of the HC engine is given by figure 2 and discussed in more detail below.

(4)

Figure 1: System Overview

Figure 2: HC Engine overview

3.2.1 Pick sentence. To extract a relation, the user picks a docu-ment with sentences from the system. Therefore, the user is focused on one sentence at a time. By splitting documents up into sentences, we attempt to create a more organized and clear interface. For the task of manual triplet extraction, we do not put strict lexical con-straints in place, in contrast to multiple automatic triplet extraction approaches [9][10]. Since the imported text does not necessarily adhere to any specific structure, we are cautious to implement such strict constraints. However, a sentence may have many words what

leads to a lot of options for a possiblerelation. To reduce the number of options and to attempt to steer the focus of the user slightly, we implement the following lexical guidelines:

• When the predicate in our system is active, we highlight all theverbs in the given sentence;

• When the subject or object in our system is active, we high-light all thenouns in the given sentence.

Words get visually highlighted in the sentence as soon as they meet these guidelines. Furthermore, we reduce the number of op-tions for a possible relation by disabling every word that is defined as a stopword6by the IE engine.

3.2.2 Return extracted triplet. Since one user has now extracted a triplet, we count one vote for this triplet (one for correct, zero votes for incorrect). Therefore, the triplet is saved to the IE triplet store, until majority voting indicates that the triplet is incorrect.

3.2.3 Receive points. After the triplet is saved, the user receives points for his effort. These points are added to the total amount of points the user has collected.

3.2.4 Triplet verification. Triplet verification is conducted by either agreeing or disagreeing with a given triplet. The triplet veri-fication task contributes to the following goals:

• Cheating detection: Since our system deals with crowdsourced users and tasks that cannot be verified automatically, it be-comes difficult to detect cheating users. To overcome this issue, we implement triplet verification by using a control group approach, as proposed by Hirth et al.[11]. By allow-ing users to verify triplets from other users, the system can use majority voting make a distinction between correct and incorrect extracted triplets and can take action against cheat-ing users.

• Precision: An extracted triplet is verified by multiple users. James Surowiecki [22] proposes the idea of the wisdom of crowds and argues that a combination of decisions from

6_{Stopwords are words that do not contain important significance towards a relation,}

for example,the, and or a.

(5)

groups often is considered better than the decision of one individual. Therefore, triplet verification by multiple users contributes to a highprecision of extracted triplets. • Cheap task: Compared to the task of triplet verification, the

task of manual triplet extraction requires more interaction with the system, and therefore we consider it an expensive task. Triplet verification shows the user already extracted triplets, to minimize the need for the expensive task of triplet extraction.

3.2.5 Crowd verification. After a user has verified a triplet, the vote is submitted to the HC datastore after which we calculate with majority voting whether the triplet is correct or incorrect.

3.2.6 Majority voting. There are multiple ways to conclude whether a triplet is correct or incorrect based on letting user verify triplets. These methods include, but are not limited to, expecta-tion maximizaexpecta-tion [5] or a multinomial naive Bayes algorithm [23] which takes the level of disagreement for a given triplet into ac-count. However, since the scope of our research limits to designing an engaging manual triplet extraction system, we decided to imple-ment the common method of majority voting to measure accurate triplets, leaving research into a decision-making algorithm for fu-ture work. When majority voting calculates the triplet is found incorrect, the triplet is removed from our IE datastore. When the triplet is found correct, the triplet is added to our IE datastore. In case the outcome of majority voting has not changed compared to the previous calculation, only the vote is saved.

3.2.7 Update points. When the status of the triplet has either changed from correct to incorrect or from incorrect to correct, points are withdrawn from or added to the user who initially ex-tracted the triplet. By withdrawing points, we use the principle of loss-aversion7being a game element in our system, as described in section 4.3.

Next, we describe our design that incorporates the described human computation approach to extract and verify triplets.

4 PROTOTYPE DESIGN

To come up with a prototype for our system, we first describe a set of guidelines that reflects our problem statement.

4.1 Design principles

Based on the defined problem and the gained knowledge during our literature study, we derived two design principles. The first principle,attracting, attempts to make the task of manual triplet extraction less repetitive. The second principle,assisting, attempts to make the task of manual triplet extraction less tedious and make the system available for non-domain experts.

4.1.1 Attracting. To make manual triplet extraction feel less repetitive, we attempt to make our system more attractive to use by implementing the following aspects:

• A sense of achievement: we link the task of triplet extraction and verification to achievements that have personal and

7_{Loss-aversion refers to people’s inclination to prefer avoiding losses over obtaining}

similar gains [12].

social significance within the system, such as points, rewards or a leaderboard[6].

• A sense of competition: by using the crowd as an element in our system, we implement a sense of competition that can be an incentive for self-improvement with regards to the task of manual triplet extraction [24].

4.1.2 Assisting. To make the task of manual triplet extraction less tedious we assist the user based on the following aspects:

• Progressive disclosure: we only show the minimum amount of data required for a task. Furthermore, less frequently used and complex information will be left out of the main inter-face. By only presenting essential information we can reduce complexity and show options gradually [18].

• Collective intelligence: because a triplet is verified by mul-tiple users, we can combine their decisions and ultimately reach a higher quality of triplets. [11][2]. Also, during triplet verification, users get an understanding of the representa-tion of a relarepresenta-tion, which can help when they start to extract triplets themselves.

4.2 Interaction design

Based on the design principles, the goal is to prototype a system that is both assisting the user and attractive to the user. Once a user enters our system for the first time, the user is presented with a graphical representation of text documents that are available forplay. The top of the interface in figure 3 and figure 4 show how well the user is performing relative to the crowd.Accuracy indicates what percentage of the verified or extracted triplets is considered to be valid by the community. By showing accuracy, we intend to motivate the user to make more thoughtful decisions since a lower accuracy reflects incorrect decisions. To calculate the accuracy we look at the total number of verifications (V ), the number of correct verifications (Vc), the total number of extractions

(E) and the number of correct extractions (Ec):

accuracy = Vc V ·V + Ec E ·E E + V

4.2.1 Triplet verification. Figure 3 shows the screen for the task of triplet verification. The screen shows the current amount of points the user has, the position of the user relative to the crowd based on points, a highlighted triplet and the sentence to give context to the provided triplet. The user has the choice to either accept or reject the presented triplet. When the user clicks on his answer of choice, feedback about given points is shown and the next triplet is loaded. In case all triplets for the given sentence are verified, the context of the game changes to the task of manual triplet extraction.

4.2.2 Triplet extraction. To design the triplet extraction inter-face, we have to decide on the balance of precision versus recall. By focusing on the precision, we concentrate on relations where there is little doubt about whether they are correct or not. We can also choose to focus on the recall of manual triplet extraction. By focusing on recall, we concentrate on as many relations as possi-ble, attempting to extract all possible triplets for a given sentence. When focusing on recall, users need to have a degree of freedom,

(6)

Figure 3: Triplet Verification

Figure 4: Relation Extraction

so that it is possible to extract as many triplets as possible. When we focus on precision, we give our users less freedom and let them play according to rules, to ensure that triplets are of high quality. Since the uploaded text in our system does not necessarily follow a grammatical structure, we decided to focus in our triplet extrac-tion interface on recall. This focus is expressed by giving the user freedom to apply any word from a sentence to one of the attributes of a triplet. In this way, we offer the possibility to extract all possi-ble relations with corresponding subject and object from a given sentence. To reduce the number of choices for the user, we have implemented the lexical guidelines as described in section 3.2.1. Figure 4 shows that words are visually highlighted when they meet these guidelines. However, also words that are not highlighted still

are represented by a button, indicating that these words still can be selected. Figure 4 shows the interface for manual triplet extraction. The top of the screen shows information about the accuracy of the performed tasks, the position of the user relative to the crowd and the obtained points. To extract triplets, we divided the task of triplet extraction into three micro-tasks: filling thesubject, predicate and object attributes. The user can add words to one of these attributes by clicking on a word in the sentence. By creating a toggle button for each word in the sentence, the user can assign and remove a word from a triplet attribute (subject, predicate, or object). The toggle button follows existing conventions of a checkbox8, making it recognizable to the users. After the user has finished a triplet, the checkmark icon can be clicked where after the triplet is saved and the number of points, accuracy and points are recalculated. After saving a triplet, the user can start to extract another triplet. When the user is done extracting triplets for the given sentence, thenext sentence button can be clicked which is a system trigger to load the next sentence in the document.

4.3 Game elements

We implemented a rule system that defines the constraints and consequences of performed actions by users. For every extracted triplet, the user receives five points. However, these five points can partly be withdrawn from the user as soon as the crowd rejects the extracted triplet. In that case, the user will only receive one point for effort instead of the original five points. Another scenario would be only to give the user points ifn users from the crowd accepted the triplet. However, the number of verifications necessary for an accurate extraction may differ among user groups hence we do not know how many users need to verify the triplet. Furthermore, by issuing points right after the extraction has been done instead of aftern verifications, we tend to encourage the user to keep playing. In addition to the given arguments, by communicating that the user will lose four points after triplet rejection, we intend to call upon the principle of loss-aversion. Loss-aversion refers to people’s inclination to prefer avoiding losses over obtaining similar gains [12] and therefore this principle may decrease the risk of false triplets on purpose in our system. Furthermore, we set up a mechanism that users cannot extract the same triplet twice. In this case, a notification is shown to the user that the triplet is already extracted. For the task of triplet verification, we put a time limit in place. For each triplet, the user gets 15 seconds to decide between acceptance or rejection. An answer is considered correct if it matches the answer of the crowd. The answer of the crowd is calculated through majority voting. If the user answers correctly, one point is issued to his account. When the user answers incorrectly, zero points are issued to his account. By rewarding correct verification answers and correctly extracted triplets, users can distinguish themselves from the crowd by the number of issued points.

5 USER TEST

To validate our approach, we conducted a user test with 16 par-ticipants. Participants were asked to perform tasks in the system,

8_{https://www.w3.org/wiki/Html/Elements/input/checkbox}

(7)

followed by a user experience questionnaire and a semi-structured interview.

5.1 Participants

We recruited a total of 16 participants as the target group for an empirical study. The participants consisted of 4 women and 12 men, with age ranging from 21 to 28 years (N=24.9, SD=2.4). All partici-pants speak English on an advanced level but have not previously extracted relations from text for NLP.

5.2 Questionnaire survey

Besides a semi-structured interview, we used the User Experience Questionnaire (UEQ) as developed by Schrepp et al. [15] to measure the user experience (UX) of our product. The UEQ measures three different qualities of the product:

• Pragmatic quality: provides a representation of the basic usability, e.g., is the product considered attractive, efficient and reliable?

• Hedonic quality: provides a representation of aspects that do not have a clear connection to the task-related goals, e.g., is the product considered stimulating and innovative? • Attractiveness: presents a combined representation of the

general appearance the product has on users.

To specify these three qualities more, the qualities are divided into six scales as presented in table 1. To measure these scales, the UEQ contains a set of items that have the opposite meaning. The order of the terms is random, which means that half of the items start with the positive term and the other half of the terms start with the negative term. A 7-point system is used to reduce central tendency bias. An example of an item is:

demotivating o o o o o o o motivating

The items are divided between -3 and +3. -3 represents the most negative answer and + 3 represents the most positive response. To validate the design of the UEQ, Schrepp et al. measured the validity of the items by conducting 11 user tests with a total of 144 partici-pants along with an online questionnaire with 722 participartici-pants [15]. The results of this study show for each scale reliability, as measured by Cronbach’s Alpha, varying from 0.69 to 0.86. Cronbach’s Alpha provides an estimate of the internal consistency among test scores and a consistency of 0.7 ≤α is commonly acknowledged as accept-able. Therefore, we conclude that the results of Schrepp et al. are sufficient enough for us to use the UEQ in our user test.

5.3 Semi-structured interview

To explain the results of the UEQ, we conducted a semi-structured interview after the UEQ with each participant. The semi-structured interview consisted mostly of follow-up questions based on what the interviewer had observed while the participant performed his task. Besides, the participant was given the possibility to explain answers from the UEQ.

5.4 Procedure

Test sessions were conducted in person or via a video connection that showed the face of the participant as well as the screen. In all

Quality Scale Description

Attractiveness Attractiveness General impression of the product. Do users appreciate the product?

Pragmatic Perspicuity Is the product easy to use? Is it easy to learn to use the prod-uct?

Pragmatic Efficiency Can users perform tasks effi-ciently?

Pragmatic Dependability Do users have the feeling to have control over the product? Hedonic Stimulation Is the product appealing and

motivating to use?

Hedonic Novelty Is the product innovative? Do users have an interest in the product?

Table 1: UEQ measurement scales

cases, participants were in (semi)-private spaces, such as at home or a quiet workspace.

First, participants were given a general introduction into the task of triplet extraction from text, without them seeing the system. After they obtained a basic level of knowledge of triplet extraction, participants were given two tasks. The first task was to verify whether an already extracted triplet was correct for a given sentence. After this task, participants immediately went to the second task, triplet extraction. The task of triplet extraction was to extract as many triplets as they could find. If they either ran out of possible triplets or got bored with the sentence, they had permission to go to the next sentence. All participants were shown the same sentences to minimize a possible difference in the difficulty of a sentence.

After playing three sentences, the observer asked the partici-pants to fill in the UEQ. The UEQ was conducted immediately after the performed tasks and participants were asked to answer the questions conforming their experience and perception. After the UEQ, the semi-structured interview was conducted.

6 RESULTS

Based on the obtained results from the user test, we analyze the strength of results and report the outcomes of the UEQ as well as the semi-structured interviews.

6.1 Data analysis

The UEQ is analyzed with an Excel-based analysis tool, developed by the UEQ community9. To detect potential random answers given to the UEQ, we checked how much the best and worst response con-tributing to a UEQ scale differed. When there was a big discrepancy in answers (>3), we examined this as an indicator for ambiguous data. Based on this heuristic, we removed one response from a total of 16 responses, leaving 15 responses for further analysis. To deter-mine the precision of the UEQ answers relative to our sample size, we examined the confidence of the answers, as presented in table

9_{www.ueq-online.org}

(8)

UEQ Scale Mean Standard deviation Confidence (p=0.05) Comparison to UEQ benchmark Interpretation

Attractiveness 1.44 0.57 0.29 Above average 25% of results better, 50% of results worse Perspicuity 0.62 0.79 0.40 Bad In the selection of the 25% worst results Efficiency 1.58 0.65 0.33 Good 10% of results better, 75% of results worse Dependability 0.78 0.80 0.41 Below average 50% of results better, 25% of results worse Stimulation 1.32 0.69 0.35 Good 10% of results better, 75% of results worse Novelty 1.28 0.78 0.39 Good 10% of results better, 75% of results worse

Table 2: UEQ Results for N=15 compared to UEQ benchmark

Figure 5: UEQ Benchmark

2. We report that the 95% confidence interval for the mean (µ) of each scale ranges from aboutµ ± 0.29 to about µ ± 0.41. Therefore, we conclude that our current sample size mirrors an acceptable indication of the UEQ scale scores.

Furthermore, table 2 shows the mean and the standard deviation (SD) for each scale. According to the UEQ, a mean between -0.8 and 0.8 represents a neutral evaluation. Scores > 0.8 describe a positive evaluation and scores < -0.8 represent a negative evaluation. The range of the different scales varies from -3 (extremely bad) to +3 (extremely good). Because the mean is calculated from the results of all participants with divergent opinions, it is unlikely to get a score above +2 or below -2.

To interpret these results, we use a UEQ benchmark, as devel-oped by Schrepp et al.[20] We consider this benchmark useful since this is our first product evaluation and therefore we do not have comparison material. The UEQ benchmark exists of data from 246 product evaluations that used the UEQ in a wide range of applica-tions, such as, but not limited to, business applications (100) and web services or shops (64). In total, the UEQ benchmark consists of 9905 responses. The sample size differs per evaluation from 3 to 1.390 participants. The average amount of participants is 40.26. The feedback of the UEQ benchmark limits to five different categories: excellent, good, above average, below average and bad. Figure 5 shows a visual overview of how our product performs relative to the UEQ benchmark and table 2 shows the exact numbers.

6.2 Semi-structured interview results

The interface of the triplet verification was considered clear to the participants. The timer on top of the interface ensured that par-ticipants did not think too long about the task, which is in line

with our design that users should not think too long about the presented question. Most participants started with reading the sen-tence, where after they tried to answer the question. Furthermore, we have not observed confusion about the meaning of the agree and disagree button that is used to answer a question. Also, the process of issuing points based on answers in the triplet verification screen was considered clear. Some participants were confused when they did not receive points for their decision, as the majority of other users did not support their answer. This confusion sometimes led to the question of how the system could know whether a given triplet was correct or incorrect. After participants had entered the triplet extraction interface, they first needed to orient themselves on its functioning. Orientation often happened by participants click-ing around through the interface to see what would happen. The response of the system based on the actions of the participant en-sured that almost all participants understood the interface during the first sentence. Feedback about this observation emerged during the semi-structured interview and was mainly about adding a possi-ble introduction for the interface and how it should be used. Based on this feedback and the observation we can conclude that the in-terface can be improved for first-time users while being considered efficient for existing users.

6.3 UEQ Scales

6.3.1 Attractiveness. Based on the UEQ, we identify attractive-ness as the scale with one of the highest scores. We report a mean of 1.44 and a SD of 0.56. The UEQ benchmark interprets this score as above average. The item with the highest score for the attractiveness scale wasattractive/unattractive with a reported mean of 1.9. This result is being supported by the outcome of the semi-structured

(9)

interview where multiple participants argued they saw themselves using our system as a recreational product in unoccupied moments, such as while traveling or waiting in a line.

6.3.2 Perspicuity. Perspicuity is the lowest scoring scale in our test results, with a mean of 0.62 and a SD of 0.79. Looking at the items contributing to this scale, we seecomplicated/easy scoring the lowest score (µ=-0.2). This is also something we identified during the semi-structured interviews and while observing the performed tasks: participants first needed time to get used to the system and the task. When participants understood the system, they experienced it as an organized and fast tool hence the relatively high scores in the UEQ. Furthermore,Easy to learn/difficult to learn obtained the highest score for the perspicuity scale.

6.3.3 Efficiency. The scale efficiency has the highest score of the scales in our test results (µ=1.58). The highest score contributing to this scale is associated with theorganized/cluttered item. With a mean of 1.4,impractical/practical resulted in the lowest score for this scale. Furthermore, the efficiency score was identified asgood by the UEQ benchmark.

6.3.4 Dependability. The dependability scale measures the level in which the product is predictable and meets the expectations of the user. Based on our UEQ results, we report adependability mean of 0.78 and an SD of 0.80. Compared to the benchmark this mean is examined asbelow average and therefore in future work a scale we have to take into consideration. The item with the lowest score for this scale isunpredictable/predictable. During the semi-structured interview, we noticed that some participants were not familiar with the three triplet attributes (subject, predicate, object) we used, which might have been of influence for this score.

6.3.5 Stimulation. The stimulation scale assesses how exciting and motivating the product is. We report astimulation mean of 1.32 and an SD of 0.69 which is examined asgood within the UEQ bench-mark. The item with the highest score for the stimulation scale was motivating/demotivating with a mean of 1.5. While observing par-ticipants performing the task, we noticed that parpar-ticipants showed excitement for the task of RE, after they familiarised themselves with the system.

6.3.6 Novelty. Novelty measures how inventive and creative the product is. Based on the conducted UEQ, we reported a mean of 1.28 and an SD of 0.77 for thenovelty scale. The UEQ benchmark assesses this score as good. One of the highest scores for novelty was reported by the iteminventive/conventional with a mean of 1.4.

6.4 Extracted triplets

During the user test, a total of 57 triplets were extracted from the sentences. All triplets were examined by the researcher to determine whether the triplets were correct. Based on this analysis, we report a recall of 96% and a precision of 41%. The high level of recall can be interpreted as that participants extracted all the triplets they could find, resulting in a high number of extracted triplets. Based on our approach to using the wisdom of the crowd, as described in section 3.2.4 and a precision that can be considered as low, we suggest a larger user group is required, especially for the triplet verification task. Therefore, we leave the calculation of the number

of users needed to reach a particular level of precision, for future work.

7 DISCUSSION

From the results of the UEQ, we can conclude that our system scores well on the scalesattractiveness (µ=1.44), efficiency (µ=1.58), stimulation (µ=1.32) and novelty (µ=1.28). Furthermore, our system scored less well on the scalesperspicuity (µ=0.62) and dependability (µ=0.78). To make a comparison with the UEQ benchmark, we have to keep in mind that the benchmark does not make a distinction between different product categories. We did not divide the UEQ benchmark into specific product categories due to resource limi-tations. For this reason, we should only use the UEQ benchmark as an indication of scales that require attention in a future version of our system. We see that the scalesperspicuity and dependability score relatively low in the UEQ benchmark. Especially the items understandable/not understandable and complicated/easy contribute with a low score to these scales.

Based on the results from the UEQ and the semi-structured interviews we can derive that our system offers an engaging and convenient way to extract triplets. However, beginning users often need to orient on the task of triplet extraction and the interface. Therefore, one method to improve this might be to investigate the possibility of an introduction. An alternative approach would be to increase the difficulty of the triplet extraction tasks more gradually in proportion to the time the user has spent in the system e.g., letting starting users verify triplets and more advanced users extract triplets. Besides, a follow-up user test can be done that makes a distinction between users that are new to the task of triplet extraction and users that already have experience with this task. Furthermore, we leave for future work a more detailed analysis of the quality of extracted triplets. We did not implement many lexical or syntactical constraints so that the user has the freedom to create many possible relations. Future work can investigate if the task of triplet verification is sufficient to eliminate potential incorrect triplets on purpose. Also, our test results offer a snapshot of the user experience. Future work is necessary to point out whether users stay engaged over a more extended period.

8 CONCLUSION

In this research, we aimed to come up with an engaging and conve-nient way for users to extract triplets from text. Based on existing user interface patterns and human computation methods, we devel-oped a system that offers an interface to extract and verify triplets supported by gamification methods. To validate our hypothesis, we conducted a qualitative user test that consisted of a semi-structured interview and a UEQ. Based on the obtained test results we can confirm that our system provides an engaging and convenient way to extract relations, especially in the areas of attractiveness and efficiency. However, the introduction for first-time users still can be improved, as is shown by the UEQ and identified during the semi-structured interviews. We leave an in-depth analysis for the need of an introduction for future work, as well as an analysis of how users can stay engaged with our system over a more extended period.

(10)

REFERENCES

[1] Angeli, G., Johnson Premkumar, M. J., and Manning, C. D. (2015). Leveraging Linguistic Structure For Open Domain Information Extraction. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). [2] Aroyo, L. and Welty, C. (2015). Truth Is a Lie Crowd Truth and the Seven Myths

of Human Annotation.AI Magazine.

[3] Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., and Etzioni, O. (2007). Open Information Extraction from the Web.IJCAI, 7.

[4] Brabham, D. C. (2008). Crowdsourcing as a model for problem solving: An intro-duction and cases.Convergence.

[5] Dawid, A. P., Skene, A. M., Dawidt, A. P., and Skene$, A. M. (1979). Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Source Journal of the Royal Statistical Society. Series C (Applied Statistics) Appl. Statist, 28(1):20– 28.

[6] Deterding, S. (2012). Gamification: Designing for Motivation.interactions. [7] Downey, D., Broadhead, M., and Etzioni, O. (2007). Locating Complex Named

Entities in Web Text.

[8] Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. (2005). Unsupervised named-entity extraction from the Web: An experimental study.Artificial Intelligence.

[9] Etzioni, O., Fader, A., Christensen, J., and Soderland, S. (2011). Open Information Extraction: The Second Generation.

[10] Fader, A., Soderland, S., and Etzioni, O. (2011). Identifying Relations for Open Information Extraction.Proceedings of the conference on empirical methods in natural language processing, pages 1535–1545.

[11] Hirth, M., Hoßfeld, T., and Tran-Gia, P. (2013). Cost-Optimal Validation Mecha-nisms and Cheat-Detection for Crowdsourcing Platforms.Mathematical and Com-puter Modelling.

[12] Kahneman, D., Knetsch, J. L., Thaler, R. H., Johnson, H., and Professor, L. (1991). Anomalies The Endowment Effect, Loss Aversion, and Status Quo Bias.Journal of Economic Perspectives—Volume, 5(1—Winter):193–206.

[13] Klein, D., Manning, C., and Finkel, J. (2018). The Stanford Natural Language Processing Group.

[14] Kondreddi, S. K., Triantafillou, P., and Weikum, G. (2014). Combining informa-tion extracinforma-tion and human computing for crowdsourced knowledge acquisiinforma-tion. In Proceedings - International Conference on Data Engineering.

[15] Laugwitz, B., Held, T., and Schrepp, M. (2008). Construction and Evaluation of a User Experience Questionnaire.LNCS, 5298:63–76.

[16] Liu, A., Soderland, S., Bragg, J., Lin, C. H., Ling, X., and Weld, D. S. (2016). Effective Crowd Annotation for Relation Extraction. pages 897–906.

[17] Mintz, M., Bills, S., Snow, R., and Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. InProceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - ACL-IJCNLP ’09.

[18] Nielsen, J. (2006). Progressive Disclosure.

[19] Santorini, B. (1990). Part-of-speech tagging guidelines for the Penn Treebank Project (3rd revision).Technical Reports (CIS).

[20] Schrepp, M., Hinderks, A., and Thomaschewski, J. (2017). Construction of a Benchmark for the User Experience Questionnaire (UEQ).International Journal of Interactive Multimedia and Artificial Intelligence.

[21] Siangliulue, P., Chan, J., Dow, S. P., and Gajos, K. Z. (2016).IdeaHound: Improving Large-scale Collaborative Ideation with Crowd-powered Real-time Semantic Modeling. [22] Surowiecki, J. (2004). The wisdom of crowds: why the many are smarter than the few and how collective wisdom shapes business, economies, societies and nations. Choice Reviews Online.

[23] van Bellen, M. (2016).Harnessing disagreement in event text classification using CrowdTruth annotation. PhD thesis, University of Amsterdam.

[24] Weiksner, G. M., Fogg, B. J., and Liu, X. (2008). Six patterns for persuasion in online social networks. InLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). [25] Zelenko, D., Aone, C., Richardella, A., Kandola, J., Hofmann, T., Poggio, T., and

Shawe-Taylor, J. (2003). Kernel Methods for Relation Extraction.Journal of Machine Learning Research, 3:1083–1106.

[26] Zouaq, A., Gagnon, M., and Jean-Louis, L. (2017). An assessment of open relation extraction systems for the semantic web.Information Systems.