Linking Cultural Heritage Collections by Modeling Personal Events in Historical Context

(1)

Linking Cultural Heritage Collections by Modeling

Personal Events in Historical Context

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

Jessie

Both

11210419

MASTER

INFORMATION

STUDIES

H

UMAN-

C

ENTERED

M

ULTIMEDIA

FACULTY OF SCIENCE

U

NIVERSITY OF

A

MSTERDAM

July 3, 2017

1st_Supervisor ₂nd_Supervisor

Dr. Lora Aroyo Dr. Victor de Boer

(2)

Linking Cultural Heritage Collections by Modeling

Personal Events in Historical Context

Jessie Both

University of Amsterdam

Faculty of Science

Amsterdam, The Netherlands

jessieboth@gmail.com

ABSTRACT

By digitizing collections, cultural heritage institutions aim to fulfill their new role as information providers. However, the majority of the digitized collections lack sufficient metadata and provide no context. This makes the collections hard to access and to search through. Prior research shows that enriching metadata of digitized collection items with events helps with providing context, structure collections, and describe relationships between collection items. This research presents a hybrid machine-crowd pipeline for detecting, extracting and modeling personal events, e.g. marriage, birth, etc., in historical context to enrich digitized World War II collection items of the NIOD1. We combine automatic information extraction with human-in-the-loop crowdsourcing for curation and extension of the machine output. The Simple Event Model (SEM) is used as input to model the enriched metadata in DIVE+2.

Author Keywords

Cultural heritage; event modeling; event extraction; machine extraction; crowdsourcing; simple event model.

INTRODUCTION

The growing influence of the Internet caused that cultural heritage institutions started to open up their collections to the World Wide Web. Expectations were that digitized items would be easier accessible, as the Internet is an open space, and that users would be able to find and access information without mediation of institutes.

However, since the early 2000s Dutch cultural heritage institutions took part in a so-called ‘mass digitization’. During this phase, the focus went to digitizing on a high pace with low costs and important metadata got lost [4, 16] This is not optimal, as users need rich metadata for the analysis, interpretation, efficient accessibility and re-use of digitized collection items [23]. Another problem that occurred is that a lot of digitization initiatives took place in-house and thus a lot of local solutions are used. The use of local solutions results into an inconsistency of the metadata of collections and a wide range of scattered places where digitized information can be found [4]

1_{http://www.niod.nl/}

2_{http://diveplus.frontwise.com/}

Current available annotations can be enriched by adding events. Events help with situating collection items into context by creating structure by linking related items. This improves the accessibility of the items and helps users to understand its context. Personal events are interesting for this cause, especially for digital humanity scholars [6]. Personal events can give an understanding of the day-to-day life of people involved in historical events and they enable large scale data research. However, detecting and extracting events is hard as events can be very ambiguous, which means that they lack a conclusive definition and are up for personal interpretation. Personal events especially are a challenge to extract, as they are not named. Personal events are micro historical events that revolve around an individual’s life [12, 22]. For example, life cycle events. These events differ in granularity [2] opposed to higher level historical events and therefore are harder to extract. In this paper, we discuss a pipeline that detects, extracts and represents personal events out of historical context to improve the digital accessibility of cultural heritage collections. First, we gather and prepare data that will be the input for the pipeline. Second, we use and evaluate multiple Named Entity Recognition (NER) tools to extract entities from those texts. Thirdly, we test whether crowdsourcing can be used to enhance the results from the NER tools and to create events out of extracted entities by conducting two crowdsourcing experiments. Fourthly, the events are modelled in the Simple Event Model (SEM). Conclusively, the created SEM and digitized collection items are implemented into DIVE+, an explorative search browser aimed at providing deep access to online cultural heritage collections, to visualize the final results.

This research is structured as follows: first, the use case of this research is explained, then we discuss related work and our research approach. After that, two crowdsourcing experiments are discussed. The next section focuses on describing our findings. In the discussion that proceeds this, we reflect on the steps of the pipeline and the final results. The last section summarizes our findings and discusses possible directions for future research.

This research is conducted in cooperation with Didi de Hooge, who wrote a Master thesis at UvA which focused on historical events within the same context. Both theses use the same use case and dataset.

(3)

RESEARCH QUESTION

In this study, we aim to answer the following research question:

Can personal events help interlink cultural heritage collections to improve the accessibility of digitized WWII collection items of the Netwerk Oorlogsbronnen?

To answer this research question, two subquestions are formulated:

1. What is an efficient way (through machine extraction and crowdsourcing) of detecting and extracting personal events from historical texts?

2. How to represent extracted personal events into an event model?

USE CASE: NETWERK OORLOGSBRONNEN

This research is done on behalf of the Institute for War, Holocaust and Genocide Studies (NIOD). For this research, we focus on Netwerk Oorlogsbronnen (NOB)3_{, a portal}

facilitated by NIOD that provides access to 9.819.375 digitized collection items of collections related to World War II (WWII). These collections are provided by 40 partner institutions. The goal of the NOB is to improve the digital access of the collections. With an improved, central, digital access, the NOB wants to help people to collect, connect and to provide context to their collections

Context: At the moment, the NOB is not yet able to fulfill the above described information need. The nearly 10 million accessible digitized items are only an estimated 7 percent of all Dutch war sources or documentation. In this relatively small subset, some problematic issues occur. Firstly, data is scattered as objects and their metadata are stored in external online databases. Moreover, not all available metadata is passed through to the NOB portal by the partner institutions. Therefore, getting access to the data is still difficult. Secondly, the available metadata is scarce and unstructured. Objects vary from articles (9.044.788), photos (280.530), to digitized metadata for physical objects (± 216.071). These objects are often described in an oversimplified manner. For example, a photograph with portraits is titled ‘Photographs. By resistance identified as

possible collaborators’. Further information, such as the

identifications of the displayed people, is missing.

These two issues lead to an unstructured collection, which causes underutilization of resources and results in insufficient digital access. As a result, it is hard for end-users to access the vast amount of collection items of the WWII collection.

Illustrating example: When a digital humanity scholar would search for ‘Jan Bonekamp’, a man who was part of the resistance, the researcher will face two problems. First, objects are not semantically modeled, which makes it hard to find all information about one person if that person had

3_{http://www.netwerkoorlogsbronnen.nl}

more than just a first- and last name. Second, the NOB portal shows data entries of multiple partner institutions that are stored on different websites with a minimum amount of metadata.

Jan Bonekamp was known as Jan, but his full name was Johannes Lambertus. It is not unlikely that this name is written in different variations across different sources, e.g. Jan Bonekamp, J. L. Bonekamp, or Johannes Lambertus Bonekamp, etc. Furthermore, as most documentation was done by German officials, spelling mistakes in records cannot be ruled out.

In the case of this example, the results of queries of variations of Jan’s name on the NOB portal indeed give different results:

• ‘Jan Bonekamp’ returns 24 items.

• ‘Johannes Lambertus Bonekamp’ returns 2 items. • ‘J Bonekamp’ returns 28 items.

Ideally, all relevant information is returned with one of the queries used above. Currently, this is not possible as metadata is unstructured.

Besides digitized collection items, humanity scholars are interested in context. To provide this context, meaningful relationships have to be made between objects. For example, from the current metadata, it is not possible to trace back to whom Jan was married or that his daughter was born on the 30th of April in 1940. Besides providing context to an individual’s life, personal events also enable large scale data research. For example, a mass execution that took place in a concentration camp is not necessarily a named event. However, the data is able to indicate such an event. Meaningful relationships can also be made on another, more historically significant, level, such as linking objects describing his companion from the resistance; Hannie Schaft.

Conclusively, at the moment it is hard for digital humanity scholars to get a complete overview of all available information about an individual. By adding personal events to the metadata, the metadata is more structured, which means that items can be linked through meaningful relationships.

Dataset: Out of all data provided through NOB by the partner institutions, a dataset is selected based upon the following three themes: deportation; arrest; and resistance. The dataset consists out of historical texts, a vocabulary, and a thesaurus.

Textual sources: In essence, every historical text inside the scope of the research was a potential text for the dataset. However, due to technical issues, as well as copyright- and time constraints, the only available sources were Wikipedia pages of the categories ‘Lijsten over de Tweede

Wereldoorlog’ and ‘Tweede Wereldoorlog in Nederland’.

(4)

distribution of the subjects of the Wikipedia pages. The vast majority of the pages (1.724 pages) are pages that describe people. Other categories are monuments (graves, memorial monuments); media (resistance papers); locations (concentration camps); events (Kristallnacht); lists (lists of events or names); and groups (resistance groups).

Figure 1. Overview Wikipedia pages per category Vocabulary: As ground truth, a people vocabulary is created out of data from Online Begraafplaatsen4_{, Erelijst}5_,

and the Oranjehotel Karthoteek. Online Begraafplaatsen is a database with records of all monuments in the Netherlands. The Erelijst is an honour roll for people who served the Netherland. The Oranjehotel Kartotheek is a selection of registration cards of people who were imprisoned in the Oranjehotel prison during WWII.

Combined, the three sources detail 35.739 people by name, date of birth, place of birth, date of death, place of death, place of residence, age occupation, relatives, cell number and monument number. See Appendix A for an overview of saturation per metadata field.

Thesaurus: The NIOD developed its own thesaurus to map concepts throughout their collection. In this research, the thesaurus is used to link our extracted events to already defined named events in the collection.

All used data and code used during this research can be found on Github6.

RELATED WORK

The following section gives an overview of related work regarding event detection and extraction with Natural Language Processing (NLP) tools, crowdsourcing, and event modeling with the Simple Event Model.

Event detection and extraction

In this research, events are required to be detected and extracted from historical texts. Events cannot be extracted

4_{http://www.online-begraafplaatsen.nl/} 5_{http://www.erelijst.nl/}

6_{https://github.com/Dididh/Thesis-Jessie-Didi}

as a whole. Traditionally, historical events are named, and consist out of, a standard grammar: an agent performing an

action on an object using an instrument [20]. To extract

these entities out of historical texts, Natural Language Processing (NLP) can be used. Named-entity recognition (NER), a task of NLP, is well suited for this purpose. Originally, NER is an information extraction subtask for computational linguistics. However, later the task was adopted by researchers in other fields [1, 19, 25], including event extraction. Van Hooland et al. [13] discus the quality of NER output for cultural heritage data. NER seems more suited for extracting names and locations and performed poorly on extracting dates and events.

The research by Fokkens et al. [9] presents a NLP pipeline for extracting personal events. In this research, personal events are extracted from biographies, a source of which it is certain it contains personal events. For this study, it is not given as the dataset consists out of Wikipedia pages of different natures.

The NER results for extracting events were found to be insufficient. Therefore, other research in the field of event extraction explores the use of crowdsourcing for a more precise approach.

Crowdsourcing is defined as the outsourcing of tasks to a crowd of unknown people [5]. Well-known crowdsourcing platforms are Amazon Mechanical Turk7 _and

CrowdFlower8. Both are platforms on which non-expert volunteer workers get paid a small amount of money to complete small tasks. Research by Snow [24] proves that this methodology is useful for extracting events. On the one hand, advantages of crowdsourcing are that the workers work on a high pace and the work is relatively inexpensive. On the other hand, there is a risk that people misuse the task and spam. [28]. Research by Theodosiou [26] refutes this risk. By assigning multiple workers to annotate the same object, the quality can be guaranteed through the use of multiple perspectives.

Crowdsourcing can be used for all kinds of information extraction tasks in the cultural heritage domain. Oomen and Aroyo [21] propose a typology crowdsourcing tasks. In the cultural heritage domain, these tasks vary from correction and transcription to co-curation. For this research, we make use of both a correction and tagging task and a contextualization task.

An example of a crowdsourcing task to enhance metadata of cultural heritage collections is the VeleHanden project which is an initiative of the Amsterdam City Archives to digitize their many archives. Their dataset is similar to our people vocabulary and shows good results [8]. Also, the social tagging game Waisda?, proposed by [11] showed that not only does crowdsourcing help in completing a task,

7_{https://www.mturk.com/}

(5)

it also increases the engagement with the cultural heritage institute.

For extracting events the use of both machine and crowd can be useful, which led to researchers to explore a hybrid approach.

Inel [14] uses a machine-crowd approach to extract events by first using the machine to extract and later the crowd to enrich. De Boer et al. [3] also use a hybrid approach to enrich metadata. Noteworthy is their choice of presenting results. The results are presented in DIVE+, the same linked media browser we use to present the result of this study. Both studies described above use video to extract events from. We use a similar hybrid approach, with the difference being that events are extracted from textual sources in this study.

Modeling personal events

Multiple event models are designed to model longitudinal data in different domains. For example, The Event Model (TEM) proposed by [7] to model situation awareness and the Relational Event Model (REM) proposed by [27] to model animal behavior. However, until now there has not been designed a model specifically for (personal) historical events.

For this research, we use the Simple Event Model (SEM) to model the extracted events. This model is well suited for modeling personal events, because of its flexible structure. SEM can model events of multiple domains without making assumptions about the domain-specific vocabularies. This is achieved by making the model work with a minimum of semantic commitment [10]. This is beneficial for personal events because personal events are not named in a similar manner as historical events.

PERSONAL EVENTS EXTRACTION METHODOLOGY

To answer our research question, we created a proof of concept pipeline that detects, extracts, and represents personal events. To achieve this, we followed the following steps:

• We gathered data out of which we could extract events and create a people vocabulary as ground truth.

• The textual data was processed by NER to extract entities.

• To enhance the results, these entities served as input for two crowdsourcing experiments. The aim of Experiment 1 was to extract more entities. Experiment 2 was designed to link entities to their corresponding verbs to eventually make events.

• The aggregated results from experiment 2, the people vocabulary, and already available other named graphs were then used as input for the SEM.

• Finally, the SEM was then imported into DIVE together with metadata from the NOB portal API to illustrate our final result.

This section describes our approach for each part of the pipeline.

Data Processing Vocabulary

The people vocabulary consists out of data from Online

Begraafplaatsen, Erelijst, and the Oranjehotel Kartotheek.

Before cleaning, the Online Begraafplaatsen and Erelijst are filtered based on the three themes of the research. After that, both lists are merged Combined, these two lists undertook the following cleaning steps:

• Firstly, the rows with names which we could not fully confirm were removed. To do so, we used the guidelines of the Basisregistratie Personen9_which

states that there is only one person with the same first name, last name, place of birth, and date of birth. • Secondly, literal duplicates were removed. With literal

duplicates, all fields per row had to be identical. • Lastly, other duplicates that were not removed in the

second step, for example names with spelling errors, were removed manually.

Based on the people vocabulary, we could match new rows from the Oranjehotel Kartotheek. As it is not certain if all the people on the list are deceased, we could not obtain a copy of the list with names. Therefore, we matched the date of birth and place of birth of the Oranjehotel Kartotheek with our people vocabulary and printed the ID-numbers and potential names from the Oranjehotel Kartotheek. This resulted into 335 potential matches. The ID-numbers and names were checked by the owner of the Oranjehotel

Kartotheek and we were given a list of 233 confirmed

names. See table 1 for the final people vocabulary. Online begraafplaatsen 30.685

Erelijst 4.588

Oranjehotel Kartotheek 223 Final people vocabulary 35.506

Table 1. People vocabulary Textual sources

As input for the NLP task NER, 2.655 pages were parsed from Wikipedia. The text per page was split into paragraphs. Redundant information, such as the headers and footers, were removed in order to speed up the process. Every paragraph was saved as a new CSV file with a page index number, URL of the page, the text of the paragraph, and a section index per sentence. This resulted into 155.880 documents.

Automatic Event Extraction

The second step in the pipeline is to detect and extract events with machine extraction. The input for this part of

(6)

the pipeline are the 155.880 documents containing paragraphs of Wikipedia pages.

The input data was processed by two pipelines. The first pipeline uses four free NER tools: DBpedia Spotlight10_,

Targeted Hypernym Discovery (THD)11_{; SemiTags}12_{; and}

TextRazor13_{. All three extractors have their own focus:}

DBpedia Spotlight and TextRazor extract DBpedia mentions in texts [18]. It is likely this extractor will extract place names, names of groups, and other named entities that can be linked to the many classes of DBpedia. The THD extractor uses lexico-syntactic patterns to find hypernyms in the text [15]. Therefore, the results of this extractor are more likely to be on a conceptual level. The SemiTags extractor is an online tool that recognizes named entities and their meaning in a particular context [17]. Therefore, we expect results of this tool to be more accurate in comparison to other tools, because of SemiTags given entity types.

The second pipeline that is used is the pipeline of BiographyNet14_{, a project that uses NLP and Semantic Web}

Technology to support digital humanity scholars with historical research. The pipeline is a supervised machine learning system trained to extract biographical metadata from text based on the following categories: date and place of birth and death; education; occupation; religion; and parents [9]. As this pipeline is specialized for extracting entities that are indicating personal events, it is likely the second pipeline will show better results than the first pipeline.

For crowdsourcing task 1, 100 sentences were selected as test set. The input for the task was a CSV file of the test set and the extracted entities per sentence.

Crowdsourcing Experiments

To improve the NER results, we created two crowdsourcing experiments. The first experiment, task 1: validating and extracting entities, focusses on letting the workers evaluate automatic extracted entities and annotate new entities. The second experiment, task 2: verify statements, aims to connect extracted entities with extracted verbs to create events by letting workers verify statements made with extracted entities. This section discusses the goal and design per task.

Task 1: validating and selecting entities

The goal of the first task is to extract entities that are not extracted by the NER tools, and let the crowd confirm and/or correct extracted entities.

The test set for this task consists out of 100 sentences. The sentences vary in difficulty and structure. Out of these 10_{http://dbpedia-spotlight.github.io/demo/} 11_{http://entityclassifier.eu/thd/} 12_{http://nlp.vse.cz/SemiTags/} 13_{https://www.textrazor.com} 14_{http://biographynet.nl/}

sentences we want to extract four types of entities: dates; locations; verbs; and names of people, organizations, and groups. To keep the task as narrow and concise as possible, this task is divided into four categories, one job per entity. The task is designed as follows:

• The worker is given a sentence and is asked to read the sentence carefully.

• Then, the worker is asked to remove entities extracted by NER that are incomplete or irrelevant. The entities that are extracted by NER are already highlighted in green and the text is colored orange (see Figure 2). For example, one job is to look at dates. If in the sentence

‘Jan de Groot’ is highlighted, the entity should be

removed. Also, ‘June 1945’ is considered incomplete when ‘14 June 1945’ is in the same sentence.

• The next step is to select the correct entity, which in this case is a date, by highlighting all words that describe that entity (see Figure 3).

• If the worker selects less than three entities, he or she is asked to explain why there are no more relevant entities in the sentence.

Figure 2. Screenshot crowdsourcing task 1; step 2 remove impartial extracted entities.

Figure 3. Screenshot crowdsourcing task 1; step 3 select correct entity.

The task is distributed through CrowdFlower. Per sentence, the worker is given 2 Eurocents. To prevent fraudulent use, every worker is allowed to judge up to 15 sentences per job. To build confidence, every sentence is judged by 15 people.

(7)

Judgements are accepted by a confidence level of 66.66 percent. To speed up the process, the 100 sentences are split into four sets of 25 sentences. Therefore, 16 jobs in total were conducted.

Task 2: verify statements

The goal of the second task is to connect the extracted entities, by machine and crowd, with a verb to create events.

For this task, the results of task 1, entities extracted by machine and crowd, and their corresponding sentences are used as input. Not all sentences of the test set of 100 sentences are used for this task. For example, if a sentence does not contain extracted dates, the sentence is not used for task 2 to link verbs with dates.

To keep the task as concise as possible, the task is divided into three different jobs:

• Linking verbs with dates • Linking verbs with locations

• Linking verbs with names of people, organizations, and groups

Task 2 is designed as follows:

• The worker is given a sentence and is asked to read the sentence carefully.

• Next, the worker is asked to select all statements that are a correct representation of the relationship between the entities in the sentence. Depending on the job, the extracted verbs and/or dates, locations, and names are highlighted (see figure 9). Below the sentence, all possible permutations are used to create statements. The term ‘correct’ points to the entities that are linked to the verb that is part of the same event. For example, the following sentence contains two events:

• ‘Jan de Groot was born on 14th of June 1945 and he lived

in Amsterdam’

• The two events in this sentence are ‘Jan de Groot was

born on the 14th of June’ and ‘Jan de Groot lived in

Amsterdam’. For the first event, we want to link ‘Jan de Groot’ and ‘14th of June’ with the verb ‘born’ and for the

second event ‘Jan de Groot’ and ‘Amsterdam’ with

‘lived’.

• If the worker believes there are no correct statements, he or she can select the option ‘There is no relevant

statement’. The worker is asked to explain why there is

no relevant statement. This is done to avert spammers.

Figure 9 Screenshot crowdsourcing task 2

The task is distributed through CrowdFlower. Per sentence, the worker is given 4 Eurocents. To prevent spam, every worker is allowed to judge 20 sentences per job and if a worker believes none of the statements are correct, he or she has to give a reason for this judgement in order to proceed. To build confidence, every sentence is judged by 15 people. Judgements are accepted by a confidence level of 66.66 percent. To speed up the process, the sentences were divided into four sets. Therefore, 12 jobs in total were conducted.

Modeling Events with Simple Event Model

The created SEM consists out of four different collections of turtle triples, named graphs. All graphs were brought together in ClioPatria15_{. To make sure the collections can}

connect, the following steps were undertaken:

• The output of crowdsourcing experiment 2 are converted to RDF by a JavaScript program and later to turtles with SEM relations.

• The people vocabulary is converted to RDF with the LevelUp16_{converter and later transferred to the SEM}

format.

The WWII thesaurus was already available in RDF, in SKOS17_{. This format is also suitable for SEM, and therefore}

does not have to be converted. Only concepts which are classified as ‘gebeurtenis’ in the thesaurus are used, plus one manually added named event.

To enhance the model, we conducted some alignments proceedings. SEM contains four core classes: for the core class places, we automatically generated sameAs relationship between identical places in the different graphs. For the core class actor, this could not be done automatically as an identical name does not necessarily

15_{http://cliopatria.swi-prolog.org/} 16_{http://levelup.networkedplanet.com/} 17_{https://www.w3.org/2004/02/skos/}

(8)

addresses the same person. Therefore, these sameAs relationships are done manually.

Lastly, out of metadata of the people vocabulary we created four events: birth; death; arrival at the Oranjehotel; and departure from the Oranjehotel.

RESULTS

This section describes the results per step of the pipeline. First, the NER results are discussed per used tool. Then, the results for both crowdsourcing experiments are reported. Thirdly, the outcomes of the data model, SEM, are presented. Lastly, the final result of the pipeline is presented by giving examples of representations of personal events in DIVE+.

Automatic Event Extraction

The results of the first NER pipeline (see table 2) show that the THD extractor was the most effective extractor with 145.617 extracted entities. Of those extracted entities, 107.024 are unique. An analysis of those entities show an overrepresentation of named entities, and an overrepresentation of non-named entities recognized as named entities. DBpediaSpotlight and SemiTags were both less successful in recognizing total entities in comparison to THD. What stands out is that the number of unique extracted entities is significantly lower than THD. A more detailed look at the unique entities shows that the entities extracted by SemiTags only consist of named entities, whereas DBpediaSpotlight also extracted non-named entities and some dates without years. TextRazor was used on a small set of articles (14) and garnered no useful results. Extracted entities consist mostly out of numbers, e.g. ‘600’, and some useful verbs, and nouns.

As expected, the BiographyNet pipeline performed better than the other pipeline. The Wikipedia articles resulted into a set of 260.244 entities. The quality of these entities is more relevant for this research compared to the results from the first pipeline. For example, full dates are extracted, plus more verbs and named entities in general.

Table 2. Overview results extracted entities NER pipeline 1 Crowdsourcing results

Results Task 1: Selecting Entities

In 11 days, all 16 jobs were judged. This resulted into 6.000 judgements made by 74 unique workers. Of those workers,

49 workers performed the tasks via an external website. The other 25 accessed the tasks via an internal link.

The task resulted into 167 extra extracted entities. Of the automatic extracted entities, 34 entities were removed by the crowd. See table 3 for the results per entity.

Before Removed After Dates 74 3 102 Verbs 132 12 228 Locations 128 12 125 Names 97 7 109

Table 3. Overview results crowdsourcing experiment 1 Results Task 2: Verify statements

In two days, all 12 jobs were judged. This resulted into 2.919 judgements. These judgements were made by 29 unique workers of which four accessed the task via an internal link.

As mentioned earlier, only sentences containing a certain entity are used in a corresponding task. Eventually, 71 sentences were used in the job to connect verbs with locations, 72 entities were used to connect verbs with names of people, organizations or groups, and 77 sentences were used to link verbs with dates. From these sentences, a total of 94 verbs are linked with 58 locations, 64 dates, and 61 names of people, organizations, and groups.

Simple Event Model

The complete knowledge graph contains four named graphs which result into 209.183 triples. After some alignment proceedings, such as generating sameAs relationships and manually adding relationships, the total amount of triples of our knowledge graph consists out of 209.245 triples. See table 4 for the results per SEM core class.

Illustrating examples

To illustrate the effectiveness of the SEM, this section will describe an illustrating example. The SEM is implemented in the DIVE+ environment for a visual representation. The given example revolves around Eduard Popko van Groningen. Without the model, information about Eduard is fragmented over the collections of Erelijst and Online

Begraafplaatsen. All this information is now available in

DIVE+ through one query. Figure 4 shows that the query leads to an actor, which is related to five matching queries.

(9)

Figure 4. Screenshot DIVE+ query Eduard Popko van Groningen

Figure 5 shows that the five matching queries are events. The first four are the events of birth and death as described in their original source. The fifth event is an event extracted from a Wikipedia page.

Figure 5 Screenshot DIVE+ matching entities

The related entities are, in turn, also related to other entities. In the case of the extracted event, to the actor, Eduard, and a location, Sachsenhausen.

New opportunities for digital humanity scholars appear through the entity ‘Sachsenhausen’. The location Sachsenhausen is linked to 21 events which are all connected with other actors than Eduard. This means that potential fellow prisoners of Eduard are known. Another storyline could be that the actors of the other 21 events

were also shot at the 3rd_{of May 1942, which would indicate}

a bigger event took place.

DISCUSSION

Our results suggest that adding personal events to metadata of cultural heritage institutions helps improve the accessibility of digitized WWII collection items of the NOB portal. With our machine-crowd hybrid pipeline we could model personal events in the SEM and illustrate our results in DIVE+. Through this explorative search tool, we could prove that we have created meaningful connections between different cultural heritage collections. These relationships solve the two main problems stated earlier; fragmented data and unstructured metadata.

To be able to reach our goal, we propose a proof of concept pipeline. We believe our pipeline can be considered efficient. All taken steps need to be conducted in order to come to the final result as presented in this paper. One could argue that machine extraction could be left out as entities are also extracted by crowd. However, task 1 turned out to be time-intensive; removing automatic extraction from the pipeline will only elongate the process.

The pipeline shows some limitations. As has been shown in prior studies, machine extraction proved to be only partially useful for this pipeline. The entities extracted by the BiographyNet pipeline, which is tailored to extract (personal) events, provided the best results. However, the extracted entities were not precise enough to create events with the entities. For example, out of a sentence with an event, only a date was extracted and the actor and place were missing.

To enhance the machine extraction, two crowdsourcing experiments were conducted. The first task delivered good results. A reasonable amount of extra entities was extracted. However, a closer look at the output shows that quite a few entities were not accepted by the majority of workers, because the entities were selected in different ways. For example, a date consists in the most complete version out of three words: this enables six different ways to select parts of this date. As only 15 people looked at a sentence, most selections get 5 - 7 votes and will not pass the 10/15 confidence level. If this task is repeated, we would recommend to allow for more judgements per sentence and clearer instructions to avoid this problem.

Another problem occurred with part 2 of the first task. The automatic extracted entities which were pre-highlighted were not considered as accepted when a worker would keep the word highlighted. Words were only selected once a word was deselected and selected again. Therefore, if the crowdsourcing task is repeated, the results on this part will be different.

Crowdsourcing task 2 garnered good results and was finished in a short time period compared to task 1. However, due to time constraints, we eased the spam rules, and gave the workers 4 cents per sentence instead of two.

(10)

This speeds up the process immensely. However, the quality of the results of task 2 is not optimal. As both the rules and the reward changed, we cannot say which of the two has potential a side effect next to speeding up the process. A possibility is that because of easier rules, more spam is accepted. Moreover, because the payment went up, it is not clear if workers with different motives performed the task.

Other limitations occurred regarding data gathering. For example, to get an optimal representation of the sources available through the NOB portal, we initially wanted to extract events from different textual sources. For example, newspapers published by the resistance are available in OCR. However, there were some technical issues which meant that the files became unusable. Furthermore, during the selection of sources for the people vocabulary, we were not able to use all the desired sources. It appears that the cultural heritage domain is still protective over already publicly available data: think of incomplete API’s and difficulty to cooperate.

We think this pipeline is scalable regarding bigger textual datasets. The results from the crowd were satisfying and we think that with a bigger set of sentences, more time and a larger amount of judgements per sentence, the quality has the possibility to increase. However, we do think that this pipeline is too specific to be scalable to other types of sources. Nevertheless, the global steps can be considered while creating a similar pipeline for other types of media such as video or sound.

CONCLUSION AND FUTURE WORK

Digitization projects of the past decades made that more and more collection items in the cultural heritage domain are digitally available. Nonetheless, due to unstructured metadata and the distributed allocation of the objects in different databases, purely digitizing objects does not make them more accessible. Personal events can play a part in solving this problem by creating meaningful relationships between objects. These relationships link related objects to give a complete overview of all objects related to an individual, but also provide context by linking it with related information about the individual that can be found in vocabularies. For digital humanity scholars, this is particularly interesting, as it helps them tell stories. Moreover, combining all these personal events enables large scale data research.

To answer our research question, a proof of concept pipeline that extracts personal events out of historical texts is developed. In the pipeline, we used a machine-crowd hybrid approach to automatically extract entities out of historical text, and enrich these entities with crowdsourcing. The events are semantically modeled and represented in an explorative search tool. We found that the pipeline is able to successfully create meaningful relationships between collection items of the NOB portal and vocabularies, and

therefore can improve the accessibility of the digitized items.

Future work can improve the pipeline. For example, the entity extraction could be improved. Currently, NLP tools work best on English texts. As the field of NLP is ever evolving, improvement and development of new tools is likely. Therefore, we recommend experimenting with other NER tools to gain better results. Another way to improve entity extraction is by tweaking the crowdsourcing tasks. In our tasks, we used sentences to extract entities from. Referencing words, such as ‘he’ and ‘that’ were not extracted as the context around the sentence is missing. As a consequence, a lot of verbs could not be linked with an actor. We recommend doing a pilot with extracting entities from paragraphs to improve results.

Lastly, we suggest enriching the SEM with adding subtypes to the core classes actor; event; place; and time. For example, nationalities can be added to names of people. Adding these subtypes will give an even richer context.

ACKNOWLEDGMENTS

I would like to thank Lora Aroyo, Lizzy Jongma, Oana Inel, Victor de Boer, Diana Helmich, Antske de Vries and Ramses Ijff for their guidance and/or help with this research.

REFERENCES

1. Ananiadou, S., & McNaught, J. (2006). Text mining

for biology and biomedicine (pp. 1-12). London:

Artech House.

2. Aroyo, L., & Welty, C. (2012). Harnessing disagreement for event semantics. Detection,

Representation, and Exploitation of Events in the Semantic Web, 31.

3. de Boer, V., Oomen, J., Inel, O., Aroyo, L., van Staveren, E., Helmich, W., & de Beurs, D. (2015). DIVE into the event-based browsing of linked historical media. Web Semantics: Science, Services

And Agents On The World Wide Web. 35. 152-158.

http://dx.doi.org/10.1016/j.websem.2015.06.003 4. Dahlström, M., Hansson, J., & Kjellman, U. (2012).

‘As We May Digitize’ — Institutions and Documents Reconfigured. LIBER Quarterly, 21(3-4), 455 - 474.

5. Drapeau, R., Chilton, L. B., Bragg, J., & Weld, D. S. (2016). MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP). 6. Dunn, S., & Schumacher, M. (2016). Explaining

Events to Computers: Critical Quantification, Multiplicity and Narratives in Cultural Heritage. DHQ: Digital Humanities

(11)

7. Etzion, O., Fournier, F., & von Halle, B. (2015). “The Event Model” for Situation Awareness. Bulletin Of The IEEE Computer Society Technical

Committee On Data Engineering. 38(4). 105 - 115.

8. Fleurbaay, E., & Eveleigh, A. (2012). Crowdsourcing: Prone to Error?. International Council On Archives Conference 2012, 20 - 24. 9. Fokkens, A., Ter Braake, S., Ockeloen, N., Vossen,

P., Legêne, S., & Schreiber, G. (2014). BiographyNet: Methodological Issues when NLP supports historical research. In LREC , 3728-3735. 10. van Hage, W., Malaisé, V., Segers, R., Hollink, L.,

& Schreiber, G. (2011). Design and use of the Simple Event Model (SEM). Web Semantics:

Science, Services And Agents On The World Wide

Web, 9(2), 128-136.

http://dx.doi.org/10.1016/j.websem.2011.03.003 11. Hildebrand, M., Brinkerink, M., Gligorov, R., van

Steenbergen, M., Huijkman, J., & Oomen, J. (2013). Waisda?. Proceedings Of The 21St ACM

International Conference On Multimedia - MM '13.

http://dx.doi.org/10.1145/2502081.2502221 12. Holmes, T., & Rahe, R. (1967). The social

readjustment rating scale. Journal Of Psychosomatic

Research, 11(2), 213-218.

http://dx.doi.org/10.1016/0022-3999(67)90010-4

13. van Hooland, S., De Wilde, M., Verborgh, R., Steiner, T., & Van de Walle, R. (2015). Exploring entity recognition and disambiguation for cultural heritage collections. Digital Scholarship In The

Humanities, 30(2), 262-279.

http://dx.doi.org/10.1093/llc/fqt067

14. Inel, O. (2016). Machine-Crowd Annotation Workflow for Event Understanding Across Collections and Domains. The Semantic Web. Latest

Advances And New Domains. 813-823. http://dx.doi.org/10.1007/978-3-319-34129-3_50 15. Kliegr, T., Svátek, V., Chandramouli, K., Nemrava,

J., & Izquierdo, E. (2008). Wikipedia as the Premiers Source for Targeted Hypernym Discovery.

Wikis, Blogs, Bookmarking Tools Mining The Web 2.0, 38 - 45.

16. Klijn, E. (2015). Van 'oud' geheugen naar digitaal brein. Massadigitalisering in praktijk. Tijdschrift

Voor Mediageschiedenis, 14(2), 56 - 68.

17. Lasek, I., & Vojtás, P. (2013). Various approached to text representation for named entity disambiguation. Internation Journal Of Web

Information Systems, 9(3), 242 - 259.

18. Mendes, P., Jakob, M., García-Silva, A., & Bizer, C. (2011). DBpedia Spotlight: shedding light on the

web of documents. Proceedings Of The 7Th

International Conference On Semantic Systems, 1-8.

19. Moens, M. F. (2006). Information extraction: algorithms and prospects in a retrieval context (Vol. 21). Springer Science & Business Media.

20. Mostern, R., & Johnson, I. (2008). From named place to naming event: creating gazetteers for history. International Journal Of Geographical

Information Science, 22(10), 1091-1108.

http://dx.doi.org/10.1080/13658810701851438 21. Oomen, J., & Aroyo, L. (2011). Crowdsourcing in

the cultural heritage domain. Proceedings Of The

5Th International Conference On Communities And Technologies - C&T '11.

http://dx.doi.org/10.1145/2103354.2103373 22. Paykel, E., Prusoff, B., & Uhlenhuth, E. (1971).

Scaling of Life Events. Archives Of General

Psychiatry, 25(4), 340 - 347.

http://dx.doi.org/10.1001/archpsyc.1971.017501600 52010

23. Renteria-Agualimpia, W., Lopez-Pellicer, F., Lacasta, J., Zarazaga-Soria, F., & Muro-Medrano, P. (2016). Improving the geospatial consistency of digital libraries metadata. Journal Of Information

Science, 42(4), 507-523.

http://dx.doi.org/10.1177/0165551515597364 24. Snow, R., O'Connor, B., Jurafsky, D., & Ng, A.

(2008). Cheap and Fast - But is it good? Evaluating non-expert annotations for natural language tasks.

Proceedings Of The Conference On Empirical Methods In Natural Language Processing, 254-263.

25. Tamilin, A., Magnini, B., Serafini, L., Girardi, C., Joseph, M., & Zanoli, R. (2010). Context-driven semantic enrichment of italian news archive. The

Semantic Web: Research and Applications, 364-378.

26. Theodosiou, Z., & Tsapatsoulis, N. (2011). Crowdsourcing annotation: Modelling keywords using low level features. 2011 IEEE 5Th

International Conference On Internet Multimedia Systems Architecture And Application. http://dx.doi.org/10.1109/imsaa.2011.6156351

27. Tranmer, M., Marcum, C. S., Morton, F. B., Croft, D. P., & de Kort, S. R. (2015). Using the relational event model (REM) to investigate the temporal dynamics of animal social networks. Animal behaviour, 101, 99-105.

28. Vuurens, J., de Vries, A. P., & Eickhoff, C. (2011, December). How much spam can you take? an analysis of crowdsourcing results to increase accuracy. In Proc. ACM SIGIR Workshop on

Crowdsourcing for Information Retrieval (CIR’11),

(12)

APPENDIX