• No results found

Semantic Annotation of Natural History Collections

N/A
N/A
Protected

Academic year: 2021

Share "Semantic Annotation of Natural History Collections"

Copied!
14
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Accepted Manuscript

Semantic annotation of natural history collections

Lise Stork, Andreas Weber, Eulàlia Gassó Miracle, Fons Verbeek, Aske Plaat, Jaap van den Herik, Katherine Wolstencroft

PII: S1570-8268(18)30028-3

DOI: https://doi.org/10.1016/j.websem.2018.06.002

Reference: WEBSEM 462

To appear in: Web Semantics: Science, Services and Agents on the World Wide Web

Received date : 2 October 2017 Revised date : 30 March 2018 Accepted date : 18 June 2018

Please cite this article as: L. Stork, A. Weber, E.G. Miracle, F. Verbeek, A. Plaat, J. van den Herik, K. Wolstencroft, Semantic annotation of natural history collections, Web Semantics: Science, Services and Agents on the World Wide Web (2018), https://doi.org/10.1016/j.websem.2018.06.002 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

(2)

Semantic Annotation of Natural History Collections

Lise Storka,∗, Andreas Weberb, Eul`alia Gass´o Miraclec, Fons Verbeeka, Aske Plaata, Jaap van den Herika,d, Katherine Wolstencrofta aLeiden Institute of Advanced Computer Science, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands

bUniversity of twente, Twente, the Netherlands cNaturalis Biodiversity Center, Leiden, the Netherlands dThe Leiden Centre for Data Science, Leiden, the Netherlands

Abstract

Large collections of historical biodiversity expeditions are housed in natural history museums throughout the world. Potentially they can serve as rich sources of data for cultural historical and biodiversity research. However, they exist as only partially catalogued specimen repositories and images of unstructured, non-standardised, hand-written text and drawings. Although many archival collections have been digitised, disclosing their content is challenging. They refer to historical place names and outdated taxonomic classifications and are written in multiple languages. Efforts to transcribe the hand-written text can make the content accessible, but semantically describing and interlinking the content would further facilitate research. We propose a semantic model that serves to structure the named entities in natural history archival collections. In addition, we present an approach for the semantic annotation of these collections whilst documenting their provenance. This approach serves as an initial step for an adaptive learning approach for semi-automated extraction of named entities from natural history archival collections. The applicability of the semantic model and the annotation approach is demonstrated using image scans from a collection of 8,000 field book pages gathered by the Committee for Natural History of the Netherlands Indies between 1820 and 1850, and evaluated together with domain experts from the field of natural and cultural history.

Keywords: Linked Data, Biodiversity, Natural History Collections, Ontologies, Semantic Annotation, History of Science

1. Introduction

Within the field of biodiversity, species research includes the observation and recording of species occurrences in partic-ular geographical areas. Naturalists have been collecting such data for several hundred years and early records are typically housed in natural history museums as hand-written field books, drawings and specimens. However, due to a lack of standardised classification practices during historical biodiversity expeditions, multilingualism and historical terms, the disclosure of such col-lections proves challenging and time-consuming [23]. Ideas should be developed for the use of semi-automated processes to disclose these collections in order to make them accessible to biodiversity researchers as well as those studying natural and cultural history. In the tower of the Naturalis Biodiversity Center in Leiden, a collection which includes the archives of all expeditions undertaken by the Committee for Natural His-tory of the Netherlands Indies (Natuurkundige Commissie voor Nederlandsch-Indi¨e), recorded in Indonesia between 1820 and 1850, already contains roughly 8,000 field book pages and about

Corresponding author

Email addresses: l.stork@liacs.leidenuniv.nl (Lise Stork ), a.weber@utwente.nl(Andreas Weber),

eulalia.gassomiracle@naturalis.nl(Eul`alia Gass´o Miracle), f.j.verbeek@liacs.leidenuniv.nl(Fons Verbeek),

a.plaat@liacs.leidenuniv.nl(Aske Plaat),

h.j.vandenherik@law.leidenuniv.nl(Jaap van den Herik), k.j.wolstencroft@liacs.leidenuniv.nl(Katherine Wolstencroft)

10,000 specimens. Such a collection would shed light upon the development and evolution of biodiversity research concerning insular Southeast Asia in the first half of the nineteenth century. But, as few methods exist to disclose such collections, they re-main hidden from the general public as well as researchers.

Through the emergence of digitisation projects [5, 36], new possibilities arise to disclose hand-written manuscript collec-tions with digital tools. Initiatives such as the Field Book Project [36], for example, use manual full-text transcription to make their collections available to the general public. In this paper we propose to disclose natural history archival collections through semantic annotation of the archive content. Many definitions exist but we take it to be the process of producing structured annotations from the named entities in texts. These named entities form the general semantics of these texts. Coupling them with background knowledge, and linking them through formal descriptions, provides connectivity throughout the doc-uments [21]. Work has already been done linking collections and items using the principles of linked data, not only regarding biodiversity [15, 28], but cultural heritage collections in general [9, 7, 8, 6, 10, 11]. Fewer examples exist where the content of items in such collections are semantically linked [7]. Such an ap-proach would serve to facilitate the use of structured queries and reasoning over the data, data aggregation and, through the use of Internationalised Resource Identifiers (IRIs), disambiguation of entities. This paper makes the following contributions:

1. We provide a semantic model, an application ontology

*Manuscript

(3)

written in OWL1to structure drawing captions and

histori-cal occurrence records in field books. For this we integrate ontologies describing biodiversity, geographic locations and annotation provenance.

2. We present a semantic annotation tool, the Semantic Field Book Annotator, which uses the application ontology to enable domain experts to produce structured annotations from digitised natural history archival collections. In ad-dition, the tool documents the provenance of annotations. 3. We provide the results of a qualitative evaluation of the

proposed model and annotation process. These results will inform the development of an adaptive learning approach leading to semi-automated annotation.

We show the applicability of the ontology and annotation work-flow on a use-case of roughly 8,000 image scans from a collec-tion of field notes and drawings and about 10,000 specimens, gathered by the Committee for Natural History of the Nether-lands Indies. This work is part of the Making Sense project.2

The paper is structured as follows: in section2 we provide some background information regarding natural history research and outline the requirements for the development of the seman-tic model. In section 3 we discuss the development method and process: we discuss requirements in section3.1, the related work regarding semantics for biodiversity in section3.2, eluci-dation of the content of natural history collections by domain experts in section3.3 and description of the design choices and the final semantic model for the description of natural history archival collections in section3.4. Section 4 describes the an-notation approach, a workflow and tool to produce structured annotations from natural history archival collections using the semantic model. In section5 we evaluate the semantic annota-tion approach qualitatively and discuss the data acquired from the semantic annotation of a field book from our use-case. Lastly we discuss our results, describe limitations and outline future work in section6.

2. Background

Biodiversity research aims to understand the whole of life on earth, its evolution and the various factors that generate its diversity. The field is usually subdivided into research regarding species, genetics and ecology. Inherent to species research is the comparison and classification of the various plants and animals that inhabit our world. In order to realise this, naturalists in the field are challenged to classify and order observations of organ-isms and develop methods that moderate systematic descriptions. Expeditions to biodiverse areas allow naturalists to record or-ganism observations and classifications. Field books are the containers that preserve these observation records. They provide rich descriptions of species-specific traits such as measurements of specific organs or other body parts, the environmental condi-tions in which organisms are discovered and information about

1https://www.w3.org/OWL/ 2makingsenseproject.org

how organisms were collected, classified and described. Because of this, field books provide rich insight into the daily practices, methods, and results of the research field [23]. Besides field books, visual material is assembled during expeditions. Histori-cally, collectors were accompanied by professional illustrators, who produced detailed drawings of organisms, as shown in fig-ure 1.

During the development of biodiversity research, methods of species classification were continuously subject to intense discussion [26]. Multiple theories emerged regarding collection practices and species classifications. In particular in the early nineteenth century and before, naturalists were struggling to find and agree upon one ‘true’ natural system [26].

Figure 1: A manuscript taken from the collection of the Committee for Natural History of the Netherlands Indies. Col-lection Naturalis Biodiversity Center, MM-NAT01 AF NNM001000415. Captions say: Fig.1-2 et 3. Molosse m´eg`ere e le crane. Fig.4-5 et 6. Molosse grˆele et details de la tˆete. Pl.68. Illustrator unknown. Image free of known re-strictions under copyright law (Public Domain Mark 1.0).

Natural history col-lections embody this search for a termi-nological structure which could be used to order, describe and classify nature. The lack of con-sensus during his-torical biodiversity expeditions resulted in species descrip-tions that are chal-lenging to analyse within the present scientific paradigm, and also within col-lections themselves: (i) biological classi-fication systems im-plied in field books cannot be directly mapped to present taxonomies (ii) taxa have synonyms within collections and (iii) scientific names shift

between genera and species [26, 20, 3], as shown in figure 2. Matching organisms based on metadata recorded in field books can potentially remove ambiguity concerning classifications. Manually structuring and comparing the data would, however, be a time consuming process, as natural history collections often contain thousands of manuscripts and specimens. Moreover, records are written in hard-to-read handwriting and multiple languages interspersed with historical terms. Making sense of the data without the use of automated processes becomes an intractable problem.

Scotophilus kuhlii temminckii (Horsfield, 1824) [current name] Vespertilio temminckii Horsfield, 1824 [synonym]

Vespertilio fulvus Kuhl & Van Hasselt [synonym]

(4)

3. Development of a semantic model

Although data standards, such as the Darwin Core [39], exist for present-day biodiversity research, it became clear through interviews with cultural and natural historians that some tailor-ing would be required for the semantic annotation of historical biodiversity collections. The development process was set up taking into account the ontology development process described by Fern´andez et al [12]. The emphasis in the development pro-cess of our model is on the re-use and re-engineering of existing semantic models. We thus follow the ontology development process as outlined in scenario 4 of the NeOn methodology for ontology engineering [33]. Furthermore, we support a user-centered design, where the focus is on the needs of the end user, similar to a method for database design described by Gray [14], where questions of domain experts become requirements for the design and evaluation of the system.

3.1. Requirements for a semantic model

The requirements for the semantic model describe user re-quirements for elucidating content, and rere-quirements for adher-ing to the principles of sharadher-ing data in the semantic web. 1. Elucidating Content

R1 The model should formalise the general semantics of species observations described in field books and drawings.

(a) The model should include the named entities that domain experts use when constructing queries in order to answer their research questions.

(b) The model should reveal relations between the named entities and their characteristics, for instance, hierarchi-cal or transitive relations, so that these can be exploited in rich content queries. The model should thus be written in an ontology language such as the recommended w3c standard language, OWL.

R2 The model should be able to deal with name variants, such as, historical terms, abbreviations, scientific and vernac-ular terms, and their context.

(a) Standardised terms for resources, such as IRIs, should be used to represent named entities so that name variants can be linked and dissimilar entities with a similar name can be disambiguated.

(b) The context of name variants should be made explicit so that it can be used by domain experts as well as auto-mated reasoners.

2. Serving Structured Annotations to the Semantic Web R3 The model should re-use existing ontologies and vocabu-laries to facilitate data aggregation on the web.

R4 The model should store annotation provenance to en-able the sources of annotations to be traced and to facilitate scientific discourse over the content.

(a) The annotations should store metadata regarding the annotation process; annotator, date/time, interpretation, to track the provenance of an interpretation.

(b) The annotations should store metadata regarding their span in the image collection: multiple pages, single pages or fragments from pages, to keep track of the provenance of annotations in relation to the collection. As we will use these fragments in further research for named entity extraction, linking the annotations and their metadata to these fragments facilitates repetition of ex-periments by other researchers.

3.2. Semantics for biodiversity

Below we discuss available state-of-the-art standards and ontologies regarding semantics for biodiversity.

3.2.1. The Darwin Core.

The biodiversity data standard that is most commonly used to model species occurrences is the Darwin Core standard (DwC) [39]. It has been developed through community consensus and thus describes which concepts in observation records are most important to the community. The DwC describes these key concepts with standardised terms. Its main classes are: dwc:Organism, dwc:Taxon, dwc:Identification, dwc:O-ccurrenceand dwc:Event. The standard therefore satisfies R1a, and thus proves to be a suitable baseline for our model.

For the purpose of semantically annotating natural history archival collections, however, the DwC alone does not suffice. Firstly, the DwC does not satisfy R1b. Although the terms from the DwC were converted to be used with RDF [2] in 2012, the standard does not allow all properties to be used within its dwciri:namespace, adopted to refer to IRIs [2]. This means that not all relations can be used to point to IRIs, hindering the linking of entities from handwritten observation records during an annotation effort. The current standard lacks properties to in-terconnect its main classes and does not exceed the semantics of RDFSchema. This means it does not include types of properties and property axioms that we require, such as equivalence and transitivity.

Moreover, the DwC does not model taxonomies explic-itly, so reasoning algorithms cannot benefit from their inher-ently hierarchical nature. It models classification systems by connecting a taxon identifier to a literal through a rank prop-erty, e.g.,:<taxon1> dwc:order "Chiroptera". Finally, the DwC use of literals for named entities does not fulfill our re-quirements. As literals are multi-interpretable, they do not serve as unique identifiers within RDF. In the field of biological taxon-omy, and especially historical taxontaxon-omy, where multiple inter-pretations of species and naming conventions exist, being able to disambiguate between terms with the same name is crucial [20]. In these respects, the DwC does non satisfyR2a and R2b. 3.2.2. The Darwin Core Semantic Web.

The Darwin Core Semantic Web (Darwin-SW)3 ontology

extends the DwC by providing properties to link the main classes

(5)

of the DwC [1]. It hereby addresses the limitations of the DwC regardingR1b. The Darwin-SW also introduces a new class, the dsw:Token class, to link the graphical model to evidence in the form of a dwc:Specimen, dwc:HumanObservation or other class on which the identification of an organism during an occurrence event is based. This creates the possibility to match observation records to specimens and drawings, based on their metadata. However, the ontology still does not allow biologi-cal taxonomies to be graphibiologi-cally modelled, something that is also included inR1b. Finally, to the extent of our knowledge, the applicability of the Darwin-SW ontology has not yet been demonstrated on large datasets.

3.2.3. TaxMeOn.

The TaxMeOn4Meta-Ontology of Biological Names is an

ontology that models biological taxonomies [35]. The ontology uses IRIs for taxa and introduces hierarchy by connecting the taxa to each other using the transitive isPartOfHigherTaxon property. This property is made transitive so that logically in-ferred, the scientific name is not only a part of its own higher taxon, but all higher taxa. This way of modelling classification systems is suitable for our purpose: taxa can be linked during the annotation process, recreating the historical taxonomy and allowing subsequent querying of the archive for all species from a certain class or order. Moreover, the instances are modelled as IRIs, avoiding name ambiguity. Its conceptualisation, however, is subtly different than the Darwin-SW ontology: TaxMeOn models taxa as instances of a rank class such as genus whereas the Darwin-SW vocabulary only models taxa as instances of the class dwc:Taxon.

In summary, present-day biodiversity records can be de-scribed using terms from the DwC and the Darwin-SW, but some additions need to be considered for the description of natu-ral history collections. Domain experts’ interests were explored to complement the existing vocabularies to satisfy (R1a) and to addressR1b, the darwin-SW ontology was re-structured so that the biological taxonomies can be modelled based on the structure of the TaxMeOn ontology. Furthermore, the terms in the field books were linked to standardised terms from other datasets. This accommodates the linking of different spellings and ab-breviations (R2a), the inclusion of context metadata (R2b) and enables data aggregation on the web (R3). Finally, the storage of provenance metadata of annotations (R4) was addressed. The process is explained in the coming subsections.

3.3. Data elucidation by domain experts

To inform the design process, the interests of domain experts were assessed via qualitative interviews and a test annotation procedure, addressingR1a. Seven domain experts participated in the interviews that were set up to acquire knowledge about interesting concepts in field books; two cultural historians, two information specialists handling collection queries from within the Naturalis Biodiversity Center (NBC) and three biologists

4http://schema.onki.fi/taxmeon/

interested in taxonomy and the history of biodiversity. A sub-set of 59 pages from our use-case was selected for inspection. These pages contained all species descriptions within the col-lection belonging to the order Chiroptera, an order of mammals that consists of the bats. The subset consisted of 40 pages of observation descriptions and 19 drawings.

3.3.1. Knowledge acquisition

First, participants were asked to describe their research in-terests and denote research questions they would like to address with access to a natural history archive. Examples included ‘Are the species named directly in the field or do they receive a num-ber or a temporary name?’ and ‘Did specific naturalists have a specialisation, such as the description of plants?’. Subsequently, they were asked to note down conceptual elements they would expect to find in historical observation records that would help them answer their research questions. Being primed thus to think in concepts, they were asked to use these concepts to annotate the field book pages and drawings, allowing the addition of other concepts discovered during the annotation process.

Table 1: Observation record elements organised by topic. Similar concepts were merged, e.g., Linnean Name and Species Name.

Topic Annotated Concepts c, (n-7)

Classification 1. Linnean Name: 30, (7-7)

2. Vernacular Name: 2, (2-7) 3. Literature used: 2, (2-7) 4. Synonyms: 6, (4-7) 5. New namings: 3, (2-7) 6. Additional class.: 6, (4-7) Species 1. Rarity: 5, (2-7)

2. Use by Locals: 0 3. Range: 5, (2-7)

Expedition 1. Person: 23, (7-7) (a) Collector: 2, (1-7) (b) Author: 6, (2-7) (c) Companion: 0 (d) Local person: 0 (e) Illustrator: 5, (3-7) 2. Role of Indigenous

Popu-lation in Knowledge Re-trieval: 0 3. Collection Practice: 2, (2-7) 4. Drawing property: 5, (3-7) 5. Language peculiarity: 0 6. Date of Observation: 10, (7-7) 7. Place of Observation: 22, (7-7) 8. Publication field book: 0

Organism 1. Corresponding specimen:

1, (1-7) 2. Corresponding drawing: 2, (1-7) 3. Condition: 0 (a) Living: 0 (b) Dead: 0 4. Quality: 14, (7-7) (a) Morphology: 5, (5-7) (b) Colour: 2, (2-7) (c) Behaviour: 8, (2-7) 5. Preservation 0 6. Drawing 17, (7-7) (a) parts 7, (2-7) (b) views 4, (3-7) 7. Anatomy: 40, (7-7) 8. Measurement: 5, (5-7) 9. Count: 1, (1-7) (a) Specimen 0 (b) Anatomical entity: 1, (1-7) 10. Gender: 1, (1-7) 3.3.2. Results

Table 1 lists the concepts that were identified by the domain experts, followed by a number c indicating how often the con-cept was used for annotation of the subset, accumulated for all participants, and a numbern-7 indicating how many of the 7 participants used the concept for annotation. If a more specific subclass was used for annotation, it was included in the count

(6)

for both the general class as well as the more specific class. They can be broadly divided into concepts relating to species classifications, their abundance and use, expedition details and characteristics of the observed organism.

Within our experiment, cultural historians appeared most interested in expedition practices, more than in the specimens or species described. During the annotation process, they were searching for clues in the text as to why certain languages were used interchangeably, in what ways knowledge was recorded, which indigenous people were helping to find new species, what methods naturalists used to find and gather the specimens or what adjectives were used to describe the behaviour or appear-ance of organisms. The biologists appeared to be more interested in classification systems, naming conventions, species character-istics and literature used for classification. The output from the interviews and annotation procedure was used to aid the design process of the NHC-Ontology. The questions from domain ex-perts were used to test the output of the annotated field book in section 5.

The most important named entities from table 1 which were extensively annotated by the experts in the field books, but which are not included in the Darwin-SW model, are dates, additional classifications - synonyms and later classifications, additional occurrences - species range and rarity - and structured organism descriptions such as the anatomical parts, qualities and measurements. We thus adopt these in the final model.

3.4. The core model: the NHC-Ontology

In this section we explain further design choices for the Nat-ural History Collection-Ontology (NHC-Ontology) and describe the adoption and application of the classes and properties. The ontology extends the Darwin-SW ontology with two classes and seven properties in order to address the remaining limitations mentioned in section 3.2. Figure 3 provides a graphical overview of the model). Two classes and all new properties are added within our own namespace, indicated by the dashed lines and the nhc: namespace.

3.4.1. Classifications and taxonomies.

The class nhc:TaxonRank connects to the Darwin-SW mod-el. All taxa are modelled as instances of the class dwc:Taxon and all taxon ranks as instances of the class nhc:TaxonRank. We adopt a derivative of the DwC property dwc:taxonRank, see figure 3. As the DwC standard does not have an analogous property in the dwciri: namespace, we adopt it in our names-pace. To represent hierarchy in the classification system we created the transitive property nhc:belongsToTaxon to link a taxon to a taxon higher in rank. Because of this transitive property we can, for example, query a collection for all families belonging to a specific order, e.g., ‘Show me all families that belong to the order Chiroptera’.

In binomial nomenclature, species are named using two names: a genus and a specific epithet or species name. Further-more, an abbreviated publisher name is included to avoid name ambiguity, e.g., Pteropus minimus Geoff, where Geoff refers to ´Etienne Geoffroy-Saint-Hilaire, a french zoologist. Similarly

in our model Genus+species is seen as a unit representing a species.5 The name of the publisher is linked separately, as

do-main experts indicated to have special interest in some authors and would like to be able to retrieve all taxonomical names from a specific scientific author. For instance to obtain knowledge concerning which species they named and their naming conven-tions. When a species is newly discovered and thus unpublished, authors sometimes use ‘Nobis’, latin for ‘by us’, or some other place holder for the name of the scientific publisher. ‘Nobis’ in this case still refers to a scientific author name, namely the writers of the field book. Annotating the term as the scientific author of the scientific name is useful as, in combination with the author name of the field book, the taxonomical names can be resolved. To link the publisher to the scientific name, we use the DwC term scientificNameAuthorship which we also adopt in our namespace as it does not yet have an equivalent in the dwciri:namespace.

3.4.2. Evidence for identification.

In the Darwin-SW model, the class dwc:Token is used to link an identification to the resource on which the identification was based. This class can be replaced with the more specific dwc:PreservedSpecimenor dwc:HumanObservation class. The human observation represents a single observation record from a field book or a drawing. To achieve this granularity, we let an instance of the dwc:HumanObservation class point to multiple field book pages describing one record. This way, users can retrieve observation records, drawings and specimen relat-ing to their research interests, e.g., ‘show me all observations recorded on Java’.

As domain experts were interested in the measurements used for classification of an organism, as is visible in table 1, we adopt the dwc:MeasurementOrFact class in the ontology, a class taken from the DwC standard. The dwc:MeasurementOrFact class is connected to the dwc:Token class with the dsw:derive-dFromproperty or its inverse dsw:hasDerivative to indicate that it is derived from, or a part of, the observation record, see figure 3. As the dsw:derivedFrom property is transitive, the measurement is also derived from the specific organism, beneficial for querying and reasoning. We use this measure-ment class to span measuremeasure-ment tables. Organism fact descrip-tions however cover full paragraphs. We adopt the property nhc:measuresOrDescribesin our model to link an instance of the class dwc:MeasurementOrFact to a term relating to an anatomical entity or property of the organism, such as liver or colour. This way, we can point to a free text description of an organism characteristic, by annotating the anatomical entity or property initiating the description. One cultural historian was, for instance, interested in the adjectives used when describing the colour and morphology of anatomical entities. Pages de-scribing a specific anatomical entity could be retrieved in one query e.g. ‘Show me all observation records from person X that measure a liver’.

5Exceptions where a genus is modelled individually are field book pages that

(7)

uberon:0001062 V ncit:C20189 dwc:Organism dwc:Identification dwc:Event dwc:Location dwc:Occurrence dsw:Token dwc:Taxon nhc:TaxonRank dwc:MeasurementOrFact nhc:additionalIdentification skos:narrower foaf:Person gn:Feature nhc:Date dwciri:identifiedBy nhc:scien tificNameA uthorship dsw:derivedFrom nhc:measuresOrDescribes dsw:evidenc eFor dsw:derivedFrom dsw:hasOccurrence dsw:hasIdentification dsw:isBasedOn dsw:atEven t nhc:verbatimDate dsw:locatedAt dwciri:inDescribedPlace dwciri:toTaxon nhc:belongsToTaxon nhc:taxonRank dwciri:r ecordedB y nhc:additional-Occurrence

Figure 3: The NHC-Ontology, an extension of the Darwin-SW graph model for annotating natural history collections. 3.4.3. Verbatim date.

A further addition is the class nhc:Date. This class is used to annotate verbatim dates: An instance of the class, e.g., nc:date1is given a label such as 10 Apr. 1821 or Sept. It is con-nected to the dwc:Event class using the dwc:verbatimEvent-Date to indicate this. The verbatim date will be converted to a standard format and linked to the dwc:Event class using the dwc:year, dwc:month and dwc:day properties. This way, dates can be used for querying using filters. Dates are an im-portant part of species descriptions and are easily annotated as they are formally formatted and have a prominent position on the page.

3.4.4. Written annotations.

In field books, we often see manual annotations or revisions written above or adjacent to the original text. Types of annota-tions that occur a lot in our use-case relate to the classification of an observed organism or an additional observation. A naturalist, for instance, classified an observed organism as a different taxon at a later date, based on further research of the described traits and anatomical parts or based on other literature. Whether this represents a shift in naming conventions, a new interpretation of the metadata or merely additional information or synonymy is unclear. Additionally, naturalists made side notes of obser-vations of the same species by different naturalists at different locations, such as ‘In Batavia according to Diard’.

In our qualitative analysis, biologists indicated that they were interested in exploring these annotations. It has to be transparent for them and other researchers which text was written at the time of the original observation, belonging to the original record, and which was added later. To emphasise these structures we added two properties; the nhc:additionalIdentification and the nhc:additionalOccurrenceproperty. These are both added

as sub-properties of the property nhc:additional such that all additional annotations can be accentuated or queried using this property.

3.4.5. Linking to external ontologies and datasets

The ontology connects to classes from other ontologies and thesauri such as Uberon6 for anatomical entities [27] and

NCIT7 for species attributes [13], both used for the

identifi-cation of a taxon, the Geonames Database8 for geographical

locations [38] and VIAF9 for referring to persons [24] as

in-stances of the class foaf:Person. These classes are indicated by a striped fill in figure 3. Linking to these vocabularies pro-vides us with three benefits. First, the entities can be resolved. Second, queries can utilise the structures of these ontologies, when available, for querying and reasoning purposes. Third, these ontologies provide extra metadata. Instances from the Geonames Database, for instance, are mapped to different his-torical name variants, abbreviations and modern names. As an example, the entity <http://sws.geonames.org/1648473> is linked to the modern name Bogor and simultaneously to the historical name Buitenzorg, a term used in the field books. They distinguish a gn:alternateName with a language tag such as <gn:alternateName xml:lang="id">Kota Bogor</gn:a-lternateName>from a gn:name, revealing indigenous nam-ings. Further, the property gn:shortName is used for abbrevia-tions and gn:officialName for official names.

We choose not to link the ontology to biological taxon IRIs from different namespaces. As mentioned in section 3.2.1, The same species name can sometimes refer to different organisms.

6http://purl.obolibrary.org/obo/ 7https://ncit.nci.nih.gov 8http://sws.geonames.org/ 9http://viaf.org/viaf/

(8)

Disambiguation of species names requires metadata such as place of observation, date and biologist who performed the clas-sification. We propose to create unique identifiers for each taxon within the namespace of the collection. After a careful anal-ysis of the annotation data after the annotation process, these taxa can be resolved and linked to each other and taxa from external datasets. This preserves the verbatim content of the field books and allows the provenance of multiple mappings to present taxonomies, should this be required to represent different theories.

3.4.6. Documenting provenance of annotations.

Provenance is crucial in the disclosure of archival collections. The provenance of data extracted from collections contributes to their interpretation and value, and allows researchers to repeat experiments. To link semantic annotations to digital objects on

rdf:type oa:hasTarget oa:hasBody rdf:type dcterms:created dc:date oa:hasBody rdf:type rdf:type oa:hasSelector rdf:type rdf:type oa:hasSource rdfs:label dc:language dwc:scientificName rdfs:label dc:format dwc:vernacularName rdf:type dwc:Taxon nc:anno oa:SemanticTag nc:Lise_Stork oa:Annotation 4 / 2 0 / 1 7 nc:xywh=0.556,0.401,0.026,0.045 oa:Target nc:image.tif dcmitype:StillImage nc:image.tif#xywh=0.556,0.401,0.026,0.045 oa:Selector Mammalien de Mammalia Mammalien@de text/plain Mammals nc:textualBody oa:TextualTag nc:taxon1

Figure 4: Example of an annotation of the taxon Mammals written in a field book, using the Web Annotation Data Model. This annotation contains both a textual and a semantic body. The namespace nc: refers to the collection from the Natural Committee for Natural History of the Netherlands Indies. the web, the Web Annotation Data Model,10initially the Open

Annotation Model (OA) [17], was used.11Reasons for its

adop-tion in our model are the use of the principles of linked data, its ability to address segments or fragments of media sources, and the fact that it is well established in the linked data community. Using this data model and its ontology, we link instances of the classes from the ontology depicted in figure 3 to the image scans. Figure 4 shows an example annotation. The instance node of te class oa:Annotation refers to the annotation ob-ject itself to which metadata relating to the annotation process is added. The instances of the classes oa:TextualTag and oa:SemanticTagare the bodies of the annotation. They in-dicate the semantic interpretation of the annotation, and the verbatim transcription. A semantic body is always an instance of the class oa:SemanticTag, but it is also an instance of a class from the NHC-Ontology, in this case dwc:Taxon. Each annotation always has a textual body, containing its verbatim transcription. This way, the text is transcribed and semanti-cally annotated simultaneously. At the same time, this allows for different name variants of entities that exist within the field

10https://www.w3.org/TR/annotation-model/ 11https://www.w3.org/annotation/

books. When an annotation is linked to the IRI of a naturalist such as <http://viaf.org/viaf/69703180/> which refers to the dutch naturalist Coenraad Jacob Temminck, the textual body will contain the verbatim label that is used in the field book such as the abbreviation Tem. Both the full name and the abbre-viation from the field book will point to the part of the field book page where Temminck is referenced. The instance of the class dcmitype:StillImagefrom figure 4 refers to the annotated field book page and the instance of the class oa:Target to the selected fragment within the page.

The resulting application ontology, a combination of the NHC-Ontology and the Web Annotation Data Model, provides a framework for annotating important named entities in the data. It is made accessible to users through a semantic annotation tool, the Semantic Field Book Annotator (SFB-Annotator), that enables the semantic annotation of digitised images of hand-written text and illustrations. The tool is discussed in the next section.

4. Semantic annotation of natural history collections In recent years, projects that create platforms for the storage, transcription and annotation of digitised historical documents on the web have begun to emerge. The Field Book Project [36], for instance, was formed in 2010 as a joint initiative between the Smithsonian National Museum of Natural History (NMNH) and the Smithsonian Institution Archives (SIA). The project was set up to bring together field books from multiple natural history collections and make them available for the general public.

.TIFF.TIFF.TIFF

.TXT

Mediawiki

backend

{{taxon|Hirundo rustica|barn swallows}}

.TIFF.TIFF.TXT

.TIFF.XML .TIFF.CSV

D wC-A data

Image Scan Collection

Template Manual Full-text Transcription

Taxonomic Referencing

Data for Publishment to GBIF

Conversion

Extraction Conversion

Figure 5: From Documents to Datasets[34] workflow

The Field Book Project makes use of the Natural Collec-tions Description (NCD)12 standard for storing metadata on a

collection level. Further, the project uses the Metadata Object Description Schema (MODS)13to create item level metadata[28].

The Biodiversity Heritage Library (BHL)14describe their data

using XML and MODS or Dublin Core (DC).15 None of the

above mentioned projects, however, aims to annotate the content from items within natural history collections. Responding to this need, the project From Documents to Datasets [34] provides

12http://rs.tdwg.org/ontology/voc/ 13http://www.loc.gov/standards/mods/ 14http://www.biodiversitylibrary.org/ 15http://dublincore.org/

(9)

a workflow for the conversion from digitised handwritten field books to flat data files, see figure 5, structured according to the terms from the Darwin Core standard. They propose first to fully transcribe the texts together with experts, then upload those texts together with the image scans to a MediaWiki16server. Via

tem-plates, the taxa, locations and dates, are annotated by researchers through a crowd-sourcing initiative. Taxonomic referencing, the process of resolving a historical taxon to a current one, occurs within the semantic annotation process through interpretation by the annotators. The annotations are then extracted and converted manually to Darwin Core terms, in order to publish them in the Global Biodiversity Information Facility (GBIF)17data server

[30]. This project provides an excellent methodology to struc-ture named entities from field books. We thus build upon this methodology and extend it to fit our needs.

4.1. Workflow

Similar to the projects mentioned at the beginning of section 4, we use the Natural Collection Description standard and the Dublin Core to enrich natural history collections on a collection and item level. On an item level, the methodological workflow

Triple Store .TIFF.TIFF.TIFF

class class class class class .TIFF ROI

interface ROI tool

backend <viaf:45106482>rdf:type <foaf:Person>

DwC-A data

Application Ontology Image Scan Collection

Semantic Annotation Taxonomic Referencing Conversion

Data for Publishment to GBIF Storage of Triples

SPARQL Querying OWL Reasoning

Figure 6: The proposed workflow for semantically annotating natural history collections.

approach in this project differs from the approach in figure 5 as it does not merely structure the entities semantically, it also links all the entities to form a connected graph. The data become read-able and interpretread-able by machines and can be interlinked and aggregated with other biodiversity data on the web. To link the named entities together we use the NHC-ontology, which also enables rich querying and reasoning. Our workflow is shown in figure 6. In our approach, we omit full-text transcription. Anno-tation of the most important entities from the field books already allows biodiversity researchers to create models and search the texts, simultaneously minimising annotation efforts. We also suggest that the process of taxonomic referencing of species and genera should occur after all named entities from a field book or collection are annotated and linked. As mentioned earlier, fully linked field books allow for a thorough comparison between different taxonomies and naming conventions. After a careful analysis, these taxa can be resolved and linked to other taxa, but we argue that this should be decoupled from the annotation

16https://wikisource.org/ 17http://www.gbif.org/

process itself. We furthermore argue that, especially with histor-ical biodiversity data, multiple interpretations of the data should be able to exist in parallel. We therefore choose to annotate classification hierarchies in the collection verbatim, to facilitate multiple researchers adding their own layers of interpretations. If necessary, researchers can attach free-text metadata to classes from the application ontology, using the properties from the DwC standard such as dwc:habitat or dwc:samplingPr-otocolwhich can be attached to the dwc:Event instance, dwc-:organismRemarksto an instance of the class dwc:Organism or dwc:identificationReferences to add literature refer-enced in the manuscripts to the dwc:Identification class. 4.2. The Semantic Fieldbook Annotator

The Semantic Fieldbook Annotator is a web application, de-veloped for domain experts, to harvest structured annotations from field books using the NHC-Ontology and proposed work-flow. With some practice, the tool can also be used to crowd-source annotations, as long as these are validated by an expert curator. Anno = {”src”:”http://domain/image1.tif”, “type”:”Taxon”, “shapes”:{”type”:”rect”, “geometry”:{”x”:0.546,”y”:0.031,“width”:0.065, ”height”:0.018}}, “date”:”2017-04-16”, “annotator”: “https://orcid.org/0000-0002-2146-4803”, “target”: “image1.tif#xywh=0.546,0.031,0.065,0.018”, “textualbody: ”Vivera genetta”@la,

“semanticbody”:”http://makingsense.liacs.nl/rdf/nc#taxon53”, “belongstotaxon”:”http://makingsense.liacs.nl/rdf/nc#taxon45”, “taxonrank”:”http://makingsense.liacs.nl/rdf/nc#species”, “identifiedby”:”http://viaf.org/viaf/45106482/”, “organismID”:”35”}

Figure 7: The annotation process using the Semantic Field Book Annotator As shown in figure 6 and 7, users can draw bounding boxes, or Regions Of Interest (ROIs), over the image scans to which annotations can be attached. The ROI tool makes use of the Annotorious annotation API18to select a ROI and create an

an-notation object, see figure 7. The anan-notation object is connected with its metadata and: a target - a page or a ROI -, a textual body and a semantic body. The shapes variable is used to store the geometry of the ROI relative to the image borders. In RDF, these coordinates are stored with the oa:Selector class to specify part of the source image, see figure 4. In order to make the manuscript images zoomable, Annotorious is used together with the OpenSeaDragon API.19

For storage, we use a servlet that pushes the annotation to an annotation server. In the servlet, annotation objects written in JSON are converted to RDF triples using the RDF4J API, an open source Java framework for processing RDF data. For

18https://annotorious.github.io/ 19https://openseadragon.github.io/

(10)

storage of annotations we use the Virtuoso quad store as it is a well evaluated store for data-intensive server applications[16]. Moreover, it can be accessed via the RDF4J API.

In the annotation process, a distinction is made between ex-plicit and imex-plicit classes, where exex-plicit classes, in comparison to implicit classes, refer to the group of named entities that are easily observed in the field books. These are: the taxonomical name, location, date, scientific publisher, writer, anatomical entities, properties and tables. The implied classes serve to connect the explicit classes. However, they can also be used to link to class-specific meta-data encountered in the field books. The Darwin Core’s dwc:organismRemarks can for instance be used to store free text descriptions from the field book about the organism under observation, as is also mentioned at the end of section 4.1. Another reason for this adoption is that salient named entities can be pulled out of the text more easily by anno-tators, and finally by automated processes.

During the annotation process, a user first links a ROI to a class c from the set of explicit classes C = {c1,c2, ....,cn} of the

application ontology. In figure 7 this is the ncit:C20189 or property or attribute class. The user then specifies a predicate p from the set of predicates P = {p1,p2, ....,pn}, although this is

only required in the case where multiple predicates are possible such as with the class foaf:Person. We however argue that it makes the annotation process more transparent and thus less error-prone. The predicates are displayed in a readable way, e.g., Measures or describes: property or attribute, such as vis-ible in figure 7, or for instance Additional occurrence recorded at: location. When a class and predicate are specified, op-tional metadata fields appear such as the uberon: IRI in case of an anatomical entity.

To create connections between all entities from the model that belong to one occurrence record, every time an instance with a dwc:Taxon type is annotated, the entire base model, excluding the measurements, is instantiated together with their semantic connections as visible in figure 3. As instances of these classes, unique identifiers are created such as nc:identification1 or nc:date1. Even if entities are missing, IRIs exist but remain without a label until they are annotated by the user. More infor-mation about the SFB-Annotator and the annotation procedure can be found online.20

4.3. Towards semi-automated annotation

As a first step towards semi-automated annotation, we pre-populated the triple store with domain knowledge concerning the collection such as locations and names of researchers that participated in the expeditions. This contextual knowledge can aid annotators with the annotation process using autocomplete to retrieve candidate instances, such as <http://viaf.org/vi-af/69703180/>, the VIAF record for Coenraad Jacob Tem-minck. The user can choose a candidate instance d ∈ X, where X is the instance space. If no instance yet exists or if it is an implicit instance such as one from the organism class, a random IRI is created.

20https://github.com/lisestork/SFB-Annotator

5. Qualitative Evaluation

In concordance with a domain expert from the field of nat-ural history, one of the field books from the collection of the Natural Committee, named ‘Manuscripten van de leden der Natuurkundige commissie: Mammalien, van Kuhl’, was seman-tically annotated using the Semantic Field Book (SFB) Anno-tator. This book contains observation records of species from three different orders: the order Chiropterae, or bats, the order Quadrumana, latin for the four-handed ones and referring to the apes and lastly the order Falculatae, a historical order referring to a collection of mammals such as the shrew, the badger and the bear. The coming sections will qualitatively evaluate the annotation process, the resulting data and possibilities for query-ing usquery-ing the concepts and questions composed by the domain experts mentioned in section 3.3.

Figure 8: A page from the annotated field book describing the species Titthaecheilos javanicus Nobis. Pteropus titthaecheilus Tem (upper right corner) is believed to be added later in Leiden by Jacob Coenraad Tem-minck, <http://viaf.org/viaf/69703180>, a dutch zoologist and mu-seum director. The written annotation is thus an additional identifica-tion of the observed organism, resulting in the triple: nc:organism1 nhc:additionalIdentification nc:taxon2. Collection Naturalis Biodi-versity Center, MMNAT01 AF NNM001001033 013. Image free of known restrictions under copyright law (Public Domain Mark 1.0)

. 5.1. The annotation process

Annotating a page from the field book using the Semantic Field Book Annotator took approximately 1 to 10 minutes, de-pending upon the amount of named entities on the page and the difficulty of interpreting a named entity. Taxonomical names such as the one in figure 8, Titthaecheilos javanicus can be difficult to read and sometimes the order of pages is shuffled, hampering the correct interpretation of links between entities. Other times however, a page only contains one or two easy to read named entities of which the relation is clearly defined. Also, the layout of the document hints to the location of the named entities. Taxonomical names, scientific publishers of names and locations are likely to appear on the top of the page.

As the time spent annotating a named entity largely depends upon its readability and interpretability, we argue that the biggest difference between our approach and the one in figure 5 is the omission of one processing step. Where other approaches first transcribe the entire text and then look for named entities to be semantically enriched, we omit the first step and directly search for named entities to be enriched. Consequently, this results in faster processing of the field books into a knowledge base.

(11)

5.2. The data

From the annotated field book, 98 single pages21 were

se-mantically annotated and their annotations validated by a natural history expert. Table 2 shows the number of named entities that were extracted from the field book pages, the size of the triple store and the per page, per class and notable per predicate statistics.

Table 2: Annotation specifications Total Annotations

Pages Size Observ. NEs Triples NEs per page

MB Records µ σ

98 1.5 34 371 9921 5 2.8

Annotations per class

Class n Class n dwc:Taxon 52 nhc:Date 6 foaf:Person 47 uberon:0001062 160 dcterms:Location 15 ncit:C20189 28 dwc:MeasurementorFact 13 Total 371 Predicate specifics Object Predicate n foaf:Person nhc:scientificNameAuthorship 41 dwciri:recordedBy 35 dwciri:identifiedBy 39 dwc:Organism nhc:additionalOccurrence 3 nhc:additionalIdentification 15

In the case that a named entity is absent in a linked observa-tion record, for instance if an annotator omitted the annotaobserva-tion of a named entity, querying the data is not hampered and can even, together with graphic visualisations of the data, help con-trol the data quality. When a named entity is not annotated, for instance the location of the organism spotting, the IRI exists, as mentioned at the end of section 4.2, but remains without a label and link to an annotation object and a ROI. Observation records of which the location is absent or not yet annotated can be found by querying the knowledge base for locations without a label or annotation.

5.3. Semantic Queries

The evaluation in section 3.3 resulted in a list containing 53 research questions. 18 questions were from biologists, 28 from cultural historians and 7 from information specialists. Here we evaluate, using the annotated data, which questions are common in terms of search requirements, determine if and how the ques-tions can be answered using the NHC-Ontology and demonstrate the gain in comparison to full-text search.

21During the digitisation process, the field notes were scanned two pages at a

time. One page here represents one physical page containing text, rather than one digital image.

5.3.1. Domain experts’ questions

To estimate the nature of common research questions, the questions were grouped together on the basis of types of named entities. Most common questions were: a question combining a type of resource and a person name, e.g., ‘Show me all field notes from person X’, and a question combining the person class and a taxon name, e.g., ‘Did specific naturalists have a specialisation such as plants or animals?’. The entities used in the queries were all covered by the model, except for some more specific person classes such as a local helpers or illustrators. From the 53 questions, 7 did not relate to the content of the field books and were therefore excluded from the question set. They could potentially be addressed with other parts of the archive. For instance, ‘How was a day organised’ relates to the field observa-tion practices, something that is more likely to be found in the diaries within the archive. Another example is ‘are there letters from person X to person Y in the collection?’. Such a question could be answered by querying the collection for both person X and Y, making use of their IRIs to overcome name ambiguity. Both diaries and letters are however beyond the scope of this paper.

Four of the questions related specifically to specimens and their preservation. Although we did not annotate specimens, the semantic model does allow these type of queries. The label of a physical specimen or its digital image can also be used for semantic annotation, as mentioned in 3.4.2. The class dwc:PreservedSpecimenis then used instead of dwc:Human-Observation.

For clarification a distinction is made between six types of queries, see table 3. The table includes a count of how often each type of question occurred in the question set. ‘Which’ and ‘Where’ questions were often seen as entity retrieval tasks, ex-cept in the case of ‘which page’ or ‘where in the archive’, and open questions were seen as document retrieval tasks. Closed questions that can be answered with a ‘yes’ or ‘no’ were also seen as document retrieval tasks, as these are usually questions that require further inspection of a document. For both query variants, queries were evaluated with regards to relevance of the search results and if extra effort is required by the user after retrieval.

Table 3: Types of expert queries

Query type Count

T1: “All documents containing keyword k.” 1 T2: “All documents matching structure s.” 18 T3: “All documents matching structure s and keyword k.” 7 T4: “All entities containing keyword k.” 0 T5: “All entities matching structure s” 7 T6: “All entities matching structure s and keyword k 13

5.3.2. Structured vs. full-text queries

Where structured query-languages such as SPARQL are bet-ter at querying the structure of the data, full-text queries are better at querying the content [25]. Here, we demonstrate that in

(12)

Table 4: Example queries for cultural history and biology research

Cultural History Biology

[Q1]How were species collected by Heinrich Kuhl, viaf:45106482? PREFIX rdfs : < http :// www . w3 . org / 2 0 0 0 / 0 1 / rdf - schema # > PREFIX dwciri : < http :// rs . tdwg . org / dwc / iri / >

PREFIX dsw : < http :// purl . org / dsw / > PREFIX viaf : < http :// viaf . org / viaf / > PREFIX oa : < http :// www . w3 . org / ns / oa # > SELECT ? label ? page WHERE {

? i d e n t i f i c a t i o n dwciri : toTaxon ? taxon . ? taxon rdfs : label ? label .

? o r g a n i s m dsw : h a s I d e n t i f i c a t i o n ? i d e n t i f i c a t i o n . ? o c c u r r e n c e dwciri : r e c o r d e d B y viaf : 4 5 1 0 6 4 8 2 . ? o c c u r r e n c e dsw : h a s E v i d e n c e ? o b s e r v a t i o n R e c o r d . ? anno oa : hasBody ? o b s e r v a t i o n R e c o r d .

? anno oa : h a s T a r g e t ? page }

[Q3] Which chiroptera species were collected by Heinrich Kuhl, viaf:45106482, on Java?

PREFIX rdfs : < http :// www . w3 . org / 2 0 0 0 / 0 1 / rdf - schema # > PREFIX nhc : < http :// m a k i n g s e n s e . liacs . nl / rdf / nhc / > PREFIX nc : < http :// m a k i n g s e n s e . liacs . nl / rdf / nc # > PREFIX dwc : < http :// rs . tdwg . org / dwc / >

PREFIX dwciri : < http :// rs . tdwg . org / dwc / iri / > PREFIX dsw : < http :// purl . org / dsw / >

PREFIX viaf : < http :// viaf . org / viaf / > PREFIX oa : < http :// www . w3 . org / ns / oa # >

PREFIX gn : < http :// www . g e o n a m e s . org / o n t o l o g y # > SELECT D I S T I N C T ? label WHERE {

? taxon rdfs : label ? label . ? taxon nhc : t a x o n R a n k nc : species .

? taxon nhc : b e l o n g s T o T a x o n ? order .

? order rdfs : label ? C h i r o p t e r a e .

FILTER regex (? Chiropterae , " C h i r o p t e r a e ") . ? i d e n t i f i c a t i o n dwciri : toTaxon ? taxon .

? o r g a n i s m dsw : h a s I d e n t i f i c a t i o n ? i d e n t i f i c a t i o n . ? o c c u r r e n c e dsw : o c c u r r e n c e O f ? o r g a n i s m . ? o c c u r r e n c e dwciri : r e c o r d e d B y viaf : 4 5 1 0 6 4 8 2 . ? o c c u r r e n c e dsw : atEvent ? event . ? event dsw : l o c a t e d A t ? l o c a t i o n . ? l o c a t i o n dwciri : i n D e s c r i b e d P l a c e ? place . ? place gn : p a r e n t F e a t u r e ? parent . ? parent gn : a l t e r n a t e N a m e ? name

FILTER regex ( str (? name ) , " Java " , " i ") }

[Q2]How were habitats described in the collection between 1820 and 1821?

PREFIX nhc : < http :// m a k i n g s e n s e . liacs . nl / rdf / nhc / > PREFIX dwc : < http :// rs . tdwg . org / dwc / terms / > PREFIX dsw : < http :// purl . org / dsw / >

PREFIX oa : < http :// www . w3 . org / ns / oa # >

PREFIX rdfs : < http :// www . w3 . org / 2 0 0 0 / 0 1 / rdf - schema # > SELECT ? page ? label WHERE {

? event dwc : year ? year FILTER ( ? year >= 1820 ) . FILTER ( ? year <= 1821 ) .

? event nhc : v e r b a t i m E v e n t D a t e ? date . ? date rdfs : label ? label .

? event dsw : eventOf ? o c c u r r e n c e .

? o c c u r r e n c e dsw : h a s E v i d e n c e ? o b s e r v a t i o n R e c o r d . ? anno oa : hasBody ? o b s e r v a t i o n R e c o r d .

? anno oa : h a s T a r g e t ? page }

[Q4]Which anatomical entities were used for the classification of the genus Pteropus?

PREFIX dwciri : < http :// rs . tdwg . org / dwc / iri / > PREFIX dsw : < http :// purl . org / dsw / >

PREFIX uberon : < http :// purl . o b o l i b r a r y . org / obo / > PREFIX ncit : < http :// i d e n t i f i e r s . org / ncit / > PREFIX nhc : < http :// m a k i n g s e n s e . liacs . nl / rdf / nhc / > PREFIX rdfs : < http :// www . w3 . org / 2 0 0 0 / 0 1 / rdf - schema # > PREFIX rdf : < http :// www . w3 . org /1999/02/22 - rdf - syntax - ns # > SELECT D I S T I N C T ? label2 ? uberon

WHERE { ? i d e n t i f i c a t i o n dwciri : toTaxon ? taxon . ? taxon rdfs : label ? label

FILTER regex (? label , " P t e r o p u s ") ? i d e n t i f i c a t i o n dsw : i s B a s e d O n ? token . ? token dsw : h a s D e r i v a t i v e ? m e a s u r e m e n t .

? m e a s u r e m e n t nhc : m e a s u r e s O r D e s c r i b e s ? anatomy . ? anatomy rdfs : label ? label2 .

? anatomy rdf : type ? uberon .

? uberon rdfs : s u b C l a s s O f uberon : U B E R O N _ 0 0 0 1 0 6 2 }

the case of field books, structured or hybrid queries[4] using the NHC-Ontology are able to provide more relevant query results than full-text queries.

It is notable from table 3 that few questions involved sim-ple keyword searches. The only question that can be answered directly using a keyword is: ‘show me all resources (lists, draw-ings and observations concerning a specific speciesk‘ k being the keyword, as no limit is imposed on the type of resource that should be retrieved. For 5 of the questions of type T3, full-text search can also provide an answer, although not directly. Ex-amples are the following questions: ‘What did personk find?’ or ‘Which drawings were made by personk’. However, all re-sources that in any way relate to person k would be retrieved, thus retrieving irrelevant documents alongside relevant ones.

Most common queries are structured queries retrieving spe-cific documents (T2) such as ’Show me all drawings with a head of a fish’ and hybrid queries retrieving named entities (T6) such

as ‘Which anatomical entities were used for the classification of the family Pteropodidae’. When transformed to hybrid queries, 25 out of 46 queries will provide a direct answer to the original question. For the remaining 21 of 46 queries, document pages are presented to the user that will likely contain an answer to their question, an example being: ’How were habitats described in the collection between dd-mm-yyyy and dd-mm-yyyy?’. The semantic query can point a user to the pages that adhere to these date restrictions, but the user will have to inspect them to answer his or her question.

Table 4 presents 4 of the 46 questions in SPARQL form. Q1 and Q2 are examples of SPARQL queries that provide an indirect answer to the question, whereas Q3 and Q4 provide a direct answer. More example queries can be found online.22

We finally argue that, as Virtuoso is equipped with full-text

(13)

indices that can be queried via SPARQL [16], queries can be formulated both as full-text, semantic or hybrid queries. How-ever, as most queries make use of the structure of the data in combination with keywords, making use of semantic queries is beneficial for the retrieval process.

We note that the average user should not be required to write complex SPARQL queries. To take on this problem, methods have been developed that bridge the gap between the Semantic Web and the domain expert users [18, 19, 22]. In our specific case, a query engine will be developed by partners at Brill pub-lishers, collaborators within the Making Sense project.

Although beneficial, the formulation of rich semantic queries is not the main reason for the use of a semantic model for the annotation of natural history collections. Most interesting is the semantic linking of named entities within and between re-sources, as well as within and across collections. For further observation, the ontology can be found online together with the domain experts’ questions, the questions transformed to queries and a visualisation of one fully linked observation record.23 The

semantic annotations can be accessed through a SPARQL end-point24which can be queried using a SPARQL query editor.25

The code for the SFB-Annotator and annotation guidelines can also be found online,26and will be updated once newer versions

are available.

6. Discussion and Future Work

In this paper, we presented a semantic model and tool for the semantic annotation of field books. Through the semantic annotation of one field book, we evaluated the model and demon-strated the annotation approach. This approach will eventually lead to a structured dataset constructed from the collection of the Committee for Natural History of the Netherlands Indies, available through a SPARQL endpoint. It is an example of how the content of historical collections in general could be disclosed using semantic annotation.

The qualitative evaluations demonstrated that the application ontology adheres to our requirements and is usable by domain experts both for the process of creating structured annotations as well as answering common research questions. Answers to structured queries will either point users to specific pages, to enable closer inspection of the original text, or provide them with lists or graphical output. However, as the model we propose is centered around the observation and collection of organisms from field books, it currently serves the requirements of the bi-ologists and taxonomists better than the cultural historians. We anticipate that extensions to the model will be required when annotating other artifacts in the collection. Letters and diaries from the collection, for example, describe the economy, villages,

23https://github.com/lisestork/NHC-Ontology

24http://makingsense.liacs.nl/rdf4j-server/repositories/NC

25An example query editor is the Yasgui editor: http://yasgui.org/, accessed:

30-03-2018

26https://github.com/lisestork/SFB-Annotator

cultures and inhabitants of colonial Indonesia, and accompany-ing drawaccompany-ings depict environmental conditions. A base model for these resources would provide a useful addition to the semantic model we propose.

In our next steps, the usability of the SFB-Annotator will be further improved; we will thus continue to evaluate the model with a small expert crowd to assess if the annotation task is well defined and to retrieve more accurate annotation time esti-mates. After that, we will develop methods for semi-automated semantic annotation of field book records. With fully transcribed texts, language processing is used for semi-automated semantic annotation. As we use pixel data instead of text, we require alternative, image processing methods for salient named entity extraction. Using the output of the annotation process, the sys-tem can learn which information is important and where this important information resides in the images [29, 32].

Our final goal within the Making Sense project [37] is to assist a handwriting recognition system MONK [31], with the enrichment of natural history collections. MONK is an adaptive learning system achieving good results on the recognition of text from handwritten collections. Exploiting domain knowledge and the structure of text in natural history collections can potentially aid the recognition process, especially when words have few instances in the archives.

Using automated processes will facilitate efficient enrich-ment of natural history collections and provide a framework to make sense of complex data that would aid researchers within the field of natural and cultural history research.

Acknowledgement

This work is supported by the Netherlands Organisation for Scientific Research (NWO), grant 652.001.001, and Brill publishers.

Literature

[1] S. J. Baskauf and C. O. Webb. Darwin-sw: Darwin core-based terms for expressing biodiversity data as rdf. Semantic Web, 7(6):629–643, October 2016.

[2] S. J. Baskauf, J. Wieczorek, J. Deck, and C. O. Webb. Lessons learned from adapting the darwin core vocabulary standard for use in rdf. Semantic Web, 7(6):617–627, October 2016.

[3] W. G. Berendsohn. The concept of ”potential taxa” in databases. Taxon, 44(2):207–212, May 1995.

[4] R. Bhagdev, S. Chapman, F. Ciravegna, V. Lanfranchi, and D. Petrelli. Hybrid search: Effectively combining keywords and semantic searches. In S. Bechhofer, M. Hauswirth, J. Hoffmann, and M. Koubarakis, editors, The Semantic Web: Research and Applications, volume 5021 of Lecture Notes in Computer Science, pages 554–568, Berlin, Heidelberg, 2008. Springer. [5] V. Blagoderov, I. J. Kitching, L. Livermore, T. J. Simonsen, and V. S. Smith. No specimen left behind: industrial scale digitization of natural history collections. ZooKeys, 209:133–146, July 2012.

[6] V. de Boer, M. van Rossum, J. Leinenga, and R. Hoekstra. Dutch ships and sailors linked data. In P. Mika, T. Tudorache, A. Bernstein, C. Welty, C. Knoblock, D. Vrandeˇci´c, P. Groth, N. Noy, K. Janowicz, and C. Goble, editors, International Semantic Web Conference (ISWC 2014), volume 8796 of Lecture Notes in Computer Science, pages 229–244, Cham, Octo-ber 2014. Springer International Publishing.

[7] V. De Boer, J. Wielemaker, J. Van Gent, M. Hildebrand, A. Isaac, J. Van Os-senbruggen, and G. Schreiber. Supporting linked data production for cul-tural heritage institutes: The amsterdam museum case study. In E. Simperl,

Referenties

GERELATEERDE DOCUMENTEN

The applicability of the semantic model and the annotation approach is demonstrated using image scans from a collection of 8,000 field book pages gathered by the Committee for

Under the extensional aspect, the singular statements and low-level generalizations characteristically produced by the natural historical sciences aim to specify nothing other

Deze analyse vindt plaats door middel van de theorie die stelt dat landen die democratisch zijn en/of een oud-kolonie meer ontwikkelingshulp ontvangen dan landen waarbij

261 Ik doel hierbij op “deliberate concerted group actions” (doelbewust gecoördineerde groepsacties) en “should generally be” (over het algemeen). Door het gebruik

The modular model converts a set of RDF triples into an English text in 4 sequential steps (discourse ordering, template selection, referring expres- sion generation and

For example, the participant entity structures (containing modifier link structures and recursively embedded entity structures) for the quantifications with structured

You should, however, decide in the preamble if a given style should be used in math mode or in plain text, as the formatting commands will be different. If you only want to type

We simply do not know enough about these focal settlements across the Dark Age Boeotian landscape to discuss all their indi- vidual population sizes and layouts, but