Semantic Enhancer And Ranking Component Helpers (SEARCH) for data discovery - 1

(1)

Semantic Enhancer And Ranking

Component Helpers (SEARCH) for data

discovery - 1

SEARCH-Enhancer: Ontological query enhancement

(2)

Layout: typeset by the author using LA_TEX.

Cover illustration: https://commons.wikimedia.org/wiki/File:Globe_icon.svg (globe icon) &

https://www.pinclipart.com/pindetail/iiwhxxT_png-file-svg-query-icon-png-clipart/ (top icon) The rest of the illustration is handmade.

(3)

Semantic Enhancer And Ranking

Component Helpers (SEARCH) for

data discovery - 1

SEARCH-Enhancer: Ontological query enhancement

Mitchell Verhaar 11239069

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. X. Liao, Dr. Z. Zhao Multiscale Networked Systems

Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(4)

(5)

Abstract

This research project aims to explore the support for cross domain search amongst different data infrastructures. By conducting a gap analysis on the state of the art tool set a new SEARCH framework is provided to extend functionality. This framework, with the focus on generating queries with ontological support, is capable of improving cross domain search capabilities. This improvement is demonstrated in the conducted experiment, where the query generation showed sig-nificant increases in search efficiency.

(6)

Introduction

Research activities in many scientific domains are data centric, e.g. in environ-ment and earth sciences. The study on societal challenges such as climate changes or food production require long term observation data or historical information to model the behavior of the environmental system. In order to conduct such kind of research, an effective data platform is crucial. This data platform should possess data sets from different sources so that scientists can discover and access data from those sources and process them using data applications. In many cases those data are provided by different scientific domains; for instance to investigate food production, one may need data from agriculture, ecosystem and atmosphere. Nowadays, those data are often collected and managed by different organizations

via certain data infrastructures, such as ICOS1 _{for green gas, ANAEE}2 _for

agri-culture, SeaDataNet3 _{for ocean and EPOS}4 _{for solid earth.}

ENVRI (Environmental Research Infrastructure) is a European community of more than 20 different Research Infrastructures in environmental and Earth

sci-ences. ENVRI-FAIR5 is an EU project which specifically focuses on the data

man-agement and interoperability across different infrastructures from environmental and earth sciences. Within the ENVRI community numerous metadata standards, vocabularies and other uniquely defining traits exist. No common standards or tools that agree with ENVRI are available, to unify these different Research in-frastructures and their data catalogues. Finding a generic tool or techniques to accompany the platform has been a subject of debate in the ENVRI community in the recent past.

1_{http://icos-ri.eu/} 2_{https://www.anaee.com/} 3_{https://www.seadatanet.org/} 4_{https://www.epos-ip.org/} 5_{http://www.envri-fair.eu/} 1

(9)

To enable a scientist to effectively discover and access from different sources, different utilities, such as databases and search tools, form vital components. The data have to be well curated based on certain metadata standards and made findable via services like data catalogues. In many cases, those metadata standards are domain specific and are often diverse across different domains. The search tool is often provided by a data infrastructure as part of the catalogue service, which is not designed for searching across different infrastructures. In this project, we review the current state of the art tool set and techniques in data infrastructures, identify the technical requirements for cross domain data discovery and propose a novel tool that tries to bridge the gap.

1.1 Research Problem

Discovering data records that span multiple scientific domains is a difficult process, as visible in the ENVRI community, and each repository has their own customised tools or methods for archiving and accessing data using specific standards and interfaces. Some of them share certain similarities, e.g. using catalogues for data and services. However, due to the diversity of their domains, including the stan-dards and vocabularies used for metadata, it is not feasible to directly apply one tool from one infrastructure in another. Therefore, providing a data search envi-ronment that would allow cross domain search to exist would could potentially be an extremely useful addition to the available tool set.

The main research question that this project focuses on is: “How can data, provided by different data infrastructures, be discovered based on their current data management status?” We highlight the current data management status as the condition, because it is not feasible to redesign their current catalogues. In order to answer this question, we decompose it into several sub research questions. First off, how do the current State of the Art catalogue technologies deal with cross domain search? Secondly, what gaps can we identify when using current catalogue technologies to support cross infrastructure search? And thirdly, how can we bridge the identified gaps with a solution? Answering these sub questions should provide a helpful indication on how to implement cross domain search. We expect data search between infrastructures to require novel extensions to the cur-rent state of technologies.

(10)

This thesis contains 7 chapters. Following the introduction is the first chap-ter, describing the survey that was conducted on the state of the art technologies available right now. In addition to the survey, a gap analysis is described as well. The second chapter succeeds the first, describing a potential solution that bridges these gaps. This solution is built in such a way that searching between scientific domains becomes easier. The third chapter of the thesis describes an experiment of the aforementioned implemented solution. The fourth chapter provides discus-sion on the results and the contributions this research has made. The fifth chapter contains answers to the research questions. The sixth chapter houses future work suggestions that this research can create. And finally, the seventh chapter contains the appendix, which holds an example of the output of the tool.

(11)

Chapter 2

State of the Art

The current field of data management and infrastructure offers a wide range of tools and techniques to enable the data to be discovered. The available tool set can be roughly grouped into two categories: building catalogues and searching data. A catalogue manages metadata records of data, services and other assets, and provides interfaces for both users and software to interact with it (Guptill (1999)). Catalogues often provide build-in search functionality, but only limited to the content within the catalogue. Dedicated search tools which can provide more semantic enriched search capabilities also exist. Most of these tools are open-source based and provide access to the code on the platform Github. In this project, we choose five examples to review their functionality based on several criteria :

• The tool must be used in a variety of existing projects or data environments. • The tool must possess traits corresponding with either the catalogue set or

the semantic set definitions.

• The tool must be available for installation in a local or server dependent version.

CKAN, GeoNetwork, Open Semantic Search and Google Dataset Search are the four tools that have been widely used by data infrastructures in environmental and Earth sciences, for example in the government of Australia (CKAN (2020)) or within the use case ENVRI platform. Out of these five tools, GeoNetwork and Open Semantic Search were selected as providers for the data sets. GeoNetwork was selected due to the popularity of the tool amongst numerous scientific infras-tructures, such as ENVRI or the National Georegister (Government (2020)). Open Semantic Search was selected in the alternative project of this research, particu-larly because of the semantic search that the tool incorporates.

(12)

2.1 Related Work

2.1.1 FAIR principle

In order for (meta)data to be of high quality pre-defined standards have been made that it should comply to. These pre-defined qualities determine the quality of the data itself. According to the research provided by Mark D. Wilkinson et al. these qualities are defined as the FAIR principle. The FAIR principle consists of four key components: Findability, Accessibility, Interoperability and Reuseability. Each of these concepts is defined as follows:

• Findability, which implies that:

1. The (meta)data should be assigned a globally unique and persistent identifier.

2. Data should be described with rich metadata.

3. The metadata should clearly and explicitly include the data identifier. 4. The (meta)data are registered or indexed in a searchable resource. • Accessibility, which implies that:

1. (Meta)ata are retrievable by their identifier using a standardized com-munications protocol.

2. The protocol is open, free and universally implementable.

3. The protocol allows for an authentication and authorization procedure, where necessary.

4. The metadata is accessible, even when the data is no longer accessible. • Interoperability, which implies that:

1. (Meta)data use a formal, accessible, shared and broadly applicable lan-guage for knowledge representation.

2. (Meta)data use vocabularies that follow FAIR principles. 3. (Meta)data include qualified references to other (meta)data.

(13)

• Reuseability, which implies that:

1. Meta(data) are richly described with a plurality of accurate and relevant attributes.

2. (Meta)data are released with a clear and accessible data usage license. 3. (Meta)data are associated with detailed provenance.

4. (Meta)data meet domain relevant community standards

If the above standards are upheld, the data should consist of high quality assets only. This research into the FAIR principle is extremely relevant for the current research project, because the data quality is vital to achieving the research goal. In order for cross domain search to exist or be improved upon, relevant data should exist and conform to the aforementioned principle. If the data provided by a third-party does not conform to the standards in the FAIR principle, it would no longer be safe to assume that cross domain search can be achieved with the given data at all, since the properties such as the identifiers and open, accessible protocols can no longer be guaranteed. Therefore, this research project takes the assumption that the given data to operate with adheres to the FAIR principle. (Nature (2016))

2.1.2 ENVRI: a community for environmental data

infras-tructures

In the ENVRI community, research infrastructures (RIs) from different environ-mental and earth science domains aim to provide the best quality of data possible. In the on-going ENVRI-FAIR project, RIs closely collaborate and work on the common standards to improve the interoperability among them. These data stan-dards stem from the FAIR principle, as discussed in the above research.

In a recent survey in ENVRI-FAIR, the discovery was made that most of the RI’s only partially complied to the FAIR principles. Especially in terms of Findability, accessibility and interoperability large room for improvements has been found. The RI’s planned to increase their compliance to the FAIR principle through numerous methods, such as conforming and publishing their metadata to one single catalogue Datacite or providing more attention to machine-to-machine interfaces. The findings of the aforementioned study are relevant for the current research project due to the fact that the data needs to be FAIR compliant in order for cross domain search to be properly achieved. Without a general consensus on the data quality, increasing efficiency of cross domain search would be a daunting task. (Magagna (2020)).

(14)

2.2 Search tools within catalogues

Catalogues are important services in data infrastructure to make data and soft-ware assets findable by users. A catalogue often provides interface for both human users and machine to search content and manage metadata records.

2.2.1 CKAN

CKAN is a catalogue and provides a search tool for data items based on CKAN’s metadata profile. It provides a web interface as well as Application Program-ming Interface (API) support, to serve both users and applications. Furthermore, CKAN is easily expandable and flexible in sourcing data (CKAN (2020)). The search technology mainly used is SOLR Search, which is a common search engine for these tools. As visible in the structure chart below, CKAN has endpoints defined for both normal users and applications. These endpoints allow for com-munication with the tool using queries or interfaces 2.1).

Furthermore, CKAN can create a custom metadata format for storing the data it harvests, allowing a common data format to exist. Another core focus of CKAN is flexibility. CKAN is highly modular and allows extensions to easily be developed and added to itself. These extensions range from handling a custom database to integrating Geospatial data into the tool. CKAN poses a viable selection as a data provider, primarily because of the efficient data harvesting methods and optional Geospatial support. (Apache (2020))

(15)

2.2.2 GeoNetwork

GeoNetwork is another catalogue technology which provides search functional-ity. It handles spatially referenced resources and provides an interactive web map viewer for this data. It supports a wide range of standards and file formats, in-cluding the metadata standards from the International Standards Organization

(OSGeo (2020)). GeoNetwork is powered by Elastic search, which is another

commonly used search technology for these tools (Elastic (2020)). The structure diagram below 2.2 highlights the search technology integration and the endpoints that both users and API’s connect to. The reason for choosing GeoNetwork as a viable tool is that GeoNetwork natively supports geospatial data using the ISO 19139 format. This trait forms a vital component in most of the data sets within ENVRI, since they nearly all contain geospatial data.

(16)

2.3 Stand-alone search tools

2.3.1 Open Semantic Search

Open Semantic Search (OSS) is a semantic search tool. It is a flexible tool and supports different standards and file formats (Search (2020)). Furthermore, OSS supports 2 different search engines: SOLR and Elastic Search, as visible in the structure diagram below 2.3. The use of both engines allows the tool to incorpo-rate semantic search as well as regular search into one platform. For this reason, using OSS to test the ranking evaluation that is done in the side part of the project is a feasible proposal. The caveat that OSS has, is that it does not support any data management. There are no data management tools available, because OSS does not store any data itself, it is just an independent search tool.

(17)

2.3.2 Google Dataset Search

Google Dataset Search is a search engine operated by the Google Corporation. This tool uses and advocates the schema.org metadata formats (Google (2015)). This metadata format is adopted by several corporations such as Google, Microsoft, Yahoo etc. The giant pool of available resources and customers makes using this format beneficial. Nevertheless, the lack of flexibility as well as adaptability makes the tool itself irrelevant for the project. The structure diagram below 2.4 only pro-vides marginal information about the backend of the tool, which is not enough to recommend the engine for further selection. The search engine is powered by the same technology as the Google Search Engine though, which relies on the PageR-ank algorithm to index and rPageR-ank the resulting webpages. This algorithm is known for its’ effectiveness in providing search functionality. (Google (2020)).

(18)

2.3.3 Gap Analysis

Based on the analysis of different tools, it is clear that cross domain search still forms a complex problem. Nearly every individual tool has its’ own implementa-tions when dealing with data and they all differ from one another. Examples of this are GeoNetwork’s Q Search API, which provides an endpoint for automated querying, and CKAN’s Action API. Both these APIs provide endpoints for ac-cessing the data sets that the tools house within them. However these tools use different approaches for enabling these interactions, which do not have immediate interoperability.

Moreover, we can clearly see the lack of compliance to the FAIR guiding princi-ples. If the FAIR principle is not being adhered to correctly, numerous features that would otherwise support cross domain search would not be guaranteed. This has been mentioned in the related research by Magagna (Magagna (2020)) in ENVRI-FAIR. Repositories such as EISCAT and their sub domains were unreachable, LifeWatch and SeaDataNet provide only limited file formats etc. Consequently, due to these principles serving as relevant conditions, cross domain search becomes limited to data sets that form exceptions to this issue.

By analysing the possible gaps and limitations in the current state of the art environment in data searching, it became apparent that reforms or additions to this environment are needed in order to support cross domain search. However, due to the diverse implementations from the different data infrastructures and their usage by current communities, advocating change to all these entities poses a complex problem. Additionally, trying to alter an existing search tool to extend its’ functionality is not an option as well. The reason for this being that all analysed search tools do not provide easy access to the source code. It is often complex and difficult to gain foothold when developing new solutions for it. Therefore, a feasible approach should be developing a technique that functions on a higher level, evading the gaps that currently exist. By proposing a novel technique based on pre-search processing, the search tools as well as the data providers themselves remain to be flexible. As a consequence, this research proposes the use of a novel technique that operates on a higher level.

(19)

Chapter 3

Semantic Enhancement and

Ranking Component Helpers

(SEARCH)

We propose a novel Semantic Enhancement and Ranking Component Helpers (SEARCH) solution to the current flow of search queries. This solution consists of three compo-nents: a novel Ontological Query Enhancer(OQE) component, a Re-Ranking com-ponent and interface to a search tool (either from a catalogue or self-contained). The search tool component slot can be filled up by any generic search tool. This specific research focuses on the proposal and implementation of the Ontological Query Enhancer.

The Re-Ranker is an tool proposal that ranks the output from the search tool component. This ranking is enhanced through a machine-learning based approach that incorporates user feedback. Using this approach will enable the search exten-sion to learn from user feedback on the results of the general pipeline.(Gonggrijp (2020))

The Ontological Query Enhancer (OQE) is proposed to enhance user queries. The core goal of the OQE is to extend user-based input queries by using domain ontologies. By integrating and processing ontologies, keywords provided by the user can be enhanced so that the results returned by the connected search tool may have increased relevancy and accuracy. Nonetheless, the proposed tool comes with certain dependencies that need to be set in place in order to provide the cor-rect functionality. Therefore, a number of requirements have been imposed upon using this tool.

(20)

3.1 Requirements

To enable cross domain search, the SEARCH tool and its Ontological Qery En-hancer have to meet a number of requirements:

• The infrastructures should provide endpoints for a higher hierarchical-leveled component to work with.

• The data hosted by these infrastructures should adhere to the FAIR principle as closely as possible.

• The technique used should be flexible in its’ communication to infrastruc-tures.

In addition to these requirements for searching across infrastructures, the pro-posed OQE comes with a list of requirements as well. These constraints are mainly imposed on the working environment for the tool to ensure effectiveness.

• API endpoints are needed for the generator to connect to. Without this connection, the tool cannot send the query off to the next component in the search extension pipeline.

• At least 1 ontology should be accessible by the generator. The ontologies provided to the generator can either be retrieved from their hyperlink, from a remote storage location, or retrieved from a local file, in the format RDF, XML or NTriples.

• The generator requires at least Python 3 to be installed on the machine it runs on. Some parts of the code are dependent on Python 3 packages.

(21)

3.2 Design Considerations

The OQE is specifically designed to bridge the gap that other conventional search tools leave open. During the survey of the newest tool set we analysed the work-flows of these tools as well. Visualising these workwork-flows provides insight on how the enhancer should be designed. We provided an overview of these workflows from each of the selected tools below.

3.2.1 Search tool interface Analysis

CKAN

CKAN’s core functionality stems from the capability to harvest and convert the metadata it finds to a custom format. By employing its’ Remote Procedure Call style API, it can source data sets from the web and add them to the internal cat-alogue. This catalogue is periodically indexed by the search engine SOLR. When the user queries the tool for data, it refers the query through its’ internal flow straight to the SOLR search engine to find results. This flow of information is visible in the workflow diagram 3.1 below. The main problem with this design is that every metadata record has to be converted to a custom format, leaving metadata that cannot be converted out of the indexed resources. Avoiding this problem would require a common standard to be adopted by all infrastructures, which is not a feasible approach to the issue.

(22)

GeoNetwork

GeoNetwork’s flexible design allows for geospatial resources to be easily accessible from its’ internal catalogue. The tool scrapes foreign data sets for records match-ing internally stored file formats and collects the metadata into it’s own catalogue. When the user queries for certain data, the search engine Lucene allows results based on geolocation to be returned as well, as is visible in workflow chart 3.2 below. The problem when dealing with multiple domains is that the metadata is only accepted in specific file formats. Widespread adoption of data is therefore impossible. Avoiding this problem would require a vast collection of internal file formats to be stored.

(23)

Open Semantic Search

Open Semantic Search (OSS) is designed not to store any content but rather to be employed as a search tool capable of handling semantics. As workflow chart 3.3 below reflects, the tool’s internal environment lacks any data management compo-nents. The advantage of not being dependent on internal storage is that the tool can be employed on pre-existing data infrastructures to provide semantic search. However, the downside is that linking the tool to an infrastructure requires quite some work in encoding the connection properly. The fact that OSS lacks a proper API to handle these endpoints makes this task even more arduous.

Figure 3.3: Open Semantic Search’ Workflow

The analysis of these tool designs provided substantial insights on how the ar-chitecture of the proposed search extension should look like and have thus been taken into consideration during the design of the OQE. The key insights extracted from this analysis are listed below:

• The tool should be independent of any infrastructure bound restrictions.

• The tool should be implemented as an addition to the search system already in place.

• The tool should be capable of communicating the results forward to the end-points offered by its’ environment.

(24)

3.3 Architecture

The OQE is written in Python and designed in four separate components, each with their own functionalities and flows of information. With the aid of the struc-ture diagram in Figure 3.4, we highlight and explain these individual components with the order of the respective operations below:

1. The ontology loader, which is the first active component. It loads in the given ontologies and stores them in self-contained variables for later use. If no ontology is provided, the program will display an info message stating that they need to be provided correctly.

2. The input processor, which is the component responsible for requesting and managing the user input. This input is processed and, if the optional flag is set to true, punctuation is removed from the input as well.

3. The ontology parser, which is the component responsible for processing the loaded ontologies and finding the concepts. It searches through each on-tology’s classes and individuals to find all concepts that relate to the given concept through any arbitrary relation.

4. The query manager, which is the component responsible for creating and managing queries that are sent off to external search tools, of which the re-sult is directed forward in the pipeline.

(25)

3.3.1 Workflow

The workflow of the entire SEARCH framework is visible in Figure 3.5. Colored in green is a visualisation of conventional user query flow. In this flow, the user generated input is immediately fed to the search tool through the web interface and the results from the search tool are directed back to the user through the same interface. Colored in red is the proposed SEARCH framework flow of user queries. The main difference between the two flows is the incorporation of the OQE that enhances the input before it is given to the search tool. This addition to the pipeline provides more accurate queries than the user would normally think of. Because of this, more accurate results could be retrieved from the data, which in turn increases user satisfaction. The Re-Ranker is capable of fine-tuning the results even more, so that the conventional search and retrieval methods are sig-nificantly outperformed.

Figure 3.5: Visual comparison between standard workflows and the workflow for the SEARCH framework

(26)

3.4 Technical Considerations

These design considerations and the architecture of the the system itself are bun-dled with some technical considerations as well. Each of these technical consider-ations are listed below:

1. It is recommended to host The SEARCH framework on a server in order to run efficiently. The reason for this is that most search tools that can fill up the component in the system require a server installation to work properly. 2. The search tool itself needs to follow the guidelines that have been mentioned

earlier in the thesis (FAIR principles, selection criteria etc.). By following these imposed constraints the search component will function as desired.

3. The considerations taken for the OQE are listed in the subset below:

• The tool relies on Python 3 as Python installation in order to run. It is written in Jupyter Notebook, a Python module that executes Python code on a local web server (Jupyter (2014)).

• The tool requires the following Python packages to be available in the runtime environment:

– OwlReady2, for accessing and managing the ontologies. It is used to query the ontologies for related concepts and extract them. – Natural Language Toolkit, specifically the stopwords functionality.

This is used to clean the user input from common stopwords. – Regex module (re), to perform Regex operations on the user input. – The string package, for cleaning the user input from punctuation

as well.

– The requests module, to send and retrieve URLs containing the queries to the external sources.

(27)

• The tool should have write access to its’ environment. This permission is needed to store the XML output from the search components, so that the output can be forwarded to any other external source.

These are some of the technical considerations that should be kept in mind. Through the aid of these considerations and earlier mentioned guidelines, the next section of this thesis will describe an implementation of the aforementioned OQE.

3.4.1 Current State

The current implementation of the SEARCH framework comes with a running version of the OQE in addition to a selected search tool for the actual search. The search tool component in this research is GeoNetwork. GeoNetwork is imple-mented on a virtual server running Intel Xeon E312xx processing units bundled with 16 GB RAM memory. GeoNetwork is installed using the source documenta-tion and set up to harvest data from other catalogues. This harvesting process, though limited by the constraints of the data platforms, incorporates third-party data into the tool so that GeoNetwork can search through it.

GeoNetwork is set up to accept third-party querying in the dataset that the tool manages. This querying is set up through the ’Q Search’ API that resides within GeoNetwork. This API accepts queries written inside the parameters of the hyperlink in order to find the data. This query is accepted and processed with the help of the search engine Lucene, that GeoNetwork uses within the repository, so that the correct metadata records corresponding to the user input are returned either in XML or JSON format. These results are retrieved and further processed locally to return to the user as output.

The OQE is implemented on a local system consisting of an Intel Core i7 2,9 GHz processing unit paired with 16 GB 2133 MHz LPDDR3 RAM memory. The Jupyter notebook file that contains the code runs on a locally instantiated Jupyter Notebook web server. The full source code is available on the OQE Github Repository (Verhaar (2020)).

(28)

Chapter 4

Implementation

4.0.1 Usability

This section contains a guide on how to use the OQE tool. It contains instructions on how to operate the tool, what kind of data is transferred between the internal components and what the input and output contain. A full pseudo code imple-mentation is provided in algorithm 1 below.

To start off, the Jupyter Notebook file ’Ont_proc.ipynb’ should be exectuted on a Jupyter Notebook web server. This will open the file and the cells of code should be visible. To start running the code requires the class to be loaded into memory.

The OQE can be started by creating an instance of the class and providing a list of ontologies as input arguments. The ontologies themselves can either be given by hyperlink or by locally storing the file. Once the class instance has been made the tool can proceed by using the class functions.

The load ontologies function takes no arguments and simply loads the ontolo-gies given to the tool at startup into memory. This is not carried out at the initialization in order to give the user freedom to import what ontologies he wants to be imported into the tool. The pseudo code implementation below shows how the function is implemented.

The Search function can be called with either no arguments, or by providing a string as input. The process input function is nested within the search function itself and does the following: in the case of no arguments, the processing function will ask for the user input in order to process it. In the case of providing input as a string, the function will process the given string.

(29)

Either way the user input gets processed so that punctuation, though optional by setting the flag as True at initialization, and stop words get removed. Do note that the processed string will be split on the use of comma’s. Therefore, each key-word should only be separated by one comma and no spacing between the letters and the comma itself. An example query would be: Ocean,Land

The output of the process function is used to query the loaded ontologies and find related concepts to return. The pseudo code implementation below shows how the function is implemented.

The process query function can be executed in three different ways:

• The function receives a list of ontological keywords that have been added by the search function. The list is sorted by occurrence.

• The function receives a query string as input. The query is sent to the ex-ternal search tool immediately.

• The function receives neither a list of ontological keywords nor a query string as input. In this case, if the user did search the ontologies with a given in-put, that input will be used to retrieve results. In this case, the query is only cleaned before use.

All three executions yield a final query. This query is given as input to the send query function. This function establishes a connection with an external search tool to search for data sets. The connection to the search tool is dependent on the tool that is being used. In the current state, the tool being used is GeoNetwork. If a desire to change the tool arises in the future, one could simply edit the class URL variable to a different link and the connection could be established, given that the endpoint of the tool is available. The pseudo code implementation below shows how the function is implemented.

(30)

[H] Algorithm 1 Semantic Query Generator

1: _{procedure Initialization} . This function contains the SQG

2: get_Stopwords(NLTK_Package)

3: punctuation_Table(NLTK_Package)

4: Initialize class variables

5: Set search tool URL

6:

7: _{procedure load_ontologies} . This function loads in the ontologies

8: for each ontology do

9: Retrieve the ontology

10: try loading the ontology

11: error upon failure

12:

13: _{procedure process_input(user_input) . This function cleans user input}

of irrelevant words

14: for words in user input do

15: if punctuation filter enabled then

16: Clean input of all punctuation and stop words

17:

18: else

19: Clean input of stop words

20:

21: _{procedure search(user_input) . This function extracts concepts from}

ontologies

22: goto process_input(user_input)

23: for Term in cleaned user input do

24: for Each ontology loaded in do

25: Search ontology for term

26: if Term exists then

27: goto add_concepts(term)

28: goto add_count(term)

29:

30: else

31: Continue with the next term

(31)

33: _{procedure add_concepts(term)} . This function extracts related concepts

34: for each related concept do

35: if related concept is directly related then

36: Add concept to storage

37: goto add_count(concept)

38: if Related concept is related through any other relation then

39: Add related target to storage

40: goto Add_count(related target

41:

42: _{procedure process_concepts} . This function sorts the ontology

output

43: Sort the ontology concepts by count

44: Return the names of these concepts

45:

46: _{procedure add_count(term)} . This function counts and stores

concept occurrences

47: if Term has already been found before then

48: Increment the count

49:

50: else

51: Add it to the counter with value of 1

52:

53: _{procedure send_query(query) . This function sends the output}

of the ontologies to the search tool

54: Create URL request

55: Connect to server and receive response

56: Store the response as a XML file

57:

58: _{procedure process_query(keywords)}

59: if keywords are provided then

60: goto send_query(keywords)

61: if Input has been processed by the search then

62: goto send_query(search result)

63:

64: else

(32)

4.0.2 Experiments

This experiment was designed to demonstrate the purpose and effectiveness of the OQE. By designing specialized queries that each retrieve unique sets of results the proof of concept can be established. The experiment is ran on the aforementioned current state of the tool, with the following additions:

• These are the queries that were used during the experiment (note that the available queries are dependent on the available ontologies and the available data in the search tool):

1. Input query: Ocean

2. Input query: Ocean,Water body 3. Input query: Continent

4. Input query: Ocean,World Ocean,Marine water body,cave 5. Input query: Texel,Island,Sea

6. Input query: Earth,Ocean

• The ontologies imported into the tool during the experiment have been se-lected due to their relevance to the data sets. The implementation of GeoNet-work contains data sets that were sourced from the ENVRI infrastructures. Therefore, these ontologies were deemed fit to the data.:

– ENVO, the enviroment ontology. This ontology provides a vast col-lection of concepts that revolve around the natural environment (PL (2016)) (Buttigieg (2013)).

– Geo, the geographical ontology. This ontology provides a collection of concepts that revolve around the Geosphere (University of Florida (2020)).

(33)

The result of those queries are individually stored as XML files, with the file title containing the keywords used in the search. These XML files will be struc-tured differently depending on the tool that is used to search with the given query.

4.0.3 Results

Each of the aforementioned input queries were fed to the generator, after it has initialized itself and loaded in the ontologies. They were processed and cleaned, forwarded to the ontologies for concept extraction and further processed into a

query to send to GeoNetwork1_{. In table 4.1 below each entry consists of the}

fol-lowing:

• The original, unmodified user input query.

• The query as enhanced by the tool.

• The amount of concepts added by the tool.

Table 4.1: Query Enhancement Results

Original Query Enhanced Query Added # of concepts

Ocean marine water body,

saline water body, sea water, ocean

3

Ocean, water body marine water body,

saline water body, sea water, water mass, hy-drosphere, water, ocean, water body

6

Continent land mass, continent 1

Continued on next page

(34)

Table 4.1 – continued from previous page

Original Query Enhanced Query Added # of concepts

Ocean, World Ocean,

Marine water body, Cave

saline water body, sea

water, World Ocean,

marine pelagic feature, lentic water body, hy-droform, marine biome, solid astronomical body

part ,cave wall, cave

floor, closure

incom-plete, cave, ocean,

marine water body

10

Texel, Island, Sea texel, geographic feature,

land, sea, marine water body, saline water body, sea water, subcontinen-tal land mass, island

6

Earth, Ocean Earth, terrestrial planet,

ocean, marine water

body, saline water body, sea water

4

The next set of results, listed in Table 4.2, describe the difference between us-ing the original user query and the enhanced user query. Each entry in this table consists of the following:

• The query set that each respectively holds the original query and the en-hanced query.

• The query responsible for the results. • The amount of results from GeoNetwork. • The run time in milliseconds for the tool.

(35)

Table 4.2: Comparison between the original and the en-hanced queries.

Query sets Queries # of results Runtime (ms)

Query set 1 ocean 0 134

ocean marine

water body

saline water

body sea water

36 525

Query set 2 ocean water

body

36 135

ocean marine

water body

saline water

body sea water

water body

water mass

hydrosphere water

38 507

Query set 3 continent 1 130

land mass conti-nent

24 420

Query set 4 ocean world

ocean marine

water body cave

36 518

(36)

Table 4.2 – continued from previous page

Query sets Queries # of results Runtime (ms)

saline water

body sea water

World Ocean

marine pelagic

feature lentic

water body hy-droform marine biome cave solid astronomical body part cave wall cave floor

closure

incom-plete ocean

marine water

body

41 552

Query set 5 texel island sea 4 130

texel geographic feature land sea

marine water

body saline

water body sea

water

subcon-tinental land

mass island

42 590

Query set 6 earth ocean 0 140

earth terrestrial

planet ocean

marine water

body saline

water body sea water

(37)

Figure 4.1 shows the run time compared to the query length. Each set of queries consists of the two queries listed in table 4.2.

Figure 4.1: Runtime in milliseconds of each set of queries

Another result worth noting is that some concepts are found in multiple on-tologies. An example of this phenomenon is the concept ’Ocean’. This concept is found in both the ENVO ontology as well as the GEO ontology. Hence the concept is counted twice and in some cases moved to the front of the query, because the query gets sorted by keyword occurrence.

(38)

Chapter 5

Discussion

The results obtained from the conducted experiment are promising. In all cases, the OQE was able to extract and append new keywords to the original user input. Furthermore, each query that has been padded with ontology keywords performed better than the original query did. In some cases, the enhanced query was capa-ble of providing search results whereas the original query was not returning any results. Thus, the experiment shows that adding an implementation of the OQE to the search environment significantly improves the result yields.

In terms of cross domain search, the tool is capable of crossing the gap be-tween scientific domains. This is especially shown by the incorporation of the two ontologies used in the experiment. They each support different domains, but in e.g. query set five the enhanced query contains keywords from both domains and therefore returns significantly more results.

5.1 Contributions

The contributions made to the technological environment are clearly visible when the SEARCH framework and its’ functionality are compared to the state of the art technologies. This research contributed to the field of data search and retrieval in the following ways:

(39)

• The proposed SEARCH framework, which consists of the following compo-nents:

– The Ontological Query Enhancer, which is a tool capable of padding a user query with keywords related to the input. These keywords have been extracted from a pre-loaded set of ontologies.

– A search tool component slot, which is a flexible part of the framework that allows search tools to be incorporated into SEARCH.

– The Re-ranker, which is a tool capable of re-ranking the search tool output to better fit the user’s request.

• Extensive research in the state of the art technologies. This research reflected the inner structure of these technologies and performed a gap analysis on them to find out what kind of expansions could be made to support cross domain search.

5.2 Limitations

Some components used during the experiment displayed clear limitations on their efficiency. The main limitations that could have influenced the results are listed below:

• The experiment could have used more data. Up until now, only a select amount of data sets was harvested by the search tool. Adding more data sets to the mix would substantiate the results even more.

• The experiment could have used more ontologies. The collection of knowl-edge bases was not expanded during the experiment due to time and com-plexity constraints. More ontologies that were accepted by the tool were simply not found within the allotted time.

Improving the quantity of the data and knowledge bases is estimated to im-prove the efficiency rate of the OQE even more. Therefore, expanding these forms of experimentation might be beneficial in future works.

(40)

Chapter 6

Conclusion

Improving search across numerous data infrastructures and scientific domains proved to be a challenge. The analysis on these tools and techniques revealed that advocating change to all infrastructures would be a daunting task. For this reason, the SEARCH framework is proposed. The framework operates beyond the current data management systems and provides flexibility where needed. The experiment proved that the OQE tool is capable of expanding the user search into multiple domains by enhancing the query. The results demonstrated an increased efficiency of the search altogether and therefore, this framework is recommended for improving the search across both scientific and infrastructure domains.

In short, the survey conducted on the State of the Art tool set and infrastruc-tures lead to the answer of the first sub question defined in this research: the state of the art tool set and infrastructures barely, if not at all, support cross domain searching for data.

The second sub question is answered in the gap analysis, which revealed that search across infrastructures was supported on a limited basis. Catalogue tools used their internal database to store data from multiple infrastructures and thus, multiple domains. Standalone search tools used techniques like semantic search to achieve the same cross infrastructural search. However, both these options have their shortcomings and do not provide promising results when queried for inter sectional search.

A new solution had to be designed to evade these gaps, which provided the answer to the third sub question. The SEARCH framework is proposed as solu-tion, which improves the searching across scientific infrastructures by enhancing the user query before using it for the actual search. On top of that, the results

(41)

are ranked again before displaying them to the user, to guarantee accuracy. This framework has been demonstrated to show promising results when used to query for inter sectional data.

To sum up, each of these answers to the sub questions provide clarity to the answer of the main question. Data can, provided by different data infrastructures, be discovered based on their current data management status by allowing the pro-posed SEARCH framework to enhance the discovery with specialized queries that improve the results. The hypothesis is proven to be correct, a novel extension was needed to achieve the desired result.

(42)

Chapter 7

Future work

Due to both the fact that this research provides an initial working implementa-tion of the proposed SEARCH framework and the fact that none of the analysed tools provides this kind of functionality, conducting this research opened up sev-eral possibilities of future works. Each of the following sections describes an area of research that could benefit from being a focus area.

7.0.1 SEARCH

The SEARCH framework, as it is proposed, already provides a multitude of func-tionality when it comes to searching across scientific domains and infrastructures. However, opportunities of further research arise when dealing with the framework. Some examples of future research are mentioned below.

The functionality of the OQE could be expanded even more. As of the cur-rent implementation, handled in this research, the tool searches for concepts with given arbitrary relations that the input has. However, inter-related concepts that have relations consisting of multiple sub concepts in the graph, are not taken into the search. Further research could look in providing some form of separate search algorithm that can navigate the knowledge graph, looking for concepts that are linked to the input through more distant relations.

New research could dive into the expansion of the tool as well. With the cur-rent implementation, the user gets the results returned after the process finishes. However, some form of user feedback would significantly aid in fine tuning the tool. Having a method of communicating the user feedback back to the individual components could improve the tool in general.

(43)

7.0.2 Tools & Data

Due to time constraints, harvesting more data sets and incorporating more on-tologies into the OQE was deemed infeasible. However, having more data could lead to interesting alterations of the results. Future work could look into providing experiments where the available pool of data sets and ontologies has been signifi-cantly enlarged. Based on an enlarged available collection, the tool is expected to perform even better in searching across scientific domains.

Next to sourcing more data and ontologies, analysing more search tools could lead to different future implementations of the framework. In this research, we analysed a set of tools that were deemed useful to the framework. However, an exhaustive search of the entire available tool set has not been conducted. Con-sequently, other search tools that complement the algorithm even more could exist.

(44)

References

Apache (2020). SOLR Search Engine Website. [Online; Accessed June 26, 2020]. url: https://lucene.apache.org/solr/.

Buttigieg (2013). “The environment ontology: contextualising biological and biomed-ical entities.” In: J Biomed Semant 4(1), p. 43.

CKAN (2020). CKAN Website. [Online; Accessed June 26, 2020]. url: https : //ckan.org/.

Elastic (2020). Elastic Search Engine Website. [Online; Accessed June 26, 2020]. url: https://www.elastic.co/.

Gonggrijp, Mats (2020). “Learning to Rank for Dataset Search across Catalogues”. In: Thesis publications UvA, 2020.

Google (2015). Schema Organization. [Online; accessed June 26, 2020]. url: https: //schema.org/.

— (2020). Google Dataset Search Website. [Online; Accessed June 26, 2020]. url: https://datasetsearch.research.google.com/.

Government, Dutch (2020). National Geo-Register, Netherlands. [Online; Accessed June 26, 2020]. url: https://www.nationaalgeoregister.nl/geonetwork/ srv/dut/catalog.search#/home.

Guptill, Stephen C (1999). “Metadata and data catalogues”. In: Geographical in-formation systems 2, pp. 677–692.

Jupyter (2014). Jupyter Notebook. [Online; accessed June 26, 2020]. url: https: //jupyter.org/index.html.

Magagna, Barbara (2020). “REQUIREMENT ANALYSIS, TECHNOLOGY RE-VIEW AND GAP ANALYSIS OF ENVIRONMENTAL RIs”. In:

Nature (2016). The FAIR Guiding Principles for scientific data management and stewardship. [Online; Accessed June 26, 2020]. url: https://rdcu.be/b3w6k. OSGeo (2020). GeoNetwork Website. [Online; Accessed June 26, 2020]. url: https:

//geonetwork-opensource.org/.

PL, Buttigieg (2016). “The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation”. In: J Biomed Semant 7(1), p. 57.

Search, Open Semantic (2020). Open Semantic Search Website. [Online; Accessed June 26, 2020]. url: https://www.opensemanticsearch.org/.

University of Florida, UoF (2020). Geographical Entity Ontology. [Online; accessed June 26, 2020]. url: https://github.com/ufbmi/geographical- entity-ontology/.

Verhaar, Mitchell (2020). Github OQE Repository. [Online; accessed June 26, 2020]. url: https://github.com/Mitchell-V/OQE/tree/master.

(45)

Chapter 8

Appendix

8.1 Example GeoNetwork output

Input query: saline water body, sea water, World Ocean, marine pelagic feature, lentic water body, hydroform, marine biome, cave, solid as-tronomical body part, cave wall, cave floor, closure incomplete, ocean, marine water body

Listing 8.1: XML Output 1 <?xml version=" 1 . 0 " e n c o d i n g="UTF−8"?> 2 <r e s p o n s e from=" 1 " t o=" 41 " s e l e c t e d=" 0 " maxPageSize=" 100 "> 3 <summary c o u n t=" 41 " t y p e=" l o c a l "> 4 <d i m e n s i o n name=" t y p e " l a b e l=" t y p e s "> 5 <c a t e g o r y v a l u e=" d a t a s e t " l a b e l=" D a t a s e t " c o u n t=" 38 " /> 6 <c a t e g o r y v a l u e=" s e r i e s " l a b e l=" S e r i e s " c o u n t=" 3 " /> 7 </ d i m e n s i o n>

8 <d i m e n s i o n name=" mdActions " l a b e l=" mdActions ">

9 <c a t e g o r y v a l u e=" mdActions−download " l a b e l=" mdActions− download " c o u n t=" 2 " /> 10 <c a t e g o r y v a l u e=" mdActions−view " l a b e l=" mdActions−view " c o u n t=" 2 " /> 11 </ d i m e n s i o n> 12 <d i m e n s i o n name=" t o p i c C a t " l a b e l=" t o p i c C a t s "> 13 <c a t e g o r y v a l u e=" s t r u c t u r e " l a b e l=" S t r u c t u r e " c o u n t=" 37 " /> 14 <c a t e g o r y v a l u e=" i n l a n d W a t e r s " l a b e l=" I n l a n d w a t e r s " c o u n t=" 1 " /> 15 <c a t e g o r y v a l u e=" b o u n d a r i e s " l a b e l=" B o u n d a r i e s " c o u n t=" 1 " /> 16 <c a t e g o r y v a l u e=" g e o s c i e n t i f i c I n f o r m a t i o n " l a b e l=" G e o s c i e n t i f i c i n f o r m a t i o n " c o u n t=" 1 " /> 38

(46)

17 </ d i m e n s i o n>

18 <d i m e n s i o n name=" inspireThemeURI " l a b e l=" inspireThemesURI " /> 19 <d i m e n s i o n name=" keyword " l a b e l=" keywords ">

20 <c a t e g o r y v a l u e=" E n v i r o n m e n t a l m o n i t o r i n g f a c i l i t i e s " l a b e l= " E n v i r o n m e n t a l m o n i t o r i n g f a c i l i t i e s " c o u n t=" 38 " /> 21 <c a t e g o r y v a l u e=" S t r u c t u r e " l a b e l=" S t r u c t u r e " c o u n t=" 37 " /> 22 <c a t e g o r y v a l u e="AnaEE−France " l a b e l="AnaEE−France " c o u n t=" 31 " /> 23 <c a t e g o r y v a l u e=" a i r h u m i d i t y " l a b e l=" a i r h u m i d i t y " c o u n t=" 22 " /> 24 <c a t e g o r y v a l u e=" a i r t e m p e r a t u r e " l a b e l=" a i r t e m p e r a t u r e " c o u n t=" 21 " /> 25 <c a t e g o r y v a l u e=" p l u v i o m e t r y " l a b e l=" p l u v i o m e t r y " c o u n t=" 19 " /> 26 <c a t e g o r y v a l u e=" s o i l t e m p e r a t u r e " l a b e l=" s o i l t e m p e r a t u r e " c o u n t=" 19 " /> 27 <c a t e g o r y v a l u e=" l o n g term m o n i t o r i n g " l a b e l=" l o n g term m o n i t o r i n g " c o u n t=" 16 " /> 28 <c a t e g o r y v a l u e=" e v a p o t r a n s p i r a t i o n " l a b e l=" e v a p o t r a n s p i r a t i o n " c o u n t=" 16 " /> 29 <c a t e g o r y v a l u e="CO2 f l u x " l a b e l="CO2 f l u x " c o u n t=" 15 " /> 30 <c a t e g o r y v a l u e=" g l o b a l r a d i a t i o n " l a b e l=" g l o b a l r a d i a t i o n " c o u n t=" 15 " /> 31 <c a t e g o r y v a l u e=" s o i l m o i s t u r e " l a b e l=" s o i l m o i s t u r e " c o u n t= " 15 " /> 32 <c a t e g o r y v a l u e=" wind s p e e d " l a b e l=" wind s p e e d " c o u n t=" 14 " / > 33 <c a t e g o r y v a l u e=" mass f l u x i n s o i l v e g e t a t i o n a t m o s p h e r e system " l a b e l=" mass f l u x i n s o i l v e g e t a t i o n a t m o s p h e r e system " c o u n t=" 13 " /> 34 <c a t e g o r y v a l u e=" s o i l r e s p i r a t i o n " l a b e l=" s o i l r e s p i r a t i o n " c o u n t=" 13 " /> 35 </ d i m e n s i o n>

36 <d i m e n s i o n name="orgName" l a b e l=" orgNames ">

37 <c a t e g o r y v a l u e="INRA" l a b e l="INRA" c o u n t=" 24 " /> 38 <c a t e g o r y v a l u e="CNRS" l a b e l="CNRS" c o u n t=" 7 " /> 39 <c a t e g o r y v a l u e="CIRAD" l a b e l="CIRAD" c o u n t=" 3 " />

40 <c a t e g o r y v a l u e="FAO − NRCW" l a b e l="FAO − NRCW" c o u n t=" 1 " /> 41 <c a t e g o r y v a l u e=" Department o f S u s t a i n a b i l i t y and

Environment (DSE) " l a b e l=" Department o f S u s t a i n a b i l i t y and Environment (DSE) " c o u n t=" 1 " />

42 <c a t e g o r y v a l u e="FAO − Land and Water Development D i v i s i o n "

l a b e l="FAO − Land and Water Development D i v i s i o n " c o u n t=" 1 " />

43 <c a t e g o r y v a l u e=" C i r a d " l a b e l=" C i r a d " c o u n t=" 1 " />

44 <c a t e g o r y v a l u e="INRA CARRTEL SOERE OLA" l a b e l="INRA CARRTEL

SOERE OLA" c o u n t=" 1 " />

45 </ d i m e n s i o n>

(47)

47 <c a t e g o r y v a l u e=" 19 e 4 3 9 5 c −b2eb −4592−b490 −22689 dc7090d " l a b e l ="AnaEE−France Metadata C a t a l o g " c o u n t=" 38 " /> 48 <c a t e g o r y v a l u e=" 83177 e44−f c c 4 −4f c d −890e−b 4 8 b c 9 d 0 9 7 f f " l a b e l ="My GeoNetwork c a t a l o g u e " c o u n t=" 3 " /> 49 </ d i m e n s i o n> 50 <d i m e n s i o n name=" c r e a t e D a t e Y e a r " l a b e l=" c r e a t e D a t e Y e a r s "> 51 <c a t e g o r y v a l u e=" 2020 " l a b e l=" 2020 " c o u n t=" 3 " /> 52 <c a t e g o r y v a l u e=" 2017 " l a b e l=" 2017 " c o u n t=" 2 " /> 53 <c a t e g o r y v a l u e=" 2016 " l a b e l=" 2016 " c o u n t=" 6 " /> 54 <c a t e g o r y v a l u e=" 2015 " l a b e l=" 2015 " c o u n t=" 1 " /> 55 <c a t e g o r y v a l u e=" 2013 " l a b e l=" 2013 " c o u n t=" 1 " /> 56 <c a t e g o r y v a l u e=" 2012 " l a b e l=" 2012 " c o u n t=" 1 " /> 57 <c a t e g o r y v a l u e=" 2011 " l a b e l=" 2011 " c o u n t=" 3 " /> 58 <c a t e g o r y v a l u e=" 2010 " l a b e l=" 2010 " c o u n t=" 2 " /> 59 <c a t e g o r y v a l u e=" 2009 " l a b e l=" 2009 " c o u n t=" 2 " /> 60 <c a t e g o r y v a l u e=" 2008 " l a b e l=" 2008 " c o u n t=" 3 " /> 61 <c a t e g o r y v a l u e=" 2007 " l a b e l=" 2007 " c o u n t=" 2 " /> 62 <c a t e g o r y v a l u e=" 2006 " l a b e l=" 2006 " c o u n t=" 1 " /> 63 <c a t e g o r y v a l u e=" 2005 " l a b e l=" 2005 " c o u n t=" 4 " /> 64 <c a t e g o r y v a l u e=" 2003 " l a b e l=" 2003 " c o u n t=" 1 " /> 65 <c a t e g o r y v a l u e=" 2001 " l a b e l=" 2001 " c o u n t=" 1 " /> 66 <c a t e g o r y v a l u e=" 2000 " l a b e l=" 2000 " c o u n t=" 2 " /> 67 <c a t e g o r y v a l u e=" 1998 " l a b e l=" 1998 " c o u n t=" 1 " /> 68 <c a t e g o r y v a l u e=" 1997 " l a b e l=" 1997 " c o u n t=" 1 " /> 69 <c a t e g o r y v a l u e=" 1996 " l a b e l=" 1996 " c o u n t=" 1 " /> 70 <c a t e g o r y v a l u e=" 1984 " l a b e l=" 1984 " c o u n t=" 1 " /> 71 </ d i m e n s i o n> 72 <d i m e n s i o n name=" f o r m a t " l a b e l=" f o r m a t s "> 73 <c a t e g o r y v a l u e="GML" l a b e l="GML" c o u n t=" 1 " />

74 <c a t e g o r y v a l u e=" Most p o p u l a r f o r m a t s i n c l u d i n g ESRI shape ,

MapInfo Tab and O r a c l e S p a t i a l " l a b e l=" Most p o p u l a r f o r m a t s i n c l u d i n g ESRI shape , MapInfo Tab and O r a c l e S p a t i a l " c o u n t=" 1 " /> 75 <c a t e g o r y v a l u e=" S h a p e F i l e " l a b e l=" S h a p e F i l e " c o u n t=" 1 " /> 76 <c a t e g o r y v a l u e=" i n a p p l i c a b l e " l a b e l=" i n a p p l i c a b l e " c o u n t=" 32 " /> 77 <c a t e g o r y v a l u e=" u n a p p l i c a b l e " l a b e l=" u n a p p l i c a b l e " c o u n t=" 2 " /> 78 <c a t e g o r y v a l u e=" u n r e l e v a n t " l a b e l=" u n r e l e v a n t " c o u n t=" 1 " /> 79 </ d i m e n s i o n> 80 <d i m e n s i o n name=" s p a t i a l R e p r e s e n t a t i o n T y p e " l a b e l=" s p a t i a l R e p r e s e n t a t i o n T y p e s "> 81 <c a t e g o r y v a l u e=" v e c t o r " l a b e l=" V e c t o r " c o u n t=" 7 " /> 82 </ d i m e n s i o n> 83 <d i m e n s i o n name=" maintenanceAndUpdateFrequency " l a b e l=" m a i n t e n a n c e A n d U p d a t e F r e q u e n c i e s ">

84 <c a t e g o r y v a l u e=" asNeeded " l a b e l="As needed " c o u n t=" 40 " /> 85 <c a t e g o r y v a l u e=" f o r t n i g h t l y " l a b e l=" F o r t n i g h t l y " c o u n t=" 1 "

(48)

86 </ d i m e n s i o n> 87 <d i m e n s i o n name=" s t a t u s " l a b e l=" s t a t u s "> 88 <c a t e g o r y v a l u e=" onGoing " l a b e l="On g o i n g " c o u n t=" 5 " /> 89 <c a t e g o r y v a l u e=" c o m p l e t e d " l a b e l=" Completed " c o u n t=" 2 " /> 90 </ d i m e n s i o n> 91 <d i m e n s i o n name=" s e r v i c e T y p e " l a b e l=" s e r v i c e T y p e s " /> 92 <d i m e n s i o n name=" d e n o m i n a t o r " l a b e l=" d e n o m i n a t o r s "> 93 <c a t e g o r y v a l u e=" 5000000 " l a b e l=" 5000000 " c o u n t=" 2 " /> 94 <c a t e g o r y v a l u e=" 50000 " l a b e l=" 50000 " c o u n t=" 34 " /> 95 <c a t e g o r y v a l u e=" 5000 " l a b e l=" 5000 " c o u n t=" 4 " /> 96 </ d i m e n s i o n> 97 <d i m e n s i o n name=" r e s o l u t i o n " l a b e l=" r e s o l u t i o n s "> 98 <c a t e g o r y v a l u e=" 25 m e t e r s " l a b e l=" 25 m e t e r s " c o u n t=" 1 " /> 99 </ d i m e n s i o n> 100 </summary> 101 <metadata> 102 <a b s t r a c t>PLANAQUA i s t h e N a t i o n a l e x p e r i m e n t a l p l a t f o r m i n A q u a t i c E c o l o g y . 103 104 I t c o n s i s t s o f 3 l e v e l s o f r e s e a r c h p l a t f o r m s : 105

106 Microcosms , from one t o s e v e r a l l i t r e s o f volume ,

107 can be u s e d i n t h e p l a n k t o n −d e d i c a t e d l a b o r a t o r y t o s e t up c o n t i n u o u s c u l t u r e s e x p e r i m e n t s . They can a l s o be u s e d i n t h e c l i m a t i c chambers o f t h e E c o t r o n I l e D e F r a n c e , a d e v i c e t h a t a l l o w s t h e p r e c i s e c o n d i t i o n i n g o f t h e e n v i r o n m e n t and t h e d e t a i l e d m o n i t o r i n g o f s t a t e s and a c t i v i t i e s o f o r g a n i s m s and e c o s y s t e m s . Microcosms a l l o w s t u d y i n g p l a n k t o n c o m m u n i t i e s i n marine o r f r e s h w a t e r e c o s y s t e m s under h i g h l y c o n t r o l l e d e n v i r o n m e n t a l c o n d i t i o n s s u c h a s t e m p e r a t u r e , i r r a d i a n c e , n u t r i e n t s , and g a s c o n c e n t r a t i o n s . A s e r i e s o f d e d i c a t e d s e n s o r s e n a b l e s m o n i t o r i n g o f g a s e x c h a n g e (O2 and CO2)

between t h e a i r and t h e w a t e r i n t h e e x p e r i m e n t a l system and t o f o l l o w t h e r e l a t e d p l a n k t o n m e t a b o l i c a c t i v i t y .

108

109 Mesocosms , w i t h a volume o f s e v e r a l c u b i c metres , have a h i g h

d e g r e e o f r e p l i c a t i o n . They a r e i n s t a l l e d o u t d o o r s and e q u i p p e d w i t h d e v i c e s f o r t h e e x p e r i m e n t a l c o n t r o l o f t h e r m a l g r a d i e n t s and w a t e r m i x i n g . For example , t w e l v e mesocosms a r e e q u i p p e d w i t h b e a t e r s t h a t g e n e r a t e waves , making i t p o s s i b l e t o c o n t r o l

t h e p h y s i c a l s t r u c t u r e o f t h e w a t e r column . These t o o l s have been d e v e l o p e d f o r s t u d y i n g t h e l i n k between p h y s i c a l

c o n s t r a i n t s and t h e f u n c t i o n i n g o f a q u a t i c s y s t e m s . The l a r g e volume o f t h e s e mesocosms ( 1 5 m3) makes i t p o s s i b l e t o h o u s e complex c o m m u n i t i e s o f o r g a n i s m s .

110

111 S i x t e e n a r t i f i c i a l l a k e s o f 650 m3 have been c o n c e i v e d f o r

i n c o r p o r a t i n g t h e n a t u r a l c o m p l e x i t y o f t h e e n v i r o n m e n t and t h e s p a t i a l l y h e t e r o g e n e o u s n a t u r e o f e c o l o g i c a l p r o c e s s e s i n

(49)

n a t u r a l e c o s y s t e m s . These v e r y l a r g e e x p e r i m e n t a l s y s t e m s , shaped w i t h l i t t o r a l , b e n t h i c and p e l a g i c z o n e s , w i l l be i n t e r − c o n n e c t e d t o e a c h o t h e r by d i s p e r s a l c h a n n e l s and e q u i p p e d w i t h automated s e n s o r s and d a t a l o g g e r s . The a r t i f i c i a l l a k e s w i l l f a c i l i t a t e t h e s t u d i e s on t h e f u n c t i o n i n g o f complex

c o m m u n i t i e s w i t h h e t e r o g e n e o u s s p a t i a l d i s t r i b u t i o n s , and w i l l a l l o w u n d e r s t a n d i n g and managing t h e c o n s e q u e n c e s o f

a n t h r o p o g e n i c p r e s s u r e s on b i o d i v e r s i t y , up t o t h e s p e c i e s a t t h e t o p o f t h e f o o d c h a i n s .</ a b s t r a c t>

112 < l i n e a g e>PLANAQUA, implemented w i t h i n t h e CEREEP−E c o t r o n

I l e D e F r a n c e , i s an e x p e r i m e n t a l high−l e v e l i n f r a s t r u c t u r e o r i e n t e d t o w a r d s t h e s t u d y o f a q u a t i c s y s t e m s and c o m p e t i t i v e v i s − −v i s t h e h i g h e s t i n t e r n a t i o n a l s t a n d a r d s i n t e r m s o f e x p e r i m e n t a l equipment . T h i s i n f r a s t r u c t u r e p r o v i d e s t h e s c i e n t i f i c community w i t h a s e t o f h i g h l y i n s t r u m e n t e d e x p e r i m e n t a l equipment . 113 The c h e m o s t a t s a r e f u n c t i o n a l s i n c e 2013 and m i c r o c o s m s i n c o n f i n e d e n v i r o n m e n t ( E c o t r o n IDF ) a r e b e i n g v a l i d a t e d i n 2 0 1 4 . Mesocosms e x i s t s i n c e 2009 and t h e E x p e r i m e n t a l Lakes w i l l be f i l l e d w i t h w a t e r i n t h e summer o f 2 0 1 4 . .

114

115 PLANAQUA i s a s e r v i c e o f AnaEE−France I n f r a s t r u c t u r e .</ l i n e a g e> 116 <r e s p o n s i b l e P a r t y>P o i n t o f c o n t a c t | metadata |CNRS | | p l a n a q u a @ b i o l o g i e . e n s . f r | G r a r d LACROIX | | 7 8 , r u e du C h t e a u , S a i n t −P i e r r e − L s −Nemours , 7 7 1 4 0 , France |+33 1 64 28 35 3 3 | | 1 |</ r e s p o n s i b l e P a r t y> 117 <r e s p o n s i b l e P a r t y>P o i n t o f c o n t a c t | r e s o u r c e | CNRS | | p l a n a q u a @ b i o l o g i e . e n s . f r | G r a r d LACROIX | | 7 8 , r u e du C h t e a u , S a i n t −P i e r r e − L s −Nemours , 7 7 1 4 0 , France |+33 1 64 28 35 3 3 | | 1 |</ r e s p o n s i b l e P a r t y> 118 <t y p e>d a t a s e t</ t y p e> 119 < l e g a l C o n s t r a i n t s>I n c a s e o f p a r t i c i p a t i o n o f t e c h n i c a l s t a f f o f t h e s t a t i o n d u r i n g t h e e x p e r i m e n t s t h e y must be co− a u t h o r s o f p u b l i c a t i o n s . 120 The CEREEP−E c o t r o n I l e D e F r a n c e s h o u l d be c i t e d i n t h e

acknowledgments and / o r t h e m a t e r i a l s and methods .</ l e g a l C o n s t r a i n t s> 121 <i s H a r v e s t e d>y</ i s H a r v e s t e d> 122 <d i s p l a y O r d e r>0</ d i s p l a y O r d e r> 123 <d o c L o c a l e>eng</ d o c L o c a l e> 124 <p o p u l a r i t y>162</ p o p u l a r i t y> 125 <keyword>AnaEE−France</ keyword> 126 <keyword>E n v i r o n m e n t a l m o n i t o r i n g f a c i l i t i e s </ keyword> 127 <keyword> F o l j u i f</ keyword> 128 <keyword>Guyane</ keyword> 129 <keyword>S t r u c t u r e</ keyword>

130 <keyword>a n t h r o p o g e n i c e n v i r o n m e n t</ keyword> 131 <keyword>a q u a t i c b i o l o g y</ keyword>

Semantic Enhancer And Ranking Component Helpers (SEARCH) for data discovery - 1

Semantic Enhancer And Ranking

Component Helpers (SEARCH) for data

discovery - 1

SEARCH-Enhancer: Ontological query enhancement

Semantic Enhancer And Ranking

Component Helpers (SEARCH) for

data discovery - 1

SEARCH-Enhancer: Ontological query enhancement

Abstract

Contents

Chapter 1

Introduction

1.1

Research Problem

Chapter 2

State of the Art

2.1

Related Work

2.1.1

FAIR principle

2.1.2

ENVRI: a community for environmental data

infras-tructures

2.2

Search tools within catalogues

2.2.1

CKAN

2.2.2

GeoNetwork

2.3

Stand-alone search tools

2.3.1

Open Semantic Search

2.3.2

Google Dataset Search

2.3.3

Gap Analysis

Chapter 3

Semantic Enhancement and

Ranking Component Helpers

(SEARCH)

3.1

Requirements

3.2

Design Considerations

3.2.1

Search tool interface Analysis

3.3

Architecture

3.3.1

Workflow

3.4

Technical Considerations

3.4.1

Current State

Chapter 4

Implementation

4.0.1

Usability

4.0.2

Experiments

4.0.3

Results

Chapter 5

Discussion

5.1

Contributions

5.2

Limitations

Chapter 6

Conclusion

Chapter 7

Future work

7.0.1

SEARCH

7.0.2

Tools & Data

References

Chapter 8