Semantic Enhancer And Ranking Component Helpers (SEARCH) for data discovery -2: SEARCH-Ranker: Re-ranking component for search results

(1)

Semantic Enhancer And

Ranking Component Helpers

(SEARCH) for data

discovery -2

(2)

Layout: typeset by the author using LA_TEX.

(3)

Semantic Enhancer And Ranking

Component Helpers (SEARCH) for

data discovery -2

SEARCH-Ranker: Re-ranking component for search

results

Mats Gonggrijp 11220600

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science

Science Park 904 1098 XH Amsterdam

Supervisor

Dr. Xiaofeng Liao, Postdoctoral Researcher Dr. Zhiming Zhao, Assistant professor

group of Multiscale Networked Systems at University of Amsterdam Faculty of Science

University of Amsterdam Science Park 907

(4)

1098 XG Amsterdam

(5)

Abstract

The ENVRI community achieves the goal that all participating Research Infras-tructures have built a set of FAIR data services via the proposed project ENVRI FAIR (Findable, Accessible, Interoperable and Reusable by machines and by peo-ple). As specified in the Findable principle, (meta)data are registered or indexed in a searchable resource. However, the data providers in ENVRI still use many dif-ferent (meta)data standards and there exists no unified search tool that connects all of them. Here we construct a starting point for a method that can aid in unify-ing data from across different (meta)data standards. For this we evaluate several search tools and propose an extension to them due to their lack of flexibility and complexity. There are several search tools for discovering data sets, like Fairshar-ing.org, or the recently released search service for data sets by Google, which allows users to discover data stored in various online repositories via keyword. Since the tools provide no state-of-the art search methods and are very labour intensive to modify we build a prototype which can re-rank the search results using semantic enhancing via an integrating domain ontology. Experimental results show that the query enhancement prototype can significantly improve the recall of existing search tools. The ranking component can successfully re-rank ranking output but no quality testing could be done. Further work needs to be done to include more ranking methods and compare these.

(6)

Chapter 1 Introduction

For researchers in many scientific fields, using large amounts of data is a key component of their work. Therefore, searching for the right data is essential. For many researchers, data is provided by separate organisations that collect the data. These organisations choose for themselves how they want to organize, store and present the data. They also provide the search tools, such as a web interface, that allow people to find the data. Data providers choose which, if any, metadata standard they use. Even in the same field, there can exist great disparity in these metadata standards. Despite this, due to increasingly complex research topics the demand for varied data is also increasing. Combining different datasets can provide novel insights. From the end-users perspective having a single unified retrieval tool that can search and present data from across many different providers and metadata standards would save time and effort in the retrieval process. In this research we present a method for increasing data retrieval generality.

1.1 Context: environmental data search

In environmental sciences there is an increasing demand for finding data sets that come from different providers. The reason for this is the growing pressure from global problems such as climate change, which require a large variety of variables to create accurate models. Therefore, researchers are forced to navigate many different data search infrastructures. Unfortunately, there exists great disparity in the way data is organised by these providers. There is no global consensus on metadata standards even withing the same domain. This means that researchers have to use the search tools that these infrastructures provide. Therefore they are limited in how efficiently they can gather data that spans across different providers.

(9)

1.2 Problem Definition

We present the following question: how can we increase the efficiency of data set search in environmental sciences for across different catalogues? Some potential candidates for this are data set catalogues tools such as CKAN, GeoNetwork, DataCite, Opensemantic Search and Google Data Set Search. The idea is then to compile data from many different providers in one such tool. Compiling the data in one tool requires adjusting the tool to incorporate metadata from many different schema’s. However, these tools have limits. First of all, they are either black-boxes: offering no insight into the internal workings, or they are very compli-cated: requiring extensive effort to adjust internal parameters and code to increase the flexibility. Besides this, most of these catalogue tools do not report using ad-vanced ranking algorithms, missing out on state-of-the art ranking methods that would provide increased search efficiency. This leads to a sub-question: how can we enhance the functionality of existing search tools without requiring changes of their source code?

For this sub-question we propose the following: to make a search tool add-on that can be connected to these search tools and enhance their functionality. This add-on is called SEARCH: ‘Semantic Enhancer And Ranking Component Helpers’. It has two main functions: Firstly, it enhances user queries using ontological se-mantics to increase search result generality. Secondly, the search system output is re-ranked using recently developed ranking methods to decrease user effort. In this part of the research the focus is on the re-ranking component of SEARCH.

1.3 Use Case and the ENVRI Community

In this research we will focus on testing on data provided by infrastructures that are part of the ENVRI community. The ENVRI community brings together 26 European Research Infrastructures, which are organized in four Earth system do-mains, i.e. atmosphere, marine, biodiversity, and solid earth. Some of them being of a multi-domain character. The term ‘Research Infrastructures’ refers to facil-ities, resources and related services used by the scientific community to conduct top-level research in their respective fields, ranging from social sciences to as-tronomy, genomics to environmental sciences. ENVRI-FAIR brings together 14 environmental research infrastructures, spanning multiple domains, including at-mosphere, marine, solid earth, terrestial ecosystem/biodiversity. The overarching goal is that at the end of the proposed project, all participating Research Infras-tructures have built a set of FAIR data services. The FAIR principles [11]

(10)

artic-ulate the behaviors expected from digital artifacts that are Findable, Accessible, Interoperable and Reusable by machines and by people [10]. Such development will enhance the efficiency and productivity of researchers, support innovation and connect the ENVRI Cluster to the EOSC. Last but not least, it will enable data-and knowledge-based decisions to answer the challenges the Earth data-and our society are facing. The ambition of ENVRI-FAIR is to establish technical preconditions for the successful implementation of a virtual, federated machine-to-machine in-terface. This interface is called the ENVRI-hub – a one stop shop for access to environmental data and services provided by the contributing research infrastruc-tures. We showcase ENVRI as an example of a step in the right direction, but still lacking (meta)data consistency. Therefore, making a proof of concept using ENVRI based data would show the more general solution path. We load data from one ENVRI infrastructure: ANAEE France, into an instance of GeoNet-work. Then the Ontological Query Enhancer passes a query to GeoNetGeoNet-work. The resulting rankings are stored and used to make preliminary tests for the re-ranker.

1.4 Contributions

1. Provide a starting point for data search capabilities across data infrastruc-tures with different metadata standards in environmental sciences.

2. Present a method to increase these capabilities through using existing search tools.

3. Present a preliminary framework that enhances the limited tools that are currently available.

(11)

Chapter 2 State of the Art

2.1 Dataset Search Products/Services

There are different approaches underway to encourage data sharing, reuse, and search. These include data marketplaces[4], and open data portals based on vol-untary publication, like Fairsharing1, Zenodo2, or based on crawling and indexing, like the recently released Google Data Search3_[₁_{]. These open data portals often}

provide data search capability.

Google Dataset Search [1] a vertical web search engine tailored to discover datasets on the web. This system uses either schema.org Dataset markup, or equivalent structures represented in W3C’s Data Catalog Vocabulary (DCAT) format. The metadata is reconciled to the Google knowledge graph and search capabilities are built on top of this metadata.

FAIRsharing is part of the ELIXIR infrastructure, which is an intergovern-mental organisation that brings together life science resources from across Europe. FAIRsharing ensures that standards, databases, repositories, policies are Find-able by providing DOIs and marking up records in schema.org, allowing users to register, claim, maintain, interlink, classify, search and discover them.[8]

Datacite.org/re3data.org, is a repository finder, a pilot project of the "Enabling FAIR Data project" led by the American Geophysical Union (AGU) in partnership with DataCite and the Earth, space and environment sciences community, can help find an appropriate repository to deposit your research data. The tools is hosted by DataCite and queries the re3data registry of research data repositories.

B2FIND 4 is a discovery service based on metadata steadily harvested from

1_{https://fairsharing.org} 2_{https://zenodo.org}

3_{https://datasetsearch.research.google.com} 4_{b2find.eudat.eu}

(12)

research data collections from EUDAT data centres and other repositories. The service offers faceted browsing and it allows in particular to discover data that is stored through the B2SAFE and B2SHARE services. The B2FIND service includes metadata that is harvested from many different community repositories. The community itself decides which metadata is made available for EUDAT. A sophisticated framework ensures that metadata providers are harvested regularly to always display complete and up to date information. EUDAT provides an optimized translation from community metadata schema to standard facets in the B2FIND metadata catalogue.

Similar with conventional search engines, the dataset search service mainly provide keyword based search with users typing in keywords as input. Unlike the conventional search engines, the dataset search service calculate similarity between keyword and published metadata of the potential datasets, instead of the content, like the document search.

2.2 Results Ranking Methodologies

Regarding the ranking of the results, one of the traditional ranking methods is page rank[2] which utilize inter links between web pages. However, as pointed out in [1] that links between datasets are still rare, which makes traditional web-based ranking difficult.

As summarized in a recent survey [3], the work that looks at ranking datasets is still needs large efforts. Among the several existing works, For instance, after per-forming a keyword query over tables, a ranking on the returned tables is attempted [14]. In a more advanced method, [5] use an unsupervised learning approach to identify topics of a database that can then be used in ranking. Finally, [6] ranks datasets containing continuous information.

Some other ranking methods that we consider are:

SetRank[9]was inspired by observations about query logs fromsemanticscholar. org. They observed that a large portion of scientific literature search queries where in fact so called entity-set queries. These entities can have possibly different types and can be ambiguous. They reason that using graph-based knowledge bases with triple-store like structures that contain semantic information can help in-crease query resolution and also provide a novel ranking approach by representing queries as heterogenous graphs and finding ’query graph covering’. Similarly to word matching, higher covering would provide higher ranking scores for the query-document combination.

(13)

Similar to SetRank, The ESR (Explicit Semantic Ranking) proposed by [13] also uses entity-type information but in a different way. They build their own knowledge graph using S2’s corpus and Freebase containing concept entities, con-text correlations, relationships with authors and venues and embeddings trained from the graph structure. Semantic relatedness between query and document entities is computed in the embedding space. ESR uses a two-stage pooling to generalize the entity-based matches to query-document ranking features and uses a learning-to-rank model to combine them.

A geospatial learning-to-rank method is proposed in [7] that improves search performance of geo-spatial data. Using a log processor that provides a user feed-back stream to a deep learning model, they train a deep neural net to adjust ranking parameters. They provide a log parser that ingests and parses web logs that record user behavior, they propose their own set of hypotheses to model the relevance of data and they make a new deep learning based ranking model that can learn from user feedback in real time. The key difference with the other ranking techniques is that they don’t explicitly use knowledge graph embedding and focus on geospatial data instead of literature.

Finally, Word-Entity Duet Representations for Document Ranking proposed by [12] uses ESR to build further on top of it. The novelty of their method is the idea to use a four way connection model that uses matching features between query words to document words, query words to document entities, query entities to document words and query entities to document entities. Besides this they developed an attention selection mechanism that emphasizes certain components in these connections to deal with the uncertainty generated by making entity rep-resentations. Both the matching models and the attention models learn through a 1 dimensional convolutional neural network to produce convoluted vectors. The attention and the feature match vectors are then combined in a dot product to produce the model output of a query-document matching score.

(14)

Chapter 3 Proposed solution: a

Re-ranking tool

3.1 Semantic Enhancer And Ranking Component

Helpers (SEARCH) framework

The main goal of this research is to propose a flexible add-on framework that can improve search results of basic search systems. This thesis focuses on the re-ranking component of SEARCH. However, for context it is necessary to show SEARCH as a whole. Figure 3.1 shows the general pipeline of user interaction with SEARCH. The green arrows indicates the standard search tool use and the red arrows show the extended version. The query is passed through a Ontological Query Enhancer that feeds an enhanced query into the search tool via the API that the tool provides. This API returns a ranking which is passed to the re-ranking system. This system changes the ordering of this ranking based on a selected algorithm and presents this to the user. For simplicity the user-feedback arrows are not added in the figure, but an important part is that the re-ranking system learns from user feedback by adjusting model parameters on user feedback data.

3.2 Re-ranking System Requirements

Following are two sets of requirements. The conceptual requirements list the gen-eral goals of the Re-ranking system. The practical requirements list the input-output, coding language and code structure.

Re-ranking conceptual requirements

(15)

Figure 3.1: SEARCH framework. Users send queries in the form of a set of key-words to the Ontological Query Enhancer to generalize the query. This enhanced user query is sent to the API of the search tool which produces search results made by the search tool. These results are re-ordered by the RRS which presents these results to the user. User behavior is taken as feedback into the RRS to learn better rankings.

(16)

1. It can learn on user feedback such as clickstream data or search ratings. 2. The learning on user feedback provides added benefit in ranking and increases

user satisfaction with the end results.

3. The ranking model (ranking score computation) should not be too compli-cated or computationally expensive. The model should be explainable as to allow for making arguments for it’s use and explain it’s behavior as well as to allow for proposing sensible modifications.

4. The system should be set up in such a way that it can be easily modified and different algorithms can be implemented to allow for comparing these.

Re-ranking practical requirements The RRS should have a number of prac-tical requirements in order to connect to the other components of the system pipeline. The practical requirements can be extended later, but to limit the scope of this research we define this set for now.

1. The code is written in the latest version of Python 3 (3.8.2 currently) for both readability as well as to be able to use some of the many python libraries that may provide useful functionality.

2. It accepts a ranking input in the form of XML and outputs a new ranking as an ordered list of tuples containing a score and a URL.

3. The re-ranking main functionality is organized in a python class such that it has a comprehensible structure and is easily extendable.

4. The code is implemented in a jupyter notebook to allow for easy testing and sharing.

5. The source code is uploaded to GitHub as open source so it can be reviewed, cloned and modified.

3.3 SEARCH Architecture

This section presents the general architecture for such a system. This architecture is in a sense an abstraction and does not show exact implementation details, but provides an overview of data flow and the concepts.

In figure3.2 the abstract system architecture is shown. It represents a general set of pipelines that the system I/O must follow. It does not show exact design

(17)

Figure 3.2: The Re-Ranking System architecture. The purpose of this is to show an abstract system that visualizes the idea of an RRS. It is divided into three main pipelines. Blue for the query flow. Red for the API output documents flow. Purple for the learning from user feedback mechanism, and finally green for the algorithm output.

(18)

implementation details which are discussed in chapter 4. The data story is as follows:

• A user sends a query to the Ontological Query Enhancer which produces an enhanced query that is sent to the API.

• The API produces a ranking that is sent to the Ranking Method.

• The Ranking Method produces a ranking which is sent to the User.

• The User interacts with the ranking to produce feedback that is collected into the Learning Mechansim

• The Learning Mechanism dynamically updates Ranking Method parameters using this feedback.

3.4 Re-ranking Architecture

The general system architecture is in the form of a python class called RRS. The idea is to have two main functionalities:

• rank an input ranking made by some search tool.

• accept user feedback to adjust ranking parameters of the instance ranking method.

3.5 Ranking Model Choices and Learning to Rank

Below presents the ranking model considerations when evaluating different models. 1. How it represents documents?

2. How it represents queries?

3. What features are used to represent query-document matching? 4. What the model parameters are?

(19)

Figure 3.3: RRS class architecture. At the init a method is chosen. Calls to rank and get feedback then use the method that the RRS instance is set to

5. How the model parameters and features are combined into a formula that calculates a ranking score?

6. What feedback features the model uses?

7. How the feedback features are calculated and how these features are used to update the model parameters?

(20)

Chapter 4 Code Implementation

As end goal we should be able to use multiple different ranking methods side by side and easily compare these. For this research, the scope is limited to making a starting implementation for the AttR-Duet model as proposed by [12]. Besides this the code also includes auxiliary methods such as cleaning and processing XML ranking output from a search tool.

4.1 Coding Language, Environment and Libraries

Language For the coding language, Python (version 3.8.2) is used due to it’s simplicity and large selection of libraries and support.

Environment The environment used is a local Jupyter notebook instance that is uploaded to GitHub repository1as an .ipynb file in the folder “RRS final”. In here the “RRS Showcase” file can be run as a notebook showing the functionality of the RRS. The reason for choosing this environment is it’s ability to easily add markup type annotations and segment the code into clear blocks that can be individually run with direct output results. This project can act as a starter to a more advanced RRS implementation and a Jupyter notebook format is good for making a narrative driven explanation of the ideas. This will aid in the communication to the next developers.

1_{https://github.com/mgonggrijp/Ranking}

(21)

4.2 Functions

Main Functions Following is a list of functions that are implemented in the RRS class.

1. RRS.rank(q, Rold, ‘format’) : re-rank documents only existing in the set of

Rold using the query and the current parameter state of the method.

2. RRS.getFeedback(Fuser) : takes in a set of user feedback data points and

feeds them into the learning mechanism that belongs to the current RRS instance ranking method.

Auxilary Functions For pre processing of GeoNetwork XML output

1. makeDoclist : takes in a GeoNetwork ranking XML file and turns it into a list of dictionaries. Each dictionary belongs to a document. This contains all information returned by GeoNetwork.

2. makeCleanDocument : cleans a single raw GeoNetwork document dictio-nary. It only stores the abstract, keywords and the URL since this is the only relevant information in the ranking output for a re-ranking.

3. cleanDoclist : Applies makeCleanDocument to all documents in the list produced by makeDoclist. Returns a cleaned list of document dictionaries in the same ordering as produced by GeoNetwork’s ranking.

4. handleGeonetworkXML : combines function 1 and 3 to produce a cleaned list of documents.

5. makeBM250Corpus : take a cleaned list of documents produced by clean-Doclist and turns it into a representation that can be used by the BM250kapi module.

Making the ranking features for AttR-Duet AttR-Duet lists the following features as necessary for their ranking model. These features are all ranking model outputs and are used as matching signals.

1. TF-IDF 2. Boolean OR 3. Boolean AND

(22)

4. BM25

5. Language Model with various smoothings. 6. Coordinate Matching.

7. Binned translation scores, 1 exact match bin 8. 5 soft match Bins in [0,1)

These matching signals are used in different combinations of query word/entity and document word/entity. E.g. query-word and document-word matching uses all of the above signals except 7 and 8.

To make these features, these ranking models need a working implementation. They are organised in the following files:

1. boolean matching

2. vsm and bow: functions for the vector space model (coordinate matching), making bag of words representations and making tf-idf statistics.

3. language models: functions for making a language model ranking with the ability to use different smoothing techniques.

4. translation score bins and soft match bins

All of these are implemented such that they use the standard document/vector representation: a list of strings.

Making the entity extraction For the AttR-Duet ranking, entities need to be extracted from a knowledge base to allow for the 4-way entity-word matching to be made between documents and queries. Due to time constraints I can not make a complete knowledge base, but below I show the necessary functions to extract entities from a knowledge base using query or document words.

1. spot surface form: using the linked probability, identify surface forms in a document.

2. disambiguation: link the most probable entity from the candidates of the surface form using commonness and context.

(23)

Chapter 5 Conclusions

5.1 Final implementation results

I was able to do the following.

1. Construct the beginning of an RRS framework.

2. Use the BM25 ranking method to re-rank a ranking produced by GeoNetwork after receiving a query sent by the Ontological Query Enhancer.

3. Make a variety of pre processing functions that handle XML from GeoNet-work.

4. Make a variety of basic ranking methods that can serve as features for an AttR-Duet implementation.

5. Make a jupyter notebook that demonstrates the basic idea and functionality of the RRS.

This shows that even a basic ranking method produces different rankings given only the set of documents in the output. Due to time constraints, it was impossible to extensively test performance improvements, but this shows a proof of concept. The current implementation is uploaded to GitHub so further work can be done.

What still needs to be done for finishing the AttR-Duet method:

1. Implementing the language model for ranking with various smoothing tech-niques.

2. Making an entity extraction tool that can find surface forms in a text and extract the most likely corresponding entity from a knowledge base.

(24)

3. Making the two binning functions: binned translation scores and soft match bins.

4. Making the scoring calculation functions that combine the different ranking model outputs as various features into features that go into a matrix. 5. Making the matching model by connecting a 1d convolution neural network

to the matching features.

6. Implement the attention model pipeline: transform words and entities based on feedback into attention features.

7. Combine the matching model and attention model with a dot product to produce the AttR-Duet function output.

In figure 5.1 the difference in ranking output is shown. The ranking was pro-duced by re-ranking the ranking set returned by GeoNetwork using the same query.

Figure 5.1: Screenshot from the code. The returned items are urls. The top url for both is the same, but the second and third options are different.

(25)

5.2 Limitations of data accessibility

Due to time constraints we only have access to a limited set of data. Therefore we had to use what was available. ANAEE France already uses GeoNetwork compatible standards, hence the choice of infrastructure and search tool. We ran an instance of GeoNetwork on a server.

5.3 Search tool architectures

In figure 5.2 are a set of figures that show the general architectural make-up of the search tools that we considered. This overview will give a general sense of how they work and will help to reason about the tool-SEARCH interaction. This is our own interpretation based on the information that the tool developers provide.

5.4 Limitations of the re-ranking system

The system is designed to extend the functionality of an existing catalogue tool. This gives it inherently a few limitations.

Ranking set size Because the re-ranking system only re-ranks using the ranking set returned by the catalogue tool, it is inherently limited by the recall of this tool. The idea is that the Ontological Query Enhancer component increases query generality and therefore tool recall, but this has to be extensively tested for many different tools.

Efficiency Most ranking methods require separate indexing. This means each request to make a ranking requires an extra linear scan of the documents to make this indexing. However, this is inherent to the system since the search tool is treated as a black box and the internal indexing is not used.

Generality Because the system as of now uses pre processing functions that work specifically on the XML ranking output made by GeoNetwork, if other search tools produce for example JSON or simply different XML structures, they will need specific pre processing functions.

(26)

(a) CKAN

(27)

(c) DataCite

(d) GeoNetwork

Figure 5.2: The catalogue and search tool architectures that are included in this research. These are representations based on our own findings in the documenta-tions. These tools do not provide such an architectural schema themselves.

(28)

Chapter 6 Discussion

6.1 Future work

This work shows a proof of concept. Even when re-ranking with a basic method such as BM25 we get very different ranking results than the search tool by itself. Following is a list of things that need to be done:

1. Finish the AttR-Duet implementation to have one working ranking model that uses learning on feedback.

2. Make a user feedback parser that feeds this to a learning ranking model. 3. Implement several other state-of-the art ranking methods inside the RRS

class.

4. Make a testing mechanism that models user satisfaction and compare differ-ent ranking method-search tool combinations

5. Connect the functionality of the Ontological Query Enhancer to the Re-Ranking System.

Besides this, more search tools such as GeoNetwork should be found and in-cluded for considerations. A trade-off must be made to find out the pro’s and cons of using SEARCH as compared to modifying and using an existing search tool.

6.2 Generality of the re-ranking system

Since the system currently only works on XML output by GeoNetwork, an inves-tigation needs to be done to check output formats for all search tools that are

(29)

considered for connection testing. If the output is different, separate pre process-ing scripts will need to be made to transform this to the input that RRS requires. Besides this, since the re-ranking is done on the set of document returned by a search tool, methods that use statistics over the entire corpus such as TF-IDF may have sub-optimal functionality. To increase their actual relevance perhaps it is necessary to retrieve these statistics somehow from the entire corpus stored in the search tool. Of course this does make the SEARCH system more reliant on pre processing and therefore less flexible.

6.3 Relevance of the evaluated search methods

An important note is that three of these papers discuss literature search, which may have slightly different challenges than metadata search. An important note is that even though a manually engineered system that makes use of the structure in the metadata schema’s will always be better, the novelty of this idea is that it can be used as an add on and assumes basic search tools as black boxes. Making it very adaptable and easy to implement.

6.4 Using the Ontology For Entity Extraction

Since we use an ontology to create the semantic queries, it is natural to also use this to create entities. Several state-of-the-art ranking methods use entity-type information in various ways. Further research needs to be done to find relevant ontologies and create methods to parse these. I predict using rich ontological in-formation for not just the query enhancement but also for the ranking will provide added benefit. This is therefore a strong argument for using the SEARCH sys-tem as well, since no catalogue tool incorporates such semantics. SEARCH would provide both a research framework and a working tool.

(30)

Bibliography

[1] Dan Brickley, Matthew Burgess, and Natasha Noy. Google dataset search: Building a search engine for datasets in an open web ecosystem. In The World Wide Web Conference, WWW ’19, pages 1365–1375, New York, NY, USA, 2019. ACM.

[2] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1):107 – 117, 1998. Proceedings of the Seventh International World Wide Web Conference. [3] Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez, Emilia Kacprzak, and Paul Groth. Dataset search: a survey. The VLDB Journal, 29(1):251–272, 2020.

[4] Yuri Demchenko, Wouter Los, and Cees de Laat. Data as economic goods: Definitions, properties, challenges, enabling technologies for future data mar-kets. 2018.

[5] Christophe Van Gysel, Maarten De Rijke, and Evangelos Kanoulas. Neural vector spaces for unsupervised information retrieval. ACM Transactions on Information Systems (TOIS), 36(4):1–25, 2018.

[6] Jian Li and Amol Deshpande. Ranking continuous probabilistic datasets. Proceedings of the VLDB Endowment, 3(1-2):638–649, 2010.

[7] Yun Li, Yongyao Jiang, Chaowei Yang, Manzhu Yu, Lara Kamal, Edward M. Armstrong, Thomas Huang, David Moroni, and Lewis J. McGibbney. Im-proving search ranking of geospatial data based on deep learning using user behavior data. Computers Geosciences, 142:104520, 2020.

[8] Peter McQuilton. How fairsharing can help fairify your standards, databases and data policies. Technical report, Dec 2019.

[9] Jiaming Shen, Jinfeng Xiao, Xinwei He, Jingbo Shang, Saurabh Sinha, and Jiawei Han. Entity set search of scientific literature: An unsupervised ranking

(31)

approach. In The 41st International ACM SIGIR Conference on Research Development in Information Retrieval, SIGIR ’18, page 565–574, New York, NY, USA, 2018. Association for Computing Machinery.

[10] Hana Pergl Sustkova, Kristina Maria Hettne, Peter Wittenburg, Annika Ja-cobsen, Tobias Kuhn, Robert Pergl, Jan Slifka, Peter McQuilton, Barbara Magagna, Susanna-Assunta Sansone, Markus Stocker, Melanie Imming, Larry Lannom, Mark Musen, and Erik Schultes. Fair convergence matrix: Optimiz-ing the reuse of existOptimiz-ing fair-related resources. Data Intelligence, 0(0):158–170, 0.

[11] Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Ed-munds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alas-dair J.G. Gray, P.T. Groth, Carole Goble, Jeffrey S. Grethe, J. Heringa, Pe-ter A.C. ’t Hoen, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-Assunta San-sone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons. The fair guiding principles for scientific data management and stewardship. Scientific Data, 3, 3 2016.

[12] Chenyan Xiong, Jamie Callan, and Tie-Yan Liu. Word-entity duet represen-tations for document ranking. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, page 763–772, New York, NY, USA, 2017. Association for Com-puting Machinery.

[13] Chenyan Xiong, Russell Power, and Jamie Callan. Explicit semantic rank-ing for academic search via knowledge graph embeddrank-ing. In Proceedrank-ings of the 26th International Conference on World Wide Web, WWW ’17, page 1271–1279, Republic and Canton of Geneva, CHE, 2017. International World Wide Web Conferences Steering Committee.

[14] Shuo Zhang and Krisztian Balog. Ad hoc table retrieval using semantic sim-ilarity. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, page 1553–1562, Republic and Canton of Geneva, CHE, 2018. International World Wide Web Conferences Steering Committee.

Semantic Enhancer And Ranking Component Helpers (SEARCH) for data discovery -2: SEARCH-Ranker: Re-ranking component for search results

Semantic Enhancer And

Ranking Component Helpers

(SEARCH) for data

discovery -2

Semantic Enhancer And Ranking

Component Helpers (SEARCH) for

data discovery -2

SEARCH-Ranker: Re-ranking component for search

results

Contents

Chapter 1

Introduction

1.1

Context: environmental data search

1.2

Problem Definition

1.3

Use Case and the ENVRI Community

1.4

Contributions

Chapter 2

State of the Art

2.1

Dataset Search Products/Services

2.2

Results Ranking Methodologies

Chapter 3

Proposed solution: a

Re-ranking tool

3.1

Semantic Enhancer And Ranking Component

Helpers (SEARCH) framework

3.2

Re-ranking System Requirements

3.3

SEARCH Architecture

3.4

Re-ranking Architecture

3.5

Ranking Model Choices and Learning to Rank

Chapter 4

Code Implementation

4.1

Coding Language, Environment and Libraries

4.2

Functions

Chapter 5

Conclusions

5.1

Final implementation results

5.2

Limitations of data accessibility

5.3

Search tool architectures

5.4

Limitations of the re-ranking system

Chapter 6

Discussion

6.1

Future work

6.2

Generality of the re-ranking system

6.3

Relevance of the evaluated search methods

6.4

Using the Ontology For Entity Extraction

Bibliography