Applying questionanswering for the Mendeley Search Engine to improve user search experience

(1)

Applying Question Answering for the Mendeley Search Engine to Improve User

Search Experience

Master in Interaction Technology University of Twente

Author Kehan Zhang

Supervisors Mari¨ et Theune Shenghui Wang External Supervisor Till Bey

Aug 14, 2020

(2)

2

(3)

Abstract

One of Elsevier’s products, Mendeley, has a search function for users. As an academic social network and service provider for academic researchers, Mendeley has a search function which is designed for finding publications.

Some users however type questions into the search box. Besides, some queries are not keyword-based, which raises a search problem.

In this project, we investigated how to use question answering techniques

embedded in the search experience to address this problem and improve user

search experience. The proposed QA system consists of various components

including a query classifier and two QA models. The query classifier is

based on TF-IDF and SVM. Two QA models are community based QA

using TF-IDF and cosine similarity, and information retrieval based QA

using FastText and BERT. Each component was evaluated by using data

from the Mendeley search log and a user test was conducted to evaluate the

completed model. The results show that by applying question answering

techniques, user search experience can be improved. The dataset used in

the project was from Elsevier, but the approach is generalizable to help

other search engines in a similar setting.

(4)

ii

(5)

Acknowledgement

It has been a long and enriching journey to conduct this project. It guides me to the NLP domain and provides me the hand-on experience as a begin- ner in this field.

I would like to express my great appreciation to my supervisors Mari¨ et Theune and Shenghui Wang, for the guidance of my project in the past several months. Thank you Mari¨ et, for guiding me to the path of NLP, for improving my understanding of academic writing, and for handling all the operational issues to ensure the progress was on track. Thank you Shenghui, for sharing your knowledge in Machine learning and NLP, for teaching me a lot of research skills, and for giving me academic and constructive ideas.

Thank you both for helping me correct my paper and the regular meetings we had, I always had a more concrete plan for the next step and higher motivation to carry it out after each meeting.

At the same time, special thanks to my external supervisor, Till Bey, for giving me so much helpful information, for helping me transform from a student to an industrial person, for all the guidance in every aspect. I appreciate that you help me to keep the project on track all the time, and tightly follow up with my project, inspire me when I lost direction. Thanks to Elsevier, for offering me the opportunity to research the project, for sharing all the resources with me. Thanks to all my colleagues, for making me a part of the team, and assisting me with the project.

Finally, I wish to extend my thanks to my family and my friends, for always supporting me and encouraging me throughout my study.

This project has been conducted during a special time when the pan-

demic slows down everything and isolates people. But with support from

all of you, I am glad that I can finish this project in time and stay healthy.

(6)

Abbreviations and Acronyms

QA Question Answering QC Question Classification

KG Knowledge Graph

5W1H What, when, where, who, why, how SVM Support vector machine

DCG Discounted cumulative gain DOI Digital Object Identifier CTR Click-Through Rate

BERT Bidirectional Encoder Representations from Transformers SQuAD Stanford Question Answering data set

QALD Question Answering over Linked Data BoW Bag of Words

TF-idf Term frequency–inverse document frequency IR Information Retrieval

NER Named Entity Recognition RDF Resource Description Framework

iv

(7)

List of Figures

2.1 Architecture of QA procedure . . . . 8

4.1 Overview of the project workframe . . . 22

4.2 Details of the Query Classifier Module . . . 22

4.3 Details of the QA Module . . . 23

5.1 Proportion of different types of query . . . 32

6.1 Overview of the Product-related QA model . . . 34

6.2 Example of Mendeley support center . . . 35

6.3 Example of Mendeley support center forum . . . 35

6.4 Output for product-related queries that don’t get answers from the model . . . 38

6.5 Validation result of the product-related QA model’s perfor- mance . . . 38

6.6 Recall and precision curve for threshold of the product-related QA model . . . 40

6.7 Different types of answer curve for threshold of the product- related QA model . . . 40

6.8 Product-related QA test result . . . 42

7.1 Overview of the Open Domain QA model . . . 44

7.2 Example of the Topic Page . . . 45

7.3 Machine Learning’s Topic Page . . . 45

7.4 Answer for “what is machine learning” from the model . . . . 48

7.5 Alcoholism’s topic page . . . 49

7.6 Output of “what is lean six sigma” from the model . . . 50

7.7 Candidates of “how dogs learn” . . . 50

7.8 Answers of “can we define sleep health, does it matter” . . . 52

7.9 Proportion of answers on each position . . . 54

7.10 Proportion of answers on each position for queries having

more than five search results . . . 55

(8)

LIST OF FIGURES

7.11 Number of answers with BERT threshold increasing of the Open Domain QA model . . . 56 7.12 Valid answers recall and precision with BERT threshold in-

creasing of the Open Domain QA model . . . 57 7.13 Number of the top answers with BERT threshold increasing

of the Open Domain QA model . . . 57 7.14 Recall and precision of top valid answer with BERT threshold

increasing of the Open Domain QA model . . . 58 7.15 Number of answers with cosine similarity threshold increasing 60 7.16 Valid answers recall and precision with cosine similarity thresh-

old increasing . . . 60 8.1 User Test interface of the project . . . 64 8.2 Output example of the user test . . . 65

vi

(9)

List of Tables

2.1 Question Taxonomy introduced by Li and Roth (2002) . . . . 6

3.1 Question Classification for this project . . . 16

3.2 Preliminary analysis of the search log based on keywords . . . 18

3.3 Preliminary SVM classification result . . . 18

5.1 Composition of the training data set . . . 26

5.2 The training set of queries not starting with question words . 27 5.3 The training set of queries starting with question words . . . 28

5.4 Confusion matrix of initial result . . . 29

5.5 Performance scores of initial result . . . 29

5.6 Confusion matrix of hybrid model’s result . . . 30

5.7 Performance scores of hybrid model’s result . . . 30

5.8 Prediction of Hybrid Model on half year’s data . . . 31

6.1 Corpus composition for product-related QA . . . 34

6.2 Numbers of good/bad answers on different thresholds . . . . 41

7.1 Validation result from the Open Domain QA model . . . 53

7.2 Number of answers on different positions . . . 53

7.3 Number of answers on different positions for queries having more than five search results . . . 54

7.4 Number of answers of different BERT threshold of the Open Domain QA model . . . 59

7.5 Number of queries of different BERT similarity threshold of the Open Domain QA model . . . 59

7.6 Number of queries at different cosine similarity threshold . . 61

7.7 Test result for the Open Domain QA model . . . 61

8.1 Criteria of the user test . . . 64

8.2 Result for the user test . . . 66

8.3 Result for product-related queries of the user test . . . 67

(10)

LIST OF TABLES

8.4 Result for open domain queries of the user test . . . 67

viii

(11)

Introduction

1.1 Background

Elsevier is a global information analytics business specializing in science, technology and health. As an academic publishing and analytics company, there are a massive amount of academic publications hosted by Elsevier.

One of Elsevier’s products, Mendeley, is a reference manager and academic social network that provides resources for researchers. To access these, a keyword-based search engine is provided on Mendeley.

Nevertheless, some users type in entire questions rather than keywords

(e.g. “what is an open system in organization”). Besides, some users use the

search function to look for instructions about Mendeley (e.g. “how to down-

load Mendeley desktop”). The keyword-based search engine only retrieves

information based on matching query with titles, abstracts of publications

and partial metadata field such as publication years, authors. Meanwhile

since it assumes that users only look for academic publications, the ques-

tions that users ask usually cannot get sufficient results. According to a

preliminary analysis of Mendeley’s search log, more than 3% of queries are

not keyword-based queries. This amount is too big to ignore. Those queries

can be categorized into 3 classes: 1) metadata related questions. 2) product

instruction related questions and 3) open domain questions. For the first

two classes, the search engine does not return any sufficient results and as

for the third class, the result depends on if there is a sufficient overlap of the

words in titles and abstracts. But even if the search engine finds results, it

still doesn’t answer the query directly. The unsatisfactory results hamper

the user search experience.

(16)

CHAPTER 1. INTRODUCTION

1.2 Research Question

With the context above, a way needs to be found to handle those queries which cannot get a sufficient result from the search engine. In the Natural Language Processing (NLP) domain, questions are handled by Question Answering (QA) techniques. This leads to the research question of the project:

Can Question Answering techniques improve user search ex- perience for Mendeley users?

QA is a well-studied field. It has a lot of potential to handle questions and find corresponding answers. QA can process queries with different algorithms and understand the goal of queries. It doesn’t require the input to be keywords but can handle natural language input. So it can cover the queries that don’t meet the original assumption, that is users only look for academic publications and type in keywords. Besides, there are different types of QA models which use diverse sources to handle the task such as text-based QA, community QA. Each type has their advantages.

With QA techniques added into the search function, when users input a question, they should be able to get sufficient results which can improve user search experience.

1.3 Overview of the project

In this project, different QA models were investigated and combined to construct a hybrid QA model which can handle diverse types of questions.

The solution will consist of two steps. The first one is a query classifier which distinguishes if a query is a keyword-based query or a question. And if it is a question the classifier also indicates which type of question it is.

Then the query will be sent into the second component, a QA model, to find the answer. To handle different types of question, the QA model has two parts according to question types. The solution is expected to handle the search problem mentioned above and improve user search experience for Mendeley users. As an extra component of the search engine now, it will be the foundation for an independent QA function in the future. At the same time it is extensible for other products and use.

1.4 Outline

The thesis structure is as follows: Chapter 2 introduces related work. Chap-

ter 3 gives results from the preliminary research. Chapter 4 gives an overview

2

(17)

1.4. OUTLINE

of the model. The first component of the model, the query classifier is in- troduced in Chapter 5. Product-related QA and open domain QA are pre- sented in Chapter 6 and 7. The result from the user test is in Chapter 8.

In Chapter 9 conclusions and a discussion are presented.

(18)

(19)

Chapter 2

Related Work

This chapter provides a literature review and study of the related work.

Some relevant topics to Question Answering such as question classification

are also helpful to the project. Because Question answering is a well-studied

topic, there are many works in the domain which provides much inspiration

to the development of the project.

(20)

CHAPTER 2. RELATED WORK

2.1 Question Classification

2.1.1 QC taxonomy

Question classification (QC) is normally a part of QA system. One of the important question taxonomies was proposed by Li & Roth (2002) [14]. The classification introduced by their taxonomy is listed as Table 2.1. Their classification is a two-level system that contains a coarse (general) class and a fine (specific) class. Their approach made use of machine learning. The SNoW learning architecture they used is a sparse network of linear units which can be used as a general purpose multi-class classifier.

Table 2.1: Question Taxonomy introduced by Li and Roth (2002)

Coarse Fine

Abbrev abbreviation, expansion

Description definition, description, manner, reason

Entity animal, body, color, creation, currency, disease, event, food, instrument, language, letter, other, plant, product, religion, sport, substance, symbol, technique, term, vehicle, word Human description, group, individual, title

Location city, country, mountain, other, state

Numeric code, count, date, distance, money, order, other,

percent, percent, period, speed, temperature, size, weight

In 2016, Van-Tu and Anh-Cuong applied a linear SVM model using a combination of lexical, syntactic and semantic features to classify ques- tions based on Li & Roth’s work [21]. The accuracy reached 91.6%. Be- sides, a purely rule-based system for QC proposed by Madabushi & Lee [20]

achieved 97.2% accuracy using the taxonomy proposed by Li & Roth. In their method, relevant words from a question were extracted based on its structure which was defined as a syntactic map and after that, the rule- based classification was carried out.

The question classification applied in this research will be novel since the former works on question classification normally focus on classifying questions by corresponding answers type following Li & Roth’s question taxonomy, which is not applicable in this project. The question classifica- tion in this project will not apply Li & Roth’s taxonomy but instead classfy queries into product-related question, metadata-related question and open domain question. And this doesn’t require specific answer type of a query.

Li & Roth’s taxonomy is more for open domain QA. For the problem in this

6

(21)

2.2. QUESTION ANSWERING

project, classify queries according to Li & Roth’s taxonomy will not help to narrow down the problem. In the other word, treating all queries in question form with one method makes the problem more difficult to be handled. Be- sides, the domain this project will focus on is the academic domain because Mendeley provides a service for academic researchers. Meanwhile the data used by the system is academic publications and related documents within Elsevier’s resource. Because of this, the questions that the system needs to answer are mainly within the academic domain or Mendeley related.

2.1.2 SVM classifier

Support Vector Machine (SVM) is a sophisticated idea with simple imple- mentation [3]. SVM is robust and fast for handling classification tasks. It is a supervised learning approach that requires labeled data to train the model. The aim of SVM is finding a separating hyperplane which maximize the margin between different classes. Support vectors are the data points that lie closest to the separating hyperplane.

The core of the classifier in this project to classify the queries is a linear SVM trained on labeled search log data. It is quite efficient for a limited amount of training data and it can have a passable performance.

2.2 Question Answering

Question answering (QA) is a discipline within Information retrieval (IR) and Natural Language Processing (NLP) domains. It aims for building a system which can answer questions in natural language.

2.2.1 QA taxonomy

Question Answering is a well-studied topic nowadays. There are different taxonomies for QA from diverse aspects.

According to difference of required knowledge, generally QA can be cat- egorized into Open Domain QA and Closed domain QA [16]. For Open Domain QA, questions can be in any field. Closed domain QA has a spe- cific domain of knowledge which the question and answer belong to.

Another way to categorize QA is by method: 1) Knowledge graph based QA, 2) Text-based QA and 3) Community QA [16]. It divides QA systems according to how they extract answers. For knowledge graph based QA, the model extracts answers from an existing knowledge graph while text- based QA retrieves answers from its corpus which is a collection of texts [5].

Community question answering is based on corpus data from a Q&A com-

(22)

CHAPTER 2. RELATED WORK

munity. The similarity between the given question and existing questions is calculated to get a compatible answer [17].

The QA can also be categorized into two paradigms: 1) Information Retrieval (IR) based QA. 2) Knowledge-based QA [12]. IR based QA seeks answers for the questions based on a massive amount of documents. By information retrieval, it looks for relevant information from the documents.

Knowledge-based QA is similar to knowledge graph based QA. It requires a structured database where the knowledge is kept structurally. The ques- tion will be represented semantically, and mapped to query that structured database.

2.2.2 Procedure of QA

Mostly, the structure of a QA procedure was defined as three parts [1]: 1) Question processing, 2) Documents processing, 3)Answer processing, see Figure 2.1.

Figure 2.1: Architecture of QA procedure

Question processing part is the first step of the QA procedure. The questions in natural language input by the users will go to this part. It will be classified and processed to get the necessary information for next step.

QC mentioned above is in the question processing step. The input will also be transformed into a compatible question formula with QA task.

Document processing part will select a set of relevant documents and extract related paragraphs which may contain the answers. This step can generate a data set for the answer processing step. The retrieved data in this step can be sorted by its relevance.

8

(23)

2.3. LANGUAGE MODELS

The last step is answer processing. It will extract answers from the result of the document processing part. Depending on different types of the QA, there are diverse approaches for this step. The answer must be a simple answer to the question. But sometimes it might require merging information from different documents which is the most challenging.

2.2.3 QA evaluation

The performance of QA normally is evaluated by conducting experimental evaluations on real data sets. There are many different data sets for different types of QA. The most famous one is SQuAD. Yet today, the performance of models on SQuAD is still improving.

SQuAD (Stanford Question Answering data set) [19] is a data set provid- ing reading comprehension text including questions and related answering texts. It is mainly used for text-based QA. Training set and evaluation set are provided. There is a ranking of performances on SQuAD that diverse models have [8]. According to the leaderboard, the state-of-the-art model at the moment is ALBERT which stands for “a lite BERT”. The F1 scores of it reached 92 which was higher than human performance which was around 89. Meanwhile, the foundation of ALBERT, BERT which came out in 2018 had an F1 score of 83.

A data set for evaluating knowledge graph based QA is provided by QALD (Question Answering over Linked Data) challenge [15]. The general task for QALD challenge is that given one or several Resource Description Framework (RDF) dataset(s) and natural language questions, return the correct answers or a SPARQL query that retrieves these answers.

In this project, the whole framework of the QA model is basically matched with the architecture mentioned in section 2.2.2. And different types of QA are applied targeting different types of queries. For product-related queries, community QA is applied because there are answer sources in Q&A form.

Open domain queries are handled by text-based QA because publications’

abstracts are used as corpus to generate answers. In some degree, both of these are IR-based QA but in this project, the open domain QA is seen as IR-based because it is based on the search engine while the product related QA is based on a static database. The data from SQuAD was tried to eval- uate the model but according to the analysis elaborated in Chapter 7, it is not suitable to use SQuAD to test the model in this project.

2.3 Language Models

Language models are a crucial part of NLP. A language model is able to

predict the probability of a sequence of words and it represents the text to

(24)

CHAPTER 2. RELATED WORK

an understandable form for the machine [9].

For QA tasks, language models also play an important role. In this project, several language models are used to do classification tasks, text selection, and text generation tasks.

2.3.1 N-gram model

N-gram model is a simple probabilistic language model [12]. It is defined by word sequences of length N and it can be used to predict the next item.

The N-gram mode bases on the assumption that the occurrence of the word N is only relevant to the N − 1 precedent words. Uni-gram and bi-gram are when N is set as 1 and 2. For uni-gram, each single word is seen as a unit.

For bi-gram, every two words are taken as one unit so the word sequences are taken into consideration.

2.3.2 BERT model

The BERT (Bidirectional Encoder Representations from Transformers) is a language representation model which is based on a deep bi-directional model. It is important for the NLP field and many later NLP models are based on BERT [7]. Pre-trained on a huge corpus including the whole Wikipedia and Books corpus, BERT has a strong ability for reading com- prehension. Besides, unlike most of the language models reading text se- quentially from left to right, the core innovation of BERT is bi-direction. It makes the model able to understand the context of a word based on words on both sides of it. The good language understanding ability makes BERT a good option for retrieving answers from documents.

BERT Architecture

The BERT model architecture is a multi-layer bidirectional transformer encoder. There are two versions of BERT model. One is BERT base, which is composed of 12 bidirectional Transformer encoder blocks, 768 hidden layers, and 110 million parameters. The other is BERT large, composed of 24 encoder blocks, 1024 hidden layers and 340 million parameters.

Pre-training BERT

The pre-training of the BERT has used an English corpus that consists of the BooksCorpus (800 million words) and English Wikipedia (2500 million words). Two unsupervised tasks were used to pre-train the BERT, elabo- rated as following.

10

(25)

2.3. LANGUAGE MODELS

The first task is Masked Language Model. To empower the BERT model to take into consideration the surrounding context of a word rather than just the preceded word, bi-directional training is crucial. 15% of the input tokens in each sequence are masked randomly and the model predicts only the masked tokens. By doing this, a bidirectional pre-trained model can be obtained.

Another task is Next Sentence Prediction. In order to train a model that understands sentence relationships, next sentence prediction is carried out.

When training, a certain sentence A is given. There is 50% of time that the following sentence B is the actual next sentence that follows A. And 50% of the time it is a random sentence from the corpus. By training in this way, the model learns the correlation between sentences.

2.3.3 Other models based on BERT

SciBERT

SciBERT [2], based on BERT, is a pre-trained model on a large multi- domain corpus of scientific publications. It addresses the lack of high- quality, large-scale labeled scientific data when applying BERT for scientific NLP tasks. It is trained on papers from the corpus of semanticsholar.org.

The corpus size is 1.14 million papers, 3.1 billion tokens. The full text of the papers is used for training.

Considering that most of Mendeley users are researchers, and most of publications on Mendeley are academic publications, SciBERT should be more ideal for the project. Nevertheless, in this project, BERT is used as the core of the open domain QA. It can give the model more generalization ability to answer more domains questions. The corpus for training the SciBERT is 18% from computer science domain and 82% from the broad biomedical domain. For this project, this range is limited because there may be questions beyond these two domains for the open domain QA.

ALBERT

ALBERT is A Lite BERT [13]. It mainly proposed two parameter-reduction

techniques to decrease resources needed for training BERT and increase the

training speed of BERT. Compared with BERT, ALBERT reduced param-

eters more than 10 times. Cross-layer parameter sharing is one of the tech-

niques resulting in decreasing in parameters. What’s more, sentence order

prediction is a new proposal that will make the model able to learn the

relation between sentences better. For BERT, next sentence prediction will

get influence from topic prediction because the positive sample has two sen-

tences coming from the same document while the negative sample has two

(26)

CHAPTER 2. RELATED WORK

sentences coming from different documents which may be different topic.

To avoid this affect, the ALBERT uses different method. 50% of the time, the next sentence is not chosen randomly any more but the given sentence swaps position with the following sentence. The decreasing in parameters makes the performance of the ALBERT drop slightly but sentence order prediction increases the performance. Generally based on these updates, ALBERT has more space for expansion and it got a better performance than BERT.

ALBERT is not used in this project because there is no a fine-tuned ALBERT for QA yet. The fine-tuning requires GPU or TPU which is more expensive. While there is fine-tuned BERT model for QA already, the BERT is the first priority in this project.

Huggingface Transformers

Huggingface Transformers [22] provides many different pre-trained models in different architectures for NLP tasks including BERT, ALBERT. They provide a BERT for QA which is fine-tuned on SQuAD already. Applying the Huggingface BERT model for question answering is very inexpensive be- cause there is no need to train the model and fine-tune. Low computational costs and strong ability make it the first priority to use in this project.

cdQA-suite

cdQA-suite [18], targeting to closed domain QA, is a text-based QA system based on BERT. CdQA-suite was based on two components: the Retriever and the Reader. The core of the Retriever is creating tf-idf features based on uni-grams and bi-grams of a group of texts and questions inputted. Then the system computes cosine similarity between their vectors. The text with highest cosine similarity is the most probable text where the answer might be. This text and the corresponding question is sent to the reader, which is a pre-trained deep learning model based on BERT. The answer will be generated by the reader.

The reader in the cdQA-suite system requires a specific format and the whole QA precedure is slightly slower than using Huggingface BERT model.

For the QA system in this project, the open domain QA uses Huggingface BERT model.

2.3.4 FastText

FastText [11] is a library providing models to do word representation and text classification. It is based on n-gram features and dimension reduction.

It embeds each word of queries then calculates the mean of each word vector

12

(27)

2.4. SUMMARY

inside of a query to get a vector of the query. The classifier in fasttext is a linear classifier that provides faster speed of training. Meanwhile, n-gram features make the model consider partial word order’s impact inside of the input which helps the model to understand the language better.

FastText provides different kinds of pre-trained word vectors. 1 mil- lion word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news data set are used in this project for the open domain QA model. Because there is no fixed corpus for the open domain QA model, it is not realistic to use tf-idf vectors. In this case, pre-trained FastText word vectors provide a efficient way for text representation task.

2.4 Summary

The model proposed in this project basically follows the common archi-

tecture of the QA procedure. But due to the special circumstances of

the project, the taxonomy used here is different from Li & Roth’s tax-

onomy which is widely used. It is created based on the Mendeley search

log and it gives convenience for designing details of the model. Consider-

ing the advantages of different types of QA, community QA is applied for

the product-related queries and IR-based & text-based QA is applied for

the open domain queries. Considering the efficiency and performance, fine-

tuned BERT model for question answering from the Huggingface is chosen

to be the core language model of the open domain QA while FastText is

chosen to be part of encoder of the open domain QA. In a summary, the

proposed model uses both of traditional but strong technique such as SVM

and state-of-art technique such as BERT.

(28)

(29)

Chapter 3

Preliminary Research

To have a better grasp of the situation and prepare for the project, pre-

liminary research was carried out. In the preliminary research, data from

the Mendeley search log was roughly analyzed to have a basic perception of

question types, which was the foundation of the methodology designed in

this project. This chapter describes the preliminary research result based

on three months’ data.

(30)

CHAPTER 3. PRELIMINARY RESEARCH

3.1 Query Data set

The data set used for preliminary research is from the search log of Mende- ley. The original search log contains a lot of data that is not useful for this project such as search time. So, the raw data was cleaned and only content texts of every query are used. The amount of search queries per month is around 400 thousand. At this stage, a rough analysis of the log was conducted based on data from October 2019 to December 2019. There are 1.1 million data points in this three-month data set which is enough for a general picture of the search log.

3.2 Question Classification

From the observation of the search log, questions as queries are not rare.

Going through the log, quite many typical questions starting with question words such as 5W1H (what, when, who, where, why, how) are discovered among all of the queries. Besides, it was quite obvious that queries including

“Mendeley” in the text are not normal keywords queries. Users aim to look for product instructions when they type in those queries. Additionally, some queries are not in a question form, but aim for some fact rather than publications such as “the most cited paper in biology”.

Based on the observation result, and the different sources of answers, in this project, all queries are classified into 4 classes:

Table 3.1: Question Classification for this project

Query Types Examples

Normal Query behavioral plasticity

Product Related Question how to download Mendeley desktop Open Domain Question where is epe, Nigeria

Metadata Question the most cited paper in biology

First of all, normal query. The normal query in this project is defined as keywords query and the goal of a normal query is finding publications.

These normal queries can be handled by the existing search function.

The second, product-related questions. This type of question might be keywords or in question form while the core of the query is always relevant to Mendeley. And those queries expect user instruction which is not covered by Mendeley’s search function. The search result of this type is always unsatisfying.

16

(31)

3.3. SEARCH LOG ANALYSIS

The third type of query is open-domain questions. This type has a very broad range. It is not domain-specific to answer the question. It could be relevant to computer science, or economy, or anything else. Those queries expect facts as answers rather than publications. Most of them are in question forms that start from question words. Some of them might be incomplete, only keeping the core phrase of the question and omitting question words. The answer sources might be included in the publications on Mendeley. However because most of publications on Mendeley are scientific, there may be no answers within them. The tricky part of this type is that some publications might have a title in question form which means the result of this type is not always insufficient. For example, “Is a tax amnesty a good fiscal policy”, this question is one data from the search log. But at the same time, it is the title of one publication whose title is “Is a tax amnesty a good fiscal policy? A review of state experience in the USA”.

In this case, it is ambiguous if the user really wants to ask the question or wants the publication. This makes it difficult to evaluate the answer.

The last type of queries also expects facts as answers while those facts are not included in any publications but are about metadata. Because the answer source is not covered by the search index, the result is also very poor. Named entities such as university names are also in this type since that is not included in the search index either.

This query classification is used in the whole procedure of the project.

But most of the time in this project, the normal query type is not considered and not included when query types are mentioned. It is not only the basis of analyzing search logs, but also the foundation of designing the QA solution.

3.3 Search Log Analysis

After the taxonomy of queries was established, the search log analysis was moved a step further. A more specific classification based on rules was conducted. Some keywords for different types of queries are listed and based on filtering the search log on those keywords, rough amounts of each type are estimated as shown in Table 3.2. The numbers shown there are out of 1.1 million. This first time classification was aimed to give a approximate impression on the composition of the search log. A empirical method was used here. The 1.1 million data was categorized by filtering keywords.

This elementary result could provide an initial impression on the propor- tion of each type of queries but it is unreliable due to incomplete keywords.

Such as for the open-domain question, some questions are starting with

question result shows that there are more diverse questions about prod-

ucts besides inquiries about the citation function. Nevertheless, it was still

(32)

CHAPTER 3. PRELIMINARY RESEARCH

Table 3.2: Preliminary analysis of the search log based on keywords

Query Types Keywords amount

Product Related Question “mendeley”, “cite”, “citation” 6,662

Open Domain Question 5W1H 10,028

Metadata Question “cited”, “most”, “top” 54

valuable because from this result, a data set for training machine learning models to classify the search log was generated. Partial data points from those classified results were randomly chosen to form the training set.

The training set has 11079 data points and all of them are labeled ac- cordingly. Most of them came from the search logs used before. 5560 data points of them are normal queries. 2250 are product-related queries.

1991 are open domain queries while 1196 are metadata related queries. For product-related queries and open domain queries, generating training data does not have problem. But because of too few metadata related queries, some data points were manually created, and by oversampling, the amount of this class data points were expanded.

Because of the limitation of training data, SVM was chosen to be the initial classifier model. The features used for representing text are uni and bi-gram tf-idf. 80% of the data set was randomly split out as the training set and the remaining 20% were testing set. Then a linear SVM was trained on this training set. The accuracy on the testing set was 89% which makes it valid to predict on the search log for the preliminary stage. The classification result by this SVM on 1.1 million data points of three months’ search logs is shown in Table 3.3. The result was manually checked. Due to the huge

Table 3.3: Preliminary SVM classification result

query type amount approximate proportion normal query 1061053 ≈ 96.3%

product-related 34967 ≈ 3%

open domain 6953 ≈ 0.6%

metadata 724 < 0.1%

amount of data points, it was impossible to go through all data points one

by one. From a glance at the result, some frequent keywords which are

representing the class stand out. For product-related queries, around 3.9

thousand queries are having “Mendeley” inside of the text. 3.1 thousand

queries asking about group function of Mendeley. 2 thousand queries are

about plugin function. 1.7 thousand queries have “how to” inside to ask

18

(33)

3.4. ANSWER SOURCES ANALYSIS

instruction about Mendeley. Besides, “login” and “account” showed up in 441 queries. 420 queries ask about citation function. Those results show that there are at least 11 thousand queries that surely are product-related.

In the result of open domain question, there were some product-related queries mixed inside because the form of those is also question form which starts with question words. For open-domain result, most of them are valid but there are some invalid queries such as the single word “are” and the blurry question “what is the theory”. However, the result of metadata queries was poor. Out of 724 data points, there are around 150 data points that don’t make any sense. They are a single word or short phrase such as

“who”, “top view”, “articles”. But it also showed that users might search a domain such as “articles on nursing”, “biotech articles”. Out of this classification result, an approximate proportion of each type is shown in Table 3.3.

As the result shows, more than 3% of queries are not normal queries.

And product-related queries have the highest percentage among those three non-normal query types which make it the primary target and starting point.

3.4 Answer sources analysis

Different types of question answering systems might be applicable for diverse questions according to the answer sources.

For the primary target product-related queries, there is a user guide section on the Mendeley website which provides a series of instructions including citation, reference manager, word plugin, and so on. Besides, on Elsevier service center, there is also a Mendeley support center that provides user guidance in Q&A form like Quora, plus a forum providing a platform for users to raise questions and get solutions in addition. All of those documents can be used as answer source corpus for product-related queries.

Open-domain queries are secondary. Required knowledge for those queries is not domain-specific. Since the project is for Mendeley, the answer sources for this class are publications on Mendeley which have a broad knowledge range.

As for metadata related queries, there is a C-graph(citation graph) in-

cluding many facts such as authorship, citation times, concept of publica-

tions. All of that information is stored as a knowledge graph. The ideal

situation will be using the knowledge graph to find answers for metadata

related queries. But because metadata queries are rare, and also from the

more accurate classification result in Chapter 5 it appears that most of the

(34)

CHAPTER 3. PRELIMINARY RESEARCH

result were university names, in this project, metadata related queries will not be handled.

3.5 Summary

In the preliminary research, the query taxonomy for this project was es- tablished. It categorizes queries into four classes: normal queries, product- related queries, open domain queries, and metadata queries. The approxi- mate proportion of each class was also obtained. There are around 2% of queries that are not normal queries and they cannot be handled well by the Mendeley search engine. This also motivated this project. Besides, the answer sources for each class of queries are also selected to form the base of the project.

20

(35)

Chapter 4

Framework

According to the preliminary research, it is clear that three types of ques-

tions have different answer sources. And those sources are in different forms

which makes a one-for-all QA model not feasible. So diverse QA models are

implemented in this project targeting different types of questions. In this

chapter, a whole picture of the approach used in this project is explained.

(36)

CHAPTER 4. FRAMEWORK

4.1 Framework

The whole QA system consists of two parts, including one classifier module and one QA module, see Figure 4.1.

Figure 4.1: Overview of the project workframe

First of all, because the QA solution aims for improving user experience, it will be an extra component to search engine, which means all queries will still go into the search engine as shown in Figure 4.1. This can ensure the result that users get won’t get narrowed down. For example, if users want a publication whose title happens to be a question, they still can get it via the search engine. And it can minimize the disadvantage that misclassification may bring. In this case, when users input a query, they will get results from the search engine as always and QA results in addition if the query is classified as a question.

Figure 4.2: Details of the Query Classifier Module

4.2 Classifier Module

At the same time, all queries will also go into the query classifier module,

see Figure 4.2. This module consists of two parts, three models. The first

part is one classifier which is a rule-based classifier. It will check if a query

starts with question words. Then it will go into the second part. This part

22

(37)

4.3. QA MODULE

consists of two classifiers. According to the result, queries will be sent to different classifiers. If it starts with question words, one SVM trained on a data set of questions starting with question words will classify it. Otherwise, another SVM that was trained on a data set of questions don’t start with question words will do that. The output of this module is a classified query and it will also decide if a query will continue to go into the QA module or not. Details of the classifier module are presented in Chapter 5.

Figure 4.3: Details of the QA Module

4.3 QA Module

If a query is classified as a product-related query or an open domain query, it will continue to go into the QA module. The QA module’s structure is shown as Figure 4.3. There are two components in this module and each of them is a QA model. As mentioned in preliminary research, different types of queries use diverse data for answers. So the type of each QA model is also customized accordingly.

For product-related queries, a community QA model will return the an-

swers and the corpus it uses are documents from Mendeley support center

and user guide page. This product-related QA model is elaborated in Chap-

ter 6.

(38)

CHAPTER 4. FRAMEWORK

Open-domain queries will go into a information retrieval and text-based QA model. It mainly uses publications on Mendeley as the corpus. However, due to the huge amount of data, only titles and abstracts of publications are used here. In addition, data on the Topic page, which is another Elsevier’s product, is used to make the ability of the model robust. The open domain QA is presented in Chapter 7.

From the research in this project, more than 90% of metadata related queries are some university names. The metadata queries such as “the most cited paper” are quite rare. It won’t add much benefit to the QA model if adding it. And in the future, entity recognition will be covered by the search engine. So in this project, metadata related queries won’t be considered and the model won’t handle it.

4.4 Summary

The classifier module and the QA module form the QA system proposed in this project. To prevent decreasing the user search experience, the QA system will be an add-on component of the search engine. So all queries will still go to the search engine but those queries that are classified as not normal queries (except metadata queries) will also go to the QA module.

According to the type of queries, they will be sent into different QA models.

The following chapters elaborate on each part of the model.

24

(39)

Chapter 5

Query Classification

As shown in the framework, the first module that queries will go into is the

classifier module. Although a simple classifier was trained already during

the preliminary research and it did classify the queries into four different

classes, the result of that classifier is not precise enough to support the QA

system. A more accurate classifier is required for further development of

the QA system to distinguish normal queries and three types of questions

mentioned above. This chapter elaborates on the methodology used in

the project for the query classifier module and the result of more precise

classification.

(40)

CHAPTER 5. QUERY CLASSIFICATION

5.1 Data set for training the classifier

The data set used here is still from Mendeley’s search logs. One important aspect is the quality of the training data. The more data points with the correct label are used for training, the better the model can perform. At this stage, the same limitation as preliminary research is no ready-to-use data.

5.1.1 Initial data set

A new data set for training classifier is generated and labeled manually. It is based on the old data set used in the preliminary research but there are more accurate data points added into it. The whole data set has 13,284 data points in total and the composition of this new set is shown in Table 5.1.

Table 5.1: Composition of the training data set

type amount

normal query 7414 product-related 2243 open domain 3247

metadata 380

Normal queries in this set come from the classification result from the preliminary research. Data points are randomly extracted from the result and put into the new data set. When those data points were put into the new set, they were also reviewed at the same time to ensure the quality of labeled data.

Product-related queries in the old data set were reviewed carefully to correct the label since the amount isn’t big. Besides label correction, some new product-related queries discovered during the preliminary research were imported into the data set to ensure the range of training data can be as broad as possible. For example, during the previous analysis, it was found that users use the search function to search error instruction, such as

“ActiveX component can’t create object 429”. This is an error description that users get when the error happens and it frequently showed in the result.

Open-domain queries keep the highest quality in the new set, because

there are many open domain question data sets. About 3 thousand open-

domain questions from one data set used for question classification were

imported into the new set and some incomplete questions were also added

such as “most productivity countries in the world”, “the most important

26

(41)

5.1. DATA SET FOR TRAINING THE CLASSIFIER

ways of processing air from the particles”. Importing from the external data set makes it easy to ensure the high quality of this part of the labeled data.

As for metadata-related queries, the old data set included many names.

But in the future, Mendeley’s search index will include names which makes it unnecessary to classify name queries to the queries the QA system should handle. And the name has the same format as keywords, which will confuse the machine when applying machine learning classification. Therefore, the names queries were removed from the new data set. And due to their small amount, it is easy to review all data points and ensure all are labeled correctly.

5.1.2 Separated Data set

The initial data set above was used for training the initial classifier. After the initial classifier was trained, to improve the accuracy of the result, a rule-based classifier was added into the classification module. According to the rule(if it starts with question words), queries will be sent into different classifiers that require re-training. The data set was also divided into two sets accordingly.

The data set for training the non-question-word starting model has 9,701 data points, see Table 5.2.

Table 5.2: The training set of queries not starting with question words

type amount

normal query 7414 product-related 1825 open domain 203

metadata 259

As shown, all normal queries from the previous set are not starting with question words. For open-domain questions, incomplete questions belong to this set and there are some questions in a special format such as “Name a golf course in Myrtle Beach .”, “In what Olympic Games did Nadia Comaneci become popular”.

Another data set includes all data points starting with question words.

It has 3744 data points in total, see Table 5.3. Most of the product- related questions in this set are “how” questions such as “how to create a group”, “how do I register and use Mendeley desktop”. As expected, most of the open-domain questions in the previous data set belong to this set. For metadata-related questions, the most are questions containing

“who”, such as “who report on global tobacco epidemic 2019”. Here it

(42)

CHAPTER 5. QUERY CLASSIFICATION

Table 5.3: The training set of queries starting with question words

type amount

product-related 399 open domain 3097

metadata 248

may means “WHO” (World Health Organization). In this case, it should not be metadata-related questions. Except this kind, the number of the remaining metadata-related queries is very little. Due to a small amount of metadata-related questions for both sets, more data points in this class were added.

All of the above data sets were divided into a training set (80% data points) and a validation set (20%).

5.1.3 Test set

A test set was generated to evaluate the performance of the model. Because the data set used for training was generated from Oct-Dec search logs, this test set was generated from July-Sep search logs to prevent over-fitting.

The test set has 300 data points, including 201 normal queries, 28 product-related queries, 46 open domain queries, and 25 metadata-related queries.

5.2 Classifiers

The classifier module in the project is a hybrid model of rule-based and machine learning methods. A rule-based model was added to make up for the shortcomings of machine learning models. In this case, a rule-based model was created after the machine learning model. The core is still the machine learning model since it can be very efficient with solid accuracy.

5.2.1 Initial classifier

The high quality of the data set ensures a machine learning model could be trained and has a valid prediction. Because the data set has only about 13 thousand data points, which might be not enough for training a neural network, SVM is still the option for the model.

To represent the queries texts, tf-idf feature was applied. Considering

the influence of word sequence, tf-idf features used here were based on uni-

gram, bi-gram and tri-gram. The features were generated based on the

28

(43)

5.2. CLASSIFIERS

whole initial data set. Before this step, stop words were removed from each text. Many question words are in the range of stop words such as

“is”, “has”, “do”. These kinds of words play an important role during the classification so they were not removed during preprocessing.

After all data points were preprocessed and vectorized, they can be used to train a SVM model. A linear SVM was used here as the classifier. The trained model was validated on the validation set and the accuracy was 97.2%.

The performance on the test set is shown in Table 5.4 and Table 5.5.

There is some misclassification but 0.95 as accuracy makes the model valid.

One shortcoming of the model was discovered from reviewing the result. The model sometimes has trouble for classifying question words starting queries, such as “has the genome granted our wish yet”, “how open is innovation”.

Considering the features are uni, bi, tri-gram, the weight of starting word isn’t very big. This might cause a problem. In this case, a rule needs to be added to make up for the shortcoming.

Table 5.4: Confusion matrix of initial result

normal query metadata product-related open domain

normal query 199 2 0 0

metadata 4 20 0 1

product-related 1 1 26 1

open domain 3 3 0 39

Table 5.5: Performance scores of initial result

precision recall f1-score amount

normal query 0.96 0.99 0.98 201

metadata 0.77 0.80 0.78 25

product-related 1.00 0.90 0.95 29

open domain 0.95 0.87 0.91 45

accuracy 0.95 300

macro avg 0.92 0.89 0.90 300

weighted avg 0.95 0.95 0.95 300

(44)

CHAPTER 5. QUERY CLASSIFICATION

5.2.2 Hybrid model

After analysis of the SVM classification result, a rule was established to make up for the shortcoming that the classifier sometimes misclassified queries starting with question words. This rule-based classifier was com- bined with machine learning classifiers to form a hybrid model as shown in Figure 4.2 .

The first part of the model is a rule-based classifier. The rule used here is checking if queries start with question words. This rule checking makes starting words have a bigger influence on the classification result. Its effect on adding weight for the starting words is stronger than adding a special starting token in front of each query.

The second part of the model is made of two machine learning classifiers.

Due to the good performance of SVM, in this hybrid model, SVM is still applied. And the features used in this stage are also the same as the initial classifier.

For non-question-word starting queries, the training procedure is the same as the initial classifier except the data set. On the validation set, the classifier has 96.7% accuracy.

Table 5.6: Confusion matrix of hybrid model’s result

normal query metadata product-related open domain

normal query 196 4 0 1

metadata 3 19 0 3

product-related 0 1 27 0

open domain 2 0 0 44

Table 5.7: Performance scores of hybrid model’s result

precision recall f1-score support

normal query 0.98 0.98 0.98 201

metadata 0.79 0.76 0.78 25

product-related 1.00 0.96 0.98 28

open domain 0.92 0.96 0.94 46

accuracy 0.95 300

macro avg 0.92 0.91 0.92 300

weighted avg 0.95 0.95 0.95 300

To train the classifier for queries starting with question words, the data

30

(45)

5.3. PREDICTION ON SEARCH LOGS

set needs one extra processing step. Stop words including question words were removed from the data set. As mentioned before, this question words starting set has all other classes of queries except normal queries. In this case, with the condition that all queries get into this model can not be classified as normal queries, question words will adding more distraction to the classifier. After this processing, the SVM classifier was trained as before. The performance on the validation set of this classifier reaches to 98.2%.

The performance of the hybrid model on the test set improved slightly, see Table 5.6 and Table 5.7. From the result, we can see the f1-scores of product-related queries and open domain queries get better, and the amount of misclassification gets less.

5.3 Prediction on Search logs

With a better classifier module, now the search logs can be predicted again to check if the proportion for each class changes. Not only search logs of Oct-Dec but also search logs of July-Sep were classified. The total amount of this half-year data is 1,873,325. The classification result is shown in Table 5.8.

Table 5.8: Prediction of Hybrid Model on half year’s data

query type amount normal query 1811341 product-related 28078 open domain 16777

metadata 17129

For these 3 types of questions, as shown in Figure 5.1, the highest pro- portion is product instruction question which makes up 1.55% of all data.

Open questions make up 0.93% while meta-data questions make up 0.95%.

The result was manually reviewed. Due to the huge size of the data set,

some results were randomly sampled to be checked. However considering

the recall of each type on the test set isn’t 100%, and there might be some

types of each class questions not included in training data, the real pro-

portion of each type might be even higher. For all non-normal queries,

product-related makes up 45%. Open-domain questions make up 27% and

meta-data related makes up 28%. 1000 data points were randomly sam-

pled from the classification result of product-related questions and open

domain questions. Those data points were evaluated one by one to check if

(46)

CHAPTER 5. QUERY CLASSIFICATION

Figure 5.1: Proportion of different types of query

it was correctly classified. For product-related result, the accuracy is 95%.

Only 50 were wrongly classified out of 1000 data points. For open domain questions result, the accuracy is 91.7%. 83 were wrongly classified. As for metadata questions result, it contains many data points which had the word“university” in it which means they may be university’s names. After filtering out those data points which includes “university”, there is only 1408 data points left. Besides, there are more than 300 duplicated data points. The data points left are too few. From manually check, there are many invalid search queries such as “top”, “who 2016”. Considering this special situation that more than 90% of the result are university names, it doesn’t make sense to evaluate the rest of the metadata result.

From this result, it proves that the classifier used here has enough ability to classify queries, especially for product-related queries and open domain queries. As for metadata queries, since most of the result are university names, and other types of metadata queries have very small numbers, in this project it won’t be handled as explained in Chapter 4.

5.4 Summary

Based on the training data, the SVM classifier has ability to handle the classification task in some degree. And applying rule checking makes up the shortcoming of the SVM classifier. From the test result on 300 data points, and the classification result for half year data, the hybrid classifier has enough ability to support following QA model.

32

(47)

Chapter 6

Product-related Questions

Product-related queries have the highest proportion among all classes except

for normal queries. But none of them can get an appropriate result because

relevant documents are not in the Mendeley search index. The QA model

for product-related questions uses documents outside of the search index to

ensure that users can get the expected instructions.

(48)

CHAPTER 6. PRODUCT-RELATED QUESTIONS

6.1 Introduction

The QA model handling product-related queries is a community QA model and it searches answers accordingly in a corpus. The procedure is shown in Figure 6.1.

Figure 6.1: Overview of the Product-related QA model

6.2 Corpus for product-related QA

As mentioned before, product-related queries have the worst result from the search function because no relevant document is in the search index. But on the Mendeley website, there is a guide page providing instruction and guidance about use of different Mendeley services and functions, including Mendeley reference manager, citation guide, desktop, and so on. Besides, on the Elsevier website, there are a Journal Article Publishing support center and a Mendeley support center providing more user instruction related to the product. All data on the websites mentioned above was collected to form the product-related QA corpus which is used for the community QA model in this part. The corpus has 1144 data points in total. The composition of the corpus is shown in Table 6.1.

Table 6.1: Corpus composition for product-related QA

Source amount

Mendeley guide page 22

Journal Article Publishing support center 280

Mendeley Support Center 230

Mendeley support center forum 612

The data points from the Mendeley guide page are titles of each section

on the page such as “Citation Plugin”, “Groups”, “Mendeley Reference

34

(49)

6.2. CORPUS FOR PRODUCT-RELATED QA

Manager”. They offer general information about each function or service and inside of each section, there are subsections about details.

The data points from the Mendeley support center and journal article publishing support center are also titles. But the information on those pages is in Q&A form as shown in Figure 6.2. To decrease the influence of noise, only titles which are questions are taken. The Q&A form is also very suitable for the community QA model to use. The answers to questions on these two pages are officially given by Mendeley and Elsevier.

Figure 6.2: Example of Mendeley support center

The forum on the Mendeley support center is a platform for users to raise questions and answer questions, see Figure 6.3. Each question on the forum might have several answers and some of them are from other users.