On the tip of my tongue: A potential dataset for Question Answering systems

(1)

Artificial Intelligence

Bachelor Thesis

On the tip of my tongue: A potential

dataset for Question Answering systems

by

Buck Boon

10549110 18 EC November, 2019 - March, 2020

Supervisor:

S. Bhargav, MSc

Examiners:

S. van Splunter

S. Bhargav

University of Amsterdam

Faculty of Science

(2)

Abstract

This paper is concerned with the task of building a dataset for multi hop reasoning using a Sub-reddit called ”Tip Of My Tongue” on Reddit. This SubSub-reddit produces a Question-Answering form which could potentially be used for further research into multi hop reasoning and information retrieval. We make the dataset using PRAW (a Reddit API)https://praw.readthedocs.io/ en/latest/, pushshift (API to scrape Reddit)https://pushshift.io and an Imdb API https: //imdbpy.github.io/ to scrape Imdb and connect the information from both to form a clean dataset after filtering all the information. When the dataset is produced, TF-IDF and bm25 are implemented as baselines for the data and Pytrec eval is used to evaluate the data from TF-IDF and bm25. This method of creating a dataset from this kind of Subreddit has the advantage that it is organic and complex. One downside of this, however, is that it is very noisy. This dataset is a step towards further research into this topic to improve datasets for multi hop reasoning and conversational search, while improving the quality of search engines where larger queries are used.

(3)

Acknowledgements

I would like to thank Samarth for helping me and enabling me to do a thesis and Sander for giving me a chance of picking a different subject.

(4)

4.2 Text Preprocessing . . . 13 4.3 QA pair examples . . . 14 4.3.1 Example 1 . . . 14 4.3.2 Example 2 . . . 14 4.3.3 Example 3 . . . 14 4.3.4 Example 4 . . . 15 4.3.5 Example 5 . . . 15 4.3.6 Tokenizing . . . 16 4.4 Data format . . . 16 4.5 Retrieval Baselines . . . 16 4.6 Evaluation . . . 17 4.7 Evaluation metrics . . . 17 4.7.1 Recall . . . 17

4.7.2 Mean average precision (MAP) . . . 17

4.7.3 Normalized discounted cumulative gain (NCDG) . . . 18

5 Results 19 5.1 Scraping Data . . . 19

5.2 TF-IDF . . . 19

5.3 BM25 . . . 20

5.4 Pytrec Eval results . . . 20

5.5 Attributes of the dataset . . . 21

6 Conclusion 22 6.1 Future Work . . . 22

6.1.1 Earlier Examples . . . 22

(5)

6.1.3 Other categories on TOMT . . . 23 6.1.4 Building a Knowledge Graph . . . 23

(6)

1 Introduction

In our everyday life we use the internet for answering our questions as one of the main sources of information. This need for information is performed using a query; a string in the form of a question or statement. For simple queries, a search engine will mostly find the correct answer to your question. However, when a query of a longer length or higher complexity is asked faulty answers or non-related answers do rise up (Kotov and Zhai, 2010). The method of using Informa-tion Retrieval (IR) and QuesInforma-tions Answering (QA) for lengthy/complex Natural Language queries does often not give the right answers to the queries (Kotov and Zhai, 2010). For complex queries, multi hop reasoning is a way of answering those questions. To answer a question with multi hop reasoning, one has to look at multiple contexts to derive an answer to the query. This shows that multiple entities from two paragraphs are taken and linked to form an answer to the query.

A link can be formed from information piece 1 and 2, to form an answer to the complex query. Question Answering has been a trending topic the past few years, because of the increasing need for information. Whereas Natural Language question gets answered by looking at contexts to infer an answer. Two good examples for this are Siri and Google Now, which both process a Natural Language question and find an answer to that question. Multiple researches (Saha et al., 2018) have been done using Deep Learning algorithms and have shown great promise towards the future of Information Retrieval (Yang et al., 2018).

To give high quality answers to Natural Language questions, a good dataset is necessary to train a model on. However, these datasets are very hard to find and most of them aren’t public (Saha et al., 2018). To answer this, we tried to think of a way to make a dataset from a Natural Language QA type of forum.

A specific subreddit on reddit called ”On The Tip Of My Tongue” has a Natural Language oriented type of QA form, where the original poster (OP) asks about a song/movie/game and redditors will resolve the question by posting their answers, on which often the OP will evaluate until they solve the case. If a poster answers correctly, a ’solved!’ is replied and the case will be solved. Because of this, our reasoning was that this could potentially be a good dataset for multi hop reasoning Information Retrieval. At first we wanted to try to build a conversational search dataset, but this promised to be too challenging in the short amount of time allowed for the thesis. We build a single turn multi hop reasoning dataset instead because of this.

(7)

The dataset will be publicly published for future research on which, for example, Natural Lan-guage models can be trained.

We discuss the background information on the methods used in section 2 and some related work to ours in section 3 to get a general idea of the thesis. In section 4 we show what methods are used and how the dataset has been created including the baselines and metrics.

In section 5 the results will be discussed from section 4, whereafter in section 6 we conclude the thesis including a section 6.1 for future work.

The public github repository for this thesis including all necessary code is: https://github.com/BuckBucket/publicdatasetTOMT

(8)

2 Background

2.1 Reddit

Reddit is an online forum-like website where user-generated content is posted including: text-based, photos, videos and links. Reddit is divided into subcategories also known as “subreddits”. The system Reddit uses is a voting system where redditors can upvote or downvote a post that has been made. This is used to increase or decrease the visibility of the post. Besides upvotes and downvotes, the age of the post is also taken into consideration to determine the ranking of the post.

Another aspect of reddit is the use of ”Karma”, which is a points-based system for the users. These Karma points can be earned by various means and show the contribution of a user to the community on reddit.

2.1.1 Subreddit: TipOfMyTongue

The subreddit ”TipOfMyTongue” is a subreddit where people can ask questions about topics in-cluding games, movies and music. From their own description of the subreddit:

”Can’t remember the name of that movie you saw when you were a kid? Or the name of that video game you had for Game Gear? This is the place to get help. Read the rules and suggestions of this subreddit for tips on how to get the most out of TOMT.”.

Because of the way this subreddit works, it is an interesting topic for information retrieval and also the main information source for this thesis. Besides the Question Answer form of the post-comments, which is what we are looking for, it also has a strict way of posting a submission:

• Reply “Solved!” to the correct answer

• Title format: You must have [TOMT] at the beginning of your post title. Add the type of media to the beginning of your post and a time period, e.g. “[TOMT][MOVIE][2000s]”, so that it’s easier to scan.

• Make both your title and body as detailed as possible to help the solvers.

• Include the timeframe/year for older items, we don’t know timeframes of your lives and when you were a kid. Only ask one question per post.

This makes the subreddit even more interesting, since a post can be solved by adding a ”Solved!” reply to the post, which in turn makes sure the flair (a button which shows in what stage the post is: unsolved or solved) gets turned to Solved.

(9)

A full submission would have a title, flair and a post body. In the post body the question will be asked about the movie that the original poster is looking for:

Figure 2.2: Box 1: the title of a post, box 2: the solved/unsolved flair, box 3: the body of the post

The second point makes it easier to filter on movies, which is the category we will use for the thesis. We discuss this in Section 4.2.

2.2 Information Retrieval (IR)

Information retrieval systems have been around for as early as the 1930’s (Sanderson and Croft, 2012). An IR system is typically a system where a query made by a user gets answered by searching through documents in which the answer to the query resides. With the growth of data and users using queries (for example on Google), the increased need for information retrieval systems was the result (Sanderson and Croft, 2012).

In the early stages of IR, a numerical indexing method was applied to collections where the same numbers have the same topic, so it would be easy to track which collections are related. Another method was introduced by Taube et al (Taube et al., 1952; Sanderson and Croft, 2012), which used a list of keywords to index words. Following this came a ranking system for the IR system, where a document in the collection of documents would get a ranking for the relevance of the document to the query, which would be the start of the term frequency (Luhn, 1957).

In the 1970’s Luhn’s Term Frequency (TF) score and Sp¨arck Jones’s Inverse Document Fre-quency (IDF) would be combined to form the TF-IDF score as we know it. Since then multiple weighting schemes have been build upon TF-IDF including bm25 (Sanderson and Croft, 2012).

One of the problems that occurs in the IR systems is lengthy queries, resulting in incorrect or low quality information returned (Sanderson and Croft, 2012). Apple’s Siri, Google, IBM’s Watson and Yahoo! have come far on improving this (Sanderson and Croft, 2012) but haven’t been able to optimise it (Tulshan and Dhage, 2018) , which motivated this thesis to improve this.

2.3 TF-IDF and bm25

The TF-IDF (term frequency - inverse document frequency) algorithm is commonly used as it provides excellent results on query-document information retrieval (Ramos et al., 2003). The term

(10)

TF part of TF-IDF as mentioned earlier (Luhn, 1958) can be explained as:

“The weight of a term that occurs in a document is simply proportional to the term frequency.” https://en.wikipedia.org/wiki/Tf-idf

The mathematical formula:

ft,d

P

t0_∈dft0_,d

(2.1) Where t is a term or query and d a document. The numerator shows the number of times a term t is in document d, denoted by ft,d. The denominator is the sum of the frequencies of all other

terms (t0) in the same document.

The IDF part of TF-IDF was introduced by (Jones, 1972), which was an improvement on how terms specificity (Jones, 1972). IDF weighs the frequency of terms in a document and uses this to improve information retrieval results.

The IDF algorithm can be worded by:

“The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.”

https://en.wikipedia.org/wiki/Tf-idf

IDF measures the rank of a specific word for its relevancy within a text. Stopwords for example will have a lower value than other words despite their high frequency in texts. The mathematical formula for IDF:

IDF (t, D) = log N

|{d ∈ D : t ∈ d}| (2.2) Where d is a document of documents D and t is a term.

After this the TF-IDF score can be calculated by doing: T F IDF = T F ∗ IDF and shows the weight of each term (word) in a document.

The TF-IDF was a basis for the Okapi-bm25 algorithm (Robertson et al., 1995), which pro-duced great results as a ranking algorithm (Sanderson and Croft, 2012). The algorithm of bm25 differs a bit from the TF-IDF algorithm:

n

X

i

IDF (qi)

f (qi, D) ∗ (k1 + 1)

f (qi, D) + k1 ∗ (1 − b + b ∗avgF ieldLenf ieldLen

(2.3)

Where qi is the ith query term

IDF(qi) is the IDF score of the ith query.

The F ieldlen/avgF ieldlen part for the bm25 algorithm is used to make the score decrease if the document is larger than the average document and increase if its smaller than the average docu-ment. The b variable in the formula is set at a standard 0.75 to give a weight to the denominator of F ieldlen/avgF ieldlen.

(11)

if TF ≤ k1 makes the impact of TF on the score larger and when TF ≥ k1 makes it less influential.

It uses a different way of calculating the IDF score. The formula used is:

IDF = ln(1 + docCount − f (qi) + 0.5 f (qi) + 0.5

) (2.4)

Where qi is the ith query term

2.4 Cosine Similarity

To measure the similarity between two vectors, in this case, the TF-IDF vectors of a query and the documents, the cosine similarity between these vectors can be used. If the vectors are the same they will have a cosine similarity of 1, since the cosine of 0 is 1. If they are completely different, it will result in a 90 degrees angle between the vectors, and the cosine similarity value will be 0.

The formula for the cosine similarity is:

cosθ = A ∗ B

||A||||B|| (2.5)

Where A are the TF-IDF vectors of a query and B the TF-IDF vectors for the documents. The denominator shows to divide by the absolute length of vector A multiplied by the absolute length of B.

2.5 Pytrec Eval

The evaluation method for the TF-IDF and bm25 on the data is Pytrec Eval (Van Gysel and de Rijke, 2018), which is an easier version to implement from the TREC’s Trec Eval (https: //github.com/usnistgov/trec_eval).

Trec Eval’s explanation from github: “trec eval is the standard tool used by the TREC com-munity for evaluating an ad hoc retrieval run, given the results file and a standard set of judged results.”

(12)

2.6 Multi Hop Reasoning

When a query is used that requires an answer that does not have a direct link to the query (i.e. it requires multiple steps over multiple entities to come to the required answer), this is called multi hop reasoning. IR systems perform poorly at this because of the extra steps taken (Yang et al., 2018). (Zhang et al., 2018) has done research on this topic and shows that traditional ways of question answering uses keyword matching or frequency-based methods, just like TF-IDF or bm25.

An example from (Yang et al., 2018), where the facts taken for the reasoning are numbered, would be:

• Paragraph A: [1] Return to Olympus is the only album by the alternative rock band Malfunkshun [2] It was released after the band had broken up and after lead singer Andrew Wood (later of Mother Love Bone) had died of a drug overdose in 1990. [3] Stone Gossard, of Pearl Jam compiled the songs and released the album on his label, Loosegroove Records.

• Paragraph B: [4] Mother Love bone was an American rock band that formed in Seatlle, Washington in 1987. [5] The band was active from 1987 to 1990. [6] Frontman Andrew Wood’s personality and compositions helped to catapult the group to the top of the bur-geoning late 1980s/early 1990s Seattle music scene. [7] Wood died only days before the scheduled release of the band’s debut album, “Apple”, thus ending the group’s hopes of success. [8] The album was finally released a few months later.

• Q: What was the former band of the member of Mother Love Bone who died just before the release of “Apple”

• A: Malfunkshun

• Supporting facts: 1, 2, 4, 6, 7

• Non-supporting facts: 3, 5, 8

The supporting facts show how the inference of the answer to the query was taken from the paragraphs.

(13)

3 Related Work

This thesis uses a multitude of algorithms and techniques to get the dataset, and statistics of the quality of that dataset. This section will mostly focus on the general idea of building the dataset and showing related work.

3.1 HOTPOTQA

(Yang et al., 2018) use crowdsourcing to create a dataset called HOTPOTQA. This dataset was built by letting annotators see two paragraphs and have them build questions based on these paragraphs. This way, a dataset was built which required reasoning. In this paper, the dataset is build by collecting information on the movies similarly to this thesis.

An interesting part of the paper is that they have used hyperlinks on Wikipedia, as they found that they entail a connection between the question and the answer. They reason that this could be used to naturally reason over the question and answer, similar to this thesis.

3.2 DuoRC

A dataset which has been built on the same concept as this one is DuoRC (Saha et al., 2018). They wanted to solve four problems which are shortcomings in present datasets (Saha et al., 2018), which are:

• To have a large number of low lexical overlap between questions and their corresponding passages

• To use common-sense and background information beyond the passage itself to arrive at the answer

• Narrative passages are included in the movie plots, which require complex reasoning across multiple sentences to infer the answer

• Several questions in the dataset cannot be answered from a given passage (which is similar to our case), therefore, the machine has to detect unanswerability.

The dataset has been made by taking one short and one long version of a summary of a movie, where the short one is taken from Wikipedia and the long one from Imdb. They first showed crowdworkers from Amazon Mechanical Turk (AMT) the Wikipedia summary of a movie and asked them to make a QA pair, after which they showed other workers from AMT the Imdb version of the summary and the same question. After this they had 2 versions of the QA pairs which have different plot detail, narration style and vocabulary. Eventually they created a dataset from 7680 pairs of parallel movie plots which has 186k human-generated QA pairs.

While this paper has human generated question pairs, of which some are unanswerable, it is still very similar to this thesis. The idea of creating question pairs from movie plots, and especially from Imdb, makes this a very interesting paper as a reference for future work.

(14)

3.3 NarrativeQA

Another paper which has the same approach as making a dataset as DuoRC (Saha et al., 2018) and this thesis is the NarrativeQA paper (Hermann et al., 2015). They created a dataset using plot summaries and full scripts of movies, where the average length of the full scripts are 60k words. The first questions made from the short summaries are similar to other datasets, however the full script questions are different than, for example, the DuoRC (Saha et al., 2018) dataset as it is more futuristic. As well as on the previous datasets, they used a model to train on the data and showed their results.

(15)

4 Methodology

For this thesis, we looked at a specific Subreddit “On The Tip Of My Tongue” on Reddit. On this Subreddit there is a way of question answering which could be used for multi hop reasoning information retrieval, because the queries are phrased in natural language.

For example, when a person asks about a movie with Angelina Jolie in it, the readers might connect Angelina Jolie to the movies she has played in and connecting those movies to specific details to eventually find the movie the person is looking for. We stick to using the (solved) ‘movie’ category of questions in this thesis, since it has plenty of threads, and we can collect information about movies from Imdb. Furthermore, we limit ourselves to single reply threads, such that the reply has an Imdb link.

The format in which the questions are answered are for example: I am looking for a movie in which a blonde woman is a double agent for America while operating in Russia. Then multiple people will try to find the movie in the comments after eventually the movie is found and the original poster (OP) replies to the correct answer with Solved! (in multiple forms).

This will allow us to filter the answers on specific factors and make a dataset out of it using an Imdb API to find the context of the movie so we can use information retrieval algorithms and bench-marks to look at the link between the data of the Subreddit and the data of Imdb. In python, we used PRAW to scrape Reddit and used a filtering system on the comments by removing blacklisted user comments (like mods or bots) and comments made by the original poster which add no value. This eventually leads to a clean dataset of solved movies which are answers to the original question.

After we have acquired a dataset, we wanted to try and make a knowledge graph to visualize the way the multi hop reasoning works. However, due to time constraints, we leave this to future work (See Section 6.1). We used two benchmarks/baselines: tf-idf and bm25 to find out how much errors there are in the dataset and if the connection between the Subreddit data and Imdb data make sense. We expect to have a lot of errors since natural language normally receives bad evaluations from benchmarks (Tulshan and Dhage, 2018). This is the reason this thesis is made, since it shows that further research is necessary on this subject.

4.1 Scraping the dataset

To scrape the dataset, we used an API called PRAWhttps://github.com/praw-dev/praw to scrape the Subreddit ”TipOfMyTongue”. PRAW has a class called ”Subreddit” where a Subreddit can be chosen to work on. In this class, there is another class called submission on which we can load all submissions from the Subreddit. We then take a look at all categories on the Subreddit and find our subject to analyse:

Note that there are several uncategorized posts, which we do not consider. The next most frequent type of post with around 40k posts are ’movie/tv’ posts, which we consider in our work. The usage of data from the Subreddit between 1 December 2018 and 1 December 2019 show that there are: 211,078 posts and a total of 142,619 replies. For this thesis, we want to have posts without an excessive amount of comments, since we use single turn answers and will just need one reply with the answer in post. After looking at the total amount of replies of each post and plotting it we can conclude this was the case:

(16)

Figure 4.1: All categories on the Subreddit On The Tip Of My Tongue

is the average. This benefits us because for our dataset we would prefer to have a low amount of comments seeing that we are doing a single turn dataset and would like the noise to be as low as possible.

We only want to focus on one category of the Subreddit for this thesis, since focusing on all of them would be too complex for the estimated time for this thesis. We will focus on the category ’Movies’ in the Subreddit. We first took a look on how much the data was which we could use and soon figured out that there were enough movies to work with.

We came to use the solved movies category, which allowed us to use Imdb as a reference point for the information given for the solved movies. Using the class Submissions, we found that you could filter on flare, which is a small button that shows if a submission has been solved or not solved.

Firstly, we could filter all submissions with the category movies on solved, which is exactly what we needed. Now we have a dataset of solved submissions in the category movies, however the comments which lead to the solved flair were challenging to be interpreted since natural language on forums can be chaotic to a certain extent and have a lot of inefficient information. To prevent this from making the data less precise, we filtered all comments on submissions for Imdb links and got rid of all the excessive data.

(17)

Secondly, we now had a dataset of solved movies where the ’winning’ comment had an Imdb link to the actual movie which was the solution for the question in the submission. We also needed to take into account how many comments were posted on a submission. At first, more comments seemed to be beneficial, but eventually the optimal thing to do was to look for a minimal amount of comments which lead to the solved flair. The amount of replies on a submission appeared to have a mean of 7.015 and a median of 6, whereas the mean for top comments was 2.905 and the median 2 top comments per post.

Figure 4.3: Amount of top comments per post show that the average is 2, which is what we want because it needs to be as low as possible for the dataset

(18)

The next step was to get all the information from Imdb of the movie given by the winning comment. I did this by using an Imdb API, which could use an Imdb id (which is in the url given by the winning comment) and used that to scrape the Imdb page for title, category, summary, synopsis and cast. This information was then stored and linked to the other dataset with the submission information and comments.

Figure 4.4: An example of a submission with 1 comment reply with an imdb link

We see that there are 2 replies, however one of them is from a bot. This bot comment has been removed for the dataset, so it is actually single turn.

4.2 Text Preprocessing

Before encoding the text for the TF-IDF or BM25 algorithm, it is necessary to process the text and clean it up. This is called Text Preprocessing.

When the submissions are downloaded and categorized for solved movies, we look at the original question and context of the original poster (OP) and the comments of the submission. The part of this that needs to be filtered and cleaned are the comments, as they produce the most noise. As mentioned earlier in the rules of the Subreddit, the OP has to post a comment with ”Solved!” to the answer of the original question. However, there are some comments that are in here as well that need to be removed to reduce the noise.

There is a habit in the comment section of a submission, where the original poster will sometimes post a ”mandatory comment”, which is just the first comment on the post to perhaps raise the visibility. We removed these using PRAW to check if a commenter is actually the OP and remove the comment if it is not ”Solved!”. There are also numerous comments made by bots or moderators,

(19)

which are luckily on a visible list on the Subreddit. We used this list to make a ”blacklist” to remove all comments made by these authors to further clean the data.

4.3 QA pair examples

We have taken five samples from the QA pairs to show which show the natural language relation between query and answer. The texts overlap minimally, but the context shows the link between the pair. In the query and answer, text that matches in context are colored the same:

4.3.1 Example 1

Query: 1794147 [TOMT] [TV Show] Show aboutpeoplemoving toHollywood to try and make it. And possible remake of it.It was an old show aboutsome apartment complexwhere it was abunch of peopletrying to make it in Hollywood. I believe it was a Canadian made show. I need the name of that and also I remember reading 6 months to a year ago they were remaking it. Anyone know if that happened and if so what that one is called?

Answer: 1794147 Full of new relationships, salacious temptations and make-it-or-break it deci-sions, the series’coming-of-age dwellersgrapple with life’s defining moments as they deal with the unreality of the show-biz industry.::Bell Media Michael Kash The L.A. Complex Drama Highland Gardens follows the ins and outs of anapartment-style motelin LA which draws aspiring Canadian hopefuls looking for a place to rest their heads while theychase their Hollywood dreams.

4.3.2 Example 2

Query: 0107002 [TOMT][MOVIE][90s]- Looking for Title ofFrench Movie Lookingfor the title of aFrenchmovie I bought on VHS way back in themid-90s. The movie took place inFrance, during the18th or 19th century. All I remember is the cover of the VHS box. To the left it showed a

middle aged townsman, with a beard, pulling open his shirt to expose his chest. There wereother townsfolkbehind him, suggesting this was a mob orsome sort of protest. To the right was apolice officer or soldierholding his rifle with bayonet affixed. The message was that the man was willing to have himself bayoneted for whatever the cause of the protestwas. This has been driving me crazy for years. Please help me find the movie. Thanks in advance!

Answer: 0107002 In mid-nineteenth-century northern France, a coal mining town’s workers are exploited by the mine’s owner. One day, they decide to go onstrike, and the authorities repress them.::Michel Rudoy ¡mdrc@hp9000a1.uam.mx¿ and Brian McInnis It’s mid 19th century, north ofFrance. The story of a coal miner’s town. They are exploited by the mine’s owner. One day the decide to go onstrike, and then theauthoritiesrepress them.::Michel Rudoy Corinne Masiero Germinal Drama Romance

4.3.3 Example 3

Query: 0470761 [TOMT][MOVIE] Plot with a baby doll or something the mom mistakes for a

baby, and ends up burying her own child instead of thedoll?Vivid memory of renting this from a blockbuster or something but can’t find it in google searches. The part of the plot I remember is the plot twist at the end where themotherthinks she’sburying a cursedbaby dollthat’s alive

(20)

or something, but ends upburying her own baby. It was a horror movieand I don’t remember it being a super old movie. If we rented it sometime in the 2000’s at blockbuster I assume it was relatively recent.

Answer: 0470761 Laura’s expecting. Her husband, Steven’s a loving guy but has little time for her. Hermomlives thousands of miles away. Forced to give up on her dreams, she’s always been a bit edgy. A C-section drives her over the edge, making her see things in a different light. A creepy

babysitterdoesn’t make things any better. She begins seeing things, trusts no one, as she goes into self-destruct mode.::BryanD EQT When the dancerLaurafeels sick after a presentation, she finds that she ispregnant. Her husband and successful executive Steven decides to buy a huge house in the suburb to raise theirbaby daughter Jessica in an adequate environment. Laura changes her lifestyle, feels nervous alone with the mice in the house and accidentally kills their dog with poison. Later Steven hires Mrs. Kasperian to helpLaurain the housework and with the baby, butLaura

believes the old woman wishes to harm Jessica using witchcraft. Laurahas a breakdown, and when she recovers, hermothermoves to her house to help her with the baby. During the night,Laura

decides toget rid offadollshe found in the house changing their lives forever.::Claudio Carvalho, Rio de Janeiro, Brazil Mary DeBellis First Born DramaHorror Mystery Thriller

Note on annotation:

The reason Laura is marked blue as well, is since the inference of Laura being the mom, because she has a child named Jessica and a husband named Steven.

4.3.4 Example 4

Query: 2822400 [TOMT] [FILM] [2000s]German languagefilm that streamed on ShudderA Chris-tianhomeless teenis taken in by a family. Over time the become abusive. The teenhas seizures that make him hallucinateJesusand believes the abuse is a test of hisfaith. Hestays with to be thewhippingboy to protect their children. Teenisbeaten to deathand the children run away.

Was on Shudder until recently. I’m going crazy trying to find it. It was one of the most

dis-turbing filmsI’ve ever watched.

Answer: 2822400The young Toreseeks in Hamburga new life among

4.3.5 Example 5

Query: 0097419 [TOMT][MOVIE][80s/90s] ... in which afew kids go trick or treating with an

old wheelchair-bound dudedressed in agiant eyeballcostume.I thinkthe kidswere scared of him first... but went on to befriendthe disabled fellow. Goonies vibes.

Answer: 0097419 This is a tale about two children who are put in a foster home, and then on

Halloween they breakout and are rescued by their eccentric grandfather who is in a Halloween costume of an eyeball. They then go to Georges island to try and find the treasure of Captain Kidd.::Andrew Hazeden ¡Dover99@mailexcite.com¿ Tina Cross George’s Island Adventure Drama Family

(21)

4.3.6 Tokenizing

To make sure TF-IDF and BM25 handle the data correctly, the documents and queries have to be tokenized. This means that they will have to be chopped up into single words. To tokenize a string (or a document in this case) one can use a premade tokenizer, for example the NLTK tokenizer, or make a custom one. While tokenizing, it is handy to also handle punctuation or get rid of certain characters as they do not add any value. Certain words also might need a special way of tokenization, for example the word Henry O’Donal, where the O’Donal part needs to be handled to correctly tokenize the word to what is desired.

In this paper, we used the Sklearn tokenizer (Pedregosa et al., 2011) and made an adjustment so it removes punctuation from the text, which is mostly in our case for the title of a submission which includes brackets.

4.4 Data format

To form a good basis for the TF-IDF and BM25, the data has to be correctly put together. From the scraped information from the submission, comments and Imdb information related to those submissions. For the submissions and comments, we put together a query as a dict where:

The submission id and Imdb id are at the start with their own key for easy access. Following that, the submission title and context of the question are placed. The format before transforming the data is a dict with a submission id, an Imdb id with which is the answer to the query and the title and context of the submission.

For the documents, another combination of information is used provided by the Imdb API, which is the Imdb id and the information of the movie. Not all information of the movie is taken, instead we only focus on the plot, synopsis, cast and category of the movie.

The idea of this data is that when a question gets asked about a movie, some data is more valuable than other data. If one were to look at the general line of questions asked on Tip Of My Tongue about movies, there would be questions about the plot or synopsis of a movie as well as cast and category. For this reason this combination has been taken from the Imdb API to form the documents for the baselines as this could closely resemble the questions on the Subreddit.

4.5 Retrieval Baselines

After the queries and docs have been prepared by preprocessing and filtering the data, we used multiple modules from Sklearn.feature extraction, such as TfidfTransformer, CountVectorizer and TfidfVectorizer. These will help in setting up the data and getting the TF-IDF scores for the queries in the documents. After the TF-IDF scores are calculated, the main thing we want are the cosine similarities between the TF-IDF vectors of the queries and the documents. This is done by using the cosine similarity package of Sklearn (Pedregosa et al., 2011)which accepts the TF-IDF vector of a query and the TF-IDF vectors of the documents it needs to calculate the cosine similarity on. After this is done, we get a result in the form of a dict with an Imdb id, which is the answer to the query, and the other documents with their cosine similarity score.

(22)

4.6 Evaluation

To evaluate the scores from bm25 and TF-IDF, a python module is used, named Pytrec Eval. This module consists of an evaluator which uses a ground truth, which is the actual score a query and document should have. It would be 1 for the Imdb document actually linked to the submission id. By doing this, we can now give Pytrec Eval the results of the baselines and evaluate these results with the results of Pytrec Eval.

After giving the results of tf-idf and bm25, the results are saved as a JSON file. These results will be used to take the mean of multiple evaluations: map, ndcg, recall@5, recall@10, recall@20, recall@30, recall@100. These will be shown in the results section.

4.7 Evaluation metrics

The metrics we used with Pytrec Eval are explained here:

4.7.1 Recall

When retrieving information from documents, there are 4 categories for the retrieved results: • True Positive (TP), which is what we would like to receive

• False Positive (FP), which looks like a TP but is actually false • True Negative (TN), which correctly gets discarded

• False Negative (FN), which gets discarded but shouldn’t be

To see how many of the documents are succesfully retrieved, we can look at the recall. The mathematical formula for recall is:

recall = T P

T P + F N (4.1)

In this thesis we use multiple recalls: recall@5, recall@10, recall@20, recall@30 and recall@100. The number after the @ is how many documents get retrieved, whereafter the recall is calculated on those documents. If ,for example, the results of recall@5 are 0.01, it means that there are only 0.01 in the top 5.

4.7.2 Mean average precision (MAP)

The mean average precision is a metric which tells how good our model is at performing a query. To exactly know how it works we take a look at the formula:

M AP = PQ

q=1AP (q)

Q (4.2)

Where Q is the number of queries in a set and AP(q) is the average precision for a given query q. What the formula shows is that: for a given query the average precision (AP) get calculated, whereafter the mean is taken from this AP. This mean is the mAP, which shows us how good our model performs the query.

(23)

4.7.3 Normalized discounted cumulative gain (NCDG)

To know what the normalized discounted cumulative gain is, firstly the discounted cumulative gain (DCG) has to be explained.

Mathematical formula for DCG:

DCGp = p X i=1 reli log2(i + 1) = rel1+ p X i=2 reli log2(i + 1) (4.3)

DCG score accumulation at a particular rank position p. Where reli is the graded relevance of

the result at position i. Formula for NCDG:

nDCGp=

DCGp

IDCGp

(4.4) Where IDCG is the ideal discounted cumulative gain:

IDCGp= |RELp| X i=1 2reli−1 log2(i + 1) (4.5)

Where RELp is the list of relevant documents ordered by relevance in the corpus till position

p.

The values from NDCG for all queries show the average performance of a ranking algorithm if they are averaged.

(24)

5 Results

The results of this thesis will be divided into four parts: Scraping data, TF-IDF, BM25 and Pytrec Eval evaluations.

5.1 Scraping Data

The collections from the Subreddit lead to these collected submissions for the dataset:

Amount Total submissions scraped 211078 Solved movie submissions 20330 Solved movies with Imdb link replies 4596 Solved movies with one reply with an Imdb link 793

Table 5.1: Tip Of My Tongue scraping data

The number of movies scraped from Imdb using the Imdb API are less than the amount of Submissions (783). This is because Imdb sometimes blocked us from scraping, which resulted in us losing 10 movies and their information.

Analyzing the title and context together from the submissions, we can see that the average length of the characters is 577 and the vocabulary (words) has an average length of 106 in the submissions.

If we look at the average length of characters in movie information (plot + synopsis) we have 2792 characters and an average vocabulary of 487 words.

5.2 TF-IDF

Figure 5.1: From the results of the NCDG values we can see that the average value lies around 0.1 and the average of MAP lies around 0.01

(25)

5.3 BM25

Figure 5.2: Results from ncdg and map values

From the results of the NCDG values, we can see that its average value lies around 0.1 and the average of MAP lies around 0.01. From table 5.2 in the Pytrec Eval results section we can see that bm25 performed slightly better than TF-IDF

5.4 Pytrec Eval results

Mean TF-IDF bm25 map 0.0094 0.0096 ncdg 0.1284 0.1287 recall@5 0.0063 0.0063 recall@10 0.0126 0.0151 recall@20 0.0264 0.0277 recall@30 0.0403 0.0416 recall@100 0.1324 0.1337

Table 5.2: Averaged results from Pytrec Eval on TF-IDF and bm25

These results from Pytrec Eval support our hypothesis that the average scores would be low. By comparison, we can see that the bm25 algorithm performed marginally better than TF-IDF. The results from the recall@5 show that there are only 0.006 in the top five. Compared to other datasets this is low since normally it is around 0.1 or 0.01 for recall.

So the results show as that TF-IDF and bm25 performed rather low compared to other related work like (Monz, 2003), that received for the recall@5 an average of 0.5-0.6. This is mainly because the textual differences between query and answer are large and a contextual understanding is necessary to find the overlapping entities. The examples in section 4.3 show this.

(26)

5.5 Attributes of the dataset

The idea of creating this dataset was to introduce a new kind of dataset in which natural language understanding was the main attribute.

After analyzing the dataset it shows multiple attributes:

Diversity This is a diverse dataset which has a large number of different natural language ques-tions. The positive is that the questions and answers are not linear, which is good for IR systems with diverse questions and answers. The downside of this is that because of the diversity, the dataset will require complex models to handle all those different questions and answers.

Organic

Because the questions are asked by regular people, the dataset is highly organic. This means that all questions emphasize natural language. This is both a positive and a negative attribute. Positive because it needs contextual understanding of the question, which is necessary for modern natural language solutions for information retrieval. Other datasets such as (Saha et al., 2018) used Amazon MT to built the dataset, in a very structured setting. There is no constraint here. Automatic and Scaleable

This dataset will be able to automatically update because of the nature of Reddit and its subreddits. Redditors will keep asking questions in the Subreddit “Tip Of My Tongue” which will then as well increase the data that we can append to the existing dataset.

Single-turn

Because the dataset is single-turn, it potentially has all the information in the context of the question for the answer to that question. If we had made it multi-turn, there would also be a need for clarifying questions (CQ). Because of this, it might be easier to work with the dataset for future work. A disadvantage of single-turn is that because of this constraint, we had to discard a lot of submissions which did have an Imdb link reply. We had to discard these submissions since we had no guaranteed way to automatically extract them because of Reddit’s complex structure for posts. Noisy

A downside mentioned earlier in the sections Diversity and Organic is that the dataset is very noisy. This is mostly because of high diversity and high organic attributes of the dataset. For future work, this might pose as a problem or it would at least take a complex model to train on the data.

(27)

6 Conclusion

In this thesis we provide a dataset of 783 QA pairs created from 211078 movie related submissions from the Subreddit “Tip Of My Tongue”. From these 211078 submissions, 4596 submissions con-tained an Imdb link and 783 of these submissions concon-tained a single reply Imdb link.

The results of this thesis is a dataset which focuses on natural language and complexity of a QA system. The usage of a single turn Question and Answering format provides a direct link between question and answer. The link, however, due to the single turn format and complexity of the questions will be hard to find.

As a result, the TF-IDF and BM25 evaluation receives a low score. This low score however does not mean that the results are bad, they show instead that it is an interesting dataset for future research as mentioned in the future research section. We publicly publish this dataset for research on semantic reasoning in texts and Deep Learning models.

The examples section, done by hand, show that it would be possible for a model to train on the dataset since it clearly shows the link between question and answer, it may however, be highly complex to do this.

6.1 Future Work

For future work, models could be trained on the dataset which is shown in multiple papers from the related work section. This dataset provides a natural language focused dataset which could potentially be used for supervised training, as the answer to the queries are already provided in the form of Imdb information. Similarly to the DuoRC (Saha et al., 2018) paper, it follows the same structure and similar methods provided in the DuoRC paper could be used to find valuable information on QA systems.

From the HOTPOTQA dataset (Yang et al., 2018) an interesting method is used: Comparing the questions. By doing so, they can categorize the questions and see if the same questions follow the same kind of answer. This is also an interesting technique which could be used on this dataset, as the questions follow a similar way of questioning in each submission.

6.1.1 Earlier Examples

The earlier examples mentioned in section 4.3 are a good example of how a model might train on the dataset. The semantic reasoning behind the links between entities is very high, so one thing that could be done is to crowdsource this, just as other papers from the related work section. By doing this, it would be easier to train a model on the data for further use.

6.1.2 Extracting multi-turn conversations from TOMT

Because we only extracted single-turn conversations, the next thing we could do is extract multi-turn conversations. The 4596 submissions with Imdb replies mostly consist of these so it would be valuable to use these as well to build the dataset.

(28)

6.1.3 Other categories on TOMT

The Subreddit TOMT has multiple categories which are also interesting to build a dataset from. For example the category books could be used. By using an API from Goodreads (A book database, including reviews and information on the books), a similar project could be done.

6.1.4 Building a Knowledge Graph

By using the cast of a movie taken from Imdb, a knowledge graph could be made to model which problems could be solved for the dataset.

(29)

References

K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Advances in neural information processing systems, pages 1693–1701, 2015.

K. S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 1972.

A. Kotov and C. Zhai. Towards natural question guided search. In Proceedings of the 19th inter-national conference on World wide web, pages 541–550, 2010.

H. P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of research and development, 1(4):309–317, 1957.

H. P. Luhn. The automatic creation of literature abstracts. IBM Journal of research and develop-ment, 2(2):159–165, 1958.

C. Monz. Document retrieval in the context of question answering. In European Conference on Information Retrieval, pages 571–579. Springer, 2003.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

J. Ramos et al. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, volume 242, pages 133–142. Piscataway, NJ, 2003.

S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford, et al. Okapi at trec-3. Nist Special Publication Sp, 109:109, 1995.

A. Saha, R. Aralikatte, M. M. Khapra, and K. Sankaranarayanan. DuoRC: Towards Complex Lan-guage Understanding with Paraphrased Reading Comprehension. In Meeting of the Association for Computational Linguistics (ACL), 2018.

M. Sanderson and W. B. Croft. The history of information retrieval research. Proceedings of the IEEE, 100(Special Centennial Issue):1444–1451, 2012.

M. Taube, C. Gull, and I. S. Wachtel. Unit terms in coordinate indexing. American Documentation (pre-1986), 3(4):213, 1952.

A. S. Tulshan and S. N. Dhage. Survey on virtual assistant: Google assistant, siri, cortana, alexa. In International Symposium on Signal Processing and Intelligent Recognition Systems, pages 190–201. Springer, 2018.

C. Van Gysel and M. de Rijke. Pytrec eval: An extremely fast python interface to trec eval. In SIGIR. ACM, 2018.

Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.

(30)

Y. Zhang, H. Dai, Z. Kozareva, A. J. Smola, and L. Song. Variational reasoning for question answering with knowledge graph. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

On the tip of my tongue: A potential dataset for Question Answering systems

Artificial Intelligence

Bachelor Thesis