Using Sparse Latent Representations and Fake News Classification for Document Ranking

(1)

Using Sparse Latent

Representations and Fake News

Classification for Document Ranking

Felix M. Rustemeyer

FNC

SLR

1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21

(2)

Layout: typeset by the author using LA_TEX.

Cover illustration: ’A modern interpretation of using SLR and FNC for document ranking’ by Matt Bryan and Felix Rustemeyer

(3)

Using Sparse Latent

Representations and Fake News

Classification for Document Ranking

Felix M. Rustemeyer 11868325

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. Ir. J. Kamps

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 907 1098 XG Amsterdam

(4)

Abstract

Improving document ranking is one of the main goals of Information Retrieval, and is often conducted by implementing additional features. This thesis has integrated sparse latent representations (SLRs) and a fake news classifier in a standard ranker, and has examined the impact of these integrations. The performance of the standard ranker and the SLR re-ranker have been evaluated to examine the impact of sparse latent represen-tations on document ranking. The SLR re-ranker achieved higher precision scores than the standard ranker, suggesting that sparse latent representations improve document ranking. Inspired by [Cormack et al., 2011], where document ranking was improved by implementing spam filtering, the impact of fake news classification on document rank-ing has also been investigated. Several fake news classifiers have been constructed and most classifiers showed the ability of classifying fake and real news on two datasets. The best performing classifier, a SVM classifier using Doc2Vec representations has been im-plemented in the re-ranking phase. This implementation did not lead to an increased document ranking precision score. Nonetheless, it provided a good starting point for further research.

(5)

Chapter 1 Introduction

Since the rise of the internet and it’s impact on modern society, document ranking can be seen as one of the most far-reaching disciplines of Computer Science. This field is aimed at improving document ranking and wide-ranging studies have been conducted to attempt accomplishing this goal. Such as the introduction of a standalone neural ranking model and content based re-ranking using spam filtering [Zamani et al., 2018] [Cormack et al., 2011]. A standalone neural ranking model (SNRM) is an efficient neu-ral model that ranks documents using sparse latent representations (SLR). In this the-sis there will be examined how the SLRs can be implemented in a document ranker. Furthermore by examining the performance of this document ranker, the impact of sparse latent representations on document ranking will be investigated. Content based re-ranking is another strategy of improving document ranking and has been applied in [Cormack et al., 2011]. First a spam classifier has been trained with the ClueWeb dataset using logistic regression. When the classifier was implemented in the re-ranking stage of a document ranker, the effectiveness of the ranker improved substantial. Inspired by this research, the second part of the thesis will examine the implementation of fake news classification in document ranking and its impact. First the construction of a fake news classifier will be examined and several fake news classifiers will be constructed. Thereafter, these classifiers will be evaluated and the best working classifier is selected. Then the impact of fake news classification on document raking can be examined by implementing the selected classifier in the document ranker. The SLRs and the fake news classifier will both be implemented in the re-ranking phase of a standard ranker, that will be constructed using Anserini. In addition to examining the implementation and impact of SLRs and fake news classification on document ranking, the impact of conducting fake news classification with SLRs will be investigated. Sparse latent repre-sentations have never been used before for document classification. Therefore, it is an interesting first step for examining the impact of SLRs on document classification. This will be conducted by utilizing the SLRs — used for the implementation in document ranking — and the fake news classification framework — used for constructing the fake news classifier that will be implemented in document ranking.

(8)

1.1 Research Questions

This thesis will examine the implementation of sparse latent representations and fake news classification in document ranking and the impact of these additions, and aims to answer the following research question:

How can sparse latent representations and a fake news classifier be imple-mented in document ranking and what is the impact of these implementa-tions?

This research question can be divided into six sub research questions:

Q1: How can sparse latent representations be implemented in document rank-ing?

Q2: What is the impact of sparse latent representations on document ranking? Q3: What is the best performing approach of conducting fake news classifica-tion?

Q4: What is the impact of sparse latent representations on fake news classi-fication?

Q5: How can a fake news classifier be implemented in document ranking? Q6: What is the impact of fake news classification on document ranking?

1.2 Thesis Overview

Chapter 2 introduces the topics of this thesis and provides a theoretical background. Section 2.1 is focused on document ranking while section 2.2 is focused on document classification.

In section 3.1, a standard ranker will be created that will be used as basis for the implementation of SLRs and the fake news classifier. This standard ranker will also be used as a comparison for testing the performance of the SLR re-ranker and the FNC re-ranker. Section 3.2 will implement SLRs in the re-ranking phase using the standard ranker as foundation. The subsequent section will use the standard ranker and the SLR re-ranker to examine the performance of the SLR re-ranker in comparison with the standard ranker. The last section of chapter 3 will analyze the results from the previous section and will answer the first and second sub research questions. The developed code for the development of the search engine can be found in [van Keizerswaard et al., 2020]. After implementing SLRs in a standard ranker and examining its impact, the thesis will focus on implementing a fake news classifier in the standard ranker and investigating the impact. Before a fake news classifier can be implemented, a fake news classifier has to be constructed. Chapter 4 will aim at finding a good performing fake news classifier

(9)

by constructing several fake news classifiers that use different document representations. One of utilized the document representations will be SLR and therefore the fourth sub question — examining the impact of SLRs on Fake News Classification — can be an-swered. The first section of chapter 4 will describe the datasets that are used for the training and testing of the classifiers. Thereafter, the generation processes of the docu-ment representations are discussed. The conducted experidocu-ments and the corresponding results are displayed in section 4.3. From these results the most accurate approach of conducting fake news classification can be deduced, and the third sub research question can be answered (section 4.4).

In chapter 5 the best performing classifier will be implemented in the standard ranker that is constructed in section 3.1. Like the implementation of SLRs, the fake news classifier will be implemented in the re-ranking phase. Thereafter, the impact of this addition will be examined in the subsequent sections. The code, produced for conducting chapters 4 and 5, can be found in [Rustemeyer, 2020].

The last chapter will give an overview of the conducted experiments, conclusions, an-swers to the sub research questions and the most important discussion points. Thereafter the main research question will be answered.

The section 2.1 — discussing the document ranking theoretical background — and sections 3.1-3.2 — the implementation of the standard ranker and SLR re-ranker — are co-authored in collaboration with the fellow students; Lodewijk van Keijzerswaard, Tijmen van Etten en Roel Klein. Who use this information for their research on the impact of sparse latent representations on downstream natural language processing tasks and the implementation of a neural ranker using a SNRM. Each of us contributed equally to the writing of these chapter. After creating a working implementation of the SLR re-ranker, described in section 3.2, our paths split. The other sections and chapters have been created individually.

(10)

Chapter 2 Theoretical Background

This chapter will introduce the main concepts of document ranking and document clas-sification. The first section addresses several approaches of Document Ranking. Section 2.2 will explore the required components for conducting document classification and will focus on different document representations in particular.

2.1 Document Ranking

Information Retrieval (IR) is the process of retrieving information from a resource given an information need. Area’s of IR research include text based retrieval, image or video retrieval, audio retrieval and some other specialized area’s, but text based retrieval is by far the most active, driven by the rise of the World Wide Web. Today many systems are able to retrieve documents from enormous collections consisting of multiple terabytes or even petabytes of data. These systems have existed for some time, and continue to improve on traditional IR methods. In this section, the basis of text based IR will be established, to be able to make a comparison between different IR methods and to be able to motivate the design choices of the neural search engine.

This section will first introduce the concepts inverted index and ranking functions, after which two traditional ranking functions will be introduced as point of reference. The remainder of this chapter will discuss the use of neural ranking functions in IR , their current computational challenges, and the way that sparse latent representations (SLRs) promise to deal with those problems.

2.1.1 Indexing and Searching

In IR, a distinction is generally made between the indexing time and the query time. The query time is defined as the time it takes the algorithm to retrieve a result (in real time). This time is dependent on the algorithm that is used as well as the hardware that it is running on. The goal of IR is then to minimize the query time which is achieved by moving as much computation as possible from query to indexing time. What to compute at indexing time and how to search this computed data is then the question that IR

(11)

research is concerned with. A general description of these two stages is given below. The following information is heavily based on [Manning et al., 2008, Ch. 2-4, 6].

Indexing

At indexing time an inverted index is computed from a collection of documents. An inverted index is a data structure that maps content such as words and numbers, some-times also called tokens, to the locations of these tokens in that document collection. Inverted indexes are widely used in information retrieval systems, used on a large scale for example in search engines [Zobel et al., 1998]. The computation of a inverted index can be split into three different parts: the tokenization of the documents in the collection, the stemming of the extracted tokens and the mapping of each token to each document. The inverted index can optionally be appended with extra data, which can be attached to either tokens or documents. These stages will be discussed below.

Tokenization Tokenization is the extraction of tokens from a document. In traditional IR the extracted tokens are usually words. However, it is also possible to use other tokens such as word sequences, also called n-grams, vector elements or labels. There is a wide range of tokenization algorithms since different applications and languages ask for specialized algorithms.

A rough token extraction algorithm would split a document on whitespace and inter-punction. For example, when using words as tokens the following sentence

"He goes to the store everyday." (2.1) would be split into {”He”, ”goes”, ”to”, ”the”, ”store”, ”everyday”}. When using n-grams, the results of token extraction with n = 2 is: {”He goes”

, ”goes to”, ”to the”, ”the store”, ”store everyday”}.

Stemming These words will next be stemmed. Stemming is the process of reducing each word to its stem. In the example given above the word "goes" would be changed into its stem: "go". The remaining words are already in their stemmed form. An example of a very popular stemming algorithm is Porter’s Algorithm [Porter et al., 1980]. This step usually also includes case-folding, the lowering of (all) cases of each word. For more information on stemming algorithms and case-folding, see [Manning et al., 2008, Ch. 2]. The stemmed tokens set of Sentence 2.1 would be

{”he”, ”go”, ”to”, ”the”, ”store”, ”everyday”}.

Token Mapping Given the set of stemmed tokens for a document, the tokens can now be mapped to each document. A simple algorithm would simply indicate if a token is present in the document. This mapping can be represented by a dictionary with the token as key and the set of documents containing the token as value: [”token” → {docID, docID, ..., docID}]. Using Sentence 2.1 as content for document 1, and

(12)

as content for document 2, the following mapping would be acquired after word tokeniza-tion and stemming:

Token Mapping: (2.3) ”he” → {1} ”we” → {2} ”go” → {1, 2} ”to” → {1, 2} ”the” → {1, 2} ”store” → {1, 2} ”everyday” → {1} ”today” → {2}

Inverted Index Construction After the token mapping is complete, the inverted index can be constructed. In the example above the inverted index only indicates the presence of a word in a document by saving the document ID. For small collections with simple search algorithms this can be sufficient, but with bigger collections sizes this quickly becomes insufficient. To improve the search functionality several other statistics can be included in or appended to the inverted index. For example the location of each term in a document can be saved by saving a list of positions for each mapping from a term to a document. A mapping of token t to a collection of n documents d ∈ D would then be of the form ”t” → {...,0d0_i : [p1, p2, ..., pk], ...} at positions p where k is

equal to the number of occurrences of t in di for any i. Other examples of data that can

be included are token frequency per document (store the value of k explicitly for each token), the amount of tokens in each document. These are commonly used in traditional search algorithms such as TF-IDF and BM25, that will be introduced in Section 2.1.2. Searching

Given an inverted index, it is possible to preform search operations using queries. There are two categories of search algorithms: unranked search algorithms, and ranked search algorithms. The first category is a simple retrieval of all document ID’s that contain (a part of) the query. This is, as mentioned in the previous paragraph, often insufficient, so this category will not be discussed any further. Ranking search algorithms will rank each document for a query. A document with a high score should have a high relevance for the given query. Section 2.1.2 will discuss two efficient traditional ranking functions.

2.1.2 Standard Ranking Functions

As mentioned earlier, ranking is the task of sorting given documents in order of rel-evance to a given query. Raking can be conducted with the use of standard ranking functions like TF-IDF and BM25. The information about these ranking functions is from [Christopher et al., 2008] and [Robertson and Zaragoza, 2009].

(13)

TF-IDF

The TF-IDF ranking approach consists of two phases. Firstly the given query and doc-uments are converted to a TF-IDF representation. Between these query and docdoc-uments vector representations, a similarity score is calculated using a scoring function like cosine similarity. After the documents are ranked according to the similarity score, the ranking task is completed.

Term frequency - inverse document frequency (TF-IDF) is a vector based model, for each document a vector representation can be built. This vector consist of the TF-IDF values of all the terms in the collection. TF-IDF is a factor weighting the importance of a term to a document in a collection of documents. For an important term in a document, a term that occurs frequently in the document an rarely in other documents, a high TF-IDF value is calculated. A low TF-IDF value is calculated for unimportant terms, such as terms with a very high document frequency.

Given term t, document d and a collection of Documents D, the TFIDF of term t in document d can be calculated by multiplying two statistics of term t: the term frequency and the inverse document frequency.

TFIDF(t,d,D) = tf(t,d) · idf(t,D) (2.4)

Term Frequency The term frequency of a term is calculated by dividing the frequency of term t in document d by the number of occurrences of the most frequent term in document d (maxkfkd):

T Ftd=

ftd

maxkfkd

(2.5) Hence, a TF of 1 is calculated for the most frequent term in a document, and fractions are calculated for the other terms in the document.

Inverse Document Frequency Solely using TF as a ranking function is not an accu-rate weighting scheme. Since, TF does not include the fact that the most frequent words, "stopwords", are not important, because these words are included in a lot of documents. Therefore, documents are not distinguishable by solely using TF.

Hence, TF is multiplied with IDF, decreasing the weight of very frequently terms and increasing the weight of rarely occurring terms. The IDF of a term t is calculated by the binary logarithm of the number of documents N , divided by the number of documents containing the term N_i:

IDFt= log2(

N Nt

) (2.6)

By multiplying a term’s TF and IDF for all the terms in a document, the TF-IDF representation of a document is constructed.

(14)

Cosine similarity After TF-IDF vector representations are build, the similarity be-tween documents and queries can be calculated with a scoring function. A standard scoring function, often used for calculating similarity scores between TF-IDF vector rep-resentations, is cosine similarity. The cosine similarity between two TF-IDF vectors is ranged between zero and 1. Because TF-IDF values cannot be negative, the angle be-tween two TF-IDF vectors cannot be greater than 90◦. Cosine similarity, between the query Q and document D of length N, is derived by the following formula:

cos(θ) = PN i=1QiDi q PN i=1Q2i q PN i=1D2i (2.7)

Now the documents can be ranked, based on the cosine similarity score between the documents and query

BM25

Another widely used ranking function is BM25, which ranks documents on the basis of the occurrences of query terms in the documents. Instead of the TF-IDF approach, where a scoring function is employed for calculating the similarity scores between the document vectors and query vector, BM25 is one function that calculates a BM25 score between document D and query Q. Lucene has switched from using TF-IDF as ranking function to BM25, because of several advantages that are incorporated in BM25.

Similiar to TF-IDF, the formula of BM25 contains TF and IDF. However, the versions of these statistics used in BM25 differ with the traditional forms.

Asymptotic TF By the standard TF, a term occurring 1000 times is two times as special as a term with 500 occurences. Due to this type of TF, the TF-IDF values sometimes do not match the expectations of users. For example, an article with 8 times the word "lion" is twice as relevant as an article with 4 times the word "lion". Users do not agree with this, according to them the first article is more relevant but not two times as relevant. BM25 handles this problem by adding a maximum TF value, variable k. Due to this maximum, the TF curve of BM25 is an asymptote that never exceeds k. In equation 4, term frequency is indicated with f (q_i, D).

Smoothed IDF The only difference between the standard IDF – utilized in the calcu-lation of TF-IDF — and the BM25 IDF, is an addition of one before taking the logarithm. This is done because the IDF of terms with very high document frequency can be a neg-ative value. This is not th case wth the IDF in the BM25 formula, due to the addition of one.

Document length Another distinction between TF-IDF and BM25 is the fact that BM25 includes document length. TF-IDF does not take document length into account, this leads to unexpected TF-IDF values. Fore example, the TF-IDF value of the term

(15)

"lion" that appears two times in a document of 500 pages is equal to the TF-IDF value of the same term appearing in a document of one page. Thus, according to TF- IDF the term "lion" is equally relevant in document of 500 pages and a document consisting of one page. This is not intuitive, and BM25 handles this by normalizing TF with the document length (1 − b + b · _avgdl|D| ). |D| is the length of document D and avgdl is the average document length. The influence of document length on the calculation of BM25 is controlled by the variable b.

BM25 Score(D, Q) = n X i IDF (qi) · f (qi, D) · (k + 1) f (qi, D) + k · (1 − b + b · _avgdl|D| ) (2.8) After the calculation of the BM25 scores — for all the documents in the collection—, the documents can immediately be ranked based on these scores.

2.1.3 Neural Ranking

In recent years, neural networks have been successfully applied to a variety of information retrieval tasks, including ad-hoc document ranking [Dehghani et al., 2017a] [Guo et al., 2017] [Xiong et al., 2017]. Neural ranking can in general be understood by dividing the rank-ing process in three steps: generatrank-ing a document representation, generatrank-ing a query representation and relevance estimation. Neural ranking models can be classified based on the step where a neural network is applied.

In general, there are two types of neural ranking models: early combination models and late combination models [Dehghani et al., 2017a]. Early combination models often use text-based representations of the document and the query, after which interactions between the document and the query are captured in a single representation. This interaction can for example be manually created features or exact query term matches. The document-query representation is then given as input to a neural model to compute a similarity score. Examples of early combination models include DRRM, DeepMatch and ARC-II.

On the other hand, the late combination models use neural networks to create sepa-rate query and document embeddings. The document relevance is then computed using a simple similarity function, such as the cosine similarity or a dot product. Examples include DSSM, C-DSSM and ARC-I.

One advantage of late combination is that all document embeddings are query inde-pendent, which means that they can be precomputed. This way only the query embed-ding and a simple similarity function have to be calculated at query-time, whereas early combination models have a more complex neural network for relevance estimation.

However, most of the time late combination model are still not efficient enough to provide close to real-time retrieval. This is because the document and query embeddings lack the sparsity of traditional word-based representations. Creating an inverted index using document embeddings is still technically possible, but will not give a speedup unless they are sparse vectors. Document embeddings are usually so dense that there are no non-zero values. As a result, the posting lists for each latent term would contain all

(16)

documents and no efficiency is gained. This is one of the main reasons that standalone neural rankings model are not used in practice. However, there are methods to still include neural ranking models in the ranking process.

The most popular way is to use a multi-stage ranker. Here a traditional first stage ranker passes the top documents to a second stage neural ranking model. These stacked ranking models seem to take advantage of both the efficiency of traditional sparse term based models and the effectiveness of dense neural models. However, this first-stage ranker acts as a gatekeeper, possibly reducing the recall of relevant documents.

2.1.4 Standalone Neural Ranking Model

In a recent paper, [Zamani et al., 2018] took a different approach to neural ranking in IR and proposed the Standalone Neural Ranking Model (SNRM). According to the pa-per, the original multistage approach leads to inefficiency due to the stacking of multiple rankers —that are working at query time— and also to a loss in the retrieval quality due to the limited set of documents continuing to the neural stage. They propose to use a sparse latent representation (SLR) for documents and queries instead of a dense representation to achieve similar levels of efficiency and quality as a multistage ranker. These representations consist of significantly less non-zero values than the dense repre-sentations but are still able to capture meaning full semantic relations due to the learned latency in the sparse representations. Figure 2.1 shows the distributions of different representations and illustrates the Zipfian nature of the SLR, matching far fewer docu-ments than the dense representation —that is used for the multistage rankers—. This approach with sparse latent representations allows for a fully neural standalone ranker, since a first stage ranker is not required for efficiency reasons. In the following sections we will provide an overview of how the SNRM works and how it is trained, as explained in [Zamani et al., 2018].

Model

The efficiency of the SNRM lies in the ability to represent the query and documents using a sparse vector, rather than a dense vector. For ranking purposes, these sparse representation vectors for documents and vectors should be in the same semantic space. To achieve this desideratum, the parameters of the representation model remain the same for both the document and the query. Simply using a fully-connected feed-forward network for the implementation of this model would result in similar sparsity for both query and document representations. Instead, since queries are intuitively shorter than the actual documents, a desirable characteristic for efficiency reasons is that queries would also have a relatively more sparse representation. To solve this, a function based on the input length is used for the sparse latent representation. This function to obtain the representation for a document, and similarly for a query, is provided below.

(17)

Figure 2.1: The document frequency for the top 1000 dimensions in the actual term space (blue), the latent dense space (red), and the latent sparse space (green) for a random sample of 10k documents from the Robust collection.

φD(d) = 1 |d| − n + 1 |d|−n+1 X i=1

φngram(wi, wi+1, ..., wi+n−1) (2.9)

As can be seen, the final representation is averaged over a number n-grams that is directly controlled by the length of the input. The larger the input, the more n-grams the representation is averaged over and the more likely it is that different activated dimensions are present. Another advantage of this function is the capturing of term dependencies using the sliding window for a set of n-grams. These dependencies have been proven to be helpful for information retrieval [Metzler and Croft, 2005].

The φ_ngramfunction is modelled by a fully-connected feed-forward network, this does not lead to density problems because the input is of fixed size. Using a sliding window, embedding vectors are first collected for the input using a relevance based embedding matrix. After embedding the input, the vectors are then fed into a stack of fully-connected layers with an hourglass structure. The middle layers of the network have a small number of units that are meant to learn the low dimensional manifold of the data. In the upper layers the number of hidden units increases, leading to a high dimensional output. Training

The SNRM is trained with a set of training instances consisting of a query, two document candidates and a label indicating which document is more relevant to the query. In the training phase there are two main objectives: the Retrieval Objective and the Sparsity Objective. In order to achieve the Retrieval Objective, Hinge loss is employed as the loss

(18)

Figure 2.2: A general overview of the model learning a sparse latent representation for a document.

function during the training phase since it has been widely used for literature ranking for pairwise models.

L = max(0, − yi[ψ(φQ(qi), φD(di1)) − ψ(φQ(qi), φD(di2))] (2.10)

For the Sparsity Objective, the minimization of the L₁ norm is added to this loss function, since minimizing L1 has a long history in promoting sparsity [Kutyniok, 2013].

The final Loss Function can be computed as follows:

L(qi, di1, di2, yi) + λL1(φQ(qi)||φD(di1)||φD(di2)) (2.11)

Where q is the query, d1 document 1, d2 document 2 and y indicates the relevant

document using the set {1, -1}. Furthermore, φ_Q indicates the SLR function for a query and φ_D for a document. Lastly, ψ() is the matching function between query and document SLR explained in the next section.

An additional method called ’weak supervision’ is used to improve the model at training. This unsupervised learning approach, works by obtaining ’pseudo-labels’ from a different retrieval model (a ’weak labeler’), and then uses the obtained labels to create a new training instance. For each instance: (q_i ,d_i1,d_i2,y_i), two candidate documents from the result list or one from the result list are sampled along with one random negative sample from the collection. yi is defined as:

(19)

yi= sign(pQL(qi|di1) − pQL(qi|di2)) (2.12)

where p_QLdenotes the query likelihood probability, since in this case a query likelihood re-trieval model [Ponte and Croft, 1998] with Dirichlet prior smoothing [Zhai and Lafferty, 2017] is used as the weak labeler.

This method has shown to be effective approach to train Neural Models for a set of IR and NLP tasks, including ranking [Dehghani et al., 2017b], learning relevance-based word embedding [Zamani and Croft, 2017] and sentiment classification [Deriu et al., 2017]. Retrieval

The matching function between document and query is the dot product between the intersection of non-zero elements their sparse latent representations. The score can be computed with: retrieval score(q, d) = X − →_q i|>0 − →_q i − → di (2.13)

The simplicity of this function is essential to the efficiency of the model, since the matching function is used frequently at inference time. The matching function will calculate a retrieval score for all the documents that have a non-zero value for at least one of the non-zero latent elements in the sparse latent representation of the query.

Finally, using the new document representations, an inverted index can be con-structed. In this case, each dimension of a SLR can be seen as a "latent term", capturing meaningful semantics about the document. The index for a latent term contains the ID’s of the documents that have a non-zero value for that specific latent term. So, for the con-struction, if the ith _{dimension of a document representation is non zero, this document}

is added to the inverted index of the latent term i.

2.2 Document Classification

Document classification is the task of assigning labels to a document. Usually, docu-ment classification is conducted by generating a model using manually classified doc-uments. First these documents have to be converted into document representations [Mladeni et al., 2010]. There exist several classification models and vector representa-tions, the following subsections will discuss the most eminent ones — and those are also used in this research. The first three subsections will introduce the following document representations; TF-IDF, Doc2Vec, BERT, and sparse latent representations. The last subsection will describe the support vector machine, a machine learning algorithm.

(20)

2.2.1 TF-IDF

In addition to a ranking function, TF-IDF values also can be used for building document representations. As mentioned in 2.1.2, TF-IDF values weight the importance of a term to a document in a collection of documents. A TF-IDF document representation consists of the TF-IDF values of all the terms in the document.

Due to the fact that each vector consists of the TF-IDF values for all the terms in the collection, TF-IDF vector representations can be substantial when conducting document classification on a large collection. By reducing the vector dimensions, the performance and efficiency of the classifier can be improved. This can be conducted trough the removal of stopwords, stemming and tokenization, and has been discussed in 2.1.1.

2.2.2 Document Embeddings

Word embeddings are vector representations that capture the meaning of a word. These representations aim to have the property where the relationship between two vectors, mirrors the linguistic relationship between the two words. As illustrated in 2.3 a word embedding model is aiming that the following is true: king −~ man +~ woman =~ queen~ [Mikolov et al., 2013].

Figure 2.3: Word embeddings aim at mirroring the linguistic relationship between ’king’ and ’queen’.

The principle idea is the distributional hypothesis, which assumes that a word is categorized by the neighboring words. According to this hypothesis, words appearing in the same context have the same meaning. The learning of word embeddings can be conducted without labels and manual annotation, therefore it can be seen as an unsupervised task.

A word embedding that is aiming on capturing the meaning of documents, is called a document embedding and can be utilized for a large amount of NLP tasks — like document classification. In contrast with TF-IDF representations document embeddings

(21)

are small dense vectors, a vector is called dense when the vector is mainly consisting of non-zero values. Another distinction between these representations is the interpretability of the meaning for each dimension. For each dimension in a TF-IDF representation, it is clear where each dimension stands for — the importance of the word to the document. This is not the case with a document embedding because a document embedding is generated with the use of a neural network — a black box.

Doc2Vec

Doc2Vec is an extension on top of the popular word embedding model Word2Vec that is capable of constructing representations regarding the document’s length, and was introduced in [Le and Mikolov, 2014].

Figure 2.4: A framework for learning a document vector.

In the Word2Vec model, the meaning of each word is predicted by training a neural network with the ’surrounding’ probabilities of all the other words in the collection. The surrounding probabilities of a word are the probabilities that the word is surrounded by another word, for all the words in the collection. So neighbouring words have a higher ’surrounding’ probability than distant words. Doc2Vec builds on this model by adding a document vector, vector D in 2.4. This document vector is trained, during the training of the word vectors W. After all the words in the collection have been trained, vector D holds the numeric representation of the document and represents the concept of the document.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is the first bidirec-tional document representation model and has been developed by Google [Devlin et al., 2018]. Before BERT was published, language models — like Doc2Vec — were trained in an one-directional way. In this approach, the language model is trained by looking at the docu-ment from either left-to-right or combined left-to-right and right-to-left. A bidirectional language model is trained by looking at the previous and next words simultaneously.

(22)

This ’looking at the same time’ approach leads to a deeper sense of language an results into context-based document representations.

2.2.3 Sparse Latent Representations

A sparse latent representation is a vector that is sparse and latent. A vector is sparse when most of the terms have a value of zero, and a latent vector is a vector with reduced dimensions that only stores the most important features of the document. The creation of sparse latent representations is conducted with a neural network, and is discussed in section 2.1.4.

2.2.4 Support Vector Machine

The support vector machine (SVM) is a supervised machine learning algorithm that is often used as a classifier. The following information is heavily based on

[Cristianini and Shawe-Taylor, 2000].

The goal of a SVM is to construct a hyperplane in the N dimensional space, where N is the number of features, that separates the classes in the best possible way. A new feature that needs to be labeled, can then be classified by examining at which side of the hyperplane the feature is located. A hyperplane is a subspace with a dimension that is one less than the dimensional space, thus N − 1.

Figure 2.5: The SVM algorithm aims at constructing a hyperplane that maximizes the margin between classes

The identification of the hyperplane — that separates the classes in the best possi-ble way — is conducted by maximizing the margin, the distance between both classes. The maximization of the margin is accomplished by using gradient descent and support vectors. Support vectors are the features that are the closest to the hyperplane and influence the orientation and position of the hyperplane the most. The loss function that is employed by this maximization is hinge loss, see equation 2.10.

(23)

2.3 Summary

This chapter has discussed the foundations of document ranking and document classi-fication. The first part of the chapter — focused on document ranking – started with introducing fundamental components of document ranking and information retrieval, such as; inverted index, stemming and tokenization. These concepts form the basis of document ranking and NLP tasks — like document classification. Thereafter, two ranking functions were introduced — TF-IDF and BM25. Then neural rankers and the standalone neural ranking model have been discussed. These fundamental components of document ranking are useful for chapter 3, where SLRs are implemented in a ranker and the performance of this ranker is tested.

The document classification part of this chapter was aimed at introducing the con-cepts relevant for conducting document classification. Usually, document classification is conducted by a classification algorithm and documents that are converted into doc-ument representations. SVM is one of the most known classifying algorithms — and subsequently used in this thesis —, and is aiming at constructing a hyperplane that separates the classes in the best possible way. Four document representations have been introduced; TF-IDF, Doc2Vec, BERT and SLR. TF-IDF is one of the most basic docu-ment representations consisting of the TF-IDF values for all the terms in the docudocu-ment. Doc2Vec and BERT are document embeddings that aim to have the property where the relationship between two document vectors mirrors the linguistic relationship between the two documents. A sparse latent representations is a vector with mostly values of zeros and only stores the most important features of the document. These components are implemented in chapter 4 for the construction of several fake news classifiers.

The next chapter will implement SLRs in a document ranker and will test its perfor-mance.

(24)

Chapter 3 Anserini: The (Neural) Search

Engine

In this chapter there will be examined how a search engine, utilizing SLRs, can be implemented in Anserini. First a standard ranker will be constructed (section 3.1). This standard ranker will be used as foundation for implementing SLRs in section 3.2 and a fake news classifier in later chapters. In section 3.2 SLRs will be employed for re-ranking. The changes necessary to be able to use a sparse latent representation besides a term frequency representation, can be divided in those related to the document indexing and those related to query representation and search. In this chapter, the standard implementation of the Anserini indexing and searching processes is outlined, after which a description of the SLR re-ranking implementation follows. The complete implementation can be found on the Anserini fork [van Keizerswaard et al., 2020].

Thereafter, the constructed SLR re-ranker will be used for examining the impact of SLRs on document ranking. This will be in conducted in section 3.3.4 by examining the performance of the SLR re-ranker, that uses SLRs for re-ranking. This thesis is not aimed at examining the efficiency of document ranking with SLRs. Therefore the efficiency of the standalone neural ranker will not be measured, however the thesis of Lodewijk of Keijzerswaard is aiming at this.

After examining the performance of the SLR re-ranker, the corresponding results will be discussed and there will be summarized how SLRs can be implemented in a document ranker. This will answer the first and second sub research questions.

Open-source toolkits play an important role in information retrieval research. Espe-cially allowing for the evaluation of performance benchmarks is an important aspect of these toolkits that would provide a valuable addition to IR research.

This is the exact motivation behind Anserini; Reproducible Ranking Baselines Using Lucene. [Yang et al., 2017]. As its description states, Anserini focuses on the ability to reproduce results of IR benchmarking, serving as a widely accessible point of comparison in IR research. Due to this ’reproducibility promise’ of Anserini [Lin et al., 2016], it is a suitable toolkit for research into the use of sparse latent representations for document

(25)

ranking and the impact that this use has on IR and down-stream NLP tasks. The fact that Anserini is built on Lucene provides another argument to use it. Lucene is a Java library that provides many advanced searching functionalities. It has become the de-facto standard for many search engines and would therefore make this research widely applicable.

3.1 Standard Ranking

3.1.1 Standard Indexing

The Anserini indexing component consists of several Java files and classes. The Java class "IndexColection", that is located in "IndexCollection.java", is the main Anserini indexing component and is controlling the indexing pipeline.

For the task of inverted indexing, the Anserini component can be classified as a wrap-per of Lucene. Because Anserini is used for assembling the Lucene indexing components into an end-to-end indexer. Lucene supports multi-threaded indexing, but it is only pro-viding access to a collection of indexing components. Hence, an end-to-end indexer can not be employed by solely using Lucene.

Usually the Anserini indexing pipeline is used for indexing text from a collection of text documents. In this section this standard indexing approach is discussed. To perform the indexing pipeline, a command has to be executed. Several flags —that determine the collection, output folder and the index settings— have to be added.

Index The file "DefaultLuceneGenerator.java" is the first component in the indexing pipeline. A given collection is converted into Lucene documents containing three fields: contents , id and raw. By converting a collection into Lucene documents, an index can be constructed consisting of the inverted index and the documents.

Inverted Index ’Contents’ stores the data, which Anserini will utilize to make an inverted index. In the standard indexing approach an inverted index is constructed with the text of the documents in the collection. Hence, the field ’contents’ in the Lucene document j is storing the text of document j. As discussed in 2.1.1 and illustrated in 3.1, an inverted index consists of terms —words in case of standard indexing— and the id’s of the documents that contain these words. (and document frequencies) The ID value of a document is stored in the ’id’ field of the Lucene document and connects the inverted index with the documents stored in the index.

Documents The documents are stored together with the document ID and the content from the ’raw’ field. This is an optional field and can store additional information about the document, like a title, the number of words in the document or the document’s sparse latent representation (see 3.1.2).

(26)

Figure 3.1: Overview of the index

3.1.2 Standard Querying

Given the index described above, the traditional TF-IDF or BM25 algorithms, can per-form a ranked search on this index. Inside the search function of the SearchCollection object, several objects will be constructed to perform scoring. This function takes in the query, an IndexReader and a Similarity object, and returns the scored documents in a ScoredDocuments object. The query is the raw input string read by Anserini, the IndexReader provides a way to read the inverted index, and the Similarity object is used to compute document scores. The construction sequence consists of the QueryGenerator, BooleanQuery, one or multiple TermQuery’s, TermWeights, FixedSimilarity’s and Scorer objects, for which the type of the last two will depend on the type of ranking function that is used. The significant parts of all these objects, with respect to this project, will be discussed below. An overview of this process can be found in Figure 3.2.

QueryGenerator The QueryGenerator object takes in the query and extracts tokens (terms) from it. Then, an analyser removes stopwords and stems the terms if needed. What remains is a collection of terms, also called a bag of words. For every term a TermQuery object is created, which is then added to the BooleanQuery. This process effectively splits the scoring of documents into scoring per term.

(27)

Figure 3.2: A simplified overview of the standard document scoring process of Anserini.

TermQuery For the purpose of this project, the TermQuery object is simply a wrapper for the TermWeight object. The TermQuery contains the term that needs to be matched and scored, and creates a TermWeight object.

TermWeight The function of the TermWeight object is to compute the data needed to score a document, and to find the documents that match the term. The IndexReader is used to gather the information required to create a TermStatistics and a Collection-Statistics object for the term. These objects are passed to a Similarity object to create a FixedSimilarity object (SimFixed in Figure 3.2), which has a score function that can score a document for a single term. The document matching also makes use of the IndexReader object to create a document enumeration, which confusingly is called a TermsEnum in the Anserini source code. This is done by looking up the document id’s that the term maps to, as described in Section 2.1.1, in the inverted index. Both the SimFixed and document enumeration are passed to the TermScorer object.

Similarity The Similarity object determines the type of ranking function that is used (options include TF-IDF, BM25, query likelyhood, and others).

FixedSimilarity The SimFixed is an object of which the only function is to take a document as a parameter and return a term-document score. The exact computation of the score differs between ranking functions.

Scorer A Scorer object receives the list of matched documents and a term specific FixedSimilarity object. The search function will score every matched document for each term using the score function of the SimFixed object. The score for every document will be calculated by taking the sum over the matched term specific scores. These scores are

(28)

accumulated in a ScoredDocuments object and are returned to the parent function to obtain a document ranking.

Something worthy to note is that this search algorithm architecture does not score every document. If a document contains none of the terms from the query, it is never matched, and thus never scored.

3.2 SLR Re-Ranking

3.2.1 SLR Re-Ranking Indexing

One way to improve re-ranking efficiency is to precompute all document embeddings. For this reason an index is created that stores the standard text-based representation as well as the sparse latent representation. Posting lists are still created using the text-based representation to facilitate efficient first-stage ranking using BM25.

The raw field is changed to a dictionary that can contain both the raw text and the document embedding. This way the embeddings can simply be retrieved at query time instead of having to be computed live. This allows the re-ranking to score more of the top documents in roughly the same time, leading to a higher accuracy.

3.2.2 SLR Re-Ranking Querying

Building on top of the Standard Querying discussed previously, we implemented a way to compare the ranking scores of the conventional retrieval to the SLR scoring. This ’hybrid’ implementation searches using the conventional term based ranking and then recomputes the scores for the top-k documents based on their SLR.

As stated in 3.2, the SLR is stored in the index as a string value and can be retrieved by accessing the dictionary in the raw document field. This string is converted to a vector and its dot product with the query SLR is computed to get the neural score.

An advantage of this method is that it uses the efficiency of the traditional retrieval to directly examine the effectiveness of the neural search to be able to re-rank the top-k documents using the SLR ranking capabilities. A disadvantage of this approach, is that the SLR score re-ranking only happens for the top-k documents and other documents with potentially higher neural scores might be withheld by the first stage ranker.

This method is only aimed at directly comparing the ranking effectiveness of the two rankers, it does not evaluate the performance efficiency of the SLR. For that purpose, a fully standalone neural ranker had to be implemented. This fully standalone neural ranker has been implemented by Lodewijk van Keijzerwaard and is discussed in his thesis.

(29)

3.3 Testing performance of SLR Re-Ranking

In the previous sections, a search engine was developed that could conduct two ap-proaches; standard ranking and re-ranking with SLRs. This section will compare the performance of the standard ranker and the SLR re-ranker. Trough this comparison an answer can be given to the first sub research question; "What is the impact of SLRs on document ranking?". As described in 3.2.2, SLR re-ranking is only implemented to directly compare the ranking effectiveness of the two rankers. It is not implemented for evaluating the performance efficiency of the SLR. For that purpose, a fully standalone neural ranker is implemented and the performance of that ranker is discussed in the the-sis of Lodewijk van Keijzerswaard. Therefore will this experiment evaluate the ranking effectiveness of the two rankers and not the performance efficiency.

3.3.1 Method

First an index has been constructed with SLRs as additional content. Then searching using the standard ranking approach (section 3.1.2) and searching using the SLR re-ranking approach (section 3.2.2) was performed. Due to computation reasons, both rankers only searched for fifty documents. SLR re-ranking was conducted by calculating the dot product — SLR similarity score — between the document SLR and the query SLR for all the fifty documents per query. Thereafter, the standard BM25 score of the document and query was replaced by this SLR similarity score. The standard ranking was conducted using the standard searching approach. For both rankers the searching resulted in a results file. This results file contains fifty retrieved document for each query, sorted on their similarity score with the query. The results files of the two rankers contain the same fifty documents per query, but the order of the documents is distinctive because of the different similarity scores — the BM25 score in case of the standard ranker and the SLR similarity score in case of the SLR re-ranker.

3.3.2 Data

Ranking was performed on the TREC Robust04 collection, a collection gathered by the Text REtrieval Conference (TREC). It contains 528,155 documents, news articles from the Financial Times, Federal Register 94, FBIS and the LA Times. The collection has been chosen due to the extensiveness of the collection. The collection seems namely large enough to call the obtained results relevant. Moreover, evaluation can be performed straightforwardly with the standard TREC evaluation program and with the provided queries and relevance labels.

3.3.3 Evaluation

Evaluation has been executed with TREC evaluation, the evaluation software for TREC collections. In addition to the documents, the collection provides a file with topics (queries) and a file with qrels (relevance labels). The topics file contains the 250 queries

(30)

and their id’s. The qrels file contains for each query, a list of documents that are consid-ered relevant. These relevance labels are made by humans who manually select relevant documents for each query.

TREC evaluation compares the results file, produced by the ranker during searching, with the qrels file and calculates metrics based on the differences. In this experiment the precision scores for the top 5, top 10, top 15 , top 20 and top 30 documents were calculated. The precision score for the top N documents (P N) is calculated by dividing the relevant documents that are in top N by N.

3.3.4 Results

The precision scores for the two rankers can be seen in Table 3.1. Ranker / Metric P 5 P 10 P 15 P 20 P 30 Standard Ranking 0.50 0.44 0.40 0.36 0.31 SLR Re-Ranking 0.52 0.49 0.45 0.42 0.36

Table 3.1: Evaluation of the standard ranker and the SLR re-ranker on Trec Robust04

3.4 Conclusion and Discussion

In this chapter the SLR re-ranker was created to evaluate the difference in effectiveness of both approaches. The SLRs were implemented in the standard ranker by embedding the SLRs in the word based index for each document. Documents can be retrieved using the word based ranking and will be re-ranked according to the dot product between the SLR of the query and their SLR.

The impact of SLRs on document ranking can be deduced from the precision scores for the top 30 documents of the SLR re-ranker and the standard ranker, as illustrated in table 3.1. The SLR re-ranker scores higher for each precision score and therefore there can be concluded that document ranking is improved by the implementation of SLRs. However, due to computation reasons the rankers only searched for fifty documents. To validate these findings, future research should aim at replicating results when searched for more documents.

(31)

Chapter 4 Fake News Classification

The previous chapter has implemented SLRs in a search engine and has examined the impact of SLRs on document ranking. This caused that the first and second sub research question could be answered. This chapter and chapter 5 will examine fake news classifi-cation and the implementation of fake news classificlassifi-cation in document ranking. The best fake news classification approach will be investigated by constructing several fake news classifiers and comparing the results. These fake news classifiers will employ the four different document representations that have been discussed in 2.2. One of the classifiers that will be constructed is a classifier using SLRs and by examining the results of this classifier, the fourth sub research question can be answered. By examining the impact of SLRs on document classification, a first step will be made in the investigation of using SLRs for document classification.

The rise of online mass media has led to an increase of fake news. While the amount of fake news all over the internet is unknown, Facebook estimates that the platform may be infested with 60 million bots. Top-ranking intelligence officials in the United States testified before Congress with proof that Fake news was deliberately spread by Russia to manipulate the 2016 US presidential election[Lazer et al., 2018]. According to [Lazer et al., 2018] fake news can be defined as fabricated information that is imitating real news articles in form but not in organizational process or intent. Fake news articles do not ensure the accuracy and credibility of the information and can be misinformation — false or misleading information — or disinformation — false information that is purposely spread to deceive people.

Fake news classification is the task of assigning the labels ’real’ and ’fake’ to news articles. In this chapter, several approaches to fake news classification are discussed. This chapter starts with a description of the datasets that have been used for fake news classification. Section 4.2 will discuss how the four document representations have been created. Section 4.3 will address the experiments that have been conducted with these features and on the described datasets, and will display the results of these experiments. In the last section, conclusions will be drawn and briefly discussed.

The subsequent chapter will implement the best performing classifier in document ranking by using the real news probability of a document — the probability that a

(32)

document is real according to the fake news classifier — in the re-ranking phase of a standard ranker.

4.1 Data

Fake news classification has been conducted with two datasets: a dataset gathered by McIntire and the ISOT fake news dataset. The following sections will give an overview of these datasets.

4.1.1 ISOT

The ISOT dataset is a fake news dataset gathered by the ISOT lab of the University of Victoria and contains 44,898 articles, of which 21,417 are real and 23,481 are fake [Ahmed et al., 2018]. The real articles were largely gathered from Reuters.com, and the fake articles were collected from unreliable websites that were flagged by the fact-checking organizations Politifact and Wikipedia. Each article in the dataset includes it’s title, text, subject and date of publication.

Figure 4.1: Overview of the ISOT fake news dataset. The dataset contains articles on various subjects, as illustrated in 4.1.

4.1.2 McIntire

The McIntire dataset is a fake news dataset gathered by George McIntire from Open Data Science and contains 6,334 articles, of which 3,171 are real and 3,163 are fake [McIntire, 2018]. Many of the real articles were collected from renowned media organiza-tions such as The Guardian, Bloomberg, and the New York Times, while the fake articles

(33)

were gathered by the data science community Kaggle; during the 2016 US election cycle, Kaggle hosted a competition where users could report fake news articles.

For each article the dataset contains an ID number, title, body of text, and the ap-propriate label of ‘REAL’ or ‘FAKE’. This dataset can be found in the GitHub repository referenced previously.

4.2 Creation of document representations

Fake news classification has been implemented with four different representations as features. For each of these configurations several experiments were conducted. The description of the experiments and the corresponding results can be found in 4.3. This section will focus on the creation process of the four different representations.

4.2.1 TF-IDF

In tfidfRepres.py the TF-IDF representations have been generated using the Python-library scikit-learn. First the TfidfVectorizer class was instantiated with the following arguments; tokenizer = ’words’ and stop_words = ’english’. These arguments were set because tokenization and the removal of stopwords leads to the reduction of vector dimen-sions as described in 2.2.1. After class instantiation the representations were generated by calling the fit_transform method, with a list containing the articles as input.

4.2.2 Doc2Vec

Doc2vec representations have been created by first training a Doc2Vec model and then utilizing the created model for the generation of Doc2Vec vectors. The file

train_d2v_model.py trains a doc2vec model using the Gensim python library on the whole dataset (the train and test set). First the articles are tokenized using the nltk word_tokenized function and all uppercase letters are altered to lowercase letters. Then for all of the articles a Gensim TaggedDocument is constructed consisting of the tokenized words and an ID (the index of the article). After these pre-processing steps, a Doc2Vec model is instantiated with the parameters displayed in 4.1

Epochs 30

Vector dimensions 50

Training rate (alpha) 0.025

Training algorithm Distributed Memory Minimal word frequency 2

Table 4.1: The parameters for the training of the Doc2Vec model

Finally, the instantiated model is trained with the list containing the TaggedDocu-ment of each article, and saved so that it can be used for the generation of Doc2Vec vec-tors. In d2vRepres.py, Doc2Vec representations are created by utilizing the infer_method on the saved model.

(34)

4.2.3 BERT

For the creation of BERT representations, the libraries serving-server’ and ’bert-serving-client’ have been employed. ‘Bert-serving-server’ was used for setting up a server that is running a pre-trained BERT model. The pre-trained model is the BERT-Base uncased model trained by Google [Google, 2020]. In bertRepres.py, the BertClient class from Bert-server-client was employed for establishing the connection with the server. The representations have been generated by using the encode method on the BertClient class with the articles as input.

4.2.4 SLR

The SLR representations have been collected by running the efficient NLP pipeline [Rau, 2020]. After running the pipeline the produced tsv file was processed in Python to convert the .tsv file, where the SLRs were stored as strings, to a list of SLR numpy arrays.

4.3 Experiments

This section will describe the fake news classification experiments that have been per-formed with different features and on different datasets.

4.3.1 Training and testing with McIntire

First, fake news classification was conducted on the McIntire dataset using four different features; TF-IDF, Doc2Vec, BERT and sparse latent representations. 70 percent of the dataset’s articles were designated as a training set for classification, and the remaining 30 percent were reserved as a test set. Furthermore, using scikit-learn a SVM class was constructed and the fit method was employed to train the SVM classifier with the training set. After completing the training phase, the fit method was applied on the classifier with the test features as input, to predict the test labels. The predicted labels were given to the F1_score function–together with the original test labels–for F1 score calculation. The F1 scores for the fake news classification with four different representations are displayed in table 4.2.

Metric / Representation TF-IDF Doc2Vec Bert SLR

F1 0.93 0.90 0.90 0.75

Table 4.2: The F1 scores of fake news classification for four different representations on the McIntire dataset.

Fakest and realest words

As Table 4.2 illustrates, the F1 scores for all of the representations were extremely high. Therefore, data analysis was conducted to clarify these results and to examine if the

(35)

articles contained words that made it very easy for the classifier to distinguish real from fake, such as the name of the publisher. If this were the case, such indicative words may need to be removed from the dataset in order for the classifier to work well with a less known or more diverse set of sources. . To clarify these high F1 scores, there is subsequent investigation into which words most often appear to distinguish real articles from fake ones; These words are then referred to as the ’realest words’ and the ’fakest words’.

’Real’ words’ are words that frequently occur in real news articles and rarely in fake news articles. Likewise, ’fake words’ occur frequently in fake articles and rarely in real news articles. The ‘realness’ or ‘fakeness’ of a particular word is measured by its ’fake:real’ ratio — how often it appears in fake articles as compared to real ones. The ‘realest’ and ‘fakest’ words of the dataset are quickly found by calculating the ’fake:real’ ratios of all the words in the data set and then sorting them in descending order. By this method, the top 15 fakest words are the 15 words at the top of the list and the top 15 realest words are the 15 words at the bottom at the list.

The calculation of the fake:real ratios started with the counting of the number of occurrences in real and fake articles for each word in the dataset. This was conducted with the scikit-learn CountVectorizer function and afterwards the output was converted into two matrices containing the number of occurrences in real and fake articles for each word: the real_word_count matrix and the fake_word_count matrix. Then Laplace smoothing — adding one to each value in the matrix — was conducted to avoid divisions by zero. After the Laplace smoothing, the fake:real ratios were calculated by dividing the fake _word_count matrix by the real_word_count matrix.

(36)

Figure 4.2: The top 15 fakest words based on the ratio of occurrences in fake and real articles and divided into three classes

The top 15 realest words are illustrated in 4.3 and the top 15 fakest words are illus-trated in 4.2.

Once the top 15 realest and fakest words were identified, they were then sorted into related groups based on their subject. For instance, several of the realest words were lumped into a group of politicians’ names, including those of Republican presidential candidates Carly Fiorina and Ted Cruz. Another sorted group captured common po-litical terms like ’1237’ — the number of delegates that Trump needed to become the republican nominee. Lastly there was a group labeled “Other”, which contained the word ‘durst’ — the last name of a murderer who repeatedly made the news in 2016. With the exception of ‘durst’, though, all of the words in the top 15 realest words were words that would be used in political articles. Meanwhile, the fakest words naturally grouped themselves into website URLs, conspiracy-related words, and words related to foreign countries. Interestingly, most of the fakest website URL words referred to the infowars-store URL Infowarsinfowars-store is an extreme right-wing website that publishes, among other things, fake news articles about chemtrails and 9/11 government conspiracies. Similarly, the "conspiracy-related" group exists of conspiracy websites like Trunews and Infowars-store, conspiracy theorists like Pilger and words often associated with conspiracy theories such as UFO.

(37)

Figure 4.3: The top 15 realest words based on the ratio of occurrences in real and fake articles and divided into three classes

on reporting politics on a high level by using political terms and names of politicians. In contrast,fake news articles report conspiracy theories by using conspiracy related words and URLs to conspiracy websites. Neither top 15 list includes words that make it very easy for the classifier to distinguish fake and real news articles, such as the name of the newspaper. Hence the high F1 scores cannot simply be explained by the fact that the articles contain source-related words that ought to be removed.

To test if the classifiers are not overfitting on this dataset and if the classifier can distinguish fake and real news articles from another dataset, the classifiers will be tested on the ISOT dataset.

4.3.2 Traning and tetsing with ISOT

Before the classifiers trained on the McIntire were tested on the ISOT dataset, classifiers were constructed that were trained and tested on the significantly larger ISOT dataset. These classifiers, using the four different representations, were trained and tested in the same way as the classifier that was trained and tested on the McIntire dataset (described in section 4.3.1). The F1 scores of fake news classification for the four representations on the ISOT dataset are shown in Table 4.3.

(38)

F1 0.98 0.98 0.99 0.82

Table 4.3: The F1 scores of fake news classification for four different representations on the ISOT dataset.

the F1 scores from the classifiers trained and tested on the ISOT dataset are extremely high. To test if the classifiers are not overfitting on this dataset and if the classifier can distinguish fake and real news articles from another dataset, the classifiers will be tested on the McIntire dataset.

4.3.3 Training with McIntire and testing on ISOT

To examine if classifiers trained with the McIntire dataset also can distinguish fake news articles from real news articles on another dataset, the classifiers were tested with the ISOT dataset. Because the McIntire dataset mainly consists of political articles, the McIntire-trained classifiers were only tested with the political articles from the ISOT dataset. The 18,113 political ISOT articles consist of fake news articles given the ’politics’ subject and real news articles given the ’Politics-News’ subject. Because the fake articles from both datasets are gathered articles that are flagged to be fake, there is some overlap in the articles. In other words, some articles from the McIntire dataset — flagged to be fake by Kaggle users — are also in the ISOT dataset because the sources of the articles have been also flagged by Politifact. Therefore,the articles that occur in both datasets have been removed from the ISOT dataset.

The classifiers have been trained with the complete McIntire dataset. Similar to previous experiments, this is conducted by employing the fit method on the SVM object. After the training stage was finished, the classifiers were tested on the selected articles from the ISOT dataset and the F1 scores were calculated (Table 4.4).

F1 0.76 0.79 0.59 0.68

Table 4.4: The F1 scores of fake news classification for four different representations trained with the McIntire dataset and tested on the ISOT dataset.

It can be seen that the F1 scores obtained from this experiment were significantly lower than the F1 scores from the previous experiment, where classifiers were trained and tested on the McIntire dataset (4.3.1). Further conclusions will be drawn in section 4.4.

4.3.4 Training with ISOT and testing on McIntire

In the last fake news classification experiment, SVM classifiers have been trained with the selected ISOT articles and have been tested on the complete McIntire dataset. See Table 4.5 for the acquired F1 scores.