Detection of misinformation in Facebook data

(1)

Detection of misinformation

in Facebook data

(2)

Layout: typeset by the author using LA_TEX.

(3)

Detection of misinformation

in Facebook data

Creating A Classifier Using Text-Based Methods

Joris D. Hijstek 11876980

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science

Science Park 904 1098 XH Amsterdam

Supervisor

Jaap Kamps and David Rau

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 907 1098 XG Amsterdam

(4)

Abstract

Misinformation spread on the internet is a growing problem. Such spreading can occur accidentally or purposely, and can contribute to shifting narratives and civil unrest. It would be beneficial to automate the detection of such posts in order to combat the spread of misinformation. However, the importance of not classifying normal posts as misinformation cannot be understated. The aim of this thesis is to investigate whether it is possible to construct a classifier that can classify misinformation. In addition, this thesis tries to create a solid base for the creation of a classifier of misinformation in Facebook posts. To help with these tasks, this thesis first aims to create a baseline from datasets extracted from the internet, and extend this onto the Facebook dataset. The Facebook dataset is a section of years of posts created on the Facebook platform, and has many features that are almost impossible to obtain otherwise, which makes the Facebook dataset a unique opportunity to try and combat the spread of misinformation.

(5)

1 Problem statement

1.1 Introduction

In the last decades, the internet has connected humans more than ever before. With the creation of social media websites, such as Facebook (2004), Reddit (2005), and Twitter (2006), sharing information among friends has become even easier. The emergence of these websites has also made it easier for misinformation to spread unchecked. When spread purposely with malicious intent, this is called disinformation, derived from the term "dezinformatsiya", a term first coined by the KGB in the USSR. In the last years, especially in the run up and in the wake of the American presidential elections of 2016, there has been an increasing demand for Facebook to properly combat the spread of misinformation.

Edelman states that public opinion is not an observable entity, but a social con-struct. It is impossible to match one person to another when comparing their opinions, as it is possible to sway opinions based on how a political or economi-cal stance is presented. Therefore, he argues, it is of utmost importance to label misinformation as objectively as possible, as any falsely perceived objectivity can lead to the censorship of normal content, or inaction on malicious content.[2]

In this thesis, three different datasets are used to train classifiers, and these results are then compared.

For the first part of this thesis, two datasets were extracted from the internet: the Buzzfeed dataset1_{, and the news_articles.csv}2 _{dataset. These datasets are used}

to train six different classifiers using two different vectorization methods, and the same datasets are then used to train a neural network using pretrained embed-dings.

The second part of this thesis consists of the Facebook dataset, which, in its en-tirety, is about an exabyte in size. However, due to limitations in the data that was made available, the actual dataset is between two and three times as big as the publicly available datasets. A large amount of information is being sent to Facebook every minute. Therefore, it is impossible to manually screen every post for misinformation before it is posted. This presents a problem, as the demand for Facebook to take responsibility for its role in the spread of misinformation is increasing rapidly.

A number of posts in the Facebook dataset have a label, named ’tpfc_rating’, which contains information about its perceived truth value. These labels consist

1_{https://github.com/sfu-discourse-lab/Misinformation_detection} 2

(8)

https://www.kaggle.com/ruchi798/how-do-you-recognize-fake-of ’fact checked as false’, ’fact checked as true’, ’not eligible’, ’fact checked as mix-ture or false headline’, ’satire’, ’opinion’, and ’prank generator’.

As Facebook deals with real users, the raw dataset would contain highly sensitive personal data of a large number of users. Therefore, the dataset is preprocessed to increase privacy.[1] The following table shows all available columns that are present in the dataset:

URL attribute table Breakdown table

URL_rid URL_rid

Clean_text Year-month

Parent_domain Country Full_domain Age_bracket first_post_time Gender

first_post_time_unix Political page-affinity

Share_title views

Share_main_blurb clicks

tpfc_rating shares

tpfc_first_fact_check total_shares_without_clicks tpfc_first_fact_check unix likes

spam_usr_feedback loves false_news_usr_feeback hahas hate_speech_usr_feedback wows public_shares_top_country sorrys

angers

Table 1: Column names of the Facebook dataset

Next, the results yielded by the publicly available datasets and the Facebook dataset are compared, and an overlapping conclusion is drawn.

This conclusion aims to answer a general research question, which is formulated and divided into sub-questions in the next section.

(9)

1.2 Research questions

For this thesis, the main research question is "How viable are text-based approaches when classifying misinformation?"

In order to answer this question, this thesis is divided into multiple sub-questions.

I What is the current state-of-the-art when it comes to misinformation detec-tion?

II How well do text-based approaches work on datasets found on the internet?

III How well do text-based approaches work on the Facebook dataset?

IV How do the results of the publicly available datasets compare to those of the Facebook dataset?

1.3 Thesis overview

In this section, the rest of the thesis is laid out in single sentences.

Chapter 2 summarizes some state-of-the-art research that has been conducted on the spread and detection of misinformation, and furthermore discusses the theo-retical concepts behind the more complicated methods used in this thesis. Chapter 3 describes and analyzes the publicly available datasets. Chapter 4 lays out the experiments that were conducted on these datasets. Then, chapter 5 reviews and analyzes the Facebook dataset. The results yielded by the experiments conducted on the Facebook dataset are described in chapter 6. Then, chapter 7 and chapter 8 reveal the results achieved on the publicly available datasets and the Facebook dataset respectively. Chapter 9 describes some of the problems and considerations that were made, and lays out an idea for future work. Finally, in chapter 10, a general conclusion is drawn from the total set of results.

The last chapter gave an overview of the problem that this thesis is based on, and explains why it is necessary to combat the spread of misinformation as effec-tively as possible. It also provides the thesis with research questions to answer. The next section will first attempt to answer the first sub-question by summing up related work and development. Then, in order to provide a solid theoretical background, vital background information on different methods used in this thesis is be provided.

(10)

2 Background

In this section, some of the state-of-the-art approaches are explored, and some of the more abstract ideas that are used in this thesis are discussed and explained.

2.1 Related work

There have been numerous attempts at classifying misinformation using various methods. Rajdev, Meet, and Kyumin Lee. (2015) used different decision tree clas-sifiers to classify tweets generated during natural disasters. The features extracted from these tweets are similar to those in the Facebook dataset: time zone, n-grams, tweet creation time, and text. In addition to using decision trees, they created mul-tiple models, which they named flat classification and hierarchical classification. In flat classification, the possible labels are classified normally. In hierarchical classification, however, the labels were divided into subgroups, which were then classified separately. These models achieved a high accuracy, with a notable in-crease when using hierarchical classification.[3]

It is also useful to detect misinformation before it goes viral, as it is better to pre-vent spreading of misinformation than to correct it afterwards. Vicario, Michela Del, et al.(2019) created a system that allows for the detection of misinformation using sentiment analysis techniques and different classification algorithms. They used a diverse set of features which are similar to those in the Facebook dataset.[4]

A lot of effort has also been put into observing the spread of misinformation, rather than the possible misinformation itself. There are numerous researchers that have attempted to create a model of information networks, like Nguyen, Nam P., et al. and Tambuscio, Marcella, et al., such that the actual nodes, or connec-tion points of a network, can be isolated and quarantined.[7][8]

This thesis uses text-based approaches when classifying articles and Facebook posts. One of the methods used is TF-IDF, which will be described in the next sec-tion. Jones first mentioned using not only the frequency of a term in a document, but rather also using the frequency of that term compared to the entire corpus.[9] This statistic is widely used by most search engines for information retrieval.

2.2 TF-IDF

The term frequency–inverse document frequency is a measure that signifies the importance of a word to a document compared to the importance of that word to the entire corpus. The TF-IDF value is computed as follows:

(11)

TF-IDFt,d = tft,d× idft (1)

Here, tft,d is the term frequency which is the amount of times term t occurs in

document d,

and idft is the inverse document frequency which is calculated as

idft= log

N dft

(2)

where N is the total number of documents, and dft is the amount of documents in

the corpus that contain the term t.

Other than statistics like TF-IDF, there is also a need to actually use these statis-tics to classify misinformation. The next section will lay out the most important models used in this thesis.

2.3 Decision trees

Decision trees are models that can be used for either classifying data or applying regression. The idea behind decision trees is going from features in a dataset to a model and using this model to classify a new sample.

Figure 1: A simple decision tree classifying gender

A decision tree consists of nodes, branches, and leaves. Nodes are intersections that decide what side the algorithm should choose to most accurately classify the data. Branches connect nodes with other nodes and leaves, and represent the choice made in the previous node. Leaves are outcomes of the algorithm, and correspond to the class of the sample.

(12)

2.4 Support Vector Machine (SVM)

A support vector machine is a classification model that attempts to create an (n-1)-dimensional plane through n-dimensional data points, in order to create areas that are then used for classifying new data points.

2.5 Neural network and layers

A neural network is a type of architecture which attempts to loosely mimic a brain. A simple network consists layers containing nodes, which can accept input and send this input to the next layer. In essence, they are pattern recognition systems that work best on data that contains some sort of pattern.

Figure 2: A basic neural network with one hidden layer

2.5.1 Embedding layers

An embedding layer is a special type of layer that maps discrete values to a vector of continuous ones. It will map words like ’Trump’ and ’president’ to values relatively close to each other, and words like ’Christmas’ and ’neural’ will be mapped to values that are quite far apart. This means that the cosine similarity between two embeddings of similar texts will be greater than that of two more differing texts. It is possible to use pretrained embeddings, like Glove, to process a new corpus, which makes working with small datasets more viable. The image below shows a simple 2D representation of the embedding of a s all collection of words:

(13)

Figure 3: A representation of embedding

Words like sister, niece, and aunt are very close together as they are all female, but are relatively close to words brother, nephew, and uncle.

The primary use of such embeddings is creating a solid platform for datasets with limited vocabulary. If a model is trained on a small dataset, and thus, a limited vocabulary, the model will not have a realistic distribution of words, and will therefore not work well on new information.

As mentioned, the demand for the control over misinformation has been rising rapidly. There are different approaches when faced with this problem. For ex-ample, one could choose a text-based approach, where the focus lies solely with the actual content of the data, rather than the pattern around it and its spread. Another approach is analyzing the virality of a post, meaning that instead of its content, the behaviour of a post is studied.

In the next section, the details of the publicly available datasets are described and analyzed.

(14)

3 Summary of the publicly available datasets

In the previous section, some of the methods that will be used in this thesis were described, and a selection of state-of-the-art research was laid out.

In this chapter, the datasets that were available on the internet are described.

3.1 The smaller datasets

3.1.1 Features of the Buzzfeed dataset

The first dataset considered was the Buzzfeed-v02-originalLabels.txt dataset. This dataset consists of 1380 articles by 36 news outlets, which were rated ’mostly true’, ’mostly false’, ’no factual content’ (NFC), or ’mixture of true and false’. This dataset is similar to the Facebook dataset, as it consists of more than two labels.

It is feasible to look at other features than just word types. One feature is the length of the text. The corpus lengths of different news types in the Buzzfeed dataset are pictured below as a plotted distribution:

(15)

Since this dataset contains information about the author, we can also use the news outlet as a feature. Plotting the ratio of labels to total articles can give an indica-tion of a news outlet’s legitimacy:

(16)

(a) Ratio of true articles per news outlet (b) Ratio of false articles per news outlet

(c) Ratio of mixed articles per news outlet (d) Ratio of nfc content per news outlet

Figure 4: Ratios of labels for all news outlets

One limitation of this dataset is its size, so an outlet with one article will have a ration of 100% for its corresponding label. Still, it is a feature that is also present in the Facebook dataset, which makes it useful for a baseline.

3.1.2 Features of the news_articles.csv dataset

The second dataset considered as a baseline is the news_articles.csv dataset. It contains 2045 news articles by 68 authors. The labels used for this dataset are ’Real’ and ’Fake’. The most significant difference from the Buzzfeed dataset is that this dataset has two labels instead of four. (True or False)

(17)

(a) Ratio of true articles per news outlet (b) Ratio of false articles per news outlet

(18)

It is evident that most authors have a ratio of 100% for one of the labels. As with any dataset, preprocessing is a necessary step to ensure the results are as accurate as possible. As no two datasets are the same, it is imperative to create a different preprocessing pipeline for every dataset.

Of course, after analyzing the datasets, it is informative to describe the experi-ments that were conducted to classify misinformation. The next chapter describes these experiments, and gives a theoretical background for pretrained embeddings.

4 Method for the publicly available datasets

The previous section analyzed the details of the public datasets, like the label distribution, author credibility, and title lengths of different labels for both publicly available datasets. This chapter describes the experiments that were conducted on those datasets, and gives theoretical background for an underlying technique used for neural networks.

4.1 Preprocessing the public datasets

The datasets were downloaded in .csv and .tsv format, which first had to be read and inserted into a Pandas dataframe to make it possible to manipulate the data. Then, the data was cleaned. First, all words were normalized by making all letters lowercase. This ensures that the capitalized word ’Trump’ and the uncapitalized word ’trump’, for example, are mapped to the same term. Next, where necessary (as some datasets had already done some preprocessing), special tokens were re-moved to ensure that they don’t end up in the cleaned text. All columns that were deemed irrelevant or unhelpful were dropped, and any remaining columns were named identically across every dataset in order to be able to more quickly apply code to all datasets.

The Buzzfeed dataset was copied and converted into a binary classification task by removing the ’no factual content’ and ’mixture of true and false’ tags. This led to the creation of a third dataset to compare to the Facebook dataset, together with the news_articles dataset.

(19)

4.2 Classification using classifiers

After preprocessing, the resulting data contained a clean string of terms and a corresponding label. These strings can be used to classify misinformation. For the low-level methods of classification, six classification models from sklearn3 were used.

1. Decision Tree Classifier

2. Random Forest Classifier

3. Adaboost Classifier

4. Multinomial Naive Bayes Classi-fier

5. Gaussian Naive Bayes Classifier

6. Support vector machine

4.3 Neural network classification

As a contrast to classifiers, machine learning was also used in order to classify misinformation. Neural networks were tested on two different vectorizations of the data. Using the TensorFlow API Keras4_{, deep learning models were built to}

classify news text. To help these models, pretrained embeddings were used to provide the network with weights that were independent of the vocabulary size of the datasets:

4.3.1 Glove and Word2Vec

One of the main weaknesses of embedding is that it is very easy to overfit on a dataset if that dataset is not sufficiently large. Therefore, there have been relatively successful attempts at pretraining embeddings on large datasets in order to mitigate this problem. It is also possible to use pretrained embeddings for deep learning. Jeffrey Pennington, Richard Socher and Christopher D. Manning from Stanford University trained embeddings from different sources, like Wikipedia and Twitter, and created embeddings of different dimensions, which are publicly available.[5] Google has developed its own set of publicly available embeddings, called word2vec.[6]

Of course, training with these embeddings will lead to poorer results on newer data, as certain words like Covid-19 or pandemic might not have been used much, if at all, leading to poor or nonexistent embedding.

3_{https://scikit-learn.org/stable/} 4_{https://keras.io/}

(20)

5 Summary of the Facebook dataset

An interesting aspect of the Facebook dataset is that there is multiple ways to retrieve a viable subset from the database. In this thesis, the chosen subset is straightforward. First, only posts that have been labeled, and thus don’t have a NaN value for ’tpfc_rating’, are selected. Then, only the posts that have been shared in either the UK or the USA are selected. Finally, Spanish posts are mostly filtered out by removing entries that contain characters that usually only appear in Spanish text.

The approximate distribution of labels is as follows:

It is important to observe that there is a large class imbalance, as there are sig-nificantly more posts that were fact checked as false. The Facebook dataset is cleaned and manipulated to be as anonymous as possible before it is released to researchers. The dataset is comprised of different origins, that is, the country that the post originated from. However, selecting just American (’US’) and British (’UK’) posts results in some different languages, like Spanish, being included in the dataframe. This could potentially lead to lowered accuracy, meaning they have to be filtered out.

(21)

The distribution of the text length of different labels looks as follows:

(22)

5.0.1 Biased data

The initial datasets were extracted from sources found on the internet. This means that a single person or a group of persons had to go and manually collect news articles from websites and archives. Subsequently, it is almost inevitable that these datasets contain at least some sort of bias. For example, every false article may contain certain words or n-grams, resulting to a overfitted model. It could be useful to manually create a dataset containing data with as little bias as possi-ble, in order to compare the results to the unique features of the Facebook dataset.

When inspecting the feature importances of the best-performing model, the out-put is the following:

Buzzfeed Buzzfeed Binary news_articles Facebook Features > 0 13953 6583 25661 11651 Total features 30010 27983 47275 14848 Percentage of total 46,5% 23,5% 54,3% 78.5%

Table 2: Total features, relevant features and percentage of features of the datasets

Plotting the numbers like this shows that the Facebook dataset has a higher per-centage of terms that are actually being used for classification, meaning that the baseline datasets likely contain more identifying terms that divide true and false articles. It is possible to extract the top features when vectorizing to visualise the bias in a dataset. This is done using sklearn’s feature_importances_ and get_feature_names(), while limiting the n-gram range to 2.

Buzzfeed Buzzfeed Binary news_articles Facebook trial error daily caller brilliant breaking

sir harambe loading news

said digestible chapters afp com provides baffled hillary photo breitbart hawkins provides clintons hear published mins obama

Table 3: Top features of the datasets extracted using sklearn

Attempting to extract these features shows some words that might be the result of a small dataset, like ’said’, but more importantly, also reveals that there are cer-tain adjectives and names that the classifiers focus on. Adjectives like ’digestible’, ’baffled’, and ’brilliant’ are reveal to have a significant impact on classification.

(23)

Notably, ’breitbart’, an American far-right5 _{news site is shown to be closely}

re-lated to whether or not something is classified as misinformation.

In the next chapter, the experiments that were conducted on the Facebook dataset are described.

6 Method for the Facebook dataset

In the previous chapter, the Facebook dataset is analyzed to get a better un-derstanding of the underlying patterns and distributions. In this chapter, the experiments that were conducted on the Facebook dataset are described.

6.1 Facebook classifiers

The first step for the Facebook dataset is straightforward. First, a subset must be defined on which the experiments are conducted. On this subset, the the exact same methods as with the baseline was applied first. Next, it is important to notice that, compared to the baseline datasets, the Facebook dataset has little data in terms of actual text. Therefore, the baseline was applied to three sets of data: the text, the title, and both. These sets were vectorized the same way as the baseline using CountVectorizer() and TfidfVectorizer(), and then classified using the aforementioned classifiers:

1. Decision Tree Classifier

2. Random Forest Classifier

3. Adaboost Classifier

4. Multinomial Naive Bayes Classi-fier

5. Gaussian Naive Bayes Classifier

6. Support vector machine

Next, the problem of class imbalance is covered. The dataset is reduced to contain the same amount of true and false labels, and the results are shown.

The next chapter discusses the results achieved with the public datasets. It first illustrates the effectiveness of classifiers, and then shows the results accomplished with neural networks.

(24)

7 Public datasets results

In the previous chapter, the experiments that were conducted on both the baseline-and the Facebook dataset are described.

In this chapter, the main focus is the results that were yielded by the experiments conducted on the publicly available datasets.

The main goal of this thesis is to classify Facebook posts as different categories. However, as a baseline, smaller datasets were used to create a working pipeline and to get a general idea of what methods are useful.

7.1 Classification using classifiers

The resulting accuracy values for different combinations of features are shown in the tables below. The column named ’Buzzfeed Binary’ refers to the Buzzfeed dataset with only the ’mostly true’ and ’mostly false’ labels.

The classifiers were first used on a dataframe containing a bag-of-words model created by sklearn’s CountVectorizer().

Buzzfeed Buzzfeed Binary news_articles DTC 0.722 0.912 0.698 RFC 0.785 0.936 0.721 ABC 0.785 0.936 0.721 MNB 0.759 0.934 0.657 GNB 0.785 0.936 0.678 SVM 0.775 0.956 0.632

Table 4: Baseline accuracy with only the bag-of-words values as a feature

The algorithms were applied to a dataframe containing just the TF-IDF values of the entire corpus. These were calculated by sklearns’s TfidfVectorizer().

(25)

Buzzfeed Buzzfeed Binary news_articles DTC 0.700 0.907 0.703 RFC 0.774 0.936 0.736 ABC 0.774 0.936 0.736 MNB 0.786 0.942 0.652 GNB 0.773 0.937 0.686 SVM 0.775 0.956 0.711

Table 5: Baseline accuracy with only the TF-IDF values as a feature

7.2 Classification using neural networks

7.2.1 Embedding vectorization

Using Glove, an embedding matrix was trained in order to work with the limited data of the baseline.

Figure 6: The accuracy on news_articles.csv using Glove’s pretrained embeddings

Similarly, word2vec was used to create an embedding layer for the dataset. The accuracy values over 100 epochs are show below:

(26)

Figure 7: The accuracy on news_articles.csv using word2vec’s pretrained embed-dings

This chapter aims to answer the second sub-question: How well do text-based approaches work on datasets found on the internet? Using two different vectoriza-tion methods, the classifiers were able to achieve similar results. The non-binary datasets classified correctly with an accuracy between 70% and 80%, while the Buzzfeed binary dataset achieved results between 90% and 96%.

The neural network scored similarly when using Glove and word2vec, having an accuracy of 68% and 70% respectively.

The next section aims to answer the third sub-question: How well do text-based approaches work on the Facebook dataset?

(27)

8 Facebook results

The previous section presented the results that were obtained by the experiments conducted on the baseline datasets. This chapter describes the results of the experiments conducted on the Facebook dataset, starting with the results of the same baseline applied to it. Unfortunately, due to the memory constraints imposed by the offline environment provided by Facebook, it was not possible to train with pretrained embeddings. This is due to the increased vocabulary size that the Facebook dataset offers.

8.1 Facebook classifiers

Since the Facebook dataset lacks a lot of text, it is interesting to see how this affects performance in comparison to the baseline. First, the low-level methods for the baseline were used on combinations of the title and text, to see if it is still viable to classify on small amounts of text. First, the classifiers were applied to just the text:

Text Title Text+title DTC 0.648 0.675 0.660 RFC 0.700 0.712 0.715 ABC 0.660 0.666 0.665 MNB 0.702 0.715 0.714 GNB 0.656 0.646 0.694 SVM 0.660 0.665 0.665

Table 6: Accuracy of different subsets of the Facebook dataset text with CountVec-torizer()

Table 7: Accuracy of different subsets of the Facebook dataset text with TfidfVec-torizer()

(28)

It is facile to observe that the Facebook dataset performs worse and arguably in a more realistic way than the baseline. The observed accuracy of the different classifiers used for the baseline perform between 10% and 20% better than when those same classifiers are applied to the Facebook dataset.

8.2 Facebook dataset imbalance

As mentioned earlier, all datasets that were used contained quite a large class im-balance. For Facebook, the distribution of classes looks like this:

Figure 8: Distribution of Facebook labels (unbalanced)

We can attempt to compensate for this by arbitrarily removing datapoints labeled as false until we reach a stadium where the number of posts that are labeled as false is about the same as the number of posts that are labeled as true. Doing this, we obtain the following distribution:

(29)

Figure 9: Distribution of Facebook labels (balanced)

Running the same experiments on this balanced dataset yields results that are lower than the first test, but are more realistic when applied to an actual unbiased selection of data.

Table 8: Accuracy of different subsets of the balanced Facebook dataset text with CountVectorizer()

(30)

Table 9: Accuracy of different subsets of the balanced Facebook dataset text with TfidfVectorizer()

These last two tables show that class imbalance is a serious issue when building a classifier, and it is therefore important to ensure that a dataset contains similar amounts of samples belonging to different labels.

In this chapter, the third sub-question was answered: How well do text-based approaches work on the Facebook dataset? Using just the text provided with the Facebook posts, the achieved results scored remarkably well when keeping in mind the lack of text. The classifiers scored between 62% and 72%, meaning they scored about 10% worse than the public datasets. However, when taking class imbalance into account, the accuracy dropped dramatically. The classifiers achieved an ac-curacy between 30% and 56%.

The next section will focus on discussing the thesis as a whole, and lays out the problems that occurred and were present during the writing of this thesis. It also gives a small pointer for possible future work.

(31)

9 Discussion

9.1 Initial results

This thesis focuses on creating a set of results for publicly available datasets, and then comparing these results to the results that follow from experiments conducted on the Facebook dataset. Using different vectorization methods, low-level methods were created in order to classify two distinct datasets. The first encoding method is count vectorization, in which each word is converted to a count of the total occurrences of that term. The second encoding method used is TF-IDF vector-ization, in which each term is assigned a value depending on the frequency in the text compared to the entire corpus. These encodings performed in a comparable manner. Another thing to observe is that the TF-IDF vectorizer uses the datasets corpus by default, meaning that it might have averse effects if the classifier is used on new data.

It should be noted that the TF-IDF method used was a simple log-idf. This is one of many possible choices when computing the inverse document frequency. For example, a different method is the PDF (Proportional Document Frequency), in which the frequency of a term is calculated proportionally to the domain it occurs in.

9.2 Facebook dataset

At the beginning of the period in which this thesis was written, the general consen-sus was that the Facebook dataset would be about 1 exabyte in size. A subset of this rather large amount of data would have labels that would make it interesting to create a subset, that is, focus on a single political entity and create a classifier for that entity. Unfortunately, the dataset that was actually made available was significantly smaller, and a small portion of this smaller dataset was actually la-beled.

Another aspect of the dataset, albeit an ethical one, is that the labels provided are still added by humans. As every human is different, different reviewers might label a certain Facebook post differently, whether through bias or interpretation.

(32)

Finally, the Facebook dataset consists of posts that have already been flagged as possible breaches of Facebook’s policies. This means that any data that appears in the dataset is already more inclined to not be rated as true, leading to a set that contains noise. This could suggest that simple text-based methods alone might be too simple to be effective, as most posts that would qualify would have already been filtered out.

9.3 Constraint on machine learning

The first part of this thesis divided the methods into two sections: classifiers and machine learning. For the machine learning, pretrained embeddings were used to train a neural network in order to prevent overfitting on the relatively small vocabulary of the datasets used. Unfortunately, the Facebook dataset is secured on Facebook’s offline infrastructure, meaning that any experiment is completely dependant on the freedom that such infrastructure offers. It was therefore not possible to use a pretrained embedding to train a neural network, as even a batch size of 1 was too much for a vocabulary size of three million.

9.4 Future work

First and foremost, the TF-IDF statistic relies on the corpus that is used to com-pare to individual terms. In this thesis, the method that was chosen relies solely on the corpus provided by the individual datasets. Hence, future work might include using a large newspaper corpus to more accurately calculate TF-IDF values.

An important thing to remember is that the Facebook dataset was smaller than expected. Having such a relatively small dataset for such an important problem adds difficulty, but perhaps also realism. Limiting the amount of data that future researchers work with could be beneficial to future projects with limited data.

Perhaps one of the most challenging aspects of misinformation classification is the constantly changing vocabulary and subjects of targeted areas of information. Therefore, creating a system that allows for the adaptation of newly tagged data would greatly reduce future errors in a classifier. A challenge for this idea is that it would most likely require ongoing manual tagging.

(33)

10 Conclusion

Summarising, this thesis makes an attempt at classifying misinformation, and looks at what text-based methods work best. While high-level methods tend to work, working with a larger dataset can lead to increased training times with little improvement. Classifying misinformation is doable using simple methods, while keeping in mind the choices that were made for the baseline and the constraints set for the Facebook dataset.

Recalling the research questions,

I What is the current state-of-the-art when it comes to misinformation detec-tion?

II How well do text-based approaches work on datasets found on the internet?

III How well do text-based approaches work on the Facebook dataset?

IV How do the results of the publicly available datasets compare to those of the Facebook dataset?

we can add the first three sub-questions together to gain a viable platform to an-swer the final sub-question, and then anan-swer the main research question.

The results achieved by the text-based methods show that, when using a public dataset, classification methods yield good results, reaching an accuracy between 70% and 96%.

The classification methods used on the Facebook dataset performed worse, but perhaps more realistically, reaching an accuracy between 62% and 72%. If we at-tempt to remove the class imbalance, however, this accuracy drops between 30% and 56%.

Adding these results together to answer the fourth sub-question, we can conclude that, even though the Facebook dataset has more data, the public datasets scored higher in terms of accuracy, meaning that there is a high probability that the public datasets contained more bias.

Seeing how the features that the classifiers used to classify were very specific terms that had clear meaning, as illustrated by table 3, it is evident that a larger dataset leads to more realistic results, and that a dataset that is the result of Facebook’s platform can deliver a more natural dataset.

The text-based methods that were utilized tend to perform decently when used on the public datasets. When using a more realistic dataset, the performance dropped significantly, meaning that it is probably best to combine text-based methods with

(34)

Bibliography

[1] Solomon Messing, Christina DeGregorio, Bennett Hillenbrand, Gary King,Saurav Mahanti, Zagreb Mukerjee, Chaya Nayak, Nate Persily, Bogdan State,and Arjun Wilkins.

"Facebook Privacy-Protected Full URLs Data Set". 2020

[2] Edelman, Murray, and Murray Jacob Edelman Edelman. The politics of misinformation.

Cambridge University Press, 2001.

[3] Rajdev, Meet, and Kyumin Lee.

"Fake and spam messages: Detecting misinformation during natural disasters on social media."

2015 IEEE/WIC/ACM International Conference on Web Intelligence and In-telligent Agent Technology (WI-IAT). Vol. 1. IEEE, 2015.

[4] Vicario, Michela Del, et al.

"Polarization and fake news: Early warning of potential misinformation tar-gets."

ACM Transactions on the Web (TWEB) 13.2 (2019): 1-22.

[5] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global vectors for word representation."

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

[6] Mikolov, Tomas, et al.

"Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013): 3111-3119.

[7] Nguyen, Nam P., et al.

"Containment of misinformation spread in online social networks." Proceedings of the 4th Annual ACM Web Science Conference. 2012.

(35)

[8] Tambuscio, Marcella, et al.

"Fact-checking effect on viral hoaxes: A model of misinformation spread in social networks."

Proceedings of the 24th international conference on World Wide Web. 2015.

[9] Jones, Karen Sparck.

"A statistical interpretation of term specificity and its application in retrieval." Journal of documentation (1972).

Detection of misinformation in Facebook data

Detection of misinformation

in Facebook data

Detection of misinformation

in Facebook data

Creating A Classifier Using Text-Based Methods

Abstract

Contents

1

Problem statement

1.1

Introduction

1.2

Research questions

1.3

Thesis overview

2

Background

2.1

Related work

2.2

TF-IDF

2.3

Decision trees

2.4

Support Vector Machine (SVM)

2.5

Neural network and layers

3

Summary of the publicly available datasets

3.1

The smaller datasets

4

Method for the publicly available datasets

4.1

Preprocessing the public datasets

4.2

Classification using classifiers

4.3

Neural network classification

5

Summary of the Facebook dataset

6

Method for the Facebook dataset

6.1

Facebook classifiers

7

Public datasets results

7.1

Classification using classifiers

7.2

Classification using neural networks

8

Facebook results

8.1

Facebook classifiers

8.2

Facebook dataset imbalance

9

Discussion

9.1

Initial results

9.2

Facebook dataset

9.3

Constraint on machine learning

9.4

Future work

10

Conclusion

Bibliography