Credibility by Proxy: Source-level Credibility Assessment for News Websites Utilising Graph, Domain and Text Data

(1)

Credibility by Proxy: Source-level

Credibility Assessment for News

Websites Utilising Graph, Domain and

Text Data

Jeroen Bastiaan den Boef

11245360

University of Amsterdam Faculty of Science

June 24, 2020

Increased disinformation dissemination and its potential harm requires an automated solution. This research proposes a new approach of combating disinformation dissemination through rating credibility at a domain level, combining classic machine learning features with the novel Graph Convolu-tional Network (GCN) algorithm to demonstrate its potential for automated credibility assessment. Experiments were conducted to determine the impor-tance of graph based data on the task of automated credibility assessment. In these experiments, the GCN is compared to a basic Neural Network, util-ising a similar structure and the same input features. Multiple configurations are trialed in the experiments and suggestions for optimal performance are provided. The results demonstrate that the GCN significantly outperforms the basic Neural Network in most trialed configurations.

Bachelor Thesis Information Science (Informatiekunde, 18 credits) Supervisors: dr. F.M. Nack, MSc F.A.W. Hermsen, W. Westra, PhD 2nd Examiner: prof. dr. P.T. Groth

(2)

1. Introduction

Nowadays information is readily available and shareable through the internet. The rise of social media has led to an even more substantial increase of information sharing. Not only has social media influenced the volume in which information is being shared, but also the speed at which information spreads. While this appears like a positive development at first, it also brings certain dangers with it. Increased information volume and dissemination leads to an increased danger of disinformation. The damage caused by disinformation in this era is often already done before the disinformation can be rectified. An example of these modern disinformation dangers is the “Reddit Boston Bomber” incident of April 2013 (Shontell, 2013). After an incorrect identification of a suspect of the Boston Marathon bombing of 2013, an online manhunt was set up. This manhunt led to massive online harassment, doxxing and even death threats towards the family of the wrongly identified suspect, all within hours of the misinformation being published.

Fake or untrustworthy news spreads at a faster rate than credible news (Vosoughi et al., 2018). This higher dissemination rate has been linked to the novelty of fake news and the emotional reactions it might trigger in consumers. Currently, the most accurate way to combat this spread of misinformation is to identify and label it as such. This is done through a time-consuming and tedious process of manual fact-checking. With the speed at which this harmful disinformation can spread, manual fact checking separate statements has become too slow. While previous research has a more fine-grained focus on ‘Fake News’ and credibility detection on the article or statement level, this research proposes a more fundamental and thorough credibility assessment of the article’s source (Hu et al., 2019; Fairbanks et al., 2018). By flagging domains as different levels of credibility, and constantly re-assessing this credibility rating, disinformation dissemination might be drastically reduced.

This research proposes a model which combines analysis of website and text features, with graph based data through state of the art neural network (Graph Convolutional Network) algorithms. Fairbanks et al. (2018) fathomed that this type of combined approach, utilising both text and graph based features, should improve the performance metrics of factuality prediction models. Though previously proposed as the solution to the credibility assessment problem, this combined approach remains untested. The proposed Graph Convolutional Network model is compared to a baseline model, utilising the same features, excluding the graph network. A more detailed explanation of the models can be found in the methods section (3). To determine the performance of this proposed approach, experiments were conducted to answer the following research question: Does integration of graph-based data into a baseline model by means of Graph Convolution lead to increased performance on the task of domain credibility assessment?

(4)

2. Related work

The terms factuality and credibility are used often throughout this paper. While similar in contextual use, they hold different meanings. To avoid misconceptions due to unclear definitions, these terms are defined as follows: factuality denotes the quality of being based on a fact, credibility denotes the quality of being trustworthy and believable. With regards to research on automated fact-checking and factuality prediction, the term factuality will often be used to describe article or statement level truth (Fairbanks et al., 2018; Baly et al., 2018, 2019). Credibility on a domain level with regards to news websites is denoted by the factuality of articles and statements published by said website.

Research regarding the topic of automated factuality assessment has thus far predom-inantly focused on ‘Fake News’ detection, political bias assessment and social media factuality assessment (Fairbanks et al., 2018; Hu et al., 2019). Both of these foci lack generalizability towards the purpose of domain credibility assessment. Social media factuality assessment lacks generalizability for domain credibility purposes because the associated models assess factuality of isolated claims. Fake news detection models gener-ally scale the veracity assessment up to a document level. These models do however often exclusively analyse text features and thus do not generalize their veracity assessment to a higher level.

Automated domain credibility assessment remains largely under-researched, leaving many approaches untested. The existing research regarding this topic has predominantly focused on analysing website, URL, style and text-based features (Baly et al., 2018, 2019; Olteanu et al., 2013). On the task of article level factuality and political bias assessment, text-based features have performed well for bias detection. Unlike political bias detection, relying exclusively on text-based features has proven to be insufficient for factuality assessment (Fairbanks et al., 2018). Graph analysis on a network of connected news domains has however proven to be an accurate method to predict factuality. A tested approach to construct this network, is to collect data of news domains linking to other news domains in their articles. While graph analysis has proven to be accurate for predicting factuality, this method has not yet been tested for the task of assessing domain credibility. This research combines this aforementioned method with other previously proven well-performing features.

In recent years, deep learning approaches have been adapted by the scientific com-munity for the purpose of fake news detection. Designed specifically to handle graph based data, Graph Neural Networks are a possible deep learning approach to extract the relational data out of a graph of connected news domains (Zhou et al., 2018). One Graph Neural Network that seems to especially excel at this task is the novel Graph Convolutional Network (GCN) algorithm (Kipf and Welling, 2016). The GCN model takes neighbouring nodes and their features into account by using the adjacency matrix of the current node as a smoothing kernel, essentially smoothing out these features of neighbouring nodes and pooling them at every iteration. Case specific adaptations of the GCN model, such as the Multi-Depth GCN, have proven to outperform other state of the art models for the purpose of detecting fake news (Hu et al., 2019). The GCN enables usage of graph data in a way that takes the relationship between news domains

(5)

and the features of these interacting domains into account.

The scale and nature of the internet has been one of the main issues in previous research regarding this topic (Olteanu et al., 2013). Utilising specific website character-istics as features to assess credibility is challenging due to the sheer size of the Internet. Some features will not generalise as well anymore due to the large volume of websites on the internet and the diverse amount of types of websites. This research will look to solve this issue of previous research by restricting itself to solely news producing domains. By restricting features to types of domains (e.g. news websites or blogs), the features are expected to perform better within their domain subtypes.

3. Method

In order to approximate the performance of the inclusion of graph-based data on the task of domain credibility assessment, two Neural Networks were designed. Both Neural Networks use the same input features and classify the node factuality on the same scale. In addition to the input features, the GCN takes a weighted, directed graph as input. The chosen features for both the baseline Neural Network and GCN are features that were identified as best performing in similar domain credibility classifications tasks (Baly et al., 2018, 2019; Olteanu et al., 2013). These features are:

• Domain type: Type of the source url, e.g. governmental and educational domains identifiable by their domain ending with .gov and .edu respectively.

• CSS rules: Denotes the amount of CSS rules defined in the CSS document and inline HTML of the website to approximate how much effort has been put into the production of a website.

• CSS selectors: Denotes the amount of CSS selectors defined in the CSS document and inline HTML of the website to approximate how much effort has been put into the production of a website.

• SM OG score: SM OG is an approximation of the amount of years of education that would be required to comprehend a given text. This feature is formally defined by the formula: SM OG = 3 +√N polysyllables. N polysyllables refers to the amount of words consisting of two syllables or more in a given document. The SM OG score calculation requires the document to be composed out of least 30 sentences.

• Sentiment: Denotes average document level polarity, utilising OpenAI’s Unsuper-vised Sentiment Neuron (Radford et al., 2017). Approximates how positive or negative a domain tends to be in their wording.

• Exclamation marks: Denotes the amount of exclamation marks per document. • Question marks: Denotes the amount of question marks per document.

(6)

• Wikipedia: Denotes whether a domain has an affiliated Wikipedia page.

The graph for the GCN is structured as a network of nodes where each node is a news domain and every edge is an outgoing link to another domain. The weight of the edges is determined by how often a node links to another node. This graph is formally defined as G = (V, E) consisting of the nodes vi∈ V with the edges (vi , vj ) ∈ E. In this graph

definition, G denotes the graph network, V is the set of nodes (news domains) in the network (v1,2,...,N) and E is the set of edges (outgoing links) in the network where each

edge is an ordered pair of nodes. In this ordered pair of nodes, vi and vj are the nodes

connected by an edge and the direction of the edge is from vi to vj, thus vi is the domain

that links to vj.

3.1. Dataset

Graph Neural Networks (GNN) have a tendency to overfit when trained on standard training/test splits of the data (Wu et al., 2020). This is due to the fact that the whole graph will be exposed to the Neural Network during training. To combat this, these kind of Neural Networks usually train on 5% of the data, validate on 5% and test on the remaining 90%. Due to this abnormal data split, a large dataset is required to train GNN as 5% of a 2000 node dataset will already result in a 200 node dataset. This requirement made all the open source news datasets ineligible for this research, as they would result in a sub 50 node training dataset, thus a new dataset was created. The credibility score of news domains was gathered by scraping the Media Bias/Fact Check website, which provides bias and factuality assessments for media domains (Van Zandt et al., 2020). These factuality assessments are performed by a team of independent researchers and journalists, using a set scoring rubric ranging from very low to very high factuality. Factuality and bias assessments by the Media Bias/Fact Check team have previously been used in similar credibility assessment and fake news detection research. Their 6 levels of reporting factuality are the classes the Neural Networks attempt to classify.

Depending on the configuration, the GCN either utilises a weighted directed graph containing news domains as nodes, with edges detailing which other nodes are being referred to, an unweighted version of this graph or an unweighted, undirected version of this graph. This graph is further detailed in section 3.4. The Global Database of Events Language and Tone (GDELT) was used to create this graph (Leetaru and Schrodt, 2013). This data, published by the GDELT project in January 2019, contains 99270 news sources and the 30072787 outgoing links that are embedded in their published news articles throughout 2017-2019.

The overlap between the domains in both the GDELT graph and the labeled Media Bias/Fact Check domains forms the core of this dataset. A corpus of news articles was created by scraping these domains using internal scraping tools developed by Owlin, the Newspaper3k Python library, and manual scraping (Ou-Yang, 2013). Domains that were both blocking the scrapers and blocking manual access as well as domains that were offline at the time of scraping were dropped from the dataset. Out of the remaining

(7)

1613 domains, 101 were scraped by hand, the rest was scraped using the aforementioned methods. The majority of the offline domains were of the lowest factuality class, resulting in this class only having 17 occurrences in the dataset. To combat class imbalance and allow a stratified split over the training, test and validation sets the Very Low factuality class was grouped together with the Low factuality class. The remaining domains are split over 5 levels of factuality as follows:

Factuality Count Very High 101 High 963 Mostly Factual 72 Mixed 313 Low 164

Table 1: Domain factuality classes and counts

Data to generate the CSS rules and CSS selectors features was gathered through means of scraping the domains using the Python Requests and BeautifulSoup libraries (Reitz et al., 2014; Richardson, 2013). Data for the Wikipedia feature was gathered through querying the Wikipedia API via a python wrapper (Goldsmith, 2014).

3.2. Data preprocessing

A subset of the published GDELT data was extracted to remove any nodes and edges that didn’t have labels. This subset was subsequently transformed into a 2-dimensional Pytorch tensor, detailing the edges between nodes. This tensor is accompanied by a 1-dimensional weight tensor, detailing the amount of times the nodes either link or are being linked to. Three versions of the graph were constructed to determine the impact of data flow. These three versions are an undirected graph, a directed graph with edges flowing from the domain that links to the linked domain and a reversed version of this graph. All three graphs have the same weights on the edges.

Text-based features were extracted from the article corpus for each domain. These text based features are the SMOG, Sentiment, Question mark and Exclamation mark features. Articles that were too short to analyse were dropped from the dataset. Articles that were too long to analyse efficiently got trimmed down to the first 10.000 characters. Regex queries were constructed to gather counts of exclamation marks and question marks for their respective features. These counts were averaged out over articles per domain. The SMOG score was calculated for every article using the Python Textstat library (Bansal and Aggarwal, 2020). Similar to the exclamation mark and question mark features, an average of the smog per article was calculated per domain and used as feature. Word level polarity was derived from the articles by splitting articles into sentences using spaCy and subsequently using OpenAI’s sentiment neuron to calculate the sentiment score for each word (Honnibal and Montani, 2017). This sentiment score is

(8)

a decimal number between 1 and -1 which describes either positive or negative contextual sentiment of the word. These sentiment scores were averaged per article and finally per domain to achieve domain level sentiment polarity.

The domain type feature was generated by stripping the domain names from their URLs. Types that have less than 3 occurrences in the whole dataset were grouped together to improve performance. The domain types were subsequently numerically en-coded and one-hot enen-coded. CSS files and HTML documents of the domains were queried using Regex and BeautifulSoup to establish a count of CSS rules and selectors for each domain. The queried Wikipedia data was used to determine whether domains had affil-iated Wikipedia pages, either yes or no. This feature was also numerically encoded and one-hot encoded. To improve training time and overall classification accuracy, all in-put features were transformed into Gaussian features using Gauss Rank Transformation (Ioffe and Szegedy, 2015).

3.3. Baseline model

In order to gauge the performance of the GCN it needs to be compared to a model with a similar structure. A simple Multi-layer Perceptron classifier was thus constructed, utilis-ing the scikit-learn Python library (Pedregosa et al., 2011). This Multi-layer Perceptron consists of 2 layers with 5 neurons, a ReLu activation function and a L-BFGS solver due to the smaller dataset size. The baseline model is trained on 5% (81 domains) of the data and tested on the remaining 95% (1532 domains) to keep consistent with the structure of the GCN. The previously described Gaussian transformed features are the input data and the predicted classes are the encoded Media Bias/Fact Check factuality scores. A stratified training and test split was ensured to accurately represent the class balance in both the training and test set respectively.

3.4. Graph Convolutional Network

Graph Convolutional Networks perform semi-supervised multi-class node classification on a graph network (Kipf and Welling, 2016). The GCN takes the adjacency structure of a graph into account by aggregating features of neighbouring nodes during convolutions. This is implemented through the layer-wise propagation rule defined as:

H(l+1)= σ ˆD−12A ˆˆD− 1

2H(l)W(l)

where H(l+1) _{is the next representation of the node network, σ is a non-linear}

activa-tion funcactiva-tion such as the ReLU funcactiva-tion, ˆD denotes the Degree Matrix of ˆA, ˆA is the Adjacency Matrix of the current node combined with self-loops. H(l) is the previous representation of the node network with W(l)_{being a N sized weight matrix which serves}

as classical fully connected layer to go from an N -sized representation to an M -sized representation. Depending on which layer of the GCN this is, N denotes the input fea-tures or the feafea-tures provided by the previous layer and M denotes the feafea-tures proved to the next layer or the class probabilities. In this propagation rule, the Adjacency

(9)

Matrix is used to aggregate the features of neighbouring nodes. Self-loops are added to include the features of the current node in the next representation. ˆD, the Degree Matrix of ˆA, is used to normalize nodes in the network that have an unusual amount of neighbouring nodes. Essentially, this means that each layer of the GCN aggregates the network structure with the node features through the equation:

H = ReLu ˆAXW

with X being the N × C sized feature matrix, containing N amount of nodes and C amount of features. Because this is a classification problem, a Softmax classifier is used after the operations to classify nodes into an output class.

The practical implementation of the GCN consists of a combination of classic Pytorch Neural network classes and Pytorch Geometric Graph Convolutional layers (Paszke et al., 2019; Fey and Lenssen, 2019). The GCN model has the Pytorch Neural Network module class at its core, containing two GCN layers. Similar to the original Kipf & Welling GCN, this model uses a ReLu as non-linear activation function and the Adam optimization algorithm to update network weights iteratively (Kingma and Ba, 2014). As the initial 0.01 learning rate made corrections that were too impactful, a 0.005 learning rate was adopted. In an effort to achieve convergence, multiple training epoch settings were trialed and training for 300 epochs was found to produce the most optimal results. The model operates with a hidden layer size of 10 units. Loss is evaluated using a Negative Log-Likelihood (NLL) function. This loss function is accompanied by class weights in an attempt to even out class imbalance. Finally, a LogSoftmax activation function is used on the output of the second GCN layer to derive class probabilities. The feature input for the GCN is identical to the baseline model, 28 Gauss Rank transformed features. These features are accompanied by an input of an edge matrix, and depending on the configuration of the GCN, an edge attribute tensor containing edge weights. Similar to the baseline model, the GCN was trained on 5% (81 domains) of the labeled data, the performance was validated during training on a validation set equal in size to the test set. Finally the model was tested on the remaining 90% (1451 domains) of labeled data to evaluate performance.

3.5. Neural Network configurations

Different configurations for both the baseline Neural Network as well as the GCN were constructed and trained to estimate the impact of these differences on the models clas-sification performance. The different configurations can be split into two categories: differences in graph configurations and differences in class distributions. Different class distributions were tested to gauge the impact of class imbalance while different graph configurations were tested to identify the impact of information flow through the graph structure. The different graph configurations either alter the weight of the edges, the directionality of the graph or a combination of these. The unweighted versions of the graph have the accompanying edge weights removed. Directed graph configurations have the graph in the direction as given in the GDELT dataset, where the information flows

(10)

from the node making the reference, to the referenced node. The reverse direction graphs have this information flow inverted. The different versions of tested class distributions:

• 5 output classes, having all 5 classes as described in section 3.1 as output classes. • 4 output classes, with the Low and Mixed classes being combined into one Low

class.

• 3 output classes, with the Low and Mixed classes being combined into one Low class and Mostly Factual being combined with High.

• Binary classification, having the Very High, High and Mostly Factual classes being combined into one High class and the Low and Mixed classes being combined into one Low class.

3.6. Metrics

The different models and configurations are compared through the macro averaged F1-score, which is the harmonic mean of recall and precision. F1-score, precision and recall are calculated per class, thus a combined score is required to evaluate the whole model. Macro averaged F1 is the most insightful out of the averaged F1 scores, as this does not favour any class. This makes this metric well suited for this dataset, as it already contains imbalanced classes. In order to answer the research question ”Does integration of graph-based data into a baseline model by means of Graph Convolution lead to increased performance on the task of domain credibility assessment?” the baseline Neural Network needs to be compared to the GCN. This comparison is done through splitting the test set results for both models up in 10 batches. Separate F1-scores were calculated on all batches, creating a distribution of F1-scores for both models. These distributions are sets of observations and thus comparable through a paired sample t -test, having the null hypothesis (H0) that the true mean difference (µd) of the F1-score

samples is equal to zero. This hypothesis can be rejected if the test holds a p-value below 0.05, essentially meaning that the models perform significantly different.

4. Results

4.1 Classes

Results of the different class configurations as described in section 3.5 are summarized in Table 2. Reported values denote macro averaged precision, recall and F1-scores as a value between 0 and 1. The best performing class configuration is the binary model, having an average macro F1 score of 0.666 with a standard deviation of 0.015 over 10 runs of the model.

4.1. Graph configurations

Results of the different graph configurations as described in section 3.5 are summarized in Table 3. Similar to section 4, reported values denote macro averaged precision, recall

(11)

GCN performance metrics across different class configurations Amount of classes Macro precision Macro recall Macro F1

5 0.368 0.304 0.272

4 0.376 0.326 0.343

3 0.509 0.457 0.466

2 0.685 0.723 0.666

Table 2: Results of Graph Convolutional Network performance over the different amount of output classes configurations, the best performing configuration is denoted by bold values

and F1-scores as a value between 0 and 1. The unweighted directed graph configuration has the best performance across all metrics, having a macro F1-score of 0.734 with a 0.05 standard deviation. The t -test comparing the baseline model to the GCN boasts a p-value of 0.0002. As this is less than 0.05, H0 is rejected and thus the GCN has a

significantly higher Macro F1 score compared to the baseline. The integration of graph-based data into the baseline model by means of Graph Convolution does thus lead to increased performance on the task of domain credibility assessment.

GCN results across different graph configurations

Graph configuration Macro precision Macro recall Macro F1

Baseline 0.598 0.568 0.568

Unweighted undirected graph 0.712 0.749 0.717 Weighted reverse direction graph 0.586 0.563 0.416 Unweighted reverse direction graph 0.712 0.705 0.708

Weighted directed 0.705 0.745 0.701

Unweighted directed graph 0.726 0.759 0.734

Table 3: Results of Graph Convolutional Network performance over the different graph configurations with a binary output, the best performing configuration is de-noted by bold values

5. Discussion

5.1. Performance

As the results in Table 2 show, the binary model outperforms the multiclass models by a wide margin. While classification inherently becomes easier the less output classes a model has, the differences in performance shown here can also be partially attributed to the class imbalance of the dataset. Some of the classes are only encountered a handful times by the model during training, resulting in these classes rarely being predicted.

(12)

This is well illustrated by the performance of the Mostly Factual class in the 4 and 5 class configurations (tables 4, 5). The latter having a macro F1-score of 0.028 for this class and the former having a macro F1-score of 0 for this class. Models with less output classes performing better might also be attributed to how closely similar some of the neighbouring classes are. This is becomes evident in the commentary provided by Media Bias/Fact Check on their credibility assessment reports, illustrating how minor the differences between the classes Very High and High are.

The results in Table 3 provide insight on how the information flows through the graph during convolutions. Aside from the baseline model and the weighted reverse direction graph configuration, most graph structures perform similar on this dataset. The addition of the edge weights seem to attribute to the biggest changes in performance, with the weighted versions of the models performing worse than their unweighted counterparts. While it might seem illogical that the inclusion of edge attributes would have a detrimen-tal effect on the GCN’s performance, this effect might be explained by the weights not being normalized. While the majority of the graph configurations only having a minor impact on the macro F1 of the model, the weighted reverse direction graph seems to be the outlier here. Having a macro F1-score of 0.416, this is the worst performing graph configuration, producing poorer results on the test dataset than the baseline model. This might be attributed to the combination of the two poorer performing configura-tions, meaning that the unweighted reverse direction graph configuration is still able to correct itself during training where the weighted reverse direction configuration gets too much confusing input. These differences in performance suggest that the directionality and weight of a graph do impact the performance of Graph Convolutional Networks.

5.2. Limitations & future work

This subsection details several limitations of this research and how they might be over-come in future work. The main limitation of this research is the class imbalance of the dataset. This imbalance essentially forces the model to be reduced to a binary classifier, as there is too little data to train all the classes properly on. A more complete dataset with proper class representation might produce better results for the multiclass models. A second limitation of this research is the features used. While the features utilised for both the GCN as well as the baseline were previously well performing features, not all of them might be as relevant anymore as they were at the time of first evaluation. Due to steady developments in web programming, features such as the CSS rules and selectors become dated and less relevant. Future research might want to explore different options for supporting input features. Additionally, the features denoting average exclamation marks and question marks per article were not normalized for text length. Normalizing these averages for text length would provide a more representative value for each domain. Finally, the loss functions used in the baseline and GCN were not fully equal. While the baseline Multi-layer Perceptron classifier uses a similar loss function (logarithmic loss), this function does not support class weights. This gives the GCN an edge over the baseline as the loss is corrected more for the imbalance of the classes. Utilising a baseline model with a class-weighted loss function would make for a fairer comparison.

(13)

6. Conclusion

This research was conducted to trial a novel approach to disinformation dissemination by rating news domains on their source level credibility, as opposed to the article level. Graph based data was proposed as a possible way of including domain interaction data in this credibility assessment. An experiment was conducted to determine the effectiveness of integrating graph based data through means of Graph Convolutions on the task of automated domain credibility assessment. A dataset containing a corpus of articles and domain credibility ratings was created in order to train and test a Graph Convolutional Network. From the corpus of articles and additional queried and scraped data, a set of input features was constructed. The GCN was compared to a basic Neural Network, having a similar structure, using the same input data. Along the input features, the GCN was provided a weighted, directed graph (depending on the configuration), containing data of the aforementioned domains and the other news producing domains they linked to in their articles. Both models were tested using similar splits of the data to train and test on. Multiple configurations were tested with regards to how many output classes the models would predict as well as graph structures for the GCN. The experiments revealed the best configuration of the GCN to be a binary classifier with an unweighted directed graph, generating a macro F1-score of 0.734. As this macro F1-score is signifi-cantly higher than that of the basic Neural Network, the conclusion can be drawn that integration of graph-based data into the baseline model by means of Graph Convolution leads to increased performance on the task of domain credibility assessment. While the inclusion of graph based data does perform better than a simple baseline model without graph data, optimal performance of the models was not achieved. More balanced data with improved class representation is required to reach optimal performance. A more balanced dataset is particularly required for improved performance for the multiclass model configurations. Additionally, input features could be improved to reflect current web design strategies better. Despite suboptimal performance, the impact of utilising graph based data for the task of domain credibility assessment has been demonstrated. Models relying solely on text or website features have proven to still be insufficient to combat disinformation dissemination, but the presented results illustrate the potential of this novel approach.

References

Baly, R., Karadzhov, G., Alexandrov, D., Glass, J., and Nakov, P. (2018). Pre-dicting factuality of reporting and bias of news media sources. arXiv preprint arXiv:1810.01765.

Baly, R., Karadzhov, G., Saleh, A., Glass, J., and Nakov, P. (2019). Multi-task or-dinal regression for jointly predicting the trustworthiness and the leading political ideology of news media. arXiv preprint arXiv:1904.00542.

(14)

Bansal, S. and Aggarwal, C. (2020). Textstat. https://2. python-requests. org/en/master. Last accessed on May 29, 2020.

Fairbanks, J., Fitch, N., Knauf, N., and Briscoe, E. (2018). Credibility assessment in the news: Do we need to read. In Proc. of the MIS2 Workshop held in conjuction with 11th Int’l Conf. on Web Search and Data Mining, pages 799–800.

Fey, M. and Lenssen, J. E. (2019). Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428.

Goldsmith, J. (2014). Wikipedia api (python). https://pypi.org/project/wikipedia/. Last accessed on May 12, 2020.

Honnibal, M. and Montani, I. (2017). spacy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Hu, G., Ding, Y., Qi, S., Wang, X., and Liao, Q. (2019). Multi-depth graph convolutional

networks for fake news detection. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 698–710. Springer.

Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network train-ing by reductrain-ing internal covariate shift. arXiv preprint arXiv:1502.03167.

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Kipf, T. N. and Welling, M. (2016). Semi-supervised classification with graph convolu-tional networks. arXiv preprint arXiv:1609.02907.

Leetaru, K. and Schrodt, P. A. (2013). Gdelt: Global data on events, location, and tone. In ISA Annual Convention. Citeseer.

Olteanu, A., Peshterliev, S., Liu, X., and Aberer, K. (2013). Web credibility: Fea-tures exploration and credibility prediction. In European conference on information retrieval, pages 557–568. Springer.

Ou-Yang, L. (2013). Newspaper3k. https://newspaper.readthedocs.io/en/latest/. Last accessed on June, 2020.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. pages 8026–8037.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Courna-peau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

(15)

Radford, A., Jozefowicz, R., and Sutskever, I. (2017). Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444.

Reitz, K., Cordasco, I., and Prewitt, N. (2014). Requests: Http for humans. KennethRe-itz [Internet]. https://2. python-requests. org/en/master. Last accessed on Mar 25, 2020.

Richardson, L. (2013). Beautiful soup. Crummy: The Site. Last accessed on Mar 25, 2020.

Shontell, A. (2013). What it’s like when reddit wrongly accuses your loved one of murder. Last accessed on Mar 25, 2020.

Van Zandt, D., O’Leary, A., O’Conner Rubsam, K., White, K., Fowler, J., Kelley, D., Allen, M., Locke Siewert, F., and Huitsing, M. (2020). Media bias/fact check. https://mediabiasfactcheck.com/. Last accessed on Jun 12, 2020.

Vosoughi, S., Roy, D., and Aral, S. (2018). The spread of true and false news online. Science, 359(6380):1146–1151.

Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Philip, S. Y. (2020). A comprehen-sive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems.

Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., and Sun, M. (2018). Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434.

(16)

A. Appendix

GCN 5 class configuration confusion matrix

Precision Recall F1-score

High 0.789 0.288 0.422 Very High 0.394 0.286 0.331 Mixed 0.229 0.758 0.351 Mostly Factual 0.025 0.031 0.028 Low 0.404 0.158 0.227 Macro avg 0.368 0.304 0.272

Table 4: Confusion matrix displaying precision, recall and F1-scores per class for the 5 class configuration

GCN 4 class configuration confusion matrix

Precision Recall F1-score

High 0.620 0.677 0.647

Very High 0.538 0.308 0.392

Low & Mixed 0.346 0.321 0.333 Mostly Factual 0.000 0.000 0.000

Macro avg 0.376 0.326 0.343

Table 5: Confusion matrix displaying precision, recall and F1-scores per class for the 4 class configuration

Credibility by Proxy: Source-level Credibility Assessment for News Websites Utilising Graph, Domain and Text Data