Classifying websites: word vectorization and support-vector machines

6 Data: metadata collection to link logs

6.5 Classifying websites: word vectorization and support-vector machines

In chapter 3 I wrote about the issue of ‘noise’. To reiterate what has been said, the most straightforward definition of noise would be websites which have nothing to do with religion, or at least do not have it as their main subject. These sites are not superfluous to this analysis as long as we’re interested in graphing the position of our corpus within the internet at large. However, this interest does not displace that other interest in what the localized ‘web spheres’ of similar sites look like. So once the non-religious sites have outlived their usefulness, we would in fact have to brand them as noise, and filter them out to the best of our ability.

I will carry out a filtering experiment through the application of natural language processing (NLP) machine learning to probabilistically classify the content of the websites as belonging to a religion or not. This is a so-called classifier problem, in which an algorithm learns to associate the features of data with specific labels, such as ‘relevant’ or ‘noise’. In a case such as filtering noise from relevancy one is dealing specifically with a ‘binary classifier’.¹⁴⁶ Using a classifier is a supervised machine learning task, meaning that the labels which the algorithm is learning to combine with features are supplied by the user. So, putting our own classifier to work on the corpus requires four steps. Firstly, I need a source of data that can be used to represent a particular category and I need to label this data, secondly a method to transform that data into a format more beneficial to the training of a classifier, thirdly a method for

144 Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy, ‘Gephi: An Open Source Software for Exploring and Manipulating Networks’, International AAAI Conference on Weblogs and Social Media, 2009, 1.

145 Ibid., 2.

146 J. Pustejovsky and Amber Stubbs, Natural Language Annotation for Machine Learning (Sebastopol, CA: O’Reilly Media, 2013), 144.

43 actually building a model that will classify data into categories, and finally a way to evaluate the accuracy of said model.

Somewhat strangely, we’ve got the labels for the data, but not the data itself. The metadata scheme contains a detailed categorisation into religious movements, which is perfect for transforming into a single label per website. In theory, the labels could be used to classify each site according to its movement: Christian, Buddhist, Hindu or all the way down to specific subgroups such as either Sunni or Shia Islam. The dataset’s focus on diversity, however, both strongly increases the number of labels and introduces imbalances between their frequencies. For instance, out of 70 Islamic sites, only 13 and 9 sites are respectively marked Sunni and Shia, likely too little to effectively train a model. Therefore, to initially assess the validity of websites as material for classifiers, it is best to keep the categorisation as broad as possible.

This of course requires coupling the labels with actual data: the content of the websites. Because the archived websites in the KB webcollection are off-limits, I have to once more resort to scraping the live web for features relevant to the labels. I have decided to use the body text (designated by the HTML tag ‘p’: paragraph) on all the pages of a website at a depth of 1, i.e., pages reachable from the frontpage.

This choice is a compromise between accuracy and automatability. Scraping all the text on these pages will compile a potpourri of content central to a religious organisation. For communities this will be their self-description on the ‘about page’, descriptions of their faith, their events and organisational

information. For blogs and news sites it will be their latest articles, for webshops their featured items, etc. There is a great opportunity and great danger in this approach, as the success of training a classifier with material from a broad range of websites rests on how strongly their use of (religious) language is shared, despite their differences in organizational identity.

There is, in fact, another problem. Considering this is an attempt to reduce noise, saying I already have the labels is an incomplete truth: I have only half of the labels, those of the 600+ sites in the initial corpus. The 50.000+ websites which the crawler has returned, in dire need of pruning, are totally

unlabelled. If I were to do this properly, I would have to make a selection out of those linked-to sites that really are noise (that is, not religious), label them as such, and then use them to train the classifier. This is the method used by Waldherr et al.¹⁴⁷ There truly is no automation without at least a baseline of hard labour, it seems. However, through reducing the scale at which noise is reduced, I can (in theory) leverage the existing labels without needing to do any additional manual inspection. Specifically, I can move from the ambition to separate ‘religious’ from ‘non-religious’ to something more manageable but nevertheless realistic: separating sites that belong to a specific movement from those that do not. For example: ‘Christianity’ and ‘Not Christianity’. I can do this easily by transforming the site’s categories into just such a binary, with edge cases – sites of which a movement is a part but not a focus – left aside.

Essentially, the method of reducing noise becomes equal to the detection of a web sphere: the sphere of topically similar websites. Having thus trained the classifier, the most realistic task for it is to predict whether sites to which a movement in our corpus links are themselves part of that movement or not.

The success of its predictions then rests not only on the degree to which Christian sites share language, but also whether taking non-Christian religious sites as the ‘noise’ category can be representative for any kind of non-Christian site.

The second step involves taking the most salient features of these texts and transforming them into a format that can be used to train a classifier. As machine learning works with numbers, the texts have to be transformed into a numerical representation, or a ‘feature vector’. A popular Python machine

147 Waldherr et al., ‘Big Data, Big Noise: The Challenge of Finding Issue Networks on the Web’, 433.

44 learning module, Scikit-Learn, offers two equally popular instances of this method, called Count

Vectorizer and Term Frequency Inverse-Document Frequency (TD-IDF). Count Vectorizer constructs what is called an ‘N-gram feature.’¹⁴⁸ Rather simply it processes a series of texts (by removing punctuation, stop words etc.) and returns a ‘document-term matrix’, of which the columns comprise all words found in all texts (or n-size combinations of words; one word is called ‘unigram’, two words ‘bigram’ etc.), the rows represent the texts, and the values count the occurrence of a given gram in a text. Count Vectorizer thus simply operates on term frequency per text. TD-IDF goes beyond this by taking into account the prevalence of a given word in the corpus as a whole: for a given text, the frequency of a term in the document raises the value, while the frequency of a word in the whole corpus reduces it. The rationale behind this calculation is that those words which are truly central to a text’s meaning are scored highly, while generalist words spaced evenly across the corpus have their importance reduced.¹⁴⁹ While TD-IDF is the more finetuned method, it remains to be seen which method will work better in our specific application, which in addition to training a classifier to recognize a category of texts trains it to recognize everything that is not that category. Hypothetically, the specificity of TD-IDF may be less useful here.

The third step is choosing a classifier algorithm that can model the categorisation of the vectorized texts. It is this model which will be used to predict the categories of new data. I will use the Support-Vector Machine (SVM) model, a class of discriminative classification. Discriminative classification, as the name suggests, tries to discriminate between categories in a spatially drawn distribution through a line, curve or in our case (where every word in the corpus gets its own dimension) a manifold. It is

possible, however, that multiple lines make correct discriminations, but the model cannot determine which one would actually be most accurate given new data. SVM’s try to improve on this situation by additionally taking into account a margin between the line/curve and the nearest datapoints (the eponymous support vectors.) Thus the ‘discrimination’ with the greatest flexibility is chosen.¹⁵⁰

Finally, the model must be trained and evaluated. Because we have no data which is specifically meant for testing, we have to use cross-validation. This happens by setting aside a percentage (say, 10 or 20 percent) of the corpus, called the hold-out or test data, training the model on the remainder, and then evaluating its performance by having it predict the labels of the hold-out.¹⁵¹ Our dataset potentially faces a problem with this method, however, because its size of 488 site-texts is, for machine learning purposes, comparatively small. If only a single train-test split is made, it is uncertain whether the single success rate reported accurately reflects the accuracy of the model, or is strongly influenced by the randomness of sampling. This uncertainty can be assuaged by using an expanded cross-validation technique called ‘k-Fold Cross-Validation.’ This involves making the train-test split not once, but k times.

The algorithm divides the size of the corpus by k so that it is divided into ‘folds’ of equal size. It then iteratively uses one of these folds as testing data and the remainder as training data. In this way, every part of the corpus gets the privilege of being the guinea pig, and the average of each fold’s success rate

148 Pustejovsky and Stubbs, 141.

149 Marius Borcan, ‘TF-IDF Explained And Python Sklearn Implementation’, Towards Data Science, 8 June 2020, https://towardsdatascience.com/tf-idf-explained-and-python-sklearn-implementation-b020c5e83275.

150 Jacob T. Vanderplas, Python Data Science Handbook: Essential Tools for Working with Data, First edition (Sebastopol, CA: O’Reilly Media, Inc, 2016), 405-408.

151 Gareth James et al., An Introduction to Statistical Learning: With Applications in R, Second edition, Springer Texts in Statistics (New York NY: Springer, 2021), 198.

45 can be taken as a more accurate measure of a model’s efficacy.¹⁵² The code which gathers the paragraph data and applies all of the above steps can be found in Appendix 1.4.

152 Ibid., 203-204.

In document Internet Faith in view: A visual and algorithmic exploration of the texts and hyperlinks of Dutch religious websites (pagina 42-46)