Automatic Prediction of Comment Quality

(1)

Thesis presented in partial fullment of the requirements of the degree of Master of Science in Computer Science at

Stellenbosch University.

Supervisor: Prof. Brink van der Merwe

Co-supervisors: Dr. Steve Kroon & Dr. Loek Cleophas March 2016

(2)

Declaration

By submitting this thesis electronically, I declare that the entirety of the work contained therein is my own, original work, that I am the sole author thereof (save to the extent explicitly otherwise stated), that reproduction and publication thereof by Stellenbosch University will not infringe any third party rights and that I have not previously in its entirety or in part submitted it for obtaining any qualication.

Date: March 2016

(3)

Abstract

Automatic Prediction of Comment Quality Dirk Brand

Computer Science Division Department of Mathematical Sciences

Stellenbosch University MSc. Computer Science

December 2015

The problem of identifying and assessing the quality of short texts (e.g. comments, reviews or web searches) has been intensively studied. There are great benets to being able to analyse short texts. As an example, advertisers might be interested in the sentiment of product reviews on e-commerce sites to more eciently pair marketing material to content. Analysing short texts is a dicult problem, because traditional machine learning models generally perform better on data sets with larger samples, which often translates to more features. More data allow for better estimation of parameters for these models. Short texts generally do not have much content, but still carry high variability in that they may still consist of a large corpus of words.

This thesis investigates various methods for feature extraction for short texts in the context of online user comments. These methods include the leading manual feature extraction techniques for short texts, N-gram models and techniques based on word embeddings. The eect of using dierent kernels for a support vector classier is also investigated. The investigation is centred around two data sets, one provided by News24 and the other extracted from Slashdot.org. It was found that N-gram models performed relatively well, mostly outperforming manual feature extraction techniques.

(4)

Uittreksel

Outomatiese voorspelling van die kwaliteit van aanlyn kommentaar

Dirk Brand

Afdeling Rekenaarwetenskap Department van Wiskundige Wetenskappe

Universiteit van Stellenbosch MSc. Rekenaarwetenskap

Desember 2015

Om die kwaliteit van kort tekste (bv. internet kommentaar, soektogte of resensies) te identiseer en te analiseer, is 'n probleem wat al redelik sorgvuldig in die navorsing bestudeer is. Daar is baie te baat by die ver-moë om die kwaliteit van aanlyn teks te analiseer. Byvoorbeeld, aanlyn winkels mag moontlik geinteresseerd wees in die sentiment van die verbruik-ers wat produkresensies gee oor hul produkte, aangesien dit kan help om meer akkurate bemarkings materiaal vir produkte te genereer. Analise van kort tekste is 'n uitdagende probleem, want tradisionele masjienleer algo-ritmes vaar gewoonlik beter op datastelle met meer kernmerke as wat kort tekste kan bied. Ryker datastelle laat toe vir meer akkurate skatting van model parameters.

Hierdie tesis bestudeer verskeie metodes vir kenmerkkonstruksie van kort tekste in die konteks van aanlyn kommentaar. Die metodes sluit die voor-staande handgemaakde kenmerkkonstruksie tegnieke vir kort tekste, N-gram modelle en woordinbeddinge in. Die eek van verskillende kernmetodes vir klassikasie modelle word ook bestudeer. Die studie is gefokus rondom twee datastelle waarvan een deur News24 voorsien is en die ander vanaf Slash-dot.org bekom is. Ons het gevind that N-gram modelle meestal beter presteer as die handgemaakde kenmerkkonstruksie tegnieke.

(5)

Acknowledgements

I would like to express my sincere gratitude for the following people and organizations:

• my supervisors, Prof. Brink van der Merwe, Dr. Steve Kroon and Dr. Loek Cleophas, for their support and valued inputs;

• Naspers for their nancial assistance;

• the MIH media lab for their assistance and provision of an excellent work environment;

• and nally, my ancée, friends, and family for continued and valued support.

(6)

4.2 Feature Sets . . . 61 4.3 Feature Preprocessing . . . 62 4.3.1 Dimensionality Reduction . . . 63 4.3.2 Feature Normalization . . . 64 4.4 Hyper-Parameter Estimation . . . 64 4.5 Spam Detection . . . 65 4.6 Summary . . . 65 5 Results 67 5.1 Evaluation Metrics . . . 67 5.2 News24 Results . . . 69 5.3 Slashdot Results . . . 75 5.4 Summary . . . 79

(8)

6 Conclusion 80

6.1 Research Question . . . 80

6.2 Future work . . . 81

A Manual Feature Graphs 83 B Articial Neural Networks 90 B.1 Classical Articial Neural Networks . . . 90

B.1.1 The Neuron . . . 91

B.1.2 Feed-Forward Neural Networks . . . 94

B.1.3 The Backpropagation Algorithm . . . 95

C Language Regularities in Word Embedding Models 99

D Word Clusters in Word Embedding Models 100

E Experiment Reproducibility 101

(9)

List of Figures

1.1 Example News24 comment thread. . . 6

1.2 Typical Slashdot comment thread. . . 10

1.3 Slashdot comment score distribution. . . 10

2.1 Linear SVC of binary class variables. . . 22

2.2 Examples of function decision surfaces. . . 25

2.3 A graphical representation of LDA . . . 29

2.4 An illustration of topic modelling. . . 30

2.5 Example of the hierarchical nature of natural language. . . 32

2.6 Example of relationships between elements of word embedding. 34 2.7 Example word embedding. . . 35

2.8 Visualization of a bilingual word embedding. . . 36

2.9 Example of distributed NNLM. . . 38

2.10 Example of Continuous Bag-of-Words NNLM. . . 39

3.1 PageRank algorithm example. . . 49

3.2 HITS algorithm example. . . 50

3.3 A pipeline for extracting word-embedding features. . . 55

4.1 Training and testing pipeline . . . 61

A.1 Distribution of custom features for the large News24 data set. 85 A.2 Distribution of custom features for the small News24 data set. 87 A.3 Distribution of custom features for Slashdot comments. . . 89

B.1 The articial neuron. . . 91

B.2 Common activation functions. . . 93

B.3 The perceptron. . . 94

B.4 A basic 1-layer feed-forward neural network. . . 95

B.5 An example network to illustrate backpropagation of error. . 97

(10)

List of Tables

1.1 Statistics about the large News24 comment set. . . 6

1.2 Distribution of comments according to their report, hidden and hot-word labels. . . 7

1.3 Statistics about the small News24 comment set. . . 9

1.4 Statistics about the Slashdot comment set. . . 11

3.1 List of the custom feature extraction methods. . . 43

3.2 The most relevant manual features, unigrams, bigrams and trigrams. . . 58

4.1 The number of features of each feature set. . . 63

4.2 Parameters obtained through parameter tuning. . . 66

5.1 Baseline accuracy for classifying all samples as the majority class for the various data sets. . . 69

5.2 Results for the spam detection classier. . . 69

5.3 Results for RBF SVM classication on the large News24 data set. . . 70

5.4 Results for linear SVM classication on the large News24 data set. . . 72

5.5 Results for RBF SVM classication on the small News24 data set. . . 73

5.6 Results for linear SVM classication on the small News24 data set. . . 74

5.7 Results for SVM classication on the Slashdot data set with three dierent comment labelling strategies. . . 76 5.8 Results for RBF SVM classication on the Slashdot data set. 77 5.9 Results for Linear SVM classication on the Slashdot data set. 78

(11)

Acronyms

ANN

articial neural network. CBOW

Continuous Bag-of-Words. kNN

k-Nearest Neighbours. LDA

Latent Dirichlet Allocation. LnDA

Linear Discriminant Analysis. NB

Naïve Bayes. PCA

Principal Component Analysis. RBF

Radial Basis Function. SVC

Support Vector Classication. SVM

Support Vector Machine. TF-IDF

Term Frequency - Inverse Document Frequency. ix

(12)

Chapter 1

Introduction

The establishment of bulletin boards in the early days of the internet was an example of the online social interaction that eventually developed into what is now known as the social web [79]. Prominent examples of modern-day online social interaction include social media (e.g. YouTube), information sharing (e.g. Wikipedia) and online communities (e.g. Facebook). All these platforms depend on users to continuously generate, curate and annotate content.

There are now countless online platforms that permit users to generate content. These include forums, blogs, newsgroups and online news providers. One of the key features underpinning the success of these online communities is large-scale user engagement, seen in the form of rating, tagging and com-menting on content [79]. User-contributed comments on web content oer a much richer, albeit unstructured, source of contextual information than ratings or tags. However, comments are often variable in quality, substance, relevance and style.

An online news provider is dened in this work as an internet entity that serves original journalistic content to users and then allows the users to engage with that content via comments and/or ratings (e.g. the New York Times). As the social web grows and people become increasingly so-cially aware [52], online news providers are becoming ever larger communities where users can discuss or comment on common issues in the context of news articles [48]. There are also sites that act as online news aggregators, that serve content either directly from real news sources or from users (e.g. Slash-dot.org [42]).

An online news provider may full many dierent roles, including edu-cating people, providing timeous access to the latest news, and providing feedback to news providers about their content [91].

The importance of the role that online news plays in the media sector (especially when educating and informing people) leads news providers to strive to provide content of high quality, as well as to keep users engaged on

(13)

the site for as long as possible. Navigating through the mass of comments on articles to nd useful information quickly becomes a daunting and time-consuming task for users. Therefore, to ensure high quality in user-submitted content (such as comments on articles), online news providers attempt to moderate or curate the content. Moderation on websites where discussions are fostered has been a topic of discussion in recent years [29, 50].

1.1 Problem Statement

There are great benets to being able to analyse short texts for example, advertisers might be interested in the sentiment of product reviews on e-commerce sites to more eciently pair marketing material with content. However, analysing short texts is a dicult problem, because traditional machine learning models are generally developed and optimized for longer texts. Longer texts are able to produce denser feature sets for some feature construction techniques which allows for better estimation of parameters for these machine learning models. Short texts generally do not have much content, but still carry high variability in that they may consist of a large corpus of words. Thus, short texts can be characterized as having both little content and being sparse. This makes it hard to build a representative feature space for short texts [36].

There are two dominant approaches to dealing with this problem. The rst is to expand the short texts with meta-information (their context, date, etc.) or external larger documents (e.g. by adding the content of the corre-sponding article to each comment) [148]. The second is to determine a set of topics for the text corpus and assign topics to comments [132]. Depending on the domain and the content, the rst approach could be a manual and time-consuming process, but more often the problem is that external texts that t contextually with the short text are not readily available. When automatically identifying the quality of short texts that are already su-ciently categorized, the second approach is not as well suited. In the context of online comments and forum posts, texts are often in threads (i.e. a tree structure containing texts) attached to some web object (e.g. comments on a news article) where comments may well display similar topic distributions, making classication by topic models less applicable.

The problem of automatically determining the quality of short texts in threads is a previously studied classication task. A number of learning methods have been applied to this problem, including k-Nearest Neighbours (kNN) [68], naïve Bayes [62, 163] and Support Vector Machines (SVMs) [164]. Using SVM-based methods is a particularly popular approach [180] for tasks involving short texts, because SVMs are versatile and can be modied

(14)

to work with both dense and sparse data.1 _{The kernel used by the}

algo-rithm can be customized according to the format of the data (e.g. numerical, string, graph or tree data), allowing for both linear and non-linear classica-tion. The performance of SVM-based methods depend on the choice of kernel as well as the quality of the training data provided [80]. The SVM kernels considered in this work cannot be trained as-is on textual input data. The data must rst be transformed into features that can be recognized by the specic kernel function (typically numeric features). Thus, most approaches to short text classication focus on improving the quality of the features ex-tracted from the texts [160, 36, 180], with studies comparing the approaches using SVMs with one or two kernels.

This thesis tackles the problem of automatically predicting and analysing the quality of online comments. This problem has been intensively studied since 2007 [132, 73, 89, 60, 178, 28]. We treat the task as a supervised learning problem and evaluate various approaches to feature set construction for multiple data sets, of which two were provided by News24 and a third obtained from Slashdot.org.

The leading feature extraction approaches for short texts include man-ual feature construction and models based on N-grams [180, 97, 32]. More recently, deep learning approaches to feature extraction have been proposed for various machine learning tasks, including text classication for short texts [159, 56]. This work considers these approaches in the context of pre-dicting the quality of online comments.

1.1.1 Comment Quality Prediction

This investigation into techniques for automatic quality prediction of short texts was motivated by a data set provided by News24 (a subsidiary of Me-dia24), a popular South African news provider that serves news articles to a mainly South African audience. They wished to replace their current article comment moderation system with a more sophisticated one, aiming at two goals, which shaped the nature of this study. First, they wanted to improve the general quality of commentary, so as to legitimise the website content and to better establish themselves as a world-class online news provider. This would have hopefully increased their user engagement. Second, they hoped to maintain the current user engagement levels of the website. The problem was that certain highly engaged users were aggressive, defamatory, and gen-erally displayed malevolent behaviour that negatively aected the 24.com brand as well as the users of the site. After providing the data, but before the completion of this study, News24 revised their comment policy, perma-nently disabling all comments on all articles from the 11th of September, 2015 [172]. Their decision further illustrates the extent and severity of the 1_{The techniques for feature extraction used in this thesis produces both dense and}

(15)

problem as the News24 editor-in-chief said in their ocial statement on the matter [172]: Our decision to change our comments policy follows months of internal debate and discussion which has seen us consider all options prac-tically available to us on how to wrangle the thousands of comments which are made on 24.com each day.

Before this policy revision, News24's system ranked comments by date only (oldest comments rst), and allowed users to ag comments that they believed to be against News24's terms of use. A team of moderators then manually reviewed agged comments, either unagging the comment (and disallowing future agging of the comment), or removing it permanently from the site. This was a labour-intensive and imperfect system, since moderators were often biased in their moderation [157]. Therefore, News24 wanted to reduce the eort and time required by editors to moderate the comments, or alternatively removed the need for editors completely.

To compare the quality prediction models for News24 comments, another set of comments was collected from a more regulated online news aggregator website called Slashdot [42]. Slashdot allows users to post links to news and other articles from other source websites, including online news providers. Users are then able to comment on these articles, either as a registered user or anonymously. Their moderation goals are similar to News24's: they wish to promote quality comments, make their content as readable and accessible as possible and do this all in an ecient way that minimizes the time required by any single moderator. They designed an automatic moderator management system to achieve these goals, which is further discussed in Chapter 2.

The case of Slashdot is fairly similar to the News24 case. News24 also has a vast volume of comments to lter and a small team of editors to it, so they have to rely on users to report malicious comments that they can then manually remove. The two data sources dier in that the Slashdot comment corpus generally contains well-formed comments with less collo-quial or regional language usage than with the News24 comments. News24 is known for appealing to a broad audience, being a general news provider, whereas the Slashdot community is generally more homogeneous and often more educated, since the type of news that is posted to Slashdot is of a technical nature. The scoring mechanism used by Slashdot (as discussed in Section 1.2.3) also provides a more ne-grained and subtle measure of quality than the agging system used by News24.

The moderation approaches used by Slashdot could have served well for a news provider like News24, but the system would still rely on an initial comment score that would need to be automatically determined, after-which a comment would be moderated over time, delaying its stable score. Thus, both News24 and Slashdot could benet from a system that performs com-pletely autonomous moderation of comments as they arrive.

(16)

1.1.2 Research Question

The quality of online comments, as investigated in this thesis, is related to the moderation schemes dened for the data sets used in the investigation. Poor quality comments could often simply be spam, which is why we use a simple spam detection model as a baseline approach to measure the performance of our approaches. The leading manual feature construction approaches, as motivated by the research literature about online comment classication, is also investigated. As a comparison to these techniques, N-gram-based models and techniques from distributed representation models are also investigated.

Thus, this thesis aims to answer the following research question:

How do feature construction techniques based on N-gram models and distributed representation models fare against leading manual feature construction approaches?

Research Objectives

To address this research question, the following research objectives were identied:

1. investigate and implement various leading approaches to manual fea-ture construction for online comments;

2. investigate and implement N-gram-based models for building represen-tative feature sets for comments;

3. investigate and implement word embeddings as a representative tech-nique for distributed representation of text; and

4. evaluate the predictive performance of these approaches against a base-line pre-trained spam detection model.

1.2 The Data

Three data sets will be used to contextualise and answer the research ques-tion posed in this thesis. The rst data set is extracted from a collecques-tion of databases provided by News24. The second data set was obtained by having News24 sta manually classify (according to pre-set criteria) comments ex-tracted from the databases originally provided by News24. The last data set was automatically extracted (via web-scraping) from the Slashdot website, since a suitable existing Slashdot data set could not be found. The data sets are discussed in more detail below.

(17)

1.2.1 News24 Basic Data

News24 allowed its users to leave comments on articles. A user could either directly leave a comment on an article (referred to as a parent comment) or on a parent comment (referred to as a child comment). A parent comment together with all the child comments following it is called a comment thread. Figure 1.1 shows a fragment of a typical comment thread where one user posted a comment and another user commented on that comment. Most articles had multiple comment threads associated with them. Table 1.1 lists some summary statistics of this data set.

Figure 1.1: Part of a typical News24 comment thread.

Number of comments 130713

Number of parent comments 82325

Number of child comments 48388

Average number of child comments per parent 0.57 Average number of comments per article 22.20 Average number of words per comment 61.45

Percentage of `hidden' comments 24.6%

Percentage of reported comments 6.6%

Table 1.1: News24 comment corpus statistics.

Users were also able to vote on News24 comments in the form of likes and dislikes, as well as ag comments that they felt were against the terms of use of News24. When agged, members of the editorial team decided whether the comment should be removed from the site or not. The editorial team also actively reviewed unagged comments to determine whether they should be removed. A simple automatic system (i.e. without the need for user reports or manual editorial eort) is also in place to detect and remove bad comments via obvious hot-word signals. Comments that are removed receive a hidden status, but are still stored in the database. If a comment was not removed, it received the default visible status.

After a comment is agged, editors decided to remove the comment (i.e. make it hidden) based on whether:

(18)

Unreported and visible 63.26%

Unreported and hidden 15.19%

Reported and visible 12.81%

Reported and hidden 6.06%

Automatically hidden (hot-words) 2.67%

Table 1.2: Distribution of comments according to their report, hidden and hot-word labels.

• it contained advertising or spam;

• it contained abusive language, hate speech or profanity; • it contained completely incorrect grammar;

• it included text-speak;

• it included nicknames or insulting names for the president or the indi-vidual(s)/group the article is about; or

• it referred to racial stereotypes or contains racial slurs.

Thus, there are six implicit categories of comments in the News24 data set:

1. unreported comments unseen by editors;

2. unreported comments reviewed by editors and accepted by editors; 3. unreported comments reviewed by editors and made hidden by editors; 4. reported comments that were accepted by editors;

5. reported comments that were made hidden by editors; and

6. hot-word comments that were immediately automatically made hidden. Table 1.2 shows the distribution of comments according to this cate-gorisation. Since the data does not contain any information on whether comments have been seen by editors or not, categories 1 and 3 are collapsed into one category. This is a challenge when classifying the comments, since there might be comments that are being shown that t the criteria for being made hidden, but have simply not been considered by editors. This problem motivated the necessity for a second data set that carried the guarantee of each comment having been seen by a member of the editorial team. This data set is further discussed in Section 1.2.2.

This thesis focusses on the task of predicting the hidden vs. visible status for unlabelled comments (typically newly posted comments) auto-matically, i.e. to classify a comment to reect the opinion of the editors. A

(19)

secondary task, that is not investigated, is predicting whether a comment would eventually be reported or not, i.e. classify a comment to reect the opinion of the users. The task of automatically nding hot-word comments has been completed by News24 and is not of interest in this thesis, thus these automatically removed comments are not included in the data sets used for this thesis.

Data Description

News24 provided two SQL databases, both containing data and meta-data about comments. One database contains unlabelled comments that were accumulated over the last six years, and another labelled comments that range over a single month. The database of unlabelled comments forms the basis for the large News24 data set, which we will henceforth refer to as News24-large. The second, smaller, labelled data set was generated from the large labelled data set, as described in Section 1.2.2, and will be referred to as News24-small. The large database of unlabelled comments was used to train the topic models and the deep learning models, as discussed in Chapter 3.

The labelled database contains the following relevant elds for each com-ment: a unique comment identier, a unique author identier, a unique thread identier, the id of the parent comment (or null if a parent com-ment), the name of the author, the body of the comment, the title of the article, the body of the article, the number of reports, the comment's hid-den status, and the date and time the comment was posted.

The unlabelled database contains the same elds, with the exception of the hidden status.

1.2.2 Small News24 Data Set

In an attempt to get a better labelled data set where there is no ambiguity on whether comments have been seen or not, a subset of the comment corpus was presented to the News24 editorial team for rating. For this task, an alternative rating scheme was proposed in an attempt to get more meaningful ags for comments.

The comments were each labelled into categories numbered from 1 to 3, with 1 representing a very low quality comment, 2 representing a comment that is suitable for the website, and 3 representing a remarkably sensible comment. The team decided internally that a comment was to be explicitly labelled as 1 if it:

• contained text-speak;

• consisted mainly of capital letters; • included profanity;

(20)

Number of parent comments 2943

Number of child comments 1853

Average number of child comments per parent 1.48 Average number of comments per article 30.94

Total number of words ±300000

Average number of words per comment 67.85

Ratings distribution 1=44%, 2=52% , 3=2.4%

Table 1.3: Statistics about the small News24 comment set. • was not relevant to the topic;

• included the text cANCer 2;

• contained insulting nicknames for the president or the individual the article is about;

• attacked another commenter;

• contained racist or abusive language; or • referred to racial stereotypes.

Table 1.3 shows some more information about the data sets and the distribution of the class labels. Since so few comments were labelled as 3, the labels of 2 and 3 were merged to split the corpus into two classes: acceptable or not acceptable for the live website. The resulting data set is thus similar to the large News24 data set, but with the added certainty that all the comments were reviewed by editors.

1.2.3 Slashdot.org Data

The Slashdot model is quite dierent to that of News24. Their articles are all sourced from other websites with links posted on Slashdot by the users. Users are also able to comment on these posts, as well as comment on other comments, forming a comment tree of limitless depth, unlike News24 where the comment thread depth is limited to 2.

Comments are given integer scores from -1 to 5, where 5 is the highest score. Readers of the site can then set a score threshold for the comments to be displayed to them, eectively hiding all posts below some threshold (if set). Some posts are also tagged by both moderators and users to further help readers identify posts that they wish to read, as well as allow users to 2_{This is a slur referring to the leading political party in South Africa, the African}

(21)

Figure 1.2: Part of a typical Slashdot comment thread.

Figure 1.3: Distribution of comment scores on Slashdot.org.

categorise posts. The tags can be any word, but some common tags are typically used, such as funny, informative, troll and amboyant.

Registered users' comments are generally seeded with a score of 1 (al-though this score could also be 0 or 2, based on the ratings of their previous posts, also called the user's karma), while anonymous users' comments begin with a score of 0. Once a comment is seeded with a score, Slashdot automati-cally assigns moderation privileges to certain users (willing participants that are registered, regular users with a positive karma), allowing them to modify this comment score. Figure 1.2 shows a typical Slashdot comment with a reply. At the time, the original comment received a score of 0 (probably the default score, since the post was made by an anonymous user), but the reply received a higher score of 2 (also probably due to the poster being a regis-tered user with a high karma). As with the News24 comments, there is no way to clearly determine whether a comment has been moderated, however Slashdot's moderation scheme assigns moderator privileges to users accord-ing to the demand of unmoderated comments. The moderation scheme is discussed further in Section 2.1.

The distribution of ratings in the data sets is shown in Figure 1.3. Ta-ble 1.4 shows some statistics about the Slashdot comment corpus that was extracted.

(22)

Number of articles 2231

Average number of comments per article 129.63

Number of anonymous comments 91057

Number of registered comments 198158

Average number of words per comment 84.41 Table 1.4: Slashdot comment corpus statistics. Data Extraction Methodology

The Slashdot comments were obtained by means of parsing web data (i.e. web scraping). An open source python library called Jsoup [71] was used to ac-cess a URL containing an archive of all Slashdot articles, arranged from most to least recent (http://slashdot.org/archive.pl). The archive page con-tains links to all articles published on Slashdot. Using Jsoup, each link was followed to its corresponding content page containing the article as well as all the comments related to that article. From each comment, the follow-ing information was extracted: the comment content, the comment's current score, the date and time the comment was posted, the username of the com-ment's author (or anonymous), the comcom-ment's immediate parent comment, as well as the comment at the root of the thread.

This process was executed on 2 August 2015 and allowed to run until the comments of the 2231 most recent articles have been captured.

1.2.4 The Unbalanced Data Problem

With the large News24 and the Slashdot data sets, the class value is unevenly distributed among the data points specically, one class label clearly dominates the other data points in the data set. With the News24 data sets, only about 25% of the comments have been made hidden by editors. With the Slashdot data sets, the eects are more severe, with approximately 45% of comments having a score of 2 (largest class) and only 2% of comments having a score of -1 (smallest class).

With traditional measures of accuracy, a classier can achieve inated accuracy by simply classifying all the data points as the majority class [173] (about 75% in the case of the News24 data). To address this, other as-sessment measures such as sensitivity and specicity are also used in this thesis [34].

Attempts have been made to deal with this unbalanced data problem in other domains such as fraud detection [57, 22], where the data is often con-siderably more unbalanced. Two common types of solutions to this problem exists, viz. cost-based and sampling-based solutions [133]. Since we have no information about the cost of predicting certain classes above others, a

(23)

cost-based solution would not be well-suited. We use a sampling-cost-based approach for dealing with class-imbalance in the Slashdot data set (as discussed in Section 5.3).

1.3 Thesis Overview

Chapter 2 contextualizes this thesis with previous approaches to similar prob-lems in the research literature, and provides background on the techniques used in this thesis. Chapter 3 lists and discusses the construction of various domain-specic features based on research by other authors, as well as how features were extracted using N-grams, basic topic modelling techniques, and deep learning. Chapter 4 presents the research methodology, followed by ex-perimental results in Chapter 5. Finally, Chapter 6 summarizes the ndings of this thesis and discusses possible future work to continue this research.

Our results show that N-gram models generally perform better than man-ual techniques for feature construction, with character N-grams and skip-grams showing great promise. We are also able to conrm that word embed-ding models based on deep learning techniques are able to perform as well as the N-gram-based approaches in some cases.

(24)

Chapter 2

Background

This chapter discusses supervised learning techniques for quality prediction in short texts.1 _{Background information on N-gram models, topic modelling}

and deep learning are also presented. Section 2.1 details some moderation schemes used by websites that make use of user-contributed content. Sec-tion 2.2 provides an overview of the research literature regarding automated quality prediction for short texts. Section 2.3 contextualises the task of qual-ity prediction in the eld of machine learning and provides background on the Naïve Bayes (NB) and SVM algorithms used to approach this task. The sections that follow discuss the background of the techniques used for feature extraction in this thesis. Section 2.4 provides a background for N-gram-based techniques for feature extraction. Section 2.5 discusses topic modelling and how it is used for short texts. Finally, Section 2.6 provides background about the proposed deep learning techniques used in this thesis.

2.1 Existing Moderation Schemes

There are various existing moderation schemes that seek to lter and moder-ate user-submitted contributions based on their subjective quality (i.e. what the specic moderators dene as quality). This section discusses some well-known examples of content moderation systems, namely those of Digg.com, Wikipedia.org, Slashdot.org and Reddit.com, as well as two common mod-eration tools, namely Debate and Disqus.

In 2004, Digg.com [49] was launched. Digg is an aggregator of online news content, curated by users and presented in a concise fashion. The content consists of articles and stories from various domains. Users can vote for content, giving an article an up-vote if the user liked it, increasing its Digg score. The Digg score allows the website to rank articles and to better lter information before presenting it to a user. The site also has several internal 1_{This chapter contains portions of work from previous papers co-authored with one or}

more of my co-supervisors [25, 26].

(25)

moderators who attempt to determine whether an article's Digg score is fair and accurate. If not, they can adjust the score to provide a more suitable ranking for an article. This is an example of a system that uses centralised moderation or supervisor moderation (where the moderation is carried out by internal moderators and not users). The users are only able to vote on content, not to explicitly moderate it.

An alternative moderation scheme is distributed moderation (where the majority of the moderation is carried out by users). There are various exam-ples of distributed moderation, of which the most well known is Wikipedia [184]. Users can post veriable content (veried by means of citations) and edit or remove content posted by other users. There are also bots for au-tomatic detection of abuse by users, such as the system developed by San-tiago M. Mola-Velasco [122]. This platform of distributed content provision and moderation has led to Wikipedia becoming the world's largest online encyclopedia [18]: It has more than 100,000 active unpaid volunteers and encyclopedia entries in more than 270 languages [18]. Their status as a free and open source encyclopedia is only possible because of the volunteers and the fund raising they are able to do. This is directly inuenced by their user engagement, as this is what attracts volunteers and funders. Various publi-cations mention the eciency of Wikipedia's moderation scheme [47, 185].

Other examples of smaller-scale distributed moderation are Slashdot [42] and Reddit [123]. Slashdot's moderation scheme rst assigns a seed rating to a comment, followed by continued moderation by users with moderator privileges (see below). All users in the system have karma representing their reputation on the site, with a positive karma value considered to indicate good reputation. A comment's seed rating is determined by the poster's karma level. The rating changes as moderators choose to increase or decrease the rating of a comment.

Slashdot's system automatically assigns moderator status to certain users. These users are chosen based on the following factors: whether they are a registered user, whether they regularly consume Slashdot content, whether they have been active for a certain period of time, and whether they are positive contributors themselves.

When a user is given moderator status, they are provided with a num-ber of points of inuence that they can use to moderate comments. Each comment they moderate deducts a point and when their points have run out, they lose their moderator status until they are automatically asked to moderate again. Moderators are also not allowed to participate in the same discussion that they are moderating. A problem with using human mod-erators, is that comments are often not immediately moderated, but rather when the moderator nds time to do so. Another problem is that moderators often do not nd consensus on the rating a comment deserves.

Reddit uses a very structured system of distributed moderation. Reddit consists of multiple individuals forums, called subreddits. Each subreddit

(26)

has one or more moderators (often the subreddit's creators) assigned to it to control the content that is posted to the subreddit. A moderator has full control of the content of the subreddit, which is motivated by their interest to maintain the vision and mission of the subreddit. Moderators are free to remove content, approve content that was erroneously removed, distinguish whether items are safe for work or not, and change the titles of posts. They cannot, however, edit submissions or see any personal details of a user that submitted content.

Comments on Reddit are either moderated by the community via a re-porting scheme similar to that of News24, picked up by automatic spam lters, or manually assessed by moderators. Users are able to report com-ments for being against the rules of the particular subreddit some sub-reddit rules are very strict whereas others have more of an anything goes policy regarding content. These comments are then shown to the moderators of the subreddit for permanent removal or reinstatement. Reddit also allows users to provide up and down votes for comments, which serves as way to lter comments that the community prefers.

On the surface, a system such as the moderation scheme used by Red-dit works well, because moderation eort is distributed to users, instead of having a single site-wide point of moderation. This allows individuals to use forums that interest them, even though they may be controversial or run counter to public opinion, without the risk of having their comments removed. This does not always happen though, as moderators of subreddits may become tyrannical in their moderation practices, since they might be the sole moderator of a subreddit, which might discourage participation in the particular subreddit.

Various freely available tools exist for online news websites. Debate [21] is a website plugin for Wordpress that gives comment management func-tionality to blog administrators. It provides a sophisticated structure of comment threads, email-based replies, proles for commenters and various widgets for the administrator to view statistics about the comment base. It uses a distributed reputation system based on comment quality ranking. Each registered user has a reputation score based on the average quality of their comments, which is in turn determined by the number of up-votes and down-votes their comments receive, as well as the length of the comment and the time the comment was posted. They do not do any additional language ltering of the comments beyond spam ltering [6, 107].

Disqus [165] is a free system that can be integrated into various web platforms (Wordpress, tumblr, Blogger, Drupal, etc.). The system provides a moderation tool to their customers that allow them to remove comments or mark them as spam. The actual ltering and ranking of comments happens on the Disqus servers and are mirrored to the site it is integrated with.

In summary, all these systems require manual moderation, although some systems provide initial seed ratings before manual moderation commences.

(27)

2.2 Literature Review

This section reviews various publications that parts of this thesis are based on. Some of these publications focus specically on quality prediction in online comments, while others address more general text classication tasks. 2.2.1 Slashdot: Peer Moderation

Lampe and Resnick [97] attempted to study the ecacy of the Slashdot moderation scheme. They asked the question: Can a system of distributed moderation quickly and consistently separate high and low quality comments in an online conversation? Their analysis shows that the basic idea of dis-tributed moderation works on Slashdot. After enough time, moderators seem to nd some consensus, even though complete agreement is not achieved. The remaining problem they identied, was that good quality comments take too long to be identied by moderators.

Lampe and Resnick also noted that general user satisfaction diminishes as more users participate in a conversation space. Some users display disruptive and anti-social behaviour that reduce participation of other users in online conversations. They investigated various methods of limiting the disruptive eect of anti-social behaviour. These methods included analysis of usage logs (records of comments and moderations) and conducting interviews with moderators to get explanations of phenomena in the comments base. They examined the distribution of comment scores and observed a strong correla-tion between dierent levels of user participacorrela-tion and comment scores. They were also able to determine both the median time until a comment receives its rst moderation, as well as the time until half of the comments that will eventually receive low (0 or less) or high (4 or greater) scores have been scored, viz. 83 minutes and 148 minutes respectively. Similar methods were investigated in other studies with other data sets (e.g. [64, 169]). Specically, Szabo et al. [169] were able to show that most Digg stories reach their nal stable popularity score within one day.

2.2.2 User Reputation

Interactions between users in an online social environment can be informative and productive, but also destructive. Although interaction on the internet is often characterised by anonymity, maintaining and using some form of reputation system for users has been shown to add value to both users and platform providers [140].

Resnick et al. [140] noted that a reputation system allows collection and aggregation of information about participants' past behaviour, while still allowing participants to remain anonymous. Users can base their interac-tion with other users of the system based on those users' reputainterac-tions. This

(28)

encourages trustworthy behaviour and deters untrustworthy users. Thus, reputation systems are a way of building trust online [66].

Chen et al. [35] distinguishes between internal and external reputation scores.

External scores are made entirely public and can be used as an incentive for users to produce good content. Yahoo! Answers shows the points that a user has earned for providing good answers to questions, thus providing some measure of user reputation. Internal reputation, on the other hand, is never revealed to users and is only used in internal applications.

Reputation is often used for the following applications [35]:

• Ranking of content User-contributed content can be ranked or recommended based on the poster's internal reputation;

• Enriching existing content A site may want to show tweets (or other social media posts) of reputable users that relate to an article, as a means to enrich the article content;

• Peer Moderation To moderate comments or other user-contributed content on a site, certain reputable users could be given moderation privileges (e.g. StackOverow [162]). Slashdot also makes use of this in their karma model; or

• Abuse Detection The reputation scores of users could be used as additional features in an abuse detector (since reputable users are typically less likely to abuse the system).

This thesis will consider the use of internal reputation for ranking of content and peer moderation when user-based features are constructed.

In 2007, Chen et al. [38] introduced a reputation model for users in a question and answering (QA) system. Their model combines traditional user ratings [141] (positive, neutral or negative votes towards a user) with an analysis of the QA social network.

They also proposed the idea of constructing a graph of user interaction (i.e. a sociogram where nodes are users and edges are interactions between users) and weighting the edges by the reputation of the users involved in the relation. Users with a higher reputation will aect other users' reputations to a greater extent, whether positively or negatively. This is similar to how PageRank [129] determines the relative importance of nodes in a network (e.g. web pages, articles or users).

2.2.3 Automatic Scoring and Prediction

Wanas et al. [180] attempted to extend the work done by Weimer et al. [183] and Lampe and Resnick [97]. They wanted to improve on the various tech-niques used by these authors.

(29)

Weimer et al. investigated methods for classifying forum posts. They proposed a set of features that addressed some known issues with classifying forum posts (e.g. the short average length of posts). The proposed features ranged from surface features (e.g. capitalized word frequency) to more com-plex linguistic features (e.g. relevance). They then designed and trained a classier to classify comments as `bad' or `good'.

Similar to the study by Lampe and Resnick [97], Wanas et al. [180] investigated the moderation schemes used on Slashdot. The moderation scheme that Slashdot used during this study (and still uses) was somewhat dependent on human input, and as Wanas et al. noted, a signicant amount of time needed to pass before users were able identify good quality comments. Additionally, earlier posts received more attention and posts that received an incorrect seed rating (or early moderated rating), often did not have their rating changed. Wanas et al. proposed a scheme of automatic post ranking based on a set of features given to a classier (SVM classication with a Radial Basis Function (RBF) kernel). Similar work was done by Hsu et al. [79], but using support vector regression (SVR).

Wanas et al. considered the features originally proposed by Weimer et al. [183] and constructed a set of 22 features, categorised into ve classes, viz. relevance (the appropriateness of posts in their respective threads), original-ity (the novelty of posts compared to others in their threads), forum-specic (various measures of the level of discussion a post evokes), surface (how well the contributor presents their post) and posting component (e.g. presence and quality of weblinks in posts) features. The forum-specic features were shown to contribute most to the accuracy of the post ratings, but consisted of complex linguistic features that rely on posts consisting of higher qual-ity English, which is not necessarily the case for Slashdot. They overcame this problem by building features conscious of linguistic phenomena in online forum posts and by building a lexicon of keywords that were domain-specic. Contrary to Weimer et al, Wanas et al. used slightly ner ratings for posts. The experiments by Wanas et al. showed their classier to be 50% accurate when classifying posts as bad, average and good (according to their predetermined partitioning of the Slashdot posts' scores into these three classes).2 _{Their experiments also showed that structural features (length,}

punctuation, etc.) of posts were more signicant in classication than fea-tures analysing the actual text (spelling, grammatical quality, etc.). We found that certain features based on user activity (e.g. number of posts of the user or the number of comments the user has made in the past, or out degree) emerged as important features (see Section 3.5).

Hsu et al. [79] studied similar methods for predicting the quality of com-ments on Digg [49]. Instead of simply predicting a comment's score, they 2_{We were able to obtain similar, but slightly better, results. For a full description of}

(30)

attempted to rank comments based on predicted scores. They used a SVR model with an RBF kernel. They focused their eorts on nding the rel-ative rank of a comment, as opposed to the actual value. They compared their ranking to a random ranking and a date-wise chronological ordering. They were able to achieve a much higher ranking correlation score than the random or chronological orderings.

Cheng et al. [39] investigated the communities on three international news sites, viz. CNN.com, Breitbart.com and IGN.com. They attempted to predict a banned status for users (as opposed to a score or rating of the user's comment). They categorised features into four groups, namely post features (concerned with the literal content of posts), community features (popularity in the community as shown by votes), activity features (the user's general patterns of use) and moderator features (number of posts that have been deleted, etc.). They attempted to use these features to predict which users will become banned. Unsurprisingly they showed that moderator features contribute the most to the performance of the classier. The community features were a close second, which showed that community moderation was closely aligned to the editors' preferences.

Mishne and Glance [119] did a comprehensive study on online comments and built a binary decision tree classier with a custom feature set simi-lar to the features used in this thesis. They achieved a mean F1-score of 0.88 for 10-fold cross validation. Similarly, Brennan et al. [27] achieved an overall precision of 0.82 when classifying Slashdot comments with binary class labels, using a SVM classier with 10-fold cross validation. Jamali and Rangwala [83] investigated various algorithms for comment classication, and they were able to obtain a 0.84 F1-score on binary classication using SVM-based methods with 5-fold cross validation. We were able to achieve an accuracy of 0.874 and a sensitivity (or recall) of 0.818 when evaluating a binary classier on News24 comments, which is comparable to the results achieved in the literature. For more information, see Section 5.2.

Otterbacher [128] suggested an alternative approach to the community rating system employed by Lampe and Resnick [97]. Instead of rating the `interestingness' of a post (i.e. how interesting users may nd a post), the community rates the `helpfulness' (i.e. the benet a user gets from a post). The study was performed on user product reviews from Amazon.com. The study uses measures of post quality developed by Wang and Strong [181]. The framework is comprised of four measures of post quality, the rst three of which are relevant in the context of this work:

1. Intrinsic quality Includes believability, accuracy, objectivity and reputation;

2. Contextual quality Includes relevance, timeliness, completeness and sentence/word counts;

(31)

3. Representational quality Includes ease of understanding, con-ciseness and consistence; and

4. Accessibility Concerns how easy it is for users to gain access to information.

The study looks at the correlation between these measures and the av-erage `helpfulness' ratings of reviews (as provided by users). A simple linear regression model was trained for this purpose and it was able to achieve an R2 score of 0.4.

The hand-crafted features designed and mentioned in the research above, forms part of the basis for the manual feature approaches studied in this thesis. The features we implement and investigate is further discussed in Chapter 3.

2.3 Supervised Learning Methods

This section discusses the major approaches to supervised learning that we use in this thesis. The SVM techniques specically make use of the feature extraction approaches outlined in the rest of this chapter. As mention before, this thesis addresses the problem of automatically predicting the quality of online comments, and since we have labelled comments and the goal is to predict this label for unseen comments, this problem is a supervised learning problem.

Predicting the score of a Slashdot comment is a more complex task than predicting a binary label for a News24 comment, as there are seven classes and the classes have an intrinsic order, unlike the hidden status of News24 comments. The ideal classier would be able to predict the correct score among these seven scores, however, as we discuss in Chapter 5, this task proved to be very hard to accomplish, partially due to the class imbalance in the Slashdot data.

Following a series of experiments with various labelling strategies for the Slashdot comments, the most viable option was to remove the comments labelled as `2' and group the comments into two classes (i.e. less than two and greater than two). This makes comparing the results of the News24 data sets and the Slashdot data set simpler, because they follow the same binary classication scheme.

In this thesis, the main supervised learning algorithm that is used for comment quality prediction, is the SVM algorithm for classication. A Naïve Bayes model is also used in the context of a simple spam detection model [161]. These techniques are explained in more detail below.

(32)

2.3.1 Naïve Bayes

The NB classier [145] is a simple probabilistic model. Despite its very simple form, the algorithm has shown good performance in many real world classi-cation tasks [51, 90, 147]. Given an input feature vector x = (x1, ..., xn),

the NB model attempts to predict the conditional probability of the input vector having class label Ck, i.e. p(Ck|x). This is known as the posterior

probability of the class, which can be formulated, using Bayes rule, as p(Ck|x) =

p(Ck)p(x|Ck)

p(x)

where p(Ck) is known as the prior probability of the class, p(x|Ck)is known

as the conditional probability (or likelihood) of the observation x given the class, and p(x) is a normalization factor known as the evidence. In practice, the only interest is in the numerator of the fraction, since the denominator is the same for all classes for a specic observation x. NB makes the naïve assumption that the features in the feature vector are conditionally indepen-dent given the class membership. Thus, the likelihood can be reformulated as follows: p(x|Ck) = p(x1, ..., xn|Ck) = n Y i=1 p(xi|Ck)

which means the posterior probability satises: p(Ck|x) ∝ p(Ck)

n

Y

i=1

p(xi|Ck)

An NB classier is trained to maximise this posterior probability given an observation x and K possible classes. This is shown in Equation 2.1.

predicted label of x ← arg max

k=1,...,K p(Ck) n Y i=1 p(xi|Ck) (2.1)

A naïve Bayes classier trained on bag-of-words representations (as ex-plained in Section 2.4.1) is used in this thesis as a simple technique for nding spam comments in the News24 data set.

2.3.2 SVM

The theoretical basis of Support Vector Machines (SVMs) was laid by Vapnik and Lerner in 1963 [177], and later developed by Cortes and Vapnik to form the family of algorithms we know today [44]. In 1990 [154], the introduction of the kernel trick made SVMs much more versatile and able to model a much wider variety of data. The use of slack variables for classication of data that is not linearly separable, was introduced by Smith in 1968 [158] and

(33)

improved upon by Bennett and Mangasarian in 1992, where it was introduced to SVMs [13].

The simplest SVM implementation is a non-probabilistic binary linear classier which attempts to nd a line that can accurately separate data points into two classes. When the classes cannot be separated by a line, a non-linear function (a curve) might be able to, but viewing the data points in a higher-dimensional space and using a hyperplane is more ecient com-putationally [3, 23, 153]. Specialised kernels (e.g. string kernels [102]) can be used to compare data points directly, without the need for intermediate feature construction.

Support Vector Machines (SVMs) are suited for many supervised learning tasks, because of how widely applicable the algorithm is, as well as for its high fault tolerance. Specically, SVMs have built-in overtting protection, making them especially suited for handling high-dimensional input spaces (such as the vector space feature sets, discussed in Section 2.4) [100]. They are also well-suited to handling extremely sparse data sets (e.g. some of the N-gram models, discussed in Section 2.4.1) [92].

SVMs can be trained on discrete (Support Vector Classication (SVC)) or continuous (SVR) labels. The specics of SVC are discussed below. Support Vector Classication

Support vector machines for binary classication attempt to construct a hyperplane and to maximise the distance from the hyperplane to the nearest data points. A very simple 2-dimensional example is shown in Figure 2.1.

−4 −2 0 2 4 −4 −2 0 2 4

Figure 2.1: Linear SVC of binary class variables. Support vectors are shown as green markers on the margin (dashed line).

(34)

Consider a simple binary classication task with training data instances of the form (xi, yi), i = 1, ..., mwhere xi ∈ Rnand yi∈ {−1, 1}. The support

vector classier attempts to nd an (n − 1)-dimensional hyperplane that can perfectly separate the data points labelled as −1 from those labelled 1, as in Figure 2.1 where red circles indicate points labelled as −1 and blue squares indicate 1 labels and the distance from the solid to a dotted line is the margin (shown as a 2-dimensional tube in Figure 2.1, but it is an n-dimensional tube in the general case). The margin lines (i.e. the dotted lines) are parallel to the hyperplane (i.e. the solid line) and are touching the points nearest to the hyperplane on either sides.

Since a hyperplane is a set of points satisfying β0+ βTx = 0

our aim is to nd a hyperplane that separates the training data by nding β0 and βT such that

yi· (β0+ βTxi) > 0 (2.2)

for all i = 1, ..., n.

Continuing from Equation 2.2, to maximise the margin M, SVM opti-mises the values for β1, ..., βn such that:

n X j=i β_j2= 1 (2.3) and yi· (β0+ βTx) ≥ M, ∀i ∈ 1, ..., n. (2.4)

where M is the width of the margin that is to be optimised. Note that M can be made arbitrarily large by scaling β0 and β if the constraints

men-tioned above are not enforced. The optimisation problem can be solved using quadratic programming. This procedure nds a hyperplane that maximises the distance to the margin, which is then xed.

This works for linearly separable data sets, but that is not always the case with real world data. To address this, the strict separability constraint (Equation 2.4) is relaxed to allow for some samples to fall within the margin. A set of non-negative slack variables, {ξi, i = 1, ..., n}is introduced (one for

each training point), together with a penalty allowance C. The optimization problem above now comes with n + 1 additional constraints:

ξi≥ 0 n

X

i=1

(35)

which essentially states that the sum of the slack variables may not exceed C.

This problem is still solvable by quadratic programming. Once we have the corresponding weights, a new training sample x∗ _{can be classied using}

the hyperplane by calculating the sign of: f (x∗) = β0+ βTx∗

The magnitude of the value gives an indication of the condence of the clas-sication (i.e. the further the sample is away from the separating hyperplane, the more condent one can be in the classication).

One key feature of using the maximal margin approach, is that the equa-tion for the hyperplane only depends on the data points that lie directly on or over the margin. These points are known as support vectors and are shown as green data points in Figure 2.1. While training the model on a set of training data points of size p, the algorithm only needs to make use of the inner product between data points and not the points themselves, so it can be shown that f(x) can be written as linear combination of inner products:

f (x) = β0+ βTx = β0+ p

X

i=1

αihx, xii

where αi is a coecient corresponding to the ithtraining sample. When xi is

not an SV, αi = 0, so the algorithm only needs to look at the set of support

vectors, say Ω, and not all the data points, so the formula can be rewritten as follows:

f (x) = β0+

X

i∈Ω

αihx, xii (2.5)

This is a major computational advantage, since signicantly fewer com-putations are necessary during evaluation of a trained model.

The above formulae work for linear decision surfaces; however some prob-lems might require a non-linear separating boundary (i.e. not a hyperplane). This is tackled with the kernel trick.

The Non-Linear Case

Linear SVM might get poor accuracy when trying to classify data that is not almost linearly separable. A separating boundary might, however, be found by transforming the set of n features into, say, 2n features x1, x21, ..., xn, x2n

and then constructing a hyperplane in this 2n-dimensional space.3 _This

means the non-linear problem in n-dimensions can be transformed to a 2n-dimensional linear problem, which is easier to solve using the maximal margin approach.

3_{This transformation is a simple example. In general, the data samples just need to be}

(36)

The kernel trick involves replacing the inner product in Equation 2.5 with a more general kernel function K(xi, xj) which calculates the similarity

between two vectors (as the inner product does in the linear case). The kernel trick can also be employed during training, since the quadratic programming task can be reformulated in terms of inner products using its dual form. Some popular kernels are:

• Linear Kernel K(x_i, xj) = hxi, xji;

• Polynomial Kernel K(xi, xj) = (xi· xk)d; and

• Radial Basis Function (RBF) K(xi, xk) = exp −γ||xi− xk||2

. The linear and RBF kernels are used in this thesis and their respective performance results are shown in Section 5.2. Illustrations of decision sur-faces from the d-degree polynomial and radial basis kernels are shown in Figure 2.2. −4 −2 0 2 4 −4 −2 0 2 4 X1 X2 −4 −2 0 2 4 −4 −2 0 2 4 X1 X2

Figure 2.2: Example d-degree polynomial and radial basis function decision surfaces for SVC.

2.4 N-Gram-Based Approaches

One of the approaches to feature extraction that this thesis investigates, is a model based on classic bag-of-words approaches (also called N-gram mod-els). No studies were found that specically use N-gram-based approaches for quality prediction of internet comments; however N-grams have been used for other related text classication tasks. Some relevant papers are [32, 118, 163]. N-gram models are widely applicable to a variety of problems, not only in information retrieval. These applications include probabilistic language modelling (where words are predicted using N-grams, often useful in transla-tion) [10], DNA sequencing [171] and improved compression algorithms [76].

(37)

The models designed by Lampe and Resnick [97], Wanas et al. [180] and other authors interested in automatic comment ltering, used various custom-designed comment features that often incorporate context around the comments and the users that post the comments. Designing and creat-ing a manual feature set is time-consumcreat-ing, and the features are often highly subjective, meaning that a user could manipulate the system if they knew the features that were being used. It would thus be valuable to be able to extract features in a natural or even automatic way. Also, manual feature construction requires specic domain knowledge, making some of the tech-niques hard to generalise. Thus, techtech-niques that do not depend on specic domain knowledge are valuable.

This thesis considers alternative approaches to feature construction for representing text data. This section investigates approaches that are taken from common practices in information retrieval [134]. It should be noted that traditionally, information retrieval techniques are designed for working with longer texts, but the techniques are applicable to short texts as well, although the techniques typically result in more sparse data sets.

In information retrieval, a piece of text is often represented by certain keywords or terms. A set of weights can also be associated with these terms to show their relative importance to the text. It is thus sensible to repre-sent the texts in question as a set of term vectors [150]. This idea of text representation is often called the N-gram model [70], a specic type of vec-tor space model for texts [151]. Traditionally, the term bag-of-words model implies that features represent single words (or unigrams) whereas the term N-gram models generally refer to models where features are N-grams with N > 1. In this thesis the term N-gram and N-gram model will be used to refer to models where features are N-grams with N ≥ 1.

N-gram models have been hugely successful in other natural language processing tasks [118, 55], but their performance on short texts such as in-ternet comments, where the representations will likely be very sparse, is not well-studied. This thesis investigates the use of N-grams for quality pre-diction of short texts, which has been shown to have some success in the research literature [32]. Various N-gram model representations used in this thesis are discussed in Chapter 3.

2.4.1 N-Grams

Joachims showed that the way text data is represented has a strong inuence on the accuracy of a classier [84]. The assumption that Joachims makes is that using small sequences of words as features, instead of longer sequences of text, allows for greater generalization, since taking the total order of words into account (i.e. not the order within the small word sequence) is often not necessary [5]. Also, better models can often be created since less data is needed for training.

(38)

Formally, an N-gram is a series of contiguous objects (letters, words, syl-lables, or other linguistic units) from a longer piece of sample text. The sim-plest N-gram representation is the unigram, which only considers one object at a time. More interesting models with higher order N-grams (e.g. bigrams and trigrams) are also used.

As an example, consider the sentence The blue bird ew away, which is composed of the following word N-grams:

unigrams The, blue, bird, ew and away;

bigrams: The blue, blue bird, bird ew and ew away; and trigrams: The blue bird, blue bird ew, bird ew away.

Similarly, the word medal consists of the following character N-grams: unigrams: m, e, d, a and l;

bigrams: me, ed, da and al; trigrams: med, eda and dal;

The general pattern is that a sequence of k words (or characters) will con-sist of k unigrams, k −1 bigrams, k −2 trigrams, etc. This thesis investigates both word and character N-grams.

2.4.2 Skip-grams

An N-gram is often taken to be a contiguous sequence, but other co-occurring sets of objects are also used (e.g. the rst and last character of words, i.e. skip-grams [40]). Character skip-skip-grams are investigated in this thesis as an al-ternative to the normal character N-gram model. Skip-grams in the context of the investigations of this thesis, are sequences of letters made from words were certain characters are left out (or skipped). This approach will be referred to as character skip-grams.

As an example, consider the following character skip-grams for the word south:

skip-grams: outh, suth, soth, souh and sout.

The motivation for using character skip-grams in this thesis, is that it could help deal with the case of misspelled or obfuscated words being re-garded as dierent entities in the normal character N-gram model. For in-stance, if someone were to write loser as lo$er, the character skip-gram model would pick up on the fact that the words are similar.