Sarcasm detection in Reddit comments

(1)

Faculty of Economics and Business

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided

up into a number of sections and contains references. An outline can be something like (this

is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper

from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page)

(c) Introduction

(d) Theoretical background

(e) Model

(f) Data

(g) Empirical Analysis

(h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you

use should be logical) and the heading of the sections. You have a free choice how to

list your references but be consistent. References in the text should contain the names

of the authors and the year of publication. E.g. Heckman and McFadden (2013). In

the case of three or more authors: list all names and year of publication in case of the

rst reference and use the rst name and et al and year of publication for the other

references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that

actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty

as in the heading of this document. This combination is provided on Blackboard (in

MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number

(d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics

1 Sarcasm Detection in Reddit Comments

by

ˇ

Stˇep´

an Svoboda

11762616

Masters in Econometrics Track: Big Data Business Analytics

Supervisor: L. S. Stephan MPhil Second reader: prof. dr. C.G.H. Diks

(2)

This document is written by student ˇStˇep´an Svoboda who declares to take full respon-sibility for the contents of this document.

I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it.

The Faculty of Economics and Business is responsible solely for the supervision of com-pletion of the work, not for the contents.

(3)

Abstract

Faculty of Economics and Business Amsterdam School of Economics

Masters in Econometrics by ˇStˇep´an Svoboda

This thesis created a new sarcasm detection model which levies existing research in sarcasm detection field and applies it to a novel data set from an online commenting platform. It takes several different approaches to extract information from the com-ments and their context (author’s history and parent comment) and identifies the most promising approaches. The main contribution is the identification of most promising features on the large and novel data set which should make the findings robust to noise in data. The achieved accuracy was 69.5% with the model being slightly better at de-tecting non-sarcastic than sarcastic comments. The best features, given the hardware limitations, were found to be the PoS-based ones and lexical-based ones.

(4)

I would like to thank my supervisor, Sanna Stephan, for all the help and feedback she provided me during the writing of my thesis. I’m also grateful to my friends Jan, Samuel and Radim for all their support and help during my studies and work on the thesis.

(5)

Statement of Originality 1 Abstract 2 Acknowledgements 3 List of Figures 6 List of Tables 7 Abbreviations 8 Glossary 9 1 Introduction 1

2 Sentiment Analysis and Sarcasm Detection 4

2.1 Sentiment Analysis . . . 5

2.1.1 Polarity classification . . . 7

2.1.2 Beyond polarity classification . . . 9

2.2 Sarcasm Detection . . . 9

2.2.1 Survey of sarcasm detection . . . 10

2.2.2 Research papers . . . 11

3 Data description 15 3.1 Sarcasm detection data set . . . 15

3.1.1 Authors’ history . . . 17

4 Methodology 19 4.1 Pre-processing & Feature engineering. . . 19

4.1.1 Bag-of-Words . . . 20

4.1.2 Sentiment-based features . . . 22

4.1.3 Lexical-based features . . . 23

4.1.4 PoS-based features . . . 23

4.1.5 Word similarity-based features . . . 24

4.1.6 User embeddings . . . 25

4.2 Neural network . . . 25 4

(6)

4.2.1 Feed-forward Neural network . . . 26 4.2.2 Error Backpropagation. . . 31 4.3 Evaluation. . . 35 5 Empirical part 38 5.1 Setup . . . 38 5.2 Results. . . 41 5.2.1 Discussion. . . 44 6 Conclusion 48 A Truncated SVD 51

B Neural net optimization 53

B.1 Neural net architecture. . . 53

C Used libraries 56

(7)

4.1 A basic neural net (Bishop, 2006) . . . 27

(8)

5.1 Confusion matrix of the main model . . . 41

5.2 Summary of the performance of individual models . . . 42

5.3 Confusion matrix of the model using as features: BoW and context . . . . 42

5.4 Confusion matrix of the model using as features: BoW, context and PoS-based features . . . 43

5.5 Confusion matrix of the model using as features: BoW, context and similarity-based features . . . 43

5.6 Confusion matrix of the model using as features: BoW, context and user similarity-based features . . . 43

5.7 Confusion matrix of the model using as features: BoW, context and sentiment-based features . . . 44

5.8 Confusion matrix of the model using as features: BoW, context and lexical-based features . . . 44

C.1 Libraries used in this thesis . . . 56

(9)

BoW Bag-of-Words CM Confusion Matrix ML Machine Learning

NLP Natural Language Processing PCA Principal Component Analysis PoS Part-of-Speech

TF-IDF Term Frequency-Inverse Document Frequency SGD Stochastic Gradient Descent

SVD Singular Value Decomposition w.r.t. with respect to

(10)

k-nearest neighbor Technique which moves all observed data points into n-dimensional space, where n is the number of features, and finds the k closest points based on a prespecified distance measure, e.g. Euclidean distance. 11

Amazon’s Mechanical Turk Amazon’s Mechanical Turk is an internet marketplace for tasks which computers are currently unable to do. Human users can choose and execute these tasks for money, thereby providing the input for the subsequent ML process. A typical ML example is acquiring set of predefined labels for a data set, which can be later used to perform a supervised learning task.. 11,14

bigram An augmentation of preprocesing of theBoW. Instead of a word a collocation of two words is in the columns. An example sentence ’I went shopping’ has three

unigrams (’I’, ’went’ & ’shopping’) and 2 bigrams (’I went’, ’went shopping’). 12,

13,45,49

BoW This technique is applied directly to all the preprocessed documents (comments) and creates a matrix with all the words being represented in the columns and all the documents being represented by the rows. Then each field of this matrix represents the number of occurences of a specific word in the specific document.

9,10,12–14,20,21,24,38,42,45,49

deep learning Type of ML based on Neural Nets aimed at learning data representa-tions such as shapes and objects in images. ’Deep’ in the name refers to the large number of layers. 10,26

determiner ’A modifying word that determines the kind of reference a noun or noun group has, e.g. a, the, every’ (Oxford Dictionary). An example are the articles in

(11)

the sentence ”The girl is a student”, where both of them are determiners which can specify the definiteness, i.e. do we know the girl?, and volume of the appropriate nouns, e.g. indefinite article is not compatible with a plural. 11

feature ML equivalent of explanatory variable. Non-numerical features can be recoded to numerical state. 2,3,9–14,17,19,22,23,42,44

feature set A number of distinctive (yet related)features used together as one set. 2,

10,11,13,19,27,41–44,46

hyperparameter In ML hyperparameter is a parameter whose value is not learned but is set before the learning begins. 10,30,39,40,53

lexical clue A specific type of expression in writing, akin to voice inflection and ges-turing in speech, e.g. emoticons, capital letters and more than one punctuation mark in a row. 11

online learning This method allows to train a model on data that come in sequentially, this is commonly used when the data set is too large to handle at once and the computation is divided into many parts to make them feasible. It is called online learning since it can handle real-time data. 26,38

regularized regression Augmented OLS, called Ridge regression or l2 penalty, by an

additional term in the loss function. The rewritten loss function takes follow-ing form: Pn

i=1(yi −Pkj=1xijβj)2 + λPkj=1βj2. An alternative versions of

reg-ularization exist such as l1 penalty, or LASSO regression, which takes the form

Pn

i=1(yi−Pkj=1xijβj)2+ λPkj=1βj. 13,14

test set Data set on which the generalization properties of the validated model are evaluated. The generalization property is the ability of the model to perform well on previously unseen data. 11,39

unigram The word in the column of aBoW. 9,12,13,45,49

validation set Data set which provides evaluation of the generalization properties of the model and allowshyperparameters to be tuned. 39

(12)

verb morphology Specific form of a verb, e.g. passive tense and infinitive. 11

word embeddings A process in which a word or a phrase is mapped to a vector of real numbers. It is a general term that can be associated with different methods. One of them is described briefly in Section 4.1.5. 14,18

(13)

Introduction

The analysis of unstructured data is a complex and intriguing problem. Unstructured data are usually comprised mainly from text but other information, such as dates and numeric values, can be present. Unlike in the case of structured data there is no rigid and uniform structure such as is common with all structured data, i.e. rows being the in-stances of the phenomenon and columns being the properties of the individual inin-stances. The structured data are sometimes also called rectangular due to their structure. Struc-tured data are commonly saved in the .csv format since that format was created to store this kind of data. An example of structured data might be list of countries (instances) with their GDP, area and population represented in the columns. The final table is then a rectangle which is where the name ‘rectangular data‘ comes from.

The lack of structure in unstructured data makes an analysis difficult since it is usually highly irregular. An example of unstructured data might be online communication of the employees, e.g. emails or internal instant messaging platform. Many specialized techniques are required to retrieve the desired information.

The sheer volume of this data is gigantic. It was estimated in 1998 that somewhere between 80-90% of all business data are actually in the unstructured form (Grimes), e.g. anything from an invoice, internal memo to a log of a machine’s performance. This is only a rule of thumb and no quantitative research has been done but is still a somewhat accepted ratio and is often cited. This is an incredible amount of information that could be levied to improve virtually every aspect of the business operations. This ratio is also

(14)

supported by the Computer World magazine’s claims that more than 70-80% of data are in unstructured form (Chakraborty & Pagolu).

One type of unstructured data analysis is the sentiment analysis which is essentially an automated way to extract opinions from the text at hand (Oxford Dictionary). The most common example of extracted opinion is the polarity of a statement, i.e. pos-itive/negative. A more detailed description follows in the next chapter. This type of analysis suffers from many drawbacks which are enumerated and discussed in this thesis. This thesis focuses on one problem specifically, sarcasm presence in the available textual information. This can easily mislead the sentiment analysis model since taken literally, sarcastic statements have a meaning different from the one intended. An example is a statement ’It’s great to be at work when there’s such a nice weather!’, this statement can seem at first as an honest, and positive, remark since it does not include any typical signs of a negative comment. But with some background knowledge the hidden meaning, sarcasm, can be easily seen.

This thesis tries to build on and extend the current sarcasm detection literature and by combining the best existing features create a working sarcasm detection model. In the process the distinct features are also compared against each other and the most beneficial ones are found. The comparison of the features is one of the main extensions of the current existing literature achieved by this thesis. This goal was achieved in two stages, in the first one data pre-processing and feature engineering took place and in the second the found features were used as an input for a neural net, i.e. in this case a classification algorithm. In the first step several pre-processing approaches were used to obtain various types offeatures and in the second step their actual benefit was evaluated. The benefit brought by the specific feature sets is also compared to the computational burden and the overall usefulness is evaluated.

The difference between the models in ML and standard econometrics should be high-lighted since this thesis focuses exclusively on the former. In ML we cannot really derive the true underlying process and model this process as is common in econometrics and thus we do not know exact rules that govern the model’s decision process. This lack of interpretability is a trade-off for overall better prediction power of these ‘black-box‘ models.

(15)

The thesis is organized as follows, in Chapter 2 sentiment analysis is described as a whole with special care given to the sarcasm detection and related literature. In Chapter 3 the used data are described and in Chapter 4 both pre-processing and feature engineering are discussed in length alongside the description of neural net. Afterwards the results are presented in Chapter 5 and the thesis ends with some concluding remarks in Chapter 6.

(16)

Sentiment Analysis and Sarcasm

Detection

Sentiment analysis is defined by theOxford Dictionaryas ’the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine the writer’s attitude towards a particular topic or product’. This discipline based mainly on Natural Language Processing (NLP) and Text Mining is widely used in many fields, from marketing and spam detection (Peng & Zhong,2014) to financial markets forecasting (Xing et al. , 2018). The specific applications and how sentiment analysis is used in these fields are discussed in the following sections. This chapter is divided into two sections. First the sentiment analysis is introduced in general and afterwards sarcasm detection is discussed in detail.

Sarcasm detection is indeed an issue that typically prevents a straightforward sentiment analysis. The problem sarcastic or ironic (here used interchangeably) comments pose is they often use different polarity than they express. An example might be a sentence ’Great job!’, which can be meant both sincerely and sarcastically. In both cases the sentiment seems positive at first sight but the true meaning differs and that can throw the analysis off.

The other issues are not discussed in depth here and are only briefly mentioned. Different types of opinions, i.e. regular vs comparative and explicit vs implicit opinions, are an issue since detecting comparisons or implicit opinions is complicated. Subjectivity detection is also an issue since many applications of sentiment analysis want to discern

(17)

between a statement without emotional undertone and with some emotional charge. Last problem sentiment analysis faces is the point of view which can be either the author’s or the reader’s.

These problems are quite intuitive but some examples are given here nonetheless (Liu,

2012). Regular opinion is stating that Coke is good while comparative is that Coke is better than Pepsi. Explicit is saying iPhone has bad battery life while implicit is saying iPhone has shorter battery life than Samsung. Subjective opinion is one expressing not a fact but my personal opinion, i.e. I like Apple products. Lastly the point of view matters as well. For example the news of a stock-hike is good for people who own the stock but bad for people who shorted the stock.

The way in which these problems can mislead the analysis are various. The comparative opinion requires the researcher to determine whether Coke is better than Pepsi or the other way around. It must be established what is the relationship between the two compared things. Implicit opinion suffers from the same problem, the structure of the sentence must be used to automatically determine which of the two things is compared to which. If this is not done the sentence can be understood incorrectly. Subjective and objective opinions depend a lot on specific use cases but differentiating between emotion-less and emotional statements can be important in removing potential noise which brings very little to the overall model. The point of view issue is similar to the problems implicit and comparative opinions pose. The entity which expressed the opinion must be determined for us to know whether there is some hidden agenda, e.g. an analyst claiming a stock has bad fundametals might have motivation to damage a stock since his firm could hold a short position in the stock.

2.1 Sentiment Analysis

Sentiment analysis (sometimes also known as opinion mining) aims to determine the position of the speaker/writer towards a specific issue. This can be done with respect to the overall polarity of the analyzed document/paragraph/sentence/... or underlying emotional response to the analyzed event. There are many slightly different names for slightly different tasks within this broad category of sentiment analysis or opinion min-ing such as opinion extraction, sentiment minmin-ing, subjectivity analysis, affect analysis,

(18)

emotion mining and review mining (Liu,2012). Opinion extraction and sentiment min-ing are the same and focus only on identifymin-ing the existmin-ing sentiment/opinion in the given document. Subjectivity analysis is concerned with discerning between subjective and objective statements. The possible usage differs per the need of the specific researcher but common usage is removing the objective sentences before determining polarity to improve the model’s ability to differentiate between positive and negative statements. Affect analysis and emotion mining are the same since they aim to detect the expressed emotional state in the text. Review mining describes all types of sentiment analysis but used only on one specific type of text, i.e. reviews.

In sentiment analysis both the position of the writer and the specific issue addressed are of interest since different types of sentiment analysis deal with different problems arising from the different specifications (Liu, 2012). The position can be expressed in two ways, either we can describe the stance of a person towards the issue, i.e. positive or negative, or we can try to describe the emotional response elicited, i.e. sadness, anger, happiness etc. Beyond the emotional response the subjectivity and aspect-based measures of the statement/document can be looked at. Aspect-based measures attempt to find expressed sentiment toward specific entities. A document can be negative toward one entity and at the same time positive toward another entity, i.e. a review can be critical to a museum and give an example of a good museum.

Subjectivity analysis aims to find subjective parts of the document in order to differen-tiate between different types of expressed opinion. Another option is to downgrade the importance or remove the objective statements to achieve greater accuracy in simple polarity analysis. The aspect-based sentiment analysis aims to identify the sentiment expressed towards a specific entity mentioned in the text, which requires to determine not only the sentiment but also the entities in the text (Liu, 2012). By looking at the opinions in finer granularity the analysis can be more precise and deliver more informa-tion.

The subjectivity analysis and aspect-based approach are not discussed here in depth since they are not of main interest in this thesis and only applications of the polarity (positive/negative) of opinion or advanced sentiment classification are mentioned later on.

(19)

2.1.1 Polarity classification

A few examples of possible applications of the standard sentiment analysis, i.e. polarity classification, are given in this chapter. The examples include mainly data from social networks and the tasks using sentiment analysis are various. They range from mar-keting, e.g. identifying brand sentiment, counting the number of positive and negative reactions, and using polarity as an input for recommendation systems, e.g. Twitter, to financial forecasting. A practical example can be Starbucks, they use sentiment analysis to identify complaints on Twitter and answer all of these negative comments.

The first application discussed is detecting spam or fake reviews. Here the term spam reviews is used due to the terminology of Peng & Zhong (2014). Users giving false positive or negative reviews to either boost or ruin a score of a product/store is not an uncommon issue. Spotting a spam review and removing it can greatly improve the trustworthiness of a rating site and improve the consumer experience as Peng & Zhong (2014) discuss. They managed to improve the detection of the spam reviews by employing a three step algorithm. The first step was computing the sentiment of the reviews. This was followed by a second step where a set of discriminative rules was applied to detect unexpected patterns in the text. The third and final step was a creation of time series based on the reviews and sentiment score to see and detect any sudden anomalies.

Another possible application is using sentiment analysis as an input for a recommen-dation system to improve the recommenrecommen-dation precision. Yang et al. (2013) use a recommendation system based on two main approaches. One is using a location-based data and the other is using sentiment analysis data to make recommendations. They use social networks with location data and by using this information in tandem with the information from sentiment analysis, i.e. sentiment of posts on these social media, they manage to create and successfully apply a novel recommendation engine. An-other example worth mentioning is the paper of Gurini et al. (2013) where they ex-plore the possibility of better friend recommendation systems. They took the standard content-based similarity measure, cosine similarity, and augmented it by plugging in the Sentiment-Volume-Objective (SVO) function. SVO is based on the expressed sentiment, the volume and the subjectivity of the reactions toward a concept by a specific user. The contribution of the sentiment analysis here is the augmentation of the measure used to

(20)

recommend the specific users. Thanks to this augmentation the recommendation engine takes into account more information and knows how similar the attitudes of the two persons are. By using not only the shared interests but also the sentiment analysis of the user’s content they managed to create and original model.

In a newer paper Zimbra et al. (2016) decided to use neural nets and Twitter data to assign a brand a sentiment class in a three- or five-class sentiment system. Here the sentiment analysis is used to find out the public perception of a certain brand, which can be very useful for all marketing practitioners. They manually labeled tweets mentioning Starbucks and used those in their analysis.

Poecze et al. (2018) studied the effectiveness of social media posts based on the metrics provided by the platform and sentiment analysis of the reactions. The effectiveness was measured by the number of different reactions, e.g. ’shares’ and ’likes’. They found out sentiment analysis proved an invaluable complement in evaluating the effectiveness of certain posts since it was able to look past the standard metrics and determine why some forms of communication were less popular with the consumer base. The underlying positive/negative reaction of the base is better measured by the sentiment analysis than by number of views and reactions alone.

Ortigosa et al. (2014) concentrate solely on Facebook and determining the polarity of the users’ opinions and the changes in their mood based on the changes of their comments’ polarity. By combining a lexical-based approach and a machine learning-based one they managed to create a tool that first determines the polarity of a submission and then by computing the user’s standard sentiment polarity detects significant emotional changes. Lastly the survey fromXing et al. (2018) focuses on financial forecasting. The natural language based financial forecasting (NLFF) is a strand of research focused on enhanc-ing the quality of financial forecastenhanc-ing by includenhanc-ing new explanatory variables based on NLP with sentiment analysis having a prominent place in this area. Among the ana-lyzed documents are corporate disclosures such as quarterly/annual reports, professional periodicals such as Financial Times, aggregated news from sources like Yahoo Finance, message boards activities and social media posts. The texts from these sources can then be used for different types of sentiment analysis like subjectivity and polarity analysis. This can be useful for many parties, an obvious example being any trader or hedge fund.

(21)

Any information signalling possible price movement of a stock is extremely valuable and can be used in the overall valuation.

2.1.2 Beyond polarity classification

Mohammad(2016) discusses in his work sentiment analysis quite holistically, from stan-dard polarity and valence of the shown sentiment to emotions. The problems of this strand of research are discussed alongside some of the underlying theory from psychol-ogy. One of the main issues is finding labeled data since in emotion analysis we don’t have just a binary problem, e.g. positive/negative or sarcastic/not sarcastic. The exist-ing work is summarized in this paper which makes it a good startexist-ing point for anyone interested in this specific area of sentiment analysis.

One of the solutions to the problem of missing labels in data sets can be the emotion lexicon presented inStaiano & Guerini(2014). It is an earlier work but it can serve as a good starting point for other emotion lexicons. The authors created a lexicon containing 37 thousand terms and a corresponding emotion score for each of the terms. This can be an invaluable source for anyone wanting to apply a sentiment analysis that goes beyond simple polarity.

2.2 Sarcasm Detection

Sarcasm detection has become a topic in sentiment analysis due to the problems it causes. A statement imparting negative/positive sentiment while seemingly having the opposite polarity can quite quickly lower the accuracy of sentiment analysis models (Liu,

2012). This gave rise to a specialized field of research – sarcasm detection. Most of the papers focused on this problem are quite new (less than 10 years old) and the progress from first attempts using only word lexicons and lexical cues to the current literature using context, word embeddings and other neural net-based approaches is significant. This section is divided into two parts, in the first a survey paper is presented as a possible introduction into this field of research. In the following section actual research papers are discussed and the evolution of this area of research presented.

(22)

2.2.1 Survey of sarcasm detection

The most notable paper that concentrates on this topic is fromJoshi et al. (2017). The setup, issues, different kinds of data sets, approaches, their performances and the trends present in the sarcasm detection literature are discussed. The problems discussed in this paper are mentioned here but they are not analyzed in depth. In the following section some papers dealing with sarcasm detection are discussed and when relevant the issues are highlighted.

The linguistic background is not discussed here and can be found with proper references in theJoshi et al. (2017) paper alongside the overall problem definition.

The different types of available data are divided based on text length – short, long, transcripts/dialogue and miscellaneous or mixed, i.e. those that do no fit into previous categories. The different approaches to the problem of sarcasm detection (illustrated in the choice of papers in next section) can be divided into rule-based and feature set -based. A rule-based system is typically a set of {IF : T HEN } statements such as IF 0red0 T HEN 0stop0 rule for a car. These rules can be much more complicated and possibly nested as well. The feature set-based one also comprises of usage of different algorithms, i.e. standard ML and deep learning ones. The trends are also described in the next section since the choice of papers is organized along the time axis and nicely imparts the evolution of the field.

Then the issues with the process of sarcasm classification are discussed. The prominent issue is the data annotation. Several options are discussed in the next chapter but the main difference is along the axis self-labeled and annotated by third party. The quality of these labels is also in question since humans have trouble discerning between sarcastic and non-sarcastic utterances and no labels from a third-party can be perfect. With self-labeling the data quality problem lies in selecting users that use the appropriate hashtag or similar method to announce the use of sarcasm. This is coupled with cases where the hashtag is used either by accident or in different meaning than intended, e.g. talking about the usage of #sarcasm. Another issue is the inherent skewness in data since most comments and tweets are not sarcastic and some measures need to be taken for the data set to be balanced.

(23)

Some local/temporal specifics of lexical clue are important as well. These clues are not constant in time and in different cultures/countries. Usage of emojis and some abbreviations, e.g. LOL, is a somewhat newer development and is not spread evenly over the globe. Background knowledge is required for constructing this type of feature sets.

2.2.2 Research papers

In following paragraphs the progress in the field is illustrated and the changes in overall approach to classifying sarcasm are shown. Some of the changes are linked to the growth of computational power, e.g. possibility of using neural network-based methods.

One of the early papers focused on identifying sarcasm is the Carvalho et al. (2009) paper that attempted to find irony in user-generated content by looking forlexical clues, specifically detecting irony in sentences containing normally positive words by looking for oral and gestural clues. Examples of these clues are among others emoticons, laughter expressions (haha, LOL,...) and number of punctuation marks in a row. They also found that some more complex linguistic information were quite inefficient in detecting irony. The more complex linguistic information were constructions such as diminutives,

determiners and verb morphology. Since this research was done mainly on Portuguese text some of the chosenfeatures were specific to Portuguese and are not transferable to other languages. The evaluation of their work was done by hand, some comments were chosen at random and then manually scored with this score being then compared to the predicted value. Their contribution is showing that the overall sentiment analysis can be easily improved by using even simple lexical rules to identify some sarcastic comments.

Davidov et al. (2010) use just likeCarvalho et al. manually labeled data. They use the

Amazon’s Mechanical Turkto label a large collection of tweets and of Amazon reviews. This gives rise to the main data set which is then divided into training and test set. This is the common and correct approach and later on in this section it is assumed as standard. Only deviations from this approach are highlighted. Each instance is rated by three independent evaluators which guarantees a high level of reliability. Their algorithm is based on expressing each observation in training set as a vector based on the extractedfeatures and then using thek-nearest neighbortechnique to match the testing observations to its most similar vectors in training set based on Euclidean distance.

(24)

Majority voting was then used to classify the testing observation. Theirfeatures included a separation of the sentences into different patterns identifying so called high-frequency words and content words. They have pre-existing patterns and look how similar the review/tweet is to their patterns and each observation gets a score between 0 and 1 for each pattern. These patterns are then combined with some punctuation basedfeatures. Newer papers includeRiloff et al. (2013) that focuses exclusively on Twitter data. Their data are once again labeled by human annotators and are chosen with special emphasis on choosing tweets that are sarcastic on its own and not as a reply. By doing this the context becomes insignificant and one of common problems in sarcasm detection is solved. They used several feature sets, various existing sentiment lexicons, support vector machines on BoW (created with unigrams and unigrams & bigrams) and their own sentiment lexicon created by bootstrapping the key phrases. They have a very specific case of sarcasm in mind, one where a positive sentiment is followed by negative sentiment with this incongruity creating the irony, e.g. ’I love being ignored’. Then they use this assumption to learn positive and negative phrases (using many iterations) which then serve as their lexicon used to identify the positive and negative sentiment present. If both are found in tandem then that is taken as evidence of sarcasm being present. The overall model is more complex but this description is supposed to give only a very rough overview of the idea.

Wallace et al. (2014) is the first paper mentioned here that talks about the necessity of using context for detecting sarcasm. It claims that since humans usually need context to detect sarcasm, machines should as well. They reach this conclusion on Reddit, social media site described in Chapter 3, data set which is scored by humans and during the scoring it is recorded whether the annotator asked for additional context. They used a basic logistic regression with one explanatory dummy variable (asking for context) to show that in case of an ironic comment the annotators are more likely to ask for context. Their claim was substantiated also by another model, another logistic regression. They have shown that their base-line model makes mistakes more often in the cases where the annotators asked for context, which again points to the intuitive conclusion of both machines and humans needing context for sarcasm detection.

Bamman & Smith(2015) are also trying to detect sarcasm by leveraging the contextual information. Another important feature of this paper is the data they are using. They

(25)

use tweets and they started using self-labeled data, i.e. #sarcasm, #irony, etc., while taking measures to exclude retweets and other submissions that are not appropriate for their study. They used many standard approaches to create features such as unigrams and bigrams (BoW), part of speech tags, capitalizaton, punctuation, tweet sentiment and tweet word sentiment within one tweet. Then they also extractedfeatures regarding the authors such as their profile information and historical sentiment, audiencefeatures such as communication between author and addressee and their shared interests. They found out that combination of the information about author and those contained in the tweet was nearly on par with the inclusion of all features at hand. They used a basic

regularized regressionand managed to reach accuracy of 85%.

A similar avenue was studied by Khattri et al. (2015) who used just like Bamman & Smith(2015) author’s history but they rather focused on sentiment contrast than many different text-based features. They implement this idea by creating two models and then synthesizing them. The first is they find the overwhelming historical sentiment towards a specific entity and the second is looking at what sentiment is shown towards the entity in the current tweet. This model using context is then combined with stan-dard sentiment incongruity within one tweet. They develop specific rules for combining these models and report a very good overall performance. The performance is based on standard classification measures descibed in Section 4.3. They also acknowledge some shortcomings of their approach such as assumption that author’s history contains the true sentiment and some accounts being deactivated, renamed or otherwise inaccessible. This accounted for approximately 10% of all authors.

Another paper worth mentioning is fromWallace et al. (2015) which is a follow up of the paperWallace et al. (2014) where it was shown that machines do need context for detecting sarcasm jut like humans. In the newer paper a proper model implementing the context as a feature is presented. The data set used was presented in the older paper and here it is only reused. As theirfeatures they decided to leverage the detected sentiment alongside the specific subreddit, which is a specific part of Reddit and is defined in the beginning of Chapter 3, and so called ’bag-of-NNP’. This bag is constructed in the same way as standardBoWbut only noun phrases (NNP), notunigrams, are taken into account from each comment. Then by creating an interaction feature setbetween the NNP features and subreddits they acquire their fourth feature type. To properly

(26)

leverage these features they use a specific regularization combining both an l1 and l2

penalty (explained inregularized regression), each levied on a specific subset offeatures.

Ghosh et al. (2015) focused on somewhat different task, discerning between a literal and sarcastic usage of a word. They used Twitter data and then employed Amazon’s Mechanical Turk to crowdsource the labels on specific phrases. The tweets themselves were self-labeled, i.e. #sarcasm, #sarcastic, ..., and the phrases in the tweets that were meant sarcastically were labeled by the Turkers, i.e. people who perform tasks at Amazon’s Mechanical Turk. The Turkers also came up with an alternative way to say the same message but without sarcasm. This gave rise to a lexicon of phrases with similar meaning to the original sarcastic utterance. These phrases were then transformed by word embeddingsand used to classify the sarcastic tweets.

Joshi et al. (2016) also worked with word embeddings but in a different fashion than

Ghosh et al. (2015). Instead of using the embeddings themselves as features they use them to find the word similarity between respective words within comments however they do not use contextual information in this paper. Only features discussed previ-ously (lexical, BoW-based, sentiment incongruity, punctuation) on top of the addition of finding similarity between different words using their word embeddings. Several dif-ferent pre-trained instances of word embeddings were used and the overall result was that thesefeatures bring significant additional value.

Amir et al. (2016) followed in the direction ofJoshi et al. (2016) and used word embed-dings as well. Their innovation was training user embedembed-dings which contain context and author’s history. Their overall model is then a neural net with convolutional layers which processes both user embeddings and contentfeatures. They split thesefeatures into four groups – tweet-features, author-features, audience-features and response-features. They used an existing data set from the paper by Bamman & Smith (2015) and the used

features were also the same with the exception of the user embeddings. The proposed model seems to be outperforming the original model fromBamman & Smith (2015).

(27)

Data description

In this chapter the used data are presented. Two different data sets were used, one is a freely available data set created for sarcasm detection tasks and the other is a data set of authors’ history downloaded from Reddit for this thesis.

Reddit is the self-proclaimed front page of the internet. It is a social-media news ag-gregation website where each user can post in each Reddit forum (called subreddit ) a comment, link or a video and others comment and react to it in a typical hierarchi-cal manner. Subreddits range from topics like politics and religion to NBA (highest US basketball league), NFL (highest US american football league) and all the way to country-specific subreddits like the Netherlands.

Both data sets are based on Reddit comments which were all in English. Almost all of Reddit is in English and the prepared data set was curated to include only English comments.

3.1 Sarcasm detection data set

The main data set used in this thesis is the one prepared by Khodak et al. (2017). The data are split into training and testing part by the authors already and on top of that there are two versions – balanced and unbalanced one1. In this thesis the balanced version is used due to the aim of the thesis and hardware limitations. The balanced

1

Balanced data set is a data set where all classes are equally represented while an unbalanced one has higher prevalence of some classes than other classes

(28)

version contains more than a million comments and that already presents a substantial computational burden.

The corpus consists of several different variables beside the label. The comment itself, the parent comment, the subreddit where the comment was published, the author, publish date and the number of likes and dislikes. Naturally the comment itself and the context (parent comment and subreddit ) hold the most information and are expected to bring a lot of explanatory power.

The data set is presented and described in quite a length in the original article and here only the most important aspects are highlighted.

The most important detail that must be mentioned is the way the labels are acquired. The sarcastic labels are added by Reddit commentators themselves by adding /s at the end of their comments. The labels are put in during the actual writing of the comment. They are not added at later date.

Several filters were used to exclude noisy comments, these include filtering comments which are just URLs and special handling of some comments due to properties of Reddit conversations, such as replies to sarcastic comments, since these are usually very noisy according to Khodak et al. (2017).

The raw data set contains 533 million comments, of which 1.3 million are sarcastic, from January 2009 to April 2017. This set is controlled to include only authors who know the Reddit sarcasm notation, /s, by excluding people who have not used in the month prior to their comment the standard notation. These are excluded to minimise the noise by authors that do not know this notation. This sarcasm notation used on Reddit is known as self-labeling which means only the author has any influence on the final label of the comment. Khodak et al. (2017) devote large part of the paper to showing the noise is minimized and the data are thus appropriate for analysis. They study the proportion of the sarcastic and non-sarcastic comments, the rate of false positive and negatives and the overall quality of the corpus for NLP tasks.

For a comment to be eligible for the data set it must pass several quality barriers, i.e. • the author is familiar with the notation meaning they used the sarcastic label in

(29)

• the comment is not a descendant, direct or indirect, of a comment labeled as sarcastic due to noise in the labeling following a sarcastic comment,

• due to the sarcasm prevalence in the overall data set being 0.25% only very few comments can be marked as sarcastic and actually be a reply to sarcastic comment which was not labeled as one.

The noise from classifying sarcastic comments which were answers to non-labeled sar-castic comments exists but in the overall scale is minimal. The impact on the overall performance is negligible but it is important to mention its existence. We are dealing with real-world data and obtaining proper labeling for every instance is simply impos-sible, this is as good as it gets.

The authors also performed a manual check on a small subset of the data and arrived at conclusion that the number of false positives is about 1% while the false negative rate is about 2%. This is an issue in the unbalanced setting as the sarcasm prevalence is about 0.25% but in the balanced setting of the data set this noise is quite limited. Overall the manual checks are described in length in the original paper and Khodak et al. (2017) state the filters are working reasonably well as shown by high overlap percentage with the manual check. The manual label was created by a majority voting scheme between human annotators.

Khodak et al. (2017) also explain why Reddit is a better data source for sarcasm detection than most other commonly used sources. It holds the same advantage as Twitter in having self-labeled data while being written in non-abbreviated English unlike Twitter. Thanks to Reddit being organized into subreddits with clear comment structure and context it is much easier to recover the necessary context features. According to the authors these are the main advantages which make the data set more realistic and better suited for this type of research.

The balanced data set used in this thesis has slightly more than one million comments.

3.1.1 Authors’ history

The idea behind the author’s history is to create a typical behavior pattern of a user. We want to create a benchmark that is relevant in explaining the user’s behavior.

(30)

The number of authors present in the above-mentioned data set is quite large, approx-imately quarter of a million. A large majority has only one comment and only a small number of people has many comments. The largest number of comments per author is 854. Thus the only way to retrieve an author’s history is to download it directly from Reddit.

In order to create a reasonable representation of user’s behavior on Reddit 100 newest comments per user were downloaded. These then serve as a benchmark to compare the comment of interest to the ”standard” way the user expresses themselves. This kind of information is again targeted at the incongruity within a person’s expressions and is based on the original idea of Amir et al. (2016).

The specific process is in detail described in Section 4.1.6. These comments were used to create an equivalent ofword embeddingsfor users. Not a word but a user is mapped into the high-dimensional vector space.

(31)

Methodology

The methodology description is divided into several parts, firstly the process of feature selection and text processing is explained and motivated. Secondly neural networks are described and explained. Thirdly and lastly the approach to model evaluation is explained and the chosen measures are justified.

4.1 Pre-processing & Feature engineering

Several text processing techniques were used to extract the signal from the available comments. Context is very important for sarcasm detection for humans and the same applies for machines as was already shown in the literature (Wallace et al.,2014). Both the original comment and its context (parent comment and author’s history) are used to provide additional information for our modeling. The exact extent to which is this information used is described for all distinctive feature sets in the following sections. The used techniques are described in some detail in the following sections. Alongside the described features a specific form of context is used – a simple variable specifying the prevalence of sarcasm of the subreddit. The share of sarcastic comments is calculated by dividing the number of sarcastic comment by the overall number of comments.

(32)

4.1.1 Bag-of-Words

This technique is applied to the text itself, firstly all unnecessary words from all docu-ments at hand are removed and then the remaining words are used to fill a matrix, i.e. the BoW. In this matrix the documents represent the rows and the columns represent the words. Each field in the matrix thus marks how many times a word was used in a document.

This is a standard NLP method that is widely used due to its strength and predictive power.

The pre-processing of the text, i.e. removing all unnecessary words, consists of sev-eral steps to achieve a clean text that can be easily transformed into the BoW. This was achieved by an algorithm which was implemented using two Python libraries, i.e. spaCy (Honnibal & Montani,2017) & NLTK (Bird et al. ,2009), and Python’s built-in functions. The steps taken to clean the text were:

• all words are converted into lower case letters,

• all words are stemmed – stemming refers to leaving in place only a stem of the word, e.g. fish, fishing, fisher and fished have all the same stem,

• only words are left in place – numbers, punctuation and special characters are removed,

• stop words are removed, i.e. words that are too general and carry no specific meaning such as the, a, and, is etc.,

• one- and two-letter words are also removed.

The algorithm described above was implemented from scratch, meaning only core func-tions, e.g. transformations to lower case letters and stemming, were used in the process. The rest of the algorithm was developed specifically for this thesis.

This leaves a significantly smaller vocabulary in the documents and the preprocessed text is then used to create theBoW. While the first steps seem reasonable even at first sight, the last step requires some additional justification, removing the one- and two-leter words. The reason is that many short words can in the end be stemmed to the

(33)

same letter(s), e.g. the word ’saw’ becomes ’s’. With the large document collection at hand this can occur often and create a substantial noise which would significantly lower the overall accuracy of our sarcasm detection model. This applies to all the steps above and not just the last one.

Last step is the normalization of this matrix known as Term Frequency–Inverse Docu-ment Frequency (TF-IDF). The goal of this normalization is to down-weigh words that are present in many documents since that points to the word being a common occurence and having less predictive power than those present only in a smaller subset of the documents. This step should improve the overall performance of our sarcasm detection model.

The TF-IDF measure is denoted as

tf -idf (t, d) = tf (t, d) ∗ idf (t, d) (4.1) and is dependent on d, the document, and t, the terms/words. The first term, tf (t, d), is intuitive and presents the number of occurrences of a word in a document while the second is a bit more complicated. The inverse document frequency is denoted as idf = log 1+nd

1+df (d,t) where nd is the overall number of documents and df (d, t) describes

the number of documents where the term in question appeared. The reason for this transformation is simple, this normalization helps bring out the more meaningful terms, those with more influence. The output is a simpleBoWwith each field not containing the number of occurrences of word in a document but the normalized equivalent described in Equation 4.1.

The final step is applying the Truncated Singular Value Decomposition, i.e. a concept related to PCA, to lower the dimensionality of our matrix. Truncated SVD is now briefly described.

The Truncated SVD simply produces a low-rank approximation of the matrix in ques-tion. X ≈ Xk = UkΣkVkT with UkΣk being the transformed training set and having

k components. This is used often due to computational complexity since it does not calculate the full and exact matrix but only its low-dimensional approximation. The

(34)

low-dimensional approximation has the same number of rows and only k columns. A more in-depth explanation can be found in literature, e.g. Bishop(2006), or in Appendix

A. The final output is then a matrix with significantly smaller number of columns and the same number of rows, with each row still representing a specific document.

This transformation is applied to both the comment of interest and the parent comment.

4.1.2 Sentiment-based features

The sentiment-based features are extracted from both the parent comment and the comment of interest. The comment’s sentiment and subjectivity alongside the most positive and negative word and the standard deviation of the polarity of all the words in the comment is also determined. Sentiment is a continuous measure on scale from −1 to 1 while subjectivity is a continuous measure on scale from 0 to 1. Negative polarity is associated with negative numbers while positive is associated with the positive part of the number axis. Objective statements are described by lower numbers while subjective statements have higher values. This expresses how positive/negative the specific word is, e.g. ’good’ is not as positive as ’great’. This scale exists so that we can order the words based on their overall positivity/negativity.

The reasoning behind this choice is based largely on the paper fromJoshi et al. (2017) and his discussion of sarcasm and sentiment. If the comment includes at first sight an incongruity of sentiment, i.e. one sentence is positive and second negative, it might be an indication of an unspoken message, i.e. sarcasm. The same applies for the previous comment, if the parent comment is in stark contrast with the comment in question we might be able to use it to detect sarcasm. The idea behind including comment’s subjectivity is also obvious, it is difficult to express sarcasm without using at least some subjective terms that can carry emotional meaning.

The sentiment retrieval in this thesis was done using the Python library NLTK (Bird et al. , 2009) and its pre-trained sentiment models. Each word has a specific sentiment and obejctivity assigned and the pre-trained model is essentially a look-up table from which the desired value is retrieved. This is a state-of-the-art NLP library widely used by many researchers.

(35)

4.1.3 Lexical-based features

Another set of features that might be indicative of sarcasm are features based on the textual expression of a user. There are some common ways to express sarcasm in the online world instead of voice inflection, such as usage of punctuation, capital letters and emojis. All of these options are studied in this thesis.

The number of capital letters divided by length is used to express how prevalent is the usage throughout the comment, a high value should be indicative of a comment more likely to be sarcastic. The same is done with words in all capitals, again the higher the value, the higher the likelihood of the comment being sarcastic based on intuition and the way people express themselves. This reasoning is used in the case of punctuation as well, the number of punctuation marks, the number of exclamation and question marks and the number of ellipsis cases, several dots in a row, are all retrieved and normalized per comment by its length. Lastly the number of emojis normalized by the comment’s length is calculated.

These six lexical measures are calculated for both parent and original comment. The built-in Python library regex was used.

4.1.4 PoS-based features

Another approach to processing the text is using the word types present in the comments. PoS tagging gives every word within the comment the appropriate tag, e.g. proper noun, pronoun, adjective, adverb. There are together 34 English word types used in the NLTK library (Bird et al. ,2009) which was used for this specific task and already mentioned in the sentiment-based feature engineering section. By extracting all word types from the comment we might be able to better detect sarcasm since some word types are more likely to carry the sarcastic meaning than others. It is not easy to express sarcasm without adjectives and/or adverbs. By exploiting this knowledge we might be able to uncover some additional signal in the data. This is a novel approach not tried in the previous literature. The PoS-basedfeatures were only scarcely used and usually only as auxilliary measures and not as main explanatory variables.

(36)

We implemented this approach by creating the standardBoWwith the columns repre-senting all the word types. Then we used the entire matrix as an input, to see whether the comment’s structure plays a role or not.

4.1.5 Word similarity-based features

Determining word similarity is an idea dating back decades. Many approaches and techniques were applied to this problem but since the seminal paper fromMikolov et al.

(2013) and the subsequent research of Mikolov and his team their approach to word embeddings became the standard. Their algorithm is called word2vec and thanks to creating a version ofBoW, the so-called continuous bag-of-words, and training a shallow neural net over the continuous BoW, it managed to capture the underlying semantic meaning much better than all previous attempts. The continuous BoWis based on the idea of context, each row and column represent a word and a field is filled by the number of times the word from the column appeared in the context of the word from the row. Context here represents immediate surrounding of the word from the row, window’s precise size is not predefined and can change based on the problem at hand. In the end each word is represented by a vector in a high-dimensional space. Each vector is the output of the shallow neural net and the dimensionality is determined by the output parameters of the neural net. The input for the neural net is the continuous BoW. More detailed explanation can be found in the paper from Mikolov et al. (2013). Similarity in this kind of setting is determined by calculating the cosine distance between the two vectors.

The implemented approach used the Python library spaCy (Honnibal & Montani,2017) and took advantage of their pretrained word vectors. By using a pretrained model it is enough to simply extract the underlying vector of each word and then compare the vectors of interest when necessary.

Joshi et al. (2016) looked as one of the first at this approach to sarcasm detection and found that just like with sentiment certain semantic incongruity in the text can be indicative of sarcasm presence. Thus we compare all verbs and nouns within a comment against each other and find the highest and lowest similarity within these two groups. This tells us whether the comment is heterogenous or homogenous meaning-wise.

(37)

Studying the similarities and dissimilarities within a comment can lead to better detec-tion of sarcasm. If the comment has a comparison of a person being at something ‘as good as fish at flying‘ then it’s not meant sincerely and the differences in the semantic meaning can help detect the sarcastic comment. Since ‘fish‘ and ‘flying‘ are dissimilar the cosine difference would be large.

Lastly the comment of interest and parent comment are compared and their similarity is calculated using cosine distance as well. This is done internally by the spaCy library by averaging the vectors within the comment and then calculating the cosine similarity of these two final vectors.

4.1.6 User embeddings

The last step of the feature engineering process is the creation of a vector representing a specific user. The user history downloaded from Reddit is used in tandem with the pretrained word vectors from the spaCy library mentioned in previous section. This idea is based on the paper fromAmir et al. (2016) and takes the past utterances of a user, i.e. their past comments, and averages the vectors of the words they used. Afterwards this vector that represents the user is compared to the comment being categorized and their similarity is calculated using once again the cosine distance.

This feature is meant to identify potential differences from the user’s typical way of expression. If the user is behaving and writing differently than is common then it might a sign of insincerity. This is simply another attempt to identify the potential incongruity present in the text in case of sarcasm.

4.2 Neural network

Neural networks are widely used models based on a collection of connected units (called neurons) which loosely imitate the way human brain works. The first relevant paper is from McCulloch & Pitts (1943) and is nowadays considered as the first stepping stone in the creation of neural nets.

A good description of neural networks is from Kriesel (2007): ”An artificial neural network is a network of simple elements called artificial neurons, which receive input,

(38)

change their internal state (activation) according to that input, and produce output depending on the input and activation. The network forms by connecting the output of certain neurons to the input of other neurons forming a directed, weighted graph. The weights as well as the functions that compute the activation can be modified by a process called learning which is governed by a learning rule.” This is a very brief and condense explanation although not the most intuitive. A very intuitive explanation for econometricians, of a neural net used for binary classification, can be found in the paper fromMullainathan & Spiess (2017) – ”... for one standard implementation [of a neural net] in binary prediction, the underlying function class is that of nested logistic regressions: The final prediction is a logistic transformation of a linear combination of variables (“neurons”) that are themselves such logistic transformations, creating a layered hierarchy of logit regressions. The complexity of this function class is controlled by the number of layers, the number of neurons per layer, and their connectivity (that is, how many variables from one level enter each logistic regression on the next).” The deep learning approach is often applied on specific type of data, i.e. text, audio and image, and a different type of neurons is included alongside the already described. The main idea is to mimic some of the functions of human brain and to give the net the ability of abstraction such as recognizing patterns in images or in text. Some examples are automatic translations, image classification and image captioning. It has not been applied in this thesis and we do not discuss here any longer.

The choice of the neural net as a classifier was largely based on two reasons. The first reason is the overall strengths of the algorithm. In current applications neural nets are the go-to algorithm and commonly manage to achieve the best performance if properly optimized. The second reason is the ability of the classifier to use online learning.

4.2.1 Feed-forward Neural network

Several types of neural nets exist which are meant specifically to tackle image or speech recognition, text analysis or even standard classification issues such as the one presented in this thesis. The basic neural network used in this thesis is the one described in the previous section, essentially a series of logistic regressions. It does not have to be a series of logistic regressions per se, it can be only a series of binary classifiers where each neuron acts as a classifier. In the terminology of econometrics and a context of a series

(39)

Figure 4.1: A basic neural net (Bishop, 2006)

of logistic regressions the link function does need to be the logistic function. The more complex neural nets with different architectures are not discussed here at length since they are used to solve different types of problems.

The specific issues and details regarding the way the default parameters of neural net are set such as the number of layers and its neurons are discussed in the Chapter 5 as it is an empirical and not a theoretical issue.

For simplicity, we illustrate the procedure with a simplified neural net that simply con-tains 3 layers: inputs, outputs and an intermediate, so-called ”hidden layer”. The described architecture is shown in the Figure 4.1. As can be seen in the Appendix B, much more complicated nets were used for the actual empirical analysis at hand. The mathematical description that follows is also based on this net to make the explanation as simple and as informative as possible. The following example is based largely on the well-knownBishop(2006).

The input, x, for our neural network are thefeature sets described in the previous section. These are thus the results of the application of one or several of the aforementioned pre-processing techniques.

(40)

y(x, w) = f   M X j=1 wjφj(x)   (4.2)

where φj(x) depends on parameters and is adjusted alongside the coefficients wj. The

neural network uses the Equation 4.2 as its underlying idea. Each neuron’s output is a nonlinear transformation of a linear combination of its inputs where the weights, or adaptive parameters, are the coefficients of the linear combination.

As per the Figure 4.1 M represents the number of hidden neurons and D represents the number of input variables. This Figure 4.1is then used as the basis for the overall idea of the neural network. Firstly we take M linear combinations of the input variables x1, ..., xD in the form aj = D X i=1 w(1)_ji xi+ wj0(1), (4.3)

where j = 1, ..., M corresponds to the neurons in the hidden layer of the neural net. The superscript (1) refers to the input layer. The term w_ji(1) is referred to as weights and w(1)_j0 as biases. Bias is commonly known in econometrics as intercept. The weights are randomly initialized, the exact approach is described in this thesis in Chapter 5 and is taken from the paper fromHe et al. (2015). The quantity aj is known as activation and

each activation is then transformed by a nonlinear activation function h(·) which leads to the final neuron output in the shape

zj = h(aj). (4.4)

The idea behind this approach is that the weighted inputs must together pass a certain threshold to be influential. This naturally depends on the specific activation function. Then all the neurons together influence the final output which is the threshold that matters. The output neuron with the ‘strongest signal‘ is then taken as the output per the one observation.

(41)

The neurons in the network after the first input layer are known as hidden units and these quantities correspond to their output. The next layer (output layer in Figure 4.1) then has output

ak= M

X

j=1

w_kj(2)zj + w(2)_k0, (4.5)

where k = 1, ..., K and K represents the number of outputs, in case of binary outcome K = 2. The output here could be just one neuron, this is not an universal choice and the preference in the literature varies. The decision is then taken based on which of the two output neurons has higher final activation value, the one representing sarcastic or non-sarcastic class. This equation describes the second layer of the network (the output layer) and the resulting activation is again transformed with a nonlinear activation function which gives us the set of network outputs yk. The final activation function

value naturally depends on the data set at hand and its structure, for a regression it is an identity function and for binary classification it can be a standard logistic sigmoid function in the shape yk = σ(ak) = h(ak) where

σ(a) = 1

1 + exp(−a). (4.6) If all the enumerated steps are combined then the final output of the neural net in our example can be written down as

yk(x, w) = σ   M X j=1 w(2)_kjh   D X j=1 w(1)_ji xi+ w (1) j0  + w (2) k0   (4.7) where all the weight and bias parameters are grouped together in the vector w. This means the neural net is a nonlinear function from input variables {xi} to output variables

{y_k} determined by the weights in the vector w. The process described up to this point is also known as forward propagation. The choice of the activation functions is done empirically and is left for Chapter 5which is focused on the empirical part.

(42)

The overall training of the neural net happens in two stages, the forward propagation and backpropagation. The forward propagation was already described and gives us the exact value of our error function1 and is obtained in the following Equations4.8or4.14. Given the prediction tn, the error function is non-convex due to the highly nonlinear

nature of the neural net and usually looks like

E(w) = 1 2 N X n=1 {y(xn, w) − tn}2 (4.8)

in the regression setting when tn is the value of the target, i.e. dependent variable in

econometrics, or as E(w) = − N X n=1 {t_nlog yn+ (1 − tn) log(1 − yn)} (4.9)

in the binary classification case when tn∈ {0, 1}.

The relationship between the forward propagation and backpropagation is quite straight-forward. The forward propagation returns the value of our error function for the cal-culated weights and biases. The backpropagation then recalculates these weights and adjusts them to achieve lower overall error. The specific way how this is done is now described.

The non-convexity means there are multiple local extrema and finding the global ex-tremum is difficult and sometimes not necessary. Finding a local exex-tremum that is ‘close enough‘ is often satisfactory. The problem is we never know what means ‘close enough‘ if we do not know the value of the global maximum. In practice this is done by iter-ating over many hyperparameters and searching for the values which lead to the best performance of the neural net on the validation set while controlling for overfitting. We start with an initial vector w and we look for one that minimizes E(w). The obvious first approach is to find an analytical solution but sadly in the case of our non-convex error function this does not exist and we must look for the optimal vector w with a

1

(43)

numerical approach. When the vector w is changed to w + δw it causes the error to change δE ' δwT∇E(w) where ∇E(w) represents the gradient of the error function. The optimization of such functions is a common problem and is done in several steps. Usually a starting point is chosen, w(0) and then in each step the weight vector is updated using

w(τ +1)= w(τ )+ ∆w(τ ) (4.10) where τ represents the iteration step. Each algorithm approaches this problem differently and this problem is not described here at length. The algorithms usually differ in the way they update the weights at each step and how each step is made, i.e. whether only the gradient is added or if some other term based on the gradient is added as well. Some details can be found inBishop (2006). Every library implementing neural nets includes these algorithms, including the PyTorch library (Paszke et al. ,2017) used here.

4.2.2 Error Backpropagation

Backpropagation is a method of finding the optimal parameters of the neural net. Before we can describe the process of backpropagation in detail we must first discuss the overall approach to training the neural net.

Error backpropagation is an efficient technique for evaluation of the gradient of the error function E(w). The goal is to evaluate the derivative of the loss function and then move in the direction of steepest descent as is common in nonlinear optimization. The derived backpropagation formulas are quite general. Bishop (2006) showed it on a typical error function of maximum likelihood for a set of i.i.d. data which is denoted as

E(w) =

M

X

n=1