Exploring the contribution of semantic information in stance detection

(1)

Bachelor Informatica

Exploring the contribution of

semantic information in stance

detection

Jurre Wolsink

June 17, 2019

Supervisor(s): Marco Del Tredici and Sandro Pezzelle

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

Stance is the attitude toward a pre-defined target entity and can be deduced automati-cally from a text using a machine learning model. We look at the contribution of semantic information, in the form of word embeddings created using word2vec, to stance detection and at different ways to combine the embedding vectors into one vector for the entire text. Our models are kept simple, with features derived only from the tweet text and using a support-vector machine as classifier. They are tested on tweets gathered for the SemEval 2016 stance detection task.

We show that adding word embeddings to a simple n-gram system improves stance classification accuracy, outperforming all teams that participated in the task. We also show that only using embeddings gains comparable scores to the n-gram based model. Finally, we find that concatenating word vectors to use the information in the word order performs worse than taking the average. Weighting the vectors with their term frequency-inverse document frequency does well on some performance measures.

(4)

(5)

2.4.2 Continuous skip-gram . . . 12 2.4.3 Combining embeddings . . . 13 3 Experimental Setup 15 3.1 Dataset . . . 15 3.2 Text processing . . . 16 3.3 Feature extraction . . . 17 3.3.1 N-grams . . . 17 3.3.2 Word vectors . . . 17 3.4 Classification . . . 17 3.5 Performance measurement . . . 17 4 Results 19 4.1 Results for stance classification . . . 19

4.1.1 Qualitative analysis . . . 20

4.2 Word vectors to instance vector . . . 20

4.3 Stop word removal . . . 21

5 Discussion 23 5.1 Conclusions . . . 23

5.2 Further research . . . 23

Bibliography 26 Appendices 27 A Ekphrasis text processing parameters . . . 27

B N-gram vectorization . . . 28

C Scaling the frequency weighted word vectors . . . 29

(6)

(7)

CHAPTER 1

Introduction

1.1 Stance

Stance is the attitude a person takes toward a person, object, organization, or idea. Automati-cally determining stance has a lot of applications, from companies who would like to know what people on social media think of their brand to investigating if a news organization is biased.

Every year a competition called SemEval is held, allowing teams to compete on multiple natural language processing (NLP) tasks. This thesis will look at stance detection in the context of the SemEval 2016 competition task 6a[13], which formalizes it as: ‘given a tweet text and a target entity, automatic natural language systems must determine whether the tweeter is in favor of the given target, against the given target, or whether neither inference is likely’. Note that ‘neither’ does not imply a neutral stance. An example is

Target: Hillary Clinton

Tweet: Hillary Clinton is a devious & deceitful career politician

We can deduce that the tweeter is likely to be against the target.

Sentiment analysis is determining if a text is ‘positive’, ‘negative’, or ‘neutral’. Stance is related to sentiment in that both express an opinion, but the target of this opinion can differ. Take the tweet

Target: Hillary Clinton

Tweet: A circus act such as Donald Trump is short term entertainment. Soon to be canceled.

The target of the sentiment is Donald Trump, but the given stance target is Hillary Clinton. Nonetheless, sentiment analysis is related enough that we can use some of its techniques for stance detection. For a better look at the interplay between stance and sentiment see [11]. Despite the two words Hillary Clinton not appearing in the tweet, a human would probably say that the tweeter is in favor of her because they know that Donald Trump was Clinton’s opponent in the 2016 US presidential election. World knowledge that a computer system would not have, making automated detection complicated.

1.2 Goals

The goal of this thesis is to look at the role of semantic information in stance detection. Often these types of NLP tasks use features like n-grams, which are very simple to work with and scale well, but hold no information about the meaning of a word. For example, ‘lion’ is as different from ‘tiger’ as it is from ‘table’. However, to infer stance, one needs to understand the meaning of a tweet, and it is reasonable to assume an automatic system would as well. We are going to use semantic information to try to accomplish this.

(8)

The model that won the SemEval competition, from MITRE [22], used features derived from word embeddings with a complex deep neural network model. These embeddings are real-valued vectors that represent words in the vocabulary. Vectors whose words have close meanings will be close to each other in Euclidean space, so these embeddings hold a lot of semantic information. Combining the meaning of the individual words should say something about the meaning of the tweet as a whole.

While discussing the results of the competition, Mohammad et al. [13] showed that using a simple linear classifier and n-gram based features scored higher than MITRE’s model. In [19], Sobhani and Mohammad mention briefly that including word embeddings in such a model gave even higher scores.

We will test how word embeddings perform both as a replacement for n-grams and in com-bination with them, as well as explore different ways to combine word vectors to create one tweet vector. We will use a simple system, that uses no features except the word n-grams and embeddings, even though this might not result in the best possible detection performance, to be able to disentangle their relative contribution.

1.3 Structure

In chapter two, we will look at the theory behind the concepts used in this paper. Chapter three will implement models using these concepts, and in chapter four, we will be testing the models. A discussion on the results is found in chapter five, as well as suggestions for further research.

(9)

CHAPTER 2

Theoretical background

2.1 Support-vector machines

Since they were first introduced in the sixties, support-vector machines (SVM’s) have shown to perform great in classification analysis using supervised learning. Given a set of training examples labeled as belonging to one of two classes, an SVM as described in [2] tries to find two parallel hyperplanes that separate the classes in the features space such that the distance between them, or the margin, is as large as possible. The plane halfway between them is then the optimal decision plane. Data points that lie on one of the hyperplanes are called support vectors. New data points are classified to one class or the other depending on what side of the decision hyperplane they fall. Sobhani et al. showed in [19] that a model using an SVM outperforms all models implemented by the different teams of the stance detection task.

Hsu et al. [6] recommend scaling the features to be between [-1,1] or [0,1], for numerical reasons but also to avoid one feature dominating all others.

Figure 2.1: Feature space with a maximum-margin hyper-plane. Circled points are support vectors. Taken from [3].

2.1.1 Binary to multi-class problems

A text can have three different stances, so we need to map this multi-class problem to a binary one. The standard way to do this is called one-vs-rest and involves creating n SVM models,

(10)

where n is the number of classes. The kth model is trained with the kth class as positive and all others as negative. Then when classifying x, x is of the kth class when the kth model has the highest value for the decision function, in other words, the model for which x is furthest from the decision hyperplane.

One-vs-one, first used with an SVM in [7], creates n(n − 1)/2 models each trained on the data of a different pair of classes. When we present a new data point to the classifier, every model chooses the class that best fits this point. The class that is chosen the most will be the output.

Experiments done by [5] show that the one-vs-one method will produce better results.

2.2 N-grams

SVM classification works by partitioning a feature space. For our stance classification problem, the features are pieces of text, so we need to map these to some numerical values to be able to position them in the space.

One of the most straightforward ways to do this is by taking the n-grams, or contiguous sequences of n items, of a text. These items can be characters or words. For example the sentence ‘to be or not to be’ contains the word 1-grams ‘to’, ‘be’, ‘or’, ‘not’, ‘to’, and ‘be’, and the 2-grams ‘to be’, ‘be or’, ‘or not’, ‘not to’, and ‘to be’. If we then create a vector the size of the number of n-grams in the entire text corpus, fill in a one when an n-gram occurs in the text we are interested in and a zero otherwise, like table 2.1, we have a feature we will call the binary n-gram, which we can use with the classifier.

Text to be or not to be

2-grams to be be or or not not to as is Feature vector 1 1 1 1 0

Table 2.1: Example of binary word n-grams.

Filling in the number of times the n-gram occurs in a text instead of only the fact that it is present or not, results in the count n-gram shown in table 2.2.

Text to be or not to be.

2-grams to be be or or not not to as is Feature vector 2 1 1 1 0

Table 2.2: Example of count word n-grams

If we divide the frequencies by the total number of n-grams, we can think of this as a proba-bility distribution table denoting the probaproba-bility of xi occurring in the text instance conditional

on xi−n−1, ..., xi−1, assuming xi is independent of all other x’s.

2.3 Term frequency-inverse document frequency (tf-idf)

The count n-gram will emphasize those words or terms that often occur in a text. This emphasis may be misguided, as some terms will be present in almost all texts and so the fact that it occurs in a particular one does not say much. We need a way to measure how ‘important’ a word is. In [20] Sp¨arck suggested to use the inverse document frequency:

idf (t, D) = log |D|

1 + |{d ∈ D : t ∈ d}|

Where D is the domain corpus, d a text in D and t the term of interest. One is added to the denominator to avoid a division by zero. Combining this with the term frequency

tf (t, d) = |{t ∈ d}| |d|

(11)

gives the term frequency-inverse document frequency:

tf − idf (t, d, D) = tf (t, d) · idf (t, D)

Because the tf is in the range [0,1] and the idf in [0, log(|D|)], the tf-idf is in the latter range as well. This might necessitate features that are weighted with the tf-idf to be scaled back down to a range that works well with the SVM classifier. If a term is present in almost all texts, the tf-idf will be very low no matter the frequency in a particular one, vice versa if a term is rare the high idf will increase the tf-idf. In essence, it places greater emphasis on a term if it distinguishes a text from others, which is suitable for our classification purpose.

2.4 Word2vec embeddings

The n-gram models treat words as indices in a vocabulary, as though there is no notion of similarity between them. Mikolov et al. [10] proposed an algorithm to generate vectors for words in such a way that they will be positioned close to each other in space when their corresponding words are close in meaning. These so-called word embeddings are an older idea, but [10] simplified the model allowing for more data to be trained, increasing accuracy across different NLP tasks.

2.4.1 Continuous bag of words

Word2vec works by training a neural network with one hidden layer of N dimensions to predict a word based on C context words before and after it, like in figure 2.2. These words are represented as one-hot encoded vectors of the length of the vocabulary V , with one at the position of the word of interest and zero for all others. To deal with multiple input vectors, we take their average and multiply this with the weight matrix. Because you lose the word order, this model is called ‘continuous bag-of-words’.

Figure 2.2: Continuous bag of words word2vec model. Taken from [17].

(12)

used for the output layer. It makes sure all values will be between zero and and and add up to one. In other words, we can interpret the output as probabilities. Softmax is defined as

σ(u)i=

eui

ΣK j=1eui

With u ∈ RK _{and i = 1, ..., K.}

If we were to calculate the output of one word, the hidden layer vector of size N would be the embedding for that word. Because the words are one-hot encoded, the embedding of the word at position i in V is the ith row in W .

2.4.2 Continuous skip-gram

By averaging the context words, you lose all information about the word order. If we were to flip the neural network around to predict the C context words based on the word of interest, we would get the continuous skip-gram model shown in figure 2.3.

Figure 2.3: Skip-gram word2vec model. Taken from [17].

Because words further away will most likely be less related to the target word, we want to give them less weight. We do this by, for each training example, randomly selecting between 1 and C context words.

Skip-gram is slower than bag-of-words, as there are more matrix multiplications and softmax functions to compute, but exhibits the additive compositionality trait that makes the meaning-fully element-wise combining of word vectors possible [9]. For example−−→king − −−→man + −woman ≈−−−−→ −−−→

queen. This is possible because the word vectors have a linear relationship to the softmax inputs, as there is no non-linearity introduced by the activation function of the hidden layer. Therefore when you add two vectors, you add two softmax inputs and because of their logarithmic rela-tionship to the context distribution, you are creating a product of two distributions, which can be thought of as a union. If you do this for all words in a text, you get a vector that represents the union of the context distributions of all words in the text. The consequence of this is that we can meaningfully combine the vectors of the individual words in a text to create one vector that represents the entire text.

(13)

2.4.3 Combining embeddings

We have now a way to create numerical representations of words, but the input data for the classification algorithm are not words but labeled texts. We need to create a representation of the entire text using the embeddings of the individual words in it. Sobhani et al. [19] take the element-wise average of all word vectors, which uses the fact that, because of the additive compositionality explained in the previous subsection, we can add individual word vectors. They then normalize the result for the number of words, so text length has no impact on the feature. We can also use the same approach as in section 2.3, that is, take into account the frequency and importance of a vector.

In order not to discard the information about the word order, the vectors can be concatenated as well, padding with zeros to make sure all vectors are the same size. This combination method might make it harder for the SVM classifier to find a good decision hyperplane, though, as the number of dimensions in the feature space increases dramatically.

(14)

(15)

CHAPTER 3

Experimental Setup

3.1 Dataset

The dataset used to test our models is described in [12] and consists of 4,163 tweets manually annotated for stance corresponding to six targets of interest: ‘Atheism’, ‘Climate Change is a Real Concern’, ‘Feminist Movement’, ‘Hillary Clinton’, and ‘Legalization of Abortion’. These were chosen because they are widely understood by people in the United States, where the annotation was done. The target ‘Donald Trump’ is not used because it is not included in the data provided by the SemEval team for this task.

The Twitter API was polled to collect tweets with a hashtag at the end corresponding to one of the targets. These hashtags were removed, and because they always occurred at the end of the tweet, this often had no impact on the grammar.

A thousand randomly selected instances per target were annotated by hand for stance, and if more than 60% of the annotators agreed with each other, it was incorporated into the dataset. The annotators were asked to classify the tweets into either ‘in favor’, ‘against’, ‘neutral’, or ‘no stance’. The latter two were combined, as less than 0.1% received the ‘neutral’ label. Table 3.1 shows the number of tweets per target.

Target # Tweets Atheism 733 Climate change 564 Feminist movement 949 Hillary Clinton 984 Legalization of abortion 933 Total 4163

Table 3.1: Number of tweets labeled per target.

As stated in the introduction, stance can often be deduced because the text shows an opinion toward something related to the target. Therefore, one of the goals of creating this dataset was to include tweets that express a stance toward something without explicitly mentioning it or even without expressing any stance at all. In table 3.2, we can see that more than a third of the included tweets do not aim their opinion towards the target.

The data is split to have 2,914 tweets for training and 1,249 for testing. It is undesirable to have substantial differences in the distribution of classes between training and test sets, but as table 3.3 shows, this is not the case here. When looking at the total percentage of classes, none of the three have a difference larger than 10%.

(16)

Target % of opinion towards Target Other Neither Atheism 49.25 46.38 4.37 Climate change 60.81 30.50 8.69 Feminist movement 68.28 27.40 4.32 Hillary Clinton 60.32 35.10 4.58 Legalization of abortion 63.67 30.97 5.36 Total 58.80 36.19 5.01

Table 3.2: Distribution of target of opinion.

Target % of instances in train % of instances in test Favor Against Neither Favor Against Neither Atheism 17.9 59.3 22.8 14.5 72.7 12.7 Climate change 53.7 3.8 42.5 72.8 6.5 20.7 Feminist movement 31.6 49.4 19.0 20.4 64.2 15.4 Hillary Clinton 17.1 57.0 25.8 15.3 58.3 26.4 Legalization of abortion 18.5 54.4 27.1 16.4 67.5 16.1 Total 25.8 47.9 26.3 24.3 57.3 18.4

Table 3.3: Distribution of stance per target.

3.2 Text processing

We can only create n-grams and word vectors from tokens, so we need to process our tweets first. For this, we will use Ekphrasis [1], a text processing tool geared towards text from social networks, created by the DataStories team for the 2017 SemEval sentiment analysis task. It has support for:

• Tokenization, or splitting a string. This is more difficult for tweets as a lot of creative writing is used, like censored words (‘f*ck’), emphasis (‘I *really* like it’), or emoticons.

• Word normalization for URL, email, percent, money, phone, user, time, date, and number. • Word annotation for repeated, elongated, all caps or censored words and hashtags. • Word segmentation. This is important for hashtags as they do not have corresponding

word vectors, but their individual words do.

• Spell correction.

An example that uses all the above features is shown in table 3.4.

Original The *new* season of #TwinPeaks is coming on May 21, 2017. CANT WAIT \o/ !!! #tvseries #davidlynch :D

Processed

the new <emphasis> season of <hashtag> twin peaks </hashtag> is coming on <date>. cant <allcaps> wait <allcaps> <happy> ! <repeated> <hashtag> tv series </hashtag> <hashtag> david lynch </hashtag> <laugh>

Table 3.4: Example of the Ekphrasis text processor, from [1].

Not all of these are used, we perform five-fold cross-validation to find the best combination of parameters. The cross-validation is done individually for the n-gram and embedding features. For example, it might be better to include a hashtag as is for the n-gram features instead of as separate tokens. The parameters that were used can be found in appendix A.

The normalization is interesting because intuitively it might make sense not to perform it in the text processing for the embedding feature as the tokens this creates, e.g. ‘<date>’, do not have corresponding word vectors. However, it might still be better to include these, as it could

(17)

be that dates, emails, etc., do not hold any information for stance detection and normalizing would have the same effect as removing them.

Saif et al. showed that sentiment analysis performance decreases when removing generic stop words from twitter messages [18]. Despite this, we will test the removing of stop words as sentiment classification is related to but not the same as stance classification. To decide which words to remove, we use the NLTK stopword corpus as described in [8], which contains 2,400 words in 11 languages.

3.3 Feature extraction

3.3.1 N-grams

We extract the word n-grams using Scikit-learn [14]. Character n-grams are not used as using information about individual characters that is not available to the word vector model would result in an unfair comparison. Cross-validation shows that the best results are obtained using one and two-word n-grams, so we try the different vectorization techniques on these. The results for this can be found in appendix B but in short, the best score is obtained by a model with tf-idf vectorization and without removing stop words.

3.3.2 Word vectors

Word vectors are created from the tokens simply by looking them up in a large dictionary. The semantic space we will use is from [9] and contains 300 dimension vectors of 3 million words and phrases trained on about 100 billion words from Google News using a skip-gram model. It is obtained using the Gensim download API [16]. All values are between -1 and 1, which is recommend for SVM classifiers [6].

We test the three different techniques described in subsection 2.4.3 to create one vector per text. A word is ignored if no vector can be found for it. For the word frequency technique, we multiply the tf-idf score with the word vector. Because this changes the scaling, it is re-scaled using the MinMax function:

xi=

xi− min(x)

max(x) − min(x)

Hsu et al. [6] recommend scaling to either [0,1] or [-1,1], but it makes sense to use the latter, as that is the range of the word vectors. We tried both, the results of which can be found in appendix C, and found that the range [0,1] performs best, so that is what we will use for the experiments involving tf-idf.

All feature vectors need to be of the same size, so when concatenating word vectors, we pad the vector with zeros up to the length of the text with the most tokens.

3.4 Classification

We use the SVM implementation of Scikit-learn [14], with parameters chosen using five-fold cross-validation. It turns out that the defaults with a linear kernel perform best. Changing the multi-class problem from one-vs-rest to one-vs-one does not change the scores, different from the results of [5]. A list of all parameters can be found in appendix D.

According to [19] information indicating stance is not necessarily the same for all targets; therefore, we train a different classifier for each one separately. The scores presented in the next chapter will be an average of the results of all targets.

3.5 Performance measurement

To evaluate what model performs best the F-score is used, as suggested for sentiment analysis in SemEval 2013 task 2 [21] and also used in the stance detection task, which will allow us to compare our results with the those of competition participants.

(18)

The F-score is the harmonic mean of the precision and the recall, calculated per class. For example, the ‘favor’ class precision is the number of correct in favor results divided by all in favor results, while the ‘favor’ recall is the number of correct in favor results divided by the number of samples that should have been classified as in favor:

Ff avor = 2

Pf avorRf avor

Pf avor+ Rf avor

The average F-score is then the average of the ‘favor’ and ‘against’ classes:

Favg=

Ff avor+ Fagainst

2

If one were to take a random tweet from Twitter, the probability of classifying it as in favor or against a specific target is small, and the ‘neither’ class would be over-represented [13]. Simply classifying every instance as ‘neither’ would obtain high scores, so we do not consider this class in the F-score. Of course, falsely marking a ‘neither’ tweet as in favor or against a target still has an indirect impact on the score.

Because the models are trained on each target individually, we have two choices on how to calculate an overall F-score. The micro-average calculates the average F-score across all targets, while the macro-average is the average of all Favg’s of the targets. The former is the official

score metric used in the SemEval competition [13], as it will be here. Nevertheless, the macro-average is interesting as it shows how well a system performs on all targets, without risk of being dominated by the score of one that occurs frequently.

(19)

CHAPTER 4

Results

4.1 Results for stance classification

The overall results of our models are shown in table 4.1. The four benchmark classifiers displayed in row I. are a random classifier, which assigns a random class to each instance, a majority classifier that labels every instance with the most frequent class per target, our n-gram based model as described in 3.3.1 and the best performing model from the SemEval competition, MITRE [22], that uses word embeddings and a neural network. Our word embedding based model uses the average of word vectors, does not include stop words and does no scaling.

What is interesting is that the model with n-gram features attains a higher micro-average F-score than the much more complex MITRE model. The same score of the majority classifier is also remarkably high, the macro average is, however, rather low. This makes sense, as when we look at the distribution of stance per target we see that the targets ‘Atheism’ and ‘Climate change’ are very asymmetric and will thus have good performance in the majority classifier, while the targets ‘Feminist movement’ and ‘Hillary Clinton’ are more balanced, leading to a lower macro F-score.

Classifier Micro F Macro F I. Benchmarks

a. Random 33.3 30.5

b. Majority 65.2 40.1 c. 1-2 word n-grams 68.2 53.2

d. MITRE 67.8 56.0

II. Semantic information

a. Embeddings 67.9 53.9 b. N-grams + embeddings 70.8 58.9

Table 4.1: Stance detection results on test set.

Row II.a shows that a model using only semantic information in the form of word vectors does not outperform an n-gram based model when they both use the same classifier. Its micro-average F-score is slightly lower, while the macro-micro-average is slightly higher. If the two features are combined, the results are quite positive, having a higher micro-average than either single-feature models. The difference in the macro-average F-score is even more substantial, showing that the stance detection on all targets improves.

With a similar model to II.b, Sobhani et al. in [19] received scores of 70.3 and 59.1 for micro-and macro-average F-scores respectively. The fact that our results are in line with theirs, even though they use character n-grams that our models do not, shows that either our pre-processing is superior, the word vectors they trained using continuous skip-gram on about 1.7 million relevant tweets capture less meaning that those we used, or simply that character n-grams have little influence on stance detection.

(20)

4.1.1 Qualitative analysis

To better understand why the model using word embedding features scores differently from the one using n-grams, we look at some examples for the target ‘Hillary Clinton’ displayed in table 4.2. In tweet a. the hashtag ‘WhyI’mNotVotingForHillary’ would indicate a strong stance against the target, but the embedding model cannot use this information as there exists no vector for the hashtag. The pre-processor breaks it up into separate words, but that does not always work as well. Sometimes, not using hashtags is an advantage, however. Example b. shows no stance about Hillary Clinton, but the n-gram model erroneously classified this example as ‘against’, likely because the hashtag ‘#freeallfour’ is used in other tweets that are annotated as such. The embeddings model looks at the meaning of the words in the hashtag and the rest of the text and correctly concludes that they do not indicate an opinion towards the target.

Word embeddings can only capture the semantic information they are trained on, and because the word vectors we use are trained on news articles, they hold little to no information about misspelled or slang words. In example c. we see how this can result in an incorrect classification. ‘Sheets’ probably does not refer to bed covers, so it pushes the average word vector of the text toward something the author did not mean.

Tweet Class according to

Embeddings N-grams Annotators

a.

@thehill : Women deserve a better candidate

Favor Against Against for the HIGH HONOR if first woman President:

We ALL do! #WhyI’mNotVotingForHillary

b. @FaithWarJournal Thank You for the follow, Neither Against Neither may God bless you all ! #freeallfour

c.

@DesireeAaron @HillaryClinton Sheets Clinton

Against Favor Favor Ya got to love it #HillaryonCNN

#HillaryClinton

Table 4.2: Examples of tweets classified on stance toward Hillary Clinton using different models.

4.2 Word vectors to instance vector

Table 4.3 shows the results of the three techniques described in subsection 2.4.3 to combine all word vectors of a text into one vector. Using information about the frequency of tokens lowers the micro-average F-score but improves the macro-average, indicating that the performance is bad for a frequent target but good overall. Concatenating word vectors does not improve stance detection performance at all.

Classifier Micro F Macro F a. Average of vectors 67.9 53.9 b. Weighted average of vectors 64.9 56.7 c. Concatenating vectors 66.3 53.5

(21)

4.3 Stop word removal

In table 4.4 we see that stop words do not hold much information from which to classify stance, as removing them has a positive effect on performance when using the average word vector combination technique. This effect is even greater with the concatenated vectors, but this could also be because it results in smaller feature vectors.

Removing stop words has less of an effect on the weighted average of vectors, but this makes sense as tf-idf already gives less weight to words that occur in many texts, which would be the case for stop words.

Classifier Micro F Macro F I. With stop words

a. Average of vectors 66.6 49.2 b. Weighted average of vectors 64.4 55.1 c. Concatenated vectors 58.8 46.5 II. Without stop words

a. Average of vectors 67.9 53.9 b. Weighted average of vectors 64.9 56.7 c. Concatenated vectors 66.3 53.5

(22)

(23)

CHAPTER 5

Discussion

5.1 Conclusions

In this thesis, we set out to see if adding semantic information to a classification model would im-prove stance detection performance. To do this, we have created multiple models using different word embedding based features that captured semantic information and tested them on the tweets dataset created by [13]. In the results of section 4.1 we showed that adding word embeddings to a simple n-gram system improves both micro- and macro-average F-scores, outperforming all baselines. We also showed that using only word vectors would already gain comparable scores to the n-gram based baseline model. We can conclude that semantic information alone is sufficient to deduce stance automatically, but it can also be used to improve stance detection when added to a model.

In section 4.2, we showed that concatenating the word vectors to capture information about the word order does not improve results. Maybe the classifier has difficulties with a feature space that has around 40x the number of dimensions of the other combination techniques, or with the fact that some texts are short and have sparse vectors while others are long and have vectors almost entirely filled with values. Using the frequency weighted word vectors results in a higher macro-average F-score and is therefore recommended if a task demands good performance across multiple targets.

5.2 Further research

Now that we have shown that word embeddings can improve stance detection performance, it would be interesting to look at different word vectors. Godin et al. [4] for example describe vectors trained specifically on tweets, which might help with some of the issues described in 4.1.1, while researchers from Stanford University show word vectors created by their GLoVe model [15], which uses word-word co-occurrence matrices.

Another avenue of future work is to see if the performance increase word embeddings give to stance detection tasks generalizes well to different datasets, especially ones that are less balanced.

(24)

(25)

Bibliography

[1] Christos Baziotis, Nikos Pelekis, and Christos Doulkeridis. “DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis”. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Vancouver, Canada: Association for Computational Linguistics, Aug. 2017, pp. 747– 754.

[2] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. “A training algorithm for optimal margin classifiers”. In: Proceedings of the fifth annual workshop on Computational learning theory. ACM. 1992, pp. 144–152.

[3] _{Kumar Dhairya. [Online; accessed June 8, 2019]. 2019. url: https://towardsdatascience.} com/demystifying-support-vector-machines-8453b39f7368.

[4] Fr´ederic Godin et al. “Multimedia Lab @ ACL WNUT NER Shared Task: Named Entity Recognition for Twitter Microposts using Distributed Word Representations”. In: Proceed-ings of the Workshop on Noisy User-generated Text. 2015, pp. 146–153.

[5] Chih-Wei Hsu and Chih-Jen Lin. “A comparison of methods for multiclass support vector machines”. In: IEEE transactions on Neural Networks 13.2 (2002), pp. 415–425.

[6] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, et al. “A practical guide to support vector classification”. In: (2003).

[7] UH-G Krebel. “Pairwise classification and support vector machines”. In: Advances in kernel methods: support vector learning (1999), pp. 255–268.

[8] Edward Loper and Steven Bird. “NLTK: the natural language toolkit”. In: arXiv preprint cs/0205028 (2002).

[9] Tomas Mikolov et al. “Distributed representations of words and phrases and their compo-sitionality”. In: Advances in neural information processing systems. 2013, pp. 3111–3119. [10] Tomas Mikolov et al. “Efficient estimation of word representations in vector space”. In:

arXiv preprint arXiv:1301.3781 (2013).

[11] Saif M Mohammad, Parinaz Sobhani, and Svetlana Kiritchenko. “Stance and sentiment in tweets”. In: ACM Transactions on Internet Technology (TOIT) 17.3 (2017), p. 26. [12] Saif Mohammad et al. “A Dataset for Detecting Stance in Tweets.” In: LREC. 2016. [13] Saif Mohammad et al. “Semeval-2016 task 6: Detecting stance in tweets”. In: Proceedings

of the 10th International Workshop on Semantic Evaluation (SemEval-2016). 2016, pp. 31– 41.

[14] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830.

[15] Jeffrey Pennington, Richard Socher, and Christopher Manning. “Glove: Global vectors for word representation”. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014, pp. 1532–1543.

[16] Radim ˇReh˚uˇrek and Petr Sojka. “Software Framework for Topic Modelling with Large Corpora”. English. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. http://is.muni.cz/publication/884893/en. Valletta, Malta: ELRA, May 2010, pp. 45–50.

(26)

[17] Xin Rong. “word2vec parameter learning explained”. In: arXiv preprint arXiv:1411.2738 (2014).

[18] Hassan Saif et al. “On stopwords, filtering and data sparsity for sentiment analysis of twitter”. In: (2014).

[19] Parinaz Sobhani, Saif Mohammad, and Svetlana Kiritchenko. “Detecting stance in tweets and analyzing its interaction with sentiment”. In: Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics. 2016, pp. 159–169.

[20] Karen Sparck Jones. “A statistical interpretation of term specificity and its application in retrieval”. In: Journal of documentation 28.1 (1972), pp. 11–21.

[21] Theresa Wilson et al. “SemEval2013 task 2 Sentiment analysis in twitter”. In: Proceedings of the International Workshop on Semantic Evaluation, SemEval. Vol. 13. 2013.

[22] Guido Zarrella and Amy Marsh. “Mitre at semeval-2016 task 6: Transfer learning for stance detection”. In: arXiv preprint arXiv:1606.03784 (2016).

(27)

Appendices

A

Ekphrasis text processing parameters

1 params = [ 2 n o r m a l i z e =[’ money ’] , 3 a n n o t a t e={’ a l l c a p s ’, ’ e l o n g a t e d ’} , 4 f i x h t m l=F a l s e , 5 s e g m e n t e r=’ t w i t t e r ’, 6 c o r r e c t o r=None , 7 u n p a c k h a s h t a g s=True , 8 u n p a c k c o n t r a c t i o n s=F a l s e , 9 s p e l l c o r r e c t e l o n g =F a l s e , 10 t o k e n i z e r=S o c i a l T o k e n i z e r ( l o w e r c a s e=True ) . t o k e n i z e , 11 d i c t s = [ ] 12 ]

Listing 1: List of ekphrasis.classes.preprocessor.TextPreProcessor parameters used for the n-gram features. 1 params = [ 2 n o r m a l i z e =[’ u s e r ’] , 3 a n n o t a t e={’ e m p h a s i s ’, ’ e l o n g a t e d ’} , 4 f i x h t m l=F a l s e , 5 s e g m e n t e r=’ t w i t t e r ’, 6 c o r r e c t o r=’ t w i t t e r ’, 7 u n p a c k h a s h t a g s=True , 8 u n p a c k c o n t r a c t i o n s=F a l s e , 9 s p e l l c o r r e c t e l o n g =True , 10 t o k e n i z e r=S o c i a l T o k e n i z e r ( l o w e r c a s e=True ) . t o k e n i z e , 11 d i c t s = [ ] 12 ]

Listing 2: List of ekphrasis.classes.preprocessor.TextPreProcessor parameters used for the word embedding features.

(28)

B

N-gram vectorization

Classifier Micro F1 Macro F1 I. 1-2 n-grams with stop words

a. Binary 67.4 54.4

b. Count 67.2 54.1

c. tf-idf 68.2 53.2

II. 1-2 n-grams without stop words

a. Binary 68.0 54.8

b. Count 67.7 54.6

c. tf-idf 67.6 52.6

(29)

C

Scaling the frequency weighted word vectors

Classifier Micro F1 Macro F1 a. No scaling 61.9 53.7 b. Scale to [-1, 1] 62.6 54.3 c. Scale to [0, 1] 64.9 56.7

(30)

D

Support vector machine parameters

1 params = [ 2 C=1 , 3 k e r n e l=’ l i n e a r ’, 4 d e g r e e =3 , 5 s h r i n k i n g=True , 6 p r o b a b i l i t y=F a l s e , 7 t o l =1e −3 , 8 c a c h e s i z e =200 , 9 c a c h e w e i g h t=None , 10 m a x i t e r =10000 , 11 d e c i s i o n f u n c t i o n s h a p e=’ o v r ’ 12 ]

Exploring the contribution of semantic information in stance detection

Bachelor Informatica