Analysis of Semantic Textual Classification Errors by Neural Sentence Embedding Model

(1)

Analysis of Semantic Textual

Classification Errors by Neural

Sentence Embedding Model

Kjeld Oostra 10748598

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor A Soleimani Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam Jan 29th, 2020

(2)

Abstract

In this research, sentence pairs from the SemEval STS Benchmark classified after being embedded by the Universal Sentence Encoder transformer-based model are analyzed. The goal of this analysis is to ascertain a pattern in the misclassified sentence pairs. Several semantic and syntactic features are proposed and analyzed. While no evident pattern was found in the vector representations produced by the model, a bias in word overlap and a bias in sentence length were found in the benchmark test set. Removing these biases from the test set improved the performance of the model on the benchmark semantic similarity task.

(4)

1 Introduction

Sentence embedding models are models that represent sentences in an n-dimensional feature vector. These models play an increasingly large role in the field of natural language processing (NLP). They are useful for many applications such as information retrieval, machine translation and text classification. The feature vectors represent many characteristics of a sentence. Although it varies per specific model, the vector usually contains information on at least the sentence semantics. This means that using these representations, sentences can be compared on the meaning of the sentence, besides on merely syntactical characteristics. The metric corresponding to the distance between the sentence embeddings is commonly referred to as Semantic Textual Similarity (STS). Comparing sentences semantically can be very powerful for applications such as ques-tion answering and semantic informaques-tion retrieval (Cer, Diab, Agirre, Lopez-Gazpio, & Specia, 2017), for assessing similarity of legal court documents (Mandal et al., 2017), for faster recognition of medical diagnoses based on symptom description (Wang et al., 2018), and for detecting plagiarism/fraud (Ferrero, Besacier, Schwab, & Agn`es, 2017). There are many ways of determining sentence embeddings or sentence representations, and in order to compare the performance of these models on STS tasks to the state-of-the-art, a benchmark test was introduced called the STS Benchmark (Cer et al., 2017). This benchmark is comprised of a dataset with sentence pairs and semantic similarity scores. The semantic relatedness of two sentences is determined using the cosine similarity between the sentence representations. The performance of any model on the STS task of this benchmark is measured as the Pearson correlation between the predicted similarity scores and the label similarity scores. As is indicated in the STS Benchmark article, state-of-the-art models have a performance score of around 0.75 to 0.80.

For this research, the Universal Sentence Encoder is used to perform the STS Benchmark task (Cer et al., 2018). This is a model introduced by Google with a strong focus on transfer learning and is made available on TensorFlow Hub. The reason this model was chosen is because it has state-of-the-art performance, it is available as a pretrained model through TensorFlow Hub making reproduction of the experiment accessible and more consistent, and it does not require a lot of computational resources. The Universal Sentence Encoder is available in two variations, and for this research the Transformer-based model is used due to its overall superior performance at the cost of computational resource demand. The goal of this research is to find a pattern, syntactically or otherwise, in the misclassified sentence pairs. This could be used to either ascertain a bias in the benchmark data or in the model’s decision making behaviour, which could ultimately be used to make suggestions to improve the model and/or evaluation method.

In this paper, first some background information is provided by covering relevant re-search. Next, the components and experiment setup of the research are described in-cluding the dataset used. The features of the sentence pairs that are analyzed with the aim to find a pattern are explained and analytical metrics are reported. It’s found that none of the analyzed features clearly indicate a bias in the model, but rather a bias in the benchmark dataset on word overlap and sentence length is ascertained. For both biases holds that eliminating them from the dataset positively affects the performance of the model on the benchmark.

(5)

2 Related work

For numerous Natural Language Processing (NLP) tasks, the scarcity of training data poses a challenge when working with deep learning models. To combat the lack of avail-ability of these large data sets, many NLP tasks rely on pre-trained word embedding models such as word2vec (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013) or GloVe (Pennington, Socher, & Manning, 2014). These models generate n-dimensional vector representations of words that signify the semantic properties of a word based on the con-text the word appears in. The semantic similarity between two words can be determined by comparing the distance between the word vectors: distances between vector represen-tations of semantically similar words (like ”forest” and ”woods”) or semantically related words (like ”forest” and ”leaves”) are shorter than of semantically unrelated words (like ”forest” and ”window”).

One of the applications of such word embedding models is in semantic textual similarity (STS) tasks such as paraphrase identification (recognition of sentences that are seman-tically similar or identical despite different words, phrasing or word order). For these tasks, the word embedding models can be extended to sentence embedding models. A simple baseline was presented that takes a weighed average of GloVe word vectors in a sentence and removed the common component (Arora, Liang, & Ma, 2016). This yielded surprisingly strong results, especially for out-of-domain texts. However, this word averaging approach has a major pitfall, namely that it does not take word order into account. This means that using models like this, a sentence such as ”the dog bites the man” and ”the man bites the dog” would be represented the same despite their different meaning. One way to overcome this is by introducing n-grams, as is done for the sent2vec model (Pagliardini, Gupta, & Jaggi, 2018).

Another approach is by training a sentence embedding model on a large corpus of sen-tences, either in combination with existing word embedding models and/or bi-gram representations, or through other mechanisms such as attention (as is the case with the Universal Sentence Encoder transformer-based model) (Cer et al., 2018). Popular mod-els like the Universal Sentence Encoder and InferSent (Conneau, Kiela, Schwenk, Bar-rault, & Bordes, 2017) use the Stanford Natural Language Inference corpus (Bowman, Angeli, Potts, & Manning, 2015) for supervised training of the model, as well as other corpora such as Wikipedia or news feeds for unsupervised learning.

Because there are a lot of different methods of constructing a sentence embedding model, each with different results, the STS Benchmark was introduced as part of SemEval (Cer et al., 2017). This benchmark has been widely adopted in recent research on

STS to indicate how new models compare to the state-of-the-art. This benchmark

is based on a dataset of sentence pairs and a similarity score between 0 and 5, as determined manually by multiple human agents. A score of 0 means the sentences are completely dissimilar, while 5 means the sentences have exactly the same meaning. The performance of evaluated models is indicated as the Pearson correlation between the provided similarity scores and the corresponding outputs of the model.

(6)

3 Method

3.1 Dataset

Sentence 1 Sentence 2

La-bel Pre-dicted Nearly all countries grew quickly

in the 1950s and 1960s for pretty much the same reasons.

The USSR growth rate during the 50’s was not exceptionally high.

1.0 0.289

By the time that Qin had totally collapsed, Xiang Yu was China’s hegemon.

The Han Dynasty came after the Qin Dyansty, after the government under Shi Huang Di collapsed.

2.0 0.303

In English, certainly the most common use of do is Do-Support.

In traditional grammar, the word doing is a participle in all your examples.

0.6 0.322

Table 1: Sample of sentence pairs in the STS Benchmark dataset that are not similar, but evaluated as similar (false positives).

Sentence 1 Sentence 2

La-bel Pre-dicted Former company chief financial

officer Franklyn M. Bergonzi pleaded guilty to one count of conspiracy on June 5 and agreed to cooperate with prosecutors.

Last week, former chief financial officer Franklyn Bergonzi pleaded guilty to one count of conspiracy and agreed to cooperate with the government’s investigation.

4.20 0.570

Mr Pollard said: ”This is a terrible personal tragedy and a shocking blow for James’s family.

Nick Pollard, the head of Sky News said: ”This is a shocking blow for James’s family.

3.40 0.312

On Thursday, a Washington Post article argued that a 50 basis point cut from the Fed was more likely, contrary to the Wall Street Journal’s line.

On Thursday, a Post article argued that a 50 basis point cut from the Fed was more likely.

3.75 0.513

Table 2: Sample of sentence pairs in the STS Benchmark dataset that are similar and also evaluated as similar (true positives).

(7)

The data used for this experiment is the widely adopted SemEval 2012-2017 paraphrase data set. The data is subdivided into a training set, a development set and a test set. The training set is disregarded for this research, the development set can be used for setting up the experiment environment and initial analysis of the data, and the test set is used for the actual error analysis. The development set consists of 1468 data rows and the test set consists of 1379 data rows. A sample of sentence pairs and label scores from the development set can be found in Table 1 and Table 2.

3.2 Universal Sentence Encoder

For this research, the results of the Universal Sentence Encoder are analysed (Cer et al., 2018). The Universal Sentence Encoder research proposes two different models: a transformer type model (Vaswani et al., 2017) and a Deep Averaging Network (DAN) (Iyyer, Manjunatha, Boyd-Graber, & Daum´e III, 2015). The transformer model achieves slightly better overall task performance on the STS benchmark. However, the DAN model has a time complexity of O(n) as opposed to O(n2_{) for the transformer model.}

The transformer based model uses context aware word representations - constructed using attention in the encoding sub-graph of the transformer - to calculate the sentence embedding vector. The DAN model takes the average of word and bi-gram embed-dings as input for training and passes this through a feedforward deep neural network. Both implementations take a PTB tokenized string as input, and return a 512 dimen-sional embedding vector. Both implementations of the Universal Sentence Encoder are made available by the authors as pre-trained models through TensorFlow Hub1. Pro-vided its demonstrated superior overall task performance, including on STS tasks, the transformer-based model has been used for this research.

3.3 Sentence pair features

The aim is to find a pattern in the false positives following from the paraphrase identi-fication task using the Universal Sentence Encoder for sentence embedding. In order to achieve this, features need to be derived from the sentence pairs. The following features are analyzed in this research:

• Unigram word overlap • Bigram word overlap • Trigram word overlap

• Unigram Part-of-Speech tag overlap • Bigram Part-of-Speech tag overlap • Trigram Part-of-Speech tag overlap

• Differences in frequency of Part-of-Speech tags between sentence pair • Mean of frequency of different Part-of-Speech tags between sentence pair • Difference in number of tokens between sentence pair

(8)

• Difference in string lengths between sentence pair • Mean of number of tokens of sentence pair • Mean of string lengths of sentence pair

• Cosine similarity between averages of word2vec embeddings

The n-grams, Part-of-Speech tags and string tokens are determined using the Natural Language Toolkit (NLTK) (Loper & Bird, 2002). The Part-of-Speech tags are repre-sented in the Universal Tagset format (see Table 3) (Petrov, Das, & McDonald, 2011). The pretrained word2vec implementation used is from Gensim and uses the Google News 300-dimensional vectors provided by the authors of the original model ( ˇReh˚uˇrek & Sojka, 2010).

Tag Meaning Examples

ADJ adjective good, special, bad, remote

ADP adposition on, at, into, by, with

ADV adverb now, already, still, already

CONJ conjunction but, if, while, and, or, although

DET determiner, article the, a, most, every, no, which, some

NOUN noun house, Germany, time, car

NUM numeral thirty-four, sixteenth, 2020, 12:15

PRT particle to (fly), (look) up, (knock) out

PRON pronoun he, his, her, she, their, my, I, its, us

VERB verb say, told, given, walking, is, would

. punctuation . , ; !

X other typos, unofficial abbreviations etc.

Table 3: POS tags in the Universal Tagset

3.4 Experiment setup

3.4.1 Predicting similarity scores

An experimental environment has been made available on the TensorFlow Hub page, which demonstrates an evaluation on the STS Benchmark. The pre-trained transformer-based Universal Sentence Encoder model can be loaded from TensorFlow Hub in Python. The STS Benchmark dataset is also publicly available for download and use. The sen-tence pairs and similarity scores are loaded from the STS Benchmark file and stored in a Pandas DataFrame. Feeding the sentence pairs to the embedding model will result in vector representations of the sentences. These vectors are normalized (using the L2 normalization technique), after which the cosine similarity is determined for each of the pairs. These scores are clipped to fit within the range -1.0 to 1.0 and then stored in a new column for the estimated scores in the original DataFrame. The label similarity scores and (scaled) predicted similarity scores are plotted in Figure 1, and predicted scores for a sample of sentence pairs are given in Tables 1 and 2.

(9)

0

1

2

3

4

5

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Figure 1: Scaled distribution of label scores (blue) and predicted scores (yellow)

3.4.2 Classifying results

In order to analyze the false positives, the results need to be classified into the four result types: true positives, true negatives, false positives and false negatives. The mid-range is used as the tipping point to discriminate between ”similar” and ”dissimilar” sentence pairs. For the true scores, this is 2.5. For the estimated scores, the mid-range value is approximately 0.178. To suppress trivial inaccuracies around the mid-range, we ignore all sentence pairs with both the evaluated score and true score within a 5% deviation from the respective mid-range. With the aim to easily be able to select the rows in a certain result category, four columns are added for each category with Boolean values corresponding to whether that sentence pair is in the respective result category.

3.4.3 Calculating sentence pair features

Extraction of the aforementioned sentence pair features is achieved by iterating over all rows in the DataFrame. This is done in a function that takes a DataFrame of sentence pairs as input and returns an updated DataFrame extended with the features stored in the designated columns. For each row, the two sentences are first converted to lists of tokens. Subsequently, these lists of tokens are used to determine the Universal Part-of-Speech tags. This is done using built-in functionality of NLTK.

For calculating the n-gram overlaps for both words as well as Part-of-Speech tags, the lists of word tokens and lists of Part-of-Speech tags are used as inputs for a helper function. This helper function takes two tokenized strings (unigrams) as input and returns the uni-, bi- and trigram overlaps. These values are obtained by first generating lists of bi- and trigrams using built-in functionality of NLTK. By storing these n-grams in frequency tables (using Counter objects in Python) and reading it’s elements, a list of each unique element repeated the number of times it occurred is produced. The intersection between these lists is determined using a bitwise-AND operation, and the length of the resulting list corresponds to the overlap.

(10)

The number of occurrences of each unique Part-of-Speech tag is obtained using Counters. The means and differences are each stored in separate columns in the DataFrame for each tag. Furthermore, the means of and differences in number of tokens and sentence length are trivially determined using the length of the list of tokens and the string length of the sentences respectively.

Finally, for both sentences, the word tokens are embedded by the Gensim implemen-tation of word2vec, using the pretrained Google News 300-dimensional vectors. Words that are not in the vocabulary of the model are ignored. Per sentence, the average of all word embedding vectors is taken. The value stored in the DataFrame is the cosine similarity between these two resulting vectors.

(11)

4 Experiment results

4.1 Word n-gram overlap

5 0 5 10 15 20 25 30 word_1gram 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 5 0 5 10 15 20 25 word_2gram 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0 5 10 15 20 25 word_3gram 0.0 0.2 0.4 0.6 0.8 1.0

Figure 2: Distribution of uni-, bi- and trigram overlap of words per result category (true negatives is blue, false negatives is orange, false positives is green, true positives is red).

Word unigram Word bigram Word trigram

FP 5.811 3.243 1.811

TP 7.755 4.744 3.050

FN 7.045 3.545 2.107

TN 4.666 1.687 0.833

Table 4: Average overlap of uni-, bi- and trigrams of words between sentence pairs per result category.

The means of the word n-gram overlaps are shown in Table 4. The distributions of over-lap rates are plotted in Figure 2. Ground truth positive sentence pairs on average have a higher n-gram overlap (especially unigram overlap) than the ground truth negative sentence pairs. A high word overlap could indicate a similar meaning, however, this does not have to be the case; “the man painted the car ” and “the man crashes the car ” have a high word overlap but entirely different meanings.

(12)

4.1.1 Reduced bias on word overlap in test set

In order to test the effect of suppressing the word overlap bias in the test set - reducing the probability that this model relies heavily on word overlap as a means of determining semantic similarity - all sentence pairs with a word unigram overlap of 6 or higher were ignored. Comparing the remaining predicted scores to the label scores yielded a higher performance score for the model on the benchmark: 0.832 as opposed to 0.810 on the original test set.

4.2 Part-of-Speech tag n-gram overlap

0 5 10 15 20 25 30 35 tag_1gram 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 5 0 5 10 15 20 25 30 tag_2gram 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 5 0 5 10 15 20 25 tag_3gram 0.00 0.05 0.10 0.15 0.20

Figure 3: Distribution of uni-, bi- and trigram overlap of Part-of-Speech tags per result category (true negatives is blue, false negatives is orange, false positives is green, true positives is red).

Tag unigram Tag bigram Tag trigram

FP 8.108 5.649 3.946

TP 9.678 7.295 5.009

FN 10.098 6.964 4.304

(13)

The means of the tag n-gram overlaps are shown in Table 5. The distributions of overlap rates are plotted in Figure 3. At first glance, the average overlaps appear to be in a certain proportion to the word overlaps. The correlation between the Part-of-Speech tag overlap and the word overlap is strong: the Pearson correlation between the unigram overlaps is 0.920, for bigram overlaps this is 0.821 and for trigram overlaps the correlation is 0.793. Consequently, it is difficult to say whether the model is aware of the Part-of-Speech per word and represents this information in the sentence embedding vector.

4.3 Part-of-Speech tag frequencies

FP TP FN TN VERB 0.858 0.875 0.805 0.711 NOUN 1.135 1.779 2.098 1.920 PRON 0.176 0.329 0.616 0.568 ADJ 0.568 0.726 0.790 0.426 ADV 0.203 0.190 0.375 0.212 ADP 0.973 1.142 1.036 0.788 CONJ 0.203 0.173 0.170 0.110 DET 0.770 1.293 1.393 1.330 NUM 0.662 0.302 0.353 0.169 PRT 0.162 0.281 0.299 0.265 X 0.027 0.001 0.027 0.016 . 0.784 1.121 1.670 1.029

Table 6: Average means of number

of occurrences of the various Part-of-Speech tags per result category.

FP TP FN TN VERB 0.324 0.528 0.536 0.529 NOUN 0.676 0.723 1.196 0.780 PRON 0.027 0.095 0.214 0.214 ADJ 0.324 0.299 0.438 0.302 ADV 0.135 0.109 0.214 0.142 ADP 0.324 0.392 0.375 0.391 CONJ 0.027 0.084 0.089 0.085 DET 0.243 0.308 0.464 0.397 NUM 0.081 0.104 0.188 0.072 PRT 0.108 0.122 0.188 0.133 X 0.054 0.002 0.018 0.006 . 0.162 0.247 0.589 0.175

Table 7: Average differences in number of occurrences of the various Part-of-Speech tags per result category. The average means and difference in number of occurences of each Part-of-Speech tag per sentence pair are shown in Tables 6 and 7 respectively. If any of the tags would be distributed in a way that the predicted positives would differ significantly (e.g. at least one full occurrence) of the predicted negatives on average, that could suggest the model is sensitive to that Part-of-Speech and its weight could be decreased in order to reduce the number of misclassifications. Since this is not the case for these results, no bias can be ascertained from the tag frequencies.

4.4 Sentence lengths

The average means and differences of string lengths and number of tokens per sentence pair are shown in Table 8 and the distributions are plotted in Figure 4. On average, the ground-truth positives consist of significantly more characters and tokens than the ground-truth negatives. Longer sentences may contain more information, however, this does not explicitly mean the sentences are more likely to be semantically similar. From the means and distributions, it does not seem that the Universal Sentence Encoder is sensitive to the numbers of tokens or string lengths. Rather, this seems to be another bias in the STS Benchmark test set.

(14)

0 5 10 15 20 25 30 35 mtokens 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 2 0 2 4 6 8 10 12 14 dtokens 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0 50 100 150 200 mstrlen 0.000 0.005 0.010 0.015 0.020 0.025 20 0 20 40 60 80 dstrlen 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07

Figure 4: Distributions of (from left to right, top to bottom) mean of number of tokens, difference in number of tokens, mean of sentence string lengths and difference in sentence string lengths. Plotted per result category (true negatives is blue, false negatives is orange, false positives is green, true positives is red).

FP TP FN TN

Average number of tokens 9.811 11.493 12.888 9.907

Difference in number of tokens 1.459 1.961 2.795 1.979

Average string length 49.838 57.398 62.991 44.467

Difference in string length 8.649 10.896 16.500 10.249

Table 8: Average differences and means of sentence length metrics per result category.

4.4.1 Reduced bias on sentence length in test set

In order to test the effect of suppressing the sentence length bias in the test set -reducing the probability that this model relies heavily on sentence lengths as a means of determining semantic similarity - all sentence pairs with a mean string length of 45 characters or higher were ignored. Comparing the remaining predicted scores to the label scores yielded a higher performance score for the model on the benchmark: 0.835 as opposed to 0.810 on the original test set.

(15)

0.711 for true negatives. The distributions of the predicted positives have similar peaks, as is the case for the predicted negatives. The means are not very different but there is a clear distinction. Figure 5 shows that the distributions of the predicted negatives both have a significantly larger deviation than the predicted positives.

0.0

0.2

0.4

0.6

0.8

1.0 wordavg_cos

0

1

2

3

4

5

6

Figure 5: Distributions of cosine similarities of word2vec embedding averages per result category (true negatives is blue, false negatives is orange, false positives is green, true positives is red).

This suggests the Universal Sentence Encoder makes similar mistakes as a simple word averaging model, albeit much less. The Pearson correlation between the word averages’ cosine similarities and the scores predicted by the Universal Sentence Encoder is 0.677, and the correlation of these word average similarities with the label scores is 0.598 (as opposed to 0.810 for the Universal Sentence Encoder).

4.6 Unbiased test set benchmark scores

STS Benchmark test set veriation Pearson r p-score Rows

Original 0.810 1.196e-321 1379

Word unigram overlap <6 0.832 1.359e-187 728

Average string length <45 0.835 3.542e-189 724

Both biases removed from test set 0.843 3.431e-179 659

720 random samples removed from test set 0.810 5.252e-132 659

Table 9: Results on the original STS Benchmark test set, and for variations with biases and random samples removed from the test set.

Removing the bias for word unigram overlap and the average string length individually resulted in an improved correlation score between the remaining predicted scores and the label scores. Removing both biases results in an even better benchmark score: 0.843. The number of samples is reduced by 720. In order to ensure the increased correlation is

(16)

not due to the decreased number of sentences, an experiment with removing 720 random samples and measuring the correlation was repeated 1000 times. The average correlation score following from removing random samples is 0.810. The results per each enhanced test set are shown in Table 9.

(17)

5 Conclusion

In order to identify a pattern in one of the features indicating over-sensitivity for that feature, there would have to be significantly different distributions per result category for that feature. This means that that feature would have clearly different distributions for its values for all predicted positives and for all predicted negatives.

Of all the word n-gram overlaps, the unigrams provide the most distinguishable dis-tributions. However, for all word n-grams there does not seem to be a considerable amount of confusion since the ground-truth classifications are more consolidated than the predicted classifications (i.e. the false positives and true negatives are close, and the true positives and false negatives are close, and those clusters have different means). As such, nothing substantial can be concluded in terms of over-sensitivity of the model for this feature.

Looking at the Part-of-Speech n-gram overlaps, the distributions appear to be compa-rable to those of the word overlaps. Similarly, the ground-truth classifications are more consolidated than the predicted classifications. Should a pattern in the errors have fol-lowed from these features, it could imply the model wrongly predicts that sentences are similar if the composition in terms of word type is similar. Because there is no clear confusion on this feature, this is not the case. Furthermore, because the tags are a hid-den property of each sentence token and the correlation with the tokens (word n-grams) is considerably strong, it is difficult to conclude anything about this feature at all. On top of evaluating the average n-gram overlap rates of Part-of-Speech tags, the number of occurrences are evaluated per tag. If one of these tags would show a pattern in the misclassified sentence pairs, that could suggest that the model values the average of or difference in the number of occurrences of a certain Part-of-Speech such as nouns of verbs over others. However, the model does not seem to be aware of this feature. The fact that neither the frequencies nor the n-gram overlaps of Part-of-Speech tags provide conclusive patterns, suggests the Universal Sentence Encoder does not represent the Part-of-Speech at all in its sentence embeddings.

For both the average string lengths and average number of tokens, the distributions are again more consolidated on the ground-truth classifications rather than the predicted classifications. Furthermore, the distributions of both the differences in string length and in number of tokens highly overlap. As such, the errors following from the represen-tations produced by the Universal Sentence Encoder do not seem to stem from sentence length features.

From the word2vec averages, it can be derived that when a sentence pair embedded by the Universal Sentence Encoder is misclassified as similar, the word2vec embedding averages are very similar to the true positives. This underlines the importance to not rely solely or primarily on unstructured word embeddings but rather incorporate syntactic and other information as well.

Consequently, this means that none of the analyzed features show a conclusive pattern that could be used to enhance the training method of the Universal Sentence Encoder. However, some interesting characteristics were identified that relate to the benchmark dataset.

The analysis word n-gram overlap indicates that the ground-truth positives (true pos-itives and false negatives) have a higher word n-gram overlap than the ground-truth

(18)

negatives. The model does not seem to be sensitive to this feature - after all, a high word overlap does not have to mean the sentences are semantically similar. The ground-truth negatives are distributed around an overlap of approximately 5 words with a small deviation, and the ground-truth positives are distributed around an overlap of approx-imately 7 words with a larger deviation. If this bias is not taken into account by the benchmarked model, this negatively affects the benchmark score. Subsequently, if the model were to take this bias into account, it would perform better on the benchmark despite the fact this is not representative for reality.

The same can be said for the Part-of-Speech tag overlap, however, because these are highly correlated to the word overlap, it can not convincingly be concluded this is another bias in the dataset. However, we see a similar pattern occur at the sentence lengths. Both the average number of tokens and the number of characters (string length) are higher for ground-truth positives than for ground-truth negatives. Especially the false negatives are distributed with a larger deviation on both the means as well as the differences compared to the other result categories. This could imply that the model is, for example, inclined to categorize a sentence pair as dissimilar if the sentence pair is longer or if the difference in sentence length is larger. In order to draw such a conclusion, however, the bias in sentence lengths in the benchmark dataset would first have to be completely removed.

For both the word overlap and sentence length biases it applies that there is an increase in the benchmark score when sentence pairs that reinforce that bias are removed from the test set. Removing both biases results in an even better benchmark score. Despite the fact that the resulting test set is smaller, this does not seem to be causing the correlation increase. This increase even surpasses most other state-of-the-art models.

(19)

6 Discussion and future work

By analyzing the features proposed in this research, no evident patterns in word and Part-of-Speech n-grams, sentence lengths, and word2vec averages were found in the sen-tence pairs that are being misclassified by the Universal Sensen-tence Encoder. To continue this research, another suitable test set could be used to reiterate the experiment on. A suitable test set would be an unbiased large corpus of sentence pairs that is not used for training the model and which contains not a binary flag indicating paraphrases per sentence pair, but rather a score range. Additionally or alternatively, new features could be introduced that have not yet been analyzed in this research. Modifying the input sentences to exclude stop words could also impact performance, however, it could be argued these contribute to the semantic value of the sentences.

Following from this research, two biases are identified in the benchmark dataset. Remov-ing either bias from the test set positively affects the Pearson correlation score between the predicted similarity scores and the label scores, and removing both biases results in even better scores. The possibility that this is caused due to the smaller size of the sample set is invalidated. However, the p-scores are much higher for all but the original test set. This means that the probability a set such as the enhanced test set is generated at random is relatively higher. It could be argued this makes the enhanced test sets less representative for general purpose STS tasks. Alternatively, the benchmark could be augmented to include weights, with higher weights for sentences that do not confirm the biases. This way, the test would retain the same size and would still include samples with a higher average sentence length or word unigram overlap. After all, samples like that do occur in realistic scenarios.

On the other hand, it could be argued that the found biases in the test set, in fact, are representative to how semantic textual similarity is determined by humans. If this would be the case, it is well justified that these biases exist in the benchmark data. To confirm this an extensive (neuro)linguistic reseach would have to be set up that analyzes how humans classify sentence pairs semantically, while taking into account factors such as primary language, secondary language, vocabulary and culture, among other things. If the biases in the dataset are confirmed to be unjustified, ideally an update to the benchmark could be proposed that is unbiased and is extended with new sentence pairs.

(20)

References

Arora, S., Liang, Y., & Ma, T. (2016). A simple but tough-to-beat baseline for sentence embeddings.

Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. CoRR, abs/1508.05326 . Retrieved from http://arxiv.org/abs/1508.05326

Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017, August). SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused eval-uation. In Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) (pp. 1–14). Vancouver, Canada: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/S17-2001 doi: 10.18653/v1/S17-2001

Cer, D., Yang, Y., yi Kong, S., Hua, N., Limtiaco, N. L. U., John, R. S., . . . Kurzweil, R. (2018). Universal sentence encoder. In In submission to: Emnlp demonstration.

Brussels, Belgium. Retrieved from https://arxiv.org/abs/1803.11175 (In

submission)

Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. CoRR, abs/1705.02364 . Retrieved from http://arxiv.org/abs/1705.02364

Ferrero, J., Besacier, L., Schwab, D., & Agn`es, F. (2017, August).

Com-piLIG at SemEval-2017 task 1: Cross-language plagiarism detection

meth-ods for semantic textual similarity. In Proceedings of the 11th

interna-tional workshop on semantic evaluation (SemEval-2017) (pp. 109–114).

Van-couver, Canada: Association for Computational Linguistics. Retrieved from

https://www.aclweb.org/anthology/S17-2012 doi: 10.18653/v1/S17-2012 Iyyer, M., Manjunatha, V., Boyd-Graber, J., & Daum´e III, H. (2015, July). Deep

unordered composition rivals syntactic methods for text classification. In Proceed-ings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers) (pp. 1681–1691). Beijing, China: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/P15-1162 doi: 10.3115/v1/P15-1162

Loper, E., & Bird, S. (2002). Nltk: The natural language toolkit. In In proceedings of the acl workshop on effective tools and methodologies for teaching natural language processing and computational linguistics. philadelphia: Association for computa-tional linguistics.

Mandal, A., Chaki, R., Saha, S., Ghosh, K., Pal, A., & Ghosh, S. (2017, 11).

Measuring similarity among legal court case documents. In (p. 1-9). doi:

10.1145/3140107.3140119

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

Pagliardini, M., Gupta, P., & Jaggi, M. (2018). Unsupervised learning of sentence embeddings using compositional n-gram features. Proceedings of the 2018 Con-ference of the North American Chapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1 (Long Papers). Retrieved from http://dx.doi.org/10.18653/v1/N18-1049 doi: 10.18653/v1/n18-1049 Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word

(21)

Petrov, S., Das, D., & McDonald, R. (2011). A universal part-of-speech tagset. ˇ

Reh˚uˇrek, R., & Sojka, P. (2010, May 22). Software Framework for Topic

Mod-elling with Large Corpora. In Proceedings of the LREC 2010 Workshop on

New Challenges for NLP Frameworks (pp. 45–50). Valletta, Malta: ELRA.

(http://is.muni.cz/publication/884893/en)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polo-sukhin, I. (2017). Attention is all you need.

Wang, Y., Afzal, N., Fu, S., Wang, L., Shen, F., Rastegar-Mojarad, M., &

Liu, H. (2018, Oct). Medsts: a resource for clinical semantic

tex-tual similarity. Language Resources and Evaluation. Retrieved from

http://dx.doi.org/10.1007/s10579-018-9431-1 doi:

Analysis of Semantic Textual Classification Errors by Neural Sentence Embedding Model