Determination and reduction of Bias in length and word overlap in Semantic Textual Similarity

(1)

Determination and reduction of

Bias in length and word overlap in

Semantic Textual Similarity

Jop Keuning 11014407

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Msc. A. Soleimani Vilage AI informatics institute

Faculty of Science University of Amsterdam

Science Park 608B 1098 XH Amsterdam

(2)

1 Abstract

In this thesis, three viable methods will be tested to reduce the bias towards the difference in sentence length and the word overlap of sentence pairs when using the Google sentence encoder model. These three methods consist of removing the last words in a sentence, removing random words and words based on the corresponding universal POS-tag. Due to the limited data available in Semantic Textual Similarity the goal is to reduce the bias while keeping the data the same size, for this the three methods show varying degrees of success. However, the most promise is shown by the method which removes words based on their POS-tags.

(3)

2 Introduction

In Natural Language Processing (NLP) there exists a field of study that con-cerns the similarity between different sentences. It might be easy for a human agent to discern both the differences and similarities between these sentences, but for a learning agent this is difficult. To perform these tasks, an automatic agent needs to make use of certain models to determine the similarities and differences. These tasks, called Semantic Textual Similarity (STS) tasks, uses word embedding models to compare two sentences and find the similarities be-tween them. The identification of similar sentences can be used for a variety of problems such as comparing the similarity of texts or sentences to each other [1] [2]. This allows for companies such as Quroa to see if questions asked have already been answered on their website or for search engines to find results that correspond to the query but would not have been found due to the page using a lot of synonyms to the question asked. Another method is comparing the sentences between different languages for similarity [3] [4] which can be used for machine translations between these languages.

To perform such a task models such as Glove [5], sent2vec [6] and unsuper-vised random walk [7] which employs different methods to embed the sentences, this is used to identify the similarity in different sentences. These models all use is a data set to train and test to perform the specified task for the model. These data sets can vary in what they contain, but one of the most commonly used data sets used is the STS Benchmark test [8]. The STS Benchmark is a data set consisting of pairs of sentences which are in some form similar to each other. However, as in most data sets for problems solved with learning models, there is the possibility of a bias present in the data. Here the concern lies with a spe-cific type of bias, selection bias towards a spespe-cific feature within the data. The two features examined here, found to be present in earlier research [9], are the difference in string length and the n-gram word overlap between two sentences. The n-gram word overlap refers to an instance where a set of n words, a uni-gram being one word, occurs in both sentences. This bias can cause a learning model that classifies sentences as similar or dissimilar to miss classify sentences, as the focus lies heavily on these two features. If a hypothetical sentence pair with one sentence, having a particular meaning and the other having the same meaning, but with synonyms, the word overlap would be extremely low, but the sentences would be similar nonetheless. In the case where the model is trained using data in which the bias is present, the model will classify this sentence pair as being dissimilar, even though on other fronts they can be seen as similar, which they are. The same is true for a hypothetical sentence pair with a large difference in string length, meaning one sentence is significantly longer than the other. If the words make a sentence longer but add nothing to the meaning of the sentence, the model would classify them as dissimilar, even though they are in fact similar.

The goal of this paper is to identify if these biases are present in the data and remove them as much as possible to increase the performance of the model. This is done by implementing three methods, with the aim to reduce both these biases. The first of these methods, the Last Word Removal method, will remove the last word in the sentence until sentences are roughly equal. This way, the difference in length between sentences classified as positive and those classified as negative will be negligible and the model will not learn this feature to mean a

(5)

sentence to be similar or dissimilar. Similarly, the Last Word Removal method will reduce the word overlap more the longer the sentence is. If multiple words are removed from the longer sentence, the chance that a removed word occurs in both sentences increases. This means that if a sentence is classified as similar, the word overlap should be lowered, marking the reliance on that feature to classify the sentence pair as similar or disimalar, will be lowered as well.

The second method works in a similar way to the first but instead of removing the last word it will pick a random word in the sentence. This will have the same effect on the sentence lengths of the sentence pairs but will work slightly different on the word overlap feature. This is because picking words from the sentence at random allows for words within the middle of the sentence to be removed, increasing the chance that a removed word will reduce the word overlap higher and should reduce the word overlap bias overall.

The third method makes a small but significant change to the second method because it also allows for words in the middle of the sentence to be removed but instead uses the corresponding POS-tags to determine which words should be removed. In this way, the first words to be removed can be set as words with POS-tags that have less of a contribution towards the meaning of the sentence which should keep the performance of the model high.

A large advantage of these methods, when comparing them to simply re-moving the sentences which contribute to the bias towards particular features, is that it deals with a problem within STS tasks. The problem with this is that there is a limited amount of data available for researching and developing within STS tasks. When that limited data is reduced even further, there is a possibil-ity that the data is reduced by such an amount that the learning model can no longer be said to give any significant output. However, with these methods, the amount of data remains the same while still aiming to reduce any of the bias present.

(6)

3 Related work

STS tasks concern themselves with finding similarities between sentences for certain goals. An example of this is when Quora 1 _{provided a public data set}

for researchers with the ultimate goal of finding duplicate questions, thereby reducing the number of pages for questions needed on the website. To achieve this goal, the data needs to be represented in a way that allows the model to compare them. This is done by representing the sentences in the data as vectors with numeric values pertaining the words in them. Most research makes use of already established manners for representing the sentences as vectors such as GloVe [5] or Word2Vec [10] because these result in a higher accuracy rather than starting from random numbers and alligning them to words or most other methods used for word embedding. GloVe stands for Global Vectors and the model makes use of the co-occurrence probabilities, the chance that a word in the corpus appears in the context of another word, in combination with a weighted least squares problem to determine the vector spaces for each word. Unlike GloVe, word2vec makes use of a Hierarchical softmax, where a binary tree representation is used in combination with a skip-gram model to determine the vector space. Both models return an N-dimensional vector representation for a word within the corpus.

In STS tasks, the comparison is often made not between only two words but whole sentences. To perform this task the entire sentence needs to be represented as a vector space. Several models have been proposed over the years. One of these models is the Sent2vec unsupervised model [6], which takes the average of the word embedding in the sentence for both the uni-grams and n-grams and combines this with the negative sampling method [10] for missing word prediction. This method gives an output of a single n-dimensional vector representation for the entire sentence.

A different model than Sent2vec that is often used as a baseline is the Smooth Inverse Frequency (SIF) method [11]. This method works by taking the weighted averages of the word vectors in a sentence and combining this with the removal of the projections of the average vectors on the first singular vector, a method known as common component removal. In this method the weighting is done by taking a parameter and applying that to the established word frequency.

Both these methods are unsupervised models for sentence embedding, an alternative to an unsupervised model is a supervised Long Short Term Memory (LSTM) model [12]. This model uses a tree structured LSTM model where the vector representation of each word in the sentence is used as the input value for the model. The tree structured LSTM works by taking the input and a hidden layer, which are used to create memory cells where information is left out of the memory cells by using a forget gate. In a tree structured LSTM the forget gate is dependant on the amount of child nodes allowing the model to forget and remember certain aspects of the task for the different child node.

There are two reasons for a learning bias to occur within a model. The first is the general machine learning bias. In this form of bias, the learning model has an hypothesis for the classification of a point in the data and all other hypothesis are eliminated, this is called absolute bias and leads to the model only classifying data entries if they exactly match the data set. Another form of

(7)

machine learning bias, relative bias, is a bit more fluid. In relative bias, certain hypothesis are preferred over other hypotheses. This means that if a data point contains two features, both learned by the model for a different classification, it will classify the model with the hypothesis it prefers which is then a biased classification [13].

The second reason for bias is when the model has not got any noticeable absolute or relative bias within itself but the data used to train the model on contains the bias. In this case, the data has a feature that is mostly present in the training cases for one class and little to not present within the other classes. In STS tasks, biased data is most prominent in two forms. The first of these is the semantic bias, where there is a bias present within the meaning of the data. This particular form of bias is most relevant in tasks where meaning of the sentence is relevant of the outcome. One form of this bias is a gender bias [14]. The goal is to identify and reduce if not outright remove these biases which can be done by taking the gender information obtained from the biased vectors and other gender definition vectors and then subtracting the vector containing this gender information from the original word vector. This method is called the Half-Sibling Regression and achieves an improvement of around 15% in classifying STS tasks when compared to a model on which this is not performed.

Another method for identifying bias is to compare the word vector of a certain association such as positive or negative with the word vector the bias is to be tested on. This comparison is done by taking the word vector for both words and taking the dot product of the words [15]. The method proposed for removing this identified bias in the same paper is to minimize the association in the word embedding towards a particular bias using orthogonal projection [15]. This method works on gendered bias by taking the word vector of both the masculine pronoun and its gender opposite the feminine pronoun. It then finds the word embedding containing gender bias using the previously mentioned bias identification and multiplies it with the orthogonal gender vector. This method brings the word embedding vector to a more gender neutral space where the gender leaning is around neutral. When this method was tested on female and male names, it produced results of work being male associated and family female associated with a decrease of 37,5%.

However, these methods focus mostly on removing the bias present within the semantic meaning of the sentences which is less relevant for this thesis, as the focus here lies on the learning bias for the model, a form of over-fitting. This type of bias, often revered to as selection bias, occurs when a learning model learns a particular feature in the data set and uses it disproportionately to identify the class of a data point. A method for identifying which features of the data set can lead to selection bias is to look at the output of a learning model with different features of the data selected each time and to compare those results. If these results show the expected value for the probability of a data point belonging to a certain class is different from the expected value of the same probability of the whole data set without feature selection then selection bias is present [16].

(8)

4 Method

Figure 1: Method Diagram

4.1 Bias Identification

The first step is to determine if the bias to the difference in string length and to the n-gram word overlap is present within the unaltered model and how large it is. To achieve that, a method from previous research is used [9]. This method works by first inserting the unaltered data into the model and then calculating the features, steps one and two in figure 1. The calculation of the two features as stated in the previous research is done by first taking the predicted scores achieved by the model and the labeled scores of the data and use them to assign the sentence pair to be true positive, true negative, false positive or false negative. This is done by taking the mid range of both the labels and the predicted score, which in the case of the STS Benchmark data lies at 2.5 for the labels and around 0.178 for the predicted scores and at 0.5 for the labels and around 0.149 for the predicted scores. If a sentence pair scores higher than both the mid range for the labels and predicted scores, it will be classified as true positive, if it scores lower on both accounts it will be classified as true negative. In case the score is higher than the mid range of the labels, but lower for the predicted scores, it will be classified as false negative, and if it is the other way around, with the predicted score being higher than the mid range and labeled score being lower, it is classified as false positive.

The following step is step two in figure 1 is to calculate the values of the difference in string length and the n-gram word overlap for uni-, bi- and tri-grams. The difference in string length is calculated fairly simply by taking the length of both sentences and subtracting the length of the second sentence in the pair from the length of the first. If this results in a negative outcome, meaning the second sentence is longer, the difference is multiplied by -1 to result in a positive value.

The n-gram word overlap is calculated by taking the word tokenization of the nltk library and turning both sentences into lists of tokens. After this, the two lists are compared, if a word exists in both lists of tokens, the word overlap count is increased by one each time this happens.

(9)

Finally, for determining if a bias is present in string length difference or word overlap, step three in the diagram of figure 1, is to firstly separate the sentence pair in four lists, consisting of sentence pairs that were classified as true positive, true negative, false positive and false negative. To determine the bias in the differences in string length, the mean of all the string lengths of the sentence pairs for each of the four lists is taken and are then compared to each other. If the mean for the false negative cases is significantly higher than that of the ground positives then a bias towards a large difference in string length can be spoken of, while on the other hand, if those values lie at around the same amount, there is no bias present. In the case of the n-gram word overlaps, to determine the bias present again, the overlap is separated in four lists of sentence pairs classified as true positive, true negative, false positive and false negative. Following this, there are two parts in determining the bias. The first part is the same as for the difference in string length feature, where the mean of all for lists for each n-gram is taken, however, in this case bias exists when the false negatives have a much lower word overlap mean than the ground positives. The second way is to make a distribution graph for the four lists and compare those. When the largest distribution for false negatives lies at the lower word overlap and there is a spike in the distribution for false positives at the higher word overlap in the distribution graph there can be spoken of a bias towards high word overlap in the model for classifying a sentence pair as positive.

4.2 Bias Reduction

The fourth step, seen in the diagram of figure 1, the running of a bias reduction method. This thesis proposes three methods for reducing the bias found within the difference in string length and n-gram word overlap. These are the Last Word Removal method, the Random Word Removal method and the POS-tag Word Removal method. These three methods work towards reducing the bias found present while maintaining the size of the data set the same as that for the unaltered model.

4.2.1 Last Word Removal

The first method proposed in this thesis is the Last Word Removal method. In this method, the first step is to take a sentence pair with a difference in string length larger than 45. This number was taken from a bias reduction method in previous research [9]. For these sentences, the string length is taken and the longest out of the two is taken. From this sentence, the last word is removed from the sentence, and the length is adjusted by the length of the word taken out. This is repeated until both sentences are around the same length. The new sentence pair is saved into a new dataframe together with the sentences with a difference in string length below 45 that remained unaltered. This new data frame is than inserted into the model following step five of the diagram in figure 1. The same is done for the scores of this dataframe as with the unaltered dataframe and the true and false positives and negatives are calculated. this works the same as with the feature calculation on the unaltered model results in step two of the diagram in figure 1. The second to last step is the same as in step three of the method in figure 1, where the means of the n-gram word overlap and difference in string length are taken and a distribution graph is

(10)

made. Finally, the resulting means and graphs are compared to each other to see if the methods were effective in dealing with the bias, the last step in the diagram seen in figure 1.

This method will reduce the difference in string length of the sentence pair where this is the largest to roughly equal, meaning that sentences with a large difference in string length which are labeled as false will now have almost no difference in string length. This way, the model learns the other features as indicators of the sentence being false. Therefore the bias should decrease. As for the bias in word overlap, this method will remove the last words in the sentence, which in sentences with different length but a similar meaning will often be similar. This way, the word overlap for sentences which are labeled as positive will be reduced and the model will no longer learn a large word overlap as an indicator that the sentence pair should be classified as true.

4.2.2 Random Word Removal

The second method proposed is the Random Word Removal Method which follows the same steps seen in figure 1 as the Last Word Removal Method. Only step four, the running of the model works differently. With this method, sentences with a difference in string length greater than 45 are taken. Then, the longest sentence is tokenized and the total number of tokens within the sentence is taken. With this, a random number is generated between zero and the total number of tokens. This number is then used to determine the word that will be removed from the sentence. The number of tokens used for the upper limit for the random number generator is lowered by one and the string length of the longest sentence is decreased by the amount from the word taken out. This is repeated until both sentences are of roughly equal string length. After this, the same steps are taken as with the Last Word Removal method, in the diagram of figure 1, steps five, six, seven and 8 are executed.

In regards to reducing the bias towards the difference in string length, the Random Word Removal will function in the same way as the Last Word Re-moval, decreasing the bias present due to reducing the difference in string length of the sentence pairs labeled negative. It will work better however with the word overlap bias. This is because this method not only removes words from the end of the sentence but also removes them from the beginning and the middle. This has an increased chance to remove a word that is present within both sentences and thus reducing the word overlap.

4.2.3 Word Removal based on POS-tags

The final method tested within this thesis removes words based on their corre-sponding POS-tags. The first step is once again to take the sentence pairs with a difference in string length greater than 45. A list is made of the universal POS-tags with tags such as punctuation, determinators, symbols, cunjunctors, numerals and particles being first and adjectives, verbs and nouns coming last. Than the longer sentence is taken and a list of its corresponding POS-tags using the nltk universal POS-tagger is created. Following this, from the longest of the two sentences is taken and the words with POS-tags corresponding to the first tag in the list are removed. The string length of the second sentence is decreased by each word. If the string length is still larger than that of the shorter sentence

(11)

in the sentence pair, the words with tags corresponding to the next POS-tag in the list are removed. This repeats until both sentences have roughly equal string length. The final four steps as shown in figure 1 are once again taken.

The POS-tag Word Removal method will work the same on the bias in the difference in string length as the other two methods, once again reducing the difference string length of the sentence pairs labeled as negative which will cause the model to not learn a large difference in string length as an indicator that a sentence is dissimilar. On the other hand, it will be less likely to remove words that occur in both sentences when not a lot of words are removed than with the Random Word Removal method. This is because the words removed are based on POS-tags that add little meaning to a sentence and thus are less likely to be present within both sentences. When more words are removed this chance will increase. On the other hand, it is likely that the words removed by this method are more likely to occur in both sentences than the ones removed by the Last Word Removal method, as they can once again occur in any part of the sentence, not just the beginning or ending.

4.3 Embedding Model

The model used in this thesis to determine the predicted score is the Google universal sentence encoder [17]. This model takes in the data and separates it in to three columns for the first sentence in a pair, the second sentence and lastly the similarity score or the label assigned to the pair. For the next part, there are two possible models that can be used, the first is the transformer model. Secondly there is the Deep Averaging Network (DAN) for short. In both variations of the model, the input strings are represented as a vector embedding using a tensor implemented in TensorFlow [18]. Both models give as output a 512-dimensional vector as the embedding for the sentence.

The difference between the two models lies in the accuracy and efficiency. The transformer model achieves a higher accuracy out of the two models but this comes at the cost of an exponential time complexity of O(n2_{). In contrast}

the DAN model achieves a slightly lower accuracy but has only a linear time complexity O(n).

4.4 Data

Two data sets will be used in this thesis to test the bias reduction methods for STS tasks on, the STS Benchmark data set and the Quora question pairs data set. Two data sets are used to show that the methods used in this paper are not data set dependent and can be used on any similar data set used in STS tasks to reduce the bias towards certain features present.

4.4.1 STS Benchmark

The STS Benchmark data sets consist of a training, development and a test set. The training set consists of 5749 sentence pairs collected from fora, news articles and captions, the development set contains 1500 pairs from the same sources and the test set contains 1379 sentence pairs. Finally the data set has labels ranking the similarity of the sentence pairs between one and five.

(12)

4.4.2 Quora sentence pairs

The Quroa sentence pair differs slightly from the STS Benchmark data set. First of all it is much larger and consists of 404290 sentence pairs. Furthermore, this data set is not split between a training, development and testing set beforehand, which has to be done manually. Aside from the sentence pairs themselves the data set also contains a sentence ID for each of the two sentences, which allows for individual sentences to be easily found in the data set. Finally it contains a set of boolean labels that determine whether the sentence pairs are duplicates of each other or not. Duplicate in this case means that the sentence pairs have the same meaning but different wording.

Due to memory constraints and hardware limitations during the writing of this paper, the whole Quora data set could not be used for the model as an Out Of Memory (OOM) error would occur. Therefore the data was shuffled on index basis and the first 40429 sentence pairs or one tenth from the new shuffled data set were used as a training set. From the remaining sentence pairs in the data set not already in the training set 20214 sentence pairs, or around one twentieth, were used for the development and testing sets each.

(13)

5 Results

5.1 Bias Reduction on STS benchmark

5.1.1 Control group, no bias reduction results

Before it can be concluded that the bias removal methods work for any bias present within the data set it must be compared to the results obtained by an unaltered data set. For the STS Benchmark data set the results achieved were a Pearson correlation of 0.803 Where the n-gram overlaps and the means of the difference in string length can be seen in Figure 2, Table 1 and Table 2.

(a) (b)

(c)

Figure 2: Distribution of n-gram word overlap in unaltered STS Benchmark data

Unaltered TP FP TN FN

String length difference 19,321 9.05 17.875 21.576 Table 1: Mean of difference in string length

(14)

Unaltered TP FP TN FN Uni gram 7.871 4.667 4.199 7.311 Bi gram 4.809 2.175 0.993 3.592 Tri gram 3.086 1.108 0.388 2.236

Table 2: Average n-gram overlap

(a) (b)

(c)

Figure 3: Distribution of n-gram word overlap after Last Word Removal in STS Benchmark data set

The Pearson correlation resulting from the Last Word Removal method was 0.800 on the development data set. Within this method, the last word of sen-tences with a difference in string length greater than 45 was removed until the two sentences within the pair were of roughly equal length. This is slightly lower than the Pearson correlation of the unaltered set.

When comparing the word overlap and the means of the difference in string length, the false positives have a slightly lower mean word overlap now for both the uni-, bi- and tri-gram as can be seen in Table 3. This is more often true for the true positives than the false positives and even more so for the false and true negatives. When looking at the uni-gram graphs in Figures 2a and

(15)

3a it can be seen that the true negatives are more in line with the other three. The bias in this case has been reduced, it is no longer clear that with a large uni-gram word overlap it is likely to have a large similarity score. However, with the bi-grams, seen in Figures 2b and 3b this is not the case as it can be seen that the true negatives are still more present with lower word overlaps and false negatives have a higher distribution at the higher overlaps. With the tri-gram overlap, looking at Figures 2c and 3b, a different case altogether appears where the true negatives are now less distributed over the lower word overlaps than the false negatives, but remain lower than the ground positives. On the other hand, the false negatives are more in line with the ground positives meaning the selection bias to seeing the sentence pairs with high word overlap has been slightly reduced.

When comparing the means of the differences in string length seen in Tables 1 and 10 it becomes clear that in this case, the opposite effect has taken place, reinforcing a bias to sentences with large differences in string length. The mean for the false negatives has grown by 10, a factor of 1.5 while that for the true positives has nearly halved. On the other hand, the mean for the false positives has come more in line with that of true positives meaning that sentence pairs with a lower difference in string length are less likely to be seen as seen as positive.

With these results, the model shows an accuracy that is quite similar to that of the model using the unaltered data set, scoring only slightly lower on the Pearson correlation, however, with the bias removed the accuracy is more evenly spread over the different types of sentences, relying less on the word overlap feature. Unexpectedly the results show that the reliance on the difference in string length feature has increased. Meaning that if an example sentence pair is taken, one being 3 words long with a string length of around 25, and the other having a string length of around 60, the model will more likely classify the sentence pair as dissimilar than the unaltered model would. On the other hand, if a sentence pair with a small string length difference but large word overlap, while being dissimilar, the model on which the Last Word Removal method is applied is less likely to falsely classify the sentence pair as similar.

Last word removal TP FP TN FN Uni gram 7.886 5.864 3.978 6.789 Bi gram 4.935 3.015 0.980 2.828 Tri gram 3.218 1.591 0.389 1.582 Table 3: Average n-gram overlap after Last Word Removal

Last Word Removal TP FP TN FN String length difference 11.224 7.773 15.290 34.516

Table 4: The mean of the difference in string length with Last Word Removal

5.1.3 Random Word Removal

As with the Last Word Removal, the Pearson correlation is slightly lower than that of the unaltered data set at 0.801 which is slightly higher than that of the

(16)

Last Word Removal method.

When looking at the distribution graphs for the distribution of the word overlap all three cases are very similar to those resulting from the Last Word Removal method as can be seen in figure 3 and thus the same can be said about them concerning the effectiveness of the bias removal. The advantage of this method over that of the Last Word Removal is that it has a slightly lower accuracy but only by 0.01. However, this comes at the cost of the bias increase in the difference between the string lengths, as in this method the bias towards seeing a sentence pair as negative for large string differences has increased even more to 36.

Random word removal TP FP TN FN Uni gram 7.785 5.864 4.051 6.839 Bi gram 4.935 3.015 0.980 2.828 Tri gram 3.218 1.591 0.389 1.582 Table 5: Average n-gram overlap after Random Word Removal

Looking at these results, a similar outcome to that of the Last Word Removal is seen and thus the same things can be said about the effects of the method on the model in the way it would classify a sentence pair based on the word overlap and difference in string length.

Random Word Removal TP FP TN FN String length difference 11.322 7.773 15.959 36.508

Table 6: The mean of the difference in string length with Random Word Removal

5.1.4 Word Removal based on POS-tags

The last method tested in this paper is the method where the word is taken out of a sentence based on POS-tags achieves a Pearson correlation score of 0.802. which is again only slightly lower than that of the unaltered STS Benchmark data set.

When comparing the graphs of this method to those of the other two, again a very similar picture arises with limited bias reduction in the uni-gram overlap and small reductions in both bi- and tri-gram overlaps as can be seen in figure 3 when comparing them to figure 2. Looking at the means in Table 7, they are very similar but in this case this method achieves a slightly better result as is the case with the Pearson correlation. When looking at the means for the differences in string lengths in Table 8, however, it changes a bit. In this method the means for true and false positives are even closer together suggesting an improved bias reduction to sentences with small differences in string lengths being classified as positives. On the other hand, the means for the false negatives are more in line with the one seen in the Last Word Removal method, meaning there is a lower increase in the selection bias on that front than in the Random Word Removal method. Out of the three methods, the Word Removal based on POS-tags also has the Pearson correlation score that is closest to that of the unaltered data set meaning there is a slight decrease in viability.

(17)

With the similarity between the results of this method when compared to the other two, it can once again be said that the model with the POS-tag Word Removal method applied to it will show better classification results when looking at the word overlap feature.

Overall it can be said that the word removal based on the POS-tags is the best performing out of the three tested methods on the STS Benchmark data set.

Pos Tag word removal TP FP TN FN Uni gram 7.903 5.881 4.080 6.921 Bi gram 4.925 3.045 0.992 2.892 Tri gram 3.205 1.612 0.391 1.639 Table 7: Average n-gram overlap after POS-tag Word Removal

Pos Tag word removal TP FP TN FN String length difference 11.420 8.090 16.883 35.969 Table 8: Mean of the difference in string length after word removal

5.2 Bias Reduction on Quora sentence pairs

5.2.1 Control group, no bias reduction results

For the Quora sentence pair data set this was a lot lower for a base score at 0.510, which is likely due to the significantly smaller size of the data set due to hardware constraints. When looking at the distribution of the n-gram word overlap it becomes immediately clear that there is almost no bias present within the uni-gram word overlap. This is not the case for the bi- and tri-gram overlap where there is a bias to large overlap being positive and small overlap being negative.

(18)

(a) (b)

(c)

Figure 4: Distribution of n-gram word overlap in unaltered Quora data

Unaltered TP FP TN FN Uni gram 6.829 7.149 4.886 5.817 Bi gram 3.693 4.048 0.993 1.981 Tri gram 2.175 2.688 0.339 0.730

Table 9: Average n-gram overlap

Unalterd data TP FP TN FN String length difference 12.540 13.753 33.656 22.701

(19)

(a) (b)

(c)

Figure 5: Distribution of n-gram word overlap after Last Word Removal in Quora data set

When attempting the Last Word Removal method on the Quora data set the Pearson correlation of the model is 0.489, which is lower than achieved by the unaltered data set just like with the STS Benchmark data sets. However, the decrease in the Pearson correlation is significantly larger at a decrease of 0.021. When comparing the distribution of the word overlap n-grams, it becomes clear that the method for reducing the bias is less effective here than with the STS Benchmark data set. Firstly, there is no bias present within the uni-gram overlap and secondly, it does not reduce them within the bi- and tri-gram word overlap. In these cases it still considers those with small word overlap as false and the larger overlaps as true.

However, this method succeeds in reducing the bias within the difference in string length, in contrast to when it was tested on the STS Benchmark data set. The ground negatives with the unaltered data set have a much higher mean for the string length differences than the data on which this method is performed. They are more in line here with the means for the ground positives and thus have reduced the bias in this aspect of the data.

With these results, if the same hypothetical situation is proposed as with the STS data Last Word Removal method, where a sentence pair is taken with a

(20)

large difference in string length. In the unaltered model this sentence pair would be classified as dissimilar, largely due to the difference in string length. However, when the same sentence pair is inserted in the model on which the Last Word Removal method is applied, the model would be less likely to classify it as a false negative. This is due to the fact that the Last Word Removal method applied removes the selection bias towards classifying sentences with large differences in length as dissimilar. On the other hand, if the hypothetical sentence pair had around the same string length but a small word overlap, especially for bi-gram and tri-gram overlaps, the model would be more likely to classify it as dissimilar while at the same time classifying sentence pairs with large overlap as similar.

Last word removal TP FP TN FN Uni gram 6.765 7.069 4.150 5.063 Bi gram 3.681 4.094 0.875 1.642 Tri gram 2.174 2.740 0.299 0.569 Table 11: Average n-gram overlap after Last Word Removal

Last word removal TP FP TN FN String length difference 10.665 10.544 16.469 13.911 Table 12: Mean of the difference in string length after word removal

5.2.3 Random World Removal

With the Random Word Removal method applied to the data, as with the Last Word Removal, the Pearson correlation is reduced when compared to that achieved by the unaltered data set to 0.466 which is 0.046 lower. This is a decrease of almost 5% and quite significant.

When looking at the means and distributions of the n-grams the same image that resulted from the Last Word Removal method appears which can be seen in figure 5. Therefore, like with the Last Word Removal, the method is ineffective in reducing in the bias which increases slightly when looking at the means of the word overlap. As seen with the Last Word Removal method, the Random Word Removal method also succeeds in the bias reduction when it comes to the differences in string length but does so less effective. There still exists a gap between the means for the ground positives and ground negatives unlike in the Last Word Removal method where this is more clearly reduced. This limited bias reduction in the string length difference combined with the lack of bias reduction in the n-gram word overlap makes the Random Word Removal method not suited for bias reduction.

With the Random Word Removal method things are slightly different from the Last Word Removal method when we insert hypothetical sentences into the model. If the model altered with Random Word Removal is applied to a hypothetical sentence pair with a high difference in string length, it is more likely to classify it as false than in the case if the model altered with Last Word Removal were to be applied it, but less likely to do so than if it were applied to the unaltered model. On the other hand, if the hypothetical sentence pair had

(21)

a high word overlap, it performs about the same, classifying high word overlap sentence pairs as similar and low word overlap as dissimilar.

Random word removal TP FP TN FN Uni gram 6.800 7.100 4.160 4.567 Bi gram 3.718 4.121 0.838 1.263 Tri gram 2.205 2.762 0.281 0.426 Table 13: Average n-gram overlap after Random Word Removal

Random word removal TP FP TN FN String length difference 10.757 10.541 20.142 21.593 Table 14: Mean of the difference in string length after word removal

5.2.4 Word Removal Based on POS-tags

Following the pattern of the previous methods on both data sets, the Pearson correlation is again reduced, this time to 0.480, making it 0.03 lower than the unaltered data set. This is an improvement over the Random Word Removal method, but slightly worse than that of the Last Word Removal method.

Concerning the n-gram overlap, there is no noticeable difference between the distributions graphs for the Last Word Removal and the POS-tag Removal methods. Furthermore, it can once again be seen that this method fails to reduce the bias in the bi- and tri-gram word overlap as seen in figure 5. It increases even more than the last two methods did, especially for the tri-grams. The means of the differences in string length show a similar picture to the results of the Random Word Removal method as seen in Table 16 and Table 14, reducing the bias present there but less significantly so than the Last Word Removal method. Looking at this together it becomes clear that the Last Word Removal method is the superior of the three, reducing the bias present in the string length difference almost completely while maintaining the closest Pearson correlation score.

This means that when sentences with high differences in string length will achieve about the same accuracy in the model to which the POS-tag word removal method is applied as to the model to which the Random Word Removal is applied. Just like with the model to which Random Word Removal is applied, the model altered by the POS-tag Word Removal method will achieve higher accuracy on high string length differences than the unaltered model and lower than the model altered by Last Word Removal and score about the same on sentences with high and low word overlap

Pos Tag word removal TP FP TN FN Uni gram 6.810 7.101 4.285 4.986 Bi gram 3.719 4.101 0.880 1.492 Tri gram 2.207 2.743 0.293 0.488 Table 15: Average n-gram overlap after POS-tag Word Removal

(22)

Pos Tag word removal TP FP TN FN String length difference 11.127 10.972 20.121 19.428 Table 16: Mean of the difference in string length after word removal

6 Conclusion

The three selection bias removal methods tried in this thesis, removing the last word from the longest sentence; removing a random word from the longest sentence and removing a word based on it’s universal POS-tag from the longest sentence, each achieve varying results. Though none can be clearly said to be a good method of reducing the selection bias fully as a bias still remains and in the case of the difference in string length can be seen to even increase. All three methods do reduce the selection bias on the word overlap mostly on the uni-gram overlaps as seen in the graphs. By doing so the Pearson correlation value for the three methods does not decrease significantly, only 0.01 or 0.02, while keeping the size of the data set the same. As the size and the amount of the available data for STS tasks is a large problem in the field, these methods offer a solution to the selection bias while keeping the data of a relevant size. This means the data is still usable as opposed to simply setting a threshold and reducing the size of the data. Though setting a threshold increases the Pearson correlation score of the model [9], it also reduces the available data for use, risking over-fitting making the model less relevant and usable. These methods offer a limited alternative to this problem. The method that shows the best results is the word removal based on POS-tags method which shows the lowest decrease in the Pearson correlation value at 0.01 coming to 0.802. With similar bias reduction in the word overlap to the other two methods, while keeping the bias increase in the difference between string lengths relatively low compared to the other methods. Even so it is not a clear and perfect solution to the problem. For the Quora data set a different outcome has been shown clearly, with no reduction in any form of bias in the n-grams. Though that was almost nonex-istent in the uni-grams to begin with, instead achieving a great bias reduc-tion concerning the string length difference, the opposite of what the methods achieved on the STS Benchmark data set.

Although the Pearson correlation score is more decreased when these meth-ods are applied to the Quora sentence pairs data set when compared to the STS Benchmark data, they are more viable. This is the case because it does not reduce the small bias present within the word n-gram overlap, this bias is not nearly as pronounced as with in the STS Benchmark data set. On the other hand, the three methods shown in this thesis increase the bias within the string length differences on the STS Benchmark data set, but reduce them significantly within the Quora data set. Therefore, this major downside to the bias reduc-tion within the STS Benchmark data set is not present for the Quora data set making it a good option.

Concerning the effect of these methods on the model, when using the STS Benchmark data set the optimal method for reducing bias and increasing the model’s accuracy on sentences with large differences in string length would be the Last Word Removal. This method works slightly better in the classification of words with large word overlap than the unaltered data and around equal to

(23)

the other two methods. However, out of the three methods used this method will work best on classifying sentences with large string length difference, even though this will still perform with a lower accuracy than the unaltered model.

For the effects on the model when using the Quora data set, the Last Word Removal will once again prove best suited for the goal of this thesis. Achieving a better accuracy than both the unaltered model and both the Random Word Removal and POS-tag Word removal methods will on sentences with large dif-ferences in string length, while performing the same on sentences with either large or low word overlap as the rest.

7 Discussion

Within this research the bias reduction methods have been tested on both the STS Benchmark data set and the Quora sentence pairs data set. However, this is not completely valid as only a tenth of the Quora data set was actually used. Though the data points were taken from the whole data set, this does leave a significantly reduced amount and can be of large influence on the results found. This was the case due to hardware constraints at the time of this research where it was not possible with the available resources to make use of the whole data set. A good way to test if these methods hold up is to perform them on the complete Quora data set and compare the results between the two.

Possible fututre research should focus on reducing the selection bias in STS data sets it is possible to build on the methods shown in this thesis in multiple ways. Most significantly is to stop the reduction methods from increasing a selection bias in the difference in string length, which is the most glaring problem for the methods tested in this thesis. If this can be achieved there are no downsides to applying these methods to a data set from the point of the learning model. Though it can be argued that these biases are present because humans perceive sentence similarities via these methods.

Another method for further sturdy is to improve the bias reduction methods on the points where they do work, the bias reduction on word overlap. None of the three methods show a definitive and effective way to reduce this bias completely. Alternate ways of reducing the word overlap might be shuffling the words of the sentence in a random order and test if this has the intended outcome. If so, then the string length difference is the only other major bias point left within the data and methods to reducing bias. This can be applied without being concerned with the bias in the word overlap.

Lastly, although it is not highly pronounced, there is still a small bias to-wards n-gram overlap present within the Quora data set, concerning the bi-tri-grams more specifically. The methods tested within this thesis fail to ad-dress this and increase them even though only slightly. Furthermore, research could be done both to mitigating that increase within this method and to finding alternate methods to decrease the n-gram overlap bias to combine them with these methods to decrease the bias to string length difference.

(24)

References

[1] Eneko Agirre et al. “* SEM 2013 shared task: Semantic textual similarity”. In: Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity. 2013, pp. 32–43.

[2] Eneko Agirre et al. “Semeval-2012 task 6: A pilot on semantic textual similarity”. In: * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012). 2012, pp. 385–393. [3] Eneko Agirre et al. “Semeval-2016 task 1: Semantic textual similarity,

monolingual and cross-lingual evaluation”. In: SemEval-2016. 10th Inter-national Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016. p. 497-511. ACL (Association for Computational Linguistics). 2016.

[4] Junfeng Tian et al. “Ecnu at semeval-2017 task 1: Leverage kernel-based traditional nlp features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity”. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017, pp. 191–197.

[5] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. “GloVe: Global Vectors for Word Representation”. In: Empirical Methods in Nat-ural Language Processing (EMNLP). 2014, pp. 1532–1543. url: http : //www.aclweb.org/anthology/D14-1162.

[6] Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. “Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features”. In: CoRR abs/1703.02507 (2017). arXiv: 1703 . 02507. url: http : / / arxiv.org/abs/1703.02507.

[7] Kawin Ethayarajh. “Unsupervised random walk sentence embeddings: A strong but simple baseline”. In: Proceedings of The Third Workshop on Representation Learning for NLP. 2018, pp. 91–100.

[8] _{Eneko Agirre. STSbenchmark. 2019. url: http : / / ixa2 . si . ehu . es /} stswiki/index.php/STSbenchmark (visited on 06/04/2020).

[9] K. Oostra. “Analysis of Semantic Textual Classification Errors by Neural Sentence Embedding Model”. In: (2020).

[10] Tomas Mikolov et al. “Distributed Representations of Words and Phrases and their Compositionality”. In: Advances in Neural Information Pro-cessing Systems 26. Ed. by C. J. C. Burges et al. Curran Associates, Inc., 2013, pp. 3111–3119. url: http : / / papers . nips . cc / paper / 5021 -distributed- representations- of- words- and- phrases- and- their-compositionality.pdf.

[11] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. “A Simple but Tough-to-Beat Baseline for Sentence Embeddings”. In: (2017).

(25)

[12] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. “Improved Semantic Representations From Tree-Structured Long Short-Term Mem-ory Networks”. In: (July 2015), pp. 1556–1566. doi: 10.3115/v1/P15-1150. url: https://www.aclweb.org/anthology/P15-10.3115/v1/P15-1150.

[13] Thomas G Dietterich and Eun Bae Kong. Machine learning bias, statis-tical bias, and statisstatis-tical variance of decision tree algorithms. Tech. rep. Technical report, Department of Computer Science, Oregon State Univer-sity, 1995.

[14] Zekun Yang and Juan Feng. A Causal Inference Method for Reducing Gender Bias in Word Embedding Relations. 2019. arXiv: 1911 . 10787 [cs.CL].

[15] Adam Sutton, Thomas Lansdall-Welfare, and Nello Cristianini. “Biased Embeddings from Wild Data: Measuring, Understanding and Removing”. In: (2018). Ed. by Wouter Duivesteijn, Arno Siebes, and Antti Ukkonen, pp. 328–339.

[16] Surendra K. Singhi and Huan Liu. “Feature Subset Selection Bias for Clas-sification Learning”. In: Proceedings of the 23rd International Conference on Machine Learning. ICML ’06. Pittsburgh, Pennsylvania, USA: Asso-ciation for Computing Machinery, 2006, pp. 849–856. isbn: 1595933832. doi: 10 . 1145 / 1143844 . 1143951. url: https : / / doi . org / 10 . 1145 / 1143844.1143951.

[17] Daniel Cer et al. Universal Sentence Encoder. 2018. arXiv: 1803.11175 [cs.CL].

[18] Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural Information Processing Systems 30. Ed. by I. Guyon et al. Curran Asso-ciates, Inc., 2017, pp. 5998–6008. url: http://papers.nips.cc/paper/ 7181-attention-is-all-you-need.pdf.

Determination and reduction of Bias in length and word overlap in Semantic Textual Similarity