Meaning of Numbers in text corpora

(1)

Meaning of Numbers in text

corpora

Lucas L. van Berkel 10747958

Bachelor Artificial Intelligence

College of Science University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. Maarten Marx Hosein Azarbonyad MSc.

Information and Language Processing Systems(ILPS) Faculty of Science

University of Amsterdam Science Park 904 1098 XH Amsterdam

(2)

Abstract

When training word embeddings on large text corpora, numbers are often discarded or substituted with a preassigned symbol to reduce the size of the corpus and vocabulary. We investigated if the numbers hold any meaning or function in the sentence they appear in by preprocessing them differently using clustering and binning methods. The word embed-dings trained using different text corpora preprocessed with the various methods was tested to see if the quality improved when the numbers were clustered or binned. The results showed improvements for almost ev-ery newly proposed method, which indicates that numbers influence the meaning of the words in the context.

1 Introduction

A major challenge in the research field of Artificial Intelligence is to enable a computer to ’understand’ words, sentences, in other words language, as well as we humans do. Natural Language Processing(NLP) has many hills to overcome before we can assert that we are able to let computers mimic our ability to communicate through natural language optimally. Natural language consists of sentences, which in their turn consists of words. A computer can only be able to understand language, if it can understand the words. Enabling computers to understand the meaning of words we represent words in a high-dimensional embedding space, in vectors of multiple hundreds of dimensions. In this high-dimensional embedding space similar words that appear in the same context are vectorised near to each other. The process of calculating those vectors in that space is called word-embedding. Giving words an high-dimensional vector as representation instead of a unique symbol, offers the possibility to calculate similarities and dissimilarities between words, a feature that can be used in other fields of NLP like translating one language into another.

Calculating or training these vectors is based on a data set consisting of words with the context in which they were given. We humans understand the meaning of words because we heard or read these words uncountable times from a very young age. The number of occurrences of every word when we were learning the meaning of them is uncountable, but it shows that it would be important, to offer computers the same opportunity, to train the word embed-dings on immense text corpora which contains over hundreds of millions words in them. For example a combination of many news paper or Wikipedia articles would be fit for training. However in natural language a lot of words in a sen-tence do not attribute to semantic meaning of sensen-tence, but more for the sake of correct grammar(words like ’the’ and ’of’). These words do not contain a meaning of itself but serve a more syntactic function. So common practice is to delete these words from the sentences, to put semantic important words more closer together and to reduce the complexity of training the word embeddings, since these words occur highly frequent(Collobert et al., 2011).

Prepossessing techniques like mentioned above are normal to improve the quality of the training corpus (Collobert et al., 2011). Another preprocessing technique is to normalise all occurring numbers in the corpus, for example sub-stituting all the number by a standard symbol, like ’#’, or even removing them entirely from the corpus(Schnabel, Labutov, Mimno, & Joachims, 2015). The motivation for this preprocessing technique is similar to the motivation of re-moving words like ’the’ and ’of’, the token does not contain semantic meaning itself and it reduces the size of the vocabulary. However, one could argue that a number, despite having no meaning itself, does have influence on the meaning of words in the context of them. It could be of importance that a word occurs often in context of a year number or a cellphone number. Besides the argument that removing or substituting numbers reduces the size of the vocabulary and

(4)

corpus, further the justification of this preprocessing technique misses. This study aims to conclude this matter, to find out if numbers do improve the qual-ity of the word embeddings of the words in the context of them. That leads to the research question: What is the effect of preprocessing numbers in natural language sentences on the quality of the word embeddings achieved from the text corpus?

To study the research question, different models of word embeddings will be evaluated on a standard linguistic task. The difference in the models originates in the preprocessing of the numbers that appear in the corpus. Common practice is to remove or substitute the numbers with a preassigned symbol, this study will cluster and bin the numbers using the precalculated embedding vectors and the syntactic structure of the numbers. For every method a new corpus will be created, which will be used to train word embeddings. The resulting vectors are evaluated and differences in accuracy show if numbers fulfil a sufficient function or meaning in the text corpus. This thesis will explain the current state-of-the-art in the field and show the influence of preprocessing and hyperparameter tuning in quality of word embeddings. Afterwords the inner workings of the used method to calculate the word embeddings is described, as well as the different newly proposed preprocess methods of the numbers in the corpus. The method and approach is finished with the explanation of the evaluation method, the analogy task. The results contain the accuracies of the different preprocess methods on the evaluation task, with multiple figures what could explain the differences in accuracy. The thesis ends with the conclusion and discussion, with some suggestions for future research regarding the preprocessing of numbers in text corpora.

2 Literature review

This study focuses on word embeddings, primarily what influences the qual-ity of them. Word embeddings can be constructed with different models, with each its own advantage, but preprocessing and hyper parameters are also im-portant. Since this study uses the Continuous Bag-Of-Words architecture of the Word2Vec tool, a more elaborate explanation is provided.

2.1 Word embeddings

Word-embedding methods can be divided in two approaches, prediction-based and count-based (Baroni, Dinu, & Kruszewski, 2014). In 2013, (Mikolov, Chen, Corrado, & Dean, 2013) proposed a new approach to create word embeddings using neural networks to predict context from a given word. (Baroni et al., 2014) compared these new prediction-based approach with the ’older’ count-based proach and concluded that prediction-based outperforms the count-based ap-proach on various linguistic tasks (Baroni et al., 2014). (Mikolov, Chen, et al., 2013) proposed two novel model architectures for computing continuous vector

(5)

representations of words from large data sets. The first model is the Continuous Bag-of-Words Model (Mikolov, Chen, et al., 2013). This architecture is simi-lar to a feed-forward neural netword-based language model, but the non-linear hidden layer is removed and the projection layer is shared for all words. The model is called Continuous Bag-of-Words(short: CBOW) because the order of words is not taken into account. The size of the window of the words taken as context surrounding the central word however is variable. This model predicts the central words given the words surrounding it.

The second model is the Continuous Skip-Gram Model(short: Skip-gram). It is similar to the Continuous Bag-of-Words Model, but instead of predicting the central word given the context it predicts the context given the central word. Figure 1 illustrates the difference between CBOW and Skip-gram. The conclu-sion of the paper is that quality of the proposed vector representation deliver good results on various syntactic and semantic language tasks. The big improve-ment is the reduction in complexity of training the model and word embeddings. Trainings epochs which would take a regular feed-forward neural net language model around 14 days, using more CPU cores(180 cores) and a smaller vector dimensionality(100), were trained by the models in around 2 days with less CPU cores(around 140 cores) and a bigger vector vector dimensionality(1000).

Figure 1: Difference between CBOW and Skip-gram

The Word2Vec method, as the application of (Mikolov, Chen, et al., 2013) was called, is the current state-of-the-art at the moment on word embedding be-cause of it low complexity and high accuracy. Besides this method other neural network-based approaches were proposed, like the Noise-Contrastive Estimation of (Mnih & Kavukcuoglu, 2013). The previous method uses Negative-Sampling,

(6)

a way to maximise the probability of a word-context pair by minimising the probability of a word-pair that does not occur in the corpus. The new method trains the word embeddings by training a logistic classifier to distinguish samples from each other using a noise distribution (Mnih & Kavukcuoglu, 2013). The advantage of this new method is it ability to fit models that are not normalised, reducing the training complexity immensely. However, the approach still uses a training method based upon a neural network, which is still computationally more expensive than the Word2Vec method.

2.2 Preprocessing and hyperparameters

Although Word2Vec has proven to be computationally quicker than other word embeddings methods, different word embedding approaches deliver different re-sults on different linguistic tasks (Levy, Goldberg, & Dagan, 2015). One ap-proach is not uniformly better than others, especially if the hyper-parameters of the different approaches are optimised (Levy et al., 2015). In this research four different word embedding approaches were compared, Pointwise Mutual Infor-mation(PMI), Singular Value Decomposition(SVD), Skip-Gram Negative Sam-pling(SGNS, a method from the Word2Vec tool) and Global Vectors(GloVe). The first two methods were considered to be more ’classical’ count-based meth-ods for word embedding and the latter two neural network approaches. The research concluded that the hyper-parameters had a significant influence in the accuracy on linguistic tasks like word similarity and analogy task. The quality of word embedding is not primarily derived from the used method to calculate the embeddings, or the size of the used vocabulary, hyper-parameters and prepro-cessing of the text can improve the quality immensely. Those hyper-parameter consist of for example the varying the context window size, diluting frequent words, deleting rare words and normalising all the vectors after training. This shows that preprocessing the data before training can be of significant influence for the accuracy of the used training method.

In addition to the hyper-parameters, the training data itself is of importance for training sufficient word embeddings. A popular data set, because of its im-mense size, is the English Wikipedia (Levy et al., 2015) (Collobert et al., 2011). Some data sets only contain sentences with similar semantic meaning, reducing the quality of data set since the whole spectrum of language is desirable for good data sets(Tulkens, Emmery, & Daelemans, 2016). Provided the impor-tance of the source of the data set and the influence of preprocessing the data, it is remarkable that there is a lack of justification for some decisions regarding preprocessing, for example leaving out digits in text corpora (Collobert et al., 2011) (Schnabel et al., 2015).

2.3 Word2Vec

The Word2Vec tool consists of two individual methods, which are fairly compa-rable. In this research the first model, the Continuous Bag-Of-Words(CBOW),

(7)

was used for calculating the word embedding vectors. To understand the differ-ence in accuracy’s, it would be logical to understand the workings of the method used to calculate the word embeddings. The researchers behind Word2Vec de-veloped the tool because at the time of making, none of the previous proposed architectures were able to train word embeddings on text corpora larger than a few hundred of millions of words with a dimensionality below the 100. So the aim of the tool was, beside to create an architecture that creates word em-beddings of good quality, also to have a low time and space complexity. The CBOW architecture is based upon a neural network, since neural network tend to keep linear regularities between syntactic or semantic similar words (Zhila, Yih, Meek, Zweig, & Mikolov, 2013)(Mikolov, Yih, & Zweig, 2013). Since the complexity of the used method is of importance, the training complexity needs to be taken into account. The training complexity of CBOW is proportional to

O = E × T × Q (1)

with E is the number of training epochs and T is the number of words in the data set. Q is a term that is different for every architecture using a neural network. Previous architectures used a non-linear hidden layer, which was the most time and space expensive. CBOW does not have this non-linear hidden layer, but does have a projection layer that is shared for all words. Since the projection layer is shared for all words and not with a projection matrix of the given input words, the sequence of words is invariant. That is the reason why this architecture is called Continuous Bag-Of-Words. Besides only taking words from the history of the context, CBOW also takes words from the future. The best results were obtained using a log-linear classifier, given the words in the history and words in the future of the current (middle) word. So for this architecture the training complexity of Q would be

Q = N × D + D × log2(V ) (2)

With N being the number of input words, D being the dimensionality of the word embeddings and V being the projection layer. Important to note is that better word embeddings are constructed by increasing occurrences of the words. CBOW as an standard limit of five occurrences for it to taken into account or else the word would be represented inaccurately. If the limit was not reached, the word would be taken out of the corpus.

3 Method and approach

To research the meaning of numbers in text corpora, different word embeddings achieved using different preprocess techniques on numbers are created, evalu-ated and compared to each other. For the creation of the word embeddings the Word2Vec tool was used (Mikolov, Chen, et al., 2013). The tool delivers suffi-cient results in low computation time. Several techniques of preprocessing will be compared to each other to see what, if any, the influence of numbers is on

(8)

the quality of the word embeddings in the total vocabulary. These preprocess techniques will consist of three standard and four newly created techniques. The first standard techniques is simply leaving the numbers as they are, the baseline technique. The second standard technique is to remove all the numbers from the text corpus. At last the numbers are to be substituted by a preassigned symbol, ’#’. These standard techniques are the current common practice and show little regard to the function the numbers could have. The new techniques are divided into clustering and binning techniques. The clustering techniques work by using the word embeddings of numbers created by the baseline model. For clustering we cluster the numbers using two different techniques. First we cluster using k-means and second we create a graph with the numbers as nodes and cluster the graph. The binning techniques look at the syntactic layout of the numbers and context and bin the numbers in predefined bins, divided into two binning methods. The first binning method bins the numbers by value. The second binning method is to bin the numbers by inspecting the context of the numbers and the syntax of the number.

3.1 Standard techniques

• Baseline; The baseline technique is actually not a preprocess technique, since not an extra effort was made to handle the numbers. Nevertheless by leaving the numbers as what they are, a important choice is made between training complexity and information. The assumption in training a model is that more information means higher accuracy. However by leaving the numbers as they are, some numbers are automatically removed from the corpus as they do not appear more than five times by the Word2Vec tool. After all, numbers can be infinitive different appearances, so some are bound to be removed. This means that the baseline technique does not truly leave numbers as they are.

• Removed; The most easy preprocess technique, apart from the baseline technique, would be to remove all the digits in the sentences. This would decrease the size of the text corpus and vocabulary, but you are risking to potentially lose a lot of information. Even the presence of a number would not be noticeable for CBOW, but we humans do recognise the occasion if a number is deleted from a sentence.

• Substituted; If the the numbers were substituted, instead of removed, with a preassigned symbol like ’#’, this information would not be lost. This technique would lead to a decrease of vocabulary compared to the baseline technique, but an increase of the text corpus size (for infrequent appearing numbers are not removed).

3.2 Clustering

• K-means; In this research two unsupervised learning algorithms were used to cluster the numbers. The first of the two is k-means clustering. The

(9)

word embeddings of numbers could be trained on the text corpus using the baseline model, so no preprocessing the numbers. This results in the word embeddings of numbers that occur more than four times. Since the vector representations of the numbers are a vector in the high-dimensional space, clusters can be found. K-means is a common unsupervised learning algorithm. The input of k-means are normally vectors and is this study the word embeddings of the numbers from the baseline model. Before training the desired amount of clusters is given. For every cluster a ran-dom centroid is initialised in the same space as the vectors of the numbers are. After initialisation every vector is assigned to a centroid cluster us-ing the euclidean distance as distance measure. Of every formed cluster, the mean is calculated, resulting in a new centroid with the least mean distance to every point in the cluster.

With the centroids new clusters are formed, since some vectors may now be closer to another centroid than before. This process is repeated numerous times until the optimal centroids are found with the least mean distance to every vector in the space. After every vector is assigned to a cluster, every number corresponding to the vector is replaced with the cluster name of the vector. With this method similar numbers are clustered and will therefore be more occurring in the presence of its cluster name in the text corpus. Like substituting the numbers with a fixed token, no information of the presence of a number is lost, but information is added differentiating the assigned symbol. Preliminary research clustered every vector given multiple numbers of centroids and calculated the mean distance from every vector to its cluster centroid. The results of this preliminary research are illustrated in the figure 2.

(10)

Figure 2 the expected course of a data set with no clear distinct clusters. Adding more clusters would decrease the mean distance, but the cost of training would be higher and the clusters would be meaningless. Common practice is to take the number of clusters around the ’elbow’ of the course of the figure. In this case that was around the three till five clusters. So these number of clusters were used for training the clusters and substitut-ing the numbers, they are called cluster 3, cluster 4 and cluster 5 in the results.

• Graph clustering; A second method of clustering was used in an attempt to find more logical and meaningful clusters. This method creates a graph of the number vectors and separated the graph into clusters. For every number vector the 100 most similar tokens (words and numbers) were calculated. From those 100 tokens, the 20 most similar numbers were ab-stracted. Between every number and its similar number vector an edge was created given the weight of their similarity, which was calculated us-ing cosine similarity. If a number did not contain 20 numbers in its 100 similar tokens, no extra numbers were sought since they would not be relevant to each other. This way, a graph is created where the nodes are the numbers and the edges are between the nearest neighbours of those numbers, with the similarity between the numbers as weight of the edge. After constructing the graph, we apply louvain method to cluster the num-bers (Blondel, Guillaume, Lambiotte, & Lefebvre, 2011). Louvain method tries to maximise the modularity over the graph Modularity is measure to investigate the structure of a graph. It measures the strength of the division of a graph into clusters. A graph with a high modularity have nodes that have edges which are close to each other within one cluster and edges that are far to nodes of other clusters. Modularity Q is calculated with the formula:

Q = 1 2m X ij [Aij− kikj 2m]δ(ci, cj) (3)

– Aij is the weight between nodes i and j

– ki and kj are the sum of weight of the edges of nodes i and j

– m is the sum of the edge weights in the graph

– δ(ci, cj) checks if the nodes i and j are in the same cluster (1 if true

else 0)

Since modularity Q needs to be calculated by analysing every node with a predetermined cluster, Q can be optimised for a given number of clusters. So modularity analyses the graph but also clusters the numbers, however considering that there is no predefined number clusters, the number of cluster using graph clustering could be a lot higher. Same as k-means, after graph clustering every number in the corpus is substituted with its corresponding cluster.

(11)

3.3 Binning

• Naive binning; Binning is a supervised learning approach to group the numbers that are similar to each other, in the contrary to clustering. We tried to bin the numbers in two different fashions. One is to bin them quite naively, the other is to bin them with regard to the context and layout of the numbers. The naive binning is simple in design, every number is binned into a bin that progress in size logarithmic. The first bin are the numbers 0 till 100, the next 100 till 1000, etc. There is no intuition added to the binning. The idea behind this binning method is that similar numbers in syntactic are similar in semantic. So numbers that differ one or two counts are similar, as well as numbers that are of the same length in digits. One downfall with this method could be that one bin could be enormously large since most numbers that are written fall in this bin.

• Binning; The more intuitive method of binning also takes regard of the context of the numbers in which they are represented. The bins in this method are bound to the function they represent that numbers could have. Most numbers are either an ID for the object they refer to or they indicate an amount of some object. Since years and days in the calendar (which are just IDs for the years and days they represent) are present in the cor-pus, separate bins were created to differ more numbers. Words like ’in’ and ’between’ indicated a year number following and if the word after the number was a plural, it would possibly mean that the number would be an amount. Moreover if the number fitted a regex of years it would ulti-mately be binned as year if no other rule applied. These, and many other, if-then rules were optimised for a small training set of 502 numbers that were classified manually. Of the 502 numbers, 487 numbers were classified correctly (error of 2.98%).

The advantage of binning the numbers as preprocessing method is that it is not necessary to train word embeddings before the numbers can be grouped together. Every number is taken into account, which means now number is left out of training.

The entire method and approach of this study is also explained in pseudo-code for an overview of the study in algorithm 1.

(12)

Result: Accuracy’s different preprocess techniques parseWikipidiaText();

for line in xml do

if line not xml tag OR length in words <5 then remove punctuation from line;

add line to corpus; end

end

for preprocess technique do for line in corpus do

for word in line do

if word is number then

switch preprocess technique do case Removed do remove word; end case Substituted do word = #; end

case Cluster k-means do word = kCluster[word]; end

case Cluster graph do word = gCluster[word]; end

case Naive Binning do word = naiveBin(word); end

case Binning do

word = bin(line, word); end

end end end

Add line to new corpus; end

end

for every corpus do trainModel(corpus); evaluate(model); end

(13)

4 Experimental results

The section Experimental results is divided in subsections datasets, hyper pa-rameter setting, evaluation and results. The datasets section is to clarify the origins of the used dataset and why the dataset used first did not suffice for this study. After, the hyper parameter setting are discussed, to enable reproduction of the achieved results. The used evaluation method is discussed in section 4.3, to elaborate the measurement of quality of the word embeddings. The final section results discusses the achieved accuracies of the different methods over the evaluation task.

4.1 Datasets

4.1.1 Dutch Parliament Documents

The study originally started with the Dutch Parliamentary documents from 1814 till 1995 as text corpus used for preprocessing and training. This data set is constructed after scanning all the typed documents for example made during parliamentary meetings. The scans were converted into text documents using optical character recognition(OCR). Figure 3 shows an example of such a scan. Preliminary research found some mistakes in the process of convert-ing the scans in text documents, this needed to be taken into account when testing the word embeddings. The text documents were finally formatted in a xml parser giving every token in the text a part-of-speech, lemma, etc using the Format of Linguistic Annotation (van Gompel & Reynaert, 2013). The data in the corpus contained around 29 million sentences and 670 million to-kens. Similar researches included data sets of similar sizes (Tulkens et al., 2016).

Figure 3: Sample scan from which the parliament documents are constructed of using OCR, showing the difficulty and possibility of errors

Since the corpus is in dutch, the analogy task needed to be translated to dutch as well to function as evaluation method. The analogy task will be dis-cussed further in section 4.3. Every token in every sentence was classified be-forehand, giving every token their classification, lemma and stemming. Using

(14)

this classification, the numbers were recognised from the text. The results are presented in table 1 and 2.

Pre-process technique Tokens in corpus Vocabulary

Baseline 533,241,094 511,892

Removed 526,112,174 492,098

Substituted 533,337,415 492,099

Table 1: Size of Dutch Parliament Documents corpora

Pre-process technique Questions seen Accuracy

Baseline 13185/20027(65.84%) 5.24%

Removed ,, 5.11%

Substituted ,, 5.32%

Table 2: Results of Dutch Parliament Documents

The results show a very low accuracy on the analogy task, but a slight improvement in those accuracy when substituting the numbers with a fixed symbol. Another important point to notice from this table is the low percentage of questions seen in the analogy task. The analogy task only takes a question in their total score if every word of the quartet is in the vocabulary. Since both corpora, the subset of one tenth of the entire set and the entire set itself, contain enough words, it is remarkable the percentage of seen questions is that low. One explanation could be that some words in English do not translate to Dutch in one word. For example, the English word ’Reducing’, translates to the Dutch words ’Kleiner worden’. Considering that the analogy task was translated using Google Translate, it is impossible to check if every translation is correct without going over every line of the 20000. Because of this every line containing more than four words was removed from the test set. The new results are shown in table 3.

Pre-process technique Questions seen Accuracy

Baseline 10076/16762(60.11%) 7.72%

Removed ,, 7.68%

Substituted ,, 7.67%

Table 3: Results of Dutch Parliament Documents with clean test set

The accuracies are not good enough to make conclusion on based upon the preprocess technique used on the numbers. The Dutch Parliament Documents text corpus was not fit to conduct this research on. Because of this the English Wikipedia was used next, since the corpus would already be in English and the size of it was immensely larger.

(15)

4.1.2 English Wikipedia

Since most similar researches are conducted on the English Wikipedia, this data set was considered next. Therefore, a data dump of all the articles and all the editors talk of May the first of 2017 was abstracted from the internet. This data set showed much more potential since Wikipedia consists of typewritten text and not text constructed using OCR. Beside that point, the Wikipedia was also of a much larger magnitude (approximate 5 billion words over 500 million). Consequently, the results promised of higher accuracy using the new data set. Since Wikipedia is in fact a large network of articles referencing each other, the data dump included a lot of hyperlinks or symbols to indicate the presence of a hyperlink or title of an article. These tags were removed using a Wiki parser1. After the parsing, only the tags indicating the beginning and ending of article needed to be removed, after which the text could be concatenated, making it suitable for the Word2Vec tool for training.

A big difference between the Dutch Parliament Documents and the English Wikipedia was the presentation of every word in the text. In the Dutch Parlia-ment DocuParlia-ments every token was classified and lemmatized, whereas the English Wikipedia consisted of solely raw text. Numbers were already recognised in the Dutch Parliament Documents, while this needed to be done before preprocessing and training the models with the English Wikipedia. A token was considered a number when the string that was read into, could be cast to a float. This way every number that contained a point or comma, would still be recognised as a number. Moreover, most punctuation was removed from the corpus, so com-pound numbers like telephone numbers and a range of years would still satisfy this criteria.

4.2 Hyper parameter setting

Hyper parameter optimising is crucial for optimal quality of the word embed-dings. Although this is not the objective of this study, it is important to note the setting of the hyper parameter for the reproduction of the results. Every setting of the hyper parameters for training are:

• -size: 200; Dimensionality of the word embeddings. • -window: 5; Max skip length between words.

• -sample: 1e-3; Threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled.

• -hs: 1; Use Hierarchical Softmax.

• -negative: 0; Number of negative examples. • -threads: 12; Number of threads for training.

(16)

• -min-count: 5; Discard words that appear less than threshold. • -alpha: 0.025; Starting learning rate.

• -cbow: 1; Used model (default is 0=skip-gram model)

4.3 Evaluation method

After preprocessing the text corpus with respect to the numbers, the Word2Vec tool trains the word embeddings. As explained earlier the process takes the words and the context of the words to optimise the word embeddings. Provided with the seven different preprocess techniques, seven different models are formed with the word embeddings. These models need to evaluated on the quality of them. A sufficient method to test these models is the analogy task. The analogy task is a test that is provided with the Word2Vec tool (Mikolov, Chen, et al., 2013). The test consists of around 20000 lines of a quartet of words that are syntactic or semantic related to each other. Every quartet of words consists of two doubles of words, with the relation within the double being the same. Quartets of words that share a semantic relation could be for example a coun-try and its capital. The line would look like this: ’France Paris Italy Rome’. A quartet of words that share a syntactic relation could be a relation of plural, for example ’Car Cars Foot Feet’. Semantic relations consists of five subcategories and syntactic of nine subcategories. An overview of the different semantic and syntactic relations that are tested is provided in table 4.

For every subcategory the accuracy is calculated, so the different preprocess techniques could perform differently on each separated task. The accuracy of the task is calculated by taking the vector corresponding to the first word of the quartet, subtracting the vector of the second word and adding the third word. The resulting vector would be ideally the vector of the fourth word. Because perfection is impossible in this high dimensional space, the vector nearest to the calculated vector is taken. If the found vector is the vector of the fourth word, it is a hit. The accuracy is calculated by taking the hits divided over the total seen questions. For the test to be fair only lines were every word exists in the vocabulary are taking into account.

(17)

Subcategory: Example: Semantic:

Capital common countries: Athens Greece Baghdad Iraq Capital world: Abuja Nigeria Accra Ghana

Currency: Algeria dinar Angola kwanza

City in state: Chicago Illinois Houston Texas

Family: boy girl brother sister

Syntactic:

Adjective to adverb: amazing amazingly apparent apparently

Opposite: acceptable unacceptable aware unaware

Comparative: bad worse big bigger

Superlative: bad worst big biggest

Present to participle: code coding dance dancing

Nationality to adjective: Albania Albanian Argentina Argentinean Past tense: dancing danced decreasing decreased

Plural: banana bananas bird birds

Plural verbs: decrease decreases describe describes

Table 4: Every subcategory and an example of it

4.4 Results

The English Wikipedia proved to be better fit to analyse the differences in preprocess techniques. The corpus was significantly larger, which the literature has proven to increase the quality of the word embeddings. Table 5 shows the size of the corpus and vocabulary, that were calculated during training of the word embeddings. The size of the corpus is just below the 5 billion. The differences between the baseline, removed and substituted corpus are due to the obvious removing of the numbers and numbers that are left out the baseline corpus since they did not occur at least five times. The difference between the removed and substituted set indicate that 3.17% of the entire corpus consists of numbers. The vocabulary size differ as expected between the corpora.

Pre-process technique Tokens in corpus Vocabulary

Baseline 4,958,037,201 3,219,540 Removed 4,802,038,935 3,149,759 Substituted 4,959,359,950 3,149,760 Clustered 3 ,, 3,149,763 Clustered 4 ,, 3,149,764 Clustered 5 ,, 3,149,765 Clustered Graph ,, 3,149,921 Binning ,, 3,149,764 Naive Binning ,, 3,149,765

(18)

With the larger vocabulary, larger corpus and original test set, the results were significantly better. Table 6 shows the accuracy of every trained model over the entire test set. To the contrary of the Dutch Parliament Documents, the percentage of the questions seen was 100%, every word in the test set did appear at least five times in the corpus. The percentages show improvements of some of the clustering and binning methods over the standard techniques.

Pre-process technique Total accuracy Average of accuracy over entire test set over every sub-task

Baseline 44.35% 42.52% Removed 45.03% 42.54% Substituted 45.09% 42.36% Clustered 3 45.42% 43.23% Clustered 4 45.06% 42.65% Clustered 5 45.14% 42.88% Clustered Graph 45.63% 42.99% Binning 45.34% 43.09% Naive Binning 45.63% 42.55%

Table 6: Results of English Wikipedia on analogy task

The accuracy over the entire testset may give a wrong impression of the qual-ity of the word embeddings, due to the differences in the size of the different sub-tasks the analogy task consists of. For example the sub-task Capital-world consists of over 4500 quartets, while the Family sub-task only consists of just 500 quartets. Taken the average of all the percentages of the scores of every sub-task, the difference in quality between every method is more accurately. Every sub-task is then normalised and equally important in calculating the av-erage accuracy. This score is shown next to the total accuracy. Both are shown because the first accuracy can be compared to the accuracy in the literature, the second gives a more fair representation of the quality of the word embed-dings. The baseline, removed and substituted method all performed worse than the average, showing that handling numbers differently improves the quality of word embeddings. The results over every sub-task are shown in appendix A.

With clustering and binning the numbers, it is important to know the distri-bution of frequency of the numbers in the entire corpus. A recurring statistic in linguistics is Zipf’s law, a law that tells that the frequency distribution of words in text follows a certain pattern. In every spoken language, the most occurring word occurs twice as often as the second most occurring word and three times as often as the third most occurring word, etc. Figure 4 shows the frequency of every occurring number versus their rank in log-log scale of the 100,000 most occurring numbers.

(19)

Figure 4: Zipf’s law of numbers, frequency vs rank

Figure 4 that numbers do not occur in the same frequency in accordance to their rank like words do. If the line were to be perfectly diagonal, than the frequency distribution would flow like Zipf’s law tells us. This indicates that frequent numbers appear more less as often as numbers near their rank. This could implicate that clusters could be more evenly distributed in total appearance, since more numbers appear frequent. The nine most occurring numbers were surprisingly all year numbers from 2006 till 2013 and 2016. The tenth most occurring number was 1.

4.4.1 Clustering

The clustering methods all showed improvements of accuracy, with the exception of clustering 4. This indicates that clustering the numbers by their vector does contain useful information. The tables 7, 8 and 9 show the size of every cluster and the appearance of every cluster in the corpus using k-means clustering. What stands out is that every methods clusters the most appearing numbers together. Especially the clustering of the methods cluster 4 and cluster 5 is remarkable, since a tiny partition of the numbers is responsible for the majority of the numbers in the corpus. This can be explained due to the high occurrences of the year numbers 2007, 2008 and 2009, which were clustered together in every clustering method.

(20)

Cluster number TNC TCC MFN

cluster #0 32,112 147,147,817 2008, 2007, 2009 cluster #1 10,086 751,956 800, 900, 350 cluster #2 32,289 8,121,677 666, 65, 0

Table 7: Clusters sizes with three clusters, TNC = Total numbers in cluster, TCC = Total of cluster in corpus, MFN = Most frequent numbers

Cluster number TNC TCC MFN

cluster #0 9,920 118,934,293 2008, 2007, 2009 cluster #1 32,896 680,578 250, 5500, 27000 cluster #2 30,334 7,782,320 666, 65, 0 cluster #3 1,337 28,624,259 1845, 1837, 1841

Table 8: Clusters sizes with four clusters, TNC = Total numbers in cluster, TCC = Total of cluster in corpus, MFN = Most frequent numbers

Cluster number TNC TCC MFN cluster #0 31,773 970,375 400, 250, 800 cluster #1 3,174 54,841 8848, 88450298874, 10 · 1012 cluster #2 9,251 28,624,259 1845, 1837, 1841 cluster #3 1,339 118,847,525 2008, 2007, 2009 cluster #4 28,950 7,524,450 666, 65, 0

Table 9: Clusters sizes with five clusters, TNC = Total numbers in cluster, TCC = Total of cluster in corpus, MFN = Most frequent numbers

With graph clustering, the number of clusters was not predefined. The number of clusters was the number for clusters for which the modularity Q was optimised. In this case that was 164 clusters. Most clusters only consisted of one number that appeared only five to twenty times. Table 10 shows the five clusters that appeared the most in the text corpus. Remarkably the distribution of appearances is more equal to one another compared to the other clustering methods. The cluster sizes are also significantly smaller than the cluster sizes of the other methods. The two largest clusters consists of 209 different numbers, but they contain 71% of all the appearances in the corpus.

(21)

Cluster number TNC TCC MFN cluster #37 178 63,216,021 2008, 2007, 2009 cluster #72 31 48,908,581 1, 2, 3 cluster #34 478 10,759,532 1200, 1600, 1700 cluster #3 600 10,451,497 0131, 0001, 0130 cluster #15 970 8,112,083 666, 40, 65

Table 10: Clusters sizes of five largest clusters with graph clustering, TNC = Total numbers in cluster, TCC = Total of cluster in corpus, MFN = Most frequent numbers

4.4.2 Binning

Both the binning methods have a higher accuracy than the standard techniques. The distribution between the bins is fairly equal between the bins with the normal binning method, whereas naive binning causes a unequal distribution. This is primarily because year numbers between 1000 till 10000 are put in the largest bin. As seen earlier, the year numbers 2008, 2007 and 2009 are the numbers that occur the most in the corpus. With the same accuracy as graph clustering, the naive binning performed best on the analogy task. However when normalising the accuracies of the sub-tasks, naive binning performed worse than the average of the nine methods. This is due the accuracy of naive binning on the Capital-World sub-task, as mentioned this sub-task was larger than the others. Naive binning got an accuracy which was slightly better than the others, but this caused the optimistic accuracy overall. Normal binning in the contrary, with clustering 3 and graph clustering, got the best results given the average of the methods. Distribution of the numbers over the bins are shown in table 11 and 12

Bin name Bin count MFN

Amount 12,777,696 2, 3, 10

Day 35,982,836 1, 4, 15

ID 55,325,278 1, 2, 3

Year 52,731,560 2008, 2007, 2009 Unidentified 503,645 1, 2, 3

(22)

Bin name Bin count MFN Bin min 100 54,899,869 1, 2, 3 Bin 100 1000 14,961,972 100, 666, 200 Bin 1000 10000 84,227,156 2008, 2007, 2009 Bin 10000 100000 1,341,365 10000, 20000, 50000 Bin 100000 max 1,880,780 100000, 200000, 500000 Bin unidentified 9,873 nan, -nan

Table 12: Bin sizes with naive binning, MFN = Most frequent numbers

5 Discussion

The objective of the study was to research the effect of different preprocessing methods on numbers on the quality of word embeddings achieved in the pre-processed corpus. Preprocessing techniques that were used in common practice were compared to newly proposed preprocessing methods. The newly proposed methods consisted of clustering and binning the numbers. For clustering we used both k-means as graph clustering. With binning, we handled every occurrence of a number individually and binned according to their syntax and context or value. The results show improvements in accuracy for multiple clustering and binning methods over the standard techniques. This proves that to improve the quality of word embeddings, it is sufficient to handle numbers as tokens that hold a function in the context they appear in. Although not every newly pro-posed method got a higher accuracy than the substituted model, clustering and binning the numbers groups the numbers with similar numbers. By binning and clustering the numbers, frequent numbers appear to be grouped together, with less frequent numbers separated from them. This effects the word embeddings in the context by indicating if the number that appears is frequent or not, which improves the quality.

Although the results look promising, the accuracies do not improve over ev-ery sub-task for any newly proposed method. Additionally, the improvements of the accuracies are not significant enough to justify the extra training time needed to cluster or bin the numbers. Especially with clustering, since the word embeddings have to be trained twice, first on the baseline corpus and then on the updated clustered corpus. Binning takes less time to form the corpus, but every sentence has to be checked on the occurrences of numbers. This, especially with large corpora needed to train sufficient word embeddings, delays the train-ing time quite significant. Prepartrain-ing the corpus took around 3 hours and forty minutes for binning the numbers and around 4 hours for clustering. Training the corpus took around 7 hours and 15 minutes for every model. So cluster-ing takes about seven and a half hours longer to train model than removcluster-ing, substituting or binning the numbers. This trade off between the efficiency over accuracy implies that it would not most definitely be better to cluster all the numbers from now on.

(23)

Besides, with large corpora like the one used in this study, tiny differences for every sentence have a large influence on the entire corpus. For example, when testing the different methods, it proved to be important to keep the inner loop for all the methods the same. The inner loop is the loop where every word is checked if it appeared to be a number. In initial training, some methods discarded empty strings, where as others kept them. Surprisingly, the methods that kept the empty strings, showed an improvement in accuracy. Probably because some words did not appear in the context, but it shows that the quality of the word embeddings can be improved with illogical preprocess methods. Additionally, as discussed earlier, the nine most occurring numbers were all year numbers ranging from 2006 till 2013 and 2016. This was due to time stamps that were given to pieces of texts. Wikipedia grew the fastest from 2006, so these numbers did not occur in natural language. Although we humans still abstract meaning from time stamps, it would be sufficient to execute the same test without these time stamps. Because of the irregular placement of these time stamps and a lack of time to train and evaluate new models, this study was not able to endeavour this additional preprocessing step.

5.1 Future research

However, given the results and the fact that, in this case the English Wikipedia, a corpus consists of numbers for over three per cent, it would be wise to continue the exploration of the function and meaning of numbers in text corpora. Due to the short time frame of this study, some potential methods have not been inves-tigated. For example it would be good to bin the numbers over their occurrence in the corpus, with grouping infrequent numbers together and keeping frequent numbers separated. Clustering did group the infrequent numbers together, but also the frequent numbers, with a dominating cluster in the corpus as a result. Also it would be good to create a more dynamic binning method than the one used in this study. For example a logistic classifier trained on a far larger test set than used to optimise the current binning method, could take more context into account, like CBOW of the word2vec-tool does as well. Finally, this study tested the quality of the word embeddings on one test only, to investigate the influence of clustering and binning numbers any further, it would be good to test the word embeddings on different linguistic tasks.

To summarise, clustering and binning the numbers showed to improve the quality of the word embeddings, although the extra training time needs to be taken into account before deciding if this improves the entire model. Prepro-cessing the numbers differently is not necessary, but if one would like to optimise their word embeddings, it would be sufficient to would.

(24)

References

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Acl (1) (pp. 238–247).

Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, ´E. (2011). The louvain method for community detection in large networks. J of Statistical Mechanics: Theory and Experiment , 10 , P10008.

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12 (Aug), 2493–2537.

Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving distributional simi-larity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3 , 211–225.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 . Mikolov, T., Yih, W.-t., & Zweig, G. (2013). Linguistic regularities in continuous

space word representations. In Hlt-naacl (Vol. 13, pp. 746–751).

Mnih, A., & Kavukcuoglu, K. (2013). Learning word embeddings efficiently with noise-contrastive estimation. In Advances in neural information processing systems (pp. 2265–2273).

Schnabel, T., Labutov, I., Mimno, D. M., & Joachims, T. (2015). Evaluation methods for unsupervised word embeddings. In Emnlp (pp. 298–307). Tulkens, S., Emmery, C., & Daelemans, W. (2016). Evaluating

unsuper-vised dutch word embeddings as a linguistic resource. arXiv preprint arXiv:1607.00225 .

van Gompel, M., & Reynaert, M. (2013, 12/2013). Folia: A practical xml format for linguistic annotation - a descriptive and comparative study. Computational Linguistics in the Netherlands Journal , 3 , 63-81.

Zhila, A., Yih, W.-t., Meek, C., Zweig, G., & Mikolov, T. (2013). Combining heterogeneous models for measuring relational similarity. In Hlt-naacl (pp. 1000–1009).

(25)

A

Result of all the models on the subtasks

Bas Rem Sub Cl3 Cl4 Cl5 ClG Bin BnN

Cc 74.7 72.33 72.33 75.3 77.87 78.26 77.87 71.74 70.36 Cw 55.81 58.29 59.66 57.87 59.62 58.64 60.99 57.82 60.28 C 12.24 12.7 12.47 12.01 12.36 12.93 12.36 13.28 12.47 Cs 27.97 30.28 29.71 29.75 29.59 29.02 29.47 29.87 31.33 F 66.01 66.21 64.62 67 65.81 65.42 65.02 66.4 63.64 Adj-Adv 13.41 11.79 15.02 13.52 17.04 15.12 16.03 15.02 15.83 Opp 16.63 18.35 17.98 20.07 16.87 16.75 19.33 17.61 16.87 Com 59.61 56.31 55.48 57.73 57.58 58.26 55.63 60.29 56.91 Sup 28.61 28.7 27.01 30.66 27.18 27.99 27.36 29.68 29.14 Pre-Par 40.06 37.5 38.26 39.68 35.23 39.39 38.73 41.67 37.88 Nat-Adj 72.48 73.55 73.98 75.05 74.05 72.8 75.92 72.98 70.48 Pt 50.58 50.96 50.06 51.09 46.73 51.47 47.76 50.96 53.08 Pl 45.05 46.92 45.35 45.2 45.8 45.27 46.32 44.22 48.35 PlV 32.18 31.72 31.15 30.23 31.38 28.97 29.03 31.72 29.08 Ave 42.52 42.54 42.36 43.23 42.65 42.88 42.99 43.09 42.55

Table 13: Accuracy on every sub-task in percentage, the average shows the normalised total accuracy

Ave Bas Rem Sub Cl3 Cl4 Cl5 ClG Bin BnN Cc 74.53% 0.17 -2.20 -2.20 0.77 3.34 3.73 3.34 -2.79 -4.17 Cw 58.78% -2.97 -0.49 0.88 -0.91 0.84 -0.14 2.21 -0.96 1.50 C 12.54% -0.30 0.16 -0.07 -0.53 -0.18 0.39 -0.18 0.74 -0.07 Cs 29.67% -1.70 0.61 0.04 0.08 -0.08 -0.65 -0.20 0.20 1.66 F 65.57% 0.44 0.64 -0.95 1.43 0.24 -0.15 -0.55 0.83 -1.93 Adj-Adv 14.75% -1.34 -2.96 0.27 -1.23 2.29 0.37 1.28 0.27 1.08 Opp 17.83% -1.20 0.52 0.15 2.24 -0.96 -1.08 1.50 -0.22 -0.96 Com 57.53% 2.08 -1.22 -2.05 0.20 0.05 0.73 -1.90 2.76 -0.62 Sup 28.48% 0.13 0.22 -1.47 2.18 -1.30 -0.49 -1.12 1.20 0.66 Pre-Par 38.71% 1.35 -1.21 -0.45 0.97 -3.48 0.68 0.02 2.96 -0.83 Nat-Adj 73.48% -1.00 0.07 0.50 1.57 0.57 -0.68 2.44 -0.50 -3.00 Pt 50.30% 0.28 0.66 -0.24 0.79 -3.57 1.17 -2.54 0.66 2.78 Pl 45.83% -0.78 1.09 -0.48 -0.63 -0.03 -0.56 0.49 -1.61 2.52 PlV 30.61% 1.57 1.11 0.54 -0.38 0.77 -1.64 -1.58 1.11 -1.53 Ave 42.75% -0.23 -0.21 -0.39 0.47 -0.11 0.12 0.23 0.33 -0.21

Table 14: Differences from the average of every model of every sub task in percentage point, bottom row shows average of differences

(26)

Cc = Capital common countries Cw = Capital world

C = currency Cs = City in State

F = Family Adj-Adv = Adjective - Adverb

Opp = Opposite Com = Comparative

Sup = Superlative Pre-Par = Present - Participle Nat-Adj = Nationality-Adjective Pt = Past tense

Meaning of Numbers in text corpora