• No results found

Evaluation and comparison of word embedding models, for efficient text classification

N/A
N/A
Protected

Academic year: 2021

Share "Evaluation and comparison of word embedding models, for efficient text classification"

Copied!
15
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Evaluation and comparison of word embedding

models, for efficient text classification

Ilias Koutsakis

University of Amsterdam Amsterdam, The Netherlands

ilias.koutsakis@gmail.com

George Tsatsaronis

Elsevier

Amsterdam, The Netherlands g.tsatsaronis@elsevier.com

Evangelos Kanoulas

University of Amsterdam Amsterdam, The Netherlands

E.Kanoulas@uva.nl

Abstract

Recent research on word embeddings has shown that they tend to outperform distributional models, on word similarity and analogy detection tasks. However, there is not enough information on whether or not such embeddings can im-prove text classification of longer pieces of text, e.g. articles. More specifically, it is not clear yet whether or not the usage of word embeddings has significant effect on various text classifiers, and what is the performance of word embeddings after being trained in different amounts of dimensions (not only the standardsize = 300).

In this research, we determine that the use of word em-beddings can create feature vectors that not only provide a formidable baseline, but also outperform traditional, count-based methods (bag of words, tf-idf) for the same amount of dimensions. We also show that word embed-dings appear to improve accuracy, even further than the distributional models baseline, when the amount of data is relatively small. In addition to that, the averaging of word embeddings appears to be a simple but effective way to keep a significant portion of the semantic information.

Besides the overall performance and the standard classifi-cation metrics (accuracy, recall, precision, F1 score), the time complexity of the embeddings will also be compared, Keywords text classification, word embeddings, fasttext, word2vec, vectorization

Acknowledgments

I would like to thank my supervisors, George Tsatsaronis and Evangelos Kanoulas, for guiding me and providing me with their years of experience, patience, and support, during the completion of this thesis.

I would also like to thank Elsevier, for giving me the chance to work and contribute to the work of amazing individuals, in such an establishment, and take advantage of their know-how and data, during my research.

Last, I would like to thank my mother, for constantly being on my side, supporting and pushing me.

1

Introduction

The accumulating amount of text data, is increasing expo-nentially, day by day. This trend of constant production and analysis of textual information is not going to stop anytime

soon. On the contrary, even non-traditional businesses, like banks1, start using Natural Language Processing for R&D

and HR purposes. It is no wonder then, that industries try to take advantage of new methodologies, to create state of the art products, in order to cover their needs and stay ahead of the curve.

This is how the concept of word embeddings became pop-ular again in the last few years, especially after the work of Mikolov et al [14], that showed that shallow neural networks can provide word vectors with some amazing geometrical properties. The central idea is that words can be mapped to fixed-size vectors of real numbers. Those vectors, should be able to hold semantic context, and this is why they were very successful in sentiment analysis, word disambiguation and syntactic parsing tasks. However, those vectors can still represent single words only, something that, although use-ful, really limits the usage of word embeddings in a wider context.

1.1 Motivation

The motivation behind this thesis, is to examine whether or not word embeddings can be used in document classification tasks. It is important to understand the main problem, which is how to form a document feature vector, from the separate word vectors. Also, we should examine the possible perfor-mance issues that different dataset sizes can have, but most importantly, datasets with different semantics.

Following, we will present our findings, comparing the us-age of word embeddings in text classification tasks, through the averaging of the word vectors, and their performance in comparison to baseline, distributional models (bag of words and tf-idf).

2

Classification and Evaluation Metrics

Classification is the process where, given a set of classes, we try to determine one or more predifined classes/labels, that a given object belongs to [13]. More specifically, Using a learn-ing method or learnlearn-ing algorithm , we then wish to train a classifier or classification functionγ that maps documents to classes, like this:

γ = X → C

1https://www.ibm.com/blogs/watson/2016/06/

natural-language-processing-transforming-financial-industry-2/ 1

(2)

Figure 1. A representation of the classification procedure, showing the training and the prediction part. (source: NLTK Documentation)

This group of machine learning algorithms is called su-pervised learning because a supervisor (the human who defines the classes and labels training documents) serves as a teacher directing the learning process. We denote the supervised learning method byΓ and write Γ(D) = γ , where D is the training set, containing the documents. The learning methodΓ takes the training set D as input and returns the learned classification functionγ .

It is important to note, that binary datasets (with true/false labels only) andmulticlass datasets (with a variety of classes to choose from), are not necessarily different problems. More specifically, the multiclass classification problem is usually delegated into a binary problem using the "one-vs-rest" method, where each class iteratively becomes the "true" class, and is evaluated with the rest of the classes (which collectively become the "false" class).

You can take a see an example of the process on Figure 1, taken from the NLTK documentation [12]. What follows is a short descriptions of the classifiers used in this thesis, as well as the most common classifier evaluation metrics, both statistical and visual.

2.1 Common classifiers

For this thesis, we needed to test different classification al-gorithms and compare the results. We decided to settle on a few algorithms, representative of the different classifica-tion methods that exist. The implementaclassifica-tions used are the ones found in Scikit-Learn, a machine learning library for Python [17].

The selected algorithms, are the following: 2.1.1 Naive Bayes

Naive bayes (NB) is a classifier based on applying Bayes’ theorem with the "naive" assumption of independence be-tween all the features. It is considered a particularly powerful machine learning algorithm, with multiple applications, es-pecially in document classification and spam filtering[25].

In our experiments, we used the Gaussian Naive Bayes implementation, from Scikit-Learn, which uses the following formula: P(xi|y) =q 1 2πσ2 y exp(−(xi−µy)2 2σ2 y ) 2.1.2 Logistic Regression

Logistic regression (LR) is a regression model, where the dependent variable is categorical. This allows it to be used in classification, where, as an optimization problem [20], it minimizes the following cost function:

minimize 1 n n Õ i=1 loд(1 + exp(−biaTix))

It was invented in 1958 [22], and it is similar to the naive Bayes classifier. But instead of using probabilities to set the model’s parameters, it searches for the parameters that will maximize the classifier performance [12].

2.1.3 Random Forest

Ensemble methods, try to improve accuracy and generaliza-tion in predicgeneraliza-tions, by using several estimators instead of one. Random Forest (RF) is an ensemble method based on De-cision Trees. It uses the averaging method of ensembling, i.e. averages the predictions of the classifiers used.

Decision Trees differ from the previously presented algo-rithms, as they are a non-parametric supervised learning method, that tries to create a model that predicts the value of a target variable based on several input variables [24]. You can take a look on an example of created rules, on Figure 2. 2.1.4 Support Vector Machines

Support Vector Machines/Classifiers (SVC) make use of hyperplanes, in a high or infinite dimensional space, which can be used for classification. They are very effective, com-pared to other algorithms, in cases where [17]:

• the dataset is very high-dimensional; and/or

• the number of dimensions is higher than the number of samples.

Both of those are immediately applicable in text classifi-cation, and although it is slower than other algorithms (es-pecially Bayesian ones), it is very performant and memory efficient.

2.1.5 k-Nearest Neighbors

One of the simplest and most important algorithms in data mining, is the k-Nearest Neighbors. It is a non-parametric method used for classification, where the input consists of the k closest training examples in the feature space, and the

(3)

Figure 2. A tree showing survival of passengers on the Ti-tanic. The figures under the leaves show the probability of survival and the percentage of observations in the leaf. (source: Wikipedia)

output is the predicted class [2]. Every input item is classified by a majority vote of its neighbors.

It is considered very sensitive to the structure of the data, thus it is commonly used for datasets with a small amount of samples. Nearest neighbor rules in effect implicitly com-pute the decision boundary. For high-dimensional data (e.g. textual data) dimension reduction is usually performed prior to applying the k-NN algorithm in order to avoid the effects of the curse of dimensionality [3].

2.2 Statistical Evaluation metrics

The metrics used bellow apply to binary classification. As mentioned before, in case of multiclass datasets, each class separately is considered the positive one, and an "one-vs-rest" approach is used to get metric results separately. This becomes especially important when considering the different metrics, as some of those, e.g. theF 1 score, are defined for binary classification only, and we need to be careful with the results.

In this context we considertp, tn, f p, f as true positive, true negative, false positive and false negative, respectively. We also considerP as the total positive samples and N as the total negative samples. It is important to note that sine the methods bellow use mainly fractions, the best value is 1 and the worst value is 0.

• Accuracy: the fraction of correct predictions, on a dataset. It can be computed either using the count of the correct predictions, or their fraction on the total. We are using the fraction method, defined as:

Accuracy = tp + tnP + N

• Precision: the ability of the classifier not to label as positive a sample that is negative, defined as:

Precision =tp + f ptp

• Recall: ability of the classifier to find all the positive samples, defined as:

Recall =tp + f ntp

• F1 score: the weighted harmonic mean of the preci-sion and recall, defined as:

F 1 = 2 ×precision × recallprecision + recall

2.3 Visual Evaluation Metrics - Confusion Matrix In addition to the above, we can also make use of a visual metric, the confusion matrix. Also known as error matrix, it is a special kind of contigency table, with dimensions equal to the number of classes. It summarizes the algorithm perfor-mance, by exposing the false positives and false negatives. 2.4 Differentiating between binary and multiclass

datasets

In order to differentiate the classification metrics used for binary and multiclass datasets, there are a few ways to av-erage binary metric calculations across the set of classes, each of which are useful in different scenarios. Those av-eraging methods (explained in length in the scikit-learn documentation2), that are applicable here, are:

• micro: gives each sample-class pair an equal contribu-tion to the overall metric (except as a result of sample-weight). Rather than summing the metric per class, this sums the dividends and divisors that make up the per-class metrics to calculate an overall quotient. • macro: simply calculates the mean of the binary

met-rics, without adding any weights. It can be especially useful in cases were classes with a small number of samples, are important, so their performance needs to be taken into consideration.

• weighted: tries to rectify class imbalance, by com-puting the average of binary metrics in which each class’ score is weighted by its presence in the true data sample.

For example, let’s assume that we classified a dataset, and we have some results. If we considerytrue as the correct

labels for each item, andypr ed the predicted labels, andL is

the set of labels, then for the Precision, we would get: • Precmicro = P(ytrue,ytrue)

2http://scikit-learn.org/stable/modules/model_evaluation.html

(4)

• Pmacro= 1LÍP(y

true,ytrue)

In this thesis, we will be using and macro averaging, as all our datasets are completely balanced, so the weighted or micro average is not useful at all.

In addition, in the case of visual metrics, the representa-tions change according to the requirements. For example, in cases of multiclass tasks, the confusion matrix becomes an N × N table, where N is the number of classes.

3

The Vector Space Model (VSM)

In order to use the text documents in a classifier, we need to create an appropriate and usable representation. This is achieved, using the Vector Space Model (VSM), which was developed by Gerard Salton and his colleagues in 1975 [19] for the SMART information retrieval system. In the VSM each document in a collection is shown as a point in a space (a vector in a vector space), and its usage revolutionized natural language processing and information retrieval.

In the VSM, all the documents and the queries, are repre-sented as vectors, where each di where each dimension is a distinct word. E.g:

• documenti = (wi,1, wi,2, ..., wi,n) • documenti+1= (wi+1,1, wi+1,2, ..., wi+1,n) • query = (w1, w2, ..., wn)

Of course, the VSM provides the outline of the procedure, but there are different ways of weighting the words/features of the vector, the Bag of Words and Tf-Idf. An explanation of both follows.

3.1 Bag of Words (BoW)

The bag of words is a simplified representation of a document, based on the one-hot-encoding [23], where each word is represented in a vector, by a categorical feature. It is one of the simplest and most effective tools in text mining and information retrieval. It’s implementation in text mining is quite simple, and it requires to have the whole corpus beforehand, in order to know the dimensions for the feature vectors. After transforming the text into a "bag of words", we can calculate various measures to characterize the text, but usually we use it in accordance to the "term-frequency", which is essentially the count of the words in each document. 3.2 Tf-Idf

Tf-Idf is an evolution of the Bag of words, as it works under the assumption that not all of the words are equally impor-tant, no matter how often they appear or not. According to Aggarwal et al [1] the most significant word are not the ones that appear most often, as these tend to be linking words such as "the", "or", "and" which are crucial to the structure of the document, but do not carry importance. So, a way needs to be used, that re-weights the count features (bag of words) into floating point values suitable for usage by a classifier.

Figure 3. Plot of word embeddings, showing their dimen-sional qualities.

So, one of the main methods of text used for text pro-cessing is the vector-space based Tf-Idf (Term Frequency × Inverse Document Frequency) representation (Salton, 1983).

As the name says, the formula has 2 distinct parts: • term frequency (tf), which is provided by the

for-mula:

t f = ft,d

• and the inverse document frequency, which is: id f = loдN

nt

In this thesis, we will be using the default Tf-Idf vector-izer class of Scikit-Learn, which uses the "smooth idf" variation3.

4

Word Embeddings

The idea of using vector representations for words, is not a new one. Earliest attempts to express word semantics as vec-tors stretch back to 1950s (Osgood [16], however they have become a popular topic only in recent years. In the 1990s, the first methods to automatically generate contextual features were developed, with one of the most important ones being the LSA [6]. The latest, massive interest in semantic word representations came from the resent research of Mikolov et al, who introduced the word2vec method [14].

In this chapter, we will describe the 2 different models that are part of the word2vec algorithm: Continuous Bag of Words (CBoW) and Skip-Gram. It is important to note that we will be using 2 different libraries, Gensim [18] and Facebook’s FastText, by Bojanowski et al [4].

A representation of the two models can be seen in Figure 4.

3http://scikit-learn.org/stable/modules/feature_extraction.html#

tfidf-term-weighting 4

(5)

Figure 4. The CBoW and Skip-Gram models. 4.1 CBOW and Skip-Gram

Word2Vec differs from the previous, distributional models by its iterative nature, as it does not require to compute any global information e.g. co-occurrence matrix, but iterates over the whole dataset while adjusting the vectors, which makes it much more memory-efficient and easier to reuse.

CBoW and Skip-Gram are similar, with a big difference: Although both of them rely on the same concept of moving a filter window of fixed size along the whole dataset and trying to predict certain words, CBOW predicts the cen-tral word from the context (surrounding) words, while the skip-gram does the opposite and predicts context words from the central word.

Both word2vec models are considered shallow learning, because, as seen in the picture, they only include one hidden and one output layer. The input of the neural network, is every context word, which is encoded using one-hot encoding. This means, that forC words, each word is represented by an C-dimensional vector. The hidden layer has simply to get the average of all those word vectors, using the representations that form a weight matrixW1.

Here are the important difference between the 2 models: • For CBOW, following from the hidden layer to the output layer, the second weight matrixW2can be used

to compute a score for each word in the vocabulary, and softmax can be used to obtain the distribution of words.

• Skip-Gram, on the other hand, at the output layer, we now outputC multinomial distributions instead of just one. The training objective is to mimimize the summed prediction error across all context words in the output layer..

Both CBoW and Skip-Gram are used extensively, and we will experiment with both.

4.2 Document vectorization using word embeddings In order to retrieve a single feature vector for each document, we needed to find a way to combine the word vectors for each text collection. According to Mikolov et al, the averaging of the word vectors seems to provide a sufficient baseline, so we decided to test that on our own datasets, in different dimensions. The algorithm used to create the feature vector from the word embeddings, is the following:

Algorithm 1: Vectorization of a document through the averaging of the word embeddings.

Data: a list of words, representing a document Result: the document feature vector

1 initialize number of words = 0;

2 create a feature vector of 0s, with length = word

embeddings dimensions (e.g. 300);

3 for word = 1, 2, ..., n do

4 if word in word2vec trained model then 5 increase the number of words by 1; 6 get the word vector from the model; 7 add the word vector to the feature vector

(elementwise);

8 end

9 end

10 divide the feature vector with the number of words to

get the average;

11 return the feature vector of the document;

(6)

5

Implementation and Datasets

In this chapter we will present the technical aspects of the implementation used for our experiments, as well as a brief descriptions of the datasets.

5.1 Implementation

The implementation of the required functionality was based on a variety of tools available in Python. Python was the obvious choice for this project, due to the wide variety of useful libraries for data science, all of them being state of the art in their respective categories. The implementation of the core functionality is based on the scikit-learn [17] Python library, as it includes implementations of the most commonly-used procedures and algorithms (e.g. decompo-sition, vectorization, classification, evaluation). In ad-dition, we took advantage of the sklearn API [5], which al-lowed us to create a completely automated analysis pipeline. The text vectors are being created using the Scikit-Learn implementations of Tf-Idf and Count (Bag of Words) vec-torizers. In the case of pre-trained word embeddings, we opted for trying the two most popular tools right now: the Word2Vec implementation of the Gensim Python pack-age [18] (which also provided many text pre-processing func-tions), as well as the implementation provided in Facebook’s FastText library [10].

Regarding the tuning options, since we wanted to get a clear view of the baseline, we chose to use the classifiers with the default tunings (e.g. no regularization for Logistic Regres-sion, no extra trees or depth for Random Forests, etc). In the case of the word embeddings, we followed a similar proce-dure, by using a context window of sizew = 5. Words that have occurred less than 5 times in the corpus were ignored and high-frequency words were randomly downsampled, with the standardsample = 0.001 threshold.

The preprocessing was also minimal, as we decided, after the initial experiments to just use stemming and stopwords removal, using the Gensim preprocessing functions, and the NLTK stopword list [12]. This significantly increased per-formance.

All the visualizations were created using the scikit-plot library [15] which is a wrapper over Matplotlib [8]. 5.2 Dimension Reduction

An interesting issue that arised was the selection of the decomposition algorithm. Some early choices includes mani-fold algorithms, e.g t-SNE and MDS (multidimensional scal-ing). However, according to Maaten et al4, algorithms like the

abovementioned try to solve specific problems; mainly, the crowding problem that appears in two-dimensional and three-dimensional visualizations, of very-high dimensional datasets. The additional fact that t-SNE is a non-parametric algorithm (it does not learn an explicit function that maps

4https://lvdmaaten.github.io/tsne/

Table 1. General dataset information Corpus Size Labels Avg. tokens

news 7.600 4 61 reviews 25.000 2 130 dbpedia 70.000 14 76

data from the input space), so it cannot be used with new data, making it unusable in business cases [21].

In the end we opted to using PCA (Principal Component Analysis), one of the most popular algorithms for dimension-ality reduction [9]. In terms of dimensiondimension-ality reduction it can be formulated as the problem of finding them orthonor-mal directions minimizing the representation error.

The implementation used was the TruncatedSVD, found in Scikit-Learn [5]. It is much more performant than PCA, as it can easily work with sparce matrices, which are the main representation of text documents.

5.3 Datasets

It was very important to investigate the different outcomes in a variety of datasets. The datasets should be distinct, but also share some characteristics, to evaluate the differences and make meaningful comparisons. For this reason, we chose the following:

• a news dataset with 7.600 items, with 4 classes; • a binary dataset of IMDB movie reviews of 25.000; and • a dataset with descriptions and ontologies from

dbpe-dia, with 70.000 items.

The texts should be considered short, as we avoided using big datasets (like scientific publications), due to computa-tional limitations. However, the accumulated datasets pro-vide a clear look on standard corpora, that are extensively used in both industry and academia (e.g. sentiment, news). You can take a look at the information summary for the datasets on Table 1.

All the datasets are curated from Zhang et al, for the pur-poses of their own research5[26].

6

Experiments and results

In our experiments, we have conducted a variety of tests, in order to evaluate the performance of word embeddings in comparison to distributional models (bag of words, tf-idf). Our research questions are mainly summarized as such:

• can the usage of word embedding outperform distri-butional models; and

• how do word embeddings, trained in different dimen-sions, perform against distributional models, of the same dimensions.

To get answers, we created the following procedure:

5https://goo.gl/PAK8mX

(7)

Table 2. Accuracy dominance matrix (News dataset) Model LR NB SVC RF w2v-cbow 0 0 0 0 w2v-sg 0 0 0 1 ft-cbow 0 0 0 0 ft-sg 3 6 3 5 bow 1 0 0 0 tf-idf 2 0 3 0 • select a dataset;

• train word embeddings on the dataset, for a specific amount of dimensions, on all the available options (fasttext and word2vec, both in their cbow and skip-gram iterations);

• compare the results with the results of distributional models, after using PCA for dimension reduction; • compare the results to the baseline (distributional

mod-els without any dime

The training of the word embeddings happened in a va-riety of dimensions, from 100 up to 500, in order to inves-tigate if the results would be acceptable. In addition, we randomly selected a few combinations and applied 10-fold cross-validation, by hand, to determine that the results do not overfit, and the results were positive.

We will now present the results for each of our datasets. We will be using a combination of table results and visual-izations for multiclass datasets. More specifically, we will be using the dominance matrix, a contigency matrix-style ta-ble, that shows the overall, accumulated performance of each algorithm, for each vectorization model.

A thorough representation of the retrieved results can be found in the Appendix.

6.1 Dataset: News Articles 6.1.1 Metrics

This was the smallest dataset of all, with 7.600 articles and 4 categories. Seeing the results of the dominance matrix Table 2, it is clear that theSkip − Gram model, specifically the one from FastText, outperforms all the other models, almost every time, with the exception ofT F − IDF on SVC.

Similar results can be found when investigating Preci-sion, Recall, and F1 score (see Table 3 for theF 1 dominance matrix).

The most crucial finding here is of course, that word em-beddings seem to outperform the baseline (BoW/Tf-Idf without decomposition), even in a very low dimen-sional space (for text documents at least), e.g. 10 or even 30 dimensions. Specifically, Random Forest and Naive Bayes, have very important gains, for a fraction of the original di-mensions, making this model cheaper and far more effective. The plots (Figure 5 and Figure 6) show that from 10 to 500

Table 3. F1 dominance matrix (News dataset) Model LR NB SVC RF w2v-cbow 0 0 0 0 w2v-sg 0 0 0 1 ft-cbow 0 0 0 0 ft-sg 3 6 3 5 bow 0 0 0 0 tf-idf 3 0 3 0

Figure 5. Accuracy in Naive Bayes, for the News dataset.

Figure 6. Accuracy in Random Forest, for the News dataset.

dimensions, word embeddings averaging is constantly higher than the best of the 2 distributional methods,T f − Id f . 6.1.2 Visual Metrics

Due to the fact that we have a multiclass dataset, it is quite interesting to investigate the confusion matrices, and search for patterns. In this case, we chose as an example to investi-gate the results ofLoдisticReдression. It is very interesting that not only does the use of word embeddings (in this case Skip − Gram) improve the T f − Id f results, but that the semantic meaning is kept after using the averaging vector-ization method, as the error rate in classes that are easy

(8)

Figure 7. Tf-Idf results. Business and Tech is mixed (News dataset).

Figure 8. Skip-Gram results: Overall improvement, but not on the classes that are semantically close (News dataset).

to collide (in this case Business and Tech.), has not signif-icantly (if at all) declined. Figure 7 and Figure 8 present those findings.

6.2 Dataset: IMDB Movie Reviews

This was a medium-sized, binary dataset, with 25000 movie reviews. Seeing the results of the dominance matrix Table 2, it is not as clear as in the previous dataset, exactly which model is the best.T f − Id f provides a formidable baseline that cannot be surpased in some cases. However, all in all, word embeddings seem to be quite successful here as well, especiallyW ord2V ec embeddings, which did not perform as well before, in comparison to FastText.

Table 4. Accuracy dominance matrix (IMDB Reviews) Model LR NB SVC RF w2v-cbow 0 0 0 0 w2v-sg 0 0 0 1 ft-cbow 0 0 0 0 ft-sg 3 6 3 5 bow 1 0 0 0 tf-idf 2 0 3 0

Figure 9. Accuracy in Naive Bayes (IMDB Reviews dataset).

Figure 10. Accuracy in Random Forest (IMDB Reviews dataset).

Once again here, word embeddings outperform the baseline,starting from as low as 30 dimensions, for Ran-dom Forest and Naive Bayes. Curiously,W ord2V ec seems to be much more performant than before, even outperform-ingFastText in certain cases, as seen on the figures below:

It is important to distinguish the results. Although word embeddings immediately return better results than the base-line (no decomposition), the decomposition itself boosts the results ofBoW and T f − Id f . So, compared to the baseline, as well as the dataset after the dimension reduction, word embeddings perform similarly well (with a slight loss),

(9)

Table 5. Accuracy dominance matrix (DBPedia Ontologies) Model LR NB SVC RF w2v-cbow 0 0 0 0 w2v-sg 0 0 0 0 ft-cbow 0 0 0 2 ft-sg 6 6 6 4 bow 0 0 0 0 tf-idf 0 0 0 0

and outperform distributional methods after around 300 dimensions.

6.3 Dataset: DBPedia Ontologies

The last dataset that we will use is the description of cer-tain DBPedia ontologies. These are short texts, of about a paragraph in length, that could describe anything, e.g. ani-mals, plants, companies and villages. It is quite interesting to see what happens in such a diverse but also extended dataset.FastText dominated in this dataset, especially the Skip − Gram algorithm, although CBoW was also successful in some cases.

6.3.1 Confusion Matrix Comparison

Since we have a significant amount of classes here, it makes sense to investigate if our previous hypothesis, that word embeddings improve the overall performance, but not in cases of semantic similarities, could appear here. For the purposes of this, we chose Naive Bayes at 30 dimensions, whereSkip − Gram achieves an 0.876 accuracy rate, and T f −Id f is somewhat lower, at 0.815. The confusion matrices, since they are pretty big, can be found in the appendix.

The results were actually quite interesting. Word embed-dings increased the overall performance, but in certain cases, where the semantic similarity was big, the performance wors-ened significantly. The classes affected were:

• Animal got a higher confusion percentage with Plant (both are life-forms); and

• Artist got confused with Album and WrittenWork (art).

It is quite clear that the averaging of the vectors creates some geometrical properties, that are being used during the training of the classifier.

6.4 Time Measurements

The time measurements were quite expected, according to our knowledge. As you can see in the following tables, they are quite expected. For both implementations,Skip − Gram is slower to train thanCBoW , which is already known. It seems though, that FastText is significantly slower when

using larger number of dimensions (e.g. around 100) in com-parison toW ord2V ec, something that needs to be taken into consideration.

Another important note here, is that for larger number of dimensions,T F − Id f is not significantly faster, so one could argue that training a word embedding model is the better choice, as it has the advantage that can be retrained regularly, without having to use the whole dataset.

You can take a look at the time measurements, found in Table 8.

7

Conclusion and Future Work

The purpose of this thesis was to investigate and evaluate, whether or not word embeddings can successfully be used in text classification. For the purpose of our experiments we used most of the important tools that are available to data scientists right now, including:

• the Gensim implementation of Word2Vec;

• the newest word embedding software from Facebook, FastText;

• some curated and balanced datasets.

Our results varied, depending on the dataset, the prepro-cessing and the classifier. However, we were able to define some basic patterns, that seemed to appear constantly. 7.1 Evaluation of Word Embeddings

There is no question thatSkip − Gram is much better than CBoW in their respective libraries, which seems to be in agreement with resent research from Levy et al Levy and Goldberg [11] It is also, somewhat slower, and can be sig-nificantly slower depending on the architecture, especially in FastText. However, it seems to be worth it, with gains from 2% − 5% and up to 20% in certain classifiers.

In addition, although FastText is better (significantly so, in some cases), it is not that much better most of the times. The increased training time that is required for FastText use, even with its multithreaded approach, make Word2Vec the better choice, for all intents and purposes.

7.2 Evaluation of Classifiers

Generally speaking, word Embeddings seem to have huge gains in Bayesian algorithms, especially in lower dimen-sions. You can find the results in the Appendix. It also gives a significant boost to Random Forests, even above the distributional baseline.

Logistic Regression and SVM are not that impressive, as word embeddings mostly underperform or perform mini-mally above the baseline. However, they provide steadily good results in all kinds of datasets, with a small amount of dimensions (even 10 to 30 dimensions seem enough to get an acceptable accuracy).

A very interesting discovery, after looking into the mul-ticlass confusion matrices, was that even if the accuracy

(10)

Table 6. Time measurements for the training/vectorization of 2 datasets.

Training time (sec.) for 25.000 Training time (sec.) for 70.000 FT-CBOW FT-SG W2V-CBOW W2V-SG FT-CBOW FT-SG W2V-CBOW W2V-SG 10 19.51 28.14 11.04 34.79 14.1 27.32 7.22 23.79 30 25.45 39.81 15.28 52.5 21.52 29.59 10.68 27.57 50 37.85 46.71 14.52 44.12 24.7 34.1 9.84 29.42 100 51.35 73.17 16.04 40.4 35.7 43.81 9.73 28.14 300 119.02 170.62 23.19 72.69 92.2 115.93 13.03 52.7 500 225.76 274.02 30.99 104.39 153.27 194 16.41 62.95

BoW TF-IDF BoW TF-IDF

10 3.84 4.89 2.43 2.26 30 4.53 4.59 2.49 2.6 50 5.6 5.68 2.28 2.38 100 7.9 8.2 2.6 2.4 300 15.49 15.88 2.3 3.3 500 22.98 22.64 2.3 3.3

is improved in general, there is a chance that for cer-tain, semantically similar classes, it will worsen. This is most probably due to the fact that since we have a prede-termined number of dimensions, the averaging method that we use "clusters" the documents, as similar words provide a similar average.

7.3 Dataset sizes and dimensions

It seems that as the usage of word embeddings creates an early maximum, that does not adhere to much improvement as the dimensions increase. This means, that we can get acceptable results with a few dimensions, e.g. 10, but the re-turns are diminishing as we train the models with increasing dimensionality. Sometimes it even worsens.

On the other hand, it seems that the size of the dataset plays a role, but not that important. Although the highest results that we got, using word embeddings, were clearly in the large dataset (95.9%5), the small dataset achieved a very satisfactory 85.1%, with just 100 dimensions, which is clearly important.

7.4 Future work

The beforementioned results open the road for some exciting new research opportunities. Apart from the logical next step, which is to use larger, more diverse datasets, and different word embedding tools (e.g. Glove), we would like to see what other ways of using the word vectors exist, and how successful would they be for the same issues.

One idea is to use aT f −Id f weighting scheme, instead of the simple averaging of word embeddings. Next, we should investigate on the usage or not of certain Neural network options, like CNNs and RNNs, which have been shown to provide excellent results, without having to deal with the vectorization issue that we had.

Last, right now the topic of "something2vec" is quite hot, and new research appears all the time, some of it quite spe-cific, e.g tweet2vec [7]. It would be interesting to see if specific problems could be solved with domain-specific corpora, without having to implement and train a domain and problem-specific neural network.

References

[1] Charu C. Aggarwal and Cheng Xiang Zhai. 2012. Mining Text Data. Springer Publishing Company, Incorporated.

[2] N. S. Altman. 1992. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. The American Statistician 46, 3 (1992), 175–185. https://doi.org/10.2307/2685209

[3] Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1999. When Is ”Nearest Neighbor” Meaningful?. In Proceedings of the 7th International Conference on Database Theory (ICDT ’99). Springer-Verlag, London, UK, UK, 217–235. http://dl.acm.org/citation. cfm?id=645503.656271

[4] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. CoRR abs/1607.04606 (2016). http://arxiv.org/abs/1607.04606

[5] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, An-dreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexan-dre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. 2013. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 108–122.

[6] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 6 (1990), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41: 6<391::AID-ASI1>3.0.CO;2-9

[7] Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and William W. Cohen. 2016. Tweet2Vec: Character-Based Distributed Representations for Social Media. CoRR abs/1605.03481 (2016). http: //arxiv.org/abs/1605.03481

[8] J. D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing In Science & Engineering 9, 3 (2007), 90–95. https://doi.org/10.1109/ MCSE.2007.55

[9] I.T. Jolliffe. 1986. Principal Component Analysis. Springer Verlag. 10

(11)

[10] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of Tricks for Efficient Text Classification. CoRR abs/1607.01759 (2016). http://arxiv.org/abs/1607.01759

[11] Omer Levy and Yoav Goldberg. 2014. Neural Word Embedding As Implicit Matrix Factorization. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). MIT Press, Cambridge, MA, USA, 2177–2185. http://dl.acm.org/citation. cfm?id=2969033.2969070

[12] Edward Loper and Steven Bird. 2002. NLTK: The Natural Lan-guage Toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1 (ETMTNLP ’02). Associ-ation for ComputAssoci-ational Linguistics, Stroudsburg, PA, USA, 63–70. https://doi.org/10.3115/1118108.1118117

[13] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.

[14] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS. Curran Associates, Inc., 3111–3119. [15] Reiichiro Nakano. 2017. reiinakano/scikit-plot: 0.2.6. Zenodo. (2017).

https://github.com/reiinakano/scikit-plot

[16] Charles E. Osgood. 1952. The nature and measurement of meaning. Psychological Bulletin 49, 3 (1952), 197–237.

[17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.

[18] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Work-shop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/884893/en.

[19] G. Salton, A. Wong, and C. S. Yang. 1975. A Vector Space Model for Automatic Indexing. Commun. ACM 18, 11 (Nov. 1975), 613–620. https://doi.org/10.1145/361219.361220

[20] Mark W. Schmidt, Nicolas Le Roux, and Francis R. Bach. 2013. Minimizing Finite Sums with the Stochastic Average Gradient. CoRR abs/1309.2388 (2013). http://dblp.uni-trier.de/db/journals/corr/ corr1309.html#SchmidtRB13

[21] L.J.P. van der Maaten and G.E. Hinton. 2008. Visualizing High-Dimensional Data Using t-SNE. (2008).

[22] S. H. Walker and D. B. Duncan. 1967. Estimation of the probability of an event as a function of several independent variables. Biometrika 54, 1 (June 1967), 167–179. http://view.ncbi.nlm.nih.gov/pubmed/6049533 [23] Wikipedia. 2017. Bag of Words model - Wikipedia. (2017). https: //en.wikipedia.org/wiki/Bag-of-words_model [Online; accessed 14-June-2017].

[24] Wikipedia. 2017. Decision tree learning - Wikipedia. (2017). https: //en.wikipedia.org/wiki/Decision_tree_learning [Online; accessed 11-June-2017].

[25] Harry Zhang. 2004. The Optimality of Naive Bayes. In Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2004).

[26] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. CoRR abs/1509.01626 (2015). http://arxiv.org/abs/1509.01626

(12)

Appendix: Tables and Figures

Table 7. IMDB Results for Logistic Regression and Naive bayes Logistic Regression Naive Bayes

Accuracy Prec (ma) Rec (ma) F1 (ma) Accuracy Prec (ma) Rec (ma) F1 (ma) 10 BoW 0.72247 0.71921 0.72247 0.71237 0.64617 0.69402 0.64617 0.65195 TF-IDF 0.69953 0.70551 0.69953 0.69201 0.65973 0.68739 0.65973 0.66107 w2v-cbow 0.76921 0.76638 0.76921 0.76387 0.73669 0.74867 0.73669 0.73614 ft-cbow 0.85764 0.85655 0.85764 0.85523 0.83226 0.83288 0.83226 0.83058 w2v-sg 0.75299 0.74731 0.75299 0.74659 0.73461 0.73752 0.73461 0.73086 ft-sg 0.85219 0.85058 0.85219 0.84928 0.83104 0.83267 0.83104 0.82915 30 BoW 0.87716 0.87625 0.87716 0.87645 0.78426 0.7936 0.78426 0.78453 TF-IDF 0.8876 0.88826 0.8876 0.88748 0.8159 0.83375 0.8159 0.81969 w2v-cbow 0.88067 0.87998 0.88067 0.88009 0.8014 0.81374 0.8014 0.80222 ft-cbow 0.92 0.91961 0.92 0.91954 0.87303 0.87717 0.87303 0.87289 w2v-sg 0.8929 0.89218 0.8929 0.89175 0.83607 0.84094 0.83607 0.83466 ft-sg 0.9299 0.92952 0.9299 0.92943 0.87677 0.87855 0.87677 0.8756 50 BoW 0.9025 0.90204 0.9025 0.90204 0.78346 0.79723 0.78346 0.78526 TF-IDF 0.9117 0.91166 0.9117 0.91152 0.83966 0.84439 0.83966 0.83968 w2v-cbow 0.89941 0.8988 0.89941 0.89891 0.81869 0.82892 0.81869 0.81916 ft-cbow 0.93111 0.93073 0.93111 0.93079 0.88037 0.88373 0.88037 0.88017 w2v-sg 0.91789 0.91727 0.91789 0.91726 0.84797 0.8525 0.84797 0.84722 ft-sg 0.94051 0.94017 0.94051 0.94018 0.88491 0.88717 0.88491 0.88433 100 BoW 0.91904 0.91871 0.91904 0.91873 0.75256 0.77877 0.75256 0.75203 TF-IDF 0.92704 0.92706 0.92704 0.92692 0.85651 0.85867 0.85651 0.8565 w2v-cbow 0.91206 0.91157 0.91206 0.91171 0.82267 0.83178 0.82267 0.823 ft-cbow 0.93971 0.93941 0.93971 0.93946 0.88206 0.88615 0.88206 0.88187 w2v-sg 0.93619 0.93581 0.93619 0.93585 0.85811 0.86491 0.85811 0.85811 ft-sg 0.95023 0.94998 0.95023 0.95 0.8903 0.89255 0.8903 0.88986 300 BoW 0.94297 0.94291 0.94297 0.94288 0.65029 0.72622 0.65029 0.65413 TF-IDF 0.9434 0.94316 0.9434 0.94313 0.85133 0.85289 0.85133 0.85115 w2v-cbow 0.91463 0.9142 0.91463 0.91432 0.82376 0.83341 0.82376 0.82452 ft-cbow 0.94111 0.94081 0.94111 0.94086 0.88256 0.88647 0.88256 0.88244 w2v-sg 0.94746 0.94742 0.94746 0.94736 0.85966 0.86561 0.85966 0.85951 ft-sg 0.95623 0.95604 0.95623 0.95605 0.88873 0.89201 0.88873 0.88849 textbf500 BoW 0.94957 0.94966 0.94957 0.94951 0.57629 0.68072 0.57629 0.58711 TF-IDF 0.95311 0.95307 0.95311 0.953 0.83304 0.83601 0.83304 0.83332 w2v-cbow 0.91207 0.91161 0.91207 0.91172 0.81413 0.82522 0.81413 0.81483 ft-cbow 0.94037 0.94008 0.94037 0.94012 0.88261 0.8865 0.88261 0.8824 w2v-sg 0.94213 0.94187 0.94213 0.94187 0.8607 0.8678 0.8607 0.86051 ft-sg 0.95554 0.95535 0.95554 0.95537 0.88696 0.89086 0.88696 0.88667 12

(13)

Table 8. Naive Bayes and Random Forest for News Dataset

Naive Bayes Random Forest

Dimensions/Vec. Accuracy Prec (ma) Rec (ma) F1 (ma) Accuracy Prec (ma) Rec (ma) F1 (ma) MAX BoW 0.78289 0.79344 0.78289 0.78458 0.81184 0.81052 0.81184 0.81042 TF-IDF 0.76711 0.76786 0.76711 0.76735 0.77237 0.77078 0.77237 0.77068 10 BoW 0.52592 0.56142 0.52592 0.51306 0.70592 0.70324 0.70592 0.70372 TF-IDF 0.65421 0.67704 0.65421 0.64795 0.78829 0.78778 0.78829 0.78717 w2v-cbow 0.65461 0.66344 0.65461 0.64392 0.72132 0.71794 0.72132 0.7184 ft-cbow 0.62118 0.63319 0.62118 0.60097 0.72184 0.71953 0.72184 0.71964 w2v-sg 0.79474 0.79415 0.79474 0.79285 0.82224 0.82145 0.82224 0.8216 ft-sg 0.83105 0.82955 0.83105 0.8297 0.83474 0.83411 0.83474 0.83428 30 BoW 0.58592 0.63953 0.58592 0.58087 0.74184 0.74067 0.74184 0.7398 TF-IDF 0.67053 0.71837 0.67053 0.67194 0.80513 0.80449 0.80513 0.80412 w2v-cbow 0.64013 0.64387 0.64013 0.62944 0.72763 0.72489 0.72763 0.72537 ft-cbow 0.56066 0.58333 0.56066 0.53501 0.72474 0.72249 0.72474 0.72238 w2v-sg 0.80921 0.80758 0.80921 0.80752 0.83079 0.83022 0.83079 0.8299 ft-sg 0.82947 0.82791 0.82947 0.82818 0.83329 0.83265 0.83329 0.83264 50 BoW 0.59592 0.64861 0.59592 0.59413 0.73513 0.7345 0.73513 0.73349 TF-IDF 0.66461 0.70563 0.66461 0.66555 0.80605 0.80578 0.80605 0.80539 w2v-cbow 0.63579 0.6439 0.63579 0.6274 0.73197 0.72933 0.73197 0.72949 ft-cbow 0.54829 0.56786 0.54829 0.52126 0.73289 0.72995 0.73289 0.73017 w2v-sg 0.80697 0.80546 0.80697 0.80511 0.83237 0.83155 0.83237 0.83175 ft-sg 0.83171 0.83032 0.83171 0.83063 0.83789 0.83732 0.83789 0.83745 100 BoW 0.59974 0.65004 0.59974 0.59964 0.72013 0.71969 0.72013 0.71822 TF-IDF 0.65342 0.69429 0.65342 0.65385 0.79908 0.79898 0.79908 0.7981 w2v-cbow 0.60592 0.61194 0.60592 0.59694 0.72566 0.72273 0.72566 0.72311 ft-cbow 0.52355 0.55938 0.52355 0.50034 0.73382 0.73119 0.73382 0.73151 w2v-sg 0.80974 0.80825 0.80974 0.80779 0.83105 0.83031 0.83105 0.8304 ft-sg 0.82145 0.81997 0.82145 0.82003 0.83803 0.83731 0.83803 0.83751 300 BoW 0.59211 0.6394 0.59211 0.59415 0.67658 0.67531 0.67658 0.6735 TF-IDF 0.65408 0.68661 0.65408 0.65355 0.77447 0.77336 0.77447 0.77293 w2v-cbow 0.56039 0.57191 0.56039 0.54825 0.71789 0.71508 0.71789 0.71582 ft-cbow 0.465 0.48974 0.465 0.4411 0.72329 0.721 0.72329 0.7 w2v-sg 0.80671 0.80528 0.80671 0.80474 0.83737 0.83692 0.83737 0.83679 ft-sg 0.81921 0.81744 0.81921 0.81772 0.83211 0.83169 0.83211 0.83172 500 BoW 0.58132 0.62546 0.58132 0.58298 0.65184 0.65211 0.65184 0.64961 TF-IDF 0.65697 0.68577 0.65697 0.65601 0.75276 0.75242 0.75276 0.75127 w2v-cbow 0.53421 0.54084 0.53421 0.51946 0.70974 0.70661 0.70974 0.70696 ft-cbow 0.44487 0.47991 0.44487 0.4292 0.70618 0.70378 0.70618 0.70375 w2v-sg 0.80566 0.80431 0.80566 0.80373 0.82803 0.82726 0.82803 0.8275 ft-sg 0.81289 0.81128 0.81289 0.81116 0.83197 0.83156 0.83197 0.83163 13

(14)

Figure 11. Tf-Idf in Naive Bayes (DBPedia dataset).

Figure 12. SG in Naive Bayes (DBPedia dataset).

(15)

Table 9. DBPedia Dataset

Naive Bayes Linear SCV

Dimensions/Vec. Accuracy Prec (ma) Rec (ma) F1 (ma) Accuracy Prec (ma) Rec (ma) F1 (ma) 10 BoW 0.64617 0.69402 0.64617 0.65195 0.71024 0.70619 0.71024 0.68798 TF-IDF 0.65973 0.68739 0.65973 0.66107 0.69941 0.69368 0.69941 0.67798 w2v-cbow 0.73669 0.74867 0.73669 0.73614 0.7614 0.76026 0.7614 0.75214 ft-cbow 0.83226 0.83288 0.83226 0.83058 0.85403 0.85332 0.85403 0.85002 w2v-sg 0.73461 0.73752 0.73461 0.73086 0.74541 0.73856 0.74541 0.73158 ft-sg 0.83104 0.83267 0.83104 0.82915 0.84886 0.84765 0.84886 0.84389 30 BoW 0.78426 0.7936 0.78426 0.78453 0.87717 0.87578 0.87717 0.87598 TF-IDF 0.8159 0.83375 0.8159 0.81969 0.89477 0.89415 0.89477 0.89411 w2v-cbow 0.8014 0.81374 0.8014 0.80222 0.87943 0.8786 0.87943 0.87861 ft-cbow 0.87303 0.87717 0.87303 0.87289 0.91984 0.91941 0.91984 0.91926 w2v-sg 0.83607 0.84094 0.83607 0.83466 0.89226 0.89174 0.89226 0.89078 ft-sg 0.87677 0.87855 0.87677 0.8756 0.92979 0.9294 0.92979 0.92925 50 BoW 0.78346 0.79723 0.78346 0.78526 0.9021 0.9014 0.9021 0.90146 TF-IDF 0.83966 0.84439 0.83966 0.83968 0.91859 0.91809 0.91859 0.91818 w2v-cbow 0.81869 0.82892 0.81869 0.81916 0.89914 0.89841 0.89914 0.89845 ft-cbow 0.88037 0.88373 0.88037 0.88017 0.93119 0.93077 0.93119 0.9308 w2v-sg 0.84797 0.8525 0.84797 0.84722 0.91869 0.91804 0.91869 0.91799 ft-sg 0.88491 0.88717 0.88491 0.88433 0.94073 0.94036 0.94073 0.94036 100 BoW 0.75256 0.77877 0.75256 0.75203 0.91883 0.91842 0.91883 0.91843 TF-IDF 0.85651 0.85867 0.85651 0.8565 0.93333 0.93308 0.93333 0.9331 w2v-cbow 0.82267 0.83178 0.82267 0.823 0.91961 0.91918 0.91961 0.91929 ft-cbow 0.88206 0.88615 0.88206 0.88187 0.94219 0.9419 0.94219 0.94191 w2v-sg 0.85811 0.86491 0.85811 0.85811 0.93916 0.93881 0.93916 0.93882 ft-sg 0.8903 0.89255 0.8903 0.88986 0.95119 0.95092 0.95119 0.95095 300 BoW 0.65029 0.72622 0.65029 0.65413 0.9429 0.94288 0.9429 0.94282 TF-IDF 0.85133 0.85289 0.85133 0.85115 0.95066 0.95049 0.95066 0.95047 w2v-cbow 0.82376 0.83341 0.82376 0.82452 0.92766 0.92732 0.92766 0.92738 ft-cbow 0.88256 0.88647 0.88256 0.88244 0.94704 0.94679 0.94704 0.9468 w2v-sg 0.85966 0.86561 0.85966 0.85951 0.95364 0.9535 0.95364 0.95351 ft-sg 0.88873 0.89201 0.88873 0.88849 0.95844 0.95827 0.95844 0.95828 500 BoW 0.57629 0.68072 0.57629 0.58711 0.94881 0.94882 0.94881 0.94876 TF-IDF 0.83304 0.83601 0.83304 0.83332 0.95066 0.95047 0.95066 0.95044 w2v-cbow 0.81413 0.82522 0.81413 0.81483 0.92597 0.92559 0.92597 0.92567 ft-cbow 0.88261 0.8865 0.88261 0.8824 0.94699 0.94675 0.94699 0.94675 w2v-sg 0.8607 0.8678 0.8607 0.86051 0.95866 0.95852 0.95866 0.95853 ft-sg 0.88696 0.89086 0.88696 0.88667 0.9592 0.95902 0.9592 0.95905 15

Referenties

GERELATEERDE DOCUMENTEN

A parsimony analysis phylogram for the genus Pedioplanis based on the combined data (mitochondrial and nuclear fragments) of the 72 most parsimonious trees (L = 2887, CI = 0.4465, RI

Zowel op negatieve externaliserende als op negatieve internaliserende emotieregulatie werd een effect gevonden voor expressiviteit, waarbij in iets sterkere mate voor

The onscreen appearance of a musician playing the music we had presumed nondiegetic, disrupts how we had come to view the story world of Birdman, up until that point.. That

year Statistics Questionnaire comparison Appearance Properties Fit Residual limb U se Cairns et al [83] Percentage of wearers reporting neutral or dissatis fied opinion Author

Adolescent parents also mentioned positive aspects related to parenting in terms of support they got from family members and professional health care.. They also described

Die negatiewe aspekte van ’n lesbiese ouerhuis, soos blyk uit die empiriese resultate van die navorsing, hou verband met die adolessente se ontwikkeling van ’n gesonde sin vir self,

Given a proper alignment, one potential use case for multilingual embeddings is cross-lingual transfer learning, where a machine learning model trained on resource-rich languages