AN EVALUATION OF THE WORD MOVER’S DISTANCE AND THE CENTROID METHOD IN THE PROBLEM OF DOCUMENT CLUSTERING

(1)

A

N

EVALUATION OF THE

WORD

MOVER’S

DISTANCE AND

THE

CENTROID

METHOD IN THE

PROBLEM OF

DOCUMENT

CLUSTERING

SUBMITTED IN PARTIAL FULFILLMENT FOR THE

DEGREE OF MASTER OF SCIENCE

J

ENNY

T

RUONG

11413581

MASTER INFORMATION STUDIES

:

DATA SCIENCE TRACK

FACULTY OF SCIENCE

UNIVERSITY OF AMSTERDAM

2017-07-07

Internal Supervisor External Supervisor Title, Name Prof. Dr. Maarten de Rijke Dr Georgios Tsatsaronis

Affiliation University of Amsterdam Elsevier

(2)

“Learning is experience everything else is just information.”

(3)

ABSTRACT

Neural network derived distributed word representations, also known as word embed-dings, have recently attracted serious attention in and outside the NLP domain. Ever since, Bengio et al. [1] proposed their version of a neural network language model (NNLM) in 2003, one can now find many models used in a wide range of text applications, in-cluding sentiment classification [2–4], speech tagging [5], name entity recognition [6, 7], parsing [8], semantic role labeling [9], and machine translation [10]. Inspired by their re-cent success in numerous word similarity and analogy-detection tasks many studies have tried to utilize word embeddings in the problem of document similarity. A simple method is the use of the average embedding of a document as its vector representation. A more sophisticated approach is the Word Mover's Distance (WMD) [11], which measures the distance between two documents while taking word embeddings into account. The pro-posed research evaluates the two methods in the context of document clustering, or to be precise, in the clustering of scientific papers. Our findings revealed that the usage of word embeddings can generally improve the computation of document similarities. Although the WMD is computationally more complex than using the average or a weighted aver-aged embedding as document representation, we cannot clearly report an improvement of the performance. However, throughout the analysis, we observed a strong influence of the training examples on the performance of both methods. Our findings clearly evinced that in-domain training of the word embeddings can drastically improve the process of document clustering.

(4)

DECLARATION OF AUTHORSHIP

I, Jenny Truong, declare that this thesis titled, ’An Evaluation of the Word Mover’s Dis-tance and the Centroid Method in the Problem of Document Clustering’ and the work presented in it are my own. I confirm that:

This work was done wholly or mainly while in candidature for a research degree at

this University.

Where any part of this thesis has previously been submitted for a degree or any

other qualification at this University or any other institution, this has been clearly stated.

Where I have consulted the published work of others, this is always clearly

at-tributed.

Where I have quoted from the work of others, the source is always given. With the

exception of such quotations, this thesis is entirely my own work.

I have acknowledged all main sources of help.

Where the thesis is based on work done by myself jointly with others, I have made

clear exactly what was done by others and what I have contributed myself.

Signed: Date:

(5)

This document is dedicated to mum and dad. Thank you.

(6)

ACKNOWLEDGEMENTS

I would first like to acknowledge my thesis supervisor Prof. Dr. Maarten de Rijke for given me valuable advice and steered me in the right direction whenever I needed it. I appreciate your time and effort.

I would also like to express my sincere gratitude to Dr. Georgios Tsatsaronis. Thank you for trusting in me. Your engagement, good advice and inspirational talks have made the internship at Elsevier a memorable time. I wish you all the best.

I would also like to thanks my colleagues Deep, Sophia, Jan Jaap, Vincent and Ilias for the support and enjoyable company.

Finally, I must express my very profound gratitude to my family, to Michael and my lovely flatmates for providing me with unfailing support and continuous encouragement throughout the last year and through the process of this thesis.

This accomplishment would not have been possible without you. Jenny Truong

(7)

TABLE OF CONTENTS Declaration . . . 4 Dedication . . . 5 Acknowledgements . . . 6 Table of Contents . . . 7 List of Tables . . . 9 List of Figures . . . 10 1 Introduction 1 2 Literature Review 3 2.1 Distributed Word Representations . . . 3

2.1.1 Topic Modelling . . . 3

2.1.2 Distributed Word Representations . . . 3

2.1.3 Comparison of Language Models . . . 4

2.1.4 The Influence of Training Examples . . . 4

2.1.5 Neural Network inspired Document Representations . . . 4

2.2 Document Similarity . . . 5

2.2.1 Pairwise Distance Measures for Text . . . 5

2.2.2 Word Mover’s Distance . . . 5

3 Theoretical Background 6 3.1 Language Models for Distributed Text Representation . . . 6

3.1.1 The Skip-Gram Model and Negative Sampling . . . 6

3.1.2 The Continuous-Bag-of-Words Model . . . 6

3.1.3 The Global Vector Model . . . 6

3.2 Methods . . . 7

3.2.1 Centroid Method . . . 7

3.2.2 Word Mover’s Distance . . . 7

4 Experimental Setup 8 4.1 Research Objectives . . . 8

4.2 List of Data Sets . . . 8

4.3 Word Embeddings . . . 9

4.3.1 Pre-trained Word Embeddings . . . 9

4.3.2 Domain-specific trained Word Embeddings . . . 9

4.4 Baselines . . . 9

4.5 Clustering Algorithms . . . 10

(8)

4.5.2 K-medoids . . . 10

4.5.3 Complete-Linkage . . . 10

4.5.4 Ward Linkage . . . 11

4.5.5 Density-Based Spatial Clustering of Applications with Noise . . . 11

4.6 Cluster Validity . . . 11 4.6.1 BCubed F-measure . . . 11 4.6.2 Silhoulette Score . . . 11 5 Results 13 5.1 Preliminaries . . . 13 5.1.1 Computational Cost . . . 13 5.1.2 Thresholds . . . 13 5.1.3 DBSCAN . . . 13

5.2 Influence of Different Training Models and Examples . . . 14

5.2.1 The Influence of Training Models . . . 14

5.2.2 The Influence of Training Examples . . . 14

5.3 Comparison of Methods . . . 15

5.3.1 Comparison of the WMD and the Centroid Approach . . . 15

5.3.2 Overall Camparison . . . 16

6 Limitations and Conclusion 17 6.1 Limitations and Future Research . . . 17

6.2 Conclusion . . . 17

(9)

LIST OF TABLES

4.1 Table of Data Sets . . . 8

4.2 List of Word Embeddings . . . 9

5.1 Comparison of the WMD and the Centroid Approach . . . 15

A.1 Table of results: Reuters R8 . . . 21

(10)

LIST OF FIGURES

5.1 Top Scores: Arxiv . . . 16

A.1 Document frequeny tresholds for Arxiv . . . 19

A.2 Visualization of top 500 words from the Arxiv set using pre-trained em-beddings . . . 19

A.3 Visualization of top 500 words from the Arxiv set using domain-specific trained embeddings . . . 20

A.4 Reuters R8: Comparison of training Mmdels (F-Measures) . . . 22

A.5 Reuters R8: comparison of training models (Silhouette Score) . . . 22

A.6 Arxiv: Comparison of training models (F-Measures) . . . 23

(11)

CHAPTER 1

INTRODUCTION

The clustering of documents has been the subject of a broad range of scientific work in the natural language processing (NLP) literature [12–16]. Essentially, clustering aims to discover underlying structure in data and thus fills an important role in a wide range of text applications, including document organization and browsing [17], corpus summa-rization [18], and document classification [19]. A crucial part within academia is occupied by the study of extraction and representation methods that are able to capture intrinsic characteristics of text objects [16]. Most commonly, a set of documents is either presented as Bag-of-Words (BOW) [20] or as term frequency-inverse document frequency (TFIDF) [21] vectors that are embedded in a joint vector space model [22]. However, despite their popularity, the models have two substantial shortcomings: first, the document vectors tend to be sparse and high-dimensional; second, the discrete representation of words impede the capture of potential semantic and syntactic relations between them, which results in a deficiency in identifying word relations such as homonyms, synonyms, and antonyms1_.

In a related field, neural network inspired language models have recently attracted se-rious attention from many researchers in and outside the NLP domain. In these methods, distributed word representations, denoted as word embeddings, are learned during pre-diction tasks with the result that semantically similar words have similar embeddings [1]. Due to shallow model architectures and training process simplifications, such as negative sampling [23], very large sets of texts can be processed, which allows the models to learn complex word relations. Ever since, many studies suggest that word embeddings out-perform more traditional count-based methods in a wide range of word-similarity and analogy-detection tasks [24–26]. In this context, count-based methods refer to embed-dings that are derived from word co-occurrence statistics within a document collection [27].

Inspired by their success, many researchers have tried to utilize word embeddings in the problem of document similarity [27–29]. A simple method is, for instance, the use of the average embedding of a document as its vector representation [30]. A more sophisti-cated approach is the Word Mover's Distance (WMD) [11], which measures the distance between two documents while taking word embeddings into account. The authors tested the WMD in several standard document classification tasks and suggest that it outweighs common state-of-arts baselines. However, it is still unclear how the WMD performs in comparison to other methods involving word embeddings. There is also a lack of

(12)

edge about how the WMD behave using different training models and parameters to obtain word embeddings.

This research gap motivated the present study. It aimed to investigate how the WMD performs in comparison to the more simplistic approach of averaging word embeddings. Since the two methods essentially rely on the quality of word embeddings, we further examined the influence of different training models and training examples. The com-putation of document similarities concerns a broad variety of NLP tasks. However, since word embeddings have been well studied in the context of text classification, this research addressed its objectives within the problem of document clustering. To be precise, we tested three widely-used training models in combination with two methods to compute inter-document distances by using five different clustering algorithms. The performances were evaluated using both extrinsic and intrinsic validation measures.

This paper is structured as follows: The next chapter briefly describes prior and related research work. It introduces methods of text representation and ties it to the problem of computing document similarities in the context of word embeddings. In Chapter 3, three widely-used training models to obtain word embeddings and two methods that utilize these embeddings to compute document similarities are outlined. This is followed by the description of the experimental setup of this research, a discussion of the results, and the limitations of the findings.

(13)

CHAPTER 2

LITERATURE REVIEW

2.1 Distributed Word Representations

2.1.1 Topic Modelling

A closely related research field of word embeddings is topic modelling [31, 32], which describes the study of methods to capture the semantic content of text documents. Es-sentially, topic models aim to leverage the vector space model [20, 22] with the result that documents are embedded as distributions over latent semantic features that are inher-ent in a collection. Popular methods are Latinher-ent Semantic Analysis (LSA)1 _{[34, 35] and} non-negative matrix factorization (NMF) [36]. The methods, typically, apply matrix de-composition on word co-occurrence statistics in order to obtain dense document embed-dings. LDA [37] is often referred to as a Bayesian probabilistic variant in topic modelling. It learns latent topics as probabilistic distributions over words in order to present each document again as probabilistic distribution over the learned topics [31]. According to Kusner et. al [11], the LSI and LSA representations, however, perform worse than the WMD in their experimental setup.

2.1.2 Distributed Word Representations

As we can see, the use of contextual information to derive semantic approximations of words has a long tradition in the NLP research. The notion is based on the distribu-tional hypothesis which claims that semantically similar words tend to occur in similar contexts [38–40]. Count-based methods refer to methods that typically involve statistical operations on word-context co-occurrence matrices in order to obtain word embeddings [27, 39, 41]. Prediction-based methods, in contrast, refer to neural network inspired lan-guage models that only considers local substructures of text [1, 7, 23, 42]. Ever since, Bengio et al. [1] proposed their version of a neural network language model (NNLM) in 2003, one can now find many models used in a wide range of text applications, including sentiment classification [2–4], speech tagging [5], name entity recognition [6, 7], parsing [8], semantic role labeling [9], and machine translation [10].

1_{Latent Semantic Analysis is often used as synonym of Latent Semantic Indexing (LSI)[33], which is a} method originates from the Information Retrieval research.

(14)

2.1.3 Comparison of Language Models

Several studies have been conducted in which the performances of different language models have been evaluated [27, 43–45]. Both the Skip-gram (SG) and the Continuous-bag-of-words (CBOW) models [23, 46, 47] have recently gained much popularity among NLP researchers suggesting that the models outperform count-based variants on numer-ous word similarity and analogy detection tasks by non-trivial margins [23, 27]. How-ever, Levy and Goldberg [48] argue that the two models essentially resemble count-based methods by pointing out that the Skip-gram model with negative sampling (SGNS) im-plicitly factorize a word-context point mutual information (PMI) matrix. In fact, the authors observed that much of the performance gain attributed to neural network lan-guage models can rather be explained by certain design choices and hyper-parameter settings than by the model itself; and when transferring these design choices to count-based methods similar performances can be achieved [45]. Congruently, Pennington et al. [41] state that the matrix-based global vector model (GloVe), which incorporates global co-occurrence statistics while still remaining linear substructures, surpass the Skip-gram and CBOW models in their experiments. Yet, Lai et al. [44] found out that the positive results can also be due to difference in training parameters. They observe that in Pen-ningtion et al.’s [41] experimental set-up word embeddings trained by the Skip-gram and CBOW models are obtained after one iteration, while 25 iterations were applied for the GloVe model [44].

2.1.4 The Influence of Training Examples

Less attention has been paid to the influence of training examples. Mikolov et al. [23] found that training the CBOW model on a bigger corpus size can lead to higher quality of word embeddings. However, some studies conclude that the content of the training corpus might be more relevant than the number of training examples [44, 49–51].

2.1.5 Neural Network inspired Document Representations

Inspired by the recent success of learning word embeddings by means of a simplistic neural network architecture, Le and Mikolov [52] expanded the idea in order to learn document representations. Instead of predicting the context of a word, their model is designed to predict words in a document. The embeddings are then trained via stochastic gradient descent and backpropagation [53]. Accordingly, a handful of models have been proposed that learns new document representations on top of word embeddings [29, 54– 57].

(15)

document clustering. In this context, the influence of different training methods and ex-amples were further examined. For the analysis, the aforementioned Skip-gram, CBOW and GloVe models were selected as language models due to their popularity and state-of-arts performances. A more detailed description of each model can be found in chapter 3.1.

2.2 Document Similarity

2.2.1 Pairwise Distance Measures for Text

The clustering of documents requires a similarity measure that reflects the degree of close-ness of each document pair [58, 59]. Huang [60] compared several similarity measures for the problem of document clustering. However, she could not state a significant perfor-mance difference when using the cosine similarity or more sophisticated measures such as the Jaccard Coefficient, Pearson Correlation Coefficient, and averaged Kullback-Leibler Divergence. The Euclidean distance is the most widely-used distance measure in general. It measures the magnitude of distance between two vectors. In contrast, the cosine simi-larity refers to the orientation or rather to the cosine of the angle between the two vectors. Thus, it provides a similarity measure that is independent of text length [61].

2.2.2 Word Mover’s Distance

The concept of the WMD [11] is derived from the Earth Mover’s Distance (EMD), also known as the Wasserstein metric [62]. In the computer science literature, the EMD has been first examined in the context of image retrieval applications [63]. A similar approach to the WMD can be found in Wan [64], in which documents are first decomposed into a set of subtopics. After that, the EMD is used to measure the cost to transform the subtopics of one document to the subtopics of another document. However, the WMD is the first distance metric that utilizes the concept of the EMD as a pairwise distance measure for documents represented as word embeddings. A supervised variant of the WMD [65] has been proposed in 2016. Ever since, the WMD has been applied in several researches [66, 67].

(16)

CHAPTER 3

THEORETICAL BACKGROUND

3.1 Language Models for Distributed Text Representation

This section briefly describes three widely-used language models to train word embed-dings and two methods that incorporate word embedembed-dings to compute text similarities. The explained models and methods are used afterwards in the experimental setup de-scribed in Chapter 4.

3.1.1 The Skip-Gram Model and Negative Sampling

The Skip-gram (SG) model [23] is a two-layer neural network language model that is de-signed to predict a set of target words from a context representation, which is the embed-ding of one input word. The target words are predicted using simple logistic regressions applied on the context representations. Given a vocabulary, a word is assigned to its em-bedding via one-shot encoding. Thus, the first layer acts as a look-up table. The embed-dings are learned via back-propagation of the prediction error gradient [53]. Negative sampling [47] describes a simplification of the hierarchical softmax approach, in which only the embeddings of k negative samples per positive sample are updated. According to the authors, 5-20 negative samples are necessary for smaller training sets and 2-5 are sufficient when the training set is of larger size.

3.1.2 The Continuous-Bag-of-Words Model

Congruent with the Skip-gram model, the Continuous-Bag-of-Words (CBOW) [23] model aims to predict a set of target words from a context representation. Instead of using one word, the CBOW model averages the embeddings of a pre-defined window of words to use it as context representation in order to predict one target word.

Both models do not maintain word order information, which can result in some criti-cal information loss [68]. A more detailed description of both models can be found in Mikolov et al. [23, 46, 47] and in Goldberg and Levy [69]. Models that consider word orders can be found in Bengio et al. [1] and Mnih et al. [70].

3.1.3 The Global Vector Model

The matrix-based Global Vector (GloVe) [41] model incorporates global co-occurrence statistics while still maintain linear substructures as in neural-network language

(17)

mod-els. The model is essentially a log-bilinear regression model that learns the embeddings on non-zero entries of a word co-occurrence matrix with the result that the dot product of word vectors equals their co-occurrence probability. Although the training on only non-zero entries is efficient, obtaining the word co-occurrence matrix can be, dependent on the corpus size, computationally very costly.

3.2 Methods

Let D = {d1, ..., dn} be a set of documents, V = {v1, ..., vm} be the vocabulary of D, and W ǫ Rd×m_{be a matrix that consists of n × d-dimensional word embeddings, wherein the i}th column is the embedding ( ~wiǫ Rd) of the ithword in the vocabulary.

3.2.1 Centroid Method

A common method, found in the literature, is the use of the mean embedding, denoted as centroid, of a document ( µ = _|d|1 P

wǫdw) as its representative vector. Additionally, the mean can be weighted by tfidf weights [21], which discounts the frequency of a word in a document with its appearance in the entire data set ( t f id f (d, v) = t f (d, v) × log(_{d f (v)}|D| )). The centroid method enables the usage of common distance measures, such as the cosine measure, to compute the similarity of documents1_:

cos(⊳( ~d1, ~d2)) = ~ d1× ~d2 ~ d1 × ~ d2 (3.1)

3.2.2 Word Mover’s Distance

The WMD [11] of the documents d1 and d2 is defined as the minimum cumulative cost to ’move’ all words from ~d1 to ~d2, where ~d1 to ~d2 are the normalized bow-vectors of the documents. The cost of moving word i to word j is the sum of squared errors between their embeddings times Ti, j, which denotes the portion of word i in ~d2 that needs to be moved to word j in ~d1. Portions are used since the documents can have different lengths:

W MD (d1, d2) = min T ≥0 n X i, j=1 Ti, j× d X x=0 ~wi−w~j 2 with T ǫ Rn×n (3.2)

(18)

CHAPTER 4

EXPERIMENTAL SETUP

4.1 Research Objectives

The aim of the present research was the evaluation of methods that utilize word embed-dings in the task of document clustering. This was approached in three sub steps: First, the influence of different training models on the two methods, described in Chapter 3.2.1 and 3.2.2., was investigated; in the second part, we tested for in-domain training effects on the performances of the two methods; the last part of the analysis is an overall compar-ison of the methods with respect to the used clustering algorithms. All components that was chosen to address the aforementioned research objectives, i.e.: the data sets, training hyper-parameters and corpora, baselines, clustering algorithms, and performance mea-sures are described in the following.

4.2 List of Data Sets

Two labelled data sets, described in Table 4.1, were used within the experiment. The first data set belongs to the R8 version of the Reuters 21579 collection. The collection is a widely used standard for text categorization research. Notably, the documents are un-even distributed over 8 domains. The second data set is a subset derived from the Arxiv archive. It contains abstracts of scientific papers. The documents are equally distributed over four sub domains of the fields mathematics and physics. We used the English stop-word list provided by the Stanford Core NLP [71] toolkit to identify stop-words to be omitted.

Table 4.1: Table of Data Sets

ID Source Content No. cluster No. of doc. ∅ length totalwordcounts vocabulary

R8 Reuters UCI KDD [72] News 8 2189 60 12

(19)

4.3 Word Embeddings

4.3.1 Pre-trained Word Embeddings

Two widely-used publicly available sets of pre-trained word vectors were used as gener-ally trained word embeddings. Initializing pre-trained word vectors is a common method to improve the performance in the absence of a large training set [9, 73]. The first set we used is trained on 100 billion words from Google News using the CBOW model with neg-ative sampling1 _{[47]. The second set was trained on Wikipedia articles available in 2014} and the Gigaword 5 corpus2_{using the GloVe [41] model. Words in the data sets that were} not present in the sets of pre-trained word vectors were omitted form the analysis.

4.3.2 Domain-specific trained Word Embeddings

In addition, we trained word embeddings on 274.416 scientific abstracts from the Arxiv database derived from the same disciplines that are in the clustering set (see 4.2). For the sake of comparison, we tried to fix the training parameters as good as possible. Following Mikolov et al. [46], we used a window size of 10 for the SG model and 5 for the other two models. Further, during training 10 negative samples per positive sample were updated as the train set consists of relative few words. Table 4.2 summarizes the training hyper-parameters and sources that have been used:

Table 4.2: List of Word Embeddings

ID Model Window Dimensions Iterations NS3 _{No. of words Source}

G CBOW CBOW 5 300 12 3 100 bio Google News

G GLOVE GloVe 15 300 8 - 5 bio Wikipedia 2014 + Gigaword 5

D SG SG 10 300 5 10 23 mio Arxiv

D CBOW CBOW 5 300 5 10 23 mio Arxiv D GloVe GloVe 5 300 5 - 23 mio Arxiv

4.4 Baselines

As baselines we used the bow-model [20] with term-frequency vectors and tfidf-vectors [21]. A description of the tfidf-weighting scheme can be found in Chapter 3.2.1.

1_{Source: https://code.google.com/archive/p/word2vec/} 2_{https://nlp.stanford.edu/projects/glove/}

(20)

4.5 Clustering Algorithms

The methods were evaluated in the context of document clustering. In total, we used 5 different clustering algorithms that are briefly described below. Except for the K-means and the Ward’s method, we used the cosine similarity and WMD as pairwise distance metrics.

4.5.1 K-means

The K-means algorithm aims to find a partitioning so that the clusters are as dense as possible, or to be precised, that the sum of distances between all documents and their cluster center is minimized. Generally, the sum of squared errors is used as criterion function: J(C) = k X i=1 X djǫci dj−µi 2 (4.1) where D ǫ Rv∗ n_{is a provided data set of n × v-dimensional document vectors, that is to be} partitioned into k clusters C = {c1, ...ck}. µi is the mean of each cluster ci. The process of k-means initially selects a random partitioning with k clusters. Then, all documents are reassigned to the recalculated closest cluster center µi. This process repeats until cluster memberships stabilize [58, 59]. A main drawback of K-means is that it can only converge to a local minimum, thus the performance varies with different initialization of the clus-ters. Furthermore, the value of K must be defined before performing the algorithm, which can be difficult when no prior knowledge of the data set is given.

4.5.2 K-medoids

K-medoids [74, 75] is a variant of K-means. But instead of the mean, the algorithms consider cluster medians as centers. Thus, the centroids are documents from the provided data set, which makes the algorithm less sensitive to outliers and noise [76]. Further, a distance matrix can be computed beforehand and used as a look-up table.

4.5.3 Complete-Linkage

Hierarchical clustering algorithms return a sequence of nested partitions resulting in a tree-like structure. Thereby, the algorithms differ in their approaches, i.e. agglomera-tive and divisive mode, and in the way how inter-cluster distances are computed [16]. Complete-linkage clustering considers the highest dissimilarities between two objects to

(21)

determine the distance between two clusters, D(Cu, Cv) = maxdi∈Cu, ˙dj∈Cvd(di, dj). Thus,

un-like single-linkage clustering, it is less vulnerable to the phenomenon of chaining and therefore more suitable for text data [16].

4.5.4 Ward Linkage

Unlike Complete-linkage clustering, the Ward’s method recursively merges the clusters with the result that the variance within the new clusters are minimized [77]. Thereby, the merging cost is defined as the total sum of squared errors that increases when two clusters are conglomerated.

4.5.5 Density-Based Spatial Clustering of Applications with Noise

The DBSCAN [78] algorithm is designed to find clusters of arbitrary shapes in large, spa-tial, and noisy data. Clusters are defined as dense areas in the data. Thus, the algorithms require two specifications, i.e. (1) a minimum number of neighbouring objects, that need to be within a certain (2) radius in order to be defined as core point. Each object that is with respect to the given radius ’reachable’ from a core point is defined as cluster object.

4.6 Cluster Validity

4.6.1 BCubed F-measure

The BCubed F-measure [79] was used as extrinsic validation metric. In extrinsic met-rics, the output of a clustering process is compared to a gold standard [80]. According to Amigo et al. [81], the BCubed F-measure is a measure that reliably considers homo-geneity, completeness, and size of the clusters as well as the number of errors and the presence of a noise cluster during evaluation. Given a data set, the BCubed precision is the average precision of all objects within the data set, while the precision of an object is the proportion of same labelled objects in the same cluster. Correspondingly, the overall BCubed recall is the average recall of all objects. Then, the overall Bcubed score is the harmonic mean of both metric:

BCubed F − Measure = 2 × Precision × Recall

Precision + Recall (4.2)

4.6.2 Silhoulette Score

The Silhoulette Score [82] was chosen as intrinsic metric. Intrinsic metrics are based on statistical measures such as intra-cluster densities and inter-cluster distances. Thus, they

(22)

do not require any external benchmarks. Let di be the intra-cluster distance and ¯dj be the average nearest-cluster distance. Then, the silhoulette coefficient is to computed as follows:

S ilhouette S core = dj−di max(di, ¯dj)

(23)

CHAPTER 5

RESULTS

5.1 Preliminaries

5.1.1 Computational Cost

The applied clustering algorithms have different time and space complexity. Generally speaking, the most time-consuming step of clustering is the calculation of the similari-ties. For the algorithms; K-medoids, Complete-Linkage, Ward’s method, and DBSCAN, distance matrices can be calculated upfront since they only rely on inter-document sim-ilarities. The K-means algorithm, however, require the computation of distances during the clustering, which, considering the number of iterations and the size of the two data sets, was too time-consuming when using the WMD1 _{as criterion function. The Ward’s} method solves the clustering as variance problem. Hence, it does not require a distance measure. For that reason, the WMD was only applied to K-medoids, Complete-Linkage, and DBSCAN clustering.

5.1.2 Thresholds

Since the WMD implementation proposed by Kusner et al. [11] considers normalized bow-vectors as document representations, we introduced upper and lower thresholds regarding a word’s document frequency. The bow-model with linear term frequencies as values assumes that the more often a word appears in a document, the more relevant it is for a document’s distinctiveness. However, this does not apply for words that occur in almost every or in barely any document in a collection. Best results among all models was yielded when only words that appear in more than 5% and less than 70% of the documents in the collections are used. Results can be seen in Figure A.1 in Appendix.

5.1.3 DBSCAN

The performance of the DBSCAN algorithm is highly dependent on its parameters [83]. Thus, minor changes in Eps-neighborhood and MinPts values can drastically change the results of the algorithm. We obtained WMD values ranging from 0 to 4.5, thus we tested values ranging from 0.1 to 0.9 for the radius and values ranging from 0 to 25 for the min-imum number of objects. In total, 225 combinations were tested. Further, we performed 5 runs per combination since the clustering result varies among the order in which the

(24)

documents are processed in the course of the algorithm. We omitted all runs that leaded to a number of clusters that deviated more than 50% of the true cluster count. Within, this constraints only methods that used embeddings obtained from the G CBOW and D GloVe models succeeded in the Arxiv experiment. Thus, we excluded DBSCAN from the evaluation of the WMD in the analysis of the Arxiv set.

5.2 Influence of Different Training Models and Examples

Our first analysis focused on the evaluation of the influence of the different training mod-els, described in Chapter 4.3.1, and the use of different training examples, described in Chapter 4.3.2. All together, our empirical results did not evince a clear difference between the training models. However, we can confirm that the content of the training examples had a significant effect. In fact, the content was even more important than the number of training examples and the training model architecture.

5.2.1 The Influence of Training Models

We normalized the scores that were calculated in each clustering in order to show the relationships among the three models that were used in this experiments. Figure A4-A7 in Appendix shows a box-plot of the average relative performance of each model of the cluster algorithms regarding both performance measures. For the Reuters R8 data set, the pre-trained CBOW model outperformed the pre-trained Glove model by 6 percent points, whereas both models performed about the same in the Arxiv experiment. Quite the opposite can be seen regarding the models trained on domain-specific examples. In this case, the GloVe model performed best, while the performance of the SG and the CBOW model is about the same. Thus, regarding to the F-Measure, none of the models performed consistently better than the others. The same applies to the Silhouette Score, which considers the cluster densities. The results are in line with the findings of Levy and Goldberg [45, 48] and Lai et al. [44]. Unlike Pennington et al.[41], we fixed the number of iteration during training in our experiment.

5.2.2 The Influence of Training Examples

In fact, a possible explanation for the strong performance gained by the pre-trained CBOW model in the Reuters R8 experiment is that it was trained on news articles, which are the same kind of articles as in the clustered data set. The same applies when reviewing the Arxiv experiment, in which domain-specific trained embeddings performed considerably better than the generally trained embeddings, even if the training corpus is much smaller.

(25)

We plot the embeddings of the 500 most frequent words in the Arxiv data set (Figure A2 - A3) to highlight the difference of the models. We can argue that in the specific case of clustering scientific documents in-domain training of word embeddings can lead to better results.

5.3 Comparison of Methods

In this section, we were interested to investigate which of the in Capter 3.2. described methods performed best.

5.3.1 Comparison of the WMD and the Centroid Approach

For the analysis, we followed the findings in Chapter 5.2 and focused on domain-specific trained embeddings. That is the G CBOW for the Reuters R8 experiment and the D SG, D CBOW, D GloVe for the Arxiv experiment. For the Arxiv experiment, we averaged the results gained by the three models. The results are shown in the Table 5.1 below:

Table 5.1: Comparison of the WMD and the Centroid Approach

Methods Reuters R8 Arxiv

K-medoids Complete DBSCAN K-medoids Complete

Mean 0.61 0.56 0.67 0.68 0.47

weigthed Mean 0.56 0.61 0.67 0.74 0.48

WMD 0.62 0.59 0.55 0.54 0.67

Reviewing the results, it is not possible to clearly state a difference in the performance of the three methods. This is in line with the findings of Nikolentzos et al. [29], in which neither the WMD nor a centroid method performed consistently better in their classifi-cation tasks. However, considering the high computational costs of the WMD, we can clearly argue that the centroid embedding is the more efficient method. With respect to the clustering algorithms, both K-medoids and Complete-Linkage clustering works best for the WMD. As mentioned in Chapter 5.1.3, the DBSCAN is not suitable for the WMD when using a set of embeddings that have been trained on a domain-specific corpus. Generally, the DBSCAN fails when the border objects of two clusters are too close [78]. As Figure A2 and A3 in Appendix show, training the embeddings on a domain-specific corpus can lead to a more dense vocabulary due to the absence of external use examples. Furthermore, all domains in the Arxiv corpus are somehow adjacent to the other ones

(26)

since they have been derived from similar disciplines. These are possible explanations why domain-specific trained embeddings are unfavourable for DBSCAN clustering.

5.3.2 Overall Camparison

Finally, we compared overall performances of the considered methods to common dis-crete representation methods that have been mentioned in Chapter 4.4. The results are shown in Table A1 and Table A2 in Appendix. Overall, the DBSCAN algorithm in com-bination with the centroid method performed best for the Reuters R8 experiment. In this case, the embeddings were trained on a very broad corpus consisting of over 100 bil-lion words of news articles. In contrary, the DBSCAN algorithm performed worst in the Arxiv experiment as explained in Chapter 5.3.1. As the results reveal, the methods in-volving word embeddings can outperform discrete word representation methods in both data sets. However, this strongly depends on the training of the word embeddings.

(27)

CHAPTER 6

LIMITATIONS AND CONCLUSION

6.1 Limitations and Future Research

In this section, we determine the scope of the presented findings by outlining several lim-itations. Preliminaries were already discussed in Chapter 5.1.

First, regarding the training models and hyper-parameters selection we have aimed to design a comparison that is as fair as possible. We determined the selection of parameters based on knowledge derived by the literature. However, no prior tests was conducted with other parameter values. As the results in our analysis were not consistent among the training models, statements about the methods to compute document similarities can only be made conditionally. Interestingly, we achieved stronger discrepancies when we changed the training examples than the model. Thus, we strongly suggest further inves-tigation concerning the influence of the composition of the training set on the quality of word embeddings.

Second, another weakness of this study is that the nature of the two data sets, that have been used for clustering, are different regarding multiple aspects, i.e. topics, distribution over the topics, quality of the content, language etc. Thus, in order to explicitly attribute performance gains on certain training parameters either the number of test sets have to be increased or the tested data sets have to be identical except for the considered parameter. Additionally, one may argue, that a comparison on more data sets, both with general and domain-specific content, is required to be actually able to generalize our findings.

Third, several cluster algorithms used in this experiments can only converge to a local minimum, which leads to different results depending on the initialization. Although, we have performed multiple iterations per cluster algorithm, there is no guarantee that the results are optimized.

6.2 Conclusion

In this research we evaluated various methods involving word embeddings in the con-text of document clustering. Our findings revealed that the usage of word embeddings can generally improve the computation of document similarities. We tested two com-mon approaches to utilize word embeddings regarding the problem of document cluster-ing, that is the WMD and the centroid method. Although the WMD is computationally more complex than using the average or a weighted averaged embedding as document representation, we cannot clearly report an improvement of the performance. However,

(28)

throughout the analysis, we observed a strong influence of the training examples on the performance of both methods. Our findings clearly evinced that in-domain training of the word embeddings can drastically improve the process of document clustering. In fact, the effect is even stronger than the number of training examples and the model architecture. However, too isolated training can lead to a failure of several clustering algorithms, such as DBSCAN, due to a too dense vocabulary. Finally, we strongly suggest further research concerning the composition of the training set in order to obtain high quality embeddings with regard to the task.

(29)

APPENDIX A

APPENDIX OF RESULTS

Figure A.1: Document frequeny tresholds for Arxiv

Figure A.2: Visualization of top 500 words from the Arxiv set using pre-trained embeddings

(30)

Figure A.3: Visualization of top 500 words from the Arxiv set using domain-specific trained embeddings

(31)

K-M eans K-M edoi ds Com ple te War d DB SC AN K-M eans K-M edoi ds Com ple te War d DB SC AN BOW 0.51 0.42 0.56 0.49 0.66 0.03 0.07 -0.13 -0.06 0.13 TFIDF 0.54 0.48 0.60 0.54 0.65 0.04 0.05 0.03 0.02 0.04 Mean gcbow 0.60 0.61 0.56 0.53 0.67 0.12 0.17 0.07 0.09 0.20 gglove 0.64 0.43 0.53 0.43 0.52 0.14 0.23 0.06 0.13 0.25 wMean gcbow 0.60 0.56 0.61 0.52 0.67 0.03 0.13 0.02 0.02 0.20 gglove 0.55 0.38 0.51 0.54 0.52 0.08 0.20 0.08 0.09 0.25 WMD gcbow 0.62 0.59 0.55 0.20 0.08 gglove 0.61 0.58 0.53 0.21 0.09

F-Measure Silhoulette Score

Table A.1: Table of results: Reuters R8

K-M eans K-M edoi ds Com ple te War d DB SC AN K-M eans K-M edoi ds Com ple te War d DB SC AN BOW 0.53 0.46 0.50 0.69 0.48 0.02 0.05 0.00 0.04 0.03 TFIDF 0.77 0.39 0.45 0.70 0.42 0.01 0.02 0.01 0.01 -0.01 Mean g cbow 0.77 0.66 0.43 0.54 0.43 0.14 0.10 0.15 0.04 0.58 g glove 0.69 0.71 0.41 0.51 0.40 0.11 0.24 0.45 0.15 0.58 a sg 0.80 0.71 0.43 0.62 0.40 0.13 0.20 0.26 0.17 0.30 a cbow 0.74 0.68 0.59 0.58 0.40 0.17 0.17 0.21 0.13 0.30 a glove 0.76 0.66 0.41 0.76 0.40 0.18 0.23 0.45 0.28 0.61 wMean g cbow 0.44 0.66 0.41 0.56 0.43 0.09 0.12 0.33 0.15 0.37 g glove 0.45 0.70 0.40 0.49 0.40 0.07 0.27 0.43 0.10 0.58 a sg 0.48 0.73 0.41 0.54 0.40 0.12 0.19 0.27 0.18 0.30 a cbow 0.59 0.77 0.62 0.70 0.40 0.16 0.17 0.18 0.20 0.30 a glove 0.52 0.72 0.41 0.50 0.40 0.20 0.27 0.39 0.15 0.61 WMD g cbow 0.53 0.58 0.41 0.04 0.05 -0.05 g glove 0.55 0.63 - 0.04 0.03 -a sg 0.54 0.63 - 0.04 0.03 -a cbow 0.53 0.72 - 0.05 0.06 -a glove 0.57 0.66 0.41 0.05 0.04 -0.03

F-Measures Silhoulette Score

(32)

Figure A.4: Reuters R8: Comparison of training Mmdels (F-Measures)

(33)

(34)

(35)

BIBLIOGRAPHY

[1] Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.

[2] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. Learning sentiment-specific word embedding for twitter sentiment classification. In ACL (1), pages 1555–1565, 2014.

[3] Duyu Tang, Bing Qin, and Ting Liu. Document modeling with gated recurrent neural network for sentiment classification. In EMNLP, pages 1422–1432, 2015.

[4] Andrew L Maas and Andrew Y Ng. A probabilistic model for semantic word vectors. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010.

[5] Cicero D Santos and Bianca Zadrozny. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine

Learning (ICML-14), pages 1818–1826, 2014.

[6] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recognition. arXiv preprint

arXiv:1603.01360, 2016.

[7] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual

meeting of the association for computational linguistics, pages 384–394. Association for

Computational Linguistics, 2010.

[8] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, Christopher Potts, et al. Recursive deep models for semantic com-positionality over a sentiment treebank. In Proceedings of the conference on empirical

methods in natural language processing (EMNLP), volume 1631, page 1642. Citeseer,

2013.

[9] Ronan Collobert, Jason Weston, L´eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of

Machine Learning Research, 12(Aug):2493–2537, 2011.

[10] Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representa-tions using rnn encoder-decoder for statistical machine translation. arXiv preprint

(36)

[11] Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Wein-berger. From word embeddings to document distances. Jour-nal of Machine Learning Research, 37:957–966, July 2015. URL http://www.jmlr.org/proceedings/papers/v37/kusnerb15.pdf.

[12] Ying Zhao and George Karypis. Criterion functions for document clustering: Exper-iments and analysis. Technical report, Technical report, 2001.

[13] Ying Zhao and George Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 55(3):311–331, 2004. [14] Peter G Anick and Shivakumar Vaithyanathan. Exploiting clustering and phrases

for context-based information retrieval. In ACM SIGIR Forum, volume 31, pages 314–323. ACM, 1997.

[15] Douglass R Cutting, David R Karger, and Jan O Pedersen. Constant interaction-time scatter/gather browsing of very large document collections. In Proceedings of the 16th

annual international ACM SIGIR conference on Research and development in information retrieval, pages 126–134. ACM, 1993.

[16] Charu C Aggarwal and ChengXiang Zhai. A survey of text clustering algorithms. In

Mining text data, pages 77–128. Springer, 2012.

[17] Douglass R Cutting, David R Karger, Jan O Pedersen, and John W Tukey. Scat-ter/gather: A cluster-based approach to browsing large document collections. In

Proceedings of the 15th annual international ACM SIGIR conference on Research and devel-opment in information retrieval, pages 318–329. ACM, 1992.

[18] Hinrich Sch ¨utze and Craig Silverstein. Projections for efficient document clustering. In ACM SIGIR Forum, volume 31, pages 74–81. ACM, 1997.

[19] Kamal Nigam, Andrew McCallum, Sebastian Thrun, Tom Mitchell, et al. Learning to classify text from labeled and unlabeled documents. AAAI/IAAI, 792, 1998.

[20] Zellig S Harris. Distributional structure. Word, 10(2-3):146–162, 1954.

[21] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988.

[22] Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for auto-matic indexing. Communications of the ACM, 18(11):613–620, 1975.

[23] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

(37)

[24] Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux, Jean-Franc¸ois Paiement, Pascal Vincent, and Marie Ouimet. Learning eigenfunctions links spectral embedding and kernel pca. Learning, 16(10), 2006.

[25] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[26] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in

neural information processing systems, pages 3111–3119, 2013.

[27] Marco Baroni, Georgiana Dinu, and Germ´an Kruszewski. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL (1), pages 238–247, 2014.

[28] Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Rich Caruana. Improving doc-ument ranking with dual word embeddings. In Proceedings of the 25th International

Conference Companion on World Wide Web, pages 83–84. International World Wide Web

Conferences Steering Committee, 2016.

[29] Giannis Nikolentzos, Polykarpos Meladianos, Franc¸ois Rousseau, Michalis Vazir-giannis, and Yannis Stavrakas. Multivariate gaussian document representation from word embeddings for text categorization. EACL 2017, page 450, 2017.

[30] Giannis Nikolentzos, Polykarpos Meladianos, Franc¸ois Rousseau, Michalis Vazir-giannis, and Yannis Stavrakas. Multivariate gaussian document representation from word embeddings for text categorization. EACL 2017, page 450, 2017.

[31] David M Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.

[32] Christos H Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and Santosh Vem-pala. Latent semantic indexing: A probabilistic analysis. In Proceedings of the

sev-enteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems,

pages 159–168. ACM, 1998.

[33] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American

society for information science, 41(6):391, 1990.

[34] Thomas K Landauer and Susan T Dumais. A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.

(38)

[35] Thomas K Landauer, Peter W Foltz, and Darrell Laham. An introduction to latent semantic analysis. Discourse processes, 25(2-3):259–284, 1998.

[36] Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative ma-trix factorization. In Proceedings of the 26th annual international ACM SIGIR conference

on Research and development in informaion retrieval, pages 267–273. ACM, 2003.

[37] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.

Jour-nal of machine Learning research, 3(Jan):993–1022, 2003.

[38] George A Miller and Walter G Charles. Contextual correlates of semantic similarity.

Language and cognitive processes, 6(1):1–28, 1991.

[39] Peter D Turney and Patrick Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 37:141–188, 2010.

[40] R´emi Lebret and Ronan Collobert. Word emdeddings through hellinger pca. arXiv

preprint arXiv:1312.5542, 2013.

[41] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 2014. [42] Ronan Collobert and Jason Weston. A unified architecture for natural language

pro-cessing: Deep neural networks with multitask learning. In Proceedings of the 25th

international conference on Machine learning, pages 160–167. ACM, 2008.

[43] Dmitrijs Milajevs, Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, and Matthew Purver. Evaluating neural word representations in tensor-based compositional settings.

arXiv preprint arXiv:1408.6179, 2014.

[44] Siwei Lai, Kang Liu, Shizhu He, and Jun Zhao. How to generate a good word em-bedding. IEEE Intelligent Systems, 31(6):5–14, 2016.

[45] Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for

Computa-tional Linguistics, 3:211–225, 2015.

[46] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in contin-uous space word representations. In Hlt-naacl, volume 13, pages 746–751, 2013. [47] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed

representations of words and phrases and their compositionality. In Advances in

(39)

[48] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factor-ization. In Advances in neural information processing systems, pages 2177–2185, 2014. [49] Pontus Stenetorp, Hubert Soyer, Sampo Pyysalo, Sophia Ananiadou, and Takashi

Chikayama. Size (and domain) matters: Evaluating semantic word space representa-tions for biomedical text. In Proceedings of the 5th International Symposium on Semantic

Mining in Biomedicine, 2012.

[50] Colin Cherry and Hongyu Guo. The unreasonable effectiveness of word representa-tions for twitter named entity recognition. In HLT-NAACL, pages 735–745, 2015. [51] Boxing Chen and Fei Huang. Semi-supervised convolutional networks for

transla-tion adaptatransla-tion with tiny amount of in-domain data. CoNLL 2016, page 314, 2016. [52] Quoc V Le and Tomas Mikolov. Distributed representations of sentences and

docu-ments. In ICML, volume 14, pages 1188–1196, 2014.

[53] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning represen-tations by back-propagating errors. Cognitive modeling, 5(3):1, 1988.

[54] Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint

arXiv:1408.5882, 2014.

[55] Peng Wang, Jiaming Xu, Bo Xu, Cheng-Lin Liu, Heng Zhang, Fangyuan Wang, and Hongwei Hao. Semantic clustering and convolutional neural network for short text categorization. In ACL (2), pages 352–357, 2015.

[56] Chao Xing, Dong Wang, Xuewei Zhang, and Chao Liu. Document classification with distributions of word vectors. In Asia-Pacific Signal and Information Processing

Association, 2014 Annual Summit and Conference (APSIPA), pages 1–5. IEEE, 2014.

[57] Chaochao Huang, Xipeng Qiu, and Xuanjing Huang. Text classification with docu-ment embeddings. In Chinese Computational Linguistics and Natural Language

Process-ing Based on Naturally Annotated Big Data, pages 131–140. SprProcess-inger, 2014.

[58] Anil K Jain and Richard C Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.

[59] Anil K Jain. Data clustering: 50 years beyond k-means. Pattern recognition letters, 31 (8):651–666, 2010.

[60] Anna Huang. Similarity measures for text document clustering. In Proceedings

of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, pages 49–56, 2008.

(40)

[61] Gang Qian, Shamik Sural, Yuelong Gu, and Sakti Pramanik. Similarity between euclidean and cosine angle distance for nearest neighbor queries. In Proceedings of

the 2004 ACM symposium on Applied computing, pages 1232–1237. ACM, 2004.

[62] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. A metric for distributions with applications to image databases. In Sixth International Conference on Computer Vision, pages 59–66. IEEE, 1998.

[63] Elizaveta Levina and Peter Bickel. The earth mover’s distance is the mallows dis-tance: Some insights from statistics. In Computer Vision, 2001. ICCV 2001. Proceedings.

Eighth IEEE International Conference on, volume 2, pages 251–256. IEEE, 2001.

[64] Xiaojun Wan. A novel document similarity measure based on earth movers distance.

Information Sciences, 177(18):3718–3730, 2007.

[65] Gao Huang, Chuan Guo, Matt J Kusner, Yu Sun, Fei Sha, and Kilian Q Weinberger. Supervised word mover’s distance. In Advances in Neural Information Processing

Sys-tems, pages 4862–4870, 2016.

[66] Hamed Zamani and W Bruce Croft. Embedding-based query language models. In

Proceedings of the 2016 ACM on International Conference on the Theory of Information Retrieval, pages 147–156. ACM, 2016.

[67] Xin Ye, Hui Shen, Xiao Ma, Razvan Bunescu, and Chang Liu. From word embed-dings to document similarities for improved information retrieval in software en-gineering. In Proceedings of the 38th International Conference on Software Engineering, pages 404–415. ACM, 2016.

[68] Thomas K Landauer. On the computational basis of learning and cognition: Argu-ments from lsa. Psychology of learning and motivation, 41:43–84, 2002.

[69] Yoav Goldberg and Omer Levy. word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722, 2014. [70] Andriy Mnih and Geoffrey Hinton. Three new graphical models for statistical

lan-guage modelling. In Proceedings of the 24th international conference on Machine learning, pages 641–648. ACM, 2007.

[71] Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. The stanford corenlp natural language processing toolkit. In ACL (System Demonstrations), pages 55–60, 2014.

(41)

[72] Seth Hettich and SD Bay. The uci kdd archive [http://kdd. ics. uci. edu]. irvine, ca: University of california. Department of Information and Computer Science, 152, 1999. [73] Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing natural

scenes and natural language with recursive neural networks. In Proceedings of the

28th international conference on machine learning (ICML-11), pages 129–136, 2011.

[74] Leonard Kaufman and Peter J Rousseeuw. Partitioning around medoids (program pam). Finding groups in data: an introduction to cluster analysis, pages 68–125, 1990. [75] Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: an introduction to

cluster analysis, volume 344. John Wiley & Sons, 2009.

[76] Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for k-medoids clus-tering. Expert systems with applications, 36(2):3336–3341, 2009.

[77] Joe H Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the

American statistical association, 58(301):236–244, 1963.

[78] Martin Ester, Hans-Peter Kriegel, J ¨org Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, vol-ume 96, pages 226–231, 1996.

[79] Amit Bagga and Breck Baldwin. Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 17th international conference on

Compu-tational linguistics-Volume 1, pages 79–85. Association for CompuCompu-tational Linguistics,

1998.

[80] Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. Clustering validity checking methods: part ii. ACM Sigmod Record, 31(3):19–27, 2002.

[81] Enrique Amig ´o, Julio Gonzalo, Javier Artiles, and Felisa Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information

retrieval, 12(4):461–486, 2009.

[82] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987. [83] Daniel Godfrey, Caley Johns, Carl Meyer, Shaina Race, and Carol Sadek. A case

study in text mining: Interpreting twitter data from world cup tweets. arXiv preprint