Document Clustering on a Large Scale

(1)

Document Clustering on a Large

Scale

Master Thesis Software Engineering

Jetse Koopmans

jetse.koopmans@student.uva.nl

July 7, 2017, 85 pages

Supervisor: Rolf Jagerman, PhD student,rolf.jagerman@uva.nl Host organisation: Blendle,www.blendle.com

Contact person: Arno Veenstra, Data Scientist,arno@blendle.com

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

6.6.1 Details. . . 44 6.6.2 Results . . . 44 7 Visualisation 46 7.1 Introduction. . . 46 7.2 Visualisation . . . 46 7.3 Example . . . 47 7.4 Future . . . 47 8 Conclusion 49 8.1 Future work . . . 49 Bibliography 51 Appendices 57 A Most frequent words translation 58 B Unsupervised ClusClas 59 C Hierarchy of Blendle data set 60 D Wikipedia data set 63 D.1 K-means . . . 64

D.2 Spherical k-means . . . 66

D.3 Von Mises-Fisher . . . 68

D.4 K-medoids: Euclidean . . . 70

D.5 K-medoids: cosine . . . 72

E Blendle data set 74 E.1 K-means . . . 75

E.2 Spherical k-means . . . 77

E.3 Von Mises-Fisher . . . 79

(4)

E.5 K-medoids: cosine . . . 83

(5)

Acknowledgements

First of all, I would like to thank my academic supervisor Rolf and my Blendle supervisor Arno. They guided me through the master’s thesis from beginning to end with their insightful advice, helpful discussions and motivation. I would like to thank all colleagues at Blendle who helped me in any way. My special thanks goes to Ana, who did a great job in managing the master Software Engineering this year.

Finally, I would like to thank my family and friends for always being supportive throughout my studies. In particular, I am thankful for my brother Arend, who was my dual core processor this year, my girlfriend Kristel, for her love and support, and my friends Joris and Matt for their encouragement.

(6)

Abstract

Huge collections of unstructured documents are too large to be processed by hand. It is however interesting to discover the structure that is hidden in these large data sets.

In this thesis we create a system that examines diﬀerent approaches to eﬃciently discover that hidden structure, in terms of speed and quality. The requirements are that the system should be able to produce clusters of high quality, in a short amount of time, at a scalable size and produce clearly visualised results.

The method that we propose is hierarchical divisive k-means clustering with f (K) to determine the optimal amount of clusters per depth. We use two diﬀerent optimization methods to scale this method for large collections of unstructured documents. We found that these methods improved the runtime remarkably while the quality of the clusters, measured by both intrinsic and extrinsic metrics, was not significantly aﬀected. The results are clearly visualised in an interactive D3 data driven document.

(7)

Chapter 1

Introduction

This chapter gives a brief overview of what Blendle1 _{does, what kind of data it has and why it is}

relevant to do hierarchical clustering.

1.1 Context, relevancy and problem analysis

Blendle is a Dutch online news platform that houses many newspapers and magazines. It allows people to browse through everything and only pay for the stories they like. In the media, Blendle is called the “iTunes for news”[1]. At the time of writing, they were in the ongoing process of expanding to the United States of America and Germany. They receive approximately 8000 documents per day from hundreds of sources which often describe the same real-life events from diﬀerent perspectives. Combine this daily stream with the existing database of more than 2.5 million documents and you get an enormous constantly growing dataset. In our modern society, businesses are gaining value from analysing only a small amount of their data [2]. Imagine the massive potential waiting to be leveraged in the analysis of all unstructured data.

So, consider the problem of being given 2.5 million documents plus a daily addition of 8000 docu-ments and wanting to structure them. By hand, this takes ages and the results are highly subjective, but this can be done much faster by means of clustering algorithms. The existing database of 2.5 million documents functions as a training set to cluster the daily stream of 8000 incoming documents. To make sure the system stays useful in the future, the software needs to be highly scalable: 8000 documents per day result in almost three million documents per year.

Assigning each document to only one cluster adds structure to the data. Adding hierarchy to the clustering has the potential to further increase this structure. For instance, a document could belong not only to sports (parent cluster), but also to football or basketball (child cluster) and perhaps even more.

Blendle deals with news documents where fast results are very important. Therefore, it is required to have the clustering of the incoming documents completed quickly. This way, the editors have the best insights to choose which document to publish online and to pick the “best” 10-15 documents

before the newsletter will be sent out around 6:45 A.M. on every workday.

The raw data generated by the clustering is not very useful for the editors. A single picture could contain a wealth of information and is processed much more quickly than a comparable page of words [3]. Therefore, a proper visualization of clustering output is just as important as the technical implementation of the clustering algorithm itself.

1.2 Project aims

The overall aim of the project is to develop a system that finds structure in news documents. This provides more insight in the data and supports editors in recommending documents to users. To

(8)

achieve this, we use hierarchical clustering. This is challenging due to scalability, the many clustering algorithms and the evaluation of the clustering. To solve this, we create a systematic comparison that examines existing clustering methods and, where needed, will improve upon them. We also introduce an experimental set-up for a new evaluation method. The system requires that:

1. Clusters with high quality are produced according to both intrinsic and extrinsic metrics. 2. 2.5 million documents are clustered within 12 hours.

3. A daily batch of 8000 documents is processed within 1 minute. 4. It is still usable in terms of speed after five years.

Regarding the visualisation, our aim is to make the results of the clustering insightful for the editors in such a way that the editors are able to browse through the document structure eﬀortlessly. To achieve this, we propose an existing visualisation technique applied to this data. This is challenging due to the enormous data set.

1.3 Research question

How to eﬃciently discover structure in a large collection of unstructured documents, in terms of speed and quality?

Sub-questions

1. How is the quality of the found structure evaluated?

2. How can methods for clustering be scaled to the size of Blendle’s data set?

3. How are results presented to editors, such that the hierarchical nature of the structure is clearly highlighted?

1.4 Research method

The research questions are answered by a systematic comparison of clustering methods, where speed and quality are important. To increase speed, the documents could be preprocessed or, for example, sampling could be used. This must be done without significant loss of quality.

It is likely that various clustering methods do not agree with one another. Top-down hierarchical clustering might outperform bottom-up hierarchical clustering, or vice versa. Empirical evaluation is necessary to determine which method is appropriate. The Gensim framework [4] and scikit-learn [5] are solutions for trying out existing algorithms as well as implementing new algorithms related to clustering. To verify these methods, we create a training set from Blendle documents with human evaluators, download Wikipedia documents and use the well-known Iris data set.

To validate the results, both intrinsic and extrinsic metrics are used. Intrinsic metrics are based on between- and within-cluster distance and extrinsic metrics perform a comparison of the clustering with the ground truth. For the latter, an existing ground truth from Wikipedia and a newly generated ground truth from experts from Blendle are used.

The visualization depends on the data from the clustering and thus on which method is used for the clustering. An interactive visualization might be interesting and insightful. A handful of existing visualisation techniques could be used [6].

1.5 Related work

A lot of research has already been done on clustering. In this section we will review several works that deal with (hierarchical) clustering for (news) documents, which are highly relevant to this thesis.

(9)

Incremental Clustering of News Reports [7]

This research describes a clustering system that can cluster news reports from disparate sources into event-centric clusters from RSS streams. This is done by the k-means algorithm and Term

Frequency-Inverse Document Frequency (TF-IDF) weighing. It performs rather poorly when

cluster-ing documents (or news reports) into more general categories. Every news report is clustered in 1 second on average. This was done on a Linux quad-core system with CPUs running at 2.4 GHz and 4 MB of cache, and with 8 GB of RAM.

Large scale news document clustering [8]

This research examines diﬀerent approaches on how to cluster news documents so that two documents which are covering the same information would belong to the same cluster. All of the examined algorithms show quite bad running times. L¨onnberg and Yreg˚ard note that pre-processing news items may improve the quality and speed and that one should be aware that it is hard (if not impossible) to guarantee a flawless result. 44,971 documents are processed in 517 seconds. This is done on an Intel Core i7 2.7 GHz (4 cores) processor with the JVM heap size limited to 4 GB.

Assigning Web News to Clusters [9]

Bouras and Tsogkas investigate the application of a great spectrum of clustering algorithms, as well as similarity measures, to news documents that originate from the Web and compare their eﬃciency for use in an online Web news service application. Sequential agglomerative hierarchical non-overlapping clustering methods feature an average complexity of at least O(n2_{) and most commonly}_O(n3_{) - on}

the input size n - which in many cases is aversive for use with large datasets.

Real-time aanbevelingen voor nieuws [10]

This research focusses on a recommender system that can filter the documents according to the interests of the user. Leroux uses the framework Apache Storm2 to build scalable applications. Before recommending a document, the documents are clustered incrementally. The system needs 800 milliseconds for 250 recommendations.

Incremental Hierarchical Clustering of Text Documents [11]

This research evaluates an incremental hierarchical clustering algorithm, which is often used with non-text datasets, for text document datasets. Sahoo et al. also propose a way to evaluate the quality of the hierarchy generated by the hierarchical clustering algorithms, by observing how often children clusters of a cluster get children classes of the class assigned to the cluster.

A Fast Incremental Clustering Algorithm [12]

In this research, Su et al. propose a fast incremental clustering algorithm by changing the radius threshold value dynamically. The algorithm restricts the number of the final clusters and reads the original dataset only once. The time spent for about 8000 items depends merely on the amount of clusters: for 5 clusters it takes 250 seconds and for 30 clusters it takes 10 seconds.

Similarity Measures for Text Document Clustering[13]

This research mentions that partitional clustering algorithms have been recognized to be more suitable as opposed to the hierarchical clustering schemes for processing large datasets. Huang compares and analyses the eﬀectiveness of various similarity measures in partitional clustering for text document datasets. Huang concludes that the measures’ overall performance is similar.

(10)

Large-Scale Multilingual Document Clustering [14]

This research focusses merely on the multilingual part of clustering, where probabilistic and matrix-based clustering methods are examined carefully. Kishida manages to cluster 22,892 multilingual documents in almost an hour and 72,835 documents in 8,7 hours.

Multilingual and cross-lingual news topic tracking [15]

This system is able to identify related news across languages, without translating documents. The col-lection of clusters is mainly presented to the users as a flat and independent list of clusters. Pouliquen et al. focus on automated news analysis that consumes about 7600 news documents per day.

Analysing Entities and Topics in News documents using Statistical Topic Models [16] In this research they have applied standard entity recognizers to extract names of people, organizations and locations from over 300,000 NY Times news documents. Unfortunately, the runtime performance is not mentioned. Newman et al. reached their goal: understand the latent structure between entities.

Multilingual Topic Models for Unaligned Text [17]

Boyd-Graber and Blei develop a probabilistic model of text that is designed to analyse corpora composed of documents in two languages. This model finds topic spaces and matchings simultaneously. 9756 documents of a German-English parallel corpus are analysed. The runtime performance is unfortunately not mentioned.

Software Framework for Topic Modelling with Large Corpora [4]

Within this framework, Rehurek and Sojka implement several popular algorithms for topical inference. This framework is useful for trying out existing algorithms as well as implementing new algorithms related to clustering.

Large Scale Relational Information Visualization, Clustering, and Abstraction [18] This research provides the models, measures and methods required to produce and evaluate the drawing of large amounts of relational information. Quigley identifies related problems with the visualization of large data sets.

1.5.1 Comparison of related work

Table1.1compares several researches in the field of (document) clustering. It gives a detailed overview about the following clustering features: hierarchical, maximum size, speed and time complexity.

No existing research on hierarchical clustering deals with the large scale of data that Blendle as: about 2.5 millions documents with a daily addition of 8000. The complexity of the related work varies fromO(n) [12] toO(n3_{) [}₉_{], which would imply that the data set of Blendle will be clustered in}

O(4000000) to O(6.4e19). Unfortunately, some researches do not mention their time complexity but

only their speed: 26310 reports in 26310 seconds [7] and 72,835 documents in 31,292 seconds [14]. [10] focusses on real-time recommendations wherein clustering takes place, unfortunately on a small scale: 250 recommendations in 800 milliseconds. Some may be faster than others, but that might be at the cost of quality. [16] and [17] are about topic modelling. That is a form of mixture modelling, where each document belongs to all clusters but with diﬀerent proportions or weights. Furthermore, four out of eleven researches support soft clustering. From table1.1, we can conclude that the researches out there do not support hierarchical clusterings, do not scale enough for a hierarchical clustering on the Blendle data set, or do not provide the necessary information to draw conclusions.

(11)

Research Hierar- Max. Speed

Comp-chical Size lexity3

Incremental Clustering of News Reports [7]

No 26,310 26,310 in 26,310s

-Large scale news document clus-tering [8]

Yes 44,971 44,971 in 517s O(n2₎

Assigning Web News to Clusters [9]

Yes 10,000 - O(n2

)-O(n3₎

Real-time aanbevelingen voor nieuws [10]

Yes 250 250 in 0.8s

-Incremental Hierarchical Clus-tering of Text Documents [11]

Yes 62,935 -

-A Fast Incremental Clustering Algorithm [12]

No 8000 - O(n)

Similarity Measures for Text Document Clustering[13]

No 18,828 -

-Analysing Entities and Topics in News documents using Statisti-cal Topic Models [16]

Yes 300,000 -

-Multilingual Topic Models for Unaligned Text [17]

No 9756 -

-Large-Scale Multilingual Docu-ment Clustering [14]

Yes 72,835 72,835 in 31,292s

-Multilingual and cross-lingual news topic tracking [15]

Yes 7600 -

-Table 1.1: Comparison of related work for clustering.

1.6 Thesis outline

Figure1.1 shows a high-level overview of the pipeline of the full system. This thesis and the system both follow this outline. The first step consists of downloading documents from the database. Second, the documents are cleaned in such a way that only relevant information remains. This is done in chapter2by tokenization, stop words removal, stemming, addition of properties and short document removal. Hereafter, documents are represented as vectors such that computers are able to compare them and the dimensionality of those vectors can be reduced in chapter 2. The vectors are then clustered with a hierarchical clustering algorithm of choice in chapter 3. Chapter 4 evaluates the results with existing measures and a new experimental measure is introduced. For this evaluations, both existing and new data sets are used, which are described in chapter 5. To answer the research question, experiments with various algorithms and multiple methods for speed improvements are conducted in chapter6. Finally, the results are visualised in chapter7 to gain more insights in the resulting clustering and our work is concluded in chapter8. AppendicesA,B,C,D,EandFprovide extra information and full details of the experiments.

(12)

Clusters Database

Cleaned documents Documents

Tokenization Stop words_removal

Stemming Addition of properties Removal of short documents Clean documents Create vectors

Term weighing Dimensionality reduction Clustering Evaluation Visualisation Cluster vectors Vectors Result

Figure 1.1: Overview of the system. It exists of 4 major steps: clean documents, create vectors, cluster vectors and result.

Connector Process Database

(13)

Chapter 2

Text pre-processing

This chapter describes how documents are preprocessed in such a way that only relevant information remains. Hereafter, it is explained how computers are able to process documents.

2.1 Tokenization

Tokenization can be seen as the process of breaking the documents into smaller units (tokens) such as terms and phrases. This process includes removing the punctuations and lower-casing the documents. The documents are tokenized using the bag-of-words model [19]. In this model, each document is represented as the bag (multi-set) of its words. For example, here are two very simple documents1_:

(1) John likes to watch movies. Mary likes movies too. (2) John also likes to watch football games.

Based on these two documents, the bag of words is constructed as follows:

["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games"] This representation does not preserve the order of the words in the original documents. Hereafter, the term frequency, can be calculated. This is further explained in section2.7. For this example, the following two lists are constructed to record the term frequencies of all distinct words:

(1) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0] (2) [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]

In the first list, the first two entries are “1” and “2” and correspond to the words “John” and “likes”. The values are “1” and “2” because “John” and “likes” appear in the first document one and two times, respectively.

2.2 Stop words

Common words such as “the”, “an”, “and”, “by”, “from”and “with” occur in almost all documents. They do not carry any information on their own and therefore add a lot of noise. Figure2.1illustrates the word frequencies in 10,000 random Blendle documents. Words like “de”, “van”, “een” and “het” are very frequent Dutch words. Because the words are not lowercased for this example, words like “de” and “De” both occur. An English translation of figure2.1is available in appendixA. This figure emphasizes stop words are to be removed. To create a better and faster clustering, stop words are be removed from the corpus [20]. Because the dimensionality of each document is reduced, speed improvements occur. The quality is increased because documents are not similar due to the fact that they contain stop words.

(14)

meer w e al of do or heeft o v er uit dan w as nog naar ze hij bij Het o o k maar je om als aan er ik niet De die zijn v o or met te op dat is in en het een v an de Words 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 Occurences

Figure 2.1: The 40 most frequently occurring words in 10,000 random Blendle documents.

Unfortunately, removing stop words may also decrease the clustering quality. When the words “it” and “us” are identified as stop words, terms like “IT engineer” or “US citizen” are aﬀected. Important information is lost and only “engineer” and “citizen” remain, but it is unclear what kind of engineer or citizen is being referred to [21].

2.3 Stemming

A computer sees “car”, “cars” and “cars'” as three diﬀerent words. To solve this problem, stemming is applied. In stemming, the morphological aﬃxes from words are removed, leaving only the word stem. To continue on the above example, the words “car”, “cars” and “cars'” are reduced to “car”. This way, the clustering is improved [22, 23]. Below are two examples that show how the stemmer works: enjoy enjoying enjoyed enjoyable          −→ enjoy stems stemmer stemming stemmed          −→ stem argue argued argues arguing          −→ argu

Unfortunately, stemming also has its downsides. Some examples might not be stemmed correctly e.g. irregular verbs and plural forms. Words with a diﬀerent meaning might also be stemmed to the same word. For example:

break broke broken      ̸−→ break foot feet } ̸−→ foot computer compute } −→ compute

(15)

How the stemming is done exactly lies outside the scope of this thesis and is not covered. The stemmer that is used is SnowballStemmer from NLTK2.

2.4 Addition of properties

Every word in the title can be interpreted as more important than any word in the content [24]. Therefore, after the previously mentioned pre-processing steps, every word in the title is added three times to the content to strengthen what the document is about. Also the ID of the provider is added three times to the content. That is a unique identifier that belongs to a provider. This emphasizes that some providers produce more or less the same content e.g. the Quote3 magazine focuses on readers that have more money than time and therefore produce more economic documents.

2.5 Removal of short documents

Compared with normal documents, short documents come with severe sparsity problems. Directly applying algorithms on short texts will suﬀer from this problem [25]. After the previously mentioned pre-processing steps, documents with less than 100 tokens are therefore seen as documents with no actual meaning. They introduce a lot of noise in the clustering as they do not contain enough meaningful words. Therefore, all documents with less than 100 tokens are removed from the corpus.

2.6 Vector space model

A widely used model for representing documents is the vector space model [26]. Each document is represented as an n-dimensional vector, where each dimension corresponds to a separate term. Each document is defined as:

⃗

Wd= (w1,d, w2,d, ..., wn,d) (2.1)

where ⃗Wd is called the document vector for document d and wn,d the weight of each term in the

document. The dimensionality of the vectors is the number of words in the vocabulary. If a term occurs in the document, the value of the corresponding vector is non-zero. To compute the values of the vectors, term weight is used. One of the best known representations is Term Frequency-Inverse

Document Frequency (TF-IDF). For instance, it is used by 83% of surveyed text-based research-paper

recommender systems [27].

2.7 Term Frequency-Inverse Document Frequency

Term Frequency-Inverse Document Frequency (TF-IDF) is a method for modelling the importance

of words in documents of a corpus. The more frequent a word appears in a document, the more important it is. However, its importance is oﬀset by the frequency of the word in the corpus of documents. This helps to adjust for the fact that some words appear more frequently in general. TF-IDF contains two components: Term Frequency (TF) and Inverse Document Frequency (IDF) [28].

Term Frequency

The Term Frequency (TF) counts the occurrences of terms in a document and is defined as:

tf (t, d) = ft,d (2.2)

2_{http://www.nltk.org/_modules/nltk/stem/snowball.html} 3

(16)

where tft,d is the frequency of occurrences of the term t in document d. However, the TF of a

document is not suﬃcient in many cases.

Inverse Document Frequency

The Inverse Document Frequency (IDF) is a measure of how much information the word provides with respect to the whole corpus. Essentially, a common term receives a low weight, while rare word receives a higher weight. This is defined as:

idf (t, D) = log₂ |D|

|{d ∈ D : t ∈ d}| (2.3)

where |D| is the total number of documents in the corpus and |{d ∈ D : t ∈ d}| the number of documents where the term t appears. The value of idf is always greater than or equal to 0.

Term Frequency-Inverse Document Frequency calculation

Combining TF with IDF leads to a new weight for each term in a document:

tf idf (t, d, D) = tf (t, d)· idf(t, D) (2.4) The result is then normalised to unit length to prevent a bias towards longer documents, which probably have a higher term count regardless of the actual importance of that term in the document. Thus, a high TF-IDF is reached by a high term frequency in the given document and a low document frequency of the term in the whole corpus. This tends to filter out common terms. As a term appears in more documents, the ratio inside the log2 of the idf approaches 1, bringing tf idf (and idf ) closer

to 0.

Term Frequency-Inverse Corpus Frequency

Because of the continuous stream of documents, it is very costly to recalculate the TF-IDF weight for all words in all documents for each arriving document. Therefore, the IDF is replaced by the inverse corpus frequency and TF-IDF is replaced by Term Frequency-Inverse Corpus Frequency (TF-ICF). The inverse corpus frequency is calculated on the 2.5 million existing documents and when a new document arrives, it is used in combination with the TF to calculate the TF-ICF. Results show that TF-ICF produces document clusters that are of comparable quality as those generated by TF-IDF and it is significantly faster [29].

2.8 Dimensionality reduction

When a TF vector is used, the number of dimensions equals the size of the vocabulary. So the documents are high dimensionally. For example, a corpus with 10,000 Blendle documents, is about 135,000-dimensional. When clustering high-dimensional data, a major challenge to overcome is the

curse of dimensionality : complete enumeration of all subspaces become intractable with increasing

dimensionality. This is is due to the exponential growth of the number of possible values with each dimension. The main goal of dimensionality reduction is to project high-dimensional data in a lower dimensional space, while trying to keep the essential features of the original data intact. The lower the dimensionality, the faster clustering will be. However, it is unclear whether a reduced dimensionality decreases the quality of the clustering. Therefore, we perform experiments on dimensionality reduction in section6.3.2and6.3.3.

Popular dimensionality reduction techniques are Independent Component Analysis (ICA), Latent

Semantic Indexing (LSI) and Random Projection (RP) [30]. According to various experiments where documents are clustered, ICA and LSI perform best. In the case of ICA, it is sometimes the only way to first use another method to pre-select some dimensions, due to memory limitations. That is because the input matrix for ICA, in contrast to the one for LSI, is dense [30, 31,32].

There is some discussion about the “best” amount of dimensions. [30] claims an eﬀective reduction to 100-200 dimensions, or even less, while [32] encourages dimensions in the range of 300-500 and [31] suggests between 10-100.

(17)

2.8.1 Latent semantic indexing

The main feature of LSI is the use of Singular Value Decomposition (SVD) to identify patterns in the relationships between terms and concepts. It is based on the principle that words that are used in the same contexts tend to have similar meanings and thus is able to uncover the underlying latent semantic structure. It mainly consists of five steps: [32,33]

1. A matrix wherein each row corresponds to a term and each column to a document is formed. Each element (m, n) in the matrix corresponds to the number of times that the term m occurs in document n.

2. Local and global term weighting is applied to the entries in the term-document matrix

3. By means of SVD, this matrix is reduced to a product of three matrices, one of which has non-zero values only on the diagonal.

4. Dimensionality is reduced by “zeroing out” all but k largest values on this diagonal, together with the corresponding columns. This process is used to generate a k-dimensional vector space. To get a better understanding of LSI, we show you an example4 _{of the concept. If each word only}

has one meaning, LSI would be easy since there is a simple mapping from words to concepts: word-1−→ concept-1

word-2−→ concept-2

. . .−→ . . .

word-n−→ concept-m

But, all languages have synonyms and words with multiple meanings. This sort of ambiguities obscure the concepts to a point where even people can have a hard time understanding. For this, the mapping is very diﬀerent: word-1 ↗_↘ concept-1 concept-2 word-2−→ concept-3? . . .−→ . . .? word-n−→ concept-n?

For example, the word bank when used together with mortgage, loans and rates probably means a financial institution. However, the word bank when used together with lures, casting and fish probably means a stream or river bank. Essentially, what LSI tries to do is to see similar concepts as one. Two documents (within a much larger corpus) represented by LSI in 5 dimensions might be in the form of: (1) [0.106*device + 0.078*google + 0.060*talk + 0.057*voice + 0.044*icon]

(2) [0.047*link + 0.027*ui + 0.018*main + 0.017*level + 0.016*locale] Thus each document is represented by a weighting of topics.

A new document can be added incrementally to a LSI representation. But, such an incremental addition fails to process the co-occurrences of the newly added documents. New terms are even ignored. Thus, the quality of the LSI representation degrades as more documents are added and in the end a re-computation is required.

For this thesis, LSI is used to reduce the dimensions to 400, 200 and 100. Experiments in section 6.3decide which amount is superior in this case, or to use the original dimensions.

4_Example _from: _{https://www.quora.com/What-is-a-good-explanation-of-Latent-Semantic-Indexing-LSI/}

(18)

Chapter 3

Hierarchical clustering

This chapter explains how agglomerative and divisive hierarchical clustering algorithms work.

3.1 Hierarchical

Clustering algorithms are divided into two types: flat and hierarchical clustering. Flat clusterings simply consist of a certain number of clusters and the inter cluster relation is often undetermined. A hierarchical clustering has a certain structure with the usual interpretation that each node stands for a subclass of its parent node and thus each node represents the cluster that contains all the objects of its descendants. Within clustering there is another important distinction: soft or hard cluster-ing. In a soft clustering, each object is assigned to multiple clusters, while in a hard clustering it is assigned to one and only one cluster. Hierarchical clustering is usually hard, while flat clustering can be both hard and soft [34]. Generally, hierarchical clustering algorithms produce more in-depth information for detailed analyses, while flat clustering algorithms are more eﬃcient and provide suf-ficient information for some purposes like small or simple data sets with a clear distinction [35]. On top of that, hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its low runtime performance. Sometimes these approaches are combined to “get the best of both worlds” [36]. Whether to choose for flat or hierarchical clustering also depends on the underlying data. Blendle deals with documents that fit well into a taxonomy [37] e.g. a document belongs to sports, football and domestic. In this thesis, we focus on hard hierarchical clustering as well as a combination of both approaches. There are basically two forms of hierarchical clustering:

agglomerative (bottom-up) and divisive (top-down) clustering.

3.2 Agglomerative

In agglomerative clustering each document starts in its own cluster and pairs of clusters are merged as one moves up the hierarchy. In order to decide which clusters should be combined, a measure of similarity between documents is required and merge criteria need to be defined. The way agglomerative clustering works is as follows: [36]

1. Calculate a similarity matrix whose ijth _{entry gives the similarity between the i}th _{and j}th

clusters.

2. Merge the most similar (closest) two clusters.

3. Update the similarity matrix with the newly created cluster. 4. Repeat step 2 and 3 until all entries are in one single cluster.

Choosing an appropriate similarity measure in step 1 and the right linkage method in step 2 are crucial for the clustering. Diﬀerent choices can lead to very diﬀerent clusterings.

(19)

3.2.1 Similarity measures

The similarity matrix reflects the degree of closeness or separation of the documents. Typically, the similarity in step 1 is estimated by a function calculating the distance between vectors. If two documents are close according to this distance, they are regarded as similar. Four popular measures are Euclidean distance, cosine similarity, Jaccard similarity and Manhattan distance. Research points out that, in the case of text documents, these measures are ranked in the order of cosine > Jaccard

> Euclidean > Manhattan, in terms of performance [38, 39, 40]. Therefore, we have decided to only focus on cosine similarity and Jaccard similarity.

Cosine similarity

The cosine similarity can be seen as the cosine of the angle between vectors (documents). The direction of the document vectors is assumed more important than their length and the distance between them [41]. It is one of the most popular similarity measures applied to text documents [42, 43, 13]. It is defined as: [8] simC( ⃗Wa, ⃗Wb) = ⃗ Wa· ⃗Wb ∥ ⃗Wa∥∥ ⃗Wb∥ (3.1)

where ⃗Wa and ⃗Wb, are document vectors, the numerator is the dot product of these vectors and the

denominator is the product of the Euclidean norm of each vector. The Euclidean norm is defined as:

∥ ⃗Wd∥ =

√

⃗

Wd· ⃗Wd (3.2)

where the inner product is the dot product. Since the TF-IDF weights cannot be negative, the cosine similarity takes values between 0 and 1, where 1 represents two very similar documents and 0 represents very diﬀerent documents. But, for documents reduced with LSI, the value ranges from -1 and 1. A two-dimensional example is shown in figure3.1, where each vector represents a document. The closer the vectors are, the more similar they are.

(a) Cosine of angle is near 1: related doc-uments

(b) Cosine of angle is near 0: unrelated documents

Figure 3.1: Cosine similarity in two dimensions.

Jaccard similarity

The second similarity measure is the Jaccard similarity. It measures similarity as the size of the intersection of two documents divided by the size of the union of two documents. It is defined as: [44]

simJ(Da, Db) =|D a∩ Db|

|Da∪ Db|

(3.3)

where Da and Db are two documents. Consider three very simple documents:

(1) ["word-2", "word-3", "word-4", "word-2"] (2) ["word-1", "word-5", "word-4", "word-2"] (3) ["word-1"]

(20)

The number of intersections, or shared words, in documents 1 and 2 is 2: “word-2” and “word-3”. The union, or the number of unique words, is 5. This results in a Jaccard similarity of 2/5 = 0.4. Because documents 1 and 3 are completely dissimilar, the Jaccard similarity is 0.

But this similarity measure must be extended in order to accept vectors with real numbered vector components. This extended Jaccard similarity measure is calculated as follows: [8]

simEJ( ⃗Wa, ⃗Wb) =

⃗ Wa· ⃗Wb

∥ ⃗Wa∥2+∥ ⃗Wb∥2− ⃗Wa· ⃗Wb

(3.4)

where ⃗Wa and ⃗Wb are document vectors and the same calculation is expressed in terms of vector

scalar product and magnitude. It compares the weight of shared terms against the weight of terms which are present in either of the documents [13]. As with the cosine similarity, the (extended) Jaccard similarity takes values between 0 and 1, where 1 represents two very similar documents and 0 represents very diﬀerent documents.

3.2.2 Linkage

Step 2 in hierarchical clustering consists of merging the most similar (closest) two clusters. To do this, there are several known linkage criteria. Popular techniques are single, complete and average linkage [45,46]. Research points out that these linkage methods are ranked in the order of average > complete > single [47,48, 49], in terms of quality.

Single linkage

Single linkage clustering is based on the shortest distance. The link between two clusters is made by a single document pair: two documents (one in each cluster) that are closest to each other according to a similarity measure. It is defined as:

linkageS(Da, Db) = min da∈Da,db∈Db

(1− simX(da, db)) (3.5)

where 1− simX(da, db) denotes the distance between two documents, based on either the cosine or

Jaccard similarity as explained in section 3.2.1. Single linkage clustering has the tendency to form long chains. As only one pair of documents needs to be close, irrespective of all others, clusters can be too spread out and not compact enough.

Complete linkage

Complete linkage can be seen as the contrary of single linkage. With complete linkage, the link between two clusters is now made by two documents that are the farthest away from each other. Not surprisingly, it is defined as:

linkageC(Da, Db) = max da∈Da,db∈Db

(1− simX(da, db)) (3.6)

which only changes min to max in comparison with equation3.5. Complete linkage has a sensitivity to outliers.

Average linkage

Average linkage is more or less a compromise between single and complete linkage. This time, the link between two clusters is made by the average of distance between all pairs of documents, where each pair consists of one sonata from each cluster. It is defined as:

linkageA(Da, Db) = 1 |Da| · |Db| ∑ da∈Da ∑ db∈Db (1− simX(da, db) (3.7)

(21)

3.3 Divisive

In divisive clustering all documents start in one cluster and splits are performed recursively as one moves down the hierarchy. This is particularly useful if one does not want to generate a complete hierarchy all the way down to individual documents. For divisive clustering, flat clustering algorithms are used recursively to form a hierarchical clustering. In contrast with agglomerative cluster, this does not create a binary tree. The way divisive clustering works is as follows:

1. Determine the optimal number of clusters k 2. Use k to cluster with a flat clustering algorithm

3. Recursively repeat step 1 and 2 until a stop condition has been reached

Both the choice of the method to determine the optimal number of clusters k and which flat clustering algorithm is used have a huge impact on the clustering.

3.3.1 Determining the optimal number of clusters k

For step1, the optimal number of clusters must be found, but often there is not one “correct answer”. Various interpretations depend on the shape and scale of the distribution of documents and the desired clustering for the user. Ideally, documents within a cluster are as similar as possible, whereas documents from diﬀerent clusters are as dissimilar as possible. There are several methods for making this decision: silhouette, gap statistic and f (K).

Silhouette

The silhouette is a measure of how close each point in one cluster is to points in the neighbouring clusters. It measures the similarity of a document with documents of its cluster compared to the documents of other clusters [51]. It is defined as:

silhouette(d) = bd− ad

max(ad, bd)

(3.8) where ad is the mean of distances between the document d and the other documents of its clusters,

also known as the mean within-cluster distance, and bd the distance between the document d and the

nearest cluster that the document is not a part of, also known as the mean between-cluster distance. Silhouette ranges from -1, indicating wrongly assigned documents, through 0, indicating overlapping clusters, to 1, indicating well clustered documents. The average silhouette value per clustering is used to determine the optimal value of k. Figure3.2shows an example silhouette analysis of 1000 random samples with a synthetic obvious clustering of 4 clusters. For each cluster, the samples are ordered by silhouette value, that explains the shape of each cluster in the figures. In figure3.2a, the number of clusters is 4, which results in an average silhouette value of 0.74. Figure3.2bperforms worse with 6 clusters, resulting in an average silhouette value of 0.51. This analysis can be done for all possible values of k to determine which amount of clusters results in the best average silhouette value. So, according to the silhouette analysis, choosing 4 as the number of clusters outperforms choosing 6 in this example.

Gap statistic

The gap statistic is another way to determine the optimal k. It builds a clustering on a uniformly sampled dataset with same dimensions as our own data and comparing the sum of intra-clusters distances on the clusters of both the randomly generated data and our data. The gap is then the diﬀerence between these intra-cluster distances. It is essentially a way to find the “elbow” in a curve-plot where intra-cluster distances are curve-plotted against number of clusters k. In other words, it is the smallest value for which the gap quantity becomes positive. Figure3.3 shows an example of the gap statistic applied to 300 random samples with a synthetic obvious clustering of 4 clusters. The optimal number of clusters is clearly at k = 4. Choosing k = 2 might be the second best answer as the gap statistic almost became positive there.

(22)

−0.10.0 0.2 0.4 0.6 0.8 1.0 Silhouette values Cluster 0 1 2 3

(a) 4 clusters, average of 0.74

−0.10.0 0.2 0.4 0.6 0.8 1.0 Silhouette values Cluster 0 1 2 3 4 5 (b) 6 clusters, average of 0.51 Figure 3.2: Silhouette analysis for k-means. The red line indicating the mean silhouette value and the thickness of the clusters represents the cluster size.

1 2 3 4 5 6 7 8 9 Cluster −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Gap

Figure 3.3: Gap statistic for k-means. It finds that 4 clusters is optimal because that is the smallest value for which the gap statistic becomes positive.

f(K)

An alternative approach to finding to optimal number of clusters is a function f (K) to evaluate the quality of the resulting clustering. It takes into account the internal distribution of each cluster as well as its relation to other clusters. Therefore, each cluster is represented by its distortion and its contribution to the sum of all distortion: [52]

SK = K

∑

j=1

Ij (3.9)

where K is the specified number of clusters and Ij the distortion of cluster j. That is a measure of

the distance between points in a cluster and its centroid. The evaluation function f (K) is defined using the equations: [52]

f (K) =          1 if K = 1 SK αKSK−1 if SK−1 ̸= 0, ∀K > 1 1 if SK−1 = 0,∀K > 1 (3.10)

(23)

αK =      1− 3 4Nd if K = 2 and Nd> 1 αK₋₁+ 1− αK−1 6 if K > 2 and Nd> 1 (3.11)

where Ndis the number of dimensions of the data set and αKis a weight factor. f (K) is then the ratio

of the real distortion to the estimated distortion. The lower the value of f (K), the more concentrated is the data distribution and results in well-defined clusters. According to [52], values f (K) < 0.85 are an indication of good clusterings. The lower f (K), the better the clustering. Figure3.4shows an example of f (K) applied to 300 random samples with a synthetic obvious clustering of 4 clusters. It shows a slight preference for 4 over 2 clusters. Thus, the optimal amount of clusters is 4, however, the valley at 2 indicates a potentially alternative clustering.

1 2 3 4 5 6 7 8 9 Cluster 0.0 0.2 0.4 0.6 0.8 1.0 1.2 f(K)

Figure 3.4: f(K) method for k-means. It finds that 4 clusters is optimal.

3.3.2 Flat clustering algorithms

The flat clustering algorithm, which is used in step2 of divisive clustering, has a huge impact on the clustering. Therefore, various flat clustering algorithms are analysed: k-means, spherical k-means, von

Mises-Fisher (vMF) and two variants of k-medoids: with euclidean and cosine similarity [53,54].

K-means

K-means is one of the most well known clustering algorithms and widely used in document clustering [8,36]. It treats each document as an object with a location in space. Basically, it works as follows:

1. Initialize k centroids. That are just other document vectors but without underlying documents. 2. Assign each document to its closest centroid.

3. Update each centroid to be the mean of its constituent documents.

This process is repeated until a stop condition has been reached - e.g. when no document is assigned to a new centroid between iterations. For step2, the Euclidean distance is used. That is the straight-line distance between the document and a centroid. The main objective of k-means is to minimize the mean-squared error [41]:

E = 1 N

∑

d

∥d − µk(d)∥2 (3.12)

where the mean minimizes the square error, {µ1, . . . , µk} is a set of unit-length centroid vectors,

k(d) = arg min

k_⊂{1,...,K}∥d − µk∥ is the index of the closest cluster centroid to document vector d and N

(24)

There are many ways to enhance this basic k-means algorithm [55,56,57]. We chose the mini batch

k-means variant due to its reduced computation time and the results being generally only slightly

diﬀerent than the standard algorithm [58]. It is based on subsets of the input data, randomly sampled in each iteration. It works as follows:

1. b samples are drawn randomly from the data set to form a mini-batch. 2. Assign each document from the mini-batch to its closest centroid. 3. Update each centroid to be the mean of its constituent documents.

Spherical k-means

As described in section 3.2.1, the cosine similarity is one of the most popular similarity measures applied to text documents. Basically, the spherical k-means algorithm is k-means with cosine similarity instead of Euclidean distance. It has been found to work well for text clustering [59, 40]. Instead of minimizing the mean-squared error as in equation 3.12, it aims to maximize the average cosine similarity:

L =∑

d

dTµk(d) (3.13)

where {µ1, . . . , µk} is a set of unit-length centroid vectors, k(d) = arg maxkdTµk the index of the

closest cluster controid to document vector d. After each maximization step, it projects the estimated cluster centroids onto the unit sphere.

Von Mises-Fisher

The von Mises-Fisher (vMF) has a mean direction µ and a concentration parameter κ. The latter,

κ, characterizes how strongly the document vectors are concentrated about the mean direction µ and

thus a larger κ leads to a more concentrated cluster of points. Each document xi drawn from the

vMF distribution and the mean direction∥µ∥2 = 1 are on the surface of the unit hypersphere (i.e.,

∥xi∥2 = 1). The documents can be modelled as a mixture of vMF distribution with an additional

weight parameter α. Generally, in mixture models every document is assigned a weighted mixture

of clusters. The mixture of vMF distributions and spherical k-means are closely related. The latter can be seen as a special case of the first: if for each cluster all of the weights are equal and all concentrations are equal, they behave the same. In [54], vMF outperformed k-means on documents.

K-medoids: Euclidean and cosine

K-medoids is related to the k-means algorithm. It attempts to minimize the sum of pairwise dissim-ilarities instead of the mean-squared error (3.12). And, in contrast to k-means, it chooses medoids or exemplars as centroids. In k-means, the mean is greatly influenced by outliers and thus cannot represent the correct cluster center, while in k-medoids, the medoid is robust to the outlier and is able to represent the cluster center correctly [60]. K-medoid clustering is most commonly realised by means of the Partitioning Around Medoids. Herein, the cost is defined as the sum of distances of points to their medoid. It works as follows:

1. Initialize k medoids or exemplars randomly. 2. Assign each document to its closest centroid.

3. For each medoid m, for each non-medoid data point o: (a) Swap m and o

(b) Recompute the cost

(c) If the total cost of the configuration increased in the previous step, undo the swap. Step 3 is repeated while the cost of the configuration decreases. Both Euclidean distance and the cosine similarity are likely to give diﬀerent results [61].

(25)

3.3.3 Stop-conditions

For step3in divisive clustering a stop-condition must be defined. Of course, if one wants to generate a complete hierarchy all the way down, the stop-condition is that each document is in its own cluster (how agglomerative clustering starts). But, to avoid computational expense stop conditions are used. Stop-conditions are:

1. The maximum tree depth. 2. Suﬃciently small clusters.

3. Suﬃciently homogeneous clusters.

Of those, the first is the most popular and usually suﬃcient [62].

3.4 Comparison of agglomerative and divisive

3.4.1 Time complexity

Agglomerative clustering algorithms usually have a time complexity ofO(n3· d) with n the number of documents and d the dimension of each point [44]. Some implementations have been optimized and achieveO(n2_{· log(n) · d) or even O(n}2_{· d) for single linkage [}₆₃_].

If there is not the desire to generate a complete hierarchy all the way down to individual document leaves, divisive clustering has the advantage of being more eﬃcient than agglomerative. Depending on the flat algorithm used in divisive clustering, the divisive algorithms are linear in the number of documents and clusters [44]. For divisive k-means, the complexity exists of a repetition of k-means clustering. A drawback here is that k-means requires the number of clusters a priori. The time complexity of k-means is O(n · d · k) with n the number of documents, k the number of clusters, d the dimension of each point. Combine this with the maximum possible number of clusters m− 1 per depth and the maximum depth of the divisive clustering l and an example time complexity for divisive k-means is: O((m − 1)l_{· n · d · k).}

Here follows an example to illustrate the diﬀerence for 10,000 documents. The dimension, d, is blacked out because it exists in both agglomerative and divisive. Agglomerative varies from

O(100002_{= 1e8) to}_O(100003_{= 1e12). Divisive results in}_O(93_{· 10000 · 10 = 7.29e7) for a maximum}

depth of 3 and a maximum possible number of clusters 10. As the number of documents increase, agglomerative increases exponentially and divisive linearly. For 1,000,000 million documents agglom-erative varies fromO(1e12) to O(1e18) and divisive results in 7.29e9. So, divisive is the clear winner here if one does not want to cluster all the way down, thus by keeping l and m low. Experiments in section6.1on the time complexity decide which form performs better.

3.4.2 Performance

Which algorithm performs best depends on the use case. Agglomerative does not take into account the global distribution, it focuses on local patterns. These early decisions are permanent and cannot be undone. While divisive has the benefit of having complete information about the global distribution, which is really helpful for making top-level decisions [44].

Research points out that divisive clustering has a comparable or better clustering performance than agglomerative for documents [64]. In this research they also mention that agglomerative is very time consuming. Another research examines the performance of a divisive variant, bisecting k-means, against agglomerative [36]. The conclusion of the latter is that bisecting K-means seems consistently to do slightly better at producing document hierarchies.

(26)

Chapter 4

Evaluation

This chapter focuses on the evaluation of the resulting clusters to answer the question: how is the quality of the found structure evaluated? We focus on both intrinsic and extrinsic metrics and we present a new experimental evaluation method.

4.1 Intrinsic metrics

With intrinsic metrics the clustering is evaluated based on the data that was clustered itself. Often, intrinsic metrics assign the best score to clusters with a high similarity within a cluster and a low similarity between clusters. But, a high score does not automatically lead to good results [44]. Besides, intrinsic metrics tend to have a bias towards algorithms that use the same cluster model. For example, both k-means clustering and a distance-based intrinsic metric focus on optimizing object distances, resulting in an overly optimistic evaluation of the clustering. Therefore, intrinsic metrics only give some insight in whether one algorithm performs better than another, but this does not imply that one algorithm produces more valid results than another [65]. For this thesis we focus on three intrinsic metrics: Silhouette, Calinski-Harabasz (CH) and Dunn.

Figure4.1shows two clusterings of 1500 samples done with k-means clustering: one with 2 clusters and one with 3 clusters. For each of these clusterings, the three intrinsic metrics are calculated to give a better understanding of how intrinsic metrics work. For all three metrics, a higher score indicate a better result, which will be further explained in this section. They all score higher for figure4.1b, indicating that choosing 3 clusters is better than 2 for this example. In figure 4.1a some orange samples are in the green cluster and therefore the clusters are disturbed.

To make these intrinsic metrics suited for hierarchical clustering, a weighting according to the size of the clusters is applied.

4.1.1 Silhouette

Silhouette, as already explained in section 3.3.1, is one way of doing intrinsic evaluation. To recall: it is a measure of how close each point in one cluster is to points in the neighbouring clusters. It ranges from -1, indicating wrongly assigned documents, through 0, indicating overlapping clusters, to 1, indicating well clustered documents.

4.1.2 Calinski-Harabasz

The Calinski-Harabasz (CH) intrinsic metric evaluates the average between- and within-cluster sum of squared index. It is also known as the variance ratio criterion. It is defined as: [66]

CH(k) = B(k) W (k)×

N− k

k− 1 (4.1)

where k is the number of clusters, N the total number of documents and B(k) and W (k) are between-and within-cluster sums of squares for the k clusters, respectively. The sum of squares is often referred

(27)

−10 −5 0 5 −5

0

(a) 2 clusters, resulting in silhouette = 0.628, CH=3418, Dunn=0.021

−10 −5 0 5

−5 0

(b) 3 clusters, resulting in: silhouette = 0.733, CH=10633, Dunn=0.143

Figure 4.1: Example scatter plot of 1500 samples to demonstrate intrinsic metrics.

to as being the sum, over all observations, of the squared diﬀerences of each observation from the overall mean [67]. The higher the value of the non-negative natural number of CH, the “better the clustering”.

4.1.3 Dunn

Another intrinsic metric is Dunn. It represents the ratio between the minimal between-cluster distance to maximal within-cluster distance. It is defined as: [68]

D = dmin dmax

(4.2)

where dminis the smallest distance between two clusters and dmax the largest within-cluster distance

of documents. Like CH, it can be any non-negative natural number, where a higher Dunn indicates a “better” clustering.

4.2 Extrinsic metrics

With extrinsic metrics the clustering is evaluated based on a comparison with a gold standard. Ideally, the gold standard is produced by human raters with a good level of inter-rater agreement [44]. This is further explained in section 5.2.1. For this thesis we make use of precision and recall, which are combined in the F-measure together. Analysis of a wide range of extrinsic evaluation metrics show that only BCubed satisfies all constraints, which shed light on which aspects of the quality of the clustering are captured [69]. Therefore, the BCubed version of precision, recall and F-measure are used.

4.2.1 Precision & recall

Precision denotes how many documents in the same cluster belong to its cluster. Symmetrically, recall denotes how many documents from its cluster appear in its cluster. This is visually explained in figure 4.2, which we explain later in this section. These metrics are calculated per document. If a document has a high recall, most of its related documents are found without leaving the cluster. If a document has a high precision, there would not be noisy documents in the same cluster. The overall precision and recall is the averaged precision and recall of all documents. To define precision and recall, first a notion of correctness is needed. That is, two documents are correctly related when they share a

(28)

category if and only if they appear in the same cluster: [69]

Correctness(e, e′) = {

1 iﬀ L(e) = L(e′)↔ C(e) = C(e′)

0 otherwise (4.3)

where e and e′are two diﬀerent documents, L(e) the cluster assigned to e by the clustering (category) and C(e) the cluster assigned to e by the gold standard. They are defined as: [69]

P recision = Avge[Avge′.C(e)=C(e′)[Correctness(e, e′)]] (4.4)

and

Recall = Avge[Avge′.L(e)=L(e′)[Correctness(e, e′)]] (4.5)

Figure4.2 illustrates how these two are calculated for one document. The precision in figure4.2a is 4/5 for the indicated document, because 4 out of 5 green documents in its cluster belong to its cluster. And the recall in figure4.2bis 4/6 for the same indicated document, because 4 out of 6 total green documents are in its cluster. Calculating this for all documents and averaging would result in the right precision and recall for this clustering.

0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 (a) Precision(e)=4/5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 (b) Recall(e)=4/6 Figure 4.2: Example scatter plot of 13 documents with 3 clusters to demonstrate extrinsic metrics for one document.

4.2.2 F-measure

Precision and recall are combined in one single evaluation metric: F-measure. This can be interpreted as a harmonic mean of precision and recall. It is defined as: [70]

F (Recall, P recision) = 1

α(_{P recision}1 ) + (1− α)(_Recall1 ) (4.6) where α and (1− α) denote the relative weight of each metric. Applying Fα=0.5 is recommended for

BCubed [69].

4.2.3 Extension for hierarchical clustering

Above definitions do not account for hierarchical clusterings. From a general point of view, each document in a leaf cluster, also occurs in all its ancestors. A hierarchical clustering is seen as an overlapping clustering where each cluster at each depth is related with a category [69]. For this, multiplicity of occurrences of a document in clusters and categories must be taken into account. Precision and recall can therefore be redefined as:

(29)

M ultiplicityP recision(e, e′) =|C(e) ∩ C(e

′₎_{|, |L(e) ∩ L(e}′₎_|

|C(e) ∩ C(e′₎_| (4.7)

and

M ultiplicityRecall(e, e′) = |C(e) ∩ C(e

′₎_{|, |L(e) ∩ L(e}′₎_|

|L(e) ∩ L(e′₎_| (4.8)

where e and e′are two diﬀerent documents, L(e) the set of clusters assigned to e by the clustering and

C(e) the set of clusters assigned to e by the gold standard. Integrating this in the overall precision

and recall score can be done by replacing Correctness from equation 4.3by M ultiplicityP recision and M ultiplicityRecall in equations4.4and4.5: [69]

P recision = Avge[Avge′.C(e)_∩C(e′)_̸=∅[M ultiplicityP recision(e, e′)]] (4.9)

and

Recall = Avge[Avge′.L(e)_∩L(e′)_̸=∅[M ultiplicityRecall(e, e′)]] (4.10)

where the precision associated to one item is its averaged multiplicity precision over other items sharing some of its categories. The same account for recall. The F-Score calculation remains the same.

4.3 Evaluation based on classification

4.3.1 ClusClas

In this section, we introduce an experimental evaluation paradigm for determining the intrinsic quality of clusters. The main idea is that various clustering and classification algorithms are based on diﬀerent properties. One might be based on minimizing the mean-squared error, while others might maximize the cosine similarity. Clustering and classification are often confused, but can be distinguished as follows: classification assigns objects to already defined classes, whereas for clustering no knowledge about the classes is provided. By first letting a particular clustering algorithm do the clustering, a classification algorithm tries to “reproduce” those results with its particular properties. If this leads to a good F-measure, the clustering can be seen as a good clustering. It uses the properties of the classification algorithm to determine whether the properties of the clustering algorithm has done a good job. Essentially, the idea behind this experimental set-up is to combine the properties of both clustering and classification algorithms. Therefore, we call this proposed methodology ClusClas. The way ClusClas works is as follows:

1. A clustering algorithm of choice creates a clustering with 90% of the data. 2. A classification model is trained on the results of the clustering.

3. The classification model from step 2is used to predict the cluster of the remaining 10% of the data.

4. Test each classifier’s performance using your favourite metric (e.g. F-measure).

For a good result, a clustering algorithm needs to be evaluated with several classification algorithms. A use case of ClusClas might be to emphasize how a clustering performs. It should only be used as an addition to already existing evaluation method as it is experimental. By using properties of both clustering and classification methods, the scope of how well a clustering performs is widened. Experimented in section6.2try to prove this. Note that when the classifier cannot properly separate the classes, it is very diﬃcult to determine whether it is the classifier’s or clustering’s fault.

(30)

4.3.2 Unsupervised ClusClas

The problem of ClusClas is that it still requires a ground truth. If a ground truth is available, one should probably just use BCubed F-measure, which is must more theoretically grounded. A diﬀerent way is to do unsupervised ClusClas, where no ground-truth is necessary. It works as follows:

1. A clustering algorithm of choice creates a flat clustering of the data.

2. All data points get assigned a “label” indicating which cluster they are in (e.g. label A for all data points in the first cluster, B for all data points in the second cluster and so on).

3. Split all data points in a train and test set (e.g. 90% and 10%). 4. A classification model is trained on the results of the clustering.

5. Test each classifier’s performance on the available test data points using your favourite metric (e.g. F- measure).

Also here, for a good result, a clustering algorithm needs to be evaluated with several classification algorithms.

Document Clustering on a Large Scale