A Comparison of Clustering Algorithms

(1)

University of Amsterdam

Final year project

A Comparison of Clustering Algorithms

Rick Vergunst

supervised by Dr. Dick Heinhuis revised by Laura Wennekes

July 13, 2017

(2)

Abstract

Within this paper an attempt was made to compare different clustering techniques. The algorithms that were chosen each represent a different clustering technique. These algorithms tried to cluster six different data sets of which the classifications were known. The conclusions were evaluated on the basis of different metrics in order to guarantee an inclusive overview. It can be concluded that overall the two best available techniques in this test were Gaussian Mixture and Mini Batch K-Means with the latter outperforming the former on scalability. For small data sets with few features, Birch proved most successful. Another conclusion that can be drawn based on this paper is that if the number of wanted clusters is unknown, DBSCAN shows the best scalability while maintaining a decent performance.

(3)

3 Methodology 9 3.1 Data sets . . . 9 3.1.1 Banknote authentication . . . 9 3.1.2 Htru2 . . . 10 3.1.3 Skin tone . . . 10 3.1.4 Spam detection . . . 10 3.1.5 Driving motion . . . 10 3.1.6 Forest recognition . . . 11 3.1.7 Data preparation . . . 11 3.2 Algorithms . . . 11 3.2.1 Gaussian Mixture . . . 11

3.2.2 Mini Batch K-Means . . . 12

3.2.3 Ward hierarchical clustering . . . 12

3.2.4 DBSCAN . . . 13 3.2.5 Birch . . . 13 3.2.6 Affinity Propagation . . . 13 3.2.7 Mean Shift . . . 14 3.2.8 Spectral Clustering . . . 14 3.3 Evaluation metrics . . . 15

3.3.1 Time and Memory . . . 15

3.3.2 Adjusted Rand Index . . . 16

3.3.3 Adjusted Mutual Information . . . 16

3.3.4 Homogeneity, completeness and V-measure . . . 16

3.3.5 Fowlkes-Mallows score . . . 17

3.3.6 Calinski-Harabaz Index . . . 17

3.3.7 Silhouette Coefficient . . . 17

3.3.8 Implementation . . . 17

4 Results 19 4.1 Time and Memory . . . 19

4.2 Known classification metrics . . . 21

4.3 Unknown class labels . . . 25

5 Conclusion 27 6 Discussion 28 6.1 Algorithm performance . . . 28 6.2 Metric performance . . . 28 6.3 Data sets . . . 28 6.4 Further research . . . 28

(4)

1 Introduction

Within the fields of data mining and machine learning a lot of different methods to analyze data exist. These methods range from anomaly detection to classification. In data mining the goal is to find patterns within data regardless the nature of this data. These methods are often a combination of statistics, artificial intelligence, machine learning and databases (Fayyad et al., 1996). With these patterns one can try to give meaning to data, which in turn can aid anybody in making decisions for example (Berry and Linoff, 1997). Most available methods can easily be compared and measured. The fact that these methods have set results gives that those results can either be right or wrong. However, this does not hold true for the clustering method. Clustering is the counterpart of the classification method where data is clustered into a certain amount of groups (Jain et al., 1999). The difference between clustering and classification is based on whether or not the groups are pre-defined. An example of classification is the outcome of a football match. There are three possible groups, either a win, a draw or a loss and every data point has to be assigned to one of these groups. Within clustering this is not possible. The amount of groups is unknown and should thus be found during the process of clustering.

Clustering can be used in a broad range of fields within the scientific world. This is recognized by both Jain et al. and Kaufman and Rousseeuw (Jain et al., 1999)(Kaufman and Rousseeuw, 2009). Not only do we see applications of clustering within the different data mining fields, but also in fields such as economy, sociology, psychology and even environmental studies (Breiger et al., 1975)(Nugent and Meila, 2010)(Focardi et al., 2001)(Tan et al., 2013). As a result, any research done to improve relevant algorithms will accordingly yield progress in numerous fields. Within data mining, the role of clustering especially holds a unique status. In order to find patterns in unstructured data, or in determining whether the data holds any value, one must resort to the clustering method. Because there is no alternative to this specific technique, it is vital that all available information surrounding it is kept updated. This way, the best possible results are ensured at all times.

As stated by Mayer and Cukier, big data will acquire increased importance in our society in the coming years (Mayer-Sch¨onberger and Cukier, 2013). This is mostly due to the huge value the data can hold and extracting that value from big data can be a huge advantage for any business or instance (Katal et al., 2013)(Roski et al., 2014)(McAfee et al., 2012)(Tien, 2013). To understand these large amounts of data it has to be efficiently analyzed. Implementing the clustering method proves to be useful in this venture. (Shirkhorshidi et al., 2014). Clustering can be the first step in finding new value in data, especially in a large amount of unlabeled and unstructured data. By making sense of this data and clustering it, businesses for example can adjust their strategy based on these clusters and change their behaviour based on the cluster that is relevant to them. Performance becomes an important factor in this new field as well because the sheer amount of data you are working with is increasing. Because of the influx in data, the trial-and-error method is becoming more expensive and choosing the right algorithm is even more delicate. Some initial attempts to evaluate performance have already been made but further research is necessary to get a better overview (Shirkhorshidi et al., 2014) (Aggarwal and Reddy, 2013). Furthermore, it is also important to know which algorithms work well for big data. Especially determining the level of scalibility of algorithms and finding those that prove useful for big data or only need little adjustment is vital since big data is invading every field of work (Manovich, 2011).

Within the clustering method, there are several different algorithms to choose from and each of these has its own features and merits. However, the results from these algorithms differ significantly and can give a different outlook on the data. The choice in algorithms therefore determines the results in a severe way and should not be taken lightly. However, comparing methods is subject to interpretation and there is no definite true or false. This also means that the choice of the algorithm can be seen as more of an art than a rational decision. Some attempts in this direction have been made, as will be discussed in the theoretical framework. This paper will therefore attempt to give an overview of the different available methods and how they compare to each other. By comparing them it will be determined which algorithm is best suited for a certain case and specifically how well the algorithms scale to features and the size of data sets. In order to do this the paper will try to answer the following question:

(5)

Before answering this question, a discussion of previous work done within this field will be presented. Besides this, a theoretical framework and a problem statement are given in the section labelled ’Theoretical Framework’. After that, the ’Methodology’ section discusses the different elements used and describes the way of coding and clustering as executed. The ’Results’ section shows all results in tables to provide an inclusive overview of all findings. Following each table a short discussion of these results is presented. Based on these results, the ’Conclusion’ section deduces the answer on the research question stated before. Finally, all other findings and proposals for future research will be discussed in the ’Discussion’ section.

(6)

2 Theoretical Framework

In this section an attempt is made to create an overview of what clustering is and the depth it has. First, to get acquainted with clustering, the generic step process is analyzed. Second, different clustering techniques are looked at. A wide range of techniques is available and each has its own merits and drawbacks. In order to fully understand the different algorithms that are available within these techniques, an example is given. Lastly, some papers that have attempted and executed a similar comparison of clustering are considered. In doing so, some inspiration was gained and certain procedures that give a good outcome can be reproduced. Furthermore, this section gives an idea on how far the scientific world has come with comparing algorithms and further proves the relevance of the problem at hand.

2.1 Clustering

Cluster analysis is an important form of analyses and very relevant today. As stated by Kaufman and Rousseeuw (Kaufman and Rousseeuw, 2009), from childhood until adolescence individuals always divide concepts amongst each other and attempt to form groups. This can be dogs and cats, male and female, but also more complex concepts in science. Classification, either based on structured or unstructured data, is relevant and present in numerous fields of work today.

This is further proven by the research of Jain et al. (Jain et al., 1999). This study looks inclusively at data clustering and concluded that clustering can be a beautiful way of pattern recognition and collecting results. This boils down to the right algorithm but is also dependent on the data that is used. As Jain et al. recognizes, clustering should be used on either image retrieval, object recognition, document retrieval or data mining in order to get the expected results. It is therefore important for the researcher to make some careful design choices in order to get the results he or she desires.

Jain et al. used different steps in order to execute a cluster analysis (Jain et al., 1999). As stated in the research a cluster analysis consists of the following four steps:

1. Pattern representation

2. Similarity computation

3. Grouping process

4. Cluster representation

The first step is pattern recognition and often skipped by researchers as it is the most complex step in the process of clustering. In this step the researcher must determine the important and relevant features of the problem at hand. In case of small data sets, this can be done by relying on previous experiences but with larger data sets the researcher has to run algorithms to determine these features. This process can be expensive and time consuming, hence often this step is skipped. The next step in the process is similarity computation, where the researcher takes the found patterns of the algorithm and compares them either on explicit or implicit knowledge. In this step the researcher attempts to differentiate between the patterns. The third step is the grouping process that roughly consists of two methods. One of them is ’hierarchical schemes’ which is more versatile and therefore has better results in more complex cases. The other is ’partitional schemes’ which is less expensive and better suited for bigger problems if the expensiveness of the solution matters. The fourth and final step in the process is cluster representation where the researcher presents a solution in an understandable way. Often graphs are used for this step as it makes the clusters easy to represent.

2.2 Cluster analysis techniques

The chosen cluster analysis technique determines the results and whether or not the question can be answered. It is therefore important to choose the right technique. Aggarwal and Reddy and Fahad et al. recently created an overview of the techniques available today (Aggarwal and Reddy, 2013) (Fahad et al., 2014). Table 1 below represents all these techniques and an example of an algorithm within that technique.

The first of many techniques is the clustering based on probability models, a technique unlike most tech-niques (Bock, 1996). Within this technique classification is performed based on the underlying probabilities

(7)

of the chosen features. Based on these probabilities you classify each data point and create clusters. This technique is extensive as the researcher can mix and match with the models until the best suited model is found.(Topchy et al., 2004)(Biernacki et al., 2000). An example of a probability model is the expectation-maximization algorithm (EM) as presented in Table 1. The EM algorithm is an iterative technique and consists of two steps within each iteration. The first step estimates the log-likelihood based on the current values of the parameters. A probability distribution will be made on the function that comes out of this step. The second step maximizes these probabilities by adjusting the function and recalculating the parameters. These recalculated parameters can then be used for another iteration, which can be repeated as many times as desired (Do and Batzoglou, 2008)(Moon, 1996).

The next two techniques described by Aggarwal and Reddy are partitional and hierarchical techniques, which are presented in the second and third row of Table 1. Partitional algorithms were the usual go-to algorithms as described by Wilson et al. (Wilson et al., 2002). These algorithms take the input and features and create different partitions of these parameters. Afterwards the partitions are compared based on a function specific to the partitional algorithm at hand. The functions that this algorithm tries to minimize are generally objective in nature.(Jin and Han, 2010). These partitions consist of K clusters, which can be pre-determined or chosen by the algorithm. The requirements that must be met are that each point is only contained in one group and that each cluster has at least one data point. The most well known algorithm that has already existed for more than 50 years in this technique is the k-means algorithm (Jain, 2010). The k-means algorithm takes as input the amount of wanted clusters and then creates P partitions out of N data points. Every partition is then given k cluster centers based on the given input. Afterwards, every data point is considered and added to the cluster center that is closest to the data point, usually based on the Euclidean distance. This is done for every partition before the partitions are compared. The objective is to minimize the sum of squares of the distances of every data point to the center of its cluster. The partition that has the lowest sum is then chosen as solution for this instance.

Hierarchical algorithms recycle a process multiple times. This process can either create more, divisive, or fewer, agglomerative, clusters based on the question at hand. This is done via calculating the distance between the existing clusters and then adjusting the cluster based on this calculation. Distances often used are Manhattan, Euclidean for numeric values and Hamming for non-numeric values. The process starts either with one cluster or with a cluster on each data point. Next, the distances are calculated between the clusters or the optimal division is based on the largest distance possible. This process is then repeated indefinitely. Hierarchical algorithms result in a dendogram, which is a graph of the different layers of clusters that are created. An obvious difference between hierarchical and partitional techniques is that the partitional approach aims for a specific amount of k-clusters whereas the hierarchical method tries to create either more or less clusters and the researcher can choose a specific step in the layering.

The next technique is density based clustering, which is presented in the fourth row of Table 1. The difference that density based clustering has when compared to other techniques is that outliers do not have a severe influence on the formed clusters. The method of density clustering is based on the density of the data objects and is divided by either continuous regions or low density areas (Kriegel et al., 2011). The idea of density based clustering is built on human differentiation of concepts and regions within imagery. When shapes interrupt or flow through each other, the classic algorithms cannot follow this dynamic and try to divide it into separate clusters. However, density based clustering can recognize these regions by analyzing their density and transforming it into separate clusters. This means that the amount of clusters found at the end is not pre-determined. One of the most well known examples of density based clustering is DBSCAN (Ester et al., 1996). DBSCAN divides all the data points into three categories. The first category is ’core points’, which consists of points which are located in a certain minimal reach of other points. This reachability is determined via a certain chosen function for this instance. The next category of points is ’density reachable points’ that consists of points that are not located within a certain reach of other points but which are connected through a core point. The third category is the ’outlier points, which consists of points that are not in reach of other points and are therefore disconnected from any density. Based on these characteristics, the classification of points is performed. By this diversification you get a certain amount of clusters and filter out the outliers as they do not belong to any cluster.

The last technique is grid based clustering, as presented in the fifth row of Table 1. This technique is somewhat similar to density based clustering as it also looks at density. However, there is a significant difference in the approach as described by Grabust and Borisov (Grabusts and Borisov, 2002). With a grid

(8)

based clustering approach the researcher first divides the data set into a number of cells that together form a grid. Afterwards the density of each cell is calculated based on the chosen algorithm. These cells are then sorted looking at the found densities of each cell. This constructed hierarchy is then analyzed and the algorithm tries to find cluster centers. Afterwards, the cells are divided between the cluster centers to create a number of clusters. STING is an example of an algorithm within the grid based approach. In STING, the data points are first divided into layers which are again divided into a grid of regions containing data points. The first layer is then chosen and for each cell in this grid a probability is calculated which determines whether the region is relevant or not. Every cell is then labeled either relevant or not. Next, the relevant regions are taken through the process again which is repeated until it reaches the bottom layer. A few relevant regions are filtered out, which then either answer the query or not. If they do not, the data points in these regions are further processed with the above steps. If the query is answered, the regions are returned as a result and the data is clustered.

Table 1: Clustering techniques Clustering technique Example

Probability models Expectation-Maximization algorithm Partitional algorithms K-means algorithm

Hierarchical algorithms Euclidiean distance Density based clustering DBSCAN

Grid based clustering STING

2.3 Cluster analysis comparisons

Small scale trials to compare cluster techniques have been conducted. Steinbach, Karypis and Kumar for example compared bisected K-Means with hierarchical agglomerative techniques which are both quite traditional in nature. (Steinbach et al., 2000). For this comparison they used two techniques, namely F-measure and entropy. Entropy is widely used amongst different scientific fields. With the Entropy function, the ’goodness’ of a cluster is determined. In other words, to what extent does the cluster accurately respect the question at hand as explained by Sripada and Rao (Sripada and Rao, 2011). The other measure, F-measure, is usually employed as a combination of recall and accuracy (Hripcsak and Rothschild, 2005) but can be expanded to measure the effectiveness of the clustering in a hierarchical technique.

The difficulty with comparing clusters is the subjective nature of clusters and their dependence on an individual’s decision making surrounding the ambition underlying the clustering. Therefore Rand determined certain objective criteria to aid in this process(Rand, 1971). By looking at objective measures the conclusions and comparison will have a more scientific basis. The first criterion was to what extent the algorithm determined the inherent structure of the data. If a technique can determine the structure of a data set, it will understand the data better, which usually leads to better clustering. The second criterion is to what extent resampling affects the clustering. If a clustering technique shows different results when resampling is done, the results of one clustering cycle will be unreliable and therefore weaken any claims made based on that clustering. The last criterion Rand stated considered the handling of new data. If new data is added and the clustering differs vastly from the previous, the clustering is sensitive, which again makes it unreliable. This, just as in the second criterion, will weaken any claims made and must therefore be taking into serious account.

Another interesting comparison has been made by Abbas et al. (Abbas et al., 2008). In their paper the authors present an objective overview of several algorithms. The chosen algorithms were K-means, HC, SOM and EM. This selection was made to capture some of the diversity within the cluster analysis field. These algorithms were compared based on four factors namely: the size of the data set, number of clusters, type of data set and the type of software. These factors were then given different variables and compared to one another to get an overview of the algorithms’ performances. For the size of the data set, the researcher differentiated between small and huge data sets to look at the scalability of the algorithm. The data sets used consisted respectively of 36.000 and 4.000 data points. Within the number of clusters, the researchers tried different amounts of clusters. Because of their nature, the algorithms were able to create a certain

(9)

amount of clusters which made it easy to compare and achieve. The cluster amounts used were 8, 16, 32 and 64. As their type of data set, the researchers chose an ideal data set for each algorithm based on the type and characteristics of the algorithm and also a random data set. With this comparison the researchers evaluated the performance of the algorithm in new situations and in situations where the algorithm should shine. The amount of clusters used in this endeavour was always 32. For the last factor the researchers used different packages to run the algorithms, namely LNKnet Package and Cluster and TreeView Package, which proved to make no difference.

A new approach was considered in the research of Fahad et al. (Fahad et al., 2014). The application of clustering techniques within big data was specifically looked at. This approach is thought of as promising for the future. (Manyika et al., 2011). In this research five different algorithms were used that span across mul-tiple different techniques as described in the previous paragraph. These five algorithms were Fuzzy-Cmeans, Birch, Denclue, Optimal Grid and EM. Fahad et al. chose to use eight different data sets with different characteristics. The differences between these sets were predominantly the amount of instances within the data set, the attributes these instances had and the amount of classes present in the data set. The foremost aim of this research was to test the different algorithms thoroughly. In the evaluation different metrics were employed to capture more aspects of the algorithms. First, compactness and separation were used to deter-mine how ’good’ the clusters were. Furthermore, the Davies-Bouldin Index was applied to measure the ratio of within-clusters to between-clusters. Another index that was used was the Dunn Validity to measure both the compactness and separation between individual clusters. The second to last index was the correctness which uses the correct classification to determine the accuracy of the clustering. The last used index was the Adjust Rand Index where instances that are in a cluster and instances that are part of different clusters are compared. Lastly, the quality of the clustering through an formula was determined through the Normalized Mutual Information. As shown in this research there are numerous measures available to achieve a complete overview of the data and algorithms that are used in the clustering.

Looking at the previous paragraphs in this section, several things can be deduced. The different steps in clustering are all vital in carrying out a good clustering process. These steps are fixed for every type of clustering which creates a consistent context and a solid foundation in each situation. A wide range of available techniques exists within clustering. By choosing algorithms that correspond to each of the techniques and differentiating between them one can create a reasonable and complete overview of what is possible. These techniques differ so vastly in nature that different results are expected accordingly. All of these techniques have advantages and disadvantages which means that testing the full range of techniques should be attempted. Out of the comparisons done up to this date quite a broad range of metrics to use can be deducted. Each allows for a different aspect of the algorithm to be tested. If a subjective metric is generated, elements that could be looked at are density and definition. Objective metrics often require the classification to be known, which is not the case within clustering. Objective measures are still possible with clustering by choosing certain data sets that are meant for classification. Important for a good comparison is to create as many variables as possible within the data sets. This can either be in size or the amount of features that a data set has. By changing these variables one can create different contexts in order to mimic a ’real life’ scenario as accurate as possible.

(10)

3 Methodology

The goal of this paper is to create an exhaustive overview of the available clustering methods and attempt to compare them based on measures. It is therefore important to test as many variables as possible while remaining concise and clear. This present section outlines in detail the different steps taken in the process of getting the results and motivates certain choices. Furthermore several pieces of code will be given to provide a thorough understanding and improve reproducibility. The present research follows the steps as discussed by Jain et al. in order to get as close to a real clustering process as possible (Jain et al., 1999). This section first outlines the different data sets and discusses their relevance. It is important to consider different types of data sets because their nature can vastly differentiate the results of a certain algorithm, which was already partially proven by Fahad (Fahad et al., 2014). Then, the algorithms that will be used to cluster with the different data sets are presented. These algorithms cover a broad range of different techniques within clustering as discussed in the theoretic framework. Lastly, the evaluation and the way it was controlled during the process will be discussed. These types of evaluation are motivated by the theoretic framework and work previously done. However, it was also attempted to provide additional motivations where needed.

3.1 Data sets

What data set is chosen is a vital element in the process as it will determine the results and the contexts that is tested. In the present research, effort was made to select data sets that outline different scenarios that can occur in clustering. It was attempted to remove most external influence on the results through creating as many contexts as possible. The chosen data can be determined by certain variables in order to create an overview. These variables are either the data set’s size or the amount of features that exist and will be used in the data set. In determining a data set’s size these are partitioned into three sizes ranging from small to large data sets. The small data sets consist of sets up to 10.000 data points, the medium data sets range between 10.000 and 100.000 data points and the large data sets have more than 100.000 data points and can go up to millions of data points. By using these metrics the size and to what extent this influences an algorithm’s performance can easily be looked at. This metric is becoming more and more important as big data is becoming relevant and gathering large amounts of data has never been easier. As presented before, the second metric used is relative to a data set’s features. The features determine what the data point consists of and the algorithm will cluster based on these features. Features can be anything and their impact will be determined by the algorithm. In this metric, two possible outcomes exist. Either a few features, which ranges up to ten features, or more than 10 features per data point. The performance is affected by these features and it is important to look at the influence of these features as the amount can drastically change and impact the process.

In choosing data sets, one can either select simulated data or find sets of real-life data. In the present research, the latter option has been chosen as this mimics a real-life clustering situation best. Furthermore, the data sets originate from different kinds origins and types to get a greater span of subjects. This can be biological or image recognition. The data sets are retrieved from the UCI Machine Learning Repository, which is a database with different types of data sets taken from real life situations. The data sets are typified with different variables with a preferred task for the data, the size, features and a brief description of the data. For the comparison in the present research, data sets with a classified task have been chosen. This is perfect for a comparison as the correct class is given along with the data. This in conjunction with determining the amount of clusters you want to create opens up the use of a lot of indices that require both the predicted and actual labels, thus increasing the grounds and depth of the comparison. Six data sets have been chosen from the UCI directory with each differing variables. These six data sets are discussed and explained in the following sections.

3.1.1 Banknote authentication

The first data set consists of images taken from either genuine or fake banknote authentications. This divides the data set into two distinct classes and therefore asks for two clusters. The images taken are 400 by 400 pixels which have been turned into four features that describe the image in numbers. Colours did not play a part in the process as some of the images were gray as well which would influence the process.

(11)

For the feature extraction of the images a Wavelet transform tool was used. This is a widely used method for describing and compressing images. The idea is to turn the image into wavelets that are described by certain features. By looking at these features and the differences between the different wavelets an image can be reconstructed with only a few values, thus describing the differences between the images in a rather efficient way. The features extracted from the banknote images are skewness, variance, curtosis and entropy. Skewness determines how much the wavelet is sided to one side, variance is how much the wavelet differs from the mean, curtosis is how high the highest peaks are of the wavelet and entropy determines how busy the wavelet and thus the image is. The data set has a total of 1372 data points and can thus be considered a relatively small data set.

3.1.2 Htru2

The second data set used for the comparison is called Htru2. It is a data set containing data about pulsars. Pulsars are neuron stars that pulse electromagnetic waves. Interesting to note is that every pulsar has its own unique wave and can thus be identified. The waves that are generated can be picked up by large radio telescopes, which makes it easy on one hand, but hard on the other as other radio signals are also picked up. It is therefore important to be able to identify which waves are truly from these pulsars and which are not. This problem can be solved by classifying these signals. This data set in particular has eight features in total to identify the real pulsars from the ’fake’ signals. The first four features are statistics of the pulse profile of the wave and the second set of four features are about the particular DM-SNR curve that is obtained through the radio signal. The set contains a total of 17,898 samples of radio signals where 1,639 are real pulsar waves and 16,259 are ’fake’ signals. The data set is considered as a medium data set in this particular research.

3.1.3 Skin tone

The third data set is about skin segmentation and the pursuit to differentiate the real skin tones from the non-skin tones. The data set has three features where each feature is a colour. The data points are divided in Blue, Green and Red values which create a colour. The data comes from both the FERET and PAL database and the real skin colours are formed from a diverse group of people differing in age, race and gender. The total size of the database contains 245,057 data points where 50,859 data points are real skin colours and the other 194,198 are non-skin data points.

3.1.4 Spam detection

The fourth data set is about spam detection which is as relevant as ever with the increase of internet traffic. The data set contains information based on the text inside emails and created features based on that information. Through these features, the specific algorithm has to determine whether it’s spam or not. There are 57 features for each data point. The first 48 features are specific words that are common in emails. The values are a percentage of the occurrence of the word based on the total amount of words in the email. The next 6 features are specific characters and their occurrence is analyzed in the same fashion as the analyzing of words. The next three features look at the occurrence of capital words looking at respectively the average length of capital runs, the longest capital run and the total number of capital letters inside the email. The set contains a total of 4,601 data points where 1,813 points are real spam, which come from postmasters and individuals who filtered emails as spam and 2,788 are normal mail, which come from conversations between colleagues etc.

3.1.5 Driving motion

The fifth data set contains data about drive diagnosis, not based on sensors but rather on features based on the motor. These motors have certain defective components and can therefore be classified in 11 different types of motors. The motors are regarded in twelve different operating conditions which can be either speed but also load moments and forces. These conditions where then measured and a total of 48 features were extracted. For each measure the phases are regarded and certain statistical attributes like mean, skewness

(12)

and curtosis were recorded and stored in the features. The set contains a total of 58,509 data points which are, as mentioned before, divided into eleven types of motors.

3.1.6 Forest recognition

The sixth and final set that was selected for this study is a cover type set where different forests are divided into clusters based on pictures taken from these forests. This is done via observations based on 30 by 30 meter cells which are given features. The forests are divided into seven different types of vegetation. The features are very diverse in order to make a clear distinction between the possible forest types. Features can be the elevation in meters, the slope of hills but also the shading of the cell and the type of wilderness and soil that the cell has. In total, every data point has 53 features to make the distinction with very diverse data types ranging from colour index to binary classification features. Important to note is that the cells contain as little human intervention as possible to have cells that are as ’true’ as they can be to the cover type. The data set contains a total of 518,012 data points and is the largest set used in this particular experiment.

3.1.7 Data preparation

Every data set that has been used was prepared to be able to use it along with the other modules in a rather efficient and easy way. To do this the pandas module was used. (McKinney, 2011) Pandas allows the user to easily load data inside a DataFrame and adjust the values based on what is needed. This DataFrame can then be used along with the scikit module to apply the techniques and evaluate the clustering. Below a data preparation is shown, which was applied to every data set retrieved.

The first step in the process is to retrieve the file and transform it into a DataFrame. In the present research, this was done via the from csv function, in which the separator is regarded as a tab. The next step is to assign names to the columns based on the values. The last column was named Target to correspond to the correct labels. Out of this column a new DataFrame is created and the Target column is dropped from the original DataFrame in the final line. Lastly the Target DataFrame is adjusted so that the ones equal zeroes and twos equal ones as the labeling to the scikit module starts at zero and further problems with comparing are prevented.

3.2 Algorithms

Picking the right algorithms is important in order to test every aspect of clustering. It is important to choose as many algorithms as possible that cover the different techniques discussed in the theoretic framework. Besides that, the chosen algorithms must not be too specific in their usage as the attempt is to make a general overview. Lastly, it is important to keep the implementation in mind, which could also be an important factor in choosing an algorithm. In the present research, the scikit-learn module is used for the implementation (Pedregosa et al., 2011). This is a widely used module for statistics and machine learning within Python. The module contains all types of statistical implementation, including a wide range of cluster techniques. Furthermore, the module is very well documented and also includes evaluation indices, which helps tremendously in evaluating the found results and comparing the algorithms.

3.2.1 Gaussian Mixture

The first algorithm used for the data sets in this study is the Gaussian Mixture model. As can be deduced from its name, this algorithm is of the probability model type which was presented in the first row of Table

(13)

1. The model assumes that all data points come from Gaussian distributions while the parameters are unknown. This model in particular uses the expectation-maximization (EM) as an algorithm to create the model first hand and use it on training data. As said before the EM uses a two step method where the first step is to generate values that are expected which are than maximized and used in the next iteration of the step. The goal is to maximize the likelihood of the parameters in the model, which can be iterated over many times. Important to note is that the algorithm can converge to a local optimum, while maintaining a relatively fast algorithm. After the model is created based on the given parameters and the outcome of the EM algorithm, which can be altered as well, the model can be used on test data to cluster the data.

What makes the method unique as compared to a normal EM is that it allows for parameters to be given to create a certain clustering process. It allows for softer or fuzzier clustering, which means that data points can belong to multiple clusters or that the points are not hard coded to a cluster but rather given a score with the chance to belong to certain cluster.

3.2.2 Mini Batch K-Means

The second algorithm that was used was a variant on the K-means algorithm, namely the mini batch k-means. This decision was made because the mini batch variant has improved computation time while not losing that much accuracy (Sculley, 2010). As with the standard variant of K-means, the goal is to reduce the inertia or within-cluster sum of squares. The formula for this consists of two steps, step one the assignment step: S(t)_i = {xp: kxp− m (t) i k 2_{≤ kx} p− m (t) j k 2_{∀j, 1 ≤ j ≤ k}}

and the update step:

m(t+1)_i = 1 |S_i(t)|

X

xj∈Si(t)

xj

At the assignment step every data point x is assigned to one of the the given k clusters, which are randomly placed. This assignment is based on the distance which is often the Euclidean distance formed by the point x minus the cluster centre m to the power of two. Every point is appointed to exactly one S, which if the set of means. Afterwards the update step is performed where, based on the new assignments, the new means of the clusters are calculated again to create new cluster centers which are then put into the first step until convergence.

The algorithm differs from the original K-means on the basis that the input data is divided into randomly sampled subsets. These subsets, the mini batches, are then used individually to decrease the computation time. The first step in the process is to draw a certain amount of samples to form a mini batch which then are assigned to the nearest centroid. The second step in the process is to update the centroids, which is done based on every mini batch. Every centroid is updated based on the streaming average of the specific batch used in the iteration. This decreased the movement of the centroid and therefore improves the computation time. The algorithm runs until it has reached convergence, which can be a local optimum, or when the number of iterations that was given is reached.

3.2.3 Ward hierarchical clustering

The Ward method is applicable in hierarchical clustering, which means that one either starts with a sample amount of clusters or one cluster and keeps iterating until he or she finds the desired amount of clusters or until no further steps are possible. The steps are done based on a certain objective function which the algorithm tries to minimize or maximize. The Ward method is an objective function and is typified as the minimum variance method. The goal for this function is to determine the variance within the clusters and then try to minimize this variance. The variance can be calculated using numerous distance metrics. Effectively it determines the error of the sum of squares and tries to minimize this. At each step, the method calculates what happens with the function if one either merges or divides clusters and it finds the next steps that minimize the function for this particular step.

(14)

In the present study, the Euclidean distance is used, which is a widely applied distance metric. It is often employed as it is very lightweight while maintaining relatively strong results. In this research, the squared Euclidean distance is used for calculating the minimum variance. This function can be denoted as follows:

dij= d({Xi}, {Xj}) = kXi− Xjk2

In the method the distance between two points x is calculated and kept in matrix d. This distance metric can be anything but is often the Euclidean distance. Afterwards the matrix d is considered and will be divided into more or less clusters based on the type of clustering. This division is based on the calculated distances and after the new assignment the steps are repeated until a certain amount of clusters is reached or no more steps are possible.

3.2.4 DBSCAN

Another technique is the DBSCAN method which falls into the density based clustering algorithms. It stands for density-based spatial clustering of applications with noise and looks at the density of the data points. The algorithm takes two parameters, which is the minimum amount of points to define a cluster and the eps. In other words, the maximum distance for two points to be considered ’in the same neighbourhood’ is based on two parameters. These parameters are used to measure each point after which it is determined in what category those points fall. This can be either a core point, a reachable point or an outlier. A point is considered a core point if a minimum amount of numbers is reachable based on a given parameter prematurely. Every cluster has to have at least one core point and is formed by every point, either a core point or reachable point, that is reachable from this core point based on the given distance metric. As with a lot of other algorithms the Euclidean distance is used in this situation to calculate the distance. The algorithm first fully explores a neighbourhood or cluster before moving on to a non-defined point and starting a new cluster, thus creating a number of clusters that is not known in advance. Important to note is that the algorithm often revisits points it has been before which increases the time it runs.

3.2.5 Birch

Birch is a unique method because it uses a tree-like data structure to cluster the data. For the clustering the algorithm develops trough four phases to achieve its results. The first phase consists of creating a CF-Tree out of the available data points. CF in this context means a Clustering Feature which is a triplet consisting of the data points, the linear sum of those data points and the square of the sum of the data points of the first variable. These features are then organized based on the branching factor B and the threshold T. In the tree structure every node has at most B entries, where the entries are formed by a CF and a child node. These child nodes consists of CF itself constrained by a certain amount thus resulting in a tree consisting of CF’s. In the second phase the algorithm looks at the created tree and tries to create a smaller tree by cutting outliers and grouping CF’s that are very similar. The third phase consists of applying a agglomerative clustering algorithm to cluster the leaves or child nodes of the CF’s. By compressing the clusters here the algorithm achieves the amount of clusters given as a parameter. Afterwards phase four can be applied which considers the found clusters and looks for errors by considering the ’seeds’ of the clusters and redistributing the data.

An important note within this algorithm is that every subcluster holds information that improves the memory usage. This information is about the number of clusters, the linear sum, squared sum and centroids of the samples with the subcluster and the squared norm of the centroids. By using these metrics, the calculation of the radius of the subcluster is improved as it only has to hold these specific attributes in the memory while calculating.

3.2.6 Affinity Propagation

The next algorithm is affinity propagation which is unique in it’s calculation. The algorithm sends messages between points to establish the connection between those points and based on this, the algorithm clusters

(15)

points together. It chooses a function to determine the similarity between points and based on those similar-ities assigns each point. An example of such function can be the negative squared distance between points. It then applies two steps per message to establish the connection. First of it uses the following formula:

r(i, k) ← s(i, k) − max

k0_6=k{a(i, k

0_{) + s(i, k}0_)}

In this formula r is the responsibility matrix and a is the availability matrix. R is how suited a point is to be an exemplar or clustering point, while the a matrix is how likely x would choose y as their exemplar or clustering point. S in this formula contains the distance chosen as discussed before which could be the negative distance. The second step is formed by the following formulas:

a(i, k) ← min  0, r(k, k) + X i0_∈{i,k}_/ max(0, r(i0, k))   and a(k, k) ← X i0_6=k max(0, r(i0, k))

In this step the algorithms basically reassigns the availability matrix based on the new responsibility matrix found in step one. Both steps are then repeated until no more changes occur after a few iterations, or after a pre-determined amount of iterations.

3.2.7 Mean Shift

Mean shift finds its core meaning in its name already as it is based around the centroids or ’means’ of the clusters. The algorithm uses the following formula:

m(x) = P

xi∈N (x)K(xi− x)xi

P

xi∈N (x)K(xi− x)

In this formula K is some kind of Kernel that determines the wait of a point, which can be a Gaussian wait for example. N in this context is the neighbourhood, which are all the points for which the K does not equal to 0 and with this formula produces m which is the mean density of the region. Afterwards x becomes the new found m and the process is repeated until possible convergence. The interesting thing about this algorithm is that it solely looks in a region around the point it considers as a clustering point, which makes it rather efficient as it does not use the whole data set for each iteration.

3.2.8 Spectral Clustering

The last algorithm used is spectral clustering (Ng et al., 2002). The first step in this process is to create an affinity matrix which establishes the similarity between points. This matrix is created by a certain technique and usually is a radial based function, which can be the euclidean distance for example. After this matrix has been created the algorithm uses k-means to create the clusters and create a graph of the data points. Afterwards this graph is regarded and it applies a normalized cut problem. This means that it tries to cut an edge inside a graph so that the weight of that edge is outweighed by the remaining edges within the graph. Based on this the algorithm tries to reduce the clustering even further and eventually produces the final clustering.

(16)

3.3 Evaluation metrics

After applying the algorithms to the data sets, some metrics have to be used to determine how ’good’ the clustering was. Important here is to involve as many metrics as possible with a broad range of evaluation factors to get a complete image of the clustering. Similar to the algorithms, the scikit module was used to apply the metrics to the found labels and because the present study had access to both the data sets and the known correct classification a broad range of the available metrics inside the module were available for usage.

Choosing these specific algorithms was based on both the theoretical framework and provided validation within the metrics. The theoretical framework proves that more metrics give a better overview in general but specifically because it allows for the evaluation of the process from more angles. Most metrics require that the classification of a data set is known, which holds true in this research as well. Because in this case the classifications were indeed known, it allowed the usage of a lot of metrics and increased the angles of evaluation. Based on this, it was attempted to provide as many angles as possible in this paper. Something else that must be taken into account is the density and definition of clusters. Next to these guidelines it was also attempted to put validation in the metrics where possible. This means that some of the metrics test roughly the same. By doing this it was attempted to eliminate randomness as much as possible and it made for a better comparison of the metrics as well. To further validate the comparisons between metrics some metrics are also used that directly influence each other and give more information on how the clustering proceeded and what its strong points were. Lastly, several general metrics such as memory and time were added to test general performance as not only the resulting clustering matters but also the lengthiness of the process.

3.3.1 Time and Memory

The first two metrics are not necessarily clustering specific but can be important factors in choosing which algorithm to use. The first one is time, which is becoming more prevalent with the increase of big data applications and the increase in the availability of data. For the tracking of time the standard python module time was employed. By starting a timer before fitting the data and ending the timer after the fitting every algorithm can be timed in a rather similar way. The whole process of starting the algorithm and assigning the labels was timed as this is a constant factor for every algorithm.

Besides time, it is also important to look at the memory usage of an algorithm. As the data increases an algorithm needs more memory to perform the clustering and by tracking the memory usage you can predict the scalability of the algorithm. This can be essential in choosing the right algorithm as one must consider both the available memory and how much the algorithm needs. For the tracking of memory, a memory profiler, the line profiler in specific, was installed and used. Through this module, the memory usage of each line can be tracked and it thus accurately knows how ’costly’ the algorithm is. Below is an example of memory tracking where the increment shows how much the fitting and creating of the algorithm adds in MiB’s.

(17)

3.3.2 Adjusted Rand Index

The first metric of the module is the Adjusted Rand Index which is a slight adaptation to the normal Rand Index. The Rand Index itself is based on knowing both the known and found classifications and uses the following formula:

RI = a + b Cnsamples

2

In this formula the C equals all the possible pairs in the dataset. A equals all the pairs that are in both the correct classification C and the found clustering K. B equals all the pairs that are in different sets between both C and K. By using this formula the values can range between 0 and 1 where 1 is a perfect score and 0 is no match at all. The Adjust Rand Index is corrected for chance and transforms to the following formula:

ARI = RI − E[RI] max(RI) − E[RI]

Along with the normal Rand Index the expected Rand Index is taken from both the Rand index and the max Rand Index and afterwards these values are divided. By doing this the values can range from -1 to 1 where -1 is bad labeling, 0 is random labeling and 1 is perfect labeling. This way you can conclude more from the found values and achieve a better comparison. Furthermore, the formula does not assume any structure for the data and is thus very suitable for comparing data structures that differ from each other.

3.3.3 Adjusted Mutual Information

The second metric is in a way similar to the previous one, in the fact that it also adjusts based on chance and has the same ranges for the values with the same meaning. The Adjusted Mutual Information is, as its name already suggests, an adjustment of the Mutual Information formula. This formula takes into account the entropy and a contingency table with both the known and found clustering labels. Based on this table it calculates the entropy of both sets of found labels and uses that to calculate the performance of the clustering, which is bound to both the found entropies. The Adjust Mutual Information changes this range to 0 and 1 and adds the chance in the same way as the Adjust Rand Index as can be seen by the formula:

AM I = M I − E[M I] max(H(U ), H(V )) − E[M I]

The MI is the found Mutual Information and the max is determined by both the entropies. Afterwards, the same expected index is subtracted from both and divided to get said ranges. By doing this, the values of different data sets and labels can be compared as the score is both normalized and adjusted for chance. The similarities in the first two methods is also seen in the advantages as this metric also has no problem with different clustering structures.

3.3.4 Homogeneity, completeness and V-measure

The following three metrics influence each other and together say something about the execution of the clustering. Homogeneity determines whether within a cluster the cluster only contains members of a single class, which would mean a perfect clustering. Completeness does the opposite and tries to look at whether members of the same class are within the same cluster. V-measure takes both values and calculates the harmonic mean of both. By looking at the values one can determine where the clustering went wrong and can evaluate where improvements in the algorithm or clustering can be made. All three metrics are bound by 0 and 1.

(18)

3.3.5 Fowlkes-Mallows score

The Fowlkes-Mallows score is a widely adopted metric used throughout a lot of statistics if the correct classifications are known. The metric uses precision and recall to calculate a mean and determines how well the clustering was handled. Precision is formed by taking all the correct classified values and dividing it by all the labels that are classified as correct. Recall is formed by taking the correctly classified labels and dividing it by all the correctly labeled classification including the negative ones. This turns into the following formula:

F M I = T P

p(T P + F P )(T P + F N)

Within this formula the TP equals the True Positive, the FP equals the False Positive and the FN the False Negative. As with the first two metrics no assumptions is made based on the structure of the clustering and is bound by 0 and 1 just like the AMI.

3.3.6 Calinski-Harabaz Index

The first metric that does not need the known labels is the Calinski-Harabaz Index. It looks at the between-cluster dispersion and the within-between-cluster dispersion to determine how well the between-clustering was executed. Important to note is that a higher score is better and there is no limit on how high the scores can be. This metric looks at the density of the clusters and penalizes clustering that overlaps. The metric uses the following formula to calculate:

s(k) = T r(Bk) T r(Wk)

×N − k k − 1

In this formula the N is the amount of samples and k is the amount of clusters. Tr(Bk) equals the between clusters dispersion and how well they are separated while the Tr(Wk) determines how dense the clusters itself are. Both of the values are matrices with the different possible combinations.

3.3.7 Silhouette Coefficient

The second and last metric that does not need the known labels is the Silhouette Coefficient. It is a rather simple metric and looks at how well the clusters are defined. This is done via both calculating the mean between a point and all other points in the same cluster and between the points and all the points of the nearest cluster. This turns into the following formula:

s = b − a max(a, b)

In this formula b is the distance to the nearest cluster and a is the distance to the points inside the cluster. A huge drawback of this metric is that it is very memory heavy as it has to calculate a lot of points and the distances between those points, which can make it rather slow.

3.3.8 Implementation

For the implementation a function was written to calculate all the metrics at once which can be seen in Figure 2.

(19)

Figure 2: Metrics function

In this function, x is the correct labeling and y is the predicted labels as found by the clustering process. Z in this instance is the created model which is used for the last two metrics to evaluate. First, both the x and y are ordered to align the labels, which allows the metrics to perform a correct evaluation.

(20)

4 Results

This section contains the results of the clustering and the metric results based on the data set as executed in the present research. The results are presented in table form and will be discussed in the text below every table. The tables will contain every algorithm and a column for the six data sets. The tables are ordered by metric which allows one to easily compare the algorithms and their performances within a metric. Within the tables some values where not achieved due to an error. Through either Mem or Nan this is assigned to respectively a memory error or a fault within the algorithm itself due to a computational error.

4.1 Time and Memory

The two tables for the measured time (Table 2) and memory usage (Table 3) of the different algorithms on the different data sets can be found. Important to note is that the time is in seconds and the memory is in megabytes. Furthermore, some algorithms were not able to complete clustering specific data sets, which greatly reduced the efficiency and scalability of the algorithm. In order to test the efficiency both the time and memory are taken into consideration.

Table 2: Time

Bank Htru2 Skin Spam Drive Cov Gaussian Mixture 0.0080 0.1190 1.3130 0.2110 14.1000 60.8650 Mini Batch K-Means 0.1450 0.0320 0.2010 0.0310 0.2690 0.7430 Ward Hierarchical 0.1810 13.0090 Mem 1.0790 861.7290 Mem DBSCAN 0.0420 0.2950 5.1620 0.2960 34.2610 28.2820 BIRCH 0.0670 15.1570 Mem 0.9470 2527.1780 Mem Affinity Propagation 1.2950 842.1470 Mem 57.6290 Mem Mem Mean Shift 3.5330 164.3260 19036.2449 29.9190 6406.4210 Mem Spectral Clustering 3.1439 872.0910 Mem Mem Mem Mem

First, the time is considered. Analyzing Table 2, it can be concluded that the performance can roughly be split into two groups. The first group consisting of Affininty Propagation, Mean Shift and Spectral Clustering took way more time than the second group. For example with the first data set, it generally took those three more than one second, up to close to a minute to complete the clustering. If we look further at larger data sets for both many and few features it can be concluded that similar patterns exists. In other words, the three algorithms in group 1 significantly under perform if measured against the other five algorithms. The three algorithms once again took considerably more time to complete in comparison or were not able to complete the clustering. As expected, this behaviour is continued with the largest data sets where only Mean Shift was able to complete the algorithm. This however took more than 5 hours which is easily the lengthiest an algorithm has run for by quite a margin. The worst performing algorithm was undoubtedly Spectral Clustering. This algorithm was not able to complete even the smallest of data sets and it showed the worst scalability. Affinity Propagation was the next worst as it generally took more time to cluster, or was not able at all to cluster a data set, which leaves Mean Shift as the third worst.

Group 2, consisting of the five remaining algorithms is comprised of: Gaussian, Mini Batch K-Means, Ward, DBSCAN and Birch. Considering the smallest data sets for these five algorithms, similar results in time can be detected, as all five roughly take a second or less to complete the clustering. Only marginal differences from the hundredths of seconds can be concluded. Such differences are negligible. However, if a ’worst’ performer must be selected from group 2, it must be Ward as it took the longest out of all the five for the smallest data sets with few and many features. Comparing few to many features we can see that the latter takes more time to cluster, which is to be expected due to more computational strain. As for the fastest, we can see that for both data sets the Gaussian Mixture is the fastest, be it by a small margin as discussed. Going up one size, interesting patterns emerge. The computational time increased for each algorithm as expected, the margins however for these increases differ vastly. Both Ward and Birch increase significantly more, where Birch even took 42 minutes, while a size smaller only took a second. Ward also experiences quite an increase, especially for many features, where it shows that both handled scalability significantly worse than the others. Looking at the other three algorithms, one can also see differences in

(21)

the ranking. Mini Batch K-Means is able to hold its speed and keeps its running time easily beneath one second. The other two algorithms, Gaussian and DBSCAN, however take quite some more time to complete the many features data sets. Both also see an increase in running time on the few features data set. Looking at the largest set, it can be concluded that Mini Batch K-Means is able to keep its times under one second, while the other two show the same, or an increase in computation time. Interesting to see is that DBSCAN actually slightly decreased in time, for the many data set, while increasing for quite a bit for the few data set. Overall Mini Batch K-Means can keep its computational time easily beneath one second. It can thus be concluded that Mini Batch K-Means is the most time efficient algorithm out of all the algorithms.

Table 3: Memory

Bank Htru2 Skin Spam Drive Cov Gaussian Mixture 1.852 3.598 4.070 2.820 5.090 5.094 Mini Batch K-Means 0.641 1.531 8.938 1.633 4.281 19.156 Ward Hierarchical 1.625 2.660 Mem 1.105 8.883 Mem DBSCAN 0.438 3.559 16.652 1.172 0.824 1.688 BIRCH 3.895 33.078 Mem 21.621 369.863 Mem Affinity Propagation 2.410 Mem Mem 9.281 Mem Mem Mean Shift 2.566 19.312 42.789 21.051 Mem Mem Spectral Clustering 4.367 Mem Mem Mem Mem Mem

The second efficiency metric, memory, is presented in Table 3. It can be concluded that the same algorithms that scored inferior on ’time’, score inferior on ’memory’ as well. However, to this group of three, Birch can be added as well. For the smallest data sets, these four algorithms took significantly more memory in comparison to the other algorithms. Birch shows an interesting shift from scoring rather well on ’time’, to second worst in both data sets in memory, only leaving Spectral Clustering behind. Spectral Clustering seems to perform very badly on efficiency in general as it is the worst on both time and memory consumption. Similar to the developments in Table 2, we can detect at least a tripling of consumption of memory from few features to many features in the four algorithms in Table 3. Going up one size, the same increase occurs for the few features with Birch even taking up 10 times more memory usage. Roughly we can say that all the algorithms scale rather bad and show quite big memory use, even for the smallest data sets. All of these can be seen as rather memory inefficient, or demanding. Spectral Clustering is easily the worst, followed by Birch, Mean Shift and Propagation.

The remaining four algorithms scored considerably better on the memory usage and should thus be re-garded as the more memory efficient algorithms. For the smallest data sets we can actually see differences in comparison with time. DBSCAN performed the best out of all on the few features and small data set followed by Mini Batch K-Means, while the other two, Gaussian and Ward took more memory. Looking at many features however, Ward uses the least, followed by DBSCAN, Mini and lastly Gaussian by a margin. Looking at the time, Gaussian seems to use more memory than the rest, while it was as fast as the others, this could possibly be explained because a model has to be created which can use more memory, this how-ever is speculation. As with the other algorithms we also see an increase in memory usage between few and many features, except for Ward, which took less memory to complete the clustering. This suggests that the amount of features is not that important for the memory usage and it can efficiently take in more features. Considering the medium sized data sets we see different rankings. For the few features, Mini Batch seems to be the most efficient, followed by Ward and then by both Gaussian and DBSCAN. DBSCAN seems to scale bad in this instance in size and memory usage, while Ward increased the slightest in this step of scale. Looking at the many features however, we can see differences as well. Most notably: DBSCAN took less memory than the smallest data set of many features and it used the fewest memory out of all the algorithms, which is rather strange. Mini Batch also took quite some more memory to cluster when compared with few features just like with the smaller data sets, which seems to show that the memory scalability of Mini Batch K-Means with features is relatively bad. Gaussian underwent roughly the same increase considering the sizes and features, so it seems to have less trouble with more features. This leaves us with the largest data set, which shows a different trend once again. Gaussian Mixture seems to pull ahead now, as the memory usage of the algorithms almost does not seem to increase any more. The algorithm shows great scalability

(22)

considering both the features and size and seems superior in few features since Mini Batch took twice as much and DBSCAN took four times as much memory to complete the clustering. For the largest data set however, DBSCAN scored the best, while Gaussian Mixture kept stable and Mini Batch increased by quite margin again, similar to the trend surrounding few features. The trend of Mini Batch increasing significantly with features also holds true in this case.

For overall efficiency, a top three is easily appointed. This top three undoubtedly is: Gaussian Mixture, Mini Batch K-Means and DBSCAN. Those algorithms easily outperform the other algorithms when all scenarios are taken into account. All three show exceptional scalability with both features and size. Mini Batch seems to be the best considering the speed as it was always able to keep the time beneath one second which is extremely fast when compared with the other two. As the size however increased it used more memory, which could prove to be a problem for lower end computers. Gaussian Mixture shows to be rather stable in increase and rather efficient while usually not being the ultimate best. It needed a bit more time to cluster, but used fewer memory when compared to Mini Batch K-Means and seems to generally be good for few features. It also maintained low time usage and a rather memory efficient clustering process. DBSCAN seems to be outperformed by both on the few features data sets. However, considering more features, it strangely seems to increase in efficiency and easily outperforms the others on memory usage. Looking at the other algorithms we see rather inefficient algorithms. Birch was able to keep up in time, but used quite a lot of memory in comparison, while for Ward it is the other way around. The other three algorithms scored inefficient on both metrics and can thus be regarded as the worst, with Spectral Clustering being the worst by quite a margin as it couldn’t cluster the smallest data set with many features.

4.2 Known classification metrics

This section presents the metrics that rely on knowing the actual classification. As in the previous tables, some results are not available if the algorithms were not able to peform the clustering. The values in the different tables correspond with the explanation given in the methodology section in their respective subsection. Some of the values have an e followed by a number which equals to the number to the power of the number behind the e. Those values can also be regarded as zero as they are extremely close to zero. Furthermore the Homogeneity, Completeness and V-Measure will and should be regarded as one metric as they relate to each other and influence each other heavily. Something to keep in mind is that DBSCAN, Birch, Affinity Propagation and Mean Shift do not take a number of clusters as input, which means it’s expected to be outperformed by the other algorithms.

Table 4: Adjusted Rand Index

Bank Htru2 Skin Spam Drive Cov Gaussian Mixture 0.8459 0.0475 0.0538 0.7239 0.3439 0.0360 Mini Batch K-Means 0.7519 -0.0769 0.5977 -0.0300 0.4368 0.0306 Ward Hierarchical 0.7572 -0.0714 Mem 0.1881 0.0969 Mem DBSCAN 0.3372 0.0 0.0005 -0.0323 0.0 1.4509e-16 BIRCH 0.9424 -0.076 Mem 0.0641 0.1883 Mem Affinity Propagation 0.0614 0.0003 Mem 0.0752 Mem Mem Mean Shift 0.0 -0.0266 -0.0679 -0.044 0.0252 Mem Spectral Clustering 0.0113 -0.0610 Mem Mem Mem Mem

The first metric to look at is the Adjust Rand Index (Table 4), which compares the similarity of the clustering, while using chance normalization to strengthen the claims. Looking at the first data set, the few features and small data set, we can see the same diversion as with time in Table 2. Affinity Propagation, Mean Shift and Spectral Clustering all achieved close to a zero. This translates to a random labeling as the algorithms states, which is rather bad. The other algorithms however performed rather well on this data set within the metric. Birch performed the best with roughly 0.9, which is a near perfect score. Right behind Birch comes Gaussian Mixture with a score of 0.85 followed by both Mini Batch and Ward with roughly the same score of 0.75, which is rather high as well and shows great similarities with the actual classification and

A Comparison of Clustering Algorithms

University of Amsterdam

Final year project