Unsupervised website visitor segmentation based on Convolutional Neural Networks and k-means

(1)

based on Convolutional Neural Networks and k-means

submitted in partial fulfillment for the degree of master of science Dimitar Dimitrov

12239496

master information studies data science

faculty of science university of amsterdam

2019-07-05

Internal Supervisor External Supervisor 3rdsupervisor Title, Name Chang Li Michael Metternich Dr Maarten Marx Affiliation UvA, FNWI, IvI Company Supervisor UvA, FNWI, IvI Email c.li@uva.nl michael.metternich@bloomreach.com maartenmarx@uva.nl .

(2)

Abstract

The digital era has made it possible for the end user to purchase almost everything on the Internet. This led to the growth of the e-commerce industry, which in its turn pushed companies to search for ways to attract more customers. One such approach is by tailor-ing the content, which is presented to all users, known as Content Management. The next step is to optimize the content based on user segments. Currently, those segments are created manually. How-ever, in this research we aim at clustering users based on keywords from the URL of the pages they have visited. The keywords were transformed via a Convolutional Neural Network (CNN) before the actual clustering. In doing so, we compared our results with a similar research [40], previously done, within BloomReach. In addition, by evaluating our CNN approach with an open dataset, we were able to compare our results with the ones reported by Jiaming Xu et al. [42], whose work, laid in the foundation of this research.

Keywords

Web Mining, Clustering, Classification, Text Analysis, Convolu-tional Neural Networks

1 Introduction

Although the digital era has changed a lot of domains, in this paper we focus on e-commerce. E-commerce represents the act of buying or selling goods through the World Wide Web (WWW). [19] As suggested by R. Cooley et al. [6] a lot of companies rely on the internet to conduct and expand their business. To motivate this statement according to [36], in 2019 e-commerce is expected to be responsible for over three trillion dollars in sales. This can lead us to the conclusion that shopping is no longer related to the physical shop, rather going to the web. Bearing all this in mind, we can understand how important it is to make the online experience of the user as pleasant as possible. An example for a possible improvement can be to allow consumers to find what they need before they even know it. Although far-fetched, this is not a new practice, an example for this is an article in Forbes from 2012. [12] In this article the highlights are taken by the retail company Target, who were targeting their customers based on patterns, generate from their data. However, this should be done as caution, as it might frighten and draw the visitors away.

By living in a tech-advanced world we generate bits of data with each action we take online. Having a closer look at the one of the Marketing Theories - Consumer Decision Making Process [39], we can see that it consists of five different steps, where the actual purchase is only the fourth of it. The first three steps have to do with the realization of the individual that he or she needs a certain product, followed by the search of similar items and their evaluation. While performing those three steps, online, the user is generating their so-called digital identity by leaving footprints, such as a search history or visiting a certain category page. The

digital identity can then be completed with the actual realization of a purchase.

By collecting, usually in the form of log files, the digital identity of their visitors companies can segment them into groups. The groups can be based on, for example a certain product group they have searched for. With those segments the business can target them in a more personal way, for instance by improving and adapting the content they are seeing online, known as content management, thus improving their digital experience.

This thesis focuses on the extraction of valuable information from the online journey of the visitors. The project takes place in the digital company: BloomReach. The company can be labelled as technology provider, who aims at improving the experience of the end online-user and his relationship with the business. To be more specific the research is built around one of BloomReach products — Experience Manager. The product intends to give more power to the company, by allowing them to analyze and optimize content, based on the audience. The content optimization is done based on user segments, which are created manually by administrators. We aim at automating the segments creation/suggestion process. As mentioned earlier, one way to store the digital footprint of the user are the log files, which are also the type of data used in this thesis. However, log files contain only the visited URL’s, without any label or class specification, or simply put – raw data, whether that stands only for URL’s or some extra information. We could, however, extract insightful information through ’text analysis’. This information can be used as an input for different algorithms or Neural Networks, however due to the lack of labels or class specifics, the approach must be unsupervised. Unsupervised algorithm stands for an algorithm capable of learning from a dataset, without any labels, and capable of finding patterns in it.[16]

1.1 Research Questions

In this work we investigate the application of Neural Networks, more specifically Convolutional Neural Networks (CNN), for un-supervised clustering of textual data extracted from log files. The reason for focusing specifically on CNN’s will be covered in the following sections.

However, our main motivation has to do with evaluating the work of Jiaming Xu et al. [42], which covers an interesting approach of training a CNN, in an unsupervised manner. The output of the network is then used for clustering of the input. Motivated by this, as well as an in-house research, which will be described in Related work, the main research question of this paper is:

RQ1: To what extent can the use of CNN outperform K-modes clustering in unsupervised user segmentation,

based on keywords extracted from the online journey? As the topic can quickly expand and overpass the time limitation of this project, and to improve evaluation, the work was split in two parts, each of which will have its own set of sub-questions question.

(3)

Part 1: CNN

• RQ2: How is our implementation, of the work of Jiaming Xu et al. , performing compared to reported results by them? Part 2: Available Data in-house

• RQ3: How can we evaluate the performance and results with the data in-house?

• RQ4: Does it make sense to scrape the visited page and use keywords based on the scraped content?

• RQ5: To what extent is the use of scraped content affecting the cluster performance and structure?

We will cover related work in chapter 2, while chapter 3 will introduce the methodology. Finally, chapters 4 and 5 will focus on evaluating the approach and drawing conclusions. Appendix A provides extended evaluation.

2 Related Work

2.1 Web Mining

Part of the work of R. Cooley et al. lies in the foundation of this work. In his work he introduced multiple approaches for structuring the raw data, coming from the internet, and motivated techniques for extracting value out of it. This subsection will be mainly based on two of his papers:

• Data Preparation for Mining World Wide Web. Browsing Patterns [7]

• Grouping Web Page References into Transactions for Mining World Wide Web Browsing Patters [6]

In the first one, the authors are describing and discussing differ-ent Data Mining techniques, amongst which we have association rules and clustering analysis. The latter one is the focus of this pa-per as it emphasizes the benefits of grouping similar users together. Based on those clusters companies can either develop a marketing strategy or execute one by targeting customers both online and offline. R. Cooley et al. share their theory that a certain web page can have one of two purposes for the end user - either navigation page or a content one. However, the message, from this paper, is the content regarding the user transaction and how to extract what is actually relevant. The authors discuss three different modules, concerning the identification of specific transactions: Reference Length Module, Maximal Forward Reference Module, Time Win-dow Module. For this thesis, we will focus only on the Maximal Forward Reference, initially proposed in [25]. This module takes into account visitor ID and timestamp of the visit and the page. The transactions, based on this module, are not related to the time the visitor spends on a page, rather on the order of visited pages. A new transaction starts with a so-called forward reference, a page not in the current transaction, and a transaction finishes when the visitors goes back to a page, which is already in the transaction. A group of consecutive URL visits will finish with a content page and the pages leading to the end are navigational. An example from [6]: the sequence A,B,C,D,C,B,E,F,E,G would be split into 3 buckets: A,B,C,D; A,B,E,F; A,B,E,G. The content pages are D,F,G - the last pages visited before going backwards.

2.2 Relevant work in-house

The current phrasing of the main research question, has to do with a recent research done within the company. [40] As the research question (RQ1) suggests, we aim at comparing the performance of our approach using the same data.

In the previous approach, from now on referred to as baseline, the aim was to segment website visitors based on contextual data retrieved from the visited URL’s. In order to achieve this, they proposed the Automated Visitor Segmentation (AVS) pipeline con-sisting of seven steps, which included reading (1), filtering (2) and sorting the data (3), followed by identifying transactions (4) and extracting information from them (5). Finally, unnecessary data was removed (6) and data was clustered (7).

The exact data used in their research, as well in ours, will be further discussed in Section 3.1. However, as the available data was made out of raw log files, the main focus in their pipeline (Steps 1 to 6) goes to data pre-processing and cleaning. Interesting in the pipeline is step 4, which has to do with a method initially proposed by Ming-Syan Chen et al.[25] - Maximal Forward Reference.

The output of the first six step would be similar to the data presented in Figure 1. Regarding the last step of their pipeline - data clustering, they provide an extensive literature overview on possible directions and different algorithms capable of clustering data in an unsupervised manner, due to the lack of labels or class specifications. The final decision was made both based on the disadvantages and advantages of the algorithms and on the type of data they had. This being said the baseline approach was built around K-Modes clustering, hence that the exact formulation of our RQ1. K-Modes is a variation of K-Means and was chosen, based on the following points:

(1) The main functionality of the algorithm is to group the most frequent similar items.

(2) The algorithm supports both, numerical and categorical data. In the baseline they directly used the cleaned categorical data as an input for the algorithm.

(3) As the algorithm is related to the K-means, it allows flexibility in the number of clusters, which is used as a parameter, which can be set based on the user.

The final output of the baseline approach would be structured data, containing the cleaned output from the first six steps together with numeric cluster labels.

We will be going back to the baseline model to discuss their final results in the Evaluation section.

2.3 Neural Networks

In Section 3.1 Data Description we will further discuss the structure of the provided data. Despite this, we should point that the data at hand represents sequences of words, without any categories or classes attached to them, as well as arguably any context. As already emphasized such data moves points to the use of unsupervised learning.

Neural Networks consist of sets of algorithms and are inspired by the human brain and the neurons in it. In different sources, the neuron is regarded as either information carrier or the working unit in the brain, which is also similar to its purpose in a Neural Network. Simply put the neuron, in programming perspective, is 2

(4)

a unit which takes some input, performs calculation on this input and sends out the result. The Neural Network is basically a set of neurons connected together. Figure 1 is a simple Neural Network, created with [22].

Figure 1: Simple Neural Network Representation. Shown in figure 1, there are arrows leading from the input layer to the output layer. For example, assume we have an image as an input. Then the network will "evaluate" it pixel by pixel and the final output will be for instance a class label, such as dog. The "eval-uation" is not a simple summation rather a processing architecture involving different functions, set up variations and mathematics. However going deep into it is out of the scope of this research. The second main algorithm involved in a Neural Network is the back-propagation. Simply put, it is how the network learns. Instead of going forward, the network goes backward and the weights of each neuron,the arrows in the figure, are adapted in order to minimize the error. The error itself is based on the result of the forward prop-agation and the actual ground truth, which going back to our image example is the actual label of the image. To summarize, the network will learn, which are the important features per class/label. This can be used to summarize supervised learning, where a prediction is made based on example.

Although from the basic explanation of Neural Networks we see that they are in a way a supervised approach. They can also be used in an unsupervised manner in order to provide a better representation of the input data. Considering this, and as we only have unlabeled data, we decided to investigate the possibilities of using the generated such a representation of our data and cluster it. Although this approach is not new, there are certain points, which need to be taken into account. For example, Dundar et al. [8], suggest an approach which incorporates both k-means and Neural Networks, however they are using images and they have labels. Besides for training, the unique count of the labels can be used instead of the exact number of clusters needed as a parameter for the algorithm. Another usage of the labels can be to calculate the accuracy and quality of the generated clusters, by comparing the predicted ones against the ground truth.

Another important point to take into account is the choice Neural Network, which is affected by multiple things such as the data at hand and expected outcome. There are multiple sources and guidelines. For example, Angus et al., proposes Criteria for choosing the best Neural Network [9] discussing this problem together with solutions.

Considering this, a workaround is needed for the lack of labels in regard to the training of the Neural Network. We also need to consider the type of data, which was mentioned at the start of the subsection. Starting from the latter one, as discussed in [13] short texts are limiting in the sense that they lack the syntax and grammar used in proper text. Secondly, short texts are lacking statistical information needed for proper use of statistical approaches like topic-modelling and as the authors state, such texts are ambiguous, thus hard to interpret. In our work, we are dealing with separate words extracted from URL’s. Regarding the unsupervised training of Neural Networks Jiaming Xu et al. [42], propose, an approach worth investigating. In their paper [42], they suggest an architecture capable of learning the most important features without the use of any labels.

Motivated by their work, we decided to incorporate their ap-proach for training a Neural Network and investigate whether or not we can firstly improve on their results and secondly make use of the approach for our own dataset.

3 Methodology

As already mentioned, this research is rather specific as in its basis it is constructed around data from one of the main products of the sponsoring company - Experience Manager. Another point making our research specific is the fact that we focus only on the use of CNN. Initially CNN’s were recognized and known for their performance in image related task, however as Wang et al. [41] suggest, they are capable of learning local features from words and phrases. The above statement also appears in the work of Yoon Kim [18], who compares the performance of different simple CNN’s on seven tasks, two of which are sentiment analysis and question classification. The proposed method from Yoon Kim improved on 4 out of those 7 tasks. In addition, the approach used for the training of the Neural Network is based on the work of Jiaming Xu et al. and as we also aim at comparing our results with theirs, we used the same Neural Network type - CNN.

The rest of this section is split into sub-sections, each aiming at providing part of the whole process and will further built-on the content from above.

3.1 Data Description

The available in-house data was generated from one of the com-pany’s products, which is a Content Management System - Experi-ence Manager. Whenever a client requests this product, the tool is deployed based on their requirements. One of the things stored by the system are log files of the interactions of the visitors with the website.

Initially two datasets were provided, one from the sponsoring company’s website and one from a client website. Both datasets had the same features per user, however the client dataset was multilingual, which differed from the initial idea to focus on content in English, similar to the previous research.

This being said, the preferred dataset was generated and based on the usage of the company’s website - bloomreach.com and con-tained the following information:

(1) Unique identifier, which is distinguishing site visits/sessions; 3

(5)

(2) Unique user identifier, which is distinguishing users and is stored in a cookie;

(3) Location information, which consist of country, city, latitude and longitude;

(4) Information about the day of the week the visit was made as well as separate Timestamp per activity per user. This allows to follow the path, described in section 2.1 Web Mining; (5) Browser used;

(6) Referrer page - the page which got the user to ’current’ page. Example of referrer page can be google.com;

According to the earliest time stamp, the first record is from 29th of August 2016 and the last record is from 15th of April 2019. In total the log file contains 12 610 512 rows. Based on the visitor identifier there are 6 442 700 unique entries.

Sample, without the visitor identifier can be seen in Table 1. 3.1.1 Data pre-processing As Karl Grooves [10] explains, log files are not initially created for usability analysis, which points to the need for data cleaning. This is also seen in the pipeline ar-chitecture of the baseline method, where 6 out of 7 steps are data pre-processing related. In this research we aim at comparing our results and preferably extend the baseline [40]. From this prospec-tive, we evaluated the pre-processing steps, part of the baseline, and although we used the same order of actions, such as reading, sorting etc., we tried to improve on them. The improvement, in our opinion, was in the way the keywords were extracted from the URL, per user. Figure 2 represents an example of a URL.

Figure 2: Sample URL Example.

The domain related part is ignored as it brings no value. The valuable information however is filtered via a set of regular ex-pression rules before it is tokenized, stemmed and filtered for stop words. The example from figure 2 will give us customer, police, national as keywords. This combined with the correct grouping of visits, as explained in section 2.1, should be sufficient to allow us to see the interest of the visitor and from there build a valuable visitor feature.

The pre-processing will be touched again in the Evaluation, as we compare the results from the baseline pre-processing and ours. 3.1.2 Web scraping Although the initial idea, as explained until now, was to focus primarily and mainly on using only the URL, we decided to add an additional data source in order to compare performance. The summarized results can be seen in section 4 Evaluation and the extended evaluation can be seen in Appendix A. Nowadays the importance of the URL has slightly fallen behind. For example, in a google search the majority of people will look into the provided summary of the page, rather than the specific link itself, the content after the domain.

According to this, we can formulate a hypothesis that the actual content will be more helpful when segmenting users, rather than the URL. To test the hypothesis, a framework was developed, capable of following a given URL and scraping the content. As Julia Kho

[17] explains, web scraping is a technique to access and extract information from a given website.

Our framework consisted of a so-called spider, the actual scraper, main repository and two separate repositories. The main repository was straightforward and was used to store the whole text content from the page, excluding any navigational bars, comments and other non-related information. Besides the content, the extracted date was also stored, in the case where the scrapped page did not contain a date identifier. This was done for the sole purpose of having a way to refresh the repository. Depending on the website at hand, the refresh can be done either after new system release or after a certain time period. In our case, the website is mainly related to documentation, meaning the content would only change around product updates or releases. The two separate repositories were stored in the form as dictionaries, as follows:

(1) {Scrapped URL: keywords from that page} (2) {Scrapped URL: summary of that page}

The keywords and the summaries were extracted based on rank-ing algorithm and the available textual content.[28] Important to mention is that there was a separate pre-processing step for the scrapped content. This was needed as the raw scrapped text also contained Hyper Text Markup Language (HTML) tags. Besides this some of the pages were documentation related, we had code snip-pets mixed with text. The pre-processing here was based on a set of regular expression and web-related programming libraries.

The two repositories described above were merged with our user data based on the visited URL’s, thus creating additional features for each user. However, in the process of working we realized that using the summary of a page, will over-complicate things and does not make sense to use it with the current research in mind. Instead it can be left for Future work. The keywords on the other hand proved more useful as it is show in4 Evaluation.

3.2 Methods

This section will be split into three main parts. The first part will cover the steps taken in order to prepare the data for the Neural Network, whereas the second one will describe the idea behind the model design. The third step will explain the steps taken in order to expand our research. Initially we were only interested in using the keywords extracted from the URL’s, however we decided to investigate whether or not it would make more sense to use the actual content of the URL by scraping the actual page.

3.2.1 Network Input Following the pre-processing steps described in section 3.1.1 Data pre-processing, we produced a set of keywords per user. However, those keywords were still represented as text and in order to be used by the Neural Network, we transformed the text into numeric values using a deep learning library. The initial set of keywords was transformed into a sequence in which each word was replaced by an index value, based on word index dictionary. For example, ’bloomreach’ was replaced by 1, whereas ’apache’ was replaced by 228. The dictionary was created according to the frequency of each word in the whole corpus. Lower indexes point to words which are more common and appear more in the given corpus.[5] Based on the new sequence the maximal length was taken, which was used to generate a padded 4

(6)

Table 1: Data sample from the in-house data.

timestamp pageUrl NewVisit pageId

2016-08-29 18:46:05.809 https://www.onehippo.org/library/administratio... True hst:pages/documentation

2016-08-29 18:46:03.111 https://www.onehippo.com/en/digital-experience... True hst:pages/Digital-experience-platform 2016-08-29 18:46:09.518 https://www.onehippo.org/ True hst:pages/home

2016-08-29 18:46:11.279 https://www.onehippo.org/ True hst:pages/home 2016-08-29 18:46:14.663 http://www.onehippo.com/connect/boston True hst:pages/boston sequence for each of the elements in the corpus. Respectively if the

given element’s length was lower, than the maximum, 0’s were added.

In parallel, an embedding matrix was initialized. This matrix is based on word embeddings, vector representation of words learned via Neural Network. The word embeddings are publicly available through the work of Mikolov et al.[24]

The next step is to combine the sequence matrix, which was generated earlier, in combination with a weighting factor in order to account for each feature (word) in our sequence. We ran tests using all four approaches as a weighting factor, however we only reported the best-performing results.

•Binary: Having the whole corpus, evaluates entries (sepa-rate text/document) and returns 1 for each word from the corpus, which is in the given entry. Respectively a 0 is re-turned if the word is not in the entry.

•Count: Following the same logic as the binary approach, it would return 0 if the word is missing, however, if the word appears it will return the number of occurrences. Important here is to account for stop words, such as: the, a etc. •Frequency: 0 will be returned if the given word in not in the

processed text/document. If the word appears it will return a proportion of the times the word appears against the total length of the text/document.

The resulting matrix was then normalized in order to ensure that all values have a common scale.[39] Following the normalization, the normalized matrix was combined with the embedding matrix. The last step of the pre-processing is to follow the approach from the baseline method and binarize the product of the two matrices. As it is based on vector representation, in some cases it would have a negative value, which after the binarization would be changed to 0 and all positive ones will get the value 1 instead.

The following few sentences of this subsection will reflect on what was done in the baseline approach. According to their work, Jiaming Xu et al. are using a binarized (0 or 1) representation of the Average Embedding, in order to train the Neural Network, figure 3. The binary representation is used instead of labels. In the baseline model they used TF-IDF as a weighting factor, for the features, in the Average Embedding. Term frequency - inverse document frequency (TF-IDF) is used to represent the importance of each word with respect to the its occurrences in the text.[27] Rephrasing, TF assumes that if a word occurs a lot in a given text, then this word should be descriptive for the text. IDF on the other hand, reasons that if a word appears a lot in the given text/document, as well as in others, then most probably this word is not unique for the text and brings no meaning. Stop words, for example, which appear a lot in a text bring arguably no context about the content. High score for

TF-IDF means that this word is rare and specific for the document at hand.

Figure 3: Jiaming Xu et al. proposed architecture. [42] 3.2.2 Network Design This section will cover the design of our Neural Network, together with reasoning for our decisions. Starting from the top - the final design, figure 4 represent the structure of our current CNN.

Figure 4: Our proposed architecture.

We incorporated the same approach used in the works of Jiaming Xu et al. or Yoon Kim and trained our CNN on top of pre-trained word vectors. The output from section 3.1.1 - simple keywords, is used as input for the steps described in 3.2.1 Network input and then for the training of the network, instead of labels.

(7)

Our neural network starts with an embedding layer, which is a vital part when one is dealing with text in Neural Networks. In our work we are not training our embeddings, rather we simply load the embedding matrix, mentioned in section 3.1.1, as weights. [4] The explanation of Embedding Layer based on the Keras doc-umentation is rather vague [38] and simply states ’Turns positive integers (indexes) into dense vectors of fixed size’. Jason Brownlee [4] provides a good summary of what embedding’s are and their purpose. In a nutshell, their purpose is to present words as dense vectors, which are based on the presentation of each of the words in a continuous vector space. This approach is a better alternative of one-hot encoding, which has to do with representing each doc-ument, as a vector with the size of the vocabulary length, with mostly 0’s. One hot encoding is on the principle of whether the word is in the document or not - if it is not, 0 is applied.

Following different discussions and sources, such as the work of Nitish Srivastava et al. [33] on Dropout as a way to prevent Neural Networks of over-fitting, we directly apply a Dropout layer to the output of the Embedding Layer. Again, going back to the documentation of Keras we can see the purpose of Dropout is to randomly set a portion of input units to 0 during training. The setting, dropping fraction, is a hyperparameter, which can be tuned. [37] As presented in figure 4 following the Dropout, we have three convolutional layers on the same level.

There are a few things we considered in this part of our network design. As numerous sources explain, such as Nils Ackermann in [1] or Jason Brownlee [3], 2D CNN’s have been used in image processing where the incoming input is of two-dimensional format. 1D CNN’s, however, have been used for other task, such as Natural Language Processing (NLP) and our case, where the input data is of different format. Regarding the use of three layers on the same level, it is important to go back to the idea behind the convolutional layer, which is to simply apply filters or a set of filters, to an input. [2] To further build on the use of filters in NLP tasks, we related to the work of Siwei Lai et al. [20]. In the paper, the authors argue that in earlier studies of CNNs in NLP, researchers would rely on filters with fixed sized, however when using such fixed sizes, one is prompt to either loose information, when the size is too small, or have as Siwei Lai et al. point out, an enormous parameter space, when larger size is used instead. With this in mind and motivated by both the work of Yoon Kim [18] and Ye Zhang et al. [43] we made use of a set of three filters, each with a different size. Figure 5 is a shortened version of the work of Ye Zhang [43] and it shows the idea behind the use of different filter sizes. Essentially, each filter will capture a different set of features. As an example, we can look at the work of Yoon Kim [18], where he shows the formula for generating a feature. A featureci, coming from a given word window, is given by equation 1:

ci = f (w ∗ xi :i+h−1+ b) (1) Where f represents an activation function, non-linear, w stands for the filter, which is applied to the given window of words and then combined with the bias. This is a single feature, single application of the filter. Once the filter is applied to all words, a feature map is created, based on all the window of words - equation 2.

c= [c1, c2, ..., cn−h+1] (2)

Figure 5: Shortened version based on the work of Ye Zhang. [43]

Following the features, we apply the Pooling. As Harsh Pokharna [26] explains, the idea of pooling is to reduce the spatial size of the representation and the number of features. The work of Alon Jacovi [15] provides and extensive overview for understanding how CNN are used for Text Classification. Based on this work and other readings we settle for Global Max Pooling. The idea of max pooling is to retrieve the highest value from a feature map.

Following the concatenation of the results we finish with a com-bination of fully connected layer, followed by dropout and a dense layer used for the final prediction.

4 Evaluation

4.1 Experimental Environment

The evaluation was conducted on a local machine equipped with Intel Core i7 (2,5 GHz) Processor, 16GB RAM (1600MHz DDR3) and macOS Mojave (Version 10.14.5) as operating system. The devel-opment itself took place on the same machine, where all the hy-perparameters for the Neural Network were tuned via GridSearch, according to the machine’s specifications. The code was written in Python 3.7.3 (Anaconda Distribution). The Convolutional Neural Network was developed via Keras, as it enables fast prototyping and testing. [30]

4.2 Model Comparison

In this subsection we aim at comparing our work with the results presented in the work of Jiaming Xu et al. Following their evalu-ation, we compared the performance based on the following two metrics:

(1) Accuracy (ACC): The amount of correct prediction done by the model.[35] In our case, we check the k-means labels against the ground truth.

(2) Normalized Mutual Information (NMI): Measure describing the relatedness of two variables. It measures how much one 6

(8)

of the variables is able to describe the other one. [11]. The end value is then normalized between 0 (no mutual information) and 1 (perfect correlation).[32]

The metrics focus on evaluating the quality of the clusters. Please note that when using a clustering algorithm, e.g. K-means, each time you re-run the algorithm the labels and the assignments themselves will change. Known as the assignment problem. To account for this Jiaming Xu et al. used combinatorial optimization algorithm -Hungarian algorithm [5] [21].

The comparison is based on the publicly available, from Kaggle, Stack Overflow dataset [34], which consists of question titles and the corresponding label. Table 2 shows the performance compari-son of the results reported by Jiaming Xu et al. against ours. Their approach is referred to as Self-Taught Convolutional Neural Net-works for Short Text Clustering (STCC). The results of STCC are based on the average of 5 runs, with 100 clusters.[42]. Accordingly, we did the same.

Table 2: Comparison of Model Performance based on Stack Overflow Dataset.

Method ACC(%) NMI(%) STCC 51.13 ± 2.80 49.03 ± 1.46 Our Approach 52.83 ± 2.68 49.09 ± 2.32

Extended version of our results can be seen in Table 3. According to our results from Table 3, we can state, with 95% Confidence, that our Accuracy Mean would fall somewhere between [52.30, 53.36] and NMI Mean - [48.64, 49.55].

As can be seen in Table 2 and Table 3, we can conclude that we were able to achieve an improvement in the final performance. In our opinion the improvement is due to two things, one of which is the network design. The other reason is the weighting factor used in the creation of the training labels. In their work, Jiaming Xu et al. relied on TF-IDF as a weighting factor, whereas we are reporting the scores based on Binary, as they it showed the best results.

4.3 Clustering comparison

In this subsection we compare our results with the results from the research previously done within the sponsoring company [40]

The comparison will be split in several parts. The first part will account for the visual and computational performance of the ap-proaches used for extracting keywords from the URL. Following the keywords extraction, the cluster results will be discussed. The pre-vious in-house research will be referred to as baseline and number of clusters will be denoted by ’k’.

4.3.1 Keywords extraction The first takeaway from the comparison is that our approach copes with the occurrences of numeric values, via a set of regular expressions. An ex-ample URL: ’https://www.onehippo.org/7_8/library/ architecture/hippo-cms-7-architecture.html’.

Another important point is that, compared to the baseline, where the domain variations are hard-coded, our approach automatically recognizes the domain and focuses only on the important part, as shown in Figure 2. This however comes at an increase in execution time, as shown in Table 4.

In addition, our approach is better at filtering the final set of keywords by not allowing symbols, such as ’:’ or URL related words, like ’http’ or ’www’. Sample example can be seen in Figure 6.

Figure 6: Comparison between Keywords extraction approaches,emphasize on numerical values, symbols and URL related words.

4.3.2 Cluster Evaluation As we do not have ground truth labels, for the evaluation of the clusters we used the Silhouette Coefficient, proposed by Rousseuw [29]. This coefficient aims at providing a graphical aid to the interpretation and validation of cluster analysis. Table 5 gives an overview of the interpretation of the different scores. This was done similar to [40].

In case the score is below 0, the data point does not belong to this cluster.

4.3.2.1 Baseline Approach Summary: Results and Reason-ing The summary is based on the initial work, which can be found under [40]. In section 2.2 we provided an overview of the baseline approach and in this section, we will focus on their results.

Step 7 from the AVS pipeline, would segment the visitors into clusters, using k-modes clustering described in section 2.2. The quality the clusters was evaluated based on the Silhouette Coeffi-cient and for the actual calculation of the score, they used a pre-computed distance matrix. This matrix has to do with how close the items/points are. However, due to the capacity of the function used to calculate the matrix, they used only 10 000 rows from the initial dataset.

In their work several values for ’k’ were investigated - 28, 80, 112, 27, 14; and their Silhouette Score. The interesting thing in their work is the fact that, they had a huge amount of data, cluster 0, situated in the negative section, Figure 7.

Figure 7: Silhouette Score for k = 28.

After assuming that cluster 0 represents a ’noise bucket’, which contains all the data unfit to be clustered. The cluster and the 7

(9)

Table 3: Comparison of Model Performance based on Stack Overflow Dataset.

# Runs Min ACC(%) Max ACC(%) Mean ACC(%) Mean NMI(%) Std. Deviation (ACC) Std. Deviation (NMI)

1 46.8 63 53,3 49.42 2.92 2.47 2 46.4 57.8 51.8 48.24 2.53 2.19 3 46 59.9 52.34 48.94 2.69 2.35 4 47.3 63.7 53.41 49.56 2.64 2.27 5 47.5 60 53.3 49.31 2.62 2.3 Average 46.80 60.88 52.83 49.09 2.68 2.32

Table 4: Execution Time Comparison - Pre-processing. Processed Data Our Approach Baseline Approach

10000 2.9 seconds 1.6 seconds 100000 29.6 seconds 19.3 seconds 1000000 4 minutes 54 seconds 2 minutes 52 seconds Table 5: Silhouettes Coefficient Interpretation. [40]

Range Interpretation 0.71 - 1.0 Strong structure has been found 0.51 - 0.7 Reasonable structure has been found 0.26 - 0.5 Structure is weak, could be artificial <0.25 No substantial structure,

corresponding data were removed with the assumption that the rest of the clusters will remain the same and the average score, the red dotted line, will increase significantly.

After removing the unfit data, they ran their clustering algorithm with k = 27, to account for the deleted cluster, and were able to achieve an increase up to 0.61 in their average Silhouette score, Figure 8. A few additional plots from the baseline approach can be seen in Appendix A.

Figure 8: Silhouette Score for k = 27, unfit data removed. The main takeaways from their results are:

(1) Cluster 0 contains noisy data, which cannot be clustered, thus should be used with caution.

(2) Leaving the noise aside, the remaining data holds a good structure. Good, is based on the score of the clusters and the interpretation from Table 5 and describes the fact that the data points are more related to the other points in the same clusters than data points from other clusters.

(3) In their work, before the noise removal, the average score (dotted line) went above 0.2 for 112 clusters, which shows that a high number of clusters is required in order to increase the average score. In general, the Silhouette Coefficient will increase with the increase in the number of clusters. (4) With the removal of cluster 0, we observe an increase in

the average score, based on the structure of the remaining clusters. From figure 8, we can see that roughly half of the clusters are below 0.6, which according to Table 5 might be considered as weak or artificial structure.

4.3.2.2 Our Results The scores from the baseline method were based on 10 000 rows and in order to ensure the adequacy of the comparison we sampled the same amount of data, as we used the same dataset. Interesting to point out is that in our case we did not have a single cluster with noise, rather every cluster was partially filled with noise. Considering this, we were not able to approach the problem as it was dealt with in the baseline, due to the fact that removing whole clusters will also remove good data. Instead, we decided to utilize Density-based spatial clustering of applications with noise (DBSCAN) for noise removal. We were able to find multiple sources, such as the works of Jiapeng Huang et al.[14] and Li Ma et al., [23], in which DBSCAN has been proven to be useful for noise removal. As the idea was to generalize the solution, we did not tune the two parameters in the algorithm, rather we used the default set up suggested in the documentation of scikit-learn [31].

Due to the size of the plots, the rest of this sub-section will contain a summary of our results, including the main takeaways. The included plot is a direct comparison to the plot containing the best result as reported in [40], which was for k = 27. The extended version of the evaluation can be seen in Appendix A, where we include more tests and visualizations to defend our conclusions.

Figure 9, shows that for k=28, with noise, our average Silhouette score is 0.296, which although insignificant according to Table 5, is still an improvement compared to the baseline approach.

In the baseline approach, Georgios Vletsas et al. [40] manually removed the noise, which as they describe, accounts for large part of the dataset, yet they did not specify an exact percentage. In our case, the DBSCAN approach for noise removal resulted, similarly, in removing a large part of the data, roughly 40% for the keywords based on URL and roughly 30% for the keywords based on scraped content. During the manual check we noticed that a lot of the large transactions were removed, as well as others, which had mixed signals in them. An example of the results can be seen in Figure 10 , which shows the Silhouette representation for k = 27, similar to the baseline approach.

(10)

Figure 9: Silhouette Score without removing Noise on the Keywords extracted from URL’s, for k = 28.

Figure 10: Silhouette Score with remove Noise on the Key-words extracted from URL’s.

An example of the final output can be seen in Figure 11, where the visitorId is mapped to the corresponding numeric cluster and a set of keywords representative of the cluster.

Figure 11: Filtered example of the final output of our ap-proach with corresponding cluster labels.

As an addition we evaluated our approach on keywords from the visited page, incorporating web scraping in our setup. However, as they did not do this in the baseline method, we present the results in Appendix A and just summarize the results here.

The main takeaways from our results are:

(1) In the baseline paper, they did not provide an overall plot showing the score per number of clusters, so we had to make our conclusion based on the provided plots, where the highest average score was with k = 112, a bit over 0.2. The average score is based on the structure of all clusters and based on this our ’noisy’ clustering is better than the reported one. Although we had a better average score, the lack of a single ’noise’ bucket resulted in noise within the clusters. This resulted in poor quality , score below 0.6, for less then half of our clusters.

(2) We did not have a single ’noise’ bucket, however by using the default parameters of DBSCAN we were able to clean the overall noise and then outperform their reported score. They, indeed, do not specify that this is the best possible score, however they do not report a higher one.

(3) Using scrapped data drastically increased the input shape for the Neural Network, however it was worth it as it showed better results.

(4) Taking it one step further we mapped the visitorId to a cluster together with the corresponding keywords for this cluster.

5 Conclusion

The main conclusion of this thesis is that clustering the learned features from Convolutional Neural Networks outperforms the use of k-modes clustering with and without noise removal. We base this conclusion on the Silhouette score, according to which our clusters have a better quality. Besides this, we were able to extract cleaner keywords from the URL by creating a new pre-processing function. Although our clusters are better, domain knowledge is still needed in order to make better sense out of the clusters, as also suggested in [40]. This being said, the cluster labels can still be used as suggestions for new or missing clusters for the system administrators.

An interesting point is that in the baseline approach they had a single cluster containing most of the noise, which allowed them to simply remove the noisy cluster. In our case, we had to incorporate another clustering algorithm - DBSCAN, as a noise removal step. Throughout our work we noticed that in some cases the transaction function, suggested by Ming-Syan Chen et al. [25], creates large transactions, which cannot be clustered in a single cluster, as they have multiple representatives. With this in mind it would be worth to further investigate the grouping of URL visits. Another point worth investigating is the use of other Neural Networks in com-bination with either keywords from the URL or keywords from scraped content. For example Recurrent Neural Networks can be used to build own embeddings.

Last but not least, it might be useful to investigate another ap-proach for noise removal or fine tune DBSCAN for the data at hand. Bearing in mind that the default setting of the clustering algorithm was able to improve our results drastically, we expect a fine tuning of its parameters to bring even more value.

As an addition, we conclude that using keywords from the scraped content, of a visited page, yields better results than simply clustering visitors based on keywords from only the URL. Therefore 9

(11)

we can say that those the scraped keywords are more descriptive. This is indeed interesting and would help in the user segmentation. Having in mind we were not able to find an open dataset suitable for this.

6 Acknowledgments

I would like to express my gratitude to Michael Metternich for giving me the opportunity to be part of BloomReach and to work on this project, as well as for all the discussions and guidance. I would also like to especially thank Chang Li for being my supervisor, for always finding time for a short discussion and for his feedback throughout the project. Last but not least, I would thank my family, close friends and colleagues for all the support and help they have given me. Last but not least, I would like to thank Dr. Maarten Marx for the agreeing to be my second examiner.

References

[1] Nils Ackermann and Nils Ackermann. 2018. Introduction to 1D Con-volutional Neural Networks in Keras for Time Sequences. (Sep 2018). https://blog.goodaudience.com/introduction-to-1d-convolutional-neural-networks-in-keras-for-time-sequences-3a7ff801a2cf

[2] Jason Brownlee. 2019. A Gentle Introduction to Convolutional Layers for Deep Learning Neural Networks. (Apr 2019). https://machinelearningmastery.com/ convolutional-layers-for-deep-learning-neural-networks/

[3] Jason Brownlee. 2019. How to Develop 1D Convolutional Neural Network Models for Human Activity Recognition. (Apr 2019). https://machinelearningmastery. com/cnn-models-for-human-activity-recognition-time-series-classification/ [4] Jason Brownlee. 2019. How to Use Word Embedding Layers for Deep

Learn-ing with Keras. (May 2019). https://machinelearnLearn-ingmastery.com/use-word- https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

[5] Wikipedia Community. 2019. Hungarian algorithm. (May 2019). https://en. wikipedia.org/wiki/Hungarian_algorithm

[6] Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. 1997. Grouping web page references into transactions for mining world wide web browsing patterns. In Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop. IEEE, IEEE, 3 Park Avenue, 17th Floor New York, 2–9.

[7] Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. 1999. Data Prepara-tion for Mining World Wide Web Browsing Patterns. Journal of Knowledge and Information Systems 1 (04 1999). https://doi.org/10.1007/BF03325089 [8] Aysegul Dundar, Jonghoon Jin, and Eugenio Culurciello. 2015.

Convolu-tional Clustering for Unsupervised Learning. CoRR abs/1511.06241 (2015). arXiv:1511.06241 http://arxiv.org/abs/1511.06241

[9] J E. Angus. 1991. Criteria for Choosing the Best Neural Network: Part 1. Missing Missing, Missing (07 1991), 28.

[10] Karl Groves. 2007. The Limitations of Server Log Files for Usability Analysis. (Oct 2007). http://boxesandarrows.com/the-limitations-of-server-log-files-for-usability-analysis/

[11] Fred Guth. 2019. Mutual information. (Jun 2019). https://en.wikipedia.org/wiki/ Mutual_information

[12] Kashmir Hill. 2016. How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did. (Mar 2016). https://www.forbes.com/sites/kashmirhill/2012/ 02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/#97b645366686

[13] Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou. 2015. Short text understanding through lexical-semantic analysis. Proceedings - International Conference on Data Engineering 2015 (05 2015), 495–506. https: //doi.org/10.1109/ICDE.2015.7113309

[14] Jiapeng Huang, Yanqiu Xing, Haotian You, Lei Qin, Jing Tian, and Jianming Ma. 2019. Particle Swarm Optimization-Based Noise Filtering Algorithm for Photon Cloud Data in Forest Area. Remote Sensing 11, 8 (Apr 2019), 980. https: //doi.org/10.3390/rs11080980

[15] Alon Jacovi, Oren Sar Shalom, and Yoav Goldberg. 2018. Understanding Convo-lutional Neural Networks for Text Classification. CoRR abs/1809.08037 (2018). arXiv:1809.08037 http://arxiv.org/abs/1809.08037

[16] Jhabel. 2019. Unsupervised learning. (Jun 2019). https://en.wikipedia.org/wiki/ Unsupervised_learning

[17] Julia Kho and Julia Kho. 2018. How to Web Scrape with Python in 4 Minutes. (Sep 2018).

https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460

[18] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1746–1751. https://doi.org/10.3115/v1/D14-1181

[19] Kku. 2019. E-commerce. (May 2019). https://en.wikipedia.org/wiki/E-commerce [20] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent Convolutional Neu-ral Networks for Text Classification. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI’15). AAAI Press, Austin, Texas, Ar-ticle 2886636, 7 pages. http://dl.acm.org/citation.cfm?id=2886521.2886636 [21] Tilman Lange, Volker Roth, Mikio Braun, and Joachim Buhmann. 2004.

Stability-Based Validation of Clustering Solutions. Neural computation 16 (07 2004), 1299–323. https://doi.org/10.1162/089976604773717621

[22] ALEXANDER LENAIL. 2019. NN-SVG. (2019). http://alexlenail.me/NN-SVG/ index.html

[23] Li Ma, Lei Gu Bo Li, Sou yi Qiao, and Jin Wang. 2014. G-DBSCAN: An Improved DBSCAN Clustering Method Based On Grid. In Conference Papers. 23–28. https: //doi.org/10.14257/astl.2014.74.05

[24] Tomas Mikolov, Kai Chen, G.s Corrado, and Jeffrey Dean. 2013. Efficient Esti-mation of Word Representations in Vector Space. Proceedings of Workshop at ICLR 2013 (01 2013).

[25] Ming-Syan Chen, Jong Soo Park, and P. S. Yu. 1996. Data mining for path traversal patterns in a web environment. In Proceedings of 16th International Conference on Distributed Computing Systems. IEEE, IEEE, 3 Park Avenue, 17th Floor New York, 385–392. https://doi.org/10.1109/ICDCS.1996.507986

[26] Harsh Pokharna. 2016. The best explanation of Convolutional Neural Networks on the Internet! (Jul 2016). https://medium.com/technologymadeeasy/the-best-explanation-of-convolutional-neural-networks-on-the-internet-fbb8b1ad5df8 [27] Nikhil Prakash. 2019. TfâĂŞidf. (May 2019). https://en.wikipedia.org/wiki/

TfâĂŞidf

[28] Radim Rehurek. 2019. gensim: topic modelling for humans. (Apr 2019). https: //radimrehurek.com/gensim/summarization/keywords.html

[29] Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 (1987), 53 – 65. https://doi.org/10.1016/0377-0427(87)90125-7

[30] Sayantini. 2019. Keras vs TensorFlow vs PyTorch | Deep Learning Frameworks. (May 2019). https://www.edureka.co/blog/keras-vs-tensorflow-vs-pytorch/ [31] scikit-learn developers. 2019. sklearn.cluster.DBSCAN. (2019).

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

[32] scikit-learn developers. 2019.skl ear n .met r ics .nor mal izedmutualinf oscor e .

(2019). https://scikit-learn.org/stable/modules/generated/sklearn.metrics. normalized_mutual_info_score.html

[33] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15 (06 2014), 1929–1958. [34] StackOverflow. 2012. Predict Closed Questions on Stack Overflow. (2012). https:

//www.kaggle.com/c/predict-closed-questions-on-stack-overflow/

[35] Google Developers Team. 2019. Classification: Accuracy | Machine Learning Crash Course | Google Developers. (2019). https://developers.google.com/ machine-learning/crash-course/classification/accuracy

[36] HostingFacts Team. 2019. Internet Statistics and Facts (Including Mobile) for 2019. (2019). https://hostingfacts.com/internet-facts-stats/

[37] Keras.io Team. 2019. Core Layers. (2019). https://keras.io/layers/core/ [38] Keras.io Team. 2019. Keras Documentation. (2019). https://keras.io/layers/

embeddings/

[39] /@urvashilluniya. 2019. Why Data Normalization is necessary for Machine Learning models. (Apr 2019). https://medium.com/@urvashilluniya/why-data-normalization-is-necessary-for-machine-learning-models-681b65a05029 [40] Georgios Vletsas. 2018. Automated Visitor Segmentation and Targeting. In Master

Thesis. University of Amsterdam, Master Software Engineering, Science Park 904, Amsterdam, the Netherlands, 43. http://scriptiesonline.uba.uva.nl/document/ 660630

[41] Jenq-Haur Wang, Ting-Wei Liu, Xiong Luo, and Long Wang. 2018. An LSTM Approach to Short Text Sentiment Classification with Word Embeddings. In Proceedings of the 30th Conference on Computational Linguistics and Speech Processing (ROCLING 2018). The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Hsinchu, Taiwan, 214–223. https: //www.aclweb.org/anthology/O18-1021

[42] Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang, and Hongwei Hao. 2015. Short Text Clustering via Convolutional Neural Net-works. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. Association for Computational Linguistics, Denver, Col-orado, 62–69. https://doi.org/10.3115/v1/W15-1509

[43] Ye Zhang and Byron C. Wallace. 2015. A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification. CoRR abs/1510.03820 (2015). arXiv:1510.03820 http://arxiv.org/abs/1510.03820

(12)

A

Extended Evaluation

A.1 Baseline Approach Extended

The rest of the baseline evaluation was around investigating the outcomes when k is set to a lower value (k = 14, Figure 12) and higher value than 27 (k = 40, Figure 13) The smaller choice for k led to the creation of new noise, as there were not enough clusters to separate the remaining data. While investigating the latter one, it became obvious that after cluster 27 the clusters were starting to get split into smaller ones - sub clusters.

Figure 12: Silhouette Score for k = 14, after removing unfit data.

Figure 13: Silhouette Score for k = 40, after removing unfit data.

A.2 Our Results with in-house data

In this subsection we will present some additional results in regards to the keywords extracted from URL’s, with and without noise. As an addition we will also present our results, which are based on the keywords extracted from the scraped content of the visited, by the user, page. The last set of figures will present the overall movement of the average Silhouette score. For the purpose of evaluating the baseline approach, we choose the same number of clusters as the ones presented in the baseline evaluation.

Figure 14 and Figure 15 are showing the cluster quality for the Keywords, from the URL, based on k = 80 and k = 112, with noise

Figure 14: Silhouette Score without removing Noise on the Keywords from URL’s, k = 80.

Figure 15: Silhouette Score without removing Noise on the Keywords from URL’s, k = 112.

Figure 16 and Figure 17 are showing the cluster quality for the Keywords, from the URL, based on k = 14 and k = 40, with noise removed based on DBSCAN default parameters

Figure 16: Silhouette Score with removed Noise on the Key-words from URL’s, k = 14.

(13)

Figure 17: Silhouette Score with removed Noise on the Key-words from URL’s, k = 40.

Figure 18 and Figure 19 are showing the cluster quality for the Keywords based on scraped content, without removing noise. Based on k = 80 and k = 112.

Figure 18: Silhouette Score without removing Noise on the Keywords based on scraped content, k = 80.

Figure 19: Silhouette Score without removing Noise on the Keywords based on scraped content, k = 112.

Figure 20 and Figure 21 are showing the cluster quality for the Keywords based on scraped contents, with noise removed based on DBSCAN default parameters. Based on k = 14 and k = 40.

Figure 20: Silhouette Score with removed Noise on the Key-words based on scraped content, k = 14.

Figure 21: Silhouette Score with removed Noise on the Key-words based on scraped content, k = 40.

Figure 22: Silhouette Avg Score movement with Noise on URL Keywords.

Figure 22 is showing the overall movement of the average quality of the clusters based on Keywords extracted from URL’s, with noise. The figure shows that our highest score is 0.437 for k = 120, whereas 12

(14)

in comparison, before the noise removal, the baseline performance was around 0.2. As the figure shows, the score keeps growing with the number of clusters. We tested this statement by running a loop, with possible cluster numbers, until 400. The score for k = 400 was 0.578 and the graph represented a continuous growth. Figure 23 is showing the overall score with removed noise, where the highest score is 0.999 with 100 clusters. We tested with maximum 120 clusters

Figure 23: Silhouette Avg Score movement without Noise on URL Keyword.

Figure 24 is showing the average quality of the clusters based on Keywords extracted from the scraped content, with noise. The figure shows that our highest score is 0.623 for k = 118. Figure 25 is showing the overall score with removed noise, where the highest score is 0.93 with k = 64. We tested with maximum 120 clusters

Figure 24: Silhouette Avg Score movement with Noise on Scraped Keywords.

Figure 25: Silhouette Avg Score movement without Noise on Scraped Keywords.