• No results found

Automated Visitor Segmentation and Targeting

N/A
N/A
Protected

Academic year: 2021

Share "Automated Visitor Segmentation and Targeting"

Copied!
44
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Automated Visitor Segmentation

and Targeting

Georgios Vletsas

georgevle93@gmail.com

Spring 2018, 43 pages

Supervisor: Dr. Adam Belloum

Host organisation: BloomReach BV

Organisation Supervisor: Michael Metternich

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

Contents

Abstract 3 1 Introduction 4 1.1 Initial Studies . . . 4 1.2 Problem Statement . . . 5 1.2.1 Research Questions. . . 5 1.2.2 Solution Outline . . . 6 1.2.3 Research Methods . . . 7 1.3 Contributions . . . 7 1.4 Related Work . . . 8 1.5 Outline . . . 9 2 Background 10 2.1 Clustering . . . 10 2.2 Clustering Problems . . . 10 2.2.1 Categorical Data . . . 10

2.2.2 Large Data Sets . . . 11

2.2.3 Concept Drift. . . 11 2.2.4 Computational Complexity . . . 11 2.3 Clustering Algorithms . . . 11 2.3.1 K-Modes . . . 12 2.3.2 Alternatives . . . 15 2.4 User Transactions. . . 17 2.4.1 Reference Length . . . 17 2.4.2 Time Window . . . 17

2.4.3 Maximal Forward Reference. . . 17

2.5 Evaluation Methods . . . 18

2.5.1 Silhouette Coefficient . . . 18

3 Automated Visitor Segmentation with Python 19 3.1 Process outline . . . 19

3.1.1 Reading the Dictionary . . . 20

3.1.2 Filtering. . . 21

3.1.3 Sorting . . . 21

3.1.4 Identifying Transactions . . . 21

3.1.5 Normalise Transactions . . . 24

3.1.6 Remove Unnecessary Data. . . 25

4 Segment Identification 27 4.1 K-Modes usage . . . 27 4.2 Centroid Structure . . . 27 5 Evaluation 29 5.1 Results Synopsis . . . 29 5.2 Result Analysis . . . 29

(3)

5.2.1 Experimental environment and evaluation indexes . . . 29

5.2.2 Evaluation on scalability. . . 29

5.2.3 Evaluation on clustering quality . . . 32

5.3 Research Questions and Answers . . . 37

5.4 Threats to Validity . . . 37

5.5 Future Work . . . 37

6 Conclusion 39

(4)

Abstract

In web commerce, a common way to increase conversion rates is to understand your audience and to provide them with a unique user experience. This is achieved by a process called personalisation, meaning to make content more relevant to your visitors. Grouping visitors together into segments and providing targeted content is one of the many personalisation techniques, however so far the process is limited because it requires manual creation of segments with little data to back up their validity. Furthermore, not all audience types are obvious to the administrator of the website, leading to the possibility that many segments are missed. Data sets in these scenarios almost always consist of solely categorical (text) data making it hard to handle by clustering algorithms.

In our work, we explored the possibility of creating a pipeline that automatically processes historical data of a website, and detects segments of the most common types of visitors, thus providing a way to determine and target the website’s regular audience. We designed and implemented a scalable solution that can be used to process high-volume real-world data sets and can provide structurally sound segments using categorical data. The prototype was tested against a large real-world data set and the resulting segments were evaluated and analysed in depth to provide a sound analysis on threats to validity and correctness.

(5)

Chapter 1

Introduction

In the world of web commerce, both marketers and software developers alike are working to provide website visitors with easy, relevant and pleasant user experiences. User experience entails the way the user experiences a product, and is commonly overshadowed in the design process by aspects of the product design such as aesthetic appeal and functionality. Garrett[Gar10] discusses the negligence user experience faces, and its importance, especially in web environments. User experience on the web is referred to as a Digital Experience, and an important aspect of a good Digital Experience, is personalisation. Personalization entails that the content the user sees is relevant to his or her interests, and aims to keep them browsing the website for a longer period of time, and ultimately convert them to customers.

One of the many layers of personalization involves creating user segments, so as to have an initial distinction between website visitors. A user segment represents a group of visitors that all share a set of similar characteristics. Each segment can be used to personalise the website content, with users fitting into said segment are presented with the corresponding version of the website.

In its current state, visitor segmentation is realised manually by the website administrator, some-thing which can cause inaccuracies, and may overlook the existence of segments that are of interest to the website. In this work we study if we can design and develop a method to exploit the historical data of a website’s visitors to create segments based on their behaviour and context. This method accepts unfiltered visitor data, process it to a point that it may be used to create meaningful segments, and identify those segments and their characteristics. We implement this as the Automated Visitor Segmentation(AVS) system.

1.1

Initial Studies

Over the years, studies have attempted to allow for web audience profiling using different methods. Some of these methods include using URL statistics[KDB+16], user transaction analysis[GA05], visitor journey analysis and user interests[ALA+11]. Most of these studies present viable methods, albeit computationally intensive, but they always apply to specifically tailored data sets commonly using straightforward data such as user location, neglecting the behaviour of the visitor and the context of the visit. Furthermore, there are common problems[WPB01] that need to be addressed such as careful data pre-processing to maintain data tidyness[W+14], Computational Complexity, Concept Drift(changing user interests over time) or the need for large data sets. Pre-processing of data in this case is closely related to feature extraction. It ensures that the data is processed in a way so that a data set is structured and modified to present contextual data usable by a machine learning algorithm capable of producing accurate segments.

One of the main ways of extracting contextual data, is extracting user transactions from the user’s web journey. Transactions in this context being individual page accesses grouped into seman-tic units. The most interesting algorithms solving this are the Reference Length module[CMS97a], which correlates the amount of time a user spends in a webpage with the classification of the page as navigation or content, and the Maximal Forward Reference module that considers a backward

(6)

refer-ence as the end of a transaction. The most successful of the studies appear to use either Bayesian, or Dempster-Schafer[XP01] approaches in combination with an unsupervised clustering algorithm.

Currently they all face the challenge of computational complexity, with the exception of Ahmed et al.[ALA+11], however their solution works on an individual level rather than a group level. This means that their work detects what one individual user might be interested in, rather than identify groups of users with similar browsing patterns and interests. Our work is different in that it tries to escape single use cases, such as applying it specifically in a certain website, and provide an adaptable and extensible algorithm that can be used on different websites without requiring modification. It however requires that the websites use the same data skeleton, albeit with widely standardised data such as User ID.

Data sets in web environments such as the one described in the previous paragraph often consist of categorical(text) data. Clustering algorithms such as the K-Means algorithm[Jai10] work exclusively with numerical data using a Euclidean distance function to assign distances to each data point, creating the clusters. The fact that one cannot translate categorical data to distance without sacrificing accuracy renders k-means unsuitable for this problem in its initial state. However, an extension of k-means called k-modes was proposed by Huang [Hua98] that is modified to work with categorical data using a dissimilarity measure proposed by the author. This algorithm was further improved by Cao[CLL+12] by proposing a more efficient and accurate dissimilarity measure by taking into account not only the dissimilarity between objects, but also between the object and the mode.

1.2

Problem Statement

Segmenting visitors in a website is the first layer of personalization, with the purpose of giving the visitor the most relevant section of the website to browse according to their interests. From there on, other personalization techniques take the lead. As it currently stands, both determining and suggesting segments is done manually, based on data such as geographic location or weather, as well as visitor information (if available) queried by back-end systems.

The Hippo Content Management System(CMS) allows website owners to manage and personalise the client’s website swiftly and efficiently. One of the features of the system is to group users into categories called Segments and personalise the segments as different versions of the website. In the context of this thesis, we will be referring to the website beneficiary that is working with the CMS as the administrator.

A visitor may have a different goal each time they visit the website. The goal of the visitor may be different than that of the business. The goal of personalised segments is to bridge the gap between the 2 goals as much as possible so that both parties gain value. However, we found the following problems:

1. Personal effort is required from the administrator to define segments, and the manual segments do not guarantee relevance/results towards their website’s goal.

2. The administrator does not have a way to ensure that they are not ignoring the existence of relevant visitor segments as derived from historical data of their website.

In this document, we will be using the words Segments and Clusters interchangeably. Unless stated otherwise, they will hold the same meaning.

1.2.1

Research Questions

During this research, we examine the following research questions:

1. Can we process visitor journey data in such a way to allow for the context of their visit to be identified and, if possible, to what extent?

2. Can we use visitor journey data in combination with an unsupervised clustering algorithm to identify user segments that can correctly describe the audience of the website?

(7)

3. Can this process be generalised to allow for automation of the pre-processing and clustering of the data, while still being able to provide meaningful segments?

4. How can we determine the optimal number of clusters needed to segment visitors meaningfully without manually pre-processing the data?

1.2.2

Solution Outline

Our solution is implemented as a pipeline approach, only requiring one set of data, usually named access log, as the input. Pipelines are a set of instructions/functions executed in sequence, where the output of the previous is the input to the next.

The data that is used as the initial input has a set of minimum requirements; it needs to contain the unique visitor ID, the URLs they visited, and the corresponding time-stamps. The data is accepted as either a CSV or a JSON dictionary, and is being transformed and processed as a data frame in the pipeline. In our solution, we use PyCharm V2018.1.4 as the IDE in combination with the languages and packages presented in Table 1.1:

Item Version 1 Python 3.6 2 pandas 0.22.0 3 scikit-learn 0.19.1 4 scipy 1.0.0 5 numpy 1.14.0 6 kmodes1 0.9 7 nltk 3.2.5 8 R 3.5.0 9 cluster 2.0.7-1

Table 1.1: Used Languages and Packages

The pipeline consists of a total of 7 steps, which can be divided in 2 categories; data pre-processing and clustering. A simplified outline of the steps in our pipeline is presented below with more details following in Sections3and4:

Step Input Action Output

1 Raw data set Read and transform data frame

2 data frame Filter unnecessary columns data frame

3 data frame Sort by user ID and time-stamp Sorted data frame

4 Sorted data frame Identify transactions and remove extra rows data frame

5 data frame Extract keywords and content page from transaction paths data frame

6 data frame Remove transactions that only contain the homepage data frame

7 data frame Cluster data Cluster Centroids

Table 1.2: Pipeline Steps

The pipeline starts by getting the raw visitor data, and goes through a series of transformations to satisfy 2 conditions:

• Have the context of the visit as dictated by the visitor’s behaviour in data form. • Ensure the data is in a form that clustering can produce valid results.

This is achieved by the first group of steps, steps 1 to 6 as dictated in Table 1.2. The 7th step achieves the clustering of the data into groups, thus creating the segments. For the purpose of

(8)

answering our research questions, we restrict our data to Visitor ID, Time-stamps, URLs and if they are a returning visitor. The cluster Centroids can be accessed as a data frame, clearly depicting the main characteristics of the segments. Centroids are discussed in more detail in Section2.3.1.

1.2.3

Research Methods

Before attempting to answer the research questions, we conducted a literature study on web profiling, user segmentation, categorical data clustering and data preparation in our effort to clarify how should data be structured for our goal, and what are ways of finding segments. The results of this literature survey are presented in Section1.4.

Data Pre-process Design.

To improve the capability of the clustering technique to yield positive results, and be scalable, we designed the pre-processing steps. All the steps ensure that the resulting data is in a clean form that allows for easy processing, and also holds information regarding the context of the visitor’s journey. Hypotheses.

We assume that contextual and behavioural data can be used to profile web users into categories that are relevant to the website’s goal. Furthermore, we assume that data changes at a rate such that the segments will become obsolete in a certain amount of time. Lastly, we assume that by focusing on the skeleton of the data, we can automate data pre-processing despite data being different for all the websites. So, our hypotheses are:

Hypothesis 1: Grouping visitor data by ID and time-stamp will yield the visitor’s journey through the website. This will aid in extracting the user transaction and thus provide contextual information to aid clustering.

Hypothesis 2: Using the visitor journey data in combination with a Reference Length or Maximal Forward Reference module[GA05, CMS97a] will yield the user transactions(traversals through navi-gation pages up to the first content page). These transactions should positively impact the results of the clustering algorithm and hold contextual data of the visit.

Hypothesis 3: Using K-Modes clustering[Hua98, CLL+12] in combination with data from Hy-pothesis 2, we can cluster visitors to retrieve segments that represent the most common types of visitors of the website.

Prototype.

We prototyped our solution by implementing the suggested pipeline and examined the resulting seg-ments for validity.

Evaluation.

To argue about the correctness of our pipeline, we provide evidence in graphical form in addition to an analysis of our results.

We used a pre-computed distance matrix of our data set, in combination with the Silhouette method[Rou87] to demonstrate that the structure of our clusters is adequate.

1.3

Contributions

Contribution #1: Pipeline Design. For our data to be able to produce structurally strong clusters containing the desired results, we had to design a pipeline that handles the pre-processing of the data to bring it into the desired form. The pipeline is composed of steps such as transaction

(9)

extraction, access log filtering, data transformation and manipulation. The process is fully auto-mated and only requires the initial access log data input from the user in the form of JSON or CSV and the URL variants of their homepage. Our work implements the Maximal Forward Reference model[CMS97a,MJHS96] to extract transaction information, and adapts parts the process from the work of Xie et al.[XP01] in terms of filtering and data selection.

Contribution #2: Prototype Implementation. We implemented our pipeline design, along

with using the K-Modes algorithm to a real world data set of up to 1 million rows of data in order to evaluate its correctness and performance.

Our system achieves scalability successfully, as opposed to similar systems such as Xie et al.[XP01]. Our prototype implementation is fully automated, and it can adapt to different amounts of observa-tions(columns) in the data set subject to K-Modes’ ability of still producing correct results. For every step of the pipeline, we explain and argue about the choices we have made and discuss the limitations. The results are analysed in Section5.

Contribution #3: Proof Outline. After collecting evidence that our system works correctly by applying the tool to a large data set, we provide proof of the structural correctness of our clusters by using the Silhouette Coefficient[Rou87]. To further remove doubts of correctness, we provide an analysis of the evaluation and we create and test theories that would threaten its validity.

1.4

Related Work

In this section, we outline related work that was used to shape the results of this research. Extracting Contextual Information

As a way to mine user browsing patterns, and examine when users find an item of interest, Cooley et al.[CMS97a] implement three different methods of separating the visitor journey into individual page accesses grouped into semantic units, called Transactions as proposed by Mobasher et al[MJHS96]. The methods are the Reference Length(RL) module, the Maximal Forward Reference(MFR) module and the Time Window module(TW). The RL module is based on the assumption that the amount of time a user spends on a page correlates to whether the page should be classified as a navigation or content page for the user. The navigation pages are expected to have low variance of times, and the navigation references make up the lower end of the curve. The MFR is based on how many pages are in the path from the first page up to the first page that a backwards reference is made. This assumes that the end page is the reference page and the pages up to that are the navigation pages. Finally, the TW module defines a time parameter and divides the access log into time intervals.

Cooley et al. replicated the aforementioned modules for traversal path analysis in more studies[CMS97a] with Reference Length outperforming the other two and Maximal Forward Reference being a close second in terms of accurate results in different conditions. It should be noted that Maximal For-ward Reference heavily outperformed the others in sparse data, while the Reference Length thrived in connected data. Xie et al.[XP01] further implemented the modules in combination with a few other data processing techniques to use into web user clustering. In our approach we use transactions for a similar goal, albeit with a different clustering mechanism.

Web User Profiling

Profiling web visitors has been an ongoing topic of research facing problems such as concept drift and computational complexity[WPB01]. Researchers such as Xie et al.[XP01] have managed to cluster users using the Dempster-Shafer theory[Sha16] by using access log data. They managed to meaning-fully use transactions, and other techniques mentioned by Wickham et al.[W+14] and their research serves as an important basis for our work. However, we use a different method to cluster visitors, namely K-Modes, and their results is limited to use only URL data and suffers from a scalability problem due to a focus on accuracy.

(10)

Webb et al.[WPB01] discuss how in web profiling accuracy is less important compared to compu-tational complexity due to its ability to be used in high-volume real world situations. A scalable solution is presented by Ahmed et al.[ALA+11] being able to process 200GB of data in less than two hours. However, their method requires the possible topics of interest to the visitors to be hand-picked before the processing for the algorithm to provide meaningful clustering.

Categorical Data Clustering

Huang et al.[Hua98] examined the problem of K-Means[HW79] not being able to cluster categorical data without transforming it into numerical form. He proposed a dissimilarity metric to define the distance between categorical objects, and developed two extensions of K-Means called K-Modes and K-Prototypes. The former is used to handle exclusively categorical data, while the latter handles a mix of numerical and categorical data, with the added advantage that both algorithms scale linearly as opposed to the quadratic scaling of K-Means. In 2012, Cao[CLL+12] further improved on the algorithm by proposing a new dissimilarity function that outperformed Huang’s. In our approach, we will only handle categorical data, making K-Modes a prime candidate as our clustering method. Data Preparation

Preparing data for mining information is a step that is stressed out in many research papers relevant to segmentation. Cooley et al.[CMS99] goes into the need to filter out unnecessary data to reduce the load, and how transactions can be used as a way to concisely represent a visitor’s behaviour. Wickham et al.[W+14] focuses on keeping data sets tidy. This is separated into input-tidy and output-tidy, and both facilitate better manipulation, visualisation and modelling of the data. An important aspect to take out is normalising the data, along with the importance of separating your data to observations and variables. In our work, a lot of the tidying aspects are used such as normalising the JSON data set that we are using, and focusing on the user as the variable whilst the rest of the data is separated into observations about them.

Proving Correctness

Proving the correctness of categorical clustering is difficult to achieve, as not only can the correctness vary on the goals of the data analyst, but the inability to accurately represent distances between categorical variables hinders the evaluation options. A method that appears to be working in most situation is the use of the Silhouette Coefficient[Rou87] that calculates the mean distance between each item in a cluster, and their distance from the Centroid. In many a research, such as Xie et al.[XP01], researchers have been forced to use manual evaluation methods. In our work, the Silhouette method works well, and we will be describing it in more detail in the later sections.

1.5

Outline

In this section we outline the structure of this thesis. In Chapter 2 we introduce the background (Clustering, Evaluation Methods, Transactions). In Chapter3 we describe the pre-processing of the data, how the context of the journey is extracted and the steps in our pipeline to achieve it. In Chapter 4 we describe the details of the the implementation of the Clustering algorithm we used, analyse how we used it, and describe the outline of its output. In Chapter5 we present our results, and their evaluation with an in-depth analysis of both parts, as well as threats to validity and future work. Finally, we conclude in Chapter6.

(11)

Chapter 2

Background

2.1

Clustering

Clustering in Data Science is defined as the task of grouping a set of objects in such a way so that objects in the same group are more similar to each other than with objects in other groups. Clustering cannot be defined precisely, and clusters are modelled differently by different people, for various different uses. As such, the activity of choosing the right model for our purpose can be a complicated one. In our case, we only look at models that can handle categorical data, some of them being:

• Centroid Models: K-Means, K-Modes, K-Prototypes etc. • Density Based models: DBSCAN, HDBSCAN etc. • Neural Models: Fuzzy C-Means, Fuzzy Clustering etc.

Clustering is further separated into Soft clustering (each object belongs to a cluster to a certain degree) and Hard clustering (each object either belongs in a cluster or not).

In our case, the goal is to group the visitors in such a way so that our clusters are groups representing the most common types of visitor of the website. As an example, we can have a large group that browses a lot through the sports clothing category, and has seen a lot of shoe-based items. A lot of similar users are grouped together, making a segment of users that like sports, and are interested in shoes.

2.2

Clustering Problems

In clustering visitor data, there is a common set of problems that hinder the usefulness of the algo-rithms that need to be taken into consideration, namely; handling categorical data, Concept Drift, Computational Complexity and the need for large data sets[WPB01]. Some of these problems apply to machine learning in general. In this section we will briefly outline them.

2.2.1

Categorical Data

Categorical data in clustering refers to the case where the data objects are defined over categorical attributes. A categorical attribute is an attribute whose domain is a set of discrete values that are not inherently comparable[Agr96]. The term categorical is given due to the fact that the observations in the data represent categories and not measures. Centroid based clustering algorithms often use euclidean distance as a means of comparing the attributes, however with the ambiguity of categorical attributes this is not easily accomplished.

Clustering categorical data has been an ongoing problem over the years, with people often avoiding to cluster it due to bad results. The sample space in categorical attributes is discrete with no natural origin. A way the problem was handled was to transform the data into numeric values, but this

(12)

does not necessarily produce meaningful results in the case that the categorical domains are not ordered[Hua98]. Numerous approaches have appeared in the past, but the problem remains that categorical data is not always able to provide viable results due to its nature, and the results might be closer to an estimation rather than fact.

2.2.2

Large Data Sets

Machine learning models do not build a model with acceptable accuracy until a relatively large number of examples is provided[WPB01]. There is also the problem of a file being too large, which would cause over-fitting (increasing training time and reducing the quality of the results) or inability to group adequately. As such, the input data can determine the success or failure of the machine learning algorithm.

In our case, large would not necessarily mean a large file size, or a large number of rows, but rather a long time-span of adequate density of information for each day. As an example, a valid data set could be one that contains data for 2 weeks with 10.000 visitors per day, or one that contains data for a month with 4000 visitors a day. A data set that contains a day’s visitors but the visitors are a very large numbers such as 100.000 would not provide valuable insight in our case.

2.2.3

Concept Drift

Concept drift is an adjustment problem. An adjustment to the rapidly changing interests of users. Over time we can detect changes in browsing patterns and interests, either in predictable periods (e.g. Christmas, Summer) or unexpectedly at random moments in time. Thus, data that we used, and the result it produced might be useful for some time but it will grow to be non-representative of the current reality, and it is important to be able to adjust to these changes. This problem, from a machine learning perspective, is called concept drift.

This problem can be handled manually by processing new data, and that is why we decided it was important for our pipeline to be able to complete in less than an hour. The user can then quickly assess the new data and decide on a new course of action regarding the targeting of their segments.

However, this suffers from the aforementioned problem of requiring a large data set. Being able to detect changes in the behaviour of users is just the first step, and identifying the new segments accurately cannot be achieved without a long observation of this behaviour. This is a problem that our work does not solve for unpredictable changes. For predictable periods the user can use the data from previous years of the same period.

2.2.4

Computational Complexity

The current amount of user activity and the sheer amount of users over the internet has surfaced the potential limitations of machine learning algorithms. Their use in these environments becomes hindered by the computational complexity they present with this amount of data. This creates the need of using supercomputers, or limiting the amount of data. However, the problem is still not solved as this makes it increasingly difficult for up-to-date calculations and use in real-time environments. For this reason the quality of the results might be discounted for the sake of reducing complexity.

Accuracy is one of the first things that may be discounted. For example, a recommender algorithm with a predictive accuracy of 78% might be preferred over one that achieves 80% if the former requires considerably less CPU time, making it viable for deployment in real-world scenarios[WPB01]. In our work, our pipeline is of linear O(n) complexity to allow the restriction of this problem to some degree.

2.3

Clustering Algorithms

In this section, we will present the background and analysis of the relevant algorithms. Initially, we present our algorithm of choice, K-Modes, and then we delve into alternative algorithms while pointing out the reasons why they were not chosen over K-Modes.

(13)

2.3.1

K-Modes

Our algorithm of choice, K-Modes, is an extension of the Centroid based K-Means. Huang[Hua98] proposed this model to solve the problem of handling categorical data in 1998 and it has been im-proved by Cao et al.[CLL+12] in 2012. Its main distinctions from K-Means is the use of a different dissimilarity metric, and the fact that it uses the Mode instead of the Mean to calculate distances. The Mode of a given set set can be defined as the item that appears most often in the set, while the Mean is the average distance between items. For a better understanding, we will first briefly describe the K-Means algorithm.

The K-Means algorithm is one of the mostly used clustering algorithms. Given a set of numeric objects X and an integer number k (≤ n), it searches for a partition of X into k clusters that minimises the within groups sum of squared errors. The integer K represents the number of groups that the datapoints will be divided into. This process is often formulated as the following mathematical problem P1 [SI84]: M inimise P (W, Q) = k X l=1 n X i=1 wi,ld(Xi, Ql) (2.1) subject to k X l=1 wi,l= 1, 1 ≤ i ≤ n wi,l∈ {0, 1}, 1 ≤ i ≤ n, 1 ≤ l ≤ k (2.2)

where W is an n × k partition matrix, Q = {Q1, Q2, ...., Qk} is a set of objects in the same object domain, and d is the squared Euclidean distance between two objects. It is important to know that the main functionality of the algorithm is that it groups objects together by calculating the euclidean distance between them.

The central item of each cluster is called a Centroid, and the algorithm randomly assigns Centroids in the first iteration as a starting point. With these Centroids as the base, the algorithm calculates the (Euclidean) distance between the items in the cluster, and if items that belong in other clusters are detected in this one, the mean of all the objects in the Cluster is made the new Centroid and the process is repeated. This process keeps repeating until the objects are no longer moved to other groups after each iteration, meaning they were grouped optimally. Equation (2.1) is often described as the cost function. Cost is defined as the sum of dissimilarities of all points with their closest Centroids. This represents the whole data set, not cluster to cluster. As such, if all clusters have great structure, but one is very poor the overall cost would be high.

(14)

The problem formulation described is valid for categorical and mixed-type objects, however the dissimilarity metric used is what obstructs it from being able to cluster those objects. Huang made three modifications to resolve this:

1. Use a simple matching dissimilarity measure for categorical objects. 2. Replace the means of clusters by modes.

3. Use a frequency-based method to find the modes to solve the problem.

Dissimilarity Metrics. Dissimilarity metrics are used to compare two data distributions. In essence, it evaluates how close two items are, and can either evaluate it in a distance space or defining a dissimilarity metric between them. This can be found in two ways; Similarity measures (how alike two items are), or dissimilarity (how different two items are) measures. Common properties among dissimilarity measures are:

1. d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q 2. d(p, q) = d(q, p) for all p and q

3. d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

A distance that satisfies these properties is called a metric. The dissimilarity metric for K-Modes is commonly referred to as the simple matching metric[KR09]. When comparing two objects described by m categorical attributes, the number of mismatches between the attributes of the two objects define their dissimilarity. The smaller the number of mismatches, the more similar the objects. So, let X, Y be two categorical objects described by m categorical attributes. Formally, the dissimilarity function can be described as:

d1(X, Y ) = m X j=1 δ(xj, yj) (2.3) where δ(xj, yj) = ( 0 (xj= yj) 1 (xj6= yj) (2.4)

With the cost function (2.1) becoming:

P (W, Q) = k X l=1 n X i=1 m X j=1 wi,lδ(xi,j, ql,j) (2.5)

where wi,l∈ W and Q1= [q1,1, q1,2, ...., q1,m] ∈ Q (2.6)

Huang proved in his work that the results of categorical clustering with the aforementioned dis-similarity function provided good results. Later, Cao et al.[CLL+12] formulated a new dissimilarity metric exclusively for use with this algorithm based on rough membership function. In our work, we use Cao’s metric defined as follows:

(15)

Formally, a categorical information system is a quadruple IS =(U, A, V ,f) where : • U, the nonempty set of objects, called the universe;

• A, the nonempty set of attributes; S

• V, the union of all attribute domains, i.e., V = ∪a∈AVa where Va is the domain of attribute a

and it is finite and unordered;

• f : U xA → V , a mapping called an information function such that for any x2U and a2A, f(x,a)2Va.

Let IS = (U, A, V, f) and P ⊆ A. For any x, y ∈ U , the dissimilarity measure between x and y with respect to P is defined as:

dp(x, y) = X a∈P da(x, y) (2.7) where da(x, y) = 1 − Sima(x, y) (2.8) and Sima(x, y) = || x |{a}T{y} | | x |{a} (2.9)

The algorithm using Cao’s dissimilarity function initially calculates the similarity between objects according to equivilance between their attributes, similar albeit slightly different to equation (2.3), and then proceeds to calculate a different Mode based on equation (2.7) between every object and all the cluster current centers, and assigns it to the cluster whose mode is closest to it. After the assign-ments are finished, it uses equation (2.3) again to calculate the new modes and the process repeats. The algorithm, its definitions and evaluation can be seen with more detail in Cao’s paper[CLL+12], however this is the basic understanding needed for our work.

Advantages. The choice of this algorithm for our work was influenced by this initial set of factors: • Our goal aligns with the algorithm’s main functionality goal; grouping together the most frequent

similar items in the data set.

• The computational complexity is linear O(n), allowing for large data sets to be processed regu-larly, thus also limiting concept drift.

• If needed, the algorithm also supports mixed-data (numerical and categorical) clustering under the name K-Prototypes.

• Centroid based, allowing us to be flexible in our results. In density based algorithms we would not be able to choose the number of centres, thus losing the choice of specificity.

Being able to solve the computational complexity problem was one of the most crucial factors of choosing this algorithm. Currently, processing high-volume real-world data has proven a challenge in web personalization[WPB01, CMM+01]. The system that we aim at developing should provide results in such environments, and a linear complexity allows for good scalability. This enables the users to use large amounts of data to improve the precision of the results without being hindered by computational complexity. Furthermore, these data sets often consist of solely categorical data, and the algorithm can handle categorical and be extended to support mixed data with little effort.

Last but not least, the algorithm serves exactly what we are aiming to do. Specific segments are in most cases going to be of higher quality if made custom, as each administrator knows specifically the target groups for their website. Our pipeline would not be able to capture that type of audience using historical data. The purpose of our pipeline is therefore to recommend segments that the administrator might not be aware of based on this data, allowing them to comprehend the audience that visits their website and target accordingly. K-Modes being able to provide the most frequent items in a data set therefore can smoothly applied to our case. Also, it allows for flexibility in the results. Choosing the number of k (groups) would be left up to the administrator, as the goal can be highly subjective. Less segments will give a more general view of the audience and as they increase, specificity also increases allowing the user to target based on their preferences and environment. It is important to note that this flexibility is made possible due to the sparsity and volume of the data, for reasons we discuss in Chapters2and5.

(16)

Drawbacks. The main drawbacks of this algorithm are:

• Results are sensitive to the number of observations in the data.

• The optimal number of k (if needed) cannot be determined automatically. • It does not handle noise, as such noisy datasets can affect results negatively.

• As with K-Means, the sparsity of the data inside each cluster will determine if the clusters are identified with precision.

The number of observations (columns) in the data set affect the quality of the result, but not as much as the values themselves. If observations contain a large range of possible values, the results will be affected negatively. In our work, we worked on handling sparsity in our URL data by giving URLs a more concise representation through keywords and transactions. The AVS system can handle extra data that has limited values such as binary or tertiary data points with success, however one should refrain from using large amounts of more extensive data observations. This can negatively affect the precision of cluster identification and the general structure of the clusters.

Unlike other types of clustering algorithms, the algorithm cannot detect the optimal number of k for the best structure without user interference. However, based on our goals this is not a significant drawback due to the flexibility it provides mentioned above.

Finally, the algorithm does not distinguish noise from useful data points, something which is again taken care of by density based methods. However, sparsity once again would not allow the use of density based as it would remove the flexibility, and may have hindered the overall quality of the results.

2.3.2

Alternatives

Density Based

Density based clusters are defined by areas of higher density in the data set, and objects outside those areas are considered noise. The key way they work is that they link points together that satisfy a certain distance threshold, thus creating a link between closely related points. DBSCAN and HDBSCAN are very popular clustering algorithms that use the density based model.

Figure 2.2: Visual Density-based clustering example using DBSCAN2.

The result of the algorithm gives areas of the points in the axis, declaring them clusters and tries to assign them common characteristics.

(17)

Advantages.

• Removes noise to improve overall quality of clusters. • Automatically finds the optimal number of k. • Can handle categorical variables.

Drawbacks.

• In our case, the optimal number of k may not be always desired, and removes flexibility. • In order to work with categorical variables, the use of a transformation method to numerical

values is necessary. This may cause loss of information and reduced accuracy. • If all data is closely related, it is unable to diversify it into sub-clusters.

• If the relation of the items is not enough to satisfy the threshold, clusters without a very strong structure will be missed.

• The dimensionality of the data significantly impacts the performance. A significant decrease is seen on high-dimensional data.

• Has exponential O(n2) time complexity.

While density based conveniently handles some of the problems, it creates more than it solves in our case. The most important point is the inability to be flexible with the number of segments by not being able to find sub-clusters. Dimensionality[MHA17] is also an issue, more so with categorical data where the threshold of high dimensionality decreases when taking into account that the data needs to be transformed into numerical variables.

Neural Models

Neural models such as fuzzy clustering are a form of soft clustering, as opposed to Centroid and density based which belong to the hard clustering category. This means that each point does not strictly belong in one cluster, but can belong in more than one cluster. This is distinguished by a percentage of fitting into the cluster (e.g. item a might fit 80% in cluster A but 46% in B) called membership grades[BEF84].

The main algorithm is Fuzzy C-Means[BEF84], and it is very similar to K-Means. The process is the same as K-Means, but the membership grade is additionally calculated after every iteration. It is important to note that the algorithm still assigns Centroids.

Advantages.

• Supports categorical data.

• Is flexible in the number of clusters. Drawbacks.

• Complexity of O(ndc2i) with n = number of data points, c = number of clusters, d = number of dimensions and i = number of iterations.

• The set of initial parameters required is larger than K-Means and results are sensitive to their correctness.

• Implements One-hot encoding to handle categorical data.

• High sparsity in the data will increase complexity and decrease results significantly.

While having more initial parameters may allow for more flexibility, the increased complexity does not align with the goal of this work to automate the process with requiring as little input as possible. Furthermore, not having a natural dissimilarity metric akin to K-Modes and having to use One-hot encoding gives questionable validity to the results.

(18)

2.4

User Transactions

The generation of large volumes of data makes it difficult to establish the context of the visit of each visitor. As such, Cooley et al.[CMS97a,CMS97b] introduced a way to identify the pages within a user’s browsing behaviour that are for navigation purposes, and those that are made for information content purposes. These groups of pages are named transactions, individual page accesses grouped into se-mantic units. Transactions are essentially the visitor’s journey broken down into smaller sub-journeys that serve a different purpose, and consequently hold different contextual information. Furthermore, they help reduce the scope of the information and thus the computational cost of the overall process. In this section we will outline the three proposed transaction models.

2.4.1

Reference Length

The Reference Length (RL) model follows the assumption that the amount of time a user spends on a page will classify that page as navigation or content. The expectation is that the variance of times spent on navigation pages is small and the majority of time is spent on content pages. This can be done in two ways; either have access to front-end data identifying the exact time a user spent on a page, or calculate the difference between the time of the next reference and the current reference.

The authors[CMS97a] identify the errors this assumption can cause. The last reference will have no next to calculate the difference, multiple visits from the same visitor will cause the difference in time to be hours long, and events such as lunch breaks or phone calls will cause erroneous identification of content pages. They suggest applying a reasonable minimum time threshold in order to weed out these problems.

There are two major drawbacks in this method if we assume that the aforementioned errors are kept to a minimum;

• The model performs poorly on sparse data.

• It requires an input parameter of an estimation of the overall percentage of references that are navigational.

These two drawbacks cause a conflict with our work, and as such we opted out from using this model.

2.4.2

Time Window

The Time Window (TW) model simply divides the journey into time intervals no larger than a specified parameter. This model has the poorest performance of the three, however as total journey time increases so does the performance. The model has questionable results compared to the alternatives, as such we opted out from using it.

2.4.3

Maximal Forward Reference

The Maximal Forward Reference (MFR) model requires the use of the Visitor ID, URLs, and the access timestamp of each URL. Each transaction is defined to be the set of pages in the path from the first page up to the page before a backwards reference is made. The final page is the content page and the pages leading to that are the navigation pages.

This method has the advantage that it does not require an input parameter that is based on an assumption about the characteristics of a particular set of data. Furthermore, it does not require special data such as time spent on a page, and its performance increases for sparse data compared to the other two methods, however performs worse for well connected data. A drawback is that the certainty of the content page being valid is limited by the user’s navigational pattern. The user might be able to navigate through the website without having to do a backwards reference, thus classifying a lot of content pages as navigational pages, albeit rarely.

In our work we ultimately chose to work with this model due to the type of information we had available, and most importantly due to the sparsity of our data and the model’s performance.

(19)

2.5

Evaluation Methods

Evaluating categorical clustering is a problem on its own, as there is no clear way to distinguish if the items are correctly clustered or not. The success of this is heavily dependent on the dissimilarity metric used and the type of results the user is looking for out of the cluster. However, the structure of the clusters can be measured with high accuracy using the Sihlouette Coefficient. Unfortunately, this was the only method we could find that can evaluate our type of work with confidence without having to resort into real-world testing.

2.5.1

Silhouette Coefficient

Rousseeuw[Rou87] proposed the Silhouette Coefficient as a graphical aid to interpret and validate cluster analysis. Its main purpose is to discover the structural integrity of each cluster, discovering which items fit well within the clusters, and which ones are merely somewhere in between clusters. It is also commonly referred to as the Silhouette width.

The process involves the intra-cluster distance a and the mean nearest-cluster distance b. More specifically, a is the mean distance of the sample from the other samples in the cluster, and b is the mean distance between the sample and all the samples in the nearest cluster that the sample is not a part of. Thus, the Silhouette Coefficient for an object (sample) i is represented by the following formula:

s(i) = b(i) − a(i) max(a(i), b(i))

Out of the algorithm we receive three types of results; the silhouette width for each sample, the average silhouette width for each cluster and the overall average silhouette width for the total data set. The width w is in the range −1 < w < 1 where negative values generally indicate that a sample is assigned to the wrong cluster, and values near 0 indicate overlapping clusters. The following interpretation of the mean silhouette width is proposed by Rousseeuw [Rou87]:

Range Interpretation

0.71 – 1.0 A strong structure has been found. 0.51 – 0.7 A reasonable structure has been found. 0.26 – 0.5 The structure is weak and could be artificial.

< 0.25 No substantial structure has been found. Table 2.1: Silhouette ranges interpretation.

The silhouette has been a widely used and successful method in clustering analysis, however it requires the use of a dissimilarity metric to calculate the score. This is could generally pose a problem with categorical data, however it is possible in combination with the K-Modes algorithm as we can use any dissimilarity measure in order to calculate a pre-computed distance matrix to use with the silhouette algorithm.

(20)

Chapter 3

Automated Visitor Segmentation

with Python

In this section we will outline the Automated Visitor Segmentation (AVS) pipeline. For each step of the pipeline, we will present the purpose, structure, code and analyse the functionality. Note that our code also includes unit tests to test some of the pipeline steps, however we purposefully exclude them from the thesis as we do not believe they are relevant to the discussion.

3.1

Process outline

The AVS prototype is implemented as a software pipeline using the Python programming language. A pipeline is a sequence of events/functions each having the output of the previous step as the input. The data input should be in a dictionary form such as JSON (JavaScript Object Notation) or a CSV (Comma Separated Values format) if extended accordingly. The Pipeline consists of 7 steps in total as dictated in Table1.2, of which 6 handle the pre-processing of the data and the final step handles the clustering. A simplified outline of the steps is presented below:

Step Purpose

1 Read file

2 Filter data

3 Sort data frame

4 Identify transactions

5 Extract information from transactions

6 Remove unnecessary data

7 Cluster data

Table 3.1: Pipeline steps A pipeline approach was chosen for two main reasons:

1. Allows for a fully automated process with only the initial input required. 2. Simplicity in modifying the pipeline by adding or removing steps.

The pipeline can indeed be modified easily, by exchanging or adding any steps in the pre-processing to adapt with the changes required over time if needed. It is also possible to use a different clustering algorithm if needed by simply modifying step 7.

The logic behind the steps is as follows; the clustering algorithm requires a data frame with tidy data[W+14]. We wish to identify segments based on contextual information such as what did they do during their visit, as such we use transactions to recognise this information. For the transactions to be extracted, we require the visitor journey of each visitor and the data in a form of a data frame, as

(21)

such we need to transform the JSON file into a data frame, and group the rows of each visitor together sorted by timestamp. After we identify the transactions, the information is a list of strings, and is not tidy enough for the clustering algorithm to produce good results. This creates the need to transform them in a more concise form. Consequently, we remove unnecessary parts of the URL and only keep the important words that describe the journey of each transaction as keywords. Furthermore, we no longer require the whole transaction path to be there nor is it clustering-friendly as such we only keep the content page from each transaction with the contextual information of the journey being stored as keywords. Finally, we need to remove any unnecessary data that increases the computational complexity, or does not provide useful input such as users who only visited the homepage and left, or columns that we do not wish to include.

The whole pipeline can be initiated and run from beginning to end using a single function named segmentation pipeline with the input parameters being the data file in JSON format and the desired number of segments;

1 def segmentation_pipeline(self, file_path, number_of_segments=10):

2 data_frame = self.read_and_sort_data(file_path)

3 data_frame = self.pre_process(data_frame)

4 clusters = self.cluster_data(data_frame, number_of_segments)

5 return clusters

This will return a data frame containing the cluster centroids with any observations (columns) the user decided to use. In this snippet the columns kept are pre-determined as such the option is missing. Python was chosen due to its high support of machine learning and data science libraries such as scikit-learn and pandas. The purpose of the pre-processing part it to bring the data in a form that is easy to handle by the clustering algorithm while containing contextual information. Our complete codebase is available on Github1 under the main branch. Keep in mind that while the code is functional and provides the wanted results, it is still a prototype. However, the extensions that we would like to do are noted in their respective subsection.

3.1.1

Reading the Dictionary

The first step of the pipeline is a relatively simple read function. By using the pandas library, we can read the given JSON file transformed and stored as a data frame. Transforming it into a dataframe is important both for making the rest of the pre-processing smooth, and for using the clustering algorithm, as matrices are generally the best way to feed data into them.

1 # Step 1

2 def json_read(self, filepath, multiline=False):

3 file = pd.read_json(filepath, lines=multiline, convert_dates=False)

4 print("Step 1/7 - Reading, done...")

5 return file

In this function we accept the file path and multiline as input parameters, and return a dataframe. The multiline defines if the JSON object is actually a multi-line JSON or a normally structured one, and is by default set to False to accept normal JSON. The function also by default does not convert numerical timestamps into normal date format as it is important to keep that data intact for the MFR algorithm.

Arguably the pandas read could be used without needing to define a new function, however this would make it harder to extend or modify the code if it would require this to be used in more than one place with the same behaviour. A case where this would be needed would be to extend the code to read very large JSON files (over 2GB) that would need to be split and read separately.

(22)

Limitations. As mentioned above, it is not possible to read very large JSON files without first splitting them into smaller files. Our implementation does not yet support this. However, this should rarely be a problem as files of this magnitude more often than not contain more information than necessary to identify good segments.

3.1.2

Filtering

Filtering the data is the part where we decide what kind of information (characteristics) we want to use for our clustering. This code is intended for use with the Hippo CMS, as such it also normalises any deeply nested JSON from the CollectorData column, and by default keeps the visitor ID, timestamp, page URL, if the user is a new visitor and the page ID (which identifies the type of page, e.g. homepage).

1 # Step 2

2 # Keep list defines all the columns to be kept in the pipeline. All others are dropped.

3 def filter_columns(self, data_frame):

4 collector_data = json_normalize(data_frame['collectorData'])

5 all_data = pd.concat([collector_data, data_frame], axis=1)

6 keep_list = ['visitorId', 'timestamp', 'pageUrl',

7 'newVisit', 'pageId']

8 processed_data = all_data[keep_list]

9 print("Step 2/7 - Filtering, done...")

10 return processed_data

What is returned is the filtered dataframe only containing the selected columns. This can be extended to accept a keep list as an input parameter for the user to define which columns to keep, however in our code we decided to keep it static.

Limitations. In case of deeply nested JSON that did not get normalised by the read function, the function will not be able to automatically determine which column to normalise and will require custom user interference. This is a problem if the user needs to keep a column that is inside this nested JSON. In our case, the column was collectorData, and we had to normalise it and then merge the result into our original dataframe in lines 4 and 5.

3.1.3

Sorting

This step is used to sort the data frame by the visitor ID and timestamp in order to create the visitor journey and group it together. This is to make it faster for the transaction algorithm in step 4 to identify the journeys and then the transactions.

1 # Step 3

2 def json_sort(self, file):

3 sort_by = ["visitorId", "timestamp"]

4 sorted_file = file.sort_values(by=sort_by)

5 print("Step 3/7 - Sorting, done...")

6 return sorted_file

The function is given a standard list of parameters to sort by, which is visitorId and timestamp, and proceeds to sort the dataframe by those values.

3.1.4

Identifying Transactions

To identify the transactions we use the Maximal Forward Reference algorithm. Before going into the pipeline code, we will outline the MFR code implementation in our work.

(23)

Maximal Forward Reference. The implementation consists of two parts, the initialization and the algorithm that identifies the transactions. The initialisation is called at the pipeline step, and receives the sorted data frame as an input. Its purpose is to retrieve all the rows from each unique visitor from the sorted data frame and send their timestamp and URL as input parameters to the actual algorithm. This will then return all the transaction paths as a list and add them to the result list.

1 @staticmethod

2 def init_algorithm(sortedData):

3 result = []

4 grouped = sortedData.groupby('visitorId')

5 print("Initializing Transaction Extraction...")

6 for visitorId, group in grouped:

7 time = grouped.get_group(visitorId).timestamp.tolist()

8 path = grouped.get_group(visitorId).pageUrl.tolist()

9 result_paths = MFAlgorithm.run_MF_algorithm(visitorId, time, path)

10 result.extend(result_paths)

11 return result

The above code creates groups of all rows belonging to each unique visitor ID using the pandas library. Then we use a loop that iterates through every row for each group, and creates two lists based on the current visitor’s data; the timestamps and the URL they accessed. Then this is given as an input to the MFR algorithm as a tuple of three elements (visitorId, timestamp, path). The resulting transaction is then added to the list of results.

Algorithm Implementation. The algorithm itself is implemented by iterating through the given URLs saved as pairs. In essence, the algorithm works with the use of a boolean flag. We iterate through the pair list and keep adding the elements as they come to the transaction list. As soon as we detect that the current element has already been added to the current transaction, the flag to end this transaction is set to True and the list closes. This way we have our first transaction. Then we start a new list, set the stop flag to False and keep repeating the process up to the final element of the input. Finally, we return the list of transactions as a list of tuples.

(24)

1 @staticmethod

2 def run_MF_algorithm(visitor, time, urls):

3 url_pairs = [('', urls[0], 0, 0)]

4 i = 0

5 number_of_urls = len(urls)

6 while (i + 1) < number_of_urls:

7 url_pairs.append((urls[i], urls[i + 1], i, i + 1))

8 i += 1 9 10 i = 0 11 end_transaction = False 12 all_transactions = [] 13 current_transaction = [] 14 timestamp = time[0]

15 number_of_pairs = len(url_pairs)

16

17 while i < number_of_pairs:

18 current_url, next_url, index_current, index_next = url_pairs[i]

19

20 # Initialize the transaction for the first URL

21 if current_url == '': 22 current_transaction.append(next_url) 23 timestamp = time[index_next] 24 i += 1 25 continue 26

27 # If the URL exists in the transaction, end transaction and add it to list.

28 # If not, we add the url to the transaction list and go on.

29 if next_url in current_transaction:

30 if not end_transaction:

31 if current_transaction not in all_transactions:

32 all_transactions.append((visitor, timestamp, current_transaction))

33 this_index = current_transaction.index(next_url)

34 current_transaction = current_transaction[0:this_index + 1]

35 end_transaction = True 36 i += 1 37 continue 38 else: 39 if end_transaction: 40 end_transaction = False 41 timestamp = time[index_current] 42 current_transaction.append(next_url) 43 44 i += 1 45

46 if current_transaction not in all_transactions:

47 all_transactions.append((visitor, timestamp, current_transaction))

48 return all_transactions

Initially we create a list of tuples that hold the current URL, the next URL, and the indices of the two. Then we iterate through the whole visitor journey and create pairs for every one of them, so in a list with URLs ABCDE we will get .A, AB, BC, CD, DE and their indices with the dot representing an empty string for initialisation purposes. Next, we have three different condition checks. The first one is to initialise the first transaction. It checks if the first URL is empty and initiates the transaction list with the second URL in the pair. The next if checks if the URL is already in the current transaction and if it is, the transaction ends and we restart the loop from the next index. If it is not, we simply

(25)

add the current URL to the transaction list and keep iterating. In the end, we return the list of tuples that hold all the transaction paths and the first timestamp of each for the visitor.

1 # Step 4

2 def get_transactions(self, sorted_data):

3 transactions = mfa.init_algorithm(sorted_data)

4 data_frame = pd.DataFrame(transactions, columns=['visitorId', 'timestamp', 'transactionPath'])

5 data_frame = pd.merge(data_frame, sorted_data, on=['visitorId', 'timestamp'])

6 print("Step 4/7 - Extract transactions, done...")

7 return data_frame.drop(['timestamp', 'pageUrl'], axis=1)

Step 4. Finally, in the above code we can see the function of the 4th step. It initially calls the algorithm and retrieves the transactions, and then proceeds to transform them into a dataframe of three columns. We then merge the transaction dataframe with our initial one. Due to using the merge function, the transactions are merged on the rows that the timestamp and visitor ID match and the rest of the rows are dropped. What this means is that now our initial data frame only keeps the row matching the first URL of each identified transaction (the merging point), as such we condensed the rest of the rows that are in this transaction into this first one. We illustrate this with an example for visitor A in Figure3.1.

Figure 3.1: Example result of merging transaction table with initial dataframe.

Finally we drop the URL column as we now have the list of URLs into their own column. This decreases the number of rows dramatically, while still keeping the same amount of information. This is both beneficial to the computational complexity and scalability of the pipeline.

Limitations. The code requires at least three columns that contain the visitor ID, timestamp and URL. The sorting is not necessary, but it speeds up the process due to branch prediction.

3.1.5

Normalise Transactions

The fifth step consists of two activities. The first is to retrieve keywords from the transaction path, and the second is to extract the last URL of the transaction that is the content page. The latter is achieved in line 4 simply by extracting the last element of the transaction path list.

1 # Step 5

2 def get_content_page_and_keywords(self, data_frame):

3 data_frame['keywords'] = data_frame.transactionPath.astype(str).apply(urlExtract.get_keywords)

4 data_frame['contentPage'] = data_frame.transactionPath.str[-1]

5 print("Step 5/7 - Keep content pages and get path keywords, done...")

6 return data_frame

Get keywords. Extracting keywords from small texts is a problem that is still being worked on by researchers[HWW+15,WLWZ12]. As such, we decided to use our own method that is essentially

(26)

reducing the URL path only to the words that describe it. We do this by replacing any unnecessary parts such as the homepage and stop-words and keeping words that may describe what the URL leads to.

1 def get_keywords(url):

2 url = url[1:-1]

3 for home_page in homepages:

4 url = url.replace(home_page, " ") 5 url = url.replace(".html", " ")

6 url = url.replace(".xml", " ")

7 url = url.replace(".php", " ")

8 url = url.replace("/", " ")

9 url = url.replace("-", " ")

10 url = url.replace(".", " ")

11 url = url.replace(",", " ")

12 url = url.replace("\'", " ")

13 path_tokens = word_tokenize(url)

14 sw = set(stopwords.words('english'))

15 sw.add('en') 16 sw.add('nl') 17 sw.add('de') 18 sw.add('txt') 19 sw.add('oh') 20 sw.add('net') 21 sw.add('com') 22 sw.add('1') 23 sw.add('login') 24 sw.remove('about') 25 sw.remove('why')

26 result = [w for w in path_tokens if w not in sw]

27 return set(result)

The above code handles this action. We first remove the homepages of the website that were provided by the user, and we replace any other parts such as commas and hyphens with spaces. The next step involves tokenising the resulting string, and creating a list of stop-words. Finally, we check every token in the list and if they match any of the stop-words we remove it. The result is saved in a set for two reasons; to avoid duplicate words and because of the fact that sets solve the problem of situations where the transaction consisted of the same URLs but in different order. Those two would now be identified as the same by the clustering algorithm.

Limitations. The main limitation is the input of keyword extraction method. While it does allow us to keep the descriptive words of the URLs, this is largely dependant on how descriptive the URLs themselves initially are. We attempted to use a method that extracts keywords by visiting the URL and understanding the content of the page, however this rarely achieved any results. Furthermore there are stop-words that we should not be ignoring, such as ’about’ which could be used to describe an about us page. The list if what should and should not be ignored cannot be known with certainty.

3.1.6

Remove Unnecessary Data

The final step of the pre-processing involves removing any transactions that only consisted of 1 URL, with that URL either being the homepage or an error page. This is done in order to avoid having non-useful segments that would result by the large number of users that do this.

1 # Step 6

(27)

3 def remove_homepage(self, data_frame):

4 data_frame = data_frame.drop(

5 data_frame[(

6 (data_frame.pageId == 'hst:pages/home') |

7 (data_frame.pageId == 'hst:pages/pagenotfound')

8 )

9 &

10 (data_frame.transactionPath.str.len() == 1)

11 ].index).reset_index(drop=True)

12 print("Step 6/7 - Remove visitors that only visited the homepage, done...")

13 return data_frame

We achieve this by using the pageId column provided by the Hippo CMS access log. This is configured so that pages belong to categories, and we specifically require the categories of home and error pages. After the user provides that, we identify and drop all rows that match those IDs, and at the same time their transaction path consists of one lone URL. We finally return the filtered dataframe.

Limitations. This function requires that the user provides either the configured names of the ap-propriate pageId column, or a raw string of the home and error page URLs in order for them to be removed. The code needs to be extended as previously mentioned in order to require this as an initial parameter to the pipeline.

(28)

Chapter 4

Segment Identification

In this section we outline the usage of the K-Modes algorithm and briefly explain the structure of the results.

4.1

K-Modes usage

We used a scikit-learn1based community implementation2of the K-Modes algorithm in our work. It supports both Huang’s[Hua98] and Cao’s[CLL+12] dissimilarity metrics. We decided to use Cao’s due to it being designed specifically for categorical variables and showing better results overall as discussed in Section 2. The algorithm requires a data frame as an input besides the initial configuration. It allows for the user to output the centroids, or a list of all the elements with the label of the cluster they were assigned to after running its iterations.

1 # Step 7

2 def cluster_data(self, data_frame, number_of_segments=10):

3 data_frame = data_frame.astype(str)

4 kmodes_cao = KModes(n_clusters=number_of_segments, init='Cao', verbose=1)

5 kmodes_cao.fit_predict(data_frame)

6

7 column_names = list(data_frame.columns.values)

8 clusters = pd.DataFrame(kmodes_cao.cluster_centroids_, columns=column_names)

9 print("Step 7/7 - Clustering, done...")

10 return clusters

In our function for the seventh step, we must initially transform all the contents of our data frame to strings. This is due to the fact that it contains an unsupported type, a set, which contains our keywords. Line 4 configures the algorithm by choosing the number of segments and the dissimilarity metric, while verbose indicates if any progress information will be printed during runtime. Line 5 runs the clustering algorithm for our data frame. Finally, we save and return the centroids as a data frame in line 8, and line 7 makes sure that the column names match the names of the initial data frame.

4.2

Centroid Structure

The structure of the resulting centroids is that of a data frame table which contains the cluster label (or ID) and all the data columns that we decided to use in our clustering. Each row contains the centroid of the cluster, otherwise called cluster representative. In Figure4.1 we can see an example of the resulting centroids of a data set of 100.000 rows.

1http://scikit-learn.org/stable/index.html 2https://github.com/nicodv/kmodes

(29)

Figure 4.1: Cluster centroids for a 100.000 row dataset with k=17.

Column A represents the cluster label, while the remaining columns are named the same as in our initial data frame. In this example we chose to keep only the if the visitor is a returning one, keywords and the content page as our parameters. The content page can tell us the last page they were on before leaving, while the keywords are derived from all the URLs in their transaction path. By looking at these results, the user can derive what most of their visitors have been browsing and consult their interests to then transform them into targeting segments.

We can see an apparent limitation in these results, specifically in cluster 6. While the keywords can show what the user was interested in, the content page is the homepage because their journey ended there. Due to the keywords being a parameter, the effect this has on the quality of the clustering is minimal, however it is problematic for the final display of information. This is one of the limitations of the MFR algorithm that we would like to correct as future work.

Referenties

GERELATEERDE DOCUMENTEN

The research questions aimed to answer if this media narrative subsisted among the online users, to examine the discourse power of traditional media in comparison

Eindexamen havo Engels 2013-I havovwo.nl havovwo.nl examen-cd.nl Tekst 3 Can we trust the forecasts?. by weatherman

He is the author of various publications, including Religion, Science and Naturalism (Cambridge: Cambridge University Press, 1996), and Creation: From Nothing until Now (London and

Although the majority of respondents believed that medical reasons were the principal motivating factor for MC, they still believed that the involvement of players who promote

Een interessante vraag voor Nederland is dan welke landen en sectoren betrokken zijn bij het vervaardigen van de producten die in ons land worden geconsumeerd. Deze paragraaf laat

Take the case of another famous sociologist, Norbert Elias, and his study of what he called the ‘Civilising Process’, more exactly the rise in early modern Europe of

Although SOX is silent on the IAF’s relationship to ACs or EBs, there is a very strong incentive to use the IAF as a supporting instrument for EBs and ACs to ensure the

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright