Clustering procedure - Cluster analyses in general

3 Evaluation of consumer segmentation

3.3 Cluster analyses in general

3.3.2 Clustering procedure

Each clustering method has its own way of forming clusters, however, it always involves optimizing a criterion, for instance making the variance within the cluster as small as possible or making the distance between clusters as big as possible. Within cluster analyses in practice, there is a distinction between three methods;

Hierarchical, partitioning and TwoStep clustering (Mooi & Starstedt, 2011). In the next sections hierarchical, partitioning and TwoStep clustering methods are discussed (Figure 6).

Hierarchical clustering

Hierarchical clustering is a stepwise clustering method with a tree structure. There is a distinction between agglomerative and divisive clustering within hierarchical clustering methods. The agglomerative clustering starts with a cluster for each consumer, and then the consumers (clusters) with the nearest distance (the most similar consumers) are merged and linked to a higher hierarchy. In subsequent steps, this process continues.

Divisive clustering methods perform this analysis vice versa. In the first place, all consumers are assigned to one single cluster, and this cluster is then gradually split up. When a consumer is assigned to a cluster, it cannot be assigned to another cluster anymore. Divisive methods are rarely used in segmentation studies. Because this method is complex to compute, there are 2^N-1-1 possible divisions for a cluster. For example, when we need to cluster a dataset of 2000 respondents (N=2000) then there are 2^2000-1-1 clusters to be analysed, which is a lot.

Therefore, agglomerative hierarchical clustering methods are used more often (Xu & Wunsch, 2009; Mooi &

Sartstedt, 2011).

Agglomerative hierarchical clustering

There are several methods for measuring similarities between consumers. A straightforward method for measuring the most similar consumers is to draw a straight line between two consumers (Mooi & Startstedt, 2011). The distance between two consumers and is the square root of the sum of the squared differences (Mooi & Starstedt, 2011).

(Euclidean distance)

is the distance between two consumers and . The variables are identified by index are the variables. The distance is defined as the square root of the sum of squares of differences between the corresponding coordinates of the points. This distance is the length of the line which connects consumer and .

Sometimes data needs to be standardized because when variables are measured on different scales they do not contribute equally to the analysis. This standardization can be conducted through several methods. For instance, z-standardization, this method rescales every variable to a mean of 0 and a standard deviation of 1.

Another way to standardize data is by range. In addition, data can be standardized by looking at the correlation between variables. When the method for measuring distance or similarity is chosen, then a cluster algorithm can be selected. Below, there is a list of the most often used agglomerative hierarchical cluster algorithms.

These algorithms differ in the way in which the distance is defined. Within this research four hierarchical cluster algorithms are considered. These algorithms are single linkage, complete linkage, average linkage and centroid linkage. The differences between these algorithms are explained and clarified by Figure 7.

Single linkage (nearest neighbour): The distance between a pair of clusters is determined by the two closest consumers to the different clusters. This clustering method inclines to make long-drawn-out clusters. Because consumers within two clusters with the smallest distance connect those clusters. This gives a chaining effect (Everitt et al., 2001). Because of this, two clusters with very different characteristics can be connected to each other because of noise. But, when the clusters are far separated from each other, then the single linkage method works well.

Complete linkage: The distance between the clusters is determined by the distance between the two farthest consumers, each in another cluster. This technique ensures a maximum distance between the consumers in a cluster.

Average linkage: The distance between two clusters is the average distance between all pairs of consumers within the clusters.

Centroid (geometric center) linkage: Two clusters are merged based on the distance of their centroid.

Partitional clustering

Partitional clustering assigns, contrary to hierarchical clustering, consumers to clusters without a hierarchical structure. The consumers are assigned to non-overlapping clusters. The best known method for partitional clustering is the k-means algorithm (Xu & Wunsch, 2009; Berkhin, 2006). The k-means method represents each cluster by the mean. means seeks an optimal partition of the data by minimizing the sum-of-squared-error. K-means clustering minimizes the sum-of-squared-error with an iterative optimization procedure, see the objective function below.

Where is the chosen distance function, is the consumer and is the cluster center (centroid). k is the number of clusters ( ). The k-means clustering procedure is thus a procedure which minimizes the distance (difference) between the cluster centre and the point which represents the consumer. As a result, optimal clusters are formed in which every consumer within a cluster is (almost) similar to the average consumer within the cluster. k-means clustering has two versions. The first version is the EM (Expectation-Maximization) algorithm, also called the Forgy’s algorithm, which consists of two-step major iterations which reassign all consumers to the nearest cluster based on the centroids (geometric center). Centroids of newly assembled clusters are recomputed. This process iterates until a stopping criterion is achieved (Xu & Wunsch, 2009). Another version of k-means clustering is iterative optimization, which reassigns consumers based on a detailed analysis to another cluster. This analysis shows the effect on the objective function of moving a consumer from a current cluster to any other cluster. When this move has a positive effect then the consumer will be relocated and the cluster centroids need to be computed again.

According to Xu & Wunsch (2009), the k-means algorithm owes its popularity to the comprehensibility, easy application and solid basis of analysis of variance. But they also indicate some disadvantages of the k-means algorithm. First, the results depend mainly on the initial guess of centroids. Thereby, calculated local optimums can be different from the global optimums. In addition, it is not clear how the number of clusters can be chosen. Further, the process is sensitive to outliers. The basic algorithm is not scalable. Only numerical attributes are covered and resulting clusters can be unbalanced.

FIGURE 7, CLUSTER ALGORITHMS (MOOI & SARSTEDT, 2011)

TwoStep clustering

TwoStep cluster analysis is a method to find natural groupings within a dataset, this algorithm differs from traditional clustering methods. In the first place, a mixture between categorical and continuous variables can be used within this analysis, while k-means and hierarchical clustering only allow ratio/interval scaled variables.

The method treats variables as independent variables, a combined multinomial normal distribution can be located on both categorical and continuous variables. Secondly, this procedure selects the best number of clusters automatically. The most optimal number of clusters can be selected by comparing the values of the model-choice criterion of the cluster solutions. Thirdly, by building a Cluster Function (CF) tree the records can be summarized. This makes it possible for this analysis to analyze large datasets, just like the k-means clustering procedure (Moiseeva, 2013). This method is developed in 2001 by Chiu, Fang, Chen, Wang & Jeris and works especially for SPSS statistics.

The TwoStep cluster technique consists of two steps, pre-clustering and clustering. Firstly, the consumers are classified into pre-clusters. These pre-clusters are clustered within the second step with a hierarchical procedure (Moiseeva, 2013). Within this step a CF tree is used which consists of ranks of nodes. Every node has a number of inputs. A node can be an internal node or a leaf-node. Leaf-nodes are final sub-clusters, while internal nodes are used to guide a new consumer (record) into the correct leaf-node. When a consumer reaches a leaf-node, then the algorithm finds the closest leaf entry in the leaf node, then the CF three is updated. The consumer will only be assigned to a cluster when the consumer is within a threshold distance of the nearest leaf entry, otherwise this consumer starts a new leaf node. When there is not enough space within an existing leaf node then this leaf node will split up. When the CF tree becomes too big, the threshold distance must increase and the CF tree will be rebuild. This is an iterative procedure which is finished when all consumers are assigned to a cluster.

Distance measure

The distance between the clusters is calculated with a log-likelihood or Euclidean measure. The Euclidean distances measure is within this cluster analysis used for continuous variables. The distance between two cluster centers is measured. In comparison to the Euclidean distance measure, the log-likelihood distance measure can handle continuous as well as categorical variables. However, it is assumed that continuous variables are normal distributed, categorical variables are assumed to be multinomial. Thereby, all variables must be independent from each other (Mooiseva, 2013). The log-likelihood distance measure is based on probability. The following equation shows the definition of the distance between cluster 1 and 2:

Within this formula is the distance between cluster 1 and 2; is the log-likelihood for the cluster 1, consequently is the log-likelihood for cluster 2; is an index that represents the cluster is formed by combining clusters 1 and 2.

The measure for cluster s is defined as:

Where Ns is the total number of consumers in cluster s; is the total number of continuous variables; is the total number of categorical variables; is the estimated variance of the continuous variable k, for the entire dataset; is the estimated variance of the continuous variable k, in cluster s. is calculated in the following equation.

Within this equation is the number of consumers in a cluster s whose categorical variable k takes l category.

29 Auto-clustering procedure

The auto-clustering procedure makes the TwoStep clustering interesting. This analyse technique can determine the number of clusters automatically. However, researcher have the ability to specify a number of clusters. For automatically selecting the number of clusters this clustering technique uses two steps. Within the first step, the Bayesian Information Criterion (BIC) or Akaike’s Information Criterion (AIC) is calculated for every number of clusters, to find an optimal number of clusters. AIC is a measure of the goodness of fit of any estimated statistical model. AIC can in special find unknown models that have high dimensional reality. BIC is designed to find the most probable model given the data (Mooiseva, 2013).

(BIC) (AIC)

Where is the sample size, is the log-likelihood and is the number of free parameters to be estimated.

According to studies comparing the performance of AIC and BIC, AIC performs well in small samples, but it is inconsistent and it does not improve in performance in large samples. In contrast, BIC is consistent and improves in performance in large sample sizes (Mooiseva, 2013). is computed according to the following equation:

is the number of categories for categorical variable number k. The second step in this clustering procedure refines the estimation of the first step. The following equation is used to calculate the change in distance between the two closest clusters.

is the cluster model containing clusters and is the minimum inter-cluster distance for model . Consequently, is the next larger model that contains one cluster more and is the minimum inter-cluster distance for cluster model . For every solution SPSS calculates models with different number of clusters (based on the maximum number of clusters) and shows for every solution the change of BIC and ratio distance.

Selecting the clustering procedure

Now all clustering techniques are discussed, we can make a good choice for selecting a cluster analysis technique for this research. Within this research we are probably dealing with a large dataset, therefore partitional clustering methods are not applicable. For measuring buying behaviour and consumer characteristics, it is probably necessary to use nominal and ordinal scales. K-means and hierarchical clustering techniques require datasets with only interval and ratio variables. Variables like gender (nominal) and education level (ordinal) cannot be used within K-means and hierarchical clustering techniques. TwoStep clustering techniques do allow continuous as well as categorical variables. Therefore, the TwoStep clustering technique is the best method to form clusters for this research.

In document Omni channel shopping behaviour during the customer journey (pagina 27-30)