Random effects mixture models for clustering time series

(1)

Geoffrey Bryan Coke

B.Sc., University of Victoria, 2004

A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of

MASTER OF SCIENCE

in the Department of Mathematics and Statistics

c

° Geoffrey Bryan Coke, 2009 University of Victoria

(2)

Random Effects Mixture Models for Clustering Time Series

by

Geoffrey Bryan Coke

B.Sc., University of Victoria, 2004

Supervisory Committee

Dr. Min Tsao, Supervisor

(Department of Mathematics and Statistics)

Dr. Mary Lesperance, Departmental Member (Department of Mathematics and Statistics)

Dr. Julie Zhou, Departmental Member (Department of Mathematics and Statistics)

(3)

Supervisory Committee

Dr. Min Tsao, Supervisor

(Department of Mathematics and Statistics)

Dr. Mary Lesperance, Departmental Member (Department of Mathematics and Statistics)

Dr. Julie Zhou, Departmental Member (Department of Mathematics and Statistics)

ABSTRACT

In this thesis, we study cluster analysis of time series and develop a new mixture model which is effective for such analysis. Our study is motivated by a real life problem of clustering time series of electricity load (demand) for BC Hydro customers. BC Hydro collects electricity load data for selected customers for the purpose of grouping customers into homogeneous classes in terms of the load. Such homoge-neous classes or clusters are useful for rate setting and long term generation capacity planning. The BC Hydro data set that we use in this thesis contains 923 load series representing 923 BC Hydro customers. Each load series consists of repeated hourly load measurements over a one year period and thus is a long time series. There are a number of clustering methods in the literature for clustering general multivariate data but these are not effective for clustering such long time series. This is because time series such as the BC Hydro customer’s load series typically have high dimensions and special covariance structures. Existing clustering methods are not designed to accommodate these special characteristics. The contributions of this thesis are the following: We first develop a mixture model based clustering method for time series which cannot only handle their high dimensions but also makes effective use of their special covariance structures. Our method is based on the random effects mixture model, a mixture model which we develop specifically for time series. We devise a special EM algorithm based on the AECM algorithm of Meng and van Dyk (1997) to handle the computation of the random effects mixture model. Once the model

(4)

is computed, we assign individual time series to clusters by their posterior proba-bilities of belonging to the components of the mixture model. Then to demonstrate the application of our method, we apply it to analyse BC Hydro data. We obtain a new clustering of the BC Hydro sample which is superior to the existing clustering in terms of relevance and interpretability.

(5)

List of Tables

Table 2.1 Matching matrix for the K-means clustering versus the simulated clustering. . . 36 Table 4.1 Parameter estimates for the (3 component random effects mixture

with AR(1) serial correlation). . . 91 Table 5.1 Matching matrix for clusterings derived from raw data log-normal

mixture models with K = 7 spherical components and K = 7 PAR(1) components. . . 111 Table 5.2 Matching matrix for clusterings derived from a raw data

log-normal mixture model with K = 7 PAR(1) components and a unitized data log-ratio normal mixture model with K = 7 PAR(1) components. . . 114 Table 5.3 Parameter estimates for the best model (7 component mixture of

log-normals with PAR(1) covariance structure). . . 116 Table 5.4 Mean load series summary measures for the best model (7

com-ponent mixture of log-normals with PAR(1) covariance structure). 121 Table 5.5 Matching matrix for rate classes and clusterings derived from the

best model (7 component mixture of log-normals with PAR(1) covariance structure). . . 123 Table 5.6 Proportions of building activity type sample customers assigned

to clusters derived from the best model (7 component mixture of log-normals with PAR(1) covariance structure). . . 126

(9)

List of Figures

Figure 1.1 923 BC Hydro customer sample raw load series. Each series is measured at daily intervals over a one year period from April 1,

2004 to March 31, 2005 . . . 4

Figure 1.2 BC Hydro customer sample raw load series plotted by rate class. 8 (a) Residential. . . 8

(b) Commercial under 35kW. . . 8

(c) Commercial over 35kW. . . 8

Figure 2.1 Clusters simulated from multivariate normal distributions with compound symmetric covariance structure. The bold curve in each plot is the mean curve. . . 34

(a) Simulated residential. . . 34

(b) Simulated commercial under 35kW. . . 34

(c) Simulated commercial over 35kW. . . 34

Figure 2.2 Plot of the first two principal components with boundaries of the three clusters given by the K-means method. The inset plot magnifies the region containing the residential and commercial under 35kW simulated clusters. . . 35

Figure 3.1 Clusters simulated from a random coefficients mixture model. The bold curve in each plot is the mean curve. . . 62

Figure 3.2 Scree graph for eigenvalues of the simulated data. The scree graph is plotted with a split axis to highlight the shape of the curve. . . 63

(10)

Figure 3.3 The BIC scores for the simulated data mixture model with a Fourier design matrix and a PCA estimated design matrix. In both methods a model with 3 clusters has the minimum BIC score. The inset plot magnifies the region where the minimum

BIC score is achieved. . . 64

Figure 4.1 Clusters simulated from a random effects mixture model. The random effect covariance matrices are AR(1) with different vari-ances and autocorrelations. The bold curve in each plot is the mean curve. . . 89

Figure 4.2 The BIC scores for the simulated data mixture model. Model AR(1) with 3 clusters is selected, where “AR(1)” denotes first-order autoregressive covariance structure. EII: spherical equal volume; RC: random coefficient. . . 90

Figure 5.1 Illustration of the effect of the many-to-one nature of the uni-tizing transformation on a mixture distribution. (a) Sample points simulated from a mixture of 3 log-normal distributions are plotted. Component parameters µ_k and Σk for k = 1, 2, 3 were chosen so that two of the components would coincide after transformation (µ₁ = (0, 0, 0)T_{, Σ} 1 = diag(.02, .02., 02), µ2 = (1, 1, 1)T_{, Σ} 2 = diag(.02, .02., 02), µ3 = (−.25, .25, 1.5)T and Σ3 = diag(.05, .05., 05)) (b) Sample points after transformation into the simplex and 95% probability regions for the transformed components are represented in the ternary plot. The three com-ponents in R3_{are mapped to two components in S}2_{. . . 101}

(a) Simulated Data in R3 _{. . . 101}

(11)

Figure 5.2 (a) The ternary plot from above. (b) Log-ratio transformed com-ponent sample points and 95% probability regions are plotted in R2_{. Log-ratio transformed components in S}2 _{are normal in}

R2_{, hence clustering the log-ratio transformed variables}

assum-ing normality should yield good results. Clusterassum-ing the unitized variables in the simplex assuming normality could be problem-atic, as more than one normal distribution could be required to describe curvature in the component distributions. . . 104 (a) Simulated Data in S2 _{. . . 104}

(b) Simulated Data in R2 _{. . . 104}

Figure 5.3 Scree graph for eigenvalues of the raw data. The scree graph is plotted with a split axis to highlight the shape of the curve. . . 107 Figure 5.4 The BIC scores for the raw data mixture model under the

nor-mal assumption. Model PAR(1) with 7 clusters is selected, where “PAR(1)” denotes periodic autoregressive covariance structure. EII: spherical equal volume; VII: spherical unequal volume; AR(1): autoregressive; RC: random coefficient. The symbol = is used to denote equal component covariance matrices. . . 108 Figure 5.5 The BIC scores for the raw data mixture model under log-normal

assumption. Model PAR(1) with 7 clusters is selected, where “PAR(1)” denotes periodic autoregressive covariance structure. EII: spherical equal volume; VII: spherical unequal volume; AR(1): autoregressive; RC: random coefficient. The symbol = is used to denote equal component covariance matrices. . . 109 Figure 5.6 The BIC scores for the reduced unitized data mixture model

under normal assumption. Model PAR(1) with 7 clusters is se-lected, where “PAR(1)” denotes periodic autoregressive covari-ance structure. EII: spherical equal volume; VII: spherical un-equal volume; AR(1): autoregressive. The symbol = is used to denote equal component covariance matrices. . . 112

(12)

Figure 5.7 The BIC scores for the unitized data mixture model under log-ratio normal assumption. Model PAR(1) with 7 clusters is se-lected, where “PAR(1)” denotes periodic autoregressive covari-ance structure. EII: spherical equal volume; VII: spherical un-equal volume; AR(1): autoregressive; RC: random coefficient. The symbol = is used to denote equal component covariance

matrices. . . 113

Figure 5.8 Plots of clusters C1-C7 derived from the best mixture model (7 component mixture of log-normals with PAR(1) covariance struc-ture). Component means are plotted next to the corresponding clusters. . . 118

(a) Cluster 1. . . 118

(b) Cluster 1 mean curve. . . 118

(c) Cluster 2. . . 118

(d) Cluster 2 mean curve. . . 118

(e) Cluster 3. . . 118

(f) Cluster 3 mean curve. . . 118

(g) Cluster 4. . . 119

(h) Cluster 4 mean curve. . . 119

(i) Cluster 5. . . 119

(j) Cluster 5 mean curve. . . 119

(k) Cluster 6. . . 119

(l) Cluster 6 mean curve. . . 119

(m) Cluster 7. . . 120

(13)

ACKNOWLEDGEMENTS

I would like to thank Dr. Min Tsao for his invaluable guidance and encouragement. I would also like to thank the Department of Mathematics and Statistics for funding me with a scholarship.

(14)

Introduction

This thesis studies cluster analysis of time series data and develops a new mixture model which is effective for such analysis. It is motivated by a real life problem of clustering time series of electricity load (demand) from BC Hydro.

The analysis of electricity load has been a key research area (Chicco et al., 2006; Carpaneto et al., 2006; Tsekouras et al., 2008) with important implications for elec-tric utilities. This is driven by the utilities’ need to know the composition of their customers as such information is crucial for rate setting and system planning. Con-ceptually, each customer of, say, BC Hydro is represented by the customer’s electricity load, which is a time series, and the composition of the customer population refers to some meaningful grouping or clustering of the population in terms of the electricity load. Hence clustering analysis is an important tool for understanding the compo-sition. There are, however, two main challenges in performing cluster analysis on electricity load data. The first is the time series nature of such data and the second is their high dimensionality. Traditionally, utility managers have used simple qualita-tive methods to define classes. The most common method is to divide the population along the lines of residential and commercial customers. This avoids the difficulties

(15)

of dealing with the load series directly but the resulting classes are often not well de-fined and do not correspond well with the true underlying clusters of the load series. There are also a number of more sophisticated methods based on heuristic clustering techniques (see Chapter 2 for a review). While they give better results than the qual-itative approach, they also do not make full use of the load data and their results may be difficult to interpret. In recent years, thanks to the advances in computing tech-nology and the EM algorithm, mixture model based clustering methods have been made possible (see Chapter 3 for a review). The mixture model based approach is a powerful new tool for clustering analysis. It goes beyond the scope of the traditional clustering analysis by not just producing a grouping of the data but also a model for the underlying population. Nevertheless, existing mixture model based methods are not particularly suitable for our purpose of clustering the load data as the underlying mixture models are not built for long time series. In this thesis, we fill this gap by developing a mixture model specifically for long time series and applying it to cluster a load data set from BC Hydro.

1.1 The BC Hydro Data

BC Hydro, the British Columbia electrical utility, collects load data from selected customers for various reasons such as rate setting and system planning. The data set that we have is provided by Power Smart, a division of BC Hydro. The primary question of interest to BC Hydro regarding this data set is the identification of the un-derlying population clusters through the data, which will be used for system planning and refining the existing rate classifications.

The data set consists of 945 time series of hourly load values over a one year period from April 1, 2004 to March 31, 2005, representing the energy consumption profiles of

(16)

945 individual customers. Here the hourly load is defined as the average rate of energy consumption over an hour. Note that this sample of 945 is not a simple random sample but a stratified sample from the BC Hydro customer population. As the population is overwhelmingly dominated by residential customers, the stratification is necessary to ensure proper coverage of all types of customers in the population.

Each load series contains 365 × 24 = 8760 points of hourly load values. The high dimension of 8760 presents computational challenges to most clustering meth-ods. Fortunately, for the purpose of clustering the sampled customers to support system planning and refining the existing rate classifications, the daily load series sufficiently captures the usage characteristics of interest. Hence, in our analysis we will be focusing on the daily load series where each point is the average rate of energy consumption over a day. To set up the notation, let xt be the daily load of day t.

Let Et be the total energy consumed in day t measured in kilowatt-hours. The daily

(average) load is then given by

xt = Et/24, t = 1, . . . , 365

and is bounded between 0 and the sum of the maximum loads of all installed electric appliances. The daily load series for the ith customer is the vector xi given by

xi = (xi1, . . . , xi365)T.

We detected 22 outlier load series in the data set which exhibited extreme be-haviours such as zero load for some days. While it is difficult to determine the cause for each and every outlier, zero load is typically the consequence of an extended pe-riod of power outage, the shutting down of a business or the relocation of a residential customer. All outlier load series are removed from the data set. Hence our analysis

(17)

will involve only the 923 normal load series. See Figure (1.1) for a plot of all 923 load series. Having removed all series with 0 load in a given day, the natural space for representing this data is the positive space R365

+ , the positive orthant of R365, defined

by

R365

+ = {(x1, . . . , x365)T : x1 > 0, . . . , x365 > 0}.

In light of the above notation, the question of interest to BC Hydro is to cluster

Figure 1.1: 923 BC Hydro customer sample raw load series. Each series is measured at daily intervals over a one year period from April 1, 2004 to March 31, 2005

the xi (i = 1, 2, . . . , 923). Each resulting cluster of the xi sample will represent an

unknown underlying population cluster of the raw load series.

In many clustering applications, standardization is proposed as a preprocessing step prior to clustering. In applications of clustering to load series, utilities are

(18)

sometimes only concerned with identifying underlying clusters in terms of the shape of their daily load and not their absolute level. In such cases, it has been proposed that the load series be standardized to have unit sum and hence the absolute level of load series is ignored. See Load Research Manual (2001). This standardization produces what is called a unitized load series, yi = (yi1, yi2. . . , yi365)T. The unitized

load for time t, yit, is calculated from the raw loads, xit, as

yit = xit/

365

X

t=1

xit t = 1, . . . , 365, . (1.1)

The natural space for representing this data is the simplex space. The 364-dimensional simplex space, S364 _{embedded in R}365 _{is the set defined by}

S364= {(x1, . . . , x365)T : x1 > 0, . . . , x365 > 0; x1+ · · · + x365= 1}.

If we multiply both the numerator and the denominator of the right-hand side of equation (1.1) by 24 (hours), the unitized load is seen to be equivalent to the ratio of energy consumption in day t to total annual consumption. Hence this standardization removes the effect of total annual consumption. The effect of standardization and the interpretation of clustering results for unitized load series will be discussed in Chapter 5.

1.2 The BC Hydro Rate Classes

A classification of the load series already exists in the form of the BC Hydro rate classes. These rate classes are largely based on qualitative information that are easy and convenient to observe (see description below). However, such information does not always represent the underlying load series well. As such, the rate classes may

(19)

not agree with classifications based on the load series. Since the rate classes are the currently existing classification of the customers for BC Hydro, we give a brief review of such classes here and will refer back to this qualitative classification method later in Chapter 5.

There are several rate classes that BC Hydro uses but the three main ones are the residential, the commercial under 35kW (kilowatts) and the commercial over 35kW classes. The two commercial categories include customers from small businesses to large industrial groups. The size of the commercial customers is measured by their peak load in each month. Commercial customers with a peak load that falls under 35kW in every month are assigned to rate class commercial under 35kW. Commercial customers with peak load that exceeds 35 kW in any month are assigned to rate class commercial over 35kW. There are three other rate classes: transmission, irrigation pumping and street lighting. Transmission customers are large commercial customers supplied at transmission voltage levels of 60 kV to 500 kV. The irrigation pumping and street lighting rate classes are self descriptive. There are very few customers in these latter three classes and they have distinctively different load profiles than customers in the first three classes. Also, BC Hydro treats them separately and hence they are not included in the customer population for this study.

As the name rate classes itself suggests, such classes are used to set the price that BC Hydro charges its customers. For example in fiscal year 2004/05, for residential customers, BC Hydro charged $0.0615 per kWh. For commercial under 35kW, the rate was $ 0.0691 per kWh. For commercial over 35kW, the first 14,800 kWh were charged at the rate of $0.0691 per kWh and additional kWh at $0.0332 per kWh. This class is also subject to an additional demand charge. The charge for the first 35 kW was nil, the next 115 kW were charged at $3.54 per kW and all additional kW at $6.79 per kW. The sample of 923 customers contains 207, 87 and 629 load series

(20)

from the three classes, respectively. Figure (1.2) contains plots of the sample load series for these three classes of customers. As can be seen from the plots, there is a lot of variability among load series within each class, indicating that the rate classes are not homogeneous groups.

1.3 Cluster Analysis of Time Series

We have seen that the rate classes amount to a qualitative (or indirect) method of clustering the load series which is not satisfactory. We now address some general issues of cluster analysis, in particular, those relevant to time series data.

The goal of clustering is to identify a meaningful group structure in a data set (or a population) with unknown structures. This is achieved by dividing the data into homogeneous groups where the within-group differences are minimized and the between-group differences are maximized according to some criteria. Most existing clustering methods are concerned with standard multivariate data vectors where the dimension is low and elements of the vector are not time ordered. Mardia et al. (1979) classified clustering methods for multivariate data into two major categories: heuris-tic methods and model based methods. For heurisheuris-tic methods, the purpose of the clustering analysis is purely descriptive; the observations are assumed to have come from (usually a small number of) K clusters but no attempt is made to characterize these clusters. For model based methods, there is an underlying model for each of the K clusters that generated the observations. The cluster analysis uncovers not only the cluster membership of the observations but also the underlying distributions of individual clusters.

A time series, such as a BC Hydro customer load series, is a special multivariate data vector whose elements are measurements of the same variable over some time

(21)

(a) Residential.

(b) Commercial under 35kW.

(c) Commercial over 35kW.

(22)

points, say t1 < t2 < · · · < tn. If all the time series in a sample are observed over the

same set of time points tj, in principle they can be simply regarded as multivariate

observations and one could apply existing clustering methods for multivariate data to cluster the sample. Indeed, there are examples of clustering analyses for time series in the literature that simply treat the time series as an ordinary multivariate vector and use the clustering methods for multivariate data. We argue, however, that time series data need special attention due to their high dimensionality and special covariance structures. There is also a vast literature on time series models that we need to take advantage of when developing methods for clustering time series. This necessarily requires a model based approach which is the focus of this thesis.

1.4 Overview of the Thesis

Motivated by the clustering/classification problem of the BC Hydro load series data, in this thesis we develop a new mixture model based approach for clustering time series. We now give an overview of subsequent chapters.

There are many heuristic clustering methods which do not require a model for the underlying population. The most prominent of these include the K-means method, hierarchical method and follow-the-leader method. These methods are the only ones that have been used in load research. In Chapter 2, we review these methods which will serve as a contrasting background for the model based clustering methods in Chapter 3 and 4. A particularly important point to note here is that for these methods, a load series is regarded as just a multivariate vector; the time series nature of the load series is not recognized. Consequently, not all information contained by the load series has been utilized.

(23)

the work of Fraley and Raftery (2002) and the availability of computer software in recent years, mixture model based clustering has becoming increasingly popular. In Chapter 3, we first review the mixture model and its application to clustering standard multivariate data. We discuss the difficulties associated with applying model based clustering methods to high dimensional time series data, in particular the large number of parameters that must be estimated. Then we describe an extension to the mixture model, the so called random coefficients mixture model, which has been used to cluster time series data. We note that this model greatly reduces the number of parameters involved but it does so by assuming that the data is inherently of a lower dimension (confined to a subspace). For high dimensional time series, these assumptions may not be appropriate.

Random effects models are important covariance models in longitudinal data anal-ysis (Diggle et al., 2002). In Chapter 4, we develop a new mixture model which we will refer to as the random effects mixture model where the covariance structures of individual components are characterized by random effects models. Unlike those re-viewed in Chapter 3, the random effects mixture model is constructed and particularly suitable for time series as it can accommodate common covariance structures found in time series. With effective modelling of the covariances of individual components, the random effects model is also parsimonious in the number of parameters involved. We consider models with serially correlated random effects following either autore-gressive or antedependence processes. We discuss the computation of random effects mixture models. The standard EM algorithm is too slow in this case, and we devise an efficient algorithm by adapting the general AECM algorithm of Meng and van Dyk (1997). We also discuss the implementation and demonstrate the effectiveness of the random effects model for clustering time series through a numerical example.

(24)

thesis with a detailed data analysis using model based clustering methods. We apply two existing model based methods and our new method based on the random effects model to cluster the BC Hydro data set. We present the clustering results and compare these model based methods as well as the qualitative rate class method. We also make recommendations to assist BC Hydro in using our results to obtain better clustering of their customer population.

(25)

Chapter 2 Heuristic Clustering Methods

There are a number of works in the literature on clustering load series using different methods. Currently the most popular methods are K-means, hierarchical and follow-the-leader. See Chicco et al. (2006); Carpaneto et al. (2006); Tsekouras et al. (2008). These methods are all heuristic and not model based. In heuristic clustering, the load series are assigned to clusters using a simple distance measure. Load series are usually high dimensional. The high dimensionality gives rise to computational difficulties and makes it difficult to visualize and interpret results. A common strategy to address these difficulties is to apply some sort of dimension reduction method to the load series and then apply the clustering methods to the dimension reduced load series. In Chicco et al. (2006), the principal component analysis was used to perform the dimension reduction. In Carpaneto et al. (2006), the Fourier regression was used. In Petrescu and Scutariu (2002), the wavelet regression was used.

The purpose of this chapter is first to review the basic clustering methods men-tioned above and then to discuss the associated dimension reduction methods. It should be noted that while the methods reviewed in this chapter have been applied to clustering load series by various authors, the choice of method remains a

(26)

prob-lem. Indeed, there is no consensus as to which of these methods is superior and the choices made in different applications, both those reviewed in this chapter and others presented in the literature, were largely heuristically motivated. In particular, the problem of choosing the method and determining the correct number of clusters has yet to be solved, primarily because there are yet no rigorous criteria for addressing these issues. This leads naturally to the consideration of more sophisticated model based methods in Chapters 3 and 4.

This chapter is organized as follows: In Section 2.1, we review the three heuristic clustering methods identified above. In Section 2.2, we discuss the three dimension reduction methods. In Section 2.3, we discuss dimension reduction for K-means clustering. In Section 2.4, we briefly review cluster validation. In Section 2.5, we examine the drawbacks of these methods through a simulated example where the K-means method fails to uncover the true cluster structure. We then conclude with a summary discussion in Section 2.6.

2.1 Heuristic Clustering Methods

Heuristic methods are characterized by their reliance on a simple distance measure to define clusters. The choice of the defining measure is fundamental to the clustering problem. The methods that we discuss below all involve a Euclidean distance measure. Recall that the N load series are represented by vectors, xi = (xi1, . . . , xin)T in Rn,

with i = 1, 2, . . . , N. Hence, the Euclidean distance between two load series x1 and

x2, is defined as d(x1, x2) = p (x1− x2)T(x1− x2) = v u u tXn t=1 (x1t− x2t)2.

(27)

Being based on a Euclidean distance measure, most of these clustering methods are strongly biased toward finding spherical clusters. If the data consists of spherical clusters this bias is not a limitation. However, as we subsequently discuss, for time series data this is unlikely to be the case.

We now review the K-means clustering method, the hierarchical method and the Follow-the-Leader method.

2.1.1 K-means clustering

The K-means method is an especially popular heuristic clustering method (Rencher, 2002). Given N observations K-means clusterings seeks to partition the observations into K clusters, {C1, . . . , CK}, so as to minimize the objective function,

WK = K X k=1 X i∈Ck kxi− mkk2, (2.1)

which is the sum of Euclidean distances of individual observations to their respec-tive cluster centres, mk =

P

j∈Ckxj/Nk. In this method the number of clusters, K,

is assumed to be known. The (estimated) cluster centres and the assignment of the observations that minimize WK make up the K-means solution to the clustering

prob-lem; the K clusters are defined by the estimated cluster centres and the assignment divides up the observations among the K clusters.

Finding a clustering that globally minimizes (2.1) requires a prohibitive amount of computation. The total number of possible clusterings of N observations into K clusters equals the total number of ways of placing N balls into K indistinguishable urns which is roughly O(KN −K_{). This is huge even for moderate values of N and K.}

Thus although we can in principle find the exact minimum of WK by computing WK

(28)

to the amount of computation that it requires. Therefore, the K-means method relies on an iterative algorithm which starts with initial choices of the cluster centres and then searches through a subset of possible clusterings. The algorithm (MacQueen, 1967) proceeds by the following steps:

1. Assume initial clusters C1, . . . , CK. Compute the cluster centres m1, . . . , mK,

mk =

X

j∈Ck

xj/Nk.

2. Assign each point, xi, to the cluster based on the cluster mean to which it is

closest according to the squared Euclidean distance,

min

k (kxi− mkk

2_).

3. Once all of the points, xi, have been assigned to a cluster, the cluster means

are recalculated.

4. If no assignment of a point from one cluster to another occurs, stop. Otherwise, return to Step 2.

Steps 2 through 4 are repeated until the value of the objective function cannot be reduced further.

The K-means algorithm is simple and relatively fast to iterate. It has a com-putational complexity of O(KN n), where K is the number of clusters specified by the user, N the sample size and n the dimensionality of the data vectors. However, being a gradient descent method, see Bottou and Bengio (1995), it is only capable of finding local minima. It will always converge to a local minima solution, but because the objective function can have multiple local minima, it is not guaranteed to return a global minima solution. Since the algorithm is fast, a common solution to the local

(29)

minima problem is to run the algorithm a large number of times, with random initial clusters, and return the best clustering found.

In high dimensional time series data, the local minima problem is not easily solved by repeated running of the algorithm. As noted in Ding et al. (2002), the objective function for high dimensional data is often very rough, having many local minima. Hence, iterations almost always get trapped near the initial starting configuration. The difficulty in searching through a high dimensional space naturally leads to con-siderations of dimension reduction.

Another difficulty with the K-means method is in its interpretation. Geometri-cally, the K-means method with Euclidean distance measure gives rise to K equal radius spheres in Rn _{that are centred on the K means. The clusters of a data set}

identified by the method are roughly the intersections of the data points and these K spheres.

2.1.2 Hierarchical clustering

Hierarchical clustering (Mardia et al., 1979) produces a nested hierarchy of cluster-ings. Two types of hierarchical clustering algorithms can be distinguished: agglom-erative and divisive, depending on whether the clusters are formed in a bottom-up or top-down fashion. Agglomerative hierarchical clustering is more commonly used than divisive hierarchical clustering. Agglomerative clustering proceeds in a stepwise manner, starting with N singleton clusters, as many as the number of observations, and ending with a single cluster containing the entire data set. At each step, the two most similar clusters, as measured by a linkage criterion, are merged. The history of the process is kept in order to form a tree diagram, which shows all the steps in the hierarchical procedure, including distances at which clusters are merged.

(30)

between clusters at each step is measured, choosing a linkage criterion is an important task. Two extreme criteria are the single linkage criterion and the complete linkage criterion. Under the former criterion, similarity between two clusters Ci and Cj is

defined as the minimum of the distance function d(v1, v2) where d is usually the

Euclidean distance, v1 ∈ Ci and v2 ∈ Cj. In the latter case, the similarity between

two clusters is defined as the maximal distance between Ci and Cj as measured

by d(v1, v2). When agglomerative clustering is performed with the single linkage

criterion, individual data vectors tend to join an existing cluster rather than join with another individual data vector to form a new cluster, whereas with the complete linkage criterion individual data vectors tend to form new clusters, rather than join existing clusters.

Having selected a linkage criterion, a tree diagram is built and a final clustering is constructed from the tree diagram by choosing a maximum distance and then cutting across the tree at that level or by directly selecting the distance corresponding to the desired number of clusters, K. This process can yield different clustering results depending upon the choice of linkage criterion and the choice of maximum distance between clusters. But it has a tendency to yield spherical clusters like the K-means method due to the Euclidean distance measure which is often used for hierarchical clustering.

2.1.3 Follow-the-leader clustering

The follow-the-leader clustering method has been described in Pao and Sobajic (1992). This technique uses an iterative process to compute the cluster centres. The number of clusters, and therefore cluster centres, are automatically derived from a researcher specified distance threshold.

(31)

the initial allocation of the N vectors to the K clusters by using a follow-the-leader approach, depending on the distance threshold. The first sample observation x1 is

selected as the centre of the first cluster and the algorithm proceeds to cluster the remaining N − 1 observations in the following way: at each step the distances (Eu-clidean) between the next observation, xi+1, and all previously formed cluster centres

are computed. If the minimal distance is less than the threshold, the observation is assigned to the corresponding cluster at which the minimal distance is achieved. The centre for this cluster is then recomputed as the average of observations already in the cluster and the new observation. If the minimal distance is greater than the threshold, the observation is declared to be the centre of a new cluster. Once all observations have been processed, K clusters and their centres will have been iden-tified. In the second cycle, the algorithm once again goes over all N observations to allocate each to the K clusters identified in the first cycle using the centres identified in that cycle. If an xi is allocated to a different cluster, the centres of the two affected

clusters are updated. Such cycles are repeated to refine the clusters with possibly different centre values as well as assignments of observations. The procedure stops when a cycle produces no reassignment of the N vectors. The process is controlled by the distance threshold, which has to be chosen by a trial and error approach.

Like the K-means method, the follow-the-leader method will produce spherical clusters. However, here the spheres have a fixed radius determined in advance by the distance threshold, whereas, in the K-means method the radius is determined from the data.

(32)

2.2 Dimension Reduction for Heuristic Clustering

While we can apply the heuristic methods directly to high dimensional data, there are reasons to consider reducing the dimension first and then to cluster the dimension-reduced data. Such reasons include lower computational cost and improved visu-alization of the clusters and possibly easier interpretations. Dimension reduction is described in the literature as a mapping from a high dimensional space into a reduced dimensional space. Several dimension reduction methods have been proposed by load researchers as the load series are often of extremely high dimension and hence it is desirable to reduce their dimension. The proposed methods include: Fourier regres-sion (Carpaneto et al., 2006), wavelet regresregres-sion (Petrescu and Scutariu, 2002), and principal component analysis (PCA) (Chicco et al., 2006). We now discuss the basic idea underlying all three dimension reduction methods mentioned here and then give details about each method.

The basic idea is to first project the original observations into a (usually) lower dimensional subspace of dimension p ≤ n. In this subspace, the projection of each observation is represented by a p-vector, z = (z1, . . . , zp), which is referred to as

the feature vector (of the original observation vector). The three methods involve three different types of lower dimensional subspaces each of which is defined by a different orthonormal basis. Let {φ₁, φ₂, . . . , φ_p} be one such orthonormal basis. The projection of an observation onto the subspace is then given by

ˆ

x = ΦpΦTpx

where Φp is a n × p matrix [φ1|φ2| . . . |φp]. It follows that the projection, ˆx, can

be represented (in the subspace spanned by {φ₁, φ₂, . . . , φ_p}) by a lower dimensional vector z whose elements are coordinates for the orthonormal basis vector φ_j (j =

(33)

1, 2, . . . , p); that is

z = ΦT_px.

This is the feature vector of x. Hence it can be seen that ΦT_p defines a linear trans-formation/mapping from Rn _{into R}p_{. Depending on the selection of the subspace}

(basis) significant dimension reduction can be achieved. In principle, one can define the feature vector of x in a lower dimensional subspace with respect to some arbitrary basis, not necessarily orthonormal. While such a feature vector also accomplishes the dimension reduction of the original observation, the feature vector given by the or-thonormal basis is the most desirable as it preserves the spherical shape of the original clusters underlying most of the heuristic methods.

Note that when p < n, the mapping ΦT_p is singular and hence not all the informa-tion in the original observainforma-tion x is preserved by the feature vector z. Because of this, the cluster membership of the original observations may not be completely preserved. Hence one of the important considerations in dimension reduction for cluster analy-sis is to choose a method which can best balance the need for a reduced dimension and the need to preserve the membership information. Methods that achieve such a balance are highly desirable. The three methods discussed below are among the best dimension reduction methods in use.

2.2.1 Fourier/wavelet regression

We review the Fourier and Wavelet regression methods together in this subsection. These regression methods assume that the observation vectors xi are vectors of values

of smooth functions gi(t) observed at times t = 1, . . . , n plus measurement error

(Ramsay and Silverman, 2005); that is

(34)

The gi(t) are functions in the space spanned by basis functions φ1(t), φ2(t), . . . , φp(t).

Hence

xi = ΦTpβi + ²i, (2.2)

where Φp is the n × p design matrix with columns φj = (φj(1), φj(2), . . . , φj(n))T, βi

is a vector of regression coefficients to be estimated and the ²i are vectors of

mea-surement errors assumed to have mean 0 and covariance matrix σ2_{I. The regression}

coefficients, β_i, are estimated separately for each observation using least squares,

bi = (ΦTpΦp)−1ΦTpxi

If the basis functions are chosen so that the columns of Φp are orthonormal then

bi = ΦTpxi (2.3)

Hence the regression coefficients provide a set of features in a reduced dimensional space for clustering. This treats dimension reduction as a regression problem where x is the response and functions of t, φi(t), are the only explanatory variables. Of

course, this finite dimensional representation provides only an approximation to the underlying gi(t). Once we decide to apply this approach to dimension reduction, the

issue of choosing a basis needs to be carefully addressed to ensure the approximation is accurate. Unfortunately, choosing a basis is a difficult problem since no basis will be uniformly optimal for all data. If the process is believed to be uniformly smooth, Fourier functions are often used. On the other hand, if the process is believed to have a number of local features, wavelet bases may be more appropriate. We now look at these two types of basis functions.

(35)

Fourier functions

The use of Fourier methods by load researchers was motivated by their recognition that load series tended to exhibit periodicities. This recognition naturally led to the consideration of the use of regression models involving sines and cosines to capture such periodicities. The Fourier functions

φ1(t) = r 1 n, φ2j(t) = r 2 ncos(ωj(t − 1)), φ2j−1(t) = r 2 nsin(ωj(t − 1))),

where j = 1, 2, . . . and ωj = 2πj/n constitute a complete orthogonal system in the

space of square integrable function, L2_{[0, n]. These functions are periodic with period}

2π/ωj. Because the series are sampled at a finite number of n time points, only the

first n Fourier functions are distinguishable. It is well known that in the case of equally sampled time points t = 1, 2, . . . , n, the n vectors φ₁ = (φ1(1), φ1(2), . . . , φ1(n))T, . . .

. . . , φ_n = (φn(1), φn(2), . . . , φn(n))T are orthonormal and hence form a basis for Rn.

The existence of a Fast Fourier Transform (FFT) makes it possible to compute all the n Fourier coefficients in O(n log n) time. In contrast, computing coefficients from their definition in (2.3) takes O(n2_{) time. Because of the speed of the FFT, when}

mapping the raw vectors it is often feasible to compute the complete set of coefficients from which a subset of coefficients is selected. Each raw vector is now represented by the corresponding subset of coefficients and clustering analysis is then performed on such Fourier coefficient representations of the raw vectors.

Wavelet functions

Wavelets can be thought of as a generalization of the Fourier functions to a much larger family of basis functions than sine and cosine functions. They combine the frequency localization of the Fourier functions with time localization.

(36)

The wavelet functions are constructed by choosing a suitable “mother wavelet” function Ψ and then considering all dyadic dilations (a product by powers of two) and translations of Ψ:

Ψj,k(t) = 2j/2Ψ(2jt − k), j = 0, 1, 2, . . . ; k = 0, 1, . . . , 2j− 1.

The mother wavelet is chosen to ensure that wavelet functions together with the constant function p1/n form a complete orthogonal system for the space of square integrable functions, L2_{[0, n]. Typically, the mother wavelet has compact support,}

and hence so do all the basis functions.

The Haar wavelets are the most elementary example of wavelets. The first or-thonormal system was constructed using the Haar wavelet (Haar, 1910). The mother wavelet for the Haar wavelets is given by

ΨHaar(t) =                p 1/n 0 < t ≤ n/2 −p1/n n/2 < t ≤ n 0 otherwise

Because the series are sampled at a finite number of n time points, only the first n wavelet functions are distinguishable. If sampled times are equally spaced and n is a power of 2, it can be shown that Haar basis vectors are

φ₁ = (p1/n, . . . ,p1/n)T_{, . . . , φ}

n= (ΨHaarj,k (1), . . . , ΨHaarj,k (n))T.

Here these basis vectors are orthonormal. The power of 2 condition is necessary for orthonormality because of the dyadic scaling nature of wavelets. When the power of 2 condition is not satisfied, a wavelet-like orthonormal basis can still be constructed

(37)

(Herley and Vetterli, 1994).

For cases where n is a power of 2, the Fast Wavelet Transform (FWT) makes it possible to compute all the n wavelet coefficients in O(n) time, as compared to O(n log n) time for the FFT. As a consequence, the complete set of wavelet regression coefficients can be computed extremely quickly, often in O(n) operations. With the wavelet coefficients, each raw vector is now represented by the corresponding set of coefficients. The dimension reduction is achieved when we select a subset of the coef-ficients to represent the raw vectors. Subsequent clustering analysis is then performed on the selected subset of coefficients.

2.2.2 Principal component analysis (PCA)

The regression methods that we have considered above make assumptions about the underlying random processes generating the data. The Principal Component Analysis (PCA), on the other hand, is concerned only with the core structure of a sample of N vector observations (Rencher, 2002).

The basic idea of the PCA is to describe the dispersion of the sample of N n-dimensional vector observations by introducing a new orthogonal coordinate system so that axes line up with the directions along which the data is maximally spread out. PCA performs a coordinate transformation. The new variables (principal com-ponents) are uncorrelated linear combinations of the observed variables. The first principal component is the linear combination corresponding to the direction with maximum sample variance. The second principal component is the linear combina-tion corresponding to a direccombina-tion orthogonal to the first with maximal sample varia-tion, and so on. There are as many principal components as the number of original variables.

If x1, . . . , xN is a set of observations with mean ¯x =

P

(38)

matrix S, S = 1 N − 1 N X i=1 (xi− ¯x)(xi− ¯x)T,

then the principal component transformation of the ith observation is given by

zi = ΦTxi,

where Φ = (φ₁, φ₂, . . . , φ_n) is the matrix of normalized eigenvectors of S sorted in decreasing orders of the corresponding eigenvalues λ1 ≥ . . . ≥ λn > 0. The

orthonormality of the transformation follows from the spectral theorem which states that the eigenvectors of the symmetric matrix S form an orthonormal basis for Rn_.

By the spectral theorem the sample covariance matrix of z, Sz is diagonal,

Sz = ΦTSΦ = Λ =          λ1 0 · · · 0 0 λ2 · · · 0 ... ... ... 0 0 · · · λn          .

Hence it follows that the principal component transformation transforms the variables into principal components that are uncorrelated with variances λ1 ≥ . . . ≥ λn > 0.

When used for dimension reduction one usually hopes to summarize most of the variability in the data using only the principal components with the highest variance. In the context of dimension reduction, PCA has an optimal property: among all the linear representations of the raw vectors based on some basis, the PCA repre-sentations are the most accurate in terms of mean squared error (Fukunaga, 1990). The advantage of PCA, however comes at a cost as PCA is computationally more expensive than the DFT and DWT. The computation of the principal components in-volves the eigenvalue decomposition of the sample covariance matrix requiring O(n2₎

(39)

time. The eigenvalues and eigenvectors can be found by standard methods. For high dimensional spaces the methods involve numerical approximation.

We have now discussed three clustering methods and three dimension reduction methods which give rise to nine different (combinations of) clustering methods for high dimensional data. Typically, we would decide on a clustering method we want to use first. The next question is what dimension reduction method is the most effective at preserving the cluster membership of the original data vectors in the context of the chosen clustering method. In the following section, we examine this question for the K-means clustering method.

2.3 Dimension Reduction for K-Means Clustering

As an example, we now consider the selection of the dimension reduction method in the context of using the K-means method. When K-means clustering is applied to a reduced dimensional feature vector, the objective function to be minimized is

W_Kp = K X k=1 X i∈Ck kzi− X j∈Ck zj/Nkk2.

As we have noted before when mapping the original data into a reduced dimensional space, some clustering information may be lost, that is, the cluster membership of each original data vector may not be preserved by the transformed vector. This is typically the case when the clustering that minimizes W_Kp is different from the clus-tering that minimizes WK (the K-means objective function for the original data). We

now examine the consistency of the clustering results based on the original data and those based on the transformed data given by the regression and the PCA dimen-sion reduction. We show that, when the regresdimen-sion approach is used for dimendimen-sion reduction, clustering results based on the original and the transformed data are

(40)

ap-proximately consistent, provided the regression model assumptions are true. We also show that the PCA approach is the most natural method of dimension reduction in the context of K-means method and that it gives consistent clustering results.

2.3.1 Fourier/wavelet regression and K-means

Suppose the data follows a regression model (2.2). Recall that since the K-means clustering is based on the Euclidean distance, it is invariant to orthonormal trans-formations. That is if Φ is an orthogonal n × n matrix then the K-means objective function for the xi and the ΦTxi are equal and share the same maximum,

WK = K X k=1 X i∈Ck kxi− X j∈Ck xj/Nkk2 = K X k=1 X i∈Ck kΦT_x i− X j∈Ck ΦT_x j/Nkk2.

This is just a restatement of the fact that orthogonal transformations preserve Eu-clidean distance.

Now if the xi follow the regression model (2.2) then the estimated regression

coefficients are given by

bi = ΦTpxi.

The columns of Φp are orthonormal and span a subspace of dimension p. We can

define V to be a n×(n−p) matrix with orthonormal columns such that Φ = (Φp : V)

is an orthogonal n × n matrix. By the orthogonality of Φ, we have

(41)

It follows that the K-means criterion for the original data can be expressed as WK = K X k=1 X i∈Ck kxi− X j∈Ck xj/Nkk2 = K X k=1 X i∈Ck kΦT_x i− X j∈Ck ΦT_x j/Nkk2 = K X k=1 X i∈Ck ( kΦT_pxi− X j∈Ck ΦT_pxj/Nkk2+ kVTxi− X j∈Ck VT_x j/Nkk2 ) = K X k=1 X i∈Ck ( kbi− X j∈Ck bj/Nkk2+ kVT²i− X j∈Ck VT_² j/Nkk2 )

Since ²i has mean 0 and variance σ2I by the law of large numbers, K X k=1 X i∈Ck kVT_² i− X j∈Ck VT_² j/Nkk2 ≈ N(n − p)σ2,

which is a constant. Hence, the K-means objective function for the original data xi is

approximately equal to the K-means objective function for the regression coefficients bi plus a constant, WK ≈ WKp + constant = K X k=1 X i∈Ck kzi− X j∈Ck zj/Nkk2+ constant.

Hence the clustering that minimizes W_Kp, the objective function for K-means cluster-ing of zi, is close to the one that minimizes WK.

The above argument also shows that if the regression model assumptions are vi-olated, little can be said about the relationship between the two objective functions, W_Kp and WK. In this case, the clustering results based on the original and the

(42)

2.3.2 Principal component analysis (PCA) and K-means

Without loss of generality, in this section we assume the original data vectors, xi,

have mean zero because the K-means method is affine invariant.

Dimension reduction based on PCA for K-means clustering has been applied to load series data in Chicco et al. (2006) with success. It has recently been shown, see Ding and He (2004), that the reason PCA often performs so well in practice is that the first K − 1 principal component eigenvectors span the cluster mean subspace, the K − 1 dimensional space spanned by the K cluster means. The dimension of the cluster mean subspace is K − 1 rather than K due to the single linear constraint imposed by the mean zero condition above. The cluster mean subspace is the best subspace for K-means clustering, as the raw data clustering can be recovered from the data projected onto the reduced dimensional cluster mean subspace.

To see that the cluster mean subspace is the best subspace for clustering, de-note by m1, m2, . . . , mK the K cluster means which minimize the K-means objective

function. These K cluster means necessarily lie in a common p-dimensional subspace, span(m1, m2, . . . , mK), where p = K − 1. Let Φp be the n × p matrix whose columns

are the first p principal component eigenvectors. Let V be a n × (n − p) matrix with orthonormal columns such that Φ = (Φp : V) is an orthogonal n × n matrix.

Then VT_m

k = 0 for k = 1, 2, . . . , K since each mk is a linear combination of the

eigenvectors. Since Φ is an orthogonal matrix, the K-means clustering of the xi will

yield identical results as the K-means clustering of ΦTxi. We now make use of this

point to show that the K-means clustering results for ΦT_pxi are also the same. We do

(43)

original data xi and the transformed data ΦTpxi. We have WK = K X k=1 X i∈Ck kxi− X j∈Ck xj/Nkk2 = K X k=1 X i∈Ck kΦTxi− ΦT X j∈Ck xj/Nkk2 = K X k=1 X i∈Ck ( kΦT pxi− ΦTp X j∈Ck xj/Nkk2 + kVTxi− VT X j∈Ck xj/Nkk2 ) = K X k=1 X i∈Ck ( kzi− X j∈Ck zi/Nkk2 ) + K X k=1 X i∈Ck ( kVTxi− VT X j∈Ck xj/Nkk2 ) . (2.4)

Note that in the above expression the first term on the right-hand side is W_Kp and the second term, which we denote by g(C), is

g(C) = K X k=1 X i∈Ck kVT_x i− VT X j∈Ck xj/Nkk2 = N X i=1 kVT_x ik2− K X k=1 (VT X j∈Ck xj)TVT X j∈Ck xj/Nk = N X i=1 kVTxik2− h(C), (2.5)

where C refers to the clustering assignment (C1, C2, . . . , CK). In (2.5), the first term

is a constant and the second term h(C) is greater than or equal to zero with equality achieved at C = C∗_{, the solution to the K-means clustering problem of the original}

data. To see the latter point, note that

h(C∗) = K X k=1 Nk(VT X j∈C∗ k xj/Nk)TVT X j∈C∗ k xj/Nk = K X k=1 Nk(VTmk)TVTmk.

The term on the right-hand side is zero as we have noted that VT_m

(44)

1, 2, . . . , K. It follows that g(C) is maximized at C∗_{. By (2.4) and (2.5),}

W_Kp(C) = WK(C) − g(C).

Hence W_Kp would be minimized by a C which minimizes WK and maximizes g(C) at

the same time. From the above discussion, we see that C∗ _{satisfies both of these}

con-ditions. Thus C∗ _{also minimizes W}p

K, that is, the K-means clustering solution based

on the PCA transformed data ΦT

pxi is the same as that for the original untransformed

data.

2.4 Cluster Validity Assessment

The clustering results will vary with the clustering and dimension reduction methods used and, of course, the number of clusters K chosen. It is thus necessary to assess the validity of clustering results and to identify the most reasonable clustering. To this end, a number of cluster validity indicators have been proposed in the literature. The cluster validity indicators considered in the load research literature are mostly numerical measures that capture both the within cluster variability and between cluster variability. See Chicco et al. (2006). The smaller the values of the indicators, the better the clustering. A typical application of such an indicator is to compute values of the indicator for different values of K, and then plot the values of the indicator against K. This provides a means to identify the most plausible K for the data set.

It should be noted that the clustering methods as well as the cluster validity indicators in the load research literature all rely on the Euclidean distance to measure distances between vectors. As previously noted, the clusters are implicitly assumed to have a spherical shape. This implicit assumption on the geometric shape of the

(45)

clusters, while convenient, makes it difficult to identify clusters for cases where they do not have spherical shapes. Unfortunately, the cluster validity indicators based on the Euclidean distance are ineffective in identifying such cases. Alternative methods of clustering and validity assessment are necessary to deal with such cases.

2.5 Example: Application of K-means Clustering

Method

Many real applications of the clustering methods and validity indicators can be found in the load research literature, e.g., Chicco et al. (2006). Hence we will not include a real application here simply to illustrate their implementations. Instead, we wish to demonstrate through a simulated example that, although the above Euclidean dis-tance based methods have been widely used in load research, they are not particularly suitable due to the implicit spherical equal radius assumption imposed on the data. Indeed, the load series are time series and as such they have non-trivial correlation and variance. Hence typically the covariance matrix of a load series will not have a σ2_{I structure.}

We generated a dataset which contains observations from three clusters, each of which is based on one of the three BC Hydro rate classes: residential, commercial under 35kW and commercial over 35kW. Specifically, each cluster was generated by randomly sampling from a multivariate normal distribution with mean and covariance matrix equal to that of the corresponding rate class. The mean of a rate class here simply refers to the sample mean of a rate class but the covariance matrix is not the sample covariance matrix. This is because it is highly unstable (in fact singular for the residential and commercial under 35kW rate classes) due to the high dimension of n = 365 relative to the sample size for each rate class. So to specify the covariance

(46)

of each rate class, we assume it has some simple parametric structure which allows us to estimate it with the standard maximum likelihood estimator. A simple covariance structure with non-trivial correlation that can be assumed is compound symmetry. Under this assumption, the covariance matrix is fully defined using two parameters: σ2

1 and ρ1. The first is the variance which is assumed to be a constant over time and

the second is the correlation between two variables also assumed a constant. The MLE for the compound symmetric covariance matrix may be found by averaging corresponding elements in the sample covariance matrix, see Szatrowski (1980). The original BC Hydro data contains 207, 87 and 629 observations from the three rate classes, respectively. The size of each cluster in the simulated data set is equal to the sample size of the corresponding rate class. This synthetic data set preserves the mean vector and the covariance matrix of the load series in each rate class, but it assumes that the underlying distribution of each component is multivariate normal. Figure (2.1) shows the simulated series.

The K-means method was applied to the full dimensional data in this simulated sample of 923 load series. Because the data is known to consist of 3 clusters, K was set equal to 3. The K-means method was also applied to reduced dimensional data given by the first two principal components. After repeated runs, as expected, both methods found a common solution. The plot of the first two principal components of the data points is shown in Figure (2.2) as well as the boundaries of the three clusters identified by the K-means method. The true cluster memberships of the corresponding original observations are colour coded. It can be seen from this figure that the K-means based clusters differ substantially from the true clusters. In particular, the blue cluster is divided into three clusters and the red and black clusters are clustered together.

Table (2.1) further illustrates the failure of the K-means method for this exam-ple. It shows the matching matrix which represents the best agreement between the

(47)

(a) Simulated residential.

(b) Simulated commercial under 35kW.

(c) Simulated commercial over 35kW.

Figure 2.1: Clusters simulated from multivariate normal distributions with compound symmetric covariance structure. The bold curve in each plot is the mean curve.

(48)

Figure 2.2: Plot of the first two principal components with boundaries of the three clusters given by the K-means method. The inset plot magnifies the region containing the residential and commercial under 35kW simulated clusters.

K-means method based clusters and real clusters. The diagonal elements in the 3 × 3 matrix give the number of correctly classified data points, while the off-diagonal ele-ments in a row represent the number of misclassifications into the other two clusters. While all of the simulated residential observations are grouped together in the same cluster and all the commercial under 35kW observations are grouped together in the same cluster, they are grouped together in one cluster along with 276 commercial over 35kW observations. The degree of clustering disagreement in the matching matrix can be summarized by the classification error (defined by Meila (2005)). The classifi-cation error is a simple single number summary. It takes values between 0 and 1 with 0 corresponding to the case where the two clusterings are identical. For this example,

(49)

Table 2.1: Matching matrix for the K-means clustering versus the simulated cluster-ing. K-means Clusters Simulated Clusters C0 1 C20 C30 Number of Subjects C1 207 0 0 207 C2 87 0 0 87 C3 276 94 259 629 Number of Subjects 570 94 259 923 (classification error = 49.5%)

the classification error of the K-means method is 49.5% which indicates again that the K-means method is not satisfactory.

2.6 Conclusion

We have reviewed three clustering methods and three dimension reduction methods all of which require no information on the underlying population from which the data are drawn. While these have been used extensively in the computing science and engineering literature and they can be effective in many cases, they are essentially heuristic and nonparametric in nature as they are simply numerical means to cluster a set of data into spherical clusters without having to consider the shape of the underlying population clusters. The numerical example in the previous section has demonstrated substantial weakness in such an heuristic approach; the disregard for the shape of the underlying population clusters can lead to high classification error when their shape is not spherical. This is indeed the case for the load data and hence more sophisticated methods which can take into consideration the shape need to be developed. In the next chapter, we will consider model based clustering, where each underlying population cluster is assumed to follow a probability distribution chosen from a larger family of distributions (ranging from spherical to ellipsoidal) and a

Random effects mixture models for clustering time series

Contents

List of Tables

List of Figures

Introduction

1.1

The BC Hydro Data

1.2

The BC Hydro Rate Classes

1.3

Cluster Analysis of Time Series

1.4

Overview of the Thesis

Chapter 2

Heuristic Clustering Methods

2.1

Heuristic Clustering Methods

2.1.1

K-means clustering

2.1.2

Hierarchical clustering

2.1.3

Follow-the-leader clustering

2.2

Dimension Reduction for Heuristic Clustering

2.2.1

Fourier/wavelet regression

2.2.2

Principal component analysis (PCA)

2.3

Dimension Reduction for K-Means Clustering

2.3.1

Fourier/wavelet regression and K-means

2.3.2

Principal component analysis (PCA) and K-means

2.4

Cluster Validity Assessment

2.5

Example: Application of K-means Clustering

Method

2.6

Conclusion