Fast kernel spectral clustering based on incomplete Cholesky factorization for large scale data analysis

(1)

Fast kernel spectral clustering based on incomplete

Cholesky factorization for large scale data analysis

Mih´aly Nov´aka,∗, Carlos Alzateb,c, Rocco Langonec, Johan A.K. Suykensc

a_{Institute for Nuclear Research of the Hungarian Academy of Sciences, MTA ATOMKI,} 18/c Bem t´er, H-4026 Debrecen, Hungary

b_{Smarter Cities Technology Centre IBM Research, Ireland Damastown Industrial Estate} Mulhuddart, Dublin 15, Ireland

c_{Department of Electrical Engineering, ESAT-STADIUS, Katholieke Universiteit Leuven,} Kasteelpark Arenberg 10, B-3001 Leuven, Belgium

Abstract

A fast spectral clustering algorithm, suitable for large scale data clustering prob-lems, is presented in this work. The algorithm is an improved version of the sparse multiway spectral clustering algorithm, formulated as weighted kernel principal component (PCA) analysis within primal-dual least-squares support vector machine (LS-SVM) framework. Sparsity is introduced by the combina-tion of reduced set method with the incomplete Cholesky factorizacombina-tion based low rank approximation of the kernel matrix. The out-of-sample capability of the clustering model and the possibility of model selection provide the opportunity to obtain the optimal model parameters through classical training-validation phases and the obtained optimal model can be easily extended to unseen data points. However, the computational complexity of the original sparse algorithm is quadratic in the training set size that prevented the application of the method to large scale data analysis problems. This quadratic computational complex-ity is reduced to linear in the proposed algorithm while attractive properties such as the simple way of out-of-sample extension, possibility of model selection through training and validation are conserved. Furthermore, a new, soft clus-ter membership encoding-decoding scheme and corresponding model selection criterion are also proposed. These modifications make the algorithm capable to solve much larger size of clustering problems than the original one with a refined model selection procedure. The significantly reduced computational complexity, the improved accuracy and sparsity of the presented algorithm compared to the original version are illustrated by experimental results obtained on difficult toy as well as real life problems such as image segmentation.

Keywords: spectral clustering, sparse model, large-scale data, kernel methods,

LS-SVM

∗_{Corresponding author}

(2)

1. Introduction

1

Clustering algorithms comprise unsupervised, explanatory data mining

tech-2

niques with the common goal of grouping objects that are similar according to

3

some similarity measure. Beyond the classical clustering algorithms such as

4

K-means, a set of methods have been developed that make use of some

eigen-5

vectors of a normalized affinity matrix derived from the data available. These

6

spectral clustering algorithms are known to perform well even in cases when

7

classical methods fail and have been successfully applied to data analysis,

im-8

age processing, pattern recognition and to many other machine learning related

9

problems.

10

Classical formulation of spectral clustering algorithms [1, 2, 3, 4, 5, 6] starts

11

from a graph partitioning problem that is NP-hard due to the discrete constraint

12

on the cluster indicators. By relaxing the discrete constraint the problem is

13

transformed to an eigenvalue problem involving a normalized affinity matrix

14

formed from pairwise similarities between the data points in hand. The discrete

15

cluster indicators are then inferred from some selected eigenvectors that give

16

the optimal solution to the relaxed problem.

17

Traditional spectral clustering models are typically based on the full set of

18

data due to the lack of clear extension to unseen data points. As a consequence,

19

the memory usage and computation time of these algorithms grow quickly by

20

increasing data sizes due to the increasing size of the involved eigenvalue

prob-21

lem. A possible solution to overcome these problems is offered by the spectral

22

clustering algorithms [7, 8] based on the Nystr¨om method [9]. A clustering

23

model is trained on a subset of data points. The training phase involves the

24

approximation of the implicit eigenfunctions by means of Nystr¨om method. The

25

cluster membership of out-of-sample data points can then be determined by

us-26

ing this approximation. However, the underlying clustering model is not known

27

and the parameter selection is done in a heuristic way.

28

Unlike the classical spectral clustering algorithms, that are formulated on

29

graph theoretical bases, a clustering algorithm is proposed in [10, 11] as a

30

weighted form of kernel principal component analysis [12, 13] derived within

31

the least-squares support vector machine (LS-SVM) formalism [14] by using a

32

primdual optimization framework. This kernel spectral clustering (KSC)

al-33

gorithm provides a straightforward way to extend the clustering model, trained

34

on a subset of data point, to unseen points through simple projection to some

35

eigenvectors obtained during the training phase. Furthermore, a model selection

36

criterion can be derived by making use of the known structure of these

eigen-37

vectors in the ideal case. The optimal model parameters such as the number of

38

clusters or kernel parameters can be determined through the maximization of

39

this model selection criterion. Therefore, this kernel spectral clustering model

40

can be trained on a subset of data, validated and extended to out-of-sample

41

points similarly to supervised methods in spite of the unsupervised nature of

42

the clustering problem.

43

The out-of-sample extension capability of the KSC model directly provides

44

the possibility of sparse model construction by training the model on a selected

(3)

subset of the full data set. It has been shown in [11] that the KSC model

46

outperforms the Nystr¨om approximation based clustering algorithm [8] both in

47

terms of accuracy and computation time even in case of random training set

48

selection. Moreover, the computation time and the memory requirement of the

49

original KSC model can be significantly reduced by using a clever training set

50

selection that can notably increase the sparsity of the clustering model while

51

improving the accuracy. Such sparse versions of the original KSC model were

52

proposed based on the combination of reduced set method [15, 16] with the

53

incomplete Cholesky decomposition (ICD) based low rank approximation of the

54

kernel matrix in [17] and by using a R´enyi entropy maximization based training

55

set selection in [18].

56

The sparse version of the KSC model, based on the combination of ICD and

57

reduced set method, was shown to be able to perfectly solve problems based on a

58

small reduced set size that are very hard even for spectral clustering algorithms

59

due to the properties of the corresponding spectrum [17] (see Sections 3.2.3

60

and 5.1). However, the computational complexity of this algorithm is quadratic

61

in the training set size that prevented the application to large scale problems.

62

An improved version of the ICD based sparse KSC algorithm is presented

63

in this paper. The quadratic computational complexity in the training set size

64

is reduced to linear while all the attractive properties such as the simple way

65

of out-of-sample extension, possibility of model selection through training and

66

validation are conserved. A new, soft cluster membership encoding-decoding

67

scheme and a corresponding new model selection criterion is also proposed both

68

for the original non-sparse KSC algorithm and for the presented sparse version.

69

These modifications make the algorithm capable to solve much larger size of

70

clustering problems than the original one.

71

The paper is organized as follows: spectral clustering problem and classical

72

algorithms are introduced in Section 2. The original non-sparse KSC model is

73

reviewed in Section 3. Important properties of some eigenvectors and the

cor-74

responding score variables, that play an essential role in the spectral clustering

75

interpretation of the weighted kernel PCA and in the introduced model selection

76

criterion, are also discussed. In Section 4, first the original ICD based sparse

77

KSC model and algorithm is reviewed by highlighting the computational

bot-78

tlenecks then a new algorithm is proposed. The significantly improved

compu-79

tational time, sparsity, and accuracy of the proposed algorithm over the original

80

version is illustrated through some computational experiments in Section 5.

81

2. Spectral clustering

82

Given a set of data points D = {xi}Ni=1, xi∈ Rdtogether with some similarity

83

measure s : Rd_{× R}d _{→ R}

≥0 between pairs of data points xi, xj the goal of

84

clustering is to partition the input data set such that points in the same subset

85

are similar and points in different subsets are dissimilar. An intuitive form of

86

representing the data with the given similarity measure is using a similarity

87

graph G = (V, E ), where V is the set of vertices, representing the data points

88

in the graph, and E is the set of edges between the vertices weighted by the

(4)

similarity between the corresponding data points. If the similarity measure is

90

symmetric then the G graph is undirected and a symmetric similarity matrix

91

S can be constructed from E such that S ∈ RN ×N≥0 , [S]ij = sij = s(xi, xj) =

92

s(xj, xi), ∀i, j ∈ {1, ..., N }. The degree of a vertex di, that represents the sum

93

of weights to this node, i.e. sum of similarities between data point xiand all the

94

data points in D, is defined as di = P

N

j=1s(xi, xj) = P N

j=1[S]ij. The degree

95

matrix D, associated to the given similarity graph G, is a diagonal matrix with

96

[d1, ..., dN] diagonal elements.

97

The above clustering problem can now be reformulated based on the

simi-98

larity graph as a graph partition problem [2] [6]. The goal is to find a partition

99

of the vertices V into K disjoint subsets {A1, ..., AK}, V = A1∪, ..., ∪AK, and

100

Ak∩ Al = ∅, ∀k, l ∈ {1, ..., K}, k 6= l such that the edges between vertices

be-101

longing to the same subset have a high weight while the edges between vertices

102

from different subsets have a low weight. Formulating this would lead to the

103

so-called mincut problem [19]. However, mincut can provide unbalanced

par-104

titions. The mincut objective can be modified to overcome this problem by

105

involving the size of the subsets Ak into the mincut objective. The size of a

106

subset can be measured either by using the cardinality of the subset |Ak| or

107

by summing up the weights of all edges attached to the vertices that belong

108

to the subset vol(Ak) =Pi∈Akdi. The former leads to the RatioCut [4] while 109

the latter results in the Ncut objective [1]. Only Ncut graph partition will be

110

discussed in the following.

111

Based on the above discussion, the Ncut objective [1] can be written as

112 Ncut{A1, ..., AK} = K X k=1 X i∈Ak,j /∈Ak [S]ij vol(Ak) (1)

One can introduce fk ∈ {0, 1}N, k = 1, ..., K binary cluster indicator vectors,

113

such that the i-th component of fkis 1 if the i-th data point belongs to the k-th

114

cluster i ∈ Ak and zero otherwise. The cluster indicator matrix F ∈ {0, 1}N ×K

115

is formed by these mutually orthogonal fk cluster indicator vectors. Defining

116

hk = D

1 2f_k/kD

1

2f_kk₂, k = 1, ..., K vectors and H = [h₁, ..., h_K] orthonormal 117

matrix, the Ncut objective given by Eq. (1) can be written as

118 Ncut{A1, ..., AK} = K X k=1  1 − X i∈Ak,j∈Ak [S]ij vol(Ak)   = K − T r(HTD−12_SD−12_H) (2)

The Ncut minimization problem then can be reformulated as the following trace

119 maximization 120 argmin H Ncut{A1, ..., AK} = argmax H T r(HTLsymH) such that HTH = IK (3)

(5)

by introducing Lsym= D−

1

2SD−12 (symmetric) normalized graph Laplacian [6]. 121

This optimization problem is NP-complete due to the discrete constraint on

122

H [1]. However, relaxing this discrete constraint by allowing H to be a real

123

valued matrix ˜_{H ∈ R}N ×K yields [20]

124 argmin ˜ H Ncut{A1, ..., AK} = argmax ˜ H T r( ˜HTLsymH)˜ such that H˜TH = I˜ K (4)

According to the Ky Fan theorem[21], a solution of this relaxed trace

maximiza-125

tion problem can be given by taking αk, k = 1, ..., K eigenvectors, associated to

126

the K highest eigenvalues λk, of the Lsym = D−

1

2SD−12 matrix as columns 127

of ˜H. Hence, a continuous solution of the Ncut partitioning problem can be

128

obtained by the solution of the Lsymα = λα eigenvalue problem.

129

Multiplying both side from the left by D−12 and introducing β = D−

1 2α 130

vectors, the above eigenvalue problem can be transformed to Lrwβ = λβ where

131

Lrw = D−1S is the so-called random walk normalized graph Laplacian [6].

132

Its name expresses the strong relation of this normalized graph Laplacian to a

133

Markov random walk on a graph[22] [5]. Lrw is a row stochastic matrix with

134

ij-th entries representing the probability of moving from the i-th vertex to the

135

j-th in one step of the random walk. Therefore, Lrw can be interpreted as a

136

stochastic transition matrix and plays a key role in the Markov random walk

137

interpretation of clustering. Within this representation, the Ncut minimization

138

corresponds to looking for a partition of the graph such that the random walk,

139

described by Lrw as transition matrix, remains most of the time on vertices

140

that belong to the same subset of the partition and there are only a few jumps

141

between vertices that belong to different subsets of the partition [5] [6].

142

The last step of the clustering, either by solving the eigenvalue problem

143

including the symmetric Lsymmatrix or the generalized eigenvalue problem

de-144

rived from the non-symmetric Lrw matrix, is to convert the continuous relaxed

145

solutions α(k)_{, k = 1, ..., K or β}(k)_{, k = 1, ..., K to discrete cluster indicators.}

146

This is usually done by applying simple K-means on the N, K-dimensional

147

points formed by the corresponding components of the obtained K leading

eigen-148

vectors. The latter including Lrw gives the spectral clustering algorithm of Shi

149

and Malik [1] while the former involving Lsymleads to the NJW [3] algorithm.

150

3. Kernel spectral clustering through weighted kernel PCA

151

A multiway kernel spectral clustering algorithm with out-of-sample

exten-152

sion capability was introduced in [11] based on primal-dual least-square

sup-153

port vector machine (LS-SMV) [14, 13] formulation of weighted kernel PCA.

154

The algorithm will be reviewed in this section together with some important

155

properties of the solutions and the corresponding score variables that are

essen-156

tial for the kernel spectral clustering interpretation of the weighted kernel PCA.

157

Furthermore, a new cluster membership encoding-decoding scheme and a model

158

selection criterion will be proposed based on some of these properties.

(6)

3.1. LS-SVM formulation of weighted kernel PCA

160

Given an input data set D = {xi}Ni=1, xi∈ Rdtogether with the

correspond-161

ing weights V = {vl}Nl=1, vl ∈ R+, the goal of weighted kernel PCA is to find

162

the vectors w(k)_{, k = 1, . . . , K < N such that the weighted variance of the}

pro-163

jections of the weight centered feature map onto these w(k)_{vectors is maximal.}

164

Following the LS-SVM formulation of the unweighted version of the kernel PCA

165

problem, given in [13], the weighted kernel PCA optimization can be formulated

166 as follows: 167 max w(k)_,b(k) K X k=1 N X l=1 vl[0 − (w(k) T ϕ(xl) + b(k))]2 (5)

where ϕ(·) : Rd _{→ R}nh _{is the non-linear mapping of the d dimensional input}

168

data to a nh high-dimensional feature space and b(k)∈ R, k = 1, . . . , K are bias

169

terms1. This maximization problem can be written as

170 max w(k)_,e(k)_,b(k) K X k=1 e(k)TV e(k) (6)

by introducing V ∈ RN ×N, [V ]ii = vi, i = 1, ..., N diagonal weight matrix,

171

Φ ∈ RN ×nh _{feature map matrix as}

172 Φ =    ϕT_(x 1) .. . ϕT_(x N)    (7) and e(k) ∈ RN_{, k = 1, . . . , K error vectors as} 173 e(k)= Φw(k)+ b(k)₁N e(k)=     e(k)₁ .. . e(k)_N     =    w(k)T_ϕ(x 1) + b(k) .. . w(k)T ϕ(xN) + b(k)    (8)

The corresponding primal optimization problem can be formulated as [11]

174 max w(k)_,e(k)_,b(k)J (w, e, b) = 1 2 K X k=1 γ(k)e(k)TV e(k) −1 2 K X k=1 w(k)Tw(k) such that e(k) = Φw(k)+ b(k)₁N, k = 1, ..., K (9)

1_{It can be shown, that using the bias term approach leads to the same weight centered} kernel matrix as the explicit weighted centering of the feature map (see A).

(7)

where γ(k) are real positive constants. The Lagrangian of this constrained op-175 timization problem is 176 L(w(k)_{, e}(k)_{, b}(k)_{; β}(k)_{) =}1 2 K X k=1 γ(k)e(k)TV e(k) −1 2 K X k=1 w(k)Tw(k)− K X k=1 β(k)T(e(k)− Φw(k)_{− b}(k) 1N) (10)

with β(k) _{∈ R}N_{, k = 1, ..., K Lagrange multiplier vectors. The}

Karush-Kuhn-177

Tucker (KKT) optimality conditions yield

178                                    ∂L ∂w(k) = 0 → w (k)_{= Φ}T_β(k) ∂L ∂e(k) = 0 → β (k) = γ(k)V e(k) ∂L ∂b(k) = 0 → 1 T Nβ (k) = 0 ∂L ∂β(k) = 0 → e (k)_{− Φw}(k)_{− b}(k) 1N = 0 (11)

for k = 1, ..., K. Eliminating the primal variables w(k) _{and e}(k)_{by substituting}

179

the first and second (e(k)_{= V}−1_β(k)

/γ(k)_{) equations into the fourth results in}

180

V−1_β(k)

γ(k) − ΦΦ

T_β(k)

− 1Nb(k)= 0, k = 1, ..., K (12)

Then applying the third KKT condition gives the expression for the last primal

181 variable b(k) 182 1TNβ (k)_{= γ}(k) 1TNV ΦΦ T_β(k)_{+ γ}(k) 1TNV 1Nb(k)= 0 → 1TNV 1Nb(k)= −1TNV ΦΦ T_β(k)_→ b(k)= −1 T NV ΦΦTβ (k) 1T NV 1N (13)

for k = 1, ..., K. Substituting this back into Eq. (12) gives the following

equa-183

tions for the stationary points of the Lagrangian expressed in terms of the dual

184 variables β(k) 185 V−1_β(k) γ(k) − ΦΦ T_β(k) +1N1 T NV ΦΦ T_β(k) 1T NV 1N = 0 → V−1β(k) γ(k) − IN −1 N1TNV 1T NV 1N ΦΦTβ(k)= 0 (14)

(8)

k = 1, ..., K, where IN is the identity matrix. Multiplying both sides by the

186

diagonal weight matrix V from the left and substituting λ(k) = 1/γ(k) the

187

solution can be expressed in terms of the dual variables β(k) as the following

188

eigenvalue problem involving the dual variables as eigenvectors

189 V MvΩβ(k)= λ(k)β(k), k = 1, ..., K (15) where 190 Mv= IN −1 N1TNV 1T NV 1N (16)

is the weighted centering matrix and Ω = ΦΦT _{is the kernel matrix with ij-th}

191

entry [Ω]ij = ϕ(xi)Tϕ(xj) = K(xi, xj), i, j = 1, ..., N and K : Rd× Rd → R is

192

a positive definite kernel.

193

It should be noted, that in general Eq. (9) is a nonconvex problem,

there-194

fore the KKT optimality conditions are necessary but not sufficient [23] [11].

195

However, taking the stationary points of the Lagrangian i.e. β eigenvectors of

196

the V MvΩ matrix as the pool of candidate solutions, the objective, given by

197

Eq. (6), can be maximized by selecting the K leading eigenvectors since

198 K X k=1 e(k)TV e(k)= K X k=1 β(k)TV−1 γ(k) V V−1β(k) γ(k) = K X k=1 1 γ(k)2β (k)T_V−1_β(k)₌ K X k=1 λ(k)2 (17)

where the V−1_{-normality (i.e. β}(k)T_V−1_β(k) _{= 1) of the β}(k) _{eigenvectors of}

199

the V MvΩ matrix was assumed.

200

The k-th projection of the l-th mapped input data ϕ(xl) can be written as

201 z_l(k)=z(k)(xl) = w(k) T ϕ(xl) + b(k)= β(k) T Φϕ(xl) + b(k)= N X i=1 β_i(k)ϕ(xi)Tϕ(xl) + b(k)= N X i=1 β_i(k)K(xi, xl) + b(k) (18)

by using the first KKT condition given by Eqs. (11) and the kernel trick. The

202 k-th score variable z(k)_{∈ R}N ×1 _is 203 z(k)=Φw(k)_{+ 1}Nb(k)= ΦΦTβ(k)−1 N1TNV ΦΦ T_β(k) 1T NV 1N = IN−1 N1TNV 1T NV 1N ΦΦTβ(k)= MvΩβ(k)= λ(k)V−1β(k) (19)

The weighted kernel PCA primal optimization problem Eqs. (9) transforms

204

to an eigenproblem in the dual weight space Eq. (15) involving the dual variables

205

β as eigenvectors of the weight centered weighted kernel matrix V MvΩ. The

(9)

dual solutions of the weighted kernel PCA problem can be obtained by taking

207

the K leading eigenvectors of the V MvΩ matrix. The effect of multiplying the

208

kernel matrix Ω by the weighted centering matrix Mvis that the weighted mean

209

will be removed from each columns of the kernel matrix. Note, that Mvbecomes

210

the centering matrix by choosing the weight matrix being equal to the identity

211

matrix V = IN and Eq. (15) corresponds to the dual form of the unweighted

212

kernel PCA problem discussed in [13].

213

According to section 2, taking the kernel matrix as a similarity matrix and

214

choosing the weight matrix equal to the inverse degree matrix V = D−1, Eq. (15)

215

becomes

216

D−1MDΩβ(k)= λ(k)β(k), k = 1, ..., K (20)

This is similar to the random walk spectral clustering eigenproblem based on

217

the random walk normalized Laplacian Lrw discussed in section 2. However,

218

now the weight centered kernel matrix MDΩ plays the role of the similarity

219

matrix in the eigenproblem [11] where

220 MD= IN −1 N1TND−1 1T ND−11N (21) Furthermore, Eq. (20) can be transformed to

221

D−12M_DΩMT

DD −1

2α(k)= λ(k)α(k), k = 1, ..., K (22)

by using the identity given by Eq. (66) then multiplying both sides by D1/2_from

222

the left and substituting α = D1/2_{β. The resulted eigenproblem is similar to the}

223

symmetric spectral clustering eigenproblem based on the symmetric normalized

224

Laplacian Lsymdiscussed in section 2 with the difference that now the symmetric

225

weight centered kernel matrix MDΩMDT plays the role of the similarity matrix

226

in the eigenproblem.

227

Therefore, a close relation can be shown between the spectral clustering

228

algorithms, both involving the random walk Lrwand the symmetric normalized

229

Laplacian Lsym, and the weighted kernel PCA by choosing the weight matrix

230

as V = D−1 [10]. However, some properties of the K leading eigenvectors of

231

Eqs. (20) and (22) will be different than the corresponding eigenvectors of Lrw

232

and Lsymdue to the weighted centering of the kernel matrix. These properties

233

will be discussed in the next section.

234

3.2. Properties of the eigenvectors and the score variables

235

Some properties of the eigenvectors of D−1MDΩ matrix and the

correspond-236

ing score variables will be discussed in the following. These properties are

es-237

sential for the spectral clustering interpretation of the weighted kernel PCA and

238

for the formulation of the clustering algorithm.

239

Let assume that there are K well formed clusters in the N input data and the

240

corresponding perfect partition of this input data set D = {xi}Ni=1, xi ∈ Rd is

241

∆ = {C1, ..., CK}. i ∈ Ck will be used as a shorthand for xi∈ Ck. Furthermore,

242

it is assumed that this structure of the input data is well represented by the

(10)

corresponding similarity graph and the kernel matrix i.e. the similarity graph

244

consists of K disconnected components that results in corresponding kernel

ma-245

trix [Ω]ij 6= 0 ⇐⇒ i, j ∈ Ck ∀k = 1, ..., K, i, j = 1...N . This case, when the

246

clusters are well formed, is often referred as ideal case [3].

247

3.2.1. Piecewise constant eigenvectors

248

In the ideal case, the algebraic multiplicity of the eigenvalue 1 of the D−1Ω

249

matrix is equal to the number of disconnected components in the corresponding

250

similarity graph. It was shown in [5], that these K leading eigenvectors of

251

the D−1Ω matrix are piecewise constant with respect to the partition ∆ =

252

{C1, ..., CK} when the corresponding K clusters are well formed due to the

block-253

stochastic property of the D−1_{Ω matrix. A vector x ∈ R}N is piecewise constant

254

relative to a partition ∆ = {C1, ..., CK} of the i = 1, ..., N indices if and only

255

if xi = xj, ∀i, j ∈ Ck, k = 1, ..., K. A matrix P , indexed by i = 1, ..., N , is

256

block-stochastic with respect to a partition ∆ = {C1, ..., CK} of the indices if

257

Pik =Pj∈CkPij are constant ∀ i ∈ Ck

0, ∀k, k0 = 1, ..., K and the R ∈ RK×K 258

matrix with Pk0_k entries are non-singular [5]2. 259

Although the D−1MDΩ matrix is not block-stochastic, one can show that it

260

has K − 1 eigenvalues equal to 1 and the corresponding eigenvectors are

piece-261

wise constant on the underlying perfect partition of the input data.

262 263

The i, j-th elements of the D−1MDΩ matrix can be written as

264 [D−1MDΩ]ij = 1 di ( [Ω]ij− PN l=1 1 dl[Ω]lj PN l=1 1 dl ) (23)

and the sum of the i-th row is

265 N X j=1 [D−1MDΩ]ij= PN j=1[Ω]ij di − 1 di N X l=1 PN j=1[Ω]lj dl = 1 − N di (24) where =PN

i=11/di. It shows that unlike D

−1_{Ω the D}−1_M

DΩ matrix is not

266

block-stochastic since di, i = 1, ..., N are usually different for the different rows.

267

However in the ideal case, if β(l)∈ RN _{is an eigenvector of the D}−1_M

DΩ matrix

268

with associated eigenvalue λ(l) _{= 1, then it is also an eigenvector of the D}−1_Ω

269

matrix with the same eigenvalue. In order to see this, one first need to show

270

that the b(l) bias terms, associated to β(l) eigenvectors with λ(l) = 1, are zero

271

in the ideal case. This can be shown by using the KKT conditions given in

272

2_{Note, that in the ideal case D}−1_{Ω is block diagonal that is a block-stochastic matrix with} R = IK.

(11)

Eqs. (11) and taking into account that now V = D−1 273 1Nb(l)= e(l)− Φw(l)= Dβ(l) γ(l) − ΦΦ T_β(l) = [λ(l)D − Ω]β(l) (25)

Multiplying both side by 1T

N/N from the left yields

274 b(l)= 1 N(λ (l)_{− 1)[d} 1, . . . , dN]β(l)= 1 N(λ (l)_{− 1)} N X i=1 diβ (l) i = 1 N(λ (l) − 1)1T NDβ (l) , l = 1, . . . , K − 1 (26)

That gives zero since λ(l)= 1. Substituting the fourth KKT condition into the

275

second results in

276

β(l) = D−1e(l)= D−1ΦΦTβ(l)= D−1Ωβ(l) (27)

by using 1/γ(l) _{= λ}(l) _{= 1 and b}(l) _{= 0. Thus in the ideal case, if β}(l)

∈ RN

277

is an eigenvector of the D−1MDΩ matrix with associated eigenvalue λ(l) = 1,

278

then it is also an eigenvector of the D−1Ω matrix with the same eigenvalue.

279

According to [5], the eigenvectors of the D−1Ω matrix, with associated

eigen-280

value equal to 1, are piecewise constants on the clusters in the ideal case i.e.

281

β_i(l) = β(l)_j = ξ_k(l), ∀i, j ∈ Ck, k ∈ {1, . . . , K}. This can also be seen by writing

282

D−1Ωβ(l)= β(l) for the components of β(l)

283 1 di N X j=1 [Ω]ijβ (l) j = β (l) i , ∀i = 1, . . . , N (28)

and assuming that the i-th input data belongs to the k-th cluster

284 X j∈Ck [Ω]ijβ (l) j − diβ (l) i = X j∈Ck [Ω]ijβ (l) j − X j∈Ck [Ω]ijβ (l) i = 0 (29)

This equation must hold for ∀i ∈ Ck, k ∈ 1, . . . , K that indicates β (l) i = β (l) j = 285 ξ_k(l), ∀i, j ∈ Ck, k ∈ {1, . . . , K}. 286

Moreover, the centered property of these piecewise constant eigenvectors

287 PN i=1β (l) i = PK k=1nkξ (l)

k = 0 shows that K − 1 out of the K, ξ

(l)

k constants, that

288

form the given eigenvector, can be chosen independently. Hence, the eigenspace

289

of the D−1MDΩ matrix associated to λ = 1 eigenvalue is a K − 1 dimensional

290

space spanned by the corresponding K − 1, D -orthogonal eigenvectors. Thus

291

these K − 1 eigenvectors are centered, piecewise constant on the underlying

292

clusters. Therefore each of the K well formed clusters are represented as single

293

points in the K − 1 dimensional eigenspace of the D−1MDΩ matrix associated

294

to λ = 1 eigenvalue.

(12)

3.2.2. Cluster membership is coded in signs

296

It has been shown above, that D−1MDΩ matrix has K − 1 eigenvectors

297

with eigenvalue equal to 1 if the underlying data set consists of K well formed

298

clusters and these K − 1 leading eigenvectors are piecewise constant on the

299

perfect partition of the input data. The eigenvectors of the D−1MDΩ matrix

300

are mutually D -orthogonal where D is the diagonal degree matrix with positive

301

diagonal entries as it was shown in section 3.1. According to the third KKT

302

condition Eqs. (11), these eigenvectors are centered. The signs of the constants,

303

that form the K−1 leading eigenvectors, satisfy the following two criteria in order

304

to ensure that the eigenvectors fulfil the centered and mutually D -orthogonal

305

properties: (i ) at least one out of the K constants, that form the individual

306

eigenvectors, has different sign than the others; (ii ) there are not two vectors

307

among these K − 1 eigenvectors that have the same signs in each case of their

308

corresponding constants. Therefore, each of the K well formed clusters are

309

represented as K single points in the K − 1 dimensional eigenspace of D−1MDΩ

310

matrix associated to λ = 1 eigenvalue and these K points, that correspond

311

to different clusters, are located in different orthant of this K − 1 dimensional

312

eigen-subspace.

313

3.2.3. Some results of perturbation analysis

314

The ideal case, when the perfect partition of the input data set into K

dis-315

joint subsets appears in the similarity graph as K disconnected subgraphs, has

316

been considered so far. The corresponding similarity matrix Ω is a block

di-317

agonal matrix in this ideal case. It has been shown above, that each of the K

318

disconnected components are represented as single points in the K − 1

dimen-319

sional eigenspace of D−1MDΩ matrix associated to λ = 1 eigenvalue due to

320

the fact that these eigenvectors are piecewise constant on the K disconnected

321

components. Therefore, the piecewise constant property of these K − 1 leading

322

eigenvectors is crucial in spectral clustering since it makes possible to transform

323

the original clustering problem given in the input space to a trivial clustering

324

problem in the K − 1 dimensional eigen-subspace of D−1MDΩ matrix associated

325

to λ = 1 eigenvalue. Then the question is, how this piecewise constant

prop-326

erty of the K − 1 leading eigenvectors change in the non-ideal case when there

327

are edges between the K subgraphs of the ideal case resulting into a non-block

328

diagonal similarity matrix.

329

Perturbation analysis can provide an answer in the nearly ideal case, when

330

there still are distinct clusters in the data set but the similarities between data

331

points that belong to different clusters are not necessarily zero but small

com-332

pared to the within cluster similarities. The difference between the nearly ideal

333

and ideal case can be considered as a small perturbation. According to the

334

Davis-Kahan theorem, if a symmetric matrix A is modified by an additive

per-335

turbation H, the distance between the eigenspaces corresponding to the first k

336

eigenvalues of A and the first k eigenvalues of the perturbed matrix ˜A = A+H is

337

bounded by kHkF/δk, where k · kF denotes the Frobenius norm, δk= λk− λk+1

338

is the additive eigengap and λkis the k-th eigenvalue of A if they are in

(13)

ing order3. It means, that the larger the gap between λkand λk+1is, the closer

340

the k leading eigenvectors of the perturbed matrix ˜A and the original matrix A

341

are.

342

The consequence of the Davis-Kahan theorem is that if the perturbation is

343

small and the eigengap between the K − 1-th and K-th ordered eigenvalues of

344

the D−1_M

DΩ matrix is large enough, the first K − 1 leading eigenvectors will

345

be very close to the corresponding eigenvectors of the ideal case. Therefore,

346

the K − 1 leading eigenvectors of the D−1MDΩ matrix will be approximately

347

piecewise constant on the clusters even if the between cluster similarities are not

348

necessarily zero (but small compared to the within cluster similarities) and/or

349

the eigengap is large enough. The K clusters will be represented in the K − 1

350

dimensional space, spanned by these K − 1 leading eigenvectors, as K clouds

351

around the K corresponding points of the ideal case. The closer to the ideal

352

case we are, the smaller the spreading of the individual clouds is.

353

3.2.4. Properties of the score variables

354

Some important properties of the score variables will be discussed in the

355

following. The ideal case, when the similarity matrix Ω has a block diagonal

356

form, will be discussed first then the deviations caused by perturbations will

357

also be addressed as in case of the eigenvectors. It was shown above, that the

358

D−1MDΩ matrix has K − 1 eigenvectors with eigenvalue equal to 1 when the

359

similarity graph consists of K disconnected components and these eigenvectors

360

are piecewise constants on the disconnected components.

361 362

Property 1. Every entry in the k-th score variable has the same sign as

363

the corresponding entry in the k-th eigenvector (k = 1, . . . , K − 1).

364 365

It can be seen from Eq. (19) by substituting V = D−1 _{and taking into account}

366

that λ(k) _{> 0, ∀k ∈ {1, . . . , K − 1} and D is a diagonal matrix with positive}

367

diagonal entries.

368 369

Property 2. z(k)_{, k = 1, . . . , K − 1 score variables are D}−1_{-orthogonal to}

370

the constant vector 1N i.e. the weighted mean is removed from the score

vari-371

ables.

372 373

According to Eq. (19), the weighted mean of the score variables can be written

374 as 375 1 1 T ND −1_z(k)₌λ (k) 1 T Nβ (k) | {z } 0 = 0, ∀k ∈ {1, . . . , K − 1} (30)

by choosing the weight matrix as V = D−1 and using the centered property of

376

the β eigenvectors i.e. the third KKT condition from Eqs. (11).

377

3_{Although D}−1_M

DΩ is not a symmetric matrix, it can always be transformed to the corresponding symmetric problem and vice versa (see Eq. (22)).

(14)

378

Property 3. z(k), k = 1, . . . , K−1 score variables are mutually D−1-orthogonal.

379 380

According to Eq. (19)

381

z(i)TD−1z(j)= [λ(i)Dβ(i)]TD−1[λ(j)Dβ(j)] = λ(i)λ(j)β(i)T DD−1 | {z } IN Dβ(j) = λ(i)λ(j)β(i)TDβ(j) | {z } =0 if i6=j = 0 if i 6= j ∀ i, j = 1...K − 1 (31)

where the mutual D-orthogonality of the β eigenvectors of the D−1MDΩ was

382

used.

383 384

Property 4. The K − 1 dimensional score points, that correspond to input

385

data points from the same cluster, are collinear in the K − 1 dimensional score

386

space spanned by the score variables z(k)_{, k = 1, . . . , K − 1.}

387 388

Let z∗_i _{∈ R}1×(K−1) _{denote the i-th row of the N × (K − 1) dimensional score}

389

matrix. The k-th component of z∗_i can be given by Eq. (18). In the ideal case,

390

the similarity matrix is block diagonal and the K − 1 leading eigenvectors β of

391

the D−1MDΩ matrix are piecewise constants on the underlying K clusters i.e.

392 K(xi, xj) 6= 0 ⇐⇒ i, j ∈ Cp, p ∈ {1, . . . , K} and β (k) i = β (k) j = ξ (k) p , ∀i, j ∈ 393

Cp, p ∈ {1, . . . , K}, k ∈ {1, . . . , K − 1}. Assuming, that the i-th input data point

394

xibelongs to the p-th cluster i.e. i ∈ Cp, p ∈ {1, . . . , K}, the k-th component of

395

the corresponding z∗_i can be written as

396 z_i∗(k)= N X j=1 β_j(k)K(xj, xi) + b(k)= X j∈Cp β(k)_j K(xj, xi) + X j /∈Cp β_j(k)K(xj, xi) + b(k) = ξ_p(k) X j∈Cp K(xj, xi) | {z } =di +X j /∈Cp β_j(k) K(xj, xi) | {z } =0 ∀j /∈Cpi∈Cp +b(k) = diξ(k)p + b (k)_, ∀i ∈ Cp, p ∈ {1, . . . , K}, ∀k ∈ {1, . . . , K − 1} (32) and 397 z∗_i =di[ξ(1)p , . . . , ξ (K−1) p ] + [b (1)_{, . . . , b}(K−1)_] ∀i ∈ Cp, p ∈ {1, . . . , K} (33)

(15)

Therefore, the K − 1 dimensional score vectors z∗_i, that correspond to input data

398

from the same cluster, point to the same direction in the K −1 dimensional score

399

space in the ideal case. Furthermore, the K directions, that correspond to the

400

K different clusters, lie in different orthants of the K − 1 dimensional score

401

space due to the fact that the cluster membership is coded in the signs of the

402

ξp(k) k = 1, . . . , K − 1 constants that form the β eigenvectors. Therefore, the

403

K −1 dimensional points z∗

i, that correspond to input data points from the same

404

clusters, are collinear in the K − 1 dimensional space spanned by the K − 1 score

405

vectors correspond to the K − 1 leading eigenvectors of the D−1MDΩ matrix

406

and the K distinct directions lie in different orthants of this score space.

407

According to section 3.2.3, the similarity matrix becomes only nearly block

408

diagonal if the between cluster similarities are not necessarily zero but small

409

compare to the within cluster similarities. As a consequence, the contribution

410

of the summation in Eq. (32) according to j /∈ Cp is not zero but still much

411

smaller than the corresponding contribution of j ∈ Cp. Furthermore, the K − 1

412

leading eigenvectors of the D−1MDΩ matrix remain approximately piecewise

413

constants on the clusters in the presence of such a small perturbation (and/or

414

in case of large eigengap). The K clusters are represented as K well separated

415

clouds in the K −1 dimensional space, spanned by the K −1 leading eigenvectors

416

of the D−1MDΩ matrix, scattered around the corresponding points of the ideal

417

case. The closer to the ideal case we are, the smaller the spreading of the

indi-418

vidual clouds is. Therefore, the K − 1 dimensional points z∗

i, that correspond

419

to input data points from the same clusters, are still approximately collinear in

420

the K − 1 dimensional space spanned by the K − 1 score vectors correspond to

421

the K − 1 leading eigenvectors of the D−1MDΩ matrix. The closer to the ideal

422

case we are, the smaller the scatter of the K point clouds in the corresponding

423

eigenspace is and more collinear the score points are.

424 425

Note, that the special features of the ideal case was exploited only in the case

426

of the last property of the score variables.

427

3.3. Cluster membership encoding and decoding

428

In order to obtain a clustering model, based on the weighted kernel PCA

429

formulation, one need to select a cluster encoding-decoding scheme that makes

430

possible the cluster membership assignment. In the case of classical spectral

431

clustering algorithms [1, 3] this step is usually done by applying K-means in

432

the appropriate eigen-subspace as it was discussed at the end of section 2. The

433

weighted kernel PCA formulation, the properties of the K − 1 leading

eigenvec-434

tors of the D−1MDΩ matrix and the corresponding score variables discussed in

435

the previous sections, provide different possible encoding schemes.

436

One, that was introduced in [11], is based on the fact that the cluster

mem-437

bership is coded in the signs of the corresponding components of the K−1 leading

438

eigenvectors of the D−1MDΩ matrix as it was discussed in section 3.2.2.

There-439

fore, each cluster Ck, k = 1, . . . , K can be represented by a unique codeword

440

c(k)_{∈ {+1, −1}}(K−1)_{. The candidate codewords can be obtained by converting}

(16)

the rows of the matrix formed by the K −1 leading eigenvectors of the D−1MDΩ

442

matrix to K − 1 dimensional sign vectors. According to Property 1. of the score

443

variables discussed in section section 3.2.4, the score variables can also be used

444

instead of the eigenvectors. Based on section 3.2.2 and 3.2.3, there are exactly

445

K different codewords in the ideal case and can be more than K in strongly

per-446

turbed cases. Thus, a codebook CB = {c(k)_}K

k=1is formed by taking the K most

447

frequent codewords. Then each point is clustered based on the corresponding

448

K−1 dimensional score point and this codebook. The score point is converted to

449

a sign vector and compared to the elements of the codebook by calculating the

450

Hamming distances. The codeword, yielding the smallest Hamming distance, is

451

selected and the point is assigned to the corresponding cluster [11].

452

A different encoding-decoding scheme can also be obtained based on

angu-453

lar similarities. According to Eq. (18), the K − 1 dimensional score point z∗_i,

454

that corresponds to the i-th input data xi, can be written as the linear

com-455

bination of e∗_j = [β_j(1), . . . , β_j(K−1)], j = 1, . . . , N vectors plus the bias vector

456 b = [b(1)_{, . . . , b}(K−1)_] 457 z∗_i = N X j=1 K(xi, xj)e∗j+ b (34)

where β(k), k = 1, . . . , K − 1 are the K − 1 leading eigenvectors of the D−1MDΩ

458

matrix. In the ideal case, Eq. (34) transforms to Eq. (33) since e∗_i = e∗_j = e∗_k =

459

[ξ_k(1), . . . , ξ_k(K−1)], ∀i, j ∈ Ck, k ∈ {1, . . . , K} due to the piecewise constant

prop-460

erty of the β(k), k = 1, . . . , K −1 eigenvectors. The K clusters are represented by

461

the e∗_k, k = 1, . . . , K vectors as K single points in different orthants of the K − 1

462

dimensional space spanned by the K − 1 leading eigenvectors of the D−1MDΩ

463 matrix and 464 z∗_i − b = N X j=1 K(xi, xj)e∗j = X j∈Ck K(xi, xj)e∗j = die∗k ∀i ∈ Ck, k ∈ {1, . . . , K} (35)

by using Eq. (34) and the K(xi, xj) 6= 0 ⇐⇒ i, j ∈ Cp, p ∈ {1, . . . , K}

465

property of the kernel matrix in the ideal case. Therefore, z∗− b vectors, that

466

correspond to data points from the same cluster, point into the same direction

467

and this direction is identical to the direction of the e(∗)_k vector that represents

468

the given cluster in the K − 1 dimensional eigen-subspace. 4

469

In perturbed cases, when the between cluster similarities are small compared

470

to the within cluster similarities, the K − 1 leading eigenvectors of the D−1MDΩ

471

matrix are approximately piecewise constants on the clusters. Each cluster is

472

represented by the e∗_j, j ∈ Ck, k ∈ {1, . . . , K} vectors as K well localized point

473

clouds in different orthants of the K − 1 dimensional space spanned by the

474

4_{One could write z}∗_{instead of z}∗_{− b because b = 0 in the ideal case as it was shown in} section 3.2.1.

(17)

K − 1 leading eigenvectors of the D−1_M

DΩ matrix. The z∗− b vectors, that

475

correspond to data points from the same cluster, will point approximately to

476

the same directions and these directions will be close to the direction of the

477

average of the e∗_j vectors associated to the given cluster.

478

An encoding-decoding scheme, based on these properties, can be obtained

479

in the following way. The e∗

j vectors can easily be clustered either by using the

480

sign coding of the cluster membership as above or by using simply K-means.

481

Sign coding of the cluster membership can be used for initializing K-means that

482

will terminate after a single iteration in the nearly ideal cases. After the e∗_j

vec-483

tors are clustered based on their angular similarities, a CB = {c(k)_/kc(k)_k 2}Kk=1

484

codebook can be formed by calculating an average ck vector for each cluster as

485

c(k) =P

j∈Cke

∗

j/|Ck|, k ∈ {1, . . . , K}. Then each point can be clustered based

486

on the corresponding z∗− b vector derived from the score and this codebook.

487

The z∗− b vector is normalized and compared to the elements of the codebook

488

by calculating either the projections to the codewords or the distances from the

489

codewords. The codeword, yielding the maximum projection or the minimum

490

distance, is selected and the point is assigned to the corresponding cluster.

491 492 493

3.4. Out-of-sample extension

494

Basic spectral clustering algorithms, discussed in section 2, provide a

clus-495

tering only for the given data set without a clear extension to out-of-sample

496

points i.e. to new points that are not part of the training data set. The most

497

popular way of extending the clustering model for test points is based on the

498

Nystr¨om method [9]. An embedding of a test point can be achieved by

us-499

ing the Nystr¨om approximation of the implicit eigenfunctions [7, 8]. The kernel

500

spectral clustering algorithm formulated in the weighted kernel PCA framework

501

provides a natural and simple way of out-of-sample extension without relying

502

on approximation techniques such as the Nystr¨om method [11]. The

out-of-503

sample extension proposed in [11] is based on the codebook discussed in the

504

previous section and the score variables associated to the out-of-sample points

505

Dt_{= {x}t i}

Nt

i=1, xti∈ Rd. These latter correspond the projections of the mapped

506

out-of-sample points ϕ(xt) to the w(k), k = 1, . . . , K − 1 vectors obtained

im-507

plicitly during the training stage. Since the problem is solved in the dual weight

508

space, these projections can be calculated by means of β(k), k = 1, . . . , K − 1

509

leading eigenvectors obtained in the training stage as

510 z(xt_i)(k)= z_it(k)= N X j=1 K(xj, xti)β (k) j + b (k)_{, k = 1, . . . , K − 1} ₍₃₆₎

by using Eq (18). The K − 1 dimensional out-of-sample score points z∗t_i ∈

511

R(K−1), i = 1, . . . , Nt have the same collinear property that was discussed in

512

section 3.2.4. Therefore, the cluster membership of the out-of-sample points

513

can be determined based on the corresponding z∗tvectors by comparing them

(18)

to the codebook, derived in the training stage as described in the previous

515

section, using the appropriate decoding scheme.

516

3.5. Model selection

517

The quality of the clustering model, obtained in the training stage, strongly

518

depends on the choice of the similarity function and its parameters. The

519

most popular similarity function is the radial basis function kernel K(xi, xj) =

520

exp(−kxi− xjk22/γ) with γ parameter. Beyond the value of the γ kernel

param-521

eter, the optimality of the model also depends on the chosen number of clusters

522

K. Different (γ, K) values result in different models and the closer the K − 1

523

leading eigenvectors of the associated D−1MDΩ matrix to being piecewise

con-524

stant, the closer to the optimal the model is. Therefore, introducing a measure

525

of the piecewise constantness of these leading eigenvectors results in a quality

526

measure of the different models with different (γ, K) values that can serve as a

527

model selection criterion. Choosing the (γ, K) value pair, that maximize this

528

quality measure, makes possible to select the most optimal clustering model.

529

The balanced line fit (BLF) quality measure was proposed in [11]. As it was

530

discussed in section 3.2.4, the z∗ ∈ R(K−1) _{score points, that corresponds to}

531

input data points from the same cluster, are collinear in the score space if the

532

associated K−1 leading eigenvectors of D−1MDΩ matrix are piecewise constants

533

on the underlying K clusters. The line fit part of this criterion measures the

534

collinearity of the zv∗

i , i ∈ Ck, k ∈ {1, . . . , K} score points, obtained by using an

535

independent validation set Dv= {xi}Ni=1v, by means of PCA. The balance of the

536

obtained clusters is also incorporated into the BLF. More details on the BLF

537

model selection criterion can be found in [11].

538

A model selection criterion, different from the BLF, can be obtained by

539

using the cluster membership encoding-decoding scheme based on angular

simi-540

larities discussed in the previous section. Using this encoding-decoding scheme,

541

the cluster membership assignment is done based on the angular similarities

542

between the normalized codewords in the codebook and the normalized zv∗

i

vec-543

tors. Angular similarity between a normalized zv∗

i vector and the normalized

544

codebook vectors can be measured either by calculating the projections or the

545

distances between these K − 1 dimensional points. The individual projections

546

are in the [−1, +1] interval while the corresponding distances lie in the [0, 2]

in-547

terval. The codeword yielding the highest projection or the smallest distance is

548

selected and the validation point, associated to the zv∗

i vector, is assigned to the

549

corresponding cluster. The more collinear the zv∗

i , i ∈ Ck, k ∈ {1, . . . , K} points

550

are, the higher(/smaller) the sum of these maximum projections/(minimum

dis-551

tances) is. Maximizing(/minimizing) the average projections(/distances) on the

552

validation set Dv results in the selection of (γ, K) parameters that correspond

553

to the most optimal clustering model. One can further refine this quality

mea-554

sure in order to give more penalty to those validation points that lie closer to

555

the decision boundary of the model. Let assume that the assignment is done

556

by selecting the codeword from the codebook that yields the minimum distance

557

between the normalized zv∗

i point and the codeword. Let denote this minimum

(19)

distance by s(1)_i and the second smallest distance by s(2)_i . The corresponding

559

validation point lies on a decision boundary of the model if s(1)_i /s(2)_i = 1 while

560

the validation point is assigned to the cluster surely if s(1)_i = 0. Therefore, one

561

can select the most optimal model by maximizing the

562 Qm(Dv, model(γ, K)) = K X k=1 1 k X i∈Ck 1 − s(1)_i /s(2)_i |Ck| (37)

quantity on a Dv _{validation set. The closer the Q}

m ∈ [0, 1] quantity to 1, the

563

better the corresponding clustering model is. One can place further emphasis on

564

balancing the obtained clusters in the model selection by introducing a balance

565

term similarly as in case of BLF [11]. The balanced quality measure can be

566 written as 567 BQ_m(Dv,model(γ, K)) = ηQ_m(Dv, model(γ, K)) + (1 − η)balance(Dv, model(γ, K)) (38) where 568

balance(Dv, model(γ, K)) = min{|C1|, . . . , |CK|} max{|C1|, . . . , |CK|}

(39)

and η ∈ [0, 1] is the importance given to Q_m with respect to the balance. It

569

should be noted, that this model selection criterion can only be used for K > 2.

570

4. Sparse algorithms

571

Non-sparse kernel spectral clustering algorithms need to store and to

eigen-572

decompose an N × N matrix where N is the size of the input data set. The

573

memory usage and computational time of these algorithms grow quickly by

in-574

creasing the size of the input data set. Kernel spectral clustering models with

575

ability to out-of-sample extension can reduce these memory and computational

576

time problems by training the model on a subset of the input data set and

577

using the out-of-sample extension capability of the model to infer the cluster

578

membership of the unseen data.

579

4.1. Reduced set method

580

When the kernel spectral clustering problem is formulated as weighted kernel

581

PCA, the problem is solved in the dual weight space. The solution consists of the

582

determination of the β(k), k = 1, . . . , K − 1 leading eigenvectors of the D−1MDΩ

583

matrix associated to the training set Dtr = {xtr_i }Ntr

i=1 ⊂ D. According to the

584

first KKT condition in Eqs (11), the primal variables w(k), k = 1, . . . , K − 1 are

585

expressed as linear combinations of the mapped input data w(k)_{= Φ}T_β(k)_{, k =}

586

1, . . . , K − 1. Since the components of the β(k)eigenvectors are usually not zero,

587

each training data point contributes to the primal variable w(k) _{resulting in a}

588

non-sparse model. The goal is to find R = {˜xr}Rr=1⊂ Dtr, R < Ntrreduced set

(20)

points and corresponding ζ(k)∈ RR_{, k = 1, . . . , K − 1 reduced set coefficients to}

590

approximate w(k)primal variables as [18]

591 w(k)≈ ˜w(k)= R X r=1 ζ_r(k)ϕ(˜xr) = ΨTζ(k), k = 1, . . . , K − 1 (40)

where Ψ = [ϕ(˜x1), . . . , ϕ(˜xR)]T ∈ RR×nh. An arbitrary data point x ∈ Rd can

592

be clustered based on the corresponding approximated score variables ˜z(x)(k)_{, k =}

593

1, . . . , K − 1 that can be calculated as [18]

594 z(x)(k)≈ ˜z(x)(k)= R X r=1 ζr(k)K(x, ˜xr) + ˜b(k) (41)

k = 1, . . . , K − 1 where ˜b(k)_{are the approximated bias terms determined on the}

595

training set. The calculation of these approximated bias terms will be described

596

in sections 4.2.1 and 4.2.4.

597

Introducing sparsity into the model as described above corresponds to the

598

so called reduced set method [15, 16] that usually consists of three steps. First

599

the size of the eigenvalue problem is reduced from the full size of the input

600

data set N to Ntr < N by selecting a subset for training Dtr = {xtri } Ntr

i=1 ⊂

601

D = {xi}Ni=1. When the required eigenvectors are derived based on the training

602

set, a sparse representation of the model is constructed by the determination

603

of the R = {˜xr}Rr=1 ⊂ Dtr reduced set and the corresponding ζ(k) ∈ RR, k =

604

1, . . . , K−1 reduced set coefficients. Since usually R Ntr< N , a sparse model

605

representation can be obtained and used for calculating the approximated score

606

variables associated to arbitrary input data points by using Eq. (41).

607

The training data set is usually selected by random sampling in case of

608

Nystr¨om approximation based kernel spectral clustering algorithms [8, 24]. Then

609

the eigenvectors, derived based on the training set, are used to approximate

610

the underlying eigenfunctions and to carry out the out-of-sample extension.

611

Random sampling of the training set can also be used in case of reduced set

612

methods. The sparsity of the final model will be higher in this case compared

613

to the Nystr¨om method based clustering models due to the application of the

614

reduced set method. In spite of the higher level of sparsity, these clustering

mod-615

els outperforms the Nystr¨om approximation based models trained on the same

616

training set [11, 18]. A R´enyi entropy maximization based training set selection

617

and sparse kernel spectral clustering model based on weighted kernel PCA

for-618

mulation was proposed in [18]. The training set is constructed in this case by

619

selecting points from the full input data set that gradually increase the quadratic

620

R´enyi entropy of the selected subset [25, 14]. The required eigenvectors are

de-621

termined based on the selected training set and the L1 penalization [16, 26] is

622

used to obtain the reduced set [18]. Then the corresponding reduced set

coef-623

ficients are derived through the solution of a minimization [16, 11]. problem.

624

Another sparse version of the weighted kernel PCA formulated kernel spectral

625

clustering was proposed in [17] based on the incomplete Cholesky

(21)

tion (ICD) and the reduced set method. This algorithm will be discussed in the

627

following.

628

4.2. ICD based sparse algorithm

629

The incomplete Cholesky decomposition (ICD) and reduced set method

630

based sparse version of the weighted kernel PCA formulated kernel spectral

631

clustering algorithm will be discussed in this section. The original algorithm,

632

proposed in [17], will be reviewed in sections 4.2.1 and 4.2.2. The

computa-633

tional bottlenecks of this algorithm will be shown and a modified, much faster

634

algorithm will be proposed in section 4.2.4 that eliminates the computationally

635

expensive parts of the original algorithm.

636

4.2.1. Approximated eigenvectors

637

The first step is to determine the K − 1 leading eigenvectors of the D−1MDΩ

638

matrix associated to the randomly sampled training set Dtr_{= {x}tr

i } Ntr

i=1 ⊆ D =

639

{xi}Ni=1. The incomplete Cholesky decomposition (ICD) of the Ω ∈ R

Ntr×Ntr 640

matrix is used to reduce the size of the eigenproblem [27, 28, 17].

641

Any symmetric positive definite matrix A ∈ RM ×M can be decomposed as

642

A = LLT _{where L ∈ R}M ×M is a lower triangular matrix. If the spectrum of A

643

decays rapidly it has a small numerical rank [29] and A can be well approximated

644

by GGT

where G ∈ RM ×R_{, R M [30]. This is the incomplete Cholesky}

645

factorization of A and it yields a decomposition in which the column space of

646

GGT _{is spanned by a subset of the columns of A [31]. The incomplete Cholesky}

647

decomposition with symmetric pivoting greedily selects the columns of A and

648

calculates the columns of G such that the approximation of A obtained by adding

649

that column in the current step is the best [31]. The algorithm terminates when

650

kP APT_−GGT_k

1= T r(P APT−GGT) ≤ tolwhere P is the permutation matrix

651

associated to the symmetric permutations done during the ICD and k · k1is the

652

trace norm [28, 31].

653

Let use the ˜Ω = GGT_{, G ∈ R}Ntr×R_{, R N}

tr incomplete Cholesky

decom-654

position of the Ω ∈ RNtr×Ntr _{matrix as a rank R approximation in the form}

655

of

656

Ω ≈ ˜Ω = GGT (42)

This can be written as

657

Ω ≈ U Λ2UT (43)

by using the singular value decomposition of G i.e. G = U ΛVT_{, where Λ ∈}

658

RR×R is the diagonal matrix of singular values, U ∈ RNtr×R _{and V ∈ R}Ntr×R

659

are the matrix of left and right singular vectors respectively. Using this

ap-660

proximation in the D−1MDΩ matrix, the kernel spectral clustering eigenvalue

661

problem can be written as [17]

662

˜

D−1M_D˜U Λ2UTβ˜ (k)

(22)

where the ˜· denotes approximation of the corresponding quantity based on

663

Eq. (42). Premultiplying both sides by UT _{and substituting γ}(k) _{= U}T_β˜(k) _∈

664

RR results in [17]

665

UTD˜−1M_D˜U Λ2γ(k)= ˜λ(k)γ(k), k = 1, . . . , K − 1 (45)

The elements of the diagonal matrix ˜D are equal to [ ˜D]ii = P

Ntr

j=1[ ˜Ω]ij that

666

can be calculated quickly as G(GT

1Ntr). Note that the size of this eigenvalue 667

problem is R×R that can be much smaller than the Ntr×Ntrsize of the original

668

problem when the spectrum of Ω ∈ RNtr×Ntr _{decays quickly. The approximated}

669

eigenvectors ˜β(k), k = 1, . . . , K − 1 of the original problem can be obtained from

670

the corresponding γ(k)_{, k = 1, . . . , K − 1 eigenvectors as ˜}_β(k) _{= U γ}(k)_{, k =}

671

1, . . . , K − 1 [17].

672

The approximated bias terms ˜b(k)_{, k = 1, . . . , K − 1 that appear in Eq. (41)}

673

can be calculated quickly by using Eq. (13) substituting V = D−1 ≈ ˜D−1,

674

ΦΦT _{= Ω ≈ GG}T _{and β}(k)_{= ˜}_β(k)_{, k = 1, . . . , K − 1.}

675

4.2.2. Reduced set and reduced set coefficients

676

In order to obtain a sparse model representation, the next step is the

de-677

termination of the R = {˜xr}Rr=1 ⊂ Dtr reduced set and the corresponding

678

ζ(k)

∈ RR_{, k = 1, . . . , K − 1 reduced set coefficients.}

679

According to Eq. (40), these are used then to approximate the w(k)_{, k =}

680

1, . . . , K − 1 vectors as the linear combination of the mapped reduced set points

681

{ϕ(˜xr)}Rr=1 thus these ϕ(˜xr) vectors should be linearly independent. The ICD

682

algorithm with pivoting gradually selects the columns of Ω and builds G ∈

683

RNtr×R _{such that the approximation Ω ≈ ˜}_{Ω = GG}T _{by adding that column}

684

in the current step is the best [31]. The column space of ˜Ω is spanned by the

685

selected columns of Ω thus the corresponding {ϕ(˜xr)}Rr=1 mapped points are

686

linearly independent. Therefore, the pivots selected by the ICD automatically

687

provides the reduced set R = {˜xr}Rr=1 ⊂ D

tr _{[17]. The linear independence of}

688

the mapped reduced set points {ϕ(˜xr)}Rr=1 also implies that ΩΨΨ = ΨΨT ∈

689

RR×R is full rank where Ψ = [ϕ(˜x1), . . . , ϕ(˜xR)]T ∈ RR×nh.

690

The corresponding reduced set coefficients ζ(k) ∈ RR_{, k = 1, . . . , K − 1 can}

691

be determined by minimizing the squared distance between the corresponding

692

w(k) _{= Φ}T_β˜(k)_{, k = 1, . . . , K − 1 and ˜}_w(k) _{= Ψ}T_ζ(k)_{, k = 1, . . . , K − 1 vectors}

693

where Φ = [ϕ(x1), . . . , ϕ(xNtr)]

T _{∈ R}Ntr×nh _{[16, 17, 18]. This can be written} 694 as 695 min ζ(k) kw(k)_{− ˜}_w(k)_k2 2= = min ζ(k){w (k)T w(k)− 2 ˜w(k)(T )w(k)+ ˜w(k)(T )w˜(k)} = min ζ(k){ ˜β (k)T Ω ˜β(k)− 2ζ(k)T_Ω ΨΦβ˜ (k) + ζ(k)TΩΨΨζ(k)} (46)

(23)

k = 1, . . . , K − 1, where ΩΦΦ= ΦΦT ∈ RNtr×Ntr, ΩΨΦ= ΨΦT ∈ RR×Ntr and

696

ΩΨΨ = ΨΨT ∈ RR×R are the corresponding kernel matrices. Taking the partial

697

derivative of this objective with respect to ζ(k) and setting to zero, the first

698

order optimality ∂kw(k)− ˜w(k)k2

2/∂ζ(k) results in the following linear system

699

ΩΨΨζ(k)= ΩΨΦβ˜ (k)

(47) for the ζ(k), k = 1, . . . , K − 1 reduced set coefficients [16, 17, 18].

700

4.2.3. Clustering in the sparse model

701

Having both the reduced set and the corresponding reduced set coefficients

702

determined, one can calculate the score variables for arbitrary data points by

703

means of Eq. (41). The clustering of these points can be done based on the

704

obtained score variables and the codebook as discussed in section 3.3 when the

705

codebook is formed based on the sign encoding-decoding scheme. However, the

706

angular similarity based encoding-decoding scheme is slightly different in the

707

case of the sparse model.

708

The angular similarity based encoding-decoding scheme was introduced in

709

section 3.3 based on the facts, that the K − 1 leading eigenvectors of the

710

D−1MDΩ matrix are piecewise constants on the underlying clusters in the

711

ideal case and the z∗ − b ∈ R1×(K−1) _{points are the linear combination of}

712

the K, K − 1 dimensional vectors formed from these constants (see Eq. (34)).

713

According to Eq. (41), the approximated ˜z∗_i − ˜b point, associated to an

arbi-714

trary input data point xi ∈ Rd, is expressed as the linear combination of the

715

τ_r∗= [ζr(1), . . . , ζ (K−1)

r ], r = 1, . . . , R reduced set coefficient points

716 ˜ z∗i − ˜b = R X r=1 K(xi, ˜xr)τr∗ (48)

when the reduced set method based sparse model is used. Although the τ_r∗

717

vectors associated to reduced set points ˜xr from the same cluster r ∈ Ck, k ∈

718

{1, . . . , K} are not piecewise constant, one can show that in the ideal case they

719

point into the same direction i.e. they are collinear. In order to do so, let assume

720

that the rows and columns of ΩΨΨ,ΩΨΦand the rows of ˜B = [ ˜β (1)

, . . . , ˜β(K−1)] ∈

721

RNtr×K−1 _{are ordered according to the perfect partition of the training data}

722

set. ΩΨΨ is a block diagonal and ΩΨΦ is a block matrix in the ideal case.

723

Furthermore, ΩΨΨhas full rank thus invertible and its inverse Ω−1ΨΨmatrix is also

724

block diagonal partitioned identically to ΩΨΨ. Therefore, the Θ = Ω−1ΨΨΩΨΦ∈

725

RR×Ntr _{matrix is also a block matrix on the clusters which means that [Θ]}

rj 6=

726

0 ⇐⇒ r, j ∈ Ck, k ∈ {1, . . . , K}. Then the X = [ζ(1), . . . , ζ(K−1)] ∈ RR×(K−1)

727

reduced set coefficient matrix can be written as X = Θ ˜B by using Eq. (47). Let

728

assume that the r-th reduced set point ˜xr∈ R belongs to the k-th cluster. The

729

rp-th element of X, that corresponds to the p-th components of the τ_r∗ vector,