Fast kernel spectral clustering based on incomplete
Cholesky factorization for large scale data analysis
Mih´aly Nov´aka,∗, Carlos Alzateb,c, Rocco Langonec, Johan A.K. Suykensc
aInstitute for Nuclear Research of the Hungarian Academy of Sciences, MTA ATOMKI, 18/c Bem t´er, H-4026 Debrecen, Hungary
bSmarter Cities Technology Centre IBM Research, Ireland Damastown Industrial Estate Mulhuddart, Dublin 15, Ireland
cDepartment of Electrical Engineering, ESAT-STADIUS, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, B-3001 Leuven, Belgium
Abstract
A fast spectral clustering algorithm, suitable for large scale data clustering prob-lems, is presented in this work. The algorithm is an improved version of the sparse multiway spectral clustering algorithm, formulated as weighted kernel principal component (PCA) analysis within primal-dual least-squares support vector machine (LS-SVM) framework. Sparsity is introduced by the combina-tion of reduced set method with the incomplete Cholesky factorizacombina-tion based low rank approximation of the kernel matrix. The out-of-sample capability of the clustering model and the possibility of model selection provide the opportunity to obtain the optimal model parameters through classical training-validation phases and the obtained optimal model can be easily extended to unseen data points. However, the computational complexity of the original sparse algorithm is quadratic in the training set size that prevented the application of the method to large scale data analysis problems. This quadratic computational complex-ity is reduced to linear in the proposed algorithm while attractive properties such as the simple way of out-of-sample extension, possibility of model selection through training and validation are conserved. Furthermore, a new, soft clus-ter membership encoding-decoding scheme and corresponding model selection criterion are also proposed. These modifications make the algorithm capable to solve much larger size of clustering problems than the original one with a refined model selection procedure. The significantly reduced computational complexity, the improved accuracy and sparsity of the presented algorithm compared to the original version are illustrated by experimental results obtained on difficult toy as well as real life problems such as image segmentation.
Keywords: spectral clustering, sparse model, large-scale data, kernel methods,
LS-SVM
∗Corresponding author
1. Introduction
1
Clustering algorithms comprise unsupervised, explanatory data mining
tech-2
niques with the common goal of grouping objects that are similar according to
3
some similarity measure. Beyond the classical clustering algorithms such as
4
K-means, a set of methods have been developed that make use of some
eigen-5
vectors of a normalized affinity matrix derived from the data available. These
6
spectral clustering algorithms are known to perform well even in cases when
7
classical methods fail and have been successfully applied to data analysis,
im-8
age processing, pattern recognition and to many other machine learning related
9
problems.
10
Classical formulation of spectral clustering algorithms [1, 2, 3, 4, 5, 6] starts
11
from a graph partitioning problem that is NP-hard due to the discrete constraint
12
on the cluster indicators. By relaxing the discrete constraint the problem is
13
transformed to an eigenvalue problem involving a normalized affinity matrix
14
formed from pairwise similarities between the data points in hand. The discrete
15
cluster indicators are then inferred from some selected eigenvectors that give
16
the optimal solution to the relaxed problem.
17
Traditional spectral clustering models are typically based on the full set of
18
data due to the lack of clear extension to unseen data points. As a consequence,
19
the memory usage and computation time of these algorithms grow quickly by
20
increasing data sizes due to the increasing size of the involved eigenvalue
prob-21
lem. A possible solution to overcome these problems is offered by the spectral
22
clustering algorithms [7, 8] based on the Nystr¨om method [9]. A clustering
23
model is trained on a subset of data points. The training phase involves the
24
approximation of the implicit eigenfunctions by means of Nystr¨om method. The
25
cluster membership of out-of-sample data points can then be determined by
us-26
ing this approximation. However, the underlying clustering model is not known
27
and the parameter selection is done in a heuristic way.
28
Unlike the classical spectral clustering algorithms, that are formulated on
29
graph theoretical bases, a clustering algorithm is proposed in [10, 11] as a
30
weighted form of kernel principal component analysis [12, 13] derived within
31
the least-squares support vector machine (LS-SVM) formalism [14] by using a
32
primdual optimization framework. This kernel spectral clustering (KSC)
al-33
gorithm provides a straightforward way to extend the clustering model, trained
34
on a subset of data point, to unseen points through simple projection to some
35
eigenvectors obtained during the training phase. Furthermore, a model selection
36
criterion can be derived by making use of the known structure of these
eigen-37
vectors in the ideal case. The optimal model parameters such as the number of
38
clusters or kernel parameters can be determined through the maximization of
39
this model selection criterion. Therefore, this kernel spectral clustering model
40
can be trained on a subset of data, validated and extended to out-of-sample
41
points similarly to supervised methods in spite of the unsupervised nature of
42
the clustering problem.
43
The out-of-sample extension capability of the KSC model directly provides
44
the possibility of sparse model construction by training the model on a selected
subset of the full data set. It has been shown in [11] that the KSC model
46
outperforms the Nystr¨om approximation based clustering algorithm [8] both in
47
terms of accuracy and computation time even in case of random training set
48
selection. Moreover, the computation time and the memory requirement of the
49
original KSC model can be significantly reduced by using a clever training set
50
selection that can notably increase the sparsity of the clustering model while
51
improving the accuracy. Such sparse versions of the original KSC model were
52
proposed based on the combination of reduced set method [15, 16] with the
53
incomplete Cholesky decomposition (ICD) based low rank approximation of the
54
kernel matrix in [17] and by using a R´enyi entropy maximization based training
55
set selection in [18].
56
The sparse version of the KSC model, based on the combination of ICD and
57
reduced set method, was shown to be able to perfectly solve problems based on a
58
small reduced set size that are very hard even for spectral clustering algorithms
59
due to the properties of the corresponding spectrum [17] (see Sections 3.2.3
60
and 5.1). However, the computational complexity of this algorithm is quadratic
61
in the training set size that prevented the application to large scale problems.
62
An improved version of the ICD based sparse KSC algorithm is presented
63
in this paper. The quadratic computational complexity in the training set size
64
is reduced to linear while all the attractive properties such as the simple way
65
of out-of-sample extension, possibility of model selection through training and
66
validation are conserved. A new, soft cluster membership encoding-decoding
67
scheme and a corresponding new model selection criterion is also proposed both
68
for the original non-sparse KSC algorithm and for the presented sparse version.
69
These modifications make the algorithm capable to solve much larger size of
70
clustering problems than the original one.
71
The paper is organized as follows: spectral clustering problem and classical
72
algorithms are introduced in Section 2. The original non-sparse KSC model is
73
reviewed in Section 3. Important properties of some eigenvectors and the
cor-74
responding score variables, that play an essential role in the spectral clustering
75
interpretation of the weighted kernel PCA and in the introduced model selection
76
criterion, are also discussed. In Section 4, first the original ICD based sparse
77
KSC model and algorithm is reviewed by highlighting the computational
bot-78
tlenecks then a new algorithm is proposed. The significantly improved
compu-79
tational time, sparsity, and accuracy of the proposed algorithm over the original
80
version is illustrated through some computational experiments in Section 5.
81
2. Spectral clustering
82
Given a set of data points D = {xi}Ni=1, xi∈ Rdtogether with some similarity
83
measure s : Rd× Rd → R
≥0 between pairs of data points xi, xj the goal of
84
clustering is to partition the input data set such that points in the same subset
85
are similar and points in different subsets are dissimilar. An intuitive form of
86
representing the data with the given similarity measure is using a similarity
87
graph G = (V, E ), where V is the set of vertices, representing the data points
88
in the graph, and E is the set of edges between the vertices weighted by the
similarity between the corresponding data points. If the similarity measure is
90
symmetric then the G graph is undirected and a symmetric similarity matrix
91
S can be constructed from E such that S ∈ RN ×N≥0 , [S]ij = sij = s(xi, xj) =
92
s(xj, xi), ∀i, j ∈ {1, ..., N }. The degree of a vertex di, that represents the sum
93
of weights to this node, i.e. sum of similarities between data point xiand all the
94
data points in D, is defined as di = P
N
j=1s(xi, xj) = P N
j=1[S]ij. The degree
95
matrix D, associated to the given similarity graph G, is a diagonal matrix with
96
[d1, ..., dN] diagonal elements.
97
The above clustering problem can now be reformulated based on the
simi-98
larity graph as a graph partition problem [2] [6]. The goal is to find a partition
99
of the vertices V into K disjoint subsets {A1, ..., AK}, V = A1∪, ..., ∪AK, and
100
Ak∩ Al = ∅, ∀k, l ∈ {1, ..., K}, k 6= l such that the edges between vertices
be-101
longing to the same subset have a high weight while the edges between vertices
102
from different subsets have a low weight. Formulating this would lead to the
103
so-called mincut problem [19]. However, mincut can provide unbalanced
par-104
titions. The mincut objective can be modified to overcome this problem by
105
involving the size of the subsets Ak into the mincut objective. The size of a
106
subset can be measured either by using the cardinality of the subset |Ak| or
107
by summing up the weights of all edges attached to the vertices that belong
108
to the subset vol(Ak) =Pi∈Akdi. The former leads to the RatioCut [4] while 109
the latter results in the Ncut objective [1]. Only Ncut graph partition will be
110
discussed in the following.
111
Based on the above discussion, the Ncut objective [1] can be written as
112 Ncut{A1, ..., AK} = K X k=1 X i∈Ak,j /∈Ak [S]ij vol(Ak) (1)
One can introduce fk ∈ {0, 1}N, k = 1, ..., K binary cluster indicator vectors,
113
such that the i-th component of fkis 1 if the i-th data point belongs to the k-th
114
cluster i ∈ Ak and zero otherwise. The cluster indicator matrix F ∈ {0, 1}N ×K
115
is formed by these mutually orthogonal fk cluster indicator vectors. Defining
116
hk = D
1 2fk/kD
1
2fkk2, k = 1, ..., K vectors and H = [h1, ..., hK] orthonormal 117
matrix, the Ncut objective given by Eq. (1) can be written as
118 Ncut{A1, ..., AK} = K X k=1 1 − X i∈Ak,j∈Ak [S]ij vol(Ak) = K − T r(HTD−12SD−12H) (2)
The Ncut minimization problem then can be reformulated as the following trace
119 maximization 120 argmin H Ncut{A1, ..., AK} = argmax H T r(HTLsymH) such that HTH = IK (3)
by introducing Lsym= D−
1
2SD−12 (symmetric) normalized graph Laplacian [6]. 121
This optimization problem is NP-complete due to the discrete constraint on
122
H [1]. However, relaxing this discrete constraint by allowing H to be a real
123
valued matrix ˜H ∈ RN ×K yields [20]
124 argmin ˜ H Ncut{A1, ..., AK} = argmax ˜ H T r( ˜HTLsymH)˜ such that H˜TH = I˜ K (4)
According to the Ky Fan theorem[21], a solution of this relaxed trace
maximiza-125
tion problem can be given by taking αk, k = 1, ..., K eigenvectors, associated to
126
the K highest eigenvalues λk, of the Lsym = D−
1
2SD−12 matrix as columns 127
of ˜H. Hence, a continuous solution of the Ncut partitioning problem can be
128
obtained by the solution of the Lsymα = λα eigenvalue problem.
129
Multiplying both side from the left by D−12 and introducing β = D−
1 2α 130
vectors, the above eigenvalue problem can be transformed to Lrwβ = λβ where
131
Lrw = D−1S is the so-called random walk normalized graph Laplacian [6].
132
Its name expresses the strong relation of this normalized graph Laplacian to a
133
Markov random walk on a graph[22] [5]. Lrw is a row stochastic matrix with
134
ij-th entries representing the probability of moving from the i-th vertex to the
135
j-th in one step of the random walk. Therefore, Lrw can be interpreted as a
136
stochastic transition matrix and plays a key role in the Markov random walk
137
interpretation of clustering. Within this representation, the Ncut minimization
138
corresponds to looking for a partition of the graph such that the random walk,
139
described by Lrw as transition matrix, remains most of the time on vertices
140
that belong to the same subset of the partition and there are only a few jumps
141
between vertices that belong to different subsets of the partition [5] [6].
142
The last step of the clustering, either by solving the eigenvalue problem
143
including the symmetric Lsymmatrix or the generalized eigenvalue problem
de-144
rived from the non-symmetric Lrw matrix, is to convert the continuous relaxed
145
solutions α(k), k = 1, ..., K or β(k), k = 1, ..., K to discrete cluster indicators.
146
This is usually done by applying simple K-means on the N, K-dimensional
147
points formed by the corresponding components of the obtained K leading
eigen-148
vectors. The latter including Lrw gives the spectral clustering algorithm of Shi
149
and Malik [1] while the former involving Lsymleads to the NJW [3] algorithm.
150
3. Kernel spectral clustering through weighted kernel PCA
151
A multiway kernel spectral clustering algorithm with out-of-sample
exten-152
sion capability was introduced in [11] based on primal-dual least-square
sup-153
port vector machine (LS-SMV) [14, 13] formulation of weighted kernel PCA.
154
The algorithm will be reviewed in this section together with some important
155
properties of the solutions and the corresponding score variables that are
essen-156
tial for the kernel spectral clustering interpretation of the weighted kernel PCA.
157
Furthermore, a new cluster membership encoding-decoding scheme and a model
158
selection criterion will be proposed based on some of these properties.
3.1. LS-SVM formulation of weighted kernel PCA
160
Given an input data set D = {xi}Ni=1, xi∈ Rdtogether with the
correspond-161
ing weights V = {vl}Nl=1, vl ∈ R+, the goal of weighted kernel PCA is to find
162
the vectors w(k), k = 1, . . . , K < N such that the weighted variance of the
pro-163
jections of the weight centered feature map onto these w(k)vectors is maximal.
164
Following the LS-SVM formulation of the unweighted version of the kernel PCA
165
problem, given in [13], the weighted kernel PCA optimization can be formulated
166 as follows: 167 max w(k),b(k) K X k=1 N X l=1 vl[0 − (w(k) T ϕ(xl) + b(k))]2 (5)
where ϕ(·) : Rd → Rnh is the non-linear mapping of the d dimensional input
168
data to a nh high-dimensional feature space and b(k)∈ R, k = 1, . . . , K are bias
169
terms1. This maximization problem can be written as
170 max w(k),e(k),b(k) K X k=1 e(k)TV e(k) (6)
by introducing V ∈ RN ×N, [V ]ii = vi, i = 1, ..., N diagonal weight matrix,
171
Φ ∈ RN ×nh feature map matrix as
172 Φ = ϕT(x 1) .. . ϕT(x N) (7) and e(k) ∈ RN, k = 1, . . . , K error vectors as 173 e(k)= Φw(k)+ b(k)1N e(k)= e(k)1 .. . e(k)N = w(k)Tϕ(x 1) + b(k) .. . w(k)T ϕ(xN) + b(k) (8)
The corresponding primal optimization problem can be formulated as [11]
174 max w(k),e(k),b(k)J (w, e, b) = 1 2 K X k=1 γ(k)e(k)TV e(k) −1 2 K X k=1 w(k)Tw(k) such that e(k) = Φw(k)+ b(k)1N, k = 1, ..., K (9)
1It can be shown, that using the bias term approach leads to the same weight centered kernel matrix as the explicit weighted centering of the feature map (see A).
where γ(k) are real positive constants. The Lagrangian of this constrained op-175 timization problem is 176 L(w(k), e(k), b(k); β(k)) =1 2 K X k=1 γ(k)e(k)TV e(k) −1 2 K X k=1 w(k)Tw(k)− K X k=1 β(k)T(e(k)− Φw(k)− b(k) 1N) (10)
with β(k) ∈ RN, k = 1, ..., K Lagrange multiplier vectors. The
Karush-Kuhn-177
Tucker (KKT) optimality conditions yield
178 ∂L ∂w(k) = 0 → w (k)= ΦTβ(k) ∂L ∂e(k) = 0 → β (k) = γ(k)V e(k) ∂L ∂b(k) = 0 → 1 T Nβ (k) = 0 ∂L ∂β(k) = 0 → e (k)− Φw(k)− b(k) 1N = 0 (11)
for k = 1, ..., K. Eliminating the primal variables w(k) and e(k)by substituting
179
the first and second (e(k)= V−1β(k)
/γ(k)) equations into the fourth results in
180
V−1β(k)
γ(k) − ΦΦ
Tβ(k)
− 1Nb(k)= 0, k = 1, ..., K (12)
Then applying the third KKT condition gives the expression for the last primal
181 variable b(k) 182 1TNβ (k)= γ(k) 1TNV ΦΦ Tβ(k)+ γ(k) 1TNV 1Nb(k)= 0 → 1TNV 1Nb(k)= −1TNV ΦΦ Tβ(k)→ b(k)= −1 T NV ΦΦTβ (k) 1T NV 1N (13)
for k = 1, ..., K. Substituting this back into Eq. (12) gives the following
equa-183
tions for the stationary points of the Lagrangian expressed in terms of the dual
184 variables β(k) 185 V−1β(k) γ(k) − ΦΦ Tβ(k) +1N1 T NV ΦΦ Tβ(k) 1T NV 1N = 0 → V−1β(k) γ(k) − IN −1 N1TNV 1T NV 1N ΦΦTβ(k)= 0 (14)
k = 1, ..., K, where IN is the identity matrix. Multiplying both sides by the
186
diagonal weight matrix V from the left and substituting λ(k) = 1/γ(k) the
187
solution can be expressed in terms of the dual variables β(k) as the following
188
eigenvalue problem involving the dual variables as eigenvectors
189 V MvΩβ(k)= λ(k)β(k), k = 1, ..., K (15) where 190 Mv= IN −1 N1TNV 1T NV 1N (16)
is the weighted centering matrix and Ω = ΦΦT is the kernel matrix with ij-th
191
entry [Ω]ij = ϕ(xi)Tϕ(xj) = K(xi, xj), i, j = 1, ..., N and K : Rd× Rd → R is
192
a positive definite kernel.
193
It should be noted, that in general Eq. (9) is a nonconvex problem,
there-194
fore the KKT optimality conditions are necessary but not sufficient [23] [11].
195
However, taking the stationary points of the Lagrangian i.e. β eigenvectors of
196
the V MvΩ matrix as the pool of candidate solutions, the objective, given by
197
Eq. (6), can be maximized by selecting the K leading eigenvectors since
198 K X k=1 e(k)TV e(k)= K X k=1 β(k)TV−1 γ(k) V V−1β(k) γ(k) = K X k=1 1 γ(k)2β (k)TV−1β(k)= K X k=1 λ(k)2 (17)
where the V−1-normality (i.e. β(k)TV−1β(k) = 1) of the β(k) eigenvectors of
199
the V MvΩ matrix was assumed.
200
The k-th projection of the l-th mapped input data ϕ(xl) can be written as
201 zl(k)=z(k)(xl) = w(k) T ϕ(xl) + b(k)= β(k) T Φϕ(xl) + b(k)= N X i=1 βi(k)ϕ(xi)Tϕ(xl) + b(k)= N X i=1 βi(k)K(xi, xl) + b(k) (18)
by using the first KKT condition given by Eqs. (11) and the kernel trick. The
202 k-th score variable z(k)∈ RN ×1 is 203 z(k)=Φw(k)+ 1Nb(k)= ΦΦTβ(k)−1 N1TNV ΦΦ Tβ(k) 1T NV 1N = IN−1 N1TNV 1T NV 1N ΦΦTβ(k)= MvΩβ(k)= λ(k)V−1β(k) (19)
The weighted kernel PCA primal optimization problem Eqs. (9) transforms
204
to an eigenproblem in the dual weight space Eq. (15) involving the dual variables
205
β as eigenvectors of the weight centered weighted kernel matrix V MvΩ. The
dual solutions of the weighted kernel PCA problem can be obtained by taking
207
the K leading eigenvectors of the V MvΩ matrix. The effect of multiplying the
208
kernel matrix Ω by the weighted centering matrix Mvis that the weighted mean
209
will be removed from each columns of the kernel matrix. Note, that Mvbecomes
210
the centering matrix by choosing the weight matrix being equal to the identity
211
matrix V = IN and Eq. (15) corresponds to the dual form of the unweighted
212
kernel PCA problem discussed in [13].
213
According to section 2, taking the kernel matrix as a similarity matrix and
214
choosing the weight matrix equal to the inverse degree matrix V = D−1, Eq. (15)
215
becomes
216
D−1MDΩβ(k)= λ(k)β(k), k = 1, ..., K (20)
This is similar to the random walk spectral clustering eigenproblem based on
217
the random walk normalized Laplacian Lrw discussed in section 2. However,
218
now the weight centered kernel matrix MDΩ plays the role of the similarity
219
matrix in the eigenproblem [11] where
220 MD= IN −1 N1TND−1 1T ND−11N (21) Furthermore, Eq. (20) can be transformed to
221
D−12MDΩMT
DD −1
2α(k)= λ(k)α(k), k = 1, ..., K (22)
by using the identity given by Eq. (66) then multiplying both sides by D1/2from
222
the left and substituting α = D1/2β. The resulted eigenproblem is similar to the
223
symmetric spectral clustering eigenproblem based on the symmetric normalized
224
Laplacian Lsymdiscussed in section 2 with the difference that now the symmetric
225
weight centered kernel matrix MDΩMDT plays the role of the similarity matrix
226
in the eigenproblem.
227
Therefore, a close relation can be shown between the spectral clustering
228
algorithms, both involving the random walk Lrwand the symmetric normalized
229
Laplacian Lsym, and the weighted kernel PCA by choosing the weight matrix
230
as V = D−1 [10]. However, some properties of the K leading eigenvectors of
231
Eqs. (20) and (22) will be different than the corresponding eigenvectors of Lrw
232
and Lsymdue to the weighted centering of the kernel matrix. These properties
233
will be discussed in the next section.
234
3.2. Properties of the eigenvectors and the score variables
235
Some properties of the eigenvectors of D−1MDΩ matrix and the
correspond-236
ing score variables will be discussed in the following. These properties are
es-237
sential for the spectral clustering interpretation of the weighted kernel PCA and
238
for the formulation of the clustering algorithm.
239
Let assume that there are K well formed clusters in the N input data and the
240
corresponding perfect partition of this input data set D = {xi}Ni=1, xi ∈ Rd is
241
∆ = {C1, ..., CK}. i ∈ Ck will be used as a shorthand for xi∈ Ck. Furthermore,
242
it is assumed that this structure of the input data is well represented by the
corresponding similarity graph and the kernel matrix i.e. the similarity graph
244
consists of K disconnected components that results in corresponding kernel
ma-245
trix [Ω]ij 6= 0 ⇐⇒ i, j ∈ Ck ∀k = 1, ..., K, i, j = 1...N . This case, when the
246
clusters are well formed, is often referred as ideal case [3].
247
3.2.1. Piecewise constant eigenvectors
248
In the ideal case, the algebraic multiplicity of the eigenvalue 1 of the D−1Ω
249
matrix is equal to the number of disconnected components in the corresponding
250
similarity graph. It was shown in [5], that these K leading eigenvectors of
251
the D−1Ω matrix are piecewise constant with respect to the partition ∆ =
252
{C1, ..., CK} when the corresponding K clusters are well formed due to the
block-253
stochastic property of the D−1Ω matrix. A vector x ∈ RN is piecewise constant
254
relative to a partition ∆ = {C1, ..., CK} of the i = 1, ..., N indices if and only
255
if xi = xj, ∀i, j ∈ Ck, k = 1, ..., K. A matrix P , indexed by i = 1, ..., N , is
256
block-stochastic with respect to a partition ∆ = {C1, ..., CK} of the indices if
257
Pik =Pj∈CkPij are constant ∀ i ∈ Ck
0, ∀k, k0 = 1, ..., K and the R ∈ RK×K 258
matrix with Pk0k entries are non-singular [5]2. 259
Although the D−1MDΩ matrix is not block-stochastic, one can show that it
260
has K − 1 eigenvalues equal to 1 and the corresponding eigenvectors are
piece-261
wise constant on the underlying perfect partition of the input data.
262 263
The i, j-th elements of the D−1MDΩ matrix can be written as
264 [D−1MDΩ]ij = 1 di ( [Ω]ij− PN l=1 1 dl[Ω]lj PN l=1 1 dl ) (23)
and the sum of the i-th row is
265 N X j=1 [D−1MDΩ]ij= PN j=1[Ω]ij di − 1 di N X l=1 PN j=1[Ω]lj dl = 1 − N di (24) where =PN
i=11/di. It shows that unlike D
−1Ω the D−1M
DΩ matrix is not
266
block-stochastic since di, i = 1, ..., N are usually different for the different rows.
267
However in the ideal case, if β(l)∈ RN is an eigenvector of the D−1M
DΩ matrix
268
with associated eigenvalue λ(l) = 1, then it is also an eigenvector of the D−1Ω
269
matrix with the same eigenvalue. In order to see this, one first need to show
270
that the b(l) bias terms, associated to β(l) eigenvectors with λ(l) = 1, are zero
271
in the ideal case. This can be shown by using the KKT conditions given in
272
2Note, that in the ideal case D−1Ω is block diagonal that is a block-stochastic matrix with R = IK.
Eqs. (11) and taking into account that now V = D−1 273 1Nb(l)= e(l)− Φw(l)= Dβ(l) γ(l) − ΦΦ Tβ(l) = [λ(l)D − Ω]β(l) (25)
Multiplying both side by 1T
N/N from the left yields
274 b(l)= 1 N(λ (l)− 1)[d 1, . . . , dN]β(l)= 1 N(λ (l)− 1) N X i=1 diβ (l) i = 1 N(λ (l) − 1)1T NDβ (l) , l = 1, . . . , K − 1 (26)
That gives zero since λ(l)= 1. Substituting the fourth KKT condition into the
275
second results in
276
β(l) = D−1e(l)= D−1ΦΦTβ(l)= D−1Ωβ(l) (27)
by using 1/γ(l) = λ(l) = 1 and b(l) = 0. Thus in the ideal case, if β(l)
∈ RN
277
is an eigenvector of the D−1MDΩ matrix with associated eigenvalue λ(l) = 1,
278
then it is also an eigenvector of the D−1Ω matrix with the same eigenvalue.
279
According to [5], the eigenvectors of the D−1Ω matrix, with associated
eigen-280
value equal to 1, are piecewise constants on the clusters in the ideal case i.e.
281
βi(l) = β(l)j = ξk(l), ∀i, j ∈ Ck, k ∈ {1, . . . , K}. This can also be seen by writing
282
D−1Ωβ(l)= β(l) for the components of β(l)
283 1 di N X j=1 [Ω]ijβ (l) j = β (l) i , ∀i = 1, . . . , N (28)
and assuming that the i-th input data belongs to the k-th cluster
284 X j∈Ck [Ω]ijβ (l) j − diβ (l) i = X j∈Ck [Ω]ijβ (l) j − X j∈Ck [Ω]ijβ (l) i = 0 (29)
This equation must hold for ∀i ∈ Ck, k ∈ 1, . . . , K that indicates β (l) i = β (l) j = 285 ξk(l), ∀i, j ∈ Ck, k ∈ {1, . . . , K}. 286
Moreover, the centered property of these piecewise constant eigenvectors
287 PN i=1β (l) i = PK k=1nkξ (l)
k = 0 shows that K − 1 out of the K, ξ
(l)
k constants, that
288
form the given eigenvector, can be chosen independently. Hence, the eigenspace
289
of the D−1MDΩ matrix associated to λ = 1 eigenvalue is a K − 1 dimensional
290
space spanned by the corresponding K − 1, D -orthogonal eigenvectors. Thus
291
these K − 1 eigenvectors are centered, piecewise constant on the underlying
292
clusters. Therefore each of the K well formed clusters are represented as single
293
points in the K − 1 dimensional eigenspace of the D−1MDΩ matrix associated
294
to λ = 1 eigenvalue.
3.2.2. Cluster membership is coded in signs
296
It has been shown above, that D−1MDΩ matrix has K − 1 eigenvectors
297
with eigenvalue equal to 1 if the underlying data set consists of K well formed
298
clusters and these K − 1 leading eigenvectors are piecewise constant on the
299
perfect partition of the input data. The eigenvectors of the D−1MDΩ matrix
300
are mutually D -orthogonal where D is the diagonal degree matrix with positive
301
diagonal entries as it was shown in section 3.1. According to the third KKT
302
condition Eqs. (11), these eigenvectors are centered. The signs of the constants,
303
that form the K−1 leading eigenvectors, satisfy the following two criteria in order
304
to ensure that the eigenvectors fulfil the centered and mutually D -orthogonal
305
properties: (i ) at least one out of the K constants, that form the individual
306
eigenvectors, has different sign than the others; (ii ) there are not two vectors
307
among these K − 1 eigenvectors that have the same signs in each case of their
308
corresponding constants. Therefore, each of the K well formed clusters are
309
represented as K single points in the K − 1 dimensional eigenspace of D−1MDΩ
310
matrix associated to λ = 1 eigenvalue and these K points, that correspond
311
to different clusters, are located in different orthant of this K − 1 dimensional
312
eigen-subspace.
313
3.2.3. Some results of perturbation analysis
314
The ideal case, when the perfect partition of the input data set into K
dis-315
joint subsets appears in the similarity graph as K disconnected subgraphs, has
316
been considered so far. The corresponding similarity matrix Ω is a block
di-317
agonal matrix in this ideal case. It has been shown above, that each of the K
318
disconnected components are represented as single points in the K − 1
dimen-319
sional eigenspace of D−1MDΩ matrix associated to λ = 1 eigenvalue due to
320
the fact that these eigenvectors are piecewise constant on the K disconnected
321
components. Therefore, the piecewise constant property of these K − 1 leading
322
eigenvectors is crucial in spectral clustering since it makes possible to transform
323
the original clustering problem given in the input space to a trivial clustering
324
problem in the K − 1 dimensional eigen-subspace of D−1MDΩ matrix associated
325
to λ = 1 eigenvalue. Then the question is, how this piecewise constant
prop-326
erty of the K − 1 leading eigenvectors change in the non-ideal case when there
327
are edges between the K subgraphs of the ideal case resulting into a non-block
328
diagonal similarity matrix.
329
Perturbation analysis can provide an answer in the nearly ideal case, when
330
there still are distinct clusters in the data set but the similarities between data
331
points that belong to different clusters are not necessarily zero but small
com-332
pared to the within cluster similarities. The difference between the nearly ideal
333
and ideal case can be considered as a small perturbation. According to the
334
Davis-Kahan theorem, if a symmetric matrix A is modified by an additive
per-335
turbation H, the distance between the eigenspaces corresponding to the first k
336
eigenvalues of A and the first k eigenvalues of the perturbed matrix ˜A = A+H is
337
bounded by kHkF/δk, where k · kF denotes the Frobenius norm, δk= λk− λk+1
338
is the additive eigengap and λkis the k-th eigenvalue of A if they are in
ing order3. It means, that the larger the gap between λkand λk+1is, the closer
340
the k leading eigenvectors of the perturbed matrix ˜A and the original matrix A
341
are.
342
The consequence of the Davis-Kahan theorem is that if the perturbation is
343
small and the eigengap between the K − 1-th and K-th ordered eigenvalues of
344
the D−1M
DΩ matrix is large enough, the first K − 1 leading eigenvectors will
345
be very close to the corresponding eigenvectors of the ideal case. Therefore,
346
the K − 1 leading eigenvectors of the D−1MDΩ matrix will be approximately
347
piecewise constant on the clusters even if the between cluster similarities are not
348
necessarily zero (but small compared to the within cluster similarities) and/or
349
the eigengap is large enough. The K clusters will be represented in the K − 1
350
dimensional space, spanned by these K − 1 leading eigenvectors, as K clouds
351
around the K corresponding points of the ideal case. The closer to the ideal
352
case we are, the smaller the spreading of the individual clouds is.
353
3.2.4. Properties of the score variables
354
Some important properties of the score variables will be discussed in the
355
following. The ideal case, when the similarity matrix Ω has a block diagonal
356
form, will be discussed first then the deviations caused by perturbations will
357
also be addressed as in case of the eigenvectors. It was shown above, that the
358
D−1MDΩ matrix has K − 1 eigenvectors with eigenvalue equal to 1 when the
359
similarity graph consists of K disconnected components and these eigenvectors
360
are piecewise constants on the disconnected components.
361 362
Property 1. Every entry in the k-th score variable has the same sign as
363
the corresponding entry in the k-th eigenvector (k = 1, . . . , K − 1).
364 365
It can be seen from Eq. (19) by substituting V = D−1 and taking into account
366
that λ(k) > 0, ∀k ∈ {1, . . . , K − 1} and D is a diagonal matrix with positive
367
diagonal entries.
368 369
Property 2. z(k), k = 1, . . . , K − 1 score variables are D−1-orthogonal to
370
the constant vector 1N i.e. the weighted mean is removed from the score
vari-371
ables.
372 373
According to Eq. (19), the weighted mean of the score variables can be written
374 as 375 1 1 T ND −1z(k)=λ (k) 1 T Nβ (k) | {z } 0 = 0, ∀k ∈ {1, . . . , K − 1} (30)
by choosing the weight matrix as V = D−1 and using the centered property of
376
the β eigenvectors i.e. the third KKT condition from Eqs. (11).
377
3Although D−1M
DΩ is not a symmetric matrix, it can always be transformed to the corresponding symmetric problem and vice versa (see Eq. (22)).
378
Property 3. z(k), k = 1, . . . , K−1 score variables are mutually D−1-orthogonal.
379 380
According to Eq. (19)
381
z(i)TD−1z(j)= [λ(i)Dβ(i)]TD−1[λ(j)Dβ(j)] = λ(i)λ(j)β(i)T DD−1 | {z } IN Dβ(j) = λ(i)λ(j)β(i)TDβ(j) | {z } =0 if i6=j = 0 if i 6= j ∀ i, j = 1...K − 1 (31)
where the mutual D-orthogonality of the β eigenvectors of the D−1MDΩ was
382
used.
383 384
Property 4. The K − 1 dimensional score points, that correspond to input
385
data points from the same cluster, are collinear in the K − 1 dimensional score
386
space spanned by the score variables z(k), k = 1, . . . , K − 1.
387 388
Let z∗i ∈ R1×(K−1) denote the i-th row of the N × (K − 1) dimensional score
389
matrix. The k-th component of z∗i can be given by Eq. (18). In the ideal case,
390
the similarity matrix is block diagonal and the K − 1 leading eigenvectors β of
391
the D−1MDΩ matrix are piecewise constants on the underlying K clusters i.e.
392 K(xi, xj) 6= 0 ⇐⇒ i, j ∈ Cp, p ∈ {1, . . . , K} and β (k) i = β (k) j = ξ (k) p , ∀i, j ∈ 393
Cp, p ∈ {1, . . . , K}, k ∈ {1, . . . , K − 1}. Assuming, that the i-th input data point
394
xibelongs to the p-th cluster i.e. i ∈ Cp, p ∈ {1, . . . , K}, the k-th component of
395
the corresponding z∗i can be written as
396 zi∗(k)= N X j=1 βj(k)K(xj, xi) + b(k)= X j∈Cp β(k)j K(xj, xi) + X j /∈Cp βj(k)K(xj, xi) + b(k) = ξp(k) X j∈Cp K(xj, xi) | {z } =di +X j /∈Cp βj(k) K(xj, xi) | {z } =0 ∀j /∈Cpi∈Cp +b(k) = diξ(k)p + b (k), ∀i ∈ Cp, p ∈ {1, . . . , K}, ∀k ∈ {1, . . . , K − 1} (32) and 397 z∗i =di[ξ(1)p , . . . , ξ (K−1) p ] + [b (1), . . . , b(K−1)] ∀i ∈ Cp, p ∈ {1, . . . , K} (33)
Therefore, the K − 1 dimensional score vectors z∗i, that correspond to input data
398
from the same cluster, point to the same direction in the K −1 dimensional score
399
space in the ideal case. Furthermore, the K directions, that correspond to the
400
K different clusters, lie in different orthants of the K − 1 dimensional score
401
space due to the fact that the cluster membership is coded in the signs of the
402
ξp(k) k = 1, . . . , K − 1 constants that form the β eigenvectors. Therefore, the
403
K −1 dimensional points z∗
i, that correspond to input data points from the same
404
clusters, are collinear in the K − 1 dimensional space spanned by the K − 1 score
405
vectors correspond to the K − 1 leading eigenvectors of the D−1MDΩ matrix
406
and the K distinct directions lie in different orthants of this score space.
407
According to section 3.2.3, the similarity matrix becomes only nearly block
408
diagonal if the between cluster similarities are not necessarily zero but small
409
compare to the within cluster similarities. As a consequence, the contribution
410
of the summation in Eq. (32) according to j /∈ Cp is not zero but still much
411
smaller than the corresponding contribution of j ∈ Cp. Furthermore, the K − 1
412
leading eigenvectors of the D−1MDΩ matrix remain approximately piecewise
413
constants on the clusters in the presence of such a small perturbation (and/or
414
in case of large eigengap). The K clusters are represented as K well separated
415
clouds in the K −1 dimensional space, spanned by the K −1 leading eigenvectors
416
of the D−1MDΩ matrix, scattered around the corresponding points of the ideal
417
case. The closer to the ideal case we are, the smaller the spreading of the
indi-418
vidual clouds is. Therefore, the K − 1 dimensional points z∗
i, that correspond
419
to input data points from the same clusters, are still approximately collinear in
420
the K − 1 dimensional space spanned by the K − 1 score vectors correspond to
421
the K − 1 leading eigenvectors of the D−1MDΩ matrix. The closer to the ideal
422
case we are, the smaller the scatter of the K point clouds in the corresponding
423
eigenspace is and more collinear the score points are.
424 425
Note, that the special features of the ideal case was exploited only in the case
426
of the last property of the score variables.
427
3.3. Cluster membership encoding and decoding
428
In order to obtain a clustering model, based on the weighted kernel PCA
429
formulation, one need to select a cluster encoding-decoding scheme that makes
430
possible the cluster membership assignment. In the case of classical spectral
431
clustering algorithms [1, 3] this step is usually done by applying K-means in
432
the appropriate eigen-subspace as it was discussed at the end of section 2. The
433
weighted kernel PCA formulation, the properties of the K − 1 leading
eigenvec-434
tors of the D−1MDΩ matrix and the corresponding score variables discussed in
435
the previous sections, provide different possible encoding schemes.
436
One, that was introduced in [11], is based on the fact that the cluster
mem-437
bership is coded in the signs of the corresponding components of the K−1 leading
438
eigenvectors of the D−1MDΩ matrix as it was discussed in section 3.2.2.
There-439
fore, each cluster Ck, k = 1, . . . , K can be represented by a unique codeword
440
c(k)∈ {+1, −1}(K−1). The candidate codewords can be obtained by converting
the rows of the matrix formed by the K −1 leading eigenvectors of the D−1MDΩ
442
matrix to K − 1 dimensional sign vectors. According to Property 1. of the score
443
variables discussed in section section 3.2.4, the score variables can also be used
444
instead of the eigenvectors. Based on section 3.2.2 and 3.2.3, there are exactly
445
K different codewords in the ideal case and can be more than K in strongly
per-446
turbed cases. Thus, a codebook CB = {c(k)}K
k=1is formed by taking the K most
447
frequent codewords. Then each point is clustered based on the corresponding
448
K−1 dimensional score point and this codebook. The score point is converted to
449
a sign vector and compared to the elements of the codebook by calculating the
450
Hamming distances. The codeword, yielding the smallest Hamming distance, is
451
selected and the point is assigned to the corresponding cluster [11].
452
A different encoding-decoding scheme can also be obtained based on
angu-453
lar similarities. According to Eq. (18), the K − 1 dimensional score point z∗i,
454
that corresponds to the i-th input data xi, can be written as the linear
com-455
bination of e∗j = [βj(1), . . . , βj(K−1)], j = 1, . . . , N vectors plus the bias vector
456 b = [b(1), . . . , b(K−1)] 457 z∗i = N X j=1 K(xi, xj)e∗j+ b (34)
where β(k), k = 1, . . . , K − 1 are the K − 1 leading eigenvectors of the D−1MDΩ
458
matrix. In the ideal case, Eq. (34) transforms to Eq. (33) since e∗i = e∗j = e∗k =
459
[ξk(1), . . . , ξk(K−1)], ∀i, j ∈ Ck, k ∈ {1, . . . , K} due to the piecewise constant
prop-460
erty of the β(k), k = 1, . . . , K −1 eigenvectors. The K clusters are represented by
461
the e∗k, k = 1, . . . , K vectors as K single points in different orthants of the K − 1
462
dimensional space spanned by the K − 1 leading eigenvectors of the D−1MDΩ
463 matrix and 464 z∗i − b = N X j=1 K(xi, xj)e∗j = X j∈Ck K(xi, xj)e∗j = die∗k ∀i ∈ Ck, k ∈ {1, . . . , K} (35)
by using Eq. (34) and the K(xi, xj) 6= 0 ⇐⇒ i, j ∈ Cp, p ∈ {1, . . . , K}
465
property of the kernel matrix in the ideal case. Therefore, z∗− b vectors, that
466
correspond to data points from the same cluster, point into the same direction
467
and this direction is identical to the direction of the e(∗)k vector that represents
468
the given cluster in the K − 1 dimensional eigen-subspace. 4
469
In perturbed cases, when the between cluster similarities are small compared
470
to the within cluster similarities, the K − 1 leading eigenvectors of the D−1MDΩ
471
matrix are approximately piecewise constants on the clusters. Each cluster is
472
represented by the e∗j, j ∈ Ck, k ∈ {1, . . . , K} vectors as K well localized point
473
clouds in different orthants of the K − 1 dimensional space spanned by the
474
4One could write z∗instead of z∗− b because b = 0 in the ideal case as it was shown in section 3.2.1.
K − 1 leading eigenvectors of the D−1M
DΩ matrix. The z∗− b vectors, that
475
correspond to data points from the same cluster, will point approximately to
476
the same directions and these directions will be close to the direction of the
477
average of the e∗j vectors associated to the given cluster.
478
An encoding-decoding scheme, based on these properties, can be obtained
479
in the following way. The e∗
j vectors can easily be clustered either by using the
480
sign coding of the cluster membership as above or by using simply K-means.
481
Sign coding of the cluster membership can be used for initializing K-means that
482
will terminate after a single iteration in the nearly ideal cases. After the e∗j
vec-483
tors are clustered based on their angular similarities, a CB = {c(k)/kc(k)k 2}Kk=1
484
codebook can be formed by calculating an average ck vector for each cluster as
485
c(k) =P
j∈Cke
∗
j/|Ck|, k ∈ {1, . . . , K}. Then each point can be clustered based
486
on the corresponding z∗− b vector derived from the score and this codebook.
487
The z∗− b vector is normalized and compared to the elements of the codebook
488
by calculating either the projections to the codewords or the distances from the
489
codewords. The codeword, yielding the maximum projection or the minimum
490
distance, is selected and the point is assigned to the corresponding cluster.
491 492 493
3.4. Out-of-sample extension
494
Basic spectral clustering algorithms, discussed in section 2, provide a
clus-495
tering only for the given data set without a clear extension to out-of-sample
496
points i.e. to new points that are not part of the training data set. The most
497
popular way of extending the clustering model for test points is based on the
498
Nystr¨om method [9]. An embedding of a test point can be achieved by
us-499
ing the Nystr¨om approximation of the implicit eigenfunctions [7, 8]. The kernel
500
spectral clustering algorithm formulated in the weighted kernel PCA framework
501
provides a natural and simple way of out-of-sample extension without relying
502
on approximation techniques such as the Nystr¨om method [11]. The
out-of-503
sample extension proposed in [11] is based on the codebook discussed in the
504
previous section and the score variables associated to the out-of-sample points
505
Dt= {xt i}
Nt
i=1, xti∈ Rd. These latter correspond the projections of the mapped
506
out-of-sample points ϕ(xt) to the w(k), k = 1, . . . , K − 1 vectors obtained
im-507
plicitly during the training stage. Since the problem is solved in the dual weight
508
space, these projections can be calculated by means of β(k), k = 1, . . . , K − 1
509
leading eigenvectors obtained in the training stage as
510 z(xti)(k)= zit(k)= N X j=1 K(xj, xti)β (k) j + b (k), k = 1, . . . , K − 1 (36)
by using Eq (18). The K − 1 dimensional out-of-sample score points z∗ti ∈
511
R(K−1), i = 1, . . . , Nt have the same collinear property that was discussed in
512
section 3.2.4. Therefore, the cluster membership of the out-of-sample points
513
can be determined based on the corresponding z∗tvectors by comparing them
to the codebook, derived in the training stage as described in the previous
515
section, using the appropriate decoding scheme.
516
3.5. Model selection
517
The quality of the clustering model, obtained in the training stage, strongly
518
depends on the choice of the similarity function and its parameters. The
519
most popular similarity function is the radial basis function kernel K(xi, xj) =
520
exp(−kxi− xjk22/γ) with γ parameter. Beyond the value of the γ kernel
param-521
eter, the optimality of the model also depends on the chosen number of clusters
522
K. Different (γ, K) values result in different models and the closer the K − 1
523
leading eigenvectors of the associated D−1MDΩ matrix to being piecewise
con-524
stant, the closer to the optimal the model is. Therefore, introducing a measure
525
of the piecewise constantness of these leading eigenvectors results in a quality
526
measure of the different models with different (γ, K) values that can serve as a
527
model selection criterion. Choosing the (γ, K) value pair, that maximize this
528
quality measure, makes possible to select the most optimal clustering model.
529
The balanced line fit (BLF) quality measure was proposed in [11]. As it was
530
discussed in section 3.2.4, the z∗ ∈ R(K−1) score points, that corresponds to
531
input data points from the same cluster, are collinear in the score space if the
532
associated K−1 leading eigenvectors of D−1MDΩ matrix are piecewise constants
533
on the underlying K clusters. The line fit part of this criterion measures the
534
collinearity of the zv∗
i , i ∈ Ck, k ∈ {1, . . . , K} score points, obtained by using an
535
independent validation set Dv= {xi}Ni=1v, by means of PCA. The balance of the
536
obtained clusters is also incorporated into the BLF. More details on the BLF
537
model selection criterion can be found in [11].
538
A model selection criterion, different from the BLF, can be obtained by
539
using the cluster membership encoding-decoding scheme based on angular
simi-540
larities discussed in the previous section. Using this encoding-decoding scheme,
541
the cluster membership assignment is done based on the angular similarities
542
between the normalized codewords in the codebook and the normalized zv∗
i
vec-543
tors. Angular similarity between a normalized zv∗
i vector and the normalized
544
codebook vectors can be measured either by calculating the projections or the
545
distances between these K − 1 dimensional points. The individual projections
546
are in the [−1, +1] interval while the corresponding distances lie in the [0, 2]
in-547
terval. The codeword yielding the highest projection or the smallest distance is
548
selected and the validation point, associated to the zv∗
i vector, is assigned to the
549
corresponding cluster. The more collinear the zv∗
i , i ∈ Ck, k ∈ {1, . . . , K} points
550
are, the higher(/smaller) the sum of these maximum projections/(minimum
dis-551
tances) is. Maximizing(/minimizing) the average projections(/distances) on the
552
validation set Dv results in the selection of (γ, K) parameters that correspond
553
to the most optimal clustering model. One can further refine this quality
mea-554
sure in order to give more penalty to those validation points that lie closer to
555
the decision boundary of the model. Let assume that the assignment is done
556
by selecting the codeword from the codebook that yields the minimum distance
557
between the normalized zv∗
i point and the codeword. Let denote this minimum
distance by s(1)i and the second smallest distance by s(2)i . The corresponding
559
validation point lies on a decision boundary of the model if s(1)i /s(2)i = 1 while
560
the validation point is assigned to the cluster surely if s(1)i = 0. Therefore, one
561
can select the most optimal model by maximizing the
562 Qm(Dv, model(γ, K)) = K X k=1 1 k X i∈Ck 1 − s(1)i /s(2)i |Ck| (37)
quantity on a Dv validation set. The closer the Q
m ∈ [0, 1] quantity to 1, the
563
better the corresponding clustering model is. One can place further emphasis on
564
balancing the obtained clusters in the model selection by introducing a balance
565
term similarly as in case of BLF [11]. The balanced quality measure can be
566 written as 567 BQm(Dv,model(γ, K)) = ηQm(Dv, model(γ, K)) + (1 − η)balance(Dv, model(γ, K)) (38) where 568
balance(Dv, model(γ, K)) = min{|C1|, . . . , |CK|} max{|C1|, . . . , |CK|}
(39)
and η ∈ [0, 1] is the importance given to Qm with respect to the balance. It
569
should be noted, that this model selection criterion can only be used for K > 2.
570
4. Sparse algorithms
571
Non-sparse kernel spectral clustering algorithms need to store and to
eigen-572
decompose an N × N matrix where N is the size of the input data set. The
573
memory usage and computational time of these algorithms grow quickly by
in-574
creasing the size of the input data set. Kernel spectral clustering models with
575
ability to out-of-sample extension can reduce these memory and computational
576
time problems by training the model on a subset of the input data set and
577
using the out-of-sample extension capability of the model to infer the cluster
578
membership of the unseen data.
579
4.1. Reduced set method
580
When the kernel spectral clustering problem is formulated as weighted kernel
581
PCA, the problem is solved in the dual weight space. The solution consists of the
582
determination of the β(k), k = 1, . . . , K − 1 leading eigenvectors of the D−1MDΩ
583
matrix associated to the training set Dtr = {xtri }Ntr
i=1 ⊂ D. According to the
584
first KKT condition in Eqs (11), the primal variables w(k), k = 1, . . . , K − 1 are
585
expressed as linear combinations of the mapped input data w(k)= ΦTβ(k), k =
586
1, . . . , K − 1. Since the components of the β(k)eigenvectors are usually not zero,
587
each training data point contributes to the primal variable w(k) resulting in a
588
non-sparse model. The goal is to find R = {˜xr}Rr=1⊂ Dtr, R < Ntrreduced set
points and corresponding ζ(k)∈ RR, k = 1, . . . , K − 1 reduced set coefficients to
590
approximate w(k)primal variables as [18]
591 w(k)≈ ˜w(k)= R X r=1 ζr(k)ϕ(˜xr) = ΨTζ(k), k = 1, . . . , K − 1 (40)
where Ψ = [ϕ(˜x1), . . . , ϕ(˜xR)]T ∈ RR×nh. An arbitrary data point x ∈ Rd can
592
be clustered based on the corresponding approximated score variables ˜z(x)(k), k =
593
1, . . . , K − 1 that can be calculated as [18]
594 z(x)(k)≈ ˜z(x)(k)= R X r=1 ζr(k)K(x, ˜xr) + ˜b(k) (41)
k = 1, . . . , K − 1 where ˜b(k)are the approximated bias terms determined on the
595
training set. The calculation of these approximated bias terms will be described
596
in sections 4.2.1 and 4.2.4.
597
Introducing sparsity into the model as described above corresponds to the
598
so called reduced set method [15, 16] that usually consists of three steps. First
599
the size of the eigenvalue problem is reduced from the full size of the input
600
data set N to Ntr < N by selecting a subset for training Dtr = {xtri } Ntr
i=1 ⊂
601
D = {xi}Ni=1. When the required eigenvectors are derived based on the training
602
set, a sparse representation of the model is constructed by the determination
603
of the R = {˜xr}Rr=1 ⊂ Dtr reduced set and the corresponding ζ(k) ∈ RR, k =
604
1, . . . , K−1 reduced set coefficients. Since usually R Ntr< N , a sparse model
605
representation can be obtained and used for calculating the approximated score
606
variables associated to arbitrary input data points by using Eq. (41).
607
The training data set is usually selected by random sampling in case of
608
Nystr¨om approximation based kernel spectral clustering algorithms [8, 24]. Then
609
the eigenvectors, derived based on the training set, are used to approximate
610
the underlying eigenfunctions and to carry out the out-of-sample extension.
611
Random sampling of the training set can also be used in case of reduced set
612
methods. The sparsity of the final model will be higher in this case compared
613
to the Nystr¨om method based clustering models due to the application of the
614
reduced set method. In spite of the higher level of sparsity, these clustering
mod-615
els outperforms the Nystr¨om approximation based models trained on the same
616
training set [11, 18]. A R´enyi entropy maximization based training set selection
617
and sparse kernel spectral clustering model based on weighted kernel PCA
for-618
mulation was proposed in [18]. The training set is constructed in this case by
619
selecting points from the full input data set that gradually increase the quadratic
620
R´enyi entropy of the selected subset [25, 14]. The required eigenvectors are
de-621
termined based on the selected training set and the L1 penalization [16, 26] is
622
used to obtain the reduced set [18]. Then the corresponding reduced set
coef-623
ficients are derived through the solution of a minimization [16, 11]. problem.
624
Another sparse version of the weighted kernel PCA formulated kernel spectral
625
clustering was proposed in [17] based on the incomplete Cholesky
tion (ICD) and the reduced set method. This algorithm will be discussed in the
627
following.
628
4.2. ICD based sparse algorithm
629
The incomplete Cholesky decomposition (ICD) and reduced set method
630
based sparse version of the weighted kernel PCA formulated kernel spectral
631
clustering algorithm will be discussed in this section. The original algorithm,
632
proposed in [17], will be reviewed in sections 4.2.1 and 4.2.2. The
computa-633
tional bottlenecks of this algorithm will be shown and a modified, much faster
634
algorithm will be proposed in section 4.2.4 that eliminates the computationally
635
expensive parts of the original algorithm.
636
4.2.1. Approximated eigenvectors
637
The first step is to determine the K − 1 leading eigenvectors of the D−1MDΩ
638
matrix associated to the randomly sampled training set Dtr= {xtr
i } Ntr
i=1 ⊆ D =
639
{xi}Ni=1. The incomplete Cholesky decomposition (ICD) of the Ω ∈ R
Ntr×Ntr 640
matrix is used to reduce the size of the eigenproblem [27, 28, 17].
641
Any symmetric positive definite matrix A ∈ RM ×M can be decomposed as
642
A = LLT where L ∈ RM ×M is a lower triangular matrix. If the spectrum of A
643
decays rapidly it has a small numerical rank [29] and A can be well approximated
644
by GGT
where G ∈ RM ×R, R M [30]. This is the incomplete Cholesky
645
factorization of A and it yields a decomposition in which the column space of
646
GGT is spanned by a subset of the columns of A [31]. The incomplete Cholesky
647
decomposition with symmetric pivoting greedily selects the columns of A and
648
calculates the columns of G such that the approximation of A obtained by adding
649
that column in the current step is the best [31]. The algorithm terminates when
650
kP APT−GGTk
1= T r(P APT−GGT) ≤ tolwhere P is the permutation matrix
651
associated to the symmetric permutations done during the ICD and k · k1is the
652
trace norm [28, 31].
653
Let use the ˜Ω = GGT, G ∈ RNtr×R, R N
tr incomplete Cholesky
decom-654
position of the Ω ∈ RNtr×Ntr matrix as a rank R approximation in the form
655
of
656
Ω ≈ ˜Ω = GGT (42)
This can be written as
657
Ω ≈ U Λ2UT (43)
by using the singular value decomposition of G i.e. G = U ΛVT, where Λ ∈
658
RR×R is the diagonal matrix of singular values, U ∈ RNtr×R and V ∈ RNtr×R
659
are the matrix of left and right singular vectors respectively. Using this
ap-660
proximation in the D−1MDΩ matrix, the kernel spectral clustering eigenvalue
661
problem can be written as [17]
662
˜
D−1MD˜U Λ2UTβ˜ (k)
where the ˜· denotes approximation of the corresponding quantity based on
663
Eq. (42). Premultiplying both sides by UT and substituting γ(k) = UTβ˜(k) ∈
664
RR results in [17]
665
UTD˜−1MD˜U Λ2γ(k)= ˜λ(k)γ(k), k = 1, . . . , K − 1 (45)
The elements of the diagonal matrix ˜D are equal to [ ˜D]ii = P
Ntr
j=1[ ˜Ω]ij that
666
can be calculated quickly as G(GT
1Ntr). Note that the size of this eigenvalue 667
problem is R×R that can be much smaller than the Ntr×Ntrsize of the original
668
problem when the spectrum of Ω ∈ RNtr×Ntr decays quickly. The approximated
669
eigenvectors ˜β(k), k = 1, . . . , K − 1 of the original problem can be obtained from
670
the corresponding γ(k), k = 1, . . . , K − 1 eigenvectors as ˜β(k) = U γ(k), k =
671
1, . . . , K − 1 [17].
672
The approximated bias terms ˜b(k), k = 1, . . . , K − 1 that appear in Eq. (41)
673
can be calculated quickly by using Eq. (13) substituting V = D−1 ≈ ˜D−1,
674
ΦΦT = Ω ≈ GGT and β(k)= ˜β(k), k = 1, . . . , K − 1.
675
4.2.2. Reduced set and reduced set coefficients
676
In order to obtain a sparse model representation, the next step is the
de-677
termination of the R = {˜xr}Rr=1 ⊂ Dtr reduced set and the corresponding
678
ζ(k)
∈ RR, k = 1, . . . , K − 1 reduced set coefficients.
679
According to Eq. (40), these are used then to approximate the w(k), k =
680
1, . . . , K − 1 vectors as the linear combination of the mapped reduced set points
681
{ϕ(˜xr)}Rr=1 thus these ϕ(˜xr) vectors should be linearly independent. The ICD
682
algorithm with pivoting gradually selects the columns of Ω and builds G ∈
683
RNtr×R such that the approximation Ω ≈ ˜Ω = GGT by adding that column
684
in the current step is the best [31]. The column space of ˜Ω is spanned by the
685
selected columns of Ω thus the corresponding {ϕ(˜xr)}Rr=1 mapped points are
686
linearly independent. Therefore, the pivots selected by the ICD automatically
687
provides the reduced set R = {˜xr}Rr=1 ⊂ D
tr [17]. The linear independence of
688
the mapped reduced set points {ϕ(˜xr)}Rr=1 also implies that ΩΨΨ = ΨΨT ∈
689
RR×R is full rank where Ψ = [ϕ(˜x1), . . . , ϕ(˜xR)]T ∈ RR×nh.
690
The corresponding reduced set coefficients ζ(k) ∈ RR, k = 1, . . . , K − 1 can
691
be determined by minimizing the squared distance between the corresponding
692
w(k) = ΦTβ˜(k), k = 1, . . . , K − 1 and ˜w(k) = ΨTζ(k), k = 1, . . . , K − 1 vectors
693
where Φ = [ϕ(x1), . . . , ϕ(xNtr)]
T ∈ RNtr×nh [16, 17, 18]. This can be written 694 as 695 min ζ(k) kw(k)− ˜w(k)k2 2= = min ζ(k){w (k)T w(k)− 2 ˜w(k)(T )w(k)+ ˜w(k)(T )w˜(k)} = min ζ(k){ ˜β (k)T Ω ˜β(k)− 2ζ(k)TΩ ΨΦβ˜ (k) + ζ(k)TΩΨΨζ(k)} (46)
k = 1, . . . , K − 1, where ΩΦΦ= ΦΦT ∈ RNtr×Ntr, ΩΨΦ= ΨΦT ∈ RR×Ntr and
696
ΩΨΨ = ΨΨT ∈ RR×R are the corresponding kernel matrices. Taking the partial
697
derivative of this objective with respect to ζ(k) and setting to zero, the first
698
order optimality ∂kw(k)− ˜w(k)k2
2/∂ζ(k) results in the following linear system
699
ΩΨΨζ(k)= ΩΨΦβ˜ (k)
(47) for the ζ(k), k = 1, . . . , K − 1 reduced set coefficients [16, 17, 18].
700
4.2.3. Clustering in the sparse model
701
Having both the reduced set and the corresponding reduced set coefficients
702
determined, one can calculate the score variables for arbitrary data points by
703
means of Eq. (41). The clustering of these points can be done based on the
704
obtained score variables and the codebook as discussed in section 3.3 when the
705
codebook is formed based on the sign encoding-decoding scheme. However, the
706
angular similarity based encoding-decoding scheme is slightly different in the
707
case of the sparse model.
708
The angular similarity based encoding-decoding scheme was introduced in
709
section 3.3 based on the facts, that the K − 1 leading eigenvectors of the
710
D−1MDΩ matrix are piecewise constants on the underlying clusters in the
711
ideal case and the z∗ − b ∈ R1×(K−1) points are the linear combination of
712
the K, K − 1 dimensional vectors formed from these constants (see Eq. (34)).
713
According to Eq. (41), the approximated ˜z∗i − ˜b point, associated to an
arbi-714
trary input data point xi ∈ Rd, is expressed as the linear combination of the
715
τr∗= [ζr(1), . . . , ζ (K−1)
r ], r = 1, . . . , R reduced set coefficient points
716 ˜ z∗i − ˜b = R X r=1 K(xi, ˜xr)τr∗ (48)
when the reduced set method based sparse model is used. Although the τr∗
717
vectors associated to reduced set points ˜xr from the same cluster r ∈ Ck, k ∈
718
{1, . . . , K} are not piecewise constant, one can show that in the ideal case they
719
point into the same direction i.e. they are collinear. In order to do so, let assume
720
that the rows and columns of ΩΨΨ,ΩΨΦand the rows of ˜B = [ ˜β (1)
, . . . , ˜β(K−1)] ∈
721
RNtr×K−1 are ordered according to the perfect partition of the training data
722
set. ΩΨΨ is a block diagonal and ΩΨΦ is a block matrix in the ideal case.
723
Furthermore, ΩΨΨhas full rank thus invertible and its inverse Ω−1ΨΨmatrix is also
724
block diagonal partitioned identically to ΩΨΨ. Therefore, the Θ = Ω−1ΨΨΩΨΦ∈
725
RR×Ntr matrix is also a block matrix on the clusters which means that [Θ]
rj 6=
726
0 ⇐⇒ r, j ∈ Ck, k ∈ {1, . . . , K}. Then the X = [ζ(1), . . . , ζ(K−1)] ∈ RR×(K−1)
727
reduced set coefficient matrix can be written as X = Θ ˜B by using Eq. (47). Let
728
assume that the r-th reduced set point ˜xr∈ R belongs to the k-th cluster. The
729
rp-th element of X, that corresponds to the p-th components of the τr∗ vector,