Fixed-Size Kernel Logistic Regression for Phoneme Classification
Peter Karsmakers 1,2 , Kristiaan Pelckmans 2 , Johan Suykens 2 , Hugo Van hamme 3
1 IIBT, K.H. Kempen (Associatie KULeuven),B-2440 Geel, Belgium
2 ESAT-SCD/SISTA, K.U.Leuven, B-3001 Heverlee, Belgium
3 ESAT-PSI/SPEECH, K.U.Leuven, B-3001 Heverlee, Belgium
[peter.karsmakers,kristiaan.pelckmans,johan.suykens,hugo.vanhamme]@esat.kuleuven.be
Abstract
Kernel logistic regression (KLR) is a popular non-linear classi- fication technique. Unlike an empirical risk minimization ap- proach such as employed by Support Vector Machines (SVMs), KLR yields probabilistic outcomes based on a maximum like- lihood argument which are particularly important in speech recognition. Different from other KLR implementations we use a Nystr¨om approximation to solve large scale problems with es- timation in the primal space such as done in fixed-size Least Squares Support Vector Machines (LS-SVMs). In the speech experiments it is investigated how a natural KLR extension to multi-class classification compares to binary KLR models cou- pled via a one-versus-one coding scheme. Moreover, a compar- ison to SVMs is made.
Index Terms: phoneme classification, kernel logistic regres- sion, large-scale, multi-class
1. Introduction
To tackle the task of phoneme classification we choose a Lo- gistic Regression (LR) and Kernel Logistic Regression (KLR) approach. Hidden Markov models (HMMs) [16] are the state- of-the-art technique for current automatic speech recognition (ASR) systems. It is widely recognized that estimating the HMM parameters via a maximum likelihood criterion does not directly optimize the classification performance of the mod- els. It is therefore of interest to develop alternative meth- ods which infer the parameters by discriminative measures of performance. Several techniques were presented for the task of phoneme recognition such as Linear Discriminant Analysis (LDA) (e.g., [1]), Multi-Layer Perceptrons (MLPs) (e.g., [17]), Hidden Conditional Random Fields (HCRFs) (e.g., [13]), Sup- port Vector Machines (SVMs) (e.g., [18]), KLR (e.g., [15]).
Although SVMs have shown promising results for phoneme recognition, the choice for a LR or KLR approach over an em- pirical risk minimization approach such as SVM, is that the for- mer yields probabilistic outcomes based on a maximum likeli- hood argument instead of a binary decision. KLR has an ad- ditional advantage that the extension to the multi-class case is well described, which must be contrasted to the commonly used coding approach (see e.g., [6],[5]). Obtaining phoneme proba- bilities offers ample perspective for integration of this work in an ASR system.
Unlike SVMs, KLR by its nature is not sparse and needs all training samples in its final model. Different adaptations to the original algorithm were performed to obtain sparseness such as in [6]. In this paper we employ a different practical technique, suited for large data sets, based on fixed-size Least Squares Support Vector Machines (LS-SVMs) [5], which we
can use because KLR is related to a weighted version of LS- SVMs [12].
Our experiments are performed on the TIMIT data set, where we compare two different multi-class KLR implementa- tions against binary SVM classifiers combined via a one-versus- one coding scheme.
This paper is organized as follows. In Section 2 we give an introduction to logistic regression. Section 3 describes the extension to kernel logistic regression. A fixed-size implemen- tation is given in Section 4. Section 5 describes extension to multi-class KLR and Section 6 reports numerical results on the TIMIT speech data set. Finally we conclude in Section 7.
2. Logistic regression
After introducing some notations, we recall the principles of lo- gistic regression. Suppose we have a binary classification prob- lem with a training set {(x
i, y
i)}
Ni=1⊂ R
d× {−1, 1} with N samples, where input samples x
iare i.i.d. from an unknown probability distribution over the random vectors (X, Y). We de- fine the first element of x
ito be 1, so that we can incorporate the intercept term in the parameter vector w. The goal is to find a classification rule from the training data, such that when given a new input x
∗, we can assign a class label to it. In logistic regression the conditional class probabilities are estimated via logit stochastic models
Pr(Y = −1 | X = x; w) =
exp(wTx)1+exp(wTx)
Pr(Y = 1 | X = x; w) =
11+exp(wTx)
,
(1)
The class membership of a new point x
∗can be given by the classification rule which is
arg max
c∈{−1,1}
Pr(Y = c|X = x
∗; w). (2) The common method to infer the parameters of the differ- ent models is via the use of penalized negative log likelihood (PNLL)
min
w`(w) =
− ln Q
Ni=1
P r(Y = y
i|X = x
i; w)
+
ν2w
Tw, (3) where the regularization parameter ν must be set such that the parameters in w stay small in order to obtain a good bias- variance trade-off and avoid overfitting.
We derive the objective function for LR by combining (1)
with (3) which gives
`
LR(w) = X
i∈D1
ln exp(w
Tx
i)
1 + exp(w
Tx
i) + X
i∈D2
ln 1
1 + exp(w
Tx
i) + ν
2 w
Tw,
(4) where D = {(x
i, y
i)}
Ni=1, D = D
1∪ D
2, D
1∩ D
2= ∅ and y
i= c, ∀x
i∈ D
c. In the sequel we use the shorthand notation
p
c,i= Pr(Y = c|X = x
i; w). (5) This PNLL criterion for LR is known to possess a number of useful properties such as the fact that it is convex in the param- eters w, smooth and has asymptotic optimality properties.
Until now we have defined a model and an objective func- tion which has to be optimized to fit the parameters on the ob- served data. Most often this optimization is performed by a Newton based strategy where the solution can be found by iter- ating
w
(k)= w
(k−1)+ s
(k), (6) over k until convergence. The minimization in this case is equivalent to an iteratively regularized re-weighted least squares problem (IRRLS) (e.g. [6]) which can be written as
min
s(k)
1
2
||Xs
(k)− z
(k)||
2W(k)+ (7)
ν
2
(s
(k)+ w
(k−1))
T(s
(k)+ w
(k−1)), where
z
(k)= (W
(k))
−1q. (8) where we define X = [x
1; ...; x
N],g
i= p
1,i(1 − p
1,i), W = diag([g
1; ...; g
N]), q
i= (p
yi,i− 1)y
iand q = [q
1; ...; q
N].
3. Kernel logistic regression
In this section we define the minimization problem for the kernel version of logistic regression. This result is based on an optimization argument as opposed to the use of an appro- priate Representer Theorem [7]. The LR model as defined in (1) can be advanced with a nonlinear extension to kernel machines where the inputs x are mapped to a high dimen- sional space. Define Φ ∈ R
N ×dϕas X where x
iis re- placed by ϕ(x
i) and where ϕ : R
d→ R
dϕdenotes the fea- ture map induced by a positive definite kernel. With the ap- plication of the Mercer’s theorem for the kernel matrix Ω as Ω
ij= K(x
i, x
j) = ϕ(x
i)
Tϕ(x
j), i, j = 1, . . . , N , it is not required to compute explicitly the nonlinear mapping ϕ(·) as this is done implicitly through the use of positive kernel functions K. For K there are usually the following choices:
K(x
i, x
j) = x
Tix
j(linear kernel); K(x
i, x
j) = (x
Tix
j+ h)
b(polynomial of degree b, with h ≥ 0 a tuning parameter);
K(x
i, x
j) = exp(−||x
i− x
j||
22/σ
2) (radial basis function, RBF), where σ is a tuning parameter. In KLR the models are defined as
Pr(Y = −1 | X = x; w) =
1+exp(wexp(wTTϕ(x))ϕ(x))Pr(Y = 1 | X = x; w) =
1+exp(w1Tϕ(x)),
(9)
Starting from (8) we include a feature map and introduce the error variable e, this results in
min
s(k),e(k)
1
2 e
(k)TW
(k)e
(k)+ ν
2 (s
(k)+ w
(k−1))
T(s
(k)+ w
(k−1)) such that z
(k)= Φs
(k)+ e
(k),
(10) which in the context of LS-SVMs is called the primal problem.
In its dual formulation the solution to this optimization problem can be found by iteratively solving a linear system.
1
ν Ω + W
(k)−1α
(k)= z
(k)+ Ωα
(k−1), (11)
where z
(k)is defined as in (8). The probabilities of a new point x
∗can be predicted using (9) where w
Tϕ(x
∗) =
1 ν
P
Ni=1
α
iK(x
i, x
∗). The proof can be found in [12].
4. Kernel logistic regression: a fixed-size implementation
4.1. Nystr¨om approximation
In the previous paragraph we stated a primal and a dual formu- lation of the optimization problem. Suppose one takes a finite dimensional feature map (e.g. a linear kernel), then one can equally well solve the primal as the dual problem. In fact, solv- ing the primal problem is more advantageous for larger data sets because the dimension of the unknowns w ∈ R
dcompared to α ∈ R
N. In order to work in the primal space using a kernel function other than the linear one, it is required to compute an explicit approximation of the nonlinear mapping ϕ. This leads to a sparse representation of the model when estimating in pri- mal space.
Explicit expressions for ϕ can be obtained by means of an eigenvalue decomposition of the kernel matrix Ω with entries K(x
i, x
j). Given the integral equation R K(x, x
j)φ
i(x)p(x)dx = λ
iφ
i(x
j), with solutions λ
iand φ
ifor a variable x with probability density p(x), we can write
ϕ = [ √ λ
1φ
1, √
λ
2φ
2, . . . , √
λ
dϕφ
dϕ]. (12) Given the data set, it is possible to approximate the integral by a sample average. This will lead to the eigenvalue problem (Nystr¨om approximation [8])
1 N
N
X
l=1
K(x
l, x
j)u
i(x
l) = λ
(s)iu
i(x
j), (13)
where the eigenvalues λ
iand eigenfunctions φ
ifrom the contin- uous problem can be approximated by the sample eigenvalues λ
(s)iand the eigenvectors u
i∈ R
Nas
λ ˆ
i= 1
N λ
(s)i, ˆ φ
i= √
N u
i. (14)
Based on this approximation, it is possible to compute the eigendecomposition of the kernel matrix Ω and use its eigenval- ues and eigenvectors to compute the i-th required component of
ˆ
ϕ(x) simply by applying (12) if x is a training point, or for any new point x
∗by means of
ˆ
ϕ(x
∗) = 1
√ λ
(s)iN
X
j=1
u
jiK(x
j, x
∗). (15)
4.2. Sparseness and large scale problems
Until now the entire training sample of size N to compute the approximation of ϕ will yield at most N components, each one of which can be computed by (14) for all x, where x is a row of X. However, if we have a large scale problem, it has been motivated [5] to use a subsample of M N data points to compute the ˆ ϕ. In this case, up to M components, which are called support vectors, will be computed. External criteria such as entropy maximization can be applied for an optimal selection of the subsample: given a fixed-size M , the aim is to select the support vectors that maximize the quadratic Renyi entropy [9]
H
R= − ln Z
p(x)
2dx, (16)
which can be approximated by using R ˆ
p(x)
2dx =
1 M2
P
M i=1P
Mj=1
Ω
ij. The use of this active selection proce- dure can be important for large scale problems, as it is related to the underlying density distribution of the sample. In this sense, the optimality of this selection is related to the final accuracy of the model. This finite dimensional approximation ˆ ϕ(x) can be used in the primal problem (10) to estimate w with a sparse representation [5].
5. Multi-class kernel logistic regression
Kernel logistic regression can be naturally extended to a multi- class version. Suppose we have a multi-class problem with C classes (C ≥ 2) with a training set {(x
i, y
i)}
Ni=1⊂ R
d× {1, 2, ..., C} with N samples, where input samples x
iare i.i.d.
from an unknown probability distribution over the random vec- tors (X, Y). In multi-class KLR the conditional class probabili- ties can be written by
Pr(Y = 1 | X = x; w
1, ..., w
m) =
exp(wT1ϕ(x))1+Pm
c=1exp(wTcϕ(x))
Pr(Y = 2 | X = x; w
1, ..., w
m) =
1+Pexp(wm T2ϕ(x)) c=1exp(wTcϕ(x)).. .
Pr(Y = C | X = x; w
1, ..., w
m) =
11+Pm
c=1exp(wTcϕ(x))
.
(17) where m = C − 1. Deriving this multi-class implementa- tion results in one large learning problem without the use of a coding scheme. Other possible multi-class implementations can be built by combining several independent binary classi- fiers via a common coding scheme approach, e.g. one-versus- one and one-versus-all. When using one-versus-all one has to optimize C different smaller learning problems compared to one-versus-one where C(C − 1)/2 small models have to be optimized. In [18] one-versus-one coding schemes resulted in better classification accuracies than one-versus-all we therefore chose to compare the natural extension of multi-class KLR to one-versus-one KLR. To obtain probabilities when using a one- versus-one coding scheme we use method 3 described in [3].
The resulting pairwise probabilities µ
ij= Pr(Y = i|Y = i or Y = j, X = x) are transformed to the a posteriori prob- ability by
Pr(Y = i|X = x) = 1/
C
X
j=1,j6=i