Fixed-Size Kernel Logistic Regression for Phoneme Classification

(1)

Fixed-Size Kernel Logistic Regression for Phoneme Classification

Peter Karsmakers ^1,2 , Kristiaan Pelckmans ² , Johan Suykens ² , Hugo Van hamme ³

1 IIBT, K.H. Kempen (Associatie KULeuven),B-2440 Geel, Belgium

2 ESAT-SCD/SISTA, K.U.Leuven, B-3001 Heverlee, Belgium

3 ESAT-PSI/SPEECH, K.U.Leuven, B-3001 Heverlee, Belgium

[peter.karsmakers,kristiaan.pelckmans,johan.suykens,hugo.vanhamme]@esat.kuleuven.be

Abstract

Kernel logistic regression (KLR) is a popular non-linear classification technique. Unlike an empirical risk minimization approach such as employed by Support Vector Machines (SVMs), KLR yields probabilistic outcomes based on a maximum likelihood argument which are particularly important in speech recognition. Different from other KLR implementations we use a Nystr¨om approximation to solve large scale problems with estimation in the primal space such as done in fixed-size Least Squares Support Vector Machines (LS-SVMs). In the speech experiments it is investigated how a natural KLR extension to multi-class classification compares to binary KLR models cou- pled via a one-versus-one coding scheme. Moreover, a comparison to SVMs is made.

Index Terms: phoneme classification, kernel logistic regression, large-scale, multi-class

1. Introduction

To tackle the task of phoneme classification we choose a Lo- gistic Regression (LR) and Kernel Logistic Regression (KLR) approach. Hidden Markov models (HMMs) [16] are the state- of-the-art technique for current automatic speech recognition (ASR) systems. It is widely recognized that estimating the HMM parameters via a maximum likelihood criterion does not directly optimize the classification performance of the models. It is therefore of interest to develop alternative meth- ods which infer the parameters by discriminative measures of performance. Several techniques were presented for the task of phoneme recognition such as Linear Discriminant Analysis (LDA) (e.g., [1]), Multi-Layer Perceptrons (MLPs) (e.g., [17]), Hidden Conditional Random Fields (HCRFs) (e.g., [13]), Sup- port Vector Machines (SVMs) (e.g., [18]), KLR (e.g., [15]).

Although SVMs have shown promising results for phoneme recognition, the choice for a LR or KLR approach over an empirical risk minimization approach such as SVM, is that the for- mer yields probabilistic outcomes based on a maximum likelihood argument instead of a binary decision. KLR has an additional advantage that the extension to the multi-class case is well described, which must be contrasted to the commonly used coding approach (see e.g., [6],[5]). Obtaining phoneme probabilities offers ample perspective for integration of this work in an ASR system.

Unlike SVMs, KLR by its nature is not sparse and needs all training samples in its final model. Different adaptations to the original algorithm were performed to obtain sparseness such as in [6]. In this paper we employ a different practical technique, suited for large data sets, based on fixed-size Least Squares Support Vector Machines (LS-SVMs) [5], which we

can use because KLR is related to a weighted version of LS- SVMs [12].

Our experiments are performed on the TIMIT data set, where we compare two different multi-class KLR implementations against binary SVM classifiers combined via a one-versus- one coding scheme.

This paper is organized as follows. In Section 2 we give an introduction to logistic regression. Section 3 describes the extension to kernel logistic regression. A fixed-size implementation is given in Section 4. Section 5 describes extension to multi-class KLR and Section 6 reports numerical results on the TIMIT speech data set. Finally we conclude in Section 7.

2. Logistic regression

After introducing some notations, we recall the principles of logistic regression. Suppose we have a binary classification problem with a training set {(x

i

, y

i

)}

^N_i=1

⊂ R

^d

× {−1, 1} with N samples, where input samples x

i

are i.i.d. from an unknown probability distribution over the random vectors (X, Y). We define the first element of x

i

to be 1, so that we can incorporate the intercept term in the parameter vector w. The goal is to find a classification rule from the training data, such that when given a new input x

∗

, we can assign a class label to it. In logistic regression the conditional class probabilities are estimated via logit stochastic models







Pr(Y = −1 | X = x; w) =

^exp(w^T^x)

1+exp(w^Tx)

Pr(Y = 1 | X = x; w) =

¹

1+exp(w^Tx)

,

(1)

The class membership of a new point x

∗

can be given by the classification rule which is

arg max

c∈{−1,1}

Pr(Y = c|X = x

∗

; w). (2) The common method to infer the parameters of the different models is via the use of penalized negative log likelihood (PNLL)

min

w

`(w) =

− ln Q

N

i=1

P r(Y = y

i

|X = x

i

; w)

+

^ν₂

w

^T

w, (3) where the regularization parameter ν must be set such that the parameters in w stay small in order to obtain a good bias- variance trade-off and avoid overfitting.

We derive the objective function for LR by combining (1)

(2)

with (3) which gives

`

LR

(w) = X

i∈D1

ln exp(w

^T

x

i

)

1 + exp(w

^T

x

i

) + X

i∈D2

ln 1

1 + exp(w

^T

x

i

) + ν

2 w

^T

w,

(4) where D = {(x

i

, y

i

)}

^N_i=1

, D = D

1

∪ D

2

, D

1

∩ D

2

= ∅ and y

i

= c, ∀x

i

∈ D

c

. In the sequel we use the shorthand notation

p

c,i

= Pr(Y = c|X = x

i

; w). (5) This PNLL criterion for LR is known to possess a number of useful properties such as the fact that it is convex in the parameters w, smooth and has asymptotic optimality properties.

Until now we have defined a model and an objective function which has to be optimized to fit the parameters on the ob- served data. Most often this optimization is performed by a Newton based strategy where the solution can be found by iter- ating

w

^(k)

= w

^(k−1)

+ s

^(k)

, (6) over k until convergence. The minimization in this case is equivalent to an iteratively regularized re-weighted least squares problem (IRRLS) (e.g. [6]) which can be written as

min

s^(k)

1

2

||Xs

^(k)

− z

^(k)

||

²_W(k)

+ (7)

ν

2

(s

^(k)

+ w

^(k−1)

)

^T

(s

^(k)

+ w

^(k−1)

), where

z

^(k)

= (W

^(k)

)

⁻¹

q. (8) where we define X = [x

1

; ...; x

N

],g

i

= p

1,i

(1 − p

1,i

), W = diag([g

1

; ...; g

N

]), q

i

= (p

y_i,i

− 1)y

i

and q = [q

1

; ...; q

N

].

3. Kernel logistic regression

In this section we define the minimization problem for the kernel version of logistic regression. This result is based on an optimization argument as opposed to the use of an appro- priate Representer Theorem [7]. The LR model as defined in (1) can be advanced with a nonlinear extension to kernel machines where the inputs x are mapped to a high dimensional space. Define Φ ∈ R

^{N ×d}^ϕ

as X where x

i

is re- placed by ϕ(x

i

) and where ϕ : R

^d

→ R

^d^ϕ

denotes the feature map induced by a positive definite kernel. With the ap- plication of the Mercer’s theorem for the kernel matrix Ω as Ω

ij

= K(x

i

, x

j

) = ϕ(x

i

)

^T

ϕ(x

j

), i, j = 1, . . . , N , it is not required to compute explicitly the nonlinear mapping ϕ(·) as this is done implicitly through the use of positive kernel functions K. For K there are usually the following choices:

K(x

i

, x

j

) = x

^T_i

x

j

(linear kernel); K(x

i

, x

j

) = (x

^T_i

x

j

+ h)

^b

(polynomial of degree b, with h ≥ 0 a tuning parameter);

K(x

i

, x

j

) = exp(−||x

i

− x

j

||

²2

/σ

²

) (radial basis function, RBF), where σ is a tuning parameter. In KLR the models are defined as







Pr(Y = −1 | X = x; w) =

_1+exp(w^exp(w^T_T^ϕ(x))_ϕ(x))

Pr(Y = 1 | X = x; w) =

_1+exp(w¹_T_ϕ(x))

,

(9)

Starting from (8) we include a feature map and introduce the error variable e, this results in

min

s^(k),e^(k)

1 2 e

^(k)T

W

^(k)

e

^(k)

+ ν

2 (s

^(k)

+ w

^(k−1)

)

^T

(s

^(k)

+ w

^(k−1)

) such that z

^(k)

= Φs

^(k)

+ e

^(k)

,

(10) which in the context of LS-SVMs is called the primal problem.

In its dual formulation the solution to this optimization problem can be found by iteratively solving a linear system.

1 ν Ω + W

^(k)−1

α

^(k)

= z

^(k)

+ Ωα

^(k−1)

, (11)

where z

^(k)

is defined as in (8). The probabilities of a new point x

∗

can be predicted using (9) where w

^T

ϕ(x

∗

) =

1 ν

P

N

i=1

α

i

K(x

i

, x

∗

). The proof can be found in [12].

4. Kernel logistic regression: a fixed-size implementation

4.1. Nystr¨om approximation

In the previous paragraph we stated a primal and a dual formulation of the optimization problem. Suppose one takes a finite dimensional feature map (e.g. a linear kernel), then one can equally well solve the primal as the dual problem. In fact, solving the primal problem is more advantageous for larger data sets because the dimension of the unknowns w ∈ R

^d

compared to α ∈ R

^N

. In order to work in the primal space using a kernel function other than the linear one, it is required to compute an explicit approximation of the nonlinear mapping ϕ. This leads to a sparse representation of the model when estimating in primal space.

Explicit expressions for ϕ can be obtained by means of an eigenvalue decomposition of the kernel matrix Ω with entries K(x

i

, x

j

). Given the integral equation R K(x, x

j

)φ

i

(x)p(x)dx = λ

i

φ

i

(x

j

), with solutions λ

i

and φ

i

for a variable x with probability density p(x), we can write

ϕ = [ √ λ

1

φ

1

, √

λ

2

φ

2

, . . . , √

λ

d_ϕ

φ

d_ϕ

]. (12) Given the data set, it is possible to approximate the integral by a sample average. This will lead to the eigenvalue problem (Nystr¨om approximation [8])

1 N

N

X

l=1

K(x

l

, x

j

)u

i

(x

l

) = λ

^(s)_i

u

i

(x

j

), (13)

where the eigenvalues λ

i

and eigenfunctions φ

i

from the continuous problem can be approximated by the sample eigenvalues λ

^(s)_i

and the eigenvectors u

i

∈ R

^N

as

λ ˆ

i

= 1

N λ

^(s)_i

, ˆ φ

i

= √

N u

i

. (14)

Based on this approximation, it is possible to compute the eigendecomposition of the kernel matrix Ω and use its eigenvalues and eigenvectors to compute the i-th required component of

ˆ

ϕ(x) simply by applying (12) if x is a training point, or for any new point x

∗

by means of

ˆ

ϕ(x

∗

) = 1

√ λ

^(s)_i

N

X

j=1

u

ji

K(x

j

, x

∗

). (15)

(3)

4.2. Sparseness and large scale problems

Until now the entire training sample of size N to compute the approximation of ϕ will yield at most N components, each one of which can be computed by (14) for all x, where x is a row of X. However, if we have a large scale problem, it has been motivated [5] to use a subsample of M N data points to compute the ˆ ϕ. In this case, up to M components, which are called support vectors, will be computed. External criteria such as entropy maximization can be applied for an optimal selection of the subsample: given a fixed-size M , the aim is to select the support vectors that maximize the quadratic Renyi entropy [9]

H

R

= − ln Z

p(x)

²

dx, (16)

which can be approximated by using R ˆ

p(x)

²

dx =

1 M²

P

M i=1

P

M

j=1

Ω

ij

. The use of this active selection proce- dure can be important for large scale problems, as it is related to the underlying density distribution of the sample. In this sense, the optimality of this selection is related to the final accuracy of the model. This finite dimensional approximation ˆ ϕ(x) can be used in the primal problem (10) to estimate w with a sparse representation [5].

5. Multi-class kernel logistic regression

Kernel logistic regression can be naturally extended to a multi- class version. Suppose we have a multi-class problem with C classes (C ≥ 2) with a training set {(x

i

, y

i

)}

^N_i=1

⊂ R

^d

× {1, 2, ..., C} with N samples, where input samples x

i

are i.i.d.

from an unknown probability distribution over the random vectors (X, Y). In multi-class KLR the conditional class probabilities can be written by



 

 

 

 

Pr(Y = 1 | X = x; w

1

, ..., w

m

) =

^exp(w^T¹^ϕ(x))

1+P_m

c=1exp(w^T_cϕ(x))

Pr(Y = 2 | X = x; w

1

, ..., w

m

) =

₁₊P^exp(w_m ^T²^ϕ(x)) c=1exp(w^T_cϕ(x))

.. .

Pr(Y = C | X = x; w

1

, ..., w

m

) =

¹

1+P_m

c=1exp(w^T_cϕ(x))

.

(17) where m = C − 1. Deriving this multi-class implementation results in one large learning problem without the use of a coding scheme. Other possible multi-class implementations can be built by combining several independent binary classifiers via a common coding scheme approach, e.g. one-versus- one and one-versus-all. When using one-versus-all one has to optimize C different smaller learning problems compared to one-versus-one where C(C − 1)/2 small models have to be optimized. In [18] one-versus-one coding schemes resulted in better classification accuracies than one-versus-all we therefore chose to compare the natural extension of multi-class KLR to one-versus-one KLR. To obtain probabilities when using a one- versus-one coding scheme we use method 3 described in [3].

The resulting pairwise probabilities µ

ij

= Pr(Y = i|Y = i or Y = j, X = x) are transformed to the a posteriori probability by

Pr(Y = i|X = x) = 1/





C

X

j=1,j6=i

1 µ

ij

− (C − 2)



 . (18)

The probability outcomes in (18) are normalized so that they sum to one for each evaluation.

6. Experiments

To test the performance of KLR we used the TIMIT database [14]. Training was performed on the ’sx’ and ’si’ training sen- tences. These create a training set with 3, 696 utterances from 168 different speakers. For testing we chose the full test set.

It consists of 1, 344 utterances from 168 different speakers not included in the training set. All utterances contain labels indicating the phoneme identity and the starting and ending time of each phoneme. The standard Kai-Fu Lee clustering [16] was used, resulting in a set of 39 phonemes.

A key problem with conducting classification experiments with the TIMIT database is that the segments that we are seek- ing to classify are not of a uniform length. In order to use the machine learning techniques such as K-Nearest Neighbors (KNN)[4], Linear Discriminant Analysis (LDA) [1], SVM and KLR we must encode the waveform information in a fixed- length vector. We chose the same simple method of encod- ing the variable length segment information in a vector of fixed length as in [18]. We converted the utterances from their waveform representation into a sequence of 36 dimensional observation vectors. These observation vectors were obtained by means of mutual information based discriminant linear transformation [19] on 24 MEL spectra and their first and second order time derivatives. Each phoneme segment was broken into three regions in the ratio 3-4-3. The 36 dimensional vectors belonging to each of these regions were averaged resulting in three 36 dimensional vectors. In addition, the 36 dimensional vectors belonging to a window region centered at the start of the phonetic segments and with a 50 ms width were averaged, resulting in another 36 dimensional vector. The same was done for a window centered at the end of the segment. One additional feature indicating the log-duration of the phoneme segment was added.

This resulted in a 142, 910 train vectors and 51, 681 test vectors with 181 dimensions.

We used a small part of the train set for tuning purposes.

After tuning we used the full train set to train the algorithm.

The phonetic classification accuracies on the full test set using different classifiers are shown in Table 1.

The SVM experiments are performed with the LIBSVM

toolbox [10]. The binary SVM probability outputs are obtained

after mapping the SVM distance outputs to a sigmoid function,

described in [11]. Via a pairwise one-versus-one coding scheme

the binary outcomes are combined using the second approach as

described in [3]. Our result SVM results using an RBF-kernel

is much higher than 76.3% as reported in [18]. In comparison

to SVM classification with a voting strategy, where each binary

classification is considered to be a voting and a data point is

designated to be in a class with maximum votes, we noticed

that because of the SVM probability estimate the accuracy in-

creases from 82.2% to 82.9%. For the fixed-size multi-class

KLR experiments without coding scheme we used an alternated

descent version of Newton’s method [12] to make it possible to

set M = 1, 000. In our theoretical explanation (17) each class

model has the same feature map ϕ, this results in m models

with 1, 000 support vectors. Although, certainly not optimal we

have tried to incorporate more information in the model with-

out increasing the computational complexity by using a differ-

ent ϕ for each logit model by choosing half of the support vec-

tors randomly from the model class and the rest we choose ran-

domly out of all other classes. The one-versus-one multi-class

(4)

Table 1: Classification accuracies on the TIMIT full test set for different algorithms. The column acc gives the percent- age of correctly classified phoneme segments. The percentages in the 10-best column are equal to the proportion of estimated phonemes for which the correct class was one of the 10 most probable classes. The acronym oVso stands for one-versus-one coding scheme. RBF indicates that an RBF kernel was used and LIN stands for linear kernel.

alg acc(%) 10-best(%)

SVM (RBF) 82.9 99.8

KLR oVso (RBF) 78.1 99.6

KLR (RBF) 76.3 99.6

SVM (LIN) 76.1 99.6

LR oVso 74.9 99.5

LR 72.8 99.2

KNN (k=1) 67.8 x

LDA 66.5 98.1

KLR approach is used as described in Section 5. Using this coding scheme results in 741 small models, where each model has 1, 000 support vectors. A possible explanation is that this approach gives better results than using the natural multi-class extension, were we have only 38 models with 1, 000 support vectors. As a consequence the model evaluation for new obser- vations takes less long to compute than in the case of using a one-versus-one coding scheme.

Although the KLR results are not as good as those obtained with an accurately tuned SVM implementation, they are comparable with results we obtained when conducting an HMM ex- periment where we used context independent acoustical models with 2 to 4 states per phoneme with in total 5, 550 Gaussians (average 124/state). The same 36 dimensional observation vectors were used as in the other experiments. Using an unigram model we obtained 78.4% on the full test set.

7. Conclusions

In this paper we applied a fixed-size algorithm for multi-class KLR models on the TIMIT speech data set. We showed that the performance in terms of correct classifications on this data is comparable to that of HMMs in combination with Gaussian mixture models. In the future we will investigate whether or not corrections for class imbalance, another support vector selection technique instead of random selection and a different coupling of coding scheme probabilities will improve the multi- class KLR classification accuracies.

8. Acknowledgments

Research supported by GOA AMBioRICS, CoE EF/05/006; (Flem- ish Government): (FWO): PhD/postdoc grants, projects, G.0407.02, G.0197.02, G.0141.03, G.0491.03, G.0120.03, G.0452.04, G.0499.04, G.0211.05, G.0226.06, G.0321.06, G.0553.06, G.0302.07. (ICCoS, ANMMM, MLDM); (IWT): PhD Grants,GBOU (McKnow), Eureka- Flite2 - Belgian Federal Science Policy Office: IUAP P5/22,PODO- II,- EU: FP5-Quprodis; ERNSI; - Contract Research/agreements:

ISMC/IPCOS, Data4s, TML, Elia, LMS, Mastercard. JS is a professor and BDM is a full professor at K.U.Leuven Belgium. This publication only reflects the authors’ views.

9. References

[1] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer, 2001.

[2] S.S. Keerthi, K. Duan, S.K. Shevade and A.N. Poo ”A Fast Dual Algorithm for Kernel Logisic Regression ”, Machine Learning, vol. 61, p. 151-165, 2005.

[3] T.-F. Wu, C.J. Lin, R.C. Weng, Probability estimates for multi- class classification by pairwise coupling., Journal of Machine Learning Research, vol. 5., p. 975-1005, 2004.

[4] T. M. Mitchell, Machine Learning, McGraw-Hill, 1997.

[5] J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor and J. Vandewalle, Least Squares Support Vector Machines, World Scientific, Singapore, 2002.

[6] J. Zhu, T. Hastie, ”Kernel logistic regression and the import vector machine”, Advances in Neural Information Processing Sys- tems, vol. 14, 2001.

[7] G. Kimeldorf, G. Wahba, ”Some results on Tchebycheffian spline functions”, Journal of Mathematics Analysis and Appli- cations,vol. 33, pp. 82-95, 1971.

[8] C.K.I. Williams, M. Seeger ”Using the Nystr¨om Method to Speed Up Kernel Machines”, Proceedings Neural Information Process- ing Systems, vol 13., MIT press, 2000.

[9] M. Girolami ”Orthogonal Series Density Estimation and the Ker- nel Eigenvalue Problem”, Neural Computation, vol. 14(3), 669- 688, 2003.

[10] C.-C. Chang, C.-J. Lin, ”LIBSVM : a library for support vector machines”, Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm, 2001.

[11] H.-T. Lin, C.-J. Lin, R.-C. Weng, A note on Platt’s probabilistic outputs for support vector machines, Technical report, 2003.

[12] P. Karsmakers, K. Pelckmans, J. A. K. Suykens, ”Multi-class kernel logistic regression: a fixed-size implementation”, Internal Report 07-39, 2007. Accepted for publication in Proc. of IJCNN 2007.

[13] A. Gunawardana, M. Mahajan, A. Acero and J.C. Plat, ”Hidden conditional random fields for phone classification”, Proceedings of Eurospeech 2005, Lisbon, 2005.

[14] TIMIT Acoustic -Phonetic Continuous Speech Corpus, National Institute of Standards and Technology Speech Disc 1 -1.1, NTIS Order No. PB91 -5050651996, 1990.

[15] M. Katz, M. Schaffner, E. Andelic, S. Kr¨uger, A. Wendemuth

”Sparse Kernel Logistic Regression for Phoneme Classification”, Proceedings of 10th International Conference on Speech and Computer (SPECOM), vol. 2, pp. 523-526, 2005.

[16] K.F. Lee and H.W. Hon, ”Speaker-independent phone recognition using hidden Markov models”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 11, pp. 1641-1648, 1988.

[17] Y. Bengio, ”Neural networks for speech and sequence recognition”, London International Thomson Computer Press , 1995.

[18] P. Clarkson, P.J. Moreno, ”On the use of support vector machines for phonetic classification”, IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2 , pp. 585-588, 1999.

[19] K. Demuynck, Extracting, Modelling and Combining Informa-

tion in Speech Recognition, Ph.D. thesis, 2001.

Fixed-Size Kernel Logistic Regression for Phoneme Classification