Indeﬁnite kernels in least squares support vector machines and principal component analysis

(1)

Citation/Reference Huang X., Maier A., Hornegger J., Suykens J.A.K., ``Indefinite Kernels in Least Squares Support Vector Machines and Principal Component Analysis'', Applied and Computational Harmonic Analysis, Sep. 2016,

Archived version Author manuscript: the content is identical to the content of the published paper, but without the final typesetting by the publisher Published version http://dx.doi.org/10.1016/j.acha.2016.09.001

Journal homepage http://www.journals.elsevier.com/applied-and-computational- harmonic-analysis/

IR https://lirias.kuleuven.be/handle/123456789/557941

(article begins on next page)

(2)

Contents lists available atScienceDirect

Applied and Computational Harmonic Analysis

www.elsevier.com/locate/acha

Case Studies

Indeﬁnite kernels in least squares support vector machines and principal component analysis

^✩

Xiaolin Huang^a,c,^∗, Andreas Maier^c, Joachim Hornegger^c, Johan A.K. Suykens^b

a InstituteofImageProcessingandPatternRecognition,ShanghaiJiaoTongUniversity,200240 Shanghai,PRChina

bKULeuven,ESAT-STADIUS,B-3001Leuven,Belgium

cPatternRecognitionLab,Friedrich-Alexander-UniversitätErlangen–Nürnberg,91058Erlangen, Germany

a r t i cl e i n f o a b s t r a c t

Articlehistory:

Received15March2016

Receivedinrevisedform29August 2016

Accepted2September2016 Availableonlinexxxx

CommunicatedbyCharlesK.Chui

Keywords:

Leastsquaressupportvector machine

Indeﬁnitekernel Classiﬁcation

Kernelprincipalcomponentanalysis

Because of severalsuccessfulapplications,indefinitekernels haveattracted many research interestsinrecentyears.This paper addressesindefinitelearninginthe framework of least squares support vector machines (LS-SVM). Unlike existing indefinite kernel learning methods, which usually involve non-convex problems, theindefiniteLS-SVMisstilleasytosolve,but thekerneltrickandprimal-dual relationshipforLS-SVMwithaMercerkernelisnolongervalid.Inthispaper,we giveafeaturespaceinterpretationforindefiniteLS-SVM.Inthesameframework, kernelprincipalcomponentanalysiswithaninfinitekernelisdiscussedaswell.In numericalexperiments,LS-SVMwithindefinitekernelsforclassificationandkernel principalcomponentanalysisisevaluated.Itsgoodperformancetogetherwiththe featurespaceinterpretationgiveninthispaperimplythepotentialuseofindefinite LS-SVMinrealapplications.

1. Introduction

Mercer’sconditionisthetraditionalrequirementonthekernelappliedinclassicalkernellearningmeth- ods,suchassupportvectormachinewiththehingeloss(C-SVM,[1]),leastsquaressupportvectormachines (LS-SVM, [2,3]), and kernel principal components analysis (kPCA, [4]). However, in practice, one may meetsophisticated similarityor dissimilarity measures which leadto kernels violatingMercer’s condition.

✩ ThisworkwaspartiallysupportedbytheAlexandervonHumboldtFoundationandNationalNaturalScienceFoundationof China(61603248).JohanSuykensacknowledgessupportofERCAdGA-DATADRIVE-B(290923),KUL:GOA/10/09MaNet,CoE PFV/10/002(OPTEC),BIL12/11T;FWO:G.0377.12,G.088114N,SBOPOM(100031);IUAPP7/19DYSCO.

* Corresponding author at: Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, 200240 Shanghai,PRChina.

E-mailaddresses:xiaolinhuang@sjtu.edu.cn(X. Huang),anreas.maier@fau.de(A. Maier),joachim.hornegger@fau.de (J. Hornegger),johan.suykens@esat.kuleuven.be(J.A.K. Suykens).

(3)

Since the kernelmatrices inducedbysuchkernelsare real,symmetric, butnotpositivesemi-definite,they arecalledindefinitekernels andthecorrespondinglearningmethodologyiscalledindefinitelearning[5–15].

Twoimportant problemsarise forindefinitelearning.First, itlackstheclassicalfeaturespace interpre- tationforaMercerkernel,i.e.,wecannotfindanonlinearfeaturemappingsuchthatitsinner-dotgivesthe value of anindefinite kernel function. Second, lackof positivedefinitiveness makes manylearning models become non-convex ifanindefinite kernelis used. Inthelast decades, there hasbeen continuous progress aimingattheseissues.Intheory,indefinitelearninginC-SVMhasbeendiscussedintheReproducingKernel Kre˘ın Spaces (RKKS), cf. [5–8]. The kernel Fisher discriminant analysis with an indefinite kernel can be foundin[9–11],whichisalsodiscussedonRKKS.Inalgorithmdesign,thecurrentmainstreamistofindan approximate positivesemi-definite (PSD)kernel and then apply classical kernel learningalgorithm based on thatPSD kernel. These methodscanbe found in [12–14]and they havebeen reviewed and compared in[15].Thatreviewalsodiscussesdirectlyapplyingindefinitekernelsinsomeclassicalkernellearningmeth- ods which are notsensitive to metricviolations. As suggestedby [16], onecanuse anindefinite kernel to replacethePSDkernelinthedualformulationofC-SVMandsolveitbysequentialminimizationoptimiza- tion[17,18].Thiskindofmethodsenjoyasimilarcomputationalefficiencyastheclassicallearningmethods and henceismoreattractiveinpractice.

Following the wayof introducing indefinitekernels toC-SVM, we consider indefinitelearningbasedon LS-SVM. Notice thatusing an indefinitekernel inC-SVM results inanon-convex problem butindefinite learning based on LS-SVM is still easy to solve. However, Mercer’stheorem is no longer valid. We have to find a new feature space interpretation and to give a characterization in terms of primal and dual problems, whichare thetheoreticaltargets ofthis paper.Since kPCAcanbe conductedintheframework of LS-SVM[19],wewilldiscuss kPCAwithanindefinitekernelaswell.

This paper is organized as follows. Section 2 briefly reviews indefinite learning. Section 3 addresses LS-SVMwithindefinitekernelsandprovidesitsfeaturespaceinterpretation.SimilardiscussiononkPCAis giveninSection4.ThentheperformanceofindefinitelearningbasedonLS-SVMisevaluatedbynumerical experimentsinSection5.Finally,Section6givesashort conclusion.

2. Indeﬁnitekernels

WestartthediscussionfromC-SVMwith aMercerkernel. Givenaset oftrainingdata {xi,y_i}^mi=1 with x_i∈ Rⁿ and y_i ∈ {−1,+1},weare tryingtoconstruct adiscriminantfunctionf (x):Rⁿ → R and useits sign for classiﬁcation. Exceptof linearly separable problems, a nonlinear featuremappingφ(x) is needed and thediscriminant functionisusually formulatedas f (x)= wφ(x)+ b. C-SVMtrainsw andb bythe following optimizationproblem:

w,b,ξmin 1

2ww + Cm i=1ξi

s.t. y_i(wφ(x_i) + b)≥ 1 − ξi, ∀i ∈ {1, . . . , m} (1) ξi ≥ 0, ∀i ∈ {1, . . . , m}.

It iswellknownthatthedualproblem of(1)takestheformulationas below,

maxα

m i=1

αi−1 2

m i=1

m j=1

yiαiKijαjyj

s.t. m

i=1y_iα_i = 0 (2)

0≤ αi≤ C, ∀i ∈ {1, . . . , m},

(4)

where K_ij = K(xi,x_j) = φ(x_i)φ(x_j) is the kernel matrix. For any kernel K which satisﬁes Mercer’s condition,there isalwaysafeaturemap φ suchthatK(xi,xj)= φ(xi)φ(xj).Thisallowsus toconstruct aclassiﬁertomaximizethemargininthefeaturespacewithoutexplicitlyknowingφ.

Traditionally,in(2), werequirethepositivenessonK.Butinsomeapplications,especiallyincomputer version, there are many distances or dissimilarities, for which the corresponding matrices are not PSD [20–22]. It is also possible that though a kernel is PSD but is very hard to verify [5]. Even for a PSD kernel,noisemaymakethedissimilaritymatrixnon-PSD[23,24].Allthesefactsmotivatedtheresearchers to think about indefinitekernels in C-SVM. Notice that “indefinitekernels” literally cover manykernels, including asymmetric ones induced by asymmetricdistances. But as all indefinite learning literature, we inthispaper restrict“indefinitekernel” tothekernels thatcorrespondto realsymmetric indefinitematri- ces.

Intheory,using indefinitekernelsinC-SVM makesMercer’stheoremnotapplicable, whichmeansthat (1) and (2) are not a pair of primal-dual problems and then the solution of (2) cannot be explained as margin maximization in afeature space. Moreover, the learning theory and approximation theory about C-SVMwith PSDisnotvalid, sincethefunctional spacespanned byindefinitekernelsdoesnotbelongto anyReproducing KernelHilbertSpace(RKHS).Tolinktheindefinitekernelto RKHS,weneedapositive decomposition. Itsdefinitionis givenby [5]as follows:an indefinitekernel K hasapositivedecomposition iftherearetwo PSDkernelsK+,K− suchthat

K(u, v) = K+(u, v)− K−(u, v), ∀u, v. (3)

ForanindefinitekernelK thathasapositivedecomposition,there existReproducingKernelKre˘ınSpaces (RKKS). Conditions for the existence of positivedecomposition are given by [5]. However,for aspecific kernel, those conditions are usually hard to verify in practice. But at least, when the training data are given,thekernelmatrixK hasadecompositionwhichis thedifferenceof twoPSDmatrices.Whetherany indefinitekernel hasapositivedecompositionis stillan openquestion.Fortunately, (3)is alwaysvalidfor u,v∈ {xi}^mi=1.Thus,indefinitelearningcanbetheoreticallyanalyzedinRKKSandbeimplementedbased onmatrixdecompositioninpractice.

The feature space interpretation for indefinite learning is given by [24] for a finite-dimensional Kre˘ın space,whichisalsocalledapseudo-Euclidean(pE)space.ApEspaceisdenotedasR^(p,q)withnon-negative integersp andq.ThisspaceisaproductoftwoEuclideanvectorspacesR^p×iR^q.AnelementinR^(p,q)canbe representedbyitscoordinatevectorandthecoordinatevectorgivestheinner product:u,vpE= uMv, where M is a diagonal matrix with the first p components equal 1 and others equal to −1. If we link the components of M with the signs of eigenvalues of the indefinite kernel matrix K, solving (2) for K is interpretedin [24]as distance minimization inR^(p,q). For learningbehavior inRKKS,onecanfind the discussiononthespacesizein[5],errorboundin[25],andasymptoticconvergencein[26,27].

WhenanindeﬁnitekernelisusedinC-SVM,(2)becomesanon-convexquadraticproblem,sinceK isnot positivesemi-deﬁnite.Foranon-convex problem,manyalgorithms basedonglobal optimalityareinvalid.

An alternativeway is to ﬁnd an approximatePSD matrix K for˜ an indeﬁnite oneK, and then solve(2) for ˜K. To obtain K,˜ one can adjust the eigenvalues of K by: i) setting all negative values as zero [12];

ii) flippingsignsofnegativevalues[13];iii)squaringtheeigenvalues[26,27].Italsocanbeimplementedby minimizingtheFrobeniusdistancebetweenK andK,˜ asintroducedby[14].Sincetrainingandclassification arebasedontwodifferentkernels,theabovemethodsareefficientonlywhenK andK are˜ similar.Alsothose methodsaretime-consumingsincetheyadditionallyinvolveeigenvalue problems.Topursuecomputational effectiveness, we canuse descentalgorithms, e.g.,sequential minimization optimization(SMO) developed by [17,18], to directlysolve(2) for anindefinite kernelmatrix. Though onlylocal optima areguaranteed, theperformanceisstillpromising,asreportedby[15] and[16].

(5)

3. LS-SVMwithrealsymmetricindeﬁnitekernels

The current indeﬁnite learning discussions are mainly for C-SVM. In this paper, we propose to use indeﬁnitekernelsintheframeworkofleastsquaressupportvectormachines.Inthedualspace,LS-SVMis to solvethefollowing linearsystem[2]:

0 y y H +_γ¹I

[b, α1, . . . , αm]=

0 1

, (4)

where I isanidentity matrix,1 isanallonesvector withtheproperdimension,andH isgiven by Hij= yiyjKij= yiyjK(xi, xj).

We assumethatthematrixin(4)is fullrank.Thenits solutioncanbe eﬀectivelyobtainedandthe corre- spondingdiscriminant functionis

f (x) =

m i=1

yiαiK(x, xi) + b.

The existingdiscussion about LS-SVMusually requires K to be positive semi-deﬁnitesuchthatMercer’s theorem isapplicableandthesolutionof(4)isrelatedtoFisherdiscriminantanalysisinfeaturespace[28].

NowletusinvestigateindefinitekernelsinLS-SVM(4).OnegoodpropertyisthatevenwhenK isindef- inite, (4)isstilleasytosolve,whichdiffersfrom C-SVM,whereanindefinitekernel makes(2)non-convex.

Though (4) with anindeﬁnite kernel is easy to solve,the solution loosesmany propertiesof PSD kernels and itsfeaturespaceinterpretationshavetobeanalyzedalsoinapEspace.This isbasedonthefollowing proposition:

Proposition 1.Letα^∗,b^∗ bethesolution of (4)forasymmetricbutindeﬁnite kernelmatrixK.

i) Thereexisttwofeaturemapsφ+ andφ₋ suchthat

w^∗₊=

m i=1

α^∗_iφ+(xi), w^∗₋ =

m i=1

α^∗_iφ₋(xi),

which isastationarypointof thefollowingprimal problem:

w+min,w−,b,ξ

1 2

w₊w+− w₋w₋ +γ

2

m i=1

ξ_i² (5)

s.t. y_i

w₊φ₊(x_i) + w₋φ₋(x_i) + b

= 1− ξi,∀i ∈ {1, 2, . . . , m}.

ii) The dualproblem of(5)is givenby(4),where

Kij =K(xi, xj) =K+(xi, xj)− K−(xi, xj) (6) with twoPSDkernels K+ andK−:

K+(xi, xj) = φ+(xi)φ+(xj), (7) and

K−(x_i, x_j) = φ₋(x_i)φ₋(x_j). (8)

(6)

Proof. For an indeﬁnite kernel K, we can always ﬁnd two PSD kernels K+, K− and the corresponding featuremapsφ₊,φ₋ tosatisfy(6)–(8).Usingφ₊ andφ₋ in(5),itsLagrangianof(5)canbe writtenas

L(w+, w₋, b, ξ; α) = 1 2

w₊w₊− w₋w₋ +γ

2

m i=1

ξ_i²

−

m i=1

αi

yi(w₊φ+(xi) + w₋φ₋(xi) + b)− 1 + ξi

.

Thentheconditionofastationarypointyields

∂L

∂w₊ = w+−m

i=1αiyiφ+(xi) = 0,

∂L

∂w₋ =−w−−m

i=1α_iy_iφ₋(x_i) = 0,

∂L

∂b =m

i=1αi= 0,

∂L

∂ξi

= γξ_i− αi= 0,

∂L

∂α_i = y_i(w₊φ₊(x_i) + w₋φ₋(x_i) + b)− 1 + ξi= 0.

Eliminatingtheprimalvariables w+,w₋,ξ,wegettheoptimalityconditions:

m i=1

αi= 0 and

yi

m

j=1αjφ+(xi)φ+(xj)−m

j=1αjφ₋(xi)φ₋(xj)

− b − αi

γ = 0.

Substituting(6)–(8)intotheaboveconditionleadsto(4).Therefore,(4)isthedualproblemof(5).Ifα^∗, b^∗ isthesolutionof(4),thenb^∗ and

w^∗₊=m

i=1α^∗_iφ+(xi), w^∗₋=m

i=1α^∗_iφ₋(xi)

satisfytheﬁrst-orderoptimalityconditionof(5),i.e.,w^∗₊, w^∗₋,b^∗ isastationary pointof(5).

Proposition 1givestheprimalproblemandafeaturespaceinterpretationforLS-SVMwithanindefinite kernel. Its proof relies on the positivedecomposition(6) on K,which exists for allreal symmetric kernel matrices.Butit doesnotmean thatwe canfindapositivedecompositionforK,i.e.,(3) isnotnecessarily valid.Theverificationisusuallyhardforaspecifickernel.Ifsuchkerneldecompositionexists,Proposition 1 furthershowsthat(4) ispursuingasmall within-classscatter inapEspaceR^(p,q). Ifnot,thewithin-class scatter isminimizedinaspace associated withan approximatekernel K = K˜ +− K− whichis equalto K onallthetrainingdata.In(7)and (8),thedimensionofthefeaturemap couldbeindefiniteand thenthe conclusionisextendedto thecorrespondingRKKS.

4. Realsymmetricindeﬁnitekernel inPCA

Inthelastsection,weconsideredLS-SVMwithanindefinitekernelforbinaryclassification.Theanalysis isapplicabletoothertaskswhichcanbesolvedintheframeworkofLS-SVM.In[19],thelinkbetweenkernel principalcomponentanalysisandLS-SVMhasbeeninvestigated.Accordingly,wecangivethefeaturespace interpretationforkernelPCAwithanindefinitekernel.

(7)

Forgiven data{xi}^mi=1,thekernelPCAis tosolveaneigenvalueproblem:

Ωα = λα, (9)

where thecenteredkernel matrixΩ isinducedfrom akernelK asfollows,

Ωij=K(xi, xj)− 1 m

m r=1

K(xi, xr)− 1 m

m r=1

K(xj, xr) + 1 m²

m r=1

m s=1

K(xr, xs).

Traditionally, K is limitedto be aPSD kernel. Then aMercerkernel isemployed and (9) maximizes the varianceintherelatedfeaturespace.

Following the sameway ofintroducing an indefinite kernelin C-SVM or LS-SVM, we candirectlyuse anindefinite kernelforkPCA(9).Noticethatforanindefinitekernel, theeigenvalueswill bepositiveand negative. Allthese eigenvalueswillbestill realfortheuseofasymmetrickernel. Thereisnodifferenceon the problem itself and the projected variables canbe calculatedas the same.However, the featurespace interpretationfundamentallychanges,whichis discussedinthefollowingproposition.

Proposition 2.Letα^∗ bethesolution of(9) foranindeﬁnitekernel K.

i) Therearetwofeature mapsφ₊ andφ₋ suchthat

w^∗₊=

m i=1

α_i^∗(φ+(xi)− ˆμφ+),

w^∗₋=

m i=1

α_i^∗(φ₋(x_i)− ˆμφ₋),

which isastationarypointof thefollowingprimal problem:

w+max,w₋,ξ

γ 2

m i=1

ξ_i²−1 2

w₊w+− w₋w₋

(10) s.t. ξ_i= w₊(φ₊(x_i)− ˆμφ+) + w₋(φ₋(x_i)− ˆμφ₋), ∀i ∈ {1, . . . , m}.

Here, μˆφ+,μˆφ₋ arethecenteringterms, i.e.,

ˆ μ_φ₊= 1

m

m i=1

φ₊(x_i) and ˆμ_φ₋ = 1 m

m i=1

φ₋(x_i).

ii)If wechooseγ asγ = _λ¹ anddecomposeK asin(6)–(8),thenthedualproblemof(10)isgivenby(9).

Proof. Again, for an indeﬁnite kernel K, we can ﬁnd two PSD kernels K+, K−, and the corresponding nonlinearfeaturemapsφ+,φ₋ to satisfy(6)–(8).

TheLagrangianof(10)canbe writtenas

L(w+, w₋, ξ; α) = 1 2

w₊w+− w₋w₋

−γ 2

m i=1

ξ²_i

−

m i=1

α_i

w₊(φ₊(x_i)− ˆμφ+) + w₋(φ₋(x_i)− ˆμφ₋)− ξi

.

(8)

Thenfrom theconditionsofastationarypoint, wehave

∂L

∂w+

= w+−m

i=1αi(φ+(xi)− ˆμφ+) = 0,

∂L

∂w₋ =−w−−m

i=1αi(φ₋(xi)− ˆμφ₋) = 0,

∂L

∂ξ_i =−γξi+ αi= 0,

∂L

∂αi

= w₊(φ+(xi)− ˆμφ+) + w₋(φ₋(xi)− ˆμφ₋)− ξi= 0.

Eliminationoftheprimalvariables resultsinthefollowingoptimalitycondition, 1

γα_i−m

j=1α_j(φ₊(x_j)− ˆμφ+)(φ₊(x_i)− ˆμφ+) (11)

−m

j=1αj(φ₋(xj)− ˆμφ₋)(φ+(xi)− ˆμφ₋) = 0, ∀i ∈ {1, 2, . . . , m}.

Applyingthekerneltrick(7)and(8),weknowthat (φ_±(xi)− ˆμφ+)(φ_±(xj)− ˆμφ_±)

=K±(x_i, x_j)− 1 m

m r=1

K±(x_i, x_r)−

m r=1

K±(x_j, x_r) + 1 m²

m r=1

m s=1

K±(x_r, x_s).

Additionally with (6), the optimality condition (11) can be formulated as the eigenvalue problem (9).

Therefore, (9) is the dual problem of (10)and gives astationary solutionfor (10), which aims at having maximalvarianceasthesameaskPCA withPSDkernels.

5. Numericalexperiments

In the preceding sections, we discussed the use of indefinite kernels in the framework of LS-SVM for classificationand kernel principalcomponent analysis,respectively.The generalconclusionsare: i) indefinite LS-SVM shares the same optimizationmodel as the PSD ones and hence the sametoolbox, namely LS-SVMlab [35], isapplicable; ii) onthecomputational loadof LS-SVM, there isno differencebetween a PSDkernel and anindefinitekernel, i.e., using anindefinitekernel inLS-SVMwill notadditionallybring computational burden;iii) thefeaturespace interpretationofLS-SVMfor anindefinitekernel isextended toapEspace andonlyastationary pointcanbe obtained.

In theory, indefinite kernels are the generalization of PSD ones, which are constrained to have zero- negative parts in (6). In practice, there are indefinite kernels successfully appliedin specific applications [29–31].Nowwiththefeaturespaceinterpretationgiveninthispaper,onecanuseLS-SVManditsmodifi- cationstolearnfromanindefinitekernel.Ingeneral,algorithmicpropertiesholdingforLS-SVMwithPSD kernelsarestillvalid whenanindefinitekernelisused.

In this section, we will test the performance of LS-SVM with indefinite kernels on some benchmark problems. It shouldbe noticed thatthe performance heavily relies on the choiceof kernel. Thoughthere arealreadysomeindefinitekernelsdesignedtospecifictasks,itisstillhardtofindanindefinitekernelfora widerangeof problems.Therefore, PSDkernels,especiallytheradialbasisfunction(RBF)kernel andthe polynomialkernel, arecurrently dominantinkernellearning.Onechallengerfrom indefinitekernels isthe tanhkernel,whichhasbeenevaluatedintheframeworkofC-SVM[8,16].Anotherpossibleindefinitekernel

(9)

Table 1

TestaccuracyofLS-SVMwithPSDandindeﬁnitekernels.

Dataset m n RBF

(CV)

poly (CV)

tanh (CV)

TL1 ρ= 0.7n

TL1 (CV)

DBWords 32 242 84.3% 85.6% 75.0% 85.2% 84.4%

Fertility 50 9 86.7% 80.4% 83.8% 86.7% 87.8%

Planning 91 12 70.2% 67.9% 71.6% 70.6% 73.6%

Sonar 104 60 84.5% 83.1% 72.9% 84.3% 83.6%

Statlog 135 13 81.4% 75.2% 82.7% 83.8% 83.5%

Monk1 124 6 79.1% 78.3% 76.6% 73.4% 85.2%

Monk2 169 6 84.1% 75.6% 69.9% 53.4% 83.7%

Monk3 122 6 93.5% 93.5% 88.0% 97.2% 97.2%

Climate 270 20 93.2% 91.7% 92.6% 91.9% 92.0%

Liver 292 10 69.0% 67.9% 67.2% 69.7% 71.8%

Austr. 345 14 85.0% 85.1% 86.6% 86.0% 86.7%

Breast 349 10 96.4% 95.7% 96.6% 97.0% 97.1%

Trans. 374 4 78.3% 78.5% 78.9% 78.4% 78.4%

Splice 1000 121 89.4% 87.3% 90.1% 93.6% 94.9%

Spamb. 2300 57 93.1% 92.4% 91.7% 94.1% 94.2%

ML-prove 3059 51 72.5% 74.6% 71.8% 79.1% 79.3%

is atruncated 1 distance (TL1) kernel, whichhas been recentlyproposed in[32]. Thementioned kernels are listedbelow:

• PSDkernels:

– linearkernel:K(u,v)= uv,

– RBFkernel withparameterσ:K(u,v)= exp

− u − v ²2

σ² ,

– polynomialkernel withparametersc≥ 0,d∈ N⁺:K(u,v)= (uv + c)^d.

• indeﬁnitekernels:

– tanhkernelwithparametersc,d¹:K(u,v)= tanh(cuv + d), – TL1kernelwithparameterρ:K(u,v)= max{ρ− u − v 1,0}.

These kernels willbe compared intheframework ofLS-SVM forboth classificationandprincipalcom- ponent analysis. First, consider binary classification problems, for which the data are downloaded from UCI Repository of Machine Learning Datasets [36]. For some datasets, there are both training and test data. Otherwise, we randomly pick half of the data for training and the rest for test. All training data arenormalizedto[0,1]ⁿ inadvance.Intrainingprocedure,therearearegularizationcoefficientandkernel parameters, which are tuned by 10-foldcross validation.Specifically, we randomly partitionthe training datainto10subsets.Oneofthesesubsetsisusedforvalidationinturnandtheremainingonesfortraining.

As discussedin[32],theperformanceoftheTL1kernelisnotverysensitivetothevalueofρ andρ= 0.7n wassuggested.Wethusalso evaluatetheTL1kernelwith ρ= 0.7n.Withoneparameterless,thetraining time canbe largelysaved.

The above procedure is repeated 10 times, and then the average classification accuracy on test data are reported inTable 1, where the number of training set m and the problem dimension n are given as well.Thebest oneforeachdatasetinthesense ofaverageaccuracyis underlined.Theresultsconfirmthe potential useof indefinite kernelsin LS-SVM: anindefinite kernel canachieve similar accuracy as aPSD kernel inmostof theproblemsand canhavebetterperformanceinsomespecific problems. Thisdoesnot mean thatindefinite kernel surely improves the performance from PSD ones but for some datasets, e.g., Monk1,Monk3,andSplice,itisworthytoconsiderindefinitelearningwithLS-SVMwhichmayhavebetter accuracy withinalmost the same trainingtime. Moreover, this experiment, for whichthe performance of

1 The tanh kernel is conditionallypositive definite(CPD) when c ≥ 0 andis indefiniteotherwise; see, e.g., [33,34]. In our experiments,weconsiderbothpositiveandnegativec,andhencethetanhkernelisregardedasanindefinitekernel.

(10)

Fig. 1. (a)Datapointsofone classcomefromtheunitsphere andare markedbyredcircles.The otherdatapoints,shownby bluestars,comefromaspherewithradius3.(b) ThisdatasetisnotlinearlyseparableandthuslinearPCA isnothelpfulfor distinguishingthetwoclasses.Instead,kernelPCAis neededandifthe parameteris suitablychosen,thereduceddatacan be correctlyclassifiedbyalinearclassifier.(c)theRBFkernelwithσ = 0.05;(d)theTL1kernelwithρ= 0.1n;(e)theTL1kernel withρ= 0.2n;(f)theTL1kernelwithρ= 0.5n,whichissimilartolinearPCA.(Forinterpretationofthereferencestocolorin thisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)

theTL1kernelwithρ= 0.7n beingsatisfactoryformanydatasets,illustratesthegoodparameterstability oftheTL1kernel.

In the following, we use indeﬁnite kernels for principal component analysis. As an intuitive example, weconsider a3-dimensionalsphere problemthatdistinguishsdata fromthe spherewithradius equalto1 and 3. The data are shownin Fig. 1(a). To reduce the dimension, we apply PCA, kPCA with the RBF kernel, and kPCA with the TL1 kernel, respectively. The obtained two dimensional data are displayed in Fig. 1(b)–(f), which roughly imply that a suitable indeﬁnite kernel can be used for kernel principal componentanalysis.

ToquantitativelyevaluatekPCAwithindeﬁnitekernels,wechoosetheproblemsofwhichthedimension ishigherthan20 fromTable 1andthenapplykPCAtoreducethedataintonrdimension.Forthereduced data, linear classiﬁers, trained from linear C-SVM with libsvm [37], are used to classify the test data.

The parameters,including kernel parametersand the regularizationconstant inlinearC-SVM, are tuned based on 10-fold cross-validation. In Table 2, the average classification accuracy of 10 trials for different reduction ratios nr/n is listed. The results illustrate that indefinite kernels can be used for kPCA. Its performanceingeneraliscomparabletoPSDkernelsandforsomedatasetstheperformanceissignificantly improved.

Summarizingalltheexperimentsabove,weobservethepotentialuseofindefinitekernelsinLS-SVMfor classificationandkPCA.Forexample,theTL1kernelhassimilarperformanceas theRBF kernelinmany problemsandhasmuchbetterresultsforseveraldatasets.Ouraiminthisexperimentisnottoclaimwhich kernelisthebest,whichactuallydependsonthespecificproblem.Instead,weshowthatforsomeproblems, aproper indefinitekernel cansignificantly improve the performancefrom PSD ones,which maymotivate theresearchersto designindefinitekernelsandusetheminLS-SVMs.

(11)

Table 2

TestaccuracybasedonkPCAwithdiﬀerentreductionratios.

Dataset m n Ratio Linear RBF poly tanh TL1

Sonar 104 60 10% 72.6% 75.6% 75.2% 63.8% 77.9%

30% 73.1% 79.1% 78.2% 71.0% 80.4%

50% 75.9% 80.7% 79.0% 71.9% 81.9%

Climate 270 21 10% 90.4% 90.5% 91.4% 91.5% 90.5%

30% 90.9% 90.8% 91.4% 91.6% 90.9%

50% 91.6% 91.4% 93.9% 91.6% 91.9%

Qsar 528 41 10% 74.4% 77.8% 75.5% 77.5% 78.8%

30% 85.4% 86.4% 84.1% 84.5% 85.9%

50% 85.9% 86.7% 85.4% 86.0% 86.2%

Splice 1000 60 10% 83.7% 86.6% 85.5% 83.7% 91.9%

30% 83.9% 87.7% 85.3% 83.0% 91.1%

50% 84.1% 87.8% 86.5% 85.2% 91.3%

Spamb. 2300 57 10% 84.7% 86.5% 84.9% 86.4% 88.4%

30% 87.7% 89.9% 88.3% 89.9% 91.8%

50% 90.7% 91.0% 91.5% 92.8% 92.8%

ML-prove 3059 51 10% 59.2% 69.7% 64.0% 63.3% 70.1%

30% 67.9% 70.3% 72.8% 69.3% 71.3%

50% 70.2% 71.0% 73.1% 68.9% 75.5%

6. Conclusion

In this paper, we proposed to use indefinite kernels in the framework of least squares support vector machines.Inthetrainingproblemitself,thereisnodifferencebetweendefinitekernelsandindefinitekernels.

Thus,onecaneasilyuseanindeﬁnitekernelinLS-SVMbysimplychangingthekernelevaluationfunction.

Numerically, the indefinitekernels achievegood performancecompared with commonly usedPSD kernels for both classification and kernel principal component analysis. The good performance motivates us to investigatethefeaturespaceinterpretationforanindefinitekernelinLS-SVM,whichisthemaintheoretical contributionofthispaper.Wehopethatthetheoreticalanalysisandgoodperformanceshowninthispaper canattractresearchandapplicationinterestsonindefiniteLS-SVMand indefinitekPCAinthefuture.

Acknowledgments

The authors would like to thank Prof. Lei Shi inFudan University forhelpful discussion. The authors are gratefultotheanonymous reviewerforinsightfulcomments.

References

[1]V.N.Vapnik,StatisticalLearningTheory,Wiley,NewYork,1998.

[2]J.A.K.Suykens,J.Vandewalle,Leastsquaressupportvectormachineclassiﬁers,NeuralProcess.Lett.9 (3)(1999)293–300.

[3]J.A.K.Suykens,T.VanGestel,B.DeMoor,J.De Brabanter,J.Vandewalle, LeastSquaresSupportVector Machines, WorldScientiﬁc,2002.

[4]B.Schölkopf,A. Smola,K.-R.Müller,Nonlinear component analysisas a kerneleigenvalue problem,Neural Comput.

10 (5)(1998)1299–1319.

[5]C.S. Ong,X. Mary,S.Canu, A.J.Smola,Learning withnon-positivekernels, in:Proceeding of the21stInternational ConferenceonMachineLearning(ICML),2004,pp. 639–646.

[6]Y.Ying,C.Campbell,M.Girolami,AnalysisofSVMwithindeﬁnitekernels,in:Y.Bengio,D.Schuurmans,J.Laﬀerty, C.Williams,A.Culotta(Eds.),AdvancesinNeuralInformationProcessingSystems22,2009,pp. 2205–2213.

[7]S.Gu,Y.Guo,LearningSVMclassifierswithindefinitekernels,in:Proceedingsofthe26thAAAIConferenceonArtificial Intelligence,2012,pp. 942–948.

[8]G.Loosli,S.Canu,C.S. Ong, LearningSVM inKre˘ın spaces, IEEETrans.Pattern Anal. Mach.Intell.38 (6)(2016) 1204–1216.

[9]E.Pekalska,B.Haasdonk,Kerneldiscriminantanalysisforpositivedeﬁniteandindeﬁnitekernels,IEEETrans.Pattern Anal.Mach.Intell.31 (6)(2009)1017–1032.

[10]B.Haasdonk,E.Pekalska,Indeﬁnitekerneldiscriminantanalysis,in:Proceedingsofthe16thInternationalConferenceon ComputationalStatistics(COMPSTAT),2010,pp. 221–230.

(12)

[11]S.Zafeiriou,SubspacelearninginKre˘ınspaces:completekernelFisherdiscriminantanalysiswithindeﬁnitekernels,in:

Proceedings ofEuropeanConferenceonComputerVision(ECCV)2012,2012,pp. 488–501.

[12]E.Pekalska,P.Paclik,R.P.Duin,Ageneralizedkernelapproachtodissimilarity-basedclassiﬁcation,J.Mach.Learn.Res.

2(2002)175–211.

[13]V. Roth,J. Laub,M. Kawanabe,J.M. Buhmann,Optimal clusterpreservingembedding ofnonmetric proximitydata, IEEETrans.PatternAnal.Mach.Intell.25 (12)(2003)1540–1551.

[14]R.Luss,A.d’Aspremont,Supportvectormachineclassiﬁcationwithindeﬁnitekernels,in:J.C.Platt,D.Koller,Y.Singer, S.T.Roweis(Eds.),AdvancesinNeuralInformationProcessingSystems20,2007,pp. 953–960.

[15]F.Schleif,P.Tino,Indeﬁniteproximitylearning:Areview,NeuralComput.27(2015)2039–2096.

[16] H.-T.Lin,C.-J.Lin,Astudyon sigmoidkernelsfor SVMandthetrainingof non-PSDkernelsbySMO-typemethods, technicalreport,DepartmentofComputerScience,NationalTaiwanUniversity,2003,http://www.csie.ntu.edu.tw/~cjlin/

papers/tanh.pdf.

[17]J.C.Platt,Fasttrainingofsupportvectormachinesusingsequentialminimaloptimization,in:AdvancesinKernelMethods –SupportVectorLearning,MITPress,1999,pp. 185–208.

[18]R.-E.Fan,P.-H.Chen,C.-J.Lin,Workingsetselectionusingsecondorderinformationfortrainingsupportvectormachines, J.Mach.Learn.Res.6(2005)1889–1918.

[19]J.A.K.Suykens,T.VanGestel,J.Vandewalle,B.DeMoor,AsupportvectormachineformulationtoPCAanalysisand itskernelversion,IEEETrans.NeuralNetw.14 (2)(2003)447–450.

[20]H.Ling,D.W.Jacobs,Usingtheinner-distanceforclassiﬁcationofarticulatedshapes,in:Proceedings ofIEEEConference onComputerVisionandPatternRecognition(CVPR)2005,2005,pp. 719–726.

[21]M.M.Deza,E.Deza,EncyclopediaofDistances,Springer,NewYork,2009.

[22]W.Xu,W.Richard,E.Hancock,Determiningthecauseof negativedissimilarityeigenvalues,in:ComputerAnalysisof ImagesandPatterns,Springer,NewYork,2011,pp. 589–597.

[23]T.Graepel,R.Herbrich,P.Bollmann-Sdorra,K.Obermayer,Classiﬁcationonpairwiseproximitydata,in:M.S.Kearns, S.A.Solla,D.A.Cohn(Eds.),AdvancesinNeuralInformationProcessingSystems11,1998,pp. 438–444.

[24]B.Haasdonk,FeaturespaceinterpretationofSVMswithindeﬁnitekernels,IEEETrans.PatternAnal.Mach.Intell.27 (4) (2005)482–492.

[25]I.Alabdulmohsin,X.Gao,X.Zhang,Supportvectormachineswithindeﬁnitekernels,in:Proceedings ofthe6thAsian ConferenceonMachineLearning(ACML),2014,pp. 32–47.

[26]H. Sun,Q. Wu,Least squareregressionwithindeﬁnitekernels andcoeﬃcient regularization,Appl. Comput.Harmon.

Anal.30 (1)(2011)96–109.

[27]Q.Wu,Regularizationnetworkswithindeﬁnitekernels,J.Approx.Theory166(2013)1–18.

[28]T.VanGestel,J.A.K.Suykens,G.Lanckriet,A.Lambrechts,B.DeMoor,J.Vandewalle,Bayesianframeworkforleast- squaressupportvectormachineclassiﬁers,Gaussian processes,andkernelFisherdiscriminantanalysis,NeuralComput.

14 (5)(2002)1115–1147.

[29]A.J.Smola,Z.L.Ovari,R.C.Williamson,Regularizationwithdot-productkernels,in:T.K.Leen,T.G.Dietterich,V. Tresp (Eds.),AdvancesinNeuralInformationProcessingSystems13,2000,pp. 308–314.

[30]H.Saigo,J.P.Vert,U.Ueda,T.Akutsu,Proteinhomologydetectionusingstringalignmentkernels,Bioinformatics20 (11) (2004)1682–1689.

[31]B.Haasdonk,H.Burkhardt,Invariant kernelfunctionsforpattern analysisandmachinelearning,Mach.Learn.68 (1) (2007)35–61.

[32]X. Huang,J.A.K.Suykens,S.Wang, A.Maier,J.Hornegger,Classiﬁcationwithtruncated1 distance kernel,internal report15-211,ESAT-SISTA,KULeuven,2015.

[33]M.D.Buhmann,RadialBasisFunctions:TheoryandImplementations,CambridgeMonographsonAppliedandCompu- tationalMathematics,vol. 12,2004,pp. 147–165.

[34]H.Wendland,ScatteredDataApproximation,CambridgeUniversityPress,Cambridge,UK,2004.

[35]K. De Brabanter, P. Karsmakers, F. Ojeda, C. Alzate, J. De Brabanter,K. Pelckmans, B.De Moor, J. Vandewalle, J.A.K. Suykens,LS-SVMlabtoolboxuser’sguide,internalreport10-146,ESAT-SISTA,KULeuven,2011.

[36] A.Frank,A. Asuncion,UCIMachineLearningRepository,SchoolofInformationandComputerScience,Universityof California,Irvine,CA,2010,http://archive.ics.uci.edu/ml.

[37]C.-C.Chang,C.-J.Lin,LIBSVM:alibraryforsupportvectormachines,ACMTrans.Intell.Syst.Technol.2 (3)(2011) 27.