• No results found

Indefinite kernels in least squares support vector machines and principal component analysis

N/A
N/A
Protected

Academic year: 2021

Share "Indefinite kernels in least squares support vector machines and principal component analysis"

Copied!
12
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Citation/Reference Huang X., Maier A., Hornegger J., Suykens J.A.K., ``Indefinite Kernels in Least Squares Support Vector Machines and Principal Component Analysis'', Applied and Computational Harmonic Analysis, Sep. 2016,

Archived version Author manuscript: the content is identical to the content of the published paper, but without the final typesetting by the publisher Published version http://dx.doi.org/10.1016/j.acha.2016.09.001

Journal homepage http://www.journals.elsevier.com/applied-and-computational- harmonic-analysis/

IR https://lirias.kuleuven.be/handle/123456789/557941

(article begins on next page)

(2)

Contents lists available atScienceDirect

Applied and Computational Harmonic Analysis

www.elsevier.com/locate/acha

Case Studies

Indefinite kernels in least squares support vector machines and principal component analysis

Xiaolin Huanga,c,, Andreas Maierc, Joachim Horneggerc, Johan A.K. Suykensb

a InstituteofImageProcessingandPatternRecognition,ShanghaiJiaoTongUniversity,200240 Shanghai,PRChina

bKULeuven,ESAT-STADIUS,B-3001Leuven,Belgium

cPatternRecognitionLab,Friedrich-Alexander-UniversitätErlangen–Nürnberg,91058Erlangen, Germany

a r t i cl e i n f o a b s t r a c t

Articlehistory:

Received15March2016

Receivedinrevisedform29August 2016

Accepted2September2016 Availableonlinexxxx

CommunicatedbyCharlesK.Chui

Keywords:

Leastsquaressupportvector machine

Indefinitekernel Classification

Kernelprincipalcomponentanalysis

Because of severalsuccessfulapplications,indefinitekernels haveattracted many research interestsinrecentyears.This paper addressesindefinitelearninginthe framework of least squares support vector machines (LS-SVM). Unlike existing indefinite kernel learning methods, which usually involve non-convex problems, theindefiniteLS-SVMisstilleasytosolve,but thekerneltrickandprimal-dual relationshipforLS-SVMwithaMercerkernelisnolongervalid.Inthispaper,we giveafeaturespaceinterpretationforindefiniteLS-SVM.Inthesameframework, kernelprincipalcomponentanalysiswithaninfinitekernelisdiscussedaswell.In numericalexperiments,LS-SVMwithindefinitekernelsforclassificationandkernel principalcomponentanalysisisevaluated.Itsgoodperformancetogetherwiththe featurespaceinterpretationgiveninthispaperimplythepotentialuseofindefinite LS-SVMinrealapplications.

©2016PublishedbyElsevierInc.

1. Introduction

Mercer’sconditionisthetraditionalrequirementonthekernelappliedinclassicalkernellearningmeth- ods,suchassupportvectormachinewiththehingeloss(C-SVM,[1]),leastsquaressupportvectormachines (LS-SVM, [2,3]), and kernel principal components analysis (kPCA, [4]). However, in practice, one may meetsophisticated similarityor dissimilarity measures which leadto kernels violatingMercer’s condition.

ThisworkwaspartiallysupportedbytheAlexandervonHumboldtFoundationandNationalNaturalScienceFoundationof China(61603248).JohanSuykensacknowledgessupportofERCAdGA-DATADRIVE-B(290923),KUL:GOA/10/09MaNet,CoE PFV/10/002(OPTEC),BIL12/11T;FWO:G.0377.12,G.088114N,SBOPOM(100031);IUAPP7/19DYSCO.

* Corresponding author at: Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, 200240 Shanghai,PRChina.

E-mailaddresses:xiaolinhuang@sjtu.edu.cn(X. Huang),anreas.maier@fau.de(A. Maier),joachim.hornegger@fau.de (J. Hornegger),johan.suykens@esat.kuleuven.be(J.A.K. Suykens).

http://dx.doi.org/10.1016/j.acha.2016.09.001 1063-5203/©2016PublishedbyElsevierInc.

(3)

Since the kernelmatrices inducedbysuchkernelsare real,symmetric, butnotpositivesemi-definite,they arecalledindefinitekernels andthecorrespondinglearningmethodologyiscalledindefinitelearning[5–15].

Twoimportant problemsarise forindefinitelearning.First, itlackstheclassicalfeaturespace interpre- tationforaMercerkernel,i.e.,wecannotfindanonlinearfeaturemappingsuchthatitsinner-dotgivesthe value of anindefinite kernel function. Second, lackof positivedefinitiveness makes manylearning models become non-convex ifanindefinite kernelis used. Inthelast decades, there hasbeen continuous progress aimingattheseissues.Intheory,indefinitelearninginC-SVMhasbeendiscussedintheReproducingKernel Kre˘ın Spaces (RKKS), cf. [5–8]. The kernel Fisher discriminant analysis with an indefinite kernel can be foundin[9–11],whichisalsodiscussedonRKKS.Inalgorithmdesign,thecurrentmainstreamistofindan approximate positivesemi-definite (PSD)kernel and then apply classical kernel learningalgorithm based on thatPSD kernel. These methodscanbe found in [12–14]and they havebeen reviewed and compared in[15].Thatreviewalsodiscussesdirectlyapplyingindefinitekernelsinsomeclassicalkernellearningmeth- ods which are notsensitive to metricviolations. As suggestedby [16], onecanuse anindefinite kernel to replacethePSDkernelinthedualformulationofC-SVMandsolveitbysequentialminimizationoptimiza- tion[17,18].Thiskindofmethodsenjoyasimilarcomputationalefficiencyastheclassicallearningmethods and henceismoreattractiveinpractice.

Following the wayof introducing indefinitekernels toC-SVM, we consider indefinitelearningbasedon LS-SVM. Notice thatusing an indefinitekernel inC-SVM results inanon-convex problem butindefinite learning based on LS-SVM is still easy to solve. However, Mercer’stheorem is no longer valid. We have to find a new feature space interpretation and to give a characterization in terms of primal and dual problems, whichare thetheoreticaltargets ofthis paper.Since kPCAcanbe conductedintheframework of LS-SVM[19],wewilldiscuss kPCAwithanindefinitekernelaswell.

This paper is organized as follows. Section 2 briefly reviews indefinite learning. Section 3 addresses LS-SVMwithindefinitekernelsandprovidesitsfeaturespaceinterpretation.SimilardiscussiononkPCAis giveninSection4.ThentheperformanceofindefinitelearningbasedonLS-SVMisevaluatedbynumerical experimentsinSection5.Finally,Section6givesashort conclusion.

2. Indefinitekernels

WestartthediscussionfromC-SVMwith aMercerkernel. Givenaset oftrainingdata {xi,yi}mi=1 with xi∈ Rn and yi ∈ {−1,+1},weare tryingtoconstruct adiscriminantfunctionf (x):Rn → R and useits sign for classification. Exceptof linearly separable problems, a nonlinear featuremappingφ(x) is needed and thediscriminant functionisusually formulatedas f (x)= wφ(x)+ b. C-SVMtrainsw andb bythe following optimizationproblem:

w,b,ξmin 1

2ww + Cm i=1ξi

s.t. yi(wφ(xi) + b)≥ 1 − ξi, ∀i ∈ {1, . . . , m} (1) ξi ≥ 0, ∀i ∈ {1, . . . , m}.

It iswellknownthatthedualproblem of(1)takestheformulationas below,

maxα

m i=1

αi1 2

m i=1

m j=1

yiαiKijαjyj

s.t. m

i=1yiαi = 0 (2)

0≤ αi≤ C, ∀i ∈ {1, . . . , m},

(4)

where Kij = K(xi,xj) = φ(xi)φ(xj) is the kernel matrix. For any kernel K which satisfies Mercer’s condition,there isalwaysafeaturemap φ suchthatK(xi,xj)= φ(xi)φ(xj).Thisallowsus toconstruct aclassifiertomaximizethemargininthefeaturespacewithoutexplicitlyknowingφ.

Traditionally,in(2), werequirethepositivenessonK.Butinsomeapplications,especiallyincomputer version, there are many distances or dissimilarities, for which the corresponding matrices are not PSD [20–22]. It is also possible that though a kernel is PSD but is very hard to verify [5]. Even for a PSD kernel,noisemaymakethedissimilaritymatrixnon-PSD[23,24].Allthesefactsmotivatedtheresearchers to think about indefinitekernels in C-SVM. Notice that “indefinitekernels” literally cover manykernels, including asymmetric ones induced by asymmetricdistances. But as all indefinite learning literature, we inthispaper restrict“indefinitekernel” tothekernels thatcorrespondto realsymmetric indefinitematri- ces.

Intheory,using indefinitekernelsinC-SVM makesMercer’stheoremnotapplicable, whichmeansthat (1) and (2) are not a pair of primal-dual problems and then the solution of (2) cannot be explained as margin maximization in afeature space. Moreover, the learning theory and approximation theory about C-SVMwith PSDisnotvalid, sincethefunctional spacespanned byindefinitekernelsdoesnotbelongto anyReproducing KernelHilbertSpace(RKHS).Tolinktheindefinitekernelto RKHS,weneedapositive decomposition. Itsdefinitionis givenby [5]as follows:an indefinitekernel K hasapositivedecomposition iftherearetwo PSDkernelsK+,K suchthat

K(u, v) = K+(u, v)− K(u, v), ∀u, v. (3)

ForanindefinitekernelK thathasapositivedecomposition,there existReproducingKernelKre˘ınSpaces (RKKS). Conditions for the existence of positivedecomposition are given by [5]. However,for aspecific kernel, those conditions are usually hard to verify in practice. But at least, when the training data are given,thekernelmatrixK hasadecompositionwhichis thedifferenceof twoPSDmatrices.Whetherany indefinitekernel hasapositivedecompositionis stillan openquestion.Fortunately, (3)is alwaysvalidfor u,v∈ {xi}mi=1.Thus,indefinitelearningcanbetheoreticallyanalyzedinRKKSandbeimplementedbased onmatrixdecompositioninpractice.

The feature space interpretation for indefinite learning is given by [24] for a finite-dimensional Kre˘ın space,whichisalsocalledapseudo-Euclidean(pE)space.ApEspaceisdenotedasR(p,q)withnon-negative integersp andq.ThisspaceisaproductoftwoEuclideanvectorspacesRp×iRq.AnelementinR(p,q)canbe representedbyitscoordinatevectorandthecoordinatevectorgivestheinner product:u,vpE= uMv, where M is a diagonal matrix with the first p components equal 1 and others equal to −1. If we link the components of M with the signs of eigenvalues of the indefinite kernel matrix K, solving (2) for K is interpretedin [24]as distance minimization inR(p,q). For learningbehavior inRKKS,onecanfind the discussiononthespacesizein[5],errorboundin[25],andasymptoticconvergencein[26,27].

WhenanindefinitekernelisusedinC-SVM,(2)becomesanon-convexquadraticproblem,sinceK isnot positivesemi-definite.Foranon-convex problem,manyalgorithms basedonglobal optimalityareinvalid.

An alternativeway is to find an approximatePSD matrix K for˜ an indefinite oneK, and then solve(2) for ˜K. To obtain K,˜ one can adjust the eigenvalues of K by: i) setting all negative values as zero [12];

ii) flippingsignsofnegativevalues[13];iii)squaringtheeigenvalues[26,27].Italsocanbeimplementedby minimizingtheFrobeniusdistancebetweenK andK,˜ asintroducedby[14].Sincetrainingandclassification arebasedontwodifferentkernels,theabovemethodsareefficientonlywhenK andK are˜ similar.Alsothose methodsaretime-consumingsincetheyadditionallyinvolveeigenvalue problems.Topursuecomputational effectiveness, we canuse descentalgorithms, e.g.,sequential minimization optimization(SMO) developed by [17,18], to directlysolve(2) for anindefinite kernelmatrix. Though onlylocal optima areguaranteed, theperformanceisstillpromising,asreportedby[15] and[16].

(5)

3. LS-SVMwithrealsymmetricindefinitekernels

The current indefinite learning discussions are mainly for C-SVM. In this paper, we propose to use indefinitekernelsintheframeworkofleastsquaressupportvectormachines.Inthedualspace,LS-SVMis to solvethefollowing linearsystem[2]:

0 y y H +γ1I



[b, α1, . . . , αm]=

0 1



, (4)

where I isanidentity matrix,1 isanallonesvector withtheproperdimension,andH isgiven by Hij= yiyjKij= yiyjK(xi, xj).

We assumethatthematrixin(4)is fullrank.Thenits solutioncanbe effectivelyobtainedandthe corre- spondingdiscriminant functionis

f (x) =

m i=1

yiαiK(x, xi) + b.

The existingdiscussion about LS-SVMusually requires K to be positive semi-definitesuchthatMercer’s theorem isapplicableandthesolutionof(4)isrelatedtoFisherdiscriminantanalysisinfeaturespace[28].

NowletusinvestigateindefinitekernelsinLS-SVM(4).OnegoodpropertyisthatevenwhenK isindef- inite, (4)isstilleasytosolve,whichdiffersfrom C-SVM,whereanindefinitekernel makes(2)non-convex.

Though (4) with anindefinite kernel is easy to solve,the solution loosesmany propertiesof PSD kernels and itsfeaturespaceinterpretationshavetobeanalyzedalsoinapEspace.This isbasedonthefollowing proposition:

Proposition 1.Letα,b bethesolution of (4)forasymmetricbutindefinite kernelmatrixK.

i) Thereexisttwofeaturemapsφ+ andφ suchthat

w+=

m i=1

αiφ+(xi), w =

m i=1

αiφ(xi),

which isastationarypointof thefollowingprimal problem:

w+min,w,b,ξ

1 2

w+w+− ww +γ

2

m i=1

ξi2 (5)

s.t. yi

w+φ+(xi) + wφ(xi) + b

= 1− ξi,∀i ∈ {1, 2, . . . , m}.

ii) The dualproblem of(5)is givenby(4),where

Kij =K(xi, xj) =K+(xi, xj)− K(xi, xj) (6) with twoPSDkernels K+ andK:

K+(xi, xj) = φ+(xi)φ+(xj), (7) and

K(xi, xj) = φ(xi)φ(xj). (8)

(6)

Proof. For an indefinite kernel K, we can always find two PSD kernels K+, K and the corresponding featuremapsφ+,φ tosatisfy(6)–(8).Usingφ+ andφ in(5),itsLagrangianof(5)canbe writtenas

L(w+, w, b, ξ; α) = 1 2

w+w+− ww +γ

2

m i=1

ξi2

m i=1

αi

yi(w+φ+(xi) + wφ(xi) + b)− 1 + ξi

.

Thentheconditionofastationarypointyields

∂L

∂w+ = w+m

i=1αiyiφ+(xi) = 0,

∂L

∂w =−wm

i=1αiyiφ(xi) = 0,

∂L

∂b =m

i=1αi= 0,

∂L

∂ξi

= γξi− αi= 0,

∂L

∂αi = yi(w+φ+(xi) + wφ(xi) + b)− 1 + ξi= 0.

Eliminatingtheprimalvariables w+,w,ξ,wegettheoptimalityconditions:

m i=1

αi= 0 and

yi

m

j=1αjφ+(xi)φ+(xj)m

j=1αjφ(xi)φ(xj)

− b − αi

γ = 0.

Substituting(6)–(8)intotheaboveconditionleadsto(4).Therefore,(4)isthedualproblemof(5).Ifα, b isthesolutionof(4),thenb and

w+=m

i=1αiφ+(xi), w=m

i=1αiφ(xi)

satisfythefirst-orderoptimalityconditionof(5),i.e.,w+, w,b isastationary pointof(5).

Proposition 1givestheprimalproblemandafeaturespaceinterpretationforLS-SVMwithanindefinite kernel. Its proof relies on the positivedecomposition(6) on K,which exists for allreal symmetric kernel matrices.Butit doesnotmean thatwe canfindapositivedecompositionforK,i.e.,(3) isnotnecessarily valid.Theverificationisusuallyhardforaspecifickernel.Ifsuchkerneldecompositionexists,Proposition 1 furthershowsthat(4) ispursuingasmall within-classscatter inapEspaceR(p,q). Ifnot,thewithin-class scatter isminimizedinaspace associated withan approximatekernel K = K˜ +− K whichis equalto K onallthetrainingdata.In(7)and (8),thedimensionofthefeaturemap couldbeindefiniteand thenthe conclusionisextendedto thecorrespondingRKKS.

4. Realsymmetricindefinitekernel inPCA

Inthelastsection,weconsideredLS-SVMwithanindefinitekernelforbinaryclassification.Theanalysis isapplicabletoothertaskswhichcanbesolvedintheframeworkofLS-SVM.In[19],thelinkbetweenkernel principalcomponentanalysisandLS-SVMhasbeeninvestigated.Accordingly,wecangivethefeaturespace interpretationforkernelPCAwithanindefinitekernel.

(7)

Forgiven data{xi}mi=1,thekernelPCAis tosolveaneigenvalueproblem:

Ωα = λα, (9)

where thecenteredkernel matrixΩ isinducedfrom akernelK asfollows,

Ωij=K(xi, xj) 1 m

m r=1

K(xi, xr) 1 m

m r=1

K(xj, xr) + 1 m2

m r=1

m s=1

K(xr, xs).

Traditionally, K is limitedto be aPSD kernel. Then aMercerkernel isemployed and (9) maximizes the varianceintherelatedfeaturespace.

Following the sameway ofintroducing an indefinite kernelin C-SVM or LS-SVM, we candirectlyuse anindefinite kernelforkPCA(9).Noticethatforanindefinitekernel, theeigenvalueswill bepositiveand negative. Allthese eigenvalueswillbestill realfortheuseofasymmetrickernel. Thereisnodifferenceon the problem itself and the projected variables canbe calculatedas the same.However, the featurespace interpretationfundamentallychanges,whichis discussedinthefollowingproposition.

Proposition 2.Letα bethesolution of(9) foranindefinitekernel K.

i) Therearetwofeature mapsφ+ andφ suchthat

w+=

m i=1

αi+(xi)− ˆμφ+),

w=

m i=1

αi(xi)− ˆμφ),

which isastationarypointof thefollowingprimal problem:

w+max,w

γ 2

m i=1

ξi21 2

w+w+− ww

(10) s.t. ξi= w++(xi)− ˆμφ+) + w(xi)− ˆμφ), ∀i ∈ {1, . . . , m}.

Here, μˆφ+ˆφ arethecenteringterms, i.e.,

ˆ μφ+= 1

m

m i=1

φ+(xi) and ˆμφ = 1 m

m i=1

φ(xi).

ii)If wechooseγ asγ = λ1 anddecomposeK asin(6)–(8),thenthedualproblemof(10)isgivenby(9).

Proof. Again, for an indefinite kernel K, we can find two PSD kernels K+, K, and the corresponding nonlinearfeaturemapsφ+,φ to satisfy(6)–(8).

TheLagrangianof(10)canbe writtenas

L(w+, w, ξ; α) = 1 2

w+w+− ww

−γ 2

m i=1

ξ2i

m i=1

αi

w++(xi)− ˆμφ+) + w(xi)− ˆμφ)− ξi

.

(8)

Thenfrom theconditionsofastationarypoint, wehave

∂L

∂w+

= w+m

i=1αi+(xi)− ˆμφ+) = 0,

∂L

∂w =−wm

i=1αi(xi)− ˆμφ) = 0,

∂L

∂ξi =−γξi+ αi= 0,

∂L

∂αi

= w++(xi)− ˆμφ+) + w(xi)− ˆμφ)− ξi= 0.

Eliminationoftheprimalvariables resultsinthefollowingoptimalitycondition, 1

γαim

j=1αj+(xj)− ˆμφ+)+(xi)− ˆμφ+) (11)

m

j=1αj(xj)− ˆμφ)+(xi)− ˆμφ) = 0, ∀i ∈ {1, 2, . . . , m}.

Applyingthekerneltrick(7)and(8),weknowthat ±(xi)− ˆμφ+)±(xj)− ˆμφ±)

=K±(xi, xj) 1 m

m r=1

K±(xi, xr)

m r=1

K±(xj, xr) + 1 m2

m r=1

m s=1

K±(xr, xs).

Additionally with (6), the optimality condition (11) can be formulated as the eigenvalue problem (9).

Therefore, (9) is the dual problem of (10)and gives astationary solutionfor (10), which aims at having maximalvarianceasthesameaskPCA withPSDkernels.

5. Numericalexperiments

In the preceding sections, we discussed the use of indefinite kernels in the framework of LS-SVM for classificationand kernel principalcomponent analysis,respectively.The generalconclusionsare: i) indefi- nite LS-SVM shares the same optimizationmodel as the PSD ones and hence the sametoolbox, namely LS-SVMlab [35], isapplicable; ii) onthecomputational loadof LS-SVM, there isno differencebetween a PSDkernel and anindefinitekernel, i.e., using anindefinitekernel inLS-SVMwill notadditionallybring computational burden;iii) thefeaturespace interpretationofLS-SVMfor anindefinitekernel isextended toapEspace andonlyastationary pointcanbe obtained.

In theory, indefinite kernels are the generalization of PSD ones, which are constrained to have zero- negative parts in (6). In practice, there are indefinite kernels successfully appliedin specific applications [29–31].Nowwiththefeaturespaceinterpretationgiveninthispaper,onecanuseLS-SVManditsmodifi- cationstolearnfromanindefinitekernel.Ingeneral,algorithmicpropertiesholdingforLS-SVMwithPSD kernelsarestillvalid whenanindefinitekernelisused.

In this section, we will test the performance of LS-SVM with indefinite kernels on some benchmark problems. It shouldbe noticed thatthe performance heavily relies on the choiceof kernel. Thoughthere arealreadysomeindefinitekernelsdesignedtospecifictasks,itisstillhardtofindanindefinitekernelfora widerangeof problems.Therefore, PSDkernels,especiallytheradialbasisfunction(RBF)kernel andthe polynomialkernel, arecurrently dominantinkernellearning.Onechallengerfrom indefinitekernels isthe tanhkernel,whichhasbeenevaluatedintheframeworkofC-SVM[8,16].Anotherpossibleindefinitekernel

(9)

Table 1

TestaccuracyofLS-SVMwithPSDandindefinitekernels.

Dataset m n RBF

(CV)

poly (CV)

tanh (CV)

TL1 ρ= 0.7n

TL1 (CV)

DBWords 32 242 84.3% 85.6% 75.0% 85.2% 84.4%

Fertility 50 9 86.7% 80.4% 83.8% 86.7% 87.8%

Planning 91 12 70.2% 67.9% 71.6% 70.6% 73.6%

Sonar 104 60 84.5% 83.1% 72.9% 84.3% 83.6%

Statlog 135 13 81.4% 75.2% 82.7% 83.8% 83.5%

Monk1 124 6 79.1% 78.3% 76.6% 73.4% 85.2%

Monk2 169 6 84.1% 75.6% 69.9% 53.4% 83.7%

Monk3 122 6 93.5% 93.5% 88.0% 97.2% 97.2%

Climate 270 20 93.2% 91.7% 92.6% 91.9% 92.0%

Liver 292 10 69.0% 67.9% 67.2% 69.7% 71.8%

Austr. 345 14 85.0% 85.1% 86.6% 86.0% 86.7%

Breast 349 10 96.4% 95.7% 96.6% 97.0% 97.1%

Trans. 374 4 78.3% 78.5% 78.9% 78.4% 78.4%

Splice 1000 121 89.4% 87.3% 90.1% 93.6% 94.9%

Spamb. 2300 57 93.1% 92.4% 91.7% 94.1% 94.2%

ML-prove 3059 51 72.5% 74.6% 71.8% 79.1% 79.3%

is atruncated 1 distance (TL1) kernel, whichhas been recentlyproposed in[32]. Thementioned kernels are listedbelow:

• PSDkernels:

– linearkernel:K(u,v)= uv,

– RBFkernel withparameterσ:K(u,v)= exp

− u − v 22

σ2 ,

– polynomialkernel withparametersc≥ 0,d∈ N+:K(u,v)= (uv + c)d.

• indefinitekernels:

– tanhkernelwithparametersc,d1:K(u,v)= tanh(cuv + d), – TL1kernelwithparameterρ:K(u,v)= max− u − v 1,0}.

These kernels willbe compared intheframework ofLS-SVM forboth classificationandprincipalcom- ponent analysis. First, consider binary classification problems, for which the data are downloaded from UCI Repository of Machine Learning Datasets [36]. For some datasets, there are both training and test data. Otherwise, we randomly pick half of the data for training and the rest for test. All training data arenormalizedto[0,1]n inadvance.Intrainingprocedure,therearearegularizationcoefficientandkernel parameters, which are tuned by 10-foldcross validation.Specifically, we randomly partitionthe training datainto10subsets.Oneofthesesubsetsisusedforvalidationinturnandtheremainingonesfortraining.

As discussedin[32],theperformanceoftheTL1kernelisnotverysensitivetothevalueofρ andρ= 0.7n wassuggested.Wethusalso evaluatetheTL1kernelwith ρ= 0.7n.Withoneparameterless,thetraining time canbe largelysaved.

The above procedure is repeated 10 times, and then the average classification accuracy on test data are reported inTable 1, where the number of training set m and the problem dimension n are given as well.Thebest oneforeachdatasetinthesense ofaverageaccuracyis underlined.Theresultsconfirmthe potential useof indefinite kernelsin LS-SVM: anindefinite kernel canachieve similar accuracy as aPSD kernel inmostof theproblemsand canhavebetterperformanceinsomespecific problems. Thisdoesnot mean thatindefinite kernel surely improves the performance from PSD ones but for some datasets, e.g., Monk1,Monk3,andSplice,itisworthytoconsiderindefinitelearningwithLS-SVMwhichmayhavebetter accuracy withinalmost the same trainingtime. Moreover, this experiment, for whichthe performance of

1 The tanh kernel is conditionallypositive definite(CPD) when c ≥ 0 andis indefiniteotherwise; see, e.g., [33,34]. In our experiments,weconsiderbothpositiveandnegativec,andhencethetanhkernelisregardedasanindefinitekernel.

(10)

Fig. 1. (a)Datapointsofone classcomefromtheunitsphere andare markedbyredcircles.The otherdatapoints,shownby bluestars,comefromaspherewithradius3.(b) ThisdatasetisnotlinearlyseparableandthuslinearPCA isnothelpfulfor distinguishingthetwoclasses.Instead,kernelPCAis neededandifthe parameteris suitablychosen,thereduceddatacan be correctlyclassifiedbyalinearclassifier.(c)theRBFkernelwithσ = 0.05;(d)theTL1kernelwithρ= 0.1n;(e)theTL1kernel withρ= 0.2n;(f)theTL1kernelwithρ= 0.5n,whichissimilartolinearPCA.(Forinterpretationofthereferencestocolorin thisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)

theTL1kernelwithρ= 0.7n beingsatisfactoryformanydatasets,illustratesthegoodparameterstability oftheTL1kernel.

In the following, we use indefinite kernels for principal component analysis. As an intuitive example, weconsider a3-dimensionalsphere problemthatdistinguishsdata fromthe spherewithradius equalto1 and 3. The data are shownin Fig. 1(a). To reduce the dimension, we apply PCA, kPCA with the RBF kernel, and kPCA with the TL1 kernel, respectively. The obtained two dimensional data are displayed in Fig. 1(b)–(f), which roughly imply that a suitable indefinite kernel can be used for kernel principal componentanalysis.

ToquantitativelyevaluatekPCAwithindefinitekernels,wechoosetheproblemsofwhichthedimension ishigherthan20 fromTable 1andthenapplykPCAtoreducethedataintonrdimension.Forthereduced data, linear classifiers, trained from linear C-SVM with libsvm [37], are used to classify the test data.

The parameters,including kernel parametersand the regularizationconstant inlinearC-SVM, are tuned based on 10-fold cross-validation. In Table 2, the average classification accuracy of 10 trials for different reduction ratios nr/n is listed. The results illustrate that indefinite kernels can be used for kPCA. Its performanceingeneraliscomparabletoPSDkernelsandforsomedatasetstheperformanceissignificantly improved.

Summarizingalltheexperimentsabove,weobservethepotentialuseofindefinitekernelsinLS-SVMfor classificationandkPCA.Forexample,theTL1kernelhassimilarperformanceas theRBF kernelinmany problemsandhasmuchbetterresultsforseveraldatasets.Ouraiminthisexperimentisnottoclaimwhich kernelisthebest,whichactuallydependsonthespecificproblem.Instead,weshowthatforsomeproblems, aproper indefinitekernel cansignificantly improve the performancefrom PSD ones,which maymotivate theresearchersto designindefinitekernelsandusetheminLS-SVMs.

(11)

Table 2

TestaccuracybasedonkPCAwithdifferentreductionratios.

Dataset m n Ratio Linear RBF poly tanh TL1

Sonar 104 60 10% 72.6% 75.6% 75.2% 63.8% 77.9%

30% 73.1% 79.1% 78.2% 71.0% 80.4%

50% 75.9% 80.7% 79.0% 71.9% 81.9%

Climate 270 21 10% 90.4% 90.5% 91.4% 91.5% 90.5%

30% 90.9% 90.8% 91.4% 91.6% 90.9%

50% 91.6% 91.4% 93.9% 91.6% 91.9%

Qsar 528 41 10% 74.4% 77.8% 75.5% 77.5% 78.8%

30% 85.4% 86.4% 84.1% 84.5% 85.9%

50% 85.9% 86.7% 85.4% 86.0% 86.2%

Splice 1000 60 10% 83.7% 86.6% 85.5% 83.7% 91.9%

30% 83.9% 87.7% 85.3% 83.0% 91.1%

50% 84.1% 87.8% 86.5% 85.2% 91.3%

Spamb. 2300 57 10% 84.7% 86.5% 84.9% 86.4% 88.4%

30% 87.7% 89.9% 88.3% 89.9% 91.8%

50% 90.7% 91.0% 91.5% 92.8% 92.8%

ML-prove 3059 51 10% 59.2% 69.7% 64.0% 63.3% 70.1%

30% 67.9% 70.3% 72.8% 69.3% 71.3%

50% 70.2% 71.0% 73.1% 68.9% 75.5%

6. Conclusion

In this paper, we proposed to use indefinite kernels in the framework of least squares support vector machines.Inthetrainingproblemitself,thereisnodifferencebetweendefinitekernelsandindefinitekernels.

Thus,onecaneasilyuseanindefinitekernelinLS-SVMbysimplychangingthekernelevaluationfunction.

Numerically, the indefinitekernels achievegood performancecompared with commonly usedPSD kernels for both classification and kernel principal component analysis. The good performance motivates us to investigatethefeaturespaceinterpretationforanindefinitekernelinLS-SVM,whichisthemaintheoretical contributionofthispaper.Wehopethatthetheoreticalanalysisandgoodperformanceshowninthispaper canattractresearchandapplicationinterestsonindefiniteLS-SVMand indefinitekPCAinthefuture.

Acknowledgments

The authors would like to thank Prof. Lei Shi inFudan University forhelpful discussion. The authors are gratefultotheanonymous reviewerforinsightfulcomments.

References

[1]V.N.Vapnik,StatisticalLearningTheory,Wiley,NewYork,1998.

[2]J.A.K.Suykens,J.Vandewalle,Leastsquaressupportvectormachineclassifiers,NeuralProcess.Lett.9 (3)(1999)293–300.

[3]J.A.K.Suykens,T.VanGestel,B.DeMoor,J.De Brabanter,J.Vandewalle, LeastSquaresSupportVector Machines, WorldScientific,2002.

[4]B.Schölkopf,A. Smola,K.-R.Müller,Nonlinear component analysisas a kerneleigenvalue problem,Neural Comput.

10 (5)(1998)1299–1319.

[5]C.S. Ong,X. Mary,S.Canu, A.J.Smola,Learning withnon-positivekernels, in:Proceeding of the21stInternational ConferenceonMachineLearning(ICML),2004,pp. 639–646.

[6]Y.Ying,C.Campbell,M.Girolami,AnalysisofSVMwithindefinitekernels,in:Y.Bengio,D.Schuurmans,J.Lafferty, C.Williams,A.Culotta(Eds.),AdvancesinNeuralInformationProcessingSystems22,2009,pp. 2205–2213.

[7]S.Gu,Y.Guo,LearningSVMclassifierswithindefinitekernels,in:Proceedingsofthe26thAAAIConferenceonArtificial Intelligence,2012,pp. 942–948.

[8]G.Loosli,S.Canu,C.S. Ong, LearningSVM inKre˘ın spaces, IEEETrans.Pattern Anal. Mach.Intell.38 (6)(2016) 1204–1216.

[9]E.Pekalska,B.Haasdonk,Kerneldiscriminantanalysisforpositivedefiniteandindefinitekernels,IEEETrans.Pattern Anal.Mach.Intell.31 (6)(2009)1017–1032.

[10]B.Haasdonk,E.Pekalska,Indefinitekerneldiscriminantanalysis,in:Proceedingsofthe16thInternationalConferenceon ComputationalStatistics(COMPSTAT),2010,pp. 221–230.

(12)

[11]S.Zafeiriou,SubspacelearninginKre˘ınspaces:completekernelFisherdiscriminantanalysiswithindefinitekernels,in:

Proceedings ofEuropeanConferenceonComputerVision(ECCV)2012,2012,pp. 488–501.

[12]E.Pekalska,P.Paclik,R.P.Duin,Ageneralizedkernelapproachtodissimilarity-basedclassification,J.Mach.Learn.Res.

2(2002)175–211.

[13]V. Roth,J. Laub,M. Kawanabe,J.M. Buhmann,Optimal clusterpreservingembedding ofnonmetric proximitydata, IEEETrans.PatternAnal.Mach.Intell.25 (12)(2003)1540–1551.

[14]R.Luss,A.d’Aspremont,Supportvectormachineclassificationwithindefinitekernels,in:J.C.Platt,D.Koller,Y.Singer, S.T.Roweis(Eds.),AdvancesinNeuralInformationProcessingSystems20,2007,pp. 953–960.

[15]F.Schleif,P.Tino,Indefiniteproximitylearning:Areview,NeuralComput.27(2015)2039–2096.

[16] H.-T.Lin,C.-J.Lin,Astudyon sigmoidkernelsfor SVMandthetrainingof non-PSDkernelsbySMO-typemethods, technicalreport,DepartmentofComputerScience,NationalTaiwanUniversity,2003,http://www.csie.ntu.edu.tw/~cjlin/

papers/tanh.pdf.

[17]J.C.Platt,Fasttrainingofsupportvectormachinesusingsequentialminimaloptimization,in:AdvancesinKernelMethods SupportVectorLearning,MITPress,1999,pp. 185–208.

[18]R.-E.Fan,P.-H.Chen,C.-J.Lin,Workingsetselectionusingsecondorderinformationfortrainingsupportvectormachines, J.Mach.Learn.Res.6(2005)1889–1918.

[19]J.A.K.Suykens,T.VanGestel,J.Vandewalle,B.DeMoor,AsupportvectormachineformulationtoPCAanalysisand itskernelversion,IEEETrans.NeuralNetw.14 (2)(2003)447–450.

[20]H.Ling,D.W.Jacobs,Usingtheinner-distanceforclassificationofarticulatedshapes,in:Proceedings ofIEEEConference onComputerVisionandPatternRecognition(CVPR)2005,2005,pp. 719–726.

[21]M.M.Deza,E.Deza,EncyclopediaofDistances,Springer,NewYork,2009.

[22]W.Xu,W.Richard,E.Hancock,Determiningthecauseof negativedissimilarityeigenvalues,in:ComputerAnalysisof ImagesandPatterns,Springer,NewYork,2011,pp. 589–597.

[23]T.Graepel,R.Herbrich,P.Bollmann-Sdorra,K.Obermayer,Classificationonpairwiseproximitydata,in:M.S.Kearns, S.A.Solla,D.A.Cohn(Eds.),AdvancesinNeuralInformationProcessingSystems11,1998,pp. 438–444.

[24]B.Haasdonk,FeaturespaceinterpretationofSVMswithindefinitekernels,IEEETrans.PatternAnal.Mach.Intell.27 (4) (2005)482–492.

[25]I.Alabdulmohsin,X.Gao,X.Zhang,Supportvectormachineswithindefinitekernels,in:Proceedings ofthe6thAsian ConferenceonMachineLearning(ACML),2014,pp. 32–47.

[26]H. Sun,Q. Wu,Least squareregressionwithindefinitekernels andcoefficient regularization,Appl. Comput.Harmon.

Anal.30 (1)(2011)96–109.

[27]Q.Wu,Regularizationnetworkswithindefinitekernels,J.Approx.Theory166(2013)1–18.

[28]T.VanGestel,J.A.K.Suykens,G.Lanckriet,A.Lambrechts,B.DeMoor,J.Vandewalle,Bayesianframeworkforleast- squaressupportvectormachineclassifiers,Gaussian processes,andkernelFisherdiscriminantanalysis,NeuralComput.

14 (5)(2002)1115–1147.

[29]A.J.Smola,Z.L.Ovari,R.C.Williamson,Regularizationwithdot-productkernels,in:T.K.Leen,T.G.Dietterich,V. Tresp (Eds.),AdvancesinNeuralInformationProcessingSystems13,2000,pp. 308–314.

[30]H.Saigo,J.P.Vert,U.Ueda,T.Akutsu,Proteinhomologydetectionusingstringalignmentkernels,Bioinformatics20 (11) (2004)1682–1689.

[31]B.Haasdonk,H.Burkhardt,Invariant kernelfunctionsforpattern analysisandmachinelearning,Mach.Learn.68 (1) (2007)35–61.

[32]X. Huang,J.A.K.Suykens,S.Wang, A.Maier,J.Hornegger,Classificationwithtruncated1 distance kernel,internal report15-211,ESAT-SISTA,KULeuven,2015.

[33]M.D.Buhmann,RadialBasisFunctions:TheoryandImplementations,CambridgeMonographsonAppliedandCompu- tationalMathematics,vol. 12,2004,pp. 147–165.

[34]H.Wendland,ScatteredDataApproximation,CambridgeUniversityPress,Cambridge,UK,2004.

[35]K. De Brabanter, P. Karsmakers, F. Ojeda, C. Alzate, J. De Brabanter,K. Pelckmans, B.De Moor, J. Vandewalle, J.A.K. Suykens,LS-SVMlabtoolboxuser’sguide,internalreport10-146,ESAT-SISTA,KULeuven,2011.

[36] A.Frank,A. Asuncion,UCIMachineLearningRepository,SchoolofInformationandComputerScience,Universityof California,Irvine,CA,2010,http://archive.ics.uci.edu/ml.

[37]C.-C.Chang,C.-J.Lin,LIBSVM:alibraryforsupportvectormachines,ACMTrans.Intell.Syst.Technol.2 (3)(2011) 27.

Referenties

GERELATEERDE DOCUMENTEN

These model parameters were then used for the prediction of the 2007 antibody titers (the inde- pendent test set): Component scores were derived from the 2007 data using the

With the exception of honest and gonat (good-natured), the stimuli are labeled by the first five letters of their names (see Table 1). The fourteen stimuli are labeled by

Several centrings can be performed in the program, primarily on frontal slices of the three-way matrix, such as centring rows, columns or frontal slices, and standardization of

The aim of this study is to develop the Bayesian Least Squares Support Vector Machine (LS-SVM) classifiers, for preoperatively predicting the malignancy of ovarian tumors.. We

We start in section 2 with a brief review of the LS-SVM classifier and its integration within the Bayesian evidence framework; then we introduce a way to compute the posterior

In kPCA, the input data are mapped to a higher dimensional space via a nonlinear transformation, given by the kernel function.. In this higher dimensional feature space, PCA

In the preceding sections, we discussed the use of indefinite kernels in the framework of LS-SVM for classification and kernel principal component analysis, respectively. The

In the preceding sections, we discussed the use of indefinite kernels in the framework of LS-SVM for classification and kernel principal component analysis, respectively. The