Citation/Reference Huang X., Maier A., Hornegger J., Suykens J.A.K., ``Indefinite Kernels in Least Squares Support Vector Machines and Principal Component Analysis'', Applied and Computational Harmonic Analysis, Sep. 2016,
Archived version Author manuscript: the content is identical to the content of the published paper, but without the final typesetting by the publisher Published version http://dx.doi.org/10.1016/j.acha.2016.09.001
Journal homepage http://www.journals.elsevier.com/applied-and-computational- harmonic-analysis/
IR https://lirias.kuleuven.be/handle/123456789/557941
(article begins on next page)
Contents lists available atScienceDirect
Applied and Computational Harmonic Analysis
www.elsevier.com/locate/acha
Case Studies
Indefinite kernels in least squares support vector machines and principal component analysis
✩Xiaolin Huanga,c,∗, Andreas Maierc, Joachim Horneggerc, Johan A.K. Suykensb
a InstituteofImageProcessingandPatternRecognition,ShanghaiJiaoTongUniversity,200240 Shanghai,PRChina
bKULeuven,ESAT-STADIUS,B-3001Leuven,Belgium
cPatternRecognitionLab,Friedrich-Alexander-UniversitätErlangen–Nürnberg,91058Erlangen, Germany
a r t i cl e i n f o a b s t r a c t
Articlehistory:
Received15March2016
Receivedinrevisedform29August 2016
Accepted2September2016 Availableonlinexxxx
CommunicatedbyCharlesK.Chui
Keywords:
Leastsquaressupportvector machine
Indefinitekernel Classification
Kernelprincipalcomponentanalysis
Because of severalsuccessfulapplications,indefinitekernels haveattracted many research interestsinrecentyears.This paper addressesindefinitelearninginthe framework of least squares support vector machines (LS-SVM). Unlike existing indefinite kernel learning methods, which usually involve non-convex problems, theindefiniteLS-SVMisstilleasytosolve,but thekerneltrickandprimal-dual relationshipforLS-SVMwithaMercerkernelisnolongervalid.Inthispaper,we giveafeaturespaceinterpretationforindefiniteLS-SVM.Inthesameframework, kernelprincipalcomponentanalysiswithaninfinitekernelisdiscussedaswell.In numericalexperiments,LS-SVMwithindefinitekernelsforclassificationandkernel principalcomponentanalysisisevaluated.Itsgoodperformancetogetherwiththe featurespaceinterpretationgiveninthispaperimplythepotentialuseofindefinite LS-SVMinrealapplications.
©2016PublishedbyElsevierInc.
1. Introduction
Mercer’sconditionisthetraditionalrequirementonthekernelappliedinclassicalkernellearningmeth- ods,suchassupportvectormachinewiththehingeloss(C-SVM,[1]),leastsquaressupportvectormachines (LS-SVM, [2,3]), and kernel principal components analysis (kPCA, [4]). However, in practice, one may meetsophisticated similarityor dissimilarity measures which leadto kernels violatingMercer’s condition.
✩ ThisworkwaspartiallysupportedbytheAlexandervonHumboldtFoundationandNationalNaturalScienceFoundationof China(61603248).JohanSuykensacknowledgessupportofERCAdGA-DATADRIVE-B(290923),KUL:GOA/10/09MaNet,CoE PFV/10/002(OPTEC),BIL12/11T;FWO:G.0377.12,G.088114N,SBOPOM(100031);IUAPP7/19DYSCO.
* Corresponding author at: Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, 200240 Shanghai,PRChina.
E-mailaddresses:xiaolinhuang@sjtu.edu.cn(X. Huang),anreas.maier@fau.de(A. Maier),joachim.hornegger@fau.de (J. Hornegger),johan.suykens@esat.kuleuven.be(J.A.K. Suykens).
http://dx.doi.org/10.1016/j.acha.2016.09.001 1063-5203/©2016PublishedbyElsevierInc.
Since the kernelmatrices inducedbysuchkernelsare real,symmetric, butnotpositivesemi-definite,they arecalledindefinitekernels andthecorrespondinglearningmethodologyiscalledindefinitelearning[5–15].
Twoimportant problemsarise forindefinitelearning.First, itlackstheclassicalfeaturespace interpre- tationforaMercerkernel,i.e.,wecannotfindanonlinearfeaturemappingsuchthatitsinner-dotgivesthe value of anindefinite kernel function. Second, lackof positivedefinitiveness makes manylearning models become non-convex ifanindefinite kernelis used. Inthelast decades, there hasbeen continuous progress aimingattheseissues.Intheory,indefinitelearninginC-SVMhasbeendiscussedintheReproducingKernel Kre˘ın Spaces (RKKS), cf. [5–8]. The kernel Fisher discriminant analysis with an indefinite kernel can be foundin[9–11],whichisalsodiscussedonRKKS.Inalgorithmdesign,thecurrentmainstreamistofindan approximate positivesemi-definite (PSD)kernel and then apply classical kernel learningalgorithm based on thatPSD kernel. These methodscanbe found in [12–14]and they havebeen reviewed and compared in[15].Thatreviewalsodiscussesdirectlyapplyingindefinitekernelsinsomeclassicalkernellearningmeth- ods which are notsensitive to metricviolations. As suggestedby [16], onecanuse anindefinite kernel to replacethePSDkernelinthedualformulationofC-SVMandsolveitbysequentialminimizationoptimiza- tion[17,18].Thiskindofmethodsenjoyasimilarcomputationalefficiencyastheclassicallearningmethods and henceismoreattractiveinpractice.
Following the wayof introducing indefinitekernels toC-SVM, we consider indefinitelearningbasedon LS-SVM. Notice thatusing an indefinitekernel inC-SVM results inanon-convex problem butindefinite learning based on LS-SVM is still easy to solve. However, Mercer’stheorem is no longer valid. We have to find a new feature space interpretation and to give a characterization in terms of primal and dual problems, whichare thetheoreticaltargets ofthis paper.Since kPCAcanbe conductedintheframework of LS-SVM[19],wewilldiscuss kPCAwithanindefinitekernelaswell.
This paper is organized as follows. Section 2 briefly reviews indefinite learning. Section 3 addresses LS-SVMwithindefinitekernelsandprovidesitsfeaturespaceinterpretation.SimilardiscussiononkPCAis giveninSection4.ThentheperformanceofindefinitelearningbasedonLS-SVMisevaluatedbynumerical experimentsinSection5.Finally,Section6givesashort conclusion.
2. Indefinitekernels
WestartthediscussionfromC-SVMwith aMercerkernel. Givenaset oftrainingdata {xi,yi}mi=1 with xi∈ Rn and yi ∈ {−1,+1},weare tryingtoconstruct adiscriminantfunctionf (x):Rn → R and useits sign for classification. Exceptof linearly separable problems, a nonlinear featuremappingφ(x) is needed and thediscriminant functionisusually formulatedas f (x)= wφ(x)+ b. C-SVMtrainsw andb bythe following optimizationproblem:
w,b,ξmin 1
2ww + Cm i=1ξi
s.t. yi(wφ(xi) + b)≥ 1 − ξi, ∀i ∈ {1, . . . , m} (1) ξi ≥ 0, ∀i ∈ {1, . . . , m}.
It iswellknownthatthedualproblem of(1)takestheformulationas below,
maxα
m i=1
αi−1 2
m i=1
m j=1
yiαiKijαjyj
s.t. m
i=1yiαi = 0 (2)
0≤ αi≤ C, ∀i ∈ {1, . . . , m},
where Kij = K(xi,xj) = φ(xi)φ(xj) is the kernel matrix. For any kernel K which satisfies Mercer’s condition,there isalwaysafeaturemap φ suchthatK(xi,xj)= φ(xi)φ(xj).Thisallowsus toconstruct aclassifiertomaximizethemargininthefeaturespacewithoutexplicitlyknowingφ.
Traditionally,in(2), werequirethepositivenessonK.Butinsomeapplications,especiallyincomputer version, there are many distances or dissimilarities, for which the corresponding matrices are not PSD [20–22]. It is also possible that though a kernel is PSD but is very hard to verify [5]. Even for a PSD kernel,noisemaymakethedissimilaritymatrixnon-PSD[23,24].Allthesefactsmotivatedtheresearchers to think about indefinitekernels in C-SVM. Notice that “indefinitekernels” literally cover manykernels, including asymmetric ones induced by asymmetricdistances. But as all indefinite learning literature, we inthispaper restrict“indefinitekernel” tothekernels thatcorrespondto realsymmetric indefinitematri- ces.
Intheory,using indefinitekernelsinC-SVM makesMercer’stheoremnotapplicable, whichmeansthat (1) and (2) are not a pair of primal-dual problems and then the solution of (2) cannot be explained as margin maximization in afeature space. Moreover, the learning theory and approximation theory about C-SVMwith PSDisnotvalid, sincethefunctional spacespanned byindefinitekernelsdoesnotbelongto anyReproducing KernelHilbertSpace(RKHS).Tolinktheindefinitekernelto RKHS,weneedapositive decomposition. Itsdefinitionis givenby [5]as follows:an indefinitekernel K hasapositivedecomposition iftherearetwo PSDkernelsK+,K− suchthat
K(u, v) = K+(u, v)− K−(u, v), ∀u, v. (3)
ForanindefinitekernelK thathasapositivedecomposition,there existReproducingKernelKre˘ınSpaces (RKKS). Conditions for the existence of positivedecomposition are given by [5]. However,for aspecific kernel, those conditions are usually hard to verify in practice. But at least, when the training data are given,thekernelmatrixK hasadecompositionwhichis thedifferenceof twoPSDmatrices.Whetherany indefinitekernel hasapositivedecompositionis stillan openquestion.Fortunately, (3)is alwaysvalidfor u,v∈ {xi}mi=1.Thus,indefinitelearningcanbetheoreticallyanalyzedinRKKSandbeimplementedbased onmatrixdecompositioninpractice.
The feature space interpretation for indefinite learning is given by [24] for a finite-dimensional Kre˘ın space,whichisalsocalledapseudo-Euclidean(pE)space.ApEspaceisdenotedasR(p,q)withnon-negative integersp andq.ThisspaceisaproductoftwoEuclideanvectorspacesRp×iRq.AnelementinR(p,q)canbe representedbyitscoordinatevectorandthecoordinatevectorgivestheinner product:u,vpE= uMv, where M is a diagonal matrix with the first p components equal 1 and others equal to −1. If we link the components of M with the signs of eigenvalues of the indefinite kernel matrix K, solving (2) for K is interpretedin [24]as distance minimization inR(p,q). For learningbehavior inRKKS,onecanfind the discussiononthespacesizein[5],errorboundin[25],andasymptoticconvergencein[26,27].
WhenanindefinitekernelisusedinC-SVM,(2)becomesanon-convexquadraticproblem,sinceK isnot positivesemi-definite.Foranon-convex problem,manyalgorithms basedonglobal optimalityareinvalid.
An alternativeway is to find an approximatePSD matrix K for˜ an indefinite oneK, and then solve(2) for ˜K. To obtain K,˜ one can adjust the eigenvalues of K by: i) setting all negative values as zero [12];
ii) flippingsignsofnegativevalues[13];iii)squaringtheeigenvalues[26,27].Italsocanbeimplementedby minimizingtheFrobeniusdistancebetweenK andK,˜ asintroducedby[14].Sincetrainingandclassification arebasedontwodifferentkernels,theabovemethodsareefficientonlywhenK andK are˜ similar.Alsothose methodsaretime-consumingsincetheyadditionallyinvolveeigenvalue problems.Topursuecomputational effectiveness, we canuse descentalgorithms, e.g.,sequential minimization optimization(SMO) developed by [17,18], to directlysolve(2) for anindefinite kernelmatrix. Though onlylocal optima areguaranteed, theperformanceisstillpromising,asreportedby[15] and[16].
3. LS-SVMwithrealsymmetricindefinitekernels
The current indefinite learning discussions are mainly for C-SVM. In this paper, we propose to use indefinitekernelsintheframeworkofleastsquaressupportvectormachines.Inthedualspace,LS-SVMis to solvethefollowing linearsystem[2]:
0 y y H +γ1I
[b, α1, . . . , αm]=
0 1
, (4)
where I isanidentity matrix,1 isanallonesvector withtheproperdimension,andH isgiven by Hij= yiyjKij= yiyjK(xi, xj).
We assumethatthematrixin(4)is fullrank.Thenits solutioncanbe effectivelyobtainedandthe corre- spondingdiscriminant functionis
f (x) =
m i=1
yiαiK(x, xi) + b.
The existingdiscussion about LS-SVMusually requires K to be positive semi-definitesuchthatMercer’s theorem isapplicableandthesolutionof(4)isrelatedtoFisherdiscriminantanalysisinfeaturespace[28].
NowletusinvestigateindefinitekernelsinLS-SVM(4).OnegoodpropertyisthatevenwhenK isindef- inite, (4)isstilleasytosolve,whichdiffersfrom C-SVM,whereanindefinitekernel makes(2)non-convex.
Though (4) with anindefinite kernel is easy to solve,the solution loosesmany propertiesof PSD kernels and itsfeaturespaceinterpretationshavetobeanalyzedalsoinapEspace.This isbasedonthefollowing proposition:
Proposition 1.Letα∗,b∗ bethesolution of (4)forasymmetricbutindefinite kernelmatrixK.
i) Thereexisttwofeaturemapsφ+ andφ− suchthat
w∗+=
m i=1
α∗iφ+(xi), w∗− =
m i=1
α∗iφ−(xi),
which isastationarypointof thefollowingprimal problem:
w+min,w−,b,ξ
1 2
w+w+− w−w− +γ
2
m i=1
ξi2 (5)
s.t. yi
w+φ+(xi) + w−φ−(xi) + b
= 1− ξi,∀i ∈ {1, 2, . . . , m}.
ii) The dualproblem of(5)is givenby(4),where
Kij =K(xi, xj) =K+(xi, xj)− K−(xi, xj) (6) with twoPSDkernels K+ andK−:
K+(xi, xj) = φ+(xi)φ+(xj), (7) and
K−(xi, xj) = φ−(xi)φ−(xj). (8)
Proof. For an indefinite kernel K, we can always find two PSD kernels K+, K− and the corresponding featuremapsφ+,φ− tosatisfy(6)–(8).Usingφ+ andφ− in(5),itsLagrangianof(5)canbe writtenas
L(w+, w−, b, ξ; α) = 1 2
w+w+− w−w− +γ
2
m i=1
ξi2
−
m i=1
αi
yi(w+φ+(xi) + w−φ−(xi) + b)− 1 + ξi
.
Thentheconditionofastationarypointyields
∂L
∂w+ = w+−m
i=1αiyiφ+(xi) = 0,
∂L
∂w− =−w−−m
i=1αiyiφ−(xi) = 0,
∂L
∂b =m
i=1αi= 0,
∂L
∂ξi
= γξi− αi= 0,
∂L
∂αi = yi(w+φ+(xi) + w−φ−(xi) + b)− 1 + ξi= 0.
Eliminatingtheprimalvariables w+,w−,ξ,wegettheoptimalityconditions:
m i=1
αi= 0 and
yi
m
j=1αjφ+(xi)φ+(xj)−m
j=1αjφ−(xi)φ−(xj)
− b − αi
γ = 0.
Substituting(6)–(8)intotheaboveconditionleadsto(4).Therefore,(4)isthedualproblemof(5).Ifα∗, b∗ isthesolutionof(4),thenb∗ and
w∗+=m
i=1α∗iφ+(xi), w∗−=m
i=1α∗iφ−(xi)
satisfythefirst-orderoptimalityconditionof(5),i.e.,w∗+, w∗−,b∗ isastationary pointof(5).
Proposition 1givestheprimalproblemandafeaturespaceinterpretationforLS-SVMwithanindefinite kernel. Its proof relies on the positivedecomposition(6) on K,which exists for allreal symmetric kernel matrices.Butit doesnotmean thatwe canfindapositivedecompositionforK,i.e.,(3) isnotnecessarily valid.Theverificationisusuallyhardforaspecifickernel.Ifsuchkerneldecompositionexists,Proposition 1 furthershowsthat(4) ispursuingasmall within-classscatter inapEspaceR(p,q). Ifnot,thewithin-class scatter isminimizedinaspace associated withan approximatekernel K = K˜ +− K− whichis equalto K onallthetrainingdata.In(7)and (8),thedimensionofthefeaturemap couldbeindefiniteand thenthe conclusionisextendedto thecorrespondingRKKS.
4. Realsymmetricindefinitekernel inPCA
Inthelastsection,weconsideredLS-SVMwithanindefinitekernelforbinaryclassification.Theanalysis isapplicabletoothertaskswhichcanbesolvedintheframeworkofLS-SVM.In[19],thelinkbetweenkernel principalcomponentanalysisandLS-SVMhasbeeninvestigated.Accordingly,wecangivethefeaturespace interpretationforkernelPCAwithanindefinitekernel.
Forgiven data{xi}mi=1,thekernelPCAis tosolveaneigenvalueproblem:
Ωα = λα, (9)
where thecenteredkernel matrixΩ isinducedfrom akernelK asfollows,
Ωij=K(xi, xj)− 1 m
m r=1
K(xi, xr)− 1 m
m r=1
K(xj, xr) + 1 m2
m r=1
m s=1
K(xr, xs).
Traditionally, K is limitedto be aPSD kernel. Then aMercerkernel isemployed and (9) maximizes the varianceintherelatedfeaturespace.
Following the sameway ofintroducing an indefinite kernelin C-SVM or LS-SVM, we candirectlyuse anindefinite kernelforkPCA(9).Noticethatforanindefinitekernel, theeigenvalueswill bepositiveand negative. Allthese eigenvalueswillbestill realfortheuseofasymmetrickernel. Thereisnodifferenceon the problem itself and the projected variables canbe calculatedas the same.However, the featurespace interpretationfundamentallychanges,whichis discussedinthefollowingproposition.
Proposition 2.Letα∗ bethesolution of(9) foranindefinitekernel K.
i) Therearetwofeature mapsφ+ andφ− suchthat
w∗+=
m i=1
αi∗(φ+(xi)− ˆμφ+),
w∗−=
m i=1
αi∗(φ−(xi)− ˆμφ−),
which isastationarypointof thefollowingprimal problem:
w+max,w−,ξ
γ 2
m i=1
ξi2−1 2
w+w+− w−w−
(10) s.t. ξi= w+(φ+(xi)− ˆμφ+) + w−(φ−(xi)− ˆμφ−), ∀i ∈ {1, . . . , m}.
Here, μˆφ+,μˆφ− arethecenteringterms, i.e.,
ˆ μφ+= 1
m
m i=1
φ+(xi) and ˆμφ− = 1 m
m i=1
φ−(xi).
ii)If wechooseγ asγ = λ1 anddecomposeK asin(6)–(8),thenthedualproblemof(10)isgivenby(9).
Proof. Again, for an indefinite kernel K, we can find two PSD kernels K+, K−, and the corresponding nonlinearfeaturemapsφ+,φ− to satisfy(6)–(8).
TheLagrangianof(10)canbe writtenas
L(w+, w−, ξ; α) = 1 2
w+w+− w−w−
−γ 2
m i=1
ξ2i
−
m i=1
αi
w+(φ+(xi)− ˆμφ+) + w−(φ−(xi)− ˆμφ−)− ξi
.
Thenfrom theconditionsofastationarypoint, wehave
∂L
∂w+
= w+−m
i=1αi(φ+(xi)− ˆμφ+) = 0,
∂L
∂w− =−w−−m
i=1αi(φ−(xi)− ˆμφ−) = 0,
∂L
∂ξi =−γξi+ αi= 0,
∂L
∂αi
= w+(φ+(xi)− ˆμφ+) + w−(φ−(xi)− ˆμφ−)− ξi= 0.
Eliminationoftheprimalvariables resultsinthefollowingoptimalitycondition, 1
γαi−m
j=1αj(φ+(xj)− ˆμφ+)(φ+(xi)− ˆμφ+) (11)
−m
j=1αj(φ−(xj)− ˆμφ−)(φ+(xi)− ˆμφ−) = 0, ∀i ∈ {1, 2, . . . , m}.
Applyingthekerneltrick(7)and(8),weknowthat (φ±(xi)− ˆμφ+)(φ±(xj)− ˆμφ±)
=K±(xi, xj)− 1 m
m r=1
K±(xi, xr)−
m r=1
K±(xj, xr) + 1 m2
m r=1
m s=1
K±(xr, xs).
Additionally with (6), the optimality condition (11) can be formulated as the eigenvalue problem (9).
Therefore, (9) is the dual problem of (10)and gives astationary solutionfor (10), which aims at having maximalvarianceasthesameaskPCA withPSDkernels.
5. Numericalexperiments
In the preceding sections, we discussed the use of indefinite kernels in the framework of LS-SVM for classificationand kernel principalcomponent analysis,respectively.The generalconclusionsare: i) indefi- nite LS-SVM shares the same optimizationmodel as the PSD ones and hence the sametoolbox, namely LS-SVMlab [35], isapplicable; ii) onthecomputational loadof LS-SVM, there isno differencebetween a PSDkernel and anindefinitekernel, i.e., using anindefinitekernel inLS-SVMwill notadditionallybring computational burden;iii) thefeaturespace interpretationofLS-SVMfor anindefinitekernel isextended toapEspace andonlyastationary pointcanbe obtained.
In theory, indefinite kernels are the generalization of PSD ones, which are constrained to have zero- negative parts in (6). In practice, there are indefinite kernels successfully appliedin specific applications [29–31].Nowwiththefeaturespaceinterpretationgiveninthispaper,onecanuseLS-SVManditsmodifi- cationstolearnfromanindefinitekernel.Ingeneral,algorithmicpropertiesholdingforLS-SVMwithPSD kernelsarestillvalid whenanindefinitekernelisused.
In this section, we will test the performance of LS-SVM with indefinite kernels on some benchmark problems. It shouldbe noticed thatthe performance heavily relies on the choiceof kernel. Thoughthere arealreadysomeindefinitekernelsdesignedtospecifictasks,itisstillhardtofindanindefinitekernelfora widerangeof problems.Therefore, PSDkernels,especiallytheradialbasisfunction(RBF)kernel andthe polynomialkernel, arecurrently dominantinkernellearning.Onechallengerfrom indefinitekernels isthe tanhkernel,whichhasbeenevaluatedintheframeworkofC-SVM[8,16].Anotherpossibleindefinitekernel
Table 1
TestaccuracyofLS-SVMwithPSDandindefinitekernels.
Dataset m n RBF
(CV)
poly (CV)
tanh (CV)
TL1 ρ= 0.7n
TL1 (CV)
DBWords 32 242 84.3% 85.6% 75.0% 85.2% 84.4%
Fertility 50 9 86.7% 80.4% 83.8% 86.7% 87.8%
Planning 91 12 70.2% 67.9% 71.6% 70.6% 73.6%
Sonar 104 60 84.5% 83.1% 72.9% 84.3% 83.6%
Statlog 135 13 81.4% 75.2% 82.7% 83.8% 83.5%
Monk1 124 6 79.1% 78.3% 76.6% 73.4% 85.2%
Monk2 169 6 84.1% 75.6% 69.9% 53.4% 83.7%
Monk3 122 6 93.5% 93.5% 88.0% 97.2% 97.2%
Climate 270 20 93.2% 91.7% 92.6% 91.9% 92.0%
Liver 292 10 69.0% 67.9% 67.2% 69.7% 71.8%
Austr. 345 14 85.0% 85.1% 86.6% 86.0% 86.7%
Breast 349 10 96.4% 95.7% 96.6% 97.0% 97.1%
Trans. 374 4 78.3% 78.5% 78.9% 78.4% 78.4%
Splice 1000 121 89.4% 87.3% 90.1% 93.6% 94.9%
Spamb. 2300 57 93.1% 92.4% 91.7% 94.1% 94.2%
ML-prove 3059 51 72.5% 74.6% 71.8% 79.1% 79.3%
is atruncated 1 distance (TL1) kernel, whichhas been recentlyproposed in[32]. Thementioned kernels are listedbelow:
• PSDkernels:
– linearkernel:K(u,v)= uv,
– RBFkernel withparameterσ:K(u,v)= exp
− u − v 22
σ2 ,
– polynomialkernel withparametersc≥ 0,d∈ N+:K(u,v)= (uv + c)d.
• indefinitekernels:
– tanhkernelwithparametersc,d1:K(u,v)= tanh(cuv + d), – TL1kernelwithparameterρ:K(u,v)= max{ρ− u − v 1,0}.
These kernels willbe compared intheframework ofLS-SVM forboth classificationandprincipalcom- ponent analysis. First, consider binary classification problems, for which the data are downloaded from UCI Repository of Machine Learning Datasets [36]. For some datasets, there are both training and test data. Otherwise, we randomly pick half of the data for training and the rest for test. All training data arenormalizedto[0,1]n inadvance.Intrainingprocedure,therearearegularizationcoefficientandkernel parameters, which are tuned by 10-foldcross validation.Specifically, we randomly partitionthe training datainto10subsets.Oneofthesesubsetsisusedforvalidationinturnandtheremainingonesfortraining.
As discussedin[32],theperformanceoftheTL1kernelisnotverysensitivetothevalueofρ andρ= 0.7n wassuggested.Wethusalso evaluatetheTL1kernelwith ρ= 0.7n.Withoneparameterless,thetraining time canbe largelysaved.
The above procedure is repeated 10 times, and then the average classification accuracy on test data are reported inTable 1, where the number of training set m and the problem dimension n are given as well.Thebest oneforeachdatasetinthesense ofaverageaccuracyis underlined.Theresultsconfirmthe potential useof indefinite kernelsin LS-SVM: anindefinite kernel canachieve similar accuracy as aPSD kernel inmostof theproblemsand canhavebetterperformanceinsomespecific problems. Thisdoesnot mean thatindefinite kernel surely improves the performance from PSD ones but for some datasets, e.g., Monk1,Monk3,andSplice,itisworthytoconsiderindefinitelearningwithLS-SVMwhichmayhavebetter accuracy withinalmost the same trainingtime. Moreover, this experiment, for whichthe performance of
1 The tanh kernel is conditionallypositive definite(CPD) when c ≥ 0 andis indefiniteotherwise; see, e.g., [33,34]. In our experiments,weconsiderbothpositiveandnegativec,andhencethetanhkernelisregardedasanindefinitekernel.
Fig. 1. (a)Datapointsofone classcomefromtheunitsphere andare markedbyredcircles.The otherdatapoints,shownby bluestars,comefromaspherewithradius3.(b) ThisdatasetisnotlinearlyseparableandthuslinearPCA isnothelpfulfor distinguishingthetwoclasses.Instead,kernelPCAis neededandifthe parameteris suitablychosen,thereduceddatacan be correctlyclassifiedbyalinearclassifier.(c)theRBFkernelwithσ = 0.05;(d)theTL1kernelwithρ= 0.1n;(e)theTL1kernel withρ= 0.2n;(f)theTL1kernelwithρ= 0.5n,whichissimilartolinearPCA.(Forinterpretationofthereferencestocolorin thisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)
theTL1kernelwithρ= 0.7n beingsatisfactoryformanydatasets,illustratesthegoodparameterstability oftheTL1kernel.
In the following, we use indefinite kernels for principal component analysis. As an intuitive example, weconsider a3-dimensionalsphere problemthatdistinguishsdata fromthe spherewithradius equalto1 and 3. The data are shownin Fig. 1(a). To reduce the dimension, we apply PCA, kPCA with the RBF kernel, and kPCA with the TL1 kernel, respectively. The obtained two dimensional data are displayed in Fig. 1(b)–(f), which roughly imply that a suitable indefinite kernel can be used for kernel principal componentanalysis.
ToquantitativelyevaluatekPCAwithindefinitekernels,wechoosetheproblemsofwhichthedimension ishigherthan20 fromTable 1andthenapplykPCAtoreducethedataintonrdimension.Forthereduced data, linear classifiers, trained from linear C-SVM with libsvm [37], are used to classify the test data.
The parameters,including kernel parametersand the regularizationconstant inlinearC-SVM, are tuned based on 10-fold cross-validation. In Table 2, the average classification accuracy of 10 trials for different reduction ratios nr/n is listed. The results illustrate that indefinite kernels can be used for kPCA. Its performanceingeneraliscomparabletoPSDkernelsandforsomedatasetstheperformanceissignificantly improved.
Summarizingalltheexperimentsabove,weobservethepotentialuseofindefinitekernelsinLS-SVMfor classificationandkPCA.Forexample,theTL1kernelhassimilarperformanceas theRBF kernelinmany problemsandhasmuchbetterresultsforseveraldatasets.Ouraiminthisexperimentisnottoclaimwhich kernelisthebest,whichactuallydependsonthespecificproblem.Instead,weshowthatforsomeproblems, aproper indefinitekernel cansignificantly improve the performancefrom PSD ones,which maymotivate theresearchersto designindefinitekernelsandusetheminLS-SVMs.
Table 2
TestaccuracybasedonkPCAwithdifferentreductionratios.
Dataset m n Ratio Linear RBF poly tanh TL1
Sonar 104 60 10% 72.6% 75.6% 75.2% 63.8% 77.9%
30% 73.1% 79.1% 78.2% 71.0% 80.4%
50% 75.9% 80.7% 79.0% 71.9% 81.9%
Climate 270 21 10% 90.4% 90.5% 91.4% 91.5% 90.5%
30% 90.9% 90.8% 91.4% 91.6% 90.9%
50% 91.6% 91.4% 93.9% 91.6% 91.9%
Qsar 528 41 10% 74.4% 77.8% 75.5% 77.5% 78.8%
30% 85.4% 86.4% 84.1% 84.5% 85.9%
50% 85.9% 86.7% 85.4% 86.0% 86.2%
Splice 1000 60 10% 83.7% 86.6% 85.5% 83.7% 91.9%
30% 83.9% 87.7% 85.3% 83.0% 91.1%
50% 84.1% 87.8% 86.5% 85.2% 91.3%
Spamb. 2300 57 10% 84.7% 86.5% 84.9% 86.4% 88.4%
30% 87.7% 89.9% 88.3% 89.9% 91.8%
50% 90.7% 91.0% 91.5% 92.8% 92.8%
ML-prove 3059 51 10% 59.2% 69.7% 64.0% 63.3% 70.1%
30% 67.9% 70.3% 72.8% 69.3% 71.3%
50% 70.2% 71.0% 73.1% 68.9% 75.5%
6. Conclusion
In this paper, we proposed to use indefinite kernels in the framework of least squares support vector machines.Inthetrainingproblemitself,thereisnodifferencebetweendefinitekernelsandindefinitekernels.
Thus,onecaneasilyuseanindefinitekernelinLS-SVMbysimplychangingthekernelevaluationfunction.
Numerically, the indefinitekernels achievegood performancecompared with commonly usedPSD kernels for both classification and kernel principal component analysis. The good performance motivates us to investigatethefeaturespaceinterpretationforanindefinitekernelinLS-SVM,whichisthemaintheoretical contributionofthispaper.Wehopethatthetheoreticalanalysisandgoodperformanceshowninthispaper canattractresearchandapplicationinterestsonindefiniteLS-SVMand indefinitekPCAinthefuture.
Acknowledgments
The authors would like to thank Prof. Lei Shi inFudan University forhelpful discussion. The authors are gratefultotheanonymous reviewerforinsightfulcomments.
References
[1]V.N.Vapnik,StatisticalLearningTheory,Wiley,NewYork,1998.
[2]J.A.K.Suykens,J.Vandewalle,Leastsquaressupportvectormachineclassifiers,NeuralProcess.Lett.9 (3)(1999)293–300.
[3]J.A.K.Suykens,T.VanGestel,B.DeMoor,J.De Brabanter,J.Vandewalle, LeastSquaresSupportVector Machines, WorldScientific,2002.
[4]B.Schölkopf,A. Smola,K.-R.Müller,Nonlinear component analysisas a kerneleigenvalue problem,Neural Comput.
10 (5)(1998)1299–1319.
[5]C.S. Ong,X. Mary,S.Canu, A.J.Smola,Learning withnon-positivekernels, in:Proceeding of the21stInternational ConferenceonMachineLearning(ICML),2004,pp. 639–646.
[6]Y.Ying,C.Campbell,M.Girolami,AnalysisofSVMwithindefinitekernels,in:Y.Bengio,D.Schuurmans,J.Lafferty, C.Williams,A.Culotta(Eds.),AdvancesinNeuralInformationProcessingSystems22,2009,pp. 2205–2213.
[7]S.Gu,Y.Guo,LearningSVMclassifierswithindefinitekernels,in:Proceedingsofthe26thAAAIConferenceonArtificial Intelligence,2012,pp. 942–948.
[8]G.Loosli,S.Canu,C.S. Ong, LearningSVM inKre˘ın spaces, IEEETrans.Pattern Anal. Mach.Intell.38 (6)(2016) 1204–1216.
[9]E.Pekalska,B.Haasdonk,Kerneldiscriminantanalysisforpositivedefiniteandindefinitekernels,IEEETrans.Pattern Anal.Mach.Intell.31 (6)(2009)1017–1032.
[10]B.Haasdonk,E.Pekalska,Indefinitekerneldiscriminantanalysis,in:Proceedingsofthe16thInternationalConferenceon ComputationalStatistics(COMPSTAT),2010,pp. 221–230.
[11]S.Zafeiriou,SubspacelearninginKre˘ınspaces:completekernelFisherdiscriminantanalysiswithindefinitekernels,in:
Proceedings ofEuropeanConferenceonComputerVision(ECCV)2012,2012,pp. 488–501.
[12]E.Pekalska,P.Paclik,R.P.Duin,Ageneralizedkernelapproachtodissimilarity-basedclassification,J.Mach.Learn.Res.
2(2002)175–211.
[13]V. Roth,J. Laub,M. Kawanabe,J.M. Buhmann,Optimal clusterpreservingembedding ofnonmetric proximitydata, IEEETrans.PatternAnal.Mach.Intell.25 (12)(2003)1540–1551.
[14]R.Luss,A.d’Aspremont,Supportvectormachineclassificationwithindefinitekernels,in:J.C.Platt,D.Koller,Y.Singer, S.T.Roweis(Eds.),AdvancesinNeuralInformationProcessingSystems20,2007,pp. 953–960.
[15]F.Schleif,P.Tino,Indefiniteproximitylearning:Areview,NeuralComput.27(2015)2039–2096.
[16] H.-T.Lin,C.-J.Lin,Astudyon sigmoidkernelsfor SVMandthetrainingof non-PSDkernelsbySMO-typemethods, technicalreport,DepartmentofComputerScience,NationalTaiwanUniversity,2003,http://www.csie.ntu.edu.tw/~cjlin/
papers/tanh.pdf.
[17]J.C.Platt,Fasttrainingofsupportvectormachinesusingsequentialminimaloptimization,in:AdvancesinKernelMethods –SupportVectorLearning,MITPress,1999,pp. 185–208.
[18]R.-E.Fan,P.-H.Chen,C.-J.Lin,Workingsetselectionusingsecondorderinformationfortrainingsupportvectormachines, J.Mach.Learn.Res.6(2005)1889–1918.
[19]J.A.K.Suykens,T.VanGestel,J.Vandewalle,B.DeMoor,AsupportvectormachineformulationtoPCAanalysisand itskernelversion,IEEETrans.NeuralNetw.14 (2)(2003)447–450.
[20]H.Ling,D.W.Jacobs,Usingtheinner-distanceforclassificationofarticulatedshapes,in:Proceedings ofIEEEConference onComputerVisionandPatternRecognition(CVPR)2005,2005,pp. 719–726.
[21]M.M.Deza,E.Deza,EncyclopediaofDistances,Springer,NewYork,2009.
[22]W.Xu,W.Richard,E.Hancock,Determiningthecauseof negativedissimilarityeigenvalues,in:ComputerAnalysisof ImagesandPatterns,Springer,NewYork,2011,pp. 589–597.
[23]T.Graepel,R.Herbrich,P.Bollmann-Sdorra,K.Obermayer,Classificationonpairwiseproximitydata,in:M.S.Kearns, S.A.Solla,D.A.Cohn(Eds.),AdvancesinNeuralInformationProcessingSystems11,1998,pp. 438–444.
[24]B.Haasdonk,FeaturespaceinterpretationofSVMswithindefinitekernels,IEEETrans.PatternAnal.Mach.Intell.27 (4) (2005)482–492.
[25]I.Alabdulmohsin,X.Gao,X.Zhang,Supportvectormachineswithindefinitekernels,in:Proceedings ofthe6thAsian ConferenceonMachineLearning(ACML),2014,pp. 32–47.
[26]H. Sun,Q. Wu,Least squareregressionwithindefinitekernels andcoefficient regularization,Appl. Comput.Harmon.
Anal.30 (1)(2011)96–109.
[27]Q.Wu,Regularizationnetworkswithindefinitekernels,J.Approx.Theory166(2013)1–18.
[28]T.VanGestel,J.A.K.Suykens,G.Lanckriet,A.Lambrechts,B.DeMoor,J.Vandewalle,Bayesianframeworkforleast- squaressupportvectormachineclassifiers,Gaussian processes,andkernelFisherdiscriminantanalysis,NeuralComput.
14 (5)(2002)1115–1147.
[29]A.J.Smola,Z.L.Ovari,R.C.Williamson,Regularizationwithdot-productkernels,in:T.K.Leen,T.G.Dietterich,V. Tresp (Eds.),AdvancesinNeuralInformationProcessingSystems13,2000,pp. 308–314.
[30]H.Saigo,J.P.Vert,U.Ueda,T.Akutsu,Proteinhomologydetectionusingstringalignmentkernels,Bioinformatics20 (11) (2004)1682–1689.
[31]B.Haasdonk,H.Burkhardt,Invariant kernelfunctionsforpattern analysisandmachinelearning,Mach.Learn.68 (1) (2007)35–61.
[32]X. Huang,J.A.K.Suykens,S.Wang, A.Maier,J.Hornegger,Classificationwithtruncated1 distance kernel,internal report15-211,ESAT-SISTA,KULeuven,2015.
[33]M.D.Buhmann,RadialBasisFunctions:TheoryandImplementations,CambridgeMonographsonAppliedandCompu- tationalMathematics,vol. 12,2004,pp. 147–165.
[34]H.Wendland,ScatteredDataApproximation,CambridgeUniversityPress,Cambridge,UK,2004.
[35]K. De Brabanter, P. Karsmakers, F. Ojeda, C. Alzate, J. De Brabanter,K. Pelckmans, B.De Moor, J. Vandewalle, J.A.K. Suykens,LS-SVMlabtoolboxuser’sguide,internalreport10-146,ESAT-SISTA,KULeuven,2011.
[36] A.Frank,A. Asuncion,UCIMachineLearningRepository,SchoolofInformationandComputerScience,Universityof California,Irvine,CA,2010,http://archive.ics.uci.edu/ml.
[37]C.-C.Chang,C.-J.Lin,LIBSVM:alibraryforsupportvectormachines,ACMTrans.Intell.Syst.Technol.2 (3)(2011) 27.