noise real rbf local rbf

(1)

sparsifying the Gram matrix in LS-SVM

regression models

B.Hamers,J.A.K.Suykens,B.DeMoor

K.U.Leuven,ESAT-SCD/SISTA,KasteelparkArenberg10,B-3001Leuven,Belgium

fbart.hamers,johan.suykensg@ esat .kule uven .ac. be

Abstract. Inthispaperweinvestigatetheuseofcompactlysupported

RBFkernelsfornonlinearfunctionestimationwithLS-SVMs.Thechoice

ofcompactkernels,recentlyproposedbyGenton,mayleadtocomputa-

tional improvementsand memory reduction.Examples, however, illus-

tratethatcompactlysupportedRBFkernelsmaylead tosevereloss in

generalization performance for some applications, e.g.inchaotic time-

seriesprediction.Asaresult,theusefulnessofsuchkernelsmaybemuch

moreapplicationdependentthantheuseoftheRBFkernel.

Keywords. Support vector machines, nonlinear function estimation,

compactlysupportedkernels,directanditerativemethods.

1 Introduction

Recently, kernel methods for pattern recognition and nonlinear function esti-

mationhave received alot of attention. Theperformancesof these methods is

oftenexcellent,althoughoneofthedisadvantagesistheupscalingtolargerdata

sets. This is caused by thefact that many optimization methods demand the

storage of the Gram matrix. Genton [3] recently showed an eÆcient method

for constructingkernelswith compactsupport without destroyingthepositive

deniteness of the kernel. In this paperwe will study the resultof the use of

compactlysupportedRBFkernels.RBFkernelsarefrequentlyusedinnonlinear

functionestimation problems[2].Acompactiedversionofthiskernelcouldbe

computationallyattractive.Inthispaperwewillapplythiskerneltoanumber

of toyproblems andreal life datasets. Asaresult,weobservethat oncertain

problems,suchaschaotictimeseriesprediction,theuseofcompactlysupported

RBFkernelsleadstolossingeneralizationperformance,whileforotherproblems

(e.g.in lowerdimensionalproblems)thequalityoftheresultsiscomparable.

Thispaperisorganizedasfollows.Insection2wediscussthecompactlysup-

portedRBFkernel.Insection3wediscussmethodsforsolvingLS-SVMsystems

andhowtoexploitsparsenessin theGrammatrix.Insection4illustrationson

articialandreallife datasetsaregiven.

2 Kernel matrix and compactly supported kernels

Thekernelfunctionsthat areusedin thesupport vectorliterature[1]arefunc-

d d

(2)

theMercercondition.Givenatrainingdatasetfx

i

;y

i g

i=1

withinputsx

i 2R

and outputs y

i

2 R the kernel matrixor Grammatrix 2R NN

is positive

(semi-)denite,where

ij

=K(x

i

;x

j ).

Innonlinearfunctionestimation, afrequentlyusedkernelistheradialbasis

function (RBF)kernelK(x;z)=exp( kx zk 2

=

2

).where2R isatuning

parameter of the model. This Gaussian kernelis aspecial case of the class of

Matern typekernels[3]. An important propertyof this class of kernels is that

theyeasilycanbetransformedintocompactlysupportedkernels.Thismeansthat

thekernelwillbezeroifkx zkislargerthanacut-odistance 0

.Asexplained

in Genton[3]onecanmultiplythekernelbymaxf0;(1 kx zk=

0

)

0

gwhere

0

>0and 0

( d+1)=2toensurepositivedeniteness.Thedangerofcutting

o a kernelin another way is that one will loose positive deniteness. In this

paperwewill investigatethe useof compactly supported Gaussian RBF kernel

(CS-RBF)

K(x;z)=max (

0;

1

kx zk

3

0

)

exp

kx zk 2

2

!

: (1)

Inordertoavoidhavingtoomanyextraparameterswedecidedtotakethecut-

opoint 0

=3,where denotesthebandwidthoftheGaussianRBF kernel.

0

ischosentobeequaltothedimensionoftheinputvariablesfortheoddcases;

whenthedimensioniseven,itisaugmentedbyone.

3 Nonlinear function estimation using LS-SVMs

We test the CS-RBF kernelin the context of LS-SVMs for nonlinearfunction

estimation. Thismethodiscloselyrelatedto regularizationnetworks, Gaussian

processesandkernelridgeregression[1,2].TheemphasisintheLS-SVMformu-

lation ison primal-dual interpretations asin standard SVM, but simplied to

aridgeregressionformulationin theprimal weightspacewhich canbeinnite

dimensional.Intheprimalweightspaceonehasthemodely

i

=w T

'(x

i

)+b+e

i

with '() the mapping to a high dimensional feature space as in standard

SVMs. e

i

denotes the error for the i-th training data point. One minimizes

min

w;b;e (1=2)w

T

w+ P

N

i=1 e

2

i s.t. y

i

= w T

'(x

i

)+b+e

i

for i = 1;::;N. For

this constrained optimization problem oneconstructs a Lagrangian. The dual

problemgivestheKKTsystem

0 1

T

v

1

v +I

N

=

b

=

0

y

(2)

with =[

1

;:::;

N

], y =[y

1

;:::;y

N ];1

v

=[1;:::;1]. This resultsin themodel

^

f(x)= P

N

i=1

i K(x;x

i

)+bwithuseofthekerneltrickK(x

i

;x

j

)='(x

i )

T

'(x

j ).

Thismodelcanberobustiedandsparsiedasexplainedin[7].Manyalgorithms

forsolvingthelinearsystemrequireapositivedenitematrixwhich isnotthe

casehere.Therefore,onecan transformthissysteminto H=1

v

and H =y

with H =+I

N

= positive denite. From this we ndthat b= T

1

v

=sand

= b wheres= T

1 .

(3)

matrixH is theCholesky factorization [5]. An importantdisadvantageis that

thematrixhasto becompletelystored in memory. ApplyingtheCS-RBF ker-

nel leads to a sparse matrix. The memory requirements become proportional

to the numberof non-zeroelementsn

z

. Thecomputational cost is reducedby

makingeÆcientuseofthezeroelementsinthematrix.Thereexistdierentper-

mutation algorithms (column countpermutation, symmetricminimumdegree,

reverse Cuthill-McKee,...)[4] on the elements of the sparse matrix that givea

higherdegreeofsparsenessin theCholeskyfactor.

AsecondimportantclassofmethodstosolvelinearsystemsareKrylovmeth-

ods. Such iterative methods are suitable for solvinglarge scale problems. The

conjugate gradient (CG) method canonly be applied to positive denite ma-

trices [6],[5]. The most demanding part in this algorithm is the matrix-vector

product betweenH andthe conjugatedirections. This canalso be reducedby

a CS-RBF kernel. The number n

z

can be exploited at this point. In the CG

methodtheconditionnumber(H)determinestheconvergence(notethatthis

alsodependson andtheregularizationconstant ).

4 Examples

InthissectionwewillinvestigatetheuseoftheCS-RBFkernelonanumberof

articialandreal-lifedatasets.

4.1 Sinctoy problem

Here we compare CS-RBF and RBF kernels for a noisysinc function f(x) =

sin(x)=xestimatedbyLS-SVMs.Thetuningparametersareselectedas =1:5

and = 3:7: The inputs were take between -20 and 20 with an interspacing

of 0.03. We addedGaussian noiseto the inputs with zero meanand standard

deviation 0.1. Fig.1 shows that the performance of regression with the RBF

andCS-RBF arealmostthesame.CS-RBF givesaslightlylargerbiasandless

smoothresults.Thepointwisevarianceof

^

f(x)islargerfortheCS-RBF kernel.

Anadvantageof theCS-RBFkernelisthesparse H matrix.Forlargedata

sets this results in a memory reduction. In the example of the sinc-function

with 1334 training points, the number of non-zero elements decreases from

1334 2

=1779556to n

z

=850160.Inthis one-dimensionalproblem theH matrix

also hasaveryclearband structure. Notice that theH matrixis independent

ofthey

i

valuesofthetrainingset.Thismeans,thatforeachregressionproblem

with thesamex

i

valuesforthetrainingsetand hyperparameterset ( ;),the

H matrix hasthis sparse band structure. This band structure, in combination

withthesparsenessinthematrix,makesthatthereisaspeed-upinthetraining

procedure.Dependingontheusedmethods (CholeskyorConjugateGradient),

thetimeneededtosolvethetwosystemsisthefollowing:theCholeskyfactoriza-

tionneeds9.7450sec.cpu-timetosolvethetwolinearsystemsfortheGaussian

RBFand4.3270secforthecompactlysupportedRBF. Theconjugategradient

method needs respectively 4.1470 sec and 2.6540 sec. Hence, we typically ob-

servethatthecompactlysupportedkernelresultsinamemoryreductionanda

(4)

−20 −15 −10 −5 0 5 10 15 20

−0.5 0 0.5 1 1.5

Input

noise real rbf local rbf

−20 −15 −10 −5 0 5 10 15 20

−0.05 0 0.05

bias

Input

rbf local rbf

−20 0 −15 −10 −5 0 5 10 15 20

2 4 6 x 10

⁻⁴

variance

Input

rbf local rbf

Fig.1. LS-SVM results for nonlinear functionestimation onthe sinc function. The

middleandbottompartshowrespectivelythebiasandvarianceofbothestimatesfor

bothkernels.

Wealsotestedthein uenceofthelocalizationontheconditionnumberofthe

matrixH.Fig.2showsthatthereisonlyasmalldierenceintheconditionnum-

ber(H)forthedierentvaluesof( ;):Therefore,thespeedofconvergence

forCG withRBForCS-RBF kernelsis comparable.

4.2 Boston Housingdata

As a second example, we tested the Boston housing data set. This data set

consists of 506 cases in 14 attributes. We trained LS-SVMs on 406 randomly

selectedtrainingdata andused100pointsastest set. Wenormalizedthedata

exceptthebinaryvariables.InTable1weshowtheperformancesoftheLS-SVM

fordierentvaluesofthehyperparameter where =30iskeptconstant.The

performances of RBF and CS-RBF kernels were comparable on all performed

tests. We see that by decreasing; theH matrixwill become moreand more

sparseasaresultofthelocalization ofthekernel.

=1:5=2:0=10

MSEtr 5.8e-3 1.3e-2 1.20e-1

MSEtest 1.1e-1 1.0e-1 8.45e-2

nz=N 2

0.37 0.84 1

Table 1.PerformancefordierentvaluesofbandwidthforCS-RBFkernelsonthe

Bostonhousingdata.MSEtrandMSEtestarerespectivelythemeansquarederroron

thetrainingandtestset.Therationz=N 2

characterizesthedegreeofsparsenessinthe

Grammatrix.

(5)

0.0010.010.10.55 10152550100250500 0.050.01

0.510.1 105 10050 1000500 5000

0 1 2 3 4 5 6 7

σ γ

log( κ (H))

0.0010.010.10.55 10152550100250500 0.050.01

0.510.1 105 10050 1000500 5000 0 1 2 3 4 5 6 7

σ γ

log( κ (H))

Fig.2.Thisgureshowsthelogarithmoftheconditionnumberfordierenthyperpa-

rameters( ;):(Left)RBFkernel;(Right)CS-RBFkernel.Noticethatthecondition

numberisnotsignicantlylargerfortheCS-RBF.

4.3 Santa Fe chaoticlaser data timeseries prediction

Inathird example, weuseLS-SVM fortime series predictionontheSanta Fe

laserdataset.Themodelthatweuseisbasedonatrainedonestepaheadpredic-

tor y^

k

=f(y

k 1

;y

k 2

;:::;y

k n

)withn=50,wherey

k

denotesthetrueoutput

at discretetimeinstantk.InFig.3 weseethat agooditerativepredictionper-

formanceisobtainedfortheRBF kernelwith hyperparameters( ;)=(70;4)

found by 10-fold crossvalidation. For the same hyperparameters the CS-RBF

kernelhasa verybad performanceascan be seenin Fig.3. Foralmost similar

performanceeither thecut-opoint 0

=50orthebandwidthofCS-RBFker-

nel hastobeincreased.Unfortunately,bothreduce thedegreeofsparsenessin

theGrammatrixtozero.

0 10 20 30 40 50 60 70 80 90 100

0 50 100 150 200 250 300

prediction

real rbf local rbf

(a)

0 10 20 30 40 50 60 70 80 90 100

0 50 100 150 200 250 300

prediction

real local rbf

(b)

3 10 20 30 40 50 60

10¹ 10² 10³ 10⁴

θ’

MSEtest

(c)

Fig.3.SantaFelaserdataprediction:(a)(-)realdata,(-.)RBFkernel(--)CS-RBF

kernel;(b)(--)CS-RBFwithcut-opoint 0

=50havingnosparseness;(c)MSEon

testdatawithrespecttocut-opoint 0

,showingbadresultsforsmaller 0

,i.e.sparse

(6)

We have studied the use of compactly supported RBF kernel based onrecent

work by Genton. RBF kernelsare frequently used for many applications. The

useofacompactlysupportedversionoftheRBFkernelcouldresultin asparse

Gram matrix,and thus decrease the computational cost and memoryrequire-

ments. Inour study we haveseenthat for certain problems thegeneralization

performance of the RBF and the compactly supported RBF remains compa-

rable,as well astheconditioning ofthe matrices towardsiterativemethods of

conjugategradient.However,onaproblemofchaotictimeseriespredictionthe

compactly supportedRBF kernelsfails to producegoodresultswhen havinga

sparse Gram matrix. As aresult onemay concludethat compactly supported

RBFkernelsmaybeusefulforsomespecicapplicationsbutoneshouldbecare-

fultouseitin ageneralcontext.

Acknowledgements

Ourresearchissupportedbygrantsfromseveralfundingagenciesandsources:Research

CouncilKUL:ConcertedResearchActionGOA-Mesto666(MathematicalEngineer-

ing),IDO(IOTAOncology,Geneticnetworks),severalPhD/postdoc&fellowgrants;

Flemish Government: Fund for Scientic Research Flanders (several PhD/postdoc

grants,projects G.0256.97 (subspace), G.0115.01 (bio-i and microarrays), G.0240.99

(multilinearalgebra),G.0197.02(powerislands),G.0407.02(supportvectormachines),

research communitiesICCoS, ANMMM),AWI (Bil. Int. Collaboration Hungary/Po-

land),IWT(Soft4s(softsensors),STWW-Genprom(genepromotorprediction),GBOU

McKnow(Knowledgemanagementalgorithms),Eureka-Impact(MPC-control),Eureka-

FLiTE( uttermodeling),severalPhDgrants);BelgianFederalGovernment:DWTC

(IUAP IV-02 (1996-2001) and IUAP V-10-29 (2002-2006): Dynamical Systems and

Control:Computation,Identication&Modelling),ProgramSustainableDevelopment

PODO-II (CP-TR-18: Sustainibility eects of TraÆc Management Systems); Direct

contract research:Verhaert,Electrabel,Elia,Data4s,IPCOS;BDMisafullprofessor

atK.U.LeuvenBelgium,JSisaprofessoratK.U.LeuvenBelgiumandapostdoctoral

researcherwithFWOFlanders.

References

1. CristianiniN.,Shawe-TaylorJ.,AnIntroductiontoSupportVectorMachines,Cam-

bridgeUniversityPress,2000.

2. EvgeniouT., PontilM., Poggio T.,\Regularization networksandsupportvector

machines",AdvancesinComputationalMathematics, 13(1),1-50,2000.

3. GentonM.,\Classesofkernelsformachinelearning:astatisticsperspective",Jour-

nal ofMachineLearning Research, 2,299-312,2001.

4. GilbertJ.,MolerC.,SchreiberR.,\SparsematricesinMatlab:designandimple-

mentation",SIAMJournalonMatrixAnalysis,13(1),333-356,1992.

5. GolubG.,VanLoanC.MatrixComputations,Baltimore: TheJohnHopkinsUni-

versityPress,2nded.,1990.

6. GreenbaumA.,IterativeMethodsforSolvingLinearSystems,Philadelphia:SIAM,

1997.

7. SuykensJ.A.K.,DeBrabanterJ.,LukasL.,VandewalleJ.,\Weightedleastsquares

supportvectormachines:robustnessandsparseapproximation",Neurocomputing,