• No results found

noise real rbf local rbf

N/A
N/A
Protected

Academic year: 2021

Share "noise real rbf local rbf"

Copied!
6
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

sparsifying the Gram matrix in LS-SVM

regression models

B.Hamers,J.A.K.Suykens,B.DeMoor

K.U.Leuven,ESAT-SCD/SISTA,KasteelparkArenberg10,B-3001Leuven,Belgium

fbart.hamers,johan.suykensg@ esat .kule uven .ac. be

Abstract. Inthispaperweinvestigatetheuseofcompactlysupported

RBFkernelsfornonlinearfunctionestimationwithLS-SVMs.Thechoice

ofcompactkernels,recentlyproposedbyGenton,mayleadtocomputa-

tional improvementsand memory reduction.Examples, however, illus-

tratethatcompactlysupportedRBFkernelsmaylead tosevereloss in

generalization performance for some applications, e.g.inchaotic time-

seriesprediction.Asaresult,theusefulnessofsuchkernelsmaybemuch

moreapplicationdependentthantheuseoftheRBFkernel.

Keywords. Support vector machines, nonlinear function estimation,

compactlysupportedkernels,directanditerativemethods.

1 Introduction

Recently, kernel methods for pattern recognition and nonlinear function esti-

mationhave received alot of attention. Theperformancesof these methods is

oftenexcellent,althoughoneofthedisadvantagesistheupscalingtolargerdata

sets. This is caused by thefact that many optimization methods demand the

storage of the Gram matrix. Genton [3] recently showed an eÆcient method

for constructingkernelswith compactsupport without destroyingthepositive

de niteness of the kernel. In this paperwe will study the resultof the use of

compactlysupportedRBFkernels.RBFkernelsarefrequentlyusedinnonlinear

functionestimation problems[2].Acompacti edversionofthiskernelcouldbe

computationallyattractive.Inthispaperwewillapplythiskerneltoanumber

of toyproblems andreal life datasets. Asaresult,weobservethat oncertain

problems,suchaschaotictimeseriesprediction,theuseofcompactlysupported

RBFkernelsleadstolossingeneralizationperformance,whileforotherproblems

(e.g.in lowerdimensionalproblems)thequalityoftheresultsiscomparable.

Thispaperisorganizedasfollows.Insection2wediscussthecompactlysup-

portedRBFkernel.Insection3wediscussmethodsforsolvingLS-SVMsystems

andhowtoexploitsparsenessin theGrammatrix.Insection4illustrationson

arti cialandreallife datasetsaregiven.

2 Kernel matrix and compactly supported kernels

Thekernelfunctionsthat areusedin thesupport vectorliterature[1]arefunc-

d d

(2)

theMercercondition.Givenatrainingdatasetfx

i

;y

i g

i=1

withinputsx

i 2R

and outputs y

i

2 R the kernel matrixor Grammatrix 2R NN

is positive

(semi-)de nite,where

ij

=K(x

i

;x

j ).

Innonlinearfunctionestimation, afrequentlyusedkernelistheradialbasis

function (RBF)kernelK(x;z)=exp( kx zk 2

=

2

).where2R isatuning

parameter of the model. This Gaussian kernelis aspecial case of the class of

Matern typekernels[3]. An important propertyof this class of kernels is that

theyeasilycanbetransformedintocompactlysupportedkernels.Thismeansthat

thekernelwillbezeroifkx zkislargerthanacut-o distance 0

.Asexplained

in Genton[3]onecanmultiplythekernelbymaxf0;(1 kx zk=

0

)

 0

gwhere

 0

>0and 0

( d+1)=2toensurepositivede niteness.Thedangerofcutting

o a kernelin another way is that one will loose positive de niteness. In this

paperwewill investigatethe useof compactly supported Gaussian RBF kernel

(CS-RBF)

K(x;z)=max (

0;



1

kx zk

3



 0

)

exp

kx zk 2

 2

!

: (1)

Inordertoavoidhavingtoomanyextraparameterswedecidedtotakethecut-

o point  0

=3,where  denotesthebandwidthoftheGaussianRBF kernel.

 0

ischosentobeequaltothedimensionoftheinputvariablesfortheoddcases;

whenthedimensioniseven,itisaugmentedbyone.

3 Nonlinear function estimation using LS-SVMs

We test the CS-RBF kernelin the context of LS-SVMs for nonlinearfunction

estimation. Thismethodiscloselyrelatedto regularizationnetworks, Gaussian

processesandkernelridgeregression[1,2].TheemphasisintheLS-SVMformu-

lation ison primal-dual interpretations asin standard SVM, but simpli ed to

aridgeregressionformulationin theprimal weightspacewhich canbein nite

dimensional.Intheprimalweightspaceonehasthemodely

i

=w T

'(x

i

)+b+e

i

with '() the mapping to a high dimensional feature space as in standard

SVMs. e

i

denotes the error for the i-th training data point. One minimizes

min

w;b;e (1=2)w

T

w+ P

N

i=1 e

2

i s.t. y

i

= w T

'(x

i

)+b+e

i

for i = 1;::;N. For

this constrained optimization problem oneconstructs a Lagrangian. The dual

problemgivestheKKTsystem



0 1

T

v

1

v +I

N

=



b



=



0

y



(2)

with =[

1

;:::;

N

], y =[y

1

;:::;y

N ];1

v

=[1;:::;1]. This resultsin themodel

^

f(x)= P

N

i=1

i K(x;x

i

)+bwithuseofthekerneltrickK(x

i

;x

j

)='(x

i )

T

'(x

j ).

Thismodelcanberobusti edandsparsi edasexplainedin[7].Manyalgorithms

forsolvingthelinearsystemrequireapositivede nitematrixwhich isnotthe

casehere.Therefore,onecan transformthissysteminto H=1

v

and H =y

with H =+I

N

= positive de nite. From this we ndthat b=  T

1

v

=sand

= b wheres= T

1 .

(3)

matrixH is theCholesky factorization [5]. An importantdisadvantageis that

thematrixhasto becompletelystored in memory. ApplyingtheCS-RBF ker-

nel leads to a sparse matrix. The memory requirements become proportional

to the numberof non-zeroelementsn

z

. Thecomputational cost is reducedby

makingeÆcientuseofthezeroelementsinthematrix.Thereexistdi erentper-

mutation algorithms (column countpermutation, symmetricminimumdegree,

reverse Cuthill-McKee,...)[4] on the elements of the sparse matrix that givea

higherdegreeofsparsenessin theCholeskyfactor.

AsecondimportantclassofmethodstosolvelinearsystemsareKrylovmeth-

ods. Such iterative methods are suitable for solvinglarge scale problems. The

conjugate gradient (CG) method canonly be applied to positive de nite ma-

trices [6],[5]. The most demanding part in this algorithm is the matrix-vector

product betweenH andthe conjugatedirections. This canalso be reducedby

a CS-RBF kernel. The number n

z

can be exploited at this point. In the CG

methodtheconditionnumber(H)determinestheconvergence(notethatthis

alsodependson andtheregularizationconstant ).

4 Examples

InthissectionwewillinvestigatetheuseoftheCS-RBFkernelonanumberof

arti cialandreal-lifedatasets.

4.1 Sinctoy problem

Here we compare CS-RBF and RBF kernels for a noisysinc function f(x) =

sin(x)=xestimatedbyLS-SVMs.Thetuningparametersareselectedas =1:5

and  = 3:7: The inputs were take between -20 and 20 with an interspacing

of 0.03. We addedGaussian noiseto the inputs with zero meanand standard

deviation 0.1. Fig.1 shows that the performance of regression with the RBF

andCS-RBF arealmostthesame.CS-RBF givesaslightlylargerbiasandless

smoothresults.Thepointwisevarianceof

^

f(x)islargerfortheCS-RBF kernel.

Anadvantageof theCS-RBFkernelisthesparse H matrix.Forlargedata

sets this results in a memory reduction. In the example of the sinc-function

with 1334 training points, the number of non-zero elements decreases from

1334 2

=1779556to n

z

=850160.Inthis one-dimensionalproblem theH matrix

also hasaveryclearband structure. Notice that theH matrixis independent

ofthey

i

valuesofthetrainingset.Thismeans,thatforeachregressionproblem

with thesamex

i

valuesforthetrainingsetand hyperparameterset ( ;),the

H matrix hasthis sparse band structure. This band structure, in combination

withthesparsenessinthematrix,makesthatthereisaspeed-upinthetraining

procedure.Dependingontheusedmethods (CholeskyorConjugateGradient),

thetimeneededtosolvethetwosystemsisthefollowing:theCholeskyfactoriza-

tionneeds9.7450sec.cpu-timetosolvethetwolinearsystemsfortheGaussian

RBFand4.3270secforthecompactlysupportedRBF. Theconjugategradient

method needs respectively 4.1470 sec and 2.6540 sec. Hence, we typically ob-

servethatthecompactlysupportedkernelresultsinamemoryreductionanda

(4)

−20 −15 −10 −5 0 5 10 15 20

−0.5 0 0.5 1 1.5

Input

noise real rbf local rbf

−20 −15 −10 −5 0 5 10 15 20

−0.05 0 0.05

bias

Input

rbf local rbf

−20 0 −15 −10 −5 0 5 10 15 20

2 4 6 x 10

−4

variance

Input

rbf local rbf

Fig.1. LS-SVM results for nonlinear functionestimation onthe sinc function. The

middleandbottompartshowrespectivelythebiasandvarianceofbothestimatesfor

bothkernels.

Wealsotestedthein uenceofthelocalizationontheconditionnumberofthe

matrixH.Fig.2showsthatthereisonlyasmalldi erenceintheconditionnum-

ber(H)forthedi erentvaluesof( ;):Therefore,thespeedofconvergence

forCG withRBForCS-RBF kernelsis comparable.

4.2 Boston Housingdata

As a second example, we tested the Boston housing data set. This data set

consists of 506 cases in 14 attributes. We trained LS-SVMs on 406 randomly

selectedtrainingdata andused100pointsastest set. Wenormalizedthedata

exceptthebinaryvariables.InTable1weshowtheperformancesoftheLS-SVM

fordi erentvaluesofthehyperparameter where =30iskeptconstant.The

performances of RBF and CS-RBF kernels were comparable on all performed

tests. We see that by decreasing; theH matrixwill become moreand more

sparseasaresultofthelocalization ofthekernel.

=1:5=2:0=10

MSEtr 5.8e-3 1.3e-2 1.20e-1

MSEtest 1.1e-1 1.0e-1 8.45e-2

nz=N 2

0.37 0.84 1

Table 1.Performancefordi erentvaluesofbandwidthforCS-RBFkernelsonthe

Bostonhousingdata.MSEtrandMSEtestarerespectivelythemeansquarederroron

thetrainingandtestset.Therationz=N 2

characterizesthedegreeofsparsenessinthe

Grammatrix.

(5)

0.0010.010.10.55 10152550100250500 0.050.01

0.510.1 105 10050 1000500 5000

0 1 2 3 4 5 6 7

σ γ

log( κ (H))

0.0010.010.10.55 10152550100250500 0.050.01

0.510.1 105 10050 1000500 5000 0 1 2 3 4 5 6 7

σ γ

log( κ (H))

Fig.2.This gureshowsthelogarithmoftheconditionnumberfordi erenthyperpa-

rameters( ;):(Left)RBFkernel;(Right)CS-RBFkernel.Noticethatthecondition

numberisnotsigni cantlylargerfortheCS-RBF.

4.3 Santa Fe chaoticlaser data timeseries prediction

Inathird example, weuseLS-SVM fortime series predictionontheSanta Fe

laserdataset.Themodelthatweuseisbasedonatrainedonestepaheadpredic-

tor y^

k

=f(y

k 1

;y

k 2

;:::;y

k n

)withn=50,wherey

k

denotesthetrueoutput

at discretetimeinstantk.InFig.3 weseethat agooditerativepredictionper-

formanceisobtainedfortheRBF kernelwith hyperparameters( ;)=(70;4)

found by 10-fold crossvalidation. For the same hyperparameters the CS-RBF

kernelhasa verybad performanceascan be seenin Fig.3. Foralmost similar

performanceeither thecut-o point 0

=50orthebandwidthofCS-RBFker-

nel hastobeincreased.Unfortunately,bothreduce thedegreeofsparsenessin

theGrammatrixtozero.

0 10 20 30 40 50 60 70 80 90 100

0 50 100 150 200 250 300

prediction

real rbf local rbf

(a)

0 10 20 30 40 50 60 70 80 90 100

0 50 100 150 200 250 300

prediction

real local rbf

(b)

3 10 20 30 40 50 60

101 102 103 104

θ’

MSEtest

(c)

Fig.3.SantaFelaserdataprediction:(a)(-)realdata,(-.)RBFkernel(--)CS-RBF

kernel;(b)(--)CS-RBFwithcut-o point 0

=50havingnosparseness;(c)MSEon

testdatawithrespecttocut-o point 0

,showingbadresultsforsmaller 0

,i.e.sparse

(6)

We have studied the use of compactly supported RBF kernel based onrecent

work by Genton. RBF kernelsare frequently used for many applications. The

useofacompactlysupportedversionoftheRBFkernelcouldresultin asparse

Gram matrix,and thus decrease the computational cost and memoryrequire-

ments. Inour study we haveseenthat for certain problems thegeneralization

performance of the RBF and the compactly supported RBF remains compa-

rable,as well astheconditioning ofthe matrices towardsiterativemethods of

conjugategradient.However,onaproblemofchaotictimeseriespredictionthe

compactly supportedRBF kernelsfails to producegoodresultswhen havinga

sparse Gram matrix. As aresult onemay concludethat compactly supported

RBFkernelsmaybeusefulforsomespeci capplicationsbutoneshouldbecare-

fultouseitin ageneralcontext.

Acknowledgements

Ourresearchissupportedbygrantsfromseveralfundingagenciesandsources:Research

CouncilKUL:ConcertedResearchActionGOA-Me sto666(MathematicalEngineer-

ing),IDO(IOTAOncology,Geneticnetworks),severalPhD/postdoc&fellowgrants;

Flemish Government: Fund for Scienti c Research Flanders (several PhD/postdoc

grants,projects G.0256.97 (subspace), G.0115.01 (bio-i and microarrays), G.0240.99

(multilinearalgebra),G.0197.02(powerislands),G.0407.02(supportvectormachines),

research communitiesICCoS, ANMMM),AWI (Bil. Int. Collaboration Hungary/Po-

land),IWT(Soft4s(softsensors),STWW-Genprom(genepromotorprediction),GBOU

McKnow(Knowledgemanagementalgorithms),Eureka-Impact(MPC-control),Eureka-

FLiTE( uttermodeling),severalPhDgrants);BelgianFederalGovernment:DWTC

(IUAP IV-02 (1996-2001) and IUAP V-10-29 (2002-2006): Dynamical Systems and

Control:Computation,Identi cation&Modelling),ProgramSustainableDevelopment

PODO-II (CP-TR-18: Sustainibility e ects of TraÆc Management Systems); Direct

contract research:Verhaert,Electrabel,Elia,Data4s,IPCOS;BDMisafullprofessor

atK.U.LeuvenBelgium,JSisaprofessoratK.U.LeuvenBelgiumandapostdoctoral

researcherwithFWOFlanders.

References

1. CristianiniN.,Shawe-TaylorJ.,AnIntroductiontoSupportVectorMachines,Cam-

bridgeUniversityPress,2000.

2. EvgeniouT., PontilM., Poggio T.,\Regularization networksandsupportvector

machines",AdvancesinComputationalMathematics, 13(1),1-50,2000.

3. GentonM.,\Classesofkernelsformachinelearning:astatisticsperspective",Jour-

nal ofMachineLearning Research, 2,299-312,2001.

4. GilbertJ.,MolerC.,SchreiberR.,\SparsematricesinMatlab:designandimple-

mentation",SIAMJournalonMatrixAnalysis,13(1),333-356,1992.

5. GolubG.,VanLoanC.MatrixComputations,Baltimore: TheJohnHopkinsUni-

versityPress,2nded.,1990.

6. GreenbaumA.,IterativeMethodsforSolvingLinearSystems,Philadelphia:SIAM,

1997.

7. SuykensJ.A.K.,DeBrabanterJ.,LukasL.,VandewalleJ.,\Weightedleastsquares

supportvectormachines:robustnessandsparseapproximation",Neurocomputing,

Referenties

GERELATEERDE DOCUMENTEN

Here we report 214 phosphorylated proteins extracted dur- ing early-logarithmic growth phase from a hyper-virulent clin- ical Beijing genotype Mycobacterium tuberculosis isolate..

Die getallen vul je in bij de verschillende tabbladen: paars voor vrijwilligers, blauw voor zorgmedewerkers, geel voor de vrijwilligerscoördi- nator, oranje voor middenmanagement en

De Hogeschool wil namelijk graag een lijst met namen van ouderen die benaderd kunnen worden als de studenten zelf niet iemand kunnen vinden.. Wanneer u door een student wordt

Chapter 3 describes the methodology followed within the research to test whether structural models of default can be used to provide estimates of the firm value and expected return

We have derived an approximation for SVM models with RBF kernels, based on the second-order Maclaurin series approximation of the exponential function.. The applica- bility of

The optimization problem that we have to solve can be formulated as choosing the linear combination of a priori known matrices such that the smallest singular vector is minimized..

In Section 2, a consistent fundamental matrix estimator is derived assuming that the measurement error variance  2 0 is known. Section 3 considers consistent esti- mator of

Results thus showed that values for the time delay lying in a small interval around the optimal time delay gave acceptable prediction and behavioural accuracy for the TDNN