sparsifying the Gram matrix in LS-SVM
regression models
B.Hamers,J.A.K.Suykens,B.DeMoor
K.U.Leuven,ESAT-SCD/SISTA,KasteelparkArenberg10,B-3001Leuven,Belgium
fbart.hamers,johan.suykensg@ esat .kule uven .ac. be
Abstract. Inthispaperweinvestigatetheuseofcompactlysupported
RBFkernelsfornonlinearfunctionestimationwithLS-SVMs.Thechoice
ofcompactkernels,recentlyproposedbyGenton,mayleadtocomputa-
tional improvementsand memory reduction.Examples, however, illus-
tratethatcompactlysupportedRBFkernelsmaylead tosevereloss in
generalization performance for some applications, e.g.inchaotic time-
seriesprediction.Asaresult,theusefulnessofsuchkernelsmaybemuch
moreapplicationdependentthantheuseoftheRBFkernel.
Keywords. Support vector machines, nonlinear function estimation,
compactlysupportedkernels,directanditerativemethods.
1 Introduction
Recently, kernel methods for pattern recognition and nonlinear function esti-
mationhave received alot of attention. Theperformancesof these methods is
oftenexcellent,althoughoneofthedisadvantagesistheupscalingtolargerdata
sets. This is caused by thefact that many optimization methods demand the
storage of the Gram matrix. Genton [3] recently showed an eÆcient method
for constructingkernelswith compactsupport without destroyingthepositive
deniteness of the kernel. In this paperwe will study the resultof the use of
compactlysupportedRBFkernels.RBFkernelsarefrequentlyusedinnonlinear
functionestimation problems[2].Acompactiedversionofthiskernelcouldbe
computationallyattractive.Inthispaperwewillapplythiskerneltoanumber
of toyproblems andreal life datasets. Asaresult,weobservethat oncertain
problems,suchaschaotictimeseriesprediction,theuseofcompactlysupported
RBFkernelsleadstolossingeneralizationperformance,whileforotherproblems
(e.g.in lowerdimensionalproblems)thequalityoftheresultsiscomparable.
Thispaperisorganizedasfollows.Insection2wediscussthecompactlysup-
portedRBFkernel.Insection3wediscussmethodsforsolvingLS-SVMsystems
andhowtoexploitsparsenessin theGrammatrix.Insection4illustrationson
articialandreallife datasetsaregiven.
2 Kernel matrix and compactly supported kernels
Thekernelfunctionsthat areusedin thesupport vectorliterature[1]arefunc-
d d
theMercercondition.Givenatrainingdatasetfx
i
;y
i g
i=1
withinputsx
i 2R
and outputs y
i
2 R the kernel matrixor Grammatrix 2R NN
is positive
(semi-)denite,where
ij
=K(x
i
;x
j ).
Innonlinearfunctionestimation, afrequentlyusedkernelistheradialbasis
function (RBF)kernelK(x;z)=exp( kx zk 2
=
2
).where2R isatuning
parameter of the model. This Gaussian kernelis aspecial case of the class of
Matern typekernels[3]. An important propertyof this class of kernels is that
theyeasilycanbetransformedintocompactlysupportedkernels.Thismeansthat
thekernelwillbezeroifkx zkislargerthanacut-odistance 0
.Asexplained
in Genton[3]onecanmultiplythekernelbymaxf0;(1 kx zk=
0
)
0
gwhere
0
>0and 0
( d+1)=2toensurepositivedeniteness.Thedangerofcutting
o a kernelin another way is that one will loose positive deniteness. In this
paperwewill investigatethe useof compactly supported Gaussian RBF kernel
(CS-RBF)
K(x;z)=max (
0;
1
kx zk
3
0
)
exp
kx zk 2
2
!
: (1)
Inordertoavoidhavingtoomanyextraparameterswedecidedtotakethecut-
opoint 0
=3,where denotesthebandwidthoftheGaussianRBF kernel.
0
ischosentobeequaltothedimensionoftheinputvariablesfortheoddcases;
whenthedimensioniseven,itisaugmentedbyone.
3 Nonlinear function estimation using LS-SVMs
We test the CS-RBF kernelin the context of LS-SVMs for nonlinearfunction
estimation. Thismethodiscloselyrelatedto regularizationnetworks, Gaussian
processesandkernelridgeregression[1,2].TheemphasisintheLS-SVMformu-
lation ison primal-dual interpretations asin standard SVM, but simplied to
aridgeregressionformulationin theprimal weightspacewhich canbeinnite
dimensional.Intheprimalweightspaceonehasthemodely
i
=w T
'(x
i
)+b+e
i
with '() the mapping to a high dimensional feature space as in standard
SVMs. e
i
denotes the error for the i-th training data point. One minimizes
min
w;b;e (1=2)w
T
w+ P
N
i=1 e
2
i s.t. y
i
= w T
'(x
i
)+b+e
i
for i = 1;::;N. For
this constrained optimization problem oneconstructs a Lagrangian. The dual
problemgivestheKKTsystem
0 1
T
v
1
v +I
N
=
b
=
0
y
(2)
with =[
1
;:::;
N
], y =[y
1
;:::;y
N ];1
v
=[1;:::;1]. This resultsin themodel
^
f(x)= P
N
i=1
i K(x;x
i
)+bwithuseofthekerneltrickK(x
i
;x
j
)='(x
i )
T
'(x
j ).
Thismodelcanberobustiedandsparsiedasexplainedin[7].Manyalgorithms
forsolvingthelinearsystemrequireapositivedenitematrixwhich isnotthe
casehere.Therefore,onecan transformthissysteminto H=1
v
and H =y
with H =+I
N
= positive denite. From this we ndthat b= T
1
v
=sand
= b wheres= T
1 .
matrixH is theCholesky factorization [5]. An importantdisadvantageis that
thematrixhasto becompletelystored in memory. ApplyingtheCS-RBF ker-
nel leads to a sparse matrix. The memory requirements become proportional
to the numberof non-zeroelementsn
z
. Thecomputational cost is reducedby
makingeÆcientuseofthezeroelementsinthematrix.Thereexistdierentper-
mutation algorithms (column countpermutation, symmetricminimumdegree,
reverse Cuthill-McKee,...)[4] on the elements of the sparse matrix that givea
higherdegreeofsparsenessin theCholeskyfactor.
AsecondimportantclassofmethodstosolvelinearsystemsareKrylovmeth-
ods. Such iterative methods are suitable for solvinglarge scale problems. The
conjugate gradient (CG) method canonly be applied to positive denite ma-
trices [6],[5]. The most demanding part in this algorithm is the matrix-vector
product betweenH andthe conjugatedirections. This canalso be reducedby
a CS-RBF kernel. The number n
z
can be exploited at this point. In the CG
methodtheconditionnumber(H)determinestheconvergence(notethatthis
alsodependson andtheregularizationconstant ).
4 Examples
InthissectionwewillinvestigatetheuseoftheCS-RBFkernelonanumberof
articialandreal-lifedatasets.
4.1 Sinctoy problem
Here we compare CS-RBF and RBF kernels for a noisysinc function f(x) =
sin(x)=xestimatedbyLS-SVMs.Thetuningparametersareselectedas =1:5
and = 3:7: The inputs were take between -20 and 20 with an interspacing
of 0.03. We addedGaussian noiseto the inputs with zero meanand standard
deviation 0.1. Fig.1 shows that the performance of regression with the RBF
andCS-RBF arealmostthesame.CS-RBF givesaslightlylargerbiasandless
smoothresults.Thepointwisevarianceof
^
f(x)islargerfortheCS-RBF kernel.
Anadvantageof theCS-RBFkernelisthesparse H matrix.Forlargedata
sets this results in a memory reduction. In the example of the sinc-function
with 1334 training points, the number of non-zero elements decreases from
1334 2
=1779556to n
z
=850160.Inthis one-dimensionalproblem theH matrix
also hasaveryclearband structure. Notice that theH matrixis independent
ofthey
i
valuesofthetrainingset.Thismeans,thatforeachregressionproblem
with thesamex
i
valuesforthetrainingsetand hyperparameterset ( ;),the
H matrix hasthis sparse band structure. This band structure, in combination
withthesparsenessinthematrix,makesthatthereisaspeed-upinthetraining
procedure.Dependingontheusedmethods (CholeskyorConjugateGradient),
thetimeneededtosolvethetwosystemsisthefollowing:theCholeskyfactoriza-
tionneeds9.7450sec.cpu-timetosolvethetwolinearsystemsfortheGaussian
RBFand4.3270secforthecompactlysupportedRBF. Theconjugategradient
method needs respectively 4.1470 sec and 2.6540 sec. Hence, we typically ob-
servethatthecompactlysupportedkernelresultsinamemoryreductionanda
−20 −15 −10 −5 0 5 10 15 20
−0.5 0 0.5 1 1.5
Input
noise real rbf local rbf
−20 −15 −10 −5 0 5 10 15 20
−0.05 0 0.05
bias
Input
rbf local rbf
−20 0 −15 −10 −5 0 5 10 15 20
2 4 6 x 10
−4variance
Input
rbf local rbf
Fig.1. LS-SVM results for nonlinear functionestimation onthe sinc function. The
middleandbottompartshowrespectivelythebiasandvarianceofbothestimatesfor
bothkernels.
Wealsotestedthein uenceofthelocalizationontheconditionnumberofthe
matrixH.Fig.2showsthatthereisonlyasmalldierenceintheconditionnum-
ber(H)forthedierentvaluesof( ;):Therefore,thespeedofconvergence
forCG withRBForCS-RBF kernelsis comparable.
4.2 Boston Housingdata
As a second example, we tested the Boston housing data set. This data set
consists of 506 cases in 14 attributes. We trained LS-SVMs on 406 randomly
selectedtrainingdata andused100pointsastest set. Wenormalizedthedata
exceptthebinaryvariables.InTable1weshowtheperformancesoftheLS-SVM
fordierentvaluesofthehyperparameter where =30iskeptconstant.The
performances of RBF and CS-RBF kernels were comparable on all performed
tests. We see that by decreasing; theH matrixwill become moreand more
sparseasaresultofthelocalization ofthekernel.
=1:5=2:0=10
MSEtr 5.8e-3 1.3e-2 1.20e-1
MSEtest 1.1e-1 1.0e-1 8.45e-2
nz=N 2
0.37 0.84 1
Table 1.PerformancefordierentvaluesofbandwidthforCS-RBFkernelsonthe
Bostonhousingdata.MSEtrandMSEtestarerespectivelythemeansquarederroron
thetrainingandtestset.Therationz=N 2
characterizesthedegreeofsparsenessinthe
Grammatrix.
0.0010.010.10.55 10152550100250500 0.050.01
0.510.1 105 10050 1000500 5000
0 1 2 3 4 5 6 7
σ γ
log( κ (H))
0.0010.010.10.55 10152550100250500 0.050.01
0.510.1 105 10050 1000500 5000 0 1 2 3 4 5 6 7
σ γ
log( κ (H))
Fig.2.Thisgureshowsthelogarithmoftheconditionnumberfordierenthyperpa-
rameters( ;):(Left)RBFkernel;(Right)CS-RBFkernel.Noticethatthecondition
numberisnotsignicantlylargerfortheCS-RBF.
4.3 Santa Fe chaoticlaser data timeseries prediction
Inathird example, weuseLS-SVM fortime series predictionontheSanta Fe
laserdataset.Themodelthatweuseisbasedonatrainedonestepaheadpredic-
tor y^
k
=f(y
k 1
;y
k 2
;:::;y
k n
)withn=50,wherey
k
denotesthetrueoutput
at discretetimeinstantk.InFig.3 weseethat agooditerativepredictionper-
formanceisobtainedfortheRBF kernelwith hyperparameters( ;)=(70;4)
found by 10-fold crossvalidation. For the same hyperparameters the CS-RBF
kernelhasa verybad performanceascan be seenin Fig.3. Foralmost similar
performanceeither thecut-opoint 0
=50orthebandwidthofCS-RBFker-
nel hastobeincreased.Unfortunately,bothreduce thedegreeofsparsenessin
theGrammatrixtozero.
0 10 20 30 40 50 60 70 80 90 100
0 50 100 150 200 250 300
prediction
real rbf local rbf
(a)
0 10 20 30 40 50 60 70 80 90 100
0 50 100 150 200 250 300
prediction
real local rbf
(b)
3 10 20 30 40 50 60
101 102 103 104
θ’
MSEtest
(c)
Fig.3.SantaFelaserdataprediction:(a)(-)realdata,(-.)RBFkernel(--)CS-RBF
kernel;(b)(--)CS-RBFwithcut-opoint 0
=50havingnosparseness;(c)MSEon
testdatawithrespecttocut-opoint 0
,showingbadresultsforsmaller 0
,i.e.sparse
We have studied the use of compactly supported RBF kernel based onrecent
work by Genton. RBF kernelsare frequently used for many applications. The
useofacompactlysupportedversionoftheRBFkernelcouldresultin asparse
Gram matrix,and thus decrease the computational cost and memoryrequire-
ments. Inour study we haveseenthat for certain problems thegeneralization
performance of the RBF and the compactly supported RBF remains compa-
rable,as well astheconditioning ofthe matrices towardsiterativemethods of
conjugategradient.However,onaproblemofchaotictimeseriespredictionthe
compactly supportedRBF kernelsfails to producegoodresultswhen havinga
sparse Gram matrix. As aresult onemay concludethat compactly supported
RBFkernelsmaybeusefulforsomespecicapplicationsbutoneshouldbecare-
fultouseitin ageneralcontext.
Acknowledgements
Ourresearchissupportedbygrantsfromseveralfundingagenciesandsources:Research
CouncilKUL:ConcertedResearchActionGOA-Mesto666(MathematicalEngineer-
ing),IDO(IOTAOncology,Geneticnetworks),severalPhD/postdoc&fellowgrants;
Flemish Government: Fund for Scientic Research Flanders (several PhD/postdoc
grants,projects G.0256.97 (subspace), G.0115.01 (bio-i and microarrays), G.0240.99
(multilinearalgebra),G.0197.02(powerislands),G.0407.02(supportvectormachines),
research communitiesICCoS, ANMMM),AWI (Bil. Int. Collaboration Hungary/Po-
land),IWT(Soft4s(softsensors),STWW-Genprom(genepromotorprediction),GBOU
McKnow(Knowledgemanagementalgorithms),Eureka-Impact(MPC-control),Eureka-
FLiTE( uttermodeling),severalPhDgrants);BelgianFederalGovernment:DWTC
(IUAP IV-02 (1996-2001) and IUAP V-10-29 (2002-2006): Dynamical Systems and
Control:Computation,Identication&Modelling),ProgramSustainableDevelopment
PODO-II (CP-TR-18: Sustainibility eects of TraÆc Management Systems); Direct
contract research:Verhaert,Electrabel,Elia,Data4s,IPCOS;BDMisafullprofessor
atK.U.LeuvenBelgium,JSisaprofessoratK.U.LeuvenBelgiumandapostdoctoral
researcherwithFWOFlanders.
References
1. CristianiniN.,Shawe-TaylorJ.,AnIntroductiontoSupportVectorMachines,Cam-
bridgeUniversityPress,2000.
2. EvgeniouT., PontilM., Poggio T.,\Regularization networksandsupportvector
machines",AdvancesinComputationalMathematics, 13(1),1-50,2000.
3. GentonM.,\Classesofkernelsformachinelearning:astatisticsperspective",Jour-
nal ofMachineLearning Research, 2,299-312,2001.
4. GilbertJ.,MolerC.,SchreiberR.,\SparsematricesinMatlab:designandimple-
mentation",SIAMJournalonMatrixAnalysis,13(1),333-356,1992.
5. GolubG.,VanLoanC.MatrixComputations,Baltimore: TheJohnHopkinsUni-
versityPress,2nded.,1990.
6. GreenbaumA.,IterativeMethodsforSolvingLinearSystems,Philadelphia:SIAM,
1997.
7. SuykensJ.A.K.,DeBrabanterJ.,LukasL.,VandewalleJ.,\Weightedleastsquares
supportvectormachines:robustnessandsparseapproximation",Neurocomputing,