Least Squares Support Ve tor Ma hine
Classiers
T.VanGestel,J.A.K.Suykens,B.De Moor&J.Vandewalle
K.U.LeuvenDept.ofEle tri alEngineering,ESAT-SISTA,
KasteelparkArenberg10,B-3001Leuven,Belgium
Abstra t. Automati Revelan eDetermination (ARD)has beenap-
plied to multilayerper eptrons by inferring dierent regularizationpa-
rametersfor theinput inter onne tionlayerwithin theeviden e frame-
work. Inthispaper,thisideaisextendedtowardsLeastSquaresSupport
Ve tor Ma hines(LS-SVMs) for lassi ation. Relatinga probabilisti
frameworktothe LS-SVMformulationonthe rstlevelofBayesianin-
feren e, the hyperparameters are inferred on the se ond level. Model
omparisonisperformedonthethirdlevelinordertosele ttheparame-
tersofthekernelfun tion. ARDisperformedbyintrodu ingadiagonal
weightingmatrixinthekernelfun tion. Thesediagonalelementsareob-
tainedbyeviden emaximizationonthethirdlevelofinferen e. Inputs
withalowweight valuearelessrelevantand anberemoved.
1. Introdu tion
TheBayesianeviden eframeworkhasbeensu esfullyappliedtothedesignof
multilayerper eptrons(MLPs)[1,3℄. Themodelparametersareinferredfrom
the databyapplying Bayes'ruleontherstlevelofinferen e, withtheprior
andlikelihood orrespondingtotheregularizationanderrorterm,respe tively.
The hyperparameters that ontrol the trade-o between error minimization
andregularizationareinferred onthese ond level. Model omparison anbe
performedonthethirdlevel. Automati Relevan eDetermination[3,5℄(ARD)
involvestheautomati determination ofrelevantinputs. Withinthe eviden e
framework,ARD isapplied toMLPs byintrodu ingadditionalregularization
hyperparameters forthe inter onne tions of ea h input. Eviden emaximiza-
tionisused to inferthe regularizationparametersand inputsele tion an be
performedbyremovinginputswithrelativelylargeregularization onstants.
E-mail:ftony.vangestel,johan.suykensgesat.kuleuven.a .be. T.VanGestelandJ.A.K.
SuykensareaResear hAssistantandaPostdo toralResear herwiththeFundforS ienti
Resear h-Flanders(FWO-Vlaanderen),respe tively. Thisworkwaspartiallysupportedby
grantsand proje ts fromthe FlemishGov.: (Resear h oun ilKULeuven: Grants, GOA-
Mesto666;FWO-Vlaanderen:Grants,res.proj.G.0240.99,G.0256.97,and omm.(ICCoS
andANMMM);AWI:Bil.Int. Coll.;IWT:STWWEurekaSINOPSYS,IMPACT);fromthe
BelgianFed.Gov.(Interuniv.Attr.Poles: IUAP-IV/02,IV/24;ProgramDur. Dev.);from
SVMs) [8℄, the inputs x 2 R n
are prepro essed in a nonlinear way by the
mapping'():R n
!R n
f
thatmapstheinputx!'(x)inanonlinearwayto
ahighn
f
-dimensionalfeaturespa e. Alinearde isionlineisthen onstru ted
inthefeaturespa espa e. Themapping'(x)isneverexpli itly al ulatedand
Mer er's ondition'(x
1 )
T
'(x
2
) =K(x
1
;x
2
) is applied instead. The weights
andbias term of theSVM and LS-SVM anbeobtained by applying Bayes'
ruleontherst levelofinferen e [2,9℄. Hyperparameters areinferred onthe
se ondlevel,whilethekernelparametersareobtainedfrom model omparison
onthe third levelof inferen e. In this paper, anARD algorithmis proposed
forLS-SVM lassiers [8℄within the Bayesianeviden e framework[9℄. Sin e
themapping '() is notexpli itly known, also theweightsof the input layer
areunknown and in this sense ARD by optimalhyperparameter sele tionon
level2 annotbeapplied. Instead,ARD forSVMs isobtainedbyassigninga
weight[6℄ to ea h inputof thekernelfun tion K. These weightsareinferred
byapplyingmodel omparisononthethird levelofBayesianinferen e.
This paper is organizedas follows. The inferen e of the model- and hy-
perparameters on level 1and 2 is reviewedin Se tions 2 and3, respe tively.
Automati Relevan e Determination by model omparison on level 3 is dis-
ussedin Se tion4. An exampleisgivenin Se tion5.
2. Inferen e of the Model Parameters (Level 1)
The LS-SVM lassier y = sign[w T
'(x)+b℄ is inferred from the data D =
f(x
i
;y
i )g
N
i=1
byminimizing the ostfun tion [8℄
min
w;b J
1
(w;b)=E
W +E
D
=
2 w
T
w+
2 P
N
i=1 e
2
i
(1)
subje ttothe onstraints
e
i
=1 y
i (w
T
'(x
i
)+b); i=1;:::;N: (2)
The regularization and error term are dened as E
W
= 1
2 w
T
w and E
D
=
1
2 P
N
i=1 e
2
i
,respe tively. Thetrade-obetweenregularizationandtrainingerror
isdeterminedbytheratio ==.
This ost fun tion is obtained in [8℄ by modifying Vapnik's SVM formu-
lation [10℄ so as to obtain a linear system in the dual spa e. Constru ting
the Lagrangian by introdu ing the Lagrange multipliers
i
for the equality
onstraints(2),alinearsystemisobtainedinthedualspa e
0 Y
T
Y +
1
I
N
b
=
0
1
v
(3)
with Y =[y
1
;:::;y
N
℄; 1
v
=[1;:::;1℄; e=[e
1
;:::;e
N
℄; =[
1
;:::;
N
℄; and
whereMer er's onditionisappliedwithinthematrix
ij
=y
i y
j '(x
i )
T
'(x
j )
=y
i y
j K(x
i
;x
j
):Possiblekernelfun tionsare,e.g.,alinearkernelK(x
1
;x
2 )=
x
1 x
2
and an RBF-kernel K(x
1
;x
2
) = exp( jjx
1 x
2 jj
2
= ), where Mer er's
ondition holds for all possible hoi es of the kernel parameter 2R. The
LS-SVM lassieristhen onstru tedasfollows:
y(x) = sign[
P
N
i=1
i y
i K(x;x
i
)+b℄; (4)
withlatentvariablez= P
N
i=1
i y
i K(x;x
i
)+b,bydenition.
Aprobabilisti interpretationfor(1)-(2)isobtainedbyapplyingBayes'rule
P(w;bjD;log;log;H )=
P(Djw;b;log;log;H )P(w;bjlog;log;H )
P(Djlog;log;H )
; (5)
where themodelH orrespondsto thekernelfun tionK,possiblywithkernel
parameters. Theeviden eP(Djlog;log;H )isanormalizing onstant. The
priorisassumedtobeoftheformP(w;bjlog;log;H )=P(wjlog;H )P(bjH ),
withP(bjH )anon-informativeuniformdistribution. AGaussianpriorP(wjlog
;H )=(
2
) n
f
=2
exp(
2 w
T
w)isassumed. ThelikelihoodisequaltoP(Djw;b;
log;H ) = Q
N
i=1 P(x
i jy
i
;w;b; log;H )P(y
i
jw;b;log;H ); with the onstant
priorprobabilitiesP(y
i
jw;b;log;H )andwherethefollowing onditionalprob-
ability isassumed: P(x
i jy
i
;w;b;log;H )=(
2
) 1=2
exp[
2 (1 y
i (w
T
'(x
i )+
b)) 2
℄:ByapplyingBayes'rule(5),weobtaintheposteriorprobabilityP(w;bjD;
log;log;H ) _ exp(
2 w
T
w) exp(
2 P
N
i=1 e
2
i
): The maximum a posteriori
estimatesw
MP and b
MP
are obtained byminimizing the orrespondingneg-
ativelogarithm(1). Thisis equivalentto solvingthelinearsystem(3) in the
dual spa e.
3. Inferen e of Hyperparameters (Level 2)
ApplyingBayes'ruleonthese ond levelofinferen e[2,3,9℄,weobtain:
P(log;logjD;H )_ p
n
f
N
p
detH
exp( J
1 (w
MP
;b
MP
)); (6)
withtheHessianH = 2
J
1
(w;b)=[w;b℄
2
. Intheoptimum,thefollowingrela-
tionshold [2, 3,9℄: 2
MP E
W (w
MP )=
eff
1and 2
MP E
D (w
MP
;b
MP )=
N
eff
, whi h is the Bayesian estimate estimate of the varian e 1
=
P
N
i=1 e
2
i
=(N
eff
) of the noisee
i
. Combining both relations, we obtain a
relation between
MP
and the ratio
MP
=
MP
=
MP : 2
MP [E
W (w
MP )+
MP E
D (w
MP
;b
MP
)℄ =N 1: For theLS-SVM, the ee tive numberof pa-
rameters[1,2,3,9℄isequalto:
eff
=1+ P
N
eff
i=1
MPG;i
MP+MPG;i
=1+ P
N
eff
i=1
MP
G;i
1+
MP
G;i
; (7)
where the rst term is obtained be ause no regularization on the bias term
b is used. TheN
eff
non-zero eigenvalues
G;i
orrespondsto theN
eff non-
zeroeigenvaluesofthe entered Grammatrixinthefeaturespa eandarethe
solutionstotheeigenvalueproblem[9℄
(I
N 1
YY T
)
G;i
=
G;i
G;i
; i=1;:::;N
eff
N 1: (8)
MP MP
istosolverstthefollowings alarminimizationproblemin [9℄:
min
J
2 ( )=
P
N 1
i=1 log[
G;i +
1
℄ + (N 1)log[E
W (w
MP ) + E
D (w
MP
;b
MP )℄; (9)
with
G;i
= 0 for i > N
eff
. In this optimization problem, expressions for
E
D;MP
and E
W;MP
are obtained from the onditions for optimality of the
Lagrangianonlevel1[8, 9℄: E
D;MP
= 1
2 2
P
N
i=1
2
i and E
W;MP
=
2
T
=
1
2 P
N
i=1
i (1
i
y
i b
MP
): From the optimal
MP
, one easily obtains
MP
and
MP
using therelationsin theoptimumbetween,, , E
W (w
MP )and
E
D (w
MP
;b
MP ).
4. Automati Relevan e Determination by In-
feren e of Kernel Parameters (Level 3)
Byapplying Bayes'rule onthe third level,the posteriorforthe model H
j is
obtained:P(H
j
jD)_P(DjH
j )P(H
j
):Atthislevel,noeviden eornormalizing
onstantisusedsin eitisimpossibleto ompareallpossiblemodelsH
j . The
priorP(H
j
) over all possible models is assumed to beuniform here. Hen e,
weobtainP(H
j
jD)_P(DjH
j
). Thelikelihood P(DjH
j
) orrespondsto the
eviden e(6)ofthepreviousleveland anbeapproximatedby[2,3,9℄
P(DjH
j
) _ P(Djlog
MP
;log
MP
;H
j )
logjD
logjD
log
log
; (10)
with
log
;
log
the standard deviations of the Gaussian priors (level 2) on
log,log,respe tively.
The error bars
logjD and
logjD
an be approximated [3℄ as follows:
2
logjD '
2
eff 1
and 2
logjD '
2
N
eff
. Theposterior(10)be omes[9℄:
P(DjH
j )_
r
N
eff
MP
N 1
MP
( eff 1)(N eff) QN
eff
i=1
(MP+MPG;i)
: (11)
Onesele ts the kernel parameters, e.g,
j
for an RBF-kernel, with maximal
posteriorP(DjH
j ).
For Automati Relevan e Determination, we now introdu e a diagonal 1
weighting matrix [6℄ U = diag([u(1);:::;u(n)℄). Ea h u(k) 2 R +
weights the
orrespondinginputx(k),k =1;:::;n in thekernelfun tion K. ForanRBF-
kernel,thekernelfun tionbe omes
K(x
1
;x
2
)=exp( (x
1 x
2 )
T
U(x
1 x
2 )=
2
)=exp( (x
1 x
2 )
T
U(x
1 x
2 ));
wherethepositives aleparameteristakenintoa ountbydening
U =U=
and
U =diag(u) =U=. Theweightsuareinferredbymaximizingthemodel
1
Insteadofusingadiagonalweightingmatrix,theapproa hmaybegeneralizedtowards
anypositivedeniteweightingmatrix
U2R nn
.However,aphysi alinterpretationofthe
importan eoftheweightsislessobviouswhentherearesigni antlynon-zeroo-diagonal
importantinputswillhaverelativelysmallweights.
Inordertondagoodstartingvalueforoptimizinguwithrespe tto(11),
we will rst infer theoptimal from (11) for u= [1;:::;1℄. This value then
servesasthestartingpointu=[1;:::;1℄=forthemore omplexoptimization
of theweightsu. Apra ti alalgorithm onsistsofthefollowingsteps:
1. Normalizetheinputstozero meanandunitvarian e.
2. Optimizej withrespe ttoP(DjHj)from(11). Forea hj,theopti-
mal
MP ,
MP and
MP
areinferredonlevel2asfollows:
(a) Solvetheeigenvalueproblem(8).
(b) MinimizeJ
2
( )from(9), inea hsteponesolvesthelinearsystem
(3)onlevel1and al ulatesE
WjMP andE
DjMP .
( ) Given MP, al ulateMP,MP and eff.
(d) Cal ulateP(DjH
j
)from(11).
3. Sele taninitial hoi eforu, e.g,u=[1;:::;1℄=(whenStep2wasthe
previousstep)oru=u
prev
(:::;l 1;l+1;:::)otherwise.
4. Optimizeuj withrespe ttoP(DjHj)from(11). SeeStep2forthe
dierentstepsonlevel1(2b)andlevel2(2a-d).
5. Removeinputslwithlowu(l ) values,goba ktoStep3.
The main dieren esof this approa h with Gaussian Pro esses[5℄ isthat
GPtypi allyinferthekernelparametersonlevel2,togetherwiththehyperpa-
rameters. TheLS-SVMformulationalsoallowstoderiveanalyti alexpressions
[2,9℄,whilesamplingte hniqueshavebeenusedtodesignandevaluateGP[5℄.
5. Example: ARD with an RBF-kernel
Weillustrate the ARD algorithm for anRBF-kernel on the syntheti binary
lassi ationdatasetfrom[7℄. Thedata set onsistsofatrainingsetand test
set of N = 250and N
test
= 1000 data points, respe tively. Both lasses 1
and +1 have equal prior probabilities and ea h lass is an equal mixture of
twonormaldistributions. Dueto theoverlapofthedistributions,theoptimal
theoreti alperforman ethat anbea hievedis 92:0%. Theoriginal problem
has two inputs (n = 2). The example reated to illustrate ARD is inspired
on[4℄: arstadditionalinput x(3)is onstru tedfrom inputx(1)by adding
Gaussiannoisewithvarian e0:25. Thisinputhassomerelevan e. These ond
additional input x(4) is zero mean, unit varian e Gaussian noise. Then, all
inputsx(1:4)wherenormalizedto zeromeanandunit varian e[1℄.
From Step 2, we obtained = 2:54,
MP
= 1:52and
MP
= 2:67 (with
u = [1;1;1;1℄). The training set and test set performan e are 89:6% and
88:5%, respe tively. In Step 3, we optimized u with respe t to (11) for all
inputs x(1:4). Thisyielded u=[0:2237;0:1307;0:1804;0:0016℄,
MP
=1:56,
MP
= 2:74, with training and test set performan es of 89:6% and 89:2%,
respe tively. The evolution of u during the optimization is depi ted in Fig-
ure 1. Removing input x(4) with verylow relevan e, we restarted the opti-
mization with u
old
=[0:2237;0:1307;0:1804℄for inputsx(1:3). Weobtained
u=[1:4276;0:4996;0:0869℄, =2:31, =3:02, whilethe trainingand
0 5 10 15 20 25 30 0
0.05 0.1 0.15 0.2 0.25
ui
N iter
Figure1: Evolutionof u(1)(+), u(2)(), u(3)() and u(4)(o) asa fun tion of
thenumberofiterationsN
iter
oftheoptimizationalgorithm.
testperforman es were 90:0% and 90:8%, respe tively. Input x(1)is now far
moreinportant than inputx(3) . Removal ofx(3) and retrainingwith inputs
x(1:2)yieldsu=[1:9461;0:1386℄,
MP
=1:56and
MP
=2:87. Trainingand
testsetperforman esarenow89:6%and91:0%,respe tively.
6. Con lusions
AnAutomati Relevan eDetermination(ARD) algorithmisproposedforLS-
SVM lassiers within the eviden eframework. A diagonalweightingmatrix
isintrodu edfortheinputsoftheRBF-kernel.Theweightsareinferredonthe
thirdlevelof Bayesianinferen e. Inputs orresponding to smallweightshave
lowrelevan e in thekernelfun tion and anbe removed. AlthoughtheRBF
kernelisknownto bequiteinsensitiveto irrevelantinputs,thegeneralization
behaviorinourexperimentisimprovedbyusingaweightingmatrix.
Referen es
[1℄ Bishop,C.M.NeuralNetworksforPatternRe ognition,OxfordUniversityPress,1995.
[2℄ Kwok,J.T.Integratingthe eviden eframeworkandthe Support Ve tor Ma hine.In
Pro .oftheEuropeanSymposiumonArti ialNeuralNetworks(ESANN1999),177-
182,Bruges,Belgium,1999.
[3℄ Ma Kay,D.J.C.ProbableNetworksandPlausiblePredi tions-AReviewofPra ti al
BayesianMethodsforSupervisedNeuralNetworks.Network: Computation inNeural
Systems,6,469-505,1995.
[4℄ Nabney,I.Netlab: AlgorithmsforPatternRe ognition,2001,toappear.
[5℄ Neal,R.M.Bayesian Learning forNeural Networks.Le tureNotesinStatisti s118,
Springer,NewYork,1996.
[6℄ Poggio,T.&Girosi,F.NetworksforApproximationandLearning.Pro eedingsofthe
IEEE,78(9),1481-1497,1990.
[7℄ Ripley,B.D.NeuralNetworksandRelatedMethodsfor Classi ation,JournalRoyal
Statisti alSo ietyB,56(3),409-456,1994.
[8℄ Suykens, J.A.K.& Vandewalle, J. Least squaressupport ve tor ma hine lassiers,
NeuralPro essingLetters,9,293-300,1999.
[9℄ VanGestel,T.,SuykensJ.A.K.,Lan kriet,G.,Lambre hts,A.,DeMoor,B.&Vande-
walle,J.ABayesianFrameworkforLeastSquaresSupportVe torMa hineClassiers.
ReportTR00-65ESAT-SISTA,K.U.Leuven,Belgium,2000.Submittedforpubli ation.
[10℄ Vapnik,V.Statisti allearningtheory,JohnWiley,New-York,1998.