Inthispaper,thisideaisextendedtowardsLeastSquaresSupport Ve tor Ma hines(LS-SVMs) for lassi ation

(1)

Least Squares Support Ve tor Ma hine

Classiers

T.VanGestel,J.A.K.Suykens,B.De Moor&J.Vandewalle

K.U.LeuvenDept.ofEle tri alEngineering,ESAT-SISTA,

KasteelparkArenberg10,B-3001Leuven,Belgium

Abstra t. Automati Revelan eDetermination (ARD)has beenap-

plied to multilayerper eptrons by inferring dierent regularizationpa-

rametersfor theinput inter onne tionlayerwithin theeviden e frame-

work. Inthispaper,thisideaisextendedtowardsLeastSquaresSupport

Ve tor Ma hines(LS-SVMs) for lassi ation. Relatinga probabilisti

frameworktothe LS-SVMformulationonthe rstlevelofBayesianin-

feren e, the hyperparameters are inferred on the se ond level. Model

omparisonisperformedonthethirdlevelinordertosele ttheparame-

tersofthekernelfun tion. ARDisperformedbyintrodu ingadiagonal

weightingmatrixinthekernelfun tion. Thesediagonalelementsareob-

tainedbyeviden emaximizationonthethirdlevelofinferen e. Inputs

withalowweight valuearelessrelevantand anberemoved.

1. Introdu tion

TheBayesianeviden eframeworkhasbeensu esfullyappliedtothedesignof

multilayerper eptrons(MLPs)[1,3℄. Themodelparametersareinferredfrom

the databyapplying Bayes'ruleontherstlevelofinferen e, withtheprior

andlikelihood orrespondingtotheregularizationanderrorterm,respe tively.

The hyperparameters that ontrol the trade-o between error minimization

andregularizationareinferred onthese ond level. Model omparison anbe

performedonthethirdlevel. Automati Relevan eDetermination[3,5℄(ARD)

involvestheautomati determination ofrelevantinputs. Withinthe eviden e

framework,ARD isapplied toMLPs byintrodu ingadditionalregularization

hyperparameters forthe inter onne tions of ea h input. Eviden emaximiza-

tionisused to inferthe regularizationparametersand inputsele tion an be

performedbyremovinginputswithrelativelylargeregularization onstants.

E-mail:ftony.vangestel,johan.suykensgesat.kuleuven.a .be. T.VanGestelandJ.A.K.

SuykensareaResear hAssistantandaPostdo toralResear herwiththeFundforS ienti

Resear h-Flanders(FWO-Vlaanderen),respe tively. Thisworkwaspartiallysupportedby

grantsand proje ts fromthe FlemishGov.: (Resear h oun ilKULeuven: Grants, GOA-

Mesto666;FWO-Vlaanderen:Grants,res.proj.G.0240.99,G.0256.97,and omm.(ICCoS

andANMMM);AWI:Bil.Int. Coll.;IWT:STWWEurekaSINOPSYS,IMPACT);fromthe

BelgianFed.Gov.(Interuniv.Attr.Poles: IUAP-IV/02,IV/24;ProgramDur. Dev.);from

(2)

SVMs) [8℄, the inputs x 2 R n

are prepro essed in a nonlinear way by the

mapping'():R n

!R n

f

thatmapstheinputx!'(x)inanonlinearwayto

ahighn

f

-dimensionalfeaturespa e. Alinearde isionlineisthen onstru ted

inthefeaturespa espa e. Themapping'(x)isneverexpli itly al ulatedand

Mer er's ondition'(x

1 )

T

'(x

2

) =K(x

1

;x

2

) is applied instead. The weights

andbias term of theSVM and LS-SVM anbeobtained by applying Bayes'

ruleontherst levelofinferen e [2,9℄. Hyperparameters areinferred onthe

se ondlevel,whilethekernelparametersareobtainedfrom model omparison

onthe third levelof inferen e. In this paper, anARD algorithmis proposed

forLS-SVM lassiers [8℄within the Bayesianeviden e framework[9℄. Sin e

themapping '() is notexpli itly known, also theweightsof the input layer

areunknown and in this sense ARD by optimalhyperparameter sele tionon

level2 annotbeapplied. Instead,ARD forSVMs isobtainedbyassigninga

weight[6℄ to ea h inputof thekernelfun tion K. These weightsareinferred

byapplyingmodel omparisononthethird levelofBayesianinferen e.

This paper is organizedas follows. The inferen e of the model- and hy-

perparameters on level 1and 2 is reviewedin Se tions 2 and3, respe tively.

Automati Relevan e Determination by model omparison on level 3 is dis-

ussedin Se tion4. An exampleisgivenin Se tion5.

2. Inferen e of the Model Parameters (Level 1)

The LS-SVM lassier y = sign[w T

'(x)+b℄ is inferred from the data D =

f(x

i

;y

i )g

N

i=1

byminimizing the ostfun tion [8℄

min

w;b J

1

(w;b)=E

W +E

D

=

2 w

T

w+

2 P

N

i=1 e

2

i

(1)

subje ttothe onstraints

e

i

=1 y

i (w

T

'(x

i

)+b); i=1;:::;N: (2)

The regularization and error term are dened as E

W

= 1

2 w

T

w and E

D

=

1

2 P

N

i=1 e

2

i

,respe tively. Thetrade-obetweenregularizationandtrainingerror

isdeterminedbytheratio ==.

This ost fun tion is obtained in [8℄ by modifying Vapnik's SVM formu-

lation [10℄ so as to obtain a linear system in the dual spa e. Constru ting

the Lagrangian by introdu ing the Lagrange multipliers

i

for the equality

onstraints(2),alinearsystemisobtainedinthedualspa e

0 Y

T

Y +

1

I

N

b

=

0

1

v

(3)

with Y =[y

1

;:::;y

N

℄; 1

v

=[1;:::;1℄; e=[e

1

;:::;e

N

℄; =[

1

;:::;

N

℄; and

whereMer er's onditionisappliedwithinthematrix

ij

=y

i y

j '(x

i )

T

'(x

j )

=y

i y

j K(x

i

;x

j

):Possiblekernelfun tionsare,e.g.,alinearkernelK(x

1

;x

2 )=

(3)

x

1 x

2

and an RBF-kernel K(x

1

;x

2

) = exp( jjx

1 x

2 jj

2

= ), where Mer er's

ondition holds for all possible hoi es of the kernel parameter 2R. The

LS-SVM lassieristhen onstru tedasfollows:

y(x) = sign[

P

N

i=1

i y

i K(x;x

i

)+b℄; (4)

withlatentvariablez= P

N

i=1

i y

i K(x;x

i

)+b,bydenition.

Aprobabilisti interpretationfor(1)-(2)isobtainedbyapplyingBayes'rule

P(w;bjD;log;log;H )=

P(Djw;b;log;log;H )P(w;bjlog;log;H )

P(Djlog;log;H )

; (5)

where themodelH orrespondsto thekernelfun tionK,possiblywithkernel

parameters. Theeviden eP(Djlog;log;H )isanormalizing onstant. The

priorisassumedtobeoftheformP(w;bjlog;log;H )=P(wjlog;H )P(bjH ),

withP(bjH )anon-informativeuniformdistribution. AGaussianpriorP(wjlog

;H )=(

2

) n

f

=2

exp(

2 w

T

w)isassumed. ThelikelihoodisequaltoP(Djw;b;

log;H ) = Q

N

i=1 P(x

i jy

i

;w;b; log;H )P(y

i

jw;b;log;H ); with the onstant

priorprobabilitiesP(y

i

jw;b;log;H )andwherethefollowing onditionalprob-

ability isassumed: P(x

i jy

i

;w;b;log;H )=(

2

) 1=2

exp[

2 (1 y

i (w

T

'(x

i )+

b)) 2

℄:ByapplyingBayes'rule(5),weobtaintheposteriorprobabilityP(w;bjD;

log;log;H ) _ exp(

2 w

T

w) exp(

2 P

N

i=1 e

2

i

): The maximum a posteriori

estimatesw

MP and b

MP

are obtained byminimizing the orrespondingneg-

ativelogarithm(1). Thisis equivalentto solvingthelinearsystem(3) in the

dual spa e.

3. Inferen e of Hyperparameters (Level 2)

ApplyingBayes'ruleonthese ond levelofinferen e[2,3,9℄,weobtain:

P(log;logjD;H )_ p

n

f

N

p

detH

exp( J

1 (w

MP

;b

MP

)); (6)

withtheHessianH = 2

J

1

(w;b)=[w;b℄

2

. Intheoptimum,thefollowingrela-

tionshold [2, 3,9℄: 2

MP E

W (w

MP )=

eff

1and 2

MP E

D (w

MP

;b

MP )=

N

eff

, whi h is the Bayesian estimate estimate of the varian e 1

=

P

N

i=1 e

2

i

=(N

eff

) of the noisee

i

. Combining both relations, we obtain a

relation between

MP

and the ratio

MP

=

MP

=

MP : 2

MP [E

W (w

MP )+

MP E

D (w

MP

;b

MP

)℄ =N 1: For theLS-SVM, the ee tive numberof pa-

rameters[1,2,3,9℄isequalto:

eff

=1+ P

N

eff

i=1

MPG;i

MP+MPG;i

=1+ P

N

eff

i=1

MP

G;i

1+

MP

G;i

; (7)

where the rst term is obtained be ause no regularization on the bias term

b is used. TheN

eff

non-zero eigenvalues

G;i

orrespondsto theN

eff non-

zeroeigenvaluesofthe entered Grammatrixinthefeaturespa eandarethe

solutionstotheeigenvalueproblem[9℄

(I

N 1

YY T

)

G;i

=

G;i

; i=1;:::;N

eff

N 1: (8)

(4)

MP MP

istosolverstthefollowings alarminimizationproblemin [9℄:

min

J

2 ( )=

P

N 1

i=1 log[

G;i +

1

℄ + (N 1)log[E

W (w

MP ) + E

D (w

MP

;b

MP )℄; (9)

with

G;i

= 0 for i > N

eff

. In this optimization problem, expressions for

E

D;MP

and E

W;MP

are obtained from the onditions for optimality of the

Lagrangianonlevel1[8, 9℄: E

D;MP

= 1

2 2

P

N

i=1

2

i and E

W;MP

=

2

T

=

1

2 P

N

i=1

i (1

i

y

i b

MP

): From the optimal

MP

, one easily obtains

MP

and

MP

using therelationsin theoptimumbetween,, , E

W (w

MP )and

E

D (w

MP

;b

MP ).

4. Automati Relevan e Determination by In-

feren e of Kernel Parameters (Level 3)

Byapplying Bayes'rule onthe third level,the posteriorforthe model H

j is

obtained:P(H

j

jD)_P(DjH

j )P(H

j

):Atthislevel,noeviden eornormalizing

onstantisusedsin eitisimpossibleto ompareallpossiblemodelsH

j . The

priorP(H

j

) over all possible models is assumed to beuniform here. Hen e,

weobtainP(H

j

jD)_P(DjH

j

). Thelikelihood P(DjH

j

) orrespondsto the

eviden e(6)ofthepreviousleveland anbeapproximatedby[2,3,9℄

P(DjH

j

) _ P(Djlog

MP

;log

MP

;H

j )

logjD

log

; (10)

with

log

;

log

the standard deviations of the Gaussian priors (level 2) on

log,log,respe tively.

The error bars

logjD and

logjD

an be approximated [3℄ as follows:

2

logjD '

2

eff 1

and 2

logjD '

2

N

eff

. Theposterior(10)be omes[9℄:

P(DjH

j )_

r

N

eff

MP

N 1

MP

( eff 1)(N eff) QN

eff

i=1

(MP+MPG;i)

: (11)

Onesele ts the kernel parameters, e.g,

j

for an RBF-kernel, with maximal

posteriorP(DjH

j ).

For Automati Relevan e Determination, we now introdu e a diagonal 1

weighting matrix [6℄ U = diag([u(1);:::;u(n)℄). Ea h u(k) 2 R +

weights the

orrespondinginputx(k),k =1;:::;n in thekernelfun tion K. ForanRBF-

kernel,thekernelfun tionbe omes

K(x

1

;x

2

)=exp( (x

1 x

2 )

T

U(x

1 x

2 )=

2

)=exp( (x

1 x

2 )

T

U(x

1 x

2 ));

wherethepositives aleparameteristakenintoa ountbydening

U =U=

and

U =diag(u) =U=. Theweightsuareinferredbymaximizingthemodel

1

Insteadofusingadiagonalweightingmatrix,theapproa hmaybegeneralizedtowards

anypositivedeniteweightingmatrix

U2R nn

.However,aphysi alinterpretationofthe

importan eoftheweightsislessobviouswhentherearesigni antlynon-zeroo-diagonal

(5)

importantinputswillhaverelativelysmallweights.

Inordertondagoodstartingvalueforoptimizinguwithrespe tto(11),

we will rst infer theoptimal from (11) for u= [1;:::;1℄. This value then

servesasthestartingpointu=[1;:::;1℄=forthemore omplexoptimization

of theweightsu. Apra ti alalgorithm onsistsofthefollowingsteps:

1. Normalizetheinputstozero meanandunitvarian e.

2. Optimizej withrespe ttoP(DjHj)from(11). Forea hj,theopti-

mal

MP ,

MP and

MP

areinferredonlevel2asfollows:

(a) Solvetheeigenvalueproblem(8).

(b) MinimizeJ

2

( )from(9), inea hsteponesolvesthelinearsystem

(3)onlevel1and al ulatesE

WjMP andE

DjMP .

( ) Given MP, al ulateMP,MP and eff.

(d) Cal ulateP(DjH

j

)from(11).

3. Sele taninitial hoi eforu, e.g,u=[1;:::;1℄=(whenStep2wasthe

previousstep)oru=u

4. Optimizeuj withrespe ttoP(DjHj)from(11). SeeStep2forthe

dierentstepsonlevel1(2b)andlevel2(2a-d).

5. Removeinputslwithlowu(l ) values,goba ktoStep3.

The main dieren esof this approa h with Gaussian Pro esses[5℄ isthat

GPtypi allyinferthekernelparametersonlevel2,togetherwiththehyperpa-

rameters. TheLS-SVMformulationalsoallowstoderiveanalyti alexpressions

[2,9℄,whilesamplingte hniqueshavebeenusedtodesignandevaluateGP[5℄.

5. Example: ARD with an RBF-kernel

Weillustrate the ARD algorithm for anRBF-kernel on the syntheti binary

lassi ationdatasetfrom[7℄. Thedata set onsistsofatrainingsetand test

set of N = 250and N

test

= 1000 data points, respe tively. Both lasses 1

and +1 have equal prior probabilities and ea h lass is an equal mixture of

twonormaldistributions. Dueto theoverlapofthedistributions,theoptimal

theoreti alperforman ethat anbea hievedis 92:0%. Theoriginal problem

has two inputs (n = 2). The example reated to illustrate ARD is inspired

on[4℄: arstadditionalinput x(3)is onstru tedfrom inputx(1)by adding

Gaussiannoisewithvarian e0:25. Thisinputhassomerelevan e. These ond

additional input x(4) is zero mean, unit varian e Gaussian noise. Then, all

inputsx(1:4)wherenormalizedto zeromeanandunit varian e[1℄.

From Step 2, we obtained = 2:54,

MP

= 1:52and

MP

= 2:67 (with

u = [1;1;1;1℄). The training set and test set performan e are 89:6% and

88:5%, respe tively. In Step 3, we optimized u with respe t to (11) for all

inputs x(1:4). Thisyielded u=[0:2237;0:1307;0:1804;0:0016℄,

MP

=1:56,

MP

= 2:74, with training and test set performan es of 89:6% and 89:2%,

respe tively. The evolution of u during the optimization is depi ted in Fig-

ure 1. Removing input x(4) with verylow relevan e, we restarted the opti-

mization with u

old

=[0:2237;0:1307;0:1804℄for inputsx(1:3). Weobtained

u=[1:4276;0:4996;0:0869℄, =2:31, =3:02, whilethe trainingand

(6)

0 5 10 15 20 25 30 0

0.05 0.1 0.15 0.2 0.25

ui

N iter

Figure1: Evolutionof u(1)(+), u(2)(), u(3)() and u(4)(o) asa fun tion of

thenumberofiterationsN

iter

oftheoptimizationalgorithm.

testperforman es were 90:0% and 90:8%, respe tively. Input x(1)is now far

moreinportant than inputx(3) . Removal ofx(3) and retrainingwith inputs

x(1:2)yieldsu=[1:9461;0:1386℄,

MP

=1:56and

MP

=2:87. Trainingand

testsetperforman esarenow89:6%and91:0%,respe tively.

6. Con lusions

AnAutomati Relevan eDetermination(ARD) algorithmisproposedforLS-

SVM lassiers within the eviden eframework. A diagonalweightingmatrix

isintrodu edfortheinputsoftheRBF-kernel.Theweightsareinferredonthe

thirdlevelof Bayesianinferen e. Inputs orresponding to smallweightshave

lowrelevan e in thekernelfun tion and anbe removed. AlthoughtheRBF

kernelisknownto bequiteinsensitiveto irrevelantinputs,thegeneralization

behaviorinourexperimentisimprovedbyusingaweightingmatrix.

Referen es

[1℄ Bishop,C.M.NeuralNetworksforPatternRe ognition,OxfordUniversityPress,1995.

[2℄ Kwok,J.T.Integratingthe eviden eframeworkandthe Support Ve tor Ma hine.In

Pro .oftheEuropeanSymposiumonArti ialNeuralNetworks(ESANN1999),177-

182,Bruges,Belgium,1999.

[3℄ Ma Kay,D.J.C.ProbableNetworksandPlausiblePredi tions-AReviewofPra ti al

BayesianMethodsforSupervisedNeuralNetworks.Network: Computation inNeural

Systems,6,469-505,1995.

[4℄ Nabney,I.Netlab: AlgorithmsforPatternRe ognition,2001,toappear.

[5℄ Neal,R.M.Bayesian Learning forNeural Networks.Le tureNotesinStatisti s118,

Springer,NewYork,1996.

[6℄ Poggio,T.&Girosi,F.NetworksforApproximationandLearning.Pro eedingsofthe

IEEE,78(9),1481-1497,1990.

[7℄ Ripley,B.D.NeuralNetworksandRelatedMethodsfor Classi ation,JournalRoyal

Statisti alSo ietyB,56(3),409-456,1994.

[8℄ Suykens, J.A.K.& Vandewalle, J. Least squaressupport ve tor ma hine lassiers,

NeuralPro essingLetters,9,293-300,1999.

[9℄ VanGestel,T.,SuykensJ.A.K.,Lan kriet,G.,Lambre hts,A.,DeMoor,B.&Vande-

walle,J.ABayesianFrameworkforLeastSquaresSupportVe torMa hineClassiers.

ReportTR00-65ESAT-SISTA,K.U.Leuven,Belgium,2000.Submittedforpubli ation.

[10℄ Vapnik,V.Statisti allearningtheory,JohnWiley,New-York,1998.