• No results found

Inthispaper,thisideaisextendedtowardsLeastSquaresSupport Ve tor Ma hines(LS-SVMs) for lassi ation

N/A
N/A
Protected

Academic year: 2021

Share "Inthispaper,thisideaisextendedtowardsLeastSquaresSupport Ve tor Ma hines(LS-SVMs) for lassi ation"

Copied!
6
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Least Squares Support Ve tor Ma hine

Classi ers

T.VanGestel,J.A.K.Suykens,B.De Moor&J.Vandewalle



K.U.LeuvenDept.ofEle tri alEngineering,ESAT-SISTA,

KasteelparkArenberg10,B-3001Leuven,Belgium

Abstra t. Automati Revelan eDetermination (ARD)has beenap-

plied to multilayerper eptrons by inferring di erent regularizationpa-

rametersfor theinput inter onne tionlayerwithin theeviden e frame-

work. Inthispaper,thisideaisextendedtowardsLeastSquaresSupport

Ve tor Ma hines(LS-SVMs) for lassi ation. Relatinga probabilisti

frameworktothe LS-SVMformulationonthe rstlevelofBayesianin-

feren e, the hyperparameters are inferred on the se ond level. Model

omparisonisperformedonthethirdlevelinordertosele ttheparame-

tersofthekernelfun tion. ARDisperformedbyintrodu ingadiagonal

weightingmatrixinthekernelfun tion. Thesediagonalelementsareob-

tainedbyeviden emaximizationonthethirdlevelofinferen e. Inputs

withalowweight valuearelessrelevantand anberemoved.

1. Introdu tion

TheBayesianeviden eframeworkhasbeensu esfullyappliedtothedesignof

multilayerper eptrons(MLPs)[1,3℄. Themodelparametersareinferredfrom

the databyapplying Bayes'ruleonthe rstlevelofinferen e, withtheprior

andlikelihood orrespondingtotheregularizationanderrorterm,respe tively.

The hyperparameters that ontrol the trade-o between error minimization

andregularizationareinferred onthese ond level. Model omparison anbe

performedonthethirdlevel. Automati Relevan eDetermination[3,5℄(ARD)

involvestheautomati determination ofrelevantinputs. Withinthe eviden e

framework,ARD isapplied toMLPs byintrodu ingadditionalregularization

hyperparameters forthe inter onne tions of ea h input. Eviden emaximiza-

tionisused to inferthe regularizationparametersand inputsele tion an be

performedbyremovinginputswithrelativelylargeregularization onstants.



E-mail:ftony.vangestel,johan.suykensgesat.kuleuven.a .be. T.VanGestelandJ.A.K.

SuykensareaResear hAssistantandaPostdo toralResear herwiththeFundforS ienti

Resear h-Flanders(FWO-Vlaanderen),respe tively. Thisworkwaspartiallysupportedby

grantsand proje ts fromthe FlemishGov.: (Resear h oun ilKULeuven: Grants, GOA-

Me sto666;FWO-Vlaanderen:Grants,res.proj.G.0240.99,G.0256.97,and omm.(ICCoS

andANMMM);AWI:Bil.Int. Coll.;IWT:STWWEurekaSINOPSYS,IMPACT);fromthe

BelgianFed.Gov.(Interuniv.Attr.Poles: IUAP-IV/02,IV/24;ProgramDur. Dev.);from

(2)

SVMs) [8℄, the inputs x 2 R n

are prepro essed in a nonlinear way by the

mapping'():R n

!R n

f

thatmapstheinputx!'(x)inanonlinearwayto

ahighn

f

-dimensionalfeaturespa e. Alinearde isionlineisthen onstru ted

inthefeaturespa espa e. Themapping'(x)isneverexpli itly al ulatedand

Mer er's ondition'(x

1 )

T

'(x

2

) =K(x

1

;x

2

) is applied instead. The weights

andbias term of theSVM and LS-SVM anbeobtained by applying Bayes'

ruleonthe rst levelofinferen e [2,9℄. Hyperparameters areinferred onthe

se ondlevel,whilethekernelparametersareobtainedfrom model omparison

onthe third levelof inferen e. In this paper, anARD algorithmis proposed

forLS-SVM lassi ers [8℄within the Bayesianeviden e framework[9℄. Sin e

themapping '() is notexpli itly known, also theweightsof the input layer

areunknown and in this sense ARD by optimalhyperparameter sele tionon

level2 annotbeapplied. Instead,ARD forSVMs isobtainedbyassigninga

weight[6℄ to ea h inputof thekernelfun tion K. These weightsareinferred

byapplyingmodel omparisononthethird levelofBayesianinferen e.

This paper is organizedas follows. The inferen e of the model- and hy-

perparameters on level 1and 2 is reviewedin Se tions 2 and3, respe tively.

Automati Relevan e Determination by model omparison on level 3 is dis-

ussedin Se tion4. An exampleisgivenin Se tion5.

2. Inferen e of the Model Parameters (Level 1)

The LS-SVM lassi er y = sign[w T

'(x)+b℄ is inferred from the data D =

f(x

i

;y

i )g

N

i=1

byminimizing the ostfun tion [8℄

min

w;b J

1

(w;b)=E

W +E

D

=



2 w

T

w+



2 P

N

i=1 e

2

i

(1)

subje ttothe onstraints

e

i

=1 y

i (w

T

'(x

i

)+b); i=1;:::;N: (2)

The regularization and error term are de ned as E

W

= 1

2 w

T

w and E

D

=

1

2 P

N

i=1 e

2

i

,respe tively. Thetrade-o betweenregularizationandtrainingerror

isdeterminedbytheratio ==.

This ost fun tion is obtained in [8℄ by modifying Vapnik's SVM formu-

lation [10℄ so as to obtain a linear system in the dual spa e. Constru ting

the Lagrangian by introdu ing the Lagrange multipliers

i

for the equality

onstraints(2),alinearsystemisobtainedinthedualspa e



0 Y

T

Y +

1

I

N



b



=



0

1

v



(3)

with Y =[y

1

;:::;y

N

℄; 1

v

=[1;:::;1℄; e=[e

1

;:::;e

N

℄; =[

1

;:::;

N

℄; and

whereMer er's onditionisappliedwithinthematrix

ij

=y

i y

j '(x

i )

T

'(x

j )

=y

i y

j K(x

i

;x

j

):Possiblekernelfun tionsare,e.g.,alinearkernelK(x

1

;x

2 )=

(3)

x

1 x

2

and an RBF-kernel K(x

1

;x

2

) = exp( jjx

1 x

2 jj

2

= ), where Mer er's

ondition holds for all possible hoi es of the kernel parameter 2R. The

LS-SVM lassi eristhen onstru tedasfollows:

y(x) = sign[

P

N

i=1

i y

i K(x;x

i

)+b℄; (4)

withlatentvariablez= P

N

i=1

i y

i K(x;x

i

)+b,byde nition.

Aprobabilisti interpretationfor(1)-(2)isobtainedbyapplyingBayes'rule

P(w;bjD;log;log;H )=

P(Djw;b;log;log;H )P(w;bjlog;log;H )

P(Djlog;log;H )

; (5)

where themodelH orrespondsto thekernelfun tionK,possiblywithkernel

parameters. Theeviden eP(Djlog;log;H )isanormalizing onstant. The

priorisassumedtobeoftheformP(w;bjlog;log;H )=P(wjlog;H )P(bjH ),

withP(bjH )anon-informativeuniformdistribution. AGaussianpriorP(wjlog

;H )=(



2

) n

f

=2

exp(



2 w

T

w)isassumed. ThelikelihoodisequaltoP(Djw;b;

log;H ) = Q

N

i=1 P(x

i jy

i

;w;b; log;H )P(y

i

jw;b;log;H ); with the onstant

priorprobabilitiesP(y

i

jw;b;log;H )andwherethefollowing onditionalprob-

ability isassumed: P(x

i jy

i

;w;b;log;H )=(



2

) 1=2

exp[



2 (1 y

i (w

T

'(x

i )+

b)) 2

℄:ByapplyingBayes'rule(5),weobtaintheposteriorprobabilityP(w;bjD;

log;log;H ) _ exp(



2 w

T

w) exp(



2 P

N

i=1 e

2

i

): The maximum a posteriori

estimatesw

MP and b

MP

are obtained byminimizing the orrespondingneg-

ativelogarithm(1). Thisis equivalentto solvingthelinearsystem(3) in the

dual spa e.

3. Inferen e of Hyperparameters (Level 2)

ApplyingBayes'ruleonthese ond levelofinferen e[2,3,9℄,weobtain:

P(log;logjD;H )_ p

 n

f

 N

p

detH

exp( J

1 (w

MP

;b

MP

)); (6)

withtheHessianH = 2

J

1

(w;b)=[w;b℄

2

. Intheoptimum,thefollowingrela-

tionshold [2, 3,9℄: 2

MP E

W (w

MP )=

eff

1and 2

MP E

D (w

MP

;b

MP )=

N

eff

, whi h is the Bayesian estimate estimate of the varian e  1

=

P

N

i=1 e

2

i

=(N

eff

) of the noisee

i

. Combining both relations, we obtain a

relation between 

MP

and the ratio

MP

=

MP

=

MP : 2

MP [E

W (w

MP )+

MP E

D (w

MP

;b

MP

)℄ =N 1: For theLS-SVM, the e e tive numberof pa-

rameters[1,2,3,9℄isequalto:

eff

=1+ P

N

eff

i=1

MPG;i

MP+MPG;i

=1+ P

N

eff

i=1

MP



G;i

1+

MP



G;i

; (7)

where the rst term is obtained be ause no regularization on the bias term

b is used. TheN

eff

non-zero eigenvalues 

G;i

orrespondsto theN

eff non-

zeroeigenvaluesofthe entered Grammatrixinthefeaturespa eandarethe

solutionstotheeigenvalueproblem[9℄

(I

N 1

YY T

)

G;i

= 

G;i



G;i

; i=1;:::;N

eff

N 1: (8)

(4)

MP MP

istosolve rstthefollowings alarminimizationproblemin [9℄:

min

J

2 ( )=

P

N 1

i=1 log[

G;i +

1

℄ + (N 1)log[E

W (w

MP ) + E

D (w

MP

;b

MP )℄; (9)

with 

G;i

= 0 for i > N

eff

. In this optimization problem, expressions for

E

D;MP

and E

W;MP

are obtained from the onditions for optimality of the

Lagrangianonlevel1[8, 9℄: E

D;MP

= 1

2 2

P

N

i=1

2

i and E

W;MP

=



2

T

=

1

2 P

N

i=1

i (1

i

y

i b

MP

): From the optimal

MP

, one easily obtains 

MP

and 

MP

using therelationsin theoptimumbetween,, , E

W (w

MP )and

E

D (w

MP

;b

MP ).

4. Automati Relevan e Determination by In-

feren e of Kernel Parameters (Level 3)

Byapplying Bayes'rule onthe third level,the posteriorforthe model H

j is

obtained:P(H

j

jD)_P(DjH

j )P(H

j

):Atthislevel,noeviden eornormalizing

onstantisusedsin eitisimpossibleto ompareallpossiblemodelsH

j . The

priorP(H

j

) over all possible models is assumed to beuniform here. Hen e,

weobtainP(H

j

jD)_P(DjH

j

). Thelikelihood P(DjH

j

) orrespondsto the

eviden e(6)ofthepreviousleveland anbeapproximatedby[2,3,9℄

P(DjH

j

) _ P(Djlog

MP

;log

MP

;H

j )



logjD



logjD



log



log

; (10)

with 

log

;

log

the standard deviations of the Gaussian priors (level 2) on

log,log,respe tively.

The error bars 

logjD and 

logjD

an be approximated [3℄ as follows:

 2

logjD '

2

eff 1

and 2

logjD '

2

N

eff

. Theposterior(10)be omes[9℄:

P(DjH

j )_

r

 N

eff

MP

 N 1

MP

( eff 1)(N eff) QN

eff

i=1

(MP+MPG;i)

: (11)

Onesele ts the kernel parameters, e.g, 

j

for an RBF-kernel, with maximal

posteriorP(DjH

j ).

For Automati Relevan e Determination, we now introdu e a diagonal 1

weighting matrix [6℄ U = diag([u(1);:::;u(n)℄). Ea h u(k) 2 R +

weights the

orrespondinginputx(k),k =1;:::;n in thekernelfun tion K. ForanRBF-

kernel,thekernelfun tionbe omes

K(x

1

;x

2

)=exp( (x

1 x

2 )

T

U(x

1 x

2 )=

2

)=exp( (x

1 x

2 )

T



U(x

1 x

2 ));

wherethepositives aleparameteristakenintoa ountbyde ning



U =U=

and



U =diag(u) =U=. Theweightsuareinferredbymaximizingthemodel

1

Insteadofusingadiagonalweightingmatrix,theapproa hmaybegeneralizedtowards

anypositivede niteweightingmatrix



U2R nn

.However,aphysi alinterpretationofthe

importan eoftheweightsislessobviouswhentherearesigni antlynon-zeroo -diagonal

(5)

importantinputswillhaverelativelysmallweights.

Inorderto ndagoodstartingvalueforoptimizinguwithrespe tto(11),

we will rst infer theoptimal  from (11) for u= [1;:::;1℄. This value then

servesasthestartingpointu=[1;:::;1℄=forthemore omplexoptimization

of theweightsu. Apra ti alalgorithm onsistsofthefollowingsteps:

1. Normalizetheinputstozero meanandunitvarian e.

2. Optimizej withrespe ttoP(DjHj)from(11). Forea hj,theopti-

mal

MP ,

MP and

MP

areinferredonlevel2asfollows:

(a) Solvetheeigenvalueproblem(8).

(b) MinimizeJ

2

( )from(9), inea hsteponesolvesthelinearsystem

(3)onlevel1and al ulatesE

WjMP andE

DjMP .

( ) Given MP, al ulateMP,MP and eff.

(d) Cal ulateP(DjH

j

)from(11).

3. Sele taninitial hoi eforu, e.g,u=[1;:::;1℄=(whenStep2wasthe

previousstep)oru=u

prev

(:::;l 1;l+1;:::)otherwise.

4. Optimizeuj withrespe ttoP(DjHj)from(11). SeeStep2forthe

di erentstepsonlevel1(2b)andlevel2(2a-d).

5. Removeinputslwithlowu(l ) values,goba ktoStep3.

The main di eren esof this approa h with Gaussian Pro esses[5℄ isthat

GPtypi allyinferthekernelparametersonlevel2,togetherwiththehyperpa-

rameters. TheLS-SVMformulationalsoallowstoderiveanalyti alexpressions

[2,9℄,whilesamplingte hniqueshavebeenusedtodesignandevaluateGP[5℄.

5. Example: ARD with an RBF-kernel

Weillustrate the ARD algorithm for anRBF-kernel on the syntheti binary

lassi ationdatasetfrom[7℄. Thedata set onsistsofatrainingsetand test

set of N = 250and N

test

= 1000 data points, respe tively. Both lasses 1

and +1 have equal prior probabilities and ea h lass is an equal mixture of

twonormaldistributions. Dueto theoverlapofthedistributions,theoptimal

theoreti alperforman ethat anbea hievedis 92:0%. Theoriginal problem

has two inputs (n = 2). The example reated to illustrate ARD is inspired

on[4℄: a rstadditionalinput x(3)is onstru tedfrom inputx(1)by adding

Gaussiannoisewithvarian e0:25. Thisinputhassomerelevan e. These ond

additional input x(4) is zero mean, unit varian e Gaussian noise. Then, all

inputsx(1:4)wherenormalizedto zeromeanandunit varian e[1℄.

From Step 2, we obtained = 2:54, 

MP

= 1:52and 

MP

= 2:67 (with

u = [1;1;1;1℄). The training set and test set performan e are 89:6% and

88:5%, respe tively. In Step 3, we optimized u with respe t to (11) for all

inputs x(1:4). Thisyielded u=[0:2237;0:1307;0:1804;0:0016℄,

MP

=1:56,



MP

= 2:74, with training and test set performan es of 89:6% and 89:2%,

respe tively. The evolution of u during the optimization is depi ted in Fig-

ure 1. Removing input x(4) with verylow relevan e, we restarted the opti-

mization with u

old

=[0:2237;0:1307;0:1804℄for inputsx(1:3). Weobtained



u=[1:4276;0:4996;0:0869℄, =2:31,  =3:02, whilethe trainingand

(6)

0 5 10 15 20 25 30 0

0.05 0.1 0.15 0.2 0.25

ui

N iter

Figure1: Evolutionof u(1)(+), u(2)(), u(3)() and u(4)(o) asa fun tion of

thenumberofiterationsN

iter

oftheoptimizationalgorithm.

testperforman es were 90:0% and 90:8%, respe tively. Input x(1)is now far

moreinportant than inputx(3) . Removal ofx(3) and retrainingwith inputs

x(1:2)yieldsu=[1:9461;0:1386℄,

MP

=1:56and

MP

=2:87. Trainingand

testsetperforman esarenow89:6%and91:0%,respe tively.

6. Con lusions

AnAutomati Relevan eDetermination(ARD) algorithmisproposedforLS-

SVM lassi ers within the eviden eframework. A diagonalweightingmatrix

isintrodu edfortheinputsoftheRBF-kernel.Theweightsareinferredonthe

thirdlevelof Bayesianinferen e. Inputs orresponding to smallweightshave

lowrelevan e in thekernelfun tion and anbe removed. AlthoughtheRBF

kernelisknownto bequiteinsensitiveto irrevelantinputs,thegeneralization

behaviorinourexperimentisimprovedbyusingaweightingmatrix.

Referen es

[1℄ Bishop,C.M.NeuralNetworksforPatternRe ognition,OxfordUniversityPress,1995.

[2℄ Kwok,J.T.Integratingthe eviden eframeworkandthe Support Ve tor Ma hine.In

Pro .oftheEuropeanSymposiumonArti ialNeuralNetworks(ESANN1999),177-

182,Bruges,Belgium,1999.

[3℄ Ma Kay,D.J.C.ProbableNetworksandPlausiblePredi tions-AReviewofPra ti al

BayesianMethodsforSupervisedNeuralNetworks.Network: Computation inNeural

Systems,6,469-505,1995.

[4℄ Nabney,I.Netlab: AlgorithmsforPatternRe ognition,2001,toappear.

[5℄ Neal,R.M.Bayesian Learning forNeural Networks.Le tureNotesinStatisti s118,

Springer,NewYork,1996.

[6℄ Poggio,T.&Girosi,F.NetworksforApproximationandLearning.Pro eedingsofthe

IEEE,78(9),1481-1497,1990.

[7℄ Ripley,B.D.NeuralNetworksandRelatedMethodsfor Classi ation,JournalRoyal

Statisti alSo ietyB,56(3),409-456,1994.

[8℄ Suykens, J.A.K.& Vandewalle, J. Least squaressupport ve tor ma hine lassi ers,

NeuralPro essingLetters,9,293-300,1999.

[9℄ VanGestel,T.,SuykensJ.A.K.,Lan kriet,G.,Lambre hts,A.,DeMoor,B.&Vande-

walle,J.ABayesianFrameworkforLeastSquaresSupportVe torMa hineClassi ers.

ReportTR00-65ESAT-SISTA,K.U.Leuven,Belgium,2000.Submittedforpubli ation.

[10℄ Vapnik,V.Statisti allearningtheory,JohnWiley,New-York,1998.

Referenties

GERELATEERDE DOCUMENTEN

In  bepaalde  omstandigheden  werd  op  de  hellingen  materiaal  uit  de  valleien  opgestoven, 

Least-Squares Support Vector Machines (LS-SVMs) have been successfully applied in many classification and regression tasks. Their main drawback is the lack of sparseness of the

In particular, the Ivanov and Morozov scheme express the trade-off between data-fitting and smoothness in the trust region of the parameters and the noise level respectively which

To test whether industry specialized audit firms have a decreasing impact on the amount of abnormal discretionary accruals of the audit client and thus provide higher quality

On the contrary it can be seen that cities operating in city networks represent a core group or “critical mass” that is able to tackle the problem of climate change

Furthermore, we have derived pairwise fluctuation terms for the velocities of the fluid blobs using the Fokker-Planck equation, which have been alternatively derived using the

The TREC Federated Web Search (FedWeb) track 2013 provides a test collection that stimulates research in many areas related to federated search, including aggregated search,

Although meta-regression analysis is a more promising tool to identify the characteristics of programs that predict better outcomes (Clark et al., 2009), this