• No results found

Supervised Feature Selection Based on Generalized Matrix Learning Vector Quantization

N/A
N/A
Protected

Academic year: 2021

Share "Supervised Feature Selection Based on Generalized Matrix Learning Vector Quantization"

Copied!
63
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Supervised Feature Selection Based on Generalized Matrix Learning

Vector Quantization

Zetao Chen

S2061244 September 2012

Master Project

Artificial Intelligence

University of Groningen, the Netherlands

Internal Supervisor:

Dr. Marco Wiering (Artificial Intelligence,University of Groningen)

External Supervisor:

Prof. Michael Biehl (Computer Science, University of Groningen)

(2)

1 Introduction and Background 4

1.1 Motivation . . . 4

1.2 ResearchQuestions . . . 5

1.3 ThesisOutline . . . 5

2 MachineLearning 7 2.1 BasicConceptofMachineLearning. . . 7

2.1.1 Denition oflearning. . . 7

2.1.2 Denition ofmachinelearning. . . 7

2.1.3 Datarepresentation . . . 8

2.2 Classication . . . 8

2.2.1 Unsupervisedandsupervisedlearning . . . 9

2.3 LearningAlgorithms . . . 9

2.3.1 SVMwithRBF kernel . . . 9

2.3.2 LVQ . . . 15

2.3.3 TwovariantsofLVQ:GRLVQandGMLVQ . . . 17

3 Feature Selection 19 3.1 Challenge . . . 19

3.1.1 Curseofdimensionality . . . 19

3.1.2 Irrelevanceandredundancy . . . 20

3.2 GeneralFramework. . . 20

3.3 WrapperandFilter Approach . . . 21

3.4 FeatureRankingTechnique . . . 22

3.4.1 Informationgain . . . 22

3.4.2 Relie . . . 23

3.4.3 Fisher . . . 24

4 GMLVQBased Feature SelectionAlgorithms 27 4.1 EntropyEnforcementforFeatureRankingResults . . . 27

4.2 Way-PointAverageAlgorithm. . . 29

4.3 FeatureRankingAmbiguityRemoval . . . 32

5 Experimentsand Results 35 5.1 DataSetDescription . . . 35

5.2 ExperimentDesign . . . 35

5.3 ResultsandDiscussion . . . 36

5.3.1 CaseStudy1: AdrenalTumor . . . 36

5.3.2 CaseStudy2: Ionosphere . . . 41

5.3.3 CaseStudy3: ConnectionistBench Sonar . . . 45

5.3.4 CaseStudy4: BreastCancer . . . 49

5.3.5 CaseStudy5: SPECTFHeart . . . 51

5.4 Discussionand Summary . . . 56

6 Conclusion and FutureWork 58

(3)

Datamining involvesthe use of data analysis tools to discover and

extractinformationfromadatasetandtransformitintoanunderstand-

able expression. Oneofits central problemsis to identifyarepresenta-

tive subsetoffeatures fromwhich alearning modelcan beconstructed.

Featureselection isanimportantpre-processingstepbeforedatamining

whichaimstoselectarepresentativesubsetoffeatureswithhighpredic-

tive informationand eliminate irrelevant features withlittle importance

forclassication. Byreducingthedimensionality ofthe data,featurese-

lectionhelpstodecreasethe timefor trainingandbyselectingthe most

relevant featuresand removingtheirrelevant and noisydata,the classi-

cation performance may be improved. Besides, witha smaller feature

subset,thelearnedmodelmaybemoreintuitiveandeasiertointerpret.

ThisthesisinvestigatestheextensionofGeneralizedMatrixLVQ(GM-

LVQ)modelonfeatureselection. GeneralizedMatrixLVQemploysafull

matrixas thedistancemetricintraining. Thediagonaland o-diagonal

elementsofthedistancematrixrespectivelymeasurethe contributionof

eachfeature andfeature pairfor classication; therefore, their distribu-

tion can provide a quantitative measurement of feature weight. More

steps andanalysis are performed to forceamore eectivefeature selec-

tion result and remove the weighting ambiguity. Besides, compared to

other methodswhichperformfeaturerankingrstandlearning amodel

after selecting the feature subset,GMLVQbased methods cancombine

theprocessof featurerankingand classicationtogetherwhich helpsto

decreasethecomputationtime.

Experimentsinthisthesiswereperformedondatasetscollectedfrom

theUCI Machine Learning Repository[29 ]. TheGMLVQbasedfeature

weightalgorithmiscomparedwithotherstate-of-the-artmethods: Infor-

mationGain,Fisher andRelie. Allthesefourfeaturerankingmethods

are evaluated using both GMLVQand RBF basedSupportVector Ma-

chine(RBF-SVM)methodsbyincreasingthe sizeoftheselectedfeature

subset with a stepsize rate. Theresults indicate that the performance

ofGMLVQbasedfeatureselection methodiscomparabletoothermeth-

ods and onsomeof the datasets, it consistentlyoutperforms theother

methods.

(4)

1 Introduction and Background

1.1 Motivation

Fora machine learning algorithm to be successful ona given task, the repre-

sentation and quality of thedata are therst and mostimportant. With the

advancingofdatabasetechnology,dataiseasiertoassessandmorefeaturescan

begatheredforaspecictask. However,morefeaturesdonotnecessarilyresult

inmorediscriminativeclassiers. Instead,whenthere aretoomanyredundant

orirrelevant features, the computation can be much moreexpensive and the

classier may havea poor generalization performance due to the interference

of noises; therefore, proper data preprocessing is essential for the successful

trainingofmachine learningalgorithms.

Feature selectionisoneof themostimportantandfrequentlyused prepro-

cessingtechniques[5]whichaimstoidentifyandselectthemostdiscriminative

subset from the original features while eliminating irrelevant, redundant and

noisydata. Some studies haveshown that irrelevantfeatures canbe removed

withoutsignicantperformancedowngrade[6]. The application of feature se-

lectioncanhavesomebenets:

1. Itreducesthedatadimensionalitywhich helpsthelearningalgorithmsto

work fasterandmoreeectively;

2. In some cases, the classication accuracy can be improved by using a

subsetofallfeatures;

3. Theselectedfeaturesubsetisusuallyamorecompactresultwhichcanbe

interpretedmoreeasily;

To perform feature selection, the training data can be with or without label

information,correspondingtosupervisedorunsupervisedfeature selection. In

unsupervised tasks [1, 2], without considering the label information, feature

relevance can be evaluated by measuring someintrinsic properties within the

data, such as the separability or covariance. In practice, unlabeled data is

easierto obtaincompared to labelled ones,therebyindicating thesignicance

ofunsupervisedalgorithms. However,thesemethods ignore labelinformation,

which may lead to performance deterioration when the label information is

available. Supervisedfeatureselectionisproposedtotakethelabelinformation

intoaccount. Itcan begenerallydividedinto twomajorframeworks: thelter

model [14, 15, 16, 17] and the wrapper model [18, 19, 20]. The lter model

performsthefeatureselectionasapre-processingstep,independentofthechoice

of theclassier. The wrappermodel, on theother hand, evaluatessubsets of

featuresaccordingtotheirusefulnesstoagivenpredictor.

Featureselectiontechniquescanbefurther categorizedintofeatureranking

andfeature subsetselection. Featurerankingmethodsassign aweightto each

(5)

selectthesubsetof featuresbychoosingathresholdand eliminateallfeatures

which do notachievethat score. Feature subsetselection searchesfor the op-

timalsubsetwhich collectivelyhasthebest performance withrespectto some

predictor. Inthis thesis,anewmethodforfeaturerankingwillbeinvestigated

andcomparedwithotherstate-of-the-artones.

Learning Vector Quantization is one of the most famous prototype-based

supervisedlearningmethods. ItwasrstintroducedbyKohonen[3]. Afterthat,

severaladvancedcostfunctionswereproposedtoimprovetheperformance,one

examplebeingGeneralizedLVQ[4]which isonlybasedonEuclideandistance.

To model the dierent contributions of features for classication, Generalized

RelevanceLVQisproposed[4,7]toextendtheEuclideandistancewithscaling

orrelevancefactorsforallfeatures. TherecentlyintroducedGeneralizedMatrix

LVQ (GMLVQ) [33] extendsthe distance measurement further to accountfor

pairwisecontributionoffeatures. ThedistancematrixinGMLVQcontainssome

informationwhichmaybeusefulforfeatureselection. Forexample,thediagonal

element

Λ ii

ofthedissimilaritymatrixcanberegardedasameasurementofthe overall relevance offeature ifor classicationandtheo-diagonal element

Λ ij

canbeinterpreted asthe contributionoffeature pairi andj. A highabsolute

valueindicatestheexistenceofahighlyrelevantrelationshipwhileanabsolute

valueclosertozeromaysuggestthat isisnotthatimportantforclassication.

TheabovediscussionillustratesthepotentialapplicationofGMLVQinfea-

tureranking which hasnot yet beenfully investigated. Early studies include

applyingGMLVQto selectthebest featurein theclassicationoflungdisease

[39]andselectthemostdiscriminativemarkerinthediagnosisofAdrenalTumor

[41]. Inthis thesis, afurther investigation will be conducted and experiments

onmoredatasetswillbecarriedout.

1.2 Research Questions

Thisthesiswill attempttoanswerthefollowingquestions:

1. CanGMLVQmethodbeextendedto performfeatureranking?

2. How well doesthe feature rankingperform? In this thesis,the GMLVQ

basedfeaturerankingtechniquewillbecomparedwiththree otherstate-

of-the-artfeaturerankingmethods. All thesefourmethodswillbeevalu-

atedbyGMLVQandRBF-SVM intermsoftheirAUCmetric.

3. CanGMLVQcombinethefeaturerankingandclassicationintoonesingle

process and how well does the classicationperform compares to other

methodsin whichfeature rankingandclassicationareperformedintwo

steps?

1.3 Thesis Outline

This thesis has six chapters and is organized as follows. Chapter 2 presents

the basicconcepts in machine learningand the algorithm details of the Sup-

(6)

performance of variousfeature rankingalgorithms in alater stage. Chapter 3

discussesthe idea of feature selection, its general framework and three state-

of-the-artfeaturerankingtechniqueswhichwillbecomparedwiththeGMLVQ

basedrankingmethod. Chapter4givesadescriptionabouttheGMLVQbased

featurerankingmethod. Inthischapter,detailswillbegiventoextractfeature

rankingfrom GMLVQ,thewaypointaveragingalgorithmand howto obtaina

unique feature ranking result. Chapter 5elaborates on the experiments con-

ductedto compare thefour feature rankingtechniquesdiscussedaboveand is

followedbyChapter6thatstatestheconclusionandfutureworkforthisthesis.

(7)

2 Machine Learning

In this chapter, we rstly give a brief introduction to machine learning and

someofitsbasicconcepts. Thedatarepresentation,classicationandlearning

algorithmsarefurtherpresented. Wefurtherpresentsomespeciclearningalgo-

rithmswhichareRBF-basedSVM,basicLVQanditstwoothervariations. The

learningalgorithmsintroducedinthis chapterwill belaterutilizedto evaluate

thefeature selectionalgorithmsintroducedinChapter 3.

2.1 Basic Concept of Machine Learning

2.1.1 Denitionof learning

Whatislearning? Learningisgenerallyreferredto themutualinteraction be-

tween the environment and the person through which one gains or modies

knowledgeorskills. AmoreformaldenitionwasgivenbyRunyonin1977[36]:

Learningisaprocessinwhichbehaviorcapabilitiesarechangedastheresultof

experienceprovidedthechangecannotbeaccountedforbynativeresponseten-

dencies,maturation,ortemporarystatesoftheorganismduetofatigue,drugs,

orothertemporaryfactors.

One of the examples in learning is the association between events. For

example,ifanormalpersontastesanappleforthersttime andndsitvery

delicious, he will assume that the next apple he meets will also be delicious

although he has not eaten it and that apple is dierent from theone he ate.

Theimportant discoveryhere is theassociation of the facts that the apple is

tasty. Thisassociationistheknowledgesomeonegainsbytheexperiencetoeat

anapple.

2.1.2 Denitionof machinelearning

Learningfor computersfalls into the eld of machine learning. A widely ac-

cepted denition is: A computerprogram is said to learn from experience E

withrespecttosomeclassoftasksTandperformancemeasure P,ifitsperfor-

manceat tasks in T,asmeasuredbyP,improveswithexperience E[37]. The

experiencehereusually refersto thedata which demonstratestherelationship

betweenobservedvariables.

There are many example applications in Machine Learning. One of the

largest groups lies in the categorization of objects into a set of pre-specied

classesorlabels. Someofthepracticalexamplesare:

1. OpticalCharacterRecognition: classifyimagesofhandwrittencharacters

tothespecicletters;

2. FaceRecognition: categorizefacialimagestotheperson itbelongsto;

(8)

learning.

3. MedicalDiagnosis: determinewhetherornotapatientsuers fromsome

disease;

4. StockPrediction: predictwhetherastockgoesupordown

2.1.3 Data representation

Intheeld ofmachine learning,data isrepresentedbyatablewhere eachrow

correspondstoonesampleorinstanceandeachcolumn describesoneattribute

or feature. In the case of supervised learning, there will be another column

containing the label information for each instance. One of the examples is

shown in Figure 1. There are 14 instances in this exampleand each instance

consistsof thedata withfour features: Outlook,Temperature,Humidity,

Wind andthelabelinformationspecifyingwhether ornotto play.

Themathematicalexpressions ofthe dataand labelsare presentedhereto

serveasthenotationsin thisthesis. Let

{x i , y i }

denotethe

i th

instancewhere

x i ∈ R N

denotesthedataintheNdimensionalspaceand

y i

isthecorresponding labelinformationwithCdierentpossiblevalues. Tobebrief,thecombination

ofdataandlabelareexpressedasbelow:

{x i , y i } ∈ R N × C

(1)

2.2 Classication

Asdiscussed in the previoussection, themajortask in machine learningis to

learnhowtoclassifyobjectsintooneofthepre-denedsetoflabels. Insuchtask,

(9)

objectsinaclass. Forexample,to identifywhether afruitisabanana,people

havetocheckitscolor,size,shapeandinferitslabelfromthis information.

2.2.1 Unsupervised and supervisedlearning

The classication task discussed above is generally referred to as supervised

learningwhere thelabelsof trainingdata are provided and the learningalgo-

rithm tries to generalize from the training instances to enable novel objects

to be classied to correct categories. In contrast to supervised learning, un-

supervisedlearningrefers to the learningin which thelabelsof trainingdata

are unknown. Its goalis to groupthe training data into dierent clusters by

evaluatingsomeintrinsicpropertieswithin thedata,suchastheseparabilityor

covariance; therefore, the quality of the data provided for trainingis crucial.

Ifirrelevantornoisydataareprovided, misclassicationswill happenonnovel

data.

2.3 Learning Algorithms

Inthissection,twosupervisedlearningalgorithms willbedescribedwhichare

theSVMalgorithmwithRBFkernelandtheLVQalgorithmwithitstwovari-

ants: GRLVQandGMLVQ.TheGMLVQandRBF-SVM willbelaterutilized

toevaluatetheperformanceoffourfeaturerankingmethods.

2.3.1 SVM with RBF kernel

The Support Vector Machine (SVM) was originally proposed by Vapnik for

classicationand regression[25, 24, 26, 27] and then itwasalso extended for

other application [28]. It has attracted large attentionin recentyears due to

its superior performance and soundly developed theoretical foundation. As a

result,italsoservesasanevaluationmethodforthefeatureselectionresultsin

thisthesis.

The SVM is a method to nd anoptimal hyperplaneto separate training

dataoftwoormoreclassesandatthesametime, maximizingitsmargin. The

linear Support Vector Machine, as the simplest and most basic case, will be

introduced rst. Then wewill showhowit canclassifynon-linearly separable

datainafeaturespaceinhigherdimensions.

LinearSVMandseparatinghyperplanemaximization ThelinearSVM

isasupervisedlearningmethodwhichisbuiltuponagroupoflabelledsamples

and that performsbinary classication in the feature space. Let'sdenote the

data and labels as

(x i , y i )

where

x i ⊆ R N

is a N-dimensionalfeature vector and

y i

is thelabelof sample

x i

. Inatwo-class problem,

y i ∈ {+1, −1}

. The

classicationprocess of asupervised learningalgorithm canthen be regarded

as a mapping process

f (x i )

:

R N → R

which maps the feature vector from a

N dimensional space to the class membership of the vector. Without loss of

(10)

spaceintotwohalves

generality,it isassumedthat

f (x i ) > 0

and

y i = 1

indicate thefeature vector

belongstoclass1and

f (x i ) < 0

and

y i = −1

specifytheclass2. Thenaformal

denition aboutlinearlyseparable datacanbegiven as: adata set is linearly

separableifthefollowingequationshold:

∀ y i = 1 : f (x i ) > 0

(2)

∀ y i = −1 : f (x i ) ≤ 0

(3)

AnillustrationexampleisshowninFigure2. Ascanbeseenfromthegure,

allthepointswith

y i = 1

areclassiedintothepositivesideofthehyperplane

andotherswith

y i = −1

areintheoppositeside.

ThediscriminantfunctioninFigure2isalinearmodelandcanbeexpressed

as:

f (x) = w T x + b

(4)

wherewindicatestheweightvectorandbisthebias. Thehyperplanewhich

dividestheplaneintotwohalf-planesisexpressedas:

f (x) = w T x + b = 0

The discriminantfunction

f (x)

can alsohelp tomeasure thedistance of a

datapointtothehyperplane. Considerthepoint

x d

and itsnormalprojection

x 0

onthehyperplanein Figure2. Thecoordinatesofthepoint

x d

canthenbe

expressedas:

x d = x 0 + d w

kwk

(5)

(11)

where ddescribesthealgebraicdistance betweenthepoint

x d

and

x 0

. Be-

cause

x 0

isonthehyperplane,

f (x 0 ) = 0

. Wehave:

f (x d ) = f (x 0 + d w

kwk ) = w T (x 0 + d w

kwk ) + b

(6)

= f (x 0 ) + d w T w

kwk = d kwk

(7)

It follows that:

d = f(x kwk i )

and to enforce that

d

is always positive under

correctclassication,wedene:

d i = y i f (x i )

kwk

(8)

Then the term margin

p

can be dened here as the distance between the

hyperplaneandtheclosestdatapointsfrombothsides:

p =

i=1,2···n min y i f (x i )

kwk

(9)

where

n

isthenumberofexamplesinthetrainingdataset. ThelinearSVM

istrainedto ndanoptimalhyperplanetomaximize themargin

p

. As shown

in the formula above, this canbe achieved by either maximizing the valueof

y i f (x i )

oftheclosestpointsorbyminimizing

kwk

. Since

w T x + b

canbescaled

withoutchangingitssign,itisreasonabletoimposetheconstraintthat:

y i (w T x i + b) ≥ 1

(10)

i = 1, 2, · · · n

(11)

Therefore,theoptimizationproblemcanbeformulatedas[25]: givenasetof

trainingsamples

{x i , y i } n i=1

,tryto ndtheoptimalparameterswandbwhich

satisestheconstraintthat:

y i (w T x i + b) ≥ 1

(12)

i = 1, 2, · · · , n

(13)

andminimizesthefollowingfunction:

L = 1

2 w T w

(14)

This is called theprimary problem and can be solved by constructing the

Lagrangefunction[30]as below:

J(w, b, a) = 1 2 w T w −

X n

i=1

a i [y i (w T x i + b) − 1]

(15)

(12)

The

a i

herearecalledtheLagrangemultipliersandthesolutionofthisopti- mizationproblemshouldbeminimizedwithrespectto

w

and

b

andmaximized

withrespectto

a i

. Asaresult,itfollowsthat

∂J(w, b, a)

∂w = w − X n

i=1

a i y i x i = 0

(16)

and

∂J(w, b, a)

∂b =

X n

i=1

a i y i = 0

(17)

whichgivesriseto

w = X n

i=1

a i y i x i

(18)

and

X n

i=1

a i y i = 0

(19)

Thenbysubstitutingtheabovetwoequationsintoequation(15),theequa-

tionbecomes:

Q(a) = X n

i=1

a i − 1 2

X n

i=1

X n

j=1

a i a j y i y j x T i x j

(20)

The corresponding problem is called the dual problem and is formulated

asbelow: given trainingsamples

{x i , y i } n i=1

, tryto ndthe optimalLagrange

multipliers

{a i } n i=1

whichmaximizetheobjectivefunctionaboveandalsosatisfy

thefollowingconstrains:

1.

X n

i=1

a i y i = 0;

2.

a i ≥ 0

for

i = 1, 2, . . . , n

AftertheLagrangemultipliers aredetermined,theweightvectorcanbeeasily

determinedby

w = X n

i=1

a i y i x i

(21)

and thebiasb canbedetermined byarbitrarilychoosing alabeledsample

{x i , y i }

andcalculate:

(13)

y i (w T x i + b) = 1

(22)

∀ y i = 1 : b = 1 − w T x i

(23)

or

(24)

∀ y i = −1 : b = −1 − w T x i

(25)

ItisalsoimportanttostatetheKarush-Kuhn-Tuckertheorem[25,30]which

givesthefollowingconstraintonthesaddlepointoftheLagrange:

a i0 [y i (w T 0 x i + b 0 ) − 1] = 0 f or i = 1, 2, . . . , n

(26)

It statesthat

a i0 6= 0

only forthe points which satisfy

y i (w 0 T x i + b 0 ) = 1

.

Thesepointsarecalledthesupportvectors.

Tosumup,wehave:

f (x) = X m

i=1

a i0 y i x T i x + b 0

(27)

where

{x i } m i=1

are the support vectors and

{a i0 } m i=1

are thecorresponding Lagrangemultipliers.

Non-linear separable data and soft margin In practical applications,

many of the data sets are non-linearly separable which makes the algorithm

intheprevioussectioninfeasible. OneexampleisshowninFigure3. Ascanbe

seenfrom thegure,although mostof thepointsare classiedintothecorrect

side, there are still some points which violate the hyperplane. These points

eithercrosstheboundaryofthemarginbutarestilllocatedonthecorrecthalf-

space,orhavebeenmisclassiedontotheincorrecthalf-space. Insuchcases,it

isimpossibletondahyperplanewhichcompletelyremovestheerrors;instead,

asolutioncanbeproposedtominimizetheerrorsonthetrainingdata.

Slackvariables areintroducedto solvethisproblem. Foradataset with

n

samples,thereare

n

slackvariables

{ε i } n i=1

whichsatisfy:

∀ y i = 1 : w T x i + b > 1 − ε i

(28)

∀ y i = −1 : w T x i + b 6 −1 + ε i

(29)

The slack variable

ε i

here is a measure of the violation to the margin. If

0 < ε i < 1

,then thesampleviolatesthemarginbut isstill correctlyclassied.

When

ε i > 1

, thesampleisclassiedintothewronghalf-space. Sincethegoal istohavefewertrainingsamplesmisclassied,apenaltyforcecanbeadded:

η(ε) = X n

i=1

ε i

(30)

(14)

whichshouldbeminimized. Itcanbeincorporatedintothecostfunctionin

theprevioussectionas:

f = 1

2 w T w + C X n

i=1

ε i

(31)

The parameter

C

here controls the trade-o between the margin rigidity

enforcementandthenumberof errorsitcantolerateduring training. A larger

valueof

C

willproduceamoreaccuratemodelwhileatthesametimeincreasing

theriskofover-tting;therefore,thevalueof

C

hastobeoptimizedbytheuser

duringtheexperiment.

ThecorrespondingLagrangefunctionforthisproblemis:

J(w, b, a, u, ε) = 1

2 w T w + C X n

i=1

ε i − X n

i=1

µ i ε i − X n

i=1

α i [y i (w T x i + b) − 1 + ε i ]

(32)

where

µ i

istheLagrangemultiplier fortheslackvariables.

Kerneltrick ConsiderthetypicalXORproblemwhichtriestoseparatefour

examplesin four cornersof arectangle such that thetwoexamplesconnected

by a diagonal belong to the same class. It is impossible to make this in a

two-dimensionalspace but whenprojectingit to athree-dimensionalspace, it

becomesmucheasier. Thisexampleindicatesthatanon-linearlyseparabledata

setmaybecome linearlyseparable inahigherdimensionalspace. Thiskindof

mappingincreasestheseparabilityofthedataset.

Letthefunction

θ

denesthenon-linearmapping:

θ : R N → H

(33)

(15)

f (x) = X n

i=1

a i y i θ(x i ) T θ(x) + b

(34)

Thekernelfunctionisdened hereby:

K(x, y) = θ(x) T θ(y)

(35)

andthediscriminantfunctionturns into:

f (x) = X n

i=1

a i y i K(x i , x) + b

(36)

Thisexpressionavoidsprovidingtheexactrepresentationinahigherdimen-

sionalspace. Numerous kernels have been proposed to solvevarious kinds of

problems. OneofthemostpopularkernelsistheRBF kernelwhichisused in

thisthesis. TheRBFkernelcanbeexpressedas:

K(x, y) = e (− 2σ2 1 kx−yk 2 )

(37)

σ

indicates the kernelwidth. A larger

σ

indicates asmoother function to

avoidoverttingandalsoavoidreproducingthenoisesinthetrainingdata;On

theotherhand,asmaller

σ

impliesamoreexiblefunction to producehighly

irregulardecisionboundaries. Hence,itis importanttodeterminetheoptimal

valuefor

σ

bymeansofcrossvalidation.

2.3.2 LVQ

LearningVector Quantization is one of the most famous prototype-based su-

pervised learningmethods which wasrst introduced by Kohonen[3]. It has

some advantages over the other methods. Firstly, this method can be easily

implemented and thecomplexity of the classiercan be controlled and deter-

minedbytheuser. Secondly,multi-classproblemscanbenaturallytackledby

theclassierwithoutmodifyingthelearningalgorithm ordecisionrule. Lastly,

the resultingclassier is intuitiveand easy to interpret due to its assignment

ofclassprototypesandintuitiveclassicationmechanismofnewdatapointsto

theclosestprototype. Theresulting prototypescanthen provideclass-specic

attributesforthedata. ThisisabigadvantageoverthemethodssuchasSVM

orNeuralNetworkswhichsuerfromthedrawbackofbeinglikeablackboxand

becauseofthat,LVQhasbeenappliedintomanyelds,suchasbioinformatics,

satelliteremotesensingandimageanalysis[34,35, 39]

TrainingdataforLVQcanbedenotedas:

{x i , y i } n i=1 ∈ R N × {1, 2, . . . C}

(38)

where

x i

denotesthedatainNdimensionalspaceand

y i

isthelabelwithC

dierentclasses.

(16)

LVQcan be parameterized by a set of prototypesrepresenting the classes

in feature space and the distance measurement which may be a traditional

Euclideandistanceorafullmatrixtrainedfromthedata. Oneoftheexamples

canbe seenin Figure 4 where there are 4 dierent prototypesrepresenting 3

dierentclasses.

TraditionalLVQemploysEuclidean distancemeasurementandisbasedon

nearest prototype classication. To be more specic, a set of prototypesare

dened torepresentthedierent classes. Ifoneprototypeperclassis dened,

theprototypescan berepresentedas:

W = {w j , c(w j )} ∈ R N × {1, 2, . . . , C}

.

Each unseen example

x new

will be assigned a label whose prototype has the

closestdistanceto itwith respecttothedistancemeasurement:

c(x new ) ← c(w k ) with w k = argmin

j

d(w j , x new )

(39)

Itiscalled awinner-takes-allstrategy.

Trainingofthismodelisguidedbytheminimization ofthecostfunction:

F = X n

i=1

φ(ε i ) with ε i = d(x i , w H ) − d(x i w M )

d(x i , w H ) + d(x i w M )

(40)

where

φ

is any monotonic function and in this thesis,

φ(x) = x

;

w H

and

w M

arerespectivelytheclosestprototypewiththesameanddierent labelto sample

x i

:

w H = argmin

j

d(x i , w j ) ∀c(w j ) = c(x i )

(41)

w M = argmin

j

d(x i , w j ) ∀c(w j ) 6= c(x i )

(42)

IntraditionalLVQsystems,onlythelocationsoftheprototypesareupdated

duringthetrainingtominimizetheerrors.

w H

ispushedtowardthesample

x i

(17)

and

w M

is pushed awayfrom it. Their derivativesto the costfunction

F

are

expressedas:

4w H = −α · φ 0 (ε i ) · ε 0 i,H · ∇ w H d(x i , w H )

(43)

4w M = α · φ 0 (ε i ) · ε 0 i,M · ∇ w M d(x i , w M )

(44)

where

α

is the learning rate;

φ 0 (ε i ) = 1

because

φ(x) = x

;

ε 0 i,H = 2 · d(x i w M )/[d(x i , w H )+d(x i w M )] 2

and

ε 0 i,M = 2·d(x i w H )/[d(x i , w H )+d(x i w M )] 2

;

∇ w H d(x i , w H )

and

∇ w M d(x i , w H )

arerespectivelythederivativesof

w H

and

w M

to the distance

d(x i , w H or M )

and therefore depend on the distance measure-

ment.

2.3.3 Two variants ofLVQ: GRLVQ and GMLVQ

How the distance is calculated is very importantin the LVQ system. Oneof

the mostpopular metricsis theEuclidean distance which is a special case of

Minkowskidistance. TheEuclideandistancefromadatapoint

x i

toaprototype

w

canbeexpressed as:

d(w, x i ) = v u u t

X N

j=1

(x j i − w j ) 2

(45)

The Euclidean distance assigns the sameweight for each feature, indicat-

ingthat eachfeature hasthesamecontribution forclassication. However,in

practicalapplications, it is usually observed that dierent features contribute

dierentlytowardtheclassication. Therefore,relevancelearning[7, 4]is pro-

posed toassignadaptiveweightvaluesfordierentfeature inputs:

d(w, x i ) = v u u t

X N

j=1

λ j (x j i − w j ) 2

(46)

ThecorrespondingLVQsystemiscalled GRLVQ[7,4].

Eachfeature,besidestheirindividualcontributionfortheclassication,will

alsocorrelatewiththeotherstoinuencetheperformance. GeneralizedMatrix

LVQ(GMLVQ)[38]isproposedtoextendthepreviousmethods. Afullmatrix

of adaptiverelevance is employed asthe similarity metricand the distance is

calculatedas:

d(w, x i ) = (x i − w) T Λ(x i − w)

(47)

where

Λ

isafull

N × N

matrixwhoseo-diagonalelement

Λ i,j

accountfor

thecontributionoffeature pair

i

and

j

forclassication. Thematrix

Λ

hasto

bepositivedenitetokeepthedistanceresultpositive. Itspositivedeniteness

isachievedbyconstructing:

Λ = Ω T

(48)

(18)

where

is an arbitrary real

M × N

matrix with

M 6 N

. However, in this

thesis,weonlyconsiderthecase:

M = N

. SubstitutingEq. (42)intoEq. (41), obtain:

(x i − w) T Λ(x i − w) = (x i − w) TT Ω(x i − w) = [Ω(x i − w)] 2 > 0

(49)

ItisnoticedtheGRLVQisaspecialcaseofGMLVQwith

diag(Λ) = {λ i } N i=1

.

Thederivativeofthedistance

d(w, x i )

withrespecttoprototypewis:

∇ w d(w, x i ) = −2Λ(x i − w)

(50)

SubstitutingEq. (50)intoEq. (43)andEq. (44),wecanobtaintheupdate

ruleforclosestcorrectandincorrectprototype.

InthemodelofGMLVQ,theupdateruleofthedistancematrix

alsoneed

tobecomputed. Thederivativeof

d(w, x i )

withrespectto onesingleelement

Ω lm

is:

∇ Ω lm d(w, x i ) = X

k

(x m i − w m )Ω lk (x k i − w k ) + X

j

(x j i − w j )Ω lj (x m i − w

(51)

m )

= 2 · (x m i − w m )[Ω(x i − w)] l ,

(52)

ThederivativeofthecostfunctionFwithrespecttoonesingleelement

Ω lm

canthenbeexpressedas:

4Ω lm = 4Ω H lm + 4Ω M lm

(53)

= −β · 2 · φ 0 (ε i ) · ε 0 i,H · ∇ H

lm d(x i , w H ) + β · 2 · φ 0 (ε i ) · ε 0 i,M · ∇ M

lm d(x i , w M )

where

β

is thelearningratefor

.

(19)

3 Feature Selection

3.1 Challenge

Inthissection,twotopicsaboutthechallengesinfeature selectionwillbedis-

cussed. Therstissueaboutthecurseof dimensionalityandthesecond oneis

therelevanceandredundancyoffeatures.

3.1.1 Curse ofdimensionality

Inmachine learning,theterm curse of dimensionality wasinitially dened by

Richard Bellman [10] when he conducted the work on dynamic optimization

[9, 10]andfound itquitediculttotackletheproblemof thecurseofdimen-

sionality. Hestated:

Inviewof allthat wesaid in theforegoingsections, themany

obstaclesweappeartohavesurmounted,whatcaststhepalloverour

victorycelebration? Itisthecurse ofdimensionality, amalediction

thathasplaguedthescientistfrom theearliestdays."[10]

Upto date, therearealreadymanydenitions aboutit,but generallyitrefers

totheproblemincurredbyaddingextrafeaturestothespace. Thereliabilityof

thelearningmodel depends onthedensityof trainingexamplesin thefeature

space. The increase of data dimensionality will sparse the feature space and

thusdeterioratethegeneralizationperformance.

Itstatesthatthepredictiveperformanceofalearningalgorithmwilldeteri-

oratewiththeincreaseofdatadimensionality. Withtheincreaseofthefeature

space, thefeature space will becomemore sparse andmore trainingexamples

arerequired. Forexample,if5samplesareenoughin eachdimension,then25

samplesaresucienttollatwo-dimensionalcube. However,thisnumberwill

increaseto

5 20

fora20-dimensionalhypercube.

It is also observedthat itbecomes moredicultto estimatethe kernel in

ahigher dimension[11]. Table1illustrates thenumberof samplesrequiredto

estimateakernelatdensity0withacertainaccuracy.

Dimensionality Sample Size

1 4

2 19

5 786

7 10,700

10 842,000

Table1: Samplesizerequiredforkernelestimation[11].

(20)

Thereare somecontroversiesin the denitionof feature relevance. There is a

review [8] which introduces the dierent relevance denitions that have been

proposed in the literature. The authors then present an example to indicate

thatalltheotherrelevancedenitionsproduceunexpectedresultsandbasedon

that,the authors suggest that twodierent degreesof relevance arerequired:

strongrelevanceandweakrelevance. Thedenition ofweakrelevancecanalso

beregardedasthedenitionofredundancy.

Let

< X, Y >

denote the training examples where

X ∈ R N

is the data

and

Y

indicates the labels. Let

F

be the full feature set and

F i

is the

i th

feature;thereforeeachinstance

X

isoneelementofthecombinationoftheset

F 1 × F 2 × · · · F N

. Let

S i = F − {F i }

denotethefeaturesubsetwithallfeatures

except for

F i

and

s i

denote one value instantiation of

S i

. Let

P

denote the

conditionalprobabilityofthelabel

Y

givenafeature subset.

Strong relevance Afeature

F i

isstronglyrelevanti

∃x i ∈ F i , y ∈ Y

where

P (x i , s i ) > 0

and

P (Y = y|S i = s i , F i = x i ) 6= P (Y = y|S i = s i )

Weak relevance A feature

F i

is weakly relevant i it is not strongly

relevantand

∃x i ∈ F i , s i ⊆ S i , y ∈ Y,

such that

P (Y = y|S i = s i , F i = x i ) 6= P (Y = y|S i = s i )

A feature

F i

is called relevantif itis either stronglyorweakly relevantto

theclass label;otherwiseit isirrelevant. Afeature

F i

whichisweaklyrelevant

canbecomestronglyrelevantafterremovingacertainfeaturesubset. Theweak

relevance can be interpreted as theexistence of other relevant features which

canprovidesimilarpredictionpowerastheonewearemeasuring. Thisisalso

what we call redundant. It is importantto note that thefeature

F i

which is

weaklyrelevantorredundantshouldnotberemovedifthefeaturesubsetwhose

removalmakes

F i

stronglyrelevanthas been removed bythe feature selection

algorithm.

3.2 General Framework

TheframeworkinFigure5showsthatatypicalfeatureselectionsystemusually

consists offour components. Theyinclude: feature subset generation,feature

subsetevaluation,stoppingcriterionandfeaturesubsetvalidation.Asindicated

inthegure,thecompletefeatureset isrstlysenttotheGeneration"model

whichproducesdierentfeaturesubsetcandidatesbasedonsomesearchstrat-

egy. EachsubsetcandidatewillthenbeevaluatedintheEvaluation"modelby

acertain evaluation measurement. A newsubsetwhich turnsoutto bebetter

will replacetheprevious best one. Thissubset generation andevaluation will

berepeatedoverandoveruntilthegivenstoppingcriterionismet. Afterthat,

(21)

theultimatelyselectedfeaturesubsetwillbesenttotheValidation"modelfor

validationbycertainlearningalgorithms.

Two basic issues have to be addressed in the Generationmodel". Those

are: Starting pointandsearchstrategy.

Starting Point. Choose apoint to start the search in the feature space.

Onechoiceistobeginwithnofeatureandthenforeachiteration,expand

thecurrentfeaturesubsetwitheachfeaturethat isnotyetinthesubset.

Thefeature whose addition producesthe best evaluation performanceis

added to the current subset. This is called forward selection. Another

optionisto doitconversely. Thesearchstartswithafullfeaturesetand

thensuccessivelyeliminatesthefeaturewhoseremovalresultsinthebest

evaluationperformance. Thissearchiscalledbackwardselection. Athird

alternativeisto startbyselectingarandomfeaturesubset[13] andthen

successivelyadd orremovefeaturesdepending ontheperformance. This

randomapproachcanavoidbeingtrappedinto localoptima.

Search Strategy. There are three dierent search strategies: complete, heuristic and random. The complete strategy examines all the possible

feature subsetsand guaranteesto nd theoptimal one. Whenthere are

N

features,thesearchwillexamine

2 N

subsetswhichmakesitunrealistic forlarge

N

. Heuristicsearch isguided bysomeheuristic. Itis lesscom-

putationally demanding but the optimal subset is not guaranteed. The

guideline determines whether ornot a better subset canbe found. The

randomstrategyjustsimplychoosesthenextfeatureatrandom;therefore,

theprobabilityto ndtheoptimal subsetdepends onhowmanyepochs

aretried.

3.3 Wrapper and Filter Approach

Theevaluationmethods in feature selection canbe generallydividedinto two

basicmodels: theltermodel[14,15,16,17]andthewrappermodel[18,19,20].

The lter model selects afeature subset asa pre-processingstep, without

considering the predictorperformance. It is usually achieved by designingan

(22)

evaluationfunctionsthatarefrequentlyusedaredistancemeasures,information

measures, dependency measures and consistency measures. The lter model

does not involve any training of learning algorithm and is thus much faster

whichmakesitsuitabletobeappliedonlargedatasets.

In the wrappermodel, a predetermined data mining algorithm is utilized

to evaluate thefeature subset and the candidate with highestprediction per-

formancewill be selectedasthe nal subset. Thewrappermodel canusually

selectafeaturesubsetwithsuperiorperformancebecauseitselectsfeaturesbet-

tersuitedtothepredeterminedalgorithm. However,becausethealgorithmhas

tobetrainedandtestedforeachsubsetcandidate,thewrappermodeltendsto

beverycomputationallyexpensive,especiallywithlargefeaturesize.

3.4 Feature Ranking Technique

3.4.1 Information gain

Information gain[21] measures thedependency betweenafeature

X i

and the

class label

Y

. It is averypopulartechniquein feature selectionbecauseit is

easyto understand and compute. Information gaincanalso beregarded asa

measure ofthereduction in uncertainty aboutafeature

X i

when thevalueof

Y

isknown. UncertaintyisusuallymeasuredbyShannon'sentropy:

Entropy Entropymeasurestheamountofuncertaintythatafeature

X i

con-

tains. Itisgivenby

H(X i ) = − X p

j=1

P (j)log 2 P (j)

(54)

Where

p

isthenumberofpossiblevaluesin

X i

and

P (j)

indicatestheobser-

vationprobabilityofthevalue

j

. Formthisformula,amoreuniformdistribution tendsto produceahigher entropy. Forexample, ifyoutoss afair coin, there

aretwopossiblevalueseachwithequalprobability0.5. Itsentropyvalueis

H(coinT oss) = −2 × (0.5 × log 2 0.5) = 1

Inanotherexample,ifyoutoss adie, there aresixpossibleoutcomes, each

withprobability

1/6

. Itsentropyvalueis

H(diceT oss) = −6 × ((1/6) × log 2 (1/6)) = 2.585

Therefore, the higherthe entropyis, themore uncertainty it containsand

themorediculttopredicttheoutput.

(23)

Information Gain Theinformationgainofafeature

X i

andthelabel

Y

is:

I(X i , Y ) = H(X i ) − H(X i | Y )

(55)

Where

H(X i )

and

H(X i | Y )

arerespectivelytheentropyoffeature

X i

and

theentropyof

X i

afterknowingthevalueof

Y

.

H(X i | Y )

iscalculatedas

H(X i | Y ) = − X

j

P (Y j ) X

k

P (x k | Y j )log 2 P (x k | Y j )

(56)

A better understanding can be gained from Figure 6. As can be seen in

Figure6,

H(X)

and

H(Y )

respectivelymeasure theentropyof

X

and

Y

. The

informationgain

I(X, Y )

isameasure oftheinformation sharedby

X

and

Y

.

H(X, Y )

istheinformationthat

X

and

Y

collectivelycontain.

H(X, Y ) = − X

x

X

y

p(x, y)log 2 p(x, y) = H(X | Y )+I(X, Y )+H(Y | X) = H(X)+H(Y )−I(X, Y )

(57)

If

X

and

Y

arehighly correlated, then the information theyshare is very high, indicating a large value of

I(X, Y )

. Then if

Y

is known, much of the

informationabout

X

canbeguessed from

Y

,suggestingalow

H(X | Y )

and

viceversafor

H(Y | X)

.

3.4.2 Relie

Relief [22] is a univariate feature weighting algorithm in the lter model. It

is basedon the principlethat theattribute which can better separate similar

instancesbut withdierentclassesismoreimportantandshouldbeassigneda

largerweight. Thethreebasicstepstocomputethefeatureweightare:

(24)

Description: TherearePinstancesdescribedby

N

featuresandthereare

C

dierentclasses:

x ∈ R N c(x) ∈ {1, 2, . . . C}

;

T

iterationsareperformed.

1. Setallthefeature weightsto0:

∀i, w(i) = 0

;

2. Fort=1to

T

,do:

3. Randomlypickaninstance

x ∈ R N

;

4. Findnearesthit

N H(x)

andnearestmiss

N M (x)

:

5.

N H(x) ← x h , with x h = argmin

j

d(x, x j ) ∀c(x j ) = c(x)

6.

N M (x) ← x m with x m = argmin

k

d(x, x k ) ∀c(x k ) 6= c(x)

7. Fori=1to

N

,do:

8.

w(i) = w(i) + d(x i , N M (x) i )/(P × T ) − d(x i , N H(x) i )/(P × T )

9. enddo.

10. enddo.

1. Find the nearest miss and nearest hit where nearest hit is the closest

samplewith thesameclass asthetestsample andnearestmiss species

theclosestsamplewithadierentlabelasthetestsample;

2. Calculatetheweightof afeature;

3. Returnarankedlistoffeature weightsorthetop

k

featuresaccordingto

agiventhreshold;

The algorithm starts by initializing all the feature weights to be zero and it

randomlyselectaninstancefromthesamplesandcalculatesitsnearesthitNH

andnearestmiss NM.Eachfeatureweightisthenupdatedbasedonitsability

todiscriminateNHandNM.ThedetailedpseudocodeisgiveninAlgorithm1.

Relie[23]extendstheoriginalReliefalgorithmtodealwiththemulti-class

situation. It incorporates two important improvements. First, the result is

morerobusttonoisesbecauseoftheconsiderationof

k

nearestneighbourhoods.

Second,itcandealwith themulti-classproblem. Thedetailed pseudocodeis

showninAlgorithm2.

3.4.3 Fisher

Fisher[40]is aneectivesupervisedfeature selectionalgorithmwhich aimsto

selectfeaturesthat assign similarvaluesto thesameclassanddierentvalues

todierentclasses. TheevaluationscoreofFisher'salgorithmis:

(25)

Description:Instancesdescribedby

N

featuresandthereare

C

dierentclasses:

x ∈ R N , c(x) ∈ {1, 2, . . . C}

;Lookfor

k

nearestneighbours;perform

T

iteration;

p(y)

the class probability specifying theprobability ofan instance beingfrom theclass

y

.

1. Setallthefeature weightsto0:

∀i, w(i) = 0

;

2. Fort=1to

T

,do:

3. Randomlypickaninstance

x ∈ R N

withlabel

y x

;

4. fory=1to

C

,do

5. nd

k

nearestinstancesof

x

fromclass

y

:

x(y, l)

where

l = 1, 2, · · · k

6. fori=1to

N

,do:

7. forl=1to

k

,do:

8. if

y = y x

(nearesthit), then

9.

w(i) = w(i) − d(x i −x(y,l) T ×n i )

10. else(nearestmiss),

11.

w(i) = w(i) + 1−p(y p(y)

x ) × d(x i −x(y,l) T ×n i )

12. end if.

13. endfor.

14. endfor.

15. endfor.

16. endfor.

(26)

F isher(f i ) = P c

j=1 n j (µ i,j − µ i ) P c

j=1 n j σ i,j 2

(58)

where

f i

is the

i th

feature to be evaluated,

n j

is the number of instances

in class j,

µ i

is the mean of feature i,

µ i,j

and

σ i,j

are respectivelythe mean and the variance of feature i on class j. Fisher algorithm is computationally

eectiveandwidelyappliedinmanyapplications,however,becauseitconsiders

thefeatures individually,ithasnoabilityto dealwithredundantfeatures.

(27)

4 GMLVQ Based Feature Selection Algorithms

4.1 Entropy Enforcement for Feature Ranking Results

Itisstatedthat theelement

Λ i,j

inmatrix

Λ

measuresthecorrelationbetween feature

i

and

j

and the diagonal element

Λ i,i

quanties the contribution of feature

i

for classication. The above statement only makes sense when the features have similar magnitude, therefore a z-score transformation is always

performed on the data before the training starts. One example is shown in

Figure7where32featuresarerankedwithrespecttotheirvalueofthediagonal

elements. The

19 th

featurecontainsthehighestvalue,indicatingthatithasthe

largestcorrelation withthe classication. Anotherconstraintis addedso that

aftereachadaption,thesumofthediagonalelementsisequalto zero:

X N

i=1

Λ i,i = 1

(59)

Figure7: Oneexampleofdiagonalelementsof

Λ

One of the idealsituations in feature rankingis that someof the features

are much more important than others and the least important features can

thereforeberemovedfromthefeaturesetwithoutdeterioratingtheclassication

performance. An external entropyforce is addedto the costfunction to push

thediagonalelementsto thisidealsituation.

Thedenitionoftheentropyforceis:

(28)

izationforce

Entropy(Λ diag ) = − X N

i=1

Λ i,i log 2 Λ i,i

(60)

where

N

is the data dimension. This external force will reach the maxi-

mumwhenallthediagonalelementsareequal,i.e. allthefeatures areequally

importantforclassication. Its minimization will, onthe otherhand, pushto

generateadiscriminativefeaturerelevanceandattheextreme,onlyonefeature

isidentiedasrelevantforclassicationandtherelevancesofotherfeaturesare

zero.

Itisintegratedintothecostfunction by:

F new = F + α × Entropy(Λ diag )

(61)

wheretheregularizationparameter

α

controlsthetrade-obetweentheclas-

sication accuracyand thediscrimination between features. A largervalueof

α

will produce amore discriminativefeature rankingresult by sacricing the classication performance. Their mutual relation on one of the data sets is

visualizedin Figure8.

Thechoice oftheregularizationvaluedependsonhowimportanttheaccu-

racyandthediscriminationarefortheuseranddiersperdataset. Asafeway

istoposttheirrelationforeachdatasetandchoosetheoptimalpointbasedon

therequirement. Amoreecientwayinthisthesisistochoosethevalueby

N 2 10

where

N

isthedatadimension. Onallofthedatasetsconductedinthisthesis,

(29)

such a valuecangenerate aconsiderable discriminativefeature rankingresult

withoutdeterioratingthe performance to alarge extent. Oneof theexamples

withandwithoutentropyforcecanbeseeninFigure 9.

4.2 Way-Point Average Algorithm

Gradient basedminimization is apopular and powerful method in non-linear

optimization[31]. Inthisthesis,batchgradientdescentisemployedtotrainthe

GMLVQmodel. Oneofthecritical choicesin gradientdescentmethods is the

appropriatechoice of the stepsize. Toosmall step sizes will slow theconver-

gence,howeverlargestepscanresultinoscillatoryorevendivergentbehavior.

In this section, amodicationof batch gradient descent[32] isintroduced

whichaimsatbetterconvergencebehavior.Theideaisthat,duringthetraining

procedure,wecomparethecostfunctionofnormaldescentadaptionwiththatof

theglidingaverageoverthemostrecentstepsandifthelatterproducesalower

optimizationvalue,minimizationjumpstothelattercongurationanddecreases

thestepsizeatthesametime. A moredetaileddescriptionisdescribedbelow.

Consider we want to minimize an objective function

F

with respect to a

N-dimensionalvector

x ∈ R N

. A gradientdescentprocessisstartedat

x 0

and

proceedstogenerateasequenceofstepsiteratively:

x t+1 = x t − a t

∇F

|∇F |

(62)

(30)

Note thatthegradienthasbeennormalizedby

∇F

|∇F |

and therefore,

a t

here

isexactlythestepsizelengthduringadaption:

|x t+1 − x t | = a t

.

Thewaypointaveragingalgorithmstartsat

x 0

withinitialstepsize

a 0

and

performskstepswithgradientstepsunchanged:

x t+1 = x t − a t

∇F

|∇F | f or t = 0, 1, 2 . . . k − 1 with a t = a 0

(63)

Afterthat

(t > k)

,theprocedureproceedsasbelow:

1. performthenormalgradientdescentstepandevaluatethecorresponding

costfunction:

x t+1 = x t − a t

∇F

|∇F | t and calculate F (x t+1 )

(64)

2. performthewaypointaverageoverthepreviousjsteps:

x t+1 = 1 j

X j−1

i=0

x t−i and calculate F (x t+1 )

(65)

3. determinethenewstepsizeandadaption position bycomparison:

n x

t+1 =x t+1 and a t+1 =a t if F(x t+1 )6F (x t+1 )

x t+1 =x t+1 and a t+1 =λa t else

(66)

withtheparameter

0 < λ < 1

.

As canbeseenfrom thealgorithm, aslongasthenormalgradientdescent

procedureproducesapositionwithlowercostthanthewaypointaveragealgo-

rithm,theiterationproceedsasanormalgradientdescentalgorithm.

Onthe otherhand,

F (x t+1 ) < F (x t+1 )

indicatesthe potentialexistenceof

oscillatorybehaviorbecauseunderoscillatorycondition,thepositionuctuates

aroundthelocalminimumanditisexpectedthattheaverageovertheprevious

stepsmayprovideacloserestimateto theminimumthanthenormalgradient

descentadaption. Italsoindicatesthatthestepsizemaybetoolargetogetto

theminimumandshould bedecreasedforbetterconvergence.

AnintuitiveexampleisshowninFigure10[32]whichvisualizestheadaption

stepsof both thenormal gradient descent and waypointaveragingalgorithm.

Thedotted lines markthe update trajectory ofnormal gradientdescent algo-

rithmwith constantstep sizes which display strongoscillatory behavior. The

waypointaveragingalgorithmshares thesametrajectorywith thenormalgra-

dientdescentintherstfoursteps. However,afterthatitjumpstotheaverage

positionoverthepreviousstepsandreducesthestepsizeatthesametimewhich

enablesittomovecloserto theminimumin themiddle.

WhenconsideringitsapplicationinGMLVQ,sincethecostfunctioninGM-

LVQhastobeoptimizedwithrespecttoboththeprototype

w

andthematrix

,twoindependentwaypointaveragingalgorithms about

w

and

haveto be

performed. Thetypicalschemeisformulatedasbelow:

(31)

dientdescent. From[32]

GivenaGMLVQsystemrepresentedby

andasetofprototypes

{w k } M k=1

withcostfunctionrepresentedbyF,respectivelychoosethestartpoints

0

and

{w 0 k } M k=1

andinitialstepsizes

a 0

and

a w 0

for

and

w

;

1. performk(k=3in thisthesis)stepswithgradientstepsunchanged:

Ω t+1 = Ω t − a t ∇F

|∇F | f or t = 0, 1, 2 . . . k − 1 with a t = a 0

(67)

w t+1 = w t − a w t ∇F

|∇F | f or t = 0, 1, 2 . . . k − 1 with a w t = a w 0

(68)

Afterthat

(t > k)

,theprocedureproceedsasbelow:

2. performthenormalgradientdescentstepandevaluatethecorresponding

costfunction forboth

andw:

t+1 = Ω t − a t ∇F

|∇F | t and calculate F (Ω t+1 )

(69)

w t+1 = w t − a w t ∇F

|∇F | t and calculate F (w t+1 )

(70)

3. performthewaypointaverageovertheleastpreviousj(j=3inthisthesis)

steps:

Ω t+1 = 1 j

j−1 X

i=0

Ω t−i and calculate F (Ω t+1 )

(71)

(32)

w t+1 = 1 j

j−1 X

i=0

w t−i and calculate F (w t+1 )

(72)

4. determinethenewstepsizeandadaption position forboth

andw:

 Ω t+1 =Ω t+1 and a t+1 =a t if F(Ω t+1 )6F (Ω t+1 )

t+1 =Ω t+1 and a t+1 =λa t else

(73)

n w

t+1 =w t+1 and a w t+1 =a w t if F(w t+1 )6F (w t+1 )

w t+1 =w t+1 and a w t+1 =λa w t else

(74)

withtheparameter

λ = 2/3

.

4.3 Feature Ranking Ambiguity Removal

Uptothis step,wehavetheinputvectorsandclasslabels:

{x, y i } n i=1 with x i ∈ R N , y i ∈ {1, 2, . . . C}

(75)

associatedwithasetofprototypes:

{w k } M k=1 where M > C

(76)

Andthedistanceiscalculatedas:

d(x i , w k ) = (x i − w k ) T Λ(x i − w k ) = (x i − w k ) TT Ω(x i − w k ) =| Ω(x i − w k ) | 2

(77)

where

Λ, Ω ∈ R N×N

and

Ω = [z 1, z 2 , · · · z N ] T

where

{z i } N i=1

arecolumnvec-

torswith dimension

N

. The feature rankingresultscanbe obtainedfrom the

valuesof diagonal elements in matrix

Λ

. However, an issue is raisedwhether

thereisanothermatrix

Λ

whichcankeepthedistancemeasurementunchanged.

Itthatmatrixexists,thefeaturerankingresultscanbedierentwithoutmod-

ifyingtheclassier,which meansthefeature rankingresultswehaveobtained

inpreviousstepsarenotunique.

Consideravector

v j

whichsatisesthefollowingconstraints:

∀ i : v j T x i = 0

(78)

∀ k : v j T w k = 0

(79)

If we add such a vector

v j

to any row

z i T

of the matrix

, consider, for

instance,i=1:

Ω new = [z 1 + v j , z 2 , · · · z N ] T

(80)

wecaneasilyverifythat thefollowingmappings keepunchanged:

(33)

∀ i : Ω new x i = Ωx i

(81)

∀ k : Ω new w k = Ωw k

(82)

Therefore,thedistances betweenanypair ofinputsamplesand prototypes

willkeepthesame:

d(x i , w k ) =| Ω(x i − w k ) | 2 =| Ω new (x i − w k ) | 2 f or all i, k

(83)

Since the mapping and distance calculation are the same between

and

Ω new

, the cost functions,classication errorsand classiers theyproduce will alsostay thesame. However,thefeature rankingresultsmayvary between

and

Ω new

, because there is no constraint on theconsistency of their diagonal elementsinmatrix

Λ

and

Λ new .

Withoutlossofgenerality,weassumethatthereare

J

suchspuriousvectors

{v j } J j=1

and as it will be proved in later stages all such vectors are actually

eigenvectorsofa constructionmatrix,wecanadditionallyassumethat all the

vectors

{v j } J j=1

areorthonormal:

v j • v k = δ jk = n

1 if j=k

0 otherwise

(84)

Theproposedsolutionistoprojectoutallthespuriousdirections

{v j } J j=1

from

agivenmatrix

:

T new = [I − X J

j=1

v j v j T ]Ω T

(85)

Itfollowsthat:

T new (x i − w k ) = Ω T (x i − w k ) − X J

j=1

v j (x i − w k )

| {z }

0

v j TT = Ω T (x i − w k )

(86)

v k T Ω new = v T k Ω − X J

j=1

v j v k

|{z}

δ jk

v T j Ω = v T k Ω − v T k Ω = 0

(87)

Hence,wecaninterprettheresultingmatrix

Ω new

astheminimalrepresen-

tationofthemappingwhichcontainsnocontributionofthespuriousdirection

v j

.

Thenext questionis how tond allthese vectors

{v j } J j=1

. The conditions

that

∀ i : v j T x i = 0

and

∀ k : v j T w k = 0

canberewrittenas

X T v j = 0 where X = [x 1 , x 2 , · · · x L , w 1 , w 2 , · · · w M ]

(88)

Referenties

GERELATEERDE DOCUMENTEN

step 1 - recurrent HMMs applied on both groups of samples, resulting in differential re- gions; step 2 - conversion of clones to differential regions, and normalization (performed

Abstract This paper presents a new feature selection method based on the changes in out-of-bag (OOB) Cohen kappa values of a random forest (RF) classifier, which was tested on

The model comprises a collection of feature selection models from filter, wrapper, and em- bedded feature selection methods and aggregates the selected features from the five

As it is shown, the performance of weather underground (WU) predictions for the minimum and maxi- mum temperature in Brussels is compared with the following scenarios: first,

Except for geolocation services, all features were expected to have a positive effect on product adoption, and this can be accepted for face verification, fingerprint verification

The first class representation of roles in AOP has however a state problem, only one instance of an aspect does exist in the run-time environment, the state of an role is in the

Note that using an analytic gradient for the Interior- Point and SQP methods the classification performance is slightly better due to the fact that a Newton approximation of a

The plot of the cost function against epoch for both cases, see Fig. 3, shows that there is similar behaviour of the curve for the two algorithms which emphasizes that gradient