Supervised Feature Selection Based on Generalized Matrix Learning Vector Quantization

(1)

Supervised Feature Selection Based on Generalized Matrix Learning

Vector Quantization

Zetao Chen

S2061244 September 2012

Master Project

Artificial Intelligence

University of Groningen, the Netherlands

Internal Supervisor:

Dr. Marco Wiering (Artificial Intelligence,University of Groningen)

External Supervisor:

Prof. Michael Biehl (Computer Science, University of Groningen)

(2)

1 Introduction and Background 4

1.1 Motivation . . . 4

1.2 ResearchQuestions . . . 5

1.3 ThesisOutline . . . 5

2 MachineLearning 7 2.1 BasicConceptofMachineLearning. . . 7

2.1.1 Denition oflearning. . . 7

2.1.2 Denition ofmachinelearning. . . 7

2.1.3 Datarepresentation . . . 8

2.2 Classication . . . 8

2.2.1 Unsupervisedandsupervisedlearning . . . 9

2.3 LearningAlgorithms . . . 9

2.3.1 SVMwithRBF kernel . . . 9

2.3.2 LVQ . . . 15

2.3.3 TwovariantsofLVQ:GRLVQandGMLVQ . . . 17

3 Feature Selection 19 3.1 Challenge . . . 19

3.1.1 Curseofdimensionality . . . 19

3.1.2 Irrelevanceandredundancy . . . 20

3.2 GeneralFramework. . . 20

3.3 WrapperandFilter Approach . . . 21

3.4 FeatureRankingTechnique . . . 22

3.4.1 Informationgain . . . 22

3.4.2 Relie . . . 23

3.4.3 Fisher . . . 24

4 GMLVQBased Feature SelectionAlgorithms 27 4.1 EntropyEnforcementforFeatureRankingResults . . . 27

4.2 Way-PointAverageAlgorithm. . . 29

4.3 FeatureRankingAmbiguityRemoval . . . 32

5 Experimentsand Results 35 5.1 DataSetDescription . . . 35

5.2 ExperimentDesign . . . 35

5.3 ResultsandDiscussion . . . 36

5.3.1 CaseStudy1: AdrenalTumor . . . 36

5.3.2 CaseStudy2: Ionosphere . . . 41

5.3.3 CaseStudy3: ConnectionistBench Sonar . . . 45

5.3.4 CaseStudy4: BreastCancer . . . 49

5.3.5 CaseStudy5: SPECTFHeart . . . 51

5.4 Discussionand Summary . . . 56

6 Conclusion and FutureWork 58

(3)

Datamining involvesthe use of data analysis tools to discover and

extractinformationfromadatasetandtransformitintoanunderstand-

able expression. Oneofits central problemsis to identifyarepresenta-

tive subsetoffeatures fromwhich alearning modelcan beconstructed.

Featureselection isanimportantpre-processingstepbeforedatamining

whichaimstoselectarepresentativesubsetoffeatureswithhighpredic-

tive informationand eliminate irrelevant features withlittle importance

forclassication. Byreducingthedimensionality ofthe data,featurese-

lectionhelpstodecreasethe timefor trainingandbyselectingthe most

relevant featuresand removingtheirrelevant and noisydata,the classi-

cation performance may be improved. Besides, witha smaller feature

subset,thelearnedmodelmaybemoreintuitiveandeasiertointerpret.

ThisthesisinvestigatestheextensionofGeneralizedMatrixLVQ(GM-

LVQ)modelonfeatureselection. GeneralizedMatrixLVQemploysafull

matrixas thedistancemetricintraining. Thediagonaland o-diagonal

elementsofthedistancematrixrespectivelymeasurethe contributionof

eachfeature andfeature pairfor classication; therefore, their distribu-

tion can provide a quantitative measurement of feature weight. More

steps andanalysis are performed to forceamore eectivefeature selec-

tion result and remove the weighting ambiguity. Besides, compared to

other methodswhichperformfeaturerankingrstandlearning amodel

after selecting the feature subset,GMLVQbased methods cancombine

theprocessof featurerankingand classicationtogetherwhich helpsto

decreasethecomputationtime.

Experimentsinthisthesiswereperformedondatasetscollectedfrom

theUCI Machine Learning Repository[29 ]. TheGMLVQbasedfeature

weightalgorithmiscomparedwithotherstate-of-the-artmethods: Infor-

mationGain,Fisher andRelie. Allthesefourfeaturerankingmethods

are evaluated using both GMLVQand RBF basedSupportVector Ma-

chine(RBF-SVM)methodsbyincreasingthe sizeoftheselectedfeature

subset with a stepsize rate. Theresults indicate that the performance

ofGMLVQbasedfeatureselection methodiscomparabletoothermeth-

ods and onsomeof the datasets, it consistentlyoutperforms theother

methods.

(4)

1 Introduction and Background

1.1 Motivation

Fora machine learning algorithm to be successful ona given task, the repre-

sentation and quality of thedata are therst and mostimportant. With the

advancingofdatabasetechnology,dataiseasiertoassessandmorefeaturescan

begatheredforaspecictask. However,morefeaturesdonotnecessarilyresult

inmorediscriminativeclassiers. Instead,whenthere aretoomanyredundant

orirrelevant features, the computation can be much moreexpensive and the

classier may havea poor generalization performance due to the interference

of noises; therefore, proper data preprocessing is essential for the successful

trainingofmachine learningalgorithms.

Feature selectionisoneof themostimportantandfrequentlyused prepro-

cessingtechniques[5]whichaimstoidentifyandselectthemostdiscriminative

subset from the original features while eliminating irrelevant, redundant and

noisydata. Some studies haveshown that irrelevantfeatures canbe removed

withoutsignicantperformancedowngrade[6]. The application of feature se-

lectioncanhavesomebenets:

1. Itreducesthedatadimensionalitywhich helpsthelearningalgorithmsto

work fasterandmoreeectively;

2. In some cases, the classication accuracy can be improved by using a

subsetofallfeatures;

3. Theselectedfeaturesubsetisusuallyamorecompactresultwhichcanbe

interpretedmoreeasily;

To perform feature selection, the training data can be with or without label

information,correspondingtosupervisedorunsupervisedfeature selection. In

unsupervised tasks [1, 2], without considering the label information, feature

relevance can be evaluated by measuring someintrinsic properties within the

data, such as the separability or covariance. In practice, unlabeled data is

easierto obtaincompared to labelled ones,therebyindicating thesignicance

ofunsupervisedalgorithms. However,thesemethods ignore labelinformation,

which may lead to performance deterioration when the label information is

available. Supervisedfeatureselectionisproposedtotakethelabelinformation

intoaccount. Itcan begenerallydividedinto twomajorframeworks: thelter

model [14, 15, 16, 17] and the wrapper model [18, 19, 20]. The lter model

performsthefeatureselectionasapre-processingstep,independentofthechoice

of theclassier. The wrappermodel, on theother hand, evaluatessubsets of

featuresaccordingtotheirusefulnesstoagivenpredictor.

Featureselectiontechniquescanbefurther categorizedintofeatureranking

andfeature subsetselection. Featurerankingmethodsassign aweightto each

(5)

selectthesubsetof featuresbychoosingathresholdand eliminateallfeatures

which do notachievethat score. Feature subsetselection searchesfor the op-

timalsubsetwhich collectivelyhasthebest performance withrespectto some

predictor. Inthis thesis,anewmethodforfeaturerankingwillbeinvestigated

andcomparedwithotherstate-of-the-artones.

Learning Vector Quantization is one of the most famous prototype-based

supervisedlearningmethods. ItwasrstintroducedbyKohonen[3]. Afterthat,

severaladvancedcostfunctionswereproposedtoimprovetheperformance,one

examplebeingGeneralizedLVQ[4]which isonlybasedonEuclideandistance.

To model the dierent contributions of features for classication, Generalized

RelevanceLVQisproposed[4,7]toextendtheEuclideandistancewithscaling

orrelevancefactorsforallfeatures. TherecentlyintroducedGeneralizedMatrix

LVQ (GMLVQ) [33] extendsthe distance measurement further to accountfor

pairwisecontributionoffeatures. ThedistancematrixinGMLVQcontainssome

informationwhichmaybeusefulforfeatureselection. Forexample,thediagonal

element

Λ ii

^of^thedissimilaritymatrixcanberegardedasameasurementofthe overall relevance offeature ifor classicationandtheo-diagonal element

Λ ij

canbeinterpreted asthe contributionoffeature pairi andj. A highabsolute

valueindicatestheexistenceofahighlyrelevantrelationshipwhileanabsolute

valueclosertozeromaysuggestthat isisnotthatimportantforclassication.

TheabovediscussionillustratesthepotentialapplicationofGMLVQinfea-

tureranking which hasnot yet beenfully investigated. Early studies include

applyingGMLVQto selectthebest featurein theclassicationoflungdisease

[39]andselectthemostdiscriminativemarkerinthediagnosisofAdrenalTumor

[41]. Inthis thesis, afurther investigation will be conducted and experiments

onmoredatasetswillbecarriedout.

1.2 Research Questions

Thisthesiswill attempttoanswerthefollowingquestions:

1. CanGMLVQmethodbeextendedto performfeatureranking?

2. How well doesthe feature rankingperform? In this thesis,the GMLVQ

basedfeaturerankingtechniquewillbecomparedwiththree otherstate-

of-the-artfeaturerankingmethods. All thesefourmethodswillbeevalu-

atedbyGMLVQandRBF-SVM intermsoftheirAUCmetric.

3. CanGMLVQcombinethefeaturerankingandclassicationintoonesingle

process and how well does the classicationperform compares to other

methodsin whichfeature rankingandclassicationareperformedintwo

steps?

1.3 Thesis Outline

This thesis has six chapters and is organized as follows. Chapter 2 presents

the basicconcepts in machine learningand the algorithm details of the Sup-

(6)

performance of variousfeature rankingalgorithms in alater stage. Chapter 3

discussesthe idea of feature selection, its general framework and three state-

of-the-artfeaturerankingtechniqueswhichwillbecomparedwiththeGMLVQ

basedrankingmethod. Chapter4givesadescriptionabouttheGMLVQbased

featurerankingmethod. Inthischapter,detailswillbegiventoextractfeature

rankingfrom GMLVQ,thewaypointaveragingalgorithmand howto obtaina

unique feature ranking result. Chapter 5elaborates on the experiments con-

ductedto compare thefour feature rankingtechniquesdiscussedaboveand is

followedbyChapter6thatstatestheconclusionandfutureworkforthisthesis.

(7)

2 Machine Learning

In this chapter, we rstly give a brief introduction to machine learning and

someofitsbasicconcepts. Thedatarepresentation,classicationandlearning

algorithmsarefurtherpresented. Wefurtherpresentsomespeciclearningalgo-

rithmswhichareRBF-basedSVM,basicLVQanditstwoothervariations. The

learningalgorithmsintroducedinthis chapterwill belaterutilizedto evaluate

thefeature selectionalgorithmsintroducedinChapter 3.

2.1 Basic Concept of Machine Learning

2.1.1 Denitionof learning

Whatislearning? Learningisgenerallyreferredto themutualinteraction be-

tween the environment and the person through which one gains or modies

knowledgeorskills. AmoreformaldenitionwasgivenbyRunyonin1977[36]:

Learningisaprocessinwhichbehaviorcapabilitiesarechangedastheresultof

experienceprovidedthechangecannotbeaccountedforbynativeresponseten-

dencies,maturation,ortemporarystatesoftheorganismduetofatigue,drugs,

orothertemporaryfactors.

One of the examples in learning is the association between events. For

example,ifanormalpersontastesanappleforthersttime andndsitvery

delicious, he will assume that the next apple he meets will also be delicious

although he has not eaten it and that apple is dierent from theone he ate.

Theimportant discoveryhere is theassociation of the facts that the apple is

tasty. Thisassociationistheknowledgesomeonegainsbytheexperiencetoeat

anapple.

2.1.2 Denitionof machinelearning

Learningfor computersfalls into the eld of machine learning. A widely ac-

cepted denition is: A computerprogram is said to learn from experience E

withrespecttosomeclassoftasksTandperformancemeasure P,ifitsperfor-

manceat tasks in T,asmeasuredbyP,improveswithexperience E[37]. The

experiencehereusually refersto thedata which demonstratestherelationship

betweenobservedvariables.

There are many example applications in Machine Learning. One of the

largest groups lies in the categorization of objects into a set of pre-specied

classesorlabels. Someofthepracticalexamplesare:

1. OpticalCharacterRecognition: classifyimagesofhandwrittencharacters

tothespecicletters;

2. FaceRecognition: categorizefacialimagestotheperson itbelongsto;

(8)

learning.

3. MedicalDiagnosis: determinewhetherornotapatientsuers fromsome

disease;

4. StockPrediction: predictwhetherastockgoesupordown

2.1.3 Data representation

Intheeld ofmachine learning,data isrepresentedbyatablewhere eachrow

correspondstoonesampleorinstanceandeachcolumn describesoneattribute

or feature. In the case of supervised learning, there will be another column

containing the label information for each instance. One of the examples is

shown in Figure 1. There are 14 instances in this exampleand each instance

consistsof thedata withfour features: Outlook,Temperature,Humidity,

Wind andthelabelinformationspecifyingwhether ornotto play.

Themathematicalexpressions ofthe dataand labelsare presentedhereto

serveasthenotationsin thisthesis. Let

{x i , y i }

^denote^the

i ^th

^instance^where

x i ∈ R ^N

^denotes^the^dataⁱⁿ^the^Ndimensionalspaceand

y i

^is^thecorresponding labelinformationwithCdierentpossiblevalues. Tobebrief,thecombination

ofdataandlabelareexpressedasbelow:

{x i , y i } ∈ R ^N × C

⁽¹⁾

2.2 Classication

Asdiscussed in the previoussection, themajortask in machine learningis to

learnhowtoclassifyobjectsintooneofthepre-denedsetoflabels. Insuchtask,

(9)

objectsinaclass. Forexample,to identifywhether afruitisabanana,people

havetocheckitscolor,size,shapeandinferitslabelfromthis information.

2.2.1 Unsupervised and supervisedlearning

The classication task discussed above is generally referred to as supervised

learningwhere thelabelsof trainingdata are provided and the learningalgo-

rithm tries to generalize from the training instances to enable novel objects

to be classied to correct categories. In contrast to supervised learning, un-

supervisedlearningrefers to the learningin which thelabelsof trainingdata

are unknown. Its goalis to groupthe training data into dierent clusters by

evaluatingsomeintrinsicpropertieswithin thedata,suchastheseparabilityor

covariance; therefore, the quality of the data provided for trainingis crucial.

Ifirrelevantornoisydataareprovided, misclassicationswill happenonnovel

data.

2.3 Learning Algorithms

Inthissection,twosupervisedlearningalgorithms willbedescribedwhichare

theSVMalgorithmwithRBFkernelandtheLVQalgorithmwithitstwovari-

ants: GRLVQandGMLVQ.TheGMLVQandRBF-SVM willbelaterutilized

toevaluatetheperformanceoffourfeaturerankingmethods.

2.3.1 SVM with RBF kernel

The Support Vector Machine (SVM) was originally proposed by Vapnik for

classicationand regression[25, 24, 26, 27] and then itwasalso extended for

other application [28]. It has attracted large attentionin recentyears due to

its superior performance and soundly developed theoretical foundation. As a

result,italsoservesasanevaluationmethodforthefeatureselectionresultsin

thisthesis.

The SVM is a method to nd anoptimal hyperplaneto separate training

dataoftwoormoreclassesandatthesametime, maximizingitsmargin. The

linear Support Vector Machine, as the simplest and most basic case, will be

introduced rst. Then wewill showhowit canclassifynon-linearly separable

datainafeaturespaceinhigherdimensions.

LinearSVMandseparatinghyperplanemaximization ThelinearSVM

isasupervisedlearningmethodwhichisbuiltuponagroupoflabelledsamples

and that performsbinary classication in the feature space. Let'sdenote the

data and labels as

(x i , y i )

^where

x i ⊆ R ^N

^is ^a N-dimensionalfeature vector and

y i

^is ^the^label^of ^sample

x i

^. ^In^a^two-class ^problem,

y i ∈ {+1, −1}

^. ^The

classicationprocess of asupervised learningalgorithm canthen be regarded

as a mapping process

f (x i )

^:

R ^N → R

^which ^maps ^the ^feature ^vector ^from ^a

N dimensional space to the class membership of the vector. Without loss of

(10)

spaceintotwohalves

generality,it isassumedthat

f (x i ) > 0

^and

y i = 1

^indicate ^the^feature ^vector

belongstoclass1and

f (x i ) < 0

^and

y i = −1

^specify^the^class^2. ^Then^a^formal

denition aboutlinearlyseparable datacanbegiven as: adata set is linearly

separableifthefollowingequationshold:

∀ y i = 1 : f (x i ) > 0

⁽²⁾

∀ y i = −1 : f (x i ) ≤ 0

⁽³⁾

AnillustrationexampleisshowninFigure2. Ascanbeseenfromthegure,

allthepointswith

y i = 1

âre^classiedînto^the^positive^sideôf^the^hyperplane

andotherswith

y i = −1

^areⁱⁿ^the^opposite^side.

ThediscriminantfunctioninFigure2isalinearmodelandcanbeexpressed

as:

f (x) = w ^T x + b

⁽⁴⁾

wherewindicatestheweightvectorandbisthebias. Thehyperplanewhich

dividestheplaneintotwohalf-planesisexpressedas:

f (x) = w ^T x + b = 0

The discriminantfunction

f (x)

^can âlso^help ^to^measure ^the^distance ôf â

datapointtothehyperplane. Considerthepoint

x d

^and ^its^normal^projection

x 0

^on^the^hyperplaneⁱⁿ ^Figure^2. ^Thecoordinatesofthepoint

x d

^can^then^be

expressedas:

x d = x 0 + d w

kwk

⁽⁵⁾

(11)

where ddescribesthealgebraicdistance betweenthepoint

x d

^and

x 0

^. ^Be-

cause

x 0

^is^on^thehyperplane,

f (x 0 ) = 0

^. ^We^have:

f (x d ) = f (x 0 + d w

kwk ) = w ^T (x 0 + d w

kwk ) + b

⁽⁶⁾

= f (x 0 ) + d w ^T w

kwk = d kwk

⁽⁷⁾

It follows that:

d = ^f(x _kwk ⁱ ⁾

^and ^to ^enforce ^that

d

îs âlways ^positive ûnder

correctclassication,wedene:

d i = y i f (x i )

kwk

⁽⁸⁾

Then the term margin

p

^can ^be ^dened ^here ^as ^the ^distance ^between ^the

hyperplaneandtheclosestdatapointsfrombothsides:

p =

i=1,2···n min y i f (x i )

kwk

⁽⁹⁾

where

n

îs^the^numberôfêxamplesⁱⁿ^the^training^data^set. ^The^linear^SVM

istrainedto ndanoptimalhyperplanetomaximize themargin

p

^. ^As ^shown

in the formula above, this canbe achieved by either maximizing the valueof

y i f (x i )

^of^the^closest^points^or^by^minimizing

kwk

^. ^Since

w ^T x + b

^can^be^scaled

withoutchangingitssign,itisreasonabletoimposetheconstraintthat:

y i (w ^T x i + b) ≥ 1

⁽¹⁰⁾

i = 1, 2, · · · n

⁽¹¹⁾

Therefore,theoptimizationproblemcanbeformulatedas[25]: givenasetof

trainingsamples

{x i , y i } ⁿ _i=1

^,^try^to ^nd^the^optimal^parameters^w^and^b^which

satisestheconstraintthat:

y i (w ^T x i + b) ≥ 1

⁽¹²⁾

i = 1, 2, · · · , n

⁽¹³⁾

andminimizesthefollowingfunction:

L = 1

2 w ^T w

⁽¹⁴⁾

This is called theprimary problem and can be solved by constructing the

Lagrangefunction[30]as below:

J(w, b, a) = 1 2 w ^T w −

X n

i=1

a i [y i (w ^T x i + b) − 1]

⁽¹⁵⁾

(12)

The

a i

^here^are^called^the^Lagrangemultipliersandthesolutionofthisopti- mizationproblemshouldbeminimizedwithrespectto

w

^and

b

^and^maximized

withrespectto

a i

^. Âsâ^result,ît^follows^that

∂J(w, b, a)

∂w = w − X n

i=1

a i y i x i = 0

⁽¹⁶⁾

and

∂J(w, b, a)

∂b =

X n

i=1

a i y i = 0

⁽¹⁷⁾

whichgivesriseto

w = X n

i=1

a i y i x i

⁽¹⁸⁾

and

X n

i=1

a i y i = 0

⁽¹⁹⁾

Thenbysubstitutingtheabovetwoequationsintoequation(15),theequa-

tionbecomes:

Q(a) = X n

i=1

a i − 1 2

X n

i=1

X n

j=1

a i a j y i y j x ^T _i x j

⁽²⁰⁾

The corresponding problem is called the dual problem and is formulated

asbelow: given trainingsamples

{x i , y i } ⁿ _i=1

^, ^try^to ^nd^the ^optimal^Lagrange

multipliers

{a i } ⁿ _i=1

^which^maximize^theôbjective^functionâboveândâlso^satisfy

thefollowingconstrains:

1.

X n

i=1

a i y i = 0;

2.

a i ≥ 0

^for

i = 1, 2, . . . , n

AftertheLagrangemultipliers aredetermined,theweightvectorcanbeeasily

determinedby

w = X n

i=1

a i y i x i

⁽²¹⁾

and thebiasb canbedetermined byarbitrarilychoosing alabeledsample

{x i , y i }

^and^calculate:

(13)

y i (w ^T x i + b) = 1

⁽²²⁾

∀ y i = 1 : b = 1 − w ^T x i

⁽²³⁾

or

⁽²⁴⁾

∀ y i = −1 : b = −1 − w ^T x i

⁽²⁵⁾

ItisalsoimportanttostatetheKarush-Kuhn-Tuckertheorem[25,30]which

givesthefollowingconstraintonthesaddlepointoftheLagrange:

a i0 [y i (w ^T ₀ x i + b 0 ) − 1] = 0 f or i = 1, 2, . . . , n

⁽²⁶⁾

It statesthat

a i0 6= 0

^only ^for^the ^points ^which ^satisfy

y i (w ₀ ^T x i + b 0 ) = 1

^.

Thesepointsarecalledthesupportvectors.

Tosumup,wehave:

f (x) = X m

i=1

a i0 y i x ^T _i x + b 0

⁽²⁷⁾

where

{x i } ^m _i=1

^are ^the ^support ^vectors ^and

{a i0 } ^m _i=1

^are ^thecorresponding Lagrangemultipliers.

Non-linear separable data and soft margin In practical applications,

many of the data sets are non-linearly separable which makes the algorithm

intheprevioussectioninfeasible. OneexampleisshowninFigure3. Ascanbe

seenfrom thegure,although mostof thepointsare classiedintothecorrect

side, there are still some points which violate the hyperplane. These points

eithercrosstheboundaryofthemarginbutarestilllocatedonthecorrecthalf-

space,orhavebeenmisclassiedontotheincorrecthalf-space. Insuchcases,it

isimpossibletondahyperplanewhichcompletelyremovestheerrors;instead,

asolutioncanbeproposedtominimizetheerrorsonthetrainingdata.

Slackvariables areintroducedto solvethisproblem. Foradataset with

n

samples,thereare

n

^slack^variables

{ε i } ⁿ _i=1

^which^satisfy:

∀ y i = 1 : w ^T x i + b > 1 − ε i

⁽²⁸⁾

∀ y i = −1 : w ^T x i + b 6 −1 + ε i

⁽²⁹⁾

The slack variable

ε i

^here îs â ^measure ôf ^the ^violation ^to ^the ^margin. Îf

0 < ε i < 1

^,^then ^the^sample^violates^the^margin^but ^is^still ^correctly^classied.

When

ε i > 1

^, ^the^sample^is^classied^into^the^wronghalf-space. Sincethegoal istohavefewertrainingsamplesmisclassied,apenaltyforcecanbeadded:

η(ε) = X n

i=1

ε i

⁽³⁰⁾

(14)

whichshouldbeminimized. Itcanbeincorporatedintothecostfunctionin

theprevioussectionas:

f = 1

2 w ^T w + C X n

i=1

ε i

⁽³¹⁾

The parameter

C

^here ^controls ^the ^trade-o ^between ^the ^margin ^rigidity

enforcementandthenumberof errorsitcantolerateduring training. A larger

valueof

C

^will^produceâ^moreâccurate^model^whileât^the^same^timeîncreasing

theriskofover-tting;therefore,thevalueof

C

^has^to^be^optimized^by^the^user

duringtheexperiment.

ThecorrespondingLagrangefunctionforthisproblemis:

J(w, b, a, u, ε) = 1

2 w ^T w + C X n

i=1

ε i − X n

i=1

µ i ε i − X n

i=1

α i [y i (w ^T x i + b) − 1 + ε i ]

⁽³²⁾

where

µ i

^is^the^Lagrange^multiplier ^for^the^slack^variables.

Kerneltrick ConsiderthetypicalXORproblemwhichtriestoseparatefour

examplesin four cornersof arectangle such that thetwoexamplesconnected

by a diagonal belong to the same class. It is impossible to make this in a

two-dimensionalspace but whenprojectingit to athree-dimensionalspace, it

becomesmucheasier. Thisexampleindicatesthatanon-linearlyseparabledata

setmaybecome linearlyseparable inahigherdimensionalspace. Thiskindof

mappingincreasestheseparabilityofthedataset.

Letthefunction

θ

^denes^the^non-linear^mapping:

θ : R ^N → H

⁽³³⁾

(15)

f (x) = X n

i=1

a i y i θ(x i ) ^T θ(x) + b

⁽³⁴⁾

Thekernelfunctionisdened hereby:

K(x, y) = θ(x) ^T θ(y)

⁽³⁵⁾

andthediscriminantfunctionturns into:

f (x) = X n

i=1

a i y i K(x i , x) + b

⁽³⁶⁾

Thisexpressionavoidsprovidingtheexactrepresentationinahigherdimen-

sionalspace. Numerous kernels have been proposed to solvevarious kinds of

problems. OneofthemostpopularkernelsistheRBF kernelwhichisused in

thisthesis. TheRBFkernelcanbeexpressedas:

K(x, y) = e ⁽⁻ ^2σ2 ¹ ^kx−yk ² ⁾

⁽³⁷⁾

σ

^indicates ^the ^kernel^width. ^A ^larger

σ

^indicates ^a^smoother ^function ^to

avoidoverttingandalsoavoidreproducingthenoisesinthetrainingdata;On

theotherhand,asmaller

σ

împliesâ^moreêxible^function ^to ^produce^highly

irregulardecisionboundaries. Hence,itis importanttodeterminetheoptimal

valuefor

σ

^by^means^of^crossvalidation.

2.3.2 LVQ

LearningVector Quantization is one of the most famous prototype-based su-

pervised learningmethods which wasrst introduced by Kohonen[3]. It has

some advantages over the other methods. Firstly, this method can be easily

implemented and thecomplexity of the classiercan be controlled and deter-

minedbytheuser. Secondly,multi-classproblemscanbenaturallytackledby

theclassierwithoutmodifyingthelearningalgorithm ordecisionrule. Lastly,

the resultingclassier is intuitiveand easy to interpret due to its assignment

ofclassprototypesandintuitiveclassicationmechanismofnewdatapointsto

theclosestprototype. Theresulting prototypescanthen provideclass-specic

attributesforthedata. ThisisabigadvantageoverthemethodssuchasSVM

orNeuralNetworkswhichsuerfromthedrawbackofbeinglikeablackboxand

becauseofthat,LVQhasbeenappliedintomanyelds,suchasbioinformatics,

satelliteremotesensingandimageanalysis[34,35, 39]

TrainingdataforLVQcanbedenotedas:

{x i , y i } ⁿ _i=1 ∈ R ^N × {1, 2, . . . C}

⁽³⁸⁾

where

x i

^denotes^the^dataⁱⁿ^Ndimensionalspaceand

y i

^is^the^label^with^C

dierentclasses.

(16)

LVQcan be parameterized by a set of prototypesrepresenting the classes

in feature space and the distance measurement which may be a traditional

Euclideandistanceorafullmatrixtrainedfromthedata. Oneoftheexamples

canbe seenin Figure 4 where there are 4 dierent prototypesrepresenting 3

dierentclasses.

TraditionalLVQemploysEuclidean distancemeasurementandisbasedon

nearest prototype classication. To be more specic, a set of prototypesare

dened torepresentthedierent classes. Ifoneprototypeperclassis dened,

theprototypescan berepresentedas:

W = {w j , c(w j )} ∈ R ^N × {1, 2, . . . , C}

^.

Each unseen example

x new

^will ^be ^assigned ^a ^label ^whose ^prototype ^has ^the

closestdistanceto itwith respecttothedistancemeasurement:

c(x new ) ← c(w k ) with w k = argmin

j

d(w j , x new )

⁽³⁹⁾

Itiscalled awinner-takes-allstrategy.

Trainingofthismodelisguidedbytheminimization ofthecostfunction:

F = X n

i=1

φ(ε i ) with ε i = d(x i , w H ) − d(x i w M )

d(x i , w H ) + d(x i w M )

⁽⁴⁰⁾

where

φ

îs âny ^monotonic ^function ând ⁱⁿ ^this ^thesis,

φ(x) = x

^;

w H

^and

w M

^arerespectivelytheclosestprototypewiththesameanddierent labelto sample

x i

^:

w H = argmin

j

d(x i , w j ) ∀c(w j ) = c(x i )

⁽⁴¹⁾

w M = argmin

j

d(x i , w j ) ∀c(w j ) 6= c(x i )

⁽⁴²⁾

IntraditionalLVQsystems,onlythelocationsoftheprototypesareupdated

duringthetrainingtominimizetheerrors.

w H

^is^pushed^toward^the^sample

x i

(17)

and

w M

îs ^pushed âway^from ît. ^Their derivativesto the costfunction

F

^are

expressedas:

4w H = −α · φ ⁰ (ε i ) · ε ⁰ _i,H · ∇ w H d(x i , w H )

⁽⁴³⁾

4w M = α · φ ⁰ (ε i ) · ε ⁰ _i,M · ∇ w M d(x i , w M )

⁽⁴⁴⁾

where

α

^is ^the ^learning ^rate;

φ ⁰ (ε i ) = 1

^because

φ(x) = x

^;

ε ⁰ _i,H = 2 · d(x i w M )/[d(x i , w H )+d(x i w M )] ²

^and

ε ⁰ _i,M = 2·d(x i w H )/[d(x i , w H )+d(x i w M )] ²

^;

∇ w H d(x i , w H )

^and

∇ w M d(x i , w H )

^arerespectivelythederivativesof

w H

^and

w M

to the distance

d(x i , w H or M )

^and ^therefore ^depend ^on ^the ^distance ^measure-

ment.

2.3.3 Two variants ofLVQ: GRLVQ and GMLVQ

How the distance is calculated is very importantin the LVQ system. Oneof

the mostpopular metricsis theEuclidean distance which is a special case of

Minkowskidistance. TheEuclideandistancefromadatapoint

x i

^to^a^prototype

w

^can^be^expressed ^as:

d(w, x i ) = v u u t

X N

j=1

(x ^j _i − w ^j ) ²

⁽⁴⁵⁾

The Euclidean distance assigns the sameweight for each feature, indicat-

ingthat eachfeature hasthesamecontribution forclassication. However,in

practicalapplications, it is usually observed that dierent features contribute

dierentlytowardtheclassication. Therefore,relevancelearning[7, 4]is pro-

posed toassignadaptiveweightvaluesfordierentfeature inputs:

d(w, x i ) = v u u t

X N

j=1

λ j (x ^j _i − w ^j ) ²

⁽⁴⁶⁾

ThecorrespondingLVQsystemiscalled GRLVQ[7,4].

Eachfeature,besidestheirindividualcontributionfortheclassication,will

alsocorrelatewiththeotherstoinuencetheperformance. GeneralizedMatrix

LVQ(GMLVQ)[38]isproposedtoextendthepreviousmethods. Afullmatrix

of adaptiverelevance is employed asthe similarity metricand the distance is

calculatedas:

d(w, x i ) = (x i − w) ^T Λ(x i − w)

⁽⁴⁷⁾

where

Λ

^is^a^full

N × N

^matrix^whoseo-diagonalelement

Λ i,j

^account^for

thecontributionoffeature pair

i

^and

j

^forclassication. Thematrix

Λ

^has^to

bepositivedenitetokeepthedistanceresultpositive. Itspositivedeniteness

isachievedbyconstructing:

Λ = Ω ^T Ω

⁽⁴⁸⁾

(18)

where

Ω

îs ân ârbitrary ^real

M × N

^matrix ^with

M 6 N

^. ^However, ⁱⁿ ^this

thesis,weonlyconsiderthecase:

M = N

^. SubstitutingEq. (42)intoEq. (41), obtain:

(x i − w) ^T Λ(x i − w) = (x i − w) ^T Ω ^T Ω(x i − w) = [Ω(x i − w)] ² > 0

⁽⁴⁹⁾

ItisnoticedtheGRLVQisaspecialcaseofGMLVQwith

diag(Λ) = {λ i } ^N _i=1

^.

Thederivativeofthedistance

d(w, x i )

^with^respect^to^prototype^w^is:

∇ w d(w, x i ) = −2Λ(x i − w)

⁽⁵⁰⁾

SubstitutingEq. (50)intoEq. (43)andEq. (44),wecanobtaintheupdate

ruleforclosestcorrectandincorrectprototype.

InthemodelofGMLVQ,theupdateruleofthedistancematrix

Ω

^also^need

tobecomputed. Thederivativeof

d(w, x i )

^with^respect^to ^one^single^element

Ω lm

^is:

∇ Ω _lm d(w, x i ) = X

k

(x ^m _i − w ^m )Ω lk (x ^k _i − w ^k ) + X

j

(x ^j _i − w ^j )Ω lj (x ^m _i − w

⁽⁵¹⁾

^m )

= 2 · (x ^m _i − w ^m )[Ω(x i − w)] l ,

⁽⁵²⁾

ThederivativeofthecostfunctionFwithrespecttoonesingleelement

Ω lm

canthenbeexpressedas:

4Ω lm = 4Ω ^H _lm + 4Ω ^M _lm

⁽⁵³⁾

= −β · 2 · φ ⁰ (ε i ) · ε ⁰ _i,H · ∇ _Ω ^H

lm d(x i , w H ) + β · 2 · φ ⁰ (ε i ) · ε ⁰ _i,M · ∇ _Ω ^M

lm d(x i , w M )

where

β

^is ^the^learning^rate^for

Ω

^.

(19)

3 Feature Selection

3.1 Challenge

Inthissection,twotopicsaboutthechallengesinfeature selectionwillbedis-

cussed. Therstissueaboutthecurseof dimensionalityandthesecond oneis

therelevanceandredundancyoffeatures.

3.1.1 Curse ofdimensionality

Inmachine learning,theterm curse of dimensionality wasinitially dened by

Richard Bellman [10] when he conducted the work on dynamic optimization

[9, 10]andfound itquitediculttotackletheproblemof thecurseofdimen-

sionality. Hestated:

Inviewof allthat wesaid in theforegoingsections, themany

obstaclesweappeartohavesurmounted,whatcaststhepalloverour

victorycelebration? Itisthecurse ofdimensionality, amalediction

thathasplaguedthescientistfrom theearliestdays."[10]

Upto date, therearealreadymanydenitions aboutit,but generallyitrefers

totheproblemincurredbyaddingextrafeaturestothespace. Thereliabilityof

thelearningmodel depends onthedensityof trainingexamplesin thefeature

space. The increase of data dimensionality will sparse the feature space and

thusdeterioratethegeneralizationperformance.

Itstatesthatthepredictiveperformanceofalearningalgorithmwilldeteri-

oratewiththeincreaseofdatadimensionality. Withtheincreaseofthefeature

space, thefeature space will becomemore sparse andmore trainingexamples

arerequired. Forexample,if5samplesareenoughin eachdimension,then25

samplesaresucienttollatwo-dimensionalcube. However,thisnumberwill

increaseto

5 ²⁰

^for^a20-dimensionalhypercube.

It is also observedthat itbecomes moredicultto estimatethe kernel in

ahigher dimension[11]. Table1illustrates thenumberof samplesrequiredto

estimateakernelatdensity0withacertainaccuracy.

Dimensionality Sample Size

1 4

2 19

5 786

7 10,700

10 842,000

Table1: Samplesizerequiredforkernelestimation[11].

(20)

Thereare somecontroversiesin the denitionof feature relevance. There is a

review [8] which introduces the dierent relevance denitions that have been

proposed in the literature. The authors then present an example to indicate

thatalltheotherrelevancedenitionsproduceunexpectedresultsandbasedon

that,the authors suggest that twodierent degreesof relevance arerequired:

strongrelevanceandweakrelevance. Thedenition ofweakrelevancecanalso

beregardedasthedenitionofredundancy.

Let

< X, Y >

^denote ^the ^training ^examples ^where

X ∈ R ^N

^is ^the ^data

and

Y

^indicates ^the ^labels. ^Let

F

^be ^the ^full ^feature ^set ^and

F i

^is ^the

i ^th

feature;thereforeeachinstance

X

îsôneêlementôf^thecombinationoftheset

F 1 × F 2 × · · · F N

^. ^Let

S i = F − {F i }

^denote^the^feature^subset^with^all^features

except for

F i

^and

s i

^denote ^one ^value instantiation of

S i

^. ^Let

P

^denote ^the

conditionalprobabilityofthelabel

Y

^given^a^feature ^subset.

Strong relevance Afeature

F i

^is^strongly^relevantⁱ

∃x i ∈ F i , y ∈ Y

^where

P (x i , s i ) > 0

^and

P (Y = y|S i = s i , F i = x i ) 6= P (Y = y|S i = s i )

Weak relevance A feature

F i

îs ^weakly ^relevant ⁱ ît îs ^not ^strongly

relevantand

∃x i ∈ F i , s i ⊆ S i , y ∈ Y,

^such ^that

P (Y = y|S i = s i , F i = x i ) 6= P (Y = y|S i = s i )

A feature

F i

îs ^called ^relevantîf îtîs êither ^stronglyôr^weakly ^relevant^to

theclass label;otherwiseit isirrelevant. Afeature

F i

^which^is^weakly^relevant

canbecomestronglyrelevantafterremovingacertainfeaturesubset. Theweak

relevance can be interpreted as theexistence of other relevant features which

canprovidesimilarpredictionpowerastheonewearemeasuring. Thisisalso

what we call redundant. It is importantto note that thefeature

F i

^which ^is

weaklyrelevantorredundantshouldnotberemovedifthefeaturesubsetwhose

removalmakes

F i

^strongly^relevant^has ^been ^removed ^by^the ^feature ^selection

algorithm.

3.2 General Framework

TheframeworkinFigure5showsthatatypicalfeatureselectionsystemusually

consists offour components. Theyinclude: feature subset generation,feature

subsetevaluation,stoppingcriterionandfeaturesubsetvalidation.Asindicated

inthegure,thecompletefeatureset isrstlysenttotheGeneration"model

whichproducesdierentfeaturesubsetcandidatesbasedonsomesearchstrat-

egy. EachsubsetcandidatewillthenbeevaluatedintheEvaluation"modelby

acertain evaluation measurement. A newsubsetwhich turnsoutto bebetter

will replacetheprevious best one. Thissubset generation andevaluation will

berepeatedoverandoveruntilthegivenstoppingcriterionismet. Afterthat,

(21)

theultimatelyselectedfeaturesubsetwillbesenttotheValidation"modelfor

validationbycertainlearningalgorithms.

Two basic issues have to be addressed in the Generationmodel". Those

are: Starting pointandsearchstrategy.

•

^Starting ^Point. ^Choose ^a^point ^to ^start ^the ^search ⁱⁿ ^the ^feature ^space.

Onechoiceistobeginwithnofeatureandthenforeachiteration,expand

thecurrentfeaturesubsetwitheachfeaturethat isnotyetinthesubset.

Thefeature whose addition producesthe best evaluation performanceis

added to the current subset. This is called forward selection. Another

optionisto doitconversely. Thesearchstartswithafullfeaturesetand

thensuccessivelyeliminatesthefeaturewhoseremovalresultsinthebest

evaluationperformance. Thissearchiscalledbackwardselection. Athird

alternativeisto startbyselectingarandomfeaturesubset[13] andthen

successivelyadd orremovefeaturesdepending ontheperformance. This

randomapproachcanavoidbeingtrappedinto localoptima.

•

^Search ^Strategy. ^There ^are ^three ^dierent ^search strategies: complete, heuristic and random. The complete strategy examines all the possible

feature subsetsand guaranteesto nd theoptimal one. Whenthere are

N

^features,^the^search^will^examine

2 ^N

^subsets^which^makes^itunrealistic forlarge

N

^. ^Heuristic^search îs^guided ^by^some^heuristic. Îtîs ^less^com-

putationally demanding but the optimal subset is not guaranteed. The

guideline determines whether ornot a better subset canbe found. The

randomstrategyjustsimplychoosesthenextfeatureatrandom;therefore,

theprobabilityto ndtheoptimal subsetdepends onhowmanyepochs

aretried.

3.3 Wrapper and Filter Approach

Theevaluationmethods in feature selection canbe generallydividedinto two

basicmodels: theltermodel[14,15,16,17]andthewrappermodel[18,19,20].

The lter model selects afeature subset asa pre-processingstep, without

considering the predictorperformance. It is usually achieved by designingan

(22)

evaluationfunctionsthatarefrequentlyusedaredistancemeasures,information

measures, dependency measures and consistency measures. The lter model

does not involve any training of learning algorithm and is thus much faster

whichmakesitsuitabletobeappliedonlargedatasets.

In the wrappermodel, a predetermined data mining algorithm is utilized

to evaluate thefeature subset and the candidate with highestprediction per-

formancewill be selectedasthe nal subset. Thewrappermodel canusually

selectafeaturesubsetwithsuperiorperformancebecauseitselectsfeaturesbet-

tersuitedtothepredeterminedalgorithm. However,becausethealgorithmhas

tobetrainedandtestedforeachsubsetcandidate,thewrappermodeltendsto

beverycomputationallyexpensive,especiallywithlargefeaturesize.

3.4 Feature Ranking Technique

3.4.1 Information gain

Information gain[21] measures thedependency betweenafeature

X i

^and ^the

class label

Y

^. Ît îs â^very^popular^techniqueⁱⁿ ^feature ^selection^becauseît îs

easyto understand and compute. Information gaincanalso beregarded asa

measure ofthereduction in uncertainty aboutafeature

X i

^when ^the^value^of

Y

^is^known. UncertaintyisusuallymeasuredbyShannon'sentropy:

Entropy Entropymeasurestheamountofuncertaintythatafeature

X i

^con-

tains. Itisgivenby

H(X i ) = − X p

j=1

P (j)log 2 P (j)

⁽⁵⁴⁾

Where

p

^is^the^number^of^possible^valuesⁱⁿ

X i

^and

P (j)

^indicates^the^obser-

vationprobabilityofthevalue

j

^. ^Form^this^formula,^a^more^uniformdistribution tendsto produceahigher entropy. Forexample, ifyoutoss afair coin, there

aretwopossiblevalueseachwithequalprobability0.5. Itsentropyvalueis

H(coinT oss) = −2 × (0.5 × log 2 0.5) = 1

Inanotherexample,ifyoutoss adie, there aresixpossibleoutcomes, each

withprobability

1/6

^. Îtsêntropy^valueîs

H(diceT oss) = −6 × ((1/6) × log 2 (1/6)) = 2.585

Therefore, the higherthe entropyis, themore uncertainty it containsand

themorediculttopredicttheoutput.

(23)

Information Gain Theinformationgainofafeature

X i

^and^the^label

Y

^is:

I(X i , Y ) = H(X i ) − H(X i | Y )

⁽⁵⁵⁾

Where

H(X i )

^and

H(X i | Y )

^arerespectivelytheentropyoffeature

X i

^and

theentropyof

X i

^after^knowing^the^value^of

Y

^.

H(X i | Y )

^is^calculated^as

H(X i | Y ) = − X

j

P (Y j ) X

k

P (x k | Y j )log 2 P (x k | Y j )

⁽⁵⁶⁾

A better understanding can be gained from Figure 6. As can be seen in

Figure6,

H(X)

^and

H(Y )

respectivelymeasure theentropyof

X

^and

Y

^. ^The

informationgain

I(X, Y )

îsâ^measure ôf^theinformation sharedby

X

^and

Y

^.

H(X, Y )

^is^theinformationthat

X

^and

Y

collectivelycontain.

H(X, Y ) = − X

x

X

y

p(x, y)log 2 p(x, y) = H(X | Y )+I(X, Y )+H(Y | X) = H(X)+H(Y )−I(X, Y )

(57)

If

X

^and

Y

^are^highly correlated, then the information theyshare is very high, indicating a large value of

I(X, Y )

^. ^Then ^if

Y

^is ^known, ^much ^of ^the

informationabout

X

^can^be^guessed ^from

Y

^,^suggesting^a^low

H(X | Y )

^and

viceversafor

H(Y | X)

^.

3.4.2 Relie

Relief [22] is a univariate feature weighting algorithm in the lter model. It

is basedon the principlethat theattribute which can better separate similar

instancesbut withdierentclassesismoreimportantandshouldbeassigneda

largerweight. Thethreebasicstepstocomputethefeatureweightare:

(24)

Description: TherearePinstancesdescribedby

N

^features^and^there^are

C

dierentclasses:

x ∈ R ^N c(x) ∈ {1, 2, . . . C}

^;

T

^iterations^are^performed.

1. Setallthefeature weightsto0:

∀i, w(i) = 0

^;

2. Fort=1to

T

^,^do:

3. Randomlypickaninstance

x ∈ R ^N

^;

4. Findnearesthit

N H(x)

^and^nearest^miss

N M (x)

^:

5.

N H(x) ← x h , with x h = argmin

j

d(x, x j ) ∀c(x j ) = c(x)

6.

N M (x) ← x m with x m = argmin

k

d(x, x k ) ∀c(x k ) 6= c(x)

7. Fori=1to

N

^,^do:

8.

w(i) = w(i) + d(x i , N M (x) i )/(P × T ) − d(x i , N H(x) i )/(P × T )

9. enddo.

10. enddo.

1. Find the nearest miss and nearest hit where nearest hit is the closest

samplewith thesameclass asthetestsample andnearestmiss species

theclosestsamplewithadierentlabelasthetestsample;

2. Calculatetheweightof afeature;

3. Returnarankedlistoffeature weightsorthetop

k

^features^according^to

agiventhreshold;

The algorithm starts by initializing all the feature weights to be zero and it

randomlyselectaninstancefromthesamplesandcalculatesitsnearesthitNH

andnearestmiss NM.Eachfeatureweightisthenupdatedbasedonitsability

todiscriminateNHandNM.ThedetailedpseudocodeisgiveninAlgorithm1.

Relie[23]extendstheoriginalReliefalgorithmtodealwiththemulti-class

situation. It incorporates two important improvements. First, the result is

morerobusttonoisesbecauseoftheconsiderationof

k

^nearestneighbourhoods.

Second,itcandealwith themulti-classproblem. Thedetailed pseudocodeis

showninAlgorithm2.

3.4.3 Fisher

Fisher[40]is aneectivesupervisedfeature selectionalgorithmwhich aimsto

selectfeaturesthat assign similarvaluesto thesameclassanddierentvalues

todierentclasses. TheevaluationscoreofFisher'salgorithmis:

(25)

Description:Instancesdescribedby

N

^features^and^there^are

C

^dierent^classes:

x ∈ R ^N , c(x) ∈ {1, 2, . . . C}

^;^Look^for

k

^nearestneighbours;perform

T

^iteration;

p(y)

^the ^class probability specifying theprobability ofan instance beingfrom theclass

y

^.

1. Setallthefeature weightsto0:

∀i, w(i) = 0

^;

2. Fort=1to

T

^,^do:

3. Randomlypickaninstance

x ∈ R ^N

^with^label

y x

^;

4. fory=1to

C

^,^do

5. nd

k

^nearest^instances^of

x

^from^class

y

^:

x(y, l)

^where

l = 1, 2, · · · k

6. fori=1to

N

^,^do:

7. forl=1to

k

^,^do:

8. if

y = y x

^(nearest^hit), ^then

9.

w(i) = w(i) − ^d(x ⁱ ^−x(y,l) _T _×n ⁱ ⁾

10. else(nearestmiss),

11.

w(i) = w(i) + _1−p(y ^p(y)

x ) × ^d(x ⁱ ^−x(y,l) _T _×n ⁱ ⁾

12. end if.

13. endfor.

14. endfor.

15. endfor.

16. endfor.

(26)

F isher(f i ) = P c

j=1 n j (µ i,j − µ i ) P c

j=1 n j σ _i,j ²

⁽⁵⁸⁾

where

f i

^is ^the

i ^th

^feature ^to ^be ^ev^aluated,

n j

îs ^the ^number ôf înstances

in class j,

µ i

îs ^the ^mean ôf ^feature î,

µ i,j

^and

σ i,j

^are respectivelythe mean and the variance of feature i on class j. Fisher algorithm is computationally

eectiveandwidelyappliedinmanyapplications,however,becauseitconsiders

thefeatures individually,ithasnoabilityto dealwithredundantfeatures.

(27)

4 GMLVQ Based Feature Selection Algorithms

4.1 Entropy Enforcement for Feature Ranking Results

Itisstatedthat theelement

Λ i,j

ⁱⁿ^matrix

Λ

^measures^thecorrelationbetween feature

i

^and

j

^and ^the ^diagonal ^element

Λ i,i

^quanties ^the contribution of feature

i

^for classication. The above statement only makes sense when the features have similar magnitude, therefore a z-score transformation is always

performed on the data before the training starts. One example is shown in

Figure7where32featuresarerankedwithrespecttotheirvalueofthediagonal

elements. The

19 ^th

^feature^contains^the^highest^value,^indicating^that^it^has^the

largestcorrelation withthe classication. Anotherconstraintis addedso that

aftereachadaption,thesumofthediagonalelementsisequalto zero:

X N

i=1

Λ i,i = 1

⁽⁵⁹⁾

Figure7: Oneexampleofdiagonalelementsof

Λ

One of the idealsituations in feature rankingis that someof the features

are much more important than others and the least important features can

thereforeberemovedfromthefeaturesetwithoutdeterioratingtheclassication

performance. An external entropyforce is addedto the costfunction to push

thediagonalelementsto thisidealsituation.

Thedenitionoftheentropyforceis:

(28)

izationforce

Entropy(Λ diag ) = − X N

i=1

Λ i,i log 2 Λ i,i

⁽⁶⁰⁾

where

N

^is ^the ^data ^dimension. ^This ^external ^force ^will ^reach ^the ^maxi-

mumwhenallthediagonalelementsareequal,i.e. allthefeatures areequally

importantforclassication. Its minimization will, onthe otherhand, pushto

generateadiscriminativefeaturerelevanceandattheextreme,onlyonefeature

isidentiedasrelevantforclassicationandtherelevancesofotherfeaturesare

zero.

Itisintegratedintothecostfunction by:

F ^new = F + α × Entropy(Λ diag )

⁽⁶¹⁾

wheretheregularizationparameter

α

^controls^the^trade-o^between^the^clas-

sication accuracyand thediscrimination between features. A largervalueof

α

^will ^produce ^a^more discriminativefeature rankingresult by sacricing the classication performance. Their mutual relation on one of the data sets is

visualizedin Figure8.

Thechoice oftheregularizationvaluedependsonhowimportanttheaccu-

racyandthediscriminationarefortheuseranddiersperdataset. Asafeway

istoposttheirrelationforeachdatasetandchoosetheoptimalpointbasedon

therequirement. Amoreecientwayinthisthesisistochoosethevalueby

N ² 10

where

N

îs^the^data^dimension. Ônâllôf^the^data^sets^conductedⁱⁿ^this^thesis,

(29)

such a valuecangenerate aconsiderable discriminativefeature rankingresult

withoutdeterioratingthe performance to alarge extent. Oneof theexamples

withandwithoutentropyforcecanbeseeninFigure 9.

4.2 Way-Point Average Algorithm

Gradient basedminimization is apopular and powerful method in non-linear

optimization[31]. Inthisthesis,batchgradientdescentisemployedtotrainthe

GMLVQmodel. Oneofthecritical choicesin gradientdescentmethods is the

appropriatechoice of the stepsize. Toosmall step sizes will slow theconver-

gence,howeverlargestepscanresultinoscillatoryorevendivergentbehavior.

In this section, amodicationof batch gradient descent[32] isintroduced

whichaimsatbetterconvergencebehavior.Theideaisthat,duringthetraining

procedure,wecomparethecostfunctionofnormaldescentadaptionwiththatof

theglidingaverageoverthemostrecentstepsandifthelatterproducesalower

optimizationvalue,minimizationjumpstothelattercongurationanddecreases

thestepsizeatthesametime. A moredetaileddescriptionisdescribedbelow.

Consider we want to minimize an objective function

F

^with ^respect ^to ^a

N-dimensionalvector

x ∈ R ^N

^. Â ^gradient^descent^processîs^startedât

x 0

^and

proceedstogenerateasequenceofstepsiteratively:

x t+1 = x t − a t

∇F

|∇F |

⁽⁶²⁾

(30)

Note thatthegradienthasbeennormalizedby

∇F

|∇F |

^and ^therefore,

a t

^here

isexactlythestepsizelengthduringadaption:

|x t+1 − x t | = a t

^.

Thewaypointaveragingalgorithmstartsat

x 0

^with^initial^step^size

a 0

^and

performskstepswithgradientstepsunchanged:

x t+1 = x t − a t

∇F

|∇F | f or t = 0, 1, 2 . . . k − 1 with a t = a 0

⁽⁶³⁾

Afterthat

(t > k)

^,^the^procedure^proceeds^as^below:

1. performthenormalgradientdescentstepandevaluatethecorresponding

costfunction:

x ^∗ _t+1 = x t − a t

∇F

|∇F | _t and calculate F (x ^∗ _t+1 )

⁽⁶⁴⁾

2. performthewaypointaverageoverthepreviousjsteps:

x t+1 = 1 j

X j−1

i=0

x t−i and calculate F (x t+1 )

⁽⁶⁵⁾

3. determinethenewstepsizeandadaption position bycomparison:

n _x

t+1 =x ^∗ _t+1 and a t+1 =a t if F(x ^∗ _t+1 )6F (x t+1 )

x t+1 =x t+1 and a t+1 =λa t else

⁽⁶⁶⁾

withtheparameter

0 < λ < 1

^.

As canbeseenfrom thealgorithm, aslongasthenormalgradientdescent

procedureproducesapositionwithlowercostthanthewaypointaveragealgo-

rithm,theiterationproceedsasanormalgradientdescentalgorithm.

Onthe otherhand,

F (x t+1 ) < F (x ^∗ _t+1 )

îndicates^the ^potentialêxistenceôf

oscillatorybehaviorbecauseunderoscillatorycondition,thepositionuctuates

aroundthelocalminimumanditisexpectedthattheaverageovertheprevious

stepsmayprovideacloserestimateto theminimumthanthenormalgradient

descentadaption. Italsoindicatesthatthestepsizemaybetoolargetogetto

theminimumandshould bedecreasedforbetterconvergence.

AnintuitiveexampleisshowninFigure10[32]whichvisualizestheadaption

stepsof both thenormal gradient descent and waypointaveragingalgorithm.

Thedotted lines markthe update trajectory ofnormal gradientdescent algo-

rithmwith constantstep sizes which display strongoscillatory behavior. The

waypointaveragingalgorithmshares thesametrajectorywith thenormalgra-

dientdescentintherstfoursteps. However,afterthatitjumpstotheaverage

positionoverthepreviousstepsandreducesthestepsizeatthesametimewhich

enablesittomovecloserto theminimumin themiddle.

WhenconsideringitsapplicationinGMLVQ,sincethecostfunctioninGM-

LVQhastobeoptimizedwithrespecttoboththeprototype

w

^and^the^matrix

Ω

^,^twoindependentwaypointaveragingalgorithms about

w

^and

Ω

^have^to ^be

performed. Thetypicalschemeisformulatedasbelow:

(31)

dientdescent. From[32]

GivenaGMLVQsystemrepresentedby

Ω

ândâ^setôf^prototypes

{w k } ^M _k=1

withcostfunctionrepresentedbyF,respectivelychoosethestartpoints

Ω ⁰

^and

{w ⁰ _k } ^M _k=1

^and^initial^step^sizes

a ^Ω ₀

^and

a ^w ₀

^for

Ω

^and

w

^;

1. performk(k=3in thisthesis)stepswithgradientstepsunchanged:

Ω t+1 = Ω t − a ^Ω _t ∇F

|∇F | f or t = 0, 1, 2 . . . k − 1 with a ^Ω _t = a ^Ω ₀

⁽⁶⁷⁾

w t+1 = w t − a ^w _t ∇F

|∇F | f or t = 0, 1, 2 . . . k − 1 with a ^w _t = a ^w ₀

⁽⁶⁸⁾

Afterthat

(t > k)

^,^the^procedure^proceeds^as^below:

2. performthenormalgradientdescentstepandevaluatethecorresponding

costfunction forboth

Ω

^and^w:

Ω ^∗ _t+1 = Ω t − a ^Ω _t ∇F

|∇F | _t and calculate F (Ω ^∗ _t+1 )

⁽⁶⁹⁾

w _t+1 ^∗ = w t − a ^w _t ∇F

|∇F | _t and calculate F (w _t+1 ^∗ )

⁽⁷⁰⁾

3. performthewaypointaverageovertheleastpreviousj(j=3inthisthesis)

steps:

Ω t+1 = 1 j

j−1 X

i=0

Ω t−i and calculate F (Ω t+1 )

⁽⁷¹⁾

(32)

w t+1 = 1 j

j−1 X

i=0

w t−i and calculate F (w t+1 )

⁽⁷²⁾

4. determinethenewstepsizeandadaption position forboth

Ω

^and^w:

Ω t+1 =Ω ^∗ _t+1 and a ^Ω _t+1 =a ^Ω _t if F(Ω ^∗ _t+1 )6F (Ω t+1 )

Ω _t+1 =Ω _t+1 and a ^Ω _t+1 =λa ^Ω _t else

⁽⁷³⁾

n _w

t+1 =w _t+1 ^∗ and a ^w _t+1 =a ^w _t if F(w ^∗ _t+1 )6F (w _t+1 )

w t+1 =w t+1 and a ^w _t+1 =λa ^w _t else

⁽⁷⁴⁾

withtheparameter

λ = 2/3

^.

4.3 Feature Ranking Ambiguity Removal

Uptothis step,wehavetheinputvectorsandclasslabels:

{x, y i } ⁿ _i=1 with x i ∈ R ^N , y i ∈ {1, 2, . . . C}

⁽⁷⁵⁾

associatedwithasetofprototypes:

{w k } ^M k=1 where M > C

⁽⁷⁶⁾

Andthedistanceiscalculatedas:

d(x i , w k ) = (x i − w k ) ^T Λ(x i − w k ) = (x i − w k ) ^T Ω ^T Ω(x i − w k ) =| Ω(x i − w k ) | ²

(77)

where

Λ, Ω ∈ R ^N×N

^and

Ω = [z 1, z 2 , · · · z N ] ^T

^where

{z i } ^N _i=1

^are^column^vec-

torswith dimension

N

^. ^The ^feature ^ranking^results^can^be ^obtained^from ^the

valuesof diagonal elements in matrix

Λ

^. ^However, ân îssue îs ^raised^whether

thereisanothermatrix

Λ

^which^can^keep^the^distancemeasurementunchanged.

Itthatmatrixexists,thefeaturerankingresultscanbedierentwithoutmod-

ifyingtheclassier,which meansthefeature rankingresultswehaveobtained

inpreviousstepsarenotunique.

Consideravector

v j

^which^satises^the^followingconstraints:

∀ i : v _j ^T x i = 0

⁽⁷⁸⁾

∀ k : v _j ^T w k = 0

⁽⁷⁹⁾

If we add such a vector

v j

^to ^any ^row

z _i ^T

^of ^the ^matrix

Ω

^, ^consider, ^for

instance,i=1:

Ω new = [z 1 + v j , z 2 , · · · z N ] ^T

⁽⁸⁰⁾

wecaneasilyverifythat thefollowingmappings keepunchanged:

(33)

∀ i : Ω new x i = Ωx i

⁽⁸¹⁾

∀ k : Ω new w k = Ωw k

⁽⁸²⁾

Therefore,thedistances betweenanypair ofinputsamplesand prototypes

willkeepthesame:

d(x i , w k ) =| Ω(x i − w k ) | ² =| Ω new (x i − w k ) | ² f or all i, k

⁽⁸³⁾

Since the mapping and distance calculation are the same between

Ω

^and

Ω new

^, ^the ^cost ^functions,classication errorsand classiers theyproduce will alsostay thesame. However,thefeature rankingresultsmayvary between

Ω

and

Ω new

^, ^because ^there ^is ^no ^constraint ^on ^theconsistency of their diagonal elementsinmatrix

Λ

^and

Λ ^new .

Withoutlossofgenerality,weassumethatthereare

J

^such^spurious^vectors

{v j } ^J _j=1

ând âs ît ^will ^be ^proved ⁱⁿ ^later ^stages âll ^such ^vectors âre âctually

eigenvectorsofa constructionmatrix,wecanadditionallyassumethat all the

vectors

{v j } ^J _j=1

^areorthonormal:

v j • v k = δ jk = n

1 if j=k

0 otherwise

⁽⁸⁴⁾

Theproposedsolutionistoprojectoutallthespuriousdirections

{v j } ^J _j=1

^from

agivenmatrix

Ω

^:

Ω ^T _new = [I − X J

j=1

v j v _j ^T ]Ω ^T

⁽⁸⁵⁾

Itfollowsthat:

Ω ^T _new (x i − w k ) = Ω ^T (x i − w k ) − X J

j=1

v j (x i − w k )

| {z }

0 v _j ^T Ω ^T = Ω ^T (x i − w k )

⁽⁸⁶⁾

v _k ^T Ω new = v ^T _k Ω − X J

j=1

v j v k

|{z}

δ jk

v ^T _j Ω = v ^T _k Ω − v ^T _k Ω = 0

⁽⁸⁷⁾

Hence,wecaninterprettheresultingmatrix

Ω new

^as^the^minimal^represen-

tationofthemappingwhichcontainsnocontributionofthespuriousdirection

v j

^.

Thenext questionis how tond allthese vectors

{v j } ^J _j=1

^. ^The ^conditions

that

∀ i : v _j ^T x i = 0

^and

∀ k : v _j ^T w k = 0

^can^be^rewritten^as

X ^T v j = 0 where X = [x 1 , x 2 , · · · x L , w 1 , w 2 , · · · w M ]

⁽⁸⁸⁾

Supervised Feature Selection Based on Generalized Matrix Learning Vector Quantization