Supervised Feature Selection Based on Generalized Matrix Learning
Vector Quantization
Zetao Chen
S2061244 September 2012
Master Project
Artificial Intelligence
University of Groningen, the Netherlands
Internal Supervisor:
Dr. Marco Wiering (Artificial Intelligence,University of Groningen)
External Supervisor:
Prof. Michael Biehl (Computer Science, University of Groningen)
1 Introduction and Background 4
1.1 Motivation . . . 4
1.2 ResearchQuestions . . . 5
1.3 ThesisOutline . . . 5
2 MachineLearning 7 2.1 BasicConceptofMachineLearning. . . 7
2.1.1 Denition oflearning. . . 7
2.1.2 Denition ofmachinelearning. . . 7
2.1.3 Datarepresentation . . . 8
2.2 Classication . . . 8
2.2.1 Unsupervisedandsupervisedlearning . . . 9
2.3 LearningAlgorithms . . . 9
2.3.1 SVMwithRBF kernel . . . 9
2.3.2 LVQ . . . 15
2.3.3 TwovariantsofLVQ:GRLVQandGMLVQ . . . 17
3 Feature Selection 19 3.1 Challenge . . . 19
3.1.1 Curseofdimensionality . . . 19
3.1.2 Irrelevanceandredundancy . . . 20
3.2 GeneralFramework. . . 20
3.3 WrapperandFilter Approach . . . 21
3.4 FeatureRankingTechnique . . . 22
3.4.1 Informationgain . . . 22
3.4.2 Relie . . . 23
3.4.3 Fisher . . . 24
4 GMLVQBased Feature SelectionAlgorithms 27 4.1 EntropyEnforcementforFeatureRankingResults . . . 27
4.2 Way-PointAverageAlgorithm. . . 29
4.3 FeatureRankingAmbiguityRemoval . . . 32
5 Experimentsand Results 35 5.1 DataSetDescription . . . 35
5.2 ExperimentDesign . . . 35
5.3 ResultsandDiscussion . . . 36
5.3.1 CaseStudy1: AdrenalTumor . . . 36
5.3.2 CaseStudy2: Ionosphere . . . 41
5.3.3 CaseStudy3: ConnectionistBench Sonar . . . 45
5.3.4 CaseStudy4: BreastCancer . . . 49
5.3.5 CaseStudy5: SPECTFHeart . . . 51
5.4 Discussionand Summary . . . 56
6 Conclusion and FutureWork 58
Datamining involvesthe use of data analysis tools to discover and
extractinformationfromadatasetandtransformitintoanunderstand-
able expression. Oneofits central problemsis to identifyarepresenta-
tive subsetoffeatures fromwhich alearning modelcan beconstructed.
Featureselection isanimportantpre-processingstepbeforedatamining
whichaimstoselectarepresentativesubsetoffeatureswithhighpredic-
tive informationand eliminate irrelevant features withlittle importance
forclassication. Byreducingthedimensionality ofthe data,featurese-
lectionhelpstodecreasethe timefor trainingandbyselectingthe most
relevant featuresand removingtheirrelevant and noisydata,the classi-
cation performance may be improved. Besides, witha smaller feature
subset,thelearnedmodelmaybemoreintuitiveandeasiertointerpret.
ThisthesisinvestigatestheextensionofGeneralizedMatrixLVQ(GM-
LVQ)modelonfeatureselection. GeneralizedMatrixLVQemploysafull
matrixas thedistancemetricintraining. Thediagonaland o-diagonal
elementsofthedistancematrixrespectivelymeasurethe contributionof
eachfeature andfeature pairfor classication; therefore, their distribu-
tion can provide a quantitative measurement of feature weight. More
steps andanalysis are performed to forceamore eectivefeature selec-
tion result and remove the weighting ambiguity. Besides, compared to
other methodswhichperformfeaturerankingrstandlearning amodel
after selecting the feature subset,GMLVQbased methods cancombine
theprocessof featurerankingand classicationtogetherwhich helpsto
decreasethecomputationtime.
Experimentsinthisthesiswereperformedondatasetscollectedfrom
theUCI Machine Learning Repository[29 ]. TheGMLVQbasedfeature
weightalgorithmiscomparedwithotherstate-of-the-artmethods: Infor-
mationGain,Fisher andRelie. Allthesefourfeaturerankingmethods
are evaluated using both GMLVQand RBF basedSupportVector Ma-
chine(RBF-SVM)methodsbyincreasingthe sizeoftheselectedfeature
subset with a stepsize rate. Theresults indicate that the performance
ofGMLVQbasedfeatureselection methodiscomparabletoothermeth-
ods and onsomeof the datasets, it consistentlyoutperforms theother
methods.
1 Introduction and Background
1.1 Motivation
Fora machine learning algorithm to be successful ona given task, the repre-
sentation and quality of thedata are therst and mostimportant. With the
advancingofdatabasetechnology,dataiseasiertoassessandmorefeaturescan
begatheredforaspecictask. However,morefeaturesdonotnecessarilyresult
inmorediscriminativeclassiers. Instead,whenthere aretoomanyredundant
orirrelevant features, the computation can be much moreexpensive and the
classier may havea poor generalization performance due to the interference
of noises; therefore, proper data preprocessing is essential for the successful
trainingofmachine learningalgorithms.
Feature selectionisoneof themostimportantandfrequentlyused prepro-
cessingtechniques[5]whichaimstoidentifyandselectthemostdiscriminative
subset from the original features while eliminating irrelevant, redundant and
noisydata. Some studies haveshown that irrelevantfeatures canbe removed
withoutsignicantperformancedowngrade[6]. The application of feature se-
lectioncanhavesomebenets:
1. Itreducesthedatadimensionalitywhich helpsthelearningalgorithmsto
work fasterandmoreeectively;
2. In some cases, the classication accuracy can be improved by using a
subsetofallfeatures;
3. Theselectedfeaturesubsetisusuallyamorecompactresultwhichcanbe
interpretedmoreeasily;
To perform feature selection, the training data can be with or without label
information,correspondingtosupervisedorunsupervisedfeature selection. In
unsupervised tasks [1, 2], without considering the label information, feature
relevance can be evaluated by measuring someintrinsic properties within the
data, such as the separability or covariance. In practice, unlabeled data is
easierto obtaincompared to labelled ones,therebyindicating thesignicance
ofunsupervisedalgorithms. However,thesemethods ignore labelinformation,
which may lead to performance deterioration when the label information is
available. Supervisedfeatureselectionisproposedtotakethelabelinformation
intoaccount. Itcan begenerallydividedinto twomajorframeworks: thelter
model [14, 15, 16, 17] and the wrapper model [18, 19, 20]. The lter model
performsthefeatureselectionasapre-processingstep,independentofthechoice
of theclassier. The wrappermodel, on theother hand, evaluatessubsets of
featuresaccordingtotheirusefulnesstoagivenpredictor.
Featureselectiontechniquescanbefurther categorizedintofeatureranking
andfeature subsetselection. Featurerankingmethodsassign aweightto each
selectthesubsetof featuresbychoosingathresholdand eliminateallfeatures
which do notachievethat score. Feature subsetselection searchesfor the op-
timalsubsetwhich collectivelyhasthebest performance withrespectto some
predictor. Inthis thesis,anewmethodforfeaturerankingwillbeinvestigated
andcomparedwithotherstate-of-the-artones.
Learning Vector Quantization is one of the most famous prototype-based
supervisedlearningmethods. ItwasrstintroducedbyKohonen[3]. Afterthat,
severaladvancedcostfunctionswereproposedtoimprovetheperformance,one
examplebeingGeneralizedLVQ[4]which isonlybasedonEuclideandistance.
To model the dierent contributions of features for classication, Generalized
RelevanceLVQisproposed[4,7]toextendtheEuclideandistancewithscaling
orrelevancefactorsforallfeatures. TherecentlyintroducedGeneralizedMatrix
LVQ (GMLVQ) [33] extendsthe distance measurement further to accountfor
pairwisecontributionoffeatures. ThedistancematrixinGMLVQcontainssome
informationwhichmaybeusefulforfeatureselection. Forexample,thediagonal
element
Λ ii
ofthedissimilaritymatrixcanberegardedasameasurementofthe overall relevance offeature ifor classicationandtheo-diagonal elementΛ ij
canbeinterpreted asthe contributionoffeature pairi andj. A highabsolute
valueindicatestheexistenceofahighlyrelevantrelationshipwhileanabsolute
valueclosertozeromaysuggestthat isisnotthatimportantforclassication.
TheabovediscussionillustratesthepotentialapplicationofGMLVQinfea-
tureranking which hasnot yet beenfully investigated. Early studies include
applyingGMLVQto selectthebest featurein theclassicationoflungdisease
[39]andselectthemostdiscriminativemarkerinthediagnosisofAdrenalTumor
[41]. Inthis thesis, afurther investigation will be conducted and experiments
onmoredatasetswillbecarriedout.
1.2 Research Questions
Thisthesiswill attempttoanswerthefollowingquestions:
1. CanGMLVQmethodbeextendedto performfeatureranking?
2. How well doesthe feature rankingperform? In this thesis,the GMLVQ
basedfeaturerankingtechniquewillbecomparedwiththree otherstate-
of-the-artfeaturerankingmethods. All thesefourmethodswillbeevalu-
atedbyGMLVQandRBF-SVM intermsoftheirAUCmetric.
3. CanGMLVQcombinethefeaturerankingandclassicationintoonesingle
process and how well does the classicationperform compares to other
methodsin whichfeature rankingandclassicationareperformedintwo
steps?
1.3 Thesis Outline
This thesis has six chapters and is organized as follows. Chapter 2 presents
the basicconcepts in machine learningand the algorithm details of the Sup-
performance of variousfeature rankingalgorithms in alater stage. Chapter 3
discussesthe idea of feature selection, its general framework and three state-
of-the-artfeaturerankingtechniqueswhichwillbecomparedwiththeGMLVQ
basedrankingmethod. Chapter4givesadescriptionabouttheGMLVQbased
featurerankingmethod. Inthischapter,detailswillbegiventoextractfeature
rankingfrom GMLVQ,thewaypointaveragingalgorithmand howto obtaina
unique feature ranking result. Chapter 5elaborates on the experiments con-
ductedto compare thefour feature rankingtechniquesdiscussedaboveand is
followedbyChapter6thatstatestheconclusionandfutureworkforthisthesis.
2 Machine Learning
In this chapter, we rstly give a brief introduction to machine learning and
someofitsbasicconcepts. Thedatarepresentation,classicationandlearning
algorithmsarefurtherpresented. Wefurtherpresentsomespeciclearningalgo-
rithmswhichareRBF-basedSVM,basicLVQanditstwoothervariations. The
learningalgorithmsintroducedinthis chapterwill belaterutilizedto evaluate
thefeature selectionalgorithmsintroducedinChapter 3.
2.1 Basic Concept of Machine Learning
2.1.1 Denitionof learning
Whatislearning? Learningisgenerallyreferredto themutualinteraction be-
tween the environment and the person through which one gains or modies
knowledgeorskills. AmoreformaldenitionwasgivenbyRunyonin1977[36]:
Learningisaprocessinwhichbehaviorcapabilitiesarechangedastheresultof
experienceprovidedthechangecannotbeaccountedforbynativeresponseten-
dencies,maturation,ortemporarystatesoftheorganismduetofatigue,drugs,
orothertemporaryfactors.
One of the examples in learning is the association between events. For
example,ifanormalpersontastesanappleforthersttime andndsitvery
delicious, he will assume that the next apple he meets will also be delicious
although he has not eaten it and that apple is dierent from theone he ate.
Theimportant discoveryhere is theassociation of the facts that the apple is
tasty. Thisassociationistheknowledgesomeonegainsbytheexperiencetoeat
anapple.
2.1.2 Denitionof machinelearning
Learningfor computersfalls into the eld of machine learning. A widely ac-
cepted denition is: A computerprogram is said to learn from experience E
withrespecttosomeclassoftasksTandperformancemeasure P,ifitsperfor-
manceat tasks in T,asmeasuredbyP,improveswithexperience E[37]. The
experiencehereusually refersto thedata which demonstratestherelationship
betweenobservedvariables.
There are many example applications in Machine Learning. One of the
largest groups lies in the categorization of objects into a set of pre-specied
classesorlabels. Someofthepracticalexamplesare:
1. OpticalCharacterRecognition: classifyimagesofhandwrittencharacters
tothespecicletters;
2. FaceRecognition: categorizefacialimagestotheperson itbelongsto;
learning.
3. MedicalDiagnosis: determinewhetherornotapatientsuers fromsome
disease;
4. StockPrediction: predictwhetherastockgoesupordown
2.1.3 Data representation
Intheeld ofmachine learning,data isrepresentedbyatablewhere eachrow
correspondstoonesampleorinstanceandeachcolumn describesoneattribute
or feature. In the case of supervised learning, there will be another column
containing the label information for each instance. One of the examples is
shown in Figure 1. There are 14 instances in this exampleand each instance
consistsof thedata withfour features: Outlook,Temperature,Humidity,
Wind andthelabelinformationspecifyingwhether ornotto play.
Themathematicalexpressions ofthe dataand labelsare presentedhereto
serveasthenotationsin thisthesis. Let
{x i , y i }
denotethei th
instancewherex i ∈ R N
denotesthedataintheNdimensionalspaceandy i
isthecorresponding labelinformationwithCdierentpossiblevalues. Tobebrief,thecombinationofdataandlabelareexpressedasbelow:
{x i , y i } ∈ R N × C
(1)2.2 Classication
Asdiscussed in the previoussection, themajortask in machine learningis to
learnhowtoclassifyobjectsintooneofthepre-denedsetoflabels. Insuchtask,
objectsinaclass. Forexample,to identifywhether afruitisabanana,people
havetocheckitscolor,size,shapeandinferitslabelfromthis information.
2.2.1 Unsupervised and supervisedlearning
The classication task discussed above is generally referred to as supervised
learningwhere thelabelsof trainingdata are provided and the learningalgo-
rithm tries to generalize from the training instances to enable novel objects
to be classied to correct categories. In contrast to supervised learning, un-
supervisedlearningrefers to the learningin which thelabelsof trainingdata
are unknown. Its goalis to groupthe training data into dierent clusters by
evaluatingsomeintrinsicpropertieswithin thedata,suchastheseparabilityor
covariance; therefore, the quality of the data provided for trainingis crucial.
Ifirrelevantornoisydataareprovided, misclassicationswill happenonnovel
data.
2.3 Learning Algorithms
Inthissection,twosupervisedlearningalgorithms willbedescribedwhichare
theSVMalgorithmwithRBFkernelandtheLVQalgorithmwithitstwovari-
ants: GRLVQandGMLVQ.TheGMLVQandRBF-SVM willbelaterutilized
toevaluatetheperformanceoffourfeaturerankingmethods.
2.3.1 SVM with RBF kernel
The Support Vector Machine (SVM) was originally proposed by Vapnik for
classicationand regression[25, 24, 26, 27] and then itwasalso extended for
other application [28]. It has attracted large attentionin recentyears due to
its superior performance and soundly developed theoretical foundation. As a
result,italsoservesasanevaluationmethodforthefeatureselectionresultsin
thisthesis.
The SVM is a method to nd anoptimal hyperplaneto separate training
dataoftwoormoreclassesandatthesametime, maximizingitsmargin. The
linear Support Vector Machine, as the simplest and most basic case, will be
introduced rst. Then wewill showhowit canclassifynon-linearly separable
datainafeaturespaceinhigherdimensions.
LinearSVMandseparatinghyperplanemaximization ThelinearSVM
isasupervisedlearningmethodwhichisbuiltuponagroupoflabelledsamples
and that performsbinary classication in the feature space. Let'sdenote the
data and labels as
(x i , y i )
wherex i ⊆ R N
is a N-dimensionalfeature vector andy i
is thelabelof samplex i
. Inatwo-class problem,y i ∈ {+1, −1}
. Theclassicationprocess of asupervised learningalgorithm canthen be regarded
as a mapping process
f (x i )
:R N → R
which maps the feature vector from aN dimensional space to the class membership of the vector. Without loss of
spaceintotwohalves
generality,it isassumedthat
f (x i ) > 0
andy i = 1
indicate thefeature vectorbelongstoclass1and
f (x i ) < 0
andy i = −1
specifytheclass2. Thenaformaldenition aboutlinearlyseparable datacanbegiven as: adata set is linearly
separableifthefollowingequationshold:
∀ y i = 1 : f (x i ) > 0
(2)∀ y i = −1 : f (x i ) ≤ 0
(3)AnillustrationexampleisshowninFigure2. Ascanbeseenfromthegure,
allthepointswith
y i = 1
areclassiedintothepositivesideofthehyperplaneandotherswith
y i = −1
areintheoppositeside.ThediscriminantfunctioninFigure2isalinearmodelandcanbeexpressed
as:
f (x) = w T x + b
(4)wherewindicatestheweightvectorandbisthebias. Thehyperplanewhich
dividestheplaneintotwohalf-planesisexpressedas:
f (x) = w T x + b = 0
The discriminantfunction
f (x)
can alsohelp tomeasure thedistance of adatapointtothehyperplane. Considerthepoint
x d
and itsnormalprojectionx 0
onthehyperplanein Figure2. Thecoordinatesofthepointx d
canthenbeexpressedas:
x d = x 0 + d w
kwk
(5)where ddescribesthealgebraicdistance betweenthepoint
x d
andx 0
. Be-cause
x 0
isonthehyperplane,f (x 0 ) = 0
. Wehave:f (x d ) = f (x 0 + d w
kwk ) = w T (x 0 + d w
kwk ) + b
(6)= f (x 0 ) + d w T w
kwk = d kwk
(7)It follows that:
d = f(x kwk i )
and to enforce thatd
is always positive undercorrectclassication,wedene:
d i = y i f (x i )
kwk
(8)Then the term margin
p
can be dened here as the distance between thehyperplaneandtheclosestdatapointsfrombothsides:
p =
i=1,2···n min y i f (x i )
kwk
(9)where
n
isthenumberofexamplesinthetrainingdataset. ThelinearSVMistrainedto ndanoptimalhyperplanetomaximize themargin
p
. As shownin the formula above, this canbe achieved by either maximizing the valueof
y i f (x i )
oftheclosestpointsorbyminimizingkwk
. Sincew T x + b
canbescaledwithoutchangingitssign,itisreasonabletoimposetheconstraintthat:
y i (w T x i + b) ≥ 1
(10)i = 1, 2, · · · n
(11)Therefore,theoptimizationproblemcanbeformulatedas[25]: givenasetof
trainingsamples
{x i , y i } n i=1
,tryto ndtheoptimalparameterswandbwhichsatisestheconstraintthat:
y i (w T x i + b) ≥ 1
(12)i = 1, 2, · · · , n
(13)andminimizesthefollowingfunction:
L = 1
2 w T w
(14)This is called theprimary problem and can be solved by constructing the
Lagrangefunction[30]as below:
J(w, b, a) = 1 2 w T w −
X n
i=1
a i [y i (w T x i + b) − 1]
(15)The
a i
herearecalledtheLagrangemultipliersandthesolutionofthisopti- mizationproblemshouldbeminimizedwithrespecttow
andb
andmaximizedwithrespectto
a i
. Asaresult,itfollowsthat∂J(w, b, a)
∂w = w − X n
i=1
a i y i x i = 0
(16)and
∂J(w, b, a)
∂b =
X n
i=1
a i y i = 0
(17)whichgivesriseto
w = X n
i=1
a i y i x i
(18)and
X n
i=1
a i y i = 0
(19)Thenbysubstitutingtheabovetwoequationsintoequation(15),theequa-
tionbecomes:
Q(a) = X n
i=1
a i − 1 2
X n
i=1
X n
j=1
a i a j y i y j x T i x j
(20)The corresponding problem is called the dual problem and is formulated
asbelow: given trainingsamples
{x i , y i } n i=1
, tryto ndthe optimalLagrangemultipliers
{a i } n i=1
whichmaximizetheobjectivefunctionaboveandalsosatisfythefollowingconstrains:
1.
X n
i=1
a i y i = 0;
2.
a i ≥ 0
fori = 1, 2, . . . , n
AftertheLagrangemultipliers aredetermined,theweightvectorcanbeeasily
determinedby
w = X n
i=1
a i y i x i
(21)and thebiasb canbedetermined byarbitrarilychoosing alabeledsample
{x i , y i }
andcalculate:y i (w T x i + b) = 1
(22)∀ y i = 1 : b = 1 − w T x i
(23)or
(24)∀ y i = −1 : b = −1 − w T x i
(25)ItisalsoimportanttostatetheKarush-Kuhn-Tuckertheorem[25,30]which
givesthefollowingconstraintonthesaddlepointoftheLagrange:
a i0 [y i (w T 0 x i + b 0 ) − 1] = 0 f or i = 1, 2, . . . , n
(26)It statesthat
a i0 6= 0
only forthe points which satisfyy i (w 0 T x i + b 0 ) = 1
.Thesepointsarecalledthesupportvectors.
Tosumup,wehave:
f (x) = X m
i=1
a i0 y i x T i x + b 0
(27)where
{x i } m i=1
are the support vectors and{a i0 } m i=1
are thecorresponding Lagrangemultipliers.Non-linear separable data and soft margin In practical applications,
many of the data sets are non-linearly separable which makes the algorithm
intheprevioussectioninfeasible. OneexampleisshowninFigure3. Ascanbe
seenfrom thegure,although mostof thepointsare classiedintothecorrect
side, there are still some points which violate the hyperplane. These points
eithercrosstheboundaryofthemarginbutarestilllocatedonthecorrecthalf-
space,orhavebeenmisclassiedontotheincorrecthalf-space. Insuchcases,it
isimpossibletondahyperplanewhichcompletelyremovestheerrors;instead,
asolutioncanbeproposedtominimizetheerrorsonthetrainingdata.
Slackvariables areintroducedto solvethisproblem. Foradataset with
n
samples,thereare
n
slackvariables{ε i } n i=1
whichsatisfy:∀ y i = 1 : w T x i + b > 1 − ε i
(28)∀ y i = −1 : w T x i + b 6 −1 + ε i
(29)The slack variable
ε i
here is a measure of the violation to the margin. If0 < ε i < 1
,then thesampleviolatesthemarginbut isstill correctlyclassied.When
ε i > 1
, thesampleisclassiedintothewronghalf-space. Sincethegoal istohavefewertrainingsamplesmisclassied,apenaltyforcecanbeadded:η(ε) = X n
i=1
ε i
(30)whichshouldbeminimized. Itcanbeincorporatedintothecostfunctionin
theprevioussectionas:
f = 1
2 w T w + C X n
i=1
ε i
(31)The parameter
C
here controls the trade-o between the margin rigidityenforcementandthenumberof errorsitcantolerateduring training. A larger
valueof
C
willproduceamoreaccuratemodelwhileatthesametimeincreasingtheriskofover-tting;therefore,thevalueof
C
hastobeoptimizedbytheuserduringtheexperiment.
ThecorrespondingLagrangefunctionforthisproblemis:
J(w, b, a, u, ε) = 1
2 w T w + C X n
i=1
ε i − X n
i=1
µ i ε i − X n
i=1
α i [y i (w T x i + b) − 1 + ε i ]
(32)where
µ i
istheLagrangemultiplier fortheslackvariables.Kerneltrick ConsiderthetypicalXORproblemwhichtriestoseparatefour
examplesin four cornersof arectangle such that thetwoexamplesconnected
by a diagonal belong to the same class. It is impossible to make this in a
two-dimensionalspace but whenprojectingit to athree-dimensionalspace, it
becomesmucheasier. Thisexampleindicatesthatanon-linearlyseparabledata
setmaybecome linearlyseparable inahigherdimensionalspace. Thiskindof
mappingincreasestheseparabilityofthedataset.
Letthefunction
θ
denesthenon-linearmapping:θ : R N → H
(33)f (x) = X n
i=1
a i y i θ(x i ) T θ(x) + b
(34)Thekernelfunctionisdened hereby:
K(x, y) = θ(x) T θ(y)
(35)andthediscriminantfunctionturns into:
f (x) = X n
i=1
a i y i K(x i , x) + b
(36)Thisexpressionavoidsprovidingtheexactrepresentationinahigherdimen-
sionalspace. Numerous kernels have been proposed to solvevarious kinds of
problems. OneofthemostpopularkernelsistheRBF kernelwhichisused in
thisthesis. TheRBFkernelcanbeexpressedas:
K(x, y) = e (− 2σ2 1 kx−yk 2 )
(37)σ
indicates the kernelwidth. A largerσ
indicates asmoother function toavoidoverttingandalsoavoidreproducingthenoisesinthetrainingdata;On
theotherhand,asmaller
σ
impliesamoreexiblefunction to producehighlyirregulardecisionboundaries. Hence,itis importanttodeterminetheoptimal
valuefor
σ
bymeansofcrossvalidation.2.3.2 LVQ
LearningVector Quantization is one of the most famous prototype-based su-
pervised learningmethods which wasrst introduced by Kohonen[3]. It has
some advantages over the other methods. Firstly, this method can be easily
implemented and thecomplexity of the classiercan be controlled and deter-
minedbytheuser. Secondly,multi-classproblemscanbenaturallytackledby
theclassierwithoutmodifyingthelearningalgorithm ordecisionrule. Lastly,
the resultingclassier is intuitiveand easy to interpret due to its assignment
ofclassprototypesandintuitiveclassicationmechanismofnewdatapointsto
theclosestprototype. Theresulting prototypescanthen provideclass-specic
attributesforthedata. ThisisabigadvantageoverthemethodssuchasSVM
orNeuralNetworkswhichsuerfromthedrawbackofbeinglikeablackboxand
becauseofthat,LVQhasbeenappliedintomanyelds,suchasbioinformatics,
satelliteremotesensingandimageanalysis[34,35, 39]
TrainingdataforLVQcanbedenotedas:
{x i , y i } n i=1 ∈ R N × {1, 2, . . . C}
(38)where
x i
denotesthedatainNdimensionalspaceandy i
isthelabelwithCdierentclasses.
LVQcan be parameterized by a set of prototypesrepresenting the classes
in feature space and the distance measurement which may be a traditional
Euclideandistanceorafullmatrixtrainedfromthedata. Oneoftheexamples
canbe seenin Figure 4 where there are 4 dierent prototypesrepresenting 3
dierentclasses.
TraditionalLVQemploysEuclidean distancemeasurementandisbasedon
nearest prototype classication. To be more specic, a set of prototypesare
dened torepresentthedierent classes. Ifoneprototypeperclassis dened,
theprototypescan berepresentedas:
W = {w j , c(w j )} ∈ R N × {1, 2, . . . , C}
.Each unseen example
x new
will be assigned a label whose prototype has theclosestdistanceto itwith respecttothedistancemeasurement:
c(x new ) ← c(w k ) with w k = argmin
j
d(w j , x new )
(39)Itiscalled awinner-takes-allstrategy.
Trainingofthismodelisguidedbytheminimization ofthecostfunction:
F = X n
i=1
φ(ε i ) with ε i = d(x i , w H ) − d(x i w M )
d(x i , w H ) + d(x i w M )
(40)where
φ
is any monotonic function and in this thesis,φ(x) = x
;w H
andw M
arerespectivelytheclosestprototypewiththesameanddierent labelto samplex i
:w H = argmin
j
d(x i , w j ) ∀c(w j ) = c(x i )
(41)w M = argmin
j
d(x i , w j ) ∀c(w j ) 6= c(x i )
(42)IntraditionalLVQsystems,onlythelocationsoftheprototypesareupdated
duringthetrainingtominimizetheerrors.
w H
ispushedtowardthesamplex i
and
w M
is pushed awayfrom it. Their derivativesto the costfunctionF
areexpressedas:
4w H = −α · φ 0 (ε i ) · ε 0 i,H · ∇ w H d(x i , w H )
(43)4w M = α · φ 0 (ε i ) · ε 0 i,M · ∇ w M d(x i , w M )
(44)where
α
is the learning rate;φ 0 (ε i ) = 1
becauseφ(x) = x
;ε 0 i,H = 2 · d(x i w M )/[d(x i , w H )+d(x i w M )] 2
andε 0 i,M = 2·d(x i w H )/[d(x i , w H )+d(x i w M )] 2
;∇ w H d(x i , w H )
and∇ w M d(x i , w H )
arerespectivelythederivativesofw H
andw M
to the distance
d(x i , w H or M )
and therefore depend on the distance measure-ment.
2.3.3 Two variants ofLVQ: GRLVQ and GMLVQ
How the distance is calculated is very importantin the LVQ system. Oneof
the mostpopular metricsis theEuclidean distance which is a special case of
Minkowskidistance. TheEuclideandistancefromadatapoint
x i
toaprototypew
canbeexpressed as:d(w, x i ) = v u u t
X N
j=1
(x j i − w j ) 2
(45)The Euclidean distance assigns the sameweight for each feature, indicat-
ingthat eachfeature hasthesamecontribution forclassication. However,in
practicalapplications, it is usually observed that dierent features contribute
dierentlytowardtheclassication. Therefore,relevancelearning[7, 4]is pro-
posed toassignadaptiveweightvaluesfordierentfeature inputs:
d(w, x i ) = v u u t
X N
j=1
λ j (x j i − w j ) 2
(46)ThecorrespondingLVQsystemiscalled GRLVQ[7,4].
Eachfeature,besidestheirindividualcontributionfortheclassication,will
alsocorrelatewiththeotherstoinuencetheperformance. GeneralizedMatrix
LVQ(GMLVQ)[38]isproposedtoextendthepreviousmethods. Afullmatrix
of adaptiverelevance is employed asthe similarity metricand the distance is
calculatedas:
d(w, x i ) = (x i − w) T Λ(x i − w)
(47)where
Λ
isafullN × N
matrixwhoseo-diagonalelementΛ i,j
accountforthecontributionoffeature pair
i
andj
forclassication. ThematrixΛ
hastobepositivedenitetokeepthedistanceresultpositive. Itspositivedeniteness
isachievedbyconstructing:
Λ = Ω T Ω
(48)where
Ω
is an arbitrary realM × N
matrix withM 6 N
. However, in thisthesis,weonlyconsiderthecase:
M = N
. SubstitutingEq. (42)intoEq. (41), obtain:(x i − w) T Λ(x i − w) = (x i − w) T Ω T Ω(x i − w) = [Ω(x i − w)] 2 > 0
(49)ItisnoticedtheGRLVQisaspecialcaseofGMLVQwith
diag(Λ) = {λ i } N i=1
.Thederivativeofthedistance
d(w, x i )
withrespecttoprototypewis:∇ w d(w, x i ) = −2Λ(x i − w)
(50)SubstitutingEq. (50)intoEq. (43)andEq. (44),wecanobtaintheupdate
ruleforclosestcorrectandincorrectprototype.
InthemodelofGMLVQ,theupdateruleofthedistancematrix
Ω
alsoneedtobecomputed. Thederivativeof
d(w, x i )
withrespectto onesingleelementΩ lm
is:∇ Ω lm d(w, x i ) = X
k
(x m i − w m )Ω lk (x k i − w k ) + X
j
(x j i − w j )Ω lj (x m i − w
(51)m )
= 2 · (x m i − w m )[Ω(x i − w)] l ,
(52)ThederivativeofthecostfunctionFwithrespecttoonesingleelement
Ω lm
canthenbeexpressedas:
4Ω lm = 4Ω H lm + 4Ω M lm
(53)= −β · 2 · φ 0 (ε i ) · ε 0 i,H · ∇ Ω H
lm d(x i , w H ) + β · 2 · φ 0 (ε i ) · ε 0 i,M · ∇ Ω M
lm d(x i , w M )
where
β
is thelearningrateforΩ
.3 Feature Selection
3.1 Challenge
Inthissection,twotopicsaboutthechallengesinfeature selectionwillbedis-
cussed. Therstissueaboutthecurseof dimensionalityandthesecond oneis
therelevanceandredundancyoffeatures.
3.1.1 Curse ofdimensionality
Inmachine learning,theterm curse of dimensionality wasinitially dened by
Richard Bellman [10] when he conducted the work on dynamic optimization
[9, 10]andfound itquitediculttotackletheproblemof thecurseofdimen-
sionality. Hestated:
Inviewof allthat wesaid in theforegoingsections, themany
obstaclesweappeartohavesurmounted,whatcaststhepalloverour
victorycelebration? Itisthecurse ofdimensionality, amalediction
thathasplaguedthescientistfrom theearliestdays."[10]
Upto date, therearealreadymanydenitions aboutit,but generallyitrefers
totheproblemincurredbyaddingextrafeaturestothespace. Thereliabilityof
thelearningmodel depends onthedensityof trainingexamplesin thefeature
space. The increase of data dimensionality will sparse the feature space and
thusdeterioratethegeneralizationperformance.
Itstatesthatthepredictiveperformanceofalearningalgorithmwilldeteri-
oratewiththeincreaseofdatadimensionality. Withtheincreaseofthefeature
space, thefeature space will becomemore sparse andmore trainingexamples
arerequired. Forexample,if5samplesareenoughin eachdimension,then25
samplesaresucienttollatwo-dimensionalcube. However,thisnumberwill
increaseto
5 20
fora20-dimensionalhypercube.It is also observedthat itbecomes moredicultto estimatethe kernel in
ahigher dimension[11]. Table1illustrates thenumberof samplesrequiredto
estimateakernelatdensity0withacertainaccuracy.
Dimensionality Sample Size
1 4
2 19
5 786
7 10,700
10 842,000
Table1: Samplesizerequiredforkernelestimation[11].
Thereare somecontroversiesin the denitionof feature relevance. There is a
review [8] which introduces the dierent relevance denitions that have been
proposed in the literature. The authors then present an example to indicate
thatalltheotherrelevancedenitionsproduceunexpectedresultsandbasedon
that,the authors suggest that twodierent degreesof relevance arerequired:
strongrelevanceandweakrelevance. Thedenition ofweakrelevancecanalso
beregardedasthedenitionofredundancy.
Let
< X, Y >
denote the training examples whereX ∈ R N
is the dataand
Y
indicates the labels. LetF
be the full feature set andF i
is thei th
feature;thereforeeachinstance
X
isoneelementofthecombinationofthesetF 1 × F 2 × · · · F N
. LetS i = F − {F i }
denotethefeaturesubsetwithallfeaturesexcept for
F i
ands i
denote one value instantiation ofS i
. LetP
denote theconditionalprobabilityofthelabel
Y
givenafeature subset.Strong relevance Afeature
F i
isstronglyrelevanti∃x i ∈ F i , y ∈ Y
whereP (x i , s i ) > 0
andP (Y = y|S i = s i , F i = x i ) 6= P (Y = y|S i = s i )
Weak relevance A feature
F i
is weakly relevant i it is not stronglyrelevantand
∃x i ∈ F i , s i ⊆ S i , y ∈ Y,
such thatP (Y = y|S i = s i , F i = x i ) 6= P (Y = y|S i = s i )
A feature
F i
is called relevantif itis either stronglyorweakly relevanttotheclass label;otherwiseit isirrelevant. Afeature
F i
whichisweaklyrelevantcanbecomestronglyrelevantafterremovingacertainfeaturesubset. Theweak
relevance can be interpreted as theexistence of other relevant features which
canprovidesimilarpredictionpowerastheonewearemeasuring. Thisisalso
what we call redundant. It is importantto note that thefeature
F i
which isweaklyrelevantorredundantshouldnotberemovedifthefeaturesubsetwhose
removalmakes
F i
stronglyrelevanthas been removed bythe feature selectionalgorithm.
3.2 General Framework
TheframeworkinFigure5showsthatatypicalfeatureselectionsystemusually
consists offour components. Theyinclude: feature subset generation,feature
subsetevaluation,stoppingcriterionandfeaturesubsetvalidation.Asindicated
inthegure,thecompletefeatureset isrstlysenttotheGeneration"model
whichproducesdierentfeaturesubsetcandidatesbasedonsomesearchstrat-
egy. EachsubsetcandidatewillthenbeevaluatedintheEvaluation"modelby
acertain evaluation measurement. A newsubsetwhich turnsoutto bebetter
will replacetheprevious best one. Thissubset generation andevaluation will
berepeatedoverandoveruntilthegivenstoppingcriterionismet. Afterthat,
theultimatelyselectedfeaturesubsetwillbesenttotheValidation"modelfor
validationbycertainlearningalgorithms.
Two basic issues have to be addressed in the Generationmodel". Those
are: Starting pointandsearchstrategy.
•
Starting Point. Choose apoint to start the search in the feature space.Onechoiceistobeginwithnofeatureandthenforeachiteration,expand
thecurrentfeaturesubsetwitheachfeaturethat isnotyetinthesubset.
Thefeature whose addition producesthe best evaluation performanceis
added to the current subset. This is called forward selection. Another
optionisto doitconversely. Thesearchstartswithafullfeaturesetand
thensuccessivelyeliminatesthefeaturewhoseremovalresultsinthebest
evaluationperformance. Thissearchiscalledbackwardselection. Athird
alternativeisto startbyselectingarandomfeaturesubset[13] andthen
successivelyadd orremovefeaturesdepending ontheperformance. This
randomapproachcanavoidbeingtrappedinto localoptima.
•
Search Strategy. There are three dierent search strategies: complete, heuristic and random. The complete strategy examines all the possiblefeature subsetsand guaranteesto nd theoptimal one. Whenthere are
N
features,thesearchwillexamine2 N
subsetswhichmakesitunrealistic forlargeN
. Heuristicsearch isguided bysomeheuristic. Itis lesscom-putationally demanding but the optimal subset is not guaranteed. The
guideline determines whether ornot a better subset canbe found. The
randomstrategyjustsimplychoosesthenextfeatureatrandom;therefore,
theprobabilityto ndtheoptimal subsetdepends onhowmanyepochs
aretried.
3.3 Wrapper and Filter Approach
Theevaluationmethods in feature selection canbe generallydividedinto two
basicmodels: theltermodel[14,15,16,17]andthewrappermodel[18,19,20].
The lter model selects afeature subset asa pre-processingstep, without
considering the predictorperformance. It is usually achieved by designingan
evaluationfunctionsthatarefrequentlyusedaredistancemeasures,information
measures, dependency measures and consistency measures. The lter model
does not involve any training of learning algorithm and is thus much faster
whichmakesitsuitabletobeappliedonlargedatasets.
In the wrappermodel, a predetermined data mining algorithm is utilized
to evaluate thefeature subset and the candidate with highestprediction per-
formancewill be selectedasthe nal subset. Thewrappermodel canusually
selectafeaturesubsetwithsuperiorperformancebecauseitselectsfeaturesbet-
tersuitedtothepredeterminedalgorithm. However,becausethealgorithmhas
tobetrainedandtestedforeachsubsetcandidate,thewrappermodeltendsto
beverycomputationallyexpensive,especiallywithlargefeaturesize.
3.4 Feature Ranking Technique
3.4.1 Information gain
Information gain[21] measures thedependency betweenafeature
X i
and theclass label
Y
. It is averypopulartechniquein feature selectionbecauseit iseasyto understand and compute. Information gaincanalso beregarded asa
measure ofthereduction in uncertainty aboutafeature
X i
when thevalueofY
isknown. UncertaintyisusuallymeasuredbyShannon'sentropy:Entropy Entropymeasurestheamountofuncertaintythatafeature
X i
con-tains. Itisgivenby
H(X i ) = − X p
j=1
P (j)log 2 P (j)
(54)Where
p
isthenumberofpossiblevaluesinX i
andP (j)
indicatestheobser-vationprobabilityofthevalue
j
. Formthisformula,amoreuniformdistribution tendsto produceahigher entropy. Forexample, ifyoutoss afair coin, therearetwopossiblevalueseachwithequalprobability0.5. Itsentropyvalueis
H(coinT oss) = −2 × (0.5 × log 2 0.5) = 1
Inanotherexample,ifyoutoss adie, there aresixpossibleoutcomes, each
withprobability
1/6
. ItsentropyvalueisH(diceT oss) = −6 × ((1/6) × log 2 (1/6)) = 2.585
Therefore, the higherthe entropyis, themore uncertainty it containsand
themorediculttopredicttheoutput.
Information Gain Theinformationgainofafeature
X i
andthelabelY
is:I(X i , Y ) = H(X i ) − H(X i | Y )
(55)Where
H(X i )
andH(X i | Y )
arerespectivelytheentropyoffeatureX i
andtheentropyof
X i
afterknowingthevalueofY
.H(X i | Y )
iscalculatedasH(X i | Y ) = − X
j
P (Y j ) X
k
P (x k | Y j )log 2 P (x k | Y j )
(56)A better understanding can be gained from Figure 6. As can be seen in
Figure6,
H(X)
andH(Y )
respectivelymeasure theentropyofX
andY
. Theinformationgain
I(X, Y )
isameasure oftheinformation sharedbyX
andY
.H(X, Y )
istheinformationthatX
andY
collectivelycontain.H(X, Y ) = − X
x
X
y
p(x, y)log 2 p(x, y) = H(X | Y )+I(X, Y )+H(Y | X) = H(X)+H(Y )−I(X, Y )
(57)
If
X
andY
arehighly correlated, then the information theyshare is very high, indicating a large value ofI(X, Y )
. Then ifY
is known, much of theinformationabout
X
canbeguessed fromY
,suggestingalowH(X | Y )
andviceversafor
H(Y | X)
.3.4.2 Relie
Relief [22] is a univariate feature weighting algorithm in the lter model. It
is basedon the principlethat theattribute which can better separate similar
instancesbut withdierentclassesismoreimportantandshouldbeassigneda
largerweight. Thethreebasicstepstocomputethefeatureweightare:
Description: TherearePinstancesdescribedby
N
featuresandthereareC
dierentclasses:
x ∈ R N c(x) ∈ {1, 2, . . . C}
;T
iterationsareperformed.1. Setallthefeature weightsto0:
∀i, w(i) = 0
;2. Fort=1to
T
,do:3. Randomlypickaninstance
x ∈ R N
;4. Findnearesthit
N H(x)
andnearestmissN M (x)
:5.
N H(x) ← x h , with x h = argmin
j
d(x, x j ) ∀c(x j ) = c(x)
6.
N M (x) ← x m with x m = argmin
k
d(x, x k ) ∀c(x k ) 6= c(x)
7. Fori=1to
N
,do:8.
w(i) = w(i) + d(x i , N M (x) i )/(P × T ) − d(x i , N H(x) i )/(P × T )
9. enddo.
10. enddo.
1. Find the nearest miss and nearest hit where nearest hit is the closest
samplewith thesameclass asthetestsample andnearestmiss species
theclosestsamplewithadierentlabelasthetestsample;
2. Calculatetheweightof afeature;
3. Returnarankedlistoffeature weightsorthetop
k
featuresaccordingtoagiventhreshold;
The algorithm starts by initializing all the feature weights to be zero and it
randomlyselectaninstancefromthesamplesandcalculatesitsnearesthitNH
andnearestmiss NM.Eachfeatureweightisthenupdatedbasedonitsability
todiscriminateNHandNM.ThedetailedpseudocodeisgiveninAlgorithm1.
Relie[23]extendstheoriginalReliefalgorithmtodealwiththemulti-class
situation. It incorporates two important improvements. First, the result is
morerobusttonoisesbecauseoftheconsiderationof
k
nearestneighbourhoods.Second,itcandealwith themulti-classproblem. Thedetailed pseudocodeis
showninAlgorithm2.
3.4.3 Fisher
Fisher[40]is aneectivesupervisedfeature selectionalgorithmwhich aimsto
selectfeaturesthat assign similarvaluesto thesameclassanddierentvalues
todierentclasses. TheevaluationscoreofFisher'salgorithmis:
Description:Instancesdescribedby
N
featuresandthereareC
dierentclasses:x ∈ R N , c(x) ∈ {1, 2, . . . C}
;Lookfork
nearestneighbours;performT
iteration;p(y)
the class probability specifying theprobability ofan instance beingfrom theclassy
.1. Setallthefeature weightsto0:
∀i, w(i) = 0
;2. Fort=1to
T
,do:3. Randomlypickaninstance
x ∈ R N
withlabely x
;4. fory=1to
C
,do5. nd
k
nearestinstancesofx
fromclassy
:x(y, l)
wherel = 1, 2, · · · k
6. fori=1to
N
,do:7. forl=1to
k
,do:8. if
y = y x
(nearesthit), then9.
w(i) = w(i) − d(x i −x(y,l) T ×n i )
10. else(nearestmiss),
11.
w(i) = w(i) + 1−p(y p(y)
x ) × d(x i −x(y,l) T ×n i )
12. end if.
13. endfor.
14. endfor.
15. endfor.
16. endfor.
F isher(f i ) = P c
j=1 n j (µ i,j − µ i ) P c
j=1 n j σ i,j 2
(58)where
f i
is thei th
feature to be evaluated,n j
is the number of instancesin class j,
µ i
is the mean of feature i,µ i,j
andσ i,j
are respectivelythe mean and the variance of feature i on class j. Fisher algorithm is computationallyeectiveandwidelyappliedinmanyapplications,however,becauseitconsiders
thefeatures individually,ithasnoabilityto dealwithredundantfeatures.
4 GMLVQ Based Feature Selection Algorithms
4.1 Entropy Enforcement for Feature Ranking Results
Itisstatedthat theelement
Λ i,j
inmatrixΛ
measuresthecorrelationbetween featurei
andj
and the diagonal elementΛ i,i
quanties the contribution of featurei
for classication. The above statement only makes sense when the features have similar magnitude, therefore a z-score transformation is alwaysperformed on the data before the training starts. One example is shown in
Figure7where32featuresarerankedwithrespecttotheirvalueofthediagonal
elements. The
19 th
featurecontainsthehighestvalue,indicatingthatithasthelargestcorrelation withthe classication. Anotherconstraintis addedso that
aftereachadaption,thesumofthediagonalelementsisequalto zero:
X N
i=1
Λ i,i = 1
(59)Figure7: Oneexampleofdiagonalelementsof
Λ
One of the idealsituations in feature rankingis that someof the features
are much more important than others and the least important features can
thereforeberemovedfromthefeaturesetwithoutdeterioratingtheclassication
performance. An external entropyforce is addedto the costfunction to push
thediagonalelementsto thisidealsituation.
Thedenitionoftheentropyforceis:
izationforce
Entropy(Λ diag ) = − X N
i=1
Λ i,i log 2 Λ i,i
(60)where
N
is the data dimension. This external force will reach the maxi-mumwhenallthediagonalelementsareequal,i.e. allthefeatures areequally
importantforclassication. Its minimization will, onthe otherhand, pushto
generateadiscriminativefeaturerelevanceandattheextreme,onlyonefeature
isidentiedasrelevantforclassicationandtherelevancesofotherfeaturesare
zero.
Itisintegratedintothecostfunction by:
F new = F + α × Entropy(Λ diag )
(61)wheretheregularizationparameter
α
controlsthetrade-obetweentheclas-sication accuracyand thediscrimination between features. A largervalueof
α
will produce amore discriminativefeature rankingresult by sacricing the classication performance. Their mutual relation on one of the data sets isvisualizedin Figure8.
Thechoice oftheregularizationvaluedependsonhowimportanttheaccu-
racyandthediscriminationarefortheuseranddiersperdataset. Asafeway
istoposttheirrelationforeachdatasetandchoosetheoptimalpointbasedon
therequirement. Amoreecientwayinthisthesisistochoosethevalueby
N 2 10
where
N
isthedatadimension. Onallofthedatasetsconductedinthisthesis,such a valuecangenerate aconsiderable discriminativefeature rankingresult
withoutdeterioratingthe performance to alarge extent. Oneof theexamples
withandwithoutentropyforcecanbeseeninFigure 9.
4.2 Way-Point Average Algorithm
Gradient basedminimization is apopular and powerful method in non-linear
optimization[31]. Inthisthesis,batchgradientdescentisemployedtotrainthe
GMLVQmodel. Oneofthecritical choicesin gradientdescentmethods is the
appropriatechoice of the stepsize. Toosmall step sizes will slow theconver-
gence,howeverlargestepscanresultinoscillatoryorevendivergentbehavior.
In this section, amodicationof batch gradient descent[32] isintroduced
whichaimsatbetterconvergencebehavior.Theideaisthat,duringthetraining
procedure,wecomparethecostfunctionofnormaldescentadaptionwiththatof
theglidingaverageoverthemostrecentstepsandifthelatterproducesalower
optimizationvalue,minimizationjumpstothelattercongurationanddecreases
thestepsizeatthesametime. A moredetaileddescriptionisdescribedbelow.
Consider we want to minimize an objective function
F
with respect to aN-dimensionalvector
x ∈ R N
. A gradientdescentprocessisstartedatx 0
andproceedstogenerateasequenceofstepsiteratively:
x t+1 = x t − a t
∇F
|∇F |
(62)Note thatthegradienthasbeennormalizedby
∇F
|∇F |
and therefore,a t
hereisexactlythestepsizelengthduringadaption:
|x t+1 − x t | = a t
.Thewaypointaveragingalgorithmstartsat
x 0
withinitialstepsizea 0
andperformskstepswithgradientstepsunchanged:
x t+1 = x t − a t
∇F
|∇F | f or t = 0, 1, 2 . . . k − 1 with a t = a 0
(63)Afterthat
(t > k)
,theprocedureproceedsasbelow:1. performthenormalgradientdescentstepandevaluatethecorresponding
costfunction:
x ∗ t+1 = x t − a t
∇F
|∇F | t and calculate F (x ∗ t+1 )
(64)2. performthewaypointaverageoverthepreviousjsteps:
x t+1 = 1 j
X j−1
i=0
x t−i and calculate F (x t+1 )
(65)3. determinethenewstepsizeandadaption position bycomparison:
n x
t+1 =x ∗ t+1 and a t+1 =a t if F(x ∗ t+1 )6F (x t+1 )
x t+1 =x t+1 and a t+1 =λa t else
(66)withtheparameter
0 < λ < 1
.As canbeseenfrom thealgorithm, aslongasthenormalgradientdescent
procedureproducesapositionwithlowercostthanthewaypointaveragealgo-
rithm,theiterationproceedsasanormalgradientdescentalgorithm.
Onthe otherhand,
F (x t+1 ) < F (x ∗ t+1 )
indicatesthe potentialexistenceofoscillatorybehaviorbecauseunderoscillatorycondition,thepositionuctuates
aroundthelocalminimumanditisexpectedthattheaverageovertheprevious
stepsmayprovideacloserestimateto theminimumthanthenormalgradient
descentadaption. Italsoindicatesthatthestepsizemaybetoolargetogetto
theminimumandshould bedecreasedforbetterconvergence.
AnintuitiveexampleisshowninFigure10[32]whichvisualizestheadaption
stepsof both thenormal gradient descent and waypointaveragingalgorithm.
Thedotted lines markthe update trajectory ofnormal gradientdescent algo-
rithmwith constantstep sizes which display strongoscillatory behavior. The
waypointaveragingalgorithmshares thesametrajectorywith thenormalgra-
dientdescentintherstfoursteps. However,afterthatitjumpstotheaverage
positionoverthepreviousstepsandreducesthestepsizeatthesametimewhich
enablesittomovecloserto theminimumin themiddle.
WhenconsideringitsapplicationinGMLVQ,sincethecostfunctioninGM-
LVQhastobeoptimizedwithrespecttoboththeprototype
w
andthematrixΩ
,twoindependentwaypointaveragingalgorithms aboutw
andΩ
haveto beperformed. Thetypicalschemeisformulatedasbelow:
dientdescent. From[32]
GivenaGMLVQsystemrepresentedby
Ω
andasetofprototypes{w k } M k=1
withcostfunctionrepresentedbyF,respectivelychoosethestartpoints
Ω 0
and{w 0 k } M k=1
andinitialstepsizesa Ω 0
anda w 0
forΩ
andw
;1. performk(k=3in thisthesis)stepswithgradientstepsunchanged:
Ω t+1 = Ω t − a Ω t ∇F
|∇F | f or t = 0, 1, 2 . . . k − 1 with a Ω t = a Ω 0
(67)w t+1 = w t − a w t ∇F
|∇F | f or t = 0, 1, 2 . . . k − 1 with a w t = a w 0
(68)Afterthat
(t > k)
,theprocedureproceedsasbelow:2. performthenormalgradientdescentstepandevaluatethecorresponding
costfunction forboth
Ω
andw:Ω ∗ t+1 = Ω t − a Ω t ∇F
|∇F | t and calculate F (Ω ∗ t+1 )
(69)w t+1 ∗ = w t − a w t ∇F
|∇F | t and calculate F (w t+1 ∗ )
(70)3. performthewaypointaverageovertheleastpreviousj(j=3inthisthesis)
steps:
Ω t+1 = 1 j
j−1 X
i=0
Ω t−i and calculate F (Ω t+1 )
(71)w t+1 = 1 j
j−1 X
i=0
w t−i and calculate F (w t+1 )
(72)4. determinethenewstepsizeandadaption position forboth
Ω
andw:Ω t+1 =Ω ∗ t+1 and a Ω t+1 =a Ω t if F(Ω ∗ t+1 )6F (Ω t+1 )
Ω t+1 =Ω t+1 and a Ω t+1 =λa Ω t else
(73)n w
t+1 =w t+1 ∗ and a w t+1 =a w t if F(w ∗ t+1 )6F (w t+1 )
w t+1 =w t+1 and a w t+1 =λa w t else
(74)withtheparameter
λ = 2/3
.4.3 Feature Ranking Ambiguity Removal
Uptothis step,wehavetheinputvectorsandclasslabels:
{x, y i } n i=1 with x i ∈ R N , y i ∈ {1, 2, . . . C}
(75)associatedwithasetofprototypes:
{w k } M k=1 where M > C
(76)Andthedistanceiscalculatedas:
d(x i , w k ) = (x i − w k ) T Λ(x i − w k ) = (x i − w k ) T Ω T Ω(x i − w k ) =| Ω(x i − w k ) | 2
(77)
where
Λ, Ω ∈ R N×N
andΩ = [z 1, z 2 , · · · z N ] T
where{z i } N i=1
arecolumnvec-torswith dimension
N
. The feature rankingresultscanbe obtainedfrom thevaluesof diagonal elements in matrix
Λ
. However, an issue is raisedwhetherthereisanothermatrix
Λ
whichcankeepthedistancemeasurementunchanged.Itthatmatrixexists,thefeaturerankingresultscanbedierentwithoutmod-
ifyingtheclassier,which meansthefeature rankingresultswehaveobtained
inpreviousstepsarenotunique.
Consideravector
v j
whichsatisesthefollowingconstraints:∀ i : v j T x i = 0
(78)∀ k : v j T w k = 0
(79)If we add such a vector
v j
to any rowz i T
of the matrixΩ
, consider, forinstance,i=1:
Ω new = [z 1 + v j , z 2 , · · · z N ] T
(80)wecaneasilyverifythat thefollowingmappings keepunchanged:
∀ i : Ω new x i = Ωx i
(81)∀ k : Ω new w k = Ωw k
(82)Therefore,thedistances betweenanypair ofinputsamplesand prototypes
willkeepthesame:
d(x i , w k ) =| Ω(x i − w k ) | 2 =| Ω new (x i − w k ) | 2 f or all i, k
(83)Since the mapping and distance calculation are the same between
Ω
andΩ new
, the cost functions,classication errorsand classiers theyproduce will alsostay thesame. However,thefeature rankingresultsmayvary betweenΩ
and
Ω new
, because there is no constraint on theconsistency of their diagonal elementsinmatrixΛ
andΛ new .
Withoutlossofgenerality,weassumethatthereare
J
suchspuriousvectors{v j } J j=1
and as it will be proved in later stages all such vectors are actuallyeigenvectorsofa constructionmatrix,wecanadditionallyassumethat all the
vectors
{v j } J j=1
areorthonormal:v j • v k = δ jk = n
1 if j=k
0 otherwise
(84)Theproposedsolutionistoprojectoutallthespuriousdirections
{v j } J j=1
fromagivenmatrix
Ω
:Ω T new = [I − X J
j=1
v j v j T ]Ω T
(85)Itfollowsthat:
Ω T new (x i − w k ) = Ω T (x i − w k ) − X J
j=1
v j (x i − w k )
| {z }
0
v j T Ω T = Ω T (x i − w k )
(86)v k T Ω new = v T k Ω − X J
j=1
v j v k
|{z}
δ jk
v T j Ω = v T k Ω − v T k Ω = 0
(87)Hence,wecaninterprettheresultingmatrix
Ω new
astheminimalrepresen-tationofthemappingwhichcontainsnocontributionofthespuriousdirection
v j
.Thenext questionis how tond allthese vectors
{v j } J j=1
. The conditionsthat