Bayesian Feature Selection for Hearing Aid Personalization
Citation for published version (APA):
Ypma, A., & Vries, de, B. (2008). Bayesian Feature Selection for Hearing Aid Personalization. In Bayesian Feature Selection for Hearing Aid Personalization, MLSP-07, Proceeings, Thessaloniki, Greece, 2007 (pp. 425-430) https://doi.org/10.1109/MLSP.2007.4414344
DOI:
10.1109/MLSP.2007.4414344
Document status and date: Published: 01/01/2008
Document Version:
Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)
Please check the document version of this publication:
• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.
• The final author version and the galley proof are versions of the publication after peer review.
• The final published version features the final layout of the paper including the volume, issue and page numbers.
Link to publication
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal.
If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:
www.tue.nl/taverne
Take down policy
If you believe that this document breaches copyright please contact us at:
openaccess@tue.nl
providing details and we will investigate your claim.
BAYESIAN FEATURE SELECTION FOR HEARING AID PERSONALIZATION
Alexander
Ypma1,
Serkan Ozer2, Erik
vander
Werf1
and
Bert
de Vries1'2
1GN ReSound Research, GN ReSound A/S, Horsten 1, 5612 AX Eindhoven
2Signal Processing Systems
group,Dept. Electrical Engineering, TU Eindhoven
s.ozer@student.tue.nl,{aypma,evdwerf,bdevries}@gnresound.com
ABSTRACT
We formulate hearingaidpersonalization as a linear regression. Sincesample sizes maybe low and the number of features may be high we resort to a Bayesian approach for sparse linear regression that can deal with many features, in order to find efficient repre-sentations foron-line usage. We compare to aheuristic feature se-lectionapproach that we optimized for speed. Results on synthetic datawith irrelevant and redundant features indicate that Bayesian
backfittinghaslabelling accuracycomparabletothe heuristic ap-proach (for moderate sample sizes), but takes much larger training times. Wethen determine features forhearingaidpersonalization
byapplying the method tohearingaid preferencedata.
1. HEARING AID PERSONALIZATION Moderndigital hearing aids contain advanced signal
process-ingalgorithms withmany parameters. Theseare set to
val-uesthatideally match the needs and preferencesof the user. Becauseof thelarge dimensionality of theparameter space
and unknown determinants ofuser satisfaction, the fitting procedure becomesacomplex task. Some of theuser para-meters arepersonalized by the hearing aid dispenser based
onthenature of thehearing loss. Otherparameters may be
tunedon the basis ofmodels for e.g. loudnessperception [1].But not everyindividualuserpreferencecanbeput as a presetinto thehearing aid: some particularities of theuser maybe hardtorepresent into thealgorithm, and the user's typical acoustic environments and preferencepatterns may
bechanging. Thereforeweshouldpersonalizeahearing aid duringusage toactualuserpreferences.
The algorithms introducedin [2] are ableto learn
pre-ferredparametersfromcontroloperations ofa user. We for-mulated thepersonalization problem as alinear regression from acoustic featurestopreferred hearing aidparameters. Itis howevernotknown which features are necessary and sufficient forexplaining theuserpreference. Both thetype
of the feature and the appropriate time scale at which the feature iscomputedareunknown. Taking 'justevery inter-esting feature' into account may lead to high-dimensional
input vectors, containingirrelevantand redundantfeatures'
that makeon-line computations expensive and hamper
gen-eralization of the model. This iseven moreofanissue since the number ofuser adjustments may be small. We there-fore choose a Bayesian feature selection scheme that can
deal withmany features andstill obtain a sparseand well-generalizing model for observed preference data (in orderto
make theon-line sound processingas efficientaspossible).
Westudy the behaviour of such afeatureselection scheme with synthetic data, andwe compare withtwobenchmark methods for featureselection and linear regression. Wethen analyse preference data from a listening test, for efficient personalization ofahearing aidalgorithm.
i(t) 0;+ o(t) USER
feature extraction\
x(t) A +T X y(t)
b
Fig. 1. Hearing aid personalization via user control The personalization problem is illustrated in figure 1. A
hearing aid performs asoundprocessing functionf
(it,
Yt)
where the functional form depends on thehearing aid pa-rameter Yt. The Automatic Control (AC) unit takes inputXt,which holdsa vectorof acoustic features from theinput
signal it andmapsitto anoutput vt. Theuserwill be
feed-ing back correctionstothehearing aid viaacontrol wheel
toadjust foranAC-drivenparametervaluevtdifferentfrom the desiredvalue. Thisuserfeedbacksignalet is absorbed by a user-specific parameter vector b in order to achieve smaller future corrections from theuser(hence: higheruser
satisfaction). Theparameter vector yt will reflect the
user-1Irrelevant featuresdonotcontributetotheoutput assuch, whereas re-dundancyrefers to featuresthatarecorrelated with otherfeatures(which do notcontributetotheoutputwhen the correlatedfeatures are alsopresent).
preferred parametervalue, and is given by Yt = xTb+ et.
Finding theoptimalbcoefficientsboilsdown to alinear re-gression problem, which can be seen when analyzing the
personalizing problem [2].
We now turn to theproblem of finding a sparse set of acoustic features for efficientprocessing on ahearing aid. Sinceuserpreferences are expectedtochangemainlyover
long-termusage,the coefficientsbareconsidered stationary
foracertain datacollection experiment. Wetherefore pro-poseoff-line feature selection basedon a setofpre-collected
preference data. To emphasize theoff-line nature, wewill
indexsampleswith i rather than t in the rest of the paper.
2. BAYESIAN FEATURE SELECTION
Backfitting [3] is a method for estimating additive linear models of the form y(x) =
Ed=
I gm(x;Om).
Thenon-linear basis functions gm absorbparameters
bm
as gmbmfm,
andthey containauxiliaryparametersOm that deter-mine the shape of the basis function. Since we assume alinearmap,theindividual functions
fm
(x;Om)
arealsolin-ear, and there are noauxiliary parameters Om. Backfitting decomposes the statistical estimationproblem into d indi-vidual estimation problems by creating "fake targets" for eachgm function. Iteffectively decouples the inferencein
eachdimension, leadingto ahighly efficient algorithm. 2.1. Probabilisticbackfitting
Aprobabilistic version of backfitting is derivedin[4], where variables Zim areintroduced(figure 2) that act asunknown (fake) targetsfor each branch ofregression input. One as-sumesthatzim and yi areconditionally Gaussian:
d
A/'
E:
zim:fy) (1)m=l
A/(bmfm
(Xi )
ezm)
Parameters {{b,
4/m
Id
=1,4y} can be optimized using the EM framework for maximum likelihood fitting [5]. There one distinguishes between observed variablesxD
{(xi,
yi)}i
1'
and hidden variables x-H zi}'and one maximizes the expected complete data
loglikeli-hood
(lnp(xD, x2-;
¢))
=(Inp(y,
ZX;b,41,y)
.Here, Z denotes the Nby dmatrix ofall zin anddesign matrix X contains thefm
(xi)
ofall data points{(xi,
yi)}=
. TheEM update equations for
bn,
the noise variances4y
and4z,n
and the moments ofZ have complexityO(d).
Forbrevity,weonly give the equation for the weights: bn±)=
bJl
bmn) +wJm
(Yi-z
b1
fk(Xi) fm(Xi)Ei=
Ifm(Xi
)(2)
Fig. 2. Graphical model for Bayesian Backfitting, from [4]
where w' = W-
m,W
m Xl t'ands =fy
+Em=1
zm.This is'probabilistic backfitting': when the factor
rzm/S
is put to 1, the standard backfitting update remains.2.2. Automatic relevance determination
Probabilistic backfitting makes itpossibleto use aBayesian framework to regularize its least-squares solution against overfitting. Furthermore, for certain choice of the priorover
the coefficients b asparsemodel is favored. E.g. one can
place individual precision variablesam overeach of the
re-gressionparametersbm (figure 2) and choose thepriors: d ba H
l
Normal(bm;
O,
I/am)
m=l d a I|Gamma(a;kA,v(m))
(3)
m=lIt canbe shown [6] that the marginal prior overthe coef-ficients isamultidimensional Studentt-distribution, which placesmostof itsprobabilitymassalong the axial ridges of thespace. At theseridges the magnitude ofonlyoneof the
parameters is large, hence favoring a sparse solution (this procedure is called automatic relevance determination or
ARD). Extractingmarginal posteriors suchasIn
p(b
xD;¢b)
is intractable, butcan be done approximately using
varia-tional Bayes. In thisapproach, one assumesthatvariables andparametersfactorize in theposterior,sotheapproximate
joint posterior
Q(Z,
b,a)
=Q(Z)Q(b, a).
This leadsto approximatemarginal posteriorsQ(b),
Q(a)
of the formQ(a)
Q(b)
H|
Gamma(am;
Act,v(m))
m=l (4)Il
tv (bm;[b,-Ob,,
m=l426
y zi-A"(
ITZ,,OY)
Zim xi -A(gm (xi), Ozm)
andQ(Z) resembles the EMupdate expression, with para-meters
b,
replacedby their expectations(b,
).
Notethat inthisapproximation the posteriorcorrelations between b and
a areretained [4]. Inference is done using the variational
Bayesian EMalgorithm, where onealternates variational E
andMsteps (inferring above posteriors) witha maximiza-tion step inthe hyperparameters. Substituting the
expres-sions for
(zi,)
inthe update equations for the distribution ofQ(b) gives a'backfitting like' update for the regression coefficients [4]. Thecomplexity of the full variational EMalgorithm remains linearinthedimensionality d. 2.3. Practical issues
When thehyperparameters
A,
v(m)
areinitializedtosmall values,wehaveeffectivelyanuninformativeprioron preci-sionvariablea.Becauseof theARDmechanism, irrelevantcomponentswill have
(m)
-> oc,sotheposterior distrib-utionoverthecorresponding coefficientbm will benarrowaround zero. The significance ofa posterior mean value unequalzero canbe testeddirectly viaa t-test onthe
poste-riormeans
(bm)
(see [7]), since themarginal posteriorovercoefficients bm is afactorial t-distribution (4). This leaves
a fully automatic regression and feature selection method, where the only remaining hyperparameters are the initial value for the noise variances
bz:,
<y,
thelevel of thet-testand theconvergencecriteria for thevariationalEMloop. 3. FAST HEURISTIC FEATURE SELECTION This sectionpresents two fast greedy heuristic feature
se-lection algorithms specifically tailored for the task of
lin-ear prediction. The algorithms apply (1) Forward Selec-tion(FW) and (2) Backward Elimination (BW), which are
known to be computationally attractive strategies that are
robustagainst overfitting [8]. In ourimplementation both
algorithms applythefollowing generalprocedure.
1. Preprocessing Forall features (input and output)
sub-tract the mean and scale to unit variance. Features
without varianceareremoved. Forefficiency, second order statistics arepre-calculatedonthefull data. 2. Crossvalidation Repeat thefollowingsteps 10 times
(a) Splitdataset: randomlytakeout
10%
of thesam-ples for validation. The statistics of the remain-ing 90%areusedtogeneratetheranking.
(b) Heuristicallyrank the features(seebelow).
(c) Evaluate the ranking tofind the number of
fea-turesk that minimizes the validationerror.
3. Wrap up From all values k (found at 2c) select the median km (round down in case ofties). Then, for
all rankings, counttheoccurrences ofafeatureinthe
top-km
to select the km most popular features and finallyoptimize their weights on thefulldataset. The difference between the twoalgorithms liesintherank-ingstrategyusedat step2b. However,before discussing this differencewefirst focusontheevaluation.
3.1. Efficient evaluation
The standard least squares error ofa linear predictor,
us-ing weightvectorb, ignoringa constant termfor theoutput
variance, is calculated by
J =bTRb -2rTb (5)
whereR is the auto-correlation matrix definedas
R ZE
xixT
(6)
andris thecross-correlationvectordefinedas
r = E yixi (7)
Finding the optimal weights for b, using standard
least-squaresfitting, requiresawell-conditioned invertible matrix R, whichwe ensureusing the standardregularization tech-nique of adding a small fraction to the diagonal elements of thecorrelation matrix. Since theregularized matrix R is
anon-singular symmetrical positive definite matrix,we can use aCholeskifactorization, providing anuppertriangular
matrixC satisfying the relation
CTC
= R, toefficientlycomputetheleast-squaressolution
b =
R-'r
= C-(C-l)Tr
(8)
Moreover, since intermediate solutions of actualweight
val-ues areoftenunnecessary,because it sufficestohavean er-ror measureforaparticular subsets (withauto- and
cross-correlations
R,
andr,
obtainedbyselectingcorrespondingrowsandcolumns of R andr,and
C,
being the correspond-ing Choleski factorization), we candirectly insert (8) into(5)toefficiently obtain theerror onthetrainingsetusing
Js =-(C
-lrs)T(C5-'r,)
(9)ObtainingaCholeski factorization fromscratch,to test a
se-lection of k features, requires acomputational complexity of
0(k3),
thesubsequentmatrix division thenonly requires0(k2).
Itispossibletoupdate the factorizationincremen-tally, reducingthatcomplexityalsoto
0(k2).
Matlab'sim-plementation of thecomplete factorization, however, proved
3.2. Forwardselectionand backwardelimination Forward selection repetitivelyexpands a set of features by
always adding themostpromising unused feature. Starting
from an empty set, features are added one at a time. Once
selected features are neverremoved. To identify the most promising feature FW investigates each (unused) feature,
directly calculating errors using (9). In principle the
pro-cedurecanprovideacomplete ordering of all features. The
complexity,
however, is dominated by the largest sets, soneedlessly generating them is rather inefficient. FW
there-forestopsthe searchearlyoncetheminimal validationerror
hasnotdecreased foratleast 10runs.
Backward elimination employs the reverse strategy of
FW. Starting fromacompletesetof features itgenerates an
ordering by each time takingouttheleast promising feature.
Toidentify theleast promising featureouralgorithm inves-tigates each feature stillpartof theset,andremovestheone
thatprovides thelargest reduction (or smallest increase) of criterion(9). SinceBWspendsmosttimeatthestart,when the feature setis still large, notmuchcan be gained using
an early stopping criterion. Hence,in contrast toFW, BW
alwaysgenerates acomplete ordering of all features. 4. EXPERIMENTS
4.1. Evaluation ofBayesian backfitting
Wegenerated artificial regression dataasfollows:
1. Choosetotal number of features d and number of
ir-relevant featuresdir. The number ofrelevant features
drel
d-dir.
2. Generate Nsamples fromanormal distribution of di-mension d dir.2 Pad theinputvectorwith dir2 zero
dimensions.
3. Regression coefficientsbn,m = 1,. . .,dweredrawn fromanormaldistribution, and coefficients with value
bm < 0.5wereclippedto 0.5 . The first dir
coeffi-cientswereput to zero.
4. Optional: choose number of redundant featuresdred
(d
-dir)/2.
The number ofrelevant features isnowdrel
=dred. Take therelevant features[dir+1,..., dir+drel, rotate them with arandom rotation matrix and add them as redundant features by substituting
fea-tures [Ir +
drel
+1,...,dir
+drel
+dred].5. Outputs were generated accordingto themodel, and
Gaussian noisewasaddedat anSNR of 10.
6. An independent test set was generated in the same
manner,but theoutputnoisewas zerointhiscase(i.e.
aninfiniteoutputSNR).
7. Inallexperiments, inputs andoutputs werescaledto zero meanand unit variance after the datageneration
procedure was performed. Unnormalizedweights were
found by inverse transforming the weights found by
the algorithms. The noise variance parameters bZM
and
Oy
wereinitializedto0.5/(d+1), thus assumingatotaloutputnoise variance that is0.5 initially.2
4.1.1. Detectingirrelevantfeatures
Inthe first experiment, dir = 10, so the first and thelast fiveinput featureswereirrelevant for predicting theoutput,
and all other features were relevant. We varied the
num-ber ofsamples N as [50,100,
500,1000,10000]
andstud-ied twodimensionalitiesd = [15, 50]. Werepeated 10runs of each featureselection experiment (with each timea new
draw of the data), and trained both Bayesian and
conven-tional feature selection methodsonthe data. TheBayesian
method was trained for 200.000cycles maximum orwhen thelikelihoodwasimproving less than le-4. Wecompared (i)labellingerror, i.e. wecounted thetotal number of mis-labelings ofafeature andnormalized by the total number of featurespresent in 10runs; (ii)meannormalized prediction
error onthetestset;(iii)computationalcomplexity (Matlab
tic, toc). Themeanlabelling resultsover 10repetitions3 are
showninfigure 3. We seethat for both 15 and 50 features
Labellingerror l r ~~~~0VB_ --- FW -*-BW I ..._-- --- ---)
-, -,_
_._-I----+
0.25 0.2 0.15 01-0.1 -0.05 ,, _-1 1.5 2 2.5 3 3.5 4 0.81 .I 0~ ~VB -0-FW *-BW 0.6 'I'0.4 04 0.2-1.5 0-2 2.5 3 3.5Log sample size
Fig. 3. Meanlabeling error versus logsample size. Upper graph is ford= 50, lower graph ford = 15.
and moderate to high sample sizes4 variational Bayesian
backfitting (VB) outperformsFW, andperforms similar to
BW. Forsmall sample sizes, FW andBWoutperformVB.
2Wenoticed thatinitializing thenoise variancestolargevaluesledto slow convergencewithlarge samplesizes. Initializingto0.5/(d+ 1) al-leviated thisproblem.
3Theresult for(d, N) = (50, 10000)isbasedon5runs.
4Wedefine moderatesamplesize as N=[100,. .1,0001ford=15
andN = [1000,. .,10000]ford= 50.
428
As for predictionqualityof thelearned modelusing the
se-lectedfeaturesonly,performance ofallmethods wassimilar
(not shown). Apparently, including irrelevantfeatures with
small weights contributes little tothe prediction error (but features may be expensive to compute!). Finally, the oc-casionalimproved accuracy of VB comes at the expense of larger training times (see figure 5).
4.1.2. Detecting redundantfeatures
We then added redundant features to thedata, i.e. we
in-cludedoptionalstep4. in the datageneration procedure. In
a new experiment, d was varied and the output SNR was
set at 10. Since the relevant and the redundant featuresmay
beinterchanged5, we determined the size of the redundant
subset in eachrun(which should equal dred = [5, 10, 20] for
d= [20, 30, 50], respectively). In figure 4weplot themean
-.~ 1.5 2 2.5
Log sample size
--O~ . -1t
E0
3 3.5 4
Fig. 5. Running timevs. log sample size. Upper, middle
and lowergraphs for each methodarefor d =50, 30, 20.
40r 35 30 ,25- ~ \, 20 C15_ 1.5 2 2.5 3 Log sample size
0 VB --- FW
-*-BW
3.5 4 4.5
Fig. 4. Estimated dred vs. log sample size. Upper, middle,
lowergraphsarefor d=50, 30, 20 and dred =20, 10, 5.
size of the redundant subset estimated in 10runsfor
differ-entd,dred, including onestandarddeviationerrorbars. For
moderatesamplesizes, both VB and the benchmark methods detect the redundant subset(though theyarebiasedto
some-whatlarger values), butaccuracyof VB estimatedrops with large sample sizes. When inspecting the likelihoodcurves
for thesecases, it turned outthatwedidnotreach
conver-gence after 200.000 iterations, likely causing the
worsen-ing. Further, in figure 5 it is shown that VB scales much lessfavourably with sample size than the benchmark meth-ods, but onthe other hand scalesmore favourably with
in-putdimensionality. The superlinear scaling behaviour of the benchmark methods with number of relevant featuresmay eventually result in higher computation times for FW and BW. We conclude that VB is ableto detect both irrelevant
5Arotatedsetof relevant featuresmaybe consideredbyafeature
se-lection methodas morerelevant than theoriginalones,inwhichcasethe
originals become the redundantones.
and redundant features in a relatively reliable manner for
dimensionalitiesupto50(whichwasthe maximum
dimen-sionality studied) and moderate sample sizes. The bench-mark methodsseem morerobusttosmall sample problems. 4.2. Feature selection inpreference data
Weimplementedahearing aid algorithmon areal-time
plat-form and madeoneof theprocessingparametersof the vir-tualhearing aid on-line modifiable by theuser. Six normal
hearing subjects were exposed in alab trial toan acoustic
stimulus that consisted of several speech and noise snap-shotspicked fromadatabase(each snapshot typically in the
order of 10seconds) whichwerecombined in several ratios
andappended. This ledto onelong streamofsignal/noise episodes with differenttypesofsignals and noise in
differ-entratios. The subjects wereasked tolistentothis stream
several times inarow,andadjust the processingparameter
asdesired. Each timeanadjustmentwasmade, the acoustic
inputvectorand thedesiredhearing aid processing
parame-terwerestored. At theend ofanexperimentasetof
input-outputpairs was obtained from which aregression model
could be inferredusing off-linetraining. We postulated that
twotypes of featuresbear information about theuser pref-erence. Feature 1wasexpectedtobe correlated with feature
3 and 4 (features referredto as red], red2, red3), whereas
feature 2was expected to be verydifferent (referred to as
indep). Further, features 2, 3 and 4werecomputedat6 dif-ferent timescales, whereas the first featurewasthesameat
all time scales, leading to 3 x 6 + 1 = 19 features. The
number ofadjustments for each of the subjects 1 to 6 was
[43,275, 703, 262, 99,1020]. This meansthatwe areinthe
realm ofmoderatesamplesizeand moderate dimensional-ity,forwhich VB isaccurate(seesection4). We then trained
VB onthe six datasets. Infigure 6we show forsubjects 3
and 6aHintondiagram of the posteriormeanvalues for the
Matlab log(running time)
0 VB -O- FW -*-BW .E -E 2-0 ---i
variance (i.e. J ). Subjects 3 and 6 adjust thehearing aid
Fig. 6. ARD-based selection ofhearingaidfeatures. Shown is a Hinton diagram of 1 , computed from preference
data. Left: subject no. 3. Right: subject no. 6. Hori-zontal (left to right): time scale at which a feature is com-puted [1,2,3.5, 5, 7.5,
10].
Vertical (top to bottom): fea-tures [red], indep, red2, red3]. Boxsize denotes relevance. parameterprimarily based onfeatures indep and red2. Twoothersubjectsonly used feature indep, whereasonesubject used all features indep, red], red2, red3 (to some extent). Onelast subject's data couldnotbe fitreliably (noise
vari-ances y m werehigh for all components). 5. DISCUSSION
From oursynthetic data experiments, we conclude that VB isauseful method for doingaccurateregression and feature selectionatthesametime, provided sample sizesare
mod-erate tohigh and computation time is not an issue. When redundant features are present and sample sizes are high,
onehas to train for many iterations in orderto reach con-vergence. However, labellingandprediction accuracies are
comparabletotheresults obtained withourspeed-optimized
benchmarkmethods, which take much lesscomputing time andscalemorefavourably with sample size. Only when di-mensionalitiesareveryhigh, weexpectthat thelinear scal-ing behaviour ofVB withdimensionality willpayoff. One furtheradvantage of the VB approachoverthe benchmark methods is that one can compute apredictive distribution using the inferredparameterposteriors, allowingfora
pre-diction with confidencelevels. =From ourpreference data experiment, wenoted that4 outof 6userspersonalizedthe hearing aidparameter based on (some of the) features
in-dep and red2, which seem agood choice for inclusion in
an on-line learning algorithm. For one of theusers, either
thesample sizewas too low, his preference was too noisy
orthelinearityassumption of the model mightnothold. In
thefuture, an experimental setup witha different acoustic stimulusmight be consideredtopossiblydiminish the noise
inthepreference data.
We compared Bayesian backfitting feature selection to a
speed-optimized heuristic approach. Results on synthetic data indicate that the method has labellingaccuracy
com-parable to a fast heuristic approach (for moderate sample sizes), but takes much larger training times. VB is there-foreauseful alternativetobenchmark methodsFWandBW
mainly when the sample size is moderate (accuracy is good), the number of features ishigh (linear scaling with dimen-sionality), and predictions with confidence levelsareneeded.
Inahearing aidpersonalization problem,4 outof 6 subjects showed preferences basedon twotypesoffeatures, giving valuable clues for feature choiceinon-linealgorithms.
7. ACKNOWLEDGMENTS
The authorsliketothank Job Geurts forhelp with the listen-ing experiments and Tjeerd Dijkstra foruseful discussions.
8. REFERENCES
[1] S. Launer and B. Moore, "Use ofa loudness model forhearing aid fitting: On-line gain controlin adigital hearing aid," Int.J. ofAudiology, pp. 262-273, 2003.
[2] A. Ypma, B. deVries, andJ. Geurts, "Robust volume control personalization from on-line preference
feed-back," in Proc. 2006 IEEE International Workshop
onMachineLearningforSignalProcessing, MLSP'06,
Maynooth,Ireland,September 6-8 2006,pp.441-446. [3] T. Hastie and R. Tibshirani, Generalized Additive
Mod-els, Chapman &Hall, 1990.
[4] A A D'Souza, Towards tractable parameter-free sta-tistical learning, Ph.D. thesis, University of Southern
California,2004.
[5] A. Dempster,N.Laird, andD.Rubin, "Maximum like-lihood fromincomplete data via theEMalgorithm," J.
ofRoyal Statistical Soc. B, vol. 39, pp. 1-38, 1977. [6] M. Tipping, "Bayesian inference: Anintroduction to
principles & practiceinmachinelearning,"inAdvanced
Lectures onMachineLearning,2003, pp. 41-62.
[7] J. Ting etal, "Predicting EMG data fromml neurons
with variational bayesian least squares," inAdvances
in Neural Information Processing Systems 18 (NIPS 2005), Cambridge,MA: MITPress., 2005.
[8] I. Guyonand A.Elisseeff, "Anintroductiontovariable
and feature selection," Journal of Machine Learning
Research,vol. 3, pp. 1157-1182, 2003.