Bayesian Feature Selection for Hearing Aid Personalization

(1)

Bayesian Feature Selection for Hearing Aid Personalization

Citation for published version (APA):

Ypma, A., & Vries, de, B. (2008). Bayesian Feature Selection for Hearing Aid Personalization. In Bayesian Feature Selection for Hearing Aid Personalization, MLSP-07, Proceeings, Thessaloniki, Greece, 2007 (pp. 425-430) https://doi.org/10.1109/MLSP.2007.4414344

DOI:

10.1109/MLSP.2007.4414344

Document status and date: Published: 01/01/2008

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

Take down policy

If you believe that this document breaches copyright please contact us at:

openaccess@tue.nl

providing details and we will investigate your claim.

(2)

BAYESIAN FEATURE SELECTION FOR HEARING AID PERSONALIZATION

Alexander

Ypma1,

Serkan Ozer2, Erik

van

der

Werf1

and

Bert

de Vries1'2

1GN ReSound Research, GN ReSound A/S, Horsten 1, 5612 AX Eindhoven

2Signal Processing Systems

group,

Dept. Electrical Engineering, TU Eindhoven

s.ozer@student.tue.nl,{aypma,evdwerf,bdevries}@gnresound.com

ABSTRACT

We formulate hearingaidpersonalization as a linear regression. Sincesample sizes maybe low and the number of features may be high we resort to a Bayesian approach for sparse linear regression that can deal with many features, in order to find efficient repre-sentations foron-line usage. We compare to aheuristic feature se-lectionapproach that we optimized for speed. Results on synthetic datawith irrelevant and redundant features indicate that Bayesian

backfittinghaslabelling accuracycomparabletothe heuristic ap-proach (for moderate sample sizes), but takes much larger training times. Wethen determine features forhearingaidpersonalization

byapplying the method tohearingaid preferencedata.

1. HEARING AID PERSONALIZATION Moderndigital hearing aids contain advanced signal

process-ingalgorithms withmany parameters. Theseare set to

val-uesthatideally match the needs and preferencesof the user. Becauseof thelarge dimensionality of theparameter space

and unknown determinants ofuser satisfaction, the fitting procedure becomesacomplex task. Some of theuser para-meters arepersonalized by the hearing aid dispenser based

onthenature of thehearing loss. Otherparameters may be

tunedon the basis ofmodels for e.g. loudnessperception [1].But not everyindividualuserpreferencecanbeput as a presetinto thehearing aid: some particularities of theuser maybe hardtorepresent into thealgorithm, and the user's typical acoustic environments and preferencepatterns may

bechanging. Thereforeweshouldpersonalizeahearing aid duringusage toactualuserpreferences.

The algorithms introducedin [2] are ableto learn

pre-ferredparametersfromcontroloperations ofa user. We for-mulated thepersonalization problem as alinear regression from acoustic featurestopreferred hearing aidparameters. Itis howevernotknown which features are necessary and sufficient forexplaining theuserpreference. Both thetype

of the feature and the appropriate time scale at which the feature iscomputedareunknown. Taking 'justevery inter-esting feature' into account may lead to high-dimensional

input vectors, containingirrelevantand redundantfeatures'

that makeon-line computations expensive and hamper

gen-eralization of the model. This iseven moreofanissue since the number ofuser adjustments may be small. We there-fore choose a Bayesian feature selection scheme that can

deal withmany features andstill obtain a sparseand well-generalizing model for observed preference data (in orderto

make theon-line sound processingas efficientaspossible).

Westudy the behaviour of such afeatureselection scheme with synthetic data, andwe compare withtwobenchmark methods for featureselection and linear regression. Wethen analyse preference data from a listening test, for efficient personalization ofahearing aidalgorithm.

i(t) 0;+ o(t) USER

feature extraction\

x(t) A +T X y(t)

b

Fig. 1. Hearing aid personalization via user control The personalization problem is illustrated in figure 1. A

hearing aid performs asoundprocessing functionf

(it,

Yt)

where the functional form depends on thehearing aid pa-rameter Yt. The Automatic Control (AC) unit takes input

Xt,which holdsa vectorof acoustic features from theinput

signal it andmapsitto anoutput vt. Theuserwill be

feed-ing back correctionstothehearing aid viaacontrol wheel

toadjust foranAC-drivenparametervaluevtdifferentfrom the desiredvalue. Thisuserfeedbacksignalet is absorbed by a user-specific parameter vector b in order to achieve smaller future corrections from theuser(hence: higheruser

satisfaction). Theparameter vector yt will reflect the

user-1Irrelevant featuresdonotcontributetotheoutput assuch, whereas re-dundancyrefers to featuresthatarecorrelated with otherfeatures(which do notcontributetotheoutputwhen the correlatedfeatures are alsopresent).

(3)

preferred parametervalue, and is given by Yt = xTb+ et.

Finding theoptimalbcoefficientsboilsdown to alinear re-gression problem, which can be seen when analyzing the

personalizing problem [2].

We now turn to theproblem of finding a sparse set of acoustic features for efficientprocessing on ahearing aid. Sinceuserpreferences are expectedtochangemainlyover

long-termusage,the coefficientsbareconsidered stationary

foracertain datacollection experiment. Wetherefore pro-poseoff-line feature selection basedon a setofpre-collected

preference data. To emphasize theoff-line nature, wewill

indexsampleswith i rather than t in the rest of the paper.

2. BAYESIAN FEATURE SELECTION

Backfitting [3] is a method for estimating additive linear models of the form y(x) =

Ed=

I gm(x;

Om).

The

non-linear basis functions gm absorbparameters

bm

as gm

bmfm,

andthey containauxiliaryparametersOm that deter-mine the shape of the basis function. Since we assume a

linearmap,theindividual functions

fm

(x;

Om)

arealso

lin-ear, and there are noauxiliary parameters Om. Backfitting decomposes the statistical estimationproblem into d indi-vidual estimation problems by creating "fake targets" for eachgm function. Iteffectively decouples the inferencein

eachdimension, leadingto ahighly efficient algorithm. 2.1. Probabilisticbackfitting

Aprobabilistic version of backfitting is derivedin[4], where variables Zim areintroduced(figure 2) that act asunknown (fake) targetsfor each branch ofregression input. One as-sumesthatzim and yi areconditionally Gaussian:

d

A/'

E:

zim:fy) (1)

m=l

A/(bmfm

(Xi )

ezm

)

Parameters {{b,

4/m

Id

=1,4y} can be optimized using the EM framework for maximum likelihood fitting [5]. There one distinguishes between observed variables

xD

{(xi,

yi)}i

1'

and hidden variables x-H zi}'

and one maximizes the expected complete data

loglikeli-hood

(lnp(xD, x2-;

¢))

=

(Inp(y,

ZX;b,41,

y)

.Here, Z denotes the Nby dmatrix ofall zin anddesign matrix X contains the

fm

(xi)

ofall data points{

(xi,

yi)}=

. The

EM update equations for

bn,

the noise variances

4y

and

4z,n

and the moments ofZ have complexity

O(d).

For

brevity,weonly give the equation for the weights: bn±)=

bJl

bmn) +

wJm

(Yi-z

b1

fk(Xi) fm(Xi)

Ei=

Ifm

(Xi

)

(2)

Fig. 2. Graphical model for Bayesian Backfitting, from [4]

where w' = W-

m,W

m Xl t'ands =

fy

+

Em=1

zm.

This is'probabilistic backfitting': when the factor

rzm/S

is put to 1, the standard backfitting update remains.

2.2. Automatic relevance determination

Probabilistic backfitting makes itpossibleto use aBayesian framework to regularize its least-squares solution against overfitting. Furthermore, for certain choice of the priorover

the coefficients b asparsemodel is favored. E.g. one can

place individual precision variablesam overeach of the

re-gressionparametersbm (figure 2) and choose thepriors: d ba H

l

Normal(bm;

O,

I/am)

m=l d a I|

Gamma(a;kA,v(m))

(3)

m=l

It canbe shown [6] that the marginal prior overthe coef-ficients isamultidimensional Studentt-distribution, which placesmostof itsprobabilitymassalong the axial ridges of thespace. At theseridges the magnitude ofonlyoneof the

parameters is large, hence favoring a sparse solution (this procedure is called automatic relevance determination or

ARD). Extractingmarginal posteriors suchasIn

p(b

xD;

¢b)

is intractable, butcan be done approximately using

varia-tional Bayes. In thisapproach, one assumesthatvariables andparametersfactorize in theposterior,sotheapproximate

joint posterior

Q(Z,

b,

a)

=

Q(Z)Q(b, a).

This leadsto approximatemarginal posteriors

Q(b),

Q

(a)

of the form

Q(a)

Q(b)

H|

Gamma(am;

Act,

v(m))

m=l (4)

Il

tv (bm;[b,-

Ob,,

m=l

426

y zi

-A"(

ITZ,,

OY)

Zim xi -

A(gm (xi), Ozm)

(4)

andQ(Z) resembles the EMupdate expression, with para-meters

b,

replacedby their expectations

(b,

).

Notethat in

thisapproximation the posteriorcorrelations between b and

a areretained [4]. Inference is done using the variational

Bayesian EMalgorithm, where onealternates variational E

andMsteps (inferring above posteriors) witha maximiza-tion step inthe hyperparameters. Substituting the

expres-sions for

(zi,)

inthe update equations for the distribution ofQ(b) gives a'backfitting like' update for the regression coefficients [4]. Thecomplexity of the full variational EM

algorithm remains linearinthedimensionality d. 2.3. Practical issues

When thehyperparameters

A,

v

(m)

areinitializedtosmall values,wehaveeffectivelyanuninformativeprioron preci-sionvariablea.Becauseof theARDmechanism, irrelevant

componentswill have

(m)

-> oc,sotheposterior distrib-utionoverthecorresponding coefficientbm will benarrow

around zero. The significance ofa posterior mean value unequalzero canbe testeddirectly viaa t-test onthe

poste-riormeans

(bm)

(see [7]), since themarginal posteriorover

coefficients bm is afactorial t-distribution (4). This leaves

a fully automatic regression and feature selection method, where the only remaining hyperparameters are the initial value for the noise variances

bz:,

<y,

thelevel of thet-test

and theconvergencecriteria for thevariationalEMloop. 3. FAST HEURISTIC FEATURE SELECTION This sectionpresents two fast greedy heuristic feature

se-lection algorithms specifically tailored for the task of

lin-ear prediction. The algorithms apply (1) Forward Selec-tion(FW) and (2) Backward Elimination (BW), which are

known to be computationally attractive strategies that are

robustagainst overfitting [8]. In ourimplementation both

algorithms applythefollowing generalprocedure.

1. Preprocessing Forall features (input and output)

sub-tract the mean and scale to unit variance. Features

without varianceareremoved. Forefficiency, second order statistics arepre-calculatedonthefull data. 2. Crossvalidation Repeat thefollowingsteps 10 times

(a) Splitdataset: randomlytakeout

10%

of the

sam-ples for validation. The statistics of the remain-ing 90%areusedtogeneratetheranking.

(b) Heuristicallyrank the features(seebelow).

(c) Evaluate the ranking tofind the number of

fea-turesk that minimizes the validationerror.

3. Wrap up From all values k (found at 2c) select the median km (round down in case ofties). Then, for

all rankings, counttheoccurrences ofafeatureinthe

top-km

to select the km most popular features and finallyoptimize their weights on thefulldataset. The difference between the twoalgorithms liesinthe

rank-ingstrategyusedat step2b. However,before discussing this differencewefirst focusontheevaluation.

3.1. Efficient evaluation

The standard least squares error ofa linear predictor,

us-ing weightvectorb, ignoringa constant termfor theoutput

variance, is calculated by

J =bTRb -2rTb ₍₅₎

whereR is the auto-correlation matrix definedas

R ZE

xixT

(6)

andris thecross-correlationvectordefinedas

r = E yixi (7)

Finding the optimal weights for b, using standard

least-squaresfitting, requiresawell-conditioned invertible matrix R, whichwe ensureusing the standardregularization tech-nique of adding a small fraction to the diagonal elements of thecorrelation matrix. Since theregularized matrix R is

anon-singular symmetrical positive definite matrix,we can use aCholeskifactorization, providing anuppertriangular

matrixC satisfying the relation

CTC

= R, toefficiently

computetheleast-squaressolution

b =

R-'r

= C-

(C-l)Tr

(8)

Moreover, since intermediate solutions of actualweight

val-ues areoftenunnecessary,because it sufficestohavean er-ror measureforaparticular subsets (withauto- and

cross-correlations

R,

and

r,

obtainedbyselectingcorresponding

rowsandcolumns of R andr,and

C,

being the correspond-ing Choleski factorization), we candirectly insert (8) into

(5)toefficiently obtain theerror onthetrainingsetusing

Js =-(C

-lrs)T(C5-'r,)

(9)

ObtainingaCholeski factorization fromscratch,to test a

se-lection of k features, requires acomputational complexity of

0(k3),

thesubsequentmatrix division thenonly requires

0(k2).

Itispossibletoupdate the factorization

incremen-tally, reducingthatcomplexityalsoto

0(k2).

Matlab's

im-plementation of thecomplete factorization, however, proved

(5)

3.2. Forwardselectionand backwardelimination Forward selection repetitivelyexpands a set of features by

always adding themostpromising unused feature. Starting

from an empty set, features are added one at a time. Once

selected features are neverremoved. To identify the most promising feature FW investigates each (unused) feature,

directly calculating errors using (9). In principle the

pro-cedurecanprovideacomplete ordering of all features. The

complexity,

however, is dominated by the largest sets, so

needlessly generating them is rather inefficient. FW

there-forestopsthe searchearlyoncetheminimal validationerror

hasnotdecreased foratleast 10runs.

Backward elimination employs the reverse strategy of

FW. Starting fromacompletesetof features itgenerates an

ordering by each time takingouttheleast promising feature.

Toidentify theleast promising featureouralgorithm inves-tigates each feature stillpartof theset,andremovestheone

thatprovides thelargest reduction (or smallest increase) of criterion(9). SinceBWspendsmosttimeatthestart,when the feature setis still large, notmuchcan be gained using

an early stopping criterion. Hence,in contrast toFW, BW

alwaysgenerates acomplete ordering of all features. 4. EXPERIMENTS

4.1. Evaluation ofBayesian backfitting

Wegenerated artificial regression dataasfollows:

1. Choosetotal number of features d and number of

ir-relevant featuresdir. The number ofrelevant features

drel

d

-dir.

2. Generate Nsamples fromanormal distribution of di-mension d dir.₂ Pad theinputvectorwith dir₂ zero

dimensions.

3. Regression coefficientsbn,m = 1,. . .,dweredrawn fromanormaldistribution, and coefficients with value

bm < 0.5wereclippedto 0.5 . The first dir

coeffi-cientswereput to zero.

4. Optional: choose number of redundant featuresdred

(d

-dir)/2.

The number ofrelevant features isnow

drel

=dred. Take therelevant features[dir+1,..., dir+

drel, rotate them with arandom rotation matrix and add them as redundant features by substituting

fea-tures [Ir +

drel

+1,...,

dir

+

drel

+dred].

5. Outputs were generated accordingto themodel, and

Gaussian noisewasaddedat anSNR of 10.

6. An independent test set was generated in the same

manner,but theoutputnoisewas zerointhiscase(i.e.

aninfiniteoutputSNR).

7. Inallexperiments, inputs andoutputs werescaledto zero meanand unit variance after the datageneration

procedure was performed. Unnormalizedweights were

found by inverse transforming the weights found by

the algorithms. The noise variance parameters bZM

and

Oy

wereinitializedto0.5/(d+1), thus assuming

atotaloutputnoise variance that is0.5 initially.2

4.1.1. Detectingirrelevantfeatures

Inthe first experiment, dir = 10, so the first and thelast fiveinput featureswereirrelevant for predicting theoutput,

and all other features were relevant. We varied the

num-ber ofsamples N as [50,100,

500,1000,10000]

and

stud-ied twodimensionalitiesd = [15, 50]. Werepeated 10runs of each featureselection experiment (with each timea new

draw of the data), and trained both Bayesian and

conven-tional feature selection methodsonthe data. TheBayesian

method was trained for 200.000cycles maximum orwhen thelikelihoodwasimproving less than le-4. Wecompared (i)labellingerror, i.e. wecounted thetotal number of mis-labelings ofafeature andnormalized by the total number of featurespresent in 10runs; (ii)meannormalized prediction

error onthetestset;(iii)computationalcomplexity (Matlab

tic, toc). Themeanlabelling resultsover 10repetitions3 are

showninfigure 3. We seethat for both 15 and 50 features

Labellingerror l r ~~~~0VB_ --- FW -*-BW I ..._-- --- ---)

-, -,_

_._-I----+

0.25 0.2 0.15 01-_0.1 -0.05 ,, _-1 1.5 2 2.5 3 3.5 4 0.81 .I 0~ ~VB -0-FW *-BW 0.6 'I'0.4 04 0.2-1.5 0-2 2.5 3 3.5

Log sample size

Fig. 3. Meanlabeling error versus logsample size. Upper graph is ford= 50, lower graph ford = 15.

and moderate to high sample sizes4 variational Bayesian

backfitting (VB) outperformsFW, andperforms similar to

BW. Forsmall sample sizes, FW andBWoutperformVB.

2Wenoticed thatinitializing thenoise variancestolargevaluesledto slow convergencewithlarge samplesizes. Initializingto0.5/(d+ 1) al-leviated thisproblem.

3Theresult for(d, N) = (50, 10000)isbasedon5runs.

4Wedefine moderatesamplesize as N=[100,. .1,0001ford=15

andN = [1000,. .,10000]ford= 50.

428

(6)

As for predictionqualityof thelearned modelusing the

se-lectedfeaturesonly,performance ofallmethods wassimilar

(not shown). Apparently, including irrelevantfeatures with

small weights contributes little tothe prediction error (but features may be expensive to compute!). Finally, the oc-casionalimproved accuracy of VB comes at the expense of larger training times (see figure 5).

4.1.2. Detecting redundantfeatures

We then added redundant features to thedata, i.e. we

in-cludedoptionalstep4. in the datageneration procedure. In

a new experiment, d was varied and the output SNR was

set at 10. Since the relevant and the redundant featuresmay

beinterchanged5, we determined the size of the redundant

subset in eachrun(which should equal dred = [5, 10, 20] for

d= [20, 30, 50], respectively). In figure 4weplot themean

-.~ 1.5 2 2.5

Log sample size

--O~ . -1t

E0

3 3.5 4

Fig. 5. Running timevs. log sample size. Upper, middle

and lowergraphs for each methodarefor d =50, 30, 20.

40r 35 30 ,25- ~ \, 20 C15_ 1.5 2 2.5 3 Log sample size

0 VB --- FW

-*-BW

3.5 4 4.5

Fig. 4. Estimated dred vs. log sample size. Upper, middle,

lowergraphsarefor d=50, 30, 20 and dred =20, 10, 5.

size of the redundant subset estimated in 10runsfor

differ-entd,dred, including onestandarddeviationerrorbars. For

moderatesamplesizes, both VB and the benchmark methods detect the redundant subset(though theyarebiasedto

some-whatlarger values), butaccuracyof VB estimatedrops with large sample sizes. When inspecting the likelihoodcurves

for thesecases, it turned outthatwedidnotreach

conver-gence after 200.000 iterations, likely causing the

worsen-ing. Further, in figure 5 it is shown that VB scales much lessfavourably with sample size than the benchmark meth-ods, but onthe other hand scalesmore favourably with

in-putdimensionality. The superlinear scaling behaviour of the benchmark methods with number of relevant featuresmay eventually result in higher computation times for FW and BW. We conclude that VB is ableto detect both irrelevant

5Arotatedsetof relevant featuresmaybe consideredbyafeature

se-lection methodas morerelevant than theoriginalones,inwhichcasethe

originals become the redundantones.

and redundant features in a relatively reliable manner for

dimensionalitiesupto50(whichwasthe maximum

dimen-sionality studied) and moderate sample sizes. The bench-mark methodsseem morerobusttosmall sample problems. 4.2. Feature selection inpreference data

Weimplementedahearing aid algorithmon areal-time

plat-form and madeoneof theprocessingparametersof the vir-tualhearing aid on-line modifiable by theuser. Six normal

hearing subjects were exposed in alab trial toan acoustic

stimulus that consisted of several speech and noise snap-shotspicked fromadatabase(each snapshot typically in the

order of 10seconds) whichwerecombined in several ratios

andappended. This ledto onelong streamofsignal/noise episodes with differenttypesofsignals and noise in

differ-entratios. The subjects wereasked tolistentothis stream

several times inarow,andadjust the processingparameter

asdesired. Each timeanadjustmentwasmade, the acoustic

inputvectorand thedesiredhearing aid processing

parame-terwerestored. At theend ofanexperimentasetof

input-outputpairs was obtained from which aregression model

could be inferredusing off-linetraining. We postulated that

twotypes of featuresbear information about theuser pref-erence. Feature 1wasexpectedtobe correlated with feature

3 and 4 (features referredto as red], red2, red3), whereas

feature 2was expected to be verydifferent (referred to as

indep). Further, features 2, 3 and 4werecomputedat6 dif-ferent timescales, whereas the first featurewasthesameat

all time scales, leading to 3 x 6 + 1 = 19 features. The

number ofadjustments for each of the subjects 1 to 6 was

[43,275, 703, 262, 99,1020]. This meansthatwe areinthe

realm ofmoderatesamplesizeand moderate dimensional-ity,forwhich VB isaccurate(seesection4). We then trained

VB onthe six datasets. Infigure 6we show forsubjects 3

and 6aHintondiagram of the posteriormeanvalues for the

Matlab log(running time)

0 VB -O- FW -*-BW .E -E 2-0 ---i

(7)

variance (i.e. J ). Subjects 3 and 6 adjust thehearing aid

Fig. 6. ARD-based selection ofhearingaidfeatures. Shown is a Hinton diagram of 1 , computed from preference

data. Left: subject no. 3. Right: subject no. 6. Hori-zontal (left to right): time scale at which a feature is com-puted [1,2,3.5, 5, 7.5,

10].

Vertical (top to bottom): fea-tures [red], indep, red2, red3]. Boxsize denotes relevance. parameterprimarily based onfeatures indep and red2. Two

othersubjectsonly used feature indep, whereasonesubject used all features indep, red], red2, red3 (to some extent). Onelast subject's data couldnotbe fitreliably (noise

vari-ances y m werehigh for all components). 5. DISCUSSION

From oursynthetic data experiments, we conclude that VB isauseful method for doingaccurateregression and feature selectionatthesametime, provided sample sizesare

mod-erate tohigh and computation time is not an issue. When redundant features are present and sample sizes are high,

onehas to train for many iterations in orderto reach con-vergence. However, labellingandprediction accuracies are

comparabletotheresults obtained withourspeed-optimized

benchmarkmethods, which take much lesscomputing time andscalemorefavourably with sample size. Only when di-mensionalitiesareveryhigh, weexpectthat thelinear scal-ing behaviour ofVB withdimensionality willpayoff. One furtheradvantage of the VB approachoverthe benchmark methods is that one can compute apredictive distribution using the inferredparameterposteriors, allowingfora

pre-diction with confidencelevels. =From ourpreference data experiment, wenoted that4 outof 6userspersonalizedthe hearing aidparameter based on (some of the) features

in-dep and red2, which seem agood choice for inclusion in

an on-line learning algorithm. For one of theusers, either

thesample sizewas too low, his preference was too noisy

orthelinearityassumption of the model mightnothold. In

thefuture, an experimental setup witha different acoustic stimulusmight be consideredtopossiblydiminish the noise

inthepreference data.

We compared Bayesian backfitting feature selection to a

speed-optimized heuristic approach. Results on synthetic data indicate that the method has labellingaccuracy

com-parable to a fast heuristic approach (for moderate sample sizes), but takes much larger training times. VB is there-foreauseful alternativetobenchmark methodsFWandBW

mainly when the sample size is moderate (accuracy is good), the number of features ishigh (linear scaling with dimen-sionality), and predictions with confidence levelsareneeded.

Inahearing aidpersonalization problem,4 outof 6 subjects showed preferences basedon twotypesoffeatures, giving valuable clues for feature choiceinon-linealgorithms.

7. ACKNOWLEDGMENTS

The authorsliketothank Job Geurts forhelp with the listen-ing experiments and Tjeerd Dijkstra foruseful discussions.

8. REFERENCES

[1] S. Launer and B. Moore, "Use ofa loudness model forhearing aid fitting: On-line gain controlin adigital hearing aid," Int.J. ofAudiology, pp. 262-273, 2003.

[2] A. Ypma, B. deVries, andJ. Geurts, "Robust volume control personalization from on-line preference

feed-back," in Proc. 2006 IEEE International Workshop

onMachineLearningforSignalProcessing, MLSP'06,

Maynooth,Ireland,September 6-8 2006,pp.441-446. [3] T. Hastie and R. Tibshirani, Generalized Additive

Mod-els, Chapman &Hall, 1990.

[4] A A D'Souza, Towards tractable parameter-free sta-tistical learning, Ph.D. thesis, University of Southern

California,2004.

[5] A. Dempster,N.Laird, andD.Rubin, "Maximum like-lihood fromincomplete data via theEMalgorithm," J.

ofRoyal Statistical Soc. B, vol. 39, pp. 1-38, 1977. [6] M. Tipping, "Bayesian inference: Anintroduction to

principles & practiceinmachinelearning,"inAdvanced

Lectures onMachineLearning,2003, pp. 41-62.

[7] J. Ting etal, "Predicting EMG data fromml neurons

with variational bayesian least squares," inAdvances

in Neural Information Processing Systems 18 (NIPS 2005), Cambridge,MA: MITPress., 2005.

[8] I. Guyonand A.Elisseeff, "Anintroductiontovariable

and feature selection," Journal of Machine Learning

Research,vol. 3, pp. 1157-1182, 2003.