• No results found

L´ aszl´ o Gy¨ orfi

N/A
N/A
Protected

Academic year: 2021

Share "L´ aszl´ o Gy¨ orfi"

Copied!
2
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Feature Selection via Detecting Ineffective Features

Kris De Brabanter Dep. Electrical Engineering Kasteelpark Arenberg 10, 3001 Leuven

Katholieke Universiteit Leuven kris.debrabanter@esat.kuleuven.be

L´ aszl´ o Gy¨ orfi

Dep. Computer Science & Information Theory Magyar Tud´ osok k¨ or´ utja 2., Budapest Budapest University of Technology and Economics

gyorfi@szit.bme.hu

Abstract: Consider the regression problem with a response variable Y and with a feature vector X. For the regression function m(x) = E{Y | X = x}, we introduce a new and simple estimator of the minimum mean squared error L = E{(Y −m(X)) 2 }. Let X (−k) be the feature vector, in which the k-th component of X is missing. In this paper we analyze a nonparametric test for the hypothesis that the k-th component is ineffective, i.e., E{Y | X} = E{Y | X (−k) } a.s.

Keywords: feature selection, minimum mean squared error, hypothesis test

1 Introduction

Let the label Y be a real valued random variable and let the feature vector X = (X 1 , . . . , X d ) be a d-dimensional random vector. The regression function m is defined by

m(x) = E{Y | X = x}.

The minimum mean squared error, called also variance of the residual Y − m(X), is denoted by

L := E{(Y − m(X)) 2 } = min

f E {(Y − f(X)) 2 }.

The regression function m and the minimum mean squared error L cannot be calculated when the dis- tribution of (X, Y ) is unknown. Assume, however, that we observe data

D n = {(X 1 , Y 1 ), . . . , (X n , Y n )}

consisting of independent and identically distributed copies of (X, Y ). D n can be used to produce an esti- mate of L . Nonparametric estimates of the minimum mean squared error are given in [2, 4].

2 New estimate of the minimum mean squared error

One can derive a new and simple estimator of L by considering the definition

L = E{(Y − m(X)) 2 } = E{Y 2 } − E{m(X) 2 }. (1) The first and second term on the right-hand- side of (1) can be estimated by n 1 P n

i=1 Y i 2 and

1 n

P n

i=1 Y i Y n,i,1 respectively where Y n,i,1 denotes the labels of the first nearest neighbors of X i among

X 1 , . . . , X i−1 , X i+1 , . . . , X n . Therefore, the minimum mean squared error L can be estimated by

L ˜ n := 1 n

n

X

i=1

Y i 2 − 1 n

n

X

i=1

Y i Y n,i,1 . (2)

One can show without any conditions that L ˜ n → L

a.s. Moreover, for bounded |Y | and kXk, and for Lips- chitz continuous m, and for d ≥ 2, we have (cf. [3])

E {| ˜ L n − L |} ≤ c 1 n −1/2 + c 2 n −2/d .

3 Feature Selection and Hypothesis Test

One way of feature selection would be to detect inef- fective components of the feature vector. Let X (−k) = (X 1 , . . . , X k−1 , X k+1 , . . . , X d ) be the d − 1 dimensional feature vector such that we leave out the k-th compo- nent from X. Then the corresponding minimum error is

L ∗(−k) := E Y − E{Y |X (−k) } 2

. We want to test the following (null) hypothesis:

H k : L ∗(−k) = L ,

which means that leaving out the k-th component the minimum mean squared error does not increase. The hypothesis H k means that

m(X) = E{Y | X} = E{Y | X (−k) } =: m (−k) (X (−k) ) a.s.

By using the data

D (−k) n = {(X (−k) 1 , Y 1 ), . . . , (X (−k) n , Y n )},

(2)

L ∗(−k) can be estimated by L ˜ (−k) n := 1

n

n

X

i=1

Y i 2 − 1 n

n

X

i=1

Y i Y n,i,1 (−k) ,

so the corresponding test statistic is L ˜ (−k) n − ˜ L n = 1

n

n

X

i=1

Y i (Y n,i,1 − Y n,i,1 (−k) ).

We can accept the hypothesis H k if L ˜ (−k) n − ˜ L n

is “close” to zero. Since with large probability the first nearest neighbors of X i and of X (−k) i are the same, Y n,i,1 − Y n,i,1 (−k) = 0 in the test statistic. We know that P(Y n,i,1 = Y n,i,1 (−k) ) is decreasing as n increases (and d remains fixed) and vice versa, this probability is in- creasing as d increases (while n remains fixed). Hence, this test statistic is small even when the hypothesis H k

is not true.

To correct for this problem we modify the test statistic such that

( ˆ Y n,i,1 , ˆ Y n,i,1 (−k) ) = (Y n,i,1 , Y n,i,1 (−k) ) if Y n,i,1 6= Y n,i,1 (−k)

and

( ˆ Y n,i,1 , ˆ Y n,i,1 (−k) ) = I i (Y n,i,2 , Y n,i,1 (−k) )+(1−I i )(Y n,i,1 , Y n,i,2 (−k) ), otherwise (where Y n,i,2 denotes the labels of the second nearest neighbors of X i among X 1 , . . . , X i−1 , X i+1 , . . . , X n ), with

I i =

 0 with probability 1/2, 1 with probability 1/2, yielding

L ˆ (−k) n − ˆ L n = 1 n

n

X

i=1

Y i ( ˆ Y n,i,1 − ˆ Y n,i,1 (−k) ).

As in classical hypothesis testing, we need to find the limit distribution of the test statistic. The main diffi- culty here is that ˆ L (−k) n − ˆ L n is an average of dependent random variables. However, this dependence has a spe- cial property, called exchangeable. Based on a central limit theorem for exchangeable arrays [1], we can show the following result.

Theorem 1 Under the conditions of [1, Theorem 2], we have that

√ n( ˆ L (−k) n − ˆ L n ) −→ N(0, 2L d E {Y 2 }) under the null hypothesis H k .

In the above theorem, L and E{Y 2 } can be estimated by (2) and n 1 P n

i=1 Y i 2 respectively. Note that such a results is quite surprising, since under H k the smooth- ness of the regression function m and the dimension d do not count.

4 Simulations

First, consider the following nonlinear function with 4 uniformly distributed inputs on [0, 1] 4 with n = 1, 000:

Y = sin(πX (1) ) cos(πX (4) ) + ε, with ε ∼ N(0, 0.1 2 ).

Figure 1(a) illustrates the frequency of the true selected subset, true subset with additional component and full subset selected by the proposed test procedure during 1,000 runs. The significance level is set to 0.05.

Second, we experimentally verify Theorem 1 by means of bootstrap (10,000 replications). Consider the fol- lowing five dimensional function with additive noise:

Y = P 5

i=1 c i X (i) + ε, where c 1 = 0 and c i = 1 for i = 2, . . . , 5. Let X be uniform on [0, 1] 5 and ε ∼ N(0, 0.05 2 ). Figure 1(b) shows the histogram of

√ n( ˆ L (−k) n − ˆ L n ) under the null hypothesis i.e., H k for k = 1. A Kolmogorov-Smirnov test confirms this result.

0 10 20 30 40 50 60 70 80 90

Number of times selected (%)

Rest True Subset True Subset+1 ALL

(a)

−2 −1 0 1 2 3

0 0.1 0.2 0.3 0.4 0.5 0.6

Data

Density

(b)

Fig. 1: (a) Illustration of the frequency of the true selected subset, true subset with additional component and full sub- set. Rest denotes at least one component of the true subset is selected; (b) Density histogram of √n(ˆ L (−k) n − L ˆ n ) under the null hypothesis with corresponding Normal fit.

5 Conclusion

We have presented a simple nonparametric hypothesis test for detecting ineffective features. The simulation shows the capability of the proposed methodology.

Acknowledgments

Kris De Brabanter is supported by an FWO fellowship grant.

L´ aszl´ o Gy¨ orfi was partially supported by the European Union and the European Social Fund through project FuturICT.hu (grant no.: TAMOP-4.2.2.C-11/1/KONV-2012-0013).

References

[1] J.R. Blum, H. Chernoff, M. Rosenblatt & H. Teicher. Central limit theorems for interexchangeable processes. Canad. J.

Math. , 10:222–2229, 1958.

[2] L. Devroye, D. Sch¨ afer, L. Gy¨ orfi & H. Walk. The estima- tion problem of minimum mean squared error. Statistics and Decisions , 21(1): 15-28, 2003.

[3] L. Devroye, P. Ferrario, L. Gy¨ orfi & H. Walk. Strong univer- sal consistent estimate of the minimum mean squared error.

Submitted , 2013.

[4] E. Liiti¨ ainen, F. Corona & A. Lendasse. Residual variance

estimation using a nearest neighbor statistic. J. Multivariate

Anal. , 101(4): 811-823, 2010.

Referenties

GERELATEERDE DOCUMENTEN

Op basis van het bovenstaande stellen wij het college voor bij gelegenheid van de begroting aan de raad voor te stellen de gemeentelijke bijdrage ongeclausuleerd ter beschikking

In deze adventsperiode maken Marja Flipse, Rienk Lanooy, Geerten van de Wetering en Daniël Rouwkema voor iedere adventsweek en voor kerstmis vijf podcasts bij meer of minder

De regievoerder neemt het initiatief voor de organisatie van de inrichting van het werkgeversservicepunt, en faciliteert deze organisatie en inrichting.. Het doel van de

jozefaltaar in de kathedraal, (coll. RHC Tilburg). toe te spreken in het Latijn, een voor de roeiers onbegrijpelijke taal. Verstokte zon- daars werden na pater Donders' voorzegging

In een maatschappij waar levenslang leren de norm is en mensen zelfsturend vermogen nodig hebben om succesvol te zijn, kunnen wij niet vroeg genoeg beginnen met

Een leerling die 5 jaar wordt vóór 1 januari van het lopende schooljaar en die tijdens het voorafgaande schooljaar niet was ingeschreven in een door de Vlaamse Gemeenschap

Het bleek dat met name het aanpassen van de buitendijkse oprit aan de zuidzijde op vrij eenvoudige wijze gerealiseerd kan worden in uw plannen en dat u dit voor uw rekening wilt

Wat da Semarang Courant betreft,kunnen wy U mededeelan dat deze sedert 5 Mei heeft opgehouden te verschynen en in liquidatie is getre- den.De Locomotief heeft met haar