A Comparative Study on Variable Selection for Nonlinear Classifiers

(1)

A Comparative Study on Variable Selection for Nonlinear Classifiers

C. Lu ¹ , T. Van Gestel ¹ , J. A. K. Suykens ¹ , S. Van Huffel ¹ , I. Vergote ² , D. Timmerman ²

1

Department of Electrical Engineering, Katholieke Universiteit Leuven, Leuven, Belgium,

2

Department of Obstetrics and Gynecology, University Hospitals Leuven, Leuven, Belgium

Email address: chuan.lu@esat.kuleuven.ac.be

(2)

1. Introduction

 Variable selection refers to the problem of selecting input variables that are relevant for a given task. In pattern recognition, variable selection can have an impact on the economics of data

acquisition and on the accuracy and complexity of the classifiers.

 This study aims at input variable selection for nonlinear blackbox classifiers, particularly multi-layer perceptrons (MLP) and least squares support vector machines (LS-SVMs).

2. Feature extraction

Variable Selection

• Variable (feature) measure

• Heuristic search: forward,

backward, stepwise, hill-climbing, branch and bound…

• Filter approach: filter out irrelevant attributes before induction occurs

• Wrapper approaches : focus on

finding attributes that are useful for performance for a specific type of model, rather than necessarily finding the relevant ones.

Feature Extraction

• Feature selection or variable selection

• Feature transformation

e.g. PCA, not desirable for maintaining data, difficulty in interpretation, and not immune from distortion under transformation

Variable measure

• Correlation

• Mutual information (MI)

• Evidence (or Bayes factor) in Bayesian framework

• Classification performance

• Sensitivity analysis: change in the objective function J by

removing variable i: DJ(i)

•Statistical partial F test (Chi- square value)

Pattern recognition: feature extraction -> classification

(3)

3. Considered nonlinear classifiers: MLPs and LS-SVMs

LS-SVM Classifier

2 , 2

1

The following model is taken:

min ( , ) ,

2 2

S.T. [ ( ) ] 1 1,..., with reg

( ) ( )

ularizer . Denote [ , ]

T N w b i

i T

i i i

T

J w b w w e

y w x b e i N

f w x b

 

   











 

 

 

 

 



x

1 1

2 2

1

[ ,..., ] ,1 [1,...,1] , [ ,..., ] , [ ,..., ] , ( ) ( ) ( , ) e.g. RBF kernel: ( , ) exp{ / } Linear kernel: ( , )

Resulting 0 1 0 1

cl

T T T

N v N

T T

N ij i j i j

T v v

T N

Y y

b I Y

y e e e

x x K x x

K K

 

 

  





     

          

  

  





 







x z x z

x z z x

1

( ) [

assifier: ( , ) ]

N

i i i

i

y x sign  y K x x b







Note: by integrating the MLP(Mackay 1992) or LS-SVM (VanGestel, Suykens 2002) with the Bayesian evidence framework, the tuning of hyperparameters and

computation of posterior class probabilities can be done in a unified way.

Variable selection can also be done based on the model evidence.

solved in dual space

 

model , for

MLP: network structure, e.g.

LS-SVM: kernel parameter, e.g.

#hidden neurons for rbf ke

( , , , ) ( , , ) , ,

rnel

( , )

,

s

: infer , for given ,

p D w b H p w b H

p D

H

H P H

w b D

w b H



 



 

 

 



Level 1

=> the Maximum A Posteriori Estimation for and will be the solution of basic MLP/LS-SVM classifier

( , ) ( ) (

exp(

( , ) =

( , ))

b

( , )

w

)

: Infer hyperparameter

p D H p D H p H

p

J

D

w b

H H

p D

  







Level 2

Level

 

( )

choose the which maximi ( )

( )

( ) (

ze t

) he

: Compare models:

j j

j

p D p

D H H

H p D H

H p H

p D

 p D 

3

Model evidence

Bayesian Evidence Framework

Inferences are divided into distinct levels.

 

(2) (1) (1) (2)

Consider the one hidden layer MLP:

, where ( , ) ' with activation function of

exp( ) exp( ) hidden layer: '( ) tanh(

( ) ( ,

) ,

exp( ) exp( ) output layer: logistic funct on

)

i

a x w w g w x b b

a a

g a a

a a

f x g a x w

  

 

 

 



, 1

1

min ( , ) , with regularizer , 2

where the cross entropy error function ( ) 1

1 e

{ log ( ) (1 ) log(1 ( ))}.

xp( )

T w b

N

i i i i

i

J w b w w G

G y

g

f x y f x

a a

 



 

    

  



   

1,...,

Consider a binary classification problem, given {( , )} ,

where , 0,1 in case of MLP, 1,1 in case of LS-SVM.

i i i N

p

i i i

D x y

x R y y



   

MLP Classifiers

(4)

4. Considered variable selection methods

Method Variable

measure Search Predefined

parameters (Dis) advantages Mutual information

feature selection under uniform

information distrib.

(MIFS-U) [8]

Mutual

information I(X;Y)

Greedy search: begin from no variables, repeat

selecting the feature until predifined k variables are

selected

Density function

estimation (parametric or nonparametric),

here the simple

discretization method is used.

Linear/Nolinear, easy to compute; computational problems increase with k, for

very high dimensional data.

Information lost due to discretization.

Bayesian LS-SVM variable forward

selection

(LSSVMB-FFS) [1]

Model

evidence P(D|

H)

Greedy search, select each time a variable that gives the highest increase in model evidence, until no more increase.

Kernel type. (Non)linear. Automatically select a certain number of variables that max the

evidence. Gaussian

assumption. Computationally expensive for high

dimensional data.

LS-SVM recursive feature elimination

(LSSVM-RFE) [7]

For linear kernel, use(w_i)²

Recursively remove the variable(s) that have the smallest DJ(i).

Kernel type,

regularization and kernel parameters.

Suitable for very high dimensional data.

Computationally expensive for large sample size, and nonlinear kernels.

Stepwise logistic

regression (SLR) Chi-square (statistical partial F-test).

Stepwise: recursively add or remove a variable at each step.

P-values for

determining addition or removal of

variables in models.

Linear, easy to compute.

Troubles in case of multicolinearity.

1 1

2 2

12 2 2

( ) let unchanged, need (- ).

( ) ( )

T T T T

i i

YKY YK i Y K i

DJ i J Dw

w

   



  

 



(5)

5. Experimental results on benchmark data sets

Variable C las s ifier

#V M IF S - U

LS S V M R F E

LS S VM B -F F S

S LR

LSSV M 4 0.912 0.942 0.735 0.912 (linear) 8 0.912 0.942 0.765 0.912

LR 4 0.912 0.794 0.794 0.912

8 0.882 0.735 0.765 0.882

Table 2. Accuracy on Test set with different number of variables

II. Biomedical real life data set

(1) Gene selection for leukemia classification []

#variables: 7129, Classes: ALL, AML,

#Training data: 38; #test data: 34

I. Synthetic data: noisy XOR problem

linearly inseparable. 50 random generated input data, X1, X2: {0,1} random Y: XOR(x1, x2).

X3, X4: noise~N(0, 0.3) was added to X1 X2 X5, X6: noise~N(0, 0.5) was added to X1 X2 X7~X16: noise~N(0, 2).

1 0

1

Variable

Ranking x1 x2 x1~4 x1~x6

lssvm RFE

25/30 26/30 28/30

class ifier

MLP

0.93 (0.11) 0.62 (0.076)

LSSVM

1 (0) 0.66 (0.044)

firs t 2 selected from

top 2v : Acc(SD) all 16v: Acc(SD)

Table 1. LssvmRFE (using a polynomial kernel with degree 2) selected correctly the top2 variables 25 times from the 30

random trials based on the 50 noisy training data;

Averaged performance on a test set of 100 examples over 30 random trials using.

Notes:

Linear classifier and selection method can’t solve the XOR problem which is nonlinear.

MIFSU: entropy for the first 2 binary variable smaller than the other continuous variables

Bayesian LSSVM FFS: evidence for the first 2 binary variables is smaller than other continuous variables; however backward Bayesian LSSVM can always remove the other noisy variables.

*linear kernels are used

for

lssvm-RFE and lssvmB-FFS

.

- MLP has 1 hidden layer with 2 hidden neurons, using Baysian MLP to determine the regularization parameter.

- the LSSVM classifier uses a polynomial kernel with degree 2.

(6)

 Good variable selection can improve the performance of the classifiers both in accuracy and computation.

 LSSVM-RFE can be suitable for both linear and nonlinear classification problems. And can deal with the very high dimensional data.

 Bayesian LSSVM forward selection can identify the important variables in some cases, however should be used with more care in the satisfaction of the assumptions.

 A strategy which combines variable ranking and the wrapper methods should give more confidence in the selected variables.

References

1. C. Lu, T. Van Gestel, et al. Preoperative prediction of malignancy of ovarian tumors using Least Squares Support Vector Machines (2002), submitted paper.

2. D. Timmerman, H. Verrelst, et al., Artificial neural network models for the preoperative discrimination between malignant and benign adnexal masses. Ultrasound Obstet Gynecol (1999).

3. J.A.K. Suykens, J. Vandewalle, Least Squares support vector machine classifiers, Neural Processing Letters (1999), 9(3).

4. T. Van Gestel, J.A.K. Suykens, et al., Bayesian framework for least squares support vector machine classifiers, Neural Computation (2002), 15(5).

5. D.J.C. MacKay, The evidence framework applied to classification networks, Neural Computation (1992), 4(5).

6. R. Kohavi and G. John, Wrappers for feature subset selection, Artificial intelligence, special issue on relevance 97 (1-2):273- 324.

7. I. Guyon, J. Weston, et al. Gene selection for cancer classification using support vector machines, Machine learning (2000).

8. N. Kwak and C.H. Choi Input feature selection for classification problems, IEEE Transactions on neural networks (2002) 13 (1).

Table 3. Accuracy on test set with different number of variables

6. Conclusions

(2) Ovarian tumor classification

# variables 27, classes: benign, malignant

# training data: 265, #test data: 160

Variable C las sifier

#V M IF S- U

LSSV M R F E

LSSVM B -F F S

SLR

A Comparative Study on Variable Selection for Nonlinear Classifiers