Variable Selection Using Linear Variable Selection Using Linear Sparse Bayesian Models for Medical Sparse Bayesian Models for Medical Classification Problems Classification Problems

(1)

Variable Selection Using Linear Variable Selection Using Linear

Sparse Bayesian Models for Medical Sparse Bayesian Models for Medical

Classification Problems Classification Problems

C. Lu

¹

, J. A. K. Suykens

¹

, S. Van Huffel

¹

,

Email: {Chuan.Lu, Johan.Suykens, Sabine.VanHuffel}@esat.kuleuven.ac.be

1

Department of Electrical Engineering,

Katholieke Universiteit Leuven, Leuven, Belgium,

(2)

Belgian Day on Biomedical Engineering, Oct 17, 2003

Introduction Introduction



Reasons for performing variable selection in medical classification problems



have an impact on the economics of data acquisition



affect the accuracy and complexity of the classifiers.



helpful in understanding the underlying mechanism that generated the data.



Our proposed method in this work



select the variables by using Tipping’s fast sparse Bayesian learning method with linear basis functions.



The selected variables are then used in different types of linear

classifiers.

(3)

Introduction Introduction

 Application

 Used linear models: linear discriminant analysis (LDA) models, logistic regression models (LR), relevance vector machines (RVM) with linear kernels and the Bayesian least squares support vector machines (LS-SVM) with linear kernels

 We applied the method to real-life medical classification problems, including two cancer diagnosis problems based on micro-array data and a brain tumor classification problem based on MRS spectra.

 The generalization performance of the compared models can be improved via the proposed variable selection procedure.

(4)

Sparse Bayesian Modeling Sparse Bayesian Modeling



Principles



Sparse Bayesian learning is the application of Bayesian automatic relevance

determination (ARD) to models linear in their parameters, by which the

sparse solutions to the regression or classification tasks can be obtained [1].



The predictions are based upon some functions y(x) defined in the input space x,



Two forms for the basis functions have been considered:

 , i.e. the original input variables



where K(x, x_n) denotes some symmetric kernel functions. Support vector machines (SVMs) and relevance vector machines (RVMs) generally adopt the kernel

representation

) ( )

;

(x w w x

y  ^T

( ) (1, , ,x x1 x^{d T})

  

( ) (1, ( , ), , ( ,x K x x1 K x x_N))^T

  

} / exp{

) ,

(x z   xz ²  ² K

x z z

x ^T

K( , ) 

Symmetric kernel K(.,.), e.g.

RBF kernel:

Linear kernel:

(5)

Bayesian Inference Bayesian Inference



In case of regression, the likelihood of the data set can be expressed as:

where ² corresponds to the variance of the Gaussian noise model.



Prefer to a smoother model by declaring smaller weights to be a prior more probable. By use of Bayesian priors: e.g. a Gaussian prior with zero mean, independent for each weight, with common inverse

variance hyperparameter 



Given the likelihood and prior, compute the posterior distribution of weights via Bayes’ rule:



 

 _m^M w_m

w

p ₁ ¹^/² ²}

exp{ 2 2 )

( )

|

( 



 









 



 



 



 _n^N ^N ⁿ ⁿ

n

n t y x w

w t p w

t

p ₁ ₂

2 1

2 / 1 2 2

2

)}

; ( exp {

) 2

( )

,

| ( )

,

|

(    

) ,

| (

)

| ( ) ,

| ( evidence

prior likelihood

) , ,

|

( ₂

2 2





 

 p t

w p w

t t p

w

p   

(6)

Bayesian Inference Bayesian Inference



Approximate predictive distribution, using Type II maximum likelihood:



To find the ‘most probable’ value 

_MP

and 

_2MP

, we maximize the marginal likelihood

This is a Gaussian distribution over a single N-dimensional dataset vector t.



Model selection:

 Models can be ranked by model evidence p(t|, r) given a selected basis function  and the related parameters r such as the radius width in the RBF kernel basis

function.







  









 





t I

dw w

p w

t p t

p

T T

T

N/2 2 1 1/2 2 1 1

2 2

) 2 (

exp 1 )

2 (

)

| ( ) ,

| ( )

,

| (

















) , ( )

, ,

| ( ) ,

| ( )

|

(t_* t p t_* w  ² p w t   ² dw N _* _*²

p 



_MP _MP _MP 

(7)

Encoding Sparsity Encoding Sparsity



A sparse Bayesian prior: using M hyperparameter =(

₁

, …, 

_M

), one 

₁

independently controls the (inverse) variance of each weight w

_m

.



Hierarchical priors:

 The prior p(w|) is Gaussian conditioned on , hyperpirors over all m can be defined by Gamma hyperprior (or particularly, a uniform hyperpriors on log() ).

 The true form of the hierarchical priror can be seen by magnetization:

<->

a penalty function



The weight posterior distribution

with

Key: when 

m

=, the corresponding 

m

=0



















 _m^M ^M

m

m m m

M

M w

w

p ₁

1 2 2

/ 2 1

/

1 2

exp 1 )

2 ( ) , ,

|

(      







   





 ^ ^ ^ ( ) ^ ( )

2 exp 1

) 2 ) (

,

| (

)

| ( ) ,

| ) (

, ,

|

( ₂ ⁽ ¹⁾^/² ¹^/² ¹

2

2   





 

 w w

t p

w p w

t t p

w

p ^N ^T

) , , ( ,

, )

( ²^T  A ¹   ² ^Tt A  diag ₁ _ _M



 ^ ^ ^

( _m) ( _m | _m) ( _m) _m

p w 



p w  p  d



_m^{log |} ^w^m ^|

(8)

Sparse Bayesian Learning Algorithm Sparse Bayesian Learning Algorithm





_MP

and 

_2MP

are found by maximizing the marginal likelihood

 This is a zero-mean Gaussian process with covariance where



A fast sequential learning algorithm:

 Analyzing the objective – the marginal likelihood function:

 where the “quality” and “sparsity” terms have been defined as

 Note these terms are independent of _i (but depend on all other _-i).







  









 





t A

I t

A I

dw w

p w

t p t

p

T T

T

N/2 2 1 1/2 2 1 1

2 2

) 2 (

exp 1 )

2 (

)

| ( ) ,

| ( )

,

| (















 

 ^M

m

T m m

m I

C

1

2

1  

 _m _[_m₍_x₁_),__,_m₍_x_N_)]^T

i i T i i i T i

i C t s C

q  _^¹ ,  _^¹



 





 







 _

i i

i s

s q t

p t

p   ²   ² log log( )  ² 2

) 1 ,

| ( log )

,

| ( log

(9)

Sparse Bayesian Learning Algorithm Sparse Bayesian Learning Algorithm



A fast sequential learning algorithm:

 Dependence of the marginal likelihood is captured by

 Setting l(i)/i=0 gives analytical solutions

:

 If q_i2>s_i:

 If q_i2<s_i:

 Optimization algorithm outline:

 Start from zero basis,

 Sequentially added or delete or update a the hyperparameter for a basis m (based on the computation of qi and si) untill converged.

i i

i s

s q

l     

 



 ) log log( ) ²

(

2 2

opt i

i

i i

s q s

 



opt

i  

(10)

Sparse Bayesian Logistic Regression for Sparse Bayesian Logistic Regression for

Variable Selection Variable Selection

 Sparse Bayesian classification

 In case of classification

 the likelihood is Bernoulli, rather than Gaussian

 with g(y)=1/(1+e^-y) the logistic link function.

 No noise variance ²

 Same prior as for regression

 Gaussian approximation is used to compute p(w|t,).

 Variable selection is a side-effect of the sparse Bayesian learning if taking the original variables as basis functions.

 The most relevant variables for the identified logistic regression classifier can be obtained from the sparse solution of the Bayesian learning.





 

 ^N

n

t n

t

n w ⁿ g y x w ⁿ

x y g w

t p

1

)}]1

; ( { 1

[ )}

; ( { )

| (

(11)

Sparse Bayesian Logistic Regression for Sparse Bayesian Logistic Regression for

Variable Selection Variable Selection

 Variable selection is a side-effect of the sparse Bayesian learning if taking the original variables as basis functions.

 The most relevant variables for the identified logistic regression classifier can be obtained from the sparse solution of the Bayesian learning.

 Variance of the variable selection, results from

 multiple solutions, local optima of the algorithm in case of large number of candidate input variables.

 sensitive to the small perturbation of experimental conditions.

 possible solutions: bagging, model averaging, committee machines, ensemble learning, etc.

(12)

Probabilistic Linear Classifiers Probabilistic Linear Classifiers

 Linear Discriminant Analysis

 Logistic Regression

 Bayesian Least Squares Support Vector Manchine Classifier (vanGestel and Suykens, 2001)

 Starting from the feature space formulation, analytic expression are obtained in the dual space on the three levels of Bayesian inference.

 Posterior class probabilities  marginalizing over the model parameters.

 Relevance Vector Machines

(13)

Multiple classification by pairwise coupling Multiple classification by pairwise coupling

 Typically K-class classification rules tend to be easier to learn for K=2 than for K>2.

 reducing a multiclass classification problem to a set of binary classification problems:one-against-one.

 obtaining probability estimates in the one-against-one case by coupling the pairwise probability estimates

 The coupling of pairwise estimates hasthe following advantages

 (p C c X |  x)

(14)

Experiments – Experiments –

binary classification on Micro-array data binary classification on Micro-array data

Typical micro-array data:

the small number of sample, but large number of genes (variables), very high dimensionality

Experimental setting:

find the variables by sparse Bayesian logistic regression model (repeat five times, choose the subset of variables with the highest marginal likelihood).

Test the selected variables for by using leave-one-out cross-validations.

Compared with the performance using all the variables. Classifiers used include Bayesian LS-SVM and RVM, which can handle the situation where #variable > #sample.

cancer no. samples no. genes task

leukemia 72 7192 2 subtypes

colon 62 2000 disease/normal

(15)

Experimental Results Experimental Results

– Leukemia data – Leukemia data

 Two leukemia subtypes: acute myeloid leukemia (AML), and acute lymphoblastic leukemia (ALL)

(16)

Experimental Results Experimental Results

– Colon data set – Colon data set

 distinguish cancerous and normal tissues

 “harder” to classify than the leukemia data

(17)

Experiments – Brain tumor Experiments – Brain tumor

classification (multiclass) classification (multiclass)



Inference of Hyperparameters (Level 2)

(18)

Conclusions Conclusions

 Summary

 Variable selection helps to analyze the data set.

 Under the Bayesian evidence framework, choosing of the model

regularization and kernel parameters for LS-SVM classifier can be done in a unified way, without the need of selecting additional validation set.

 A forward input selection procedure which tries to maximize the model evidence has been proved to be able to identify the subset of important variables for model building.

 A sparse approximation can further improve the generalization performance of the LS- SVM classifiers.

 LS-SVMs have the potential to give reliable preoperative prediction of malignancy of ovarian tumors.



Future work



A larger scale validation is still needed.



Hybrid methodology, e.g. combine the Bayesian network with the

learning of LS-SVM, might be more promising

(19)

Acknowledgement Acknowledgement

Use of the brain tumor data provided by the EU funded INTERPRET project (IST-19999-10310, http:// carbon.uab.es/INTERPRET) is gratefully

acknowledged.

References References

[1] Tipping M. E. Sparse Bayesian learning and the relevance vector machine.

Journal of Machine Learning Research 1, 211–244, 2001.

[2] Tipping M. E. and Faul A. Fast marginal likelihood maximisation for sparse Bayesian models. In Proceedings of Artificial Intelligence and Statistics '03, 2003.

[3] Suykens J.A.K., Van Gestel T., et al., Least Squares Support Vector Machines.

World Scientific, Singapore: 2002

[4] Guyon I., Weston J. et al. Gene selection for cancer classification using support vector machines, Machine learning, 2002.

[5] Lukas L., Devos A. et al. classification of brain tumours using ¹H MRS spectra, internal report, ESAT-SISTA, K.U.Leuven, 2003 .