• No results found

Variable Selection Using Linear Variable Selection Using Linear Sparse Bayesian Models for Medical Sparse Bayesian Models for Medical Classification Problems Classification Problems

N/A
N/A
Protected

Academic year: 2021

Share "Variable Selection Using Linear Variable Selection Using Linear Sparse Bayesian Models for Medical Sparse Bayesian Models for Medical Classification Problems Classification Problems"

Copied!
19
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Variable Selection Using Linear Variable Selection Using Linear

Sparse Bayesian Models for Medical Sparse Bayesian Models for Medical

Classification Problems Classification Problems

C. Lu

1

, J. A. K. Suykens

1

, S. Van Huffel

1

,

Email: {Chuan.Lu, Johan.Suykens, Sabine.VanHuffel}@esat.kuleuven.ac.be

1

Department of Electrical Engineering,

Katholieke Universiteit Leuven, Leuven, Belgium,

(2)

Belgian Day on Biomedical Engineering, Oct 17, 2003

Introduction Introduction

Reasons for performing variable selection in medical classification problems

have an impact on the economics of data acquisition

affect the accuracy and complexity of the classifiers.

helpful in understanding the underlying mechanism that generated the data.

Our proposed method in this work

select the variables by using Tipping’s fast sparse Bayesian learning method with linear basis functions.

The selected variables are then used in different types of linear

classifiers.

(3)

Introduction Introduction

Application

Used linear models: linear discriminant analysis (LDA) models, logistic regression models (LR), relevance vector machines (RVM) with linear kernels and the Bayesian least squares support vector machines (LS-SVM) with linear kernels

We applied the method to real-life medical classification problems, including two cancer diagnosis problems based on micro-array data and a brain tumor classification problem based on MRS spectra.

The generalization performance of the compared models can be improved via the proposed variable selection procedure.

(4)

Belgian Day on Biomedical Engineering, Oct 17, 2003

Sparse Bayesian Modeling Sparse Bayesian Modeling

Principles

Sparse Bayesian learning is the application of Bayesian automatic relevance

determination (ARD) to models linear in their parameters, by which the

sparse solutions to the regression or classification tasks can be obtained [1].

The predictions are based upon some functions y(x) defined in the input space x,

Two forms for the basis functions have been considered:

, i.e. the original input variables

where K(x, xn) denotes some symmetric kernel functions. Support vector machines (SVMs) and relevance vector machines (RVMs) generally adopt the kernel

representation

) ( )

;

(x w w x

yT

( ) (1, , ,x x1 xd T)

  

( ) (1, ( , ), , ( ,x K x x1 K x xN))T

  

} / exp{

) ,

(x z xz 2 2 K

x z z

x T

K( , )

Symmetric kernel K(.,.), e.g.

RBF kernel:

Linear kernel:

(5)

Bayesian Inference Bayesian Inference

In case of regression, the likelihood of the data set can be expressed as:

where 2 corresponds to the variance of the Gaussian noise model.

Prefer to a smoother model by declaring smaller weights to be a prior more probable. By use of Bayesian priors: e.g. a Gaussian prior with zero mean, independent for each weight, with common inverse

variance hyperparameter 

Given the likelihood and prior, compute the posterior distribution of weights via Bayes’ rule:

mM wm

w

p 1 1/2 2}

exp{ 2 2 )

( )

|

( 

 

 

 

nN N n n

n

n t y x w

w t p w

t

p 1 2

2 1

2 / 1 2 2

2

2

)}

; ( exp {

) 2

( )

,

| ( )

,

|

(    

) ,

| (

)

| ( ) ,

| ( evidence

prior likelihood

) , ,

|

( 2

2 2

 

p t

w p w

t t p

w

p   

(6)

Belgian Day on Biomedical Engineering, Oct 17, 2003

Bayesian Inference Bayesian Inference

Approximate predictive distribution, using Type II maximum likelihood:

To find the ‘most probable’ value 

MP

and 

2MP

, we maximize the marginal likelihood

This is a Gaussian distribution over a single N-dimensional dataset vector t.

Model selection:

Models can be ranked by model evidence p(t|, r) given a selected basis function  and the related parameters r such as the radius width in the RBF kernel basis

function.





  



t I

t I

dw w

p w

t p t

p

T T

T

N/2 2 1 1/2 2 1 1

2 2

) 2 (

exp 1 )

2 (

)

| ( ) ,

| ( )

,

| (

) , ( )

, ,

| ( ) ,

| ( )

|

(t* t p t* w2 p w t   2 dw N**2

p

MP MP MP

(7)

Encoding Sparsity Encoding Sparsity

A sparse Bayesian prior: using M hyperparameter =(

1

, …, 

M

), one 

1

independently controls the (inverse) variance of each weight w

m

.

Hierarchical priors:

The prior p(w|) is Gaussian conditioned on , hyperpirors over all m can be defined by Gamma hyperprior (or particularly, a uniform hyperpriors on log() ).

The true form of the hierarchical priror can be seen by magnetization:

<->

a penalty function

The weight posterior distribution

with

Key: when 

m

=, the corresponding 

m

=0







mM M

m

m m m

M

M w

w

p 1

1 2 2

/ 2 1

/

1 2

exp 1 )

2 ( ) , ,

|

(      





   

( ) ( )

2 exp 1

) 2 ) (

,

| (

)

| ( ) ,

| ) (

, ,

|

( 2 ( 1)/2 1/2 1

2

2   

 

w w

t p

w p w

t t p

w

p N T

) , , ( ,

, )

( 2T  A 1   2 Tt Adiag1 M

( m) ( m | m) ( m) m

p w

p wpd

mlog | wm |

(8)

Belgian Day on Biomedical Engineering, Oct 17, 2003

Sparse Bayesian Learning Algorithm Sparse Bayesian Learning Algorithm

MP

and 

2MP

are found by maximizing the marginal likelihood

This is a zero-mean Gaussian process with covariance where

A fast sequential learning algorithm:

Analyzing the objective – the marginal likelihood function:

where the “quality” and “sparsity” terms have been defined as

Note these terms are independent of i (but depend on all other -i).





  

t A

I t

A I

dw w

p w

t p t

p

T T

T

N/2 2 1 1/2 2 1 1

2 2

) 2 (

exp 1 )

2 (

)

| ( ) ,

| ( )

,

| (

M

m

T m m

m I

C

1

2

1  

 m[m(x1),,m(xN)]T

i i T i i i T i

i C t s C

q  1 ,  1



 

 

i i

i i

i i

i s

s q t

p t

p   2   2 log log( )  2 2

) 1 ,

| ( log )

,

| ( log

(9)

Sparse Bayesian Learning Algorithm Sparse Bayesian Learning Algorithm

A fast sequential learning algorithm:

Dependence of the marginal likelihood is captured by

Setting l(i)/i=0 gives analytical solutions

:

If qi2>si:

If qi2<si:

Optimization algorithm outline:

Start from zero basis,

Sequentially added or delete or update a the hyperparameter for a basis m (based on the computation of qi and si) untill converged.

i i

i i

i i

i s

s q

l     

 

 ) log log( ) 2

(

2 2

opt i

i

i i

s q s

 

opt

i  

(10)

Belgian Day on Biomedical Engineering, Oct 17, 2003

Sparse Bayesian Logistic Regression for Sparse Bayesian Logistic Regression for

Variable Selection Variable Selection

Sparse Bayesian classification

In case of classification

the likelihood is Bernoulli, rather than Gaussian

with g(y)=1/(1+e-y) the logistic link function.

No noise variance 2

Same prior as for regression

Gaussian approximation is used to compute p(w|t,).

Variable selection is a side-effect of the sparse Bayesian learning if taking the original variables as basis functions.

The most relevant variables for the identified logistic regression classifier can be obtained from the sparse solution of the Bayesian learning.

N

n

t n

t

n w n g y x w n

x y g w

t p

1

)}]1

; ( { 1

[ )}

; ( { )

| (

(11)

Sparse Bayesian Logistic Regression for Sparse Bayesian Logistic Regression for

Variable Selection Variable Selection

Variable selection is a side-effect of the sparse Bayesian learning if taking the original variables as basis functions.

The most relevant variables for the identified logistic regression classifier can be obtained from the sparse solution of the Bayesian learning.

Variance of the variable selection, results from

multiple solutions, local optima of the algorithm in case of large number of candidate input variables.

sensitive to the small perturbation of experimental conditions.

possible solutions: bagging, model averaging, committee machines, ensemble learning, etc.

(12)

Belgian Day on Biomedical Engineering, Oct 17, 2003

Probabilistic Linear Classifiers Probabilistic Linear Classifiers

Linear Discriminant Analysis

Logistic Regression

Bayesian Least Squares Support Vector Manchine Classifier (vanGestel and Suykens, 2001)

Starting from the feature space formulation, analytic expression are obtained in the dual space on the three levels of Bayesian inference.

Posterior class probabilities  marginalizing over the model parameters.

Relevance Vector Machines

(13)

Multiple classification by pairwise coupling Multiple classification by pairwise coupling

Typically K-class classification rules tend to be easier to learn for K=2 than for K>2.

reducing a multiclass classification problem to a set of binary classification problems:one-against-one.

obtaining probability estimates in the one-against-one case by coupling the pairwise probability estimates

The coupling of pairwise estimates hasthe following advantages

 (p C c X |  x)

(14)

Belgian Day on Biomedical Engineering, Oct 17, 2003

Experiments – Experiments –

binary classification on Micro-array data binary classification on Micro-array data

Typical micro-array data:

the small number of sample, but large number of genes (variables), very high dimensionality

Experimental setting:

find the variables by sparse Bayesian logistic regression model (repeat five times, choose the subset of variables with the highest marginal likelihood).

Test the selected variables for by using leave-one-out cross-validations.

Compared with the performance using all the variables. Classifiers used include Bayesian LS-SVM and RVM, which can handle the situation where #variable > #sample.

cancer no. samples no. genes task

leukemia 72 7192 2 subtypes

colon 62 2000 disease/normal

(15)

Experimental Results Experimental Results

– Leukemia data – Leukemia data

Two leukemia subtypes: acute myeloid leukemia (AML), and acute lymphoblastic leukemia (ALL)

(16)

Belgian Day on Biomedical Engineering, Oct 17, 2003

Experimental Results Experimental Results

– Colon data set – Colon data set

distinguish cancerous and normal tissues

“harder” to classify than the leukemia data

(17)

Experiments – Brain tumor Experiments – Brain tumor

classification (multiclass) classification (multiclass)

Inference of Hyperparameters (Level 2)

(18)

Belgian Day on Biomedical Engineering, Oct 17, 2003

Conclusions Conclusions

Summary

Variable selection helps to analyze the data set.

Under the Bayesian evidence framework, choosing of the model

regularization and kernel parameters for LS-SVM classifier can be done in a unified way, without the need of selecting additional validation set.

A forward input selection procedure which tries to maximize the model evidence has been proved to be able to identify the subset of important variables for model building.

A sparse approximation can further improve the generalization performance of the LS- SVM classifiers.

LS-SVMs have the potential to give reliable preoperative prediction of malignancy of ovarian tumors.

Future work

A larger scale validation is still needed.

Hybrid methodology, e.g. combine the Bayesian network with the

learning of LS-SVM, might be more promising

(19)

Acknowledgement Acknowledgement

Use of the brain tumor data provided by the EU funded INTERPRET project (IST-19999-10310, http:// carbon.uab.es/INTERPRET) is gratefully

acknowledged.

References References

[1] Tipping M. E. Sparse Bayesian learning and the relevance vector machine.

Journal of Machine Learning Research 1, 211–244, 2001.

[2] Tipping M. E. and Faul A. Fast marginal likelihood maximisation for sparse Bayesian models. In Proceedings of Artificial Intelligence and Statistics '03, 2003.

[3] Suykens J.A.K., Van Gestel T., et al., Least Squares Support Vector Machines.

World Scientific, Singapore: 2002

[4] Guyon I., Weston J. et al. Gene selection for cancer classification using support vector machines, Machine learning, 2002.

[5] Lukas L., Devos A. et al. classification of brain tumours using 1H MRS spectra, internal report, ESAT-SISTA, K.U.Leuven, 2003 .

Referenties

GERELATEERDE DOCUMENTEN

Er zijn enkele sporen aanwezig maar die kunnen allemaal toegewezen worden aan activiteiten uit de tweede helft van de 20 ste

order models the correlation of the different quantities are mostly below 10 degrees. It seems that with the overparametrized formulation, the true noise model coefficients cannot

In order to compare the PL-LSSVM model with traditional techniques, Ordinary Least Squares (OLS) regression using all the variables (in linear form) is implemented, as well as

The proposed Bayes factor, on the other hand, can be used for testing hypotheses with equality and/or order constraints, is very fast to compute due to its analytic expression, and

The proposed Bayes factor, on the other hand, can be used for testing hypotheses with equality and/or order constraints, is very fast to compute due to its analytic expression, and

As a simple demonstration that conjugate models might not react to prior-data conflict reasonably, infer- ence on the mean of data from a scaled normal distribution and inference on

Another way may be to assume Jeffreys' prior for the previous sample and take the posterior distribution of ~ as the prior f'or the current

Theoreti- cally, we can conclude that the model with the sharpest pos- terior distribution p(a~y) is the least heteroscedastic in comparison to the other n-1 posterior