Variable Selection Using Linear Variable Selection Using Linear
Sparse Bayesian Models for Medical Sparse Bayesian Models for Medical
Classification Problems Classification Problems
C. Lu
1, J. A. K. Suykens
1, S. Van Huffel
1,
Email: {Chuan.Lu, Johan.Suykens, Sabine.VanHuffel}@esat.kuleuven.ac.be
1
Department of Electrical Engineering,
Katholieke Universiteit Leuven, Leuven, Belgium,
Belgian Day on Biomedical Engineering, Oct 17, 2003
Introduction Introduction
Reasons for performing variable selection in medical classification problems
have an impact on the economics of data acquisition
affect the accuracy and complexity of the classifiers.
helpful in understanding the underlying mechanism that generated the data.
Our proposed method in this work
select the variables by using Tipping’s fast sparse Bayesian learning method with linear basis functions.
The selected variables are then used in different types of linear
classifiers.
Introduction Introduction
Application
Used linear models: linear discriminant analysis (LDA) models, logistic regression models (LR), relevance vector machines (RVM) with linear kernels and the Bayesian least squares support vector machines (LS-SVM) with linear kernels
We applied the method to real-life medical classification problems, including two cancer diagnosis problems based on micro-array data and a brain tumor classification problem based on MRS spectra.
The generalization performance of the compared models can be improved via the proposed variable selection procedure.
Belgian Day on Biomedical Engineering, Oct 17, 2003
Sparse Bayesian Modeling Sparse Bayesian Modeling
Principles
Sparse Bayesian learning is the application of Bayesian automatic relevance
determination (ARD) to models linear in their parameters, by which thesparse solutions to the regression or classification tasks can be obtained [1].
The predictions are based upon some functions y(x) defined in the input space x,
Two forms for the basis functions have been considered:
, i.e. the original input variables
where K(x, xn) denotes some symmetric kernel functions. Support vector machines (SVMs) and relevance vector machines (RVMs) generally adopt the kernel
representation
) ( )
;
(x w w x
y T
( ) (1, , ,x x1 xd T)
( ) (1, ( , ), , ( ,x K x x1 K x xN))T
} / exp{
) ,
(x z xz 2 2 K
x z z
x T
K( , )
Symmetric kernel K(.,.), e.g.
RBF kernel:
Linear kernel:
Bayesian Inference Bayesian Inference
In case of regression, the likelihood of the data set can be expressed as:
where 2 corresponds to the variance of the Gaussian noise model.
Prefer to a smoother model by declaring smaller weights to be a prior more probable. By use of Bayesian priors: e.g. a Gaussian prior with zero mean, independent for each weight, with common inverse
variance hyperparameter
Given the likelihood and prior, compute the posterior distribution of weights via Bayes’ rule:
mM wm
w
p 1 1/2 2}
exp{ 2 2 )
( )
|
(
nN N n n
n
n t y x w
w t p w
t
p 1 2
2 1
2 / 1 2 2
2
2
)}
; ( exp {
) 2
( )
,
| ( )
,
|
(
) ,
| (
)
| ( ) ,
| ( evidence
prior likelihood
) , ,
|
( 2
2 2
p t
w p w
t t p
w
p
Belgian Day on Biomedical Engineering, Oct 17, 2003
Bayesian Inference Bayesian Inference
Approximate predictive distribution, using Type II maximum likelihood:
To find the ‘most probable’ value
MPand
2MP, we maximize the marginal likelihood
This is a Gaussian distribution over a single N-dimensional dataset vector t.
Model selection:
Models can be ranked by model evidence p(t|, r) given a selected basis function and the related parameters r such as the radius width in the RBF kernel basis
function.
t I
t I
dw w
p w
t p t
p
T T
T
N/2 2 1 1/2 2 1 1
2 2
) 2 (
exp 1 )
2 (
)
| ( ) ,
| ( )
,
| (
) , ( )
, ,
| ( ) ,
| ( )
|
(t* t p t* w 2 p w t 2 dw N * *2
p
MP MP MP Encoding Sparsity Encoding Sparsity
A sparse Bayesian prior: using M hyperparameter =(
1, …,
M), one
1independently controls the (inverse) variance of each weight w
m.
Hierarchical priors:
The prior p(w|) is Gaussian conditioned on , hyperpirors over all m can be defined by Gamma hyperprior (or particularly, a uniform hyperpriors on log() ).
The true form of the hierarchical priror can be seen by magnetization:
<->
a penalty function
The weight posterior distribution
with
Key: when
m=, the corresponding
m=0
mM M
m
m m m
M
M w
w
p 1
1 2 2
/ 2 1
/
1 2
exp 1 )
2 ( ) , ,
|
(
( ) ( )
2 exp 1
) 2 ) (
,
| (
)
| ( ) ,
| ) (
, ,
|
( 2 ( 1)/2 1/2 1
2
2
w w
t p
w p w
t t p
w
p N T
) , , ( ,
, )
( 2T A 1 2 Tt A diag 1 M
( m) ( m | m) ( m) m
p w
p w p d
mlog | wm |Belgian Day on Biomedical Engineering, Oct 17, 2003
Sparse Bayesian Learning Algorithm Sparse Bayesian Learning Algorithm
MPand
2MPare found by maximizing the marginal likelihood
This is a zero-mean Gaussian process with covariance where
A fast sequential learning algorithm:
Analyzing the objective – the marginal likelihood function:
where the “quality” and “sparsity” terms have been defined as
Note these terms are independent of i (but depend on all other -i).
t A
I t
A I
dw w
p w
t p t
p
T T
T
N/2 2 1 1/2 2 1 1
2 2
) 2 (
exp 1 )
2 (
)
| ( ) ,
| ( )
,
| (
M
m
T m m
m I
C
1
2
1
m [m(x1),,m(xN)]T
i i T i i i T i
i C t s C
q 1 , 1
i i
i i
i i
i s
s q t
p t
p 2 2 log log( ) 2 2
) 1 ,
| ( log )
,
| ( log
Sparse Bayesian Learning Algorithm Sparse Bayesian Learning Algorithm
A fast sequential learning algorithm:
Dependence of the marginal likelihood is captured by
Setting l(i)/i=0 gives analytical solutions
:
If qi2>si:
If qi2<si:
Optimization algorithm outline:
Start from zero basis,
Sequentially added or delete or update a the hyperparameter for a basis m (based on the computation of qi and si) untill converged.
i i
i i
i i
i s
s q
l
) log log( ) 2
(
2 2
opt i
i
i i
s q s
opt
i
Belgian Day on Biomedical Engineering, Oct 17, 2003
Sparse Bayesian Logistic Regression for Sparse Bayesian Logistic Regression for
Variable Selection Variable Selection
Sparse Bayesian classification
In case of classification
the likelihood is Bernoulli, rather than Gaussian
with g(y)=1/(1+e-y) the logistic link function.
No noise variance 2
Same prior as for regression
Gaussian approximation is used to compute p(w|t,).
Variable selection is a side-effect of the sparse Bayesian learning if taking the original variables as basis functions.
The most relevant variables for the identified logistic regression classifier can be obtained from the sparse solution of the Bayesian learning.
N
n
t n
t
n w n g y x w n
x y g w
t p
1
)}]1
; ( { 1
[ )}
; ( { )
| (
Sparse Bayesian Logistic Regression for Sparse Bayesian Logistic Regression for
Variable Selection Variable Selection
Variable selection is a side-effect of the sparse Bayesian learning if taking the original variables as basis functions.
The most relevant variables for the identified logistic regression classifier can be obtained from the sparse solution of the Bayesian learning.
Variance of the variable selection, results from
multiple solutions, local optima of the algorithm in case of large number of candidate input variables.
sensitive to the small perturbation of experimental conditions.
possible solutions: bagging, model averaging, committee machines, ensemble learning, etc.
Belgian Day on Biomedical Engineering, Oct 17, 2003
Probabilistic Linear Classifiers Probabilistic Linear Classifiers
Linear Discriminant Analysis
Logistic Regression
Bayesian Least Squares Support Vector Manchine Classifier (vanGestel and Suykens, 2001)
Starting from the feature space formulation, analytic expression are obtained in the dual space on the three levels of Bayesian inference.
Posterior class probabilities marginalizing over the model parameters.
Relevance Vector Machines
Multiple classification by pairwise coupling Multiple classification by pairwise coupling
Typically K-class classification rules tend to be easier to learn for K=2 than for K>2.
reducing a multiclass classification problem to a set of binary classification problems:one-against-one.
obtaining probability estimates in the one-against-one case by coupling the pairwise probability estimates
The coupling of pairwise estimates hasthe following advantages
(p C c X | x)
Belgian Day on Biomedical Engineering, Oct 17, 2003
Experiments – Experiments –
binary classification on Micro-array data binary classification on Micro-array data
Typical micro-array data:
the small number of sample, but large number of genes (variables), very high dimensionality
Experimental setting:
find the variables by sparse Bayesian logistic regression model (repeat five times, choose the subset of variables with the highest marginal likelihood).
Test the selected variables for by using leave-one-out cross-validations.
Compared with the performance using all the variables. Classifiers used include Bayesian LS-SVM and RVM, which can handle the situation where #variable > #sample.
cancer no. samples no. genes task
leukemia 72 7192 2 subtypes
colon 62 2000 disease/normal
Experimental Results Experimental Results
– Leukemia data – Leukemia data
Two leukemia subtypes: acute myeloid leukemia (AML), and acute lymphoblastic leukemia (ALL)
Belgian Day on Biomedical Engineering, Oct 17, 2003
Experimental Results Experimental Results
– Colon data set – Colon data set
distinguish cancerous and normal tissues
“harder” to classify than the leukemia data
Experiments – Brain tumor Experiments – Brain tumor
classification (multiclass) classification (multiclass)
Inference of Hyperparameters (Level 2)
Belgian Day on Biomedical Engineering, Oct 17, 2003
Conclusions Conclusions
Summary
Variable selection helps to analyze the data set.
Under the Bayesian evidence framework, choosing of the model
regularization and kernel parameters for LS-SVM classifier can be done in a unified way, without the need of selecting additional validation set.
A forward input selection procedure which tries to maximize the model evidence has been proved to be able to identify the subset of important variables for model building.
A sparse approximation can further improve the generalization performance of the LS- SVM classifiers.
LS-SVMs have the potential to give reliable preoperative prediction of malignancy of ovarian tumors.
Future work
A larger scale validation is still needed.
Hybrid methodology, e.g. combine the Bayesian network with the
learning of LS-SVM, might be more promising
Acknowledgement Acknowledgement
Use of the brain tumor data provided by the EU funded INTERPRET project (IST-19999-10310, http:// carbon.uab.es/INTERPRET) is gratefully
acknowledged.
References References
[1] Tipping M. E. Sparse Bayesian learning and the relevance vector machine.
Journal of Machine Learning Research 1, 211–244, 2001.
[2] Tipping M. E. and Faul A. Fast marginal likelihood maximisation for sparse Bayesian models. In Proceedings of Artificial Intelligence and Statistics '03, 2003.
[3] Suykens J.A.K., Van Gestel T., et al., Least Squares Support Vector Machines.
World Scientific, Singapore: 2002
[4] Guyon I., Weston J. et al. Gene selection for cancer classification using support vector machines, Machine learning, 2002.
[5] Lukas L., Devos A. et al. classification of brain tumours using 1H MRS spectra, internal report, ESAT-SISTA, K.U.Leuven, 2003 .