Prediction of Malignancy of Ovarian Tumors Using Least Squares Support Vector Machines
C. Lu 1 , T. Van Gestel 1 , J. A. K. Suykens 1 , S. Van Huffel 1 , I. Vergote 2 , D. Timmerman 2
1 Department of Electrical Engineering, Katholieke Universiteit Leuven, Leuven, Belgium,
2 Department of Obstetrics and Gynecology, University Hospitals Leuven, Leuven, Belgium Email address: chuan.lu@esat.kuleuven.ac.be
Variable (symbol) Benign Malignant Demographic Age (age)
Postmenopausal (meno) 45.6 15.2 31.0 %
56.9 14.6 66.0 % Serum marker CA 125 (log) (l_ca125) 3.0 1.2 5.2 1.5
CDI High color score (colsc3,4) 19.0% 77.3 %
Morphologic Abdominal fluid (asc) Bilateral mass (bilat) Unilocular cyst (un)
Multiloc/solid cyst (mulsol) Solid (sol)
Smooth wall (smooth) Irregular wall (irreg) Papillations (pap)
32.7 % 13.3 % 45.8 % 10.7 % 8.3 % 56.8 % 33.8 % 12.5 %
67.3 % 39.0 % 5.0 % 36.2 % 37.6 % 5.7 % 73.2 % 53.2 %
Demographic, serum marker, color Doppler imaging and morphologic variables
1. Introduction
Ovarian masses is a common problem in gynecology. A reliable test for preoperative discrimination between benign and malignant
ovarian tumors is of considerable help for clinicians in choosing appropriate treatments for patients.
In this work, we develop and evaluate several LS-SVM models within Bayesian evidence
framework, to preoperatively predict malignancy of ovarian tumors. The analysis includes exploratory data analysis, optimal input variable selection,
parameter estimation, performance evaluation via Receiver Operating Characteristic (ROC) curve analysis.
2. Methods
o: benign case x: malignant case
Visualizing the correlation between the
variables and the relations between the variables and
clusters.
Biplot of Ovarian Tumor Data
Patient Data
Unv. Hospitals Leuven
1994~1999
425 records, 25 features 32% malignant
Univariate Analysis
Preprocessing
Multivariate Analysis
PCA, Factor analysis Stepwise logistic regression
Model Building
Bayesian LS-SVM Classifier (RBF, Linear)
Logistic Regression
Model Evaluation
ROC analysis: AUC
Cross validation (Hold out, K-fold CV)
Descriptive statistics Histograms
Input Selection
Data Exploration
Model Development
Procedure of developing models to predict the malignancy of ovarian tumors
Bayesian LS-SVM (RBF, Linear) Forward Selection (Max Evidence)
LS-SVM Classifier within Bayesian Evidence Framework
1,...,
2
, , 1
Given {( , )} , where , 1,1
The following model is taken:
min ( , , )
2 2
S.T. [ ( ) ] 1 1,..., with regul
(
arizer .
) ( )
i
p
i i i N i i
T N
i i
w b e
i
i i
T
T i
D x y x R y
J w b e w w e
y w x b e i N
f b
x w x
1 1
1 1
1
0 1 0
1
[ ,..., ]',1 [1,...,1], [ ,..., ], [ ,..., ]'
( ) ( ) ( , )
The result
( ) [ ( ,
ing LS-SVM classifier )
is ]
T v
v N
N v
N N
N
i i i
T
ij i j i j
i
y x sign y K b I Y
Y y y
e e e
x x K x x
x x b
( , , )
2for rbf kernels ,
exp( ( , )) Given model (kernel parameter, e.g. )
, , ,
( , , , , ) ( , , , )
=> the Maximum A Posteriori Estimation for an w d b will be the solution of basi
p D H
p D w b H p w b H H
w b
P D J w b
H
c LS-SVM classifier
Level 1: infer w,b
1
( , , ) ( , , ) with is the prior clas
( ) ( )
s probabil ( , , )
ity
( ) .
y
p y p y p
p x y D H p y x D H p x
H y
y D
( , , ) ( , )
( , ) =
(Assume ( ,
, ) separable uniform distribute (
d.) ( , , ) )
p D H p p D H p D H H
p H
p D H
( ) ( )
( )
( )
(Assume prior ( ) is uniform distributed choose the which maximize
( )
.) the
j
j j
j j
j
j
p D H p D H p H
p D
p D
H p
H
H
H p D
Level 2: Infer hyperparameter
Level 3: Compare models
} / exp{
) ,
( x z x z
2
2K
x z z
x
TK ( , )
Positive definite kernel K(.,.)
RBF:
Linear:
Mercer’s theorem
Posterior class probability
Model evidence
Input variable selection
Given a certain type of kernel, Performs forward selection
Initial: 0 variables,
Add: variable which gives the greatest increase in the current model evidence at each iteration.
Stop: when the adding of any remaining variable can no longer increase the model evidence.
solved in dual space
y(x)
,p(y=1|x,D,H)
D
trainBayesian LS-SVM Classifier
kernel type (rbf/linear)
model evidence
x
*,*,*,*, , b
initial set of {
j} for rbf kernels
training test
10 variables were selected using an RBF kernel.
l_ca125, pap, sol, colsc3, bilat, meno, asc, shadows, colsc4, irreg
Blackbox of Bayesian LS-SVM Classifier
3. Experimental Results
RMI: risk of malignancy index = scoremorph× scoremeno× CA125
2) Results from randomized cross-validation (30 runs)
Training set : data from the first treated 265 patients
Test set : data from the latest treated 160 patients
1) Results from Temporal validation
--
LSSVMrbf--
LSSVMlin--
LR--
RMIROC curve on test set
MODEL TYPE
AUC cut off
Accur acy
Sensi tivity
Speci ficity RMI 0.8733 0.4 78.13 74.07 80.19
0.3 76.88 81.48 74.53 LR1 0.9111 0.4 80.63 75.96 83.02 0.3 80.63 77.78 82.08 LS-SVM1 0.9141 0.4 81.25 77.78 83.02 (LIN) 0.3 81.88 83.33 81.13 LS-SVM1 0.9184 0.4 83.13 81.48 83.96 (RBF) 0.3 84.38 85.19 83.96
Performance on Test set Averaged Performance on 30 runs of validations
MODEL TYPE
AUC (SD)
cut off
Accu racy
Sensi tivity
Speci ficity RMI 0.8882 0.5 82.6 81.73 83.06
0.0318 0.4 81.1 83.87 79.85 LR1 0.9397 0.5 83.3 89.33 80.55 0.0238 0.4 81.9 91.6 77.55 LS-SVM1 0.9405 0.5 84.3 87.4 82.91 (LIN) 0.0236 0.4 82.8 90.47 79.27 LS-SVM1 0.9424 0.5 84.9 86.53 84.09 (RBF) 0.0232 0.4 83.5 90 80.58
randomly separating training set (n=265) and test set (n=160)
Stratified, #malignant : #benign ~ 2:1 for each training and test set.
Repeat 30 times