Optimizing variable selection and cost using a genetic algorithm for modelling adnexal masses with Bayesian networks
Olivier Gevaert
1, Bart De Moor
1, Dirk Timmerman
21
Dept. Electrical Engineering, ESAT-SCD-BioI, Katholieke Universiteit Leuven, Belgium.
2Division of Gynaecological Oncology, Department of Obstetrics and Gynaecology, UZ Gasthuisberg, Katholieke Universiteit Leuven, Leuven, Belgium
Statistical methods have already proven their usefulness in diagnosis and prognosis of complex diseases. When the number of variables becomes large or the relationships between the variables becomes large, there is a need for advanced methods to build models that help clinicians in making reliable predictions on the outcome. Possible methods include logistic regression (Timmerman et al., 2005), Support Vector Machines (Pochet and Suykens, 2006; De Smet et al., 2006) and Bayesian networks (Gevaert et al., 2006). These statistical or mathematical methods provide a more unbiased way of analysing clinical data and allow modelling a large number of variables.
We have used Bayesian networks to model the malignancy of adnexal masses based on clinical data. Bayesian networks offer an alternative way of modelling clinical data since a Bayesian network can explain its reasoning. We used Bayesian networks in combination with a genetic algorithm that can remove or add variables in the model such that variable selection was performed. Moreover, since each variable had a cost associated with it, we concurrently minimized the cost of the selected variables. The cost of each variable was specified by a gynaecological expert in the field and reflected a combination of the subjectivity, the financial cost and the time cost that was necessary to measure a specific variable.
This method was applied to the data resulting from the International Ovarian Tumour Analysis (IOTA) multicenter study on the pre-operative characterisation of ovarian tumours. The first phase of IOTA, which was initiated in 1998, was finished in 2002 and resulted in a data set consisting of 1346 masses and 1152 patients with 68 variables. We used this data set to construct a Bayesian network and to predict the malignancy of ovarian masses while optimizing variable selection and cost. The results showed that the performance was similar to using all variables which means that a subset can be chosen with less “costly” variables without losing prediction accuracy. The developed models will be prospectively tested on new patients which are being collected in the next phase of IOTA.
Reference List
De Smet,F., De Brabanter,J., Van den,B.T., Pochet,N., Amant,F., Van Holsbeke,C., Moerman,P., De Moor,B., Vergote,I., and Timmerman,D. (2006). New models to predict depth of infiltration in endometrial carcinoma based on transvaginal sonography.
Ultrasound Obstet Gynecol 27, 664-671.
Gevaert,O., De Smet,F., Kirk,E., Van Calster,B., Bourne,T., Van Huffel,S., Moreau,Y., Timmerman,D., De Moor,B., and Condous,G.
(2006). Predicting the outcome of pregnancies of unknown location: Bayesian networks with expert prior information compared to logistic regression. Human reproduction 21.
Pochet,N.L. and Suykens,J.A. (2006). Support vector machines versus logistic regression: improving prospective performance in clinical decision-making. Ultrasound Obstet Gynecol 27, 607-608.
Timmerman,D., Testa,A.C., Bourne,T., Ferrazzi,E., Ameye,L., Konstantinovic,M.L., Van Calster,B., Collins,W.P., Vergote,I., Van Huffel,S., and Valentin,L. (2005). Logistic regression model to distinguish between the benign and malignant adnexal mass before surgery: a multicenter study by the International Ovarian Tumor Analysis Group. J Clin Oncol 23, 8794-8801.