Data mining scenarios for the discovery of subtypes and the comparison of algorithms
Colas, F.P.R.
Citation
Colas, F. P. R. (2009, March 4). Data mining scenarios for the discovery of
subtypes and the comparison of algorithms. Retrieved fromhttps://hdl.handle.net/1887/13575
Version: Corrected Publisher’s Version
License: Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden
Downloaded from: https://hdl.handle.net/1887/13575
Note: To cite this publication please use the final published version (if
applicable).
D ata Mining Scenarios
for the Discovery of Subtypes and
the Comparison of Algorithms
D ata Mining Scenarios
for the Discovery of Subtypes and the Comparison of Algorithms
PROEFSCHRIFT
ter verkrijging van
de graad van Doctor aan de Universiteit Leiden,
op gezag van Rector Magnificus Prof. mr. P. F. van der Heijden, volgens besluit van het College voor Promoties
te verdedigen op woensdag 4 maart 2009 te klokke 13.45 uur
door Fabrice Pierre Robert Colas
geboren te Laval, France in 1981.
Promotiecommissie:
Prof. dr. J.N. Kok, LIACS, Universiteit Leiden Promotor Dr. Ingrid Meulenbelt, LUMC
Prof. dr. F. Famili, NRC-IIT, Ottawa, Canada Prof. dr. T.H.W. B¨ack, LIACS, Universiteit Leiden Prof. dr. G. Rozenberg, LIACS, Universiteit Leiden Prof. dr. B.R. Katzy, LIACS, Universiteit Leiden
This work was carried out under a grant from the Netherlands BioInformatics Center (NBIC).
Data Mining Scenarios for the Discovery of Subtypes and the Comparison of Algorithms
Fabrice Pierre Robert Colas Thesis Universiteit Leiden ISBN 978-90-9023888-3
Copyright c 2008 by Fabrice Pierre Robert Colas
All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without the prior permission of the author.
Printed in France
Email: fabrice.colas@grano-salis.net
To my parents
Contents
Introduction
1Data mining scenarios . . . 1
Part I: Subtype Discovery by Cluster Analysis . . . 1
Part II: Automatic Text Classification . . . 3
Publications . . . 5
Part I: Subtype Discovery by Cluster Analysis . . . 5
Part II: Automatic Text Classification . . . 6
I Sub type Discovery by Cluster Analysis 9 1 Application Domains
11 1.1 Introduction . . . 111.2 Osteoarthritis . . . 11
1.3 Parkinson’s disease . . . 14
1.4 Drug discovery . . . 17
1.5 Concluding remarks . . . 19
2 A Scenario for Subtype Discovery by Cluster Analysis
21 2.1 Introduction . . . 212.2 Data preparation and clustering . . . 22
2.2.1 Data preparation . . . 22
2.2.2 The reliability and validity of a cluster result . . . 23
2.2.3 Clustering by a mixture of Gaussians . . . 26
2.3 Model selection . . . 27
2.3.1 A score to compare models . . . 27
ii Table of contents
2.3.2 Valid model selection . . . 27
2.4 Characterizing, comparing and evaluating cluster results . . . 28
2.4.1 Visualizing subtypes . . . 28
2.4.2 Statistical characterization and comparison of subtypes . . 29
2.4.3 Statistical evaluation of subtypes . . . 31
2.5 Concluding remarks . . . 34
3 Reliability of Cluster Results for Different Time Adjustments
35 3.1 Introduction . . . 353.2 Methods . . . 36
3.3 Experimental results . . . 38
3.4 Why does optimizing ther2 not boost the clusterreliability? . . . 40
3.5 Concluding remarks . . . 41
4 Subtyping in Osteoarthritis, Parkinson’s disease and Drug Discovery
43 4.1 Introduction . . . 434.2 Subtyping in Osteoarthritis . . . 44
4.2.1 Outline of the analysis . . . 44
4.2.2 Model selection . . . 45
4.2.3 Subtype characteristics and evaluation . . . 46
4.3 Subtyping in Parkinson’s disease . . . 48
4.3.1 Outline of the analysis . . . 49
4.3.2 Model selection . . . 49
4.3.3 Subtype characteristics . . . 50
4.3.4 Outline of thepost hoc-analysis . . . 50
4.4 Subtyping in drug discovery . . . 52
4.4.1 Outline of the analysis . . . 53
4.4.2 Model selection . . . 53
4.4.3 Subtype characteristics . . . 55
4.5 Concluding remarks . . . 59
5 Scenario Implementation as the R SubtypeDiscovery Package
61 5.1 Introduction . . . 615.2 Design of the scenario implementation . . . 62
5.2.1 Methods for data preparation and data specific settings . . 63
5.2.2 The dataset class (cdata) and its generic methods . . . 64
5.2.3 The cluster result class (cresult) and its generic methods 65 5.2.4 Statistical methods to characterize, compare and evaluate subtypes . . . 67
5.2.5 Other methods . . . 68
5.3 Sample analyses . . . 68
5.3.1 Analysis on the original scores . . . 69
Table of contents iii
5.3.2 Analysis on the principal components . . . 69
5.4 Concluding remarks . . . 70
II Automatic Text Classification 73 6 A Scenario for the Comparison of Algorithms in Text Classification
75 6.1 Introduction . . . 756.2 Conductingfair classifier comparisons . . . 76
6.3 Classification algorithms . . . 77
6.3.1 k Nearest Neighbors. . . 78
6.3.2 Naive Bayes . . . 78
6.3.3 Support Vector Machines . . . 78
6.3.4 Implementation of the algorithms . . . 82
6.4 Definition of the scenario . . . 82
6.4.1 Evaluation methodology and measures . . . 82
6.4.2 Dimensions of experimentation . . . 83
6.5 Experimental data . . . 85
6.5.1 To study the behaviors of the classifiers . . . 86
6.5.2 To study the scale-up of SVM in large bag of words feature spaces . . . 86
6.6 Concluding remarks . . . 87
7 Comparison of Classifiers
89 7.1 Introduction . . . 897.2 Experimental data . . . 90
7.3 Parameter optimization . . . 90
7.3.1 Support Vector Machines . . . 90
7.3.2 k Nearest Neighbors . . . 92
7.4 Comparisons for increasing document and feature sizes . . . 94
7.5 Related work . . . 97
7.6 Concluding remarks . . . 97
8 Does SVM Scale up to Large Bag of Words Feature Spaces?
99 8.1 Introduction . . . 998.2 Experimental data . . . 100
8.3 Best performing SVM . . . 100
8.4 Nature of SVM solutions . . . 101
8.5 A performance drop for SVM . . . 104
8.6 Relating the performance drop tooutliers in the data . . . 106
8.7 Related work . . . 108
iv Table of contents 8.8 Concluding remarks . . . 108
Conclusions 110
Subtype Discovery by Cluster Analysis . . . 113 Automatic Text Classification . . . 115