Data mining scenarios for the discovery of subtypes and the comparison of algorithms Colas, F.P.R.

(1)

Data mining scenarios for the discovery of subtypes and the comparison of algorithms

Colas, F.P.R.

Citation

Colas, F. P. R. (2009, March 4). Data mining scenarios for the discovery of

subtypes and the comparison of algorithms. Retrieved from

https://hdl.handle.net/1887/13575

Version: Corrected Publisher’s Version

License: Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden

Downloaded from: https://hdl.handle.net/1887/13575

Note: To cite this publication please use the final published version (if

applicable).

(2)

D ata Mining Scenarios

for the Discovery of Subtypes and

the Comparison of Algorithms

(3)

(4)

D ata Mining Scenarios

for the Discovery of Subtypes and the Comparison of Algorithms

PROEFSCHRIFT

ter verkrijging van

de graad van Doctor aan de Universiteit Leiden,

op gezag van Rector Magniﬁcus Prof. mr. P. F. van der Heijden, volgens besluit van het College voor Promoties

te verdedigen op woensdag 4 maart 2009 te klokke 13.45 uur

door Fabrice Pierre Robert Colas

geboren te Laval, France in 1981.

(5)

Promotiecommissie:

Prof. dr. J.N. Kok, LIACS, Universiteit Leiden Promotor Dr. Ingrid Meulenbelt, LUMC

Prof. dr. F. Famili, NRC-IIT, Ottawa, Canada Prof. dr. T.H.W. B¨ack, LIACS, Universiteit Leiden Prof. dr. G. Rozenberg, LIACS, Universiteit Leiden Prof. dr. B.R. Katzy, LIACS, Universiteit Leiden

This work was carried out under a grant from the Netherlands BioInformatics Center (NBIC).

Data Mining Scenarios for the Discovery of Subtypes and the Comparison of Algorithms

Fabrice Pierre Robert Colas Thesis Universiteit Leiden ISBN 978-90-9023888-3

Copyright c 2008 by Fabrice Pierre Robert Colas

All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without the prior permission of the author.

Printed in France

Email: fabrice.colas@grano-salis.net

(6)

To my parents

(7)

(8)

Introduction

1

Data mining scenarios . . . 1

Part I: Subtype Discovery by Cluster Analysis . . . 1

Part II: Automatic Text Classiﬁcation . . . 3

Publications . . . 5

Part I: Subtype Discovery by Cluster Analysis . . . 5

Part II: Automatic Text Classiﬁcation . . . 6

I Sub type Discovery by Cluster Analysis 9 1 Application Domains

11 1.1 Introduction . . . 11

1.2 Osteoarthritis . . . 11

1.3 Parkinson’s disease . . . 14

1.4 Drug discovery . . . 17

1.5 Concluding remarks . . . 19

2 A Scenario for Subtype Discovery by Cluster Analysis

2.2 Data preparation and clustering . . . 22

2.2.1 Data preparation . . . 22

2.2.2 The reliability and validity of a cluster result . . . 23

2.2.3 Clustering by a mixture of Gaussians . . . 26

2.3 Model selection . . . 27

2.3.1 A score to compare models . . . 27

(9)

ii Table of contents

2.3.2 Valid model selection . . . 27

2.4 Characterizing, comparing and evaluating cluster results . . . 28

2.4.1 Visualizing subtypes . . . 28

2.4.2 Statistical characterization and comparison of subtypes . . 29

2.4.3 Statistical evaluation of subtypes . . . 31

3 Reliability of Cluster Results for Different Time Adjustments

3.2 Methods . . . 36

3.3 Experimental results . . . 38

3.4 Why does optimizing ther² not boost the clusterreliability? . . . 40

4 Subtyping in Osteoarthritis, Parkinson’s disease and Drug Discovery

4.2 Subtyping in Osteoarthritis . . . 44

4.2.1 Outline of the analysis . . . 44

4.2.2 Model selection . . . 45

4.2.3 Subtype characteristics and evaluation . . . 46

4.3 Subtyping in Parkinson’s disease . . . 48

4.3.3 Subtype characteristics . . . 50

4.3.4 Outline of thepost hoc-analysis . . . 50

4.4 Subtyping in drug discovery . . . 52

4.4.3 Subtype characteristics . . . 55

5 Scenario Implementation as the R SubtypeDiscovery Package

5.2 Design of the scenario implementation . . . 62

5.2.1 Methods for data preparation and data speciﬁc settings . . 63

5.2.2 The dataset class (cdata) and its generic methods . . . 64

5.2.3 The cluster result class (cresult) and its generic methods 65 5.2.4 Statistical methods to characterize, compare and evaluate subtypes . . . 67

5.2.5 Other methods . . . 68

5.3 Sample analyses . . . 68

5.3.1 Analysis on the original scores . . . 69

(10)

Table of contents iii

5.3.2 Analysis on the principal components . . . 69

II Automatic Text Classiﬁcation 73 6 A Scenario for the Comparison of Algorithms in Text Classiﬁcation

6.2 Conductingfair classiﬁer comparisons . . . 76

6.3 Classiﬁcation algorithms . . . 77

6.3.1 k Nearest Neighbors. . . 78

6.3.2 Naive Bayes . . . 78

6.3.3 Support Vector Machines . . . 78

6.3.4 Implementation of the algorithms . . . 82

6.4 Deﬁnition of the scenario . . . 82

6.4.1 Evaluation methodology and measures . . . 82

6.4.2 Dimensions of experimentation . . . 83

6.5 Experimental data . . . 85

6.5.1 To study the behaviors of the classiﬁers . . . 86

6.5.2 To study the scale-up of SVM in large bag of words feature spaces . . . 86

7 Comparison of Classiﬁers

7.3 Parameter optimization . . . 90

7.3.1 Support Vector Machines . . . 90

7.3.2 k Nearest Neighbors . . . 92

7.4 Comparisons for increasing document and feature sizes . . . 94

7.5 Related work . . . 97

8 Does SVM Scale up to Large Bag of Words Feature Spaces?

8.3 Best performing SVM . . . 100

8.4 Nature of SVM solutions . . . 101

8.5 A performance drop for SVM . . . 104

8.6 Relating the performance drop tooutliers in the data . . . 106

8.7 Related work . . . 108

(11)

iv Table of contents 8.8 Concluding remarks . . . 108

Conclusions 110

Subtype Discovery by Cluster Analysis . . . 113 Automatic Text Classiﬁcation . . . 115

Appendices 118

A Two Dimensional Molecular Descriptors

119

B Additional Results in Text Classiﬁcation

127

Bibliography

133

S amenvatting

141

Curriculum Vitae

143

Acknowledgements

145

Data mining scenarios for the discovery of subtypes and the comparison of algorithms Colas, F.P.R.

Data mining scenarios for the discovery of subtypes and the comparison of algorithms

Colas, F.P.R.

Colas, F. P. R. (2009, March 4). Data mining scenarios for the discovery of

https://hdl.handle.net/1887/13575

Version: Corrected Publisher’s Version

License: Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden

Downloaded from: https://hdl.handle.net/1887/13575

applicable).

D ata Mining Scenarios

for the Discovery of Subtypes and

the Comparison of Algorithms

D ata Mining Scenarios

for the Discovery of Subtypes and the Comparison of Algorithms

PROEFSCHRIFT

ter verkrijging van

de graad van Doctor aan de Universiteit Leiden,

op gezag van Rector Magniﬁcus Prof. mr. P. F. van der Heijden, volgens besluit van het College voor Promoties

te verdedigen op woensdag 4 maart 2009 te klokke 13.45 uur

door Fabrice Pierre Robert Colas

geboren te Laval, France in 1981.

Contents

Introduction

I Sub type Discovery by Cluster Analysis 9 1 Application Domains

2 A Scenario for Subtype Discovery by Cluster Analysis

3 Reliability of Cluster Results for Different Time Adjustments

4 Subtyping in Osteoarthritis, Parkinson’s disease and Drug Discovery

5 Scenario Implementation as the R SubtypeDiscovery Package

II Automatic Text Classiﬁcation 73 6 A Scenario for the Comparison of Algorithms in Text Classiﬁcation

7 Comparison of Classiﬁers

8 Does SVM Scale up to Large Bag of Words Feature Spaces?

Conclusions 110

Appendices 118

A Two Dimensional Molecular Descriptors

B Additional Results in Text Classiﬁcation

Bibliography

S amenvatting

Curriculum Vitae

Acknowledgements