• No results found

Data mining scenarios for the discovery of subtypes and the comparison of algorithms Colas, F.P.R.

N/A
N/A
Protected

Academic year: 2021

Share "Data mining scenarios for the discovery of subtypes and the comparison of algorithms Colas, F.P.R."

Copied!
11
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Data mining scenarios for the discovery of subtypes and the comparison of algorithms

Colas, F.P.R.

Citation

Colas, F. P. R. (2009, March 4). Data mining scenarios for the discovery of

subtypes and the comparison of algorithms. Retrieved from

https://hdl.handle.net/1887/13575

Version: Corrected Publisher’s Version

License: Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden

Downloaded from: https://hdl.handle.net/1887/13575

Note: To cite this publication please use the final published version (if

applicable).

(2)

D ata Mining Scenarios

for the Discovery of Subtypes and

the Comparison of Algorithms

(3)
(4)

D ata Mining Scenarios

for the Discovery of Subtypes and the Comparison of Algorithms

PROEFSCHRIFT

ter verkrijging van

de graad van Doctor aan de Universiteit Leiden,

op gezag van Rector Magnificus Prof. mr. P. F. van der Heijden, volgens besluit van het College voor Promoties

te verdedigen op woensdag 4 maart 2009 te klokke 13.45 uur

door Fabrice Pierre Robert Colas

geboren te Laval, France in 1981.

(5)

Promotiecommissie:

Prof. dr. J.N. Kok, LIACS, Universiteit Leiden Promotor Dr. Ingrid Meulenbelt, LUMC

Prof. dr. F. Famili, NRC-IIT, Ottawa, Canada Prof. dr. T.H.W. B¨ack, LIACS, Universiteit Leiden Prof. dr. G. Rozenberg, LIACS, Universiteit Leiden Prof. dr. B.R. Katzy, LIACS, Universiteit Leiden

This work was carried out under a grant from the Netherlands BioInformatics Center (NBIC).

Data Mining Scenarios for the Discovery of Subtypes and the Comparison of Algorithms

Fabrice Pierre Robert Colas Thesis Universiteit Leiden ISBN 978-90-9023888-3

Copyright c 2008 by Fabrice Pierre Robert Colas

All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without the prior permission of the author.

Printed in France

Email: fabrice.colas@grano-salis.net

(6)

To my parents

(7)
(8)

Contents

Introduction

1

Data mining scenarios . . . 1

Part I: Subtype Discovery by Cluster Analysis . . . 1

Part II: Automatic Text Classification . . . 3

Publications . . . 5

Part I: Subtype Discovery by Cluster Analysis . . . 5

Part II: Automatic Text Classification . . . 6

I Sub type Discovery by Cluster Analysis 9 1 Application Domains

11 1.1 Introduction . . . 11

1.2 Osteoarthritis . . . 11

1.3 Parkinson’s disease . . . 14

1.4 Drug discovery . . . 17

1.5 Concluding remarks . . . 19

2 A Scenario for Subtype Discovery by Cluster Analysis

21 2.1 Introduction . . . 21

2.2 Data preparation and clustering . . . 22

2.2.1 Data preparation . . . 22

2.2.2 The reliability and validity of a cluster result . . . 23

2.2.3 Clustering by a mixture of Gaussians . . . 26

2.3 Model selection . . . 27

2.3.1 A score to compare models . . . 27

(9)

ii Table of contents

2.3.2 Valid model selection . . . 27

2.4 Characterizing, comparing and evaluating cluster results . . . 28

2.4.1 Visualizing subtypes . . . 28

2.4.2 Statistical characterization and comparison of subtypes . . 29

2.4.3 Statistical evaluation of subtypes . . . 31

2.5 Concluding remarks . . . 34

3 Reliability of Cluster Results for Different Time Adjustments

35 3.1 Introduction . . . 35

3.2 Methods . . . 36

3.3 Experimental results . . . 38

3.4 Why does optimizing ther2 not boost the clusterreliability? . . . 40

3.5 Concluding remarks . . . 41

4 Subtyping in Osteoarthritis, Parkinson’s disease and Drug Discovery

43 4.1 Introduction . . . 43

4.2 Subtyping in Osteoarthritis . . . 44

4.2.1 Outline of the analysis . . . 44

4.2.2 Model selection . . . 45

4.2.3 Subtype characteristics and evaluation . . . 46

4.3 Subtyping in Parkinson’s disease . . . 48

4.3.1 Outline of the analysis . . . 49

4.3.2 Model selection . . . 49

4.3.3 Subtype characteristics . . . 50

4.3.4 Outline of thepost hoc-analysis . . . 50

4.4 Subtyping in drug discovery . . . 52

4.4.1 Outline of the analysis . . . 53

4.4.2 Model selection . . . 53

4.4.3 Subtype characteristics . . . 55

4.5 Concluding remarks . . . 59

5 Scenario Implementation as the R SubtypeDiscovery Package

61 5.1 Introduction . . . 61

5.2 Design of the scenario implementation . . . 62

5.2.1 Methods for data preparation and data specific settings . . 63

5.2.2 The dataset class (cdata) and its generic methods . . . 64

5.2.3 The cluster result class (cresult) and its generic methods 65 5.2.4 Statistical methods to characterize, compare and evaluate subtypes . . . 67

5.2.5 Other methods . . . 68

5.3 Sample analyses . . . 68

5.3.1 Analysis on the original scores . . . 69

(10)

Table of contents iii

5.3.2 Analysis on the principal components . . . 69

5.4 Concluding remarks . . . 70

II Automatic Text Classification 73 6 A Scenario for the Comparison of Algorithms in Text Classification

75 6.1 Introduction . . . 75

6.2 Conductingfair classifier comparisons . . . 76

6.3 Classification algorithms . . . 77

6.3.1 k Nearest Neighbors. . . 78

6.3.2 Naive Bayes . . . 78

6.3.3 Support Vector Machines . . . 78

6.3.4 Implementation of the algorithms . . . 82

6.4 Definition of the scenario . . . 82

6.4.1 Evaluation methodology and measures . . . 82

6.4.2 Dimensions of experimentation . . . 83

6.5 Experimental data . . . 85

6.5.1 To study the behaviors of the classifiers . . . 86

6.5.2 To study the scale-up of SVM in large bag of words feature spaces . . . 86

6.6 Concluding remarks . . . 87

7 Comparison of Classifiers

89 7.1 Introduction . . . 89

7.2 Experimental data . . . 90

7.3 Parameter optimization . . . 90

7.3.1 Support Vector Machines . . . 90

7.3.2 k Nearest Neighbors . . . 92

7.4 Comparisons for increasing document and feature sizes . . . 94

7.5 Related work . . . 97

7.6 Concluding remarks . . . 97

8 Does SVM Scale up to Large Bag of Words Feature Spaces?

99 8.1 Introduction . . . 99

8.2 Experimental data . . . 100

8.3 Best performing SVM . . . 100

8.4 Nature of SVM solutions . . . 101

8.5 A performance drop for SVM . . . 104

8.6 Relating the performance drop tooutliers in the data . . . 106

8.7 Related work . . . 108

(11)

iv Table of contents 8.8 Concluding remarks . . . 108

Conclusions 110

Subtype Discovery by Cluster Analysis . . . 113 Automatic Text Classification . . . 115

Appendices 118

A Two Dimensional Molecular Descriptors

119

B Additional Results in Text Classification

127

Bibliography

133

S amenvatting

141

Curriculum Vitae

143

Acknowledgements

145

Referenties

GERELATEERDE DOCUMENTEN

This scenario involves techniques to prepare data, a computational approach repeating data modeling to select for a number of clusters and a particular model, as well as other

To prevent cluster analyses that model only the time dimension in the data, we presented a method that helps to select for a type of time adjustment by assessing the cluster

Furthermore, this number also illustrates that the models (VVI,6) tend, in average, to be more likely in terms of BIC scores than the other combinations of model type and number

We start by presenting the design of the implementation: the data preparation methods, the dataset class, the cluster result class, and the methods to characterize, compare and

Therefore, when running experiments on complex classification tasks involving more than two-classes, we are actually comparing n SVM classifiers (for n classes) to single

In fact, on those tasks, small feature space SVM classifiers would, first, exhibit performances that compare with the best ones shown by the 49 nearest neighbors classifier and

Furthermore, in accordance to the several performance drops observed for small C values (illustrated in Figures 8.3 (a) and (c)), the tightly constrained SVM’s can be less stable

To conclude, our comparison of algorithms data mining scenario offers a new view on the problem of classifying text documents into categories. This focus en- abled to show that SVM