Statistical methods for microarray data
Goeman, Jelle Jurjen
Citation
Goeman, J. J. (2006, March 8). Statistical methods for microarray data.
Retrieved from https://hdl.handle.net/1887/4324
Version:
Corrected Publisher’s Version
License:
Licence agreement concerning inclusion of doctoral
thesis in the Institutional Repository of the University
of Leiden
Downloaded from:
https://hdl.handle.net/1887/4324
De uitgave van dit proefschrift werd ondersteund door het Fonds Medische Statistiek en door het Thomas Stieltjes Institute for Mathematics.
Statistical Methods
for Microarray Data
Pathway Analysis, Prediction Methods and
Visualization Tools
P
ROEFSCHRIFTter verkrijging van de graad van Doctor aan de Universiteit Leiden,
op gezag van de Rector Magnificus Dr. D. D. Breimer, hoogleraar in de faculteit der Wiskunde en Natuurwetenschappen en die der Geneeskunde,
volgens besluit van het College voor Promoties te verdedigen op woensdag 8 maart 2006
te klokke 15.15 uur
door
P
ROMOTIECOMMISSIE
PROMOTORES: Prof. dr. J. C. van Houwelingen Prof. dr. S. A. van de Geer
·Eidgen ¨ossische Technische Hochschule, Z ¨urich
REFERENT: Prof. dr. S. Richardson
·Imperial College, Londen
OVERIGE LEDEN: Prof. dr. C. Kooperberg
·Fred Hutchinson Cancer Research Center, Seattle
Contents
1 Introduction and overview 1
1.1 Biological context . . . 1
1.2 Statistical context . . . 5
1.3 This thesis . . . 11
2 Testing Association of a Pathway with a Clinical Variable 15 2.1 Introduction . . . 15
2.2 The data . . . 16
2.3 The model . . . 17
2.4 The score test . . . 19
2.5 Properties of the test . . . 20
2.6 Some technical adjustments . . . 21
2.7 Handling small sample size . . . 22
2.8 Handling missing values . . . 22
2.9 Application: AML/ALL . . . 23
2.10 Application: Heat Shock . . . 26
2.11 Discussion . . . 29
3 Testing Association of a Pathway with Survival 33 3.1 Introduction . . . 33
3.2 The model . . . 35
3.3 Derivation of the test . . . 37
3.4 Interpretation . . . 43
3.5 Application: osteosarcoma data . . . 45
3.6 Discussion . . . 48
4 A goodness-of-fit test for multinomial logistic regression 51 4.1 Introduction . . . 51
4.2 The multinomial logistic regression model . . . 52
4.3 Testing goodness-of-fit by smoothing . . . 53
4.4 Distribution of the test statistic . . . 55
4.5 Testing for the presence of a random effect . . . 57
4.6 Connection to binary logistic regression . . . 59
4.7 Simulation results . . . 59
Contents
4.8 Application: liver enzyme data . . . 61
4.9 Discussion . . . 63
4.10 Variance of the test statistic . . . 64
4.11 Derivation of the test statistic . . . 65
5 Testing against a high-dimensional alternative 67 5.1 Introduction . . . 67
5.2 Empirical Bayes testing . . . 69
5.3 The locally most powerful test . . . 71
5.4 Nuisance parameters . . . 73
5.5 Distribution of the test statistic . . . 74
5.6 The linear model . . . 75
5.7 Power of the score test . . . 76
5.8 A new look at the F-test . . . 78
5.9 Sparse alternatives . . . 80
5.10 Simulations . . . 81
5.11 Discussion . . . 84
5.12 Proofs of the lemmas . . . 85
6 Model-based dimension reduction 87 6.1 Introduction . . . 88
6.2 Bias and variance . . . 89
6.3 A basic joint model . . . 91
6.4 Regression . . . 93
6.5 Easy prediction . . . 94
6.6 Estimation . . . 96
6.7 Prediction . . . 99
6.8 Supervised Principal Components . . . 102
6.9 Application . . . 104
6.10 Discussion . . . 108
6.11 Proofs of the theorems . . . 109
Contents
A Manual of the GlobalTest package 127
A.1 Introduction . . . 127
A.2 Global testing of a single pathway . . . 128
A.3 Multiple global testing . . . 132
A.4 Diagnostic plots . . . 133
Samenvatting 143
Bibliography 147
Curriculum Vitae 155
Contents
Published and submitted chapters
The folowing chapters have been published in scientific journals or have been submitted for publication:
Chapter 2:
J. J. Goeman, S. A. van de Geer, F. de Kort, and J. C. van Houwelingen (2004). A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20 (1), 93–99.
Chapter 3:
J. J. Goeman, J. Oosting, A. M. Cleton-Jansen, J. Anninga, and J. C. van Houwe-lingen (2005). Testing association of a pathway with survival using gene ex-pression data. Bioinformatics 21 (9), 1950–1957.
Chapter 4:
J. J. Goeman and S. le Cessie. A goodness-of-fit test for multinomial logistic regression. submitted.
Chapter 5:
J. J. Goeman, S. A. van de Geer and J. C. van Houwelingen (2006) Testing against a high-dimensional alternative. Journal of the Royal Statistical Society, Series B 68, in press.
Chapter 6:
J. J. Goeman and J. C. van Houwelingen. Model-based dimension reduction for high-dimensional regression. submitted.
Chapter 7:
P. H. C. Eilers and J. J. Goeman (2004). Enhancing scatterplots with smoothed densities. Bioinformatics 20 (5), 623–628.
Appendix:
J. J. Goeman and J. Oosting (2005). Globaltest: testing association of a group of genes with a clinical variable. R package, version 3.2.0. www.bioconductor.org.
Contents
C
HAPTER
1
Introduction and overview
The subject of this thesis is the statistical analysis of high-dimensional data. It is motivated by (and primarily focussed on) problems arising from microarray gene expression data, a new type of high-dimensional data, which has become important in many areas of biology and medicine in the last decade.
The thesis is a collection of six articles and a software manual. The articles are self-contained and they can in principle be read in any order. However, such random reading of the chapters would not do justice to the close connections that exist between them, which are partly obscured by the fact that the articles were written for different journals and therefore for different audiences with different backgrounds and interests.
The objective of this introduction is to provide the context in which the pa-pers should be read and to make the connections between the different chapters more explicit. It is not meant to be a full review of microarray data and their analysis. These can be found for example in D´ıaz-Uriarte (2005), Speed (2003) and Simon et al. (2003). I give a short introduction to gene expression data and the biological and clinical questions arising from them in section 1.1. The next section 1.2 reviews some of the statistical methods that have been developed in recent years to address these questions. Section 1.3 examines the contribution of this thesis to the field.
1.1
Biological context
The microarray is a recent technology from molecular biology which was first developed around 1995 (Schena et al., 1995), and became widely available a-round the turn of the century (see Ewis et al., 2005, for a short history). The microarray is designed to measure the activity of gene expression, which is the process by which the genetic information in DNA is used to make proteins. The microarray technology gives the biologist the potential to greatly increase knowledge on the functions of genes and on the biology of disease, as well as allowing improvements in diagnosis and prognosis of patients.