• No results found

Statistical methods for microarray data Goeman, Jelle Jurjen

N/A
N/A
Protected

Academic year: 2021

Share "Statistical methods for microarray data Goeman, Jelle Jurjen"

Copied!
12
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Statistical methods for microarray data

Goeman, Jelle Jurjen

Citation

Goeman, J. J. (2006, March 8). Statistical methods for microarray data.

Retrieved from https://hdl.handle.net/1887/4324

Version:

Corrected Publisher’s Version

License:

Licence agreement concerning inclusion of doctoral

thesis in the Institutional Repository of the University

of Leiden

Downloaded from:

https://hdl.handle.net/1887/4324

(2)
(3)

De uitgave van dit proefschrift werd ondersteund door het Fonds Medische Statistiek en door het Thomas Stieltjes Institute for Mathematics.

(4)

Statistical Methods

for Microarray Data

Pathway Analysis, Prediction Methods and

Visualization Tools

P

ROEFSCHRIFT

ter verkrijging van de graad van Doctor aan de Universiteit Leiden,

op gezag van de Rector Magnificus Dr. D. D. Breimer, hoogleraar in de faculteit der Wiskunde en Natuurwetenschappen en die der Geneeskunde,

volgens besluit van het College voor Promoties te verdedigen op woensdag 8 maart 2006

te klokke 15.15 uur

door

(5)

P

ROMOTIECOMMISSIE

PROMOTORES: Prof. dr. J. C. van Houwelingen Prof. dr. S. A. van de Geer

·Eidgen ¨ossische Technische Hochschule, Z ¨urich

REFERENT: Prof. dr. S. Richardson

·Imperial College, Londen

OVERIGE LEDEN: Prof. dr. C. Kooperberg

·Fred Hutchinson Cancer Research Center, Seattle

(6)

Contents

1 Introduction and overview 1

1.1 Biological context . . . 1

1.2 Statistical context . . . 5

1.3 This thesis . . . 11

2 Testing Association of a Pathway with a Clinical Variable 15 2.1 Introduction . . . 15

2.2 The data . . . 16

2.3 The model . . . 17

2.4 The score test . . . 19

2.5 Properties of the test . . . 20

2.6 Some technical adjustments . . . 21

2.7 Handling small sample size . . . 22

2.8 Handling missing values . . . 22

2.9 Application: AML/ALL . . . 23

2.10 Application: Heat Shock . . . 26

2.11 Discussion . . . 29

3 Testing Association of a Pathway with Survival 33 3.1 Introduction . . . 33

3.2 The model . . . 35

3.3 Derivation of the test . . . 37

3.4 Interpretation . . . 43

3.5 Application: osteosarcoma data . . . 45

3.6 Discussion . . . 48

4 A goodness-of-fit test for multinomial logistic regression 51 4.1 Introduction . . . 51

4.2 The multinomial logistic regression model . . . 52

4.3 Testing goodness-of-fit by smoothing . . . 53

4.4 Distribution of the test statistic . . . 55

4.5 Testing for the presence of a random effect . . . 57

4.6 Connection to binary logistic regression . . . 59

4.7 Simulation results . . . 59

(7)

Contents

4.8 Application: liver enzyme data . . . 61

4.9 Discussion . . . 63

4.10 Variance of the test statistic . . . 64

4.11 Derivation of the test statistic . . . 65

5 Testing against a high-dimensional alternative 67 5.1 Introduction . . . 67

5.2 Empirical Bayes testing . . . 69

5.3 The locally most powerful test . . . 71

5.4 Nuisance parameters . . . 73

5.5 Distribution of the test statistic . . . 74

5.6 The linear model . . . 75

5.7 Power of the score test . . . 76

5.8 A new look at the F-test . . . 78

5.9 Sparse alternatives . . . 80

5.10 Simulations . . . 81

5.11 Discussion . . . 84

5.12 Proofs of the lemmas . . . 85

6 Model-based dimension reduction 87 6.1 Introduction . . . 88

6.2 Bias and variance . . . 89

6.3 A basic joint model . . . 91

6.4 Regression . . . 93

6.5 Easy prediction . . . 94

6.6 Estimation . . . 96

6.7 Prediction . . . 99

6.8 Supervised Principal Components . . . 102

6.9 Application . . . 104

6.10 Discussion . . . 108

6.11 Proofs of the theorems . . . 109

(8)

Contents

A Manual of the GlobalTest package 127

A.1 Introduction . . . 127

A.2 Global testing of a single pathway . . . 128

A.3 Multiple global testing . . . 132

A.4 Diagnostic plots . . . 133

Samenvatting 143

Bibliography 147

Curriculum Vitae 155

(9)

Contents

(10)

Published and submitted chapters

The folowing chapters have been published in scientific journals or have been submitted for publication:

Chapter 2:

J. J. Goeman, S. A. van de Geer, F. de Kort, and J. C. van Houwelingen (2004). A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20 (1), 93–99.

Chapter 3:

J. J. Goeman, J. Oosting, A. M. Cleton-Jansen, J. Anninga, and J. C. van Houwe-lingen (2005). Testing association of a pathway with survival using gene ex-pression data. Bioinformatics 21 (9), 1950–1957.

Chapter 4:

J. J. Goeman and S. le Cessie. A goodness-of-fit test for multinomial logistic regression. submitted.

Chapter 5:

J. J. Goeman, S. A. van de Geer and J. C. van Houwelingen (2006) Testing against a high-dimensional alternative. Journal of the Royal Statistical Society, Series B 68, in press.

Chapter 6:

J. J. Goeman and J. C. van Houwelingen. Model-based dimension reduction for high-dimensional regression. submitted.

Chapter 7:

P. H. C. Eilers and J. J. Goeman (2004). Enhancing scatterplots with smoothed densities. Bioinformatics 20 (5), 623–628.

Appendix:

J. J. Goeman and J. Oosting (2005). Globaltest: testing association of a group of genes with a clinical variable. R package, version 3.2.0. www.bioconductor.org.

(11)

Contents

(12)

C

HAPTER

1

Introduction and overview

The subject of this thesis is the statistical analysis of high-dimensional data. It is motivated by (and primarily focussed on) problems arising from microarray gene expression data, a new type of high-dimensional data, which has become important in many areas of biology and medicine in the last decade.

The thesis is a collection of six articles and a software manual. The articles are self-contained and they can in principle be read in any order. However, such random reading of the chapters would not do justice to the close connections that exist between them, which are partly obscured by the fact that the articles were written for different journals and therefore for different audiences with different backgrounds and interests.

The objective of this introduction is to provide the context in which the pa-pers should be read and to make the connections between the different chapters more explicit. It is not meant to be a full review of microarray data and their analysis. These can be found for example in D´ıaz-Uriarte (2005), Speed (2003) and Simon et al. (2003). I give a short introduction to gene expression data and the biological and clinical questions arising from them in section 1.1. The next section 1.2 reviews some of the statistical methods that have been developed in recent years to address these questions. Section 1.3 examines the contribution of this thesis to the field.

1.1

Biological context

The microarray is a recent technology from molecular biology which was first developed around 1995 (Schena et al., 1995), and became widely available a-round the turn of the century (see Ewis et al., 2005, for a short history). The microarray is designed to measure the activity of gene expression, which is the process by which the genetic information in DNA is used to make proteins. The microarray technology gives the biologist the potential to greatly increase knowledge on the functions of genes and on the biology of disease, as well as allowing improvements in diagnosis and prognosis of patients.

Referenties

GERELATEERDE DOCUMENTEN

The method is based on an empirical Bayesian regression model for predicting the phenotype from the gene expression measurements of the genes in the pathway.. This is the same type

Using this test it can be determined whether the global expression pattern of a group of genes is significantly related to some clinical outcome of interest.. Groups of genes may be

The Skeletal development pathway is interesting in its own way: it is clearly not associated with survival (p = 0.5) and this is quite exceptional for a pathway of this size in

By specifying the distance metric in covariate space, users can choose the alternative against which the test is directed, making it either an omnibus goodness-of-fit test or a test

The em- pirical Bayes score test often has better power than the F-test in the situations where there are errors in variables in the design matrix X, when a small set of

Based on this analysis, we argue for a doing principal components regression with a relatively small number of components and us- ing only a subset of the predictor variables,

Statistical analysis of microarray data started out with explorative methods, which approach the data impartially and try to let the data ‘speak for them- selves’.. Most methods

If a sample has a positive bar, its expression profile is relatively similar to that of samples which have the same value of the clinical variable and relatively unlike the profile