Statistical methods for microarray data Goeman, Jelle Jurjen

(1)

Statistical methods for microarray data

Goeman, Jelle Jurjen

Citation

Goeman, J. J. (2006, March 8). Statistical methods for microarray data.

Retrieved from https://hdl.handle.net/1887/4324

Version:

Corrected Publisher’s Version

License:

Licence agreement concerning inclusion of doctoral

thesis in the Institutional Repository of the University

of Leiden

Downloaded from:

https://hdl.handle.net/1887/4324

(2)

(3)

De uitgave van dit proefschrift werd ondersteund door het Fonds Medische Statistiek en door het Thomas Stieltjes Institute for Mathematics.

(4)

Statistical Methods

for Microarray Data

Pathway Analysis, Prediction Methods and

Visualization Tools

P

ROEFSCHRIFT

ter verkrijging van de graad van Doctor aan de Universiteit Leiden,

op gezag van de Rector Magnificus Dr. D. D. Breimer, hoogleraar in de faculteit der Wiskunde en Natuurwetenschappen en die der Geneeskunde,

volgens besluit van het College voor Promoties te verdedigen op woensdag 8 maart 2006

te klokke 15.15 uur

door

(5)

P

ROMOTIECOMMISSIE

PROMOTORES: Prof. dr. J. C. van Houwelingen Prof. dr. S. A. van de Geer

·Eidgen ¨ossische Technische Hochschule, Z ¨urich

REFERENT: Prof. dr. S. Richardson

·Imperial College, Londen

OVERIGE LEDEN: Prof. dr. C. Kooperberg

·Fred Hutchinson Cancer Research Center, Seattle

(6)

Published and submitted chapters

The folowing chapters have been published in scientific journals or have been submitted for publication:

Chapter 2:

J. J. Goeman, S. A. van de Geer, F. de Kort, and J. C. van Houwelingen (2004). A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20 (1), 93–99.

Chapter 3:

J. J. Goeman, J. Oosting, A. M. Cleton-Jansen, J. Anninga, and J. C. van Houwe-lingen (2005). Testing association of a pathway with survival using gene ex-pression data. Bioinformatics 21 (9), 1950–1957.

Chapter 4:

J. J. Goeman and S. le Cessie. A goodness-of-fit test for multinomial logistic regression. submitted.

Chapter 5:

J. J. Goeman, S. A. van de Geer and J. C. van Houwelingen (2006) Testing against a high-dimensional alternative. Journal of the Royal Statistical Society, Series B 68, in press.

Chapter 6:

J. J. Goeman and J. C. van Houwelingen. Model-based dimension reduction for high-dimensional regression. submitted.

Chapter 7:

P. H. C. Eilers and J. J. Goeman (2004). Enhancing scatterplots with smoothed densities. Bioinformatics 20 (5), 623–628.

Appendix:

J. J. Goeman and J. Oosting (2005). Globaltest: testing association of a group of genes with a clinical variable. R package, version 3.2.0. www.bioconductor.org.

(11)

Contents

(12)

C

HAPTER

1 Introduction and overview

The subject of this thesis is the statistical analysis of high-dimensional data. It is motivated by (and primarily focussed on) problems arising from microarray gene expression data, a new type of high-dimensional data, which has become important in many areas of biology and medicine in the last decade.

The thesis is a collection of six articles and a software manual. The articles are self-contained and they can in principle be read in any order. However, such random reading of the chapters would not do justice to the close connections that exist between them, which are partly obscured by the fact that the articles were written for different journals and therefore for different audiences with different backgrounds and interests.

The objective of this introduction is to provide the context in which the pa-pers should be read and to make the connections between the different chapters more explicit. It is not meant to be a full review of microarray data and their analysis. These can be found for example in D´ıaz-Uriarte (2005), Speed (2003) and Simon et al. (2003). I give a short introduction to gene expression data and the biological and clinical questions arising from them in section 1.1. The next section 1.2 reviews some of the statistical methods that have been developed in recent years to address these questions. Section 1.3 examines the contribution of this thesis to the field.

1.1 Biological context

The microarray is a recent technology from molecular biology which was first developed around 1995 (Schena et al., 1995), and became widely available a-round the turn of the century (see Ewis et al., 2005, for a short history). The microarray is designed to measure the activity of gene expression, which is the process by which the genetic information in DNA is used to make proteins. The microarray technology gives the biologist the potential to greatly increase knowledge on the functions of genes and on the biology of disease, as well as allowing improvements in diagnosis and prognosis of patients.