Microarray data analysis

(1)

Microarray data analysis

Yves Moreau

(2)

2 Microarrays



Genome-wide monitoring of gene activities by measurement of the levels of RNA transcripts



Massively parallel



Fully automated



Standardizable

(3)

3 Applications

 Identification of regulatory mechanisms

 Modeling of genetic networks for function identification and transgenesis

 Drug development: study gene expression in

 Disease

 Model systems

 Pathogens

 Response to drug treatment

 Diagnosis

 Survey of Single Nucleotide Polymorphisms

 Genotyping for drug tayloring

(4)

4 cDNA microarray

http://www.bio.davidson.edu/

courses/genomics/chip/chip.html

(5)

1. Collecting sample

2. Extracting mRNA

3. Labeling

4. Hybridizing

5. Scanning

6. Visualizing

(6)

6 cDNA microarrays

 Red: high in test, low in reference

 Green: low in test, high in reference

 Yellow: high in test, high in reference

 Black: low in test, low in reference

(7)

7 Microarray production

Clones

Plasmide preparation

PCR amplification

Reordering

Spotting

Zoom - pins

(8)

8 DNA chips

 Silicon substrate

 In situ synthesis of oligonucleotides (25 bases) by photolithography

 Multiple probes per gene (match + mismatch)

 Ultra-high density possible

(9)

9 DNA chip fabrication

(10)

10 DNA chip cDNA microarray

Examples of scanner outputs

(11)

11 Relative vs. absolute expression



Oligonucleotide arrays

 Single measurement per feature

 Absolute expression level



cDNA arrays

 Two samples per experiment

 Test

 Reference

 Measurement is ratio of gene expression in test sample vs.

reference sample

 Increases reproducibility

 More difficult to handle during analysis

(12)

12 Types of experiments

 Wild-type vs. mutant

 Knock-out, conditional knock-out

 Overexpression construct, inducible overexpression

 Detection of the targets of a transcription factor

 Groups of patient samples

 Different types of tumors

 Multiple conditions

 Expression patterns in the presence of drugs or toxins

 Time-course experiment

 Response to a signal

 Stress

 Development

 Design of experiment

(13)

13 Preprocessing

(14)

14 Simplified flowchart



Ratio-based analysis



Analysis of variance

Filtering Normalisation

Ratio

Log transformation

Statistical test

(15)

15 Normalization



Correction for intensity-dependent effect



M-A plot (M = log

₂

(R/G), A = ½log*

₂

(R.G))

(16)

16 Log transformation



Multiplicative versus additive error



Equal treatment of over- and underexpression

(17)

17 Statistical analysis

(18)

18 Significance testing: t-test



Two groups t and c

 Repetition of comparison experiments

 e.g., benign vs. malignant tumor, diauxic shift



Data normally distributed

 Absolute expression is not normally distributed

 R/G, logR/logG, logR/G approximately normally distributed



t-test (Student distribution)

 Empiric mean m

 Estimate s of standard deviation

 Number of experiments n



Number of repetition often too low => Bayesian t-test

c c t

t

c t

n s n

s

m t m

₂ ₂



 

(19)

19 Significance testing: Bayesian t-test



Gene expression of gene gen i at replicate k is normally distributed



Parameters of the normal distribution are themselves random variables (hierarchical model)

) ,

; (

)

( g

_ik^t

N g

_ik^t _i^t _i^t

P   

gamma inverse

Scaled

) ,

; (

) (

) /

,

; ( )

| (

) (

)

| ( )

, (

2 0 0

2 2

0 2

2 2

2

























I P

N P

P P

P



(20)

20 Significance testing: Bayesian t-test



Computation of the distribution of the posterior for the mean and standard deviation



Priors

 0 prior mean

 0 weight (“pseudocount”) for prior mean

 0 prior standard deviation

 0 weight for prior standard deviation



For the t-test, it is primarily the improvement of the

estimate of the standard deviation that is of importance

 0 = m

(21)

21 Significance testing: Bayesian t-test



Posterior mean estimate of the mean and standard deviation



Effect is similar to the use of 

0

pseudodata with standard deviation 

₀



Regularized estimates are then used in t-test



Little difference if more than 5 replicas 2

) 1 (

0

2 2

0 2 0







 



n

s s n

m m

p p







(22)

22 Clustering

(23)

23 Goal of clustering



Microarray data



Form coherent groups of

 Genes

 Patient samples (e.g., tumors)

 Drug or toxin response



Study these groups to get insight into biological processes

 Genes in same clusters can have the same function or same regulation

(24)

24 Identifying prevalent expression patterns (clusters)

(25)

25

original coordinate system original coordinate system

new coordinate system

Principal Component Analysis



PCA detects directions that explain most of the

variation of the data

(26)

26

5

2 4

1 3

Hierarchical clustering

3

1

4 2

5

Distance between linked clusters

Dendrogram

(27)

27 Hierarchical clustering

 Construction of gene tree based on correlation matrix

(28)

28 K-means



Initialization

 Choose the number of clusters K and start from random positions for the K centers



Iteration

 Assign points to the closest center

 Move each center to the center of mass of the assigned points



Termination

 Stop when the centers

have converged or maximum number of iterations

Initialization

(29)

29 Iteration 1

K-means



Initialization



Iteration



Termination

(30)

30 Iteration 1

K-means



Initialization



Iteration



Termination

(31)

31 Iteration 3

K-means



Initialization



Iteration



Termination

(32)

32 EM clustering

(33)

33 EM clustering



Isotropic d-dimensional Gaussian mixture



Hidden observations label to which component/cluster a data point belongs with



EM algorithm

EM

EM 1

EM

arg max ( )

arg max ( | , ) ln ( , | )

i

n

i

n n i n n

n m

Q

P m x P x m

 



 





 

1 1

1 1 1

( | ) ( ; , )

with ( ,..., ; ,..., ; ,..., )

n K

j j j

n j

K K K

p x w N x I

w w

  

    

 





1

,...,

_n

{1,..., }

c c  K

( | )

j n

w  P c  j 

(34)

34 EM clustering



Expectation step

 

EM

1 1

( ) ( | , ) ln ( | , ).

i

N K

n n i n n j

n j

Q

_

 P c j x  p x c j  w

 

   

EM 1 1 1

2 EM

1 1 2

( ) with ( ,..., ; ,..., ; ,..., )

( | , ) ln ln Cst

2

i K K K

N K

n j

n n i j j

n j j

Q w w

P c j x w d x



     

  

 





  

 

      

 

 



(35)

35 Optimization with Lagrange multipliers

 Constraint

 Lagrange multipliers

 Optimization w.r.t.  gives back the constraint

 Optimization w.r.t. ,  , AND wj (hidden observations!!!)

j

1

j

w 



( ) ( )

_j

1

j

Q   Q    ^  w  ^ 

  



( ) 0 for all 1,..., ( ) 0 for all 1,...,

j

Q j K









  



  





( ) 0 for all 1,...,

j

Q j K

w



  





(36)

36 Lagrange multipliers



Elimination of the Lagrange multipliers and estimation of the probability of each component

1

( ) 0 for all 1,...,

( | , )

0 (*) partial derivative ( | , ) rearrangements of the terms

( | , ) sum over all clusters

j i

n n i

n ij

i

n n i j

n

i

n n i j

j n j

Q j K

w

P c j x w

P c j x w j



 



  



   

 



 



(37)

37 Lagrange multipliers



Cont’d



Posterior probability of the labeling

1

( | , ) inversion of the sum signs sum over all labelings 1

1 ( | , ) substitution in (*)

n n i

n j

i j j

i

j n n i

n

P c j x

N w

w P c j x

N

 







 

  





( | , ) ( | )

( | , )

( ; , ) ( ; , )

n n i n i

n n i

i

n j j j

i

n l l l

l

P x c j P c j P c j x

P x c j

N x I w

 

 

 





(38)

38 Lagrange multipliers



Estimation of the new  s

 

1

1 1 2

1

( ) 0 voor alle 1,...,

2( )

( | , ) 0

2 ( | , )

( | , )

j i

i

n j

n n i

n i

j

n n i n

i n

j

n n i

n

Q j K

P c j x x

P c j x





 





 



  



   

 

  

 

 



 







(39)

39 Lagrange multipliers



Elimination of the new  s

 

1

3

1 1

1 1 2

( ) 0 for all 1,...,

( | , ) . 2 0

2 ( | , )

1 ( | , )

j i

i

n j

n n i i i

n j j

i

n n i n j

i n

j

n n i

n

Q j K

d x P c j x

P c j x x

d P c j x





 

 

 

 



 



  



   

 

    

 

 

 

 







(40)

40 EM clustering



Initialization: random parameters 

₀



Iteration



End: when parameters have converged or maximum number of iterations is reached

1

1 2

1

( ; , ) ( | , )

( ; , )

1 ( | , )

( | , )

( | , ) 1

( | , )

i

j n i i

n n i i

l n l l

l i

j n n i

n

n n i n

i n

j

n n i

n

i

n n i n j

i n

j

n n i

n

w N x I

P c j x

w N x I

w P c j x

N

P c j x x

P c j x

P c j x x

d P c j x

  

 



 

 

 



 



 

 

 



(41)

41 Practical aspects of EM clustering



K-means is a “hard” version of EM with Gaussian mixture with unit covariance



Local minima are often a real problem for EM, so a good initialization can dramatically improve the results



Clusters in sparse regions can collapse to a unique data point

 The variance of the cluster goes to zero aan the likelihood becomes infinite

 Unacceptable

 Too small clusters are pruned and replaced by a new random initialization

(42)

42 Practical aspects of EM clustering



After estimation of the model, clustering can be done via

“soft assignment” or “hard assignment”



Choice of the number of clusters is a difficult problem



EM clustering possible for diagonal covariance; often

impossible for full covariance matrix because of the large number of parameters



EM clustering possible for other distributions of the exponential family



Implementations

 AutoClass

 mclust

(43)

43

(44)

44 Clustering

(45)

45

Identificatie van frequente uitdrukkingspatronen (clusters)

(46)

46 Doel van clustering



Microroostergegevens



Vorm coherenten groepen van

 Genen

 Patientstalen (bvb., tumoren)

 Drug- of toxine-effecten



Bestudeer deze groepen om inzicht te krijgen in de biologische processen

 Genen in de zelfde cluster kunnen de zelfde functie hebben

 Functie van onbekende genen kan ontdekt worden

 Patienten in de zelfde cluster kunnen clinisch gerelateerd zijn

 Verfijning van therapeutische categorieën

(47)

47

original coordinate system original coordinate system

new coordinate system

Principale componentanalyse



PCA detecteert de richtingen die de meeste

variantie van de gegevens verklaren

(48)

48 Principale componentanalyse



Principale componenten zijn de eigenvectoren U

_i

van de covariantiematrix  :



Reductie tot M dimensie: projectie van

gegevensvectoren op de M eigenvectoren met de M grootste eigenwaarden  minimaal verlies aan

informatie

 

ⁿ



^T

n

μ x μ

x

Σ    

(49)

49

5

2 4

1 3

Hierarchische clustering

3

1

4 2

5

Afstand tussen verbonden clusters

Dendrogram

(50)

50 Hierarchische clustering

 Opbouw van een genboom gebaseerd op correlatiematrix

(51)

51 Afstandsmaten

 Om hierarchisch te clusteren, moeten afstandmaten gedefinieerd worden

1. Similariteitsmaat tussen twee objecten (genen of experimenten)

 Gecentreerde correlatie

 Niet-gecentreerde correlatie

 Absolute correlatie

 Euclidisch afstand

2. Afstandsmaat tussen clusters

 Enkelvoudige Linkage: afstand tussen punten van dichtstbijzijnde paar

 Volledige Linkage: afstand tussen punten van verste paar

 Gemiddelde Linkage: gemiddelde afstand tussen punten van alle paren

 Centroïdemethode: afstand tussen zwaartepunten van de groepen

 Methode van Ward

(52)

52 Afstand tussen punten

 Pearson correlatie

 Indien er waarden ontbreken, wordt de correlatie berekend enkel voor de dimensies waar beide vectoren een waarde hebben

 Een afstand kan bekomen worden als d = 1 – |s| of d = (1 – s)/2

 Niet-gecentreerde correlatie







 _n

i

i n

i

i n i

i i

y y

x x

y y x x y

x s

1

2 1

) (

) )(

( )

, (





  _n

i i n

i i n i

i i

y x

y x y

x s

1 2 1

2

) 1

, (

 Spearman rankcorrelatie

n n

y x

y x r

n i

i i









3 1

2 2

)) ( rank )

( (rank 6

1 ) , (

(53)

53 Afstand tussen punten



Euclidische afstand



Manhattan afstand



l

_p

-norm



Mahalanobis afstand







 ⁿ

i

i y

x y

x d

1

)2

( )

, (







 ⁿ

i

i y

x y

x d

1

|

| )

, (

n p

i

p i

i y

x y

x d

/ 1

1

) (

) ,

( 



 



 







) (

) ,

(x y x y W x y

d   ^T 

(54)

54 Afstand tussen clusters



Enkelvoudige linkage



Volledige linkage



Methode van Ward ) , ( min )

,

( R S

,

d i j

d 

i j

) , ( max )

,

( R S

,

d i j

d 

i j



Gemiddelde linkage



Centroïdemethode





i j

j i S d

S R R

d ( , )

|

||

| ) 1 , (

2(R,S) x^R y^S 2

d  

2 2

|

||

| ) 2

,

( x^R y^S

S R

S S R

R

d 

 

(55)

55 Algoritme



Initialisatie

 Iedere punt i vormt een cluster Ci

 Bereken de matrix van alle paargewijse afstanden tussen alle punten



Iteratie

 Pak het paar (i,j) waarvoor d(Ci,Cj ) minimaal is

 Verbind Ci en C_j met een tak van lengte d(Ci,C_j)

 Definieer een nieuwe cluster Ck als de unie van Ci en Cj

 Bereken de afstand tussen Ck en alle andere clusters

 De afstanden kunnen incrementeel berekend worden



Eind

 Stop wanneer er maar één cluster overblijft

(56)

56 Clustervorm



Gemiddelde linkage en centroïdemethoden



Enkelvoudige linkage



Volledige linkage

(57)

57 Hierarchische clustering



Hierarchische clustering

 De clusteringboom geeft een idee van de afstand tussen de verschillende punten (ultrametrische afstand)

 De bladeren van de boom zijn willekeurig geordend

 Iedere tak kan omgedraaid worden

 Nabijheid in de “heatmap” kan misleidend zijn

 De verschillende variaties (afstand, linkage) geven soms

uiteenlopende resultaten (keuze tussen die variaties is moeilijk)

 De meeste implementatie hebben geheugenproblemen omdat er een O(N²) afstandsmatrix berekend moet worden

 Hierarchische clustering produceert geen clusters

 Clusters kunnen bekomen worden door te knippen op een maximale afstand

 Keuze van de drempel is moeilijk

 Clusters kunnen ook handmatig geplukt worden

(58)

58 K-means



Initialisatie

 Kies het aantal clusters K en start met willekeurige posities voor K centra



Iteratie

 Ken punten toe tot

het dichtsbijzijnde centrum

 Beweeg ieder centrum tot het zwaartepunt van de hieraan toegekende punten



Einde

 Stop na dat de centra geconvergeerd zijn

Initialisatie

(59)

59 Iteratie 1

K-means



Initialisatie



Iteratie



Einde

(60)

60 Iteratie 1

K-means



Initialisatie



Iteratie



Einde

(61)

61 Iteratie 3

K-means



Initialisatie



Iteratie



Einde

(62)

62 Batch vs. online



K-means minimaliseert de totale quantisatiefout



In de ‘batch’ versie van K-means worden alle punten eerst toegekend en dan worden alle centra geupdated



In de ‘online’ versie van K-means wordt het toegekende centrum geupdated na iedere punt



De ‘online’ versie is vaak minder gevoelig aan lokale minima in optimalisatie van de quantisatiefout









^N

i

i i

K

x C x

E

1

2

( ( ))

(63)

63 Evaluatie van clusteringpartities



Evaluatie van de kwaliteit van een clustering is essentieel om verschillende runs en methoden te kunnen vergelijken



Silhouetcoefficienten

 Ieder punt i krijgt een score s(i) van hoe goed het bij zijn cluster C(i) hoort

 Voor C(i) bereken

 a(i) = gemiddelde afstand van i tot de punten van C(i) behalve i

 Voor alle andere clusters Ck  C(i), bereken

 d(i, C_k) = gemiddelde afstand van i tot de punten van C_k  C(i)

 Selecteer de naburige cluster

 b(i) = mink d(i, Ck)

(64)

64

 Silhouet van punt i



Gemiddelde silhouet voor de data set

 Maat voor de clusteringkwaliteit



Gemiddelde silhouet kan gebruikt worden om het aantal clusters te optimalizeren

 Silhouetcoefficient



Silhouetcoefficient is een maat voor de structuur die gedetecteerd kan worden in de data

1 ) ( 1

)) , ( ), ( max(

) ( )

) (

(     s i 

i b i a

i a i

i b s

s

)

(

max k s

SC 

k

(65)

65 K-means



Het kiezen van het aantal clusters is een groot probleem voor de clustering van microroostergegevens



Het algoritme kan geblokkeerd geraken in een lokaal minimum



Twee verschillende runs produceren verschillende

resultaten

(66)

66 Self-organizing maps

 Initialisatie

 Kies een roostertopologie (b.v.b, 3 x 2 rechthoekig rooster)

 Knopen van het rooster liggen in m-dimensionale ruimte

 Iteratie (20.000 – 50.000)

 Een datapunt p wordt willekeurig gekozen en de knopen worden verplaatst in de richting van p

 Dichtstbijzinde knoop Np wordt het meest verplaatst

 Andere knopen worden in mindere maat verplaatst in functie van hun afstand tot Np in de roostertopologie

 Einde

 Stop als clustercentra niet meer bewegen

 Knopen = cluster centra

 Roostertopologie geeft naburige clusters

(67)

67 Self-organizing maps

 Visualisatie: genuitdrukkingsprofielen per cluster samen met roostertopologie

 SOM probeert een 2D-oppervlakte te fitten op de hoog-dimensionale gegevens

 Indien deze verondersteling niet geldt is de SOM

niet betrouwbaar voor clustering

 Keuze van het aantal knopen moeilijk

 SOM goed voor visualisatie

(68)

68 Classificatie

(69)

69 Classificatie



Classificatie van microroostergegevens is interessant omdat het het pad opent voor nieuwe diagnostica- en

“disease management” toepassingen

 Voorspelling van de therapeutische response van een tumor op chemoterapie



Classificatie kan ook gebruikt worden om de functionele klasse van een gen te voorspellen op basis van het

profiel in een brede reeks microroosterexperimenten

(70)

70 Classificatie



Kenmerkselectie



Dimensionaliteitsreductie



Classificatie

 Discriminantanalyse

 Nearest-neighbor classificatie

 Neurale netwerken

 Support Vector Machines

(71)

71 Voorbeeld: onderscheiding van twee vormen van leukemie



Leukemie dataset:

 Twee klassen: Acute Lymphoblastische Leukemie (klasse 1) and Acute Myeloide Leukemie (klasse 2)

 Trainingset met 38 voorbeelden (27 ALL en 11 AML)

 Onafhankelijke testset met 34 voorbeelden

 ~7000 geneuitdrukkingsniveaus voor iedere staal

= 7000 x 38 matrix voor trainingset



Herschalering, logaritmische transformatie, en

normalizatie

(72)

72 Kenmerkselectie (één-dimensionaal)



Correlatiecoefficient:

 |P(g,c)| is een maat voor de correlatie tussen

uitdrukkingsniveaus van een gen g en het klasseverschil c



Selecteer kenmerken

 Welke genen?

 Genen met meest positieve en meest negatieve |P(g,c)|

 Hoeveel genen?

 Significantie-analyse

) , ( )

, (

) , ( )

, ) (

, (

2 1

c g c

g

c g c

c g g

P  





 

(73)

73 Dimensionaliteitsreductie met PCA



Principale componenten zijn de eigenvectoren U

_i

van de covariantiematrix  :



Reductie tot M dimensie: projectie van

gegevensvectoren op de M eigenvectoren met de M grootste eigenwaarden  minimaal verlies aan

informatie

 

ⁿ



^T

n

μ x μ

x

Σ    

(74)

74 PCA in 2 dimensies

-20 -15 -10 -5 0 5 10 15 20 25

-20 -15 -10 -5 0 5 10 15

* = training O = independent Class 1/ALL Class 2/AML

(75)

75 Fisher Lineaire Discriminantanalyse



Reductie tot één dimensie



De vector w wordt gekozen zodat

 De intraklassespreiding klein is

 De klassescheiding (verschil tussen klassegemiddelde) in één dimensie groot is



De vector w wordt gegeven door waar

x w y 

^T



₂ ₁



1

μ μ

S

w 

_w^













T



n

C n

n T n

C n

n

μ x μ x μ x μ x

w

S

2

2 1

2 1 1

 

^

   

^

 

(76)

76 Fisher Lineaire Discriminantanalyse



Drempel w

_o

:

 y <= wo  Klasse 1

 y > wo  Klasse 2



Lineaire Discriminantanalyse na PCA in 5 dimensies geeft maar 2 misclassificaties op de testset

(34 voorbeelden)

μ

w

_o



^T

(77)

77 Resultaat (2 dimensies)

-15 -10 -5 0 5 10 15 20

-10 -5 0 5 10 15

O = independent Class 1/ALL Class 2/AML

(78)

78 Logistische regressie

Number of

inputs % ROC area

training % ROC area prospective

15 1 98.93

10 1 98.93

5 1 97.57

3 1 97.10

2 1 98.21

(79)

79 Neuraal netwerk



Single-hidden-layer perceptron

 Preprocessing + PCA tot n dimensies

 n inputs

 5 verborgen neuronen

 Met regularisatie

(80)

80 Neural network (5 verborgen neuronen)

Number of

inputs % ROC area

training % ROC area prospective

15 1 99.29

10 1 98.39

5 1 97.15

3 1 96.10

2 1 98.21

(81)

81 Samenvatting



Significantietesten



Clustering

 Hierarchische clustering

 K-means

 Self-Organizing Maps



Classificatie

Microarray data analysis