9 CLUSTER VALIDATION, COMPARING CLUSTER RESULTS

(1)

9 CLUSTER VALIDATION, COMPARING CLUSTER RESULTS

adapted 060305 Table of contents

9 Cluster validation, comparing cluster results...119 9.1. Introduction...120 9.3. Biological validation...121

(2)

9.1. Introduction

Depending on

 the preprocessing

 the algorithm

 different metric distances

clustering will produce different results. Even clustering on unrelated data will still produce clusters although they might not be biologically meaningful. Therefore cluster validation after clustering is of outmost importance. In the following different methods to compare cluster results are described.

Clusters can either be compared from statistical point of view: i.e. the coherence of a cluster will be tested based on distance measure or the robustness of a cluster result will be analyzed by some kind of sensitivity analysis. Of course it is very hard to select the best cluster output since "the biological real" solution will only be known if the biological system studied is completely characterized. Still from biological point of view there are ways to validate a cluster result such as motif finding, testing for enrichment of functional classes within a cluster etc. These biological validations can give an indication on the validity of a cluster.

Fig. 1 Dependence of cluster result

9.3. Biological validation

There are different ways of validating in silico the outcome of a cluster experiment biologically:

(3)

 Motif Finding: coexpressed genes are expected to be coregulated either at transcriptional or at posttranslational level. When the mechanism of coregulation occurs at transcriptional level coregulated genes are expected to contain in their upstream regions (promoter regions) a consensus sequence. This is a short DNA sequence that is recognized by a transcriptional regulator. If genes contained in the same cluster indeed contain such consensus motif, the presence of this motif points towards transcriptional coregulation. A common mechanism of regulation between genes might explain the similarities in their behavior during a gene expression profiling experiment and therefore biologically confirms the cluster output (fact that they have tightly coexpressed profiles). It should be noted however, that the opposite is not true. Not finding a common motif is no indication for a biologically irrelevant clustering.

Fig. 2 Location of the motif in a promoter region of a gene. Motifs are recognized by transcriptional regulators.

 Independent of the platform and the analysis methods used, the result of a microarray experiment is, in most cases, a list of genes found to be differentially expressed between two or more conditions under study. The challenge faced by there searcher is to translate this list of differentially regulated genes into a better understanding of the underlying biological phenomena. The translation from a list of differentially expressed genes to a functional profile able to offer insight into the cellular mechanisms is a very tedious task if performed manually. Typically, one would take each regulated gene, search various public databases and compile a list of, for instance, the biological processes that the gene is involved in. In order to construct a master list of all the biological processes in which at least one gene is involved, this task must be performed repeatedly for each gene.

Further processing of this list provides a list of those biological processes that are common between several of the regulated genes. It is expected

(4)

that those biological processes that occur more frequently in this list will be more relevant to the studied condition. For instance, if all the genes found to be regulated were involved in apoptosis, one would conclude that the condition studied has significant impact on the apoptotic pathway.

 Literature searches, text mining: check whether genes belonging to the same cluster have similarities in their functional annotation. E.g. biological valid clusters will be enriched in MIPS/ GO categories. Genes belonging to the cluster are expected to be part of the same biological pathway. Based on the known annotation of genes belonging to a cluster, an approximation on the function of unknown genes can be inferred. To facilitate such searches tools have been developed that allow easy retrieval of information from different publicly available datasets. (e.g. MIPS. GO, Pubmed, LocusLink). As such they compile for each gene a specific functional profile. Subsequently they calculate the statistical significance of overrepresentation of a specific functional profile in a cluster or in a set of differentially expressed genes.

 Ontologies provide a structured description of biological information that is extremely useful for computational management. One of the most widely accepted ontologies is Gene Ontology (GO; Ashburner et al., 2000), which organizes information for molecular function, biological processes and cellular components for a number of different organisms.

Example 1Hypergeometric distribution (binomial)

The results of the expression profiling experiment of Cho et al. (1998) studying the yeast cell cycle (Saccharomyces cerevisiae) in a synchronized culture is often used as a benchmark data set. It contains 6220 expression profiles taken over 17 time points (measurements over 10-min intervals, covering nearly two cell cycles – also see http://cellcycle-www.stanford.edu). One of the reasons that this data is so frequently used as benchmark data for the validation of new clustering algorithms is due to the fact that the majority of the genes included in the data have been functionally classified and due to the existence of a functional classification scheme (MIPS database – see

http://mips.gsf.de/proj/yeast/catalogues/funcat/index.html) making it possible to biologically validate the results.

Assume that a certain clustering method finds a set of clusters in the Cho et al. (1998) data. We could objectively look for functionally enriched clusters as follows: suppose that one of the clusters has n genes where k genes belong to a certain functional category in the MIPS database and suppose that this functional category in its turn contains f genes in total. Also suppose that the total data set contains g genes (in the case of Cho et al. (1998) g would be 6220). Using the cumulative hypergeometric probability distribution, we could calculate the probability or P-value that this degree of enrichment could have occurred by chance, i.e., what is the chance of finding at least k genes in this specific cluster from this specific functional category by chance:

These P-values can be calculated for each functional category in each cluster. Since there are about 200 functional categories in the MIPS database, only clusters where the P-value < 0.0003 for a certain functional category, are said to be significantly enriched (level of significance 0.05). Note that these P-values can also be used to compare the results from functionally matching clusters identified by two different clustering algorithms on the same data.

. 1

) , 1 min(

0 

 



 

 







 







 



 









 







 







 



 







 ⁿ ^f

k i k

i

n g

i n

f g i f

n g

i n

f g i f P

(5)

This hypergeometric distribution calculation (for additional info on the statistics see Draghici et al., 2003)) has been implemented in OntoExpress (http://vortex.cs.wayne.edu/Projects.html#Onto-

Express) for a few vertebrate organisms and GO4G

(http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html) for human.

Example 2X² or (Fisher exact test is sample is small)

Alternative approaches include a X² test for equality of proportions and Fisher’s Exact test. For the purpose of applying these tests, the data can be organized as shown in Table below. The dot notation for an index is used to represent the summation on that index. In this notation, the number of genes on the chip is N = N.1, the number of genes in functional category F is M = n11, the number of genes selected as differentially regulated is K = N.2, and the number of differentially regulated genes in F is x = n12. Using this notation, the X²test involves calculating the value of the X² statistic.

Calculate X2:



 ^k_i 

i i i

E E o

1

2 2

0

)

 (

The value thus

calculated can be compared with critical values obtained from a _2 distribution with df =(2 - 1) · (2 _-1) = 1.(r-1)(c-1)

EF=N1j/N..

EnotF=N2j/N..

Ediff=Ni1/N..

Enotdiff=Ni2/N..

EDiff&F=N..X Ediff X EF= (Ni.N.j /N..)

However, the _2 test for equality of proportion cannot be used for small samples. The rule of thumb is that all expected

frequencies Eij = (Ni.N.j /N..) should be greater than or equal to 5 for the test to provide valid conclusions. If this is not the case, Fisher’s Exact test can be used (for information see http://www.physics.csbsju.edu/stats/exact.html).

FatiGO, is a web-based application (http://fatigo.bioinfo.cnio.es). which carries out simple datamining using Gene Ontology for DNA microarray data. The datamining consists on the assignation of the most characteristic Gene Ontology term to each cluster. GO terms are related to Human, Mouse, Rat, Arabidopsis, Fly, Worm and Yeast genes and proteins.

FatiGO implements Fishers exact test for 2x2 contingency tables for comparing two groups of genes and extracting a list of GO terms distribution among the groups is significantly different. The results of the test are corrected for multiple-testing to obtain an adjusted-pvalue. The results are displayed in HTML and txt format and also we represent a tree view with GO terms associated with the list of genes and the number of genes annotated to a specific GO term. OntoExpress also leaves the option to use the X2 test or Fisher exact test.