Fish genomes : a powerful tool to uncover new functional elements in vertebrates

(1)

Fish genomes : a powerful tool to uncover new functional elements in vertebrates

Stupka, E.

Citation

Stupka, E. (2011, May 11). Fish genomes : a powerful tool to uncover new functional elements in vertebrates. Retrieved from https://hdl.handle.net/1887/17640

Version: Corrected Publisher’s Version

License: Licence agreement concerning inclusion of doctoral thesis in the

Institutional Repository of the University of Leiden

Downloaded from: https://hdl.handle.net/1887/17640

Note: To cite this publication please use the final published version (if applicable).

(2)

Chapter 3: Shuffling of cis-‐regulatory elements is a pervasive feature of the vertebrate lineage

Published in: Genome Biology, 2006, Vol 7:R56

(3)

Abstract

Background: All vertebrates share a remarkable degree of similarity in their development as well as in the basic functions of their cells. Despite this, attempts at unearthing genome-wide regulatory elements conserved throughout the vertebrate lineage using BLAST-like approaches have thus far detected noncoding conservation in only a few hundred genes, mostly associated with regulation of transcription and development. We used a unique combination of tools to obtain regional global-local alignments of orthologous loci. This approach takes into account shuffling of regulatory regions that are likely to occur over evolutionary distances greater than those separating mammalian genomes. This approach revealed one order of magnitude more vertebrate conserved elements than was previously reported in over 2,000 genes, including a high number of genes found in the membrane and extracellular regions. Our analysis revealed that 72% of the elements identified have undergone shuffling. We tested the ability of the elements identified to enhance transcription in zebrafish embryos and compared their activity with a set of control fragments. We found that more than 80% of the elements tested were able to enhance transcription significantly, prevalently in a tissue- restricted manner corresponding to the expression domain of the neighboring gene. Our work elucidates the importance of shuffling in the detection of cis-regulatory elements. It also elucidates how similarities across the vertebrate lineage, which go well beyond development, can be explained not only within the realm of coding genes but also in that of the sequences that ultimately govern their expression.

(4)

Introduction

Enhancers are ciscting sequences that increase the utilization and/or specificity of eukaryotic promoters, can function in either orientation, and often act in a distance and position independent manner [1]. The regulatory logic of enhancers is often conserved throughout vertebrates, and their activity relies on sequence modules containing binding sites that are crucial for transcriptional activation.

However, recent studies on the cis-‐regulatory logic of Otx in ascidians pointed out that there can be great plasticity in the arrangement of binding sites within individual functional modules. This degeneracy, combined with the involvement of a few crucial binding sites, is sufficient to explain how the regulatory logic of an enhancer can be retained in the absence of detectable sequence conservation [2]. These observations together with the fact that we are still far from understanding fully the grammar of transcription factor binding sites and their conservation [3] make it difficult to assess the extent of conservation in vertebrate cis-‐regulatory elements.

Very little is known about the evolutionary mobility of enhancer and promoter elements within the genome as well as within a specific locus. Sporadic studies of selected gene families have addressed questions related to the mobility of regulatory sequences involving promoter shuffling [4] and enhancer shuffling [5]; these describe the gain or loss of individual regulatory elements exchanged between specific genes in a cassette manner [6]. These studies suggested that a wide variety of different regulatory motifs and mutational mechanisms have operated upon non-‐coding regions over time. These studies, however, were conducted before the advent of large-‐scale genome sequencing, and thus they

(5)

were performed on a scale that would not allow the authors to derive more general conclusions on the mobility and shuffling of regulatory elements.

The basic tenet of comparative genomics is that constraint on functional genomic elements has kept their sequence conserved throughout evolution. The completion of the draft sequence of several mammalian genomes has been an important milestone in the search for conserved sequence elements in noncoding DNA. It has been estimated that the proportion of small segments in the mammalian genome that is under purifying selection within intergenic regions is about 5% and that this proportion is much greater than can be explained by protein-‐coding sequences alone, implying that the genome contains many additional features (such as untranslated regions, regulatory elements, non-‐protein-‐coding genes, and structural elements) that are under selection for biological functions [7-‐11]. In order to address this issue, sequence comparisons across longer evolutionary distances and, in particular, with the compact Fugu rubripes genome have been shown to be useful in dissecting the regulatory grammar of genes long before the advent of genome sequencing [12]. More recently, the completion of the draft sequence of several fish genomes has allowed larger scale approaches for the detection of several regulatory conserved noncoding features.

Several studies have addressed the issue of conserved non-‐coding sequences on a larger scale. A first study on chromosome 21 [13] revealed conserved nongenic sequences (CNGs); these were identified using local sequence alignments between the human and mouse genome of high similarity, which were shown to be untranscribed. A separate study focusing on sequences with 100% identity

(6)

[14] revealed the presence of ultraconserved elements (UCEs) on a genome-‐wide scale, and finally conserved noncoding elements (CNEs) [15] were found by performing local sequence comparisons between the human and fugu genomes showing enhancer activity in zebrafish co-‐injection assays. Although the CNG study yielded a very large number of elements dispersed across the genome, and bearing no clear relationship to the genes surrounding them, the latter studies (UCEs and CNEs) were almost exclusively associated with genes that have been termed 'trans-‐dev' (that is, they are involved in developmental processes and/or regulation of transcription).

One of the major drawbacks of current genome-‐wide studies is that they rely on methods for local alignment, such as BLAST (basic local alignment search tool) [16] and FASTA [17], which were developed when the bulk of available sequences to be aligned were coding. It has been shown that such algorithms are not as efficient in aligning noncoding sequences [18]. To tackle this issue new algorithms and strategies have been developed in order to search for conserved and/or over-‐represented motifs from sequence alignments, such as the motif conservation score [19], the threaded blockset aligner program [20] and the regulatory potential score [21], as well as phastCons elements and scores [22].

However, all of these rely on a BLAST-‐like algorithm to produce the initial sequence alignment and are thus subject to some of the sensitivity limitations of this algorithm and do not constitute a major shift in alignment strategy that would model more closely the evolution of regulatory sequences.

Two approaches were recently reported which provide novel alignment strategies: the promoterwise algorithm coupled with 'evolutionary selex' [23]

(7)

and the CHAOS (CHAins Of Scores) alignment program [24]. Whereas the former has been used to validate a set of short motifs, which have been shown to be of functional importance, the latter has not been coupled to experimental verification to estimate its potential for the discovery of conserved regulatory sequences. Unlike other fast algorithms for genomic alignment, CHAOS does not depend on long exact matches, it does not require extensive ungapped homology, and it does allow for mismatches within alignment seeds, all of which are important when comparing noncoding regions across distantly related organisms. Thus, CHAOS could be a suitable method for the identification of short conserved regions that have remained functional despite their location having changed during vertebrate evolution. The only method available that attempts to tackle the question of shuffled elements and that makes use of CHAOS is Shuffle-‐Lagan [25]; however, it has not been used on a genome-‐wide scale and its ability to detect enhancers has not been verified experimentally.

Until recently our ability to verify the function of sequence elements on a large scale within an in vivo context was strongly limited. This task was eased significantly using co-‐injection experiments in zebrafish embryos [26], which allows significant scale-‐up in the quantity of regulatory elements tested; this is fundamental when one is trying to elucidate general principles regarding regulatory elements, the grammar of which still eludes us. The co-‐injection technique used to test shuffled conserved regions (SCEs) for enhancer activity was previously shown to be a simple way to test cis-‐ acting regulatory elements [15,27,28] and was shown to be an efficient way to test many elements in a relatively short period of time [15].

(8)

The analysis described herein attempts to tackle the issue of the extent, mobility, and function of conserved noncoding elements across vertebrate orthologous loci using a unique combination of tools aimed at identifying global-‐local regionally conserved elements. We first used orthologous loci from four mammalian genomes to extract 'regionally conserved elements' (rCNEs) using MLAGAN [29], and then used CHAOS to verity the extent of conservation of those rCNEs within their orthologous loci within fish genomes. The analysis was conducted annotating the extent of shuffling undergone by the elements identified. Finally, we investigated the activity of rearranged and shuffled elements as enhancer elements in vivo. We found that the inclusion of additional genomes, the use of a combined global-‐local strategy, and the deployment of a sensitive alignment algorithm such as CHAOS yields an increase of one order of magnitude in the number of potentially functional noncoding elements detected as being conserved across vertebrates. We also found that the majority of these have undergone shuffling and are likely to act as enhancers in vivo, based on the more than 80% rate of functional and tissue-‐restricted enhancers detected in our zebrafish co-‐injection study.

Results

The dataset described in this analysis is available on the internet [30] for full download, as well as a searchable site to identify SCEs belonging to individual genes.

Identification of mammalian regionally conserved elements

For each group of orthologous genes global multiple alignments among the human, mouse, rat, and dog loci were performed using MLAGAN [25]. We took

(9)

into consideration all genes for which there were predicted othologs within Ensembl [31] in the mouse genome, human genome, and any third mammalian species, which led us to analyze 9,749 groups of orthologous genes (36% of the annotated mouse genes). Most genes (about 88%) were found to be conserved in all four species considered, with only about 12% found in three out of four species (about 6% in each triplet; Figure 1). For each locus we took into account the whole genomic repeat-‐masked sequence containing the transcriptional unit as well as the complete flanking sequences up to the preceding and following gene. This lead us to analyze 37% of the murine genome sequence overall. The alignments were parsed using VISTA (visualizing global DNA sequence alignments of arbitrary length) [32] searching for segments of minimum 100 base pairs (bp) length and 70% identity. We further selected these regions by only taking into account those regions that were found at least in mouse, human, and a third mammalian species and which overlapped by at least 50bp, which resulted in a set of 364,358 rCNEs (Table 1). These were then filtered stringently to distinguish 'genic' from 'nongenic' (see Materials and methods, below). This analysis classified 22.7% of the resulting rCNEs as 'genic', while 281,644 nongenic elements account for about 46 megabases, or 1.77%, of the murine genome.

(10)

Figure 1 Number of conserved gene loci versus number of rCNEs identified in the mouse, rat, human, and dog genomes. Graph showing the number of rCNEs found conserved in the dog, rat, mouse and human genomes versus the number of genes found conserved across the same genomes. Although almost 90% of the genes can be found in all four genomes, most rCNEs can be found only in three out of four genomes. rCNE, regionally conserved element.

We further annotated mammalian rCNEs based on their position in the mouse genome with respect to the gene locus in order to define whether they were located before the annotated transcription start site (TSS; 'pre-‐gene'), within the intronic portion of the gene, or posterior to the transcriptional unit ('post-‐gene').

Approximately 54% of rCNEs were found to fall within intergenic regions, of which 37% were post-‐gene and 63% pre-‐gene (Table 1).

(11)

Table 1 Transcription potential, localization, and number of mammalian rCNEs. a)Type of conserved non-coding sequence (rCNE). B)Total number of rCNEs, including genic and nongenic. c)Number of genic rCNEs: overlapping EMBL proteins, ESTs, GenScan predictions, and Ensembl genes. d)Number of nongenic rCNEs: not overlapping EMBL proteins, ESTs, GenScan, and Ensembl genes. e)Total number of rCNEs, including pre-gene, intronic and post-gene. f)Number of pre-gene rCNEs: rCNEs localized before the translation start of the reference gene. g)Number of intronic rCNEs: rCNEs localized within the introns of the reference gene. h)Number of post-gene rCNEs: rCNEs localized after the translation end of the reference gene. EST, expressed sequence tag; rCNE, regionally conserved non-coding element.

Shuffling of conserved elements is a widespread phenomenon

We searched for conservation of rCNEs in teleost genomes using CHAOS [24], selecting regions that presented at least 60% identity over a minimum length of 40 bp as compared with the mouse sequence of the rCNEs. This method allowed us to identify regions that are reversed or moved in the fish locus with respect to the corresponding mammalian locus. For each locus in every species analyzed we took into account the whole genomic repeat-‐masked sequence containing the transcriptional unit as well as the complete flanking sequences up to the preceding and following gene. We defined as SCEs those regions of the mouse genome that were conserved at least in the fugu orthologous locus and filtered out any sequence shorter than 20 bp as a result of the overlap analysis with zebrafish and tetraodon (see Materials and methods, below, for details). Our analysis identified 21,427 nonredundant nongenic SCEs, which were found in about 30% of the genes analyzed (2,911; Table 2). The distribution of their length and percentage identity is shown in Figure 2e,f. The median length and percentage identity (45 bp and 67%, respectively) reflect closely the cut offs provided to CHAOS in the alignment (40 bp and 60% identity), although there is

(12)

a significant number of outliers whose length is equal to or greater than 200 bp (223 elements whose maximum length is 669 bp) and whose median percentage identity is 74%. No elements were identified that were completely identical to their mouse counterpart (the maximum percentage identity found was 97%).

Figure 2 Distribution of length, percentage identity and shuffling categories of SCEs. SCEs were categorized based on their change in location and orientation in Fugu rubripes with respect to their location and orientation in the mouse locus. The entire locus, comprising the entire flanking sequence up to the next upstream and downstream gene was taken into consideration. Definitions of specific classes: (a) collinear SCEs (elements that have not undergone any change in location or orientation within the entire gene locus); (b) reversed SCEs (elements that have changed their orientation in the fish locus with respect to the mouse locus, but have remained in the same portion of the locus); (c) moved SCEs (elements that have moved between the pre-gene, post-gene and intronic portions of the locus); (d) Moved-reversed (elements that have undergone both of the above changes). (e) Frequency distribution of SCE length in base pairs. (f) Frequency distribution of percentage identity of SCE hits in fugu. SCE, shuffled conserved region.

(13)

Table 2 Transcription potential, localization, and number of vertebrate SCEs. ^aType of SCE. ^bTotal number of SCEs, including genic and nongenic. ^cNumber of genic SCEs: overlapping EMBL proteins, ESTs, GenScan predictions, and Ensembl genes. ^dNumber of nongenic SCEs: not overlapping EMBL proteins, ESTs, GenScan, and Ensembl genes. ^eTotal number of SCEs, including pre-gene, intronic, and post-gene. ^fNumber of pre-gene SCEs: SCEs localized before the translation start of the reference gene. ^gNumber of intronic SCEs: SCEs localized within the introns of the reference gene. ^hNumber of post-gene SCEs: SCEs localized after the translation end of the reference gene. EST, expressed sequence tag; SCE, shuffled conserved element.

We decided to investigate further the extent to which the elements identified, which are still retained within the locus analyzed, have shuffled in terms of relative position and orientation relative to the transcriptional unit, and would thus be missed by a simple regional global alignment (such as MLAGAN). The results of this revealed that only 28% of elements identified have retained the same orientation and the same position with respect to the transcriptional unit taken into account (that is to say, have remained pre-‐gene, intronic, or post-‐gene.

Labeled as 'collinear'; Figure 2a), whereas others have shifted in terms of orientation ('reversed'; Figure 2b), position ('moved'; Figure 2c), or both ('moved-‐reversed'; Figure 2d). Thus, almost two-‐thirds of the SCEs identified would have been missed by a global, albeit regional, alignment approach.

A possible explanation for the large number of non-‐collinear elements is that they could appear shuffled owing to assembly artifacts. In order to assess whether the large number of elements identified as non-‐collinear were merely due to assembly artifacts, we analyzed the number of SCEs containing a single hit in fugu and not classified as collinear that also had a match in tetraodon. If the shuffling were merely due to assembly artifacts, then we would expect

(14)

approximately half of the non-‐collinear hits in fugu also to be non-‐collinear in tetraodon. The results, however, were significantly different, because more than 80% of the elements were not collinear in both species (P < 2.2 × e-‐16 obtained by performing a χ² comparison between the proportion obtained and the expected 0.5/0.5 proportion). These findings emphasize that shuffling is a mechanism of particular relevance when searching for short, well conserved elements across long evolutionary distances and that its true extent can only be detected by using a sensitive global-‐local alignment approach, as opposed to a fast genome-‐wide approach [25].

Two examples of SCEs that were identified in our study are shown in Figure 3.

Example A shows the locus of Sema6d, a semaphorin gene that is located in the plasma membrane and is involved in cardiac morphogenesis. This locus represents a conserved element that is found after the transcriptional unit at the 3' end of the gene in all mammals analyzed, whereas it is located upstream in fish genomes and reversed in orientation in the fugu and tetraodon genomes.

Example B shows the locus of the tyrosine phosphatase receptor type G protein, a candidate tumor suppressor gene, which has a conserved element in the first intron of all mammalian loci analyzed, which is found in reversed orientation in all fish genomes, downstream of the gene in the fugu and tetraodon genomes,

(15)

and in the second intron in the zebrafish genome.

Figure 3 Examples of loci containing shuffled conserved elements. (a) The Sema6d (sema domain, transmembrane domain, and cytoplasmic domain, semaphorin 6D; MGI:2387661) locus contains a post-genic moved-reversed conserved element. The SCE is found downstream from the gene in mammalian loci and upstream of the gene in fish genomes, and in reverse orientation only in the genomes of fugu and tetraodon. (b) the Ptprg (protein tyrosine phosphatase, receptor type G;

MGI:97814) locus contains an intronic moved-reversed conserved element. The SCE is found in the first intron of the Ptprg gene in mammalian genomes, downstream of the gene in reverse orientation in fugu and tetraodon, and in the second intron in reverse orientation in zebrafish. Boxes represent the multiple alignments of the SCEs identified. SCE, shuffled conserved region.

Shuffled conserved regions cast a wider net of nongenic conservation across the genome

We analyzed the type of genes that are associated with SCEs by assessing the distribution of Gene Ontology (GO) terms [33] using GOstat [34] (see Materials and methods, below). Although the results indicate significant over-‐

representation of gene classes typical of genes harboring noncoding

(16)

conservation ('trans-‐dev' enrichment) as reported previously, the number of genes within our analysis containing nongenic SCEs (2,911) is approximately an order of magnitude greater than that of the number of genes containing CNEs (330). The overlap between the two datasets is 291 genes, and so almost all (>88%) genes containing SCEs also contain CNEs. A GO analysis comparing genes containing CNEs and those containing SCEs (Figure 4) revealed that there are several GO categories that are significantly under-‐represented in the CNE dataset as compared with ours. These categories were not seen in the previous analysis because they are not over-‐represented in our dataset as compared with the entire genome.

Figure 4 GO Classification of genes harboring CNEs versus genes harboring SCEs. All genes containing CNEs and/or SCEs were analyzed for GO term classification. Genes containing CNEs are shown in red and genes containing SCEs are shown in gray. Plots show differences in absolute numbers as well as

(17)

relative percentages. Classification is shown for (a) cellular component and (b) biological process categories. CNE, conserved noncoding element; GO, Gene Ontology; SCE, shuffled conserved region.

The most striking difference is found in the analysis by cellular components;

there is an approximate 54-‐fold enrichment in genes belonging to the extracellular regions that contain SCEs as compared with genes in the same class that contain CNEs. In fact SCEs are present in more than 50% of the genes we were able to classify as belonging to the extracellular matrix and in 35% of those belonging to the extracellular space, whereas CNEs are only found in six and two such genes, respectively. These gene sets differ significantly in both extracellular regions and membrane GO cellular component categories (P < 0.001).

Enrichments in the order of 10-‐fold to 13-‐fold are seen when comparing genes involved in physiological and cellular processes, respectively. For both of these categories our analysis was able to identify SCEs in more than 30% of the genes belonging to this class. The differences, although substantial (about sevenfold) are not as extreme when comparing 'trans-‐dev' genes (genes categorized as belonging to the 'regulation of biological process' and 'development' using GO) because the CNE dataset has a stronger bias for those genes (P < 0.001). Finally, although we identified SCEs in 40% of genes assigned to the 'behavior' class, none of the genes in this class has CNEs. The data thus suggest that there are both quantitative and qualitative differences between the two datasets.

The proximal promoter region is a shuffling 'oasis'

Because a large proportion of our dataset undergoes shuffling, we decided to investigate whether shuffling is a property that is dependent on proximity to the transcriptional unit. To address this question we divided our dataset of nongenic SCEs between collinear (as discussed above) and non-‐collinear (all other

(18)

categories discussed above taken together) elements, and analyzed the distribution of their distances from the TSS (pre-‐gene set), the intron start (intron start), the intron end (intron-‐end set) and the 3' end of the transcript (post-‐gene). This analysis demonstrated that collinear elements were distributed significantly closer to the start and the end of the transcriptional unit compared with non-‐collinear elements, whereas no differences were observed in terms of proximity to the intron start and intron end (Figure S1).

(19)

Figure S1 Boxplots comparing the distribution of the distance of collinear versus non-collinear non-

genic SCEs from the transcriptional unit c o l l i n e a r n o n c o l l i n e a r

-40000-300000-200000-100000gene

c o l l i n e a r n o n c o l l i n e a r

-400000-30000-200000-100000intron

I N T R O N S T A R T S C E d i s t r i b u i t i o n

gene10000200000300000400000

intron100000200000300000400000

I N T R O N E N D S C E d i s t r i b u i t i o n

(20)

In order to investigate this phenomenon at higher resolution, we subdivided all loci analyzed in our dataset into 1,000 bp windows within the areas, and verified whether the proportion of collinear versus non-‐collinear elements deviated significantly from the expected proportions in any of these windows (see Materials and methods, below, for details). The results of the analysis are shown in Figure 5. The only window that exhibited a high χ² result with significantly less shuffled elements than collinear ones (P = e-‐08), was the 1,000 bp window immediately upstream of the TSS. No similar results were found in any other 1,000 bp windows across the gene loci analyzed. Similar results were obtained when deploying other window sizes (data not shown). To ascertain whether the result observed was due to annotation problems, we inspected the GO classification of the genes that presented non-‐genic collinear elements in the 1,000 bp window discussed above and observed significant enrichment (P <

0.001) for 'trans-‐dev' genes, whereas the same test conducted on genic collinear elements in the same window revealed no significant GO enrichment.

(21)

Figure 5 Analysis of SCE shuffling in 1000 bp windows. Each column in the figure shows the analysis of a locus portion (pre-gene, intron-start, intron-end and post-gene) divided into 1000 bp windows.

In each column the first graph indicates the number of collinear SCEs identified, the second graph the number of noncollinear SCEs identified, and the third graph the χ2 test used to identify windows that show a significant deviation from the expected proportion of collinear to noncollinear SCEs. The P value is shown for the only window (1000 bp upstream of the transcription start site) that exhibits significant deviation from the expected proportion. bp, base pairs; SCE, shuffled conserved region.

Shuffled conserved regions are able to predict vertebrate enhancers

In order to verify the ability of SCEs to predict functional enhancer elements, we conducted an overlap analysis (see Materials and methods, below) of SCEs with 98 mouse enhancer elements deposited in Genbank. We compared the overlap of SCEs with that of two other datasets that present conservation in fish genomes, namely CNEs and UCEs. The results presented in Figure 6 show that although

(22)

CNEs and UCEs are able to detect only one and two known enhancers from our dataset, respectively, SCEs detect 18 of them successfully.

Figure 6 Overlap of known mouse enhancers with conserved elements. All mouse enhancers deposited in GenBank (94) were mapped to the genome and compared with previously published conserved elements (UCEs and CNEs) as well as our own dataset of SCEs to verify their overlap. Only one known mouse enhancer is overlapped by a CNE and two by a UCE, whereas our dataset of SCEs identifies 18 known mouse enhancers as being conserved within fish genomes. CNE, conserved noncoding element; SCE, shuffled conserved region; UCE, ultraconserved element.

Shuffled conserved regions act as enhancers in vivo

In order to validate the cis-‐regulatory activity of SCEs we chose a subset of SCEs to be tested for in vivo enhancer activity by amplifying them from the fugu genome and co-‐injecting them in zebrafish embryos with a minimal promoter-‐

reporter construct yielding transient transgenic zebrafish embryos. Twenty-‐

seven SCEs were tested, of which four overlapped known mouse enhancers for which activity had not previously been reported in fish, and the remaining 23 (from 12 genes, of which four were not trans-‐dev genes, for a total of eight fragments not associated with trans-‐dev genes) did not overlap any known

(23)

feature. As a control set 12 noncoding, non-‐repeated, and non-‐conserved fragments were also chosen for co-‐injection assays, of which nine were from the same genes from which SCEs had been picked and three were from random genes (see Materials and methods, below, for details). Owing to the mosaic expression patterns that are obtained with this technique, results were recorded in two ways: by counting the number of cells stained for X-‐Gal and recording, where possible, the tissue in which the LacZ-‐positive cells were found; and by plotting LacZ-‐positive cells on expression maps that represent a composite overview of the LacZ-‐positive cells of all the embryos tested. Results of the cell counts are shown in Table 3 and the expression maps are shown in Figure 7. The cell counts were used to define statistically which fragments exhibited tissue-‐

restricted enhancer activity or generalized enhancer activity (see Materials and methods, below). As a positive control a published regulatory element from the shh locus, ar-‐C [27], was coinjected with the HSP:lacZ fragment. From a total of 27 SCEs, 22 (about 81%) were able to enhance significantly the activity of the HSP:lacZ construct in comparison with the embryos injected with HSP:lacZ only (see Materials and methods, below, for details). Of these, three out of the four tested known mouse enhancers that were found to be conserved in fish were confirmed to act as enhancers in fish. A similar percentage of positive results (82.6%) was obtained excluding these enhancers in the count. The enhancer effect in 20 out of the 22 positive SCEs was not generalized but observed in a tissue-‐restricted manner.

(24)

Table 3 Analysis of X-Gal staining in zebrafish embryos co-injected with the HSP promoter and SCEs or control fragments. For each DNA fragment tested the following information is given, from left to right: the gene locus in which the DNA fragment is found; indication about the GO classification of the gene in the 'trans-dev' class (Y = yes, N = no); the identifier given to the SCE or control fragment;

the size of the SCE; the class (rev = reversed, mov = moved, mre = moved and reversed, col = collinear, Ctrl = control); summary about the potentially enhancer function of the element (Y = yes, N

= no); the number of embryos injected; the total number of cells X-gal-stained; the ratio of stained cells divided by the number of embryos observed (with bold highlighting those with significant generalized enhancer activity); the P values for the significance of the number of cells observed in the fragment tested versus the lacZ:HSP control for each tissue (bold for P values < 0.01; see Materials and methods). See Additional data file 3 for further info on the fragments tested. CNS, central nervous system; SCE, shuffled conserved element.

Table 3

Analysis of X-Gal staining in zebrafish embryos co-injected with the HSP promoter and SCEs or control fragments

Gene Trans

dev

Name SCE bp

SCE Class

ENH Embryo Cell ce/emb P value

Muscle Notochord CNS Eye Ear Vessels Other

No NA lacZ Neg

control

161 40 0.25

Shh Y ArC Pos

control

96 242 2.52 8.48E-07

Shh Y 12058 45 Rev Y 139 69 0.5 6.86E-09

Otx2 Y 13988 51 Mov Y 111 93 0.84 0.6444 0.006269 0.5536 0.3155

Gata3 Y 15402 40 Mre Y 107 103 0.96 0.398 0.5764 0.1906 1

Ets Y 8744 40 Mov Y 105 180 1.57 0.002593 4.78E-09

Ets Y 8745 46 Mov Y 133 210 1.58 0.1558 0.6015 0.3619 2.15E-06

Ets Y 8726 41 Mre Y 159 345 2.17 0.05534 0.6136 0.1485 2.08E-06

Ets Y 8728 48 Mre Y 149 176 1.18 0.0444 0.129 0.07924 1.31E-05

Pax2b Y 31027 39 Col Y 149 105 0.7 0.002374 0.06327 0.1902

Pax6a Y 15696 33 Mov Y 133 122 0.92 8.21E-06 0.3343 0.01268

Pax3 Y 24781 42 Mov N 124 67 0.54 0.02982 0.5287 1

Zfpm2 Y 23818 48 Col Y 140 119 0.85 1.49E-06 0.01296 1

Zfpm2 Y 23838 48 Mre Y 131 148 0.98 0.0003576 0.04369 0.1231

Tmeff2 N 26014 48 Mov N 164 125 0.76 0.7654 0.02301 0.3371 0.2801

Tmeff2 N 26015 38 Mov Y 120 159 1.33 0.001035 0.303 0.2088

Tmeff2 N 26016 51 Mre Y 109 148 1.36 0.0006309 0.0149 0.5862

Jag1b Y 16407 37 Col N 136 98 0.72 1 0.1849 1 1

Jag1b Y 16408 55 Col Y 142 109 0.86 5.45E-08 0.006524 0.3245

Jag1b Y 16409 44 Rev N 106 54 0.51 1 0.5088 1 0.5058

Mapkap1 N 17058 37 Mov Y 143 295 2.06 0.6825 0.05292 0.3788 0.6065 1

Mapkap1 N 17059 39 Mov Y 136 171 1.26 0.6686 0.004037 0.5973 0.077 0.5197

Mab21l2 Y 23001 42 Col Y 142 317 2.23 1.24E-07 0.004985 0.2339

Mab21l2 Y 23002 37 Mre Y 155 122 0.79 7.85E-08 0.004138

Hmx3 Y 11669 150 Col Y 165 136 0.82 0.001029 0.07062 0.01423

Lmx1b Y 17027 300 Col Y 116 105 0.91 0.00762 0.1876 1

3110004L20Rik N 5803 45 Mre N 65 16 0.25 0.2929 1

3110004L20Rik N 5802 39 Mov Y 122 320 2.62 0.1874 0.01209

Elmo1 N 6026 45 Rev Y 103 76 0.74 0.007132 0.6848

Ets Y 11216 NA Ctrl N 104 74 0.71 1 0.6954

Gata3 Y 3255 NA Ctrl N 174 110 0.63 0.04481 0.281 0.5739 0.02163

1300007F04Rik N 2797 NA Ctrl N 157 115 0.73

Tmeff2 N 198 NA Ctrl N 145 23 0.16 0.7448 0.6597 0.3651

Mab21l2 Y 909 NA Ctrl N 165 92 0.56 0.06359 1 1 1

3110004L20Rik N 410 NA Ctrl N 107 23 0.21 0.01984

Elmo1 N 10157 NA Ctrl N 146 38 0.26 0.287 0.8126

Shh Y 11271 NA Ctrl Y 165 83 0.5 3.34E-07 1 1 1

Impact Y 5990 NA Ctrl N 150 101 0.67 0.6496 0.2754 0.0622

Ubl7 N 268 NA Ctrl Y 117 644 5.5 0.0003325 7.15E-11 0.02555 0.6197

Lmx1b Y 11767 NA Ctrl N 116 15 0.13 0.2743 0.0707 1

Irx3 Y 5945 NA Ctrl N 93 15 0.16 0.03938

For each DNA fragment tested the following information is given, from left to right: the gene locus in which the DNA fragment is found; indication about the GO classification of the gene in the 'trans-dev' class (Y = yes, N = no); the identifier given to the SCE or control fragment; the size of the SCE; the class (rev = reversed, mov = moved, mre = moved and reversed, col = collinear, Ctrl = control); summary about the potentially enhancer function of the element (Y = yes, N = no); the number of embryos injected; the total number of cells X-gal-stained; the ratio of stained cells divided by the number of embryos observed (with bold highlighting those with significant generalized enhancer activity); the P values for the significance of the number of cells observed in the fragment tested

genomebiology.com - Table 3 http://genomebiology.com/2006/7/7/R56/table/T3

1 of 2 10/21/10 12:43 AM

(25)

Figure 7 Expression profiles of X-Gal stained embryos. (a-f) Expression profiles of 1-day-old X-Gal stained zebrafish embryos. Each expression map represents a composite overview of the LacZ-

positive cells of 65-175 embryos. Gene names and fragment/SCE id are shown. Detailed distribution of X-Gal stained cells in different tissues as well as data for all other fragments are shown in Table 3.

Side view of head region of LacZ-stained embryos are shown with anterior to the left. (panel a) HSP-

lacZ injected embryo. (d) Embryo co-injected with SCE 3121 associated with Jag1b gene. (f) Embryo co-injected with SCE 4939 associated with Mab21l2 gene. SCE, shuffled conserved region.

The expression patterns obtained in our experiments were compared with expression data retrieved from the Zebrafish Information Network [35,36].

Multiple SCEs found within a single gene locus gave similar tissue-‐restricted enhancer activity. For example, all four SCEs tested from the ets-‐1 locus gave expression that was highly specific to the blood precursors (SCE 1646 in Figure 7c). This result is in accordance with reported data, which showed ets-‐1 expression in the arterial system and venous system. Moreover, both elements tested from the zfpm2 (also described as fog2 [37]) gene gave central nervous system (CNS) specific enhancer activity, which is in accordance with a recent report showing that the expression of both fog2 paralogs is restricted to the

(26)

brain [37]. Similarly, elements tested from the mab-‐21-‐like genes gave CNS and eye specific enhancer activity (SCE 4939; Figure 7f). This pattern of expression corresponds with the patterns reported in the brain, neurons, and eye [38,39].

The SCEs that were found in the pax6a and hmx3 genes were shown to give CNS specific enhancement, which is in accordance with the reported expression of these genes in the CNS [35]. Finally, SCE 3121 from the gene jag1b gave specific expression in the CNS and in the eye (Figure 7d), which is in partial agreement with reported expression of this gene (expressed in the rostral end of the pronephric duct, nephron primordia, and the region extending from the otic vesicle to the eye [40]).

Novel enhancer functions were also detected for SCEs neighboring lmx1b1, which showed CNS specific activity, and SCEs neighboring four genes not belonging to the trans-‐dev category, such as mapkap1 (Figure 7e), tmeff2 and 3110004L20Rik (producing proteins integral to the membrane), and elmo1 (associated with the cytoskeleton), which exhibited strong generalized and/or tissue specific activity. No endogenous expression data are available for these genes for comparison. In contrast to the results with SCE elements, only two out of 12 (about 17%) of the genomic control fragment set derived from the same loci of the SCEs exhibited significant enhancement of LacZ activity (Table 3).

Taken together, these data demonstrate that SCEs act as bona fide enhancers that can drive tissue-‐restricted as well as generalized expression during embryo development.

Fish genomes : a powerful tool to uncover new functional elements in vertebrates

Fish genomes : a powerful tool to uncover new functional elements in vertebrates

Chapter 3: Shuffling of cis-­‐regulatory elements is a pervasive feature of the vertebrate lineage

Chapter 3: Shuffling of cis-‐regulatory elements is a pervasive feature of the vertebrate lineage