• No results found

Fish genomes : a powerful tool to uncover new functional elements in vertebrates

N/A
N/A
Protected

Academic year: 2021

Share "Fish genomes : a powerful tool to uncover new functional elements in vertebrates"

Copied!
57
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Fish genomes : a powerful tool to uncover new functional elements in vertebrates

Stupka, E.

Citation

Stupka, E. (2011, May 11). Fish genomes : a powerful tool to uncover new functional elements in vertebrates. Retrieved from https://hdl.handle.net/1887/17640

Version: Corrected Publisher’s Version

License: Licence agreement concerning inclusion of doctoral thesis in the

Institutional Repository of the University of Leiden

Downloaded from: https://hdl.handle.net/1887/17640

Note: To cite this publication please use the final published version (if applicable).

(2)

Chapter  3:  Shuffling  of  cis-­‐regulatory  elements  is  a  pervasive   feature  of  the  vertebrate  lineage  

 

Published  in:  Genome  Biology,  2006,  Vol  7:R56  

(3)

Abstract  

Background:   All   vertebrates   share   a   remarkable   degree   of   similarity   in   their   development   as   well   as   in   the   basic   functions   of   their   cells.   Despite   this,   attempts   at   unearthing   genome-­wide   regulatory   elements   conserved   throughout  the  vertebrate  lineage  using  BLAST-­like  approaches  have  thus   far  detected  noncoding  conservation  in  only  a  few  hundred  genes,  mostly   associated   with   regulation   of   transcription   and   development.   We   used   a   unique   combination   of   tools   to   obtain   regional   global-­local   alignments   of   orthologous  loci.  This  approach  takes  into  account  shuffling  of  regulatory   regions   that   are   likely   to   occur   over   evolutionary   distances   greater   than   those  separating  mammalian  genomes.  This  approach  revealed  one  order   of   magnitude   more   vertebrate   conserved   elements   than   was   previously   reported   in   over   2,000   genes,   including   a   high   number   of   genes   found   in   the  membrane  and  extracellular  regions.  Our  analysis  revealed  that  72%  of   the  elements  identified  have  undergone  shuffling.  We  tested  the  ability  of   the  elements  identified  to  enhance  transcription  in  zebrafish  embryos  and   compared   their   activity   with   a   set   of   control   fragments.   We   found   that   more  than  80%  of  the  elements  tested  were  able  to  enhance  transcription   significantly,   prevalently   in   a   tissue-­   restricted   manner   corresponding   to   the   expression   domain   of   the   neighboring   gene.   Our   work   elucidates   the   importance  of  shuffling  in  the  detection  of  cis-­regulatory  elements.  It  also   elucidates   how   similarities   across   the   vertebrate   lineage,   which   go   well   beyond  development,  can  be  explained  not  only  within  the  realm  of  coding   genes   but   also   in   that   of   the   sequences   that   ultimately   govern   their   expression.  

(4)

Introduction  

Enhancers  are  ciscting  sequences  that  increase  the  utilization  and/or  specificity   of   eukaryotic   promoters,   can   function   in   either   orientation,   and   often   act   in   a   distance  and  position  independent  manner  [1].  The  regulatory  logic  of  enhancers   is  often  conserved  throughout  vertebrates,  and  their  activity  relies  on  sequence   modules   containing   binding   sites   that   are   crucial   for   transcriptional   activation.  

However,   recent   studies   on   the   cis-­‐regulatory   logic   of   Otx   in   ascidians   pointed   out  that  there  can  be  great  plasticity  in  the  arrangement  of  binding  sites  within   individual  functional  modules.  This  degeneracy,  combined  with  the  involvement   of  a  few  crucial  binding  sites,  is  sufficient  to  explain  how  the  regulatory  logic  of   an  enhancer  can  be  retained  in  the  absence  of  detectable  sequence  conservation   [2].   These   observations   together   with   the   fact   that   we   are   still   far   from   understanding  fully  the  grammar  of  transcription  factor  binding  sites  and  their   conservation   [3]   make   it   difficult   to   assess   the   extent   of   conservation   in   vertebrate  cis-­‐regulatory  elements.  

Very  little  is  known  about  the  evolutionary  mobility  of  enhancer  and  promoter   elements  within  the  genome  as  well  as  within  a  specific  locus.  Sporadic  studies  of   selected   gene   families   have   addressed   questions   related   to   the   mobility   of   regulatory   sequences   involving   promoter   shuffling   [4]   and   enhancer   shuffling   [5];  these  describe  the  gain  or  loss  of  individual  regulatory  elements  exchanged   between  specific  genes  in  a  cassette  manner  [6].  These  studies  suggested  that  a   wide   variety   of   different   regulatory   motifs   and   mutational   mechanisms   have   operated   upon   non-­‐coding   regions   over   time.   These   studies,   however,   were   conducted   before   the   advent   of   large-­‐scale   genome   sequencing,   and   thus   they  

(5)

were   performed   on   a   scale   that   would   not   allow   the   authors   to   derive   more   general  conclusions  on  the  mobility  and  shuffling  of  regulatory  elements.  

The  basic  tenet  of  comparative  genomics  is  that  constraint  on  functional  genomic   elements   has   kept   their   sequence   conserved   throughout   evolution.   The   completion   of   the   draft   sequence   of   several   mammalian   genomes   has   been   an   important   milestone   in   the   search   for   conserved   sequence   elements   in   noncoding  DNA.  It  has  been  estimated  that  the  proportion  of  small  segments  in   the   mammalian   genome   that   is   under   purifying   selection   within   intergenic   regions   is   about   5%   and   that   this   proportion   is   much   greater   than   can   be   explained  by  protein-­‐coding  sequences  alone,  implying  that  the  genome  contains   many   additional   features   (such   as   untranslated   regions,   regulatory   elements,   non-­‐protein-­‐coding  genes,  and  structural  elements)  that  are  under  selection  for   biological  functions  [7-­‐11].  In  order  to  address  this  issue,  sequence  comparisons   across   longer   evolutionary   distances   and,   in   particular,   with   the   compact   Fugu   rubripes   genome   have   been   shown   to   be   useful   in   dissecting   the   regulatory   grammar   of   genes   long   before   the   advent   of   genome   sequencing   [12].   More   recently,   the   completion   of   the   draft   sequence   of   several   fish   genomes   has   allowed   larger   scale   approaches   for   the   detection   of   several   regulatory   conserved  noncoding  features.  

Several  studies  have  addressed  the  issue  of  conserved  non-­‐coding  sequences  on   a  larger  scale.  A  first  study  on  chromosome  21  [13]  revealed  conserved  nongenic   sequences   (CNGs);   these   were   identified   using   local   sequence   alignments   between  the  human  and  mouse  genome  of  high  similarity,  which  were  shown  to   be   untranscribed.   A   separate   study   focusing   on   sequences   with   100%   identity  

(6)

[14]  revealed  the  presence  of  ultraconserved  elements  (UCEs)  on  a  genome-­‐wide   scale,   and   finally   conserved   noncoding   elements   (CNEs)   [15]   were   found   by   performing  local  sequence  comparisons  between  the  human  and  fugu  genomes   showing   enhancer   activity   in   zebrafish   co-­‐injection   assays.   Although   the   CNG   study  yielded  a  very  large  number  of  elements  dispersed  across  the  genome,  and   bearing   no   clear   relationship   to   the   genes   surrounding   them,   the   latter   studies   (UCEs  and  CNEs)  were  almost  exclusively  associated  with  genes  that  have  been   termed  'trans-­‐dev'  (that  is,  they  are  involved  in  developmental  processes  and/or   regulation  of  transcription).  

One  of  the  major  drawbacks  of  current  genome-­‐wide  studies  is  that  they  rely  on   methods  for  local  alignment,  such  as  BLAST  (basic  local  alignment  search  tool)   [16]   and   FASTA   [17],   which   were   developed   when   the   bulk   of   available   sequences  to  be  aligned  were  coding.  It  has  been  shown  that  such  algorithms  are   not   as   efficient   in   aligning   noncoding   sequences   [18].   To   tackle   this   issue   new   algorithms  and  strategies  have  been  developed  in  order  to  search  for  conserved   and/or   over-­‐represented   motifs   from   sequence   alignments,   such   as   the   motif   conservation   score   [19],   the   threaded   blockset   aligner   program   [20]   and   the   regulatory  potential  score  [21],  as  well  as  phastCons  elements  and  scores  [22].  

However,   all   of   these   rely   on   a   BLAST-­‐like   algorithm   to   produce   the   initial   sequence  alignment  and  are  thus  subject  to  some  of  the  sensitivity  limitations  of   this   algorithm   and   do   not   constitute   a   major   shift   in   alignment   strategy   that   would  model  more  closely  the  evolution  of  regulatory  sequences.  

Two   approaches   were   recently   reported   which   provide   novel   alignment   strategies:   the   promoterwise   algorithm   coupled   with   'evolutionary   selex'   [23]  

(7)

and  the  CHAOS  (CHAins  Of  Scores)  alignment  program  [24].  Whereas  the  former   has  been  used  to  validate  a  set  of  short  motifs,  which  have  been  shown  to  be  of   functional   importance,   the   latter   has   not   been   coupled   to   experimental   verification   to   estimate   its   potential   for   the   discovery   of   conserved   regulatory   sequences.  Unlike  other  fast  algorithms  for  genomic  alignment,  CHAOS  does  not   depend  on  long  exact  matches,  it  does  not  require  extensive  ungapped  homology,   and   it   does   allow   for   mismatches   within   alignment   seeds,   all   of   which   are   important   when   comparing   noncoding   regions   across   distantly   related   organisms.   Thus,   CHAOS   could   be   a   suitable   method   for   the   identification   of   short   conserved   regions   that   have   remained   functional   despite   their   location   having   changed   during   vertebrate   evolution.   The   only   method   available   that   attempts   to   tackle   the   question   of   shuffled   elements   and   that   makes   use   of   CHAOS  is  Shuffle-­‐Lagan  [25];  however,  it  has  not  been  used  on  a  genome-­‐wide   scale  and  its  ability  to  detect  enhancers  has  not  been  verified  experimentally.  

Until  recently  our  ability  to  verify  the  function  of  sequence  elements  on  a  large   scale   within   an   in   vivo   context   was   strongly   limited.   This   task   was   eased   significantly   using   co-­‐injection   experiments   in   zebrafish   embryos   [26],   which   allows  significant  scale-­‐up  in  the  quantity  of  regulatory  elements  tested;  this  is   fundamental   when   one   is   trying   to   elucidate   general   principles   regarding   regulatory   elements,   the   grammar   of   which   still   eludes   us.   The   co-­‐injection   technique   used   to   test   shuffled   conserved   regions   (SCEs)   for   enhancer   activity   was  previously  shown  to  be  a  simple  way  to  test  cis-­‐  acting  regulatory  elements   [15,27,28]   and   was   shown   to   be   an   efficient   way   to   test   many   elements   in   a   relatively  short  period  of  time  [15].  

(8)

The  analysis  described  herein  attempts  to  tackle  the  issue  of  the  extent,  mobility,   and   function   of   conserved   noncoding   elements   across   vertebrate   orthologous   loci   using   a   unique   combination   of   tools   aimed   at   identifying   global-­‐local   regionally   conserved   elements.   We   first   used   orthologous   loci   from   four   mammalian   genomes   to   extract   'regionally   conserved   elements'   (rCNEs)   using   MLAGAN  [29],  and  then  used  CHAOS  to  verity  the  extent  of  conservation  of  those   rCNEs   within   their   orthologous   loci   within   fish   genomes.   The   analysis   was   conducted   annotating   the   extent   of   shuffling   undergone   by   the   elements   identified.   Finally,   we   investigated   the   activity   of   rearranged   and   shuffled   elements  as  enhancer  elements  in  vivo.  We  found  that  the  inclusion  of  additional   genomes,   the   use   of   a   combined   global-­‐local   strategy,   and   the   deployment   of   a   sensitive  alignment  algorithm  such  as  CHAOS  yields  an  increase  of  one  order  of   magnitude  in  the  number  of  potentially  functional  noncoding  elements  detected   as  being  conserved  across  vertebrates.  We  also  found  that  the  majority  of  these   have  undergone  shuffling  and  are  likely  to  act  as  enhancers  in  vivo,  based  on  the   more  than  80%  rate  of  functional  and  tissue-­‐restricted  enhancers  detected  in  our   zebrafish  co-­‐injection  study.  

Results  

The   dataset   described   in   this   analysis   is   available   on   the   internet   [30]   for   full   download,   as   well   as   a   searchable   site   to   identify   SCEs   belonging   to   individual   genes.  

Identification  of  mammalian  regionally  conserved  elements    

For   each   group   of   orthologous   genes   global   multiple   alignments   among   the   human,  mouse,  rat,  and  dog  loci  were  performed  using  MLAGAN  [25].  We  took  

(9)

into   consideration   all   genes   for   which   there   were   predicted   othologs   within   Ensembl  [31]  in  the  mouse  genome,  human  genome,  and  any  third  mammalian   species,  which  led  us  to  analyze  9,749  groups  of  orthologous  genes  (36%  of  the   annotated  mouse  genes).  Most  genes  (about  88%)  were  found  to  be  conserved  in   all   four   species   considered,   with   only   about   12%   found   in   three   out   of   four   species  (about  6%  in  each  triplet;  Figure  1).  For  each  locus  we  took  into  account   the  whole  genomic  repeat-­‐masked  sequence  containing  the  transcriptional  unit   as   well   as   the   complete   flanking   sequences   up   to   the   preceding   and   following   gene.  This  lead  us  to  analyze  37%  of  the  murine  genome  sequence  overall.  The   alignments   were   parsed   using   VISTA   (visualizing   global   DNA   sequence   alignments   of   arbitrary   length)   [32]   searching   for   segments   of   minimum   100   base   pairs   (bp)   length   and   70%   identity.   We   further   selected   these   regions   by   only  taking  into  account  those  regions  that  were  found  at  least  in  mouse,  human,   and   a   third   mammalian   species   and   which   overlapped   by   at   least   50bp,   which   resulted  in  a  set  of  364,358  rCNEs  (Table  1).  These  were  then  filtered  stringently   to   distinguish   'genic'   from   'nongenic'   (see   Materials   and   methods,   below).   This   analysis   classified   22.7%   of   the   resulting   rCNEs   as   'genic',   while   281,644   nongenic   elements   account   for   about   46   megabases,   or   1.77%,   of   the   murine   genome.  

(10)

 

Figure  1  Number  of  conserved  gene  loci  versus  number  of  rCNEs  identified  in  the  mouse,  rat,  human,   and  dog  genomes.  Graph  showing  the  number  of  rCNEs  found  conserved  in  the  dog,  rat,  mouse  and   human  genomes  versus  the  number  of  genes  found  conserved  across  the  same  genomes.  Although   almost  90%  of  the  genes  can  be  found  in  all  four  genomes,  most  rCNEs  can  be  found  only  in  three  out   of  four  genomes.  rCNE,  regionally  conserved  element.  

We   further   annotated   mammalian   rCNEs   based   on   their   position   in   the   mouse   genome   with   respect   to   the   gene   locus   in   order   to   define   whether   they   were   located  before  the  annotated  transcription  start  site  (TSS;  'pre-­‐gene'),  within  the   intronic  portion  of  the  gene,  or  posterior  to  the  transcriptional  unit  ('post-­‐gene').  

Approximately   54%   of   rCNEs   were   found   to   fall   within   intergenic   regions,   of   which  37%  were  post-­‐gene  and  63%  pre-­‐gene  (Table  1).  

(11)

 

Table  1  Transcription  potential,  localization,  and  number  of  mammalian  rCNEs.  a)Type  of  conserved   non-­coding  sequence  (rCNE).  B)Total  number  of  rCNEs,  including  genic  and  nongenic.  c)Number  of   genic  rCNEs:  overlapping  EMBL  proteins,  ESTs,  GenScan  predictions,  and  Ensembl  genes.  d)Number   of   nongenic   rCNEs:   not   overlapping   EMBL   proteins,   ESTs,   GenScan,   and   Ensembl   genes.   e)Total   number   of   rCNEs,   including   pre-­gene,   intronic   and   post-­gene.   f)Number   of   pre-­gene   rCNEs:   rCNEs   localized   before   the   translation   start   of   the   reference   gene.   g)Number   of   intronic   rCNEs:   rCNEs   localized   within   the   introns   of   the   reference   gene.   h)Number   of   post-­gene   rCNEs:   rCNEs   localized   after   the   translation   end   of   the   reference   gene.   EST,   expressed   sequence   tag;   rCNE,   regionally   conserved  non-­coding  element.      

Shuffling  of  conserved  elements  is  a  widespread  phenomenon    

We   searched   for   conservation   of   rCNEs   in   teleost   genomes   using   CHAOS   [24],   selecting  regions  that  presented  at  least  60%  identity  over  a  minimum  length  of   40  bp  as  compared  with  the  mouse  sequence  of  the  rCNEs.  This  method  allowed   us  to  identify  regions  that  are  reversed  or  moved  in  the  fish  locus  with  respect  to   the   corresponding   mammalian   locus.   For   each   locus   in   every   species   analyzed   we  took  into  account  the  whole  genomic  repeat-­‐masked  sequence  containing  the   transcriptional   unit   as   well   as   the   complete   flanking   sequences   up   to   the   preceding   and   following   gene.   We   defined   as   SCEs   those   regions   of   the   mouse   genome  that  were  conserved  at  least  in  the  fugu  orthologous  locus  and  filtered   out   any   sequence   shorter   than   20   bp   as   a   result   of   the   overlap   analysis   with   zebrafish   and   tetraodon   (see   Materials   and   methods,   below,   for   details).   Our   analysis   identified   21,427   nonredundant   nongenic   SCEs,   which   were   found   in   about   30%   of   the   genes   analyzed   (2,911;   Table   2).   The   distribution   of   their   length   and   percentage   identity   is   shown   in   Figure   2e,f.   The   median   length   and   percentage   identity   (45   bp   and   67%,   respectively)   reflect   closely   the   cut   offs   provided  to  CHAOS  in  the  alignment  (40  bp  and  60%  identity),  although  there  is  

(12)

a  significant  number  of  outliers  whose  length  is  equal  to  or  greater  than  200  bp   (223  elements  whose  maximum  length  is  669  bp)  and  whose  median  percentage   identity   is   74%.   No   elements   were   identified   that   were   completely   identical   to   their  mouse  counterpart  (the  maximum  percentage  identity  found  was  97%).  

 

Figure   2   Distribution   of   length,   percentage   identity   and   shuffling   categories   of   SCEs.   SCEs   were   categorized  based  on  their  change  in  location  and  orientation  in  Fugu  rubripes  with  respect  to  their   location   and   orientation   in   the   mouse   locus.   The   entire   locus,   comprising   the   entire   flanking   sequence  up  to  the  next  upstream  and  downstream  gene  was  taken  into  consideration.  Definitions  of   specific   classes:   (a)   collinear   SCEs   (elements   that   have   not   undergone   any   change   in   location   or   orientation   within   the   entire   gene   locus);   (b)   reversed   SCEs   (elements   that   have   changed   their   orientation  in  the  fish  locus  with  respect  to  the  mouse  locus,  but  have  remained  in  the  same  portion   of   the   locus);   (c)   moved   SCEs   (elements   that   have   moved   between   the   pre-­gene,   post-­gene   and   intronic  portions  of  the  locus);  (d)  Moved-­reversed  (elements  that  have  undergone  both  of  the  above   changes).   (e)   Frequency   distribution   of   SCE   length   in   base   pairs.   (f)   Frequency   distribution   of   percentage  identity  of  SCE  hits  in  fugu.  SCE,  shuffled  conserved  region.  

(13)

 

Table   2   Transcription   potential,   localization,   and   number   of   vertebrate   SCEs.  aType   of   SCE.  bTotal   number  of  SCEs,  including  genic  and  nongenic.  cNumber  of  genic  SCEs:  overlapping  EMBL  proteins,   ESTs,   GenScan   predictions,   and   Ensembl   genes.  dNumber   of   nongenic   SCEs:   not   overlapping   EMBL   proteins,   ESTs,   GenScan,   and   Ensembl   genes.  eTotal   number   of   SCEs,   including   pre-­gene,   intronic,   and  post-­gene.  fNumber  of  pre-­gene  SCEs:  SCEs  localized  before  the  translation  start  of  the  reference   gene.  gNumber  of  intronic  SCEs:  SCEs  localized  within  the  introns  of  the  reference  gene.  hNumber  of   post-­gene   SCEs:   SCEs   localized   after   the   translation   end   of   the   reference   gene.   EST,   expressed   sequence  tag;  SCE,  shuffled  conserved  element.  

We   decided   to   investigate   further   the   extent   to   which   the   elements   identified,   which   are   still   retained   within   the   locus   analyzed,   have   shuffled   in   terms   of   relative  position  and  orientation  relative  to  the  transcriptional  unit,  and  would   thus   be   missed   by   a   simple   regional   global   alignment   (such   as   MLAGAN).   The   results   of   this   revealed   that   only   28%   of   elements   identified   have   retained   the   same  orientation  and  the  same  position  with  respect  to  the  transcriptional  unit   taken  into  account  (that  is  to  say,  have  remained  pre-­‐gene,  intronic,  or  post-­‐gene.  

Labeled   as   'collinear';   Figure   2a),   whereas   others   have   shifted   in   terms   of   orientation   ('reversed';   Figure   2b),   position   ('moved';   Figure   2c),   or   both   ('moved-­‐reversed';   Figure   2d).   Thus,   almost   two-­‐thirds   of   the   SCEs   identified   would  have  been  missed  by  a  global,  albeit  regional,  alignment  approach.  

A   possible   explanation   for   the   large   number   of   non-­‐collinear   elements   is   that   they   could   appear   shuffled   owing   to   assembly   artifacts.   In   order   to   assess   whether   the   large   number   of   elements   identified   as   non-­‐collinear   were   merely   due  to  assembly  artifacts,  we  analyzed  the  number  of  SCEs  containing  a  single  hit   in  fugu  and  not  classified  as  collinear  that  also  had  a  match  in  tetraodon.  If  the   shuffling   were   merely   due   to   assembly   artifacts,   then   we   would   expect  

(14)

approximately   half   of   the   non-­‐collinear   hits   in   fugu   also   to   be   non-­‐collinear   in   tetraodon.  The  results,  however,  were  significantly  different,  because  more  than   80%  of  the  elements  were  not  collinear  in  both  species  (P  <  2.2  ×  e-­‐16  obtained   by   performing   a   χ2   comparison   between   the   proportion   obtained   and   the   expected   0.5/0.5   proportion).   These   findings   emphasize   that   shuffling   is   a   mechanism   of   particular   relevance   when   searching   for   short,   well   conserved   elements  across  long  evolutionary  distances  and  that  its  true  extent  can  only  be   detected   by   using   a   sensitive   global-­‐local   alignment   approach,   as   opposed   to   a   fast  genome-­‐wide  approach  [25].  

Two  examples  of  SCEs  that  were  identified  in  our  study  are  shown  in  Figure  3.  

Example  A  shows  the  locus  of  Sema6d,  a  semaphorin  gene  that  is  located  in  the   plasma   membrane   and   is   involved   in   cardiac   morphogenesis.   This   locus   represents  a  conserved  element  that  is  found  after  the  transcriptional  unit  at  the   3'  end  of  the  gene  in  all  mammals  analyzed,  whereas  it  is  located  upstream  in  fish   genomes   and   reversed   in   orientation   in   the   fugu   and   tetraodon   genomes.  

Example  B  shows  the  locus  of  the  tyrosine  phosphatase  receptor  type  G  protein,   a  candidate  tumor  suppressor  gene,  which  has  a  conserved  element  in  the  first   intron  of  all  mammalian  loci  analyzed,  which  is  found  in  reversed  orientation  in   all   fish   genomes,   downstream   of   the   gene   in   the   fugu   and   tetraodon   genomes,  

(15)

and   in   the   second   intron   in   the   zebrafish   genome.  

 

Figure   3   Examples   of   loci   containing   shuffled   conserved   elements.   (a)   The   Sema6d   (sema   domain,   transmembrane  domain,  and  cytoplasmic  domain,  semaphorin  6D;  MGI:2387661)  locus  contains  a   post-­genic   moved-­reversed   conserved   element.   The   SCE   is   found   downstream   from   the   gene   in   mammalian   loci   and   upstream   of   the   gene   in   fish   genomes,   and   in   reverse   orientation   only   in   the   genomes   of   fugu   and   tetraodon.   (b)   the   Ptprg   (protein   tyrosine   phosphatase,   receptor   type   G;  

MGI:97814)  locus  contains  an  intronic  moved-­reversed  conserved  element.  The  SCE  is  found  in  the   first  intron  of  the  Ptprg  gene  in  mammalian  genomes,  downstream  of  the  gene  in  reverse  orientation   in  fugu  and  tetraodon,  and  in  the  second  intron  in  reverse  orientation  in  zebrafish.  Boxes  represent   the  multiple  alignments  of  the  SCEs  identified.  SCE,  shuffled  conserved  region.  

Shuffled  conserved  regions  cast  a  wider  net  of  nongenic  conservation  across  the   genome    

We   analyzed   the   type   of   genes   that   are   associated   with   SCEs   by   assessing   the   distribution  of  Gene  Ontology  (GO)  terms  [33]  using  GOstat  [34]  (see  Materials   and   methods,   below).   Although   the   results   indicate   significant   over-­‐

representation   of   gene   classes   typical   of   genes   harboring   noncoding  

(16)

conservation   ('trans-­‐dev'   enrichment)   as   reported   previously,   the   number   of   genes  within  our  analysis  containing  nongenic  SCEs  (2,911)  is  approximately  an   order   of   magnitude   greater   than   that   of   the   number   of   genes   containing   CNEs   (330).   The   overlap   between   the   two   datasets   is   291   genes,   and   so   almost   all   (>88%)  genes  containing  SCEs  also  contain  CNEs.  A  GO  analysis  comparing  genes   containing   CNEs   and   those   containing   SCEs   (Figure   4)   revealed   that   there   are   several  GO  categories  that  are  significantly  under-­‐represented  in  the  CNE  dataset   as  compared  with  ours.  These  categories  were  not  seen  in  the  previous  analysis   because   they   are   not   over-­‐represented   in   our   dataset   as   compared   with   the   entire  genome.  

 

Figure  4  GO  Classification  of  genes  harboring  CNEs  versus  genes  harboring  SCEs.  All  genes  containing   CNEs  and/or  SCEs  were  analyzed  for  GO  term  classification.  Genes  containing  CNEs  are  shown  in  red   and  genes  containing  SCEs  are  shown  in  gray.  Plots  show  differences  in  absolute  numbers  as  well  as  

(17)

relative   percentages.   Classification   is   shown   for   (a)   cellular   component   and   (b)   biological   process   categories.  CNE,  conserved  noncoding  element;  GO,  Gene  Ontology;  SCE,  shuffled  conserved  region.  

The   most   striking   difference   is   found   in   the   analysis   by   cellular   components;  

there   is   an   approximate   54-­‐fold   enrichment   in   genes   belonging   to   the   extracellular  regions  that  contain  SCEs  as  compared  with  genes  in  the  same  class   that   contain   CNEs.   In   fact   SCEs   are   present   in   more   than   50%   of   the   genes   we   were  able  to  classify  as  belonging  to  the  extracellular  matrix  and  in  35%  of  those   belonging  to  the  extracellular  space,  whereas  CNEs  are  only  found  in  six  and  two   such  genes,  respectively.  These  gene  sets  differ  significantly  in  both  extracellular   regions   and   membrane   GO   cellular   component   categories   (P   <   0.001).  

Enrichments  in  the  order  of  10-­‐fold  to  13-­‐fold  are  seen  when  comparing  genes   involved  in  physiological  and  cellular  processes,  respectively.  For  both  of  these   categories  our  analysis  was  able  to  identify  SCEs  in  more  than  30%  of  the  genes   belonging   to   this   class.   The   differences,   although   substantial   (about   sevenfold)   are   not   as   extreme   when   comparing   'trans-­‐dev'   genes   (genes   categorized   as   belonging  to  the  'regulation  of  biological   process'  and  'development'  using  GO)   because  the  CNE  dataset  has  a  stronger  bias  for  those  genes  (P  <  0.001).  Finally,   although   we   identified   SCEs   in   40%   of   genes   assigned   to   the   'behavior'   class,   none   of   the   genes   in   this   class   has   CNEs.   The   data   thus   suggest   that   there   are   both  quantitative  and  qualitative  differences  between  the  two  datasets.  

The  proximal  promoter  region  is  a  shuffling  'oasis'  

Because   a   large   proportion   of   our   dataset   undergoes   shuffling,   we   decided   to   investigate  whether  shuffling  is  a  property  that  is  dependent  on  proximity  to  the   transcriptional  unit.  To  address  this  question  we  divided  our  dataset  of  nongenic   SCEs   between   collinear   (as   discussed   above)   and   non-­‐collinear   (all   other  

(18)

categories   discussed   above   taken   together)   elements,   and   analyzed   the   distribution   of   their   distances   from   the   TSS   (pre-­‐gene   set),   the   intron   start   (intron   start),   the   intron   end   (intron-­‐end   set)   and   the   3'   end   of   the   transcript   (post-­‐gene).  This  analysis  demonstrated  that  collinear  elements  were  distributed   significantly  closer  to  the  start  and  the  end  of  the  transcriptional  unit  compared   with  non-­‐collinear  elements,  whereas  no  differences  were  observed  in  terms  of   proximity  to  the  intron  start  and  intron  end  (Figure  S1).  

(19)

 

Figure  S1  Boxplots  comparing  the  distribution  of  the  distance  of  collinear  versus  non-­collinear  non-­

genic  SCEs  from  the  transcriptional  unit   c o l l i n e a r n o n c o l l i n e a r

-40000-300000-200000-100000gene

c o l l i n e a r n o n c o l l i n e a r

-400000-30000-200000-100000intron

I N T R O N S T A R T S C E d i s t r i b u i t i o n

c o l l i n e a r n o n c o l l i n e a r

gene10000200000300000400000

c o l l i n e a r n o n c o l l i n e a r

intron100000200000300000400000

I N T R O N E N D S C E d i s t r i b u i t i o n

(20)

In  order  to  investigate  this  phenomenon  at  higher  resolution,  we  subdivided  all   loci  analyzed  in  our  dataset  into  1,000  bp  windows  within  the  areas,  and  verified   whether   the   proportion   of   collinear   versus   non-­‐collinear   elements   deviated   significantly   from   the   expected   proportions   in   any   of   these   windows   (see   Materials  and  methods,  below,  for  details).  The  results  of  the  analysis  are  shown   in   Figure   5.   The   only   window   that   exhibited   a   high   χ2   result   with   significantly   less  shuffled  elements  than  collinear  ones  (P  =  e-­‐08),  was  the  1,000  bp  window   immediately   upstream   of   the   TSS.   No   similar   results   were   found   in   any   other   1,000  bp  windows  across  the  gene  loci  analyzed.  Similar  results  were  obtained   when  deploying  other  window  sizes  (data  not  shown).  To  ascertain  whether  the   result   observed   was   due   to   annotation   problems,   we   inspected   the   GO   classification   of   the   genes   that   presented   non-­‐genic   collinear   elements   in   the   1,000   bp   window   discussed   above   and   observed   significant   enrichment   (P   <  

0.001)  for  'trans-­‐dev'  genes,  whereas  the  same  test  conducted  on  genic  collinear   elements  in  the  same  window  revealed  no  significant  GO  enrichment.  

(21)

 

Figure  5  Analysis  of  SCE  shuffling  in  1000  bp  windows.  Each  column  in  the  figure  shows  the  analysis   of  a  locus  portion  (pre-­gene,  intron-­start,  intron-­end  and  post-­gene)  divided  into  1000  bp  windows.  

In  each  column  the  first  graph  indicates  the  number  of  collinear  SCEs  identified,  the  second  graph   the  number  of  noncollinear  SCEs  identified,  and  the  third  graph  the  χ2  test  used  to  identify  windows   that  show  a  significant  deviation  from  the  expected  proportion  of  collinear  to  noncollinear  SCEs.  The   P  value  is  shown  for  the  only  window  (1000  bp  upstream  of  the  transcription  start  site)  that  exhibits   significant  deviation  from  the  expected  proportion.  bp,  base  pairs;  SCE,  shuffled  conserved  region.  

Shuffled  conserved  regions  are  able  to  predict  vertebrate  enhancers    

In  order  to  verify  the  ability  of  SCEs  to  predict  functional  enhancer  elements,  we   conducted  an  overlap  analysis  (see  Materials  and  methods,  below)  of  SCEs  with   98  mouse  enhancer  elements  deposited  in  Genbank.  We  compared  the  overlap  of   SCEs  with  that  of  two  other  datasets  that  present  conservation  in  fish  genomes,   namely   CNEs   and   UCEs.   The   results   presented   in   Figure   6   show   that   although  

(22)

CNEs  and  UCEs  are  able  to  detect  only  one  and  two  known  enhancers  from  our   dataset,  respectively,  SCEs  detect  18  of  them  successfully.  

 

Figure  6  Overlap  of  known  mouse  enhancers  with  conserved  elements.  All  mouse  enhancers   deposited  in  GenBank  (94)  were  mapped  to  the  genome  and  compared  with  previously  published   conserved  elements  (UCEs  and  CNEs)  as  well  as  our  own  dataset  of  SCEs  to  verify  their  overlap.  Only   one  known  mouse  enhancer  is  overlapped  by  a  CNE  and  two  by  a  UCE,  whereas  our  dataset  of  SCEs   identifies  18  known  mouse  enhancers  as  being  conserved  within  fish  genomes.  CNE,  conserved   noncoding  element;  SCE,  shuffled  conserved  region;  UCE,  ultraconserved  element.  

Shuffled  conserved  regions  act  as  enhancers  in  vivo    

In  order  to  validate  the  cis-­‐regulatory  activity  of  SCEs  we  chose  a  subset  of  SCEs   to   be   tested   for   in   vivo   enhancer   activity   by   amplifying   them   from   the   fugu   genome   and   co-­‐injecting   them   in   zebrafish   embryos   with   a   minimal   promoter-­‐

reporter   construct   yielding   transient   transgenic   zebrafish   embryos.   Twenty-­‐

seven  SCEs  were  tested,  of  which  four  overlapped  known  mouse  enhancers  for   which   activity   had   not   previously   been   reported   in   fish,   and   the   remaining   23   (from   12   genes,   of   which   four   were   not   trans-­‐dev   genes,   for   a   total   of   eight   fragments   not   associated   with   trans-­‐dev   genes)   did   not   overlap   any   known  

(23)

feature.   As   a   control   set   12   noncoding,   non-­‐repeated,   and   non-­‐conserved   fragments  were  also  chosen  for  co-­‐injection  assays,  of  which  nine  were  from  the   same   genes   from   which   SCEs   had   been   picked   and   three   were   from   random   genes   (see   Materials   and   methods,   below,   for   details).   Owing   to   the   mosaic   expression  patterns  that  are  obtained  with  this  technique,  results  were  recorded   in   two   ways:   by   counting   the   number   of   cells   stained   for   X-­‐Gal   and   recording,   where   possible,   the   tissue   in   which   the   LacZ-­‐positive   cells   were   found;   and   by   plotting   LacZ-­‐positive   cells   on   expression   maps   that   represent   a   composite   overview  of  the  LacZ-­‐positive  cells  of  all  the  embryos  tested.  Results  of  the  cell   counts  are  shown  in  Table  3  and  the  expression  maps  are  shown  in  Figure  7.  The   cell   counts   were   used   to   define   statistically   which   fragments   exhibited   tissue-­‐

restricted  enhancer  activity  or  generalized  enhancer  activity  (see  Materials  and   methods,  below).  As  a  positive  control  a  published  regulatory  element  from  the   shh  locus,  ar-­‐C  [27],  was  coinjected  with  the  HSP:lacZ  fragment.  From  a  total  of   27   SCEs,   22   (about   81%)   were   able   to   enhance   significantly   the   activity   of   the   HSP:lacZ  construct  in  comparison  with  the  embryos  injected  with  HSP:lacZ  only   (see   Materials   and   methods,   below,   for   details).   Of   these,   three   out   of   the   four   tested   known   mouse   enhancers   that   were   found   to   be   conserved   in   fish   were   confirmed   to   act   as   enhancers   in   fish.   A   similar   percentage   of   positive   results   (82.6%)   was   obtained   excluding   these   enhancers   in   the   count.   The   enhancer   effect   in   20   out   of   the   22   positive   SCEs   was   not   generalized   but   observed   in   a   tissue-­‐restricted  manner.  

(24)

 

Table  3  Analysis  of  X-­Gal  staining  in  zebrafish  embryos  co-­injected  with  the  HSP  promoter  and  SCEs   or  control  fragments.  For  each  DNA  fragment  tested  the  following  information  is  given,  from  left  to   right:  the  gene  locus  in  which  the  DNA  fragment  is  found;  indication  about  the  GO  classification  of   the  gene  in  the  'trans-­dev'  class  (Y  =  yes,  N  =  no);  the  identifier  given  to  the  SCE  or  control  fragment;  

the   size   of   the   SCE;   the   class   (rev   =   reversed,   mov   =   moved,   mre   =   moved   and   reversed,   col   =   collinear,  Ctrl  =  control);  summary  about  the  potentially  enhancer  function  of  the  element  (Y  =  yes,  N  

=  no);  the  number  of  embryos  injected;  the  total  number  of  cells  X-­gal-­stained;  the  ratio  of  stained   cells   divided   by   the   number   of   embryos   observed   (with   bold   highlighting   those   with   significant   generalized  enhancer  activity);  the  P  values  for  the  significance  of  the  number  of  cells  observed  in   the   fragment   tested   versus   the   lacZ:HSP   control   for   each   tissue   (bold   for   P   values   <   0.01;   see   Materials   and   methods).   See   Additional   data   file   3   for   further   info   on   the   fragments   tested.   CNS,   central  nervous  system;  SCE,  shuffled  conserved  element.  

Table 3

Analysis of X-Gal staining in zebrafish embryos co-injected with the HSP promoter and SCEs or control fragments

Gene Trans

dev

Name SCE bp

SCE Class

ENH Embryo Cell ce/emb P value

Muscle Notochord CNS Eye Ear Vessels Other

No NA lacZ Neg

control

161 40 0.25

Shh Y ArC Pos

control

96 242 2.52 8.48E-07

Shh Y 12058 45 Rev Y 139 69 0.5 6.86E-09

Otx2 Y 13988 51 Mov Y 111 93 0.84 0.6444 0.006269 0.5536 0.3155

Gata3 Y 15402 40 Mre Y 107 103 0.96 0.398 0.5764 0.1906 1

Ets Y 8744 40 Mov Y 105 180 1.57 0.002593 4.78E-09

Ets Y 8745 46 Mov Y 133 210 1.58 0.1558 0.6015 0.3619 2.15E-06

Ets Y 8726 41 Mre Y 159 345 2.17 0.05534 0.6136 0.1485 2.08E-06

Ets Y 8728 48 Mre Y 149 176 1.18 0.0444 0.129 0.07924 1.31E-05

Pax2b Y 31027 39 Col Y 149 105 0.7 0.002374 0.06327 0.1902

Pax6a Y 15696 33 Mov Y 133 122 0.92 8.21E-06 0.3343 0.01268

Pax3 Y 24781 42 Mov N 124 67 0.54 0.02982 0.5287 1

Zfpm2 Y 23818 48 Col Y 140 119 0.85 1.49E-06 0.01296 1

Zfpm2 Y 23838 48 Mre Y 131 148 0.98 0.0003576 0.04369 0.1231

Tmeff2 N 26014 48 Mov N 164 125 0.76 0.7654 0.02301 0.3371 0.2801

Tmeff2 N 26015 38 Mov Y 120 159 1.33 0.001035 0.303 0.2088

Tmeff2 N 26016 51 Mre Y 109 148 1.36 0.0006309 0.0149 0.5862

Jag1b Y 16407 37 Col N 136 98 0.72 1 0.1849 1 1

Jag1b Y 16408 55 Col Y 142 109 0.86 5.45E-08 0.006524 0.3245

Jag1b Y 16409 44 Rev N 106 54 0.51 1 0.5088 1 0.5058

Mapkap1 N 17058 37 Mov Y 143 295 2.06 0.6825 0.05292 0.3788 0.6065 1

Mapkap1 N 17059 39 Mov Y 136 171 1.26 0.6686 0.004037 0.5973 0.077 0.5197

Mab21l2 Y 23001 42 Col Y 142 317 2.23 1.24E-07 0.004985 0.2339

Mab21l2 Y 23002 37 Mre Y 155 122 0.79 7.85E-08 0.004138

Hmx3 Y 11669 150 Col Y 165 136 0.82 0.001029 0.07062 0.01423

Lmx1b Y 17027 300 Col Y 116 105 0.91 0.00762 0.1876 1

3110004L20Rik N 5803 45 Mre N 65 16 0.25 0.2929 1

3110004L20Rik N 5802 39 Mov Y 122 320 2.62 0.1874 0.01209

Elmo1 N 6026 45 Rev Y 103 76 0.74 0.007132 0.6848

Ets Y 11216 NA Ctrl N 104 74 0.71 1 0.6954

Gata3 Y 3255 NA Ctrl N 174 110 0.63 0.04481 0.281 0.5739 0.02163

1300007F04Rik N 2797 NA Ctrl N 157 115 0.73

Tmeff2 N 198 NA Ctrl N 145 23 0.16 0.7448 0.6597 0.3651

Mab21l2 Y 909 NA Ctrl N 165 92 0.56 0.06359 1 1 1

3110004L20Rik N 410 NA Ctrl N 107 23 0.21 0.01984

Elmo1 N 10157 NA Ctrl N 146 38 0.26 0.287 0.8126

Shh Y 11271 NA Ctrl Y 165 83 0.5 3.34E-07 1 1 1

Impact Y 5990 NA Ctrl N 150 101 0.67 0.6496 0.2754 0.0622

Ubl7 N 268 NA Ctrl Y 117 644 5.5 0.0003325 7.15E-11 0.02555 0.6197

Lmx1b Y 11767 NA Ctrl N 116 15 0.13 0.2743 0.0707 1

Irx3 Y 5945 NA Ctrl N 93 15 0.16 0.03938

For each DNA fragment tested the following information is given, from left to right: the gene locus in which the DNA fragment is found; indication about the GO classification of the gene in the 'trans-dev' class (Y = yes, N = no); the identifier given to the SCE or control fragment; the size of the SCE; the class (rev = reversed, mov = moved, mre = moved and reversed, col = collinear, Ctrl = control); summary about the potentially enhancer function of the element (Y = yes, N = no); the number of embryos injected; the total number of cells X-gal-stained; the ratio of stained cells divided by the number of embryos observed (with bold highlighting those with significant generalized enhancer activity); the P values for the significance of the number of cells observed in the fragment tested

genomebiology.com - Table 3 http://genomebiology.com/2006/7/7/R56/table/T3

1 of 2 10/21/10 12:43 AM

(25)

 

Figure   7   Expression   profiles   of   X-­Gal   stained   embryos.   (a-­f)   Expression   profiles   of   1-­day-­old   X-­Gal   stained   zebrafish   embryos.   Each   expression   map   represents   a   composite   overview   of   the   LacZ-­

positive  cells  of  65-­175  embryos.  Gene  names  and  fragment/SCE  id  are  shown.  Detailed  distribution   of  X-­Gal  stained  cells  in  different  tissues  as  well  as  data  for  all  other  fragments  are  shown  in  Table  3.  

Side  view  of  head  region  of  LacZ-­stained  embryos  are  shown  with  anterior  to  the  left.  (panel  a)  HSP-­

lacZ  injected  embryo.  (d)  Embryo  co-­injected  with  SCE  3121  associated  with  Jag1b  gene.  (f)  Embryo   co-­injected  with  SCE  4939  associated  with  Mab21l2  gene.  SCE,  shuffled  conserved  region.  

The   expression   patterns   obtained   in   our   experiments   were   compared   with   expression   data   retrieved   from   the   Zebrafish   Information   Network   [35,36].  

Multiple   SCEs   found   within   a   single   gene   locus   gave   similar   tissue-­‐restricted   enhancer   activity.   For   example,   all   four   SCEs   tested   from   the   ets-­‐1   locus   gave   expression  that  was  highly  specific  to  the  blood  precursors  (SCE  1646  in  Figure   7c).   This   result   is   in   accordance   with   reported   data,   which   showed   ets-­‐1   expression   in   the   arterial   system   and   venous   system.   Moreover,   both   elements   tested   from   the   zfpm2   (also   described   as   fog2   [37])   gene   gave   central   nervous   system   (CNS)   specific   enhancer   activity,   which   is   in   accordance   with   a   recent   report   showing   that   the   expression   of   both   fog2   paralogs   is   restricted   to   the  

(26)

brain  [37].  Similarly,  elements  tested  from  the  mab-­‐21-­‐like  genes  gave  CNS  and   eye  specific  enhancer  activity  (SCE  4939;  Figure  7f).  This  pattern  of  expression   corresponds  with  the  patterns  reported  in  the  brain,  neurons,  and  eye  [38,39].  

The  SCEs  that  were  found  in  the  pax6a  and  hmx3  genes  were  shown  to  give  CNS   specific   enhancement,   which   is   in   accordance   with   the   reported   expression   of   these  genes  in  the  CNS  [35].  Finally,  SCE  3121  from  the  gene  jag1b  gave  specific   expression  in  the  CNS  and  in  the  eye  (Figure  7d),  which  is  in  partial  agreement   with   reported   expression   of   this   gene   (expressed   in   the   rostral   end   of   the   pronephric   duct,   nephron   primordia,   and   the   region   extending   from   the   otic   vesicle  to  the  eye  [40]).  

Novel   enhancer   functions   were   also   detected   for   SCEs   neighboring   lmx1b1,   which   showed   CNS   specific   activity,   and   SCEs   neighboring   four   genes   not   belonging   to   the   trans-­‐dev   category,   such   as   mapkap1   (Figure   7e),   tmeff2   and   3110004L20Rik   (producing   proteins   integral   to   the   membrane),   and   elmo1   (associated   with   the   cytoskeleton),   which   exhibited   strong   generalized   and/or   tissue   specific   activity.   No   endogenous   expression   data   are   available   for   these   genes  for  comparison.  In  contrast  to  the  results  with  SCE  elements,  only  two  out   of   12   (about   17%)   of   the   genomic   control   fragment   set   derived   from   the   same   loci  of  the  SCEs  exhibited  significant  enhancement  of  LacZ  activity  (Table  3).  

Taken   together,   these   data   demonstrate   that   SCEs   act   as   bona   fide   enhancers   that  can  drive  tissue-­‐restricted  as  well  as  generalized  expression  during  embryo   development.  

Referenties

GERELATEERDE DOCUMENTEN

Since glucose uptake is facilitated by translocation of glucose transporter 4 (GLUT4) to the plasma membrane in response of insulin or exercise, glucose intolerance and

In Infoblad 398.28 werd betoogd dat een hoger N-leverend vermogen van de bodem - bij gelijk- blijvende N-gift - weliswaar leidt tot een lager overschot op de bodembalans, maar dat

Fish genomes : a powerful tool to uncover new functional elements in vertebrates..

License: Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden Downloaded.

ter verkrijging van de graad van Doctor aan de Universiteit Leiden, op gezag van Rector Magnificus prof.mr. van

Fish genomes : a powerful tool to uncover new functional elements in vertebrates..

Our predictions are of course limited by the nature of automated gene-building pipelines, and we do not yet incorporate gene structures built from Fugu expressed sequence

Fish genomes : a powerful tool to uncover new functional elements in vertebrates..