the issue
TCCTGGCCTACATGTTCTTTGGCAAAGGATCTTCAAAATCAACGGCTCCCGGTGCGGCGATCATCCATTTCTTCGGAGGGATTCACGAGATTTACTTCCCGTAC ATTCTGATGAAACCTGGCCCTGATTCTCGCAGCCATTGCCGGCGGAGCAAGCGGACTCTTAACATTACGATCTTTAATGCCGGACTTGTCGCGGCAGCGTCACC GGGAAGCATTATCGCATTGATGGCAATGACGCCAAGAGGAGGCTATTTCGGCGTATTGGCGGGTGTATTGGTCGCTGCAGCTGTATCGTTCATCGTTTCAGCAG TGATCCTGAAATCCTCTAAAGCTAGTGAAGAAGACCTGGCTGCCGCAACAGAAAAAATGCAGTCCATGAAGGGGAAGAAAAGCCAAGCAGCAGCTGCTTTAG AGGCGGAACAAGCCAAAGCAGAGAAGCGTCTGAGCTGTCTCCTGAAAGCGCGAACAAAATTATCTTTTCGTGTGATCCGGGATGGGATCAAGTGCCATGGGG GCATCCATCTTAAGAAACAAAGTGAAAAAGCGGAGCTTGACATCAGTGTGACCAACACGGCCATTAACAATCTGCCAAGCGATGCGGATATTGTCATCACCCA
CAAAGATTTAACAGACCGCGCGAAAGCAAAGCTGCCGAACGCGACGCACATATCAGTGGATAACTTCTTAAACAGCCCGAAATACGACGAGCTGATTGAAAA GCTGAAAAGTAATCTTATAGAAAGAGAGTATTGTCATGCAAGTACTCGCAAAGGAAACATTAAACTCAATCAAACGGTATCATCAAAAGAAGAGGCTATCAA ATTGGCAGGCCAGACGCTGATTGACAACGGCTACGTGACAGAGGATTACATTAGCAAAATGTTTGACCGTGAAGAAACGTCTTCTACGTTTATGGGGAATTTC ATTGCCATTCCACACGGCACAGAAGAAGCGAAAAGCGAGGTGCTTCACTCAGGAATTTCAATCATACAGATTCCAGAGGGCGTTGAGTACGGAGAAGGCAAC ACGGCAAAAGTGGTATTCGGCATTGCGGGTAAAAATAATGAGCATTTAGACATTTTGTCTAACATCGCCATTATCTGTTCAGAAGAAGAAACATTGAACGCCT GATCTCCGCTAAAGCGAAGAAGATTTGATCGCCATTTCAACGAGGTGAACTGACATGATCGCCTTACATTTCGGTGCGGGAAATATCGGGAGAGGATTTATCG GCGCGCTGCTTCACCACTCCGGCTATGATGTGGTGTTTGCGGATGTGAACGAAACGATGGTCAGCCTCCTCAATGAAAAAAAAGAATACACAGTGGAACTGGC GGAAGAGGGACGTTCATCGGAGATCATTGGCCCGGTGAGCGCTATTAACAGCGGCAGTCAGACCGAGGAGCTGTACCGGCTGATGAATGAGGCGGCGCTCAT CACAACAGCTGTCGGCCCGAATGTCCTGAAGCTGATTGCCCCGTCTATCGCAGAAGGTTTAAGACGAAGAAATACTGCAAACACACTGAATATCATTGCCTGC GAAAATATGATTGGCGGAAGCAGCTTCTTAAAGAAAGAAATATACAGCCATTTAACGGAAGCAGAGCAGAAATCCGTCAGTGAAACGTTAGGTTTTCCGAAT TCTGCCGTTGACCGGATCGTCCCGATTCAGCATCATGAAGACCCGCTGAAAGTATCGGTTGAACCATTTTTCGAATGGGTCATTGATGAATCAGGCTTTAAAGG GAAAACACCAGTCATAAACGGCGCACTGTTTGTTGATGATTTAACGCCGTACATCGAACGGAAGCTGTTTACGGTCAATACCGGACACGCGGTCACAGCGTAT GTCGGCTATCAGCGCGGACTCAAAACGGTCAAAGAAGCAATTGATCATCCGGAAATCCGCCGTGTTGTTCATTCGGCGCTGCTTGAAACTGGTGACTATCTCGT
CAAATCGTATGGCTTTAAGCAAACTGAACACGAACAATATATTAAAAATCAGCGGTCGCTTTTAAAATCCTTTCATTTCGGACGATGTGACCCGCGTAGCGAG GTCACCTCTCAGAAAACTGGGAGAAAATGTAGACTTGTAGGCCCGGCAAAGAAAATAAAAGAACCGAATGCACTGGCTGAAGGAATTGCCGCAGCACTGCGC TTCGATTTCACCGGTGACCCTGAAGCGGTTGAACTGCAAGCGCTGATCGAAGAAAAGGATACAGCGGCGTACTTCAAGAGGTGTGCGGCATTCAGTCCCATGA ACCGTTGCACGCCATCATTTTAAAGAAACTTAATCAATAACCGACCACCCGTGACACAATGTCACGGGCTTTTTACTATCTCGCAATCTAGTATAATAGAAAGC GCTTACGATAACAGGGGAAGGAGAATGACGATGAAACAATTTGAGATTGCGGCAATACCGGGAGACGGAGTAGGAAAGAGGTTGTAGCGGCTGCTGAGAAA GTGCTTCATACAGCGGCTGAGGTACACGGAGGTTTGTCATTCTCATTCACAGCTTTTCCATGGAGCTGTGATTATTACTTGGAGCACGGCAAAAATGATGCCCG AAGATGGAATACATACGCTTACTCAATTTGAAGCAGTTTTTGGGAGCTGTCGGAAATCCGAAGCTGGTTCCCGATCATATATCGTTATGGGGCTGCTGCTGAAA TCCGGAGGGAGCTTGAGCTTTCCATTAATATGAGACCCGCCAAACAAATGGCAGGCATTACGTCGCCGCTTCTGCATCCAAATGATTTTTGACTTCGTGGTGAT TCGCGAGAACAGTGAAGGTGAATACAGTGAAGTTGTCGGGCGCATTCACAGAGGCGATGATGAAATCGCCATCCAGAATGCCGTGTTTACGAGAAAAGCGAC AGAACGTGTCATGCGCTTTGCCTTCGAATTGGCGAAAAAACGGCGCACACTCGTGACAAGCGCCACAAAGTCTAACGGCATTTATCACGCGATGCCGTTTTGG GATGAAGTCTTTCAGCAGACAGCCGCTGATTATAGCGGAATCGAGACATCATCTCAGCATATTGATGCGCTGGCCGCTTTTTTTGTGACGCGTCCGGAAACGTT TGATGTCATTGTGGCGAGCAAATTGTTCGGTGATATTTTAACCGACATCAGCTCAAGCCTGATGGAAAGCATCGGCATTGCGCCTCCCGACATCAATCCATCCG GCAAATATCCGTCCATGTTTGAACCGGTTCACGGCTCAGCTCCTGACATTGCCGGACAGGCCTTGCCAATCCGATCGGCCAGATTTGGACAGCGAAGCTGATGC TCGACCACTTCGGAGAGGAAGAATTGGGGGCGAAAATTCTGGATGTAATGGAGCAAGTGACTGCCGACGGCATCAAAACACGCGACATTGGGGGACAAAGCA CAACGGCTGAGGTCACTGATGAAATCTGTTCGCGCTTAAGAAAGCTCTGATGAATCAGGCCGGTGGCAGATGGCTGCCCCGGTCTGTCCATTTCCTTACGAAA ATTTCCACGAAAGTCTAACCAAGCAGATCCAAATGCTGTATAATAATTTGGAATTCTTAGGAAAGCATCGGGTGAAGGAAGTTGAATGCAAAAACAATCACGT
TAAAGAAAAAAAGAAAAATCAAAACGATCGTTGTACTCAGTATCATTATGATCGCAGCTCTCATTTTTACGATCAGATTGGTGTTTTACAAGCCTTTTCTTATT GAAGGATCATCAATGGCCCCAACGCTTAAAGACTCAGAAAGAATTCTGGTTGATAAAGCAGTCAAATGGACTGGCGGGTTTCACAGAGGAGACATCATAGTC ATTCATGACAAAAAGAGCGGCCGCTCATTTGTCAAACGTTTAATCGGTTTGCCTGGTGACAGCATTAAAATGAAAAATGATCAGCTATACATAAATGATAAAA AGGTGGAAGAACCATACTTAAAGGAATATAAACAGGAGGTCAAAGAGTCGGGTGTAACCTTAACAGGTGACTTCGAAGTTGAGGTTCCTTCCGGTAAATATTT TGTGATGGGAGATAACCCTGATATAAGTGGAGCAATTAAACAAAATGGCGCCAAAGGATGTACGCGCCCTGATACGAGAGGGGAAAATAAACGGGCCGACCG CAGGCATGTCCGGCGGCTACGCCCAAGCGAATCTTGTGGTTTTGAAAAAGGACCTTGCGTTTGATTTTCTGCTGTTTTGCCAGCGAAATCAAAAGCCCTGCCCC
Annotation of the 400Kb contig around AP2 on chromosome IV
the different strategies to
build the structure of genes
. experimental . predictive
extrinsic / comparative
intrinsic / ab-initio
the experimental approach
Methods to localize genes on genome sequences
The experimental approach
identify & clone the cognate transcripts (as cDNA), sequence it and compare cDNA and gDNA
it is the ONLY secure method!
The experimental approach
Even this method has its bottlenecks : cDNA are rarely full length ...
There are often alternative transcripts … but only one or a few cloned or considered for analysis
The nucleic acid sequence does not provide
experimental information on translation product(s) a minimum of bioinformatics is needed:
cDNA and gDNA sequence comparison ...
and exact localization of splice sites at intron-exon borders: NNNag/Gtaagt……AG/gtNNN
this requires a specific software for high throughput:
e.g. Sim4
the predictive approaches
Methods to localize genes on genome sequences
Predictive Methods
the extrinsic (comparative) method
Methods to localize genes on genome sequences
Predictive Methods
the extrinsic method search for similarities
in protein & nucleic acid sequence databases rationale:
many genes and proteins are already documented
the genomic DNA may contain such one, or at least a close or distant homologue
Predictive Methods
the extrinsic method protein databases
due to a richer alphabet (20 amino acids compared to 4 nucleotides) protein sequence databases are the most efficient and the most informative
in the best case, a hit in a database search indicates the existence of a gene
the complete exon-intron structure of this gene for which function this gene codes for
:Multiple Alignment, instead of one-to-one, allows to finds outliers among database homologues [e.g. partial sequences] or point to peculiarities of the gene product which is the object of the search : here the N-terminal extension signs organelle subcellular localization
Predictive Methods
the extrinsic method limits & bottlenecks there is a need for closely homologous sequences to be in databases : orphan and fast evolving genes are typically not found this way
partial and wrong sequences are causing problems this approach identify and give the structure for a
fraction of genes in a complete genome (e.g. 40%) and incomplete information for another fraction (e.g. 20%)
Predictive Methods
the extrinsic method flaws & bottlenecks
protein searches rely on correct gene annotation in databases …
does a given database hit refer to an experimentally documented or to a virtual entity ?
how to track the source of information and validate the features given in databases ?
Predictive Methods
the extrinsic method gDNA versus mRNAs The EST case : what is it for real ?
Expressed Sequence Tags
obtained from mRNA isolated from a given organ cloned as cDNA in large libraries
sequenced from one extremity (often 3’)
in a single pass as far as possible (100-800 bp)
Predictive Methods
the extrinsic method EST pros & cons + the closest to the experimental method
no assumption needed
alternative transcripts are often found this way
- poor quality of EST sequences (error range >1%)
unequal coverage, depending on gene expression level partial sequences (though may be assembled)
directional: 3’ (and 5’) exons best covered
many ESTs needed for correct annotation: >106 for human
Predictive Methods
the extrinsic method gDNA versus gDNA The “Conserved Exon” Method:
comparison of non-documented genomic DNA with another non-documented gDNA
Rationale : the coding sequences being more conserved in evolution, (coding) exons should be seen as more similar to each other than introns and intergenics
No need for transcript or protein data.
Applies well to comparison between genomes of closely related species : e.g. mouse-human…
Methods to localize genes on genome sequences
Predictive Methods
the intrinsic (ab initio) method
Intrinsic Gene Prediction
• Not every DNA sequence is a gene
• Sequences of genes have specific features, which are often linked to the expression of these genes :
• this apply to properties of sequences as a whole
– Coding sequences : 3bp-periodicity, codon usage, GC content
• or to local signals
– translation start and stops, splice sites, polyA site, TATA box, promoter cis-acting motifs....
Intrinsic Gene Prediction
The case of prokaryotic (bacterial) genomes :
Genes do not contain introns and are generally close to each other
The task then consists essentially in finding Potential
Protein Coding Sequences (CDS)
Intrinsic Gene Prediction
Finding Protein Coding Sequences
Search for n-mers (hexamers)
3-periodic Markov models (GeneMark, Glimmer)
Why is this frame coding, and not any of the other 5 ?
1 1 23
4 65
Intrinsic Gene Prediction
The case of eukaryotic genomes :
Genes quite often do contain introns which may sometimes be numerous and/or big (example)
The space between genes (intergenic regions) may be
important and may contain transposons and repeats
5’UTR 3’UTR
ATG stop
internal exons
start exon stop exon
non coding coding coding non coding
5’UTR exon
ATG
stop Translation
initiation 3’UTR intron
5’UTR intron
3’UTR exon
AAAAAAA CAP
ATG
Coding SEQUENCE
CDS
Transcription Start Site
The gene
The transcript
stop
internal introns
Intrinsic Gene Prediction
Relies on combinatorial, statistical and/or A.I. methods may integrate several individual sensors
Needs training sets of documented genes
Intrinsic Gene Prediction
Is not universal !
Each (group of) species has its own genome “style”.
Therefore :
each method has to be trained and even adapted for a given genome, and need a species-specific gene set for this purpose the performance of a given algorithm or integrated software may vary a lot from one species to another...
EUGENE
as an example of an integrated gene prediction and
modeling platform
Blastn RepeatMasker TBlastx
Extrinsic modules
Content potential for coding, intron
and intergenic Splice Sites
Start ATG
Intrinsic modules
Poplar SpliceMachine IMM
Translation Start Site prediction
EuGene DAG
join(9265..9395,9749..9 9342).
complement(join(10164..
10295,10349..10420,10 467..10514,10566..1062 6,10681..10770,10823..
10949,11001)) join(9265..9395,9749..9 9342).
complement(join(10164..
10295,10349..10420,10 467..10514,10566..1062 6,10681..10770,10823..
10949,11001)) ATCCGTAAGATGGTG
CGATGCCCTAAATGG GTCGGTTTATAAAGG CGCGTAGGTAAGTGC AATTTATTCTTCAAGT TCCGAATTTTATATGC GCATATCGTCAGTTCT TCTGTTGCAGTTGGC GCACTTGGACTACCT GCAATTTATTCTTCAA GTTCCGAATTTTATAT ATCCGTAAGATGGTG CGATGCCCTAAATGG GTCGGTTTATAAAGG CGCGTAGGTAAGTGC AATTTATTCTTCAAGT TCCGAATTTTATATGC GCATATCGTCAGTTCT TCTGTTGCAGTTGGC GCACTTGGACTACCT GCAATTTATTCTTCAA GTTCCGAATTTTATAT
Input Output
Genome
Sequence Gene Models
Eugene, a Black Box ?
Extrinsic modules
Splice Sites
Intrinsic modules
SpliceMachine
Translation Start Site prediction
EuGene DAG
join(9265..9395,9749..9 9342).
complement(join(10164..
10295,10349..10420,10 467..10514,10566..1062 6,10681..10770,10823..
10949,11001)) join(9265..9395,9749..9 9342).
complement(join(10164..
10295,10349..10420,10 467..10514,10566..1062 6,10681..10770,10823..
10949,11001)) ATCCGTAAGATGGTG
CGATGCCCTAAATGG GTCGGTTTATAAAGG CGCGTAGGTAAGTGC AATTTATTCTTCAAGT TCCGAATTTTATATGC GCATATCGTCAGTTCT TCTGTTGCAGTTGGC GCACTTGGACTACCT GCAATTTATTCTTCAA GTTCCGAATTTTATAT ATCCGTAAGATGGTG CGATGCCCTAAATGG GTCGGTTTATAAAGG CGCGTAGGTAAGTGC AATTTATTCTTCAAGT TCCGAATTTTATATGC GCATATCGTCAGTTCT TCTGTTGCAGTTGGC GCACTTGGACTACCT GCAATTTATTCTTCAA GTTCCGAATTTTATAT
Input Output
Genome
Sequence Gene Models
Blastn RepeatMasker TBlastx
Start ATG
Content potential for coding, intron
and intergenic
Poplar IMM Blastx
EuGene Direct Acyclic Graph
Shifting from exon to intron …
Let EuGene make prediction based on
extrinsic data Blastn
RepeatMasker Blastx
Extrinsic modules
Coding potential
CDS Splice Sites
Start ATG
Intrinsic modules
EuGene self-training of intrinsic modules
TBlastN against Arabidopsis full length proteins
Discard cDNAs giving no hit Poplar
SpliceMachine IMM
Translation Start Site prediction
Select predicted
genes covered
by FL cDNA
EuGene DAG
ATCCGTAAGATGGTGCGAT GCCCTAAATGGGTCGGTTT ATAAAGGCGCGTAGGTAAG
Training set of poplar cDNAs mapped on genome seq.