• No results found

the issue

N/A
N/A
Protected

Academic year: 2021

Share "the issue"

Copied!
35
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)
(2)

the issue

(3)

TCCTGGCCTACATGTTCTTTGGCAAAGGATCTTCAAAATCAACGGCTCCCGGTGCGGCGATCATCCATTTCTTCGGAGGGATTCACGAGATTTACTTCCCGTAC ATTCTGATGAAACCTGGCCCTGATTCTCGCAGCCATTGCCGGCGGAGCAAGCGGACTCTTAACATTACGATCTTTAATGCCGGACTTGTCGCGGCAGCGTCACC GGGAAGCATTATCGCATTGATGGCAATGACGCCAAGAGGAGGCTATTTCGGCGTATTGGCGGGTGTATTGGTCGCTGCAGCTGTATCGTTCATCGTTTCAGCAG TGATCCTGAAATCCTCTAAAGCTAGTGAAGAAGACCTGGCTGCCGCAACAGAAAAAATGCAGTCCATGAAGGGGAAGAAAAGCCAAGCAGCAGCTGCTTTAG AGGCGGAACAAGCCAAAGCAGAGAAGCGTCTGAGCTGTCTCCTGAAAGCGCGAACAAAATTATCTTTTCGTGTGATCCGGGATGGGATCAAGTGCCATGGGG GCATCCATCTTAAGAAACAAAGTGAAAAAGCGGAGCTTGACATCAGTGTGACCAACACGGCCATTAACAATCTGCCAAGCGATGCGGATATTGTCATCACCCA

CAAAGATTTAACAGACCGCGCGAAAGCAAAGCTGCCGAACGCGACGCACATATCAGTGGATAACTTCTTAAACAGCCCGAAATACGACGAGCTGATTGAAAA GCTGAAAAGTAATCTTATAGAAAGAGAGTATTGTCATGCAAGTACTCGCAAAGGAAACATTAAACTCAATCAAACGGTATCATCAAAAGAAGAGGCTATCAA ATTGGCAGGCCAGACGCTGATTGACAACGGCTACGTGACAGAGGATTACATTAGCAAAATGTTTGACCGTGAAGAAACGTCTTCTACGTTTATGGGGAATTTC ATTGCCATTCCACACGGCACAGAAGAAGCGAAAAGCGAGGTGCTTCACTCAGGAATTTCAATCATACAGATTCCAGAGGGCGTTGAGTACGGAGAAGGCAAC ACGGCAAAAGTGGTATTCGGCATTGCGGGTAAAAATAATGAGCATTTAGACATTTTGTCTAACATCGCCATTATCTGTTCAGAAGAAGAAACATTGAACGCCT GATCTCCGCTAAAGCGAAGAAGATTTGATCGCCATTTCAACGAGGTGAACTGACATGATCGCCTTACATTTCGGTGCGGGAAATATCGGGAGAGGATTTATCG GCGCGCTGCTTCACCACTCCGGCTATGATGTGGTGTTTGCGGATGTGAACGAAACGATGGTCAGCCTCCTCAATGAAAAAAAAGAATACACAGTGGAACTGGC GGAAGAGGGACGTTCATCGGAGATCATTGGCCCGGTGAGCGCTATTAACAGCGGCAGTCAGACCGAGGAGCTGTACCGGCTGATGAATGAGGCGGCGCTCAT CACAACAGCTGTCGGCCCGAATGTCCTGAAGCTGATTGCCCCGTCTATCGCAGAAGGTTTAAGACGAAGAAATACTGCAAACACACTGAATATCATTGCCTGC GAAAATATGATTGGCGGAAGCAGCTTCTTAAAGAAAGAAATATACAGCCATTTAACGGAAGCAGAGCAGAAATCCGTCAGTGAAACGTTAGGTTTTCCGAAT TCTGCCGTTGACCGGATCGTCCCGATTCAGCATCATGAAGACCCGCTGAAAGTATCGGTTGAACCATTTTTCGAATGGGTCATTGATGAATCAGGCTTTAAAGG GAAAACACCAGTCATAAACGGCGCACTGTTTGTTGATGATTTAACGCCGTACATCGAACGGAAGCTGTTTACGGTCAATACCGGACACGCGGTCACAGCGTAT GTCGGCTATCAGCGCGGACTCAAAACGGTCAAAGAAGCAATTGATCATCCGGAAATCCGCCGTGTTGTTCATTCGGCGCTGCTTGAAACTGGTGACTATCTCGT

CAAATCGTATGGCTTTAAGCAAACTGAACACGAACAATATATTAAAAATCAGCGGTCGCTTTTAAAATCCTTTCATTTCGGACGATGTGACCCGCGTAGCGAG GTCACCTCTCAGAAAACTGGGAGAAAATGTAGACTTGTAGGCCCGGCAAAGAAAATAAAAGAACCGAATGCACTGGCTGAAGGAATTGCCGCAGCACTGCGC TTCGATTTCACCGGTGACCCTGAAGCGGTTGAACTGCAAGCGCTGATCGAAGAAAAGGATACAGCGGCGTACTTCAAGAGGTGTGCGGCATTCAGTCCCATGA ACCGTTGCACGCCATCATTTTAAAGAAACTTAATCAATAACCGACCACCCGTGACACAATGTCACGGGCTTTTTACTATCTCGCAATCTAGTATAATAGAAAGC GCTTACGATAACAGGGGAAGGAGAATGACGATGAAACAATTTGAGATTGCGGCAATACCGGGAGACGGAGTAGGAAAGAGGTTGTAGCGGCTGCTGAGAAA GTGCTTCATACAGCGGCTGAGGTACACGGAGGTTTGTCATTCTCATTCACAGCTTTTCCATGGAGCTGTGATTATTACTTGGAGCACGGCAAAAATGATGCCCG AAGATGGAATACATACGCTTACTCAATTTGAAGCAGTTTTTGGGAGCTGTCGGAAATCCGAAGCTGGTTCCCGATCATATATCGTTATGGGGCTGCTGCTGAAA TCCGGAGGGAGCTTGAGCTTTCCATTAATATGAGACCCGCCAAACAAATGGCAGGCATTACGTCGCCGCTTCTGCATCCAAATGATTTTTGACTTCGTGGTGAT TCGCGAGAACAGTGAAGGTGAATACAGTGAAGTTGTCGGGCGCATTCACAGAGGCGATGATGAAATCGCCATCCAGAATGCCGTGTTTACGAGAAAAGCGAC AGAACGTGTCATGCGCTTTGCCTTCGAATTGGCGAAAAAACGGCGCACACTCGTGACAAGCGCCACAAAGTCTAACGGCATTTATCACGCGATGCCGTTTTGG GATGAAGTCTTTCAGCAGACAGCCGCTGATTATAGCGGAATCGAGACATCATCTCAGCATATTGATGCGCTGGCCGCTTTTTTTGTGACGCGTCCGGAAACGTT TGATGTCATTGTGGCGAGCAAATTGTTCGGTGATATTTTAACCGACATCAGCTCAAGCCTGATGGAAAGCATCGGCATTGCGCCTCCCGACATCAATCCATCCG GCAAATATCCGTCCATGTTTGAACCGGTTCACGGCTCAGCTCCTGACATTGCCGGACAGGCCTTGCCAATCCGATCGGCCAGATTTGGACAGCGAAGCTGATGC TCGACCACTTCGGAGAGGAAGAATTGGGGGCGAAAATTCTGGATGTAATGGAGCAAGTGACTGCCGACGGCATCAAAACACGCGACATTGGGGGACAAAGCA CAACGGCTGAGGTCACTGATGAAATCTGTTCGCGCTTAAGAAAGCTCTGATGAATCAGGCCGGTGGCAGATGGCTGCCCCGGTCTGTCCATTTCCTTACGAAA ATTTCCACGAAAGTCTAACCAAGCAGATCCAAATGCTGTATAATAATTTGGAATTCTTAGGAAAGCATCGGGTGAAGGAAGTTGAATGCAAAAACAATCACGT

TAAAGAAAAAAAGAAAAATCAAAACGATCGTTGTACTCAGTATCATTATGATCGCAGCTCTCATTTTTACGATCAGATTGGTGTTTTACAAGCCTTTTCTTATT GAAGGATCATCAATGGCCCCAACGCTTAAAGACTCAGAAAGAATTCTGGTTGATAAAGCAGTCAAATGGACTGGCGGGTTTCACAGAGGAGACATCATAGTC ATTCATGACAAAAAGAGCGGCCGCTCATTTGTCAAACGTTTAATCGGTTTGCCTGGTGACAGCATTAAAATGAAAAATGATCAGCTATACATAAATGATAAAA AGGTGGAAGAACCATACTTAAAGGAATATAAACAGGAGGTCAAAGAGTCGGGTGTAACCTTAACAGGTGACTTCGAAGTTGAGGTTCCTTCCGGTAAATATTT TGTGATGGGAGATAACCCTGATATAAGTGGAGCAATTAAACAAAATGGCGCCAAAGGATGTACGCGCCCTGATACGAGAGGGGAAAATAAACGGGCCGACCG CAGGCATGTCCGGCGGCTACGCCCAAGCGAATCTTGTGGTTTTGAAAAAGGACCTTGCGTTTGATTTTCTGCTGTTTTGCCAGCGAAATCAAAAGCCCTGCCCC

(4)

Annotation of the 400Kb contig around AP2 on chromosome IV

(5)

the different strategies to

build the structure of genes

. experimental . predictive

extrinsic / comparative

intrinsic / ab-initio

(6)

the experimental approach

(7)

Methods to localize genes on genome sequences

The experimental approach

identify & clone the cognate transcripts (as cDNA), sequence it and compare cDNA and gDNA

it is the ONLY secure method!

(8)

The experimental approach

Even this method has its bottlenecks : cDNA are rarely full length ...

There are often alternative transcripts … but only one or a few cloned or considered for analysis

The nucleic acid sequence does not provide

experimental information on translation product(s) a minimum of bioinformatics is needed:

cDNA and gDNA sequence comparison ...

and exact localization of splice sites at intron-exon borders: NNNag/Gtaagt……AG/gtNNN

this requires a specific software for high throughput:

e.g. Sim4

(9)

the predictive approaches

(10)

Methods to localize genes on genome sequences

Predictive Methods

the extrinsic (comparative) method

(11)

Methods to localize genes on genome sequences

Predictive Methods

the extrinsic method search for similarities

in protein & nucleic acid sequence databases rationale:

many genes and proteins are already documented

the genomic DNA may contain such one, or at least a close or distant homologue

(12)

Predictive Methods

the extrinsic method protein databases

due to a richer alphabet (20 amino acids compared to 4 nucleotides) protein sequence databases are the most efficient and the most informative

in the best case, a hit in a database search indicates the existence of a gene

the complete exon-intron structure of this gene for which function this gene codes for

(13)

:Multiple Alignment, instead of one-to-one, allows to finds outliers among database homologues [e.g. partial sequences] or point to peculiarities of the gene product which is the object of the search : here the N-terminal extension signs organelle subcellular localization

(14)

Predictive Methods

the extrinsic method limits & bottlenecks there is a need for closely homologous sequences to be in databases : orphan and fast evolving genes are typically not found this way

partial and wrong sequences are causing problems this approach identify and give the structure for a

fraction of genes in a complete genome (e.g. 40%) and incomplete information for another fraction (e.g. 20%)

(15)

Predictive Methods

the extrinsic method flaws & bottlenecks

protein searches rely on correct gene annotation in databases …

does a given database hit refer to an experimentally documented or to a virtual entity ?

how to track the source of information and validate the features given in databases ?

(16)

Predictive Methods

the extrinsic method gDNA versus mRNAs The EST case : what is it for real ?

Expressed Sequence Tags

obtained from mRNA isolated from a given organ cloned as cDNA in large libraries

sequenced from one extremity (often 3’)

in a single pass as far as possible (100-800 bp)

(17)

Predictive Methods

the extrinsic method EST pros & cons + the closest to the experimental method

no assumption needed

alternative transcripts are often found this way

- poor quality of EST sequences (error range >1%)

unequal coverage, depending on gene expression level partial sequences (though may be assembled)

directional: 3’ (and 5’) exons best covered

many ESTs needed for correct annotation: >106 for human

(18)

Predictive Methods

the extrinsic method gDNA versus gDNA The “Conserved Exon” Method:

comparison of non-documented genomic DNA with another non-documented gDNA

Rationale : the coding sequences being more conserved in evolution, (coding) exons should be seen as more similar to each other than introns and intergenics

No need for transcript or protein data.

Applies well to comparison between genomes of closely related species : e.g. mouse-human…

(19)
(20)

Methods to localize genes on genome sequences

Predictive Methods

the intrinsic (ab initio) method

(21)

Intrinsic Gene Prediction

• Not every DNA sequence is a gene

• Sequences of genes have specific features, which are often linked to the expression of these genes :

• this apply to properties of sequences as a whole

– Coding sequences : 3bp-periodicity, codon usage, GC content

• or to local signals

– translation start and stops, splice sites, polyA site, TATA box, promoter cis-acting motifs....

(22)

Intrinsic Gene Prediction

The case of prokaryotic (bacterial) genomes :

Genes do not contain introns and are generally close to each other

The task then consists essentially in finding Potential

Protein Coding Sequences (CDS)

(23)

Intrinsic Gene Prediction

Finding Protein Coding Sequences

Search for n-mers (hexamers)

3-periodic Markov models (GeneMark, Glimmer)

(24)

Why is this frame coding, and not any of the other 5 ?

1 1 23

4 65

(25)

Intrinsic Gene Prediction

The case of eukaryotic genomes :

Genes quite often do contain introns which may sometimes be numerous and/or big (example)

The space between genes (intergenic regions) may be

important and may contain transposons and repeats

(26)

5’UTR 3’UTR

ATG stop

internal exons

start exon stop exon

non coding coding coding non coding

5’UTR exon

ATG

stop Translation

initiation 3’UTR intron

5’UTR intron

3’UTR exon

AAAAAAA CAP

ATG

Coding SEQUENCE

CDS

Transcription Start Site

The gene

The transcript

stop

internal introns

(27)

Intrinsic Gene Prediction

Relies on combinatorial, statistical and/or A.I. methods may integrate several individual sensors

Needs training sets of documented genes

(28)

Intrinsic Gene Prediction

Is not universal !

Each (group of) species has its own genome “style”.

Therefore :

each method has to be trained and even adapted for a given genome, and need a species-specific gene set for this purpose the performance of a given algorithm or integrated software may vary a lot from one species to another...

(29)

EUGENE

as an example of an integrated gene prediction and

modeling platform

(30)

Blastn RepeatMasker TBlastx

Extrinsic modules

Content potential for coding, intron

and intergenic Splice Sites

Start ATG

Intrinsic modules

Poplar SpliceMachine IMM

Translation Start Site prediction

EuGene DAG

join(9265..9395,9749..9 9342).

complement(join(10164..

10295,10349..10420,10 467..10514,10566..1062 6,10681..10770,10823..

10949,11001)) join(9265..9395,9749..9 9342).

complement(join(10164..

10295,10349..10420,10 467..10514,10566..1062 6,10681..10770,10823..

10949,11001)) ATCCGTAAGATGGTG

CGATGCCCTAAATGG GTCGGTTTATAAAGG CGCGTAGGTAAGTGC AATTTATTCTTCAAGT TCCGAATTTTATATGC GCATATCGTCAGTTCT TCTGTTGCAGTTGGC GCACTTGGACTACCT GCAATTTATTCTTCAA GTTCCGAATTTTATAT ATCCGTAAGATGGTG CGATGCCCTAAATGG GTCGGTTTATAAAGG CGCGTAGGTAAGTGC AATTTATTCTTCAAGT TCCGAATTTTATATGC GCATATCGTCAGTTCT TCTGTTGCAGTTGGC GCACTTGGACTACCT GCAATTTATTCTTCAA GTTCCGAATTTTATAT

Input Output

Genome

Sequence Gene Models

Eugene, a Black Box ?

(31)

Extrinsic modules

Splice Sites

Intrinsic modules

SpliceMachine

Translation Start Site prediction

EuGene DAG

join(9265..9395,9749..9 9342).

complement(join(10164..

10295,10349..10420,10 467..10514,10566..1062 6,10681..10770,10823..

10949,11001)) join(9265..9395,9749..9 9342).

complement(join(10164..

10295,10349..10420,10 467..10514,10566..1062 6,10681..10770,10823..

10949,11001)) ATCCGTAAGATGGTG

CGATGCCCTAAATGG GTCGGTTTATAAAGG CGCGTAGGTAAGTGC AATTTATTCTTCAAGT TCCGAATTTTATATGC GCATATCGTCAGTTCT TCTGTTGCAGTTGGC GCACTTGGACTACCT GCAATTTATTCTTCAA GTTCCGAATTTTATAT ATCCGTAAGATGGTG CGATGCCCTAAATGG GTCGGTTTATAAAGG CGCGTAGGTAAGTGC AATTTATTCTTCAAGT TCCGAATTTTATATGC GCATATCGTCAGTTCT TCTGTTGCAGTTGGC GCACTTGGACTACCT GCAATTTATTCTTCAA GTTCCGAATTTTATAT

Input Output

Genome

Sequence Gene Models

Blastn RepeatMasker TBlastx

Start ATG

Content potential for coding, intron

and intergenic

Poplar IMM Blastx

(32)

EuGene Direct Acyclic Graph

(33)

Shifting from exon to intron …

(34)

Let EuGene make prediction based on

extrinsic data Blastn

RepeatMasker Blastx

Extrinsic modules

Coding potential

CDS Splice Sites

Start ATG

Intrinsic modules

EuGene self-training of intrinsic modules

TBlastN against Arabidopsis full length proteins

Discard cDNAs giving no hit Poplar

SpliceMachine IMM

Translation Start Site prediction

Select predicted

genes covered

by FL cDNA

EuGene DAG

ATCCGTAAGATGGTGCGAT GCCCTAAATGGGTCGGTTT ATAAAGGCGCGTAGGTAAG

Training set of poplar cDNAs mapped on genome seq.

(35)

a typical EUGENE output graph

Referenties

GERELATEERDE DOCUMENTEN

This book introduces a theore cal model of MP decision making in which the main decision-making mechanisms, derived from the exis ng literature on the pathways

5.18 Party agreement (the frequency of disagreement with the party’s posi- on on a vote in parliament) and ‘I feel involved in the decision making in the party group’ in the

This decision-making mechanism is based on the preference homogeneity pathway, which holds that party group unity results from the fact that an individual is likely to join the poli

The responsible party model holds that the polit- ical party ought to be the main actor in the representa onal rela onship, “[i]ndividual poli cians play a second fiddle, at

If the MP does not subscribe to the norm of party group loyalty, or the MP does sub- scribe to the norm but his disagreement with the party group’s posi on is so intense that

The more inclusive and decentralized the selectorate, however, the more compe ng principals there are within the poli cal party to whom an MP may owe his allegiance, and thus the

In line with our hypothesis, the percentage of representa ves who infrequently dis- agree with their party’s posi on on a vote in parliament is quite a bit higher in our re-

We may s ll see an increase in party group preference heterogeneity and MPs’ disagreement with the party group’s posi on, and a decrease in party group loyalty among MPs, but the