• No results found

Yves Moreau Master of Bioinformatics Katholieke Universiteit Leuven 2001-2002

N/A
N/A
Protected

Academic year: 2021

Share "Yves Moreau Master of Bioinformatics Katholieke Universiteit Leuven 2001-2002"

Copied!
32
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Yves Moreau Master of Bioinformatics Katholieke Universiteit Leuven 2001-2002

(2)

n Sequence alignment n Hidden Markov Models n Profile HMMs

(3)
(4)

n Regular expression

n Problem: regular expression does not distinguish

n Exceptional TGCTAGG n Consensus ACACATC ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC [AT][CG][AC][ACGT]*A[TG][GC]

(5)

A .8 C 0 G 0 T .2 A 0 C .8 G .2 T 0 A .8 C .2 G 0 T 0 A 1 C 0 G 0 T 0 A 0 C 0 G .2 T .8 A 0 C .8 G .2 T 0 A .2 C .4 G .2 T .2 1.0 1.0 .6 .4 .6 .4 1.0 1.0 Sequence score 0472 0 8 0 1 8 0 1 1 6 0 4 0 6 0 8 0 1 8 0 1 8 0 ) ( . . . . . . . . . P = × × × × × × × × × × × × = ACACATC Transition probabilities Emission probabilities

(6)

n Log odds for sequence S: A: 1.16 T:-0.22 C: 1.16 G:-0.22 A: 1.16 C:-0.22 A: 1.39 G:-0.22 T: 1.16 C: 1.16 G:-0.22 A:-0.22 C: 0.47 G:-0.22 T:-0.22 0 0 -0.51 -0.92 -0.51 -0.92 0 0 25 . 0 log ) ( log 25 . 0 ) ( log P SL = P SL

(7)

65 . 6 16 . 1 0 16 . 1 39 . 1 51 . 0 47 . 0 51 . 0 16 . 1 0 16 . 1 0 16 . 1 ) ( odds log = + + + + − + − + + + + = − ACACATC -0.97 TGCT--AGG (exceptional) 4.6 ACCG--ATC 4.9 AGA---ATC 5.3 ACAC--AGC 3.0 TCAACTATC 4.9 ACA---ATG 6.7 ACAC--ATC (consensus) Log odds Sequence

(8)

n Hidden Markov Model

n The observations are the emitted symbols n The state transitions are hidden

n Estimation and training of HMMs can be done with

several dynamic programming methods

n We can compute the probability of observing a sequence

of observations given the sequence of the states (see previous slides)

(9)

n In general, we do not know the sequence of the states

(as the states are hidden) but we can find which sequence of states gives the highest probability of producing the sequence of observations = Viterbi

algorithm

n Also, we can estimate the probability of producing a

sequence of observations independently of the state sequence = forward algorithm

n Further, we can estimate the probability of the different

states at each observation in the observation sequence = backward algorithm

(10)

n Parameter estimation

n Based on a set of sequences, we want to find an HMM that

could have produced this set of sequences

n If the state sequence is known, estimating the transition and

emission probabilities amounts to counting

n If the state sequence is not known, we optimize the parameters

so as to maximize the overall likelihood of the data

n The strategy is to find the best state sequences given the current

parameters and to reestimate the parameters given the best state sequences; this procedure is then iterated until

convergence

n Viterbi training uses the Viterbi algorithm in the iteration

n Baum-Welch training uses the forward-backward algorithm in the

(11)
(12)

n Hidden Markov Models for multiple alignments n Match, insert, and delete states

(13)

GGWWRGdy.ggkkqLWFPSNYV IGWLNGynettgerGDFPGTYV PNWWEGql..nnrrGIFPSNYV DEWWQArr..deqiGIVPSK--GEWWKAqs..tgqeGFIPFNFV GDWWLArs..sgqtGYIPSNYV GDWWDAel..kgrrGKVPSNYL -DWWEArslssghrGYVPSNYV GDWWYArslitnseGYIPSTYV GEWWKArslatrkeGYIPSNYV GDWWLArslvtgreGYVPSNFV GEWWKAkslsskreGFIPSNYV GEWCEAgt.kngq.GWVPSNYI SDWWRVvnlttrqeGLIPLNFV LPWWRArd.kngqeGYIPSNYI RDWWEFrsktvytpGYYESGYV EHWWKVkd.algnvGYIPSNYV IHWWRVqd.rngheGYVPSSYL KDWWKVev..ndrqGFVPAAYV

Multiple alignment (+ conserved columns)

.85

(14)

containing previously unseen residues

n To avoid this problem, add pseudocounts (add extra counts as if

prior data was available)

New profile HMM

.85 .33

(15)

n More sophisticated pseudocounts are possible

n Different types of local alignments can be done with

HMMs (e.g., Smith-Waterman)

n Profile HMMs can be estimated from unaligned

sequences

n This produces also the multiple alignment n Need large number of sequences

(16)

n SAM: University of California Santa Cruz

n http://www.cse.ucsc.edu/research/compbio/sam.html

n Web service:

http://www.cse.ucsc.edu/research/compbio/HMM-apps/HMM-applications.html (takes time)

n Hmmer (‘hammer’): Washington University, St. Louis

(17)

n PFAM

n http://www.sanger.ac.uk/Software/Pfam/search.shtml

n Collection of protein families and protein domains

n Provides multiple alignment of the protein families for the

domains

n Provides the domain organization of proteins n Provides profile HMMs of the domains

(18)
(19)
(20)

n Higher gene density in prokaryotes n No introns in prokaryotes

n In prokaryotes, large ORFs (>300-500bp) likely to contain a gene but

smaller genes are harder to detect

n Eukaryotes n Promoter n CpG islands n 5’UTR n Exons n Introns n 3’UTR n Splice sites n Repeats

(21)

n Sources of evidence (positive and negative)

n Similarity to features not likely to overlap coding sequence (e.g.,

Alu repeats)

n Sequence similarity to known genes (e.g., found by BLASTX) n Statistical measure of codon bias

n Template matches to functional sites (e.g., splice site) n The structure must respect the biological grammar

(22)

n Search by signal

n Detect short signals in the genome

n E.g., splice site, signal peptide, glycosylation site n Neural networks can be useful here

n Search by content

n Detect extended regions in the genome n E.g., coding regions, CpG islands

n Hidden Markov Models are useful here

(23)

n Hidden Markov Models can be used to predict genes

n Homology to a known gene is also a strong methods for

detecting genes

n More and more gene prediction packages combine both

(24)

n In mammalians, CpG islands have higher G+C and CG

dinucleotide content than the rest of the DNA

n CpG islands arise in active regions where no

deactivation by methylation takes place (CG

dinucleotides in methylated regions disappear by deamination)

n CpG islands may be used as gene markers in

(25)

n Repeats make up a large part of the human genome n Alu repeats

n Long Interspersed Elements (LINEs) n Short Interspersed Elements (SINEs)

(26)

n Promoter region contains the element that control the

expression of the gene

n Prediction of the promoter region (e.g., prediction of the

TATA-box) is difficult

n Current methods in human do not significantly better than the

(27)
(28)

n Introns usually

n Begin with GU dinucleotide (donor splice site) n End with AG dinucleotide (acceptor splice site)

n Neural networks hav been applied for analysis of splice

sites, they are able to detect complex correlation between positions in a functional site

(29)

n Polyadenylation (cleavage of pre-mRNA 3' end and

synthesis of poly-(A) tract) is a very important early step of pre-mRNA processing

n The most well-known signal involved in this process is

AATAAA, located 15-20 nucleotides upstream from the poly-(A) site (site of cleavage)

n Real AATAAA signals can differ from AATAAA

consensus sequence. The most frequent natural variant, ATTAAA, is nearly as active as the canonical sequence.

(30)

n Most coding potentials are based on analysis of codon

usage

n The HMMs keeps track of some kind of average coding

potential around each position

n The increase and decrease of the coding potential will

(31)
(32)

n Current performance on exon prediction is acceptable n However, grouping the correct exons into the genes is

still problematic

n In many cases, the majority of the predicted genes will

Referenties

GERELATEERDE DOCUMENTEN

Even though the WASN nodes are restricted to exchange information with neighbor- ing nodes only, the use of a distributed averaging algorithm results in a CAP model estimate with

Firstly, the link between the different rank-1 approximation based noise reduction filters and the original speech distortion weighted multichannel Wiener filter is investigated

Hearing aids typically use a serial concatenation of Noise Reduction (NR) and Dynamic Range Compression (DRC).. However, the DRC in such a con- catenation negatively affects

This paper presents a variable Speech Distortion Weighted Multichannel Wiener Filter (SDW-MWF) based on soft output Voice Activity Detection (VAD) which is used for noise reduction

Once again it is clear that GIMPC2 has allowed noticeable gains in feasibility and moreover has feasible regions of similar volume to OMPC with larger numbers of d.o.f. The reader

A parallel paper (Rossiter et al., 2005) showed how one can extend the feasible regions for interpolation based predictive control far more widely than originally thought, but

In [1] the construction of controllability sets for linear systems with polytopic model uncertainty and polytopic disturbances is described. These sets do not take a given

Keywords : Predictive control, LPV systems, interpolation, computational simplicity, feasibility This paper first introduces several interpolation schemes, which have been derived