Yves Moreau Master of Bioinformatics Katholieke Universiteit Leuven 2001-2002
n Sequence alignment n Hidden Markov Models n Profile HMMs
n Regular expression
n Problem: regular expression does not distinguish
n Exceptional TGCTAGG n Consensus ACACATC ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC [AT][CG][AC][ACGT]*A[TG][GC]
A .8 C 0 G 0 T .2 A 0 C .8 G .2 T 0 A .8 C .2 G 0 T 0 A 1 C 0 G 0 T 0 A 0 C 0 G .2 T .8 A 0 C .8 G .2 T 0 A .2 C .4 G .2 T .2 1.0 1.0 .6 .4 .6 .4 1.0 1.0 Sequence score 0472 0 8 0 1 8 0 1 1 6 0 4 0 6 0 8 0 1 8 0 1 8 0 ) ( . . . . . . . . . P = × × × × × × × × × × × × = ACACATC Transition probabilities Emission probabilities
n Log odds for sequence S: A: 1.16 T:-0.22 C: 1.16 G:-0.22 A: 1.16 C:-0.22 A: 1.39 G:-0.22 T: 1.16 C: 1.16 G:-0.22 A:-0.22 C: 0.47 G:-0.22 T:-0.22 0 0 -0.51 -0.92 -0.51 -0.92 0 0 25 . 0 log ) ( log 25 . 0 ) ( log P SL = P S − L
65 . 6 16 . 1 0 16 . 1 39 . 1 51 . 0 47 . 0 51 . 0 16 . 1 0 16 . 1 0 16 . 1 ) ( odds log = + + + + − + − + + + + = − ACACATC -0.97 TGCT--AGG (exceptional) 4.6 ACCG--ATC 4.9 AGA---ATC 5.3 ACAC--AGC 3.0 TCAACTATC 4.9 ACA---ATG 6.7 ACAC--ATC (consensus) Log odds Sequence
n Hidden Markov Model
n The observations are the emitted symbols n The state transitions are hidden
n Estimation and training of HMMs can be done with
several dynamic programming methods
n We can compute the probability of observing a sequence
of observations given the sequence of the states (see previous slides)
n In general, we do not know the sequence of the states
(as the states are hidden) but we can find which sequence of states gives the highest probability of producing the sequence of observations = Viterbi
algorithm
n Also, we can estimate the probability of producing a
sequence of observations independently of the state sequence = forward algorithm
n Further, we can estimate the probability of the different
states at each observation in the observation sequence = backward algorithm
n Parameter estimation
n Based on a set of sequences, we want to find an HMM that
could have produced this set of sequences
n If the state sequence is known, estimating the transition and
emission probabilities amounts to counting
n If the state sequence is not known, we optimize the parameters
so as to maximize the overall likelihood of the data
n The strategy is to find the best state sequences given the current
parameters and to reestimate the parameters given the best state sequences; this procedure is then iterated until
convergence
n Viterbi training uses the Viterbi algorithm in the iteration
n Baum-Welch training uses the forward-backward algorithm in the
n Hidden Markov Models for multiple alignments n Match, insert, and delete states
GGWWRGdy.ggkkqLWFPSNYV IGWLNGynettgerGDFPGTYV PNWWEGql..nnrrGIFPSNYV DEWWQArr..deqiGIVPSK--GEWWKAqs..tgqeGFIPFNFV GDWWLArs..sgqtGYIPSNYV GDWWDAel..kgrrGKVPSNYL -DWWEArslssghrGYVPSNYV GDWWYArslitnseGYIPSTYV GEWWKArslatrkeGYIPSNYV GDWWLArslvtgreGYVPSNFV GEWWKAkslsskreGFIPSNYV GEWCEAgt.kngq.GWVPSNYI SDWWRVvnlttrqeGLIPLNFV LPWWRArd.kngqeGYIPSNYI RDWWEFrsktvytpGYYESGYV EHWWKVkd.algnvGYIPSNYV IHWWRVqd.rngheGYVPSSYL KDWWKVev..ndrqGFVPAAYV
Multiple alignment (+ conserved columns)
.85
containing previously unseen residues
n To avoid this problem, add pseudocounts (add extra counts as if
prior data was available)
New profile HMM
.85 .33
n More sophisticated pseudocounts are possible
n Different types of local alignments can be done with
HMMs (e.g., Smith-Waterman)
n Profile HMMs can be estimated from unaligned
sequences
n This produces also the multiple alignment n Need large number of sequences
n SAM: University of California Santa Cruz
n http://www.cse.ucsc.edu/research/compbio/sam.html
n Web service:
http://www.cse.ucsc.edu/research/compbio/HMM-apps/HMM-applications.html (takes time)
n Hmmer (‘hammer’): Washington University, St. Louis
n PFAM
n http://www.sanger.ac.uk/Software/Pfam/search.shtml
n Collection of protein families and protein domains
n Provides multiple alignment of the protein families for the
domains
n Provides the domain organization of proteins n Provides profile HMMs of the domains
n Higher gene density in prokaryotes n No introns in prokaryotes
n In prokaryotes, large ORFs (>300-500bp) likely to contain a gene but
smaller genes are harder to detect
n Eukaryotes n Promoter n CpG islands n 5’UTR n Exons n Introns n 3’UTR n Splice sites n Repeats
n Sources of evidence (positive and negative)
n Similarity to features not likely to overlap coding sequence (e.g.,
Alu repeats)
n Sequence similarity to known genes (e.g., found by BLASTX) n Statistical measure of codon bias
n Template matches to functional sites (e.g., splice site) n The structure must respect the biological grammar
n Search by signal
n Detect short signals in the genome
n E.g., splice site, signal peptide, glycosylation site n Neural networks can be useful here
n Search by content
n Detect extended regions in the genome n E.g., coding regions, CpG islands
n Hidden Markov Models are useful here
n Hidden Markov Models can be used to predict genes
n Homology to a known gene is also a strong methods for
detecting genes
n More and more gene prediction packages combine both
n In mammalians, CpG islands have higher G+C and CG
dinucleotide content than the rest of the DNA
n CpG islands arise in active regions where no
deactivation by methylation takes place (CG
dinucleotides in methylated regions disappear by deamination)
n CpG islands may be used as gene markers in
n Repeats make up a large part of the human genome n Alu repeats
n Long Interspersed Elements (LINEs) n Short Interspersed Elements (SINEs)
n Promoter region contains the element that control the
expression of the gene
n Prediction of the promoter region (e.g., prediction of the
TATA-box) is difficult
n Current methods in human do not significantly better than the
n Introns usually
n Begin with GU dinucleotide (donor splice site) n End with AG dinucleotide (acceptor splice site)
n Neural networks hav been applied for analysis of splice
sites, they are able to detect complex correlation between positions in a functional site
n Polyadenylation (cleavage of pre-mRNA 3' end and
synthesis of poly-(A) tract) is a very important early step of pre-mRNA processing
n The most well-known signal involved in this process is
AATAAA, located 15-20 nucleotides upstream from the poly-(A) site (site of cleavage)
n Real AATAAA signals can differ from AATAAA
consensus sequence. The most frequent natural variant, ATTAAA, is nearly as active as the canonical sequence.
n Most coding potentials are based on analysis of codon
usage
n The HMMs keeps track of some kind of average coding
potential around each position
n The increase and decrease of the coding potential will
n Current performance on exon prediction is acceptable n However, grouping the correct exons into the genes is
still problematic
n In many cases, the majority of the predicted genes will