Profile Searches
Revised 07/11/06
Overview
• Introduction
• Motif representation
• Motif screening
• Motif Databases
• Exercise
Features characteristic for the whole family Multiple sequence alignment
Introduction
How to represent the characteristic features?
Motif model: captures the family characteristic features
• regular expression, weight matrix, HMM profile
Introduction
Multiple sequence alignment
Construct model
Scan new sequence with the model
Unaligned sequences
• model: captures the family characteristic features
• used to detect remote homologs of a family
Overview
• Introduction
• Motif representation
– String based representation
• Consensus
• Regular expression
– Probabilistic representation
• PSSM
• HMM
• Profile
• Motif screening
• Motif Databases
• Exercise
HMM
Multiple sequence alignment
Construct model
Scan new sequence with the model
I. II.
Unaligned sequences
III.
Consensus sequence:
– Reductionistic representation of a motif
– Most frequent instance is used as a representative – Loss of information
Regular expression:
– More complex representation allowing motif degeneracy
String Based Representation
Symbol Meaning Origin of designation
G G Guanine
A A Adenine
T T Thymine
C C Cytosine
R G or A puRine
Y T or C pYrimidine
M A or C aMino
K G or T Keto
S G or C Strong interaction (3 H bonds) W A or T Weak interaction (2 H bonds)
H A or C or T not-G, H follows G in the alphabet B G or T or C not-A, B follows A
V G or C or A not-T (not-U), V follows U D G or A or T not-C, D follows C
N G or A or T or C aNy
CTTAATATTAACTTAAT Consensus
CTTAAKRTTMAYTTAAT Regular expression
String Based Representation
cell signalsignal
motif Gene 1 Gene 2Gene 3Gene 4
signal
?
translation transcription
mRNA
protein gene
chromosome
DNA motifs
String Based Representation
Sequences involved in enzymatic reactions (PROSITE)
String Based Representation
Overview
• Introduction
• Motif representation
– String based representation
• Consensus
• Regular expression
– Probabilistic representation
• PSSM
• HMM
• Profile
• Motif screening
• Motif Databases
• Exercise
Probabilistic
PSSM
Frequency matrix
w T T
T
w G G
G
w C C
C
w A A
A
... , 2 , 1
,
... , 2 , 1
,
... , 2 , 1
,
... , 2 , 1
,
G A A T T C A T G T C A C T T C A T T G
75 . 0 75 . 0 5 . 0 01 . 0 01 . 0
25 . 0 25 . 0 01 . 0 01 . 0 25 . 0
01 . 0 01 . 0 25 . 0 01 . 0 75 . 0
01 . 0 01 . 0 25 . 0 1
01 . 0
T G C A
75 . 0 75 . 0 5 . 0 0 . 0 0 . 0
25 . 0 25 . 0 0 . 0 0 . 0 25 . 0
0 . 0 0
. 0 25 . 0 0 . 0 75 . 0
0 . 0 0
. 0 25 . 0 1 0
. 0
T G C A
Pseudo Counts
Frequency matrix Alignment
735 . 0 735 . 0 495 . 0 0097 . 0 0098 . 0
245 . 0 245 . 0 0099 . 0 0097 . 0 245 . 0
0098 . 0 0098 . 0 2475 . 0 0097 . 0 735 . 0
0098 . 0 0098 . 0 2475 . 0 97 . 0 0098 . 0
T G C A
735 . 0 735 . 0 495 . 0 0097 . 0 0098 . 0
245 . 0 245 . 0 0099 . 0 0097 . 0 245 . 0
0098 . 0 0098 . 0 2475 . 0 0097 . 0 735 . 0
0098 . 0 0098 . 0 2475 . 0 97 . 0 0098 . 0
T G C A
5558 . 1 5558 . 1 985 . 0 68 . 4 673 . 4
0291 . 0 0291 . 0 65 . 4 68 . 4 0291 . 0
673 . 4 673 . 4 014 . 0 68 . 4 5558 . 1
673 . 4 673 . 4 014 . 0 95 . 1 673 . 4
T G C A
Probabilistic
PSSM
G A A T T C A T G T C A C T T C A T T G
Convert into PSSM Alignment
PSSM p(A)=p(C)=p(G)=p(T)=0.25
25) . 0
01 . (0 log2
Motif logo
PSSM
msa
Regular expression
Weight matrix
Motif logo
Motif Representation
CTTAATATTAACTTAAT Consensus
CTTAAKRTTMAYTTAAT Regular expression
PSSM (motif logo)
Definition HMM
• State sequence path p:
– Probability of a state depends only on the previous state
1 ) ( i l i k kl P
a
– Transition probability from state l to state k
– emission probability: probability that symbol b is seen when in state k
) (
)
(b P xi b i k
ek
akl ek(b)
State l State k
A HIDDEN Markov model: it is not possible to tell what state the system is in by looking at the corresponding symbol
Finding the possible paths = decoding
begin Mj
Ij Dj
end
HMM
HMM
Probabilistic model that represents the alignment of the family – Gapped multiple alignment
– Distinct states separated by transition probabilities (i.e. the probability of moving from one state to the next)
– The current stateis only dependent on the previous state (first order Markov process)
– The sequence of states followed in the model is called the path
– Each state has the probability of emitting a certain symbol of the alphabet
(A,C,T,G for DNA) or one of the 20 amino acids for proteins: emission probability
• HMM can model any possible sequence
• It defines a probability distribution over the whole space of sequences
• Training a HMM: search for the parametrisation that makes this distribution peak around members of the family
• Parametrisation
– Determine model structure
• Length of alignment
• Number of insert states
– Determine the probability parameters
HMM
HMM
Training a HMM
– Determine structure of the model
– Determine emission and transition probabilities
E.g. the first column: e1(A) = 4/5; e1(T) = 1/5; e1(C) = 0; e1(G) = 0;
E.g. the second column: e2(A) = 0; e2(T) = 0; e2(C) = 4/5; e2(G) = 1/5;
E.g. the third column: e3(A) = 4/5; e3(T) = 0; e3(C) = 1/5; e3(G) = 0;
ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC
A 0.8 C PC G PC T 0.2
A PC C 0.8 G 0.2 T PC
A 0.8 C G T 0.2
A 1 C G
1 1 0.4 T
A 0.2 C 0.4 G 0.2 T 0.2
0.6
0.6 0.4
Profile representation
• Suppose I (amino acid b) is the ancestor
• What is the probability of observing a T (amino acid a) in the first column (position p) of the alignment
• This probability is reflected by the score M M(p,a)= W(p,b) X Y(a,b)
M(1,I)= W(1,T) X Y(I,T)
M is dependent on
• The observed frequency of T in the first position of the alignment (W)
• The probability of mutating I => T (according to PAM) (Y)
I A … I S T T V A V A I L T V T V I A I V
b
Profile representation
gaps
Overview
• Introduction
• Motif representation
• Motif screening
• Motif Databases
• Exercise
HMM
Multiple sequence alignment
Construct profile HMM
Scan new sequence with the profile
I. II.
Unaligned sequences
III.
Screening
Screening
• The multiple alignment of the family is known (Clustal W)
• The motif to be detected is known but the multiple alignment does not yet exist
– Motifs already described in literature
– Construct the multiple alignment, derive the model
• Neither the motif nor the multiple alignment exist
– Probabilistic motif detection
• Obtained Motif Model used for genome wide screening (Motif Scanner)
• Identification of putative additional targets
• Use sliding window
• Attribute to each sequence within the sliding window a score
• Rank the hits based on their score and select the most promising candidates
Genome wide screening
Screening
Screening
• Distinct methods differ in the motif representation and the scoring system used
• Consensus Sequence or Regular expression (pattern match) – Very conservative
– Do not allow mismatches
• PSSM / HMM: more complicated scoring schemes – based on information content
– Log likelihood
– Less conservative
– Difficult choice of threshold score
– Tradeoff between sensitivity and selectivity
Screening
• FDR (1-Precision) FP/(TP+FP)
• Precision TP/(TP+FP)
• Specificity (related to the false positive fraction= 1-spec) TN/(TN+FP)
• Sensitivity (true positive fraction = recall) TP/(TP+FN)
Screening
• E- value: corresponds to the probability of
finding a score equal or better than the one
observed, by chance alone.
Screening with Regular Expression
Simple perl scripts
Screening with PSSM
i i
x W ix
i
p
score q
odds
1
log
2log
70 , 0 05 , 0 01 , 0 50 , 0 90 , 0 10 , 0
10 , 0 30 , 0 01 , 0 80 , 0 02 , 0 10 , 0
10 , 0 05 , 0 97 , 0 05 , 0 06 , 0 20 , 0
10 , 0 60 , 0 01 , 0 10 , 0 02 , 0 60 , 0 PSSM
Background frequency of each of the four nucleotides:
0 , 25
xi
p
Sequentie Odds Log odds Opmerking ATGCAT 720,8829 9,49362 consensus
CTGCGT 120,1471 6,90866 Similar sequence CATGGC 0,0002 -11,99046 Different sequence
• Slide a window of length W over a sequence
• Calculate for each subsequence within the window a log odds-score
• The highest scoring positions correspond to the most likely locations of the motif
9.4 = log2(720)
(0.6*0.9*0.8*0.97*0.6*0.7)/(0.25^6)
Screening with HMM
• Belongs a sequence to a family of proteins?
• Scoring a sequence with a HMM
– aligning the sequence to the HMM
– finding the hidden path that generates the sequence
• A sequence can be generated by different paths
• Enumerate all paths and calculate for each path the probability that is generates the sequence
• Viterbi Path: most likely path
• Total probability that sequence is generated by HMM = sum of probabilities of all possible paths
Screening with HMM
)
| ,
( ATCAGT m 1m 2 d 3 m 4 m 5i5 m 6 HMM P
0005772 , 0
95 , 0 70 , 0 95 , 0 25 , 0 20 , 0 60 , 0 85 , 0 97 , 0 80 , 0 10, 0 90 , 0 90 , 0 60 , 0 95 , 0
)
| (
).
| ( ).
| ( ).
| ( ).
| ( ).
| ( ).
| (
).
| ( ).
| ( ).
| ( ).
| ( ).
| ( ).
| ( ).
| (
6 6
5 6 5
5 5 5
4 5
4 3
4 2
3 2
1 2 1
1
m END
T m T P i m T i G P m i T m A P m m T
m C P d m T m d T m T P m m T m A P BEGIN
m T
)
| ,
(
log P ATCAGT m 1 m 2 d 3 m 4 m 5i5 m 6 HMM
7587 , 10
95 , 0 log 70
, 0 log 95
, 0 log 25 , 0 log ....
90 , 0 log 60 , 0 log 95 , 0 log
)
| (
log )
| ( log )
| ( log )
| ( log ...
)....
| ( log )
| ( log )
| ( log )
| ( log
6 6
5 6 5
2 1
2 1
1
m END
T m
T P i
m T i
G P
m T P m
m T m
A P BEGIN
m T
Example for 1 path
ATCAGT
Screening with HMM
• Calculate the probability of the sequence being generated by the HMM profile of a protein family versus a random model
= align the unknown sequence with the HMM
– The sequence can be generated by different paths
• Impossible to enumerate all possibilities
– What is the most probable path? (Viterbi, backtracking) – What is the total probability? (Forward)
) (
) mod 2 (
log P data Random el data
P
Bits score
A T
A -
- T ATT
and
TTC
Screening with HMM
• Hidden Markov model because if we observe a sequence, the path of states that was followed by the Markov model to generate the observed sequence is unknown or hidden.
• This hidden path contains the information on how the observed sequence should be aligned with the profile.
• Usually a sequence can be generated in multiple ways by the Markov model and more hidden paths (corresponding to distinct alignments) are possible. Usually not all possible paths have an equal probability. Indeed some transitions are not very likely (low transition probability). Usually the path with the highest probability (highest score = most likely path) corresponds to the best alignment.
Screening with HMM
• Detecting the underlying sequence of states allows to uncover the most probable path of transitions (decoding)
– VITERBI Algorithm: most probable path (backtracking)
• Start at first position (state k)
• Move to next most probable state l
– Vk(i) is the probability of the most probable path ending in state k – Calculate probability
– Viterbi algorithm allows to detect the most probable path and the probability of this most probable path
akl k i
k v xi
el l i
v ( ) ( )max ( ( 1))
begin Mj
Ij Dj
end
begin Mj Ij Dj
end
-ACA---ATG -TCAACTATC -ACAC--AGC -AGA---ATC -ACCG--ATC
A AC
Calculate Score state 1:
S(1)= a(BM) +e(A) S(2)= a(BI) + e(A) S(3)= a(BD)
-
Maximal score state M:
S(1)= a(BM) +e(A)
S(1)= a(BI) + e(A) + a(IM)+e(C)
HMM
ACAAG
Conclusion
• Distinct methods differ in the motif representation and the scoring system used
• Consensus Sequence or Regular expression (pattern match) – Very conservative
– Do not allow mismatches
• PSSM / HMM: more complicated scoring schemes – based on information content
– Log likelihood – Less conservative
– Difficult choice of threshold score
– Tradeoff between sensitivity and selectivity
Overview
• Introduction
• Motif representation
• Motif screening
• Motif Databases
– Prosite – Blocks – pFAM
• Exercise
Pfam
• Pfam starts from a set of automatically generated domain alignments (generated by PsiBlast).
• From these alignments a HMM is calculated
• Subsequently all sequences in the SwissProt database of proteins are classified in protein families
– By scoring them with the representative HMMs – Ranking sequences according to their score
– separate class members from the other sequences in the database based on a suitable threshold
• Pfam 7.0 is such a database that contains a total of 3360 families. Pfam contains multiple protein alignments and profile-HMMs of these families.
PSI-blast
Automatically generated sequences
PfamB families
Construct seed HMM PfamA
PRODOM alignment
Discover PRODOM alignments not covered by PfamA
Pfam
Pfam
• Full: alignment on which the Pfam HMM was based
• HMMs for global and fragment search
Pfam
Screening an new sequence against Pfam HMMs to classify the novel sequence
Pfam
• Each Pfam family: "trusted cutoff" and a "noise cutoff“
• TC1 is the lowest score for sequences included in the family
• NC1 is the highest score for sequences not included in the Full alignment
The probability that the sequence was generated by the HMM and the probability that the sequence was generated by a null model
E-value is the number of hits that would be expected to have a score equal or better than this by chance alone
•Raw score: bitscore
Scores in Pfam
Pfam
PROSITE
Patterns (regular expressions) (ScanProsite) – Shorter than Pfam
• Enzyme catalytic sites
• Prosthetic group attachment sites (heme, pyridoxal- phosphate, biotin, etc)
• Amino acids involved in binding a metal ion
• Cysteines involved in disulfide bonds
• Regions involved in binding a molecule (ADP/ATP, GDP/GTP, calcium, DNA, etc.) or another protein
PROSITE
• Profiles (Profile representation)
PROSITE
Aminael
renew
BLOCKS
• Database of ungapped alignments
• Motif models represented as PSSMs
Example sequence
>gi|1071819|pir||B54759 ba-type ubiquinol oxidase Paracoccus denitrificans
MATFSNETTFLLGRLNWDAIPKEPIVWATFVVVAIGGIAALAALTKYRLWGWLWREWFTSVDHKKIGIMYIVLALIMFVRGFA DAIMMRLQQVWAFGGSEGYLNSHHYDQIFTAHGVIMIFFVAMPFITGLMNYVVPLQIGARDVSFPFLNNFSFWMTVGGAVITM ASLFLGEFAQTGWLAFPPLSGIGYSPWVGVDYYIWGLQVAGVGTTLSGINLLVTILKMRAPGMTMMRMPIFTWTSFCANILIVA SFPVLTMTLILLTLDRYVGTNFFTNDLGGNPMMYINLIWIWGHPEVYILILPLFGVFSEVTSTFSGKRLFGYSSMVYATVCITVLS YLVWLHHFFTMGSGASVNSFFGITTMIISIPTGAKLFNWLFTMYRGRIRYELPMMWTIAFMLTFVIGGMTGVLLAVPPADFVLH NSLFLIAHFHNVIIGGVLFGLFAAINFWWPKAFGFKLDVFWGKVSFWFWVVGFWAAFMPLYILGLMGVTRRLRVFDDPDLRIW FAIAAFGAVLIACGIAAMFVQFGVSILRRDRPEYRDVSGDPWDGRTLEWATSSPPPAYNFAFNPISHGLDTWWEMKQQGATRPT GGYMPIHMPKNTGTGVILAALATVCGMALVWYVWWLAALSFLGIIAVSIAHTFNYNRDYYIPVSEIEATEDARTRQLAQGV
http://www.expasy.org/prosite/
http://www.sanger.ac.uk/Software/Pfam/search.shtml http://blocks.fhcrc.org/
Scan sequence with prosite, Pfam, Blocks
PSI-BLAST
Overview
Query Sequence Unknown
Blast Sequence to search for close homologs
Search pFAM, Prosite for conserved motifs
You detected homology with an annotated
protein family
Make a multiple sequence alignment Generate profile or HMM
Search database for remote homologs Blast
ClustalW PFAM
PROSITE
HMMer, PSSM
Profile Search
PSI-blast
• exercises
Bits score (log odd score Bayesian)
– Posterior: HMM model: is this a globin domain?
– Likelihood calculated: probability of the sequence being generated by the HMM model
– Prior probability:
• p(model) – Bayes
) ( ) (
) ,
(D M p M D p D
p
) ( M D p
) ( M D
p p (M )
) (
) ( ) ) (
( p D
M p M D D p
M
p
)
( ) (
) ,
(D M p DM p M
p
M R
1 2 3 4