Profile Searches

(1)

Profile Searches

Revised 07/11/06

(2)

Overview

• Introduction

• Motif representation

• Motif screening

• Motif Databases

• Exercise

(3)

Features characteristic for the whole family Multiple sequence alignment

Introduction

How to represent the characteristic features?

Motif model: captures the family characteristic features

• regular expression, weight matrix, HMM profile

(4)

Introduction

Multiple sequence alignment

Construct model

Scan new sequence with the model

Unaligned sequences

• model: captures the family characteristic features

• used to detect remote homologs of a family

(5)

Overview

• Introduction

• Motif representation

– String based representation

• Consensus

• Regular expression

– Probabilistic representation

• PSSM

• HMM

• Profile

• Motif screening

• Motif Databases

• Exercise

(6)

HMM

Construct model

Scan new sequence with the model

I. II.

Unaligned sequences

III.

(7)

Consensus sequence:

– Reductionistic representation of a motif

– Most frequent instance is used as a representative – Loss of information

Regular expression:

– More complex representation allowing motif degeneracy

String Based Representation

Symbol Meaning Origin of designation

G G Guanine

A A Adenine

T T Thymine

C C Cytosine

R G or A puRine

Y T or C pYrimidine

M A or C aMino

K G or T Keto

S G or C Strong interaction (3 H bonds) W A or T Weak interaction (2 H bonds)

H A or C or T not-G, H follows G in the alphabet B G or T or C not-A, B follows A

V G or C or A not-T (not-U), V follows U D G or A or T not-C, D follows C

N G or A or T or C aNy

(8)

CTTAATATTAACTTAAT Consensus

CTTAAKRTTMAYTTAAT Regular expression

String Based Representation

(9)

cell signalsignal

motif Gene 1 Gene 2Gene 3Gene 4

signal

?

translation transcription

mRNA

protein gene

chromosome

DNA motifs

String Based Representation

(10)

Sequences involved in enzymatic reactions (PROSITE)

String Based Representation

(11)

Overview

• Introduction

• Motif representation

– String based representation

• Consensus

• Regular expression

– Probabilistic representation

• PSSM

• HMM

• Profile

• Motif screening

• Motif Databases

• Exercise

(12)

Probabilistic

PSSM

Frequency matrix



















w T T

T

w G G

G

w C C

C

w A A

A

... , 2 , 1

,

... , 2 , 1

,

... , 2 , 1

,

... , 2 , 1

,





G A A T T C A T G T C A C T T C A T T G

















75 . 0 75 . 0 5 . 0 01 . 0 01 . 0

25 . 0 25 . 0 01 . 0 01 . 0 25 . 0

01 . 0 01 . 0 25 . 0 01 . 0 75 . 0

01 . 0 01 . 0 25 . 0 1

01 . 0

T G C A



















75 . 0 75 . 0 5 . 0 0 . 0 0 . 0

25 . 0 25 . 0 0 . 0 0 . 0 25 . 0

0 . 0 0

. 0 25 . 0 0 . 0 75 . 0

0 . 0 0

. 0 25 . 0 1 0

. 0

T G C A



Pseudo Counts

Frequency matrix Alignment

















735 . 0 735 . 0 495 . 0 0097 . 0 0098 . 0

245 . 0 245 . 0 0099 . 0 0097 . 0 245 . 0

0098 . 0 0098 . 0 2475 . 0 0097 . 0 735 . 0

0098 . 0 0098 . 0 2475 . 0 97 . 0 0098 . 0

T G C A



(13)

















735 . 0 735 . 0 495 . 0 0097 . 0 0098 . 0

245 . 0 245 . 0 0099 . 0 0097 . 0 245 . 0

0098 . 0 0098 . 0 2475 . 0 0097 . 0 735 . 0

0098 . 0 0098 . 0 2475 . 0 97 . 0 0098 . 0

T G C A





















5558 . 1 5558 . 1 985 . 0 68 . 4 673 . 4

0291 . 0 0291 . 0 65 . 4 68 . 4 0291 . 0

673 . 4 673 . 4 014 . 0 68 . 4 5558 . 1

673 . 4 673 . 4 014 . 0 95 . 1 673 . 4

T G C A



Probabilistic

PSSM

G A A T T C A T G T C A C T T C A T T G

Convert into PSSM Alignment

PSSM p(A)=p(C)=p(G)=p(T)=0.25

25) . 0

01 . (0 log₂

Motif logo

(14)

PSSM

msa

Regular expression

Weight matrix

Motif logo

(15)

Motif Representation

CTTAATATTAACTTAAT Consensus

CTTAAKRTTMAYTTAAT Regular expression

PSSM (motif logo)

(16)

Definition HMM

• State sequence path p:

– Probability of a state depends only on the previous state

1 ) ( i l i k kl P

a      

– Transition probability from state l to state k

– emission probability: probability that symbol b is seen when in state k

) (

)

(b P xi b i k

ek    

a_kl e_k(b)

State l State k

A HIDDEN Markov model: it is not possible to tell what state the system is in by looking at the corresponding symbol

Finding the possible paths = decoding

begin M_j

I_j D_j

end

HMM

(17)

HMM

Probabilistic model that represents the alignment of the family – Gapped multiple alignment

– Distinct states separated by transition probabilities (i.e. the probability of moving from one state to the next)

– The current stateis only dependent on the previous state (first order Markov process)

– The sequence of states followed in the model is called the path 

– Each state has the probability of emitting a certain symbol of the alphabet

(A,C,T,G for DNA) or one of the 20 amino acids for proteins: emission probability

(18)

• HMM can model any possible sequence

• It defines a probability distribution over the whole space of sequences

• Training a HMM: search for the parametrisation that makes this distribution peak around members of the family

• Parametrisation

– Determine model structure

• Length of alignment

• Number of insert states

– Determine the probability parameters

HMM

(19)

HMM

Training a HMM

– Determine structure of the model

– Determine emission and transition probabilities

E.g. the first column: e₁(A) = 4/5; e₁(T) = 1/5; e₁(C) = 0; e₁(G) = 0;

E.g. the second column: e₂(A) = 0; e₂(T) = 0; e₂(C) = 4/5; e₂(G) = 1/5;

E.g. the third column: e₃(A) = 4/5; e₃(T) = 0; e₃(C) = 1/5; e₃(G) = 0;

ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC

A 0.8 C PC G PC T 0.2

A PC C 0.8 G 0.2 T PC

A 0.8 C G T 0.2

A 1 C G

1 1 0.4 T

A 0.2 C 0.4 G 0.2 T 0.2

0.6

0.6 0.4

(20)

Profile representation

• Suppose I (amino acid b) is the ancestor

• What is the probability of observing a T (amino acid a) in the first column (position p) of the alignment

• This probability is reflected by the score M M(p,a)= W(p,b) X Y(a,b)

M(1,I)= W(1,T) X Y(I,T)

M is dependent on

• The observed frequency of T in the first position of the alignment (W)

• The probability of mutating I => T (according to PAM) (Y)

I A … I S T T V A V A I L T V T V I A I V

b

(21)

Profile representation

gaps

(22)

Overview

• Introduction

• Motif representation

• Motif screening

• Motif Databases

• Exercise

(23)

HMM

Construct profile HMM

Scan new sequence with the profile

I. II.

Unaligned sequences

III.

(24)

Screening

(25)

Screening

• The multiple alignment of the family is known (Clustal W)

• The motif to be detected is known but the multiple alignment does not yet exist

– Motifs already described in literature

– Construct the multiple alignment, derive the model

• Neither the motif nor the multiple alignment exist

– Probabilistic motif detection

(26)

• Obtained Motif Model used for genome wide screening (Motif Scanner)

• Identification of putative additional targets

• Use sliding window

• Attribute to each sequence within the sliding window a score

• Rank the hits based on their score and select the most promising candidates

Genome wide screening

Screening

(27)

Screening

• Distinct methods differ in the motif representation and the scoring system used

• Consensus Sequence or Regular expression (pattern match) – Very conservative

– Do not allow mismatches

• PSSM / HMM: more complicated scoring schemes – based on information content

– Log likelihood

– Less conservative

– Difficult choice of threshold score

– Tradeoff between sensitivity and selectivity

(28)

Screening

• FDR (1-Precision) FP/(TP+FP)

• Precision TP/(TP+FP)

• Specificity (related to the false positive fraction= 1-spec) TN/(TN+FP)

• Sensitivity (true positive fraction = recall) TP/(TP+FN)

(29)

Screening

• E- value: corresponds to the probability of

finding a score equal or better than the one

observed, by chance alone.

(30)

Screening with Regular Expression

Simple perl scripts

(31)

Screening with PSSM

i i

x W ix

i

p

score q

odds 





1

log

2

log















70 , 0 05 , 0 01 , 0 50 , 0 90 , 0 10 , 0

10 , 0 30 , 0 01 , 0 80 , 0 02 , 0 10 , 0

10 , 0 05 , 0 97 , 0 05 , 0 06 , 0 20 , 0

10 , 0 60 , 0 01 , 0 10 , 0 02 , 0 60 , 0 PSSM

Background frequency of each of the four nucleotides:

 0 , 25

xi

p

Sequentie Odds Log odds Opmerking ATGCAT 720,8829 9,49362 consensus

CTGCGT 120,1471 6,90866 Similar sequence CATGGC 0,0002 -11,99046 Different sequence

• Slide a window of length W over a sequence

• Calculate for each subsequence within the window a log odds-score

• The highest scoring positions correspond to the most likely locations of the motif

9.4 = log2(720)

(0.6*0.9*0.8*0.97*0.6*0.7)/(0.25^6)

(32)

Screening with HMM

• Belongs a sequence to a family of proteins?

• Scoring a sequence with a HMM

– aligning the sequence to the HMM

– finding the hidden path that generates the sequence

• A sequence can be generated by different paths

• Enumerate all paths and calculate for each path the probability that is generates the sequence

• Viterbi Path: most likely path

• Total probability that sequence is generated by HMM = sum of probabilities of all possible paths

(33)

Screening with HMM

)

| ,

( ATCAGT m ₁m ₂ d ₃ m ₄ m ₅i₅ m ₆ HMM P

0005772 , 0

95 , 0 70 , 0 95 , 0 25 , 0 20 , 0 60 , 0 85 , 0 97 , 0 80 , 0 10, 0 90 , 0 90 , 0 60 , 0 95 , 0

)

| (

).

| ( ).

| (

).

| ( ).

| (

6 6

5 6 5

5 5 5

4 5

4 3

4 2

3 2

1 2 1

1







m END

T m T P i m T i G P m i T m A P m m T

m C P d m T m d T m T P m m T m A P BEGIN

m T

)

| ,

(

log P ATCAGT m ₁ m ₂ d ₃ m ₄ m ₅i₅ m ₆ HMM



7587 , 10

95 , 0 log 70

, 0 log 95

, 0 log 25 , 0 log ....

90 , 0 log 60 , 0 log 95 , 0 log

)

| (

log )

| ( log )

| ( log ...

)....

| ( log )

| ( log

6 6

5 6 5

2 1

1











m END

T m

T P i

m T i

G P

m T P m

m T m

A P BEGIN

m T

Example for 1 path

ATCAGT

(34)

Screening with HMM

• Calculate the probability of the sequence being generated by the HMM profile of a protein family versus a random model

= align the unknown sequence with the HMM

– The sequence can be generated by different paths

• Impossible to enumerate all possibilities

– What is the most probable path? (Viterbi, backtracking) – What is the total probability? (Forward)

) (

) mod 2 (

log P data Random el data

P

Bits score

A T

A -

- T ATT

and

TTC

(35)

Screening with HMM

• Hidden Markov model because if we observe a sequence, the path of states that was followed by the Markov model to generate the observed sequence is unknown or hidden.

• This hidden path contains the information on how the observed sequence should be aligned with the profile.

• Usually a sequence can be generated in multiple ways by the Markov model and more hidden paths (corresponding to distinct alignments) are possible. Usually not all possible paths have an equal probability. Indeed some transitions are not very likely (low transition probability). Usually the path with the highest probability (highest score = most likely path) corresponds to the best alignment.

(36)

Screening with HMM

• Detecting the underlying sequence of states allows to uncover the most probable path of transitions (decoding)

– VITERBI Algorithm: most probable path (backtracking)

• Start at first position (state k)

• Move to next most probable state l

– V_k(i) is the probability of the most probable path ending in state k – Calculate probability

– Viterbi algorithm allows to detect the most probable path and the probability of this most probable path

akl k i

k v xi

el l i

v ( ) ( )max ( ( 1))

begin M_j

I_j D_j

end

(37)

(38)

begin M_j I_j D_j

end

-ACA---ATG -TCAACTATC -ACAC--AGC -AGA---ATC -ACCG--ATC

A AC

Calculate Score state 1:

S(1)= a(BM) +e(A) S(2)= a(BI) + e(A) S(3)= a(BD)

-

Maximal score state M:

S(1)= a(BM) +e(A)

S(1)= a(BI) + e(A) + a(IM)+e(C)

HMM

ACAAG

(39)

Conclusion

• Distinct methods differ in the motif representation and the scoring system used

• Consensus Sequence or Regular expression (pattern match) – Very conservative

– Do not allow mismatches

• PSSM / HMM: more complicated scoring schemes – based on information content

– Log likelihood – Less conservative

– Difficult choice of threshold score

– Tradeoff between sensitivity and selectivity

(40)

Overview

• Introduction

• Motif representation

• Motif screening

• Motif Databases

– Prosite – Blocks – pFAM

• Exercise

(41)

Pfam

• Pfam starts from a set of automatically generated domain alignments (generated by PsiBlast).

• From these alignments a HMM is calculated

• Subsequently all sequences in the SwissProt database of proteins are classified in protein families

– By scoring them with the representative HMMs – Ranking sequences according to their score

– separate class members from the other sequences in the database based on a suitable threshold

• Pfam 7.0 is such a database that contains a total of 3360 families. Pfam contains multiple protein alignments and profile-HMMs of these families.

PSI-blast

Automatically generated sequences

PfamB families

Construct seed HMM PfamA

PRODOM alignment

Discover PRODOM alignments not covered by PfamA

(42)

Pfam

(43)

Pfam

• Full: alignment on which the Pfam HMM was based

• HMMs for global and fragment search

(44)

Pfam

Screening an new sequence against Pfam HMMs to classify the novel sequence

(45)

Pfam

• Each Pfam family: "trusted cutoff" and a "noise cutoff“

• TC1 is the lowest score for sequences included in the family

• NC1 is the highest score for sequences not included in the Full alignment

The probability that the sequence was generated by the HMM and the probability that the sequence was generated by a null model

E-value is the number of hits that would be expected to have a score equal or better than this by chance alone

•Raw score: bitscore

Scores in Pfam

(46)

Pfam

(47)

PROSITE

Patterns (regular expressions) (ScanProsite) – Shorter than Pfam

• Enzyme catalytic sites

• Prosthetic group attachment sites (heme, pyridoxal- phosphate, biotin, etc)

• Amino acids involved in binding a metal ion

• Cysteines involved in disulfide bonds

• Regions involved in binding a molecule (ADP/ATP, GDP/GTP, calcium, DNA, etc.) or another protein

(48)

PROSITE

• Profiles (Profile representation)

(49)

PROSITE

Aminael

renew

(50)

BLOCKS

• Database of ungapped alignments

• Motif models represented as PSSMs

(51)

Example sequence

>gi|1071819|pir||B54759 ba-type ubiquinol oxidase Paracoccus denitrificans

MATFSNETTFLLGRLNWDAIPKEPIVWATFVVVAIGGIAALAALTKYRLWGWLWREWFTSVDHKKIGIMYIVLALIMFVRGFA DAIMMRLQQVWAFGGSEGYLNSHHYDQIFTAHGVIMIFFVAMPFITGLMNYVVPLQIGARDVSFPFLNNFSFWMTVGGAVITM ASLFLGEFAQTGWLAFPPLSGIGYSPWVGVDYYIWGLQVAGVGTTLSGINLLVTILKMRAPGMTMMRMPIFTWTSFCANILIVA SFPVLTMTLILLTLDRYVGTNFFTNDLGGNPMMYINLIWIWGHPEVYILILPLFGVFSEVTSTFSGKRLFGYSSMVYATVCITVLS YLVWLHHFFTMGSGASVNSFFGITTMIISIPTGAKLFNWLFTMYRGRIRYELPMMWTIAFMLTFVIGGMTGVLLAVPPADFVLH NSLFLIAHFHNVIIGGVLFGLFAAINFWWPKAFGFKLDVFWGKVSFWFWVVGFWAAFMPLYILGLMGVTRRLRVFDDPDLRIW FAIAAFGAVLIACGIAAMFVQFGVSILRRDRPEYRDVSGDPWDGRTLEWATSSPPPAYNFAFNPISHGLDTWWEMKQQGATRPT GGYMPIHMPKNTGTGVILAALATVCGMALVWYVWWLAALSFLGIIAVSIAHTFNYNRDYYIPVSEIEATEDARTRQLAQGV

http://www.expasy.org/prosite/

http://www.sanger.ac.uk/Software/Pfam/search.shtml http://blocks.fhcrc.org/

Scan sequence with prosite, Pfam, Blocks

(52)

PSI-BLAST

(53)

Overview

Query Sequence Unknown

Blast Sequence to search for close homologs

Search pFAM, Prosite for conserved motifs

You detected homology with an annotated

protein family

Make a multiple sequence alignment Generate profile or HMM

Search database for remote homologs Blast

ClustalW PFAM

PROSITE

HMMer, PSSM

Profile Search

PSI-blast

(54)

(55)

• exercises

(56)

Bits score (log odd score Bayesian)

– Posterior: HMM model: is this a globin domain?

– Likelihood calculated: probability of the sequence being generated by the HMM model

– Prior probability:

• p(model) – Bayes

) ( ) (

) ,

(D M p M D p D

p 

) ( M D p

) ( M D

p p (M )

) (

) ( ) ) (

( p D

M p M D D p

M

p 

) 

( ) (

) ,

(D M p DM p M

p 

M R

1 2 3 4