4 MOTIF REPRESENTATION AND SCREENING

(1)

4 MOTIF REPRESENTATION AND SCREENING

4 MOTIF REPRESENTATION AND SCREENING...1

4.1. Introduction...2

4.2. Input...3

4.3. Motif representations...4

4.3.1 String based approach...4

4.3.2 Probabilistic approaches...5

4.4. Scoring databases with motif models...14

4.4.1 Scoring a sequence database with a string based motif model...15

4.4.2 Scoring a sequence database with a PSSM...15

4.4.3 Scoring a sequence database with a profile representation...16

4.4.4 Scoring a sequence database with a HMM model...16

4.5. Assessing the relevance of an alignment when performing high throughput screenings. .20 4.5.1 Sensitivity versus specificity...20

4.5.3 Extreme value distribution...21

4.6. Iterative searches...21

4.7. Databases of motif models...21

4.7.1 Pfam, the proteins family database...22

4.7.2 Prosite...24

4.7.3 BLOCKS proteins family database...26

Adapted 7/11/2006

(2)

4.1. Introduction

As shown previously a multiple alignment captures much more information on a protein family than does a pairwise alignment. When searching a database for remote homologs of a protein, pairwise search programs such as Blast and FastA might fail, if the degree of overall conservation between the homologous sequences has become too low.

Usually, only the functionally important residues or the residues important for the 3D structure of the proteins are still conserved. In these cases, the proteins are, despite their low overall similarity still true homologs because they share the same essential residues, characteristic for that protein family.

This problem is illustrated by the globin family. The globin polypeptide chains from organisms as diverse as humans, insects, and plants are folded in the same general three-dimensional pattern, yet there are only two positions within the some 150 residues of the chain that contain the same amino acid in all globins. That is, this "globin fold" is encoded in many different amino acid sequences, some differing from others in as many as 130 positions. Any single globin sequence represents just one realization of the globin fold (article Gribskov).

The overall degree of similarity becomes too low to detect members of this family by using classical database search tools such as Blast, FastA.

However, when one uses the variation that is present in a multiple alignment of a set of related sequences when searching a database, members of the protein (DNA) family that match combinations of sequence characters in the multiple alignment can also be found (whereas only the combinations found in the original sequence would be matched I single query sequences were to be used).

The first question that will be solved in this chapter is how to summarize the sequence variation that is present in a multiple alignment in a model that can be used for screening a sequence database.

Approaches based on this principle are called model based (also called profile or pattern based methods) methods. Indeed, based on a multiple alignment of representative members of the protein (DNA) family, one can build a representation that summarizes only the residues that are characteristic for that family of proteins (or DNA). These conserved residues constitute the motif or domain, characteristic for the protein or DNA family. In this chapter, we will show how such motif or domain (the set of conserved residues of the multiple alignment of the protein or DNA family) can be represented by a regular expression (consensus) or a probabilistic representation (HMM, profile, weight matrix) and how these representations can be constructed from a known multiple alignment. Subsequently in a second part, we will show how these motif representations can be used to screen sequence databases in order to also retrieve the remote homologs (this in contrast to the database searching methods which are based on a single query sequence). In a third part we discuss different scores to assess the significance of the hits obtained by these screening procedures.

The power of model based methods can be further enhanced through iteration of the search procedure. After a motif model is used to screen a sequence database, new similar members of the protein (DNA) family can be detected. A new multiple alignment, which includes these new sequences, can be constructed, a new motif model abstracted, and a new database search performed.

The procedure can be iterated as often as desired or until convergence, when no new statistically

significant sequences are detected.

(3)

This will be discussed in a fourth part where we describe PSI-Blast as an example of an iterative motif model based search algorithm. In a last part we describe some databases (Pfam, PROSITE, TRANFAC) that contain motif representations of known protein families.

Motif representations can also be constructed from a set of unaligned sequences. This requires local multiple alignment procedures (also called pattern search algorithms). We will discuss some of them in the next chapter on motif detection.

4.2. Input

The standard procedure of constructing a model based representation of a motif starts with a set of sequences whose membership in the protein- or domain-family under study is well established. The sequences are aligned using any of the available multiple-alignment methods (such as, for instance, ClustalW). It has proved to be very advantageous to include information on the 3D-structure of one or more family members in the alignment process, if it is available. In many cases the alignment can be significantly improved by manual editing. The manually verified multiple alignment is used as input to construct a motif model. Two issues are important to keep in mind when constructing this

“set of trusted sequences”:

 It is important not to include sequences with doubtful relationship to the family under consideration since even a single inappropriate sequence can severely degrade the reliability of the constructed motif model.

 The sample of “trusted sequences” should be representative for all members of the family and is should be balanced (they should all more or less have the same phylogenetic distance to each other). If this is not the case (because not enough biological knowledge is available to construct a balanced and representative sample)

o To compensate for an unbalanced sampling: Sequences might have to be weighted before they can be used to construct a motif model. Indeed, if several sequences in the multiple alignment are similar, the multiple alignment and the motif model derived from it will be biased in favor of those sequences. This effect is particularly pronounced if the initial set of trusted sequences contains both unique sequences and multiple members from closely related sequence families.

o Pseudocounts might have to be added to compensate for an unrepresentative sampling (see below).

Multiple sequence alignment

Construct model

Scan new sequence with the model

Unaligned sequences

Multiple sequence alignment

Construct model

Scan new sequence with the model

Unaligned

sequences

(4)

4.3. Motif representations

Two different ways of representing a motif can be considered: a string-based approach and a probabilistic approach (PSSM, profile or HHM).

4.3.1 String based approach

4.3.1.1 Consensus sequence

A motif consensus is a reductionistic representation of a motif where the most frequent instance of the motif is used as a representation. The representation is simple, but results in a considerable loss of information that was originally present in the multiple alignment. Such simple representation does not allow for degeneracies of the motif (presence of mismatches).

Example: Pribnowbox: TATAAT 4.3.1.2 Regular expression

As compared to the consensus sequence regular expressions allow a more flexible representation of the motifs. Degeneracy at a certain position in the motif is represented by conventional characters (IUPAC code).: www.chem.qmw.ac.uk/iupac/). A regular expression represents a family of consensus sequences each occurring with a similar probability.

Symbol Meaning Origin of designation

G G Guanine

A A Adenine

T T Thymine

C C Cytosine

R G or A puRine

Y T or C pYrimidine

M A or C aMino

K G or T Keto

S G or C Strong interaction (3 H bonds) W A or T Weak interaction (2 H bonds)

H A or C or T not-G, H follows G in the alphabet B G or T or C not-A, B follows A

V G or C or A not-T (not-U), V follows U

D G or A or T not-C, D follows C

N G or A or T or C aNy

Despite the more accurate representation of a regular expression, still part of the information is lost i.e. the probability with which a certain nucleotide can occur at a certain position. Regular expressions are frequently used for DNA sequences but also for protein sequences. PROSITE (http://us.expasy.org/prosite/) is a database of protein families and domains (see below for details).

Motifs in Prosite are represented as regular expressions (also called patterns following the Prosite nomenclature).

Example of a PROSITE pattern:

(5)

4.3.2 Probabilistic approaches

Probabilistic representations provide a more subtle description of the motif than a regular expression.

4.3.2.1 Position specific scoring matrix(PSSM)

A motif can also be described probabilistically by a weight (or frequency) matrix or PSSM (position specific scoring matrix). A PSSM is derived from a frequency matrix or a weight matrix. In a weight matrix the columns represent the motif positions while the rows describe the probability of occurrence of the respective nucleotides.

For each motif position i the weight matrix describes the relative frequency q

ij

that a certain nucleotide x

i

is observed. This frequency equals the number of observations of a certain nucleotide in the column of the alignment divided by the number of aligned sequences.

 





 







WT T

T

WG G

G

WC C

C

WA A

A

q q

q

q q

q

q q

q

q q

q atrix frequencym

...

2 1

Note that local alignment tools such as MEME, Gibbs Sampling use the weight matrices as motif representations (see next chapter).

Weight matrices can be converted into PSSM (position specific scoring matrices) by simple log transformation (see below).

Example of a weight or frequency matrix:

For a motif of length n using an alphabet of c characters, a frequency matrix A is a c by n matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the motif.

... A C G T G ...

... A C C T G ...

... T C C C G ...

... A C C C T ...

... A C G C T ...

... T C G T G ...

... A C G T G ...

... A G G T G ...

... A C T T G ...

... A C G T G ...

The frequency model (a 10 by 5 matrix) for this motif is then:

A 0.8 0.0 0.0 0.0 0.0

C 0.0 0.9 0.3 0.3 0.0

(6)

G 0.0 0.1 0.6 0.0 0.8 T 0.2 0.0 0.1 0.7 0.2

The quality of the probability matrix depends to a large extent on the actual sample of the protein family on which the multiple alignment from which the probability matrix was derived was based.

If the number of sequences with the found motif is large and reasonably diverse, the sequences represent a good statistical sampling of all sequences that are ever likely to be found with that same motif. If a given column in 20 sequences has only L it is not very likely that a different amino acid will be found in other sequences with that motif because the residue is probably important for the function. In contrast, another column in the motif from the 20 sequences may have several amino acids and some amino acids may not be represented at all. Even more variation may be expected at that position in other sequences although the more abundant amino acids already found in that column probably would be favored. Thus if a good sampling of sequences is available the number of sequences is sufficiently large and the motif structure is not too complex, it should, in principle be possible to obtain frequencies highly representative of the same motif in other sequences also.

However, the number of sequences for producing the motif may be small and the sequences might not be representative for the whole family. In this case, the obtained frequencies in each column might not be representative for all amino acids that can be expected to occur in that column. In that case the estimates of amino acid frequencies are improved by adding pseudocounts. Knowing how much count to add is difficult. The contribution of the pseudocount should decrease as the information of the estimated frequencies increases. Ie the pseudocount will become dominant if the number of sequences from which the probability matrix was derived is low and non representative.

The higher the impact of the pseudocounts, the less informative the probability matrix is. Because values of 0 can not be converted by log transformation, taking pseudocounts overcomes this problem as well.

Example of adding pseudocounts

Once the pseudocounts are added, a frequency matrix can be converted into a PSSM.

Matrix values in a PSSM are log odds scores obtained by dividing the counts of the residue in the

alignment by the expected (random) number of counts based on sequence composition, and

converting the ratio to a log score (similar to the approach followed in substitution probability

matrices). Each entry is calculated by taking for each column the logarithm in base 2 (bit units) of

the total of the real counts plus the pseudocounts for each amino acid (nucleotide) divided by the

probability of that amino acid (nucleotide).

(7)

Suppose that the random frequency (B) of finding a certain base at a position is 0.25 (assuming equal frequencies), then the 'log likelihood' weight (scoring) matrix for the specific motif shown above becomes:

Normalized matrix

 





 









735 . 0 735 . 0 495 . 0 0097 .

0 0098 .

0 245 . 0 245 . 0 0099 .

0 0097 .

0 245 . 0

0098 .

0 0098 .

0 2475 .

0 0097 .

0 735 . 0

0098 .

0 0098 .

0 2475 .

0 97 . 0 0098 .

0 T G C A



After the log transformation, a PSSM is obtained: Log2(normalized matrix/ background)

 





 











5558 .

1 5558

. 1 985

. 0 68 . 4 673

. 4

0291 .

0 0291

. 0 65

. 4 68

. 4 0291

. 0

673 . 4 673

. 4 014

. 0 68

. 4 5558

. 1

673 . 4 673

. 4 014

. 0 95

. 1 673

. 4

T G C A



The quality and quantity of the information provided by the PSSM varies with the column and is

represented by motif logo's. A motif logo illustrates the amount of information in each column of

the motif (derived from the PSSM). The X-axis represents the motif positions (corresponding to the

columns of the PSSM matrix). The height of each column describes the decrease in uncertainty

provided by the information in that column of the PSSM. The higher the column the more useful

that position is for finding matches in sequences. In each column are the symbols of the nucleotides

found at that position in the motif with a height proportional to the frequency of that nucleotide in

the column. Tools to generate a sequence logo from a multiple alignment are at this site

http://www.bio.cam.ac.uk/cgi-bin/seqlogo/logo.cgi.

(8)

msa

Regular expression

Weight matrix

Motif logo msa

Regular expression

Weight matrix

Motif logo

Different ways of representing a motif. Aligned subsequences are displayed at the top. Based on this alignment a regular expression can be built. The highlighted positions indicate degenerate nucleotides where “n” indicates a random nucleotide,

“s” indicates a C or a G and “w” indicates an A or a T. Another representation is the weigth matrix) that contains the frequency of a particular nucleotide at a specific position. Only the frequencies for the first position and the highlighted positions are displayed. Based on this matrix, a motif logo can be constructed, giving a visual representation of the binding site.

PSSM's or weigth matrices can be used to screen other sequences or the complete genome of an organism of interest using the appropriate scoring algorithms (see below, for proteins: Mast, PSI- BLAST, Block Searcher, for DNA sequences: MotifScanner)

Databases that use PSSM or frequency matrices as motif representation are, for instance, for proteins the BLOCKS database http://blocks.fhcrc.org/ (see also below) for DNA motifs (Transfac, RegulonDB).

4.3.2.2 Profile representation

A conserved part of a multiple sequence alignment (msa) is converted into a type of scoring matrix which is called a profile. A profile includes scores for amino acid substitutions, and gaps in each column of the conserved region.

The entries in the matrix are derived from the log odds substitution matrix that was used to produce the alignment:

The value of the profile for amino acid a at position p is M(p,a) = l=i W(p,b)

X

Y(a,b),

 where Y(a,b) is the entry in the log odds substitution matrix ( Dayhoff s PAM matrix)

 W(p, b) is a weight for the appearance of amino acid b at position p. This weight is determined as follows: Suppose that amino acid b appears n(b,p) times in position p in the NR input sequences. Then a simple average weight is given by W(b,p) = n(b,p)/NR.

For example, if column 1 in the msa has 5 I, 3 T and 2V then the frequency of each amino acid in the column is 0.5 I =W(I,1), 0.3 T =W(T,1)and 0.2 V= W(V,1).

To calculate the entry of I in the profile matrix: it supposes that all observed amino acids (I, T, V) in the msa arose from I in the ancestor. Using the entries of I-I, I-T and I-V in the Dayhoff PAM250 would be 5, 0 and 4 (expressing the mutation probability of an I into one of the observed values) and weighting them by the observed frequencies of each of the amino acids results in the following entry for I in the first column of the profile matrix would be:

0.5X5+0.3X0+0.2X4=3.3.

(9)

This is repeated for all amino acids, i.e, each of the 20 amino acids is once considered as the ancestor of the pattern of each column.

The profile also allows specifying position-dependent penalties for insertions and deletions.

Insertions and deletions in families of aligned homologous sequences tend to occur more frequently in regions between segments of regular secondary structure than within them. This is encoded in the 21st column of each row of the profile by a penalty for insertion/deletion of the corresponding probe residue. This penalty can be set high to prevent insertions inside known regular secondary structural elements and set low to allow insertions in regions where insertions are observed in the probe.

The profile is 22 columns wide, one column for each of the 20 amino acids, and one column for the gap penalties (sometimes two if gap opening and gap extension penalties are taken into account).

Rows correspond to columns in the msa (or positions in the motif). The consensus sequence, derived from the most common amino acid in each column is listed in the left hand column.

Once a profile table has been generated, it may be used in database searches for additional sequences with the same motif. A protein database that uses this motif representation is the PROSITE database http://ca.expasy.org/prosite/.

4.3.2.3 Hidden Markov models

HMM are not only used for sequence alignment but have a wide range of bioinformatics applications:

finding homologous sequences

(10)

producing multiple alignments

producing motif model (a motif or domain HMM)

analysis of sequence composition and patterns (cpG islands)

location of genes by predicting open reading frames (Gene prediction) prediction of structural features in protein sequences

Because of their prevalence in bioinformatics we will discuss them in a more general detail.

4.3.2.3.1 Definition

Formally a hidden Markov model is a probabilistic model. An HMM can be viewed as a generative model that generates or emits sequences ( any sequence of …) based on a particular sequence of states and symbols. A distinction is made between the sequence of symbols (letters of the alphabet) and the sequence of states (match, insertion,…). Suppose we have access to a validated multiple alignment. The Markov model then captures the statistical relevant information in the alignment.

How does it do so?

Suppose we have an ungapped alignment without insertions and without deletions. A single probabilistic model that represents the complete alignment of the family is given below:

Beg e _k (b ) Mj end

State l State k

It consists of distinct states (in this case only M (match) states) because for simplicity no insertions or deletions were allowed. The distinct states are separated by transition probabilities (i.e. the probability of moving from one state to the next.). The basic assumption of a Markov process states the current state is only dependent on the previous state (first order Markov process). The sequence of states followed in the model is called the path . In this case only one path can be followed.

Each state has the probability of emitting a certain symbol of the alphabet (A,C,T,G for DNA) or one of the 20 amino acids for proteins. This is called the emission probability.

Transition probability: probability of going to state l if previous state was k 1 )

( k

i l P i

a kl      

Emission probability: probability that the symbol b is seen when in state k

) (

)

( b P x i b i k e k    

In this simple case the transition probabilities are equal to 1 because there is at each step only one transition possible. The HMM shown here is an alternative representation of the PSSM and the emission probabilities are calculated in exactly the same way.

Lets now make things a bit more complicated and introduce gaps (insertions and deletions) in the model.

Below is represented the Markov model for a gapped multiple alignment in its graphical

representation. To handle insertions i.e. portions of the sequence that do not match anything in the

model a new set of states is introduced I

i

(represented by diamonds), where I

i

will be used to model

insertions after the residue matching the i

th

column of the multiple alignment. the I

i

have an emission

(11)

distribution eI

i

(a) that normally is set equal to the background distribution q

a

. We also need transitions from M

i

to I

i

, a loop transition from I

i

to itself and a transition back from I

i

to M

i+1

. To model the gaps, silent states D

j

are introduced (circles). These silent states do not emit residues, a sequence of silent states allows to move from one match state to any other later one in the alignment. The cost of a deletion will then be the sum of the M->D transition, followed by a number D->D transitions and a final D->M transition. This is exactly the same as for the I states but remark that the path for the D states is different. D->D transitions have different transition probabilities while the I->I transitions always have the same transition probability.

Figuur 1.2: Algemene vorm van een HMM. De letters m duiden op een matchtoestand, i op een insertietoestand en d op een deletietoestand. De verbindingslijnen geven de toegelaten transities toe. Aan elke verbindingslijn is een bepaalde transitiewaarschijnlijkheid verbonden maar voor de duidelijkheid worden deze niet weergegeven: De match- en insertietoestanden bezitten emissieprobabiliteiten voor de vier nucleotiden. Enkel de probabiliteiten van de matchtoestand zijn weergegeven in bovenstaande figuur. De niet-weergegeven emissieprobabiliteiten van de insertietoestanden worden vaak gelijk gesteld aan de achtergrondfrequenties.

(12)

The HMM is a generative model because a sequence is generated by following a path  through the model.

We call it a Hidden Markov model because if we observe a sequence, the path of states that was followed by the Markov model to generate the observed sequence is unknown or hidden. This hidden path is often the thing we are more interested because it contains the information on how the observed sequence should be aligned with the model. Usually a sequence can be generated in multiple ways by the Markov model and more hidden paths (corresponding to distinct alignments) are possible. Usually not all possible paths have an equal probability. Indeed some transitions are not very likely (low transition probability). Usually the path with the highest probability (highest score = most likely path) corresponds to the best alignment.

4.3.2.3.2 Estimation of the parameters of a Markov model

Theoretically the hidden markov models as described previously can model any possible sequence of residues. It therefore defines a probability distribution over the whole space of sequences. The aim of the parametrisation is to make this distribution peak around members of the family.

Searching for the appropriate parametrisation is done be training the Markov model on the given sequence alignment i.e. a reliable multiple alignment of members of the family is used as a traniningsset to determine the parameters of the HMM.

Multiple sequence alignment

Construct profile HMM

Scan new sequence with the profile (HMM)

Complete databases of HMM representing profiles characteristic for protein families are constructed (Pfam)

What is the chance that a sequence is generated by the HMM

I. II.

Unaligned sequences

ClustalW HMM

III. The parametrisation involves at first:

1) determining the structure of the model (see graphic representation)

This comes down to choosing the length of the model i.e. choosing which columns of the multiple alignment should be treated as match states and which as insert states. EM based algorithms (Baum Welch) are used for this purpose (see multiple sequence alignment using HMM) but for the time being we use the simple heuristic: Columns that are more than half gap characters should be modeled as inserts.

2) the probability parameters (emission and transition)

Because the alignments are given the transition and emission probabilities can just be obtained by

counting up the numbers an emission or a transition is used (which can be shown to be the

maximum likelihood estimation).

(13)

 

b' (b') E k

k (b) (b) E

e k  

' ' l A A kl kl a kl

In the following is given an example of how an HMM can be derived from the given alignment.

The first 3 columns are match columns and the transition probability for going from 1 M state to the next M state is 1. In each such match column the emission probabilities can be calculated by counting the frequencies with which each nucleotide is observed.

E.g. the first column: e

1

(A) = 4/5; e

1

(T) = 1/5; e

1

(C) = 0; e

1

(G) = 0;

E.g. the second column: e

2

(A) = 0; e

2

(T) = 0; e

2

(C) = 4/5; e

2

(G) = 1/5;

E.g. the third column: e

3

(A) = 4/5; e

3

(T) = 0; e

3

(C) = 1/5; e

3

(G) = 0;

From the third M column: in 3/5 sequences there is a M->I transition: a3I = 0.6 In 2/5 sequences there is a M->M transition a34 = 0.4

The calculation of the I->I transition is as follows: from the 5 insert states, 2 (blue) I->I transitions occur and 3 I->M transitions corresponding to probabilities of 0.4 and 0.6 respectively.

Remark that in the example the multiple alignment is already given and the HMM model is derived from the alignment i.e. the state sequence is known from the multiple alignment. A HMM can also be used for constructing the multiple alignment (estimation of the parameters when paths are unknown). In this case an iterative procedure based on Viterbi training or on EM can be used (see motif detection).

4.3.2.4 Adding pseudocounts

An often observed problem with the construction of HMM is that especially if the trainingsset is small (few columns in the multiple alignment), some transitions and emissions may not be seen in the trainingsset and the corresponding emission and transition probabilities would acquire zero (see example above). This would imply that they would never be allowed in the future and that when the HMM model is used for scoring a database for instance members of the family containing such transitions would be missed. To avoid this problem of overfitting and to enhance the generalization properties of the Markov model often pseudocounts are added. Choosing pseudocounts is a topic on

ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC

A 0.8 C PC G PC T 0.2

A PC C 0.8 G 0.2 T PC

A 0.8 C G T 0.2

A 1 C G

1 1 0.4 T

A 0.2 C 0.4 G 0.2 T 0.2 0.6

0.6

0.4

(14)

itself. In any case the lower the number of sequences in the alignment the more pronounced the influence of the pseudocounts will be.

Databases that use the HMM representation of a motif are for instance, for proteins the PfAM database http://www.sanger.ac.uk/Software/Pfam/search.shtml

4.4. Scoring databases with motif models

One a motif model (either a regular expression, a position specific scoring matrix, a HMM, a motif profile) is compiled for a class of protein or DNA sequences, this model can be used to screen novel genomes or sequence databases to retrieve other, novel instances of the motif. The presence of a motif in a sequence with unknown function can help in annotating this novel sequence.

1. Motif model construction. Motifs can be DNA or protein motifs as already explained before. A Before screening can be performed, a curated motif model needs to be compiled from a multiple alignment. Different approaches can be considered:

o The multiple alignment of the family is known

 It was for instance generated by a multiple alignment program such asCclustalW.

This multiple alignment can directly be converted into one of the models described above.

o The motif to be detected is known but the multiple alignment does not yet exist

 Motifs already described in literature can either be represented by regular expression (string). From these described motif instances a multiple alignment is constructed (for instance, by Motifs, see further) which can subsequently be converted in a motif model as described above.

o Neither the motif nor the multiple alignment exist

 The use of a machine learning approach is needed to detect novel motifs, not yet described in literature. Most of the probabilistic motif detection procedures (based on expectation maximization or Gibbs sampling) represent the motifs as PSSM or frequency matrices (see also chapter motif detection).

Subsequently novel sequences of interest are scanned for the presence of the motif. Every possible sequence position is scored by sliding a window along the sequence, one position at the time.

Sequence matches with highest score are the best matches. For the probabilistic motif models, the statistical relevance of the matches can be assessed by a score. How this scoring is performed for the screening with each of the motif models is shown below. How, based on these scoring schemes the statistical significance of the hits is assessed is also outlined below.

4.4.1 Scoring a sequence database with a string based motif model

A database of sequences can be searched with a regular expression. These string based methods are often simple perl scripts (comparable to find and replace in word).

Example for DNA sequences: RSA tools http://rsat.ulb.ac.be/rsat/

4.4.2 Scoring a sequence database with a PSSM

Example of calculating a score when screening a sequence with a PSSM

(15)

Suppose based on a local multiple alignment a PSSM with length W was constructed to model a certain regulatory binding site (the length of the motif equals the number of columns in the motif model).

A novel sequence S consisting of the nucleotides x

1

, x

2

, … x

W

can be scored using the following formula:







^W

i x

ix

i i

p score q

odds

1

[1.1]

where p

x_i

represents the background frequency of the nucleotide x

i

. This score describes the probability that a subsequence matches the motif model described by the PSSM. The product is called the combined odds score, the individual factors ( q /

ixi

p

xi

) the odds scores. Calculations are usually performed in logarithmic scale (in that case called log odds score) where the individual terms are ^log

2

⁽ q

ix_i

^/ p

x_i

⁾ . De log odds-score is then

i i

x W ix

i

p

score q

odds 





1

log

2

log [1.2]

Example of odds- and log odds-score

 





 







70 , 0 05 , 0 01 , 0 50 , 0 90 , 0 10 , 0

10 , 0 30 , 0 01 , 0 80 , 0 02 , 0 10 , 0

10 , 0 05 , 0 97 , 0 05 , 0 06 , 0 20 , 0

10 , 0 60 , 0 01 , 0 10 , 0 02 , 0 60 , 0 PSSM

Background frequency of each of the four nucleotides: p

xi

^ ⁰ ^, ²⁵

Calculation of the odds ratio and log odds ratio for the consensus (ATGCAT), a similar sequence (CTGCGT) and a very different sequence (CATGGC):

Sequentie Odds Log odds Opmerking

ATGCAT 720,8829 9,49362 consensus

CTGCGT 120,1471 6,90866 Similar sequence CATGGC 0,0002 -11,99046 Different sequence

By sliding a window of length W over a novel sequence, for each subsequence within the window a log odds-score can be calculated. The highest scoring positions correspond to the most likely locations of the motif.

Example:

For DNA sequences: Motif Scanner (http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html), RSA tools (http://embnet.cifn.unam.mx/~jvanheld/rsa-tools/).

For proteins (Mast, PSI-BLAST, Block Searcher):

Remark that programs such as psiblast make use of motif models are be more sensitive.

4.4.3 Scoring a sequence database with a profile representation

(16)

4.4.4 Scoring a sequence database with a HMM model

As mentioned before one of the main purposes of developing HMM is to use them to detect membership in a family by obtaining significant matches of a sequence to a motif HMM. There are different ways of scoring a sequence against a motif HMM. In fact scoring a sequence with a HMM comes down to aligning the sequence to the HMM i.e.we try to find the hidden path that generates the sequence. Retrieving this hidden path of states (matches, insertions, deletions) is also called decoding. As mentioned previously, a sequence can be generated by different paths for which in the best case only one has a high probability (this is a problem similar to the fact that a dynamic programming algorithm can give rise to different high scoring alignments). The Viterbi algorithm allows to retrieve the most likely path (the score for the sequence corresponds to the maximum likelihood score) while the forward algorithm allows to calculate the total probability of a sequence being generated by the HMM.

For scoring sequences with motif HMM, calculations for matches are usually expressed in log-odds ratios of the resulting probability to the probability given of x given a standard random model.

Scoring a sequence means calculating the probability that a given sequence is generated by the HMM versus being generated by a random model.

Multiple sequence alignment

Construct profile HMM

Scan new sequence with the profile (HMM)

Complete databases of HMM representing profiles characteristic for protein families are constructed (Pfam)

What is the chance that a sequence is generated by the HMM

I. II.

Unaligned sequences

ClustalW HMM

III.

(17)

4.4.4.2 Example of calculating the score of a sequence against a HMM

The arrows indicate the most likely alignment of the subsequence ATCAGT to the HMM. The most likely path s calculated by the Viterbi algorithm through the HMM is m

1

m

2

d

3

m

4

m

5

i

5

m

6

. Only the probabilities of the followed transitions are indicated.

The probability that the subsequence ATCAGT follows the indicated path m

1

m

2

d

3

m

4

m

5

i

5

m

6

can be calculated. The emission probability was taken equal as the background frequency and equals 0.25 for each nucleotide.

A famous screening tool based on HMMs is the HMMER (http://hmmer.janelia.org/) which is also the tool used by PFAM.

)

| ,

( ATCAGT m

₁

m

₂

d

₃

m

₄

m

₅

i

₅

m

₆

HMM P

0005772 , 0

95 , 0 70 , 0 95 , 0 25 , 0 20 , 0 60 , 0 85 , 0 97 , 0 80 , 0 10, 0 90 , 0 90 , 0 60 , 0 95 , 0

)

| ( ).

| (

).

| ( ).

| (

6 6

5 6 5 5

5 5 4

5

4 3

4 2 3 2 1

2 1 1







m END T m T P i m T i G P m i T m A P m m T

m C P d m T m d T m T P m m T m A P BEGIN m T

)

| ,

(

log P ATCAGT m

₁

m

₂

d

₃

m

₄

m

₅

i

₅

m

₆

HMM



7587 , 10

95 , 0 log 70 , 0 log 95 , 0 log 25 , 0 log ....

90 , 0 log 60 , 0 log 95 , 0 log

)

| ( log )

| ( log ...

)....

| ( log )

| ( log

6 6

5 6 5

2 1

1











m END T m

T P i

m T i

G P

m T P m

m T m

A P BEGIN

m

T

(18)

4.5. Assessing the relevance of an alignment when performing high throughput screenings

4.5.1 Sensitivity versus specificity

The first and most important criterion is that a good signature pattern must be as short as possible, should detect all or most of the sequences it is designed to describe and should not give too many false positive results. In other words it must exhibit both high sensitivity and high specificity.

When screening, there is always a trade off between sensitivity and specificity (selectivity).

Sensitivity:

FN TP recall TP

SENS   

Sensitivity is the proportion of true positives that are correctly identified by the test.

Specificity:

FP TN SPEC TN

 

Specificity is the proportion of true negatives that are correctly identified by the test.

FDR (false discovery rate):

FP TP FDR FP

 

Precision = 1-FDR FP TP ecision TP

  Pr

4.5.3 Extreme value distribution

Like most similarity search techniques, a database search with a motif model returns a sorted list of potential matches ranked by a quality score. Because there is no statistical theory that allows for direct computation of the probability of obtaining a certain score by chance, one has to rely on empirical methods for significance estimation. How does this works:

 Randomize the sequences e.g, by shuffling them. If random sequences are used, it is important that the sequences are generated with a procedure that preserves certain statistical properties of biological sequences known to have an influence on the score distribution such as compositional bias and the actual length distribution

 Apply the scoring procedure that was used on the real data on the random data

 Estimate as such empirically the random score distribution

(19)

 Fit this distribution by a mathematical function (usually the extreme value distribution) (Prosite help file). The obtained E-value then corresponds to the probability of finding a score equal or better than the one observed, by chance alone.

4.6. Iterative searches

Iterated search methods have led to biologically important observations. One such method is PSI- BLAST. The procedure PSI-BLAST uses can be summarized in five steps:

(1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program.

(2) The program constructs a multiple alignment, and then a model (weight matrix), from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and model, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions.

(3) The model is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm can be used for this directly.

(4) Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence.

4.7. Databases of motif models

Proteins can, according to their functional and sequence similarities be grouped into “protein families”. The degree of similarity of the proteins across the sequences in each family is far from uniform. While some regions are clearly conserved, others display little sequence similarity. Often the conserved regions are crucial to the protein’s function, for example, they correspond to the enzymatic catalytic sites.

The description of a protein family by its conserved regions thus highlights the family’s characteristic and distinctive sequence features. These features can be used to probe an uncharacterized sequence to indicate its function. Therefore, databases of conserved features of protein families can be used to classify sequences from proteins, cDNAs and genomic DNA or to infer a function for an unknown protein.

Distinct databases of protein domains (motifs) exist

 ProDom

 PRINTS

 Pfam

 Prosite

 BLOCKS

Some of these databases will be described below.

4.7.1 Pfam, the proteins family database

4.7.1.1 introduction

From training sets of known members of protein classes HMMs are constructed and deposited in

databases. These HMM represent usually domains in proteins. These database of HMM can then be

(20)

used to classify a novel sequence i.e. to find out which domains are present in a novel sequence with unknown function. This means that the novel sequence is aligned against all HMMs in the database. Pfam 7.0 is such a database that contains a total of 3360 families. Each domain of a protein family is represented by a HMM. Pfam contains multiple protein alignments and motif- HMMs of these families.

4.7.1.2 Construction of Pfam

PSI-blast

Automatically generated sequences

PfamB families

Construct seed HMM PfamA

PRODOM alignment

Discover PRODOM alignments not covered by PfamA

Pfam starts from a set of automatically generated domain alignments (generated by PsiBlast). From these alignments a HMM is calculated. Subsequently all sequences in the SwissProt database of proteins are scored with the HMMs. The resulting scores are ranked and a suitable threshold is selected to separate class members from the other sequences in the database. One can start with a seed of sequences that are multiple aligned, build a HMM, use the HMM to retrieve remote homologs, reconstruct the alignment … iterative motif search).

Unfortunately, because these raw scores of a HMM (Pm(S)) are length-dependent, they must be modified before the ranking can occur. Various techniques have been developed to work around this shortcoming. One approach is to calculate the difference in standard deviations between the raw score of a sequence and the mean score of sequences of similar length in the database. A more sophisticated method scores a sequence by a . This number is the log of the ratio between two probabilities - the probability that the sequence was generated by the HMM and the probability that the sequence was generated by a null model, whose parameters reflect the general amino acid distribution in the training sequences. http://www.sanger.ac.uk/Software/Pfam/

4.7.1.3 Scores in Pfam

In the formalism used by HMMs (basically a Bayesian statistical formalism), the HMM of the

domain is considered as a model which "produces" sequences with a certain probability for each

different sequence. For each possible sequence, these probabilities are very low, but they are

compared to the probability that this sequence was produced by a "random" (also called NULL)

model. The log of the ratio of these two likelihoods are reported (a log odds ratio). The base of the

log is base2, giving a statistic that is commonly called a "bit score". That's what the HMMER raw

scores are. The higher this score, the better. The scores described below are derived from this raw

score.

(21)

In HMMER2, like BLAST, E-values (expectation values) are calculated. The E-value is the number of hits that would be expected to have a score equal or better than this by chance alone. A good E- value is much less than 1. Around 1 is what we expect just by chance. In principle, all you need to decide on the significance of a match is the E-value.

Besides the E-values a second, and even more empirical, s scoring system is used. For each Pfam family, a "trusted cutoff" and a "noise cutoff", TC1 and NC1 are recorded. TC1 is the lowest score for sequences included in the family (e.g. in the Full alignment). NC1 is the highest score for sequences not included in the Full alignment.

Therefore, we can consider a hit very significant if it scores better than the trusted cutoff, better than the noise cutoff, and has a significant E-value.

The "sequence classification score" is the total score of a sequence aligned to a model; if there are more than one domain, the sequence score is the sum of all (finding multiple domains increases our confidence that the sequence belongs to that protein family, even if each domain individually is a weak match.)

The "domain score" is a score for a single domain. (These two scores are identical for single domain proteins.)

4.7.1.4 Scoring a sequence with Pfam models

I.e. screening an new sequence against Pfam HMMs to classify the novel sequence.

HMMER is most sensitive at identifying complete domains: its preferred algorithm is a "motif alignment" algorithm that is neither fully global nor fully local alignment, but instead looks for an alignment that is global with respect to the model, but (multiply) local with respect to the sequence -- e.g. to look for one or more complete domains in a query sequence.

Fragments of domains do occur, especially in truncated sequences, translated ESTs, or because of

insertions of new domains in existing ones.

(22)

Each Pfam family contains two HMMs, one built to find complete domains (ls mode) the other to match fragmants of domains (fs mode). This allows Pfam to be as complete as possible, but at the cost of having to carry out twice as many HMM searches. The most sensitive way to search against Pfam is to combine the results of both the ls and fs mode HMMs.

4.7.2 Prosite http://ca.expasy.org/prosite/

PROSITE contains 1156 documentation entries that describe 1585 different patterns, rules and profiles/matrices

A prosite pattern is defined as a regular expression A prosite profile is a profile representation (see above)

4.7.2.1 Construction of PROSITE patterns and profiles Patterns from the literature

A number of the patterns described in this dictionary have been published. We have tested those patterns on the SWISS-PROT knowledgebase to see if the signature pattern was still specific to the group of family of proteins since the paper was published. If this was the case we used the published pattern as such, otherwise we updated the pattern using methods similar to those used to develop a new pattern and which are described in the following sub-section.

Steps in the development of a new pattern

 start by studying review(s) on a group or family of proteins.

 build an alignment table of the proteins discussed in that review. If necessary add to this table new published sequences relevant to the subject under consideration. Using such alignment tables pay particular attention to the residues and regions thought or proved to be important to the biological function of that group of proteins. These biologically significant regions or residues are generally:

-Enzyme catalytic sites.

- - Prostethic group attachment sites (heme, pyridoxal-phosphate, biotin, etc).

- - Amino acids involved in binding a metal ion.

- - Cysteines involved in disulfide bonds.

- - Regions involved in binding a molecule (ADP/ATP, GDP/GTP, calcium, DNA, etc.) or another protein.

 Try to find a short (not more than four or five residues long) conserved sequence which is part of a region known to be important or which include biologically significant residue(s).

Pattern(s) created at this stage are the 'core' pattern(s). The most recent version of the SWISS-PROT knowledgebase is then scanned with these core pattern(s). If a core pattern will detect all the proteins under consideration and none (or very few) of the other proteins, the process will be stopped at this stage and use the core pattern as a bona fide signature.

 In most cases a lot of extra sequences will be picked up which clearly do not belong to the group of proteins under consideration. A further series of scans, involving a gradual increase in the size of the pattern, is then necessary. In some cases a good pattern will never be found and the process has to be retried with a core pattern from a different part of the sequence. It must also be noted that take particular attention is taken to try to avoid 'false' patterns. The following example describes what is called a 'false' pattern:

Let us assume that we have a partial alignment of three sequences around an active site residue (in

this example an histidine whose position is marked with an asterisk) as shown below:

(23)

* ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS

Here we would start scanning with a core pattern with the sequence A-T-H-[D or E]. This pattern is small and would probably pick up too many false positive results. According to the procedure outlined above, we would then have to extend the core pattern. But in this case, any extension would be artificial and group together residues which have different properties and which are represented only once in a given position of the alignment. For example, we could scan with the pattern [R, T or D]-[D, A or Q]-[F, E or A]-A-T-H-[D or E]. This pattern would probably only pick up the sequences which are in the alignment, but it would be biologically meaningless; there is no consensus in the first three positions of the pattern and the pattern does not even group residues with identical physicochemical properties. Consequently, this pattern would probably fail to detect a new sequence containing the same active site but having a different N-terminal sequence.

Methodology for the development of profile entries

In subsequent step patterns (regular expressions) are converted into profiles. A weight matrix is a table of position-specific amino acid weights and gap costs. These numbers (also referred to as scores) are used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a given cut-off value constitutes a motif occurrence.

The profiles are developed according to the method of Gribskov (also outlined above). In order to be acceptable, a profile must not only assign high similarity scores to true motif occurrences and low scores to false matches. In addition, it should correctly align those residues having analogous functions or structural properties according to experimental data.

4.7.3 BLOCKS proteins family database

The Blocks Database consists of ungapped multiple alignments of short regions, called

‘blocks’. The Blocks models are based on frequency matrices. The database is constructed from sequences of protein families using a fully automated method (PROTOMAT system). The blocks database can be used to

 Convert your multiple alignment in a block (BlockMaker)

 Search a query sequence for the presence of previously annotated blocks. (BlockSearcher,

IMPALA searcher, reverse PSI-Blast searcher, MAST)

(24)

FIGURE. Blocks Database main page available at http://blocks.fhcrc.org/

BlockMaker:

The program automatically extracts conserved, ungapped regions (blocks) from an msa using two different approaches.

 Using the two-step PROTOMAT system. The first step incorporates a motif finder algorithm (MOTIF) that exhaustively evaluates spaced triplets of amino acids that are common to multiple sequences. The statistical significance of these motifs is assessed by comparing their occurrences in the genuine set to their occurrence in a set of shuffled sequences. In a second step of the PROTOMAT system seed blocks are extended.

Figure An example of a MOTIF BLOCK containing aligned sequences from 10 proteins. Block so not contain gaps. The pattern A Q I which occurs in all 10 sequences is detected by the motif finder algorithm within the PROTOMAT system.

 Using Gibbs sampling (MEME) see next chapter.

The blocks in the BLOCKS database are derived from PROSITE catalogue using the PROTOMAT

system.

(25)

An example of a block database entry can be found below.

FIGURE. Blocks Database format. A. Each block entry is divided into header and sequences parts. The header part consists of four lines. The ID (identification) line contains the block’s family short description and identifies the entry as a block type. The AC (accession) line gives the block’s accession code and the minimal and maximal distances of the block from the previous block or the protein N’ end. The DE (description) line contains the long description of the family.

The short and long descriptions are taken from PROSITE. The BL (block) line gives the spaced triplet motif of the block, the block’s width, number of sequences used to derive the block. Each sequence line contains the sequence Swiss-Prot name, the start position of the segment (define sequence segment) (100 being most distant). B. Block sequence logo.

Often a conserved region is modelled as a concatenation of several blocks, corresponding to

different regions in the conserved domain (see below).

(26)

Screening with Blocks

Below is given the output of screening query sequence, a bacterial dichlorocatechol oxidase with the BLOCKS database.

From these results it appears that, as was to be expected the best match is obtained with the

blocks of the iron-containing alcohol dehydrogenase family. The representative domain of this

family consists of 4 concatenated blocks. In our query sequence all four blocks are present in

the same order as they were present in the sequences used to construct the block, that is,

A→B→C→D. This can be seen in the block map.

This map shows that the distances between the three blocks representing this family fit the distances between the segments of the query that align with these blocks.

Therefore, because all

representative blocks of this family, the order of the blocks and the distance between the blocks

has been conserved, the query is likely to be a member of the iron-containing alcohol

dehydrogenase family.

(27)

Note that the Blocks database was used as a resource to construct the BLOSUM series of

amino acid substitution matrices.

(28)

(29)