Methods for the construction of profile entries for the PROSITE database

(1)

Methods for the construction of profile entries for the PROSITE database

The following terminology is adopted in this document:

- The term `profile' refers to a quantitative motif description based on the generalised profile syntax.

- The term `pattern' refers to a qualitative motif description based on a regular expression-like syntax such as the one currently used in PRO- SITE entries marked as PATTERN.

- The term `motif' refers to the biological object one attempts to ap- proximate by a pattern or a profile.

Kay Hofmann and Philipp Bucher

Bioinformatics Group, Swiss Institute for Experimental Cancer Research, CH-1066 Epalinges

s/Lausanne, Switzerland.

(INTERNET: khofmann@isrec-sun1.unil.ch ; pbucher@isrec-sun1.unil.ch)

The PROSITE pattern library has proved to be a useful tool for the detection of motifs in protein sequences. Both consensus sequences for posttranslational modifications and `signatures' characteristic for certain protein families are documented and stored in a searchable format.

However, not all characterised protein motifs and domains can be described using the PROSITE syntax, which is based on regular expressions. Recently, we started to add profile entries to PROSITE, using the `generalised profile syntax', which had been designed for that purpose. Profiles are two-dimensional tables of position specific match-, gap-, and insertion-scores, normally derived from aligned sequence families. Here we describe the methods we use for the generation of a profile entry for the PROSITE database.

The standard procedure starts with a set of sequences whose membership in the protein- or domain- family under study is well established. The sequences are aligned using any of the available multiple-alignment methods. It has proved to be very advantageous to include information on the 3D-structure of one or more family members in the alignment process, if it is available. In many cases the alignment can be significantly improved by manual editing. Subsequently, a profile is constructed from the multiple alignment, using sequence weighting to account for subfamily bias, and limited gap-excision. The profile is first compared with every sequence in a randomised protein database derived from SwissProt by regional shuffling in order to analyse the score statistics and to calculate the necessary scaling parameters. Finally, the profile is used to search a nonredundant database of all available protein sequences. The scaling parameters are used to estimate the

(2)

significance of an observed score, database sequences with scores exceeding a certain confidence threshold are tentatively accepted as members of the sequence family. After checking for biological plausibility, this newly derived set of sequences is used for the construction of a second profile, essentially as described above. The whole iterative process is continued until no new sequences fulfil the imposed significance criteria. For the inclusion of the profile into PROSITE, appropriate cut-off scores are derived from the statistical parameters and the relevant sequences in SwissProt are classified into the categories `true positive', `false positive', `false negative', and `sequence fragment'.

s.

Up to now we have constructed about 40 different profiles for frequent and important protein domains. With the recent improvements in the profile construction methods, we expect this number to increase relatively fast although we intend to prepare rather a limited number of high quality profiles representing defined domains and protein families instead of a large number of automatically created profiles of poor quality.

Flowchart of the profile construction strategy

(3)

[gif] [Postscript]

Explanation of the most important steps

1 Choosing an initial set of trusted sequences

The construction of each new profile entry begins with a set of sequences, either total proteins or local homology domains, which can be assumed to belong to the same family. It is important not to include sequences with doubtful relationship to the family under consideration since even a single inappropriate sequence can severely degrade profile performance. Criteria we accept for establishing the relationship between the sequences of the starting set include the following:

 Highly significant sequence similarity between all members using pairwise comparison techniques.

 Knowledge of a common functionality of the sequences in combination with a reasonable degree of sequence similarity.

 A common three-dimensional fold of the sequences allowing reliable superposition of the structures.

(4)

2 Construction of a multiple alignment

Several methods for aligning the proteins of the trusted set can be used for that purpose, depending on the amount of available information. If the 3D-structure of more than one sequence of the family is known, we usually start with a structural alignment of these sequences, derived from a superposition of the structures. If no structural data are available, we start with a multiple alignment generated by programs like ClustalW or Pileup. In most cases that include divergent proteins, a manual refinement of the initial multiple alignment is necessary. If some of the sequences are very divergent, it has proven advantageous to exclude these sequences from the initial alignment and add them at a later stage to the already existing multiple alignment.

3 Calculation of the sequence weights

the introduction of sequence weights improves the performance of the resulting profiles. This effect is particularly pronounced if the initial set of trusted sequences contains both unique sequences and multiple members from closely related sequence families.

4 Construction of a generalised profile from a weighted alignment

The generalised profile syntax for PROSITE entries is based on the concept of profiles introduced

by Gribskov.

In the default procedure, we use a 10*log10-scaled version of the BLOSUM45 comparison matrix[3], applying symmetrical gap-opening and gap-closing penalties of 1.05 each, and a gap-

extension penalty of 0.21 .

Depending on the purpose of the profile (i.e. if it reflects a complete protein family or a localised homology domain that is part of larger sequences) we can force it to favour local or global alignment behaviour, or any intermediate thereof.

(2) Gribskov M., McLachlan A.D., Eisenberg D. Proc. Natl. Acad.. Sci. USA 84:4355-4358 (1987).

5 Estimating the statistical significance of profile matches

Like most similarity search techniques, a protein database search with a profile returns a sorted list of potential matches ranked by a quality score. Because there is no statistical theory that allows for direct computation of the probability of obtaining a certain score by chance, one has to rely on empirical methods for significance estimation. Such methods typically attempt to fit the parameters of a mathematical function to the score distribution of chance matches found in real or random sequences. If random sequences are used, it is important that the sequences are generated with a procedure that preserves certain statistical properties of biological sequences known to have an influence on the score distribution such as compositional bias and the actual length distribution.

The specific method we use for significance tests with PROSITE profiles uses a regionally shuffled version of SWISS-PROT preserving the original length distribution and amino acid composition in successive windows of length 20. Each profile is compared against this random database to produce a list of high-scoring profile matches sorted by score.

(1) Pearson, W.R. and Lipman, D.J., Proc. Natl. Acad. Sci. USA 85:2444-2448 (1988).

6 Plausibility checking of new sequences

(5)

As mentioned before, it is very important not to include unrelated proteins into the sequence set used for the profile construction. Every potential candidate sequence detected in the database search has to be carefully checked before it can be used for the iterative profile refinement.

The most important condition a potentially new family member has to meet is the statistical significance of the profile score, as described in section 5. For sequences with no available functional or structural information, we usually require a residual error probability p<0.01. If biological or structural data suggest a meaningful relationship of the test sequence to the family under consideration, we use a relaxed stringency criterion. A frequent example for this situation is the potential occurrence of additional copies of a repeat domain in a protein that already contains

several `accepted' copies of this domain.

If biological or structural data apparently contradict the relationship indicated by the profile score, we increase the stringency up to p<0.0005. In cases where even this condition is met, a re- evaluation of the contradicting data is indicated.

Profiles in PROSITE: signal transduction and other domains

domains in signal transduction proteins

SH2 domain**, SH3 domain**, PH domain**, C1 domain**, C2 domain**, PID domain**, rasGAP domain*, rhoGAP domain*, rapGAP family*, rabGAP family, arfGAP family, cdc24-type rasGRF domain*, cdc25-type rasGRF domain*, rcc- repeat (ranGRF)*, rhoGDI-family, rabGDI- family

putative intracellular protein/protein interaction domains

C3H2C3-type RING finger*, rsp5/WW-domain*, forkhead-associated (FHA) domain*, polo- box*, death-box*, lipoxygenase appendage domain*, bromo-domain*, chromo-domain*, IQ- domain*, BTB-domain*, a-latrotoxin receptor interaction domain

some DNA/RNA-binding domains

forkhead-domain*, myb-domain*, ets-domain*, HMG (high-mobility group) domain*, MCM (mini-chromosome maintenance) domain, KH-domain*

some catalytic domains

protein-kinase domain**, lipid kinase (PI3K) domain*, PI-specific PLC X-box** and Y-box**, bacterial PLC-domain, bacterial SMase, intracellular (plant-type) PLD, extracellular PLD, intracellular PLA2, HECT-domain (ubiquitin-transferase)*

some repeat domains

leucin-rich repeat*, LRR-flanking regions*, TPR repeat**, wd40 repeat**, ankyrin repeat*, spectrin repeat *, gelsolin repeat *, filamin/ABP280 repeat

a few extracellular domains

(6)

cub domain**, anaphylotoxin-domain**, saposin II-domain*, C-type lectin domain*, thrombospondin type I domain*, archaebacterial surface layer repeat*

miscellaneous protein families

hsp20-family**, globin-family**, cpn10-family*, ricin-family*, IMB/FBP/IPP- family*

Profiles that are already part of PROSITE are labeled with (**), experimental profiles that are not yet part of PROSITE but can be searched using our WWW-server (http://ulrec3.unil.ch) are labeled with (*).