The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query

(1)

EXERCISE 5: ITERATIVE PROFILE SEARCHES (PSI BLAST, ) Adapted 05012007

Remark that in order to build the multiple sequence alignment of above, we 1. started from a single sequence

2. blasted it against NCBI nr database 3. selected all these sequences

4. performed multiple alignment

5. Constructed a profile based on this alignment that can be subsequently used to discover remote homologs

PSI-BLAST Introduction

All these steps can be performed automatically with the PSI BLAST profile server that itself allows to detect remote homologs:

The procedure PSI-BLAST uses can be summarized in five steps:

(1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program.

(2) The program constructs a multiple alignment, and then a profile (weight matrix), from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions.

(3) The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm can be used for this directly.

(4) Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence.

# Paste the complete original sequence in PSI blast. On the format page are displayed the number of hits with the different annotated domains. How many hits do you find?

You find a hit with 2 domains COX and FTR1

What will be the effect of having 2 domains?

Because having more domains will give rise to mixed up results, we will select part of the sequence so that there will be only homology to one domain.

Try the following sequence with two domains:

(2)

>gi|11467051|ref|NP_042527.1| cytochrome oxidase subunit 1 and subunit 2 [Acanthamoeba castellanii]

MINRLLNNLTSFFTDNRWLFSTNHKDIGTLYLIFGGFSGIIGTIFSMIIRLELAAPGSQILSGNSQLYNV IITAHAFVMIFFFVMPVMIGGFGNWFVPLMIGAPDMAFPRLNNISFWLLPPSLFLLLCSSLVEFGAGTGW TVYPPLSSIVAHSGGSVDLAIFSLHLAGISSLLGAINFITTIFNMRVPGLSMHKLPLFVWSVLITAFLLL FSLPVLAGAITMLLTDRNFNTSFFDPSGGGDPILYQHLFWFFGHPEVYILILPAFGIVSQIIGTFSNKSI FGYIGMVYAMLSIAVLGFIVWAHHMYTVGLDVDTRAYFTAATMMIAVPTGIKIFSWIATLWGGQIVRKTP LLFVIGFLILFTLGGLTGIVLSNAGLDIMLHDTYYVVAHFHYVLSMGAVFAFFAGFYYWFWKISGYTYNE MYGNVHFWLMFIGVNLTFFPMHFVGLAGMPRRIPDYPDNYYYWNILSSFGSIISSVSVIVFFYLIYLAFN NNNTPKLIKLVHSIFAPYINTLSKNLLTFASIKSTSDSSFFKFSKFFIFFMVSLSVLFIFYDSLLCLNDH TNSWKIGFQDPTTPIAYGIIKLHDHILFFLAVILFVVGYLLLSTYKKFYYGSLNNDLPESKRISLFDTLI NTYKENLSFNVTNRTYNINHGTTIEIIWTILPAFILLFIAVPSFALLYAMDEIIDPVLTVKVIGHQWYWS YEYSDYSVVYSNRMLDYDSIDRFAAMEMMYKGMGYLKDRSLLSYLYIPMVIPETTIKFDSYMIHEAELNL GDLRLLKTDMPLFLPKNTHIRLLITSSDVLHSWAVPSFGVKVDAVPGRLNQTSLYLKNTGTFYGQCSELC GVNHAFMPIEVYVVNPVYFYNYVYIYFKNFNLI

PSIBLAST the cox family

# try to find a part of the original sequence located around the site involved in binding the Cu ligands. This site is well conserved in all members of the family (see the multiple alignment).

Using this part of the sequence to construct a profile will be very informative.

Paste this sequence in PSIblast, restrict the blast results to the prokaryotic sequences.

Perform a few iterations: check the alignments of the retrieved hits. Do these still contain the conserved part?

If so the alignment might still be reliable. Because there are so many terminal oxidase sequences in the database, this programs does not converge fast (in the exercise below, the algorithm converges (i.e. does not find novel entries anymore after 3 iterations).

By using psi blast we can find members of the cytcaa3 oxidases (standard form) but also some quinol oxidases. The most remote members cytcbb3 oxidases are harder to find. How would you explain that.

(if you do not pick up a member, you can not include it in the trainingsset to construct the profile).

>unknown gene

MADAAVHGHGDHHDTRGFFTRWFMSTNHKDIGILYLFTAGIVGLISVCFTVYMRMELQHPGVQYMCLEGAR LIADASAECTPNGHLWNVMITYHGVLMMFFVVIPALFGGFGNYFMPLHIGAPDMAFPRLNNLSYWMYVCGV ALGVASLLAPGGNDQMGSGVGWVLYPPLSTTEAGYSMDLAIFAVHVSGASSILGAINIITTFLNMRAPGMTLF KVPLFAWSVFITAWLILLSLPVLAGAITMLLMDRNFGTQFFDPAGGGDPVLYQHILWFFGHPEVYIIILPGFGIIS HVISTFAKKPIFGYLPMVLAMAAIGILGFVVWAHHMYTAGMSLTQQAYFMLATMTIAVPTGIKVFSWIATMW GGSIEFKTPMLWAFGFLFLFTVGGVTGVVLSQAPLDRVYHDTYYVVAHFHYVMSLGAVFGIFAGVYYWIGK MSGRQYPEWAGQLHFWMMFIGSNLIFFPQHFLGRQGMPRRYIDYPVEFAYWNNISSIGAYISFASFLFFIGIVFY TLFAGKRVNVPNYWNEHADTLEWTLPSPPPEHTFETLPKREDWDRAHAH

# do this exercise at home

Try the PSI BLAST method with the following sequence and compare its results with an ordinary blast search after the first and the second iteration:

>gi|8170799|gb|AAC33476.2| unknown [Azospirillum brasilense]

MAGPLRSPDGQRRRCMDATPTWWWLGRPIRTATVTARSRFRRIWCSNAGVRCWSCPTPALSPTSGNRRVLVAWNGSREAA RAVADAMPILTAAKRVVVMAVNPKAGPAGIGDEPGADIAKHLSRHGCRVEATHIVTDQIDPGDTLLNTVADESCDLLVMG AYARSRVREQVLGGMTRYMLEHMTVPVLMSH

After the first iteration:

How many hits with Blast , how many with PSI BLAST: (266) Iterate in PSI BLAST:

How many more sequences do you find as compared to the first blast iteration? Can you find back the sequence with worst score in the ordinary blast?

Kathleen Marchal ESAT/SCD 2

(3)

Continue to iterate? How many iterations are needed to find all sequences?

PSIBlast the lipocalin family

Start from the human lipocalin NP_006735 (NP_006735.fasta)

Perform a PSIblast search using he default parameters using the nonredundant database. There are about 120 hits (halve of then above and half of them below the default inclusion threshold of E- 0.005. By inspection these are all called RBP or apolipoprotein D (another lipocalin). Some of the not retained matches (insecticyanins) are authentic lipocalins based on having similar three dimensional structures and related biological function as carrier proteins. Other proteins are viral and are false positives.

To check whether the by PSIblast selected sequences belong to the protein family, select them (all (lipocalin_PsiBlastit_abovetreshold.fasta) of them or a selection). For instance select the sequences with an E value above 0.005, save them as FASTA file and import them in BioEdit. Delete the sequences that are either much shorter or much longer than the remainder of the sequences (lipocalin_PsiBlastit_abovetreshold_deleted.fasta). These sequences with different size might interfere with the ClustalW multiple alignment. Export the sequences as Fasta file from Bioedit and use them as input for a multiple alignment in ClustalW. Inspect the multiple alignment, you can

clearly see these sequences belong to the same family

(clustalw_psiblastabovethreshold_deleted.aln). In the multiple alignment you clearly discern two subfamilies (as also can be seen from their name, the aplolipoprotein D and the retinol binding protein) but the same signature is conserved in both subfamilies (indicated by a box in the alignment).

(4)

Now perform the ClustalW alignment of the sequences selected above and below the E value of 0.0005 (lipocalin_PsiBlastit1.fasta, lipocalin_PsiBlastit1_deleted.fasta). Note that now it contains a few sequences which do not contain the specific signature of the lipocalin family. These are false positives (misannotations), we can try to keep them out the PSI balst search by performing a PHI blast as a first step using an identified signature as a pattern.

Instead of constructing all the multiple alignments yourself. You can just change the view of the blast search so you can immediately see the multiple alignment instead of all the pairwise alignments

(5)

Now perform the same PSI blast search with the human lipocalin as a query but limit your search against the mammalian sequences (the databases are too large, if you use the nr database you might need hundreds of iterations before convergence).

Run Psiblast for a few iterations (at least 5 will be needed to find the remotest members of the lipocalin family,).

The family contains 3 major protein groups (the retinol binding proteins, the apolipoproteins, the odorant binding proteins). Reiterate PSIblast until you found representative members of three subfamilies of proteins.

Pick a representative of each subfamily (or a few, 5sequenceafterPsi5_mammals_lipocalin.fasta) and construct a multiple alignment using clustalW. Note that in the alignment only a few amino acids are still conserved over these long evolutionary distances (GXW[Y][A])) This is a signature of the lipocalin protein family. In the subsequent iterations of the PSIblast search only these conserved residues will be taken into account. Gradualy you will detect more remote members of the family.

(6)

To further enhance the specificity of the PSIblast search, the initial blast search can be enhanced with a PHIblast search. This requires that the sequences retreived both are similar to the query sequence and contain the signature of the protein family (pattern).

Of course it is not evident to find this pattern if the protein family is largely unknown. One could search in literature whether a pattern has already been described in any of the protein members of this family (preferentially the ones closely related to your query sequence). Otherwise one could perform a PSIblast, make a multiple alignment of related sequences as shown above and check whether consistently the same residues are conserved. These are most likely candidates to form a pattern. The initial PSIBlast result might contain false positive hits. These can be removed by refining the blast search with a pattern search (pattern detected in an initial screening).