Using the Blocks Database to Recognize Functional Domains

(1)

Blocks Tutorial

Using the Blocks Database to Recognize Functional Domains

Jorja G. Henikoff

Fred Hutchinson Cancer Research Center

Telephone: 206-667-4509

Fax: 206-667-5889

Email: jorja@fhcrc.org

Elizabeth A. Greene

Fred Hutchinson Cancer Research Center

Telephone: 206-667-6576

Fax: 206-667-6497

Email: eagreene@fhcrc.org

Nick Taylor

Fred Hutchinson Cancer Research Center

Telephone: 206-667-6576

Fax: 206-667-6497

Email: ntaylor@hmc.edu

Shmuel Pietrokovski

Weizmann Institute of Science

Telephone: ##972 (8) 934 2747

FAX: ##972 (8) 934 4180

Email: shmuel.pietrokovski@weizmann.ac.il

Steven Henikoff

Howard Hughes Medical Institute

Fred Hutchinson Cancer Research Center

Telephone: 206-667-4515

Fax: 206-667-5889

Email: steveh@fhcrc.org

Key terms: protein motif, amino acid sequence conservation, multiple sequence alignment, protein homology searching, PCR primer design

Abstract

Blocks are ungapped multiple alignments of segments of related protein sequences that correspond to the most conserved regions of proteins. The Blocks Database is a collection of blocks representing known protein families that can be used to compare a protein or DNA sequence with documented families of proteins. Procedures in this unit describe the analysis of proteins and families using Blocks-based tools, including searching, exploring relationships with trees, making blocks and designing PCR primers with blocks for isolating homologous sequences.

Using the Blocks Database to Recognize Functional Domains

Blocks are ungapped multiple alignments of segments of related protein sequences that correspond to the most conserved regions of proteins. The Blocks Database is a collection of blocks representing known protein families that can be used to compare a protein or DNA sequence with documented families of proteins (Henikoff and Henikoff, 1991). The current Blocks+ Database, generated by the automated PROTOMAT system, includes protein families documented in InterPro (Apweiler et al., 2000) and Prints (Attwood et al., 2000).

Part 1 describes retrieval of a Blocks Database entry and numerous options for displaying and analyzing conserved sequence information. Appendix 1 describes searching other databases with block queries (Pietrokovski et al., 1998).

Parts 2 and 3 describe procedures for analyzing a sequence of interest using Blocks-based tools. Part 4 introduces the ProWeb Tree Viewer, a graphical tool that facilitates the exploration of relationships between protein family members.

Part 5 illustrates how a user can create blocks from a set of related sequences using Block Maker (Henikoff et al.,

(2)

1995). Part 6 describes the use of blocks in designing optimal PCR primers by applying the CODEHOP strategy (Rose et al., 1998). These procedures are illustrated with an example of current interest.

Part 1: EXPLORING PROTEIN FAMILIES USING THE BLOCKS DATABASE

The blocks for each protein family entry in the Blocks Database can be retrieved and displayed, and can be used as queries in searches of other databases. There are three ways to access information in the Blocks Database:

Web interface. The best way to access the Blocks Database is through the Web at http://blocks.fhcrc.org/ . Download. The Blocks Database is available as a text file from ftp://ftp.ncbi.nlm.nih.gov/repository/blocks/unix/

Necessary Resources

Hardware. A workstation, personal computer or terminal connected to the Internet.

Software An E-mail program for the E-mail interface, and any type of Web browser for the Web interface. Either Chime or Rasmol helper application to view protein structures using a browser. A file transfer program to download the data files.

Data Files The Blocks Database is distributed as an ASCII text file.

1. Get a Blocks Database entry.

We use the Blocks Database entry for the C-5 cytosine-specific DNA methylases as an example. Open the Blocks Web site in a Web browser: http://blocks.fhcrc.org/. The first window to appear is shown in Figure 1.

a. Click on "Get Blocks by key word".

b. Enter "cytosine and methylase" and hit "Enter". One item is returned, the entry IPB001525. This is the Blocks Database accession number for blocks made from the InterPro family with accession number IPR001525 (Apweiler et al., 2000).

c. Click on the link to IPB001525. The entire Blocks Database entry for IPB001525 is shown in text format. The first page is reproduced in Figure 2. There are six blocks for this family labeled IPB001525A to IPB001525F. Links at the top of the page lead directly to the blocks. The first part of IPB001525B is shown in Figure 3. Each block starts with ID, AC and DE lines adapted from InterPro. They list, respectively, the InterPro short identifier, the Blocks accession number, and the InterPro description of the family. The AC line also includes the minimum and maximum distance from the end of the previous block to this block across all sequences. For the A block, these numbers are the distances from the beginning of the sequences. The BL line following the DE line in each block contains information from PROTOMAT, including a three-character motif, the width of the block and the number of sequence segments in it.

Additional numerical calibration points (99.5% and strength) are used by the BLIMPS searching program described in Parts 2 and 3.

The aligned sequence segments follow the BL line in each block. The sequence identifier from Swiss-Prot/Trembl (Bairoch and Apweiler, 2000) is followed by the position of the first residue in the segments. Clicking on the sequence identifier link brings up the Swiss-Prot/Trembl entry for the sequence. Sequence segments are clumped and separated by blank lines if at least 80% of the aligned residues match between any pair of segments. Numerical sequence segment weights are shown to the right of each segment (Henikoff and Henikoff, 1994). The higher this weight, the more dissimilar the segment is from other segments in the block, with the segment most dissimilar from all others having a weight of 100.

Each block in a Blocks Database entry contains segments from the same sequences, but the order is different since the segments clump differently in each block. The six IPB001525 blocks each contain segments from the same 158 sequences.

At the top of the Blocks Database entry page are several links that provide additional information and views.

2. Display blocks graphically.

(3)

a. Map. Click on "Block Map". The locations of all six blocks in all 158 sequences is displayed.

b. Logos. Under the "Logos" bullet, select "GIF" display format. The six blocks are shown as sequence logos (Schneider and Stephens, 1990) reproduced in Figure 4. A sequence logo is a graphical representation of aligned sequences where at each position the size of each residue is proportional to its frequency in that position, and the total height of all the residues in the position is proportional to the conservation of the position. Highly conserved motifs, such as the "PCQ" in IPB001525B and "ENV" in IPB001525C, stand out more clearly in logos than in the text format.

Logos may also be displayed in other formats.

c. Phylogenetic tree. Under the "Tree from blocks alignment" bullet, select "ProWeb TreeViewer". It takes a few minutes to build and display a phylogenetic tree computed from the sequence segments in the blocks (Chapter 6). The tree is displayed in a separate browser window. The ProWeb TreeViewer is discussed in Appendix 2.

d. Protein structures. If any of the sequences in the blocks for a family has a structure in the Protein Data Bank (http://www.rcsb.org/pdb/), then the blocks can be displayed on the structure. Select "PDB entries". Two sequences in the blocks, MTH1_HAEHA and MTH3_HAEAE, have known structures that overlap the block regions, 6MHT and 1DCT respectively. Click on "6MHT" under the "3D Blocks" column. A thumbnail sketch of the structure with the six blocks marked in different colors is displayed, along with links to start Web browser helper applications for the Chime or Rasmol structure viewers.

3. Other links.

a. Design polymerase chain reaction (PCR) primers from blocks. The COnsensus-DEgenerate Hybrid Oligonucleotide Primers (CODEHOP; (Rose et al., 1998)) tool designs PCR primers from protein multiple alignments. It is described in Part 6.

b. Predict amino acid substitutions in blocks. The Sorting Intolerant from Tolerant (SIFT; (Ng and Henikoff, 2001)) program predicts which amino acid substitutions in each block position are likely to affect protein function. Clicking on the SIFT link brings up the SIFT entry form with the IPB001525 blocks inserted.

c. Additional links.For some families in the Blocks Databases, links are provided to other Web sites with related information. For IPB001525, there are links to CYRCA (Kunin et al., 2001) and MetaFam (Silverstein et al., 2001).

Appendix 1: SEARCH BLOCKS VERSUS OTHER DATABASES.

Representations of the six IPB001525 blocks can be used to search other databases for additional C-5 cytosine-specific DNA methylases. This approach is more powerful than searching with a single protein sequence (Henikoff and Henikoff, 1997).

1. COBBLER sequence.

Select "COBBLER sequence" under the "Search Blocks vs other databases" bullet. COBBLER stands for COnsensus Biasing By Locally Embedding Residues. A single sequence is selected from the set of blocks and enriched by replacing the conserved regions delineated by the blocks with consensus residues derived from the blocks. Embedding consensus residues improves performance with readily available single sequence query searching programs, such as BLAST [ (Altschul et al., 1990); Unit 3A.4] and FASTA [ (Pearson, 1990); Unit 3A.5]. The IPB001525 blocks are embedded in the portion of MTF1_FUSNU spanned by the blocks. The blocks are shown in upper case and the intervening sequence in lower case. Click on "Gap-Blast Search" and a search of the COBBLER sequence against the non-redundant protein database is automatically started at NCBI's Blast Web site in a separate browser window. Other BLAST searching options are also provided. The COBBLER sequence may also be copied and pasted into other sequence searching Web pages.

2. MAST search.

Select "MAST search" under the "Search Blocks versus other databases" bullet and a MAST searching form will appear in a separate browser window. MAST is a searching tool at the San Diego Super Computing Center [ (Bailey, and Gribskov, 1998); Unit 2.5]. The six IPB001525 blocks are converted into numerical position-specific scoring matrices (Henikoff and Henikoff, 1996) consisting of 20 scores for each amino acid's probable occurrence in each position.

MAST scans all six of these PSSMs against one of several amino acid or nucleotide sequence databases and returns the

results by E-mail. Enter an E-mail address in the MAST form and select a sequence database to search. Consult the

(4)

MAST help files by clicking on the links for the other options. For our example, select the Drosophila database and accept the defaults for the other options. MAST will search for C-5 cytosine-specific DNA methylases among Drosophila proteins. The list of MAST hits is shown in Figure 5. The top hit, AAF53163.1 is an unequivocal DNA methyltransferase homolog with an E-value of 4.7 x 10

^-26

.

3. LAMA search.

Select "LAMA search" under the "Search Blocks vs other databases" bullet and a LAMA searching form will appear in a separate browser window with the IPB001525 blocks inserted in the query field. LAMA (Local Alignment of Multiple Alignments) is a program for comparing protein multiple sequence alignments with each other (Pietrokovski, 1996). The program can search databases of multiple alignments in the Blocks Database format. The search is for sequence similarities between conserved regions of protein families. The method is sensitive, detecting weak sequence relationships between protein families and sequence similarities beyond the range of conventional sequence database searches. Under the "Select database to search" heading on the LAMA form, select "Prints Database" and click the

"Perform Search" button. The Prints Database [ (Attwood et al., 2000); Unit 2.8] is another collection of ungapped conserved regions of protein families similar in philosophy to the Blocks Database. Four hits are reported by LAMA to two different Prints entries. IPB001525A,C,D are aligned with PR00105A,B,C. PR00105 is the Prints entry for cytosine-specific DNA methyltransferases. Click on the "Logo" icon at the right of each LAMA hit to see the blocks aligned as logos. IPB001525C has a weaker alignment with PR00115E, the fifth of six blocks representing the fructose- 1, 6-bisphosphatases in the Prints Database. The aligned logos for these two blocks show both blocks have highly conserved P, F and E residues in the same relative positions.

Part 2: ANALYZING PROTEIN SEQUENCES WITH THE BLOCK SEARCHER

The primary use of the Blocks Database is to classify a query sequence as belonging to one or more known protein families based on sharing conserved regions. This part discusses classifying a protein query and Part 3 discusses classifying a DNA sequence query.

Web interface. The best way to compare a query sequence with the Blocks Database is through the Web at http://blocks.fhcrc.org. Three different searching programs are available.

UNIX programs. Programs to search the Blocks Database and analyze results are available for UNIX systems from ftp://ftp.ncbi.nlm.nih.gov/repository/blocks/unix/blimps/.

Data Files Query sequences are accepted in FASTA or GENBANK format.

1. Select a searching option.

Open the Blocks Web site in a Web browser: http://blocks.fhcrc.org/ (Figure 1). Three searching options are provided:

Block Searcher (Henikoff and Henikoff, 1991), Reverse PSI-BLAST Searcher and IMPALA Searcher (Schaffer et al., 1999). Block Searcher uses the original BLIMPS (Henikoff et al., 1995) program. Reverse PSI-BLAST and IMPALA are searching programs from the NCBI group and use the BLAST searching algorithms and statistics (Schaffer et al., 2001). All three of these programs convert blocks to position-specific scoring matrices (PSSMs) for searching. Of the three, reverse PSI-BLAST is the fastest way to search the Blocks Database, requiring less than a minute for the average protein query on our Web server. eMotif at Stanford University (Huang and Brutlag, 2001) is an even faster, although less sensitive, way to search the Blocks Database, requiring perhaps a second to search the Blocks+ and Prints databases. eMotif attains high speed by searching amino acid strings rather than PSSMs.

Whereas Reverse PSI-BLAST, IMPALA and eMotif are limited to protein query sequences, Block Searcher accepts both protein and nucleotide query sequences, translating DNA sequences on-the-fly. Click on the "Block Searcher" link and the form shown in Figure 6 appears. Links to PSI-BLAST, IMPALA, eMotif and other protein family searching sites are included on the Block Searcher page.

2. Submit a protein query sequence to the Block Searcher.

For our example, we are interested in Drosophila cytosine methyltransferases in Drosophila. Using step 2 of Appendix 1

for this protein family, a MAST search of the IPB001525 blocks against Drosophila proteins returns GenBank sequence

AAF53163.1 as the top hit (Figure 5). Follow the 'E' link for AAF53163.1 to the amino acid sequence entry, display in

FASTA format, and copy and paste it into the sequence box of the Block Searcher form. Accept the default values for

(5)

the rest of the options on the form and click "Perform Search". The search takes a few minutes and the results can optionally be returned to an E-mail address.

By default the "Blocks+" Database is searched. This database represents InterPro families( (Apweiler et al., 2000)), plus additional families from Prints( (Attwood et al., 2000)). Blocks for the InterPro families are made by PROTOMAT, but Prints blocks are taken directly from the Prints Database. Optionally the "Blocks+ Database without compositionally biased blocks" may be searched. This is a subset of Blocks+ with highly biased blocks removed to reduce false positive hits to compositionally biased queries. A description of the current release of Blocks+ is at http://blocks.fhcrc.org/blocks_release.html. The entire Prints Database may also be searched.

The default cutoff expected value is 1; an average protein is expected to hit one protein family by chance. There are several output options. However, all but "Summary with alignments" and "Summary", which omits the alignments, are specialized and not generally recommended. The BLIMPS searching program will examine the query sequence to determine whether it is amino acid or DNA, but sequence type may be specified. These are the only options for a protein query.

3. Examine results returned by the Block Searcher.

Block Searcher results are prefaced by a description of the version of the Blocks Database searched and a brief description of the output format. The query title and length are then listed, followed by the number of blocks compared and the number of query-block alignments scored. In Figure 7, the top hit for AAF53163.1 is IPB001525 with a combined E-value of 6.9 x 10

^-27

for five of the six IPB001525 blocks which are aligned with the query in the correct order and with distances between them compatible with those observed in sequences in the blocks. A second hit to PR01035 with an expected value of 0.67 is probably spurious because only one of twelve PR01035 blocks was aligned.

Because there is often a question concerning the reality of twilight zone hits with marginal E-values, you should ask whether a suspected match is detectable using a different searching program. Reverse PSI-BLAST (or IMPALA) and eMotif differ very substantially from Block Searcher and each other in the way they align and score matches, and so they are unlikely to detect the same chance similarities. Verify your search of AAF53163.1 using Reverse PSI-Blast ( http://blocks.fhcrc.org/blocks/rpsblast.html). Reverse PSI-Blast reports a portion of the CXXC zinc finger family (IPB002857) with E=0.37. Investigation of IPB002857 reveals that it was annotated by InterPro as a domain that is usually found upstream of cytosine methylases. Because the Blocks Database is generated automatically, it occasionally includes a conserved region adjacent to the annotated domain if that region is found in a large fraction of the sequences.

Thus, IPB002857 includes regions from known cytosine methyltransferases that are found in IPB001525. As Reverse PSI-BLAST allows gaps, it tends to extend alignments increasing sensitivity at the expense of selectivity relative to Block Searcher.

Following the hit summary are alignment details for each hit. Each block in the hit and its location in the query sequence is listed with individual expected values. A schematic map is shown to compare the block alignments with the range of alignments in sequences in the blocks. Finally, the query segment is aligned with a single segment most like it from each block. AAF53163.1 is aligned with IPB001525B-F, but not with IPB001525A. IPB001525B is aligned starting at position 70, IPB001525C at position 95, etc. The query is most similar to Trembl sequence O35212 in IPB001525B, to O43669 in IPB001525D, and to PMT1_SCHPO|P40999 in IPB001525C, E and F.

Because the alignments of AAF53163.1 with IPB001525B-F are so clear, it is curious that Block IPB001525A is missing. The GenBank annotation (Unit 1.2) documents AAF53163.1 as a predicted protein from a large sequencing project, and it thus may not have been adequately scrutinized. It was translated from AE003635.1:7013..8050, so a DNA query can be extracted from AE003635 including more upstream sequence where the A block may lie. One such query is shown in Figure 8.

Part 3: ANALYZING DNA SEQUENCES WITH THE BLOCK SEARCHER

If you have a DNA query, the Block Searcher will translate it into protein in all frames on one or both strands. Each block in a family is aligned with the translated query sequence independently and then hits are assembled on each strand. Therefore, all blocks in a hit are on the same strand, but not necessarily in the same frame.

Web interface The best way to compare a DNA query sequence with the Blocks Database is through the Web at http://blocks.fhcrc.org/blocks_search.html.

Data Files Query sequences are accepted in FASTA or GenBank format.

(6)

1. Select a searching option.

Open the Blocks Web site in a Web browser: http://blocks.fhcrc.org (Figure 1). Only the Block Searcher (Henikoff and Henikoff, 1991) will handle a DNA query sequence, translating it on-the-fly. Click on the “Block Searcher” link and the form shown in Figure 6 appears.

2. Submit a DNA query sequence to the Block Searcher.

Copy and paste the DNA sequence (Figure 8) into the Block Searcher form. Because DNA queries are translated before comparing them with the Blocks Database, three (one strand) or six (both strands) times as many comparisons are made as for a protein query. Therefore, this type of search takes longer and may result in higher background levels of false positive hits. To reduce the background, select the "Blocks+ database without compositionally biased blocks". Hits are pieced together from blocks on the same strand, although they may be in different frames. To reduce search time, select

"Forward Strand" under "Additional optional search parameters for a DNA query" at the "Strands to search" bullet.

You may also want to select "DNA" under "Optionally force query sequence type". If extra line feeds are introduced during copy-and-paste so that a long FASTA title line becomes two lines, BLIMPS may decide the query is protein. It is a good idea to check the title line after the paste operation as different workstations and browsers produce different results.

Click the "Perform Search" button and wait the for the results (usually a few minutes).

3. Examine results returned by the Block Searcher.

This time, the top hit includes all six IPB001525 blocks with a combined E-value of 2.9 x 10

^-31

(Figure 9). The A block is in a different frame (1) than the other five blocks (2) and is located upstream of the region of AE003635 translated for AAF53163.1. Further analysis (not shown here) reveals a 49 nucleotide intron between the A and B blocks missed by the gene prediction programs. The corrected protein, now dubbed "Dnmt2" is shown in Figure 10.

The Block Searcher results with the DNA query also report a hit to the single block representing IPB001529 with an E- value of 0.7. Because this RNA polymerase M/15 Kd subunit family is only represented by one block in the Blocks Database, it is not a clear false positive. However, the alignment reveals a stop codon within the block region of the query, which is unlikely. The corrected protein can be searched against the Blocks Database to see if this hit again turns up (it does not, nor does the hit to PR01035 (Figure 7) show up with the corrected query).

In order to explore how Dnmt2 relates to the other C-5 cytosine-specific DNA methylases, click on one of the IPB001525 links on the Block Searcher results to get the Part 1 page for this group.

Part 4: VIEWING TREES BASED ON BLOCKS

A phylogenetic tree is made for each protein family in the Blocks Database using the multiple alignments in the block regions only. The neighbor-joining algorithm [ (Saitou and Nei, 1987); Unit 6.4] is applied using Clustal W [ (Thompson et al., 1994); Unit 2.4]. The Kimura correction for multiple substitutions is applied. If there are not too many sequences, 100 bootstrap values are calculated. The output from Clustal W is a tree file in a format which can be read by most tree display programs (Unit 6.2).

The ProWeb TreeViewer allows you to interactively explore trees made from blocks, zooming in on sections of interest, and to view additional information associated with the sequences used to create the tree. It also facilitates making new blocks from subtrees. This type of analysis is valuable when your sequence belongs to a clade from a large family which may have somewhat different properties than the entire family.

Web interface. The ProWeb TreeViewer is available through the Web at http://www.proweb.org/treeviewer/info.html 1. Start the ProWeb TreeViewer.

From the Part 1 page for IPB001525, click on the "ProWeb TreeViewer" link near the top of the page. Alternatively, enter "IPB001525" in the form http://www.proweb.org/treeviewer A phylogenetic tree appears.

2. Select a subtree.

(7)

In the Block Searcher output from Part 3, step 3 (Figure 9), Dnmt2 is most like PMT1_SCHPO, O35212 and O43669.

Near the top of the page, you should see a subtree that contains sequences PMT1_SCHPO, O43669, O14717, O35212 and O55055 (Figure 11). Prune the tree to include only this subtree by clicking on the small solid blue box at the junction between PMT1_SCHPO and the other four sequences. The pruned tree is shown in a new browser window (Figure 12). Below the pruned tree are several links. "View FASTA files of these sequences" shows the full-length sequences included in the tree. "View extracted subclade Blocks" shows the sequence segments from the IPB001525 blocks for the five sequences following a MAST form which uses these pruned blocks as a query. There is also a link to the CODEHOP page described in Part 6 to design PCR primers from these pruned blocks.

3. Link to Block Maker.

Because the IPB001525 blocks represent conserved regions in all the 158 sequences in the group, they may not capture the conserved regions in this subtree particularly well. Click on "Run BLOCK MAKER on these sequences" to make new blocks from just these five sequences. The Block Maker input form will appear in a new browser window with the five sequences already inserted (Figure 13).

Part 5: USING BLOCK MAKER

Block Maker finds blocks in a group of related protein sequences. Block Maker uses the PROTOMAT algorithm (Henikoff and Henikoff, 1991), a two-step procedure. First, candidate motifs are found using a motif-finder. Then a best set of motifs is assembled along the length of most of the sequences.

Block Maker runs PROTOMAT twice, first using MOTIF (Smith et al., 1990) and second using a Gibbs sampler [ (Neuwald et al., 1995); Unit 2.13] as motif-finding algorithms. It returns both sets of blocks. While the system attempts to align all sequences provided, it will sometimes exclude sequences that are too diverged from the majority of sequences for the similarity to be detected.

Web interface. The best way to use Block Maker is through the Web at http://blocks.fhcrc.org/blocks/make_blocks.html. Instructions are returned when the word "help" appears in the subject heading of the E-mail message.

UNIX programs.< Programs to make Blocks are available for UNIX systems from ftp://ftp.ncbi.nlm.nih.gov/repository/blocks/unix/blimps/.

Data Files Query sequences are accepted in FASTA or GenBank format.

1. Submit sequences to Block Maker.

The input to Block Maker is a set of related sequences in FASTA or GenBank format. In Figure 13, sequences have been preinserted for the subtree closest to Dnmt2 (Figure 12) by the ProWeb TreeViewer (Part 4). Sequences can be edited within the form. For our example, copy and paste the corrected Dnmt2 sequence ( Figure 10) into the form after the five preinserted sequences in order to make blocks from it and the other sequences in the subtree. As an alternative to copy and paste, the name of a file on your workstation containing the sequences can be entered in the "Enter the name of a file containing your protein sequences" field.

A minimum of three sequences is required to make blocks and Block Maker will accept up to 250 sequences depending on their combined lengths. Block Maker requires considerable computer resources, and so sequence sets with combined length of more than 15,000 amino acids must be submitted to the E-mail server by entering an email address on the Web form (Figure 13). Sets of sequences with a combined length of more than 100,000 cannot be processed by the Block Maker servers and the programs must be installed locally.

Enter "Dnmt2" in the "Enter a short description of your group of sequences" field and click the "Make Blocks" button.

2. Examine Block Maker results.

The Block Maker results resemble the Get Blocks display (Part 1) for an entry in the Blocks Database, except that two

sets of blocks are displayed. Following an introduction briefly describing the result format is a "Block Maps" link

which, when clicked, compares the locations of the MOTIF and Gibbs blocks. Both algorithms found five blocks, but

they differ. The A and E blocks correspond, but the Gibbs B block lies between the MOTIF A and B blocks, and the

MOTIF B and C blocks correspond to the Gibbs C and D blocks. The MOTIF D block lies between the Gibbs D and E

(8)

blocks. The MOTIF B and C blocks are contiguous in all six sequences as are the Gibbs C and D blocks. Both sets of blocks are wider than the blocks in IPB001525 because the six sequences used to make them are more similar to one another than are the 158 in IPB001525.

3. Blocks from Motif.

Click on "Blocks from Motif" (Figure 14). The "Logos", "Tree" and "Search" links are described in Part 1 and Appendix 1. It is instructive to do a LAMA search of these blocks against the Blocks Database to see how the blocks made from this subtree correspond to those from IPB001525 (Appendix 1, step 3). Click on the "LAMA" link to start the search. MOTIF misses the IPB001525B region containing the catalytic PCQ motif because PMT1_SCHPO has SCQ in this position. PMT1_SCHPO is a cryptic pseudogene in the unmethylated Schizosaccharomyces pombe genome, and replacement of SCQ by PCQ turns it into a DNA methyltransferase that is active in vitro (Pinarbasi et al., 1996).

4. Blocks from Gibbs.

Click on "Blocks from Gibbs" (Figure 15). It is again instructive to do a LAMA search of these blocks against the Blocks Database. In contrast to the MOTIF blocks, the Gibbs B block contains the PCQ motif corresponding to IPB001525B and correctly aligns PMT1_SCHPO in it. The Gibbs motif-finder uses a statistical approach that does not depend as heavily on sequence identity as does MOTIF, which looks for a few common residues in most sequences (Neuwald et al., 1995).

Click on the "CODEHOP" link to design primers from Gibbs blocks for polymerase chain reactions (PCR).

Part 6: DESIGNING PRIMERS FROM BLOCKS

The CODEHOP (Consensus-Degenerate Hybrid Oligonucleotide primers) program designs DNA primers that you can use to amplify distantly related homologs of a gene of interest (Rose et al., 1998). A CODEHOP primer has a degenerate 3' "core", with a length of 11-12 bp across four codons of highly conserved amino acids, and a non-degenerate 5' consensus "clamp" region, with a length that depends on its desired annealing temperature, typically between 20 and 30bp.

Web interface. CODEHOP is accessible at http://blocks.fhcrc.org/codehop.html.

UNIX programs. The CODEHOP program is available for UNIX systems ftp://ftp.ncbi.nlm.nih.gov/repository/blocks/unix/blimps/ .

Data Files Input is in Blocks format as described at http://blocks.fhcrc.org/block_format.html. Utilities are available at http://blocks.fhcrc.org/process_blocks.html to convert common multiple alignment formats to Blocks format.

1. Submitting blocks to CODEHOP.

Blocks are inserted into the CODEHOP Web form by Get Blocks (Part 1) and by Block Maker (Part 5, steps 3 and 4, Figure 16). Near the top of form is a link to the "Blocks multiple alignment processor" which carves out blocks from multiple alignments and then inserts them into the form. Alternatively, blocks can be copied and pasted into the form.

The blocks can be edited within the form. For instance, you may want to adjust the sequence segment weights to emphasize some sequences over others. Setting a sequence weight to zero will ignore the contribution of that sequence to the block.

2. CODEHOP parameters.

Usually it is only necessary to select an appropriate codon usage table for back-translation of amino acids and use the default values for the other parameters. If primers are not found when the defaults are used, then read the "Getting started" guide, which describes how to adjust the parameters systematically to obtain a satisfactory set. There are several parameters that can be set, including clamp annealing temperature, degeneracy and "strictness", which are described in detail in the "Full Help file".

3. CODEHOP results.

(9)

Starting with the Gibbs blocks from Part 5, step 4 (Figure 16), select the Drosophila melanogaster codon usage table by scrolling through the list of tables next to "Codon usage table", then click "Look for primers". You will see a large number of suggested primers, with the degenerate core in lower case, using the standard degenerate alphabet, and the consensus clamp in upper case. Some primers have the comment "CLAMP NEEDS EXTENSION". Using the CODEHOP strategy, primers cannot extend beyond the limits of the blocks and this comment indicates that the melting temperature is lower than desired. You can copy and paste the primer into the oligo temperature calculation site linked from the comment and add residues to the 5' end until the desired temperature is reached. The most reliable primers will be the least degenerate, and so by reducing the maximum degeneracy from the default of 128 to 32, a smaller set of primers is reported (Figure 17). These are mapped along a consensus sequence representing each block and summarized at the bottom of the page, providing all oligo sequences from 5' to 3' for ordering from a supplier. In the example, the best primer pair consists of an 8-fold degenerate primer to IPB001525A (tacgtrrtrcgGAACTTGCTCCGGG) and either of two overlapping 32-fold degenerate primers to the complement of IPB001525C (ctyttrcanktCCCGAAGCTCCACAGGT or ttrcanktyccGAAGCTCCACAGGTTCT).

DATA INTERPRETATION

Different protein family search engines can produce different results, especially in the twilight zone. A search of the Blocks Database does not guarantee correct or complete results. An expected value is provided for each hit by Block Searcher based on statistics developed for the MAST system (Bailey and Gribskov, 1998), but the value can be skewed by compositional bias and repeated domains. Single block hits require careful evaluation, and it is important to verify uncertain hits using other searching methods, such as Reverse PSI-BLAST.

Phylogenetic trees are becoming increasingly useful for discerning subfamily relationships, however, they are no better than the alignments that they are based on. Although block alignments are limited to conserved regions, and so are likely to be correct, slight misalignments can occur within a block where it spans a short variable region. Other uncertainties in the reliability of trees stem from differences in rates of evolution between positions and from compromises made in constructing trees, in this case using neighbor-joining. Branch lengths indicate the degree of divergence of sequences, however, uncertainties in evolutionary rates add an unknown degree of uncertainty. Although Tree viewer indicates which nodes are judged to be reliable by coloring those with 75% bootstrap support, this is meant as only a rough guide of reliability.

Block Maker includes two different motif-finding algorithms, MOTIF and Gibbs sampling, that use different scoring systems. As a result, it is unlikely that the same block alignment will be detected unless it is real. Block Maker always returns a set of blocks, even when these are from randomly chosen sequences. You may be surprised at how real such block alignments can appear, sometimes rivaling alignments that we are accustomed to seeing in molecular biology publications (Henikoff, 1991). An interesting exercise for students is to randomly select 10 sequences of >300 amino acids and run Block Maker on them. Using these blocks in a MAST search will invariably detect each of the sequences that went into them, despite the fact that the alignments have no meaning! This illustrates why you should never use the mere ability to obtain a plausible alignment between two sequences as evidence that they are related. Database searching is well-suited to validating similarity, as the E-values that are returned can be interpreted in the context of a comparison against a large set of truly unrelated proteins or families, without depending on subjective judgments.

COMMENTARY Background Information The utility of blocks

Blocks, or motifs, correspond to minimal units of protein function. They are typically short amino acid segments that

are conserved in sequence and in length. Motifs form protein active sites, substrate and cofactor-binding sites, and

structural features crucial for function. Although individual amino acids comprise smaller units than blocks, they are not

sufficiently specific to define a unique function. For example, a position with either Asp or Glu residues can be part of a

metal binding site, a protein binding site, etc. Larger units, made up of multiple motifs, comprise protein domains that

most often correspond to structure folds. Some distinct domains nevertheless share common motifs, for example, HTH

DNA binding motifs, P-loop ATP -binding motifs and Rossmann fold-like phosphate/sulphate binding loops. Unlike 3D

structural folds, motifs do not generally assume a stable structure by themselves and depend on the presence of other

(less sequence conserved) protein segments to support and position them. The alignment-based searching methods that

comprise the Blocks system can be used for detection and analysis of protein functional building blocks in different

contexts.

(10)

Block-based alignment methods differ from those based on global multiple sequence alignment. Both perform better than single sequence analyses in identifying the functionally critical sequence regions from a group of related sequences. Block-based methods are explicitly designed to identify conserved regions, whereas more global multiple sequence alignment usually includes alignment of both conserved and non-conserved regions. Global multiple sequence alignment may also be unable to align short conserved regions that are found in different contexts. Multiple blocks can be joined to achieve a global alignment, a strategy used by Gapped-BLAST and PSI-BLAST (Altschul et al., 1997), but the converse is not always true, because in global alignment the boundary between conserved and non-conserved is often unclear.

Global multiple alignment methods have been widely used to identify complete domains, which typically consist of multiple blocks and adjacent regions. These methods have become standard for automatic annotation of genomic sequence, because they tend to identify complete domains. Blocks-based methods are more suitable for analyzing critical regions and residues within domains, and so the two classes of methods are complementary.

Making blocks

Blocks are produced by the automated PROTOMAT system (Henikoff and Henikoff, 1991), which applies a robust motif-finder to a set of related protein sequences. Resulting candidate motifs are assembled into a best set along the lengths of the sequences to give a multiple alignment consisting of ungapped conserved regions separated by unaligned regions of variable size. The Blocks Database consists of blocks constructed from protein families cataloged in the InterPro (Apweiler et al., 2000) collection of protein families.

MOTIF looks for spaced triples in most of the sequences and aligns them around these triples (Smith et al., 1990). A spaced triple is a set of three amino acids separated by two distances. For Block Maker, all spaced triples with all combinations of two distances ranging from between 0 and 17 amino acids each are tallied.

PROTOMAT also has been modified to utilize a Gibbs sampler as motif-finder (Neuwald et al., 1995). GIBBS uses a statistical sampling algorithm to find motifs and does not rely on finding amino acid identities in the sequences.

Searching with blocks

Block alignments are converted into position-specific scoring matrices (PSSMs) for searching. Each PSSM column corresponds to a block position and includes 20 numerical scores representing the odds for each amino acid occurring in that position. Calculation of the Block Searcher PSSMs uses sophisticated methods of sequence weighting and pseudo- count estimation shown to be effective in comprehensive tests (Henikoff and Henikoff, 1996). A theoretical score distribution is computed for each PSSM (Tatusov et al., 1994). A query sequence is compared with each PSSM in the Blocks Database by aligning it with the block at every possible position and adding the log-odds scores in each PSSM column. The highest-scoring alignment is saved and the probability of its score looked up in the theoretical distribution.

For families with multiple blocks, each block is aligned and scored individually with the query, and the probabilities of all the blocks are combined to give the overall expected value for the alignment of the query with the blocks for the family (Bailey and Gribskov, 1998). Multiple blocks are only combined in a hit if they occur in order and within reasonable distances of one another within the query sequence. Reasonable distances are determined by looking at the distances between blocks in the known members of a family.

Block searches against sequences can be improved upon by searching blocks against blocks. In such cases both query and target are devoid of non-conserved sequence regions, and both are defined by amino acid distribution in each position (Pietrokovski, 1996). Since the block-to-block alignment is ungapped and over relatively short regions, it is possible to automatically identify consistent alignments of several blocks (Kunin et al., 2001).

Because blocks are inherently local, they can accommodate partial sequences, such as those that are available from EST projects. The Block Searcher facilitates this task by accepting DNA queries, which it translates in 3 or 6 frames, piecing together multiple block hits in different frames on a DNA strand. This feature is also useful for identifying missing exons caused by alternative splicing or gene mis-prediction, as illustrated in the Dnmt2 example.

Using blocks for tree construction

Multiple sequence alignments and phylogenetic trees constructed from them are well suited to reconstruct relationships

between the component sequences. However, regions that are wrongly aligned will confound this analysis. Because

blocks correspond to more confidently aligned segments, they may be more reliable than reconstructions based on

global alignment. The Tree Viewer tool describes relationships between sequences that are derived from the best

aligned regions of proteins, and this reduces the concern that divergence is an artifact of misalignment.

(11)

Using CODEHOP to isolate orthologs in related organisms

CODEHOP primers overcome problems of both degenerate and consensus methods for primer design. Hybrid primers consist of a relatively short 3' degenerate core and a 5' non-degenerate clamp. Reducing the length of the 3' core to a minimum decreases the total number of individual primers in the degenerate primer pool. Hybridization of the 3' degenerate core with the target template is stabilized by the 5' consensus clamp of the primer and the target sequence during the initial PCR cycles.

Even in the postgenomic era, sequencing has hardly begun on the vast majority of genomes on earth, and so methods are still needed for isolating homologs that are not present in sequence databases. The CODEHOP primer designer can aid in this task, by implementing a strategy that permits high stringency annealing to avoid mispriming by chance. PCR primer design takes advantage of the accumulation of sequence data, which facilitates the task of obtaining homologous sequences from organisms of interest. As illustrated by the Dnmt2 example, using just a subfamily of cytosine methylases allows primers to be designed specifically for members of this subtree, which should succeed in most organisms that have Dnmt2 orthologs. The cytosine methylase family is typical in that it is so diverse that the design of PCR primers to specifically amplify them all is unfeasable. Fortunately, the diversity of most protein families is mostly evident in paralogous relationships, and so limiting oneself to probable orthologs is likely to be a sound general strategy. As orthologs are expected to share function, the primer design strategy illustrated for Dnmt2 allows a user to focus on shared function despite the possible occurrence of paralogs that may be functionally dissimilar.

Critical Parameters and Troubleshooting Blocks Database retrieval

Usually a keyword or sequence name is sufficient to retrieve a family using Get Blocks. However, homology searching is a more reliable way to determine if a protein belongs to one or more families, and Reverse PSI-BLAST is fast and sensitive. Because block alignments may differ from those used for the corresponding InterPro entry, occasional significant hits may not correspond to their InterPro annotations, and an example of this is found in Part 2 section 3.

Blocks are not made for every InterPro entry. In particular, they are not made for entries that are subsets of other entries. This reduces overlap between families in the Blocks Database.

Avoiding spurious hits in searches

The expected (E) value is the most critical parameter, where E=1 means that a single hit is expected to occur by chance, and so higher values should result in more hits being reported. View significant E-values with caution when there is compositional bias, and use filtering on such queries (Wootton and Federhen, 1993). Alternatively, search the Blocks+

database with compositionally biased blocks removed. Compositional bias can be especially severe when non-coding short repetitive sequences are present in DNA queries. In addition to searching the Blocks+ database with compositionally biased blocks removed, you can perform a search using only the coding strand of the query to reduce background.

Block Searcher does not penalize gaps, and so it is possible that very long DNA queries will report successive blocks that are implausibly far apart on the same strand. One of these may be spurious, especially if there is compositional bias in either the query or the database entry. If a family is represented by only a single block, then the hit's quality is more difficult to judge. In this case perform another search using the Reverse PSI-BLAST or IMPALA Searcher to confirm the hit, as these programs use different alignment algorithms and statistics than Block Searcher.

Block Maker features

Block Maker constructs blocks using two very different motif-finders: Motif and Gibbs, requiring no externally provided parameters other than the set of protein sequences submitted to it. Non-overlapping blocks are found and a

"best set" of blocks is reported, sometimes discarding individual sequences that do not sufficiently conform with the others. This can occur if it lacks some of the strongest motifs found in other sequences, or if the motifs are out of order or overlap.

The complementary strengths and weaknesses of the Motif and Gibbs means that you can compare their results as a

"reality check". PROTOMAT will always report blocks, even if random sequences are provided. If sequences truly have

motifs in common, then both methods yield similar, and sometimes identical sets of blocks. However, if sequences have

nothing in common, the two motif-finding algorithms tend pick up completely different meaningless blocks.

(12)

Repeated domains are not handled by Block Maker. Rather, only a single repeat member is aligned within a block.

MEME (Bailey and Elkan, 1994), which is available from http://meme.sdsc.edu/meme/website/, is designed to align all of the repeat members within a block. MEME uses a statistical approach that is comparable to Gibbs sampling.

Using CODEHOP interactively

There are ways of reducing the stringency if you do not get predictions using the default parameters, or if you don't like what you get. Raising the strictness of the core region, for example from 0.0 to 0.1 or even to 0.25 will discriminate against the less probable codons. If one or more of the sequences is expected to be closer to the desired target gene, then raising its weight relative to the others can reduce the size of the target primer pool without requiring that you raise the degeneracy or strictness. You do this by working in the Web box on the sequence segment weight in the last column.

The maximum sequence weight in a block from the Blocks Database or Block Maker is 100, so you might upweight your favorite sequence to 200 or 400. You can also ignore the contribution of individual sequences to the block by down-weighting them to 0 if they are too divergent or misaligned and so prevent finding a solution.

Clamp residues can be selected as the most common codons of the consensus amino acids. Otherwise, the clamp residues are the ones with maximum weight in the DNA PSSM, which may result in artificial codons. These do not affect the primers chosen, but the output may be disturbing.

Suggestions for Further Analysis

Conserved regions of proteins are those that are most likely to suffer deleterious effects when mutated (Ng and

Henikoff, 2001). SIFT (Sorting Intolerant from Tolerant, http://blocks.fhcrc.org/sift/SIFT.html) is a Web tool for

predicting which changes are likely to affect protein function based on conservation. Given a multiple alignment such

as set of blocks, SIFT predicts which changes can be expected to damage the protein. If SIFT is given a sequence, it

uses PSI-BLAST to obtain homologous sequences from sequence databanks for multiple alignment. When applied to

human polymorphism data, SIFT identifies disease loci with about 70% accuracy (Ng and Henikoff, 2002). CODDLE

(http://www.proweb.org/coddle) and PARSESNP (http://www.proweb.org/parsesnp) are general Web tools for

polymorphism and mutation assessment that take sequence input from a variety of sources, display gene models, and

use Blocks Database alignments to aid in identifying regions most suitable for targeted mutagenesis.