Introduction to the theory of sequence alignment

(1)

Introduction to the theory of

sequence alignment

Yves Moreau Master of Bioinformatics Katholieke Universiteit Leuven 2001-2002

(2)

References

n _{R. Durbin, A. Krogh, S. Eddy, G. Mitchinson, Biological Sequence}

Analysis: Probabilistic Models of Proteins and Nucleic Acids, Oxford

University Press, 1998

n S.F. Altschul et al., Basic Local Alignment Search Tool, J. Mol. Biol.,

No. 215, pp. 403-410, 1990

n S.F. Altschul et al., Gapped BLAST and PSI-BLAST: a new

generation of protein database search programs, Nucleic Acids

(3)

Overview

n Alignment of two sequences

n DNA n Proteins n Similarity vs. homology n Similarity n Homology n Orthology n Paralogy n Elements of an alignment n _{Dynamic programming}

(4)

Overview

n Global alignment

n Needleman-Wunsch algorithm

n Local alignment

n Smith-Waterman algorithm n Affine gap penalty

n Substitution matrices n PAM n BLOSUM n Gap penalty n Significance evaluation n BLAST

(5)

Overview

n PHI-BLAST

n Multiple alignment n PSI-BLAST

(6)

(7)

(8)

Evolution of sequence databases

n Genbank n _SWISSProt

(9)

Molecular evolution

n Genomes through imperfect replication and natural

selection

(10)

Similarity vs. homology

n Sequences are similar if they are sufficiently resembling at the

sequence level (DNA, protein, …)

n _{Similarity can arise from}

n Homology

n Convergence (functional constraints)

n Chance

n _{Sequences are homologous if they arise from a common ancestor}

n Homologous sequences are paralogous if their differences involve a

gene duplication event

n Homologous sequences are orthologous iftheir differences are not

(11)

Orthology vs. paralogy

δ-globin - hu man β-globin - human β-globin - mouse β-globin -ch icken leghemoglob in - lupin α-globin - ch imp α-globin -mouse myog lobin -wha le

(12)

Phylogeny

n Relationships between genes and proteins can be inferred on the

basis of their sequences

(13)

Homology for the prediction of structure

and function

n Homologous proteins have comparable structures

n _{Homologous proteins potentially have similar functions}

(14)

Homology for prediction with DNA

n Conserved regions arise from evolutionary pressure and

are therefore functionally important

n Genes

n Control regions

n Comparative genomics

n Genes can be predicted by comparing genomes at an

appropriate evolutionary distance (e.g., mouse and human)

(15)

(16)

Elements of an alignment

n _{Types of alignments}

n DNA vs. protein

n Pairwise va. multiple alignment

n Global alignment

n Local alignment

n Scoring model for alignments

n Substitutions

n Gaps (insertions, deletions)

n Substitution matrix and gap penalty

n Algorithm n Dynamic programming n Heuristic n _{Significance evaluation}

HEAGAWGHE-E

--P-AW-HEAE

(17)

Global alignment

x y

(18)

Global alignment

n Alignment of ‘human alpha globin’ against ‘human beta globin’, ‘lupin

leghemoglobin’ and ‘glutathionine S-transferase homolog F11G11.2’ (‘+’ for good substitutions)

HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L+ L+++H+ K LGB2_LUPLU NNPEFQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL GS+ + G + +D L ++ H+ D+ A +AL D ++AH+ F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE Strong homology

Low similarity / structural homology

(19)

Local alignment

x y

(20)

Substitution matrix and gap penalty

n The alignment of two residues can be more or less likely

n _{To compute the quality of an alignment, we assign a gain or a penalty}

to the alignment of two residues

n _{Gaps also have a penalty}

……… … … … … … … … ……… 7 -3 0 0 1 -1 Q ……… -3 13 -4 -2 -4 -1 C ……… 0 -4 8 2 -2 -2 D ……… 0 -2 2 7 -1 -1 N ……… 1 -4 -2 -1 7 -2 R ……… -1 -1 -2 -1 -2 5 A EGHILKMFPSTWYV Q C D N R A HEAGAWGHE-E --P-AW-HEAE

(21)

Substitution matrix for DNA

n Standard 5 -4 -4 -4 T -4 5 -4 -4 G -4 -4 5 -4 C -4 -4 -4 5 A T G C A

(22)

Dynamic programming

n To align is to find the minimum penalty / maximum score path

through the penalty table = DYNAMIC PROGRAMMING

n _{Substitution matrix} = BLOSUM 50 n _{Gap penalty = -8} -8 -8 -8 -8 -8 -8 -8 0 * -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 * 6 -1 6 0 -3 -1 -1 E 6 -1 6 0 -3 -1 -1 E 0 -2 0 10 -3 -2 -2 H -3 -3 -1 -3 -1 6 0 E 0 -3 5 0 5 -1 -2 A -3 -3 -1 -3 -1 6 0 E -2 -3 -2 -2 -2 0 10 H -3 15 -3 -3 -3 -3 -3 W 0 -3 5 0 5 -1 -2 A -2 -4 -1 -2 -1 -1 -2 P G W A G A E H HEAGAWGHE-E --P-AW-HEAE

(23)

Dynamic programming

C₁ C₂ C₃ C₄ C₅ C₇ C₆ C₈ 5 7 3 4 2 5 2 6 4 3 5 3 5 Shortest path from C₁ to C₈

Shortest path from C₁ to C₅ Shortest path from C₅ to C₈

n Belman’s optimality principle

(24)

(25)

Global alignment

n Needleman-Wunsch algorithm

n Progressively complete the table F(i,j) (!!! column, row) that keeps

track of the maximum score for the alignment of sequence x up to x_i to sequence y up to y_j

n _{Substitution matrix s(x, y) and gap penalty d} n Recurrence

I G A x_i A I G A x_i G A x_i

-L G V y_j G V y_j - - S L G V y_j

{ F(i-1,j-1) + s(x_i, y_j) substitution

F(i,j) = max { F(i-1,j) – d deletion

{ F(i,j-1) – d insertion

(26)

-56 -48 -40 -32 -24 -16 -8 0 * -80 -72 -64 -56 -48 -40 -32 -24 -16 -8 * 1 2 -5 -19 -37 -60 -73 E -9 -5 3 -11 -29 -52 -65 E -12 -15 -7 -3 -21 -44 -57 H -15 -14 -12 -6 -11 -24 -38 E -12 -12 -11 -11 -3 -16 -30 A -15 -12 -9 -16 -16 -8 -22 E -7 -13 -9 -8 -13 -18 -14 H -13 -5 -15 -7 -6 -11 -18 W -36 -28 -20 -12 -4 -3 -10 A -49 -42 -33 -25 -17 -9 -2 P G W A G A E H F(i,j) F(i-1,j) F(i,j-1) F(i-1,j-1) s(x_i, y_j) – d – d

Start from to left

Complete progressively by recurrence Use traceback pointers

{ F(i-1,j-1) + s(x_i, y_j)

F(i,j) = max { F(i-1,j) – d { F(i,j-1) – d

(27)

Local alignment

n Smith-Waterman algorithm

n Best alignment between subsequences of x and y

n If the current alignment has a negative score, it is better

to start a new alignment

{ 0 restart

{ F(i-1,j-1) + s(x_i, y_j) substitution

F(i,j) = max { F(i-1,j) – d deletion

{ F(i,j-1) – d insertion

(28)

6 14 22 18 12 0 0 0 2 10 0 H 0 0 0 0 0 0 0 * 0 0 0 0 0 0 0 0 0 0 * 26 27 -19 0 0 0 E 16 20 28 0 0 0 E 4 10 18 4 0 0 H 0 4 12 18 13 6 0 E 4 0 5 13 21 8 0 A 10 4 0 0 8 16 2 E 12 20 0 2 0 0 0 W 0 0 5 0 5 0 0 A 0 0 0 0 0 0 0 P G W A G A E H

Start from top left

Complete progressively by recurrence Traceback from the highest score

and stop at zero

AWGHE AW-HE

(29)

Substitution matrices

n How can we choose a reasonable substitution matrix? n _{Look at a set of confirmed alignments (with gaps) and}

compute the amino acid frequences q_a, the substitution frequences p_ab, and the gap function f(g)

n _{Likelihood model}

n Random sequences: P(x,y|R) = Π_iq_xiΠ_jq_yj n Alignment: P(x,y|M) = Π_ip_xixj

n Odds ratios: P(x,y|M)/P(x,y|R) = Π_ip_xixj/(Π_iq_xiΠ_jq_yj)

n Log-odds score: S(x,y) = Σ_is(x_i,y_i) with s(a,b) = log(p_ab/q_aq_b)

(30)

PAM matrix

n Point Accepted Mutations matrix n Problems

n Alignments are not independent for related proteins

n Different alignments correspond to different evolution times

n PAM1 matrix

n Tree of protein families

n Estimate ancestral sequences

n Estimate mutations at short evolutionary distance n Scale to a substitution matrix

n 1% Point Accepted Mutations (PAM1)

n _{PAM250 is 250% Point Accepted Mutations (~20%}

(31)

BLOSUM matrix

n BLOCKS SUbstitution Matrix

n _{PAM does not work so well at large evolutionary}

distances

n Ungapped alignments of protein families from the

BLOCKS database

n Group sequences with more than L% identical amino

acids (e.g., BLOSUM62)

n Use the substitution frequency of amino acids between

the different groups (with correction for the group size) to derive the substitution matrix

(32)

BLAST

n For large databases, Smith-Waterman local alignment is

too slow

n _{Basic Local Alignment Search Tool (BLAST) is a fast}

heuristic algorithm for local alignment (http://www.ncbi.nlm.nih.gov/Entrez)

n BLASTP – protein query on protein database

n BLASTN – nucleotide query on nucleotide database

n BLASTX – translated nucleotide query on protein database

(translation into the six reading frames)

n TBLASTN – protein query on translated nucleotide db

(33)

BLASTP

n Step 1: Find all words of length w (e.g., w=3) for which

there is a match in the query sequence with score at least T (e.g., T=11) for the chosen substitution matrix (e.g., BLOSUM62 with gap penalty 10+g)

n Step 2: Use a finite state automaton to find all matches

(34)

BLASTP

n Step 3: Check which hits have another hit without

overlap within a distance of A (e.g., A=40) (the distance must be identical on the query and on the target)

(two-hits)

n Step 4: Extend the left hit of the two-hits in both

directions by ungapped alignment ; stop the extension when the score drops by X_g (e.g., X_g=40) under the best score so far (high scoring segment pair HSP)

(35)

BLASTP

n Step 5: Extend the HSPs with normalized score above S_g

(S_g =22 bits) by ungapped alignment ; stop the extension when the score drops by X_g (e.g., X_g=40) under the best score so far ; select the best gapped local alignment

n _{Step 6: Compute the significance of the alignments ; for}

the significant alignments, repeat the gapped alignment with a higher dropoff parameter X_g for more accuracy

(36)

BLASTP

Query Target Two-hits + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Hits Local alignment

(37)

0.34 36 Running time 197 198 P20705 H+_{-transport ATP synthase}

81 83 P14942 Glutathione transferase 211 211 P10635 Cytochrome P450 261 262 P10318 Histocompatibility antigen 137 138 P07327 Alcohol dehydrogenase 53 53 P05013 Interferon alpha 128 128 P03435 Hemagglutinin 28 28 P02232 Globin 252 255 P01111 Ras 108 108 P01008 Serine protease inhibitor

275 275 P00762 Serine protease BLAST (# matches) Smith-Waterman Query (SWISS-PROT) Protein family

BLASTP example

(38)

(39)

Multiple alignment

n What to do if we want to compare together the

sequences of several proteins from the same family (not pair by pair)?

n Multiple alignments are, for example, important in the

study of the role played by the residues in the 3D

structure of the proteins (ideally, aligned residues should have the same 3D coordinates and should correspond evolutionarily)

(40)

(41)

Multiple alignment

n Elements of a multiple alignment method

n Scoring model

n Algorithm for finding the best aligment

n _{Ideally, given a phylogenetic tree for the sequences, the}

probability of a multiple alignment is the product of the probabilities of all the events needed for this tree (in practice, not enough data will be available for this)

(42)

Scoring model

n A pragmatic scoring function is of the form

S(m) = G + Σ_iS(m_i)

G gap score, m_i column i of the multiple alignment, S(m_i)

score for column i

n Column score: sum of pairs score

S(m_i)=Σ_lΣ_k<ls(m_ik_,m

il)

s() substitution matrix

n Sum of pairs score not statistically well-founded

n Extension of log-odds score (3 seqs.): log(p_abc/(q_aq_bq_c))

(43)

PSI-BLAST

n Position-Specific Iterated BLAST

n _{Searching the database not with a single sequence but}

with a score matrix describing a protein family is much more sensitive

n _Idea:

n Run BLAST for the query sequence n Gather the significant alignments

n Build a specific matrix describing the aligned sequences n Run BLAST using this specific matrix to recruit more seqs. n Build new matrix and iterate until no more seqs. are recruited)

(44)

PSI-BLAST

n Motifs can be found in a protein family; if we search

using such motifs we can detect more distant relationships

n It is difficult to delineate multiple motifs automatically in a

set of proteins; therefore, PSI-BLAST builds a motif as long as the query (Lx20 matrix)

n The motif is not a substitution matrix but gives at each

position a specific score for the presence of an amino

acid and is thus called a position-specific scoring matrix (PSSM) (gap scores are not position specific)

(45)

Position-specific scoring matrix

MFGKRAFVHHYVGEGMEENEFTDARQDLYELEVDYANL MFKRKGFLHWYTGEGMEPVEFSEAQSDLEDLILEYQQY MFSRKAFLHWYTGEGMEEGDFAEADNNVSDLLSEYQQY RGAFLDQFRREDIFKDDLNELDESRETVDCLVQEYEAA RNAFLDNFRRESMFQDDLTELDIARDTVDCLVQEYEAA RSKRAFIDKFEKIDNFSLDMMDDAMHIVQDLLDEYKAV RNAFLEQYKKEAPFQDGLDEFDEARAVVMDLVGEYEAA RDAYMNIFKQTKIFEDNLDEFDSSDEVVKSLIDEYAAA ENYKKESMFSSADGQGNFEEMESSKEITQNLIDEYKSA RNAFLEQYKKEAIFEDDLNEFDSSRDVVADLINEYEAC RNAFMPQYQKEAMFEKNLDEFDEARATVQDLIEEYQAC MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYSKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDFEEV MYAKRAFVHWYVGEGMEEGEFSEAREDMAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYSKRAFVHWYVGEGMEEGEFSEAREDLAALERDYEEV

(46)

BLAST with PSSMs

n Only minor modifications necessary to run BLAST with a

position-specific scoring matrix (essentially, replace the score provided by the query and the substitution matrix

s(t_i,q_i) by the score provided by the PSSM s(t_i,C))

n To avoid having to modify the parameters of BLAST, the

(47)

0.87 0.34 36 Running time 207 197 198 P20705

H+_{-transport ATP synthase}

142 81 83 P14942 Glutathione transferase 224 211 211 P10635 Cytochrome P450 338 261 262 P10318 Histocompatibility antigen 160 137 138 P07327 Alcohol dehydrogenase 53 53 53 P05013 Interferon alpha 130 128 128 P03435 Hemagglutinin 623 28 28 P02232 Globin 375 252 255 P01111 Ras 111 108 108 P01008

Serine protease inhibitor

286 275 275 P00762 Serine protease PSI-BLAST (# matches) BLAST Smith-Waterman Query (SWISS-PROT) Protein family

PSI-BLAST example

(48)

CLUSTALW

n Multiple alignment can be seen as multidimensional

dynamic programming; however, for N sequences of

length L, the complexity is O(LN_{) and therefore infeasible}

in practice

n CLUSTALW (Cluster and ALign) is an example of

progressive multiple alignment

n Build a guide tree based on sequence similarity (not a real

phylogenetic tree)

n Do pairwise alignment going back up the guide tree

n Pairwise sequence alignment

n Sequence-group alignment

(49)

sequences distance matrix pairwise alignment sequence-group alignment group-group alignment guide tree

(50)

CLUSTALW

n Algorithm

n Construct distance matrix of all pairs

n Perform pairwise gapped alignment of all pairs

n Score as percentage differences in aligned sequences (gaps

excluded)

n Build the guide tree

n Use the neighbor-joining method

n Place the root on the edge where the mean branch length is equal on

both sides

n Use the three to derive a weight for each sequence

n Progressively align along the tree (seq.-seq., seq.-prof., prof.-prof.)

n Lock already aligned sequences together

n Align two groups using sum-of-pairs score for dynamic programming

(51)

Summary

n Similarity vs. homology n Dynamic programming n Local alignment n Needleman – Wunsch n Global alignment n Smith – Waterman n Affine gap penalty

n Substitution matrices

n PAM

n BLOSUM

(52)

n BLAST

n _PHI-BLAST

n Multiple alignment n PSI-BLAST