Introduction to the theory of
sequence alignment
Yves Moreau Master of Bioinformatics Katholieke Universiteit Leuven 2001-2002
References
n R. Durbin, A. Krogh, S. Eddy, G. Mitchinson, Biological Sequence
Analysis: Probabilistic Models of Proteins and Nucleic Acids, Oxford
University Press, 1998
n S.F. Altschul et al., Basic Local Alignment Search Tool, J. Mol. Biol.,
No. 215, pp. 403-410, 1990
n S.F. Altschul et al., Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs, Nucleic Acids
Overview
n Alignment of two sequences
n DNA n Proteins n Similarity vs. homology n Similarity n Homology n Orthology n Paralogy n Elements of an alignment n Dynamic programming
Overview
n Global alignment
n Needleman-Wunsch algorithm
n Local alignment
n Smith-Waterman algorithm n Affine gap penalty
n Substitution matrices n PAM n BLOSUM n Gap penalty n Significance evaluation n BLAST
Overview
n PHI-BLAST
n Multiple alignment n PSI-BLAST
Evolution of sequence databases
n Genbank n SWISSProt
Molecular evolution
n Genomes through imperfect replication and natural
selection
Similarity vs. homology
n Sequences are similar if they are sufficiently resembling at the
sequence level (DNA, protein, …)
n Similarity can arise from
n Homology
n Convergence (functional constraints)
n Chance
n Sequences are homologous if they arise from a common ancestor
n Homologous sequences are paralogous if their differences involve a
gene duplication event
n Homologous sequences are orthologous iftheir differences are not
Orthology vs. paralogy
δ-globin - hu man β-globin - human β-globin - mouse β-globin -ch icken leghemoglob in - lupin α-globin - ch imp α-globin -mouse myog lobin -wha lePhylogeny
n Relationships between genes and proteins can be inferred on the
basis of their sequences
Homology for the prediction of structure
and function
n Homologous proteins have comparable structures
n Homologous proteins potentially have similar functions
Homology for prediction with DNA
n Conserved regions arise from evolutionary pressure and
are therefore functionally important
n Genes
n Control regions
n Comparative genomics
n Genes can be predicted by comparing genomes at an
appropriate evolutionary distance (e.g., mouse and human)
Elements of an alignment
n Types of alignments
n DNA vs. protein
n Pairwise va. multiple alignment
n Global alignment
n Local alignment
n Scoring model for alignments
n Substitutions
n Gaps (insertions, deletions)
n Substitution matrix and gap penalty
n Algorithm n Dynamic programming n Heuristic n Significance evaluation
HEAGAWGHE-E
--P-AW-HEAE
Global alignment
x y
Global alignment
n Alignment of ‘human alpha globin’ against ‘human beta globin’, ‘lupin
leghemoglobin’ and ‘glutathionine S-transferase homolog F11G11.2’ (‘+’ for good substitutions)
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L+ L+++H+ K LGB2_LUPLU NNPEFQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL GS+ + G + +D L ++ H+ D+ A +AL D ++AH+ F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE Strong homology
Low similarity / structural homology
Local alignment
x y
Substitution matrix and gap penalty
n The alignment of two residues can be more or less likely
n To compute the quality of an alignment, we assign a gain or a penalty
to the alignment of two residues
n Gaps also have a penalty
……… … … … … … … … ……… 7 -3 0 0 1 -1 Q ……… -3 13 -4 -2 -4 -1 C ……… 0 -4 8 2 -2 -2 D ……… 0 -2 2 7 -1 -1 N ……… 1 -4 -2 -1 7 -2 R ……… -1 -1 -2 -1 -2 5 A EGHILKMFPSTWYV Q C D N R A HEAGAWGHE-E --P-AW-HEAE
Substitution matrix for DNA
n Standard 5 -4 -4 -4 T -4 5 -4 -4 G -4 -4 5 -4 C -4 -4 -4 5 A T G C ADynamic programming
n To align is to find the minimum penalty / maximum score path
through the penalty table = DYNAMIC PROGRAMMING
n Substitution matrix = BLOSUM 50 n Gap penalty = -8 -8 -8 -8 -8 -8 -8 -8 0 * -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 * 6 -1 6 0 -3 -1 -1 E 6 -1 6 0 -3 -1 -1 E 0 -2 0 10 -3 -2 -2 H -3 -3 -1 -3 -1 6 0 E 0 -3 5 0 5 -1 -2 A -3 -3 -1 -3 -1 6 0 E -2 -3 -2 -2 -2 0 10 H -3 15 -3 -3 -3 -3 -3 W 0 -3 5 0 5 -1 -2 A -2 -4 -1 -2 -1 -1 -2 P G W A G A E H HEAGAWGHE-E --P-AW-HEAE
Dynamic programming
C1 C2 C3 C4 C5 C7 C6 C8 5 7 3 4 2 5 2 6 4 3 5 3 5 Shortest path from C1 to C8Shortest path from C1 to C5 Shortest path from C5 to C8
n Belman’s optimality principle
Global alignment
n Needleman-Wunsch algorithm
n Progressively complete the table F(i,j) (!!! column, row) that keeps
track of the maximum score for the alignment of sequence x up to xi to sequence y up to yj
n Substitution matrix s(x, y) and gap penalty d n Recurrence
I G A xi A I G A xi G A xi
-L G V yj G V yj - - S L G V yj
{ F(i-1,j-1) + s(xi, yj) substitution
F(i,j) = max { F(i-1,j) – d deletion
{ F(i,j-1) – d insertion
-56 -48 -40 -32 -24 -16 -8 0 * -80 -72 -64 -56 -48 -40 -32 -24 -16 -8 * 1 2 -5 -19 -37 -60 -73 E -9 -5 3 -11 -29 -52 -65 E -12 -15 -7 -3 -21 -44 -57 H -15 -14 -12 -6 -11 -24 -38 E -12 -12 -11 -11 -3 -16 -30 A -15 -12 -9 -16 -16 -8 -22 E -7 -13 -9 -8 -13 -18 -14 H -13 -5 -15 -7 -6 -11 -18 W -36 -28 -20 -12 -4 -3 -10 A -49 -42 -33 -25 -17 -9 -2 P G W A G A E H F(i,j) F(i-1,j) F(i,j-1) F(i-1,j-1) s(xi, yj) – d – d
Start from to left
Complete progressively by recurrence Use traceback pointers
{ F(i-1,j-1) + s(xi, yj)
F(i,j) = max { F(i-1,j) – d { F(i,j-1) – d
Local alignment
n Smith-Waterman algorithm
n Best alignment between subsequences of x and y
n If the current alignment has a negative score, it is better
to start a new alignment
{ 0 restart
{ F(i-1,j-1) + s(xi, yj) substitution
F(i,j) = max { F(i-1,j) – d deletion
{ F(i,j-1) – d insertion
6 14 22 18 12 0 0 0 2 10 0 H 0 0 0 0 0 0 0 * 0 0 0 0 0 0 0 0 0 0 * 26 27 -19 0 0 0 E 16 20 28 0 0 0 E 4 10 18 4 0 0 H 0 4 12 18 13 6 0 E 4 0 5 13 21 8 0 A 10 4 0 0 8 16 2 E 12 20 0 2 0 0 0 W 0 0 5 0 5 0 0 A 0 0 0 0 0 0 0 P G W A G A E H
Start from top left
Complete progressively by recurrence Traceback from the highest score
and stop at zero
AWGHE AW-HE
Substitution matrices
n How can we choose a reasonable substitution matrix? n Look at a set of confirmed alignments (with gaps) and
compute the amino acid frequences qa, the substitution frequences pab, and the gap function f(g)
n Likelihood model
n Random sequences: P(x,y|R) = ΠiqxiΠjqyj n Alignment: P(x,y|M) = Πipxixj
n Odds ratios: P(x,y|M)/P(x,y|R) = Πipxixj/(ΠiqxiΠjqyj)
n Log-odds score: S(x,y) = Σis(xi,yi) with s(a,b) = log(pab/qaqb)
PAM matrix
n Point Accepted Mutations matrix n Problems
n Alignments are not independent for related proteins
n Different alignments correspond to different evolution times
n PAM1 matrix
n Tree of protein families
n Estimate ancestral sequences
n Estimate mutations at short evolutionary distance n Scale to a substitution matrix
n 1% Point Accepted Mutations (PAM1)
n PAM250 is 250% Point Accepted Mutations (~20%
BLOSUM matrix
n BLOCKS SUbstitution Matrix
n PAM does not work so well at large evolutionary
distances
n Ungapped alignments of protein families from the
BLOCKS database
n Group sequences with more than L% identical amino
acids (e.g., BLOSUM62)
n Use the substitution frequency of amino acids between
the different groups (with correction for the group size) to derive the substitution matrix
BLAST
n For large databases, Smith-Waterman local alignment is
too slow
n Basic Local Alignment Search Tool (BLAST) is a fast
heuristic algorithm for local alignment (http://www.ncbi.nlm.nih.gov/Entrez)
n BLASTP – protein query on protein database
n BLASTN – nucleotide query on nucleotide database
n BLASTX – translated nucleotide query on protein database
(translation into the six reading frames)
n TBLASTN – protein query on translated nucleotide db
BLASTP
n Step 1: Find all words of length w (e.g., w=3) for which
there is a match in the query sequence with score at least T (e.g., T=11) for the chosen substitution matrix (e.g., BLOSUM62 with gap penalty 10+g)
n Step 2: Use a finite state automaton to find all matches
BLASTP
n Step 3: Check which hits have another hit without
overlap within a distance of A (e.g., A=40) (the distance must be identical on the query and on the target)
(two-hits)
n Step 4: Extend the left hit of the two-hits in both
directions by ungapped alignment ; stop the extension when the score drops by Xg (e.g., Xg=40) under the best score so far (high scoring segment pair HSP)
BLASTP
n Step 5: Extend the HSPs with normalized score above Sg
(Sg =22 bits) by ungapped alignment ; stop the extension when the score drops by Xg (e.g., Xg=40) under the best score so far ; select the best gapped local alignment
n Step 6: Compute the significance of the alignments ; for
the significant alignments, repeat the gapped alignment with a higher dropoff parameter Xg for more accuracy
BLASTP
Query Target Two-hits + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Hits Local alignment0.34 36 Running time 197 198 P20705 H+-transport ATP synthase
81 83 P14942 Glutathione transferase 211 211 P10635 Cytochrome P450 261 262 P10318 Histocompatibility antigen 137 138 P07327 Alcohol dehydrogenase 53 53 P05013 Interferon alpha 128 128 P03435 Hemagglutinin 28 28 P02232 Globin 252 255 P01111 Ras 108 108 P01008 Serine protease inhibitor
275 275 P00762 Serine protease BLAST (# matches) Smith-Waterman Query (SWISS-PROT) Protein family
BLASTP example
Multiple alignment
n What to do if we want to compare together the
sequences of several proteins from the same family (not pair by pair)?
n Multiple alignments are, for example, important in the
study of the role played by the residues in the 3D
structure of the proteins (ideally, aligned residues should have the same 3D coordinates and should correspond evolutionarily)
Multiple alignment
n Elements of a multiple alignment method
n Scoring model
n Algorithm for finding the best aligment
n Ideally, given a phylogenetic tree for the sequences, the
probability of a multiple alignment is the product of the probabilities of all the events needed for this tree (in practice, not enough data will be available for this)
Scoring model
n A pragmatic scoring function is of the form
S(m) = G + ΣiS(mi)
G gap score, mi column i of the multiple alignment, S(mi)
score for column i
n Column score: sum of pairs score
S(mi)=ΣlΣk<ls(mik,m
il)
s() substitution matrix
n Sum of pairs score not statistically well-founded
n Extension of log-odds score (3 seqs.): log(pabc/(qaqbqc))
PSI-BLAST
n Position-Specific Iterated BLAST
n Searching the database not with a single sequence but
with a score matrix describing a protein family is much more sensitive
n Idea:
n Run BLAST for the query sequence n Gather the significant alignments
n Build a specific matrix describing the aligned sequences n Run BLAST using this specific matrix to recruit more seqs. n Build new matrix and iterate until no more seqs. are recruited)
PSI-BLAST
n Motifs can be found in a protein family; if we search
using such motifs we can detect more distant relationships
n It is difficult to delineate multiple motifs automatically in a
set of proteins; therefore, PSI-BLAST builds a motif as long as the query (Lx20 matrix)
n The motif is not a substitution matrix but gives at each
position a specific score for the presence of an amino
acid and is thus called a position-specific scoring matrix (PSSM) (gap scores are not position specific)
Position-specific scoring matrix
MFGKRAFVHHYVGEGMEENEFTDARQDLYELEVDYANL MFKRKGFLHWYTGEGMEPVEFSEAQSDLEDLILEYQQY MFSRKAFLHWYTGEGMEEGDFAEADNNVSDLLSEYQQY RGAFLDQFRREDIFKDDLNELDESRETVDCLVQEYEAA RNAFLDNFRRESMFQDDLTELDIARDTVDCLVQEYEAA RSKRAFIDKFEKIDNFSLDMMDDAMHIVQDLLDEYKAV RNAFLEQYKKEAPFQDGLDEFDEARAVVMDLVGEYEAA RDAYMNIFKQTKIFEDNLDEFDSSDEVVKSLIDEYAAA ENYKKESMFSSADGQGNFEEMESSKEITQNLIDEYKSA RNAFLEQYKKEAIFEDDLNEFDSSRDVVADLINEYEAC RNAFMPQYQKEAMFEKNLDEFDEARATVQDLIEEYQAC MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYSKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDFEEV MYAKRAFVHWYVGEGMEEGEFSEAREDMAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYSKRAFVHWYVGEGMEEGEFSEAREDLAALERDYEEVBLAST with PSSMs
n Only minor modifications necessary to run BLAST with a
position-specific scoring matrix (essentially, replace the score provided by the query and the substitution matrix
s(ti,qi) by the score provided by the PSSM s(ti,C))
n To avoid having to modify the parameters of BLAST, the
0.87 0.34 36 Running time 207 197 198 P20705
H+-transport ATP synthase
142 81 83 P14942 Glutathione transferase 224 211 211 P10635 Cytochrome P450 338 261 262 P10318 Histocompatibility antigen 160 137 138 P07327 Alcohol dehydrogenase 53 53 53 P05013 Interferon alpha 130 128 128 P03435 Hemagglutinin 623 28 28 P02232 Globin 375 252 255 P01111 Ras 111 108 108 P01008
Serine protease inhibitor
286 275 275 P00762 Serine protease PSI-BLAST (# matches) BLAST Smith-Waterman Query (SWISS-PROT) Protein family
PSI-BLAST example
CLUSTALW
n Multiple alignment can be seen as multidimensional
dynamic programming; however, for N sequences of
length L, the complexity is O(LN) and therefore infeasible
in practice
n CLUSTALW (Cluster and ALign) is an example of
progressive multiple alignment
n Build a guide tree based on sequence similarity (not a real
phylogenetic tree)
n Do pairwise alignment going back up the guide tree
n Pairwise sequence alignment
n Sequence-group alignment
sequences distance matrix pairwise alignment sequence-group alignment group-group alignment guide tree
CLUSTALW
n Algorithm
n Construct distance matrix of all pairs
n Perform pairwise gapped alignment of all pairs
n Score as percentage differences in aligned sequences (gaps
excluded)
n Build the guide tree
n Use the neighbor-joining method
n Place the root on the edge where the mean branch length is equal on
both sides
n Use the three to derive a weight for each sequence
n Progressively align along the tree (seq.-seq., seq.-prof., prof.-prof.)
n Lock already aligned sequences together
n Align two groups using sum-of-pairs score for dynamic programming
Summary
n Similarity vs. homology n Dynamic programming n Local alignment n Needleman – Wunsch n Global alignment n Smith – Waterman n Affine gap penaltyn Substitution matrices
n PAM
n BLOSUM
n BLAST
n PHI-BLAST
n Multiple alignment n PSI-BLAST