Bioinformatics
Multiple Alignment
Overview
• Introduction Multiple Alignments
• Global multiple alignment
– Introduction – Scoring
– Algorithms
Algorithms
Multiple Alignment
HMM
Pattern
recognition Dynamic
Programming Heuristic
Searches
Motif
Searches Database searches
Chapter 2
Introduction
• Global multiple alignment (ClustalW)
– Proteins, nucleotides
– Long stretches of conservation essential – Identification of protein family profiles – Score gaps
• Local multiple alignments (Motif Detection, Profile construction)
– Proteins, nucleotides
– Short stretches of conservation (12 NT, 6 AA) – Identification of regulatory motifs (DNA, protein) – No explicit gap scoring
– Explicit use of a profile
Introduction
Evolution
• duplication
• speciation Primary sequence
Homologs in related organisms Families of proteins Multiple sequence alignment
Features characteristic for the whole family
Introduction
Multiple sequence alignment
Features characteristic for the protein family
Profile (HMM)
Detect remote members of the family
Phylogeny
Reconstruct phylogenetic
relationships
Scoring a multiple alignment
Assumption:
– Independency between columns
– Residues within column independent (I.e. representative members of a sequence family should be chosen, all evolutionary subfamilies should be represented)
– Sequence score: score for all the columns and gaps
) (
)
( m = G + ∑ i S m i
S
Scoring
• Sums of pair score is an approximation
• But for tree-way alignment
• SP problem:
– N sequences with L (score L is 5)
– N-1 sequences with L and one with G (score G is -4) )
1 )(
, ( )
( l
m i l
k
i k m i s
m
S ∑
= ≤ S(a,b) from scoring matrix PAM or BLOSUM
) 2 )(
/
log( q c
q b q a
p abc instead of log( / ) log( / q c ) log( p ac / q a q c )( 3 ) q b
p bc q b
q a
p ab + +
2 / ) 1 ( 5×N N −
)) 1 ( 9 ( 2 / ) 1 (
5×N N− − × N−
N N N
N
5 18 2
/ ) 1 (
5
) 1 (
9 − − =
relative difference in score between the correct and the incorrect alignment decreases with the number of sequences in the alignment
RAL RTL CAL RAG a
b c
Counterintuitive !
Algorithm
Multidimensional dynamic programming Tedious formalism (optimal alignment)
• computation of the whole dynamic programming matrices L1,L2,…LN entries
• Maximize over all 2N-1 combinations of gaps in a column
• Time complexity (2N LN)
Clever algorithm : Carrillo & Lipman (MSA)
Algorithm
2 1) N(N−
Pairwise sequence
alignments
Multiple sequencealignment
Progressive alignment “once a gap always a gap”
Similarity matrix
A B C
B 142
C 95 101
D 60 62 55
Progressive clustering
D C B A Guide tree
Algorithm
Progressive alignment methods
• Hierarchical (heuristic): succession of pairwise alignments
• Two sequences are aligned by standard pairwise alignment
• This alignment is fixed
• Align next sequence
• Different algorithms
– Order of the alignment – Progression:
» Alignment of a new sequence to a growing alignment
» Subfamilies are built up on a tree structure and alignments are aligned to alignments
– Process used to align and score sequences to alignments
• Heuristic approach:
– Align most similar pairs of sequences first
– Most similar is based on a guide tree (quick and dirty and
unsuitable for phylogenetic inference)
Algorithm
Disadvantage
But it is advantageous to use position specific information from an existing alignment
e.g. mismatches at highly conserved positions should be penalized more than mismatches at variable positions
e.g. gap penalties might increase in regions which do not contain gaps as compared to regions which contain gaps
PROFILE ALIGNMENT
(hidden Markov, frequency matrices)
C T T G T C A T G T C A C T T C A T T G
⎟⎟
⎟⎟
⎟⎟
⎠
⎞
⎜⎜
⎜⎜
⎜⎜
⎝
⎛
=
75 . 0
25 . 0
0 0
75 . 0 5 . 0 0 0
25 . 0 0 0 25 . 0
0 25 . 0 0 75 . 0
0 25 . 0 1 0 φ
Algorithm
PROFILE based progressive multiple alignment : CLUSTALW
– Construct distance matrix by pairwise dynamic programming – Convert similarity scores to evolutionary distances
– Construct a guide tree (clustering, neighbour joining clustering) – Progressively align in order of decreasing similarity