• No results found

Introduction to the theory of sequence alignment

N/A
N/A
Protected

Academic year: 2021

Share "Introduction to the theory of sequence alignment"

Copied!
52
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Introduction to the theory of

sequence alignment

Yves Moreau Master of Bioinformatics Katholieke Universiteit Leuven 2001-2002

(2)

References

n R. Durbin, A. Krogh, S. Eddy, G. Mitchinson, Biological Sequence

Analysis: Probabilistic Models of Proteins and Nucleic Acids, Oxford

University Press, 1998

n S.F. Altschul et al., Basic Local Alignment Search Tool, J. Mol. Biol.,

No. 215, pp. 403-410, 1990

n S.F. Altschul et al., Gapped BLAST and PSI-BLAST: a new

generation of protein database search programs, Nucleic Acids

(3)

Overview

n Alignment of two sequences

n DNA n Proteins n Similarity vs. homology n Similarity n Homology n Orthology n Paralogy n Elements of an alignment n Dynamic programming

(4)

Overview

n Global alignment

n Needleman-Wunsch algorithm

n Local alignment

n Smith-Waterman algorithm n Affine gap penalty

n Substitution matrices n PAM n BLOSUM n Gap penalty n Significance evaluation n BLAST

(5)

Overview

n PHI-BLAST

n Multiple alignment n PSI-BLAST

(6)
(7)
(8)

Evolution of sequence databases

n Genbank n SWISSProt

(9)

Molecular evolution

n Genomes through imperfect replication and natural

selection

(10)

Similarity vs. homology

n Sequences are similar if they are sufficiently resembling at the

sequence level (DNA, protein, …)

n Similarity can arise from

n Homology

n Convergence (functional constraints)

n Chance

n Sequences are homologous if they arise from a common ancestor

n Homologous sequences are paralogous if their differences involve a

gene duplication event

n Homologous sequences are orthologous iftheir differences are not

(11)

Orthology vs. paralogy

δ-globin - hu man β-globin - human β-globin - mouse β-globin -ch icken leghemoglob in - lupin α-globin - ch imp α-globin -mouse myog lobin -wha le

(12)

Phylogeny

n Relationships between genes and proteins can be inferred on the

basis of their sequences

(13)

Homology for the prediction of structure

and function

n Homologous proteins have comparable structures

n Homologous proteins potentially have similar functions

(14)

Homology for prediction with DNA

n Conserved regions arise from evolutionary pressure and

are therefore functionally important

n Genes

n Control regions

n Comparative genomics

n Genes can be predicted by comparing genomes at an

appropriate evolutionary distance (e.g., mouse and human)

(15)
(16)

Elements of an alignment

n Types of alignments

n DNA vs. protein

n Pairwise va. multiple alignment

n Global alignment

n Local alignment

n Scoring model for alignments

n Substitutions

n Gaps (insertions, deletions)

n Substitution matrix and gap penalty

n Algorithm n Dynamic programming n Heuristic n Significance evaluation

HEAGAWGHE-E

--P-AW-HEAE

(17)

Global alignment

x y

(18)

Global alignment

n Alignment of ‘human alpha globin’ against ‘human beta globin’, ‘lupin

leghemoglobin’ and ‘glutathionine S-transferase homolog F11G11.2’ (‘+’ for good substitutions)

HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L+ L+++H+ K LGB2_LUPLU NNPEFQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL GS+ + G + +D L ++ H+ D+ A +AL D ++AH+ F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE Strong homology

Low similarity / structural homology

(19)

Local alignment

x y

(20)

Substitution matrix and gap penalty

n The alignment of two residues can be more or less likely

n To compute the quality of an alignment, we assign a gain or a penalty

to the alignment of two residues

n Gaps also have a penalty

……… … … … … … … … ……… 7 -3 0 0 1 -1 Q ……… -3 13 -4 -2 -4 -1 C ……… 0 -4 8 2 -2 -2 D ……… 0 -2 2 7 -1 -1 N ……… 1 -4 -2 -1 7 -2 R ……… -1 -1 -2 -1 -2 5 A EGHILKMFPSTWYV Q C D N R A HEAGAWGHE-E --P-AW-HEAE

(21)

Substitution matrix for DNA

n Standard 5 -4 -4 -4 T -4 5 -4 -4 G -4 -4 5 -4 C -4 -4 -4 5 A T G C A

(22)

Dynamic programming

n To align is to find the minimum penalty / maximum score path

through the penalty table = DYNAMIC PROGRAMMING

n Substitution matrix = BLOSUM 50 n Gap penalty = -8 -8 -8 -8 -8 -8 -8 -8 0 * -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 * 6 -1 6 0 -3 -1 -1 E 6 -1 6 0 -3 -1 -1 E 0 -2 0 10 -3 -2 -2 H -3 -3 -1 -3 -1 6 0 E 0 -3 5 0 5 -1 -2 A -3 -3 -1 -3 -1 6 0 E -2 -3 -2 -2 -2 0 10 H -3 15 -3 -3 -3 -3 -3 W 0 -3 5 0 5 -1 -2 A -2 -4 -1 -2 -1 -1 -2 P G W A G A E H HEAGAWGHE-E --P-AW-HEAE

(23)

Dynamic programming

C1 C2 C3 C4 C5 C7 C6 C8 5 7 3 4 2 5 2 6 4 3 5 3 5 Shortest path from C1 to C8

Shortest path from C1 to C5 Shortest path from C5 to C8

n Belman’s optimality principle

(24)
(25)

Global alignment

n Needleman-Wunsch algorithm

n Progressively complete the table F(i,j) (!!! column, row) that keeps

track of the maximum score for the alignment of sequence x up to xi to sequence y up to yj

n Substitution matrix s(x, y) and gap penalty d n Recurrence

I G A xi A I G A xi G A xi

-L G V yj G V yj - - S L G V yj

{ F(i-1,j-1) + s(xi, yj) substitution

F(i,j) = max { F(i-1,j) – d deletion

{ F(i,j-1) – d insertion

(26)

-56 -48 -40 -32 -24 -16 -8 0 * -80 -72 -64 -56 -48 -40 -32 -24 -16 -8 * 1 2 -5 -19 -37 -60 -73 E -9 -5 3 -11 -29 -52 -65 E -12 -15 -7 -3 -21 -44 -57 H -15 -14 -12 -6 -11 -24 -38 E -12 -12 -11 -11 -3 -16 -30 A -15 -12 -9 -16 -16 -8 -22 E -7 -13 -9 -8 -13 -18 -14 H -13 -5 -15 -7 -6 -11 -18 W -36 -28 -20 -12 -4 -3 -10 A -49 -42 -33 -25 -17 -9 -2 P G W A G A E H F(i,j) F(i-1,j) F(i,j-1) F(i-1,j-1) s(xi, yj) – d – d

Start from to left

Complete progressively by recurrence Use traceback pointers

{ F(i-1,j-1) + s(xi, yj)

F(i,j) = max { F(i-1,j) – d { F(i,j-1) – d

(27)

Local alignment

n Smith-Waterman algorithm

n Best alignment between subsequences of x and y

n If the current alignment has a negative score, it is better

to start a new alignment

{ 0 restart

{ F(i-1,j-1) + s(xi, yj) substitution

F(i,j) = max { F(i-1,j) – d deletion

{ F(i,j-1) – d insertion

(28)

6 14 22 18 12 0 0 0 2 10 0 H 0 0 0 0 0 0 0 * 0 0 0 0 0 0 0 0 0 0 * 26 27 -19 0 0 0 E 16 20 28 0 0 0 E 4 10 18 4 0 0 H 0 4 12 18 13 6 0 E 4 0 5 13 21 8 0 A 10 4 0 0 8 16 2 E 12 20 0 2 0 0 0 W 0 0 5 0 5 0 0 A 0 0 0 0 0 0 0 P G W A G A E H

Start from top left

Complete progressively by recurrence Traceback from the highest score

and stop at zero

AWGHE AW-HE

(29)

Substitution matrices

n How can we choose a reasonable substitution matrix? n Look at a set of confirmed alignments (with gaps) and

compute the amino acid frequences qa, the substitution frequences pab, and the gap function f(g)

n Likelihood model

n Random sequences: P(x,y|R) = ΠiqxiΠjqyj n Alignment: P(x,y|M) = Πipxixj

n Odds ratios: P(x,y|M)/P(x,y|R) = Πipxixj/(ΠiqxiΠjqyj)

n Log-odds score: S(x,y) = Σis(xi,yi) with s(a,b) = log(pab/qaqb)

(30)

PAM matrix

n Point Accepted Mutations matrix n Problems

n Alignments are not independent for related proteins

n Different alignments correspond to different evolution times

n PAM1 matrix

n Tree of protein families

n Estimate ancestral sequences

n Estimate mutations at short evolutionary distance n Scale to a substitution matrix

n 1% Point Accepted Mutations (PAM1)

n PAM250 is 250% Point Accepted Mutations (~20%

(31)

BLOSUM matrix

n BLOCKS SUbstitution Matrix

n PAM does not work so well at large evolutionary

distances

n Ungapped alignments of protein families from the

BLOCKS database

n Group sequences with more than L% identical amino

acids (e.g., BLOSUM62)

n Use the substitution frequency of amino acids between

the different groups (with correction for the group size) to derive the substitution matrix

(32)

BLAST

n For large databases, Smith-Waterman local alignment is

too slow

n Basic Local Alignment Search Tool (BLAST) is a fast

heuristic algorithm for local alignment (http://www.ncbi.nlm.nih.gov/Entrez)

n BLASTP – protein query on protein database

n BLASTN – nucleotide query on nucleotide database

n BLASTX – translated nucleotide query on protein database

(translation into the six reading frames)

n TBLASTN – protein query on translated nucleotide db

(33)

BLASTP

n Step 1: Find all words of length w (e.g., w=3) for which

there is a match in the query sequence with score at least T (e.g., T=11) for the chosen substitution matrix (e.g., BLOSUM62 with gap penalty 10+g)

n Step 2: Use a finite state automaton to find all matches

(34)

BLASTP

n Step 3: Check which hits have another hit without

overlap within a distance of A (e.g., A=40) (the distance must be identical on the query and on the target)

(two-hits)

n Step 4: Extend the left hit of the two-hits in both

directions by ungapped alignment ; stop the extension when the score drops by Xg (e.g., Xg=40) under the best score so far (high scoring segment pair HSP)

(35)

BLASTP

n Step 5: Extend the HSPs with normalized score above Sg

(Sg =22 bits) by ungapped alignment ; stop the extension when the score drops by Xg (e.g., Xg=40) under the best score so far ; select the best gapped local alignment

n Step 6: Compute the significance of the alignments ; for

the significant alignments, repeat the gapped alignment with a higher dropoff parameter Xg for more accuracy

(36)

BLASTP

Query Target Two-hits + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Hits Local alignment

(37)

0.34 36 Running time 197 198 P20705 H+-transport ATP synthase

81 83 P14942 Glutathione transferase 211 211 P10635 Cytochrome P450 261 262 P10318 Histocompatibility antigen 137 138 P07327 Alcohol dehydrogenase 53 53 P05013 Interferon alpha 128 128 P03435 Hemagglutinin 28 28 P02232 Globin 252 255 P01111 Ras 108 108 P01008 Serine protease inhibitor

275 275 P00762 Serine protease BLAST (# matches) Smith-Waterman Query (SWISS-PROT) Protein family

BLASTP example

(38)
(39)

Multiple alignment

n What to do if we want to compare together the

sequences of several proteins from the same family (not pair by pair)?

n Multiple alignments are, for example, important in the

study of the role played by the residues in the 3D

structure of the proteins (ideally, aligned residues should have the same 3D coordinates and should correspond evolutionarily)

(40)
(41)

Multiple alignment

n Elements of a multiple alignment method

n Scoring model

n Algorithm for finding the best aligment

n Ideally, given a phylogenetic tree for the sequences, the

probability of a multiple alignment is the product of the probabilities of all the events needed for this tree (in practice, not enough data will be available for this)

(42)

Scoring model

n A pragmatic scoring function is of the form

S(m) = G + ΣiS(mi)

G gap score, mi column i of the multiple alignment, S(mi)

score for column i

n Column score: sum of pairs score

S(mi)=ΣlΣk<ls(mik,m

il)

s() substitution matrix

n Sum of pairs score not statistically well-founded

n Extension of log-odds score (3 seqs.): log(pabc/(qaqbqc))

(43)

PSI-BLAST

n Position-Specific Iterated BLAST

n Searching the database not with a single sequence but

with a score matrix describing a protein family is much more sensitive

n Idea:

n Run BLAST for the query sequence n Gather the significant alignments

n Build a specific matrix describing the aligned sequences n Run BLAST using this specific matrix to recruit more seqs. n Build new matrix and iterate until no more seqs. are recruited)

(44)

PSI-BLAST

n Motifs can be found in a protein family; if we search

using such motifs we can detect more distant relationships

n It is difficult to delineate multiple motifs automatically in a

set of proteins; therefore, PSI-BLAST builds a motif as long as the query (Lx20 matrix)

n The motif is not a substitution matrix but gives at each

position a specific score for the presence of an amino

acid and is thus called a position-specific scoring matrix (PSSM) (gap scores are not position specific)

(45)

Position-specific scoring matrix

MFGKRAFVHHYVGEGMEENEFTDARQDLYELEVDYANL MFKRKGFLHWYTGEGMEPVEFSEAQSDLEDLILEYQQY MFSRKAFLHWYTGEGMEEGDFAEADNNVSDLLSEYQQY RGAFLDQFRREDIFKDDLNELDESRETVDCLVQEYEAA RNAFLDNFRRESMFQDDLTELDIARDTVDCLVQEYEAA RSKRAFIDKFEKIDNFSLDMMDDAMHIVQDLLDEYKAV RNAFLEQYKKEAPFQDGLDEFDEARAVVMDLVGEYEAA RDAYMNIFKQTKIFEDNLDEFDSSDEVVKSLIDEYAAA ENYKKESMFSSADGQGNFEEMESSKEITQNLIDEYKSA RNAFLEQYKKEAIFEDDLNEFDSSRDVVADLINEYEAC RNAFMPQYQKEAMFEKNLDEFDEARATVQDLIEEYQAC MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYSKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDFEEV MYAKRAFVHWYVGEGMEEGEFSEAREDMAALEKDYEEV MYAKRAFVHWYVGEGMEEGEFSEAREDLAALEKDYEEV MYSKRAFVHWYVGEGMEEGEFSEAREDLAALERDYEEV

(46)

BLAST with PSSMs

n Only minor modifications necessary to run BLAST with a

position-specific scoring matrix (essentially, replace the score provided by the query and the substitution matrix

s(ti,qi) by the score provided by the PSSM s(ti,C))

n To avoid having to modify the parameters of BLAST, the

(47)

0.87 0.34 36 Running time 207 197 198 P20705

H+-transport ATP synthase

142 81 83 P14942 Glutathione transferase 224 211 211 P10635 Cytochrome P450 338 261 262 P10318 Histocompatibility antigen 160 137 138 P07327 Alcohol dehydrogenase 53 53 53 P05013 Interferon alpha 130 128 128 P03435 Hemagglutinin 623 28 28 P02232 Globin 375 252 255 P01111 Ras 111 108 108 P01008

Serine protease inhibitor

286 275 275 P00762 Serine protease PSI-BLAST (# matches) BLAST Smith-Waterman Query (SWISS-PROT) Protein family

PSI-BLAST example

(48)

CLUSTALW

n Multiple alignment can be seen as multidimensional

dynamic programming; however, for N sequences of

length L, the complexity is O(LN) and therefore infeasible

in practice

n CLUSTALW (Cluster and ALign) is an example of

progressive multiple alignment

n Build a guide tree based on sequence similarity (not a real

phylogenetic tree)

n Do pairwise alignment going back up the guide tree

n Pairwise sequence alignment

n Sequence-group alignment

(49)

sequences distance matrix pairwise alignment sequence-group alignment group-group alignment guide tree

(50)

CLUSTALW

n Algorithm

n Construct distance matrix of all pairs

n Perform pairwise gapped alignment of all pairs

n Score as percentage differences in aligned sequences (gaps

excluded)

n Build the guide tree

n Use the neighbor-joining method

n Place the root on the edge where the mean branch length is equal on

both sides

n Use the three to derive a weight for each sequence

n Progressively align along the tree (seq.-seq., seq.-prof., prof.-prof.)

n Lock already aligned sequences together

n Align two groups using sum-of-pairs score for dynamic programming

(51)

Summary

n Similarity vs. homology n Dynamic programming n Local alignment n Needleman – Wunsch n Global alignment n Smith – Waterman n Affine gap penalty

n Substitution matrices

n PAM

n BLOSUM

(52)

n BLAST

n PHI-BLAST

n Multiple alignment n PSI-BLAST

Referenties

GERELATEERDE DOCUMENTEN

Dit zal ertoe leiden dat de afdeling Sales &amp; Marketing, die in rechtstreeks contact staat met de klanten van Kappa Triton, sneller en eenduidiger klanten kan informeren over

This research explains on the role of group dynamics in IT and business alignment and the particular focus is on the influences of team roles in the alignment process of

common end alignment Driving matrix NE/WE =&gt; NE+WE/NE-WE after thermal transient PR alignment acts on reference mass. Local control on marionette

Galvo feedback on NE bench (07/2006) Galvo feedback on NE bench

This file provides example of setting the alignment of \Columns on the page: right (default), left or center.. The column are 0.44\textwidth, and we

It is widely accepted in the optimization community that the gradient method can be very inefficient: in fact, for non- linear optimal control problems where g = 0

 Step 5: Extend the HSPs with normalized score above S g (S g =22 bits) by ungapped alignment ; stop the extension when the score drops by X g (e.g., X g =40) under the best

Secondly several computer guided automated alignment processes have been discussed: the sensitivity table method, merit Function Regression method and differential wavefront