Sequence Alignment.

Similar presentations


Presentation on theme: "Sequence Alignment."— Presentation transcript:

1 Sequence Alignment

2 Outline DNA Sequence Comparison: First Success Stories
Dynamic Programming vs. Recursion Relation of Sequences Comparing Sequences Hamming Distance vs. Edit Distance Sequence Alignment Longest Common Subsequence Problem Scoring Matrices: PAM and BLOSUM Local vs. Global Alignment

3 Problem Solving Approach
Dynamic Programming Bottom-Up Approach Repetitive calculations LIFO Approach Mem: Return Point+Environmental Variables and Runtime Stack Divide and Conquer Top-Down Approach Calculate each once FIFO Approach Result Table Fibonacci Example

4 DNA Sequence Comparison: First Success Story
Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function Computing a similarity score between two genes tells how likely it is that they have similar functions Dynamic programming is a technique for revealing similarities between genes

5

6 Relation Of Sequences Homolog Xenolog Has a Common ancestor
Ortholog Gene or Protein Family Members In 2 different organism Paralog Gene or Protein Family Members In 1 organism Xenolog Have Some Similarities but not from a common ancestor

7 Relation Of Sequences

8 Comparing Sequences Sequence Alignment Pairwise Sequence Alignment
Between 2 Sequences Multiple Sequence Alignment Between more than 2 Sequences

9 Aligning Sequences without Insertions and Deletions: Hamming Distance
Given two DNA sequences v and w : v : A T w : A T The Hamming distance: dH(v, w) = 8 is large but the sequences are very similar

10 Aligning Sequences with Insertions and Deletions
By shifting one sequence over one position: v : A T -- w : -- A T The edit distance: dH(v, w) = 2. Hamming distance neglects insertions and deletions in DNA

11 Edit Distance Levenshtein (1966) introduced edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v,w) = MIN number of elementary operations to transform v  w

12 Edit Distance vs Hamming Distance
may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT Just one shift Make it all line up W = TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)= d(v, w)=2 (one insertion and one deletion) How to find what j goes with what i ???

13 Edit Distance: Example
TGCATAT  ATCCGAT in 5 steps also TGCATAT  ATCCGAT in 4 steps Can it be done in 3 steps???

14 Aligning DNA Sequences
V = ATCTGATG n = 8 4 matches mismatches insertions deletions m = 7 1 W = TGCATAC 2 match mismatch 2 V A T C G W deletion indels insertion

15 Longest Common Subsequence (LCS) – Alignment without Mismatches
Given two sequences v = v1 v2…vm and w = w1 w2…wn The LCS of v and w is a sequence of positions in v: 1 < i1 < i2 < … < it < m and a sequence of positions in w: 1 < j1 < j2 < … < jt < n such that it -th letter of v equals to jt-letter of w and t is maximal

16 Edit Graph for LCS Problem
j 1 2 3 4 5 6 7 8 Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence LCS Problem: Find a path with maximum number of diagonal edges i T 1 G 2 C 3 A 4 T 5 A 6 C 7

17 Computing LCS Let vi = prefix of v of length i: v1 … vi
and wj = prefix of w of length j: w1 … wj The length of LCS(vi,wj) is computed by: si, j = max si-1, j si, j-1 si-1, j if vi = wj

18 Computing LCS (cont’d)
i-1,j -1 i-1,j 1 si-1,j si,j = MAX i,j -1 si,j i,j si-1,j , if vi = wj Every alignment path is from source to sink

19 Every Path in the Grid Corresponds to an Alignment
W A T C G V = A T - G T | | | W= A T C G – V 1 2 3 4 A T G T

20 Alignment as a Path in the Edit Graph
1 2 3 4 5 6 7 G A T C w v A T _ G T T A T _ A T C G T _ A _ C (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) - Corresponding path -

21 Alignment: Dynamic Programming
si,j = si-1, j-1+1 if vi = wj max si-1, j si, j-1 { Arrows show where the score originated from. if from the top if from the left if vi = wj

22 Backtracking Example w v Find a match in row and column 2.
i=2, j=2,5 is a match (T). j=2, i=4,5,7 is a match (T). Since vi = wj, si,j = si-1,j-1 +1 s2,2 = [s1,1 = 1] + 1 s2,5 = [s1,4 = 1] + 1 s4,2 = [s3,1 = 1] + 1 s5,2 = [s4,1 = 1] + 1 s7,2 = [s6,1 = 1] + 1 1 2 3 4 5 6 7 G A T C w v 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 2

23 LCS Algorithm { { LCS(v,w) for i  1 to n si,0  0 for j  1 to m
s0,j  0 si-1,j si,j  max si,j-1 si-1,j-1 + 1, if vi = wj “ “ if si,j = si-1,j bi,j  “ “ if si,j = si,j-1 “ “ if si,j = si-1,j-1 + 1 return (sn,m, b) { {

24 Printing LCS: Backtracking
PrintLCS(b,v,i,j) if i = 0 or j = 0 return if bi,j = “ “ PrintLCS(b,v,i-1,j-1) print vi else PrintLCS(b,v,i-1,j) PrintLCS(b,v,i,j-1)

25 LCS Runtime It takes O(nm) time to fill in the nxm dynamic programming matrix. Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix.

26 From LCS to Alignment: Change up the Scoring
The Longest Common Subsequence (LCS) problem—the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). In the LCS Problem, we scored 1 for matches and 0 for indels Consider penalizing indels and mismatches with negative scores Simplest scoring schema: +1 : match premium -μ : mismatch penalty -σ : indel penalty

27 Simple Scoring When mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels)

28 The Global Alignment Problem
Find the best alignment between two strings under a given scoring schema Input : Strings v and w and a scoring schema Output : Alignment of maximum score ↑→ = -б = 1 if match = -µ if mismatch si-1,j if vi = wj si,j = max s i-1,j-1 -µ if vi ≠ wj s i-1,j - σ s i,j-1 - σ { m : mismatch penalty σ : indel penalty

29

30 Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score for comparison of a gap character “-”. This will simplify the algorithm as follows: si-1,j-1 + δ (vi, wj) si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) {

31 Making a Scoring Matrix
Scoring matrices are created based on biological evidence. Alignments can be thought of as two sequences that differ due to mutations. Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others.

32 Scoring Matrix: Example
K 5 -2 -1 - 7 3 6 Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein. AKRANR KAAANK -1 + (-1) + (-2) = 11

33 Conservation Amino acid changes that tend to preserve the physico-chemical properties of the original residue Polar to polar aspartate  glutamate Nonpolar to nonpolar alanine  valine Similarly behaving residues leucine to isoleucine

34 Scoring matrices Amino acid substitution matrices
PAM BLOSUM DNA substitution matrices DNA is less conserved than protein sequences Less effective to compare coding regions at nucleotide level

35 PAM After 100 PAMs of evolution, not every residue will have changed
Point Accepted Mutation (Dayhoff et al.) 1 PAM = PAM1 = 1% average change of all amino acid positions After 100 PAMs of evolution, not every residue will have changed some residues may have mutated several times some residues may have returned to their original state some residues may not changed at all

36 PAMX PAMx = PAM1x PAM250 is a widely used scoring matrix:
PAM250 = PAM1250 PAM250 is a widely used scoring matrix: Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ... Ala A Arg R Asn N Asp D Cys C Gln Q ... Trp W Tyr Y Val V

37 BLOSUM Blocks Substitution Matrix
Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins Matrix name indicates evolutionary distance BLOSUM62 was created using sequences sharing no more than 62% identity

38 The Blosum50 Scoring Matrix

39 Comparing PAM and BLOSUM

40

41 Local vs. Global Alignment
The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph.

42

43 Local vs. Global Alignment
The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. In the edit graph with negatively-scored edges, Local Alignmet may score higher than Global Alignment

44 Local vs. Global Alignment (cont’d)
Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc

45 Local Alignment: Example
Compute a “mini” Global Alignment to get Local Local alignment Global alignment

46 Similarity Based on Dot Plots
Simple Dot Plot Dot Plot With 75% identity Filtering

47 Dotplot for a small protein against itself
identity (i=j) similarity of sequence with other parts of itself 48

48 Local Alignments: Why? Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. Example: Homeobox genes have a short region called the homeodomain that is highly conserved between species. A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence

49 The Local Alignment Problem
Goal: Find the best local alignment between two strings Input : Strings v, w and scoring matrix δ Output : Alignment of substrings of v and w whose alignment score is maximum among all possible alignment of all possible substrings

50 The Problem with this Problem
Long run time O(n4): - In the grid of size n x n there are ~n2 vertices (i,j) that may serve as a source. - For each such vertex computing alignments from (i,j) takes O(n2) time.

51 The Local Alignment Recurrence
The largest value of si,j over the whole edit graph is the score of the best local alignment. The recurrence: Power of ZERO: there is only this change from the original recurrence of a Global Alignment - since there is only one “free ride” edge entering into every vertex si,j = max si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) {

52 Local Align vs. Global Align

53 Local Align vs. Global Align

54 Further Reading -(ρ + σx) x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Better Gap penalty strategy -(ρ + σx) Multiple Sequence Alignment x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

55 Heuristic Alignment Algorithms

56

57

58

59

60

61


Download ppt "Sequence Alignment."
Ads by Google