1 / 67

Multiple Sequence Alignment

Multiple Sequence Alignment. Alexei Drummond. Week 3 Learning Outcomes. Be able to compute the Smith-Waterman (local) pairwise alignment of two sequences given a score matrix and gap penalty

marvel
Download Presentation

Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignment Alexei Drummond

  2. Week 3 Learning Outcomes • Be able to compute the Smith-Waterman (local) pairwise alignment of two sequences given a score matrix and gap penalty • Be able to compute the Needleman-Wunsch (global) pairwise alignment of two sequences given a score matrix and gap penalty • Understand the principle of log-odds scoring. CS369 2007

  3. Week 4 Learning Outcomes • Be able to recognize simple problems that are amenable to dynamic programming (DP) and design a DP algorithm to solve such problems. • Understand the principle of linear space optimal pairwise alignment • Understand the principle of quadratic-time pairwise alignment with affine gap penalties. CS369 2007

  4. Computational Biology Pairwise sequence alignment (global and local) Multiple sequence alignment Substitution matrices Database searching BLAST Sequence statistics Global Local Evolutionary tree reconstruction CS369 2007 Adapted from slide by Dannie Durant

  5. Multiple sequence alignment • Definition: Given sequences X(1)…X(N) of lengths n1…nN, seek A(1)…A(N) of length n max{ni} such that • Obtain X(i) from A(i) by removing gap characters • No columns contains all gaps • He score of the alignment is optimal CS369 2007

  6. Definitions Sequence i Row i in alignment Column j in alignment CS369 2007

  7. Multiple sequence alignment The first 55 amino acids of the albumin protein in 4 vertebrate animals unaligned and aligned. CS369 2007

  8. Multiple sequence alignment • Align N sequences, so that residues in each column share a property of interest • A common ancestor / evolutionary history • A structural or functional role CS369 2007

  9. Multiple sequence alignment I H T A L H N V L H H V F Y H V Characters in the same column share evolutionary history CS369 2007

  10. Structure-based alignment CS369 2007 Adapted from slide by Dannie Durant

  11. Scoring function: sum of pairs Column Score A-CTCCAT A-GTCC-T ACGTCA-T CS369 2007

  12. Scoring function: sum of pairs Column Score A-CTCCAT A-GTCC-T ACGTCA-T CS369 2007

  13. Scoring function: tree-based G (1) A-CTCCAT (2) A-GTCC-T (3) ACGTCA-T G C G G (1) (2) (3) • Assumptions • Sequences (in particular the characters in a column) evolved from a common ancestor • Evolution is parsimonious - mutations are rare CS369 2007

  14. Scoring function: tree-based G 1 G (1) A-CTCCAT (2) A-GTCC-T (3) ACGTCA-T C G G (1) (2) (3) C 1 G The score is the minimum number of substitutions needed to explain the data, considering all possible internal labels. Here are 3 of the 16 possible internal labelings of two internal nodes, and the corresponding number of substitutions implied. C G G (1) (2) (3) C C 2 C G G (1) (2) (3) CS369 2007

  15. Sum of pairs versus tree-based A A A G G SP_Score = 6 Tree_Score = 1 A A A A A A G G G G CS369 2007

  16. Tree-based scores • Thought to be the “most biological” but • We don’t know the tree • We need to infer the characters on internal nodes (more on that in later lectures) • There may be different trees for different parts of the alignment (if recombination has occurred) • Not always relevant for structural alignments • Sum of pairs is almost always used in practice. CS369 2007

  17. Linear gap scores & SP scoring i 1 - - - Treat gap as separate symbol s(a,-) = s(-,a) = gap score s(-,-) = 0 “Sum of Pairs” (SP) scoring function - - - - - - - - - j - - - - - - - - k - - - - - - - N - Column CS369 2007

  18. Multidimensional dynamic programming Define i 1 = max score of an alignment up to the sequences ending with 1 - - - - - - - - - - - - - - - - - - N - ways of placing gaps in this column All space time, CS369 2007

  19. Dynamic programming for multiple sequence alignment Traceback Optimal score CS369 2007

  20. MSA Carrillo and Lipman (1988), Lipman, Altschul and Kececioglu (1989). Can optimally align up to 8-10 protein sequences of up to 500 residues. CS369 2007

  21. Multiple alignment software Really need approximation methods. Different techniques • Progressive global alignment of sequences starting with an alignment of the most similar sequences and then building a full alignment by adding more sequences • Iterative methods that make an initial alignment of groups of sequences and then refine the alignment to achieve a better result (Barton-Sternberg, Simulated annealing, stochastic hill climbing, genetic algorithms) • Use of probabilistic models of the indel and substitution process to do statistical inference of alignment. (“Statistical alignment”) CS369 2007

  22. Progressive alignment Align sequences (pairwise) in some (greedy) order • Decisions • (1) Order of alignments • (2) Alignment of sequence to group (only), or allow group to group • Method of alignment, and scoring function CS369 2007

  23. Guide tree A this ? B C D E A B or this ? C D E F CS369 2007

  24. Feng & Doolittle (1987) Overview Calculate diagonal matrix of N(N-1)/2 distances between all pairs of N sequences by standard pairwise alignment, converting raw alignment scores to approximate pairwise “distances” (either p-distance or a genetic distance based on a Markov model). Construct guide tree from the distance matrix by using appropriate clustering algorithm. Starting from first node added to the tree, align the child nodes (which may be two sequences, a sequence and an alignment, or two alignments). Repeat for all other nodes in the order that they were added to tree, until all sequences have been aligned. CS369 2007

  25. Feng & Doolittle (1987) sequence-to-group Best pairwise alignment determines alignment to group X X X X X X X X X XX XX CS369 2007

  26. Feng & Doolittle (1987) sequence-to-group Best pairwise alignment determines alignment to group X CS369 2007

  27. Feng & Doolittle (1987) sequence-to-group Best pairwise alignment determines alignment to group – – – – – X This column is encouraged because it has no cost CS369 2007

  28. Feng & Doolittle (1987) sequence-to-group Best pairwise alignment determines alignment to group – – – – – X X X X X X X X X XX XX CS369 2007

  29. Feng & Doolittle (1987) sequence-to-group Best pairwise alignment determines alignment to group X X X X X X X X X X X X X X XX XX CS369 2007

  30. Feng & Doolittle (1987) group-to-group X X XX XX Best pairwise alignment determines alignment of groups X X X X X X X X X XX XX CS369 2007

  31. Feng & Doolittle (1987) group-to-group XX Best pairwise alignment determines alignment of groups X CS369 2007

  32. Feng & Doolittle (1987) group-to-group XX – – – – – – Best pairwise alignment determines alignment of groups X – CS369 2007

  33. Feng & Doolittle (1987) group-to-group – – – – – – X X – – – – – – – – – – – – XX XX – – – – – – Best pairwise alignment determines alignment of groups X X X ––––––– X X X X X X XX XX CS369 2007

  34. Feng & Doolittle (1987) group-to-group – – – – – – X X – – – – – – XX XX – – – – – – – – – – – – Best pairwise alignment determines alignment of groups X X X ––––––– X X X X X X XX XX CS369 2007

  35. Feng & Doolittle (1987) group-to-group X X X X X X X X X X X X X X X X X X X X X X X X X X XX XX Best pairwise alignment determines alignment of groups X X X XXXXXXX X X X X X X XX XX CS369 2007

  36. Feng & Doolittle (1987) After alignment is completed gap symbols replaced by “X”. “Once a gap, always a gap”. Encourages gaps to occur in same columns in subsequent alignments. Implemented by PILEUP (from GCG package). CS369 2007

  37. Profile alignment group-to-group X X X A X X X B X X X Total alignment score = score (A) + score (B) + score (A*B) CS369 2007

  38. CLUSTALW • Thompson, Higgins and Gibson (1994). • Widely used implementation of profile-based progressive multiple alignment. • Similar to Feng-Doolittle method, except for use of profile alignment methods. • Overview: • Calculate diagonal matrix of N(N-1)/2 distances between all pairs of N sequences by standard pairwise alignment, converting raw alignment scores to approximate pairwise “distances”. • Construct guide tree from distance matrix by using an appropriate neighbour-joining clustering algorithm. • Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence-profile, and profile-profile alignment. • Plus many other heuristics. CS369 2007

  39. CLUSTAL W heuristics • Closely related sequences are aligned with hard matrices (BLOSUM80) and distant sequences are aligned with soft matrices (BLOSUM50). • Hydrophobic residues (which are more likely to be buried) are given higher gap penalties than hydrophilic residues (which are more likely to be surface-accessible). • Gap-open penalties are also decreased if the position is spanned by 5 or more consecutive hydrophilic residues. CS369 2007

  40. CLUSTAL W heuristics • Both gap-open penalties and gap-extend penalties are increased if there are no gaps in a column but gaps occur nearby in the alignment. This rule tries to force all gaps to occur in the same places in an alignment. • In the progressive alignment stage, if the score of an alignment is low, the guide tree may be adjusted on the fly to defer the low scoring alignment until later in the progressive alignment phase when more profile information has been accumulated. CS369 2007

  41. Iterative refinement • i.e. “hill climbing”. Slightly change solution to improve score. Converge to local optimum. • e.g. Barton-Sternberg (1987) multiple alignment • Find the two sequences with the highest pairwise similarity and align them using standard dynamic programming alignment. • Find sequence most similar to a profile of the alignment of the first two, and align it to first two by profile-sequence alignment. Repeat until all sequences have been included in the multiple alignment. • Remove sequence X(1) and realign it to a profile of the other aligned sequences X(2)…X(N) by profile-sequence alignment. Repeat for sequences X(2)…X(N). • Repeat the previous alignment step a fixed number of times, or until the alignment score converges. CS369 2007

  42. Clustal X CS369 2007

  43. Clustal X CS369 2007

  44. CLUSTALX CS369 2007

  45. CLUSTALX CS369 2007

  46. C_aminophilum AGCT.YCGCATGRAGCAGTG TGAAAA.... ............ACTCCGGT GGTACAGGAT C_colinum AGTA..GGCATCTACAAGTT GGAAAA.... ............ACTGAGGT GGTATAGGAG C_lentocellum GGTATTCGCTTGATTATNATAGTAAA.... ............GATTTATC GCCATAGGAT C_botulinum_D TTTA.TGGCATCATACATAAAATAATCAAA ..........GGAGCAATCC GCTTTGAGAT C_novyi_A TTTA.CGGCAT....CGTAG AATAATCAAA ..........GGAGCAATCC GCTTTGAGAT C_gasigenes AGTT.TCGCATGAAACA... GC.AATTAAA ..........GGAGAAATCC GCTATAAGAT C_aurantibutyricum A.NT.TCGCATGGAGCA... AC.AATCAAA ..........GGAGCAAT.CACTATAAGAT C_sp_C_quinii AGTT.T.GCATGGGACA... GC.AATTAAA ..........GGAGCAATCC GCTATGAGAT C_perfringens AAGA.TGGCAT.CATCA... TTCAACCAAA ..........GGAGCAATCC GCTATGAGAT C_cadaveris TTTT.CTGCATGGGAAA... GTC.ATGAAA ..........GGAGCAATCC GCTGTAAGAT C_cellulovorans ATTC.TCGCATGAGAGA... .TGTATCAAA ..........GGAGCAATCC GCTATAAGAT C_K21 TTGR.TCGCATGATCKAAACATCAAAGGAT ..TTTTCTTTGGAAAATTCCACTTTGAGAT C_estertheticum TTGA.TCGCATGATCTTAACATCAAAGGAA ..TTT..TTCGG..AATTTCACTTTGAGAT C_botulinum_A AGAA.TCGCATGATTTTCTTATCAAAGATT ..T............ATT.. GCTTTGAGAT C_sporogenes AGAA.TCGCATGATTTTCTTATCAAAGATT ..T............ATT.. GCTTTGAGAT C_argentinense AAGG.TCGCATGACTTTTATACCAAAGGAG ..T............AATCC GCTATGAGAT C_subterminale AAGG.TCGCATGACTTTTATACCAAAGGAG ..T............AATCC GCTATGAGAT C_tetanomorphum TTTT.CCGCATGAAAAACTAATCAAAGGAG ..T............AAT.C GCTTTGAGAT C_pasteurianum AGTT.TCACATGGAGCTTTAATTAAAGGAG ..T............AATCC GCTTTGAGAT C_collagenovorans TTGA.TCGCATGGTCGAAATATTAAAGGAG ..T............AATCC GCTTACAGAT C_histolyticum TTTA.ATGCATGTTAGAAAG ATTAAAGGAG ..............CAATCC GCTTTGAGAT C_tyrobutyricum AGTT.TCACATGGAATTTGG ATGAAAGGAG ..T............AATTC GCTTTGAGAT C_tetani GGTT.TCGCATGAAACTTTAACCAAAGGAG ..T............AATCT GCTTTGAGAT C_barkeri GACA.TCGCATGGTGTT... .TTAATGAAA ............ACTCCGGT GCCATGAGAT C_thermocellum GGCA.TCGTCCTGTTAT... .CAAAGGAGA ............AATCCGGT ...ATGAGAT Pep_prevotii AGTC.TCGCATGGNGTTATCATCAAAGA.. ..............TTTATC GGTGTAAGAT C_innocuum ACGGAGCGCATGCTCTGTATATTAAAGCGCCCTTCAAGGCGTGAAC.... ....ATGGAT S_ruminantium AGTTTCCGCATGGGAGCTTG ATTAAAGATG GCCTCTACTTGTAAGCTATC GCTTTGCGAT

  47. TCAAAGGAG C_aminophilum AGCT.YCGCATGRAGCAGTG TGAAAA.... ............ACTCCGGT GGTACAGGAT C_colinum AGTA..GGCATCTACAAGTT GGAAAA.... ............ACTGAGGT GGTATAGGAG C_lentocellum GGTATTCGCTTGATTATNATAGTAAA.... ............GATTTATC GCCATAGGAT C_botulinum_D TTTA.TGGCATCATACATAAAATAATCAAA ..........GGAGCAATCC GCTTTGAGAT C_novyi_A TTTA.CGGCAT....CGTAG AATAATCAAA ..........GGAGCAATCC GCTTTGAGAT C_gasigenes AGTT.TCGCATGAAACA... GC.AATTAAA ..........GGAGAAATCC GCTATAAGAT C_aurantibutyricum A.NT.TCGCATGGAGCA... AC.AATCAAA ..........GGAGCAAT.CACTATAAGAT C_sp_C_quinii AGTT.T.GCATGGGACA... GC.AATTAAA ..........GGAGCAATCC GCTATGAGAT C_perfringens AAGA.TGGCAT.CATCA... TTCAACCAAA ..........GGAGCAATCC GCTATGAGAT C_cadaveris TTTT.CTGCATGGGAAA... GTC.ATGAAA ..........GGAGCAATCC GCTGTAAGAT C_cellulovorans ATTC.TCGCATGAGAGA... .TGTATCAAA ..........GGAGCAATCC GCTATAAGAT C_K21 TTGR.TCGCATGATCKAAACATCAAAGGAT ..TTTTCTTTGGAAAATTCCACTTTGAGAT C_estertheticum TTGA.TCGCATGATCTTAACATCAAAGGAA ..TTT..TTCGG..AATTTCACTTTGAGAT C_botulinum_A AGAA.TCGCATGATTTTCTTATCAAAGATT ..T............ATT.. GCTTTGAGAT C_sporogenes AGAA.TCGCATGATTTTCTTATCAAAGATT ..T............ATT.. GCTTTGAGAT C_argentinense AAGG.TCGCATGACTTTTATACCAAAGGAG ..T............AATCC GCTATGAGAT C_subterminale AAGG.TCGCATGACTTTTATACCAAAGGAG ..T............AATCC GCTATGAGAT C_tetanomorphum TTTT.CCGCATGAAAAACTAATCAAAGGAG ..T............AAT.C GCTTTGAGAT C_pasteurianum AGTT.TCACATGGAGCTTTAATTAAAGGAG ..T............AATCC GCTTTGAGAT C_collagenovorans TTGA.TCGCATGGTCGAAATATTAAAGGAG ..T............AATCC GCTTACAGAT C_histolyticum TTTA.ATGCATGTTAGAAAG ATTAAAGGAG ..............CAATCC GCTTTGAGAT C_tyrobutyricum AGTT.TCACATGGAATTTGG ATGAAAGGAG ..T............AATTC GCTTTGAGAT C_tetani GGTT.TCGCATGAAACTTTAACCAAAGGAG ..T............AATCT GCTTTGAGAT C_barkeri GACA.TCGCATGGTGTT... .TTAATGAAA ............ACTCCGGT GCCATGAGAT C_thermocellum GGCA.TCGTCCTGTTAT... .CAAAGGAGA ............AATCCGGT ...ATGAGAT Pep_prevotii AGTC.TCGCATGGNGTTATCATCAAAGA.. ..............TTTATC GGTGTAAGAT C_innocuum ACGGAGCGCATGCTCTGTATATTAAAGCGCCCTTCAAGGCGTGAAC.... ....ATGGAT S_ruminantium AGTTTCCGCATGGGAGCTTG ATTAAAGATG GCCTCTACTTGTAAGCTATC GCTTTGCGAT TCAAAGGAG

  48. Alignment - considerations • The programs simply try to maximize the number of matches • The “best” alignment may not be the correct biological one • Multiple alignments are done progressively • Such alignments get progressively worse as you add sequences • Mistakes that occur during alignment process are frozen in. • Unless the sequences are very similar you will almost certainly have to correct manually CS369 2007

  49. Manual Alignment- software Geneious- cross-platform - https://www.geneious.com/ CINEMA- Java applet available from: • https://www.biochem.ucl.ac.uk Seqapp/Seqpup- Mac/PC/UNIX available from: • https://iubio.bio.indiana.edu Se-Al for Macintosh, available from: • https://evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html BioEdit for PC, available from: • https://www.mbio.ncsu.edu/RNaseP/info/programs/BIOEDIT/bioedit.html CS369 2007

  50. CS369 2007

More Related