Abstract
Computation of multiple sequence alignment (MSA) is usually formulated as a combinatory optimization problem of an objective function. Solving the problem for virtually all sensible objective functions is known to be NP-complete implying that some heuristics must be adopted. Several general strategies have been proven effective to obtain accurate MSAs in reasonable computational costs. This chapter is devoted to a brief summary of most successful heuristic approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Carrillo H, Lipman D (1988) The multiple sequence alignment problem in biology. SIAM J Appl Math 48:1073–1082
Gupta SK, Kececioglu JD, Schaffer AA (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J Comput Biol 2:459–472
Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36:D202–D205
Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5, Suppl. 3. National Biomedical Research Foundation, Silver Spring, MD, pp 345–352
Chiaromonte F, Yap VB, Miller W (2002) Scoring pairwise genomic sequence alignments. In: Altman RB, Dunker AK, Hunter L, Klein TED, Lauderdale K (eds) Pacific symposium on biocomputing. World Scientific, Singapore, pp 115–126
Frith MC, Hamada M, Horton P (2010) Parameters for accurate genome alignment. BMC Bioinformatics 11:80
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453
Sellers PH (1974) On the theory and computation of evolutionary distances. SIAM J Appl Math 26:787–793
Waterman MS, Smith TF, Beyer WA (1976) Some biological sequence metrics. Adv Math 20:367–387
Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162:705–708
Gotoh O (1990) Optimal sequence alignment allowing for long gaps. Bull Math Biol 52:359–373
Waterman MS, Byers TH (1985) A dynamic-programming algorithm to find all solutions in a neighborhood of the optimum. Math Biosci 77:179–188
Bishop MJ, Thompson EA (1986) Maximum likelihood alignment of DNA sequences. J Mol Biol 190:159–165
Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol 33:114–124
Miyazawa S (1995) A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng 8:999–1009
Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge
Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286
Holmes I, Durbin R (1998) Dynamic programming alignment accuracy. J Comput Biol 5:493–504
Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res 15:330–340
Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L (2009) Fast statistical alignment. PLoS Comput Biol 5:e1000392
Gotoh O (1990) Consistency of optimal sequence alignments. Bull Math Biol 52:509–525
Kruskal JB, Sankoff D (1983) An anthology of algorithms and concepts for sequence comparison. In: Sankoff D, Kruskal J (eds) Time warps, string edits, and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading, MA, pp 265–310
Notredame C, Holm L, Higgins DG (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14:407–422
Notredame C, Higgins DG, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217
Kececioglu JD (1993) The maximum weight trace problem in multiple sequence alignment. Lect Notes Comput Sci 684:106–119
Roshan U, Livesay DR (2006) Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics 22:2715–2721
Pei J, Grishin NV (2006) MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res 34:4364–4374
Liu Y, Schmidt B, Maskell DL (2010) MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities. Bioinformatics 26:1958–1964
Paten B, Herrero J, Beal K, Birney E (2009) Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment. Bioinformatics 25:295–301
Paten B, Herrero J, Beal K, Fitzgerald S, Birney E (2008) Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res 18:1814–1828
Hogeweg P, Hesper B (1984) The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J Mol Evol 20:175–186
Kruspe M, Stadler PF (2007) Progressive multiple sequence alignments from triplets. BMC Bioinformatics 8:254
Lassmann T, Frings O, Sonnhammer EL (2009) Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic Acids Res 37:858–865
Loytynoja A, Goldman N (2005) An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci U S A 102:10557–10562
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Soding J, Thompson JD, Higgins DG (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7:539
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680
Blaisdell BE (1986) A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci USA 83:5155–5159
Muth R, Manber U (1996) Approximate multiple string search. Lect Notes Comput Sci 1075:75–86
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275–282
Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30:3059–3066
Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113
Sneath PHA, Sokal RP (1973) Numerical taxonomy. Freeman, San Francisco, CA
Wheeler TJ, Kececioglu JD (2007) Multiple alignment by aligning alignments. Bioinformatics 23:i559–i568
Plyusnin I, Holm L (2012) Comprehensive comparison of graph based multiple protein sequence alignment strategies. BMC Bioinformatics 13:64
Gronau I, Moran S (2007) Optimal implementations of UPGMA and other common clustering algorithms. Inform Process Lett 104:205–210
Katoh K, Toh H (2007) PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics 23:372–374
Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG (2010) Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithm Mol Bio 5:21
Gribskov M, McLachlan AD, Eisenberg D (1987) Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A 84:4355–4358
Hein J (1989) A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given. Mol Biol Evol 6:649–668
Lee C, Grasso C, Sharlow MF (2002) Multiple sequence alignment using partial order graphs. Bioinformatics 18:452–464
Loytynoja A, Vilella AJ, Goldman N (2012) Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm. Bioinformatics 28:1684–1691
Gotoh O (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 264:823–838
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797
Feng DF, Doolittle RF (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 25:351–360
Barton GJ, Sternberg MJ (1987) A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. J Mol Biol 198:327–337
Subbiah S, Harrison SC (1989) A method for multiple sequence alignment with gaps. J Mol Biol 209:539–548
Berger MP, Munson PJ (1991) A novel randomized iterative strategy for aligning multiple protein sequences. Comput Appl Biosci 7:479–484
Gotoh O (1993) Optimal alignment between groups of sequences and its application to multiple sequence alignment. Comput Appl Biosci 9:361–370
Altschul SF (1989) Gap costs for multiple sequence alignment. J Theor Biol 138:297–309
Altschul SF, Carroll RJ, Lipman DJ (1989) Weights for data related by a tree. J Mol Biol 207:647–653
Gotoh O (1994) Further improvement in methods of group-to-group sequence alignment with generalized profile operations. Comput Appl Biosci 10:379–387
Ma B, Wang Z, Zhang K (2003) Alignment between two multiple alignments. Lect Notes Comput Sci 2676:254–265
Gotoh O (1999) Multiple sequence alignment: algorithms and applications. Adv Biophys 36:159–206
Kececioglu JD, Starrett D (2004) Aligning alignments exactly. In: Gusfield D, Bourne P, Istrail S, Pevzner P, Waterman M (eds) Proceedings of the 8th ACM conference on computational molecular biology (RECOMB). ACM Press, New York, pp 85–96
Katoh K, Kuma K, Toh H, Miyata T (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33:511–518
Yamada S, Gotoh O, Yamana H (2006) Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost. BMC Bioinformatics 7:524
Soding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951–960
Edgar RC, Sjolander K (2004) A comparison of scoring functions for protein sequence profile alignment. Bioinformatics 20:1301–1308
Wang G, Dunbrack RL Jr (2004) Scoring profile-to-profile sequence alignments. Protein Sci 13:1612–1626
Altschul SF, Wootton JC, Zaslavsky E, Yu YK (2010) The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol 6:e1000852
Edgar RC (2009) Optimizing substitution matrix choice and gap parameters for sequence alignment. BMC Bioinformatics 10:396
Muller T, Spang R, Vingron M (2002) Estimating amino acid substitution models: a comparison of Dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol 19:8–13
Hirosawa M, Totoki Y, Hoshida M, Ishikawa M (1995) Comprehensive study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci 11:13–18
Gotoh O (1995) A weighting system and algorithm for aligning many phylogenetically related sequences. Comput Appl Biosci 11:543–551
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12:656–664
Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL (1999) Alignment of whole genomes. Nucleic Acids Res 27:2369–2376
Darling AC, Mau B, Blattner FR, Perna NT (2004) Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14:1394–1403
Hohl M, Kurtz S, Ohlebusch E (2002) Efficient multiple genome alignment. Bioinformatics 18(Suppl 1):S312–S320
Choi JH, Cho HG, Kim S (2005) GAME: a simple and efficient whole genome alignment method using maximal exact match filtering. Comput Biol Chem 29:244–253
Kryukov K, Saitou N (2010) MISHIMA–a new method for high speed multiple alignment of nucleotide sequences of bacterial genome scale data. BMC Bioinformatics 11:142
Crochemore M, Hancart C, Lecroq T (2007) Algorithms on strings. Cambridge University Press, Cambridge
Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 13:721–731
Bray N, Pachter L (2004) MAVID: constrained ancestral alignment of multiple sequences. Genome Res 14:693–699
Batzoglou S, Pachter L, Mesirov JP, Berger B, Lander ES (2000) Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res 10:950–958
Bray N, Dubchak I, Pachter L (2003) AVID: a global alignment program. Genome Res 13:97–102
Morgenstern B (1999) DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15:211–218
Rausch T, Emde AK, Weese D, Doring A, Notredame C, Reinert K (2008) Segment-based multiple sequence alignment. Bioinformatics 24:i187–i192
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Schwartz AS, Pachter L (2007) Multiple alignment by sequence annealing. Bioinformatics 23:e24–e29
Sahraeian SM, Yoon BJ (2010) PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences. Nucleic Acids Res 38:4917–4928
Thompson JD, Thierry JC, Poch O (2003) RASCAL: rapid scanning and correction of multiple sequence alignments. Bioinformatics 19:1155–1161
Yamada S, Gotoh O, Yamana H (2009) Improvement in speed and accuracy of multiple sequence alignment program PRIME. Inform Media Tech 4:317–327
Sadreyev RI, Baker D, Grishin NV (2003) Profile-profile comparisons by COMPASS predict intricate homologies between protein families. Protein Sci 12:2262–2272
Tomii K, Akiyama Y (2004) FORTE: a profile-profile comparison tool for protein fold recognition. Bioinformatics 20:594–595
Soding J, Remmert M (2011) Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr Opin Struct Biol 21:404–411
Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292:195–202
Rost B, Sander C (1994) Conservation and prediction of solvent accessibility in protein families. Proteins 20:216–226
Simossis VA, Heringa J (2005) PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Res 33:W289–W294
Zhou H, Zhou Y (2005) SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics 21:3615–3621
Pei J, Sadreyev R, Grishin NV (2003) PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics 19:427–428
Pei J, Grishin NV (2007) PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 23:802–808
Papadopoulos JS, Agarwala R (2007) COBALT: constraint-based alignment tool for multiple protein sequences. Bioinformatics 23:1073–1079
O’Sullivan O, Suhre K, Abergel C, Higgins DG, Notredame C (2004) 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J Mol Biol 340:385–395
Pei J, Kim BH, Grishin NV (2008) PROMALS3D: a tool for multiple protein sequence and structure alignments. Nucleic Acids Res 36:2295–2300
Smith TF, Waterman MS, Fitch WM (1981) Comparative biosequence metrics. J Mol Evol 18:38–46
Sellers PH (1980) The theory and computation of evolutionary distances: pattern recognition. J Algorithm 1:359–373
Hamada M, Asai K (2012) A classification of bioinformatics algorithms from the viewpoint of maximizing expected accuracy (MEA). J Comput Biol 19:532–549
Acknowledgment
I thank Dr. Kentaro Tomii for instruction about protein fold recognition methods. This work was partly supported by Kakenhi (Grant-in-Aid for Scientific Research) B (grant number 22310124) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Gotoh, O. (2014). Heuristic Alignment Methods. In: Russell, D. (eds) Multiple Sequence Alignment Methods. Methods in Molecular Biology, vol 1079. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-646-7_2
Download citation
DOI: https://doi.org/10.1007/978-1-62703-646-7_2
Published:
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-62703-645-0
Online ISBN: 978-1-62703-646-7
eBook Packages: Springer Protocols