Heuristic Alignment Methods

Gotoh, Osamu

doi:10.1007/978-1-62703-646-7_2

Osamu Gotoh^3,4

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1079))

5158 Accesses
4 Citations

Abstract

Computation of multiple sequence alignment (MSA) is usually formulated as a combinatory optimization problem of an objective function. Solving the problem for virtually all sensible objective functions is known to be NP-complete implying that some heuristics must be adopted. Several general strategies have been proven effective to obtain accurate MSAs in reasonable computational costs. This chapter is devoted to a brief summary of most successful heuristic approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Carrillo H, Lipman D (1988) The multiple sequence alignment problem in biology. SIAM J Appl Math 48:1073–1082
Google Scholar
Gupta SK, Kececioglu JD, Schaffer AA (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J Comput Biol 2:459–472
PubMed CAS Google Scholar
Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36:D202–D205
PubMed CAS Google Scholar
Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure, vol 5, Suppl. 3. National Biomedical Research Foundation, Silver Spring, MD, pp 345–352
Google Scholar
Chiaromonte F, Yap VB, Miller W (2002) Scoring pairwise genomic sequence alignments. In: Altman RB, Dunker AK, Hunter L, Klein TED, Lauderdale K (eds) Pacific symposium on biocomputing. World Scientific, Singapore, pp 115–126
Google Scholar
Frith MC, Hamada M, Horton P (2010) Parameters for accurate genome alignment. BMC Bioinformatics 11:80
PubMed Google Scholar
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453
PubMed CAS Google Scholar
Sellers PH (1974) On the theory and computation of evolutionary distances. SIAM J Appl Math 26:787–793
Google Scholar
Waterman MS, Smith TF, Beyer WA (1976) Some biological sequence metrics. Adv Math 20:367–387
Google Scholar
Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162:705–708
PubMed CAS Google Scholar
Gotoh O (1990) Optimal sequence alignment allowing for long gaps. Bull Math Biol 52:359–373
PubMed CAS Google Scholar
Waterman MS, Byers TH (1985) A dynamic-programming algorithm to find all solutions in a neighborhood of the optimum. Math Biosci 77:179–188
Google Scholar
Bishop MJ, Thompson EA (1986) Maximum likelihood alignment of DNA sequences. J Mol Biol 190:159–165
PubMed CAS Google Scholar
Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol 33:114–124
PubMed CAS Google Scholar
Miyazawa S (1995) A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng 8:999–1009
PubMed CAS Google Scholar
Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge
Google Scholar
Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286
Google Scholar
Holmes I, Durbin R (1998) Dynamic programming alignment accuracy. J Comput Biol 5:493–504
PubMed CAS Google Scholar
Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res 15:330–340
PubMed CAS Google Scholar
Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L (2009) Fast statistical alignment. PLoS Comput Biol 5:e1000392
PubMed Google Scholar
Gotoh O (1990) Consistency of optimal sequence alignments. Bull Math Biol 52:509–525
PubMed CAS Google Scholar
Kruskal JB, Sankoff D (1983) An anthology of algorithms and concepts for sequence comparison. In: Sankoff D, Kruskal J (eds) Time warps, string edits, and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading, MA, pp 265–310
Google Scholar
Notredame C, Holm L, Higgins DG (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14:407–422
PubMed CAS Google Scholar
Notredame C, Higgins DG, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217
PubMed CAS Google Scholar
Kececioglu JD (1993) The maximum weight trace problem in multiple sequence alignment. Lect Notes Comput Sci 684:106–119
Google Scholar
Roshan U, Livesay DR (2006) Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics 22:2715–2721
PubMed CAS Google Scholar
Pei J, Grishin NV (2006) MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res 34:4364–4374
PubMed CAS Google Scholar
Liu Y, Schmidt B, Maskell DL (2010) MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities. Bioinformatics 26:1958–1964
PubMed CAS Google Scholar
Paten B, Herrero J, Beal K, Birney E (2009) Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment. Bioinformatics 25:295–301
PubMed CAS Google Scholar
Paten B, Herrero J, Beal K, Fitzgerald S, Birney E (2008) Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res 18:1814–1828
PubMed CAS Google Scholar
Hogeweg P, Hesper B (1984) The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J Mol Evol 20:175–186
PubMed CAS Google Scholar
Kruspe M, Stadler PF (2007) Progressive multiple sequence alignments from triplets. BMC Bioinformatics 8:254
PubMed Google Scholar
Lassmann T, Frings O, Sonnhammer EL (2009) Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic Acids Res 37:858–865
PubMed CAS Google Scholar
Loytynoja A, Goldman N (2005) An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci U S A 102:10557–10562
PubMed Google Scholar
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Soding J, Thompson JD, Higgins DG (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7:539
PubMed Google Scholar
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680
PubMed CAS Google Scholar
Blaisdell BE (1986) A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci USA 83:5155–5159
PubMed CAS Google Scholar
Muth R, Manber U (1996) Approximate multiple string search. Lect Notes Comput Sci 1075:75–86
Google Scholar
Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275–282
PubMed CAS Google Scholar
Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30:3059–3066
PubMed CAS Google Scholar
Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113
PubMed Google Scholar
Sneath PHA, Sokal RP (1973) Numerical taxonomy. Freeman, San Francisco, CA
Google Scholar
Wheeler TJ, Kececioglu JD (2007) Multiple alignment by aligning alignments. Bioinformatics 23:i559–i568
PubMed CAS Google Scholar
Plyusnin I, Holm L (2012) Comprehensive comparison of graph based multiple protein sequence alignment strategies. BMC Bioinformatics 13:64
PubMed CAS Google Scholar
Gronau I, Moran S (2007) Optimal implementations of UPGMA and other common clustering algorithms. Inform Process Lett 104:205–210
Google Scholar
Katoh K, Toh H (2007) PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics 23:372–374
PubMed CAS Google Scholar
Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG (2010) Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithm Mol Bio 5:21
Google Scholar
Gribskov M, McLachlan AD, Eisenberg D (1987) Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A 84:4355–4358
PubMed CAS Google Scholar
Hein J (1989) A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given. Mol Biol Evol 6:649–668
PubMed CAS Google Scholar
Lee C, Grasso C, Sharlow MF (2002) Multiple sequence alignment using partial order graphs. Bioinformatics 18:452–464
PubMed CAS Google Scholar
Loytynoja A, Vilella AJ, Goldman N (2012) Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm. Bioinformatics 28:1684–1691
PubMed Google Scholar
Gotoh O (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 264:823–838
PubMed CAS Google Scholar
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797
PubMed CAS Google Scholar
Feng DF, Doolittle RF (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 25:351–360
PubMed CAS Google Scholar
Barton GJ, Sternberg MJ (1987) A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. J Mol Biol 198:327–337
PubMed CAS Google Scholar
Subbiah S, Harrison SC (1989) A method for multiple sequence alignment with gaps. J Mol Biol 209:539–548
PubMed CAS Google Scholar
Berger MP, Munson PJ (1991) A novel randomized iterative strategy for aligning multiple protein sequences. Comput Appl Biosci 7:479–484
PubMed CAS Google Scholar
Gotoh O (1993) Optimal alignment between groups of sequences and its application to multiple sequence alignment. Comput Appl Biosci 9:361–370
PubMed CAS Google Scholar
Altschul SF (1989) Gap costs for multiple sequence alignment. J Theor Biol 138:297–309
PubMed CAS Google Scholar
Altschul SF, Carroll RJ, Lipman DJ (1989) Weights for data related by a tree. J Mol Biol 207:647–653
PubMed CAS Google Scholar
Gotoh O (1994) Further improvement in methods of group-to-group sequence alignment with generalized profile operations. Comput Appl Biosci 10:379–387
PubMed CAS Google Scholar
Ma B, Wang Z, Zhang K (2003) Alignment between two multiple alignments. Lect Notes Comput Sci 2676:254–265
Google Scholar
Gotoh O (1999) Multiple sequence alignment: algorithms and applications. Adv Biophys 36:159–206
PubMed CAS Google Scholar
Kececioglu JD, Starrett D (2004) Aligning alignments exactly. In: Gusfield D, Bourne P, Istrail S, Pevzner P, Waterman M (eds) Proceedings of the 8th ACM conference on computational molecular biology (RECOMB). ACM Press, New York, pp 85–96
Google Scholar
Katoh K, Kuma K, Toh H, Miyata T (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33:511–518
PubMed CAS Google Scholar
Yamada S, Gotoh O, Yamana H (2006) Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost. BMC Bioinformatics 7:524
PubMed Google Scholar
Soding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951–960
PubMed Google Scholar
Edgar RC, Sjolander K (2004) A comparison of scoring functions for protein sequence profile alignment. Bioinformatics 20:1301–1308
PubMed CAS Google Scholar
Wang G, Dunbrack RL Jr (2004) Scoring profile-to-profile sequence alignments. Protein Sci 13:1612–1626
PubMed CAS Google Scholar
Altschul SF, Wootton JC, Zaslavsky E, Yu YK (2010) The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol 6:e1000852
PubMed Google Scholar
Edgar RC (2009) Optimizing substitution matrix choice and gap parameters for sequence alignment. BMC Bioinformatics 10:396
PubMed Google Scholar
Muller T, Spang R, Vingron M (2002) Estimating amino acid substitution models: a comparison of Dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol 19:8–13
PubMed CAS Google Scholar
Hirosawa M, Totoki Y, Hoshida M, Ishikawa M (1995) Comprehensive study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci 11:13–18
PubMed CAS Google Scholar
Gotoh O (1995) A weighting system and algorithm for aligning many phylogenetically related sequences. Comput Appl Biosci 11:543–551
PubMed CAS Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
PubMed CAS Google Scholar
Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12:656–664
PubMed CAS Google Scholar
Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL (1999) Alignment of whole genomes. Nucleic Acids Res 27:2369–2376
PubMed CAS Google Scholar
Darling AC, Mau B, Blattner FR, Perna NT (2004) Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14:1394–1403
PubMed CAS Google Scholar
Hohl M, Kurtz S, Ohlebusch E (2002) Efficient multiple genome alignment. Bioinformatics 18(Suppl 1):S312–S320
PubMed Google Scholar
Choi JH, Cho HG, Kim S (2005) GAME: a simple and efficient whole genome alignment method using maximal exact match filtering. Comput Biol Chem 29:244–253
PubMed CAS Google Scholar
Kryukov K, Saitou N (2010) MISHIMA–a new method for high speed multiple alignment of nucleotide sequences of bacterial genome scale data. BMC Bioinformatics 11:142
PubMed Google Scholar
Crochemore M, Hancart C, Lecroq T (2007) Algorithms on strings. Cambridge University Press, Cambridge
Google Scholar
Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 13:721–731
PubMed CAS Google Scholar
Bray N, Pachter L (2004) MAVID: constrained ancestral alignment of multiple sequences. Genome Res 14:693–699
PubMed CAS Google Scholar
Batzoglou S, Pachter L, Mesirov JP, Berger B, Lander ES (2000) Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res 10:950–958
PubMed CAS Google Scholar
Bray N, Dubchak I, Pachter L (2003) AVID: a global alignment program. Genome Res 13:97–102
PubMed CAS Google Scholar
Morgenstern B (1999) DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15:211–218
PubMed CAS Google Scholar
Rausch T, Emde AK, Weese D, Doring A, Notredame C, Reinert K (2008) Segment-based multiple sequence alignment. Bioinformatics 24:i187–i192
PubMed Google Scholar
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
PubMed CAS Google Scholar
Schwartz AS, Pachter L (2007) Multiple alignment by sequence annealing. Bioinformatics 23:e24–e29
PubMed CAS Google Scholar
Sahraeian SM, Yoon BJ (2010) PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences. Nucleic Acids Res 38:4917–4928
PubMed CAS Google Scholar
Thompson JD, Thierry JC, Poch O (2003) RASCAL: rapid scanning and correction of multiple sequence alignments. Bioinformatics 19:1155–1161
PubMed CAS Google Scholar
Yamada S, Gotoh O, Yamana H (2009) Improvement in speed and accuracy of multiple sequence alignment program PRIME. Inform Media Tech 4:317–327
Google Scholar
Sadreyev RI, Baker D, Grishin NV (2003) Profile-profile comparisons by COMPASS predict intricate homologies between protein families. Protein Sci 12:2262–2272
PubMed CAS Google Scholar
Tomii K, Akiyama Y (2004) FORTE: a profile-profile comparison tool for protein fold recognition. Bioinformatics 20:594–595
PubMed CAS Google Scholar
Soding J, Remmert M (2011) Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr Opin Struct Biol 21:404–411
PubMed Google Scholar
Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292:195–202
PubMed CAS Google Scholar
Rost B, Sander C (1994) Conservation and prediction of solvent accessibility in protein families. Proteins 20:216–226
PubMed CAS Google Scholar
Simossis VA, Heringa J (2005) PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Res 33:W289–W294
PubMed CAS Google Scholar
Zhou H, Zhou Y (2005) SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics 21:3615–3621
PubMed CAS Google Scholar
Pei J, Sadreyev R, Grishin NV (2003) PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics 19:427–428
PubMed CAS Google Scholar
Pei J, Grishin NV (2007) PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 23:802–808
PubMed CAS Google Scholar
Papadopoulos JS, Agarwala R (2007) COBALT: constraint-based alignment tool for multiple protein sequences. Bioinformatics 23:1073–1079
PubMed CAS Google Scholar
O’Sullivan O, Suhre K, Abergel C, Higgins DG, Notredame C (2004) 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J Mol Biol 340:385–395
PubMed Google Scholar
Pei J, Kim BH, Grishin NV (2008) PROMALS3D: a tool for multiple protein sequence and structure alignments. Nucleic Acids Res 36:2295–2300
PubMed CAS Google Scholar
Smith TF, Waterman MS, Fitch WM (1981) Comparative biosequence metrics. J Mol Evol 18:38–46
PubMed CAS Google Scholar
Sellers PH (1980) The theory and computation of evolutionary distances: pattern recognition. J Algorithm 1:359–373
Google Scholar
Hamada M, Asai K (2012) A classification of bioinformatics algorithms from the viewpoint of maximizing expected accuracy (MEA). J Comput Biol 19:532–549
PubMed CAS Google Scholar

Download references

Acknowledgment

I thank Dr. Kentaro Tomii for instruction about protein fold recognition methods. This work was partly supported by Kakenhi (Grant-in-Aid for Scientific Research) B (grant number 22310124) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

Author information

Authors and Affiliations

Computational Biology Research Center (CBRC), Tokyo, Japan
Osamu Gotoh
National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
Osamu Gotoh

Authors

Osamu Gotoh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Electrical Engineering, University of Nebraska-Lincoln, Lincoln, Nebraska, USA
David J Russell

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Gotoh, O. (2014). Heuristic Alignment Methods. In: Russell, D. (eds) Multiple Sequence Alignment Methods. Methods in Molecular Biology, vol 1079. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-646-7_2

Download citation

DOI: https://doi.org/10.1007/978-1-62703-646-7_2
Published: 23 August 2013
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-62703-645-0
Online ISBN: 978-1-62703-646-7
eBook Packages: Springer Protocols

Publish with us

Policies and ethics