Sequence Comparison Tools

  • Michael Imelfort


The evolution of methods which capture genetic sequence data has inspired a parallel evolution of computational tools which can be used to analyze and compare the data. Indeed, much of the progress in modern biological research has stemmed from the application of such technology. In this chapter we provide an overview of the main classes of tools currently used for sequence comparison. For each class of tools we provide a basic overview of how they work, their history, and their current state. There have been literally hundreds of different tools produced to align, cluster, filter, or otherwise analyze sequence data and it would be impossible to list all of them in this chapter, so we supply only an overview of the tools that most readers may encounter. We apologize to researchers who feel that their particular piece of software should have been included here. The reader will notice that there is much conceptual and application overlap between tools and in many cases one tool or algorithm is used as one part of another tool’s implementation. Most of the more popular sequence comparison tools are based on ideas and algorithms which can be traced back to the 1960s and 1970s when the cost of computing power first became low enough to enable wide spread development in this area. Where applicable we describe the original algorithms and then list the iterations of the idea (often by different people in different labs) noting the important changes that were included at each stage. Finally we describe the software packages currently used by today’s bioinformaticians. A quick search will allow the reader to find many papers which formally compare different implementations of a particular algorithm, so while we may note that one algorithm is more efficient or accurate than another we stress that we have not performed any formal benchmarking or comparison analysis here.


Query Sequence Alignment Algorithm Pairwise Alignment Progressive Method Progressive Alignment 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410PubMedGoogle Scholar
  2. Bentley DR (2006) Whole-genome re-sequencing. Curr Opin Genet Dev 16(6):545–552CrossRefPubMedGoogle Scholar
  3. Burke J, Davison D, Hide W (1999) d2_cluster: A validated method for clustering EST and full-length cDNA sequences. Genome Res 9:1135–1142CrossRefPubMedGoogle Scholar
  4. Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES et al (2008) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res 18(5):810–820CrossRefPubMedGoogle Scholar
  5. Chaisson MJ, Pevzner PA (2008) Short read fragment assembly of bacterial genomes. Genome Res 18:324–330CrossRefPubMedGoogle Scholar
  6. Chaisson M, Pevzner PA, Tang HX (2004) Fragment assembly with short reads. Bioinformatics 20(13):2067–2074CrossRefPubMedGoogle Scholar
  7. Dayhoff Mo, ed., 1978, Atlas of protein Sequence and Structure, Vol 5Google Scholar
  8. Dohm JC, Lottaz C, Borodina T, Himmelbauer H (2007) SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res 17:1697–1706CrossRefPubMedGoogle Scholar
  9. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797CrossRefPubMedGoogle Scholar
  10. Ewing B, Hillier L, Wendl MC, Green P (1998) Base-calling of automated sequencer traces using phred. 1. accuracy assessment. Genome Res 8:175–185PubMedGoogle Scholar
  11. Feng DF, Doolittle RF (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 25:351–360CrossRefPubMedGoogle Scholar
  12. Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162:705–708CrossRefPubMedGoogle Scholar
  13. Hazelhurst S, Hide W, Liptak Z, Nogueira R, Starfield R (2008) An overview of the wcd EST clustering tool. Bioinformatics 24(13):1542–1546CrossRefPubMedGoogle Scholar
  14. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89(22):10915–10919CrossRefPubMedGoogle Scholar
  15. Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J (2008) De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Res 18(5):802–809CrossRefPubMedGoogle Scholar
  16. Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237–244CrossRefPubMedGoogle Scholar
  17. Higgins DG, Bleasby AJ, Fuchs R (1992) CLUSTAL V: improved software for multiple sequence alignment. Bioinformatics 8(2):189–191CrossRefGoogle Scholar
  18. Huang X, Madan A (1999) CAP3: A DNA sequence assembly program. Genome Res 9:868–877CrossRefPubMedGoogle Scholar
  19. Idury RM, Waterman MS (1995) A new algorithm for DNA sequence assembly. J Comput Biol 2:291–306CrossRefPubMedGoogle Scholar
  20. Jeck WR, Reinhardt JA, Baltrus DA, Hickenbotham MT, Magrini V, Mardis ER et al (2007) Extending assembly of short DNA sequences to handle error. Bioinformatics 23:2942–2944CrossRefPubMedGoogle Scholar
  21. Kent JW (2002) BLAT – the BLAST-like alignment tool. Genome Res 12:656–664PubMedGoogle Scholar
  22. Murata M, Richardson JS, Sussman JL (1985) Simultaneous comparison of three protein sequences. Proc Natl Acad Sci USA 82(10):3073–3077CrossRefPubMedGoogle Scholar
  23. Myers EW (1995) Toward simplifying and accurately formulating fragment assembly. J Comput Biol 2:275–290CrossRefPubMedGoogle Scholar
  24. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453CrossRefPubMedGoogle Scholar
  25. Notredame C, Higgins DG, Heringa J (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217CrossRefPubMedGoogle Scholar
  26. O’Connor M, Peifer M, Bender W (1989) Construction of large DNA segments in Escherichia coli. Science 244:1307–1312CrossRefPubMedGoogle Scholar
  27. Penzner PA (2001) Fragment assembly with double-barreled data. Bioinformatics 17:S225–S233Google Scholar
  28. Pevzner PA (1989) l-tuple DNA sequencing: computer analysis. J Biomol Struct Dyn 7:63–73PubMedGoogle Scholar
  29. Pevzner PA, Tang HX, Waterman MS (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 98(17):9748–9753CrossRefPubMedGoogle Scholar
  30. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74(12):5463–5467CrossRefPubMedGoogle Scholar
  31. Sellers PH (1974) On the theory and computation of evolutionary distances. J Appl Math (siam) 26:787–793CrossRefGoogle Scholar
  32. Smit AFA, Hubley R, Green P RepeatMasker Open-3.0. 1996-2004.
  33. Staden R (1979) A strategy of DNA sequencing employing computer programs. Nucleic Acids Res 6:2601–2610CrossRefPubMedGoogle Scholar
  34. Thompson JD, Higgins DG, Gibson TJ, Clustal W (1994) Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. Nov 11;22(22):4673–4680Google Scholar
  35. Warren RL, Sutton GG, Jones SJM, Holt RA (2007) Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23(4):500–501CrossRefPubMedGoogle Scholar
  36. Waterman MS, Smith TF, Beyer WA (1976) Some biological sequence metrics. J Adv Math 20:367–387CrossRefGoogle Scholar
  37. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A et al (2008) The complete genome of an individual by massively parallel DNA sequencing. Nature 452(7189):U872–U875CrossRefGoogle Scholar
  38. Whiteford N, Haslam N, Weber G, Prugel-Bennett A, Essex JW, Roach PL et al (2005) An analysis of the feasibility of short read sequencing. Nucleic Acids Res 33(19):e171CrossRefPubMedGoogle Scholar
  39. Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18(5):821–829CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.University of QueenslandQueenslandAustralia

Personalised recommendations