Sequence Alignment

  • Benu Atri
  • Olivier Lichtarge


Recent accomplishments in computational science have made conceivable not just the acquisition and storage of large amounts of sequence data but also made possible the simultaneous analyses of these sequences. Variations over evolutionary timescale are the reason for divergence among sequences. In light of how variable or conserved a region is, between two or more sequences, much can be said regarding the significance of the region for maintaining functional and structural integrity. Sequence alignment is also the first step in most of the bioinformatics analysis. Domains of high similarity could be a consequence of evolutionary relationships, i.e., shared ancestry, and can be uncovered by using sequence alignment. In this chapter we discuss common terminology related to sequence alignment, how to choose the appropriate alignment strategy for a given problem, different alignment algorithms, and most commonly available tools for pairwise, multiple, and whole genome sequence alignment.


Sequence identity Sequence homology Sequence similarity Substitution matrices Distance matrices Pairwise sequence alignment Multiple sequence alignment BLAST FASTA Genome alignment 



This work is supported by a grant from the NIH Research Project Grant Program (2R01GM079656). The authors are grateful to Dr. David C. Marciano, Dr. Angela Wilkins, and Dr. Rhonald C. Lua for their helpful comments.


  1. Alberts B, Johnson A, Lewis J et al (2002) Molecular biology of the cell, 4th edn. Garland Science, New YorkGoogle Scholar
  2. Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410CrossRefGoogle Scholar
  3. Baldi P, Chauvin Y, Hunkapiller T, McClure MA (1994) Hidden Markov models of biological primary sequence information. P Natl Acad Sci USA 91(3):1059–1063CrossRefGoogle Scholar
  4. Churchill GA (1989) Stochastic models for heterogeneous DNA sequences. Bull Math Biol 51(1):79–94CrossRefGoogle Scholar
  5. Correns C (1950) G. Mendel’s law concerning the behavior of progeny of varietal hybrids. Genetics 35:33–41PubMedGoogle Scholar
  6. Darwin C (1859.) On the origin of species by means of natural selectionGoogle Scholar
  7. Dayhoff M, Schwartz R (1978) A model of evolutionary change in proteins. Atlas Pro Seq Struct:345–352 Scholar
  8. de Vries H (1900–1903) The mutation theoryGoogle Scholar
  9. Dewey CN (2012) Whole-genome alignment. In: Evolutionary genomics. Humana Press, Totowa, pp 237–257CrossRefGoogle Scholar
  10. Earl D, Nguyen N, Hickey G, Harris R (2014) Alignathon: a competitive assessment of whole genome alignment methods. bioRxiv:1–30.
  11. Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113. CrossRefPubMedPubMedCentralGoogle Scholar
  12. Feng DF, Doolittle RF (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 25:351–360CrossRefGoogle Scholar
  13. Forterre P (2015) The universal tree of life: an update. Front Microbiol 6:1–18. CrossRefGoogle Scholar
  14. Gibbs JA, McIntyre AG (1970) The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur J Biochem 16:1–11. CrossRefPubMedGoogle Scholar
  15. Hagen JB (2000) The origins of bioinformatics. Nat Rev Genet 1:231–236. CrossRefPubMedGoogle Scholar
  16. Healy J (2016) FLAK: ultra-fast fuzzy whole genome alignment. Advances in intelligent systems and computing, vol 477. SpringerGoogle Scholar
  17. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. P Natl Acad Sci USA 89:10915–10919CrossRefGoogle Scholar
  18. Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237–244. CrossRefGoogle Scholar
  19. Katoh K, Stanley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–780. CrossRefPubMedPubMedCentralGoogle Scholar
  20. Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30:3059–3066. CrossRefPubMedPubMedCentralGoogle Scholar
  21. Koonin EV, Galperin MY (2003) Sequence – evolution – function: computational approaches in comparative genomics. Kluwer Academic, BostonCrossRefGoogle Scholar
  22. Krogh A, Mian IS, Haussler D (1994) A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res 22(22):4768–4778CrossRefGoogle Scholar
  23. Li W, Cowley A, Uludag M et al (2015) The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Res 43: W580–W580–4. CrossRefGoogle Scholar
  24. Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441CrossRefGoogle Scholar
  25. McWilliam H, Li W, Uludag M et al (2013) Analysis tool web services from the EMBL-EBI. Nucleic Acids Res 41:597–600. CrossRefGoogle Scholar
  26. Mendel GJ (1865) Experiments on plant hybridization. Read at the meetings of the Brünn Natural History SocietyGoogle Scholar
  27. Needleman SB, Wunsch CD (1970) General method applicable to search for similarities in amino acid sequence of 2 proteins. J Mol Biol 48:443CrossRefGoogle Scholar
  28. Notredame C, Higgins DG, Heringa J (2000) T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217. CrossRefPubMedGoogle Scholar
  29. Pearson WR (2014) An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinforma:1–9.
  30. Poliakov A, Foong J, Brudno M, Dubchak I (2014) GenomeVISTA-an integrated software package for whole-genome alignment and visualization. Bioinformatics 30:2654–2655. CrossRefPubMedPubMedCentralGoogle Scholar
  31. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. P Natl Acad Sci USA 74:5463–5467. CrossRefGoogle Scholar
  32. Sievers F, Wilm A, Dineen D et al (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal omega. Mol Syst Biol 7:539. CrossRefPubMedPubMedCentralGoogle Scholar
  33. Smith T, Waterman M (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197. CrossRefPubMedGoogle Scholar
  34. Thompson JD, Linard B, Lecompte O, Poch O (2011) A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One 6:e18093. CrossRefPubMedPubMedCentralGoogle Scholar
  35. Woese CR, Kandler O, Wheelis ML (1990) Towards a natural system of organisms: proposal for the domains archaea, Bacteria, and Eucarya. P Natl Acad Sci USA 87:4576–4579. Webpage references: CrossRefGoogle Scholar
  36. Zuckerland E, Pauling L (1965) History of evolutionary molecules as documents. J Theor Biol:357–366Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • Benu Atri
    • 1
  • Olivier Lichtarge
    • 1
    • 2
    • 3
  1. 1.Quantitative and Computational BiosciencesBaylor College of MedicineHoustonUSA
  2. 2.Center for Computational and Integrative Biomedical Research (CIBR), Baylor College of MedicineHoustonUSA
  3. 3.Department of Molecular and Human GeneticsBaylor College of MedicineHoustonUSA

Personalised recommendations