Skip to main content

A Quantitative and Qualitative Characterization of k-mer Based Alignment-Free Phylogeny Construction

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10834))

Abstract

The rapidly growing volume of genomic data, including pathogens, both invites exploration of possible phylogenetic relationships among unclassified organisms, and challenges standard techniques that require multiple sequence alignment. Further, the ability to probe variations in selection pressure e.g. among viral outbreaks, is an important characterization of the life of a virus in its biological reservoir.

In this paper, we derived the probability distribution of k-mer alignment lengths between random sequences for a given optimized score to quantify the probability that a given alignment was not better than chance, and applied it to Human Papiloma Virus (HPV), primate mtDNA, and Ebola. Even for highly variable HPV types, the number of k-mers required to significantly distinguish an alignment of related genomes from random sequences was reduced from 64 for 1-mers to 6 for 3-mers and 4 for 4-mers, indicating k-mers provide sufficient specificity to be able to characterize differences in sequences by their k-mer frequencies, allowing distances based on the k-mer frequencies to proxy for evolutionary distance. We computed mtDNA coding sequence and Ebola phylogeny construction. Primate mtDNA coding region k-mer UPGMA phylogenies reproduced most of the expected primate phylogeny. The Mantel test, applied to RAxML and Bayesian phylogenetic distances between Ebola samples versus 3-mer frequency distances, was highly significant (\(\le 1\times 10^{-5}\)). We characterized differences in selection pressure between coding and non-coding regions, and of selection in early cell cycle vs. late genes in Ebola. Coding versus non-coding regions showed evidence of purifying selection, while the early vs. late cell cycle proteins showed differences with late cycle proteins resembling influenza like immunological response, noting the g-proteins are among the late genes.

F. Utro and D.E. Platt—Contributed equally to this work.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.hpvcenter.se/html/refclones.html.

References

  1. Barrette, R.W., et al.: Discovery of swine as a host for the reston ebolavirus. Science 325(5937), 204–206 (2009)

    Article  Google Scholar 

  2. Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Nat. Acad. Sci. 83, 5155–5159 (1986)

    Article  MATH  Google Scholar 

  3. Boyce, K., Sievers, F., Higgins, D.G.: Instability in progressive multiple sequence alignment algorithms. Algorithms Mol. Biol. 10(1), 1–10 (2015)

    Article  Google Scholar 

  4. Chan, C.X., Bernard, G., Poirion, O., Hogan, J.M., Ragan, M.A.: Inferring phylogenies of evolving sequences without muultiple sequence alignment. Sci. Rep. 4(6504), 1–9 (2014)

    Google Scholar 

  5. Chor, B., Horn, D., Goldman, N., Levy, Y., Massingham, T.: Genomic DNA k-merspectra: models and modalities. Genome Biol. 10, R108 (2009)

    Article  Google Scholar 

  6. Dembo, A., Karlin, S., Zeitouni, O.: Limit distribution of maximal non-aligned two-sequence segmental score. Ann. Probab. 22(4), 2022–2039 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  7. Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinform. 8, 252 (2007)

    Article  Google Scholar 

  8. Giancarlo, R., Scaturro, D., Utro, F.: Textual data compression in computational biology: a synopsis. Bioinformatics 25, 1575–1586 (2009)

    Article  MATH  Google Scholar 

  9. Giancarlo, R., Rombo, S.E., Utro, F.: Epigenomic k-mer dictionaries: shedding light on how sequence composition influences nucleosome positioning in vivo. Bioinformatics 31, 2939–2946 (2015)

    Article  Google Scholar 

  10. Gire, S.K., et al.: Genomic surveillance elucideates ebola virus origin and transmission during the 2014 outbreak. Science 345, 1369–1372 (2014)

    Article  Google Scholar 

  11. Haubold, B.: Alignment-free phylogenetics and population genetics. Briefings Bioinform. 15, 407–418 (2013)

    Article  Google Scholar 

  12. Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring functions. PNAS 87(6), 2264–2268 (1990)

    Article  MATH  Google Scholar 

  13. Katoh, K., Standley, D.M.: Mafft multiple sequence alignment software versions 7: improvements in performance and usability. Mol. Biol. Evol. 30(4), 772–780 (2013)

    Article  Google Scholar 

  14. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)

    Google Scholar 

  15. Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Statist. 22, 79–86 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  16. Lo Bosco, G.: Alignment free dissimilarities for nucleosome classification. In: Angelini, C., Rancoita, P.M.V., Rovetta, S. (eds.) CIBB 2015. LNCS, vol. 9874, pp. 114–128. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44332-4_9

    Chapter  Google Scholar 

  17. Song, K., Ren, J., Reinert, G., Deng, M., Waterman, M.S., Fengzhu, S.: New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Briefings Bioinform. 15(3), 343–353 (2014)

    Article  Google Scholar 

  18. Stamatakis, A.: Raxml version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9), 1312–1313 (2014)

    Article  Google Scholar 

  19. Utro, F., Di Benedetto, V., Corona, D.F., Giancarlo, R.: The intrinsic combinatorial organization and information theoretic content of a sequence are correlated to the DNA encoded nucleosome organization of eukaryotic genomes. Bioinformatics 32, 835–842 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Filippo Utro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Utro, F., Platt, D.E., Parida, L. (2019). A Quantitative and Qualitative Characterization of k-mer Based Alignment-Free Phylogeny Construction. In: Bartoletti, M., et al. Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2017. Lecture Notes in Computer Science(), vol 10834. Springer, Cham. https://doi.org/10.1007/978-3-030-14160-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-14160-8_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-14159-2

  • Online ISBN: 978-3-030-14160-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics