A Quantitative and Qualitative Characterization of k-mer Based Alignment-Free Phylogeny Construction

Utro, Filippo; Platt, Daniel E.; Parida, Laxmi

doi:10.1007/978-3-030-14160-8_3

A Quantitative and Qualitative Characterization of k-mer Based Alignment-Free Phylogeny Construction

Conference paper
First Online: 14 February 2019

473 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10834))

Abstract

The rapidly growing volume of genomic data, including pathogens, both invites exploration of possible phylogenetic relationships among unclassified organisms, and challenges standard techniques that require multiple sequence alignment. Further, the ability to probe variations in selection pressure e.g. among viral outbreaks, is an important characterization of the life of a virus in its biological reservoir.

In this paper, we derived the probability distribution of k-mer alignment lengths between random sequences for a given optimized score to quantify the probability that a given alignment was not better than chance, and applied it to Human Papiloma Virus (HPV), primate mtDNA, and Ebola. Even for highly variable HPV types, the number of k-mers required to significantly distinguish an alignment of related genomes from random sequences was reduced from 64 for 1-mers to 6 for 3-mers and 4 for 4-mers, indicating k-mers provide sufficient specificity to be able to characterize differences in sequences by their k-mer frequencies, allowing distances based on the k-mer frequencies to proxy for evolutionary distance. We computed mtDNA coding sequence and Ebola phylogeny construction. Primate mtDNA coding region k-mer UPGMA phylogenies reproduced most of the expected primate phylogeny. The Mantel test, applied to RAxML and Bayesian phylogenetic distances between Ebola samples versus 3-mer frequency distances, was highly significant (\(\le 1\times 10^{-5}\)). We characterized differences in selection pressure between coding and non-coding regions, and of selection in early cell cycle vs. late genes in Ebola. Coding versus non-coding regions showed evidence of purifying selection, while the early vs. late cell cycle proteins showed differences with late cycle proteins resembling influenza like immunological response, noting the g-proteins are among the late genes.

F. Utro and D.E. Platt—Contributed equally to this work.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://www.hpvcenter.se/html/refclones.html.

References

Barrette, R.W., et al.: Discovery of swine as a host for the reston ebolavirus. Science 325(5937), 204–206 (2009)
Article Google Scholar
Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Nat. Acad. Sci. 83, 5155–5159 (1986)
Article MATH Google Scholar
Boyce, K., Sievers, F., Higgins, D.G.: Instability in progressive multiple sequence alignment algorithms. Algorithms Mol. Biol. 10(1), 1–10 (2015)
Article Google Scholar
Chan, C.X., Bernard, G., Poirion, O., Hogan, J.M., Ragan, M.A.: Inferring phylogenies of evolving sequences without muultiple sequence alignment. Sci. Rep. 4(6504), 1–9 (2014)
Google Scholar
Chor, B., Horn, D., Goldman, N., Levy, Y., Massingham, T.: Genomic DNA k-merspectra: models and modalities. Genome Biol. 10, R108 (2009)
Article Google Scholar
Dembo, A., Karlin, S., Zeitouni, O.: Limit distribution of maximal non-aligned two-sequence segmental score. Ann. Probab. 22(4), 2022–2039 (1994)
Article MathSciNet MATH Google Scholar
Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., Valiente, G.: Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinform. 8, 252 (2007)
Article Google Scholar
Giancarlo, R., Scaturro, D., Utro, F.: Textual data compression in computational biology: a synopsis. Bioinformatics 25, 1575–1586 (2009)
Article MATH Google Scholar
Giancarlo, R., Rombo, S.E., Utro, F.: Epigenomic k-mer dictionaries: shedding light on how sequence composition influences nucleosome positioning in vivo. Bioinformatics 31, 2939–2946 (2015)
Article Google Scholar
Gire, S.K., et al.: Genomic surveillance elucideates ebola virus origin and transmission during the 2014 outbreak. Science 345, 1369–1372 (2014)
Article Google Scholar
Haubold, B.: Alignment-free phylogenetics and population genetics. Briefings Bioinform. 15, 407–418 (2013)
Article Google Scholar
Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring functions. PNAS 87(6), 2264–2268 (1990)
Article MATH Google Scholar
Katoh, K., Standley, D.M.: Mafft multiple sequence alignment software versions 7: improvements in performance and usability. Mol. Biol. Evol. 30(4), 772–780 (2013)
Article Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Google Scholar
Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Statist. 22, 79–86 (1951)
Article MathSciNet MATH Google Scholar
Lo Bosco, G.: Alignment free dissimilarities for nucleosome classification. In: Angelini, C., Rancoita, P.M.V., Rovetta, S. (eds.) CIBB 2015. LNCS, vol. 9874, pp. 114–128. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44332-4_9
Chapter Google Scholar
Song, K., Ren, J., Reinert, G., Deng, M., Waterman, M.S., Fengzhu, S.: New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Briefings Bioinform. 15(3), 343–353 (2014)
Article Google Scholar
Stamatakis, A.: Raxml version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9), 1312–1313 (2014)
Article Google Scholar
Utro, F., Di Benedetto, V., Corona, D.F., Giancarlo, R.: The intrinsic combinatorial organization and information theoretic content of a sequence are correlated to the DNA encoded nucleosome organization of eukaryotic genomes. Bioinformatics 32, 835–842 (2016)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computational Biology Center, IBM T. J. Watson Research, Yorktown Heights, NY, 10598, USA
Filippo Utro, Daniel E. Platt & Laxmi Parida

Authors

Filippo Utro
View author publications
You can also search for this author in PubMed Google Scholar
Daniel E. Platt
View author publications
You can also search for this author in PubMed Google Scholar
Laxmi Parida
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Filippo Utro .

Editor information

Editors and Affiliations

University of Cagliari, Cagliari, Italy
Massimo Bartoletti
University of Genova, Genoa, Italy
Annalisa Barla
University of Stirling, Stirling, UK
Andrea Bracciali
Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany
Gunnar W. Klau
Houston Methodist Research Institute, Houston, TX, USA
Leif Peterson
University of Udine, Udine, Italy
Alberto Policriti
University of Salerno, Fisciano, Italy
Roberto Tagliaferri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Utro, F., Platt, D.E., Parida, L. (2019). A Quantitative and Qualitative Characterization of k-mer Based Alignment-Free Phylogeny Construction. In: Bartoletti, M., et al. Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2017. Lecture Notes in Computer Science(), vol 10834. Springer, Cham. https://doi.org/10.1007/978-3-030-14160-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-14160-8_3
Published: 14 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-14159-2
Online ISBN: 978-3-030-14160-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics