Abstract
The Kolmogorov complexity furnishes several ways for studying different natural processes that can be expressed using sequences of symbols from a finite alphabet, such as the case of DNA sequences. Although the Kolmogorov complexity is not algorithmically computable, it can be approximated by lossless normal compressors. In this paper, we use a specific DNA compressor to approximate the Kolmogorov complexity and we assess it regarding its normality. Then, we use it on several datasets, that are constituted by different DNA sequences, representing complete genomes of different species and domains. We show several evolution-related insights associated with the complexity, namely that, globally, archaea have higher relative complexity than bacteria and eukaryotes.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Kolmogorov, A.N.: Three approaches to the quantittative definition of information. Probl. Inf. Transm. 1(1), 1–7 (1965)
Solomonoff, R.J.: A formal theory of inductive inference: Part I. Inf. Control 7(1), 1–22 (1964)
Solomonoff, R.J.: A formal theory of inductive inference: Part II. Inf. Control 7(2), 224–254 (1964)
Chaitin, G.J.: On the length of programs for computing finite binary sequences. J. ACM 13, 547–569 (1966)
Wallace, C.S., Boulton, D.M.: An information measure for classification. Comput. J. 11(2), 185–194 (1968)
Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)
Hutter, M.: Algorithmic information theory: a brief non-technical guide to the field. Scholarpedia 9620, March 2007
Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications, 3rd edn. Springer, Heidelberg (2008)
Turing, A.: On computable numbers, with an application to the Entscheidungs problem. Proc. Lond. Math. Soc. 42(2), 230–265 (1936)
Cilibrasi, R., Vitányi, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theor. 51(4), 1523–1545 (2005)
Hammer, D., Romashchenko, A., Shen, A., Vereshchagin, N.: Inequalities for Shannon entropy and Kolmogorov complexity. J. Comput. Syst. Sci. 60(2), 442–464 (2000)
Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using the normalized compression distance: what to watch out for in a compressor. Commun. Inf. Syst. 5(4), 367–384 (2005)
Pratas, D., Pinho, A.J., Ferreira, P.: Efficient compression of genomic sequences. In: Proceedings of the Data Compression Conference, DCC-2016, Snowbird, UT, pp. 231–240, March 2016
Pratas, D.: Compression and analysis of genomic data. Ph.D. thesis, University of Aveiro (2016)
Hosseini, M., Pratas, D., Pinho, A.J.: A survey on data compression methods for biological sequences. Information 7(4), 56 (2016)
Bywater, R.P.: Prediction of protein structural features from sequence data based on Shannon entropy and Kolmogorov complexity. PLoS ONE 10(4), e0119306 (2015)
Ferreira, P.J.S.G., Pinho, A.J.: Compression-based normal similarity measures for DNA sequences. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-2014, Florence, Italy, pp. 419–423, May 2014
Pratas, D., Pinho, A.J., Rodrigues, J.M.O.S.: XS: a FASTQ read simulator. BMC Res. Notes 7(1), 40 (2014)
Hedges, S.B.: The origin and evolution of model organisms. Nat. Rev. Genet. 3(11), 838–849 (2002)
Parfrey, L.W., Grant, J., Tekle, Y.I., Lasek-Nesselquist, E., Morrison, H.G., Sogin, M.L., Patterson, D.J., Katz, L.A.: Broadly sampled multigene analyses yield a well-resolved eukaryotic tree of life. Syst. Biol. 59(5), 518–533 (2010)
Podani, J., Oltvai, Z.N., Jeong, H., Tombor, B., Barabási, A.L., Szathmary, E.: Comparable system-level organization of archaea and eukaryotes. Nat. Genet. 29(1), 54–56 (2001)
Wu, D., Hugenholtz, P., Mavromatis, K., Pukall, R., Dalin, E., Ivanova, N.N., Kunin, V., Goodwin, L., Wu, M., Tindall, B.J., et al.: A phylogeny-driven genomic encyclopaedia of bacteria and archaea. Nature 462(7276), 1056–1060 (2009)
Koonin, E.V., Senkevich, T.G., Dolja, V.V.: The ancient virus world and evolution of cells. Biol. Direct 1(1), 29 (2006)
Maumus, F., Epert, A., Nogué, F., Blanc, G.: Plant genomes enclose footprints of past infections by giant virus relatives. Nat. Commun. 5, 4268 (2014)
Filée, J.: Multiple occurrences of giant virus core genes acquired by eukaryotic genomes: the visible part of the iceberg? Virology 466, 53–59 (2014)
Colson, P., De Lamballerie, X., Yutin, N., Asgari, S., Bigot, Y., Bideshi, D.K., Cheng, X.W., Federici, B.A., Van Etten, J.L., Koonin, E.V., et al.: “Megavirales”, a proposed new order for eukaryotic nucleocytoplasmic large DNA viruses. Arch. Virol. 158(12), 2517–2521 (2013)
Forterre, P., Krupovic, M., Prangishvili, D.: Cellular domains and viral lineages. Trends Microbiol. 22(10), 554–558 (2014)
Pennisi, E.: Ever-bigger viruses shake tree of life. Science 341(6143), 226–227 (2013)
Canchaya, C., Fournous, G., Chibani-Chennoufi, S., Dillmann, M.L., Brüssow, H.: Phage as agents of lateral gene transfer. Curr. Opin. Microbiol. 6(4), 417–424 (2003)
Bitra, K., Burke, G.R., Strand, M.R.: Permissiveness of lepidopteran hosts is linked to differential expression of bracovirus genes. Virology 492, 259–272 (2016)
Pratas, D., Pinho, A.J.: Compressing the human genome using exclusively Markov models. In: Rocha, M.P., Rodríguez, J.M.C., Fdez-Riverola, F., Valencia, A. (eds.) PACBB 2011. AISC, vol. 93, pp. 213–220. Springer, Heidelberg (2011)
Acknowledgments
This work was partially funded by FEDER (Programa Operacional Factores de Competitividade - COMPETE) and by National Funds through the FCT - Foundation for Science and Technology, in the context of the projects UID/CEC/00127/2013, PTCD/EEI-SII/6608/2014.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Pratas, D., Pinho, A.J. (2017). On the Approximation of the Kolmogorov Complexity for DNA Sequences. In: Alexandre, L., Salvador Sánchez, J., Rodrigues, J. (eds) Pattern Recognition and Image Analysis. IbPRIA 2017. Lecture Notes in Computer Science(), vol 10255. Springer, Cham. https://doi.org/10.1007/978-3-319-58838-4_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-58838-4_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58837-7
Online ISBN: 978-3-319-58838-4
eBook Packages: Computer ScienceComputer Science (R0)