Kolmogorov complexity as a data similarity metric: application in mitochondrial DNA

  • Rómulo Antão
  • Alexandre Mota
  • J. A. Tenreiro Machado
Original Paper

Abstract

The problem of developing a similarity index for different objects is discussed. The limitations of current metrics are evaluated and discussed. The normalized compression distance, based on the non-computable Kolmogorov complexity, is examined and compared with two alternative measures. A case study consisting of a phylogenetic tree of different mammals is constructed applying this technique with a mitochondrial DNA database.

Keywords

Kolmogorov complexity Normalized compression distance Mitochondrial DNA 

Notes

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

References

  1. 1.
    Engineering and technology history wiki: History of lossless data compression algorithms. http://ethw.org/History_of_Lossless_Data_Compression_Algorithms. Accessed 19 Oct 2017
  2. 2.
  3. 3.
    On the Approximation of the Kolmogorov Complexity for DNA Sequences (2017).  https://doi.org/10.1007/978-3-319-58838-4_29
  4. 4.
    Aziz, M., Alhadidi, D., Mohammed, N.: Secure approximation of edit distance on genomic data. BMC Med Genomics 10(Suppl 2), (2017).  https://doi.org/10.1186/s12920-017-0279-9
  5. 5.
    Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.H.: Information distance. IEEE Trans. Inf. Theory 44(4), 1407–1423 (1998)MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    Borbely, R.S.: On normalized compression distance and large malware. J. Comput. Virol. Hacking Tech. 12(4), 235–242 (2016).  https://doi.org/10.1007/s11416-015-0260-0 CrossRefGoogle Scholar
  7. 7.
    Yin, C., Chen, Y., Sdddd, Y.: A measure of DNA sequence similarity by fourier transform with applications on hierarchical clustering complexity for DNA sequences. J. Theor. Biol. 359, 18–28 (2014).  https://doi.org/10.1016/j.jtbi.2014.05.043 CrossRefGoogle Scholar
  8. 8.
    Carbone, A.: Information measure for long-range correlated sequences: the case of the 24 human chromosomes. Scientific Reports 3 (2013).  https://doi.org/10.1038/srep02721
  9. 9.
    Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using the normalized compression distance: what to watch for in a compressor. Commun. Inf. Syst. 5(4), 367–384 (2005)MathSciNetMATHGoogle Scholar
  10. 10.
    Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using the normalized compression distance: what to watch out for in a compressor. Commun. Inf. Syst. 5(4), 367–384 (2005).  https://doi.org/10.4310/CIS.2005.v5.n4.a1 MathSciNetMATHGoogle Scholar
  11. 11.
    Cilibrasi, R., Vitany, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005).  https://doi.org/10.1109/TIT.2005.844059 MathSciNetCrossRefMATHGoogle Scholar
  12. 12.
    Cohen, A.R., Vitányi, P.M.B.: Normalized compression distance of multisets with applications. IEEE Trans. Pattern Anal. Mach. Intell. 37(8), 1602–1614 (2015).  https://doi.org/10.1109/TPAMI.2014.2375175 CrossRefGoogle Scholar
  13. 13.
    Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Berlin (2009)CrossRefMATHGoogle Scholar
  14. 14.
    Endres, D., Schindelin, J.: A new metric for probability distributions. IEEE Trans. Inf. Theory 49(7), 1858–1860 (2003).  https://doi.org/10.1109/TIT.2003.813506 MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
    Fortnow, L., Lee, T., Vereshchagin, N.: Kolmogorov complexity with error. In: Durand, B., Thomas, W. (eds.) STACS 2006–23rd Annual Symposium on Theoretical Aspects of Computer Science, Marseille, France, February 23–25, 2006. Lecture Notes in Computer Science, pp. 137–148. Springer, Berlin (2006)Google Scholar
  16. 16.
    Gower, J.C., Dijksterhuis, G.B.: Procrustes Problems. Oxford University Press, Oxford (2004)CrossRefMATHGoogle Scholar
  17. 17.
    Glunčić, M., Paar, V.: Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm. Nucleic Acids Research 41(1) (2013).  https://doi.org/10.1093/nar/gks721
  18. 18.
    Grünwald, P.D., Vitányi, P.M.B.: Kolmogorov complexity and information theory. J. Logic Lang. Inf. 12, 497–529 (2003)MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A. (eds.): Feature Extraction: foundations and Applications. Springer, Berlin (2008)Google Scholar
  20. 20.
    Hautamaki, V., Pollanen, A., Kinnunen, T., Aik, K., Haizhou, L., Franti, L.: A Comparison of Categorical Attribute Data Clustering Methods, pp. 53–62. Springer, Berlin (2014).  https://doi.org/10.1007/978-3-662-44415-3_6 Google Scholar
  21. 21.
    Hu, L.Y., Huang, M.W., Ke, S.W., Tsai, C.F.: The distance function effect on k-nearest neighbor classification for medical datasets. Springer Plus 5, 1304 (2016).  https://doi.org/10.1186/s40064-016-2941-7 CrossRefGoogle Scholar
  22. 22.
    Kalinowski, S.T., Leonard, M.J., Andrews, T.M.: Nothing in evolution makes sense except in the light of DNA. CBE Life Sci. Educ. 2(9), 87–97 (2010).  https://doi.org/10.1187/cbe.09-12-0088 CrossRefGoogle Scholar
  23. 23.
    Kawakatsu, H.: Methods for evaluating pictures and extracting music by 2D DFA and 2D FFT. Procedia Comput. Sci. 60, 834–840 (2015).  https://doi.org/10.1016/j.procs.2015.08.246 CrossRefGoogle Scholar
  24. 24.
    Kendall, D.G.: A survey of the statistical theory of shape. Stat. Sci. 4(12), 87–99 (1989)MathSciNetCrossRefMATHGoogle Scholar
  25. 25.
    Klenk, S., Thom, D., Heidemann, G.: The Normalized Compression Distance as a Distance Measure in Entity Identification. Springer, Berlin (2009)CrossRefGoogle Scholar
  26. 26.
    Kolmogorov, A.: Three approaches to the quantitative definition of information. Int. J. Comput. Math. 2(1–4), 157–168 (1968)MathSciNetCrossRefMATHGoogle Scholar
  27. 27.
    Kubicova, V., Provaznik, I.: Relationship of bacteria using comparison of whole genome sequences in frequency domain. Inf. Technol. Biomed. 3, 397–408 (2014).  https://doi.org/10.1007/978-3-319-06593-9_35 Google Scholar
  28. 28.
    Kullback, S.: Information Theory and Statistics. Wiley, New York (1959)MATHGoogle Scholar
  29. 29.
    Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)MathSciNetCrossRefMATHGoogle Scholar
  30. 30.
    Li, M., Chen, X., Li, X., Ma, B., Vitány, P.M.B.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004).  https://doi.org/10.1109/TIT.2004.838101 MathSciNetCrossRefMATHGoogle Scholar
  31. 31.
    Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (1991).  https://doi.org/10.1109/18.61115 MathSciNetCrossRefMATHGoogle Scholar
  32. 32.
    Machado, J.A.T.: Fractional order generalized information. Entropy 16(4), 2350–2361 (2014).  https://doi.org/10.3390/e16042350 CrossRefGoogle Scholar
  33. 33.
    Machado, J.A.T.: Bond graph and memristor approach to DNA analysis. Nonlinear Dyn. 88(2), 1051–1057 (2017).  https://doi.org/10.1007/s11071-016-3294-z CrossRefGoogle Scholar
  34. 34.
    Machado, J.T.: Fractional order description of DNA. Appl. Math. Model. 39(14), 4095–4102 (2015).  https://doi.org/10.1016/j.apm.2014.12.037 CrossRefGoogle Scholar
  35. 35.
    Machado, J.T., Costa, A., Quelhas, M.: Entropy analysis of DNA code dynamics in human chromosomes. Comput. Math. Appl. 62(3), 1612–1617 (2011).  https://doi.org/10.1016/j.camwa.2011.03.005 MathSciNetCrossRefMATHGoogle Scholar
  36. 36.
    Machado, J.T., Costa, A.C., Lima, M.F.M.: Dynamical analysis of compositions. Nonlinear Dyn. 65(4), 399–412 (2011).  https://doi.org/10.1007/s11071-010-9900-6 CrossRefGoogle Scholar
  37. 37.
    Machado, J.T., Costa, A.C., Quelhas, M.D.: Fractional dynamics in DNA. Commun. Nonlinear Sci. Numer. Simul. 16(8), 2963–2969 (2011).  https://doi.org/10.1016/j.cnsns.2010.11.007 CrossRefMATHGoogle Scholar
  38. 38.
    MacKay, D.J.: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge (2003)MATHGoogle Scholar
  39. 39.
    Moscato, P., Buriol, L., Cotta, C.: On the analysis of data derived from mitochondrial DNA distance matrices: Kolmogorov and a traveling salesman give their opinion (2002)Google Scholar
  40. 40.
    Pinho, A., Ferreira, P.: Image similarity using the normalized compression distance based on finite context models. In: Proceedings of IEEE International Conference on Image Processing (2011).  https://doi.org/10.1109/ICIP.2011.6115866
  41. 41.
    Rajarajeswari, P., Apparao, A.: Normalized distance matrix method for construction of phylogenetic trees using new compressor - DNABIT compress. J. Adv. Bioinf. Appl. Res. 2(1), 89–97 (2011)Google Scholar
  42. 42.
    Ré, M.A., Azad, R.K.: Generalization of entropy based divergence measures for symbolic sequence analysis. PLoS ONE 9(4), e93,532 (2014).  https://doi.org/10.1371/journal.pone.0093532 CrossRefGoogle Scholar
  43. 43.
    Russel, R., Sinha, P.: Perceptually based comparison of image similarity metrics. Perception 40, 1269–1281 (2011).  https://doi.org/10.1068/p7063 CrossRefGoogle Scholar
  44. 44.
    Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)Google Scholar
  45. 45.
    Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)MathSciNetCrossRefMATHGoogle Scholar
  46. 46.
    Sokal, R.R., Michener, C.D.: A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull. 38(22), 1409–1438 (1958)Google Scholar
  47. 47.
    Starr, T.N., Picton, L.K., Thornton, J.W.: Alternative evolutionary histories in the sequence space of an ancient protein. Nature 549, 409–413 (2017).  https://doi.org/10.1038/nature23902 CrossRefGoogle Scholar
  48. 48.
    Vázquez, P.P., Marco, J.: Using normalized compression distance for image similarity measurement: an experimental study. J. Comput. Virol. Hacking Tech. 28(11), 1063–1084 (2012).  https://doi.org/10.1007/s00371-011-0651-2 Google Scholar
  49. 49.
    Walsh, B.: Estimating the time to the most recent common ancestor for the Y chromosome or mitochondrial DNA for a pair of individuals. Genetics 158(2), 897–912 (2001)Google Scholar
  50. 50.
    Wang, W., Wang, T.: Conditional LZ complexity and its application in mtDNA sequence analysis. MATCH Commun. Math. Comput. Chem. 66, 425–443 (2011)MathSciNetGoogle Scholar
  51. 51.
    Yianilos, P.N.: Normalized forms of two common metrics. Tech. Rep. Report 91-082-9027-1, NEC Research Institute (1991)Google Scholar
  52. 52.
    Yu, J., Amores, J., Sebe, N., Tian, Q.: A new study on distance metrics as similarity measurement. In: IEEE International Conference on Multimedia and Expo (2006).  https://doi.org/10.1109/ICME.2006.262443

Copyright information

© Springer Science+Business Media B.V., part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Electronics, Telecommunications and InformaticsUniversity of AveiroAveiroPortugal
  2. 2.Department Electrical EngineeringInstitute of Engineering, Polytechnic of PortoPortoPortugal

Personalised recommendations