Nonlinear Dynamics

, Volume 93, Issue 3, pp 1059–1071 | Cite as

Kolmogorov complexity as a data similarity metric: application in mitochondrial DNA

  • Rómulo Antão
  • Alexandre Mota
  • J. A. Tenreiro Machado
Original Paper


The problem of developing a similarity index for different objects is discussed. The limitations of current metrics are evaluated and discussed. The normalized compression distance, based on the non-computable Kolmogorov complexity, is examined and compared with two alternative measures. A case study consisting of a phylogenetic tree of different mammals is constructed applying this technique with a mitochondrial DNA database.


Kolmogorov complexity Normalized compression distance Mitochondrial DNA 


Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.


  1. 1.
    Engineering and technology history wiki: History of lossless data compression algorithms. Accessed 19 Oct 2017
  2. 2.
  3. 3.
    On the Approximation of the Kolmogorov Complexity for DNA Sequences (2017).
  4. 4.
    Aziz, M., Alhadidi, D., Mohammed, N.: Secure approximation of edit distance on genomic data. BMC Med Genomics 10(Suppl 2), (2017).
  5. 5.
    Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.H.: Information distance. IEEE Trans. Inf. Theory 44(4), 1407–1423 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Borbely, R.S.: On normalized compression distance and large malware. J. Comput. Virol. Hacking Tech. 12(4), 235–242 (2016). CrossRefGoogle Scholar
  7. 7.
    Yin, C., Chen, Y., Sdddd, Y.: A measure of DNA sequence similarity by fourier transform with applications on hierarchical clustering complexity for DNA sequences. J. Theor. Biol. 359, 18–28 (2014). CrossRefGoogle Scholar
  8. 8.
    Carbone, A.: Information measure for long-range correlated sequences: the case of the 24 human chromosomes. Scientific Reports 3 (2013).
  9. 9.
    Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using the normalized compression distance: what to watch for in a compressor. Commun. Inf. Syst. 5(4), 367–384 (2005)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using the normalized compression distance: what to watch out for in a compressor. Commun. Inf. Syst. 5(4), 367–384 (2005). MathSciNetzbMATHGoogle Scholar
  11. 11.
    Cilibrasi, R., Vitany, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005). MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Cohen, A.R., Vitányi, P.M.B.: Normalized compression distance of multisets with applications. IEEE Trans. Pattern Anal. Mach. Intell. 37(8), 1602–1614 (2015). CrossRefGoogle Scholar
  13. 13.
    Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Berlin (2009)CrossRefzbMATHGoogle Scholar
  14. 14.
    Endres, D., Schindelin, J.: A new metric for probability distributions. IEEE Trans. Inf. Theory 49(7), 1858–1860 (2003). MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Fortnow, L., Lee, T., Vereshchagin, N.: Kolmogorov complexity with error. In: Durand, B., Thomas, W. (eds.) STACS 2006–23rd Annual Symposium on Theoretical Aspects of Computer Science, Marseille, France, February 23–25, 2006. Lecture Notes in Computer Science, pp. 137–148. Springer, Berlin (2006)Google Scholar
  16. 16.
    Gower, J.C., Dijksterhuis, G.B.: Procrustes Problems. Oxford University Press, Oxford (2004)CrossRefzbMATHGoogle Scholar
  17. 17.
    Glunčić, M., Paar, V.: Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm. Nucleic Acids Research 41(1) (2013).
  18. 18.
    Grünwald, P.D., Vitányi, P.M.B.: Kolmogorov complexity and information theory. J. Logic Lang. Inf. 12, 497–529 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A. (eds.): Feature Extraction: foundations and Applications. Springer, Berlin (2008)Google Scholar
  20. 20.
    Hautamaki, V., Pollanen, A., Kinnunen, T., Aik, K., Haizhou, L., Franti, L.: A Comparison of Categorical Attribute Data Clustering Methods, pp. 53–62. Springer, Berlin (2014). Google Scholar
  21. 21.
    Hu, L.Y., Huang, M.W., Ke, S.W., Tsai, C.F.: The distance function effect on k-nearest neighbor classification for medical datasets. Springer Plus 5, 1304 (2016). CrossRefGoogle Scholar
  22. 22.
    Kalinowski, S.T., Leonard, M.J., Andrews, T.M.: Nothing in evolution makes sense except in the light of DNA. CBE Life Sci. Educ. 2(9), 87–97 (2010). CrossRefGoogle Scholar
  23. 23.
    Kawakatsu, H.: Methods for evaluating pictures and extracting music by 2D DFA and 2D FFT. Procedia Comput. Sci. 60, 834–840 (2015). CrossRefGoogle Scholar
  24. 24.
    Kendall, D.G.: A survey of the statistical theory of shape. Stat. Sci. 4(12), 87–99 (1989)MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Klenk, S., Thom, D., Heidemann, G.: The Normalized Compression Distance as a Distance Measure in Entity Identification. Springer, Berlin (2009)CrossRefGoogle Scholar
  26. 26.
    Kolmogorov, A.: Three approaches to the quantitative definition of information. Int. J. Comput. Math. 2(1–4), 157–168 (1968)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Kubicova, V., Provaznik, I.: Relationship of bacteria using comparison of whole genome sequences in frequency domain. Inf. Technol. Biomed. 3, 397–408 (2014). Google Scholar
  28. 28.
    Kullback, S.: Information Theory and Statistics. Wiley, New York (1959)zbMATHGoogle Scholar
  29. 29.
    Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Li, M., Chen, X., Li, X., Ma, B., Vitány, P.M.B.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004). MathSciNetCrossRefzbMATHGoogle Scholar
  31. 31.
    Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (1991). MathSciNetCrossRefzbMATHGoogle Scholar
  32. 32.
    Machado, J.A.T.: Fractional order generalized information. Entropy 16(4), 2350–2361 (2014). CrossRefGoogle Scholar
  33. 33.
    Machado, J.A.T.: Bond graph and memristor approach to DNA analysis. Nonlinear Dyn. 88(2), 1051–1057 (2017). CrossRefGoogle Scholar
  34. 34.
    Machado, J.T.: Fractional order description of DNA. Appl. Math. Model. 39(14), 4095–4102 (2015). CrossRefGoogle Scholar
  35. 35.
    Machado, J.T., Costa, A., Quelhas, M.: Entropy analysis of DNA code dynamics in human chromosomes. Comput. Math. Appl. 62(3), 1612–1617 (2011). MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Machado, J.T., Costa, A.C., Lima, M.F.M.: Dynamical analysis of compositions. Nonlinear Dyn. 65(4), 399–412 (2011). CrossRefGoogle Scholar
  37. 37.
    Machado, J.T., Costa, A.C., Quelhas, M.D.: Fractional dynamics in DNA. Commun. Nonlinear Sci. Numer. Simul. 16(8), 2963–2969 (2011). CrossRefzbMATHGoogle Scholar
  38. 38.
    MacKay, D.J.: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge (2003)zbMATHGoogle Scholar
  39. 39.
    Moscato, P., Buriol, L., Cotta, C.: On the analysis of data derived from mitochondrial DNA distance matrices: Kolmogorov and a traveling salesman give their opinion (2002)Google Scholar
  40. 40.
    Pinho, A., Ferreira, P.: Image similarity using the normalized compression distance based on finite context models. In: Proceedings of IEEE International Conference on Image Processing (2011).
  41. 41.
    Rajarajeswari, P., Apparao, A.: Normalized distance matrix method for construction of phylogenetic trees using new compressor - DNABIT compress. J. Adv. Bioinf. Appl. Res. 2(1), 89–97 (2011)Google Scholar
  42. 42.
    Ré, M.A., Azad, R.K.: Generalization of entropy based divergence measures for symbolic sequence analysis. PLoS ONE 9(4), e93,532 (2014). CrossRefGoogle Scholar
  43. 43.
    Russel, R., Sinha, P.: Perceptually based comparison of image similarity metrics. Perception 40, 1269–1281 (2011). CrossRefGoogle Scholar
  44. 44.
    Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)Google Scholar
  45. 45.
    Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)MathSciNetCrossRefzbMATHGoogle Scholar
  46. 46.
    Sokal, R.R., Michener, C.D.: A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull. 38(22), 1409–1438 (1958)Google Scholar
  47. 47.
    Starr, T.N., Picton, L.K., Thornton, J.W.: Alternative evolutionary histories in the sequence space of an ancient protein. Nature 549, 409–413 (2017). CrossRefGoogle Scholar
  48. 48.
    Vázquez, P.P., Marco, J.: Using normalized compression distance for image similarity measurement: an experimental study. J. Comput. Virol. Hacking Tech. 28(11), 1063–1084 (2012). Google Scholar
  49. 49.
    Walsh, B.: Estimating the time to the most recent common ancestor for the Y chromosome or mitochondrial DNA for a pair of individuals. Genetics 158(2), 897–912 (2001)Google Scholar
  50. 50.
    Wang, W., Wang, T.: Conditional LZ complexity and its application in mtDNA sequence analysis. MATCH Commun. Math. Comput. Chem. 66, 425–443 (2011)MathSciNetGoogle Scholar
  51. 51.
    Yianilos, P.N.: Normalized forms of two common metrics. Tech. Rep. Report 91-082-9027-1, NEC Research Institute (1991)Google Scholar
  52. 52.
    Yu, J., Amores, J., Sebe, N., Tian, Q.: A new study on distance metrics as similarity measurement. In: IEEE International Conference on Multimedia and Expo (2006).

Copyright information

© Springer Science+Business Media B.V., part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Electronics, Telecommunications and InformaticsUniversity of AveiroAveiroPortugal
  2. 2.Department Electrical EngineeringInstitute of Engineering, Polytechnic of PortoPortoPortugal

Personalised recommendations