Skip to main content

The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, especially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. United Nations General Assembly resolution 217 A (III) of 10 December 1948: Universal Declaration of Human Rights

    Google Scholar 

  2. Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88(4), 048702 (2002)

    Article  Google Scholar 

  3. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: GenBank. Nucleic Acids Research 36(Database-Issue), 25–30 (2008)

    Google Scholar 

  4. Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using normalized compression distance: what to watch out for in a compressor. Communications in Information and Systems 5(4), 367–384 (2005)

    MathSciNet  Google Scholar 

  5. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001). Software available at http://www.csie.ntu.edu.tw/?cjlin/libsvm

  6. Chen, X., Francia, B., Li, M., McKinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Transactions on Information Theory 50(7), 1545–1551 (2004)

    Article  Google Scholar 

  7. Chen, X., Li, M., Ma, B., Tromp, J.: DNACompress: fast and effective DNA sequence compression. Bioinformatics 181696–1698 (2002)

    Article  Google Scholar 

  8. Cilibrasi, R.L., Cruz, A.L., de Rooij, S., Keijzer, M.: CompLearn software system, http://www.complearn.org

  9. Cilibrasi, R.L., Vitányi, P.M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)

    Article  Google Scholar 

  10. Cilibrasi, R.L., Vitányi, P.M.B.: The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007). Preliminary version: “Automatic Meaning Discovery Using Google”, Arxiv preprint cs.CL/0412098, 2004, arxiv.org

    Google Scholar 

  11. Cilibrasi, R.L., Vitányi, P.M.B., de Wolf, R.: Algorithmic clustering of music based on string compression. Computer Music Journal 28(4), 49–67 (2004)

    Article  Google Scholar 

  12. Fellbaum, C.: Wordnet: An Electronic Lexical Database. MIT, Cambridge (1998)

    MATH  Google Scholar 

  13. Ferragina, P., Giancarlo, R., Greco, V., end G. Valiente, G.M.: Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment. BMC Bioinformatics 8(1), 252 (2007)

    Article  Google Scholar 

  14. Keogh, E., Lonardi, S., Ratanamahatana, C.: Toward parameter-free data mining. In: Proc. 10th ACM SIGKDD Intn'l Conf. Knowledge Discovery and Data Mining, pp. 206–215. Seattle, Washington, USA (2004). August 22–25, 2004

    Google Scholar 

  15. Keogh, E., Lonardi, S., Ratanamahatana, C.A., Wei, L., Lee, S.H., Handley, J.: Compression-based data mining of sequential data. Data Mining and Knowledge Discovery 14(1), 99–129 (2007)

    Article  MathSciNet  Google Scholar 

  16. Lenat, D.B.: CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM 38(11), 33–38 (1995)

    Article  Google Scholar 

  17. Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)

    Article  Google Scholar 

  18. Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)

    Article  MathSciNet  Google Scholar 

  19. Li, M., Vitányi, P.M.B.: An Introduction to Kolmogorov Complexity and Its Applications, second edn. Springer, New York (1997)

    MATH  Google Scholar 

  20. Miller, G.A., Fellbaum, C., Tengi, R., Wakefield, P., Poddar, R., Langone, H., Haskell, B.: WordNet, A Lexical Database for the English Language. Cognitive Science Lab, Princeton University, http://wordnet.princeton.edu/

  21. Reed, S.L., Lenat, D.B.: Mapping ontologies into cyc. In: Proc. AAAI Conference 2002 Workshop on Ontologies for the Semantic Web. Edmonton, Canada

    Google Scholar 

  22. Rutledge, L., Alberink, M., Brussee, R., Pokraev, S., van Dieten, W., Veenstra, M.: Finding the story — broader applicability of semantics and discourse for hypermedia generation. In: Proc. 14th ACM Conf. Hypertext and Hypermedia, pp. 67–76. Nottingham, UK (2003). August 23–27, 2003

    Google Scholar 

  23. Shannon, C.: The mathematical theory of communication. Bell System Technical Journal 27379–423, 623–656 (1948)

    MathSciNet  Google Scholar 

  24. Tan, P.N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for associating patterns. In: Proc. eighth ACM-SIGKDD Conf. Knowledge Discovery and Data Mining, pp. 491–502. ACM (2002)

    Google Scholar 

  25. Zhang, X., Hao, Y., Zhu, X., Li, M.: Information distance from a question to an answer. In: Proc. 13th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp. 874–883. ACM (2007)

    Google Scholar 

  26. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Paul M. B. Vitányi , Frank J. Balbach or Rudi L. Cilibrasi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Vitányi, P.M.B., Balbach, F.J., Cilibrasi, R.L., Li, M. (2009). Normalized Information Distance. In: Emmert-Streib, F., Dehmer, M. (eds) Information Theory and Statistical Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-84816-7_3

Download citation

  • DOI: https://doi.org/10.1007/978-0-387-84816-7_3

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-84815-0

  • Online ISBN: 978-0-387-84816-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics