Normalized Information Distance

Vitányi, Paul M. B.; Balbach, Frank J.; Cilibrasi, Rudi L.; Li, Ming

doi:10.1007/978-0-387-84816-7_3

Paul M. B. Vitányi⁵,
Frank J. Balbach⁶,
Rudi L. Cilibrasi⁵ &
…
Ming Li⁶

4427 Accesses
29 Citations

The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, especially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

United Nations General Assembly resolution 217 A (III) of 10 December 1948: Universal Declaration of Human Rights
Google Scholar
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88(4), 048702 (2002)
Article Google Scholar
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: GenBank. Nucleic Acids Research 36(Database-Issue), 25–30 (2008)
Google Scholar
Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using normalized compression distance: what to watch out for in a compressor. Communications in Information and Systems 5(4), 367–384 (2005)
MathSciNet Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001). Software available at http://www.csie.ntu.edu.tw/?cjlin/libsvm
Chen, X., Francia, B., Li, M., McKinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Transactions on Information Theory 50(7), 1545–1551 (2004)
Article Google Scholar
Chen, X., Li, M., Ma, B., Tromp, J.: DNACompress: fast and effective DNA sequence compression. Bioinformatics 181696–1698 (2002)
Article Google Scholar
Cilibrasi, R.L., Cruz, A.L., de Rooij, S., Keijzer, M.: CompLearn software system, http://www.complearn.org
Cilibrasi, R.L., Vitányi, P.M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)
Article Google Scholar
Cilibrasi, R.L., Vitányi, P.M.B.: The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007). Preliminary version: “Automatic Meaning Discovery Using Google”, Arxiv preprint cs.CL/0412098, 2004, arxiv.org
Google Scholar
Cilibrasi, R.L., Vitányi, P.M.B., de Wolf, R.: Algorithmic clustering of music based on string compression. Computer Music Journal 28(4), 49–67 (2004)
Article Google Scholar
Fellbaum, C.: Wordnet: An Electronic Lexical Database. MIT, Cambridge (1998)
MATH Google Scholar
Ferragina, P., Giancarlo, R., Greco, V., end G. Valiente, G.M.: Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment. BMC Bioinformatics 8(1), 252 (2007)
Article Google Scholar
Keogh, E., Lonardi, S., Ratanamahatana, C.: Toward parameter-free data mining. In: Proc. 10th ACM SIGKDD Intn'l Conf. Knowledge Discovery and Data Mining, pp. 206–215. Seattle, Washington, USA (2004). August 22–25, 2004
Google Scholar
Keogh, E., Lonardi, S., Ratanamahatana, C.A., Wei, L., Lee, S.H., Handley, J.: Compression-based data mining of sequential data. Data Mining and Knowledge Discovery 14(1), 99–129 (2007)
Article MathSciNet Google Scholar
Lenat, D.B.: CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM 38(11), 33–38 (1995)
Article Google Scholar
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)
Article Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)
Article MathSciNet Google Scholar
Li, M., Vitányi, P.M.B.: An Introduction to Kolmogorov Complexity and Its Applications, second edn. Springer, New York (1997)
MATH Google Scholar
Miller, G.A., Fellbaum, C., Tengi, R., Wakefield, P., Poddar, R., Langone, H., Haskell, B.: WordNet, A Lexical Database for the English Language. Cognitive Science Lab, Princeton University, http://wordnet.princeton.edu/
Reed, S.L., Lenat, D.B.: Mapping ontologies into cyc. In: Proc. AAAI Conference 2002 Workshop on Ontologies for the Semantic Web. Edmonton, Canada
Google Scholar
Rutledge, L., Alberink, M., Brussee, R., Pokraev, S., van Dieten, W., Veenstra, M.: Finding the story — broader applicability of semantics and discourse for hypermedia generation. In: Proc. 14th ACM Conf. Hypertext and Hypermedia, pp. 67–76. Nottingham, UK (2003). August 23–27, 2003
Google Scholar
Shannon, C.: The mathematical theory of communication. Bell System Technical Journal 27379–423, 623–656 (1948)
MathSciNet Google Scholar
Tan, P.N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for associating patterns. In: Proc. eighth ACM-SIGKDD Conf. Knowledge Discovery and Data Mining, pp. 491–502. ACM (2002)
Google Scholar
Zhang, X., Hao, Y., Zhu, X., Li, M.: Information distance from a question to an answer. In: Proc. 13th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp. 874–883. ACM (2007)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

CWI, Kruislaan 413, SJ Amsterdam, 1098, The Netherlands
Paul M. B. Vitányi & Rudi L. Cilibrasi
University of Waterloo, Waterloo, ON, Canada
Frank J. Balbach & Ming Li

Authors

Paul M. B. Vitányi
View author publications
You can also search for this author in PubMed Google Scholar
Frank J. Balbach
View author publications
You can also search for this author in PubMed Google Scholar
Rudi L. Cilibrasi
View author publications
You can also search for this author in PubMed Google Scholar
Ming Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Paul M. B. Vitányi , Frank J. Balbach or Rudi L. Cilibrasi .

Editor information

Editors and Affiliations

Department of Biostatistics and Department of Genome Sciences, University of Washington, 1705 NE Pacific St., Box 357730, Seattle, WA, 98195, USA
Frank Emmert-Streib
Queen's University Belfast Computational Biology and Machine Learning, Center for Cancer Research and Cell Biology School of Biomedical Sciences, 97 Lisburn Road, Belfast, BT9 7BL, UK
Frank Emmert-Streib
Institute of Discrete Mathematics and Geometry, Vienna University of Technology, Wiedner Hauptstr. 8–10, Vienna, 1040, Austria
Matthias Dehmer
Probability and Statistics, University of Coimbra Center for Mathematics, Apartado 3008, Coimbra, 3001–454, Portugal
Matthias Dehmer

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Vitányi, P.M.B., Balbach, F.J., Cilibrasi, R.L., Li, M. (2009). Normalized Information Distance. In: Emmert-Streib, F., Dehmer, M. (eds) Information Theory and Statistical Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-84816-7_3

Download citation

DOI: https://doi.org/10.1007/978-0-387-84816-7_3
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-84815-0
Online ISBN: 978-0-387-84816-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics