The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, especially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
United Nations General Assembly resolution 217 A (III) of 10 December 1948: Universal Declaration of Human Rights
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88(4), 048702 (2002)
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: GenBank. Nucleic Acids Research 36(Database-Issue), 25–30 (2008)
Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using normalized compression distance: what to watch out for in a compressor. Communications in Information and Systems 5(4), 367–384 (2005)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001). Software available at http://www.csie.ntu.edu.tw/?cjlin/libsvm
Chen, X., Francia, B., Li, M., McKinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Transactions on Information Theory 50(7), 1545–1551 (2004)
Chen, X., Li, M., Ma, B., Tromp, J.: DNACompress: fast and effective DNA sequence compression. Bioinformatics 181696–1698 (2002)
Cilibrasi, R.L., Cruz, A.L., de Rooij, S., Keijzer, M.: CompLearn software system, http://www.complearn.org
Cilibrasi, R.L., Vitányi, P.M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)
Cilibrasi, R.L., Vitányi, P.M.B.: The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007). Preliminary version: “Automatic Meaning Discovery Using Google”, Arxiv preprint cs.CL/0412098, 2004, arxiv.org
Cilibrasi, R.L., Vitányi, P.M.B., de Wolf, R.: Algorithmic clustering of music based on string compression. Computer Music Journal 28(4), 49–67 (2004)
Fellbaum, C.: Wordnet: An Electronic Lexical Database. MIT, Cambridge (1998)
Ferragina, P., Giancarlo, R., Greco, V., end G. Valiente, G.M.: Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment. BMC Bioinformatics 8(1), 252 (2007)
Keogh, E., Lonardi, S., Ratanamahatana, C.: Toward parameter-free data mining. In: Proc. 10th ACM SIGKDD Intn'l Conf. Knowledge Discovery and Data Mining, pp. 206–215. Seattle, Washington, USA (2004). August 22–25, 2004
Keogh, E., Lonardi, S., Ratanamahatana, C.A., Wei, L., Lee, S.H., Handley, J.: Compression-based data mining of sequential data. Data Mining and Knowledge Discovery 14(1), 99–129 (2007)
Lenat, D.B.: CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM 38(11), 33–38 (1995)
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.M.B.: The similarity metric. IEEE Transactions on Information Theory 50(12), 3250–3264 (2004)
Li, M., Vitányi, P.M.B.: An Introduction to Kolmogorov Complexity and Its Applications, second edn. Springer, New York (1997)
Miller, G.A., Fellbaum, C., Tengi, R., Wakefield, P., Poddar, R., Langone, H., Haskell, B.: WordNet, A Lexical Database for the English Language. Cognitive Science Lab, Princeton University, http://wordnet.princeton.edu/
Reed, S.L., Lenat, D.B.: Mapping ontologies into cyc. In: Proc. AAAI Conference 2002 Workshop on Ontologies for the Semantic Web. Edmonton, Canada
Rutledge, L., Alberink, M., Brussee, R., Pokraev, S., van Dieten, W., Veenstra, M.: Finding the story — broader applicability of semantics and discourse for hypermedia generation. In: Proc. 14th ACM Conf. Hypertext and Hypermedia, pp. 67–76. Nottingham, UK (2003). August 23–27, 2003
Shannon, C.: The mathematical theory of communication. Bell System Technical Journal 27379–423, 623–656 (1948)
Tan, P.N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for associating patterns. In: Proc. eighth ACM-SIGKDD Conf. Knowledge Discovery and Data Mining, pp. 491–502. ACM (2002)
Zhang, X., Hao, Y., Zhu, X., Li, M.: Information distance from a question to an answer. In: Proc. 13th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp. 874–883. ACM (2007)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Vitányi, P.M.B., Balbach, F.J., Cilibrasi, R.L., Li, M. (2009). Normalized Information Distance. In: Emmert-Streib, F., Dehmer, M. (eds) Information Theory and Statistical Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-84816-7_3
Download citation
DOI: https://doi.org/10.1007/978-0-387-84816-7_3
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-84815-0
Online ISBN: 978-0-387-84816-7
eBook Packages: Computer ScienceComputer Science (R0)