Similarity of Objects and the Meaning of Words

  • Rudi Cilibrasi
  • Paul Vitanyi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3959)


We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like “red” or “christianity.” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches.


Latent Semantic Analysis Kolmogorov Complexity Language Tree Normalize Compression Distance Compression Distance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bagrow, J.P., ben-Avraham, D.: On the Google-fame of scientists and other populations. In: AIP Conference Proceedings, vol. 779(1), pp. 81–89 (2005)Google Scholar
  2. 2.
    Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Review Lett. 88(4), 48702 (2002)CrossRefGoogle Scholar
  3. 3.
    Bennett, C.H., Gács, P., Li, M., Vitányi, P.M.B., Zurek, W.: Information Distance. IEEE Trans. Information Theory 44(4), 1407–1423 (1998); Conference version: Thermodynamics of Computation and Information Distance, In: Proc. 25th ACM Symp. Theory of Comput. pp. 21–30 (1993)CrossRefzbMATHGoogle Scholar
  4. 4.
    Bennett, C.H., Li, M., Ma, B.: Chain letters and evolutionary histories. Scientific American, 76–81 (June 2003)Google Scholar
  5. 5.
    Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)CrossRefGoogle Scholar
  6. 6.
    Chen, X., Francia, B., Li, M., McKinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Trans. Inform. Th. 50(7), 1545–1551 (2004)CrossRefMathSciNetGoogle Scholar
  7. 7.
    Cilibrasi, R.: The CompLearn Toolkit, CWI, (2003),
  8. 8.
    Cimiano, P., Staab, S.: Learning by Googling. SIGKDD Explorations 6(2), 24–33 (2004)CrossRefGoogle Scholar
  9. 9.
    Chai, W., Vercoe, B.: Folk music classification using hidden Markov models. In: Proc. of International Conference on Artificial Intelligence (2001)Google Scholar
  10. 10.
    Cilibrasi, R., Vitanyi, P.: Automatic Meaning Discovery Using Google: 100 Experiments in LearningWordNet Categories (2004),
  11. 11.
    Cilibrasi, R., de Wolf, R., Vitanyi, P.: Algorithmic clustering of music based on string compression. Computer Music J. 28(4), 49–67 (2004), Web version Google Scholar
  12. 12.
    Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE Trans. Information Theory 51(4), 1523–1545 (2005), Web version Google Scholar
  13. 13.
    Cilibrasi, R., Vitanyi, P.: Automatic meaning discovery using Google, Manuscript, CWI (2004),
  14. 14.
    Cilibrasi, R., Vitanyi, P.M.B.: A New Quartet Tree Heuristic for Hierarchical Clustering. In: EUPASCAL Statistics and Optimization of Clustering Workshop, London, UK, July 5-6(2005),
  15. 15.
    Dannenberg, R., Thom, B., Watson, D.: A machine learning approach to musical style recognition. In: Proc. International Computer Music Conference, pp. 344–347 (1997)Google Scholar
  16. 16.
    Duda, R., Hart, P., Stork, D.: Pattern Classification. John Wiley and Sons, Chichester (2001)zbMATHGoogle Scholar
  17. 17.
    The basics of Google search,
  18. 18.
    Grimaldi, M., Kokaram, A., Cunningham, P.: Classifying music by genre using the wavelet packet transform and a round-robin ensemble. Technical report TCD-CS-2002-64, Trinity College Dublin (2002),
  19. 19.
    Keogh, E., Lonardi, S., Rtanamahatana, C.A.: Toward parameter-free data mining. In: Proc. 10th ACM SIGKDD Intn’l Conf. Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22–25, pp. 206–215 (2004)Google Scholar
  20. 20.
    Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problems Inform. Transmission 1(1), 1–7 (1965)MathSciNetGoogle Scholar
  21. 21.
    Kolmogorov, A.N.: Combinatorial foundations of information theory and the calculus of probabilities. Russian Math. Surveys 38(4), 29–40 (1983)CrossRefzbMATHGoogle Scholar
  22. 22.
    Landauer, T., Dumais, S.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol. Rev. 104, 211–240 (1997)CrossRefGoogle Scholar
  23. 23.
    Lenat, D.B.: Cyc: A large-scale investment in knowledge infrastructure. Comm. ACM 38(11), 33–38 (1995)CrossRefGoogle Scholar
  24. 24.
    Lesk, M.E.: Word-word associations in document retrieval systems. American Documentation 20(1), 27–38 (1969)CrossRefGoogle Scholar
  25. 25.
    Li, M., Vitányi, P.M.B.: Theory of thermodynamics of computation. In: Proc. IEEE Physics of Computation Workshop, Dallas (Texas), October 4-6, pp. 42–46 (1992), A full version (basically the here relevant part of [26]) appeared in the Preliminary Proceedings handed out at the WorkshopGoogle Scholar
  26. 26.
    Li, M., Vitányi, P.M.B.: Reversibility and adiabatic computation: trading time and space for energy. Proc. Royal Society of London, Series A 452, 769–789 (1996)CrossRefzbMATHGoogle Scholar
  27. 27.
    Li, M., Vitányi, P.M.B.: An Introduction to Kolmogorov Complexity and its Applications, 2nd edn. Springer, New York (1997)zbMATHGoogle Scholar
  28. 28.
    Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences based on approximate matching. In: Proc. 10th Workshop on Genome Informatics (GIW), Tokyo, December 14-15. Genome Informatics Series, vol. 10 (1999); Also in Proc. 4th ACM RECOMB, p. 107 (2000)Google Scholar
  29. 29.
    Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)CrossRefGoogle Scholar
  30. 30.
    Li, M., Vitányi, P.M.B.: Algorithmic Complexity. In: Smelser, N.J., Baltes, P.B. (eds.) International Encyclopedia of the Social & Behavioral Sciences, pp. 376–382. Pergamon, Oxford (2001/2002)Google Scholar
  31. 31.
    Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. IEEE Trans. Information Theory 50(12), 3250–3264 (2004); Conference version in: Proc. 14th ACM-SIAM Symposium on Discrete Algorithms, Baltimore, USA, pp 863–872 (2003) Web version: CrossRefMathSciNetGoogle Scholar
  32. 32.
    Li, M., Vitanyi, P.M.B.: An Introduction to Kolmogorov Complexity and Its Applications, 2nd edn. Springer, New York (1997)zbMATHGoogle Scholar
  33. 33.
    Reed, S.L., Lenat, D.B.: Mapping ontologies into cyc. In: Proc. AAAI Conference 2002 Workshop on Ontologies for the Semantic Web, Edmonton, Canada,
  34. 34.
    Scott, P.: Music classification using neural networks (2001),
  35. 35.
    Strimmer, K., von Haeseler, A.: Quartet puzzling: A quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13, 964–969 (1996)Google Scholar
  36. 36.
    Miller, G.A., et al.: WordNet, A Lexical Database for the English Language, Cognitive Science Lab. Princeton University,
  37. 37.
    Terra, E., Clarke, C.L.A.: Frequency Estimates for Statistical Word Similarity Measures. In: HLT/NAACL 2003, Edmonton, Alberta, 37/162 (May 2003)Google Scholar
  38. 38.
    Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for associating patterns. In: Proc. ACM-SIGKDD Conf. Knowledge Discovery and Data Mining, pp. 491–502 (2002)Google Scholar
  39. 39.
    Tzanetakis, G., Cook, P.: Music genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 10(5), 293–302 (2002)CrossRefGoogle Scholar
  40. 40.
    Wehner, S.: Analyzing network traffic and worms using compression,
  41. 41.
    Corpus collosal: How well does the world wide web represent human language? The Economist, January 20 (2005),

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Rudi Cilibrasi
    • 1
  • Paul Vitanyi
    • 1
  1. 1.CWIAmsterdamThe Netherlands

Personalised recommendations