Advertisement

A Data-Compression Approach to the Monolingual GIRT Task: An Agnostic Point of View

  • Daniela Alderuccio
  • Luciana Bordoni
  • Vittorio Loreto
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3237)

Abstract

In this paper we apply a data-compression IR method in the GIRT social science database, focusing on the monolingual task in German and English. For this purpose we use a recently proposed general scheme for context recognition and context classification of strings of characters (in particular texts) or other coded information. The key point of the method is the computation of a suitable measure of remoteness (or similarity) between two strings of characters. This measure of remoteness reflects the distance between the structures present in the two strings, i.e. between the two different distributions of elements of the compared sequences. The hypothesis is that the information-theory oriented measure of remoteness between two sequences could reflect their semantic distance. It is worth stressing the generality and versatility of our information-theoretic method which applies to any kind of corpora of character strings, whatever the type of coding used (i.e. language).

Keywords

Relative Entropy Semantic Distance Kolmogorov Complexity Reference Text Context Classification 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)Google Scholar
  2. 2.
    Croft, B. (ed.): Advances in Information Retrieval – Recent Research from the Centre for Intelligent Information Retrieval. Kluwer Academic Publishers, Dordrecht (2003)Google Scholar
  3. 3.
    Shannon, C.E.: A Mathematical Theory of Communication. The Bell System Technical Journal 27, 379–423, 623–656 (1948)Google Scholar
  4. 4.
    Zurek, W.H. (ed.): Complexity, Entropy and Physics of Information. Addison-Wesley, Redwood City (1990)Google Scholar
  5. 5.
    Li, M., Vitànyi, P.: An Introduction to Kolmogorov Complexity and its Applications, 2nd edn. Springer, Heidelberg (1997)zbMATHGoogle Scholar
  6. 6.
    Khinchin, A.I.: Mathematical Foundations of Information Theory. Dover, New York (1957)zbMATHGoogle Scholar
  7. 7.
    Benedetto, D., Caglioti, E., Loreto, V.: Language Trees and Zipping. Physical Review Letters 88, 048702–048705 (2002)Google Scholar
  8. 8.
    Ziv, J., Merhav, N.: A Measure of Relative Entropy between Individual Sequences with Applications to Universal Classification. IEEE Transactions on Information Theory 39, 1280–1292 (1993)CrossRefMathSciNetGoogle Scholar
  9. 9.
    Ziv, J., Lempel, A.: A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 23, 337–343 (1977)zbMATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    Puglisi, A., Benedetto, D., Caglioti, E., Loreto, V., Vulpiani, A.: Data Compression and Learning Time Sequences Analysis. Physica D 180, 92–107 (2003)zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Benedetto, D., Caglioti, E., Loreto, V.: Zipping Out Relevant Information. Invited column “Computing Prescriptions”. The AIP/IEEE journal Computing in Science and Engineering, January-February issue (2003)Google Scholar
  12. 12.
    Braschler, M., Ripplinger, B.: Stemming and Decompounding for German Text Retrieval. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 177–192. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  13. 13.
    Kluck, M., Gey, F.C.: The Domain-Specific Task of CLEF - Specific Evaluation Strategies in Cross-Language Information Retrieval. In: Peters, C. (ed.) CLEF 2000. LNCS, vol. 2069, p. 48. Springer, Heidelberg (2001)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Daniela Alderuccio
    • 1
  • Luciana Bordoni
    • 1
  • Vittorio Loreto
    • 2
  1. 1.ENEA – Uda/AdvisorCentro Ricerche CasacciaS. Maria di Galeria (Rome)Italy
  2. 2.Physics Dept.“La Sapienza”, Univ. in RomeRomeItaly

Personalised recommendations