Skip to main content

Using Kullback-Leibler Distance for Text Categorization

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2633))

Abstract

A system that performs text categorization aims to assign appropriate categories from a predefined classification scheme to incoming documents. These assignments might be used for varied purposes such as filtering, or retrieval. This paper introduces a new effective model for text categorization with great corpus (more or less 1 million documents). Text categorization is performed using the Kullback-Leibler distance between the probability distribution of the document to classify and the probability distribution of each category. Using the same representation of categories, experiments show a significant improvement when the above mentioned method is used. KLD method achieve substantial improvements over the tfidf performing method.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34 (2002) 1–47

    Article  MathSciNet  Google Scholar 

  2. Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (1994) 13–22

    Google Scholar 

  3. Lewis, D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of the Third Annual Symposium on Document Analsysis and Information Retrieval. (1994) 81–93

    Google Scholar 

  4. Wiener, E., Pedersen, J., Weigend, A.: A neural network approach to topic spotting. In: Proceedings of the Fourth Annual Symposium on Document Analsysis and Information Retrieval. (1995)

    Google Scholar 

  5. Salton, G., McGill, M.: The smart and sire experimental retrieval systems, McGraw-Hill, New York (1983) 118–155

    Google Scholar 

  6. Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Fisher, D.H., ed.: Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, US, Morgan Kaufmann Publishers, San Francisco, US (1997) 143–151

    Google Scholar 

  7. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of ACM Conference on Research and Development in Information Retrieval. (1999) 42–49

    Google Scholar 

  8. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the European Conference on Machine Learning, Springer (1998)

    Google Scholar 

  9. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of ACM-CIKM98. (1998) 148–155

    Google Scholar 

  10. Kindermann, J., Paass, G., Leopold, E.: Error correcting codes with optimized kullback-leibler distances for text categorization. In Raedt, L., ed.: Principles of data mining and knowledge discovery. (2001) 133–137

    Google Scholar 

  11. Cover, T., Thomas, J.: Elements of Information Theory. Wiley (1991)

    Google Scholar 

  12. Carpineto, C., De Mori, R., Romano, G., Bigi, B.: An information theoretic approach to automatic query expansion. ACM Transactions On Information Systems 19 (2001) 1–27

    Article  Google Scholar 

  13. Salton, G.: Developments in automatic text retrieval. Science 253 (1991) 974–980

    Article  MathSciNet  Google Scholar 

  14. Kullback, S., Leibler, R.: On information and sufficiency. 22 (1951) 79–86

    MATH  MathSciNet  Google Scholar 

  15. Dagan, I., Lee, L., Pereira, F.: Similarity-based models of word co occurrence probabilities. Machine Learning 34 (1999) 43–69

    Article  MATH  Google Scholar 

  16. Bigi, B., De Mori, R., El-Bèze, M., Spriet, T.: A fuzzy decision strategy for topic identification and dynamic selection of language models. Special Issue on Fuzzy Logic in Signal Processing, Signal Processing Journal 80 (2000)

    Google Scholar 

  17. Xu, J., Croft, B.: Cluster-based language models for distributed retrieval. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA (1999) 254–261

    Google Scholar 

  18. De Mori, R.: SPOKEN DIALOGUES WITH COMPUTERS. Academic Press (1998)

    Google Scholar 

  19. Leopold, E., Kindermann, J.: Text categorization with support vector machines: How to represent texts in input spaces? Machine Learning 46 (2002) 423–444

    Article  MATH  Google Scholar 

  20. Rosenfeld, R.: A maximum entropy approach to adaptive statistical language modeling. Computer, Speech and Language 10 (1996) 187–228

    Article  Google Scholar 

  21. Buckley, C., Salton, G., Allan, J.: The effect of adding relevance information in a relevance feedback environment. In: Proceedings of the seventeenth annual international ACM-SIGIR conference on research and development in information retrieval, Springer-Verlag (1994)

    Google Scholar 

  22. Bennett, P., Dumais, S., Horvitz, E.: Probabilistic combination of text classifiers using reliability indicators: Models and results. In: Proceedings of ACM International Conference on Research and Development in Information Retrieval. (2002) 207–214

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bigi, B. (2003). Using Kullback-Leibler Distance for Text Categorization. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_22

Download citation

  • DOI: https://doi.org/10.1007/3-540-36618-0_22

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-01274-0

  • Online ISBN: 978-3-540-36618-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics