Skip to main content

A Simple Probability Based Term Weighting Scheme for Automated Text Classification

  • Conference paper
New Trends in Applied Artificial Intelligence (IEA/AIE 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4570))

  • 1359 Accesses

Abstract

In the automated text classification, tfidf is often considered as the default term weighting scheme and has been widely reported in literature. However, tfidf does not directly reflect terms’ category membership. Inspired by the analysis of various feature selection methods, we propose a simple probability based term weighting scheme which directly utilizes two critical information ratios, i.e. relevance indicators. These relevance indicators are nicely supported by probability estimates which embody the category membership. Our experimental study based on two data sets, including Reuters-21578, demonstrates that the proposed probability based term weighting scheme outperforms tfidf significantly using Bayesian classifier and Support Vector Machines (SVM).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Longman Publishing Co, Boston, MA (1999)

    Google Scholar 

  2. Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the 2003 ACM symposium on Applied computing, ACM Press, New York (2003)

    Google Scholar 

  3. Dumais, S., Chen, H.: Hierarchical classification of Web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR2000), ACM Press, New York (2000)

    Google Scholar 

  4. Forman, G.: An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, Special Issue on Variable and Feature Selection 3, 1289–1305 (2003)

    MATH  Google Scholar 

  5. Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  6. Joachims, T.: A Statistical Learning Model of Text Classification with Support Vector Machines. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, New York (2001)

    Google Scholar 

  7. Leopold, E., Kindermann, J.: Text Categorization with Support Vector Machines - How to Represent Texts in Input Space. Machine Learning 46, 423–444 (2002)

    Article  MATH  Google Scholar 

  8. Liu, Y., Loh, H.T., Tor, S.B.: Building a Document Corpus for Manufacturing Knowledge Retrieval. In: Proceedings of the Singapore MIT Alliance Symposium (2004)

    Google Scholar 

  9. Ng, H.T., Goh, W.B., Low, K.L.: Feature selection, perception learning, and a usability case study for text categorization. In: ACM SIGIR Forum. Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, New York (1997)

    Google Scholar 

  10. Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (2003)

    Google Scholar 

  11. Ruiz, M.E., Srinivasan, P.: Hierarchical Text Categorization Using Neural Networks. Information Retrieval 5, 87–118 (2002)

    Article  MATH  Google Scholar 

  12. Salton, G., Buckley, C.: Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management 24, 513–523 (1988)

    Article  Google Scholar 

  13. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    Google Scholar 

  14. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys (CSUR) 34, 1–47 (2002)

    Article  Google Scholar 

  15. Sun, A., Lim, E.-P., Ng, W.-K., Srivastava, J.: Blocking Reduction Strategies in Hierarchical Text Classification. IEEE Transactions on Knowledge and Data Engineering (TKDE) 16, 1305–1308 (2004)

    Article  Google Scholar 

  16. van_Rijsbergen, C.J.: Information Retrieval. 2nd edn. Butterworths, London, UK (1979)

    Google Scholar 

  17. Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer, New York (1999)

    Google Scholar 

  18. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

  19. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, New York (1999)

    Google Scholar 

  20. Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of ICML-97, 14th International Conference on Machine Learning (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Hiroshi G. Okuno Moonis Ali

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Liu, Y., Loh, H.T. (2007). A Simple Probability Based Term Weighting Scheme for Automated Text Classification. In: Okuno, H.G., Ali, M. (eds) New Trends in Applied Artificial Intelligence. IEA/AIE 2007. Lecture Notes in Computer Science(), vol 4570. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73325-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-73325-6_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-73322-5

  • Online ISBN: 978-3-540-73325-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics