Skip to main content

Handling of Imbalanced Data in Text Classification: Category-Based Term Weights

  • Chapter

Abstract

Learning from imbalanced data has emerged as a new challenge to the machine learning (ML), data mining (DM) and text mining (TM) communities. Two recent workshops in 2000 [17] and 2003 [7] at AAAI and ICML conferences respectively and a special issue in ACM SIGKDD explorations [8] are dedicated to this topic. It has been witnessing growing interest and attention among researchers and practitioners seeking solutions in handling imbalanced data. An excellent review of the state-ofthe- art is given by Gary Weiss [43].

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   119.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   149.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   159.00
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates R. & Ribeiro-Neto B. (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co. Inc., Boston, MA, USA

    Google Scholar 

  2. Baoli L., Qin L. & Shiwen Y. (2004) An adaptive k-nearest neighbor text categorization strategy. ACM Transactions on Asian Language Information Processing (TALIP) 3:215–226

    Article  Google Scholar 

  3. Blum A. & Mitchell T. (1998) Combining Labeled and Unlabeled Data with Co-Training. In: COLT: Proceedings of the Workshop on Computational Learning Theory

    Google Scholar 

  4. Brank J., Grobelnik M., Milic-Frayling N. & Mladenic D. (2003) Training text classifiers with SVM on very few positive examples. Report MSR-TR-2003-34

    Google Scholar 

  5. Burges C. J. C. (1998) A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2:121–167

    Article  Google Scholar 

  6. Castillo M. D. d. & Serrano J. I. (2004) A multistrategy approach for digital text categorization from imbalanced documents. ACM SIGKDD Explorations Newsletter: Special issue on learning from imbalanced datasets 6:70–79

    Google Scholar 

  7. Chawla N., Japkowicz N. & Kolcz A. (eds) (2003) Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Data Sets

    Google Scholar 

  8. Chawla N., Japkowicz N. & Kolcz A. (eds) (2004) Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations Newsletter 6

    Google Scholar 

  9. Debole F. & Sebastiani F. (2003) Supervised term weighting for automated text categorization. In: Proceedings of the 2003 ACM Symposium on Applied computing

    Google Scholar 

  10. Dietterich T., Margineantu D., Provost F. & Turney P. (eds) (2000) Proceedings of the ICML’2000 Workshop on Cost-sensitive Learning

    Google Scholar 

  11. Dumais S. & Chen H. (2000) Hierarchical classification of Web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR2000)

    Google Scholar 

  12. Elkan C. (2001) The Foundations of Cost-Sensitive Learning. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01)

    Google Scholar 

  13. Fan W., Yu P. S. & Wang H. (2004) Mining Extremely Skewed Trading Anomalies. In: Advances in Database Technology-EDBT 2004: 9th International Conference on Extending Database Technology

    Google Scholar 

  14. Forman G. (2003) An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, Special Issue on Variable and Feature Selection 3:1289–1305

    MATH  Google Scholar 

  15. Ghani R. (2002) Combining Labeled and Unlabeled Data for MultiClass Text Categorization. In: International Conference on Machine Learning (ICML 2002)

    Google Scholar 

  16. Goldman S. & Zhou Y. (2000) Enhancing Supervised Learning with Unlabeled Data. In: Proceedings of 17th International Conference on Machine Learning

    Google Scholar 

  17. Japkowicz N. (eds) (2000) Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets. AAAI Tech Report WS-00–05, AAAI

    Google Scholar 

  18. Japkowicz N., Myers C. & Gluck M. A. (1995) A Novelty Detection Approach to Classification. In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI-95)

    Google Scholar 

  19. Joachims T. (1998) Text categorization with Support Vector Machines: Learning with many relevant features. In: ECML-98, Tenth European Conference on Machine Learning

    Google Scholar 

  20. Joachims T. (2001) A Statistical Learning Model of Text Classification with Support Vector Machines. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

    Google Scholar 

  21. Joachims T. (2002) Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers

    Google Scholar 

  22. Leopold E. & Kindermann J. (2002) Text Categorization with Support Vector Machines-How to Represent Texts in Input Space. Machine Learning 46:423–444

    Article  MATH  Google Scholar 

  23. Lewis D. D. & Gale W. A. (1994) A Sequential Algorithm for Training Text Classifiers. In: Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval

    Google Scholar 

  24. Lewis D. D., Yang Y., Rose T. G. & Li F. (2004) RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research 5:361–397

    Google Scholar 

  25. Liu A. Y. C. (2004) The effect of oversampling and undersampling on classifying imbalanced text datasets. Masters thesis. University of Texas at Austin

    Google Scholar 

  26. Liu B., Dai Y., Li X., Lee W. S. & Yu P. (2003) Building Text Classifiers Using Positive and Unlabeled Examples. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM’03)

    Google Scholar 

  27. Liu Y., Loh H. T. & Tor S. B. (2004) Building a Document Corpus for Manufacturing Knowledge Retrieval. In: Proceedings of the Singapore MIT Alliance Symposium 2004

    Google Scholar 

  28. Liu Y., Loh H. T., Youcef-Toumi K. & Tor S. B. (2005) MCV1: An Engineering Paper Corpus for Manufacturing Knowledge Retrieval. submitted to the Journal of Knowledge and Information System (KAIS)

    Google Scholar 

  29. Man L. (2004) A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines. In: Text Seminar of CHIME Group at the National University of Singapore

    Google Scholar 

  30. Manevitz L. M. & Yousef M. (2002) One-class svms for document classification. The Journal of Machine Learning Research 2:139–154

    Article  MATH  Google Scholar 

  31. Mladenic D. & Grobelnik M. (1999) Feature Selection for Unbalanced Class Distribution and Naive Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML’99

    Google Scholar 

  32. Ng H. T., Goh W. B. & Low K. L. (1997) Feature selection, perception learning, and a usability case study for text categorization. In: ACM SIGIR Forum, Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval

    Google Scholar 

  33. Nickerson A., Japkowicz N. & Milios E. (2001) Using Unsupervised Learning to Guide Re-Sampling in Imbalanced Data Sets. In: Proceedings of the Eighth International Workshop on AI and Statitsics

    Google Scholar 

  34. Nigam K. P. (2001) Using unlabeled data to improve text classification. PhD thesis. Carnegie Mellon University

    Google Scholar 

  35. Raskutti B. & Kowalczyk A. (2004) Extreme re-balancing for SVMs: a case study. ACM SIGKDD Explorations Newsletter: Special issue on learning from imbalanced datasets 6:60–69

    Google Scholar 

  36. Rijsbergen C. J. v. (1979) Information Retrieval. 2nd edn. Butterworths, London, UK

    Google Scholar 

  37. Ruiz M. E. & Srinivasan P. (2002) Hierarchical Text Categorization Using Neural Networks. Information Retrieval 5:87–118

    Article  MATH  Google Scholar 

  38. Salton G. & Buckley C. (1988) Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management 24:513–523

    Article  Google Scholar 

  39. Salton G. & McGill M. J. (1983) Introduction to Modern Information Retrieval. McGraw-Hill, New York, USA

    MATH  Google Scholar 

  40. Sebastiani F. (2002) Machine Learning in Automated Text Categorization. ACM Computing Surveys (CSUR) 34:1–47

    Article  MathSciNet  Google Scholar 

  41. Sun A., Lim E. P., Ng W. K. & Srivastava J. (2004) Blocking Reduction Strategies in Hierarchical Text Classification. IEEE Transactions on Knowledge and Data Engineering (TKDE) 16:1305–1308

    Article  Google Scholar 

  42. Vapnik V. N. (1999) The Nature of Statistical Learning Theory. 2nd edn. Springer-Verlag, New York

    Google Scholar 

  43. Weiss G. M. (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter: Special issue on learning from imbalanced datasets 6:7–19

    Google Scholar 

  44. Weiss G. M. & Provost F. (2003) Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research 19:315–354

    MATH  Google Scholar 

  45. Yang Y. (1996) Sampling Strategies and Learning Efficiency in Text Categorization. In: Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access

    Google Scholar 

  46. Yang Y. & Liu X. (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval

    Google Scholar 

  47. Yang Y. & Pedersen J. O. (1997) A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of ICML-97, 14th International Conference on Machine Learning

    Google Scholar 

  48. Yu H., Zhai C. & Han J. (2003) Text Classification from Positive and Unlabeled Documents. In: Proceedings of the twelfth international conference on Information and knowledge management (CIKM 2003)

    Google Scholar 

  49. Zelikovitz S. & Hirsh H. (2000) Improving Short Text Classification Using Unlabeled Background Knowledge. In: Proceedings of the Seventeenth International Conference on Machine Learning(ICML2000)

    Google Scholar 

  50. Zheng Z., Wu X. & Srihari R. (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter: Special issue on learning from imbalanced datasets 6:80–89

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag London Limited

About this chapter

Cite this chapter

Liu, Y., Loh, H.T., Kamal, YT., Tor, S.B. (2007). Handling of Imbalanced Data in Text Classification: Category-Based Term Weights. In: Kao, A., Poteet, S.R. (eds) Natural Language Processing and Text Mining. Springer, London. https://doi.org/10.1007/978-1-84628-754-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-1-84628-754-1_10

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84628-175-4

  • Online ISBN: 978-1-84628-754-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics