Skip to main content

Utilizing DTRS for Imbalanced Text Classification

  • Conference paper
  • First Online:
Rough Sets (IJCRS 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9920))

Included in the following conference series:

  • 973 Accesses

Abstract

Imbalanced data classification is one of the challenging problems in data mining and machine learning research. The traditional classification algorithms are often biased towards the majority class when learning from imbalanced data. Much work have been proposed to address this problem, including data re-sampling, algorithm modification, and cost-sensitive learning. However, most of them focus on one of these techniques. This paper proposes to utilize both algorithm modification and cost-sensitive learning based on decision-theoretic rough set (DTRS) model. In particular, we use naive Bayes classifier as the base classifier and modify it for imbalanced learning. For cost-sensitive learning, we adopt the systematic method from DTRS to derive required thresholds that have the minimum decision cost. Our experimental results on three well-known text classification databases show that unified DTRS provides similar performance on balanced class distribution, outperforms naive Bayes classifier on imbalanced datasets, and is competitive with other imbalanced learning classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.: Classification and Regression Trees. Chapman and Hall, Belmont (1984)

    MATH  Google Scholar 

  2. Dietterich, T., Kearns, M., Mansour, Y.: Applying the weak learning framework to understand and improve C4.5. In: Proceedings of the 13th International Conference on Machine Learning, pp. 96–104. Morgan Kaufmann (1996)

    Google Scholar 

  3. Domingos, P., Pazzani, M.: Beyond independence: conditions for the optimality of the simple Bayesian classifier. In: Proceedings of the 13th International Conference on Machine Learning, pp. 105–112 (1996)

    Google Scholar 

  4. Drummond, C., Holte, R.: Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: ICML, pp. 239–246 (2000)

    Google Scholar 

  5. Duda, R.O., Hart, P.E.: Pattern Classication and Scene Analysis. Wiley, New York (1973)

    MATH  Google Scholar 

  6. Flach, P.A.: The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In: ICML, pp. 194–201 (2003)

    Google Scholar 

  7. Good, I.J.: The Estimation of Probabilities: An Essay on Modern Bayesian Methods. MIT Press, Cambridge (1965)

    MATH  Google Scholar 

  8. Langley, P., Wayne, I., Thompson, K.: An analysis of Bayesian classifiers. In: Proceedings of the 10th National Conference on Artificial Intelligence, pp. 223–228 (1992)

    Google Scholar 

  9. Lpez, V., Fernndez, A., Garca, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)

    Article  Google Scholar 

  10. Margineantu, D.D. When does imbalanced data require cost-sensitive learning? AAAI Technical report WS-00-05 (2000)

    Google Scholar 

  11. Pawlak, Z.: Rough Sets, Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991)

    MATH  Google Scholar 

  12. Probost, F. Machine learning from imbalanced data sets 101. Invited Paper for the AAAI 2000 Workshop on Imbalanced Data Sets (2000)

    Google Scholar 

  13. Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufman, San Mateo (1993)

    Google Scholar 

  14. Raskutti, B.: Extreme re-balancing for SVM’s: a case study. In: ICML-KDD 2003 Workshop: Learning from Imbalanced Data Sets (2003)

    Google Scholar 

  15. Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn. 40(12), 3358–3378 (2007)

    Article  MATH  Google Scholar 

  16. Ting, K.M.: An instance-weighting method to induce cost-sensitive trees. IEEE Trans. Knowl. Data Eng. 14(3), 659–665 (2002)

    Article  Google Scholar 

  17. Wu, G.: Class-boundary alignment for imbalanced dataset learning. In: ICML-KDD 2003 Workshop: Learning from Imbalanced Data Sets (2003)

    Google Scholar 

  18. Yang, Q., Wu, X.D.: 10 challenging problems in data mining research. Int. J. Inf. Technol. Decis. Making 05, 597 (2006)

    Article  Google Scholar 

  19. Yao, Y.Y.: Three-way decisions and cognitive computing. Cogn. Comput. 8, 543–554 (2016). doi:10.1007/s12559-016-9397-5

    Article  Google Scholar 

  20. Yao, Y.Y., Wong, S.K.M., Lingras, P.: A decision-theoretic rough set model. In: Ras, Z.W., Zemankova, M., Emrich, M.L. (eds.) Methodologies for Intelligent Systems, vol. 5, pp. 17–24. North-Holland, New York (1990)

    Google Scholar 

  21. Yao, Y.Y., Zhou, B.: Two Bayesian approaches to rough sets. Eur. J. Oper. Res. 251, 904–917 (2016)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bing Zhou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Zhou, B., Yao, Y., Liu, Q. (2016). Utilizing DTRS for Imbalanced Text Classification. In: Flores, V., et al. Rough Sets. IJCRS 2016. Lecture Notes in Computer Science(), vol 9920. Springer, Cham. https://doi.org/10.1007/978-3-319-47160-0_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-47160-0_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-47159-4

  • Online ISBN: 978-3-319-47160-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics