Advertisement

Reducing Effects of Class Imbalance Distribution in Multi-class Text Categorization

  • Part Pramokchon
  • Punpiti Piamsa-nga
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 265)

Abstract

In multi-class text classification, when number of entities in each class is highly imbalanced, performance of feature ranking methods is usually low because the larger class has much dominant influence to the classifier and the smaller one seems to be ignored. This research attempts to solve this problem by separating the larger classes into several smaller subclasses according to their proximities, by k-mean clustering then all subclasses are considered for feature scoring measure instead of the main classes. This cluster-based feature scoring method is proposed to reduce the influence of skewed class distributions. Compared to performance of feature sets selected from main classes and ground-truth subclasses, the experimental results show that performance of a feature set selected by the proposed method achieves significant improvement on classifying imbalanced corpora, the RCV1v2 dataset.

Keywords

feature selection ranking method text categorization class imbalance distribution 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)CrossRefGoogle Scholar
  2. 2.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)zbMATHGoogle Scholar
  3. 3.
    Forman, G.: Feature Selection for Text Classification. Computational Methods of Feature Selection. Chapman and Hall/CRC Press (2007)CrossRefGoogle Scholar
  4. 4.
    Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: 14th International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann Publishers Inc., 657137 (1997)Google Scholar
  5. 5.
    Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering 17, 491–502 (2005)CrossRefGoogle Scholar
  6. 6.
    Soucy, P., Mineau, G.W.: Feature Selection Strategies for Text Categorization. In: Xiang, Y., Chaib-draa, B. (eds.) AI 2003. LNCS (LNAI), vol. 2671, pp. 505–509. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  7. 7.
    Uchyigit, G., Clark, K.: A new feature selection method for text classification. International Journal of Pattern Recognition and Artificial Intelligence 21, 423–438 (2007)CrossRefGoogle Scholar
  8. 8.
    Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. SIGKDD Explor. Newsl. 6, 80–89 (2004)CrossRefGoogle Scholar
  9. 9.
    He, H., Garcia, E.A.: Learning from Imbalanced Data. IEEE Trans. on Knowl. and Data Eng. 21, 1263–1284 (2009)CrossRefGoogle Scholar
  10. 10.
    Makrehchi, M., Kamel, M.S.: Impact of Term Dependency and Class Imbalance on The Performance of Feature Ranking Methods. International Journal of Pattern Recognition and Artificial Intelligence 25, 953–983 (2011)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Makrehchi, M., Kamel, M.S.: Combining feature ranking for text classification. In: IEEE International Conference on Systems, Man and Cybernetics, ISIC 2007, pp. 510–515 (2007)Google Scholar
  12. 12.
    MacQueen, J.B.: Some Methods for Classification and Analysis of MultiVariate Observations. In: Cam, L.M.L., Neyman, J. (eds.) Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)Google Scholar
  13. 13.
    Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)Google Scholar
  14. 14.
    Lee, L.-W., Chen, S.-M.: New Methods for Text Categorization Based on a New Feature Selection Method and a New Similarity Measure Between Documents. In: Ali, M., Dapoigny, R. (eds.) IEA/AIE 2006. LNCS (LNAI), vol. 4031, pp. 1280–1289. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  15. 15.
    Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10, 1048–1054 (1999)CrossRefGoogle Scholar
  16. 16.
    Elias, F.C., Elena, M., Irene, D.A., Jos, R., Ricardo, M.: Introducing a Family of Linear Measures for Feature Selection in Text Categorization. IEEE Transactions on Knowledge and Data Engineering 17, 1223–1232 (2005)CrossRefGoogle Scholar
  17. 17.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)zbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Department of Computer Engineering, Faculty of EngineeringKasetsart UniversityBangkokThailand

Personalised recommendations