Abstract
Feature selection algorithm plays an important role in text categorization. Considering some drawbacks proposed from traditional and recently improved information gain (IG) approach, an improved IG feature selection method based on relative document frequency distribution is proposed, which combines reducing the impact of unbalanced data sets and low-frequency characteristics, the frequency distribution of features within category and the relative frequency document distribution of features among different categories. The experimental results of NLPCC-ICCPOL 2016 stance detection in Chinese microblogs show that the performance of the improved method is better than traditional IG approach and another improved method in feature selection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. ICML 97, 412–420 (1997)
Shi, H., Jia, D.P., Miao, P.: Improved information gain text feature selection algorithm based on word frequency information. J. Comput. Appl. 34(11), 3279–3282 (2014)
Guo, Y., Liu, X.: Study on information gain-based feature selection in Chinese text categorization. J. Comput. Eng. Appl. 48(27), 119–122 (2012)
Xu, J., Jiang, H.: An improved information gain feature selection algorithm for SVM text classifier. In: 2015 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), pp. 273–276. IEEE Computer Society (2015)
Xu, Y., Chen, L.: Term-frequency based feature selection methods for text categorization. In: Proceedings of the 2010 Fourth International Conference on Genetic and Evolutionary Computing, pp. 280–283. IEEE Press, Piscataway (2010)
Mladenic, D., Grobelnk, M.: Feature selection for unbalanced class distribution and naive Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML 1999, pp. 258–267. ACM Press, New York (1999)
Forman, G., Guyon, I., Elisseeff, A.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
Ren, Y.G.: Information-gain-based text feature selection method. J. Comput. Sci. 39(11), 127–130 (2012)
Ren, K.Q.: Feature reduction based on relative document frequency balance information gain. J. Jiangxi Univ. Sci. Technol. 29(5), 68–71 (2008)
Zhang, H.P., Yu, H.K., Xiong, D.Y., Liu, Q.: HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, vol. 17, pp. 184–187. Association for Computational Linguistics (2003)
Shi, C., Xu, C., Yang, X.: Study of TFIDF algorithm. J. Comput. Appl. 6(29), 167–170 (2009)
Chang, G.C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Acknowledgements
This research work is supported by National Natural Science Foundation of China (No. 61402220, No. 61502221), the Scientific Research Fund of Hunan Provincial Education Department (No. 14B153, No. 16C1378), the Philosophy and Social Science Foundation of Hunan Province (No. 14YBA335).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Peng, J., Yang, XH., Ouyang, CP., Liu, YB. (2016). An Improved Information Gain Algorithm Based on Relative Document Frequency Distribution. In: Lin, CY., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds) Natural Language Understanding and Intelligent Applications. ICCPOL NLPCC 2016 2016. Lecture Notes in Computer Science(), vol 10102. Springer, Cham. https://doi.org/10.1007/978-3-319-50496-4_49
Download citation
DOI: https://doi.org/10.1007/978-3-319-50496-4_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50495-7
Online ISBN: 978-3-319-50496-4
eBook Packages: Computer ScienceComputer Science (R0)