A Term Weighting Scheme Approach for Vietnamese Text Classification
The term weighting scheme, which is used to convert the documents to vectors in the term space, is a vital step in automatic text categorization. The previous studies showed that term weighting schemes dominate the performance. There have been extensive studies on term weighting for English text classification. However, not many works have been studied on Vietnamese text classification.. In this paper, we proposed a term weighting scheme called normalize(tf.rf max ), which is based on tf.rf term weighting scheme – one of the most effective term weighting schemes to date. We conducted experiments to compare our proposed normalize(tf.rf max ) term weighting scheme to tf.rf and tf.idf on Vietnamese text classification benchmark. The results showed that our proposed term weighting scheme can achieve about 3 %–5 % accuracy better than other term weighting schemes.
KeywordsTerm weighting scheme Vietnamese text classification tf.idf tf.rf
This research is funded by Vietnam National University, Ho Chi Minh City (VNU-HCM) under grant number C2014-26-04.
- 1.Chang, C.C., Chih, J.L.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)Google Scholar
- 4.Hoang, V.C.D., et al.: A comparative study on Vietnamese text classification methods. In: 2007 IEEE International Conference on Research, Innovation and Vision for the Future. IEEE (2007)Google Scholar
- 11.Yang, Y., Jan, O.P.: A comparative study on feature selection in text categorization. In: ICML, vol. 97 (1997)Google Scholar