Performance Comparison of TF*IDF, LDA and Paragraph Vector for Document Classification
To meet the fast and effective requirements of document classification in Web 2.0, the most direct strategy is to reduce the dimension of document representation without much information loss. Topic model and neural network language model are two main strategies to represent document in a low-dimensional space. To compare the effectiveness of bag-of-words, topic model and neural network language model for document classification, TF*IDF, latent Dirichlet allocation (LDA) and Paragraph Vector model are selected. Based on the generated vectors of these three methods, support vector machine classifiers are developed respectively. The performances of these three methods on English and Chinese document collections are evaluated. The experimental results show that TF*IDF outperforms LDA and Paragraph Vector, but the high-dimensional vectors take up much time and memory. Furthermore, through cross validation, the results reveal that stop words elimination and the size of training samples significantly affect the performances of LDA and Paragraph Vector, and Paragraph Vector displays its potential to overwhelm two other methods. Finally, the suggestions related with stop words elimination and data size for LDA and Paragraph Vector training are provided.
KeywordsTF*IDF LDA Paragraph vector Support vector machine Document classification
This research is supported by National Natural Science Foundation of China under Grant Nos. 61473284, 61379046 and 71371107. The authors would like to thank other members who contribute their effort to the experiments.
- 6.Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1631–1642. ACL (2013)Google Scholar
- 7.Wen, S.Y., Wan, X.J.: Emotion classification in Microblog texts using class sequential rules. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (Québec, Canada), pp. 187–193. AAAI (2014)Google Scholar
- 10.Tang, X.B.: Fang XK (2013) Research on Micro-blog topic retrieval model based on the integration of text clustering with LDA. Info. Stud. Theory Appl. 8, 85–90 (2013). (in Chinese)Google Scholar
- 13.Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceeding of International Conference on Learning Representations (ICLR2013, Scottsdale), pp. 1–12 (2013)Google Scholar
- 14.Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (Beijing). JMLR Workshop and Conference Proceedings, pp. 1188–1196 (2014)Google Scholar
- 15.Andrew, M.D., Christopher, O., Quoc, V.L.: Document embedding with paragraph vectors. arXiv:1507.07998 (2015)
- 16.Zhao, Y.L., Tang, X.J.: A preliminary research of pattern of users’ behavior based on Tianya forum. In: Wang, S.Y. (eds.) The 14th International Symposium on Knowledge and Systems Sciences, Ningbo, pp. 139–145. JAIST Press (2013)Google Scholar