Advertisement

Performance Comparison of TF*IDF, LDA and Paragraph Vector for Document Classification

  • Jindong Chen
  • Pengjia Yuan
  • Xiaoji Zhou
  • Xijin TangEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 660)

Abstract

To meet the fast and effective requirements of document classification in Web 2.0, the most direct strategy is to reduce the dimension of document representation without much information loss. Topic model and neural network language model are two main strategies to represent document in a low-dimensional space. To compare the effectiveness of bag-of-words, topic model and neural network language model for document classification, TF*IDF, latent Dirichlet allocation (LDA) and Paragraph Vector model are selected. Based on the generated vectors of these three methods, support vector machine classifiers are developed respectively. The performances of these three methods on English and Chinese document collections are evaluated. The experimental results show that TF*IDF outperforms LDA and Paragraph Vector, but the high-dimensional vectors take up much time and memory. Furthermore, through cross validation, the results reveal that stop words elimination and the size of training samples significantly affect the performances of LDA and Paragraph Vector, and Paragraph Vector displays its potential to overwhelm two other methods. Finally, the suggestions related with stop words elimination and data size for LDA and Paragraph Vector training are provided.

Keywords

TF*IDF LDA Paragraph vector Support vector machine Document classification 

Notes

Acknowledgements

This research is supported by National Natural Science Foundation of China under Grant Nos. 61473284, 61379046 and 71371107. The authors would like to thank other members who contribute their effort to the experiments.

References

  1. 1.
    Cao, L.N., Tang, X.J.: Topics and threads of the online public concerns based on Tianya forum. J. Syst. Sci. Syst. Eng. 23(2), 212–230 (2014). doi: 10.1007/s11518-014-5243-z MathSciNetCrossRefGoogle Scholar
  2. 2.
    Korde, V., Mahender, C.N.: Text classification and classifiers: a survey. Int. J. Artif. Intel. Appl. 3(2), 85–99 (2012). doi: 10.5121/ijaia.2012.3208 Google Scholar
  3. 3.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). doi: 10.1145/505282.505283 CrossRefGoogle Scholar
  4. 4.
    Manuel, F.D., Eva, C., Senén, B., Dinani, A.: Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15(1), 3133–3181 (2014)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Zhang, W., Yoshida, T., Tang, X.J.: A comparative study of TF*IDF, LSI and Multi-words for text classification. Expert Syst. Appl. 38(3), 2758–2765 (2011). doi: 10.1016/j.eswa.2010.08.066 CrossRefGoogle Scholar
  6. 6.
    Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1631–1642. ACL (2013)Google Scholar
  7. 7.
    Wen, S.Y., Wan, X.J.: Emotion classification in Microblog texts using class sequential rules. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (Québec, Canada), pp. 187–193. AAAI (2014)Google Scholar
  8. 8.
    Tang, X.J.: Exploring on-line societal risk perception for harmonious society measurement. J. Syst. Sci. Syst. Eng. 22(4), 469–486 (2013). doi: 10.1007/s11518-013-5238-1 CrossRefGoogle Scholar
  9. 9.
    Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(5), 993–1022 (2003)zbMATHGoogle Scholar
  10. 10.
    Tang, X.B.: Fang XK (2013) Research on Micro-blog topic retrieval model based on the integration of text clustering with LDA. Info. Stud. Theory Appl. 8, 85–90 (2013). (in Chinese)Google Scholar
  11. 11.
    Li, K.L., Xie, J., Sun, X., Ma, Y.H., Bai, H.: Multi-class text categorization based on LDA and SVM. Procedia Eng. 15, 1963–1967 (2011). doi: 10.1016/j.proeng.2011.08.366 CrossRefGoogle Scholar
  12. 12.
    Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)zbMATHGoogle Scholar
  13. 13.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceeding of International Conference on Learning Representations (ICLR2013, Scottsdale), pp. 1–12 (2013)Google Scholar
  14. 14.
    Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (Beijing). JMLR Workshop and Conference Proceedings, pp. 1188–1196 (2014)Google Scholar
  15. 15.
    Andrew, M.D., Christopher, O., Quoc, V.L.: Document embedding with paragraph vectors. arXiv:1507.07998 (2015)
  16. 16.
    Zhao, Y.L., Tang, X.J.: A preliminary research of pattern of users’ behavior based on Tianya forum. In: Wang, S.Y. (eds.) The 14th International Symposium on Knowledge and Systems Sciences, Ningbo, pp. 139–145. JAIST Press (2013)Google Scholar
  17. 17.
    Zheng, R., Shi, K., Li, S.: The influence factors and mechanism of societal risk perception. In: Zhou, J. (ed.) Complex 2009. LNICST, vol. 5, pp. 2266–2275. Springer, Heidelberg (2009)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2016

Authors and Affiliations

  • Jindong Chen
    • 1
  • Pengjia Yuan
    • 2
  • Xiaoji Zhou
    • 1
  • Xijin Tang
    • 2
    Email author
  1. 1.China Academy of Aerospace Systems Science and EngineeringBeijingPeople’s Republic of China
  2. 2.Institute of Systems ScienceAcademy of Mathematics and Systems Science, Chinese Academy of SciencesBeijingPeople’s Republic of China

Personalised recommendations