Advertisement

Document representation and classification with Twitter-based document embedding, adversarial domain-adaptation, and query expansion

  • Minh-Triet TranEmail author
  • Lap Q. Trieu
  • Huy Q. Tran
Article

Abstract

Document vectorization with an appropriate encoding scheme is an essential component in various document processing tasks, including text document classification, retrieval, or generation. Training a dedicated document in a specific domain may require large enough data and sufficient resource. This motivates us to propose a novel document representation scheme with two main components. First, we train TD2V, a generic pre-trained document embedding for English documents from more than one million tweets in Twitter. Second, we propose a domain adaptation process with adversarial training to adapt TD2V to different domains. To classify a document, we use the rank list of its similar documents using query expansion techniques, either Average Query Expansion or Discriminative Query Expansion. Experiments on datasets from different online sources show that by using TD2V only, our method can classify documents with better accuracy than existing methods. By applying adversarial adaptation process, we can further boost and achieve the accuracy on BBC, BBCSport, Amazon4, 20NewsGroup datasets. We also evaluate our method on a specific domain of sensitivity classification and achieve the accuracy of higher than \(95\%\) even with a short text fragment having 1024 characters on 5 datasets: Snowden, Mormon, Dyncorp, TM, and Enron.

Keywords

Document embedding Adversarial domain adaptation Document classification Document representation Doc2Vec Query expansion 

Notes

Acknowledgements

This research is funded by Department of Science and Technology, Ho Chi Minh city, under grant number 40/2015/HD-SKHCN.

References

  1. Arandjelovic, R.: Three things everyone should know to improve object retrieval. In: Proceedings of the 2012 IEEE conference on computer vision and pattern recognition (CVPR), CVPR ’12, pp. 2911–2918, Washington, DC, USA, 2012. IEEE Computer Society. ISBN 978-1-4673-1226-4. URL http://dl.acm.org/citation.cfm?id=2354409.2355123
  2. Arnaoudova, V., Haiduc, S., Marcus, A., Antoniol, G.: The use of text retrieval and natural language processing in software engineering. In: Proceedings of the 37th international conference on Software Engineering - Vol. 2, ICSE ’15, pp. 949–950, Piscataway, NJ, USA, 2015. IEEE Press. URL http://dl.acm.org/citation.cfm?id=2819009.2819224
  3. Blei, D. M., Ng, A. Y., Jordan, M. I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3: 993–1022. ISSN 1532–4435Google Scholar
  4. Blitzer, J., Dredze, M., Pereira, F.: Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp. 440–447, Prague, Czech Republic, June 2007. Association for Computational LinguisticsGoogle Scholar
  5. Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y., Strope, B., Kurzweil, R.: Universal sentence encoder. CoRR, arXiv:1803.11175 (2018)
  6. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Cohen, W. W., McCallum, A. and Roweis, S. T. (eds.) ICML, vol. 307 of ACM international conference proceeding series, pp. 160–167. ACM, 2008. ISBN 978-1-60558-205-4Google Scholar
  7. Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning. CoRR, arXiv:1605.09782 (2016)
  8. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Proceedings of the 27th international conference on neural information processing systems - Vol. 2, NIPS’14, pp. 2672–2680, Cambridge, MA, USA, (2014) MIT Press. URL http://dl.acm.org/citation.cfm?id=2969033.2969125
  9. Gouws, S.: Training neural word embeddings for transfer learning and translation. PhD thesis, Stellenbosch University, (2016)Google Scholar
  10. Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on machine learning, ICML ’06, pp. 377–384, New York, NY, USA, (2006) ACM. ISBN 1-59593-383-2Google Scholar
  11. Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., Schölkopf, B.: Covariate shift and local learning by distribution matching, pp. 131–160. MIT Press, Cambridge, MA, USA, (2009)Google Scholar
  12. Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)CrossRefGoogle Scholar
  13. Hart, M., Manadhata, P., Johnson, R.: Text Classification for Data Loss Prevention, pp. 18–37. Springer, Berlin (2011) ISBN 978-3-642-22263-4Google Scholar
  14. Joachims, T.: Text categorization with suport vector machines: Learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, ECML ’98, pp. 137–142, London, UK, UK, (1998). Springer-Verlag. ISBN 3-540-64417-2Google Scholar
  15. Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Vol. 1: Long Papers)Google Scholar
  16. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, October 25–29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1746–1751, (2014) URL http://aclweb.org/anthology/D/D14/D14-1181.pdf
  17. Kowsari, K., Brown, D.E., Heidarysafa, M., Meimandi, K.J., Gerber, M.S., Barnes, L.E.: Hdltex: Hierarchical deep learning for text classification. In: Chen, X., Luo, B., Luo, F., Palade, V. and Wani, M.A. (eds.) ICMLA, pp. 364–371. IEEE, (2017) ISBN 978-1-5386-1418-1Google Scholar
  18. Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd international conference on international conference on machine learning - Vol. 37Google Scholar
  19. Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd international conference on international conference on machine learning - Vol. 37Google Scholar
  20. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Bonet, B., Koenig, S. (eds.) Proceedings of the twenty-ninth AAAI conference on artificial intelligence, AAAI’15, vol. 333, pp. 2267–2273. AAAI Press, (2015). ISBN 0-262-51129-0. URL http://dl.acm.org/citation.cfm?id=2886521.2886636
  21. Landauer, T., Foltz, P., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25, 259–284 (1998)CrossRefGoogle Scholar
  22. Le, Q. V., Mikolov, T.: Distributed representations of sentences and documents. In: ICML, vol. 32 of JMLR Workshop and Conference Proceedings, pp. 1188–1196. JMLR.org, (2014)Google Scholar
  23. Liu, M., Tuzel, O.: Coupled generative adversarial networks. CoRR, arXiv:1606.07536 (2016)
  24. Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd international conference on international conference on machine learning - Vol. 37, ICML’15, pp. 97–105. JMLR.org, (2015) URL http://dl.acm.org/citation.cfm?id=3045118.3045130
  25. Manevitz, L.M., Yousef, M.: One-class svms for document classification. J. Mach. Learn. Res. 2, 139–154 Mar. (2002) ISSN 1532-4435Google Scholar
  26. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR arXiv:1301.3781 (2013a)
  27. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems - Vol. 2, NIPS’13, pp. 3111–3119, USA, (2013b) Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999792.2999959
  28. Mitchell, J., Lapata, M.: Composition in distributional models of semantics. J. Cogn. Sci. 34(8), 1388–1429 (2010)CrossRefGoogle Scholar
  29. Mitra, M., Singhal, A., Buckley, C.: Improving automatic query expansion. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98, pp. 206–214, New York, NY, USA, (1998) ACM. ISBN 1-58113-015-5Google Scholar
  30. Moraes, R., Valiati, J.F., Neto, W.P.G.: Document-level sentiment classification: an empirical comparison between SVM and ANN. Expert Syst. Appl. 40(2), 621–633 (2013)CrossRefGoogle Scholar
  31. Patel, K., Patel, D., Golakiya, M., Bhattacharyya, P., Birari, N.: Adapting pre-trained word embeddings for use in medical coding. In: BioNLP 2017, pp. 302–306. Association for Computational Linguistics, (2017) URL http://aclweb.org/anthology/W17-2338
  32. Pawar, P.Y., Gawande, S.H.: A comparative study on different types of approaches to text categorization. Int. J. Mach. Learn. Comput. 2(4), 423–426 (2012)CrossRefGoogle Scholar
  33. Pazzani, M., Billsus, D.: Learning and revising user profiles: the identification ofinteresting web sites. Mach. Learn. 27(3), 313–331, June (1997) ISSN 0885-6125Google Scholar
  34. Rish, I.: An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol. 3, pp. 41–46. IBM New York, (2001)Google Scholar
  35. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)CrossRefzbMATHGoogle Scholar
  36. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  37. Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. TACL 2, 207–218 (2014)Google Scholar
  38. Subramanian, S., Trischler, A., Bengio, Y., Pal, C.J.: Learning general purpose distributed sentence representations via large scale multi-task learning. CoRR, arXiv:1804.00079 (2018)
  39. Trieu, L.Q., Tran, H.Q., Tran, M.-T.: News classification from social media using twitter-based doc2vec model and automatic query expansion. In: Proceedings of the eighth International symposium on information and communication technology, SoICT 2017, pp. 460–467, New York, NY, USA, (2017a). ACM. ISBN 978-1-4503-5328-1.  https://doi.org/10.1145/3155133.3155206. URL http://doi.acm.org/10.1145/3155133.3155206
  40. Trieu, L.Q., Tran, T., Tran, M., Tran, M.: Document sensitivity classification for data leakage prevention with twitter-based document embedding and query expansion. In: 13th international conference on computational intelligence and security, CIS 2017, Hong Kong, China, December 15-18, 2017, pp. 537–542, (2017b)  https://doi.org/10.1109/CIS.2017.00125
  41. Turney, P.D., Pantel, P., et al.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  42. Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: maximizing for domain invariance. CoRR, arXiv:1412.3474 (2014)
  43. Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Adversarial discriminative domain adaptation. In: Computer vision and pattern recognition (CVPR)Google Scholar
  44. Xu, J., Xu, B., Wang, P., Zheng, S., Tian, G., Zhao, J., Xu, B.: Self-taught convolutional neural networks for short text clustering. Neural Netw. 88, 22–31, (2017) ISSN 0893-6080.  https://doi.org/10.1016/j.neunet.2016.12.008. URL http://www.sciencedirect.com/science/article/pii/S0893608016301976
  45. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A.J., Hovy, E.H.: Hierarchical attention networks for document classification. In: HLT-NAACL, (2016)Google Scholar
  46. Yin, Y., Song, Y., Zhang, M.: Document-level multi-aspect sentiment classification as machine comprehension. In: Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, Copenhagen, Denmark, September 9–11, 2017, pp. 2044–2054, (2017) URL https://aclanthology.info/papers/D17-1217/d17-1217
  47. Zhang, L., Wang, S., Liu, B.: Deep learning for sentiment analysis: a survey. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 8(4), 1 (2018).  https://doi.org/10.1002/widm.1253 CrossRefGoogle Scholar
  48. Zhao, R., Mao, K.: Fuzzy bag-of-words model for document representation. IEEE Trans. Fuzzy Syst. 26(2), 794–804 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Faculty of Information TechnologyUniversity of Science, VNU-HCMHo Chi Minh CityVietnam

Personalised recommendations