Skip to main content

Addressing Unseen Word Problem in Text Classification

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10859))

Abstract

Word based Deep Neural Network (DNN) approach of text classification suffers performance issues due to limited set of vocabulary words. Character based Convolutional Neural Network models (CNN) was proposed by the researchers to address the issue. But, character based models do not inherently capture the sequential relationship of words in texts. Hence, there is scope of further improvement by addressing unseen word problem through character model while maintaining the sequential context through word based model. In this work, we propose methods to combine both character and word based models for efficient text classification. The methods are compared with some of the benchmark datasets and state-of-the art results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of EMNLP 2014 Conference, pp. 1746–1751 (2014)

    Google Scholar 

  2. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR (2013)

    Google Scholar 

  3. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Proceedings of INTERSPEECH 2015, pp. 3057–3061 (2015)

    Google Scholar 

  4. Dai, A.M., Olah, C., Le, Q.V.: Document embedding with paragraph vectors. arXiv:1507.07998v1 [cs.CL], 29 July 2015

  5. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M: Character aware neural language models. arXiv:1508.06615v4 [cs.CL], 1 December 2015

  6. Chen, T., Xu, R., He, Y., Wang, X.: Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN. Expert Syst. Appl. 72, 221–230 (2017)

    Article  Google Scholar 

  7. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st ICML, Beijing, China, vol. 32, JMLR: W&CP (2014)

    Google Scholar 

  8. Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: CHARAGRAM: Embedding Words and Sentences via Character n-grams. arXiv:1607.02789v1 [cs.CL], 10 July 2016

  9. Wang, J., Wang, Z., Zhang, D., Yan, J.: Combining knowledge with deep convolutional neural networks for short text classification. In: Proceedings of IJCAI (2017)

    Google Scholar 

  10. Liang, D., Xu, W., Zhao, Y.: Combining word-level and character-level representations for relation classification of informal text: In: Proceedings of the 2nd Workshop on Representation Learning for NLP, Vancouver, Canada, pp. 43–47, 3 August 2017

    Google Scholar 

  11. Yiny, W., Kanny, K., Yuz, M., Schutze, H.: Comparative Study of CNN and RNN for Natural Language Processing. arXiv:1702.01923v1 [cs.CL], 7 February 2017

  12. Mikolov, T., Karafiat, M., Burget, L., Cernoky, J.H., Khundanpur, S.: Recurrent neural network based language model. In: Proceedings of Interspeech (2010)

    Google Scholar 

  13. Tang, D., Qin, B., Liu, T.: Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of EMNLP 2015, pp. 1422–1432 (2015)

    Google Scholar 

  14. Wang, P., Xu, J., Xu, B., Liu, C., Zhang, H., Wang, F., Hao, H.: Semantic clustering and convolutional neural network for short text categorization. In: Proceedings ACL, pp. 352–357 (2015)

    Google Scholar 

  15. Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional networks for text classification. In: Proceedings of the 15th Conference of the European Chapter of the ACL, vol. 1, Long Papers, pp. 1107–1116 (2017)

    Google Scholar 

  16. Johnson, R., Zhang, T.: Convolutional neural networks for text categorization: Shallow word-level vs. deep character-level (2016). arXiv preprint: arXiv:1609.00718

  17. Johnson, R. Zhang, T.: Supervised and semi-supervised text categorization using LSTM for region embeddings. In: Proceedings of the 33rd ICML, New York, USA (2016)

    Google Scholar 

  18. Zhou, C., Sun, C., Liu, Z., Lau, F.C.M.: A C-LSTM Neural Network for Text Classification. https://arxiv.org/pdf/1511.08630

  19. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546

  20. Dataset source. https://github.com/AcademiaSinicaNLPLab/sentiment_dataset, 2 January 2018

  21. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, vol. 9, pp. 249–256 (2010)

    Google Scholar 

  22. Liu, P., Qiu, X., Huang, X.: Recurrent Neural Network for Text Classification with Multi-Task Learning. arXiv:1605.05101v1 [cs.CL], 17 May 2016

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Promod Yenigalla .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yenigalla, P., Kar, S., Singh, C., Nagar, A., Mathur, G. (2018). Addressing Unseen Word Problem in Text Classification. In: Silberztein, M., Atigui, F., Kornyshova, E., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2018. Lecture Notes in Computer Science(), vol 10859. Springer, Cham. https://doi.org/10.1007/978-3-319-91947-8_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-91947-8_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-91946-1

  • Online ISBN: 978-3-319-91947-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics