Deep Learning and Its Applications to Natural Language Processing

  • Haiqin YangEmail author
  • Linkai Luo
  • Lap Pong Chueng
  • David Ling
  • Francis Chin
Part of the Cognitive Computation Trends book series (COCT, volume 2)


Natural language processing (NLP), utilizing computer programs to process large amounts of language data, is a key research area in artificial intelligence and computer science. Deep learning technologies have been well developed and applied in this area. However, the literature still lacks a succinct survey, which would allow readers to get a quick understanding of (1) how the deep learning technologies apply to NLP and (2) what the promising applications are. In this survey, we try to investigate the recent developments of NLP, centered around natural language understanding, to answer these two questions. First, we explore the newly developed word embedding or word representation methods. Then, we describe two powerful learning models, Recurrent Neural Networks and Convolutional Neural Networks. Next, we outline five key NLP applications, including (1) part-of-speech tagging and named entity recognition, two fundamental NLP applications; (2) machine translation and automatic English grammatical error correction, two applications with prominent commercial value; and (3) image description, an application requiring technologies of both computer vision and NLP. Moreover, we present a series of benchmark datasets which would be useful for researchers to evaluate the performance of models in the related applications.


Deep learning Natural language processing Word2Vec Recurrent neural networks Convolutional neural networks 



The work described in this paper was partially supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. UGC/IDS14/16).


  1. Artetxe M, Labaka G, Agirre E, Cho K (2017) Unsupervised neural machine translation. CoRR, abs/1710.11041Google Scholar
  2. Ba JL, Kiros R, Hinton EG (2016) Layer normalization. CoRR, abs/1607.06450Google Scholar
  3. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473Google Scholar
  4. Bernardi R, Çakici R, Elliott D, Erdem A, Erdem E, Ikizler-Cinbis N, Keller F, Muscat A, Plank B (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures. J Artif Intell Res 55:409–442CrossRefGoogle Scholar
  5. Bhirud SN, Bhavsar R, Pawar B (2017) Grammar checkers for natural languages:a review. Int J Natural Lang Comput 6(4):1CrossRefGoogle Scholar
  6. Brants T (2000) Tnt: a statistical part-of-speech tagger. In: ANLC’00, Stroudsburg. Association for Computational Linguistics, pp 224–231Google Scholar
  7. Brill E (1992) A simple rule-based part of speech tagger. In: ANLC, Stroudsburg, pp 152–155Google Scholar
  8. Britz D, Goldie A, Luong M, Le VQ (2017) Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906Google Scholar
  9. Chieu LH, Ng TH (2002) Named entity recognition: a maximum entropy approach using global information. In: COLING, TaipeiCrossRefGoogle Scholar
  10. Chiu JPC, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNs. TACL 4:357–370Google Scholar
  11. Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555Google Scholar
  12. Chung J, Cho K, Bengio Y (2016) A character-level decoder without explicit segmentation for neural machine translation. In: ACL, BerlinCrossRefGoogle Scholar
  13. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa PP (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537Google Scholar
  14. Costa-Jussà MR, Fonollosa JAR (2016) Character-based neural machine translation. In: ACL, BerlinCrossRefGoogle Scholar
  15. Dale R, Kilgarriff A (2011) Helping our own: the HOO 2011 pilot shared task. In: ENLG, Nancy, pp 242–249Google Scholar
  16. Daniel N (2003) A rule-based style and grammar checker. Master’s thesis, Bielefeld University, BielefeldGoogle Scholar
  17. Daudaravicius V, Banchs ER, Volodina E, Napoles C (2016) A report on the automatic evaluation of scientific writing shared task. In: Proceedings of the 11th workshop on innovative use of NLP for building educational applications, BEA@NAACL-HLT 2016, San Diego, 16 June 2016, pp 53–62Google Scholar
  18. dos Santos CN, Gatti M (2014) Deep convolutional neural networks for sentiment analysis of short texts. In: COLING, Dublin, pp 69–78Google Scholar
  19. Elman LJ (1990) Finding structure in time. Cogn Sci 14(2):179–211CrossRefGoogle Scholar
  20. Firat O, Cho K, Sankaran B, Yarman-Vural FT, Bengio Y (2017) Multi-way, multilingual neural machine translation. Comput Speech Lang 45:236–252CrossRefGoogle Scholar
  21. Firth RJ (1957) A synopsis of linguistic theory 1930–1955. Studies in linguistic analysis. Blackwell, Oxford, pp 1–32Google Scholar
  22. Florian R, Ittycheriah A, Jing H, Zhang T (2003) Named entity recognition through classifier combination. In: Proceedings of the seventh conference on natural language learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, 31 May–1 June 2003, pp 168–171Google Scholar
  23. Gehring J, Auli M, Grangier D, Dauphin Y (2017) A convolutional encoder model for neural machine translation. In: ACL, Vancouver, pp 123–135Google Scholar
  24. Gehring J, Auli M, Grangier D, Yarats D, Dauphin NY (2017) Convolutional sequence to sequence learning. In: ICML, Sydney, pp 1243–1252Google Scholar
  25. Gers AF, Schmidhuber J (2000) Recurrent nets that time and count. In: IJCNN (3), Como, pp 189–194Google Scholar
  26. Goodfellow JI, Bengio Y, Courville CA (2016) Deep learning. Adaptive computation and machine learning. MIT Press, CambridgeGoogle Scholar
  27. Graves A, Mohamed A, Hinton EG (2013) Speech recognition with deep recurrent neural networks. In: IEEE ICASSP, British Columbia, pp 6645–6649Google Scholar
  28. Greff K, Srivastava KR, Koutník J, Steunebrink RB, Schmidhuber J (2017) LSTM: a search space odyssey. IEEE Trans Neural Netw Learn Syst 28(10):2222–2232CrossRefGoogle Scholar
  29. Gucehre C, Firat O, Xu K, Cho K, Barrault L, Lin H, Bougares F, Schwenk H, Bengio Y (2015) On using monolingual corpora in neural machine translation. CoRR, abs/1503.03535Google Scholar
  30. Harris Z (1954) Distributional structure. Word 10(23):146–162CrossRefGoogle Scholar
  31. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, Las Vegas, pp 770–778Google Scholar
  32. Hoang TD, Chollampatt S, Ng TH (2016) Exploiting n-best hypotheses to improve an SMT approach to grammatical error correction. In: IJCAI, pp 2803–2809Google Scholar
  33. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  34. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899CrossRefGoogle Scholar
  35. Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991Google Scholar
  36. Hu Z, Zhang Z, Yang H, Chen Q, Zuo D (2017) A deep learning approach for predicting the quality of online health expert question-answering services. J Biomed Inform 71:241–253CrossRefGoogle Scholar
  37. Hu Z, Zhang Z, Yang H, Chen Q, Zhu R, Zuo D (2018) Predicting the quality of online health expert question-answering services with temporal features in a deep learning framework. Neurocomputing 275:2769–2782CrossRefGoogle Scholar
  38. Jean S, Cho K, Memisevic R, Bengio Y (2015) On using very large target vocabulary for neural machine translation. In: ACL, Beijing, pp 1–10Google Scholar
  39. Ji S, Vishwanathan SVN, Satish N, Anderson JM, Dubey P (2015) Blackout: speeding up recurrent neural network language models with very large vocabularies. CoRR, abs/1511.06909Google Scholar
  40. Johnson M, Schuster M, Le VQ, Krikun M, Wu Y, Chen Z, Thorat N, Viégas FB, Wattenberg M, Corrado G, Hughes M, Dean J (2017) Google’s multilingual neural machine translation system: enabling zero-shot translation. TACL 5:339–351Google Scholar
  41. Józefowicz R, Zaremba W, Sutskever I (2015) An empirical exploration of recurrent network architectures. In: ICML, Lille, pp 2342–2350Google Scholar
  42. Junczys-Dowmunt M, Grundkiewicz R (2016) Phrase-based machine translation is state-of-the-art for automatic grammatical error correction. In: EMNLP, Austin, pp 1546–1556Google Scholar
  43. Jurafsky D, Martin HJ (2017) Speech and language processing – an introduction to natural language processing. Computational linguistics, and speech recognition. 3rd edn. Prentice Hall, p 1032Google Scholar
  44. Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676CrossRefGoogle Scholar
  45. Kim Y (2014) Convolutional neural networks for sentence classification. In: EMNLP, Doha, pp 1746–1751Google Scholar
  46. Kingma PD, Ba J (2014) Adam: a method for stochastic optimization. CoRR, abs/1412.6980Google Scholar
  47. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: MT summit, vol 5, pp 79–86Google Scholar
  48. Koutník J, Greff K, Gomez JF, Schmidhuber J (2014) A clockwork RNN. In: ICML, Beijing, pp 1863–1871Google Scholar
  49. Krizhevsky A, Sutskever I, Hinton EG (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90CrossRefGoogle Scholar
  50. Lafferty DJ, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, Williams College, pp 282–289Google Scholar
  51. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. In: NAACL HLT, San Diego, pp 260–270Google Scholar
  52. Lample G, Denoyer L, Ranzato M (2017) Unsupervised machine translation using monolingual corpora only. CoRR, abs/1711.00043Google Scholar
  53. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRefGoogle Scholar
  54. LeCun Y, Bengio Y, Hinton EG (2015) Deep learning. Nature 521(7553):436–444CrossRefGoogle Scholar
  55. Lewis DD, Yang Y, Rose GT, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397Google Scholar
  56. Lin T, Maire M, Belongie JS, Hays J, Perona P, Ramanan D, Dollár P, Zitnick LC (2014) Microsoft COCO: common objects in context. In: ECCV, Zurich, pp 740–755Google Scholar
  57. Luong M, Manning DC (2016) Achieving open vocabulary neural machine translation with hybrid word-character models. In: ACL, BerlinCrossRefGoogle Scholar
  58. Luong M, Le VQ, Sutskever I, Vinyals O, Kaiser L (2015a) Multi-task sequence to sequence learning. CoRR, abs/1511.06114Google Scholar
  59. Luong T, Pham H, Manning DC (2015b) Effective approaches to attention-based neural machine translation. In: EMNLP, Lisbon, pp 1412–1421Google Scholar
  60. Luong T, Sutskever I, Le VQ, Vinyals O, Zaremba W (2015c) Addressing the rare word problem in neural machine translation. In: ACL, Beijing, pp 11–19Google Scholar
  61. Ma X, Hovy HE (2016) End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: ACL, BerlinCrossRefGoogle Scholar
  62. Maas LA, Daly ER, Pham TP, Huang D, Ng YA, Potts C (2011) Learning word vectors for sentiment analysis. In: The 49th annual meeting of the Association for Computational Linguistics: human language technologies, proceedings of the conference, 19–24 June 2011, Portland, pp 142–150Google Scholar
  63. Manchanda B, Athavale AV, Kumar Sharma S (2016) Various techniques used for grammar checking. Int J Comput Appl Inf Technol 9(1):177Google Scholar
  64. Manning DC (2011) Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: CICLing, Tokyo, pp 171–189Google Scholar
  65. Marcus PM, Santorini B, Marcinkiewicz AM (1993) Building a large annotated corpus of English: the penn treebank. Comput Linguist 19(2):313–330Google Scholar
  66. Marcus M, Santorini B, Marcinkiewicz M, Taylor A (1999) Treebank-3 LDC99T42. Web Download. Linguistic Data Consortium, Philadelphia. Google Scholar
  67. McCallum A, Freitag D, Pereira FCN (2000) Maximum entropy Markov models for information extraction and segmentation. In: ICML’00. Morgan Kaufmann Publishers Inc., San Francisco, pp 591–598Google Scholar
  68. Melamud O, McClosky D, Patwardhan S, Bansal M (2016) The role of context types and dimensionality in learning word embeddings. In: NAACL HLT, San Diego, pp 1030–1040Google Scholar
  69. Mi H, Wang Z, Ittycheriah A (2016) Vocabulary manipulation for neural machine translation. In: ACL, BerlinCrossRefGoogle Scholar
  70. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR, abs/1301.3781Google Scholar
  71. Mikolov T, Sutskever I, Chen K, Corrado SG, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: NIPS, Lake Tahoe, pp 3111–3119Google Scholar
  72. Nazar R, Renau I (2012) Google books n-gram corpus used as a grammar checker. In: Proceedings of the second workshop on computational linguistics and writing (CLW 2012): linguistic and cognitive aspects of document creation and document engineering, EACL 2012, Stroudsburg. Association for Computational Linguistics, pp 27–34Google Scholar
  73. Ng TH, Wu MS, Wu Y, Hadiwinoto C, Tetreault RJ (2013) The conll-2013 shared task on grammatical error correction. In: Proceedings of the seventeenth conference on computational natural language learning: shared task, CoNLL 2013, Sofia, 8–9 Aug 2013, pp 1–12Google Scholar
  74. Ng TH, Wu MS, Briscoe T, Hadiwinoto C, Susanto HR, Bryant C (2014) The conll-2014 shared task on grammatical error correction. In: CoNLL, Baltimore, pp 1–14Google Scholar
  75. Nivre J et al (2017) Universal dependencies 2.1. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics ( ’UFAL), Faculty of Mathematics and Physics, Charles UniversityGoogle Scholar
  76. Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd annual meeting of the Association for Computational Linguistics, Barcelona, 21–26 July 2004, pp 271–278Google Scholar
  77. Pang B, Lee L (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: ACL 2005, 43rd annual meeting of the Association for Computational Linguistics, proceedings of the conference, 25–30 June 2005, University of Michigan, USA, pp 115–124Google Scholar
  78. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. CoRR, cs.CL/0205070Google Scholar
  79. Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: ICML, Atlanta, pp 1310–1318Google Scholar
  80. Pennington J, Socher R, Manning DC (2014) Glove: global vectors for word representation. In: EMNLP, Doha, pp 1532–1543Google Scholar
  81. Plank B, Søgaard A, Goldberg Y (2016) Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In: ACL, BerlinCrossRefGoogle Scholar
  82. Plummer AB, Wang L, Cervantes MC, Caicedo CJ, Hockenmaier J, Lazebnik S (2017) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. Int J Comput Vis 123(1):74–93CrossRefGoogle Scholar
  83. Rong X (2014) word2vec parameter learning explained. CoRR, abs/1411.2738Google Scholar
  84. Rozovskaya A, Roth D (2010) Training paradigms for correcting errors in grammar and usage. In: HLT’10, Stroudsburg. Association for Computational Linguistics, pp 154–162Google Scholar
  85. Rozovskaya A, Roth D (2016) Grammatical error correction: machine translation and classifiers. In: ACL, BerlinGoogle Scholar
  86. Ruder S (2016) An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747Google Scholar
  87. Ruder S, Ghaffari P, Breslin GJ (2016) A hierarchical model of reviews for aspect-based sentiment analysis. In: EMNLP, Austin, pp 999–1005Google Scholar
  88. Schmaltz A, Kim Y, Rush MA, Shieber MS (2016) Sentence-level grammatical error identification as sequence-to-sequence correction. In: Proceedings of the 11th workshop on innovative use of NLP for building educational applications, BEA@NAACL-HLT 2016, 16 June 2016, San Diego, pp 242–251Google Scholar
  89. Sennrich R, Haddow B, Birch A (2016a) Improving neural machine translation models with monolingual data. In: ACL, BerlinCrossRefGoogle Scholar
  90. Sennrich R, Haddow B, Birch A (2016b) Neural machine translation of rare words with subword units. In: ACL, BerlinCrossRefGoogle Scholar
  91. Srivastava N, Hinton EG, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958Google Scholar
  92. Sutskever I, Vinyals O, Le VQ (2014) Sequence to sequence learning with neural networks. In: NIPS, Montreal, pp 3104–3112Google Scholar
  93. Toutanova K, Klein D, Manning DC, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: HLT-NAACL, EdmontonCrossRefGoogle Scholar
  94. Ueffing N, Ney H (2003) Using POS information for statistical machine translation into morphologically rich languages. In: EACL’03, Stroudsburg. Association for Computational Linguistics, pp 347–354Google Scholar
  95. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez NA, Kaiser L, Polosukhin I (2017) Attention is all you need. In: NIPS, Long Beach, pp 6000–6010Google Scholar
  96. Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663CrossRefGoogle Scholar
  97. Wang P, Qian Y, Soong KF, He L, Zhao H (2015) A unified tagging solution: bidirectional LSTM recurrent neural network with word embedding. CoRR, abs/1511.00215Google Scholar
  98. Wiseman S, Rush MA (2016) Sequence-to-sequence learning as beam-search optimization. In: EMNLP, Austin, pp 1296–1306Google Scholar
  99. Wu J, Chang J, Chang SJ (2013) Correcting serial grammatical errors based on n-grams and syntax. IJCLCLP 18(4)Google Scholar
  100. Wu Y, Schuster M, Chen Z, Le VQ, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR, abs/1609.08144Google Scholar
  101. Xu K, Ba J, Kiros R, Cho K, Courville CA, Salakhutdinov R, Zemel SR, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: ICML, Lille, pp 2048–2057Google Scholar
  102. Yang Z, Salakhutdinov R, Cohen WW (2016) Multi-task cross-lingual sequence tagging from scratch. CoRR, abs/1603.06270Google Scholar
  103. Yin W, Yu M, Xiang B, Zhou B, Schütze H (2016) Simple question answering by attentive convolutional neural network. In: COLING, Osaka, pp 1746–1756Google Scholar
  104. Zhou J, Cao Y, Wang X, Li P, Xu W (2016) Deep recurrent models with fast-forward connections for neural machine translation. TACL 4:371–383Google Scholar
  105. Ziemski M, Junczys-Dowmunt M, Pouliquen B (2016) The united nations parallel corpus v1.0. In: Proceedings of the tenth international conference on language resources and evaluation LREC 2016, Portorovz, 23–28 May 2016Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Haiqin Yang
    • 1
    Email author
  • Linkai Luo
    • 1
  • Lap Pong Chueng
    • 1
  • David Ling
    • 1
  • Francis Chin
    • 1
  1. 1.Department of Computing, Deep Learning Research and Application CentreHang Seng Management CollegeSha TinHong Kong

Personalised recommendations