Skip to main content

Text Sequence Modeling and Deep Learning

  • Chapter
  • First Online:

Abstract

Much of the discussion in the previous chapters has focused on a bag-of-words representation of text. While the bag-of-words representation is sufficient in many practical applications, there are cases in which the sequential aspects of text become more important.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The original definition of distance-graph [15] differs from the skip-gram definition here in a very minor way. The original definition always assumes that a word is connected to itself. Such a definition also allows a distance graph of order k = 0, which corresponds to a traditional bag-of-words with only self-loops. For example, for the sentence “Adam ate an apple” there would always be a self-loop at each of the four nodes even though there are no adjacent repetitions of words. The slightly modified definition here would include a self-loop only when words actually occur in their own context. For many traditional applications, however, this distinction does not seem to affect the results.

  2. 2.

    Note that \(\overline{u}_{i}\) and \(\overline{v}_{j}\) are added in the updates, which is a slight abuse of notation. Although \(\overline{u}_{i}\) is a row vector and \(\overline{v}_{j}\) is a column vector, the updates are intuitively clear.

  3. 3.

    An LSTM was used, which is a variation on the vanilla RNN discussed here.

  4. 4.

    https://www.nasa.gov/mission_pages/chandra/cosmic-winter-wonderland.html.

  5. 5.

    In principle, one can also allow it to be input at all time-stamps, but it only seems to worsen performance.

  6. 6.

    The original work in [464] seems to use this option [274]. In the Google Neural Machine Translation system [620], this weight is removed. This system is now used in Google Translate.

  7. 7.

    Here, we are treating the forget bits as a vector of binary bits, although it contains continuous values in (0, 1), which can be viewed as probabilities. As discussed earlier, the binary abstraction helps us understand the conceptual nature of the operations.

Bibliography

  1. C. Aggarwal. Data mining: The textbook. Springer, 2015.

    Google Scholar 

  2. C. Aggarwal and P. Zhao. Towards graphical models for text processing. Knowledge and Information Systems, 36(1), pp. 1–21, 2013. [Preliminary version in ACM SIGIR, 2010]

    Google Scholar 

  3. M. Baroni, G. Dinu, and G. Kruszewski. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL, pp. 238–247, 2014.

    Google Scholar 

  4. M. Baroni and A. Lenci. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4), pp. 673–721, 2010.

    Article  Google Scholar 

  5. Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3, pp. 1137–1155, 2003.

    MATH  Google Scholar 

  6. C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.

    Google Scholar 

  7. J. Bullinaria and J. Levy. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39(3), pp. 510–526, 2007.

    Article  Google Scholar 

  8. R. Bunescu and R. Mooney. A shortest path dependency kernel for relation extraction. Human Language Technology and Empirical Methods in Natural Language Processing, pp. 724–731, 2005.

    Google Scholar 

  9. R. Bunescu and R. Mooney. Subsequence kernels for relation extraction. NIPS Conference, pp. 171–178, 2005.

    Google Scholar 

  10. K. Cho, B. Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP, 2014. https://arxiv.org/pdf/1406.1078.pdf

  11. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014. https://arxiv.org/abs/1412.3555

  12. K. Church and P. Hanks. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), pp. 22–29, 1990.

    Google Scholar 

  13. M. Collins and N. Duffy. Convolution kernels for natural language. NIPS Conference, pp. 625–632, 2001.

    Google Scholar 

  14. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, pp. 2493–2537, 2011.

    MATH  Google Scholar 

  15. R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. ICML Conference, pp. 160–167, 2008.

    Google Scholar 

  16. A. Culotta and J. Sorensen. Dependency tree kernels for relation extraction. ACL Conference, 2004.

    Google Scholar 

  17. A. Fader, L. Zettlemoyer, and O. Etzioni. Paraphrase-Driven Learning for Open Question Answering. ACL, pp. 1608–1618, 2013.

    Google Scholar 

  18. A. Fader, L. Zettlemoyer, and O. Etzioni. Open question answering over curated and extracted knowledge bases. ACM KDD Conference, 2014.

    Google Scholar 

  19. Y. Goldberg. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research (JAIR), 57, pp. 345–420, 2016.

    MathSciNet  MATH  Google Scholar 

  20. Y. Goldberg and O. Levy. word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722, 2014. https://arxiv.org/abs/1402.3722

  21. I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT Press, 2016.

    Google Scholar 

  22. A. Graves. Supervised sequence labelling with recurrent neural networks Springer, 2012. http://rd.springer.com/book/10.1007%2F978-3-642-24797-2

    Book  Google Scholar 

  23. A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013. https://arxiv.org/abs/1308.0850

  24. A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649, 2013.

    Google Scholar 

  25. M. Gutmann and A. Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. AISTATS, 1(2), pp. 6, 2010.

    Google Scholar 

  26. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8), pp. 1735–1785, 1997.

    Article  Google Scholar 

  27. S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press, 2001.

    Google Scholar 

  28. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), pp. 359–366, 1989.

    Article  Google Scholar 

  29. M. Iyyer, J. Boyd-Graber, L. Claudino, R. Socher, and H. Daume III. A Neural Network for Factoid Question Answering over Paragraphs. EMNLP, 2014.

    Google Scholar 

  30. C. Johnson. Logistic matrix factorization for implicit feedback data. NIPS Conference, 2014.

    Google Scholar 

  31. N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. EMNLP, 3, 39, pp. 413, 2013.

    Google Scholar 

  32. A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2015. https://arxiv.org/abs/1506.02078

  33. A. Karpathy. The unreasonable effectiveness of recurrent neural networks, Blog post, 2015. http://karpathy.github.io/2015/05/21/rnn-effectiveness/

    Google Scholar 

  34. Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.

    Google Scholar 

  35. J. Lau and T. Baldwin. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv:1607.05368, 2016. https://arxiv.org/abs/1607.05368

  36. Q. Le. Personal communication, 2017.

    Google Scholar 

  37. Q. Le and T. Mikolov. Distributed representations of sentences and documents. ICML Conference, pp. 1188–196, 2014.

    Google Scholar 

  38. O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. NIPS Conference, pp. 2177–2185, 2014.

    Google Scholar 

  39. O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, pp. 211–225, 2015.

    Google Scholar 

  40. O. Levy, Y. Goldberg, and I. Ramat-Gan. Linguistic regularities in sparse and explicit word representations. CoNLL, 2014.

    Google Scholar 

  41. Z. Lipton, J. Berkowitz, and C. Elkan. A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019, 2015. https://arxiv.org/abs/1506.00019

  42. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2, pp. 419–444, 2002.

    MATH  Google Scholar 

  43. K. Lund and C. Burgess. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, and Computers, 28(2). pp. 203–208, 1996.

    Article  Google Scholar 

  44. J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization. ICML Conference, pp. 1033–1040, 2011.

    Google Scholar 

  45. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781

  46. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. NIPS Conference, pp. 3111–3119, 2013.

    Google Scholar 

  47. T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur. Recurrent neural network based language model. Interspeech, Vol 2, 2010.

    Google Scholar 

  48. T. Mikolov, W. Yih, and G. Zweig. Linguistic Regularities in Continuous Space Word Representations. HLT-NAACL, pp. 746–751, 2013.

    Google Scholar 

  49. T. Mikolov, Q. Le, and I. Sutskever. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013. https://arxiv.org/abs/1309.4168

  50. A. Mnih and G. Hinton. Three new graphical models for statistical language modelling. ICML Conference, pp. 641–648, 2007.

    Google Scholar 

  51. A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. NIPS Conference, pp. 2265–2273, 2013.

    Google Scholar 

  52. A. Mnih and Y. Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv:1206.6426, 2012. https://arxiv.org/abs/1206.6426

  53. H. Niitsuma and M. Lee. Word2Vec is a special case of kernel correspondence analysis and kernels for natural language processing, arXiv preprint arXiv:1605.05087, 2016. https://arxiv.org/abs/1605.05087

  54. R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. ICML, (3), 28, pp. 1310–1318, 2013. http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf

  55. J. Pennington, R. Socher, and C. Manning. Glove: Global Vectors for Word Representation. EMNLP, pp. 1532–1543, 2014.

    Google Scholar 

  56. L. Polanyi and A. Zaenen. Contextual valence shifters. Computing Attitude and Affect in Text: Theory and Applications, pp. 1–10, Springer, 2006.

    Google Scholar 

  57. L. Qian, G. Zhou, F. Kong, Q. Zhu, and P. Qian. Exploiting constituent dependencies for tree kernel-based semantic relation extraction. International Conference on Computational Linguistics, pp. 697–704, 2008.

    Google Scholar 

  58. R. Rehurek and P. Sojka. Software framework for topic modelling with large corpora. LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50, 2010. https://radimrehurek.com/gensim/index.html

  59. X. Rong. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738, 2014. https://arxiv.org/abs/1411.2738

  60. M. Schuster and K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), pp. 2673–2681, 1997.

    Article  Google Scholar 

  61. S. Siencnik. Adapting word2vec to named entity recognition. Nordic Conference of Computational Linguistics, NODALIDA, 2015.

    Google Scholar 

  62. M. Sundermeyer, R. Schluter, and H. Ney. LSTM neural networks for language modeling. Interspeech, 2010.

    Google Scholar 

  63. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. NIPS Conference, pp. 3104–3112, 2014.

    Google Scholar 

  64. P. Turney and P. Pantel. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1), pp. 141–188, 2010.

    MathSciNet  MATH  Google Scholar 

  65. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. CVPR Conference, pp. 3156–3164, 2015.

    Google Scholar 

  66. J. Weston, A. Bordes, S. Chopra, A. Rush, B. van Merrienboer, A. Joulin, and T. Mikolov. Towards ai-complete question answering: A set of pre-requisite toy tasks. arXiv preprint arXiv:1502.05698, 2015. https://arxiv.org/abs/1502.05698

  67. J. Weston, S. Chopra, and A. Bordes. Memory networks. ICLR, 2015.

    Google Scholar 

  68. D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3. pp. 1083–1106, 2003.

    Google Scholar 

  69. M. Zhang, J. Zhang, and J. Su. Exploring syntactic features for relation extraction using a convolution tree kernel. Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 288-295, 2006.

    Google Scholar 

  70. M. Zhang, J. Zhang, J. Su, and G. Zhou. A composite kernel to extract relations between entities with both flat and structured features. International Conference on Computational Linguistics and the Annual Meeting of the Association for Computational Linguistics, pp. 825–832, 2006.

    Google Scholar 

  71. https://code.google.com/archive/p/word2vec/

  72. https://www.tensorflow.org/tutorials/word2vec/

  73. http://clic.cimec.unitn.it/composes/toolkit/

  74. https://github.com/stanfordnlp/GloVe

  75. https://deeplearning4j.org/

  76. http://colah.github.io/posts/2015-08-Understanding-LSTMs/

  77. http://deeplearning.net/tutorial/lstm.html

  78. http://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural- networks-python-keras/

  79. https://deeplearning4j.org/lstm

  80. https://github.com/karpathy/char-rnn

  81. https://arxiv.org/abs/1609.08144

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Aggarwal, C.C. (2018). Text Sequence Modeling and Deep Learning. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73531-3_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73530-6

  • Online ISBN: 978-3-319-73531-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics