Text Sequence Modeling and Deep Learning



Much of the discussion in the previous chapters has focused on a bag-of-words representation of text. While the bag-of-words representation is sufficient in many practical applications, there are cases in which the sequential aspects of text become more important.


Long Short-term Memory (LSTM) Positive Pointwise Mutual Information (PPMI) CBOW Model Context Window Output Embedding 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. [2]
    C. Aggarwal. Data mining: The textbook. Springer, 2015.Google Scholar
  2. [15]
    C. Aggarwal and P. Zhao. Towards graphical models for text processing. Knowledge and Information Systems, 36(1), pp. 1–21, 2013. [Preliminary version in ACM SIGIR, 2010]Google Scholar
  3. [37]
    M. Baroni, G. Dinu, and G. Kruszewski. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL, pp. 238–247, 2014.Google Scholar
  4. [38]
    M. Baroni and A. Lenci. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4), pp. 673–721, 2010.CrossRefGoogle Scholar
  5. [47]
    Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3, pp. 1137–1155, 2003.zbMATHGoogle Scholar
  6. [51]
    C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.Google Scholar
  7. [66]
    J. Bullinaria and J. Levy. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39(3), pp. 510–526, 2007.CrossRefGoogle Scholar
  8. [67]
    R. Bunescu and R. Mooney. A shortest path dependency kernel for relation extraction. Human Language Technology and Empirical Methods in Natural Language Processing, pp. 724–731, 2005.Google Scholar
  9. [68]
    R. Bunescu and R. Mooney. Subsequence kernels for relation extraction. NIPS Conference, pp. 171–178, 2005.Google Scholar
  10. [94]
    K. Cho, B. Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP, 2014.
  11. [98]
    J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014.
  12. [99]
    K. Church and P. Hanks. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), pp. 22–29, 1990.Google Scholar
  13. [107]
    M. Collins and N. Duffy. Convolution kernels for natural language. NIPS Conference, pp. 625–632, 2001.Google Scholar
  14. [108]
    R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, pp. 2493–2537, 2011.zbMATHGoogle Scholar
  15. [109]
    R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. ICML Conference, pp. 160–167, 2008.Google Scholar
  16. [122]
    A. Culotta and J. Sorensen. Dependency tree kernels for relation extraction. ACL Conference, 2004.Google Scholar
  17. [161]
    A. Fader, L. Zettlemoyer, and O. Etzioni. Paraphrase-Driven Learning for Open Question Answering. ACL, pp. 1608–1618, 2013.Google Scholar
  18. [162]
    A. Fader, L. Zettlemoyer, and O. Etzioni. Open question answering over curated and extracted knowledge bases. ACM KDD Conference, 2014.Google Scholar
  19. [181]
    Y. Goldberg. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research (JAIR), 57, pp. 345–420, 2016.MathSciNetzbMATHGoogle Scholar
  20. [182]
    Y. Goldberg and O. Levy. word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722, 2014.
  21. [183]
    I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT Press, 2016.Google Scholar
  22. [195]
    A. Graves. Supervised sequence labelling with recurrent neural networks Springer, 2012. CrossRefGoogle Scholar
  23. [196]
    A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  24. [197]
    A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649, 2013.Google Scholar
  25. [198]
    M. Gutmann and A. Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. AISTATS, 1(2), pp. 6, 2010.Google Scholar
  26. [222]
    S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8), pp. 1735–1785, 1997.CrossRefGoogle Scholar
  27. [223]
    S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press, 2001.Google Scholar
  28. [227]
    K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), pp. 359–366, 1989.CrossRefGoogle Scholar
  29. [232]
    M. Iyyer, J. Boyd-Graber, L. Claudino, R. Socher, and H. Daume III. A Neural Network for Factoid Question Answering over Paragraphs. EMNLP, 2014.Google Scholar
  30. [245]
    C. Johnson. Logistic matrix factorization for implicit feedback data. NIPS Conference, 2014.Google Scholar
  31. [250]
    N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. EMNLP, 3, 39, pp. 413, 2013.Google Scholar
  32. [256]
    A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078, 2015.
  33. [257]
    A. Karpathy. The unreasonable effectiveness of recurrent neural networks, Blog post, 2015. Google Scholar
  34. [261]
    Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.Google Scholar
  35. [273]
    J. Lau and T. Baldwin. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv:1607.05368, 2016.
  36. [274]
    Q. Le. Personal communication, 2017.Google Scholar
  37. [275]
    Q. Le and T. Mikolov. Distributed representations of sentences and documents. ICML Conference, pp. 1188–196, 2014.Google Scholar
  38. [282]
    O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. NIPS Conference, pp. 2177–2185, 2014.Google Scholar
  39. [283]
    O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, pp. 211–225, 2015.Google Scholar
  40. [284]
    O. Levy, Y. Goldberg, and I. Ramat-Gan. Linguistic regularities in sparse and explicit word representations. CoNLL, 2014.Google Scholar
  41. [299]
    Z. Lipton, J. Berkowitz, and C. Elkan. A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019, 2015.
  42. [308]
    H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2, pp. 419–444, 2002.zbMATHGoogle Scholar
  43. [313]
    K. Lund and C. Burgess. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, and Computers, 28(2). pp. 203–208, 1996.CrossRefGoogle Scholar
  44. [324]
    J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization. ICML Conference, pp. 1033–1040, 2011.Google Scholar
  45. [341]
    T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013.
  46. [342]
    T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. NIPS Conference, pp. 3111–3119, 2013.Google Scholar
  47. [343]
    T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur. Recurrent neural network based language model. Interspeech, Vol 2, 2010.Google Scholar
  48. [344]
    T. Mikolov, W. Yih, and G. Zweig. Linguistic Regularities in Continuous Space Word Representations. HLT-NAACL, pp. 746–751, 2013.Google Scholar
  49. [345]
    T. Mikolov, Q. Le, and I. Sutskever. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013.
  50. [351]
    A. Mnih and G. Hinton. Three new graphical models for statistical language modelling. ICML Conference, pp. 641–648, 2007.Google Scholar
  51. [352]
    A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. NIPS Conference, pp. 2265–2273, 2013.Google Scholar
  52. [353]
    A. Mnih and Y. Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv:1206.6426, 2012.
  53. [365]
    H. Niitsuma and M. Lee. Word2Vec is a special case of kernel correspondence analysis and kernels for natural language processing, arXiv preprint arXiv:1605.05087, 2016.
  54. [377]
    R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. ICML, (3), 28, pp. 1310–1318, 2013.
  55. [380]
    J. Pennington, R. Socher, and C. Manning. Glove: Global Vectors for Word Representation. EMNLP, pp. 1532–1543, 2014.Google Scholar
  56. [384]
    L. Polanyi and A. Zaenen. Contextual valence shifters. Computing Attitude and Affect in Text: Theory and Applications, pp. 1–10, Springer, 2006.Google Scholar
  57. [393]
    L. Qian, G. Zhou, F. Kong, Q. Zhu, and P. Qian. Exploiting constituent dependencies for tree kernel-based semantic relation extraction. International Conference on Computational Linguistics, pp. 697–704, 2008.Google Scholar
  58. [401]
    R. Rehurek and P. Sojka. Software framework for topic modelling with large corpora. LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50, 2010.
  59. [415]
    X. Rong. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738, 2014.
  60. [437]
    M. Schuster and K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), pp. 2673–2681, 1997.CrossRefGoogle Scholar
  61. [447]
    S. Siencnik. Adapting word2vec to named entity recognition. Nordic Conference of Computational Linguistics, NODALIDA, 2015.Google Scholar
  62. [463]
    M. Sundermeyer, R. Schluter, and H. Ney. LSTM neural networks for language modeling. Interspeech, 2010.Google Scholar
  63. [464]
    I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. NIPS Conference, pp. 3104–3112, 2014.Google Scholar
  64. [479]
    P. Turney and P. Pantel. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1), pp. 141–188, 2010.MathSciNetzbMATHGoogle Scholar
  65. [485]
    O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. CVPR Conference, pp. 3156–3164, 2015.Google Scholar
  66. [494]
    J. Weston, A. Bordes, S. Chopra, A. Rush, B. van Merrienboer, A. Joulin, and T. Mikolov. Towards ai-complete question answering: A set of pre-requisite toy tasks. arXiv preprint arXiv:1502.05698, 2015.
  67. [495]
    J. Weston, S. Chopra, and A. Bordes. Memory networks. ICLR, 2015.Google Scholar
  68. [526]
    D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3. pp. 1083–1106, 2003.Google Scholar
  69. [533]
    M. Zhang, J. Zhang, and J. Su. Exploring syntactic features for relation extraction using a convolution tree kernel. Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 288-295, 2006.Google Scholar
  70. [534]
    M. Zhang, J. Zhang, J. Su, and G. Zhou. A composite kernel to extract relations between entities with both flat and structured features. International Conference on Computational Linguistics and the Annual Meeting of the Association for Computational Linguistics, pp. 825–832, 2006.Google Scholar
  71. [565]
  72. [566]
  73. [609]
  74. [610]
  75. [611]
  76. [615]
  77. [616]
  78. [617]
  79. [618]
  80. [619]
  81. [620]

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.IBM T. J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations