On Multilingual Training of Neural Dependency Parsers

  • Michał Zapotoczny
  • Paweł Rychlikowski
  • Jan ChorowskiEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10415)


We show that a recently proposed neural dependency parser can be improved by joint training on multiple languages from the same family. The parser is implemented as a deep neural network whose only input is orthographic representations of words. In order to successfully parse, the network has to discover how linguistically relevant concepts can be inferred from word spellings. We analyze the representations of characters and words that are learned by the network to establish which properties of languages were accounted for. In particular we show that the parser has approximately learned to associate Latin characters with their Cyrillic counterparts and that it can group Polish and Russian words that have a similar grammatical function. Finally, we evaluate the parser on selected languages from the Universal Dependencies dataset and show that it is competitive with other recently proposed state-of-the art methods, while having a simple structure.


Dependency parsing Recurrent neural networks Multitask training 



The experiments used Theano [6], Blocks and Fuel [22] libraries. The authors would like to acknowledge the support of the following agencies for research funding and computing support: National Science Center (Poland) grant Sonata 8 2014/15/D/ST6/04402, National Center for Research and Development (Poland) grant Audioscope (Applied Research Program, 3rd contest, submission no. 245755).


  1. 1.
    Alberti, C., et al.: SyntaxNet models for the CoNLL 2017 shared task. arXiv:1703.04929, March 2017
  2. 2.
    Ammar, W., et al.: Many languages, one parser. Trans. Assoc. Comput. Linguist. 4(0), 431–444 (2016)Google Scholar
  3. 3.
    Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Petrov, S., Collins, M.: Globally normalized transition-based neural networks. arXiv:1603.06042 [cs], March 2016
  4. 4.
    Ballesteros, M., Dyer, C., Smith, N.A.: Improved transition-based parsing by modeling characters instead of words with LSTMs. arXiv preprint arXiv:1508.00657 (2015)
  5. 5.
    Bender, E.M.: On achieving and evaluating language-independence in NLP. Linguist. Issues Lang. Technol. 6(3), 1–26 (2011)Google Scholar
  6. 6.
    Bergstra, J., et al.: Theano: a CPU and GPU math expression compiler. In: Proceedings of SciPy (2010)Google Scholar
  7. 7.
    Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Chen, D., Manning, C.D.: A fast and accurate dependency parser using neural networks. In: EMNLP, pp. 740–750 (2014)Google Scholar
  9. 9.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078 (2014)Google Scholar
  10. 10.
    Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent NN: first results. arXiv:1412.1602 [cs stat], December 2014
  11. 11.
    Chorowski, J., Zapotoczny, M., Rychlikowski, P.: Read, tag, and parse all at once, or fully-neural dependency parsing. CoRR abs/1609.03441 (2016)Google Scholar
  12. 12.
    Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing. CoRR abs/1611.01734 (2016)Google Scholar
  13. 13.
    Duong, L., Cohn, T., Bird, S., Cook, P.: A neural network model for low-resource universal dependency parsing. In: EMNLP, pp. 339–348. Citeseer (2015)Google Scholar
  14. 14.
    Dyer, C., Ballesteros, M., Ling, W., Matthews, A., Smith, N.A.: Transition-based dependency parsing with stack long short-term memory. arXiv preprint arXiv:1505.08075 (2015)
  15. 15.
    Edmonds, J.: Optimim branchings. J. Res. Natl. Bur. Stand. B 71B(4), 233–240 (1966)CrossRefGoogle Scholar
  16. 16.
    Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: ICML, pp. 1319–1327 (2013)Google Scholar
  17. 17.
    Guo, J., Che, W., Yarowsky, D., Wang, H., Liu, T.: Cross-lingual dependency parsing based on distributed representations. In: ACL, vol. 1, pp. 1234–1244 (2015)Google Scholar
  18. 18.
    Hinton, G.E., McClelland, J.L., Rumelhart, D.E.: Paralell Distributed Processing: Explorations in the Microstructure of Cognition: Foundations, vol. 1. MIT Press/Bradford Books, Cambridge (1986)Google Scholar
  19. 19.
    Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., Wu, Y.: Exploring the limits of language modeling. arXiv:1602.02410 [cs], February 2016
  20. 20.
    Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. arXiv preprint arXiv:1508.06615 (2015)
  21. 21.
    Kiperwasser, E., Goldberg, Y.: Simple and accurate dependency parsing using bidirectional LSTM feature representations. arXiv:1603.04351 [cs], March 2016
  22. 22.
    van Merriënboer, B., et al.: Blocks and fuel: frameworks for deep learning. arXiv:1506.00619 [cs stat], June 2015
  23. 23.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)Google Scholar
  24. 24.
    Mikolov, T., Karafiát, M., Burget, L., Cernocky, J., Khudanpur, S.: Recurrent neural network based language model, Makuhari, Chiba, Japan, September 2010Google Scholar
  25. 25.
    Nivre, J.: Algorithms for deterministic incremental dependency parsing. Comput. Linguist. 34(4), 513–553 (2008)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Nivre, J., et al.: MaltParser: a language-independent system for data-driven dependency parsing. Nat. Lang. Eng., 1 (2005)Google Scholar
  27. 27.
    Nivre, J., et al.: Universal dependencies 1.2.
  28. 28.
    Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)CrossRefGoogle Scholar
  29. 29.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15, 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  30. 30.
    Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks. arXiv:1505.00387 [cs], May 2015
  31. 31.
    Titov, I., Henderson, J.: A latent variable model for generative dependency parsing. In: Proceedings of IWPT (2007)Google Scholar
  32. 32.
    Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., Hinton, G.: Grammar as a Foreign language. arXiv:1412.7449 [cs stat], December 2014
  33. 33.
    Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv:1609.08144, September 2016
  34. 34.
    Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv:1212.5701 (2012)
  35. 35.
    Zhang, X., Cheng, J., Lapata, M.: Dependency parsing as head selection. CoRR abs/1606.01280 (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Michał Zapotoczny
    • 1
  • Paweł Rychlikowski
    • 1
  • Jan Chorowski
    • 1
    Email author
  1. 1.Institute of Computer ScienceUniversity of WrocławWrocławPoland

Personalised recommendations