Bilingual Data Selection Using a Continuous Vector-Space Representation

  • Mara Chinea-RiosEmail author
  • Germán Sanchis-Trilles
  • Francisco Casacuberta
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10029)


Data selection aims to select the best data subset from an available pool of sentences with which to train a pattern recognition system. In this article, we present a bilingual data selection method that leverages a continuous vector-space representation of word sequences for selecting the best subset of a bilingual corpus, for the application of training a machine translation system. We compared our proposal with a state-of-the-art data selection technique (cross-entropy) obtaining very promising results, which were coherent across different language pairs.


Vector space representation Data selection Bilingual corpora 



The research leading to these results has received funding from the Generalitat Valenciana under grant PROMETEOII/2014/030 and the FPI (2014) grant by Universitat Politècnica de València.


  1. 1.
    Brown, P.F., Cocke, J., Pietra, S.A.D., Pietra, V.J.D., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Comput. Linguist. 16, 79–85 (1990)Google Scholar
  2. 2.
    Callison-Burch, C., Koehn, P., Monz, C., Peterson, K., Przybocki, M., Zaidan, O.F.: Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In: Proceedings of ACL, pp. 17–53 (2010)Google Scholar
  3. 3.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space, January 2013. arXiv:1301.3781
  4. 4.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents (2014). arXiv:1405.4053
  5. 5.
    Gao, J., Goodman, J., Li, M., Lee, K.-F.: Toward a unified approach to statistical language modeling for Chinese. ACM TALIP 1(1), 3–33 (2002)CrossRefGoogle Scholar
  6. 6.
    Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: Proceedings of ACL, pp. 220–224 (2010)Google Scholar
  7. 7.
    Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of EMNLP, pp. 355–362 (2011)Google Scholar
  8. 8.
    Schwenk, H., Rousseau, A., Attik, M.: Large, pruned or continuous space language models on a GPU for statistical machine translation. In: Proceedings of the NAACL-HLT, pp. 11–19 (2012)Google Scholar
  9. 9.
    Lü, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. In: Proceedings of EMNLP, pp. 343–350 (2007)Google Scholar
  10. 10.
    Duh, K., Neubig, G., Sudoh, K., Tsukada, H.: Adaptation data selection using neural language models: experiments in machine translation. In: Proceedings of ACL, pp. 678–683 (2013)Google Scholar
  11. 11.
    Collobert, R., Weston, J.: A unified architecture for natural language processing. In: Proceedings of ICML, pp. 160–167 (2008)Google Scholar
  12. 12.
    Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: Proceedings of ICML, pp. 513–520 (2011)Google Scholar
  13. 13.
    Cho, K., van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches, arxiv:1409.1259 (2014)
  14. 14.
    McClelland, J.L., Rumelhart, D.E., PDP Research Group, et al.: Parallel Distributed Processing, vol. 2. Cambridge University Press, Cambridge (1987)Google Scholar
  15. 15.
    Paulus, R., Socher, R., Manning, C.D.: Global belief recursive neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 2888–2896 (2014)Google Scholar
  16. 16.
    Mikolov, T., Karafit, M., Burget, L., Eernocký, J., Khudanpur, S.: Recurrent neural network based language model. In: Proceedings of INTERSPEECH, pp. 1045–1048 (2010)Google Scholar
  17. 17.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of Advances in neural information processing systems, pp. 3111–3119 (2013)Google Scholar
  18. 18.
    Socher, R., Lin, C.C., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of ICML, pp. 129–136 (2011)Google Scholar
  19. 19.
    Kågebäck, M., Mogren, O., Tahmasebi, N., Dubhashi, D.: Extractive summarization using continuous vector space models. In: Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality, pp. 31–39 (2014)Google Scholar
  20. 20.
    Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit, pp. 79–86 (2005)Google Scholar
  21. 21.
    Tiedemann, J.: News from OPUS-a collection of multilingual parallel corpora with tools and interfaces. In: Proceedings of RANLP, pp. 237–248 (2009)Google Scholar
  22. 22.
    Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of ACL, pp. 177–180 (2007)Google Scholar
  23. 23.
    Och, F.J. : Minimum error rate training in statistical machine translation. In: Proceedings of ACL, pp. 160–167 (2003)Google Scholar
  24. 24.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Mara Chinea-Rios
    • 1
    Email author
  • Germán Sanchis-Trilles
    • 2
  • Francisco Casacuberta
    • 1
  1. 1.Pattern Recognition and Human Language Technology Research CenterUniversitat Politècnica de ValènciaValenciaSpain
  2. 2.ScilingUniversitat Politècnica de ValènciaValenciaSpain

Personalised recommendations