Augmenting SMT with Semantically-Generated Virtual-Parallel Corpora from Monolingual Texts

  • Krzysztof Wołk
  • Agnieszka Wołk
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 745)


Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English) from monolingual resources by calculating the compatibility between the results of three machine translation systems. We have created translations of large, single-language resources by applying multiple translation systems and strictly measuring translation compatibility using rules based on the Levenshtein distance. The results produced by this approach were very favorable. The generated corpora successfully improved the quality of SMT systems and seem to be useful for many other natural language processing tasks.


Data filtration Corpora building Machine learning Data mining Parallel corpora Machine translation 


  1. 1.
    Wołk, K., Marasek, K., Wołk, A.: Exploration for Polish-* bi-lingual translation equivalents from comparable and quasi-comparable corpora. In: 2016 Federated Conference on Computer Science and Information Systems (FedCSIS), Gdansk, pp. 517–525 (2016)Google Scholar
  2. 2.
    Anderson, S.R., Harrison, D., Horn, L., Zanuttini, R., Lightfoot, D.: How many languages are there in the world?: linguistic society of America (2010). Accessed 16 Feb 2017
  3. 3.
    List of languages by number of native speakers (2016). Wikipedia, Accessed 16 Feb 2016
  4. 4.
    Paolillo, J., Anupam, D.: Evaluating language statistics: the Ethnologue and beyond (2006). Accessed 8 Oct 2015
  5. 5.
    English language in Europe 2016 Wikipedia. Accessed 16 Feb 2017
  6. 6.
    Munteanu, D., Fraser, A., Marcu, D.: Improved machine translation performance via parallel sentence extraction from comparable corpora. In: Human Language Technologies-The 2004 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Marina del Rey, pp. 265–272 (2004)Google Scholar
  7. 7.
    Callison-Burch, C., Osborne, M.: Co-training for statistical machine translation. Dissertation, School of Informatics, University of Edinburgh (2002)Google Scholar
  8. 8.
    Ueffing, N., Haffari, G., Sarkar, A.: Semisupervised learning for machine translation. In: Goutte, C., Cancedda, N., Dymetman, M., Foster, G. (eds.) Learning Machine Translation, pp. 237–256. MIT Press, Pittsburgh (2009)Google Scholar
  9. 9.
    Mann, G., Yarowsky, D.: Multipath translation lexicon induction via bridge languages. In: Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, Pittsburgh, pp. 1–8 (2001)Google Scholar
  10. 10.
    Kumar, S., Och, F., Macherey, W.: Improving word alignment with bridge languages. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, pp. 42–50 (2007)Google Scholar
  11. 11.
    Wu, H., Wang, H.: Pivot language approach for phrase-based statistical machine translation. Mach. Transl. 21(3), 165–181 (2007)CrossRefGoogle Scholar
  12. 12.
    Habash, N., Hu, J.: Improving Arabic-Chinese statistical machine translation using English as pivot language. In: Proceedings of the Fourth Workshop on Statistical Machine Translation. Association of Computational Linguistics, Athens, pp. 173–181 (2009)Google Scholar
  13. 13.
    Eisele, A., Federmann, C., Uszkoreit, H., Saint-Amand, H., Kay, M., Jellinghaus, M., Hunsicker, S., Herrmann, T., Chen, Y.: Hybrid machine translation architectures within and beyond the EuroMatrix project. In: Hutchins, J., Hahn, W.V. (eds.) Hybrid MT Methods in Practice: Their Use in Multilingual Extraction, Cross-Language Information Retrieval, Multilingual Summarization, and Applications in Hand-Held Devices. Proceedings of the European Machine Translation Conference, Proceedings of the 12th Annual Conference of the European Association for Machine Translation. HITEC e.V., European Association for Machine Translation, Hamburg, Germany, pp. 27–34 (2008)Google Scholar
  14. 14.
    Cohn, T., Lapata, M.: Machine translation by triangulation: making effective use of multi-parallel corpora. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, pp. 728–735 (2007)Google Scholar
  15. 15.
    Leusch, G., Max, A., Crego, J.M., Ney, H.: Multi-pivot translation by system combination. In: Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), Paris, pp. 299–306 (2010)Google Scholar
  16. 16.
    Bertoldi, N., Barbaiani, M., Federico, M., Cattoni, R.: Phrase-based statistical machine translation with pivot languages. In: Proceedings of IWSLT, Hawaii, pp. 143–149 (2008)Google Scholar
  17. 17.
    Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association of Computational Linguistics, Prague, pp. 177–180 (2007)Google Scholar
  18. 18.
    Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Proceedings of International Conference Spoken Language Processing, Denver, pp. 901–904 (2002)Google Scholar
  19. 19.
    Junczys-Dowmunt, M., Szal, A.: SyMGiza ++: symmetrized word alignment models for statistical machine translation. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) Security and Intelligent Information Systems: International Joint Conferences, 2011, Warsaw, pp. 379–390. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  20. 20.
    Durrani, N., Sajjad, H., Hoang, H., Koehn, P.: Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, pp. 148–153 (2014)Google Scholar
  21. 21.
    Cettolo, M., Girardi, C., Fedirico, M.: WIT3: web inventory of transcribed and translated talks. In: Proceedings of the 16th Conference of the European Association for Machine Translation, Trento, pp. 261–268 (2012)Google Scholar
  22. 22.
    Abdelali, A., Guzman, F., Sajjad, H., Vogel, S.: The AMARA corpus: building parallel language resources for the educational domain. In: Ninth International Conference on Language Resources and Evaluation (LREC14), Reykjavik, pp. 1044–1054 (2014)Google Scholar
  23. 23.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of Association for Computational Linguistics, Philadelphia, pp. 311–318 (2002)Google Scholar
  24. 24.
    Yujian, L., Bo, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091–1095 (2007)CrossRefGoogle Scholar
  25. 25.
    Cao, G., Nie, J., Bai, J.: Integrating term relationships into language models. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, pp. 298–305 (2005)Google Scholar
  26. 26.
    Chen, S., Goodman, J.: An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13(4), 359–394 (1999)CrossRefGoogle Scholar
  27. 27.
    Bellegarda, J.: Data-driven semantic language modeling, Institute for Mathematics and Its Applications Workshop (2000). Accessed 16 Feb 2017
  28. 28.
    Thomo, A.: Latent semantic analysis (LSA) tutorial (2009). Accessed 16 Feb 2007
  29. 29.
    Moses statistical machine translation, OOVs (2015). Accessed 27 Sept 2015
  30. 30.
    Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation. Association of Computational Linguistics, Edinburgh, pp. 187–197 (2011)Google Scholar
  31. 31.
    Costa-jussa, M.R., Fonollosa, J.R.: Using linear interpolation and weighted reordering hypotheses in the Moses system. In: Seventh Conference on International Language Resources and Evaluation, Valletta, pp. 1712–1718 (2011)Google Scholar
  32. 32.
    Moses statistical machine translation, Build reordering model (2013) Reordering Model. Accessed 10 Oct 2015
  33. 33.
    Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association of Computational Linguistics, Edinburgh, pp. 355–362 (2011)Google Scholar
  34. 34.
    Wang, L., Wong, D.F., Chao, L.S., Lu, Y., Xing, J.: A systematic comparison of data selection criteria for SMT domain adaptation. Sci. World J. 2014, 745485 (2014)Google Scholar
  35. 35.
    Hovy, E.: Toward finely differentiated evaluation metrics for machine translation. In: Proceedings of the EAGLES Workshop on Standards and Evaluation, Pisa, pp. 127–133 (1999)Google Scholar
  36. 36.
    Vanni, M., Reeder, F.: How are you doing? A look at MT evaluation. In: White, J.S. (eds.), Envisioning Machine Translation in the Information Future, AMTA 2000. LNCS, vol. 1934. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  37. 37.
    Oyeka, I.C.A., Ebuh, G.U.: Modified Wilcoxon signed-rank test. Open J. Stat. 2, 172–176 (2012)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Lin, S., Verspoor, K.: A semantics-enhanced language model for unsupervised word sense disambiguation. In: Ninth International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2008). Lecture Notes in Computer Science (LNCS), Haifa, pp. 287–298 (2008)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Polish-Japanese Academy of Information TechnologyWarsawPoland

Personalised recommendations