Advertisement

Statistical Approach to Noisy-Parallel and Comparable Corpora Filtering for the Extraction of Bi-lingual Equivalent Data at Sentence-Level

  • Krzysztof Wołk
  • Emilia Zawadzka
  • Agnieszka Wołk
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 745)

Abstract

Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems and other text processing tasks requiring bilingual data. In this study, we propose a language independent bi-sentence filtering approach based on Polish to English translation. This approach was developed using a noisy TED Talks corpus and tested on a Wikipedia-based comparable corpus; however, it can be extended to any text domain or language pair. The proposed method uses various statistical measures for sentence comparison and can be used for in-domain data adaptation tasks as well. Minimization of data loss was ensured by parameter adaptation. An improvement in MT system score using the text processed using our tool is discussed and in-domain data adaptation results are presented. We also discuss measures to improve performance such as bootstrapping and comparison model pruning. The results show significant improvement in filtering in terms of MT quality.

Keywords

Parallel corpora filtration Noisy-parallel corpora Comparable corpora In-domain data adaptation Machine translation 

References

  1. 1.
    Wu, H., Wang, H.: Comparative study of word alignment heuristics and phrase-based SMT. In: Presented in the 11th Machine Translation Summit – Second CFP, Copenhagen Business School, Copenhagen, September 2007Google Scholar
  2. 2.
    Deng, Y., Shankar, K., William, B.: Segmentation and alignment of parallel text for statistical machine translation. Nat. Lang. Eng. 13(3), 235–260 (2007)CrossRefGoogle Scholar
  3. 3.
    Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 81–89. Association for Computational Linguistics, Stroudsburg (2010)Google Scholar
  4. 4.
    Wołk, K., Marasek, K.: Real-time statistical speech translation. In: Rocha, A., Correia, A.M., Tan, F.B., Stroetmann, K.A. (eds.) New Perspectives in Information Systems and Technologies, vol. 1, pp. 107–113. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  5. 5.
    Cettolo, M., Girardi, C., Federico, M.: WIT3: web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pp. 261–268. Fondazione Bruno Kessler (FBK), Trento (2012)Google Scholar
  6. 6.
    Santos, A.: A survey on parallel corpora alignment. MI-STAR 2011, 117–128 (2011)Google Scholar
  7. 7.
    Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, pp. 169–176. Association for Computational Linguistics, Stroudsburg (1991)Google Scholar
  8. 8.
    Gale, W.A., Church, K.W.: Identifying word correspondences in parallel texts. In: Proceedings of the workshop on Speech and Natural Language, HLT 1991, pp. 152–157. Association for Computational Linguistics, Stroudsburg (1991)Google Scholar
  9. 9.
    Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., Trón, V.: Parallel corpora for medium density languages. Amsterdam Stud. Theor. Hist. Linguist. Sci. Ser. 4(292), 247 (2007)Google Scholar
  10. 10.
    Wołk, K., Marasek, K.: A sentence meaning based alignment method for parallel text corpora preparation. In: Rocha, A., Correia, A.M., Tan, F.B., Stroetmann, K.A. (eds.) New Perspectives in Information Systems and Technologies, vol. 1, pp. 229–237. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  11. 11.
    Marasek, K.: TED Polish-to-English translation system for the IWSLT 2012. Paper Presented in the International Workshop on Spoken Language Translation 2012, Hong Kong, December 2012Google Scholar
  12. 12.
    Tiedemann, J. Parallel data, tools and interfaces in OPUS. In: The Eighth International Conference on Language Resources and Evaluation (LREC), pp. 2214–2218. European Language Resources Association, Istanbul (2012)Google Scholar
  13. 13.
    Lison, P., Tiedemann, J.: OpenSubtitles2016: extracting large parallel corpora from movie and TV subtitles (2016). http://www.lrecconf.org/proceedings/lrec2016/pdf/947_Paper.pdf. Accessed Dec 2016
  14. 14.
    Wołk, K., Marasek, K.: Polish-English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the 10th International Workshop on Spoken Language Translation, pp. 113–119. Springer, Heidelberg (2013)Google Scholar
  15. 15.
    Wołk, K., Rejmund, E., Marasek, K.: Multi-domain machine translation enhancements by parallel data extraction from comparable corpora. In: Ewa, G., Agnieszka, L.S. (eds.) Polish-Language Parallel Corpora, pp. 157–179. Instytut Lingwistyki Stosowanej, Warszawa (2016)Google Scholar
  16. 16.
    Berrotarán, G.G., Carrascosa, R., Vine, A.: Yalign documentation (2015). http://yalign.readthedocs.org/en/latest/. Accessed Dec 2016
  17. 17.
    Musso, G.: Sequence alignment (Needleman-Wunsch, Smith-Waterman). CSC2427: Algorithms in Molecular Biology (2006). http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf. Accessed Dec 2016
  18. 18.
    Wołk, K., Marasek, K.: Tuned and GPU-accelerated parallel data mining from comparable corpora. In: Král, P., Matoušek, V. (eds.) Text, Speech, and Dialogue, pp. 32–40. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  19. 19.
    Tyers, F.M., Pienaar, J.A.: Extracting bilingual word pairs from Wikipedia. In: Collaboration: Interoperability Between People in the Creation of Language Resources for Less-Resourced Languages, vol. 19, pp. 19–22 (2008)Google Scholar
  20. 20.
    Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics, StroudsburgGoogle Scholar
  21. 21.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)CrossRefGoogle Scholar
  22. 22.
    Vulić, I.: Term alignment: state of the art overview (2010). http://people.cs.kuleuven.be/~ivan.vulic/Files/TASOA.pdf. Accessed Dec 2016
  23. 23.
    Koehn, P.: Statistical Machine Translation. Cambridge University Press, Cambridge (2009)CrossRefGoogle Scholar
  24. 24.
    Collins, M.: Statistical machine translation: IBM models 1 and 2 (2011). http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/ibm12.pdf. Accessed Dec 2016
  25. 25.
    Schoenemann, T.: Computing optimal alignments for the IBM-3 translation model. In: Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pp. 98–106. Association for Computational Linguistics, Stroudsburg (2010)Google Scholar
  26. 26.
    Fernández, P.M.: Improving Word-to-word Alignments Using Morphological Information (Unpublished doctoral dissertation). San Diego State University, California (2008)Google Scholar
  27. 27.
    Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)Google Scholar
  28. 28.
    Bojar, O., Rosa, R., Tamchyna, A.: Chimera–three heads for English-to-Czech translation. In: Proceedings of the Eighth Workshop on Statistical Machine Translation, pp. 90–96. Association for Computational Linguistics, Stroudsburg (2013)Google Scholar
  29. 29.
    Fujita, A., Isabelle, P.: Expanding paraphrase lexicons by exploiting lexical variants. Paper Presented at the Proceedings of Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT, Denver, Colorado, USA, June 2015Google Scholar
  30. 30.
    Junczys-Dowmunt, M., Szał, A.: SyMGiza++: a tool for parallel computation of symmetrized word alignment models. In: Computer Science and Information Technology (IMCSIT), Proceedings of the International MultiConference of Engineers and Computer Scientists, pp. 397–401. IEEE, Wisła (2010)Google Scholar
  31. 31.
    Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362. Association for Computational Linguistics, Stroudsburg (2011)Google Scholar
  32. 32.
    Lü, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. Paper Presented in the Conference on Empirical Methods in Natural Language Processing - Conference on Computational Natural Language Learning, Prague, Czech Republic, June 2007Google Scholar
  33. 33.
    Hildebrand, A.S., Eck, M., Vogel, S., Waibel, A.: Adaptation of the translation model for statistical machine translation based on information retrieval. In: Proceedings of EAMT 10th Annual Conference, 30–31 May 2005, Budapest, Hungary, pp. 133–142. Association for Computational Linguistics, Stroudsburg (2005)Google Scholar
  34. 34.
    Koehn, P.: Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In: Frederking, R.E., Taylor, K.B. (eds.) Conference of the Association for Machine Translation in the Americas, pp. 115–124. Springer, Heidelberg (2004)Google Scholar
  35. 35.
    Daumé III, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Vol. 2, pp. 407–412. Association for Computational Linguistics, Stroudsburg (2011)Google Scholar
  36. 36.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, Stroudsburg (2002)Google Scholar
  37. 37.
    Mansour, S., Ney, H.: A simple and effective weighted phrase extraction for machine translation adaptation. Paper Presented in the International Workshop on Spoken Language Translation 2012, Hong Kong, December 2012Google Scholar
  38. 38.
    Stolcke, A.: SRILM-an extensible language modeling toolkit. Paper Presented in the 7th International Conference on Spoken Language Processing, ICSLP2002 - INTERSPEECH 2002, Denver, Colorado, USA, September 2002Google Scholar
  39. 39.
    Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, pp. 310–318. Association for Computational Linguistics, Stroudsburg (1996)Google Scholar
  40. 40.
    Hovy, E.: Toward finely differentiated evaluation metrics for machine translation. Paper Presented in the Proceedings of the EAGLES Workshop on Standards and Evaluation, Pisa, January 1999Google Scholar
  41. 41.
    Reeder, F.: Additional mt-eval references. Technical report, International Standards for Language Engineering, Evaluation Working Group (2001)Google Scholar
  42. 42.
    Lavie, A.: Evaluating the output of machine translation systems (2010). https://amta2010.amtaweb.org/AMTA/papers/6-04-LavieMTEvaluation.pdf. Accessed Dec 2016
  43. 43.
    Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 138–145. Morgan Kaufmann Publishers Inc., San Diego (2002)Google Scholar
  44. 44.
    Oliver, J.: Global autonomous language exploitation (GALE). DARPA/IPTO Proposer Information Pamphlet (2005)Google Scholar
  45. 45.
    Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of association for machine translation in the Americas, pp. 223–231. Association for Machine Translation in the Americas, Stroudsburg (2006)Google Scholar
  46. 46.
    Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72. Association for Computational Linguistics, Stroudsburg (2005)Google Scholar
  47. 47.
    Oyeka, I.C.A., Ebuh, G.U.: Modified Wilcoxon signed-rank test. Open J. Stat. 2(2), 172–176 (2012)MathSciNetCrossRefGoogle Scholar
  48. 48.
    Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 48–54. Association for Computational Linguistics, Stroudsburg (2003)Google Scholar
  49. 49.
    Shen, L., Sarkar, A., Och, F.J.: Discriminative reranking for machine translation. Paper Presented in the Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pp. 177–184. Association for Computational Linguistics, Stroudsburg (2004)Google Scholar
  50. 50.
    Koehn, P.: An experimental management system. Prague Bull. Math. Linguist. 94, 87–96 (2010)CrossRefGoogle Scholar
  51. 51.
    Quirk, C., Menezes, A.: Do we need phrases?: challenging the conventional wisdom in statistical machine translation. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pp. 9–16. Association for Computational Linguistics, Stroudsburg (2006)Google Scholar
  52. 52.
    Marino, J.B., Banchs, R.E., Crego, J.M., de Gispert, A., Lambert, P., Fonollosa, J.A., Costa-Jussà, M.R.: N-gram-based machine translation. Comput. Linguist. 32(4), 527–549 (2006)MathSciNetCrossRefGoogle Scholar
  53. 53.
    Costa-Jussà, M.R., Crego, J.M., Vilar, D., Fonollosa, J.A., Mariño, J.B., Ney, H.: Analysis and system combination of phrase-and n-gram-based statistical machine translation systems. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume, Short Papers, pp. 137–140. Association for Computational Linguistics, Stroudsburg (2007)Google Scholar
  54. 54.
    Johnson, J.H., Martin, J., Foster, G., Kuhn, R.: Improving translation quality by discarding most of the phrasetable. In: Proceedings of Conference on Empirical Methods in Natural Language Processing - Conference on Computational Natural Language Learning (EMNLP-CoNLL 2007), 28–30 June 2007, Prague, Czech Republic, pp. 967–975. Association for Computational Linguistics, Stroudsburg (2007)Google Scholar
  55. 55.
    Kutsumi, T., Yoshimi, T., Kotani, K., Sata, I., Isahara, H.: Selection of entries for a bilingual dictionary from aligned translation equivalents using support vector machines. In: Proceedings of Pacific Association for Computational Linguistics, pp. 24–27. Association for Computational Linguistics, Stroudsburg (2005)Google Scholar
  56. 56.
    Eck, M., Vogel, S., Waibel, A.: Estimating phrase pair relevance for translation model pruning. Paper Presented at the MT Summit XI, The 11th Machine Translation Summit, Copenhagen Business School, Copenhagen, Denmark, September 2007Google Scholar
  57. 57.
    Eck, M., Vogel, S., Waibel, A.: Translation model pruning via usage statistics for statistical machine translation. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pp. 21–24. Association for Computational Linguistics, Stroudsburg (2007)Google Scholar
  58. 58.
    Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R.M., Makhoul, J.: Fast and robust neural network joint models for statistical machine translation. In: The 52nd Annual Meeting of the Association for Computational Linguistics, pp. 1370–1380. Association for Computational Linguistics, Stroudsburg (2014)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Krzysztof Wołk
    • 1
  • Emilia Zawadzka
    • 1
  • Agnieszka Wołk
    • 1
  1. 1.Polish-Japanese Academy of Information TechnologyWarsawPoland

Personalised recommendations