Skip to main content

Statistical Approach to Noisy-Parallel and Comparable Corpora Filtering for the Extraction of Bi-lingual Equivalent Data at Sentence-Level

  • Conference paper
Trends and Advances in Information Systems and Technologies (WorldCIST'18 2018)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 745))

Included in the following conference series:

  • 8508 Accesses

Abstract

Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems and other text processing tasks requiring bilingual data. In this study, we propose a language independent bi-sentence filtering approach based on Polish to English translation. This approach was developed using a noisy TED Talks corpus and tested on a Wikipedia-based comparable corpus; however, it can be extended to any text domain or language pair. The proposed method uses various statistical measures for sentence comparison and can be used for in-domain data adaptation tasks as well. Minimization of data loss was ensured by parameter adaptation. An improvement in MT system score using the text processed using our tool is discussed and in-domain data adaptation results are presented. We also discuss measures to improve performance such as bootstrapping and comparison model pruning. The results show significant improvement in filtering in terms of MT quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wu, H., Wang, H.: Comparative study of word alignment heuristics and phrase-based SMT. In: Presented in the 11th Machine Translation Summit – Second CFP, Copenhagen Business School, Copenhagen, September 2007

    Google Scholar 

  2. Deng, Y., Shankar, K., William, B.: Segmentation and alignment of parallel text for statistical machine translation. Nat. Lang. Eng. 13(3), 235–260 (2007)

    Article  Google Scholar 

  3. Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 81–89. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  4. Wołk, K., Marasek, K.: Real-time statistical speech translation. In: Rocha, A., Correia, A.M., Tan, F.B., Stroetmann, K.A. (eds.) New Perspectives in Information Systems and Technologies, vol. 1, pp. 107–113. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  5. Cettolo, M., Girardi, C., Federico, M.: WIT3: web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pp. 261–268. Fondazione Bruno Kessler (FBK), Trento (2012)

    Google Scholar 

  6. Santos, A.: A survey on parallel corpora alignment. MI-STAR 2011, 117–128 (2011)

    Google Scholar 

  7. Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, pp. 169–176. Association for Computational Linguistics, Stroudsburg (1991)

    Google Scholar 

  8. Gale, W.A., Church, K.W.: Identifying word correspondences in parallel texts. In: Proceedings of the workshop on Speech and Natural Language, HLT 1991, pp. 152–157. Association for Computational Linguistics, Stroudsburg (1991)

    Google Scholar 

  9. Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., Trón, V.: Parallel corpora for medium density languages. Amsterdam Stud. Theor. Hist. Linguist. Sci. Ser. 4(292), 247 (2007)

    Google Scholar 

  10. Wołk, K., Marasek, K.: A sentence meaning based alignment method for parallel text corpora preparation. In: Rocha, A., Correia, A.M., Tan, F.B., Stroetmann, K.A. (eds.) New Perspectives in Information Systems and Technologies, vol. 1, pp. 229–237. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  11. Marasek, K.: TED Polish-to-English translation system for the IWSLT 2012. Paper Presented in the International Workshop on Spoken Language Translation 2012, Hong Kong, December 2012

    Google Scholar 

  12. Tiedemann, J. Parallel data, tools and interfaces in OPUS. In: The Eighth International Conference on Language Resources and Evaluation (LREC), pp. 2214–2218. European Language Resources Association, Istanbul (2012)

    Google Scholar 

  13. Lison, P., Tiedemann, J.: OpenSubtitles2016: extracting large parallel corpora from movie and TV subtitles (2016). http://www.lrecconf.org/proceedings/lrec2016/pdf/947_Paper.pdf. Accessed Dec 2016

  14. Wołk, K., Marasek, K.: Polish-English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the 10th International Workshop on Spoken Language Translation, pp. 113–119. Springer, Heidelberg (2013)

    Google Scholar 

  15. Wołk, K., Rejmund, E., Marasek, K.: Multi-domain machine translation enhancements by parallel data extraction from comparable corpora. In: Ewa, G., Agnieszka, L.S. (eds.) Polish-Language Parallel Corpora, pp. 157–179. Instytut Lingwistyki Stosowanej, Warszawa (2016)

    Google Scholar 

  16. Berrotarán, G.G., Carrascosa, R., Vine, A.: Yalign documentation (2015). http://yalign.readthedocs.org/en/latest/. Accessed Dec 2016

  17. Musso, G.: Sequence alignment (Needleman-Wunsch, Smith-Waterman). CSC2427: Algorithms in Molecular Biology (2006). http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf. Accessed Dec 2016

  18. Wołk, K., Marasek, K.: Tuned and GPU-accelerated parallel data mining from comparable corpora. In: Král, P., Matoušek, V. (eds.) Text, Speech, and Dialogue, pp. 32–40. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  19. Tyers, F.M., Pienaar, J.A.: Extracting bilingual word pairs from Wikipedia. In: Collaboration: Interoperability Between People in the Creation of Language Resources for Less-Resourced Languages, vol. 19, pp. 19–22 (2008)

    Google Scholar 

  20. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics, Stroudsburg

    Google Scholar 

  21. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)

    Article  Google Scholar 

  22. Vulić, I.: Term alignment: state of the art overview (2010). http://people.cs.kuleuven.be/~ivan.vulic/Files/TASOA.pdf. Accessed Dec 2016

  23. Koehn, P.: Statistical Machine Translation. Cambridge University Press, Cambridge (2009)

    Book  Google Scholar 

  24. Collins, M.: Statistical machine translation: IBM models 1 and 2 (2011). http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/ibm12.pdf. Accessed Dec 2016

  25. Schoenemann, T.: Computing optimal alignments for the IBM-3 translation model. In: Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pp. 98–106. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  26. Fernández, P.M.: Improving Word-to-word Alignments Using Morphological Information (Unpublished doctoral dissertation). San Diego State University, California (2008)

    Google Scholar 

  27. Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)

    Google Scholar 

  28. Bojar, O., Rosa, R., Tamchyna, A.: Chimera–three heads for English-to-Czech translation. In: Proceedings of the Eighth Workshop on Statistical Machine Translation, pp. 90–96. Association for Computational Linguistics, Stroudsburg (2013)

    Google Scholar 

  29. Fujita, A., Isabelle, P.: Expanding paraphrase lexicons by exploiting lexical variants. Paper Presented at the Proceedings of Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT, Denver, Colorado, USA, June 2015

    Google Scholar 

  30. Junczys-Dowmunt, M., Szał, A.: SyMGiza++: a tool for parallel computation of symmetrized word alignment models. In: Computer Science and Information Technology (IMCSIT), Proceedings of the International MultiConference of Engineers and Computer Scientists, pp. 397–401. IEEE, Wisła (2010)

    Google Scholar 

  31. Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362. Association for Computational Linguistics, Stroudsburg (2011)

    Google Scholar 

  32. Lü, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. Paper Presented in the Conference on Empirical Methods in Natural Language Processing - Conference on Computational Natural Language Learning, Prague, Czech Republic, June 2007

    Google Scholar 

  33. Hildebrand, A.S., Eck, M., Vogel, S., Waibel, A.: Adaptation of the translation model for statistical machine translation based on information retrieval. In: Proceedings of EAMT 10th Annual Conference, 30–31 May 2005, Budapest, Hungary, pp. 133–142. Association for Computational Linguistics, Stroudsburg (2005)

    Google Scholar 

  34. Koehn, P.: Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In: Frederking, R.E., Taylor, K.B. (eds.) Conference of the Association for Machine Translation in the Americas, pp. 115–124. Springer, Heidelberg (2004)

    Google Scholar 

  35. Daumé III, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Vol. 2, pp. 407–412. Association for Computational Linguistics, Stroudsburg (2011)

    Google Scholar 

  36. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, Stroudsburg (2002)

    Google Scholar 

  37. Mansour, S., Ney, H.: A simple and effective weighted phrase extraction for machine translation adaptation. Paper Presented in the International Workshop on Spoken Language Translation 2012, Hong Kong, December 2012

    Google Scholar 

  38. Stolcke, A.: SRILM-an extensible language modeling toolkit. Paper Presented in the 7th International Conference on Spoken Language Processing, ICSLP2002 - INTERSPEECH 2002, Denver, Colorado, USA, September 2002

    Google Scholar 

  39. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, pp. 310–318. Association for Computational Linguistics, Stroudsburg (1996)

    Google Scholar 

  40. Hovy, E.: Toward finely differentiated evaluation metrics for machine translation. Paper Presented in the Proceedings of the EAGLES Workshop on Standards and Evaluation, Pisa, January 1999

    Google Scholar 

  41. Reeder, F.: Additional mt-eval references. Technical report, International Standards for Language Engineering, Evaluation Working Group (2001)

    Google Scholar 

  42. Lavie, A.: Evaluating the output of machine translation systems (2010). https://amta2010.amtaweb.org/AMTA/papers/6-04-LavieMTEvaluation.pdf. Accessed Dec 2016

  43. Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 138–145. Morgan Kaufmann Publishers Inc., San Diego (2002)

    Google Scholar 

  44. Oliver, J.: Global autonomous language exploitation (GALE). DARPA/IPTO Proposer Information Pamphlet (2005)

    Google Scholar 

  45. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of association for machine translation in the Americas, pp. 223–231. Association for Machine Translation in the Americas, Stroudsburg (2006)

    Google Scholar 

  46. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72. Association for Computational Linguistics, Stroudsburg (2005)

    Google Scholar 

  47. Oyeka, I.C.A., Ebuh, G.U.: Modified Wilcoxon signed-rank test. Open J. Stat. 2(2), 172–176 (2012)

    Article  MathSciNet  Google Scholar 

  48. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 48–54. Association for Computational Linguistics, Stroudsburg (2003)

    Google Scholar 

  49. Shen, L., Sarkar, A., Och, F.J.: Discriminative reranking for machine translation. Paper Presented in the Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pp. 177–184. Association for Computational Linguistics, Stroudsburg (2004)

    Google Scholar 

  50. Koehn, P.: An experimental management system. Prague Bull. Math. Linguist. 94, 87–96 (2010)

    Article  Google Scholar 

  51. Quirk, C., Menezes, A.: Do we need phrases?: challenging the conventional wisdom in statistical machine translation. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pp. 9–16. Association for Computational Linguistics, Stroudsburg (2006)

    Google Scholar 

  52. Marino, J.B., Banchs, R.E., Crego, J.M., de Gispert, A., Lambert, P., Fonollosa, J.A., Costa-Jussà, M.R.: N-gram-based machine translation. Comput. Linguist. 32(4), 527–549 (2006)

    Article  MathSciNet  Google Scholar 

  53. Costa-Jussà, M.R., Crego, J.M., Vilar, D., Fonollosa, J.A., Mariño, J.B., Ney, H.: Analysis and system combination of phrase-and n-gram-based statistical machine translation systems. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume, Short Papers, pp. 137–140. Association for Computational Linguistics, Stroudsburg (2007)

    Google Scholar 

  54. Johnson, J.H., Martin, J., Foster, G., Kuhn, R.: Improving translation quality by discarding most of the phrasetable. In: Proceedings of Conference on Empirical Methods in Natural Language Processing - Conference on Computational Natural Language Learning (EMNLP-CoNLL 2007), 28–30 June 2007, Prague, Czech Republic, pp. 967–975. Association for Computational Linguistics, Stroudsburg (2007)

    Google Scholar 

  55. Kutsumi, T., Yoshimi, T., Kotani, K., Sata, I., Isahara, H.: Selection of entries for a bilingual dictionary from aligned translation equivalents using support vector machines. In: Proceedings of Pacific Association for Computational Linguistics, pp. 24–27. Association for Computational Linguistics, Stroudsburg (2005)

    Google Scholar 

  56. Eck, M., Vogel, S., Waibel, A.: Estimating phrase pair relevance for translation model pruning. Paper Presented at the MT Summit XI, The 11th Machine Translation Summit, Copenhagen Business School, Copenhagen, Denmark, September 2007

    Google Scholar 

  57. Eck, M., Vogel, S., Waibel, A.: Translation model pruning via usage statistics for statistical machine translation. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pp. 21–24. Association for Computational Linguistics, Stroudsburg (2007)

    Google Scholar 

  58. Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R.M., Makhoul, J.: Fast and robust neural network joint models for statistical machine translation. In: The 52nd Annual Meeting of the Association for Computational Linguistics, pp. 1370–1380. Association for Computational Linguistics, Stroudsburg (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Krzysztof Wołk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Cite this paper

Wołk, K., Zawadzka, E., Wołk, A. (2018). Statistical Approach to Noisy-Parallel and Comparable Corpora Filtering for the Extraction of Bi-lingual Equivalent Data at Sentence-Level. In: Rocha, Á., Adeli, H., Reis, L.P., Costanzo, S. (eds) Trends and Advances in Information Systems and Technologies. WorldCIST'18 2018. Advances in Intelligent Systems and Computing, vol 745. Springer, Cham. https://doi.org/10.1007/978-3-319-77703-0_79

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77703-0_79

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77702-3

  • Online ISBN: 978-3-319-77703-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics