Statistical Approach to Noisy-Parallel and Comparable Corpora Filtering for the Extraction of Bi-lingual Equivalent Data at Sentence-Level

Wołk, Krzysztof; Zawadzka, Emilia; Wołk, Agnieszka

doi:10.1007/978-3-319-77703-0_79

Krzysztof Wołk⁶,
Emilia Zawadzka⁶ &
Agnieszka Wołk⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 745))

Included in the following conference series:

World Conference on Information Systems and Technologies

8508 Accesses

Abstract

Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems and other text processing tasks requiring bilingual data. In this study, we propose a language independent bi-sentence filtering approach based on Polish to English translation. This approach was developed using a noisy TED Talks corpus and tested on a Wikipedia-based comparable corpus; however, it can be extended to any text domain or language pair. The proposed method uses various statistical measures for sentence comparison and can be used for in-domain data adaptation tasks as well. Minimization of data loss was ensured by parameter adaptation. An improvement in MT system score using the text processed using our tool is discussed and in-domain data adaptation results are presented. We also discuss measures to improve performance such as bootstrapping and comparison model pruning. The results show significant improvement in filtering in terms of MT quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wu, H., Wang, H.: Comparative study of word alignment heuristics and phrase-based SMT. In: Presented in the 11th Machine Translation Summit – Second CFP, Copenhagen Business School, Copenhagen, September 2007
Google Scholar
Deng, Y., Shankar, K., William, B.: Segmentation and alignment of parallel text for statistical machine translation. Nat. Lang. Eng. 13(3), 235–260 (2007)
Article Google Scholar
Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 81–89. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Wołk, K., Marasek, K.: Real-time statistical speech translation. In: Rocha, A., Correia, A.M., Tan, F.B., Stroetmann, K.A. (eds.) New Perspectives in Information Systems and Technologies, vol. 1, pp. 107–113. Springer, Heidelberg (2014)
Chapter Google Scholar
Cettolo, M., Girardi, C., Federico, M.: WIT3: web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pp. 261–268. Fondazione Bruno Kessler (FBK), Trento (2012)
Google Scholar
Santos, A.: A survey on parallel corpora alignment. MI-STAR 2011, 117–128 (2011)
Google Scholar
Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, pp. 169–176. Association for Computational Linguistics, Stroudsburg (1991)
Google Scholar
Gale, W.A., Church, K.W.: Identifying word correspondences in parallel texts. In: Proceedings of the workshop on Speech and Natural Language, HLT 1991, pp. 152–157. Association for Computational Linguistics, Stroudsburg (1991)
Google Scholar
Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., Trón, V.: Parallel corpora for medium density languages. Amsterdam Stud. Theor. Hist. Linguist. Sci. Ser. 4(292), 247 (2007)
Google Scholar
Wołk, K., Marasek, K.: A sentence meaning based alignment method for parallel text corpora preparation. In: Rocha, A., Correia, A.M., Tan, F.B., Stroetmann, K.A. (eds.) New Perspectives in Information Systems and Technologies, vol. 1, pp. 229–237. Springer, Heidelberg (2014)
Chapter Google Scholar
Marasek, K.: TED Polish-to-English translation system for the IWSLT 2012. Paper Presented in the International Workshop on Spoken Language Translation 2012, Hong Kong, December 2012
Google Scholar
Tiedemann, J. Parallel data, tools and interfaces in OPUS. In: The Eighth International Conference on Language Resources and Evaluation (LREC), pp. 2214–2218. European Language Resources Association, Istanbul (2012)
Google Scholar
Lison, P., Tiedemann, J.: OpenSubtitles2016: extracting large parallel corpora from movie and TV subtitles (2016). http://www.lrecconf.org/proceedings/lrec2016/pdf/947_Paper.pdf. Accessed Dec 2016
Wołk, K., Marasek, K.: Polish-English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the 10th International Workshop on Spoken Language Translation, pp. 113–119. Springer, Heidelberg (2013)
Google Scholar
Wołk, K., Rejmund, E., Marasek, K.: Multi-domain machine translation enhancements by parallel data extraction from comparable corpora. In: Ewa, G., Agnieszka, L.S. (eds.) Polish-Language Parallel Corpora, pp. 157–179. Instytut Lingwistyki Stosowanej, Warszawa (2016)
Google Scholar
Berrotarán, G.G., Carrascosa, R., Vine, A.: Yalign documentation (2015). http://yalign.readthedocs.org/en/latest/. Accessed Dec 2016
Musso, G.: Sequence alignment (Needleman-Wunsch, Smith-Waterman). CSC2427: Algorithms in Molecular Biology (2006). http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf. Accessed Dec 2016
Wołk, K., Marasek, K.: Tuned and GPU-accelerated parallel data mining from comparable corpora. In: Král, P., Matoušek, V. (eds.) Text, Speech, and Dialogue, pp. 32–40. Springer, Heidelberg (2015)
Chapter Google Scholar
Tyers, F.M., Pienaar, J.A.: Extracting bilingual word pairs from Wikipedia. In: Collaboration: Interoperability Between People in the Creation of Language Resources for Less-Resourced Languages, vol. 19, pp. 19–22 (2008)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics, Stroudsburg
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Article Google Scholar
Vulić, I.: Term alignment: state of the art overview (2010). http://people.cs.kuleuven.be/~ivan.vulic/Files/TASOA.pdf. Accessed Dec 2016
Koehn, P.: Statistical Machine Translation. Cambridge University Press, Cambridge (2009)
Book Google Scholar
Collins, M.: Statistical machine translation: IBM models 1 and 2 (2011). http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/ibm12.pdf. Accessed Dec 2016
Schoenemann, T.: Computing optimal alignments for the IBM-3 translation model. In: Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pp. 98–106. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Fernández, P.M.: Improving Word-to-word Alignments Using Morphological Information (Unpublished doctoral dissertation). San Diego State University, California (2008)
Google Scholar
Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Google Scholar
Bojar, O., Rosa, R., Tamchyna, A.: Chimera–three heads for English-to-Czech translation. In: Proceedings of the Eighth Workshop on Statistical Machine Translation, pp. 90–96. Association for Computational Linguistics, Stroudsburg (2013)
Google Scholar
Fujita, A., Isabelle, P.: Expanding paraphrase lexicons by exploiting lexical variants. Paper Presented at the Proceedings of Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT, Denver, Colorado, USA, June 2015
Google Scholar
Junczys-Dowmunt, M., Szał, A.: SyMGiza++: a tool for parallel computation of symmetrized word alignment models. In: Computer Science and Information Technology (IMCSIT), Proceedings of the International MultiConference of Engineers and Computer Scientists, pp. 397–401. IEEE, Wisła (2010)
Google Scholar
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362. Association for Computational Linguistics, Stroudsburg (2011)
Google Scholar
Lü, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. Paper Presented in the Conference on Empirical Methods in Natural Language Processing - Conference on Computational Natural Language Learning, Prague, Czech Republic, June 2007
Google Scholar
Hildebrand, A.S., Eck, M., Vogel, S., Waibel, A.: Adaptation of the translation model for statistical machine translation based on information retrieval. In: Proceedings of EAMT 10th Annual Conference, 30–31 May 2005, Budapest, Hungary, pp. 133–142. Association for Computational Linguistics, Stroudsburg (2005)
Google Scholar
Koehn, P.: Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In: Frederking, R.E., Taylor, K.B. (eds.) Conference of the Association for Machine Translation in the Americas, pp. 115–124. Springer, Heidelberg (2004)
Google Scholar
Daumé III, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Vol. 2, pp. 407–412. Association for Computational Linguistics, Stroudsburg (2011)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, Stroudsburg (2002)
Google Scholar
Mansour, S., Ney, H.: A simple and effective weighted phrase extraction for machine translation adaptation. Paper Presented in the International Workshop on Spoken Language Translation 2012, Hong Kong, December 2012
Google Scholar
Stolcke, A.: SRILM-an extensible language modeling toolkit. Paper Presented in the 7th International Conference on Spoken Language Processing, ICSLP2002 - INTERSPEECH 2002, Denver, Colorado, USA, September 2002
Google Scholar
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, pp. 310–318. Association for Computational Linguistics, Stroudsburg (1996)
Google Scholar
Hovy, E.: Toward finely differentiated evaluation metrics for machine translation. Paper Presented in the Proceedings of the EAGLES Workshop on Standards and Evaluation, Pisa, January 1999
Google Scholar
Reeder, F.: Additional mt-eval references. Technical report, International Standards for Language Engineering, Evaluation Working Group (2001)
Google Scholar
Lavie, A.: Evaluating the output of machine translation systems (2010). https://amta2010.amtaweb.org/AMTA/papers/6-04-LavieMTEvaluation.pdf. Accessed Dec 2016
Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 138–145. Morgan Kaufmann Publishers Inc., San Diego (2002)
Google Scholar
Oliver, J.: Global autonomous language exploitation (GALE). DARPA/IPTO Proposer Information Pamphlet (2005)
Google Scholar
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of association for machine translation in the Americas, pp. 223–231. Association for Machine Translation in the Americas, Stroudsburg (2006)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72. Association for Computational Linguistics, Stroudsburg (2005)
Google Scholar
Oyeka, I.C.A., Ebuh, G.U.: Modified Wilcoxon signed-rank test. Open J. Stat. 2(2), 172–176 (2012)
Article MathSciNet Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 48–54. Association for Computational Linguistics, Stroudsburg (2003)
Google Scholar
Shen, L., Sarkar, A., Och, F.J.: Discriminative reranking for machine translation. Paper Presented in the Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pp. 177–184. Association for Computational Linguistics, Stroudsburg (2004)
Google Scholar
Koehn, P.: An experimental management system. Prague Bull. Math. Linguist. 94, 87–96 (2010)
Article Google Scholar
Quirk, C., Menezes, A.: Do we need phrases?: challenging the conventional wisdom in statistical machine translation. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pp. 9–16. Association for Computational Linguistics, Stroudsburg (2006)
Google Scholar
Marino, J.B., Banchs, R.E., Crego, J.M., de Gispert, A., Lambert, P., Fonollosa, J.A., Costa-Jussà, M.R.: N-gram-based machine translation. Comput. Linguist. 32(4), 527–549 (2006)
Article MathSciNet Google Scholar
Costa-Jussà, M.R., Crego, J.M., Vilar, D., Fonollosa, J.A., Mariño, J.B., Ney, H.: Analysis and system combination of phrase-and n-gram-based statistical machine translation systems. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume, Short Papers, pp. 137–140. Association for Computational Linguistics, Stroudsburg (2007)
Google Scholar
Johnson, J.H., Martin, J., Foster, G., Kuhn, R.: Improving translation quality by discarding most of the phrasetable. In: Proceedings of Conference on Empirical Methods in Natural Language Processing - Conference on Computational Natural Language Learning (EMNLP-CoNLL 2007), 28–30 June 2007, Prague, Czech Republic, pp. 967–975. Association for Computational Linguistics, Stroudsburg (2007)
Google Scholar
Kutsumi, T., Yoshimi, T., Kotani, K., Sata, I., Isahara, H.: Selection of entries for a bilingual dictionary from aligned translation equivalents using support vector machines. In: Proceedings of Pacific Association for Computational Linguistics, pp. 24–27. Association for Computational Linguistics, Stroudsburg (2005)
Google Scholar
Eck, M., Vogel, S., Waibel, A.: Estimating phrase pair relevance for translation model pruning. Paper Presented at the MT Summit XI, The 11th Machine Translation Summit, Copenhagen Business School, Copenhagen, Denmark, September 2007
Google Scholar
Eck, M., Vogel, S., Waibel, A.: Translation model pruning via usage statistics for statistical machine translation. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pp. 21–24. Association for Computational Linguistics, Stroudsburg (2007)
Google Scholar
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R.M., Makhoul, J.: Fast and robust neural network joint models for statistical machine translation. In: The 52nd Annual Meeting of the Association for Computational Linguistics, pp. 1370–1380. Association for Computational Linguistics, Stroudsburg (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Polish-Japanese Academy of Information Technology, Koszykowa 86, Warsaw, Poland
Krzysztof Wołk, Emilia Zawadzka & Agnieszka Wołk

Authors

Krzysztof Wołk
View author publications
You can also search for this author in PubMed Google Scholar
Emilia Zawadzka
View author publications
You can also search for this author in PubMed Google Scholar
Agnieszka Wołk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Krzysztof Wołk .

Editor information

Editors and Affiliations

Departamento de Engenharia Informática, Universidade de Coimbra, Coimbra, Portugal
Álvaro Rocha
College of Engineering, The Ohio State University, Columbus, OH, USA
Hojjat Adeli
DSI/EEUM, Universidade do Minho, Guimarães, Portugal
Luís Paulo Reis
DIMES, Università della Calabria, Arcavacata di Rende, Italy
Sandra Costanzo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wołk, K., Zawadzka, E., Wołk, A. (2018). Statistical Approach to Noisy-Parallel and Comparable Corpora Filtering for the Extraction of Bi-lingual Equivalent Data at Sentence-Level. In: Rocha, Á., Adeli, H., Reis, L.P., Costanzo, S. (eds) Trends and Advances in Information Systems and Technologies. WorldCIST'18 2018. Advances in Intelligent Systems and Computing, vol 745. Springer, Cham. https://doi.org/10.1007/978-3-319-77703-0_79

Download citation

DOI: https://doi.org/10.1007/978-3-319-77703-0_79
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77702-3
Online ISBN: 978-3-319-77703-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics