Abstract
Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems and other text processing tasks requiring bilingual data. In this study, we propose a language independent bi-sentence filtering approach based on Polish to English translation. This approach was developed using a noisy TED Talks corpus and tested on a Wikipedia-based comparable corpus; however, it can be extended to any text domain or language pair. The proposed method uses various statistical measures for sentence comparison and can be used for in-domain data adaptation tasks as well. Minimization of data loss was ensured by parameter adaptation. An improvement in MT system score using the text processed using our tool is discussed and in-domain data adaptation results are presented. We also discuss measures to improve performance such as bootstrapping and comparison model pruning. The results show significant improvement in filtering in terms of MT quality.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wu, H., Wang, H.: Comparative study of word alignment heuristics and phrase-based SMT. In: Presented in the 11th Machine Translation Summit – Second CFP, Copenhagen Business School, Copenhagen, September 2007
Deng, Y., Shankar, K., William, B.: Segmentation and alignment of parallel text for statistical machine translation. Nat. Lang. Eng. 13(3), 235–260 (2007)
Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 81–89. Association for Computational Linguistics, Stroudsburg (2010)
Wołk, K., Marasek, K.: Real-time statistical speech translation. In: Rocha, A., Correia, A.M., Tan, F.B., Stroetmann, K.A. (eds.) New Perspectives in Information Systems and Technologies, vol. 1, pp. 107–113. Springer, Heidelberg (2014)
Cettolo, M., Girardi, C., Federico, M.: WIT3: web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pp. 261–268. Fondazione Bruno Kessler (FBK), Trento (2012)
Santos, A.: A survey on parallel corpora alignment. MI-STAR 2011, 117–128 (2011)
Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, pp. 169–176. Association for Computational Linguistics, Stroudsburg (1991)
Gale, W.A., Church, K.W.: Identifying word correspondences in parallel texts. In: Proceedings of the workshop on Speech and Natural Language, HLT 1991, pp. 152–157. Association for Computational Linguistics, Stroudsburg (1991)
Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., Trón, V.: Parallel corpora for medium density languages. Amsterdam Stud. Theor. Hist. Linguist. Sci. Ser. 4(292), 247 (2007)
Wołk, K., Marasek, K.: A sentence meaning based alignment method for parallel text corpora preparation. In: Rocha, A., Correia, A.M., Tan, F.B., Stroetmann, K.A. (eds.) New Perspectives in Information Systems and Technologies, vol. 1, pp. 229–237. Springer, Heidelberg (2014)
Marasek, K.: TED Polish-to-English translation system for the IWSLT 2012. Paper Presented in the International Workshop on Spoken Language Translation 2012, Hong Kong, December 2012
Tiedemann, J. Parallel data, tools and interfaces in OPUS. In: The Eighth International Conference on Language Resources and Evaluation (LREC), pp. 2214–2218. European Language Resources Association, Istanbul (2012)
Lison, P., Tiedemann, J.: OpenSubtitles2016: extracting large parallel corpora from movie and TV subtitles (2016). http://www.lrecconf.org/proceedings/lrec2016/pdf/947_Paper.pdf. Accessed Dec 2016
Wołk, K., Marasek, K.: Polish-English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the 10th International Workshop on Spoken Language Translation, pp. 113–119. Springer, Heidelberg (2013)
Wołk, K., Rejmund, E., Marasek, K.: Multi-domain machine translation enhancements by parallel data extraction from comparable corpora. In: Ewa, G., Agnieszka, L.S. (eds.) Polish-Language Parallel Corpora, pp. 157–179. Instytut Lingwistyki Stosowanej, Warszawa (2016)
Berrotarán, G.G., Carrascosa, R., Vine, A.: Yalign documentation (2015). http://yalign.readthedocs.org/en/latest/. Accessed Dec 2016
Musso, G.: Sequence alignment (Needleman-Wunsch, Smith-Waterman). CSC2427: Algorithms in Molecular Biology (2006). http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf. Accessed Dec 2016
Wołk, K., Marasek, K.: Tuned and GPU-accelerated parallel data mining from comparable corpora. In: Král, P., Matoušek, V. (eds.) Text, Speech, and Dialogue, pp. 32–40. Springer, Heidelberg (2015)
Tyers, F.M., Pienaar, J.A.: Extracting bilingual word pairs from Wikipedia. In: Collaboration: Interoperability Between People in the Creation of Language Resources for Less-Resourced Languages, vol. 19, pp. 19–22 (2008)
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics, Stroudsburg
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Vulić, I.: Term alignment: state of the art overview (2010). http://people.cs.kuleuven.be/~ivan.vulic/Files/TASOA.pdf. Accessed Dec 2016
Koehn, P.: Statistical Machine Translation. Cambridge University Press, Cambridge (2009)
Collins, M.: Statistical machine translation: IBM models 1 and 2 (2011). http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/ibm12.pdf. Accessed Dec 2016
Schoenemann, T.: Computing optimal alignments for the IBM-3 translation model. In: Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pp. 98–106. Association for Computational Linguistics, Stroudsburg (2010)
Fernández, P.M.: Improving Word-to-word Alignments Using Morphological Information (Unpublished doctoral dissertation). San Diego State University, California (2008)
Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Bojar, O., Rosa, R., Tamchyna, A.: Chimera–three heads for English-to-Czech translation. In: Proceedings of the Eighth Workshop on Statistical Machine Translation, pp. 90–96. Association for Computational Linguistics, Stroudsburg (2013)
Fujita, A., Isabelle, P.: Expanding paraphrase lexicons by exploiting lexical variants. Paper Presented at the Proceedings of Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT, Denver, Colorado, USA, June 2015
Junczys-Dowmunt, M., Szał, A.: SyMGiza++: a tool for parallel computation of symmetrized word alignment models. In: Computer Science and Information Technology (IMCSIT), Proceedings of the International MultiConference of Engineers and Computer Scientists, pp. 397–401. IEEE, Wisła (2010)
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362. Association for Computational Linguistics, Stroudsburg (2011)
Lü, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. Paper Presented in the Conference on Empirical Methods in Natural Language Processing - Conference on Computational Natural Language Learning, Prague, Czech Republic, June 2007
Hildebrand, A.S., Eck, M., Vogel, S., Waibel, A.: Adaptation of the translation model for statistical machine translation based on information retrieval. In: Proceedings of EAMT 10th Annual Conference, 30–31 May 2005, Budapest, Hungary, pp. 133–142. Association for Computational Linguistics, Stroudsburg (2005)
Koehn, P.: Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In: Frederking, R.E., Taylor, K.B. (eds.) Conference of the Association for Machine Translation in the Americas, pp. 115–124. Springer, Heidelberg (2004)
Daumé III, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Vol. 2, pp. 407–412. Association for Computational Linguistics, Stroudsburg (2011)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, Stroudsburg (2002)
Mansour, S., Ney, H.: A simple and effective weighted phrase extraction for machine translation adaptation. Paper Presented in the International Workshop on Spoken Language Translation 2012, Hong Kong, December 2012
Stolcke, A.: SRILM-an extensible language modeling toolkit. Paper Presented in the 7th International Conference on Spoken Language Processing, ICSLP2002 - INTERSPEECH 2002, Denver, Colorado, USA, September 2002
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, pp. 310–318. Association for Computational Linguistics, Stroudsburg (1996)
Hovy, E.: Toward finely differentiated evaluation metrics for machine translation. Paper Presented in the Proceedings of the EAGLES Workshop on Standards and Evaluation, Pisa, January 1999
Reeder, F.: Additional mt-eval references. Technical report, International Standards for Language Engineering, Evaluation Working Group (2001)
Lavie, A.: Evaluating the output of machine translation systems (2010). https://amta2010.amtaweb.org/AMTA/papers/6-04-LavieMTEvaluation.pdf. Accessed Dec 2016
Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 138–145. Morgan Kaufmann Publishers Inc., San Diego (2002)
Oliver, J.: Global autonomous language exploitation (GALE). DARPA/IPTO Proposer Information Pamphlet (2005)
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of association for machine translation in the Americas, pp. 223–231. Association for Machine Translation in the Americas, Stroudsburg (2006)
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72. Association for Computational Linguistics, Stroudsburg (2005)
Oyeka, I.C.A., Ebuh, G.U.: Modified Wilcoxon signed-rank test. Open J. Stat. 2(2), 172–176 (2012)
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 48–54. Association for Computational Linguistics, Stroudsburg (2003)
Shen, L., Sarkar, A., Och, F.J.: Discriminative reranking for machine translation. Paper Presented in the Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pp. 177–184. Association for Computational Linguistics, Stroudsburg (2004)
Koehn, P.: An experimental management system. Prague Bull. Math. Linguist. 94, 87–96 (2010)
Quirk, C., Menezes, A.: Do we need phrases?: challenging the conventional wisdom in statistical machine translation. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pp. 9–16. Association for Computational Linguistics, Stroudsburg (2006)
Marino, J.B., Banchs, R.E., Crego, J.M., de Gispert, A., Lambert, P., Fonollosa, J.A., Costa-Jussà, M.R.: N-gram-based machine translation. Comput. Linguist. 32(4), 527–549 (2006)
Costa-Jussà, M.R., Crego, J.M., Vilar, D., Fonollosa, J.A., Mariño, J.B., Ney, H.: Analysis and system combination of phrase-and n-gram-based statistical machine translation systems. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume, Short Papers, pp. 137–140. Association for Computational Linguistics, Stroudsburg (2007)
Johnson, J.H., Martin, J., Foster, G., Kuhn, R.: Improving translation quality by discarding most of the phrasetable. In: Proceedings of Conference on Empirical Methods in Natural Language Processing - Conference on Computational Natural Language Learning (EMNLP-CoNLL 2007), 28–30 June 2007, Prague, Czech Republic, pp. 967–975. Association for Computational Linguistics, Stroudsburg (2007)
Kutsumi, T., Yoshimi, T., Kotani, K., Sata, I., Isahara, H.: Selection of entries for a bilingual dictionary from aligned translation equivalents using support vector machines. In: Proceedings of Pacific Association for Computational Linguistics, pp. 24–27. Association for Computational Linguistics, Stroudsburg (2005)
Eck, M., Vogel, S., Waibel, A.: Estimating phrase pair relevance for translation model pruning. Paper Presented at the MT Summit XI, The 11th Machine Translation Summit, Copenhagen Business School, Copenhagen, Denmark, September 2007
Eck, M., Vogel, S., Waibel, A.: Translation model pruning via usage statistics for statistical machine translation. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pp. 21–24. Association for Computational Linguistics, Stroudsburg (2007)
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R.M., Makhoul, J.: Fast and robust neural network joint models for statistical machine translation. In: The 52nd Annual Meeting of the Association for Computational Linguistics, pp. 1370–1380. Association for Computational Linguistics, Stroudsburg (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Wołk, K., Zawadzka, E., Wołk, A. (2018). Statistical Approach to Noisy-Parallel and Comparable Corpora Filtering for the Extraction of Bi-lingual Equivalent Data at Sentence-Level. In: Rocha, Á., Adeli, H., Reis, L.P., Costanzo, S. (eds) Trends and Advances in Information Systems and Technologies. WorldCIST'18 2018. Advances in Intelligent Systems and Computing, vol 745. Springer, Cham. https://doi.org/10.1007/978-3-319-77703-0_79
Download citation
DOI: https://doi.org/10.1007/978-3-319-77703-0_79
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77702-3
Online ISBN: 978-3-319-77703-0
eBook Packages: EngineeringEngineering (R0)