Abstract
We describe efforts towards getting better resources for English-Arabic machine translation of spoken text. In particular, we look at movie subtitles as a unique, rich resource, as subtitles in one language often get translated into other languages. Movie subtitles are not new as a resource and have been explored in previous research; however, here we create a much larger bi-text (the biggest to date), and we further generate better quality alignment for it. Given the subtitles for the same movie in different languages, a key problem is how to align them at the fragment level. Typically, this is done using length-based alignment, but for movie subtitles, there is also time information. Here we exploit this information to develop an original algorithm that outperforms the current best subtitle alignment tool, subalign. The evaluation results show that adding our bi-text to the IWSLT training bi-text yields an improvement of over two BLEU points absolute.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
The data can be found at http://workshop2013.iwslt.org.
- 3.
- 4.
This is the official scoring method for the translation tracks into Arabic at IWSLT’13: http://alt.qcri.org/tools/arabic-normalizer.
- 5.
We set M to be at most 5 in order to prevent the algorithm from unreasonably iterating up to the last segment looking for a match.
- 6.
References
Abdelali, A., Guzmán, F., Sajjad, H., Vogel, S.: The AMARA corpus: building parallel language resources for the educational domain. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland (2014)
Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Frederico, M.: Report on the 10th IWSLT evaluation campaign. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2013, Heidelberg, Germany, pp. 15–24 (2013)
Foster, G., Kuhn, R.: Stabilizing minimum error rate training. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, WMT 2009, Athens, Greece, pp. 242–249 (2009)
Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Comput. linguist. 19(1), 75–102 (1993)
Guzmán, F., Nakov, P., Vogel, S.: Analyzing optimization for statistical machine translation: MERT learns verbosity, PRO learns length. In: Proceedings of the Nineteenth Conference on Computational Natural Language Learning, CoNLL 2015, Beijing, China, pp. 62–72 (2015)
Heafield, K.: KenLM: Faster and smaller language model queries. In: Proceedings of the 6th Workshop on Statistical Machine Translation, WMT 2011, Edinburgh, Scotland, UK, pp. 187–197 (2011)
Hopkins, M., May, J.: Tuning as ranking. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Edinburgh, Scotland, UK, pp. 1352–1362 (2011)
Jansen, D., Alcala, A., Guzmán, F.: AMARA: a sustainable, global solution for accessibility, powered by communities of volunteers. In: Stephanidis, C., Antona, M. (eds.) UAHCI 2014. LNCS, vol. 8516, pp. 401–411. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07509-9_38
El Kholy, A., Habash, N.: Orthographic and morphological processing for English-Arabic statistical machine translation. Mach. Transl. 26(1–2), 25–45 (2012)
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the Tenth Machine Translation Summit, MT Summit 2005, Phuket, Thailand, pp. 79–86 (2005)
Koehn, P., Axelrod, A., Mayne, A.B., Callison-Burch, C., Osborne, M., Talbot, D.: Edinburgh system description for the 2005 IWSLT speech translation evaluation. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2005, Pittsburgh, PA, USA (2005)
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL: Interactive Poster and Demonstration Sessions, ACL 2007, Prague, Czech Republic, pp. 177–180 (2007)
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLT-NAACL 2003, Edmonton, Canada, vol. 1, pp. 48–54 (2003)
Lavecchia, C., Smaïli, K., Langlois, D.: Building parallel corpora from movies. In: Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science, NLPCS 2007, Funchal, Madeira, Portugal (2007)
Mangeot, M., Giguet, E.: Multilingual aligned corpora from movie subtitles. Report in Laboratoire d’Informatique, Systèmes, Traitement de l’Information et de la Connaissance (2005)
Monroe, W., Green, S., Manning, C.D.: Word segmentation of informal Arabic with domain adaptation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2014, Baltimore, MD, USA, pp. 206–211 (2014)
Nakov, P.: Improving English-Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing. In: Proceedings of the Third Workshop on Statistical Machine Translation, WMT 2008, Columbus, Ohio, USA, pp. 147–150 (2008)
Nakov, P., Al Obaidli, F., Guzman, F., Vogel, S.: Parameter optimization for statistical machine translation: it pays to learn from hard examples. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2013, Hissar, Bulgaria, pp. 504–510 (2013)
Nakov, P., Guzmán, F., Vogel, S.: Optimizing for sentence-level BLEU+1 yields short translations. In: Proceedings of the 24th International Conference on Computational Linguistics, COLING 2012, Mumbai, India, pp. 1979–1994 (2012)
Nakov, P., Guzmán, F., Vogel, S.: A tale about PRO and monsters. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2013, Sofia, Bulgaria, pp. 12–17 (2013)
Nakov, P., Ng, H.T.: Improved statistical machine translation for resource-poor languages using related resource-rich languages. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, vol. 3, pp. 1358–1367 (2009)
Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, ACL 2008, Columbus, OH, USA, pp. 117–120 (2008)
Sajjad, H., Guzmán, F., Nakov, P., Abdelali, A., Murray, K., Al Obaidli, F., Vogel, S.: QCRI at IWSLT 2013: experiments in Arabic-English and English-Arabic spoken language translation. In Proceedings of the 10th International Workshop on Spoken Language Translation, IWSLT 2013, Heidelberg, Germany (2013)
Tiedemann, J.: Building a multilingual parallel subtitle corpus. In: Proceedings of the Computational Linguistics in the Netherlands, CLIN 2007, Nijmegen, Netherlands (2007)
Tiedemann, J.: Improved sentence alignment for movie subtitles. In: Proceedings of the Conference on Recent Advances in Natural Language Processing, RANLP 2007, Borovets, Bulgaria, pp. 582–588 (2007)
Tiedemann, J.: Synchronizing translated movie subtitles. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008 (2008)
Tiedemann. J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, pp. 2214–2218 (2012)
Volk, M.: The automatic translation of film subtitles. A machine translation success story? J. Lang. Technol. Comput. Linguist. (JLCL) 24(3), 115–128 (2009)
Volk, M., Harder, S.: Evaluating MT with translations or translators: what is the difference? In: Proceedings of the Machine Translation Summit XI, MT-Summit 2007, Copenhagen, Denmark (2007)
Wang, P., Nakov, P., Ng, H.T.: Source language adaptation for resource-poor machine translation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, Jeju Island, Korea, pp. 286–296 (2012)
Wang, P., Nakov, P., Ng, H.T.: Source language adaptation approaches for resource-poor machine translation. Comput. Linguist. 42, 277–306 (2016)
Xiao, H., Wang, X.: Constructing parallel corpus from movie subtitles. In: Li, W., Mollá-Aliod, D. (eds.) ICCPOL 2009. LNCS (LNAI), vol. 5459, pp. 329–336. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00831-3_32
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Al-Obaidli, F., Cox, S., Nakov, P. (2018). Bi-text Alignment of Movie Subtitles for Spoken English-Arabic Statistical Machine Translation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-75487-1_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75486-4
Online ISBN: 978-3-319-75487-1
eBook Packages: Computer ScienceComputer Science (R0)