Skip to main content

Bi-text Alignment of Movie Subtitles for Spoken English-Arabic Statistical Machine Translation

  • Conference paper
  • First Online:
  • 1194 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9624))

Abstract

We describe efforts towards getting better resources for English-Arabic machine translation of spoken text. In particular, we look at movie subtitles as a unique, rich resource, as subtitles in one language often get translated into other languages. Movie subtitles are not new as a resource and have been explored in previous research; however, here we create a much larger bi-text (the biggest to date), and we further generate better quality alignment for it. Given the subtitles for the same movie in different languages, a key problem is how to align them at the fragment level. Typically, this is done using length-based alignment, but for movie subtitles, there is also time information. Here we exploit this information to develop an original algorithm that outperforms the current best subtitle alignment tool, subalign. The evaluation results show that adding our bi-text to the IWSLT training bi-text yields an improvement of over two BLEU points absolute.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.amara.org/en/.

  2. 2.

    The data can be found at http://workshop2013.iwslt.org.

  3. 3.

    For a broader discussion see also [6, 21].

  4. 4.

    This is the official scoring method for the translation tracks into Arabic at IWSLT’13: http://alt.qcri.org/tools/arabic-normalizer.

  5. 5.

    We set M to be at most 5 in order to prevent the algorithm from unreasonably iterating up to the last segment looking for a match.

  6. 6.

    https://wit3.fbk.eu.

References

  1. Abdelali, A., Guzmán, F., Sajjad, H., Vogel, S.: The AMARA corpus: building parallel language resources for the educational domain. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland (2014)

    Google Scholar 

  2. Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)

    Google Scholar 

  3. Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Frederico, M.: Report on the 10th IWSLT evaluation campaign. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2013, Heidelberg, Germany, pp. 15–24 (2013)

    Google Scholar 

  4. Foster, G., Kuhn, R.: Stabilizing minimum error rate training. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, WMT 2009, Athens, Greece, pp. 242–249 (2009)

    Google Scholar 

  5. Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Comput. linguist. 19(1), 75–102 (1993)

    Google Scholar 

  6. Guzmán, F., Nakov, P., Vogel, S.: Analyzing optimization for statistical machine translation: MERT learns verbosity, PRO learns length. In: Proceedings of the Nineteenth Conference on Computational Natural Language Learning, CoNLL 2015, Beijing, China, pp. 62–72 (2015)

    Google Scholar 

  7. Heafield, K.: KenLM: Faster and smaller language model queries. In: Proceedings of the 6th Workshop on Statistical Machine Translation, WMT 2011, Edinburgh, Scotland, UK, pp. 187–197 (2011)

    Google Scholar 

  8. Hopkins, M., May, J.: Tuning as ranking. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Edinburgh, Scotland, UK, pp. 1352–1362 (2011)

    Google Scholar 

  9. Jansen, D., Alcala, A., Guzmán, F.: AMARA: a sustainable, global solution for accessibility, powered by communities of volunteers. In: Stephanidis, C., Antona, M. (eds.) UAHCI 2014. LNCS, vol. 8516, pp. 401–411. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07509-9_38

    Chapter  Google Scholar 

  10. El Kholy, A., Habash, N.: Orthographic and morphological processing for English-Arabic statistical machine translation. Mach. Transl. 26(1–2), 25–45 (2012)

    Article  Google Scholar 

  11. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the Tenth Machine Translation Summit, MT Summit 2005, Phuket, Thailand, pp. 79–86 (2005)

    Google Scholar 

  12. Koehn, P., Axelrod, A., Mayne, A.B., Callison-Burch, C., Osborne, M., Talbot, D.: Edinburgh system description for the 2005 IWSLT speech translation evaluation. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2005, Pittsburgh, PA, USA (2005)

    Google Scholar 

  13. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL: Interactive Poster and Demonstration Sessions, ACL 2007, Prague, Czech Republic, pp. 177–180 (2007)

    Google Scholar 

  14. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLT-NAACL 2003, Edmonton, Canada, vol. 1, pp. 48–54 (2003)

    Google Scholar 

  15. Lavecchia, C., Smaïli, K., Langlois, D.: Building parallel corpora from movies. In: Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science, NLPCS 2007, Funchal, Madeira, Portugal (2007)

    Google Scholar 

  16. Mangeot, M., Giguet, E.: Multilingual aligned corpora from movie subtitles. Report in Laboratoire d’Informatique, Systèmes, Traitement de l’Information et de la Connaissance (2005)

    Google Scholar 

  17. Monroe, W., Green, S., Manning, C.D.: Word segmentation of informal Arabic with domain adaptation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2014, Baltimore, MD, USA, pp. 206–211 (2014)

    Google Scholar 

  18. Nakov, P.: Improving English-Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing. In: Proceedings of the Third Workshop on Statistical Machine Translation, WMT 2008, Columbus, Ohio, USA, pp. 147–150 (2008)

    Google Scholar 

  19. Nakov, P., Al Obaidli, F., Guzman, F., Vogel, S.: Parameter optimization for statistical machine translation: it pays to learn from hard examples. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2013, Hissar, Bulgaria, pp. 504–510 (2013)

    Google Scholar 

  20. Nakov, P., Guzmán, F., Vogel, S.: Optimizing for sentence-level BLEU+1 yields short translations. In: Proceedings of the 24th International Conference on Computational Linguistics, COLING 2012, Mumbai, India, pp. 1979–1994 (2012)

    Google Scholar 

  21. Nakov, P., Guzmán, F., Vogel, S.: A tale about PRO and monsters. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2013, Sofia, Bulgaria, pp. 12–17 (2013)

    Google Scholar 

  22. Nakov, P., Ng, H.T.: Improved statistical machine translation for resource-poor languages using related resource-rich languages. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, vol. 3, pp. 1358–1367 (2009)

    Google Scholar 

  23. Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, ACL 2008, Columbus, OH, USA, pp. 117–120 (2008)

    Google Scholar 

  24. Sajjad, H., Guzmán, F., Nakov, P., Abdelali, A., Murray, K., Al Obaidli, F., Vogel, S.: QCRI at IWSLT 2013: experiments in Arabic-English and English-Arabic spoken language translation. In Proceedings of the 10th International Workshop on Spoken Language Translation, IWSLT 2013, Heidelberg, Germany (2013)

    Google Scholar 

  25. Tiedemann, J.: Building a multilingual parallel subtitle corpus. In: Proceedings of the Computational Linguistics in the Netherlands, CLIN 2007, Nijmegen, Netherlands (2007)

    Google Scholar 

  26. Tiedemann, J.: Improved sentence alignment for movie subtitles. In: Proceedings of the Conference on Recent Advances in Natural Language Processing, RANLP 2007, Borovets, Bulgaria, pp. 582–588 (2007)

    Google Scholar 

  27. Tiedemann, J.: Synchronizing translated movie subtitles. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008 (2008)

    Google Scholar 

  28. Tiedemann. J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, pp. 2214–2218 (2012)

    Google Scholar 

  29. Volk, M.: The automatic translation of film subtitles. A machine translation success story? J. Lang. Technol. Comput. Linguist. (JLCL) 24(3), 115–128 (2009)

    Google Scholar 

  30. Volk, M., Harder, S.: Evaluating MT with translations or translators: what is the difference? In: Proceedings of the Machine Translation Summit XI, MT-Summit 2007, Copenhagen, Denmark (2007)

    Google Scholar 

  31. Wang, P., Nakov, P., Ng, H.T.: Source language adaptation for resource-poor machine translation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, Jeju Island, Korea, pp. 286–296 (2012)

    Google Scholar 

  32. Wang, P., Nakov, P., Ng, H.T.: Source language adaptation approaches for resource-poor machine translation. Comput. Linguist. 42, 277–306 (2016)

    Article  MathSciNet  Google Scholar 

  33. Xiao, H., Wang, X.: Constructing parallel corpus from movie subtitles. In: Li, W., Mollá-Aliod, D. (eds.) ICCPOL 2009. LNCS (LNAI), vol. 5459, pp. 329–336. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00831-3_32

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Preslav Nakov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Al-Obaidli, F., Cox, S., Nakov, P. (2018). Bi-text Alignment of Movie Subtitles for Spoken English-Arabic Statistical Machine Translation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-75487-1_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-75486-4

  • Online ISBN: 978-3-319-75487-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics