Skip to main content

Parallel Texts Extraction from Multimodal Comparable Corpora

  • Conference paper
Advances in Natural Language Processing (JapTAL 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7614))

Included in the following conference series:

  • 1581 Accesses

Abstract

Statistical machine translation (SMT) systems depend on the availability of domain-specific bilingual parallel text. However parallel corpora are a limited resource and they are often not available for some domains or language pairs. We analyze the feasibility of extracting parallel sentences from multimodal comparable corpora. This work extends the use of comparable corpora by using audio sources instead of texts on the source side. The audio is transcribed by an automatic speech recognition system and translated with a baseline SMT system. We then use information retrieval in a large text corpus in the target language to extract parallel sentences. We have performed a series of experiments on data of the IWSLT’11 speech translation task that shows the feasibility of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abdul-Rauf, S., Schwenk, H.: Parallel sentence generation from comparable corpora for improved SMT. Machine Translation (2011)

    Google Scholar 

  2. Deléglise, P., Estève, Y., Meignier, S., Merlin, T.: Improvements to the LIUM french ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate? In: Interspeech 2009, September 6-10 (2009)

    Google Scholar 

  3. Fung, P., Cheung, P.: Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004 (2004)

    Google Scholar 

  4. Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, SETQA-NLP 2008, pp. 49–57 (2008)

    Google Scholar 

  5. Grézl, F., Fousek, P.: Optimizing bottle-neck features for LVCSR. In: 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 4729–4732. IEEE Signal Processing Society (2008)

    Google Scholar 

  6. Hewavitharana, S., Vogel, S.: Extracting parallel phrases from comparable data. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, BUCC 2011, pp. 61–68 (2011)

    Google Scholar 

  7. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180 (2007)

    Google Scholar 

  8. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 48–54 (2003)

    Google Scholar 

  9. Munteanu, D.S., Marcu, D.: Improving Machine Translation Performance by Exploiting Non-Parallel Corpora. Computational Linguistics 31(4), 477–504 (2005)

    Article  Google Scholar 

  10. Ogilvie, P., Callan, J.: Experiments using the lemur toolkit. In: Procedding of the Trenth Text Retrieval Conference, TREC-10 (2001)

    Google Scholar 

  11. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  12. Paulik, M., Waibel, A.: Automatic translation from parallel speech: Simultaneous interpretation as MT training data. In: ASRU, Merano, Italy (December 2009)

    Google Scholar 

  13. Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29, 349–380 (2003)

    Article  Google Scholar 

  14. Rousseau, A., Bougares, F., Deléglise, P., Schwenk, H., Estève, Y.: LIUM’s systems for the IWSLT 2011 speech translation tasks. In: International Workshop on Spoken Language Translation 2011 (2011)

    Google Scholar 

  15. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of Association for Machine Translation in the Americas, pp. 223–231 (2006)

    Google Scholar 

  16. Stolcke, A.: SRILM - an extensible language modeling toolkit. In: International Conference on Spoken Language Processing, pp. 257–286 (November 2002)

    Google Scholar 

  17. Utiyama, M., Isahara, H.: Reliable measures for aligning japanese-english news articles and sentences. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 72–79 (2003)

    Google Scholar 

  18. Yang, C.C., Li, K.W.: Automatic construction of english/chinese parallel corpora. J. Am. Soc. Inf. Sci. Technol. 54, 730–742 (2003)

    Article  Google Scholar 

  19. Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news collection. In: Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM 2002, 745 pages. IEEE Computer Society, Washington, DC (2002)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Afli, H., Barrault, L., Schwenk, H. (2012). Parallel Texts Extraction from Multimodal Comparable Corpora. In: Isahara, H., Kanzaki, K. (eds) Advances in Natural Language Processing. JapTAL 2012. Lecture Notes in Computer Science(), vol 7614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33983-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33983-7_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33982-0

  • Online ISBN: 978-3-642-33983-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics