Parallel Texts Extraction from Multimodal Comparable Corpora

Afli, Haithem; Barrault, Loïc; Schwenk, Holger

doi:10.1007/978-3-642-33983-7_5

Haithem Afli²⁰,
Loïc Barrault²⁰ &
Holger Schwenk²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7614))

Included in the following conference series:

International Conference on NLP

1581 Accesses

Abstract

Statistical machine translation (SMT) systems depend on the availability of domain-specific bilingual parallel text. However parallel corpora are a limited resource and they are often not available for some domains or language pairs. We analyze the feasibility of extracting parallel sentences from multimodal comparable corpora. This work extends the use of comparable corpora by using audio sources instead of texts on the source side. The audio is transcribed by an automatic speech recognition system and translated with a baseline SMT system. We then use information retrieval in a large text corpus in the target language to extract parallel sentences. We have performed a series of experiments on data of the IWSLT’11 speech translation task that shows the feasibility of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abdul-Rauf, S., Schwenk, H.: Parallel sentence generation from comparable corpora for improved SMT. Machine Translation (2011)
Google Scholar
Deléglise, P., Estève, Y., Meignier, S., Merlin, T.: Improvements to the LIUM french ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate? In: Interspeech 2009, September 6-10 (2009)
Google Scholar
Fung, P., Cheung, P.: Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004 (2004)
Google Scholar
Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, SETQA-NLP 2008, pp. 49–57 (2008)
Google Scholar
Grézl, F., Fousek, P.: Optimizing bottle-neck features for LVCSR. In: 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 4729–4732. IEEE Signal Processing Society (2008)
Google Scholar
Hewavitharana, S., Vogel, S.: Extracting parallel phrases from comparable data. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, BUCC 2011, pp. 61–68 (2011)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180 (2007)
Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 48–54 (2003)
Google Scholar
Munteanu, D.S., Marcu, D.: Improving Machine Translation Performance by Exploiting Non-Parallel Corpora. Computational Linguistics 31(4), 477–504 (2005)
Article Google Scholar
Ogilvie, P., Callan, J.: Experiments using the lemur toolkit. In: Procedding of the Trenth Text Retrieval Conference, TREC-10 (2001)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Paulik, M., Waibel, A.: Automatic translation from parallel speech: Simultaneous interpretation as MT training data. In: ASRU, Merano, Italy (December 2009)
Google Scholar
Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29, 349–380 (2003)
Article Google Scholar
Rousseau, A., Bougares, F., Deléglise, P., Schwenk, H., Estève, Y.: LIUM’s systems for the IWSLT 2011 speech translation tasks. In: International Workshop on Spoken Language Translation 2011 (2011)
Google Scholar
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of Association for Machine Translation in the Americas, pp. 223–231 (2006)
Google Scholar
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: International Conference on Spoken Language Processing, pp. 257–286 (November 2002)
Google Scholar
Utiyama, M., Isahara, H.: Reliable measures for aligning japanese-english news articles and sentences. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 72–79 (2003)
Google Scholar
Yang, C.C., Li, K.W.: Automatic construction of english/chinese parallel corpora. J. Am. Soc. Inf. Sci. Technol. 54, 730–742 (2003)
Article Google Scholar
Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news collection. In: Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM 2002, 745 pages. IEEE Computer Society, Washington, DC (2002)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Universit du Maine, Avenue Olivier Messiaen, F-72085, Le Mans, France
Haithem Afli, Loïc Barrault & Holger Schwenk

Authors

Haithem Afli
View author publications
You can also search for this author in PubMed Google Scholar
Loïc Barrault
View author publications
You can also search for this author in PubMed Google Scholar
Holger Schwenk
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information and Media Center, Toyohashi Universtiy of Technology, 1-1 Hibarigaoka, Tenpakucho, 441-8580, Toyohashi, Japan
Hitoshi Isahara & Kyoko Kanzaki &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Afli, H., Barrault, L., Schwenk, H. (2012). Parallel Texts Extraction from Multimodal Comparable Corpora. In: Isahara, H., Kanzaki, K. (eds) Advances in Natural Language Processing. JapTAL 2012. Lecture Notes in Computer Science(), vol 7614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33983-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-33983-7_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33982-0
Online ISBN: 978-3-642-33983-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics