Abstract
Statistical machine translation (SMT) systems depend on the availability of domain-specific bilingual parallel text. However parallel corpora are a limited resource and they are often not available for some domains or language pairs. We analyze the feasibility of extracting parallel sentences from multimodal comparable corpora. This work extends the use of comparable corpora by using audio sources instead of texts on the source side. The audio is transcribed by an automatic speech recognition system and translated with a baseline SMT system. We then use information retrieval in a large text corpus in the target language to extract parallel sentences. We have performed a series of experiments on data of the IWSLT’11 speech translation task that shows the feasibility of our approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abdul-Rauf, S., Schwenk, H.: Parallel sentence generation from comparable corpora for improved SMT. Machine Translation (2011)
Deléglise, P., Estève, Y., Meignier, S., Merlin, T.: Improvements to the LIUM french ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate? In: Interspeech 2009, September 6-10 (2009)
Fung, P., Cheung, P.: Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004 (2004)
Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, SETQA-NLP 2008, pp. 49–57 (2008)
Grézl, F., Fousek, P.: Optimizing bottle-neck features for LVCSR. In: 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 4729–4732. IEEE Signal Processing Society (2008)
Hewavitharana, S., Vogel, S.: Extracting parallel phrases from comparable data. In: Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, BUCC 2011, pp. 61–68 (2011)
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180 (2007)
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 48–54 (2003)
Munteanu, D.S., Marcu, D.: Improving Machine Translation Performance by Exploiting Non-Parallel Corpora. Computational Linguistics 31(4), 477–504 (2005)
Ogilvie, P., Callan, J.: Experiments using the lemur toolkit. In: Procedding of the Trenth Text Retrieval Conference, TREC-10 (2001)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318 (2002)
Paulik, M., Waibel, A.: Automatic translation from parallel speech: Simultaneous interpretation as MT training data. In: ASRU, Merano, Italy (December 2009)
Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29, 349–380 (2003)
Rousseau, A., Bougares, F., Deléglise, P., Schwenk, H., Estève, Y.: LIUM’s systems for the IWSLT 2011 speech translation tasks. In: International Workshop on Spoken Language Translation 2011 (2011)
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of Association for Machine Translation in the Americas, pp. 223–231 (2006)
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: International Conference on Spoken Language Processing, pp. 257–286 (November 2002)
Utiyama, M., Isahara, H.: Reliable measures for aligning japanese-english news articles and sentences. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 72–79 (2003)
Yang, C.C., Li, K.W.: Automatic construction of english/chinese parallel corpora. J. Am. Soc. Inf. Sci. Technol. 54, 730–742 (2003)
Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news collection. In: Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM 2002, 745 pages. IEEE Computer Society, Washington, DC (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Afli, H., Barrault, L., Schwenk, H. (2012). Parallel Texts Extraction from Multimodal Comparable Corpora. In: Isahara, H., Kanzaki, K. (eds) Advances in Natural Language Processing. JapTAL 2012. Lecture Notes in Computer Science(), vol 7614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33983-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-33983-7_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33982-0
Online ISBN: 978-3-642-33983-7
eBook Packages: Computer ScienceComputer Science (R0)