Extracting Parallel Phrases from Comparable Data

Hewavitharana, Sanjika; Vogel, Stephan

doi:10.1007/978-3-642-20128-8_10

Sanjika Hewavitharana⁵ &
Stephan Vogel⁵

1201 Accesses
5 Citations

Abstract

Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other NLP applications. Even if two comparable documents have few or no parallel sentence pairs, there is still potential for parallelism in the sub-sentential level. The ability to detect these phrases creates a valuable resource, especially for low-resource languages. In this chapter we explore three phrase alignment approaches to detect parallel phrase pairs embedded in comparable sentences: the standard phrase extraction algorithm, which relies on the Viterbi path; a phrase extraction approach that does not rely on the Viterbi path of word alignments, but uses only lexical features; and a binary classifier that detects parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the effectiveness of these approaches in detecting alignments for phrase pairs that have a known alignment in comparable sentence pairs. The results show that the non-Viterbi alignment approach outperforms the other two approaches in terms of F-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Arabic Gigaword Fourth Edition (LDC2009T30).
2.
English Gigaword Fourth Edition (LDC2009T13).

References

Bourdaillet, J., Huet, S., Langlais, P., Lapalme, G.: TransSearch: from a bilingual concordancer to a translation finder. Mach. Transl. 24(3–4), 241–271 (2010)
Article Google Scholar
Brown, P.F., Pietra, S.A.D., Pietra, V.J.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Google Scholar
Fung, P., Cheung, P.: Mining very non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 57–63, Barcelona, Spain (2004)
Google Scholar
Fung, P., Yee, L.Y.: An IR approach for translating new words from nonparallel, comparable texts. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, pp. 414–420, Montreal, Canada (1998)
Google Scholar
Kikui, G., Sumita, E., Takezawa, T., Yamamoto, S.: Creating corpora for speech-to-speech translation. In: Proceedings of EUROSPEECH 2003, pp. 381–384, Geneva (2003)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic (2007)
Google Scholar
Kumano, T., Tanaka, H., Tokunaga, T.: Extracting phrasal alignments from comparable corpora by using joint probability smt model. In: Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, Skvde, Sweden (2007)
Google Scholar
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)
Article Google Scholar
Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 81–88, Sydney, Australia (2006)
Google Scholar
Quirk, C., Udupa, R.U., Menezes, A.: Generative models of noisy translations with applications to parallel fragment extraction. In: Proceedings of the Machine Translation Summit XI, pp. 377–384, Copenhagen, Denmark (2007)
Google Scholar
Rapp, R.: Identifying word translations in non-parallel texts. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 320–322, Cambridge, Massachusetts (1995)
Google Scholar
Rapp, R.: Automatic identification of word translations from unrelated english and german corpora. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 519–526, College Park, Maryland, USA (1999)
Google Scholar
Resnik, P., Smith, N.: The web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003)
Article Google Scholar
Tillmann, C., Hewavitharana, S.: A unified alignment algorithm for bilingual data. In: Proceedings of Interspeech 2011, Florence, Italy (2011)
Google Scholar
Tillmann, C., Xu, J.-M.: A simple sentence-level extraction algorithm for comparable data. In: Companion Volume of NAACL HLT 09, Boulder, CA (2009)
Google Scholar
Utiyama, M., Isahara, H.: Reliable measures for aligning Japanese–English news articles and sentences. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 72–79, Sapporo, Japan (2003)
Google Scholar
Vogel, S.: PESA: phrase pair extraction as sentence splitting. In: Proceedings of the Machine Translation Summit X, Phuket, Thailand (2005)
Google Scholar
Zhao, B., Vogel, S.: Adaptive parallel sentence mining from web bilingual news collection. In: Proceedings of the IEEE International Conference on Data Mining, pp. 745–748, Maebashi City, Japan (2002)
Google Scholar
Zhao, B., Vogel, S.: Full-text story alignment models for Chinese–English bilingual news corpora. In: Proceedings of the ICSLP ’02 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Sanjika Hewavitharana & Stephan Vogel

Authors

Sanjika Hewavitharana
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Vogel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sanjika Hewavitharana .

Editor information

Editors and Affiliations

Centre for Translation Studies, University of Leeds, Leeds, United Kingdom
Serge Sharoff
University of Mainz, Mainz, Germany
Reinhard Rapp
Université de Paris-Sud LIMSI-CNRS, Orsay, France
Pierre Zweigenbaum
Electronic & Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China
Pascale Fung

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hewavitharana, S., Vogel, S. (2013). Extracting Parallel Phrases from Comparable Data. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-20128-8_10
Published: 14 December 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics