Skip to main content

Extracting Parallel Phrases from Comparable Data

  • Chapter
  • First Online:
Building and Using Comparable Corpora

Abstract

Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other NLP applications. Even if two comparable documents have few or no parallel sentence pairs, there is still potential for parallelism in the sub-sentential level. The ability to detect these phrases creates a valuable resource, especially for low-resource languages. In this chapter we explore three phrase alignment approaches to detect parallel phrase pairs embedded in comparable sentences: the standard phrase extraction algorithm, which relies on the Viterbi path; a phrase extraction approach that does not rely on the Viterbi path of word alignments, but uses only lexical features; and a binary classifier that detects parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the effectiveness of these approaches in detecting alignments for phrase pairs that have a known alignment in comparable sentence pairs. The results show that the non-Viterbi alignment approach outperforms the other two approaches in terms of F-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Arabic Gigaword Fourth Edition (LDC2009T30).

  2. 2.

    English Gigaword Fourth Edition (LDC2009T13).

References

  1. Bourdaillet, J., Huet, S., Langlais, P., Lapalme, G.: TransSearch: from a bilingual concordancer to a translation finder. Mach. Transl. 24(3–4), 241–271 (2010)

    Article  Google Scholar 

  2. Brown, P.F., Pietra, S.A.D., Pietra, V.J.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)

    Google Scholar 

  3. Fung, P., Cheung, P.: Mining very non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 57–63, Barcelona, Spain (2004)

    Google Scholar 

  4. Fung, P., Yee, L.Y.: An IR approach for translating new words from nonparallel, comparable texts. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, pp. 414–420, Montreal, Canada (1998)

    Google Scholar 

  5. Kikui, G., Sumita, E., Takezawa, T., Yamamoto, S.: Creating corpora for speech-to-speech translation. In: Proceedings of EUROSPEECH 2003, pp. 381–384, Geneva (2003)

    Google Scholar 

  6. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic (2007)

    Google Scholar 

  7. Kumano, T., Tanaka, H., Tokunaga, T.: Extracting phrasal alignments from comparable corpora by using joint probability smt model. In: Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, Skvde, Sweden (2007)

    Google Scholar 

  8. Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)

    Article  Google Scholar 

  9. Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 81–88, Sydney, Australia (2006)

    Google Scholar 

  10. Quirk, C., Udupa, R.U., Menezes, A.: Generative models of noisy translations with applications to parallel fragment extraction. In: Proceedings of the Machine Translation Summit XI, pp. 377–384, Copenhagen, Denmark (2007)

    Google Scholar 

  11. Rapp, R.: Identifying word translations in non-parallel texts. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 320–322, Cambridge, Massachusetts (1995)

    Google Scholar 

  12. Rapp, R.: Automatic identification of word translations from unrelated english and german corpora. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 519–526, College Park, Maryland, USA (1999)

    Google Scholar 

  13. Resnik, P., Smith, N.: The web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003)

    Article  Google Scholar 

  14. Tillmann, C., Hewavitharana, S.: A unified alignment algorithm for bilingual data. In: Proceedings of Interspeech 2011, Florence, Italy (2011)

    Google Scholar 

  15. Tillmann, C., Xu, J.-M.: A simple sentence-level extraction algorithm for comparable data. In: Companion Volume of NAACL HLT 09, Boulder, CA (2009)

    Google Scholar 

  16. Utiyama, M., Isahara, H.: Reliable measures for aligning Japanese–English news articles and sentences. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 72–79, Sapporo, Japan (2003)

    Google Scholar 

  17. Vogel, S.: PESA: phrase pair extraction as sentence splitting. In: Proceedings of the Machine Translation Summit X, Phuket, Thailand (2005)

    Google Scholar 

  18. Zhao, B., Vogel, S.: Adaptive parallel sentence mining from web bilingual news collection. In: Proceedings of the IEEE International Conference on Data Mining, pp. 745–748, Maebashi City, Japan (2002)

    Google Scholar 

  19. Zhao, B., Vogel, S.: Full-text story alignment models for Chinese–English bilingual news corpora. In: Proceedings of the ICSLP ’02 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sanjika Hewavitharana .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Hewavitharana, S., Vogel, S. (2013). Extracting Parallel Phrases from Comparable Data. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20128-8_10

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20127-1

  • Online ISBN: 978-3-642-20128-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics