Skip to main content

Extracting Parallel Fragments from Comparable Documents Using a Feature-Based Method

  • Conference paper
  • First Online:
Artificial Intelligence and Signal Processing (AISP 2013)

Abstract

Here, a novel method for extracting parallel sub-sentential fragments from comparable corpora is presented. The proposed method aims to extract bilingual sentence fragments from noisy sentence pairs. We define a similarity measure between bilingual sentence fragments which is actually a linear combination of some new features. The features used are fragment length, LLR score, alignment path specifications in the block and translation coverage fraction. This method enables us to extract useful machine translation training data from comparable corpora that contain no parallel sentence pairs. Evaluations indicate that proposed method is very efficient and not only outperforms the existing similar systems in the measure of precision and recall; it also helps to improve the performance of a statistical machine translation system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Cheung, P., Fung, P.: Sentence alignment in parallel, comparable, and quasi-comparable corpora. In: LREC2004 Workshop (2004)

    Google Scholar 

  • Deng, Y., Kumar, S., Byrne, W.: Segmentation and alignment of parallel text for statistical machine translation. J. Nat. Lang. Eng. 12, 235–326 (2006)

    Google Scholar 

  • Diab, M., Finch, S.: A statistical word-level translation model for comparable corpora. In: RIAO2000 (2000)

    Google Scholar 

  • Farajian, M.A.: PEN: parallel english-persian news corpus. In: Proceedings of 2011 International Conference on Artificial Intelligence (ICAI’11), Nevada, USA (2011)

    Google Scholar 

  • Fung, P., Cheung, P.: Mining very non-parallel corpora: parallel sentence and lexicon extraction vie bootstrapping and EM. In: EMNLP 2004, pp. 57–63 (2004a)

    Google Scholar 

  • Fung, P., Cheung, P.: Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In: COLING 2004, pp. 1051–1057 (2004b)

    Google Scholar 

  • Fung, P., Yee, L.Y.: An IR approach for translating new words from nonparallel, comparable texts. In: ACL 1998, pp. 414–420 (1998)

    Google Scholar 

  • Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., Dejean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: ACL 2004, pp. 527–534 (2004)

    Google Scholar 

  • Koehn, P., Knight, K.: Estimating word translation probabilities from unrelated mono-lingual corpora using the EM algorithm. In: National Conference on Artificial Intelligence, pp. 711–715 (2000)

    Google Scholar 

  • Moore, R.C.: Improving IBM word-alignment model 1. In: ACL 2004, pp. 519–526 (2004)

    Google Scholar 

  • Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Ling. 31(4), 477–504 (2005)

    Google Scholar 

  • Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from comparable corpora. In: Proceedings of ACL 2006, pp. 81–88 (2006)

    Google Scholar 

  • Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: ACL 1999, pp. 519–526 (1999)

    Google Scholar 

  • Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Ling. 29(3), 349–380 (2003)

    Article  Google Scholar 

  • Utiyama, M., Isahara, H.: Reliable measures for aligning Japanese-English news articles and sentences. In: ACL 2003, pp. 72–79 (2003)

    Google Scholar 

  • Vogel, S.: Using noisy bilingual data for statistical machine translation. In: EACL 2003, pp. 175–178 (2003)

    Google Scholar 

  • Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 257–268. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  • Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news col-lection. In: 2002 IEEE International Conference on Data Mining, pp. 745–748 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zeinab Rahimi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Rahimi, Z., Samani, M.H., Khadivi, S. (2014). Extracting Parallel Fragments from Comparable Documents Using a Feature-Based Method. In: Movaghar, A., Jamzad, M., Asadi, H. (eds) Artificial Intelligence and Signal Processing. AISP 2013. Communications in Computer and Information Science, vol 427. Springer, Cham. https://doi.org/10.1007/978-3-319-10849-0_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10849-0_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10848-3

  • Online ISBN: 978-3-319-10849-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics