Extracting Parallel Fragments from Comparable Documents Using a Feature-Based Method

Rahimi, Zeinab; Samani, Mohammad Hossein; Khadivi, Shahram

doi:10.1007/978-3-319-10849-0_29

Zeinab Rahimi⁴,
Mohammad Hossein Samani⁵ &
Shahram Khadivi⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 427))

Included in the following conference series:

International Symposium on Artificial Intelligence and Signal Processing

973 Accesses

Abstract

Here, a novel method for extracting parallel sub-sentential fragments from comparable corpora is presented. The proposed method aims to extract bilingual sentence fragments from noisy sentence pairs. We define a similarity measure between bilingual sentence fragments which is actually a linear combination of some new features. The features used are fragment length, LLR score, alignment path specifications in the block and translation coverage fraction. This method enables us to extract useful machine translation training data from comparable corpora that contain no parallel sentence pairs. Evaluations indicate that proposed method is very efficient and not only outperforms the existing similar systems in the measure of precision and recall; it also helps to improve the performance of a statistical machine translation system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cheung, P., Fung, P.: Sentence alignment in parallel, comparable, and quasi-comparable corpora. In: LREC2004 Workshop (2004)
Google Scholar
Deng, Y., Kumar, S., Byrne, W.: Segmentation and alignment of parallel text for statistical machine translation. J. Nat. Lang. Eng. 12, 235–326 (2006)
Google Scholar
Diab, M., Finch, S.: A statistical word-level translation model for comparable corpora. In: RIAO2000 (2000)
Google Scholar
Farajian, M.A.: PEN: parallel english-persian news corpus. In: Proceedings of 2011 International Conference on Artificial Intelligence (ICAI’11), Nevada, USA (2011)
Google Scholar
Fung, P., Cheung, P.: Mining very non-parallel corpora: parallel sentence and lexicon extraction vie bootstrapping and EM. In: EMNLP 2004, pp. 57–63 (2004a)
Google Scholar
Fung, P., Cheung, P.: Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In: COLING 2004, pp. 1051–1057 (2004b)
Google Scholar
Fung, P., Yee, L.Y.: An IR approach for translating new words from nonparallel, comparable texts. In: ACL 1998, pp. 414–420 (1998)
Google Scholar
Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., Dejean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: ACL 2004, pp. 527–534 (2004)
Google Scholar
Koehn, P., Knight, K.: Estimating word translation probabilities from unrelated mono-lingual corpora using the EM algorithm. In: National Conference on Artificial Intelligence, pp. 711–715 (2000)
Google Scholar
Moore, R.C.: Improving IBM word-alignment model 1. In: ACL 2004, pp. 519–526 (2004)
Google Scholar
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Ling. 31(4), 477–504 (2005)
Google Scholar
Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from comparable corpora. In: Proceedings of ACL 2006, pp. 81–88 (2006)
Google Scholar
Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: ACL 1999, pp. 519–526 (1999)
Google Scholar
Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Ling. 29(3), 349–380 (2003)
Article Google Scholar
Utiyama, M., Isahara, H.: Reliable measures for aligning Japanese-English news articles and sentences. In: ACL 2003, pp. 72–79 (2003)
Google Scholar
Vogel, S.: Using noisy bilingual data for statistical machine translation. In: EACL 2003, pp. 175–178 (2003)
Google Scholar
Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 257–268. Springer, Heidelberg (2005)
Chapter Google Scholar
Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news col-lection. In: 2002 IEEE International Conference on Data Mining, pp. 745–748 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Speech and Natural Language Processing, Research Center of Intelligent Signal Processing (RCISP), Tehran, Iran
Zeinab Rahimi
Department of Secure Infrastructures, Research Center of Intelligent Signal Processing (RCISP), Tehran, Iran
Mohammad Hossein Samani
Department of Computer Engineering, Amirkabir University of Technology, Hafez Avenue, Tehran, Iran
Shahram Khadivi

Authors

Zeinab Rahimi
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Hossein Samani
View author publications
You can also search for this author in PubMed Google Scholar
Shahram Khadivi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zeinab Rahimi .

Editor information

Editors and Affiliations

Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
Ali Movaghar
Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
Mansour Jamzad
Sharif University of Technology, Tehran, Iran
Hossein Asadi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rahimi, Z., Samani, M.H., Khadivi, S. (2014). Extracting Parallel Fragments from Comparable Documents Using a Feature-Based Method. In: Movaghar, A., Jamzad, M., Asadi, H. (eds) Artificial Intelligence and Signal Processing. AISP 2013. Communications in Computer and Information Science, vol 427. Springer, Cham. https://doi.org/10.1007/978-3-319-10849-0_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-10849-0_29
Published: 26 September 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10848-3
Online ISBN: 978-3-319-10849-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics