Abstract
Here, a novel method for extracting parallel sub-sentential fragments from comparable corpora is presented. The proposed method aims to extract bilingual sentence fragments from noisy sentence pairs. We define a similarity measure between bilingual sentence fragments which is actually a linear combination of some new features. The features used are fragment length, LLR score, alignment path specifications in the block and translation coverage fraction. This method enables us to extract useful machine translation training data from comparable corpora that contain no parallel sentence pairs. Evaluations indicate that proposed method is very efficient and not only outperforms the existing similar systems in the measure of precision and recall; it also helps to improve the performance of a statistical machine translation system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cheung, P., Fung, P.: Sentence alignment in parallel, comparable, and quasi-comparable corpora. In: LREC2004 Workshop (2004)
Deng, Y., Kumar, S., Byrne, W.: Segmentation and alignment of parallel text for statistical machine translation. J. Nat. Lang. Eng. 12, 235–326 (2006)
Diab, M., Finch, S.: A statistical word-level translation model for comparable corpora. In: RIAO2000 (2000)
Farajian, M.A.: PEN: parallel english-persian news corpus. In: Proceedings of 2011 International Conference on Artificial Intelligence (ICAI’11), Nevada, USA (2011)
Fung, P., Cheung, P.: Mining very non-parallel corpora: parallel sentence and lexicon extraction vie bootstrapping and EM. In: EMNLP 2004, pp. 57–63 (2004a)
Fung, P., Cheung, P.: Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In: COLING 2004, pp. 1051–1057 (2004b)
Fung, P., Yee, L.Y.: An IR approach for translating new words from nonparallel, comparable texts. In: ACL 1998, pp. 414–420 (1998)
Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., Dejean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: ACL 2004, pp. 527–534 (2004)
Koehn, P., Knight, K.: Estimating word translation probabilities from unrelated mono-lingual corpora using the EM algorithm. In: National Conference on Artificial Intelligence, pp. 711–715 (2000)
Moore, R.C.: Improving IBM word-alignment model 1. In: ACL 2004, pp. 519–526 (2004)
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Ling. 31(4), 477–504 (2005)
Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from comparable corpora. In: Proceedings of ACL 2006, pp. 81–88 (2006)
Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: ACL 1999, pp. 519–526 (1999)
Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Ling. 29(3), 349–380 (2003)
Utiyama, M., Isahara, H.: Reliable measures for aligning Japanese-English news articles and sentences. In: ACL 2003, pp. 72–79 (2003)
Vogel, S.: Using noisy bilingual data for statistical machine translation. In: EACL 2003, pp. 175–178 (2003)
Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 257–268. Springer, Heidelberg (2005)
Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news col-lection. In: 2002 IEEE International Conference on Data Mining, pp. 745–748 (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Rahimi, Z., Samani, M.H., Khadivi, S. (2014). Extracting Parallel Fragments from Comparable Documents Using a Feature-Based Method. In: Movaghar, A., Jamzad, M., Asadi, H. (eds) Artificial Intelligence and Signal Processing. AISP 2013. Communications in Computer and Information Science, vol 427. Springer, Cham. https://doi.org/10.1007/978-3-319-10849-0_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-10849-0_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10848-3
Online ISBN: 978-3-319-10849-0
eBook Packages: Computer ScienceComputer Science (R0)