Abstract
This study addresses the integration and incorporation of rich additional information into the phrase-based approach, aptly called factored translation, which is an extension of phrase-based statistic machine translation (PBSMT). This approach was proven successful when translating English into a morphologically rich language. PBSMT represents the baseline of this work. We extend the phrase-based translation approach by integrating additional linguistic knowledge, namely part-of-speech (POS) tags, to create a factored model. The main contribution of this study is the creation of a new approach for Arabic–English translation via the injection of the factored model into Combinatory Categorial Grammar (CCG) supertags to form an integrated model (POS + CCG). The system was trained on a freely available multi-UN corpus on Arabic–English language pairs. Moses decoder, which is an open-source factored SMT system, was used to integrate these data into the target language model and the target side of the translation model. Results showed improvements to the BLEU automatic score via various high n-gram language models (LMs). The integration of the featured factors (POS + CCG) of the translation has been successfully tested. Overall, the 3-, 5-, 7-, and 9-g LM evaluation with BLEU scores proved that our integrated model performed better than PBSMT. Compared with three other models (PBSMT, POS, and CCG models), the integrated model improved the translation quality by 1.54, 1.29, and 0.21 %, respectively, over the 3-g LM.
Similar content being viewed by others
Change history
17 May 2018
The original version of this article unfortunately contained a mistake. The family name of the first author was incomplete. The complete family name is “Rajeh Ali” as given above.
17 May 2018
The original version of this article unfortunately contained a mistake. The family name of the first author was incomplete. The complete family name is ?Rajeh Ali? as given above.
17 May 2018
The original version of this article unfortunately contained a mistake. The family name of the first author was incomplete. The complete family name is ���Rajeh Ali��� as given above.
17 May 2018
The original version of this article unfortunately contained a mistake. The family name of the first author was incomplete. The complete family name is ���Rajeh Ali��� as given above.
17 May 2018
The original version of this article unfortunately contained a mistake. The family name of the first author was incomplete. The complete family name is ���Rajeh Ali��� as given above.
17 May 2018
The original version of this article unfortunately contained a mistake. The family name of the first author was incomplete. The complete family name is ���Rajeh Ali��� as given above.
References
Tripathi S., Sarkhel J.K.: Approaches to machine translation. Ann. Libr. Inf. Stud. 57, 388–393 (2010)
Koehn P.: Statistical Machine Translation. Cambridge University Press, Cambridge (2009)
Mehay, D.N.; Brew, C.: CCG syntactic reordering models for phrase-based machine translation. In: Proceedings of the Seventh Workshop on Statistical Machine Translation ACL, pp. 210–221 (2012)
Koehn, P.; Och, F.J.; Marcu, D.: Statistical phrase-based translation. In: Proceedings of NAACL-HLT. ACL, pp. 48–54 (2003)
Hassan H., Sima’an K., Way A.: Syntactically lexicalized phrase-based SMT. IEEE Trans. Audio Speech Lang. Process. 16(7), 1260–1273 (2008)
Steedman M.: The Syntactic Process. MIT Press, Cambridge (2000)
Koehn, P.; Hoang, H.: Factored translation models. In: EMNLP-CoNLL, pp. 868–876 (2007)
Hassan, H.; Sima’an, K.; Way, A.: A syntactic language model based on incremental CCG parsing. In: Spoken Language Technology Workshop, IEEE, pp. 205–208 (2008)
Almaghout, H.; Jiang, J., Way, A.: Extending CCG-based syntactic constraints in hierarchical phrase-based SMT. In: Proceedings of the Annual Conference of the European Association for MT (EAMT), pp. 193–200 (2012)
Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Zens, R.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. ACL, pp. 177–180 (2007)
Bojar, O.: English-to-Czech factored machine translation. In: Proceedings of the Second Workshop on Statistical Machine Translation. ACL, pp. 232–239 (2007)
Huet, S.; Manishina, E.; Lefèvre, F.: Factored machine translation systems for Russian-English. In: Proceedings of the Eighth Workshop on Statistical Machine Translation, pp. 152–155 (2013)
de Medeiros Caseli, H.; Nunes, I.A.: Factored Translation between Brazilian Portuguese and English. In: SBIA, pp. 163–172. Springer (2010)
Almaghout, H.; Jiang, J., Way, A.: CCG augmented hierarchical phrase-based machine-translation. In: Proceedings of the 7th International Workshop on Spoken Language Translatiopn (2010)
Almaghout, H.; Jiang, J., Way, A.: CCG contextual labels in hierarchical phrase-based SMT. In: Proceedings of EAMT, pp. 281–288 (2011)
Birch, A.; Osborne, M.; Koehn, P.: CCG supertags in factored statistical machine translation. In: Proceedings of the Second Workshop on SMT. ACL, pp. 9–16 (2007)
Mustafa S.H.: Character contiguity in N-gram-based word matching: the case for Arabic text searching. Inf. Process. Manag. 41(4), 819–827 (2005)
Clark S., Curran J.R.: Wide-coverage efficient statistical parsing with CCG and log-linear models. Comput. Linguist. 33(4), 493–552 (2007)
Curran, J.R.; Clark, S.; Vadas, D.: Multi-tagging for lexicalized-grammar parsing. In: Proceedings of the 21st International Conference on Computational Linguistics ACL, pp. 697–704 (2006)
Hockenmaier, J.; Steedman, M.: CCGbank: User’s Manual. Technical Reports (CIS). Paper 52. Department of Computer & Information Science, University of Pennsylvania, Philadelphia (2005). http://repository.upenn.edu/cgi/viewcontent.cgi?article=1054&context=cis_reports
Hassan, H.; Sima’an, K.; Way, A.: Supertagged phrase-based statistical machine translation. In: Proceedings of the ACL (2007)
Boxwell, S.A.; Brew, C.: A Pilot Arabic CCGbank. In: Proceedings of the Seventh International Conference on LREC-10 (2010)
El-taher A.I., Bakr H.M.A., Zidan I., Shaalan K.: An Arabic CCG approach for determining constituent types from Arabic Treebank. J. King Saud Univ. Comput. Info. Sci. 26(4), 441–449 (2014)
Kaeshammer, M.; Wetzel, D.: Enriching phrase-based statistical machine translation with POS information. In: RANLP Student Research Workshop, pp. 33–40 (2011)
Tian, L.;Wong, D.F.; Chao, L.S.; Oliveira, F.: A relationship: word alignment, phrase table, and translation quality. Sci.World J. 2014, 438106 (2014). doi:10.1155/2014/438106
Clark, S.; Curran, J.R.: Parsing the WSJ using CCG and log-linear models. In: Proceedings of the 42nd Annual ACL, p. 103 (2004)
Federico, M.; Bertoldi, N.; Cettolo, M.: IRSTLM: an open source toolkit for handling large scale language models. In: Interspeech, 9th Annual Conference of the International Speech Communication Association, pp. 1618–1621 (2008)
Tamchyna, A.; Bojar, O.: No free lunch in factored phrase-based machine translation. In: Computational Linguistics and Intelligent Text Processing, pp. 210–223. Springer (2013)
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on ACL, pp. 311–318 (2002)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rajeh, H.A., Li, Z. & Ayedh, A.M. A Novel Approach by Injecting CCG Supertags into an Arabic–English Factored Translation Machine. Arab J Sci Eng 41, 3071–3080 (2016). https://doi.org/10.1007/s13369-016-2075-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-016-2075-9