Abstract
This paper proposes a hybrid word alignment model for Phrase-Based Statistical Machine Translation (PB-SMT). The proposed hybrid word alignment model provides most informative alignment links, which are offered by both unsupervised and semi-supervised word alignment models. Two unsupervised word alignment models, namely GIZA++ and Berkeley aligner, and a rule based word alignment technique are combined together. The unsupervised alignment models are trained on the surface form as well as the root form of the training data and provide alignment tables for the corresponding training data. The rule-based aligner is aimed towards aligning named entities (NEs) and syntactically motivated chunks. NEs are aligned through transliteration using a joint source-channel model. Chunks are aligned employing a bootstrapping approach by translating the source chunks into the target language using a baseline PB-SMT model and subsequently validating the chunk hypotheses using a fuzzy matching technique against the target corpus. Experiments are carried out after single-tokenizing the multiword NEs. The effectiveness of the proposed hybrid alignment model was extrinsically evaluated on the MT quality by using well-known automatic MT evaluation metrics, such as BLUE and NIST. Our best system provided significant improvements over the baseline as measured by BLEU.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The EILMT project is funded by the Department of Electronics and Information Technology (DEITY), Ministry of Communications and Information Technology (MCIT), Government of India.
- 2.
- 3.
- 4.
The IL-ILMT project is funded by the Department of Electronics and Information Technology (DEITY), Ministry of Communications and Information Technology (MCIT), Government of India.
References
Ambati, Vamshi, Stephan Vogel, and Jaime Carbonell. 2010. 10th Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing (ALNLP-2010), 10–17.
Bouamor, Dhouha, Nasredine Semmar, and Pierre Zweigenbeaum. 2012. Automatic construction of a multiword expressions bilingual lexicon: A statistical machine translation evaluation perspective. In Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon (CogALex-III), COLING 2012, 95–108. Mumbai.
Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19(2): 263–311.
Callison-Burch, Chris, David Talbot, and Miles Osborne. 2004. Statistical machine translation with word- and sentence-aligned parallel corpora. In Association for Computational Linguistics 2004, 175. Morristown, NJ: Association for Computational Linguistics.
Chapelle, O., B. Schölkopf, and A. Zien, ed. 2006. Semi-supervised learning. Cambridge, MA: MIT.
Doddington, George. 2002. Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In Proceedings of the Second International Conference on Human Language Technology Research (HLT-2002), 128–132. San Diego, CA.
Duan, Xiangyu, Min Zhang, and Haizhou Li. 2010. Pseudo-word for phrase-based machine translation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 148–156. Uppsala.
Eck, Matthias, Stephan Vogel, and Alex Waibel. 2004. Improving statistical machine translation in the medical domain using the Unified Medical Language System. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), 792–798. Geneva.
Ekbal, Asif, and Sivaji Bandyopadhyay. 2009. Voted NER system using appropriate unlabeled data. In Proceedings of the ACL-IJCNLP-2009 Named Entities Workshop (NEWS 2009), 202–210. Singapore: Suntec.
Ekbal, Asif, Sudip Kumar Naskar, and Sivaji Bandyopadhyay. 2006. A modified joint source-channel model for transliteration. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, (ACL-2006), 191–198. Sydney.
Feng, Donghui, Yajuan Lü, and Ming Zhou. 2004. A new approach for English-Chinese named entity alignment. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004), 372–379. Barcelona.
Fraser, Alexander, and Daniel Marcu. 2006. Semisupervised training for statistical word alignment. In ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (ACL-2006), 769–776. Morristown, NJ
Kneser, Reinhard, and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 181–184. Detroit, MI.
Koehn, Philipp. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004), 388–395. Barcelona.
Koehn, Philipp, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of HLT-NAACL 2003: Conference Combining Human Language Technology Conference Series and The North American Chapter of the Association for Computational Linguistics Conference Series, 48–54. Edmonton.
Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007): Proceedings of Demo and Poster Sessions, 177–180. Prague.
Lambert, Patrik, and Rafael Banchs. 2005. Data inferred multiword expressions for statistical machine translation. In Proceedings of Machine Translation Summit X, 396–403. Phuket.
Liang, Percy, Ben Taskar, and Dan Klein. 2006. 6th Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL-2006, 104–111.
Ma, Yanjun, Nicolas Stroppa, and Andy Way. 2007. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 304–311. Prague.
Moore, Robert C. 2003. Learning translations of named-entity phrases from parallel corpora. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), 259–266. Budapest.
Och, Franz J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-2003), 160–167. Sapporo.
Och, Franz Josef, and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29: 19–51.
Pal, Santanu, and Sivaji Bandyopadhyay. 2012. Bootstrapping Chunk Alignment in Phrase-Based Statistical Machine Translation. In: Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), EACL-2012, 93–100, Avignon.
Pal, Santanu, Sudip Kumar Naskar, Pavel Pecina, Sivaji Bandyopadhyay, and Andy Way. 2010. Handling named entities and compound verbs in phrase-based statistical machine translation. In Proceedings of the Workshop on Multiword Expression: From Theory to Application (MWE-2010), The 23rd International Conference of Computational Linguistics (Coling 2010), 46–54. Beijing.
Pal, Santanu, Tanmoy Chakraborty, and Sivaji Bandyopadhyay. 2011. Handling Multiword Expressions in Phrase-Based Statistical Machine Translation. Machine Translation Summit XIII (2011), 215–224. Xiamen
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-2002), 311–318. Philadelphia, PA.
Ren, Zhixiang, Yajuan Lü, Jie Cao, Qun Liu, and Yun Huang. 2009. Improving statistical machine translation using domain bilingual multiword expressions. In Proceedings of the 2009 Workshop on Multiword Expressions, ACL-IJCNLP 2009, 47–54. Singapore: Suntec.
Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002), 1–15. Mexico City.
Stolcke, Andreas. 2002. SRILM—an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing, vol. 2, 901–904. Denver.
Vogel, Stephan, Hermann Ney, and Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In Proceeding of the 16th International Conference on Computational Linguistics (COLING 1996), 836–841. Copenhagen.
Wu, Hua, Haifeng Wang, and Zhanyi Liu. 2006. Boosting statistical word alignment using labeled and unlabeled data. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, 913–920. Morristown, NJ: Association for Computational Linguistics.
Wu, Hua, Haifeng Wang, and Chengqing Zong. 2008. Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), 993–1000. Manchester.
Zhu, Xiaojin. 2005. Semi-Supervised Learning Literature Survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison. http://www.cs.wisc.edu/_jerryzhu/pub/ssl_survey.pdf
Acknowledgements
The research leading to these results has received funding from the EU project EXPERT –the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme FP7/2007-2013<tel:2007-2013>/ under REA grant agreement no. [317471].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Pal, S., Naskar, S.K. (2016). Hybrid Word Alignment. In: Costa-jussà, M., Rapp, R., Lambert, P., Eberle, K., Banchs, R., Babych, B. (eds) Hybrid Approaches to Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-21311-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-21311-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21310-1
Online ISBN: 978-3-319-21311-8
eBook Packages: Computer ScienceComputer Science (R0)