Hybrid Word Alignment

Pal, Santanu; Naskar, Sudip Kumar

doi:10.1007/978-3-319-21311-8_3

Hybrid Word Alignment

Santanu Pal¹⁰ &
Sudip Kumar Naskar¹¹

Chapter
First Online: 13 July 2016

985 Accesses

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

Abstract

This paper proposes a hybrid word alignment model for Phrase-Based Statistical Machine Translation (PB-SMT). The proposed hybrid word alignment model provides most informative alignment links, which are offered by both unsupervised and semi-supervised word alignment models. Two unsupervised word alignment models, namely GIZA++ and Berkeley aligner, and a rule based word alignment technique are combined together. The unsupervised alignment models are trained on the surface form as well as the root form of the training data and provide alignment tables for the corresponding training data. The rule-based aligner is aimed towards aligning named entities (NEs) and syntactically motivated chunks. NEs are aligned through transliteration using a joint source-channel model. Chunks are aligned employing a bootstrapping approach by translating the source chunks into the target language using a baseline PB-SMT model and subsequently validating the chunk hypotheses using a fuzzy matching technique against the target corpus. Experiments are carried out after single-tokenizing the multiword NEs. The effectiveness of the proposed hybrid alignment model was extrinsically evaluated on the MT quality by using well-known automatic MT evaluation metrics, such as BLUE and NIST. Our best system provided significant improvements over the baseline as measured by BLEU.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The EILMT project is funded by the Department of Electronics and Information Technology (DEITY), Ministry of Communications and Information Technology (MCIT), Government of India.
2.
http://nlp.stanford.edu/software/lex-parser.shtml
3.
http://crfchunker.sourceforge.net/
4.
The IL-ILMT project is funded by the Department of Electronics and Information Technology (DEITY), Ministry of Communications and Information Technology (MCIT), Government of India.

References

Ambati, Vamshi, Stephan Vogel, and Jaime Carbonell. 2010. 10th Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing (ALNLP-2010), 10–17.
Google Scholar
Bouamor, Dhouha, Nasredine Semmar, and Pierre Zweigenbeaum. 2012. Automatic construction of a multiword expressions bilingual lexicon: A statistical machine translation evaluation perspective. In Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon (CogALex-III), COLING 2012, 95–108. Mumbai.
Google Scholar
Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19(2): 263–311.
Google Scholar
Callison-Burch, Chris, David Talbot, and Miles Osborne. 2004. Statistical machine translation with word- and sentence-aligned parallel corpora. In Association for Computational Linguistics 2004, 175. Morristown, NJ: Association for Computational Linguistics.
Google Scholar
Chapelle, O., B. Schölkopf, and A. Zien, ed. 2006. Semi-supervised learning. Cambridge, MA: MIT.
Google Scholar
Doddington, George. 2002. Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In Proceedings of the Second International Conference on Human Language Technology Research (HLT-2002), 128–132. San Diego, CA.
Google Scholar
Duan, Xiangyu, Min Zhang, and Haizhou Li. 2010. Pseudo-word for phrase-based machine translation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 148–156. Uppsala.
Google Scholar
Eck, Matthias, Stephan Vogel, and Alex Waibel. 2004. Improving statistical machine translation in the medical domain using the Unified Medical Language System. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), 792–798. Geneva.
Google Scholar
Ekbal, Asif, and Sivaji Bandyopadhyay. 2009. Voted NER system using appropriate unlabeled data. In Proceedings of the ACL-IJCNLP-2009 Named Entities Workshop (NEWS 2009), 202–210. Singapore: Suntec.
Google Scholar
Ekbal, Asif, Sudip Kumar Naskar, and Sivaji Bandyopadhyay. 2006. A modified joint source-channel model for transliteration. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, (ACL-2006), 191–198. Sydney.
Google Scholar
Feng, Donghui, Yajuan Lü, and Ming Zhou. 2004. A new approach for English-Chinese named entity alignment. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004), 372–379. Barcelona.
Google Scholar
Fraser, Alexander, and Daniel Marcu. 2006. Semisupervised training for statistical word alignment. In ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (ACL-2006), 769–776. Morristown, NJ
Google Scholar
Kneser, Reinhard, and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 181–184. Detroit, MI.
Google Scholar
Koehn, Philipp. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004), 388–395. Barcelona.
Google Scholar
Koehn, Philipp, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of HLT-NAACL 2003: Conference Combining Human Language Technology Conference Series and The North American Chapter of the Association for Computational Linguistics Conference Series, 48–54. Edmonton.
Google Scholar
Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007): Proceedings of Demo and Poster Sessions, 177–180. Prague.
Google Scholar
Lambert, Patrik, and Rafael Banchs. 2005. Data inferred multiword expressions for statistical machine translation. In Proceedings of Machine Translation Summit X, 396–403. Phuket.
Google Scholar
Liang, Percy, Ben Taskar, and Dan Klein. 2006. 6th Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL-2006, 104–111.
Google Scholar
Ma, Yanjun, Nicolas Stroppa, and Andy Way. 2007. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 304–311. Prague.
Google Scholar
Moore, Robert C. 2003. Learning translations of named-entity phrases from parallel corpora. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), 259–266. Budapest.
Google Scholar
Och, Franz J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-2003), 160–167. Sapporo.
Google Scholar
Och, Franz Josef, and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29: 19–51.
Google Scholar
Pal, Santanu, and Sivaji Bandyopadhyay. 2012. Bootstrapping Chunk Alignment in Phrase-Based Statistical Machine Translation. In: Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), EACL-2012, 93–100, Avignon.
Google Scholar
Pal, Santanu, Sudip Kumar Naskar, Pavel Pecina, Sivaji Bandyopadhyay, and Andy Way. 2010. Handling named entities and compound verbs in phrase-based statistical machine translation. In Proceedings of the Workshop on Multiword Expression: From Theory to Application (MWE-2010), The 23rd International Conference of Computational Linguistics (Coling 2010), 46–54. Beijing.
Google Scholar
Pal, Santanu, Tanmoy Chakraborty, and Sivaji Bandyopadhyay. 2011. Handling Multiword Expressions in Phrase-Based Statistical Machine Translation. Machine Translation Summit XIII (2011), 215–224. Xiamen
Google Scholar
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-2002), 311–318. Philadelphia, PA.
Google Scholar
Ren, Zhixiang, Yajuan Lü, Jie Cao, Qun Liu, and Yun Huang. 2009. Improving statistical machine translation using domain bilingual multiword expressions. In Proceedings of the 2009 Workshop on Multiword Expressions, ACL-IJCNLP 2009, 47–54. Singapore: Suntec.
Google Scholar
Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002), 1–15. Mexico City.
Google Scholar
Stolcke, Andreas. 2002. SRILM—an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing, vol. 2, 901–904. Denver.
Google Scholar
Vogel, Stephan, Hermann Ney, and Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In Proceeding of the 16th International Conference on Computational Linguistics (COLING 1996), 836–841. Copenhagen.
Google Scholar
Wu, Hua, Haifeng Wang, and Zhanyi Liu. 2006. Boosting statistical word alignment using labeled and unlabeled data. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, 913–920. Morristown, NJ: Association for Computational Linguistics.
Google Scholar
Wu, Hua, Haifeng Wang, and Chengqing Zong. 2008. Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), 993–1000. Manchester.
Google Scholar
Zhu, Xiaojin. 2005. Semi-Supervised Learning Literature Survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison. http://www.cs.wisc.edu/_jerryzhu/pub/ssl_survey.pdf

Download references

Acknowledgements

The research leading to these results has received funding from the EU project EXPERT –the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme FP7/2007-2013<tel:2007-2013>/ under REA grant agreement no. [317471].

Author information

Authors and Affiliations

Universität Des Saarlandes, Saarbrücken, Germany
Santanu Pal
Jadavpur University, Kolkata, India
Sudip Kumar Naskar

Authors

Santanu Pal
View author publications
You can also search for this author in PubMed Google Scholar
Sudip Kumar Naskar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Santanu Pal .

Editor information

Editors and Affiliations

Universitat politècnica de catalunya , Barcelona, Spain
Marta R. Costa-jussà
University of Aix-Marseille and University of Mainz, Marseille, France
Reinhard Rapp
Pompeu Fabra University, Barcelona, Barcelona, Spain
Patrik Lambert
Lingenio GmbH, Heidelberg, Baden-Württemberg, Germany
Kurt Eberle
Institute for Infocomm Research, Singapore, Singapur, Singapore
Rafael E. Banchs
Centre for Translation Studies, University of Leeds School of Modern Languages&Cultures, Leeds, United Kingdom
Bogdan Babych

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pal, S., Naskar, S.K. (2016). Hybrid Word Alignment. In: Costa-jussà, M., Rapp, R., Lambert, P., Eberle, K., Banchs, R., Babych, B. (eds) Hybrid Approaches to Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-21311-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-21311-8_3
Published: 13 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21310-1
Online ISBN: 978-3-319-21311-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics