Advertisement

Hybrid Word Alignment

  • Santanu PalEmail author
  • Sudip Kumar Naskar
Chapter
  • 771 Downloads
Part of the Theory and Applications of Natural Language Processing book series (NLP)

Abstract

This paper proposes a hybrid word alignment model for Phrase-Based Statistical Machine Translation (PB-SMT). The proposed hybrid word alignment model provides most informative alignment links, which are offered by both unsupervised and semi-supervised word alignment models. Two unsupervised word alignment models, namely GIZA++ and Berkeley aligner, and a rule based word alignment technique are combined together. The unsupervised alignment models are trained on the surface form as well as the root form of the training data and provide alignment tables for the corresponding training data. The rule-based aligner is aimed towards aligning named entities (NEs) and syntactically motivated chunks. NEs are aligned through transliteration using a joint source-channel model. Chunks are aligned employing a bootstrapping approach by translating the source chunks into the target language using a baseline PB-SMT model and subsequently validating the chunk hypotheses using a fuzzy matching technique against the target corpus. Experiments are carried out after single-tokenizing the multiword NEs. The effectiveness of the proposed hybrid alignment model was extrinsically evaluated on the MT quality by using well-known automatic MT evaluation metrics, such as BLUE and NIST. Our best system provided significant improvements over the baseline as measured by BLEU.

Keywords

Statistical Machine Translation Parallel Corpus Language Pair Name Entity Word Alignment 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

The research leading to these results has received funding from the EU project EXPERT –the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme FP7/2007-2013<tel:2007-2013>/ under REA grant agreement no. [317471].

References

  1. Ambati, Vamshi, Stephan Vogel, and Jaime Carbonell. 2010. 10th Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing (ALNLP-2010), 10–17.Google Scholar
  2. Bouamor, Dhouha, Nasredine Semmar, and Pierre Zweigenbeaum. 2012. Automatic construction of a multiword expressions bilingual lexicon: A statistical machine translation evaluation perspective. In Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon (CogALex-III), COLING 2012, 95–108. Mumbai.Google Scholar
  3. Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19(2): 263–311.Google Scholar
  4. Callison-Burch, Chris, David Talbot, and Miles Osborne. 2004. Statistical machine translation with word- and sentence-aligned parallel corpora. In Association for Computational Linguistics 2004, 175. Morristown, NJ: Association for Computational Linguistics.Google Scholar
  5. Chapelle, O., B. Schölkopf, and A. Zien, ed. 2006. Semi-supervised learning. Cambridge, MA: MIT.Google Scholar
  6. Doddington, George. 2002. Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In Proceedings of the Second International Conference on Human Language Technology Research (HLT-2002), 128–132. San Diego, CA.Google Scholar
  7. Duan, Xiangyu, Min Zhang, and Haizhou Li. 2010. Pseudo-word for phrase-based machine translation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 148–156. Uppsala.Google Scholar
  8. Eck, Matthias, Stephan Vogel, and Alex Waibel. 2004. Improving statistical machine translation in the medical domain using the Unified Medical Language System. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), 792–798. Geneva.Google Scholar
  9. Ekbal, Asif, and Sivaji Bandyopadhyay. 2009. Voted NER system using appropriate unlabeled data. In Proceedings of the ACL-IJCNLP-2009 Named Entities Workshop (NEWS 2009), 202–210. Singapore: Suntec.Google Scholar
  10. Ekbal, Asif, Sudip Kumar Naskar, and Sivaji Bandyopadhyay. 2006. A modified joint source-channel model for transliteration. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, (ACL-2006), 191–198. Sydney.Google Scholar
  11. Feng, Donghui, Yajuan Lü, and Ming Zhou. 2004. A new approach for English-Chinese named entity alignment. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004), 372–379. Barcelona.Google Scholar
  12. Fraser, Alexander, and Daniel Marcu. 2006. Semisupervised training for statistical word alignment. In ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (ACL-2006), 769–776. Morristown, NJGoogle Scholar
  13. Kneser, Reinhard, and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 181–184. Detroit, MI.Google Scholar
  14. Koehn, Philipp. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004), 388–395. Barcelona.Google Scholar
  15. Koehn, Philipp, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of HLT-NAACL 2003: Conference Combining Human Language Technology Conference Series and The North American Chapter of the Association for Computational Linguistics Conference Series, 48–54. Edmonton.Google Scholar
  16. Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007): Proceedings of Demo and Poster Sessions, 177–180. Prague.Google Scholar
  17. Lambert, Patrik, and Rafael Banchs. 2005. Data inferred multiword expressions for statistical machine translation. In Proceedings of Machine Translation Summit X, 396–403. Phuket.Google Scholar
  18. Liang, Percy, Ben Taskar, and Dan Klein. 2006. 6th Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL-2006, 104–111.Google Scholar
  19. Ma, Yanjun, Nicolas Stroppa, and Andy Way. 2007. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 304–311. Prague.Google Scholar
  20. Moore, Robert C. 2003. Learning translations of named-entity phrases from parallel corpora. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), 259–266. Budapest.Google Scholar
  21. Och, Franz J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-2003), 160–167. Sapporo.Google Scholar
  22. Och, Franz Josef, and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29: 19–51.Google Scholar
  23. Pal, Santanu, and Sivaji Bandyopadhyay. 2012. Bootstrapping Chunk Alignment in Phrase-Based Statistical Machine Translation. In: Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), EACL-2012, 93–100, Avignon.Google Scholar
  24. Pal, Santanu, Sudip Kumar Naskar, Pavel Pecina, Sivaji Bandyopadhyay, and Andy Way. 2010. Handling named entities and compound verbs in phrase-based statistical machine translation. In Proceedings of the Workshop on Multiword Expression: From Theory to Application (MWE-2010), The 23rd International Conference of Computational Linguistics (Coling 2010), 46–54. Beijing.Google Scholar
  25. Pal, Santanu, Tanmoy Chakraborty, and Sivaji Bandyopadhyay. 2011. Handling Multiword Expressions in Phrase-Based Statistical Machine Translation. Machine Translation Summit XIII (2011), 215–224. XiamenGoogle Scholar
  26. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-2002), 311–318. Philadelphia, PA.Google Scholar
  27. Ren, Zhixiang, Yajuan Lü, Jie Cao, Qun Liu, and Yun Huang. 2009. Improving statistical machine translation using domain bilingual multiword expressions. In Proceedings of the 2009 Workshop on Multiword Expressions, ACL-IJCNLP 2009, 47–54. Singapore: Suntec.Google Scholar
  28. Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002), 1–15. Mexico City.Google Scholar
  29. Stolcke, Andreas. 2002. SRILM—an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing, vol. 2, 901–904. Denver.Google Scholar
  30. Vogel, Stephan, Hermann Ney, and Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In Proceeding of the 16th International Conference on Computational Linguistics (COLING 1996), 836–841. Copenhagen.Google Scholar
  31. Wu, Hua, Haifeng Wang, and Zhanyi Liu. 2006. Boosting statistical word alignment using labeled and unlabeled data. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, 913–920. Morristown, NJ: Association for Computational Linguistics.Google Scholar
  32. Wu, Hua, Haifeng Wang, and Chengqing Zong. 2008. Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), 993–1000. Manchester.Google Scholar
  33. Zhu, Xiaojin. 2005. Semi-Supervised Learning Literature Survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison. http://www.cs.wisc.edu/_jerryzhu/pub/ssl_survey.pdf

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Universität Des SaarlandesSaarbrückenGermany
  2. 2.Jadavpur UniversityKolkataIndia

Personalised recommendations