Skip to main content

Hybrid Word Alignment

  • Chapter
  • First Online:
  • 985 Accesses

Abstract

This paper proposes a hybrid word alignment model for Phrase-Based Statistical Machine Translation (PB-SMT). The proposed hybrid word alignment model provides most informative alignment links, which are offered by both unsupervised and semi-supervised word alignment models. Two unsupervised word alignment models, namely GIZA++ and Berkeley aligner, and a rule based word alignment technique are combined together. The unsupervised alignment models are trained on the surface form as well as the root form of the training data and provide alignment tables for the corresponding training data. The rule-based aligner is aimed towards aligning named entities (NEs) and syntactically motivated chunks. NEs are aligned through transliteration using a joint source-channel model. Chunks are aligned employing a bootstrapping approach by translating the source chunks into the target language using a baseline PB-SMT model and subsequently validating the chunk hypotheses using a fuzzy matching technique against the target corpus. Experiments are carried out after single-tokenizing the multiword NEs. The effectiveness of the proposed hybrid alignment model was extrinsically evaluated on the MT quality by using well-known automatic MT evaluation metrics, such as BLUE and NIST. Our best system provided significant improvements over the baseline as measured by BLEU.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The EILMT project is funded by the Department of Electronics and Information Technology (DEITY), Ministry of Communications and Information Technology (MCIT), Government of India.

  2. 2.

    http://nlp.stanford.edu/software/lex-parser.shtml

  3. 3.

    http://crfchunker.sourceforge.net/

  4. 4.

    The IL-ILMT project is funded by the Department of Electronics and Information Technology (DEITY), Ministry of Communications and Information Technology (MCIT), Government of India.

References

  • Ambati, Vamshi, Stephan Vogel, and Jaime Carbonell. 2010. 10th Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing (ALNLP-2010), 10–17.

    Google Scholar 

  • Bouamor, Dhouha, Nasredine Semmar, and Pierre Zweigenbeaum. 2012. Automatic construction of a multiword expressions bilingual lexicon: A statistical machine translation evaluation perspective. In Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon (CogALex-III), COLING 2012, 95–108. Mumbai.

    Google Scholar 

  • Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19(2): 263–311.

    Google Scholar 

  • Callison-Burch, Chris, David Talbot, and Miles Osborne. 2004. Statistical machine translation with word- and sentence-aligned parallel corpora. In Association for Computational Linguistics 2004, 175. Morristown, NJ: Association for Computational Linguistics.

    Google Scholar 

  • Chapelle, O., B. Schölkopf, and A. Zien, ed. 2006. Semi-supervised learning. Cambridge, MA: MIT.

    Google Scholar 

  • Doddington, George. 2002. Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In Proceedings of the Second International Conference on Human Language Technology Research (HLT-2002), 128–132. San Diego, CA.

    Google Scholar 

  • Duan, Xiangyu, Min Zhang, and Haizhou Li. 2010. Pseudo-word for phrase-based machine translation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 148–156. Uppsala.

    Google Scholar 

  • Eck, Matthias, Stephan Vogel, and Alex Waibel. 2004. Improving statistical machine translation in the medical domain using the Unified Medical Language System. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), 792–798. Geneva.

    Google Scholar 

  • Ekbal, Asif, and Sivaji Bandyopadhyay. 2009. Voted NER system using appropriate unlabeled data. In Proceedings of the ACL-IJCNLP-2009 Named Entities Workshop (NEWS 2009), 202–210. Singapore: Suntec.

    Google Scholar 

  • Ekbal, Asif, Sudip Kumar Naskar, and Sivaji Bandyopadhyay. 2006. A modified joint source-channel model for transliteration. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, (ACL-2006), 191–198. Sydney.

    Google Scholar 

  • Feng, Donghui, Yajuan Lü, and Ming Zhou. 2004. A new approach for English-Chinese named entity alignment. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004), 372–379. Barcelona.

    Google Scholar 

  • Fraser, Alexander, and Daniel Marcu. 2006. Semisupervised training for statistical word alignment. In ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (ACL-2006), 769–776. Morristown, NJ

    Google Scholar 

  • Kneser, Reinhard, and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 181–184. Detroit, MI.

    Google Scholar 

  • Koehn, Philipp. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004), 388–395. Barcelona.

    Google Scholar 

  • Koehn, Philipp, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of HLT-NAACL 2003: Conference Combining Human Language Technology Conference Series and The North American Chapter of the Association for Computational Linguistics Conference Series, 48–54. Edmonton.

    Google Scholar 

  • Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007): Proceedings of Demo and Poster Sessions, 177–180. Prague.

    Google Scholar 

  • Lambert, Patrik, and Rafael Banchs. 2005. Data inferred multiword expressions for statistical machine translation. In Proceedings of Machine Translation Summit X, 396–403. Phuket.

    Google Scholar 

  • Liang, Percy, Ben Taskar, and Dan Klein. 2006. 6th Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL-2006, 104–111.

    Google Scholar 

  • Ma, Yanjun, Nicolas Stroppa, and Andy Way. 2007. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 304–311. Prague.

    Google Scholar 

  • Moore, Robert C. 2003. Learning translations of named-entity phrases from parallel corpora. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), 259–266. Budapest.

    Google Scholar 

  • Och, Franz J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-2003), 160–167. Sapporo.

    Google Scholar 

  • Och, Franz Josef, and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29: 19–51.

    Google Scholar 

  • Pal, Santanu, and Sivaji Bandyopadhyay. 2012. Bootstrapping Chunk Alignment in Phrase-Based Statistical Machine Translation. In: Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), EACL-2012, 93–100, Avignon.

    Google Scholar 

  • Pal, Santanu, Sudip Kumar Naskar, Pavel Pecina, Sivaji Bandyopadhyay, and Andy Way. 2010. Handling named entities and compound verbs in phrase-based statistical machine translation. In Proceedings of the Workshop on Multiword Expression: From Theory to Application (MWE-2010), The 23rd International Conference of Computational Linguistics (Coling 2010), 46–54. Beijing.

    Google Scholar 

  • Pal, Santanu, Tanmoy Chakraborty, and Sivaji Bandyopadhyay. 2011. Handling Multiword Expressions in Phrase-Based Statistical Machine Translation. Machine Translation Summit XIII (2011), 215–224. Xiamen

    Google Scholar 

  • Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-2002), 311–318. Philadelphia, PA.

    Google Scholar 

  • Ren, Zhixiang, Yajuan Lü, Jie Cao, Qun Liu, and Yun Huang. 2009. Improving statistical machine translation using domain bilingual multiword expressions. In Proceedings of the 2009 Workshop on Multiword Expressions, ACL-IJCNLP 2009, 47–54. Singapore: Suntec.

    Google Scholar 

  • Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002), 1–15. Mexico City.

    Google Scholar 

  • Stolcke, Andreas. 2002. SRILM—an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing, vol. 2, 901–904. Denver.

    Google Scholar 

  • Vogel, Stephan, Hermann Ney, and Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In Proceeding of the 16th International Conference on Computational Linguistics (COLING 1996), 836–841. Copenhagen.

    Google Scholar 

  • Wu, Hua, Haifeng Wang, and Zhanyi Liu. 2006. Boosting statistical word alignment using labeled and unlabeled data. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, 913–920. Morristown, NJ: Association for Computational Linguistics.

    Google Scholar 

  • Wu, Hua, Haifeng Wang, and Chengqing Zong. 2008. Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), 993–1000. Manchester.

    Google Scholar 

  • Zhu, Xiaojin. 2005. Semi-Supervised Learning Literature Survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison. http://www.cs.wisc.edu/_jerryzhu/pub/ssl_survey.pdf

Download references

Acknowledgements

The research leading to these results has received funding from the EU project EXPERT –the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme FP7/2007-2013<tel:2007-2013>/ under REA grant agreement no. [317471].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Santanu Pal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Pal, S., Naskar, S.K. (2016). Hybrid Word Alignment. In: Costa-jussà, M., Rapp, R., Lambert, P., Eberle, K., Banchs, R., Babych, B. (eds) Hybrid Approaches to Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-21311-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-21311-8_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-21310-1

  • Online ISBN: 978-3-319-21311-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics