iCPE: A Hybrid Data Selection Model for SMT Domain Adaptation

Wang, Longyue; Wong, Derek F.; Chao, Lidia S.; Lu, Yi; Xing, Junwen

doi:10.1007/978-3-642-41491-6_26

Longyue Wang²³,
Derek F. Wong²³,
Lidia S. Chao²³,
Yi Lu²³ &
…
Junwen Xing²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8202))

Included in the following conference series:

1616 Accesses
1 Citations

Abstract

Data selection is a significant technique to enhance the data-driven models especially for large-scale natural language processing (NLP). Recent research on statistical machine translation (SMT) domain adaptation focuses on the usage of various individual data selection models. In this paper, we proposed a hybrid data selection model named iCPE, which combines three state-of-the-art similarity metrics: Cosine tf-idf, Perplexity and Edit distance at both corpus level and model level. We conduct the experiments on Hong Kong Law Chinese-English corpus and the results show that this simple and effective hybrid model performs better over the baseline system trained on entire data as well as the best rival method. This consistently boosting performance of the proposed approach has a profound implication for mining very large corpora in a computationally-limited environment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)
Google Scholar
Daumé III, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL HLT 2011 (2011)
Google Scholar
Mansour, S., Ney, H.: A simple and effective weighted phrase extraction for machine translation adaptation. In: IWSLT (2012)
Google Scholar
Koehn, P., Haddow, B.: Towards effective use of training data in statistical machine translation. In: Proceedings of the Seventh Workshop on Statistical Machine Translation, pp. 317–321 (2012)
Google Scholar
Civera, J., Juan, A.: Domain adaptation in statistical machine translation with mixture modeling. In: Proceedings of the Second Workshop on Statistical Machine Translation, pp. 177–180 (2007)
Google Scholar
Foster, G., Kuhn, R.: Mixture-model adaptation for SMT. In: Proceedings of the Second ACL Workshop on Statistical Machine Translation, pp. 128–136 (2007)
Google Scholar
Eidelman, V., Boyd-Graber, J., Resnik, P.: Topic models for dynamic translation model adaptation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, vol. 2, pp. 115–119 (2012)
Google Scholar
Matsoukas, S., Rosti, A.V.I., Zhang, B.: Discriminative corpus weight estimation for machine translation. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 2, pp. 708–717 (2009)
Google Scholar
Hildebrand, A.S., Eck, M., Vogel, S., Waibel, A.: Adaptation of the translation model for statistical machine translation based on information retrieval. In: Proceedings of EAMT, vol. 2005, pp. 133–142 (2005)
Google Scholar
Lü, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 343–350 (2007)
Google Scholar
Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 220–224 (2010)
Google Scholar
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362 (2011)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10, 707 (1966)
MathSciNet Google Scholar
Koehn, P., Senellart, J.: Convergence of translation memory and statistical machine translation. In: Proceedings of AMTA Workshop on MT Research and the Translation Industry, pp. 21–31 (2010)
Google Scholar
Leveling, J., et al.: Approximate sentence retrieval for scalable and efficient example-based machine translation. In: COLING 2012, pp. 1571–1586 (2012)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Wang, L.Y., Wong, D.F., Chao, L.S.: TQDL: Integrated models for cross-language document retrieval. International Journal of Computational Linguistics and Chinese Language Processing (IJCLCLP) 17(4), 15–31 (2012)
Google Scholar
Wang, L.Y., Wong, D.F., Chao, L.S.: An improvement in cross-language document retrieval based on statistical models. In: Processing of the 24th Conference on Computational Linguistics and Speech (ROCLING 2012), pp. 144–155 (2012)
Google Scholar
Stolcke, A., et al.: SRILM-an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing, vol. 2, pp. 901–904 (2002)
Google Scholar
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, pp. 310–318 (1996)
Google Scholar
Wang, L.Y., Wong, D.F., Chao, L.S., Xing, J.W., Lu, Y., Isabel, T.: Edit Distance: A new data selection criterion for SMT domain adaptation. In: Proceedings of Recent Advances in Natural Language Processing (2013)
Google Scholar
Zhang, H.P., Yu, H.K., Xiong, D.Y., Liu, Q.: HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, vol. 17, pp. 184–187 (2003)
Google Scholar
Wang, L.Y., Wong, D.F., Chao, L.S., Xing, J.W.: CRFs-based Chinese word segmentation for micro-blog with small-scale data. In: Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language, pp. 51–57, December 20-21 (2012)
Google Scholar
Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: MT Summit, vol. 5 (2005)
Google Scholar
Koehn, P., et al.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180 (2007)
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)
Article MATH Google Scholar
Federico, M., Bertoldi, N., Cettolo, M.: IRSTLM: an open source toolkit for handling large scale language models. In: Proceedings of Interspeech, pp. 1618–1621 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory, Department of Computer and Information Science, University of Macau, Macau S.A.R., China
Longyue Wang, Derek F. Wong, Lidia S. Chao, Yi Lu & Junwen Xing

Authors

Longyue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Derek F. Wong
View author publications
You can also search for this author in PubMed Google Scholar
Lidia S. Chao
View author publications
You can also search for this author in PubMed Google Scholar
Yi Lu
View author publications
You can also search for this author in PubMed Google Scholar
Junwen Xing
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Maosong Sun
Horizon Doctoral Training Centre, School of Computer Science, University of Nottingham, NG8 1BB, Nottingham, UK
Min Zhang
Google Inc., Mountain View, CA, USA
Dekang Lin
Baidu Inc., Beijing, China
Haifeng Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, L., Wong, D.F., Chao, L.S., Lu, Y., Xing, J. (2013). iCPE: A Hybrid Data Selection Model for SMT Domain Adaptation. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2013 2013. Lecture Notes in Computer Science(), vol 8202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41491-6_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-41491-6_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41490-9
Online ISBN: 978-3-642-41491-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics