Topic Model Based Adaptation Data Selection for Domain-Specific Machine Translation

Yao, Liang; Liu, Mengyi; Hong, Yu; Liu, Hao; Yao, Jianmin

doi:10.1007/978-981-10-2993-6_14

Liang Yao¹⁴,
Mengyi Liu¹⁴,
Yu Hong¹⁴,
Hao Liu¹⁴ &
…
Jianmin Yao¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 669))

Included in the following conference series:

Chinese National Conference on Social Media Processing

1168 Accesses

Abstract

Current domain-specific machine translation (MT) suffers from the lack of high-quality bilingual corpora. Existing work in this field has shown the advantage of Adaptation data selection (Ada-selection) for enriching the corpora. Encouraged by the empirical finding that topic distribution is conductive to characterizing a distinctive domain, we propose to use topic model to improve Ada-selection. Based on a joint LDA approach, we incorporate topic distribution in measuring the relevance between the target domain and the candidate parallel sentence pairs. On the basis, we select the highly relevant candidates as the high-quality domain-specific bilingual corpora. In practice, we apply our method for the acquisition of domain-specific corpora from the general-domain. Experiments on an end-to-end domain-specific MT task show that our method outperforms the state of the art, yielding at least 1.5 BLEU points at different scales of training data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://gibbslda.sourceforge.net/.
2.
We determine n by testing {5, 15, 30, 50, 100} in our experiments. We find that n = 30 gets a better performance than other values.

References

Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362. Association for Computational Linguistics (2011)
Google Scholar
Bertoldi, N., Federico, M.: Domain adaptation for statistical machine translation with monolingual resources. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 182–189. Association for Computational Linguistics (2009)
Google Scholar
Blei, D.M., Lafferty, J.D.: A correlated topic model of science. Ann. Appl. Stat. 1(1), 17–35 (2007)
Article MathSciNet MATH Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Chiang, D.: Hierarchical phrase-based translation. Comput. Linguist. 33(2), 201–228 (2007)
Article MATH Google Scholar
Cui, L., Zhang, D., Liu, S., Chen, Q., Li, M., Zhou, M., Yang, M.: Learning topic representation for SMT with neural networks. In: ACL (1), pp. 133–143. Citeseer (2014)
Google Scholar
Duh, K., Neubig, G., Sudoh, K., Tsukada, H.: Adaptation data selection using neural language models: experiments in machine translation. In: Meeting of the Association for Computational Linguistics, pp. 678–683 (2013)
Google Scholar
Eidelman, V., Boyd-Graber, J., Resnik, P.: Topic models for dynamic translation model adaptation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pp. 115–119. Association for Computational Linguistics (2012)
Google Scholar
Foster, G., Goutte, C., Kuhn, R.: Discriminative instance weighting for domain adaptation in statistical machine translation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 451–459. Association for Computational Linguistics (2010)
Google Scholar
Gong, Z., Zhang, Y., Zhou, G.: Statistical machine translation based on LDA. In: 2010 4th International on Universal Communication Symposium (IUCS), pp. 286–290. IEEE (2010)
Google Scholar
Gong, Z., Zhou, G., Li, L.: Improve smt with source-side topic-document distributions. In: MT Summit, pp. 496–501 (2011)
Google Scholar
Liu, L., Hong, Y., Liu, H., Wang, X., Yao, J.: Effective selection of translation model training data. In: Meeting of the Association for Computational Linguistics, pp. 569–573 (2014)
Google Scholar
Liu, L., Hong, Y., Lu, J., Lang, J., Ji, H., Yao, J.M.: An iterative link-based method for parallel web page mining. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1216–1224. Association for Computational Linguistics (2014)
Google Scholar
Lü, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. In: EMNLP-CoNLL, vol. 34, pp. 3–350 (2007)
Google Scholar
Matsoukas, S., Rosti, A.V.I., Zhang, B.: Discriminative corpus weight estimation for machine translation. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 708–717. Association for Computational Linguistics (2009)
Google Scholar
Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 220–224. Association for Computational Linguistics (2010)
Google Scholar
Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 160–167. Association for Computational Linguistics (2003)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Meeting on Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Pecina, P., Toral, A., Way, A., Papavassiliou, V., Prokopidis, P., Giagkou, M.: Towards using web-crawled data for domain adaptation in statistical machine translation. In: Proceedings of the 15th Conference of European Association for Machine Translation, pp. 297–304 (2011)
Google Scholar
Rubino, R., De Souza, J., Foster, J., Specia, L.: Topic models for translation quality estimation for gisting purposes. In: Machine Translating (2013)
Google Scholar
Stolcke, A., et al.: Srilm-an extensible language modeling toolkit. In: INTERSPEECH, vol. 2002, p. 2002 (2002)
Google Scholar
Su, J., Wu, H., Wang, H., Chen, Y., Shi, X., Dong, H., Liu, Q.: Translation model adaptation for statistical machine translation with monolingual topic information. In: Meeting of the Association for Computational Linguistics: Long Papers, pp. 459–468 (2012)
Google Scholar
Tam, Y.C., Lane, I., Schultz, T.: Bilingual LSA-based adaptation for statistical machine translation. Mach. Transl. 21(4), 187–207 (2008)
Article Google Scholar
Yasuda, K., Zhang, R., Yamamoto, H., Sumita, E.: Method of selecting training data to build a compact and efficient translation model. In: IJCNLP, pp. 655–660 (2008)
Google Scholar
Zhao, B., Xing, E.P.: BiTAM: bilingual topic admixture models for word alignment. In: ACL 2006, International Conference on Computational Linguistics and Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17–21 July 2006
Google Scholar
Zhao, B., Xing, E.P.: HM-BiTAM: bilingual topic exploration, word alignment, and translation. In: Advances in Neural Information Processing Systems, pp. 1689–1696 (2007)
Google Scholar

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China, No. 61672368, No. 61373097, No. 61672367, No. 61272259. The authors would like to thank the anonymous reviewers for their insightful comments and suggestions. Yu Hong, Professor Associate in Soochow University, is the corresponding author of the paper, whose email address is tianxianer@gmail.com.

Author information

Authors and Affiliations

Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou, China
Liang Yao, Mengyi Liu, Yu Hong, Hao Liu & Jianmin Yao

Authors

Liang Yao
View author publications
You can also search for this author in PubMed Google Scholar
Mengyi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yu Hong
View author publications
You can also search for this author in PubMed Google Scholar
Hao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jianmin Yao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Hong .

Editor information

Editors and Affiliations

Beijing Language and Culture University, Beijing, China
Yuming Li
Jiangxi Normal University, Nanchang, China
Guoxiong Xiang
Dalian University of Technology, Dalian, China
Hongfei Lin
Jiangxi Normal University, Nanchang, China
Mingwen Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yao, L., Liu, M., Hong, Y., Liu, H., Yao, J. (2016). Topic Model Based Adaptation Data Selection for Domain-Specific Machine Translation. In: Li, Y., Xiang, G., Lin, H., Wang, M. (eds) Social Media Processing. SMP 2016. Communications in Computer and Information Science, vol 669. Springer, Singapore. https://doi.org/10.1007/978-981-10-2993-6_14

Download citation

DOI: https://doi.org/10.1007/978-981-10-2993-6_14
Published: 19 October 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2992-9
Online ISBN: 978-981-10-2993-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics