Abstract
Current domain-specific machine translation (MT) suffers from the lack of high-quality bilingual corpora. Existing work in this field has shown the advantage of Adaptation data selection (Ada-selection) for enriching the corpora. Encouraged by the empirical finding that topic distribution is conductive to characterizing a distinctive domain, we propose to use topic model to improve Ada-selection. Based on a joint LDA approach, we incorporate topic distribution in measuring the relevance between the target domain and the candidate parallel sentence pairs. On the basis, we select the highly relevant candidates as the high-quality domain-specific bilingual corpora. In practice, we apply our method for the acquisition of domain-specific corpora from the general-domain. Experiments on an end-to-end domain-specific MT task show that our method outperforms the state of the art, yielding at least 1.5 BLEU points at different scales of training data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
We determine n by testing {5, 15, 30, 50, 100} in our experiments. We find that n = 30 gets a better performance than other values.
References
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362. Association for Computational Linguistics (2011)
Bertoldi, N., Federico, M.: Domain adaptation for statistical machine translation with monolingual resources. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 182–189. Association for Computational Linguistics (2009)
Blei, D.M., Lafferty, J.D.: A correlated topic model of science. Ann. Appl. Stat. 1(1), 17–35 (2007)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Chiang, D.: Hierarchical phrase-based translation. Comput. Linguist. 33(2), 201–228 (2007)
Cui, L., Zhang, D., Liu, S., Chen, Q., Li, M., Zhou, M., Yang, M.: Learning topic representation for SMT with neural networks. In: ACL (1), pp. 133–143. Citeseer (2014)
Duh, K., Neubig, G., Sudoh, K., Tsukada, H.: Adaptation data selection using neural language models: experiments in machine translation. In: Meeting of the Association for Computational Linguistics, pp. 678–683 (2013)
Eidelman, V., Boyd-Graber, J., Resnik, P.: Topic models for dynamic translation model adaptation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pp. 115–119. Association for Computational Linguistics (2012)
Foster, G., Goutte, C., Kuhn, R.: Discriminative instance weighting for domain adaptation in statistical machine translation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 451–459. Association for Computational Linguistics (2010)
Gong, Z., Zhang, Y., Zhou, G.: Statistical machine translation based on LDA. In: 2010 4th International on Universal Communication Symposium (IUCS), pp. 286–290. IEEE (2010)
Gong, Z., Zhou, G., Li, L.: Improve smt with source-side topic-document distributions. In: MT Summit, pp. 496–501 (2011)
Liu, L., Hong, Y., Liu, H., Wang, X., Yao, J.: Effective selection of translation model training data. In: Meeting of the Association for Computational Linguistics, pp. 569–573 (2014)
Liu, L., Hong, Y., Lu, J., Lang, J., Ji, H., Yao, J.M.: An iterative link-based method for parallel web page mining. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1216–1224. Association for Computational Linguistics (2014)
Lü, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. In: EMNLP-CoNLL, vol. 34, pp. 3–350 (2007)
Matsoukas, S., Rosti, A.V.I., Zhang, B.: Discriminative corpus weight estimation for machine translation. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 708–717. Association for Computational Linguistics (2009)
Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 220–224. Association for Computational Linguistics (2010)
Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 160–167. Association for Computational Linguistics (2003)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Meeting on Association for Computational Linguistics, pp. 311–318 (2002)
Pecina, P., Toral, A., Way, A., Papavassiliou, V., Prokopidis, P., Giagkou, M.: Towards using web-crawled data for domain adaptation in statistical machine translation. In: Proceedings of the 15th Conference of European Association for Machine Translation, pp. 297–304 (2011)
Rubino, R., De Souza, J., Foster, J., Specia, L.: Topic models for translation quality estimation for gisting purposes. In: Machine Translating (2013)
Stolcke, A., et al.: Srilm-an extensible language modeling toolkit. In: INTERSPEECH, vol. 2002, p. 2002 (2002)
Su, J., Wu, H., Wang, H., Chen, Y., Shi, X., Dong, H., Liu, Q.: Translation model adaptation for statistical machine translation with monolingual topic information. In: Meeting of the Association for Computational Linguistics: Long Papers, pp. 459–468 (2012)
Tam, Y.C., Lane, I., Schultz, T.: Bilingual LSA-based adaptation for statistical machine translation. Mach. Transl. 21(4), 187–207 (2008)
Yasuda, K., Zhang, R., Yamamoto, H., Sumita, E.: Method of selecting training data to build a compact and efficient translation model. In: IJCNLP, pp. 655–660 (2008)
Zhao, B., Xing, E.P.: BiTAM: bilingual topic admixture models for word alignment. In: ACL 2006, International Conference on Computational Linguistics and Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17–21 July 2006
Zhao, B., Xing, E.P.: HM-BiTAM: bilingual topic exploration, word alignment, and translation. In: Advances in Neural Information Processing Systems, pp. 1689–1696 (2007)
Acknowledgements
This research is supported by the National Natural Science Foundation of China, No. 61672368, No. 61373097, No. 61672367, No. 61272259. The authors would like to thank the anonymous reviewers for their insightful comments and suggestions. Yu Hong, Professor Associate in Soochow University, is the corresponding author of the paper, whose email address is tianxianer@gmail.com.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yao, L., Liu, M., Hong, Y., Liu, H., Yao, J. (2016). Topic Model Based Adaptation Data Selection for Domain-Specific Machine Translation. In: Li, Y., Xiang, G., Lin, H., Wang, M. (eds) Social Media Processing. SMP 2016. Communications in Computer and Information Science, vol 669. Springer, Singapore. https://doi.org/10.1007/978-981-10-2993-6_14
Download citation
DOI: https://doi.org/10.1007/978-981-10-2993-6_14
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2992-9
Online ISBN: 978-981-10-2993-6
eBook Packages: Computer ScienceComputer Science (R0)