Skip to main content

Topic Model Based Adaptation Data Selection for Domain-Specific Machine Translation

  • Conference paper
  • First Online:
Social Media Processing (SMP 2016)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 669))

Included in the following conference series:

  • 1168 Accesses

Abstract

Current domain-specific machine translation (MT) suffers from the lack of high-quality bilingual corpora. Existing work in this field has shown the advantage of Adaptation data selection (Ada-selection) for enriching the corpora. Encouraged by the empirical finding that topic distribution is conductive to characterizing a distinctive domain, we propose to use topic model to improve Ada-selection. Based on a joint LDA approach, we incorporate topic distribution in measuring the relevance between the target domain and the candidate parallel sentence pairs. On the basis, we select the highly relevant candidates as the high-quality domain-specific bilingual corpora. In practice, we apply our method for the acquisition of domain-specific corpora from the general-domain. Experiments on an end-to-end domain-specific MT task show that our method outperforms the state of the art, yielding at least 1.5 BLEU points at different scales of training data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://gibbslda.sourceforge.net/.

  2. 2.

    We determine n by testing {5, 15, 30, 50, 100} in our experiments. We find that n = 30 gets a better performance than other values.

References

  1. Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362. Association for Computational Linguistics (2011)

    Google Scholar 

  2. Bertoldi, N., Federico, M.: Domain adaptation for statistical machine translation with monolingual resources. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 182–189. Association for Computational Linguistics (2009)

    Google Scholar 

  3. Blei, D.M., Lafferty, J.D.: A correlated topic model of science. Ann. Appl. Stat. 1(1), 17–35 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  5. Chiang, D.: Hierarchical phrase-based translation. Comput. Linguist. 33(2), 201–228 (2007)

    Article  MATH  Google Scholar 

  6. Cui, L., Zhang, D., Liu, S., Chen, Q., Li, M., Zhou, M., Yang, M.: Learning topic representation for SMT with neural networks. In: ACL (1), pp. 133–143. Citeseer (2014)

    Google Scholar 

  7. Duh, K., Neubig, G., Sudoh, K., Tsukada, H.: Adaptation data selection using neural language models: experiments in machine translation. In: Meeting of the Association for Computational Linguistics, pp. 678–683 (2013)

    Google Scholar 

  8. Eidelman, V., Boyd-Graber, J., Resnik, P.: Topic models for dynamic translation model adaptation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pp. 115–119. Association for Computational Linguistics (2012)

    Google Scholar 

  9. Foster, G., Goutte, C., Kuhn, R.: Discriminative instance weighting for domain adaptation in statistical machine translation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 451–459. Association for Computational Linguistics (2010)

    Google Scholar 

  10. Gong, Z., Zhang, Y., Zhou, G.: Statistical machine translation based on LDA. In: 2010 4th International on Universal Communication Symposium (IUCS), pp. 286–290. IEEE (2010)

    Google Scholar 

  11. Gong, Z., Zhou, G., Li, L.: Improve smt with source-side topic-document distributions. In: MT Summit, pp. 496–501 (2011)

    Google Scholar 

  12. Liu, L., Hong, Y., Liu, H., Wang, X., Yao, J.: Effective selection of translation model training data. In: Meeting of the Association for Computational Linguistics, pp. 569–573 (2014)

    Google Scholar 

  13. Liu, L., Hong, Y., Lu, J., Lang, J., Ji, H., Yao, J.M.: An iterative link-based method for parallel web page mining. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1216–1224. Association for Computational Linguistics (2014)

    Google Scholar 

  14. Lü, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. In: EMNLP-CoNLL, vol. 34, pp. 3–350 (2007)

    Google Scholar 

  15. Matsoukas, S., Rosti, A.V.I., Zhang, B.: Discriminative corpus weight estimation for machine translation. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 708–717. Association for Computational Linguistics (2009)

    Google Scholar 

  16. Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 220–224. Association for Computational Linguistics (2010)

    Google Scholar 

  17. Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 160–167. Association for Computational Linguistics (2003)

    Google Scholar 

  18. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Meeting on Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  19. Pecina, P., Toral, A., Way, A., Papavassiliou, V., Prokopidis, P., Giagkou, M.: Towards using web-crawled data for domain adaptation in statistical machine translation. In: Proceedings of the 15th Conference of European Association for Machine Translation, pp. 297–304 (2011)

    Google Scholar 

  20. Rubino, R., De Souza, J., Foster, J., Specia, L.: Topic models for translation quality estimation for gisting purposes. In: Machine Translating (2013)

    Google Scholar 

  21. Stolcke, A., et al.: Srilm-an extensible language modeling toolkit. In: INTERSPEECH, vol. 2002, p. 2002 (2002)

    Google Scholar 

  22. Su, J., Wu, H., Wang, H., Chen, Y., Shi, X., Dong, H., Liu, Q.: Translation model adaptation for statistical machine translation with monolingual topic information. In: Meeting of the Association for Computational Linguistics: Long Papers, pp. 459–468 (2012)

    Google Scholar 

  23. Tam, Y.C., Lane, I., Schultz, T.: Bilingual LSA-based adaptation for statistical machine translation. Mach. Transl. 21(4), 187–207 (2008)

    Article  Google Scholar 

  24. Yasuda, K., Zhang, R., Yamamoto, H., Sumita, E.: Method of selecting training data to build a compact and efficient translation model. In: IJCNLP, pp. 655–660 (2008)

    Google Scholar 

  25. Zhao, B., Xing, E.P.: BiTAM: bilingual topic admixture models for word alignment. In: ACL 2006, International Conference on Computational Linguistics and Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17–21 July 2006

    Google Scholar 

  26. Zhao, B., Xing, E.P.: HM-BiTAM: bilingual topic exploration, word alignment, and translation. In: Advances in Neural Information Processing Systems, pp. 1689–1696 (2007)

    Google Scholar 

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China, No. 61672368, No. 61373097, No. 61672367, No. 61272259. The authors would like to thank the anonymous reviewers for their insightful comments and suggestions. Yu Hong, Professor Associate in Soochow University, is the corresponding author of the paper, whose email address is tianxianer@gmail.com.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Hong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Yao, L., Liu, M., Hong, Y., Liu, H., Yao, J. (2016). Topic Model Based Adaptation Data Selection for Domain-Specific Machine Translation. In: Li, Y., Xiang, G., Lin, H., Wang, M. (eds) Social Media Processing. SMP 2016. Communications in Computer and Information Science, vol 669. Springer, Singapore. https://doi.org/10.1007/978-981-10-2993-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-2993-6_14

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-2992-9

  • Online ISBN: 978-981-10-2993-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics