Identifying Word Translations in Scientific Literature Based on Labeled Bilingual Topic Model and Co-occurrence Features

  • Mingjie Tian
  • Yahui Zhao
  • Rongyi CuiEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11221)


Aiming at the increasingly rich multi language information resources and multi-label data in scientific literature, in order to mining the relevance and correlation in languages, this paper proposed the labeled bilingual topic model and co-occurrence feature based similarity metric which could be adopted to the word translation identifying task. First of all, it could assume that the keywords in the scientific literature are relevant to the abstract in the same article, then extracted the keywords and regard it as labels, labels with topics are assigned and the “latent” topic was instantiated. Secondly, the abstracts in article were trained by the labeled bilingual topic model and got the word representation on the topic distribution. Finally, the most similar word between both languages was matched with similarity metric proposed in this paper. The experiment result shows that the labeled bilingual topic model reaches better precision than “latent” topic model based bilingual model, and co-occurrence features enhance the attractiveness of the bilingual word pairs to improve the identifying effects.


Topic model Label Co-occurrence features Word translations 



This research was financially supported by State Language Commission of China under Grant No. YB135-76.


  1. 1.
    Diab, M.T., Finch, S.: A statistical translation model using comparable corpora. In: Proceedings of the 2000 Conference on Content-Based Multi-media Information Access, pp. 1500–1508 (2000)Google Scholar
  2. 2.
    Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: Proceedings of the ACL 2002 Workshop on Unsupervised Lexical Acquisition, vol. 9, pp. 9–16. ACL, Stroudsburg (2002)Google Scholar
  3. 3.
    Gaussier, E., Renders, J.M., Matveeva, I., Goutte, C., Déjean, H.: A geometric view on bilingual lexicon extraction from comparable corpora. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pp. 526–533. ACL, Stroudsburg (2004)Google Scholar
  4. 4.
    Boyd-Graber, J., Blei, D.M.: Multilingual topic models for unaligned text. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 75–82. AUAI Press, Arlington (2009)Google Scholar
  5. 5.
    Ni, X., Sun, J.T., Hu, J., Chen, Z.: Mining multilingual topics from Wikipedia. In: Proceedings of the 18th International World Wide Web Conference, pp. 1155–1156. ACM, New York (2009)Google Scholar
  6. 6.
    Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 880–889. ACL, Stroudsburg (2009)Google Scholar
  7. 7.
    De Smet, W., Moens, M.F.: Cross language linking of news stories on the web using interlingual topic modelling. In: Proceedings of the 2nd ACM Workshop on Social Web Search and Mining, pp. 57–64. ACM, New York (2009)Google Scholar
  8. 8.
    Vulić, I., De Smet, W., Moens, M.F.: Identifying word translations from comparable corpora using latent topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers, vol. 2, pp. 479–484. ACL, Stroudsburg (2011)Google Scholar
  9. 9.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)Google Scholar
  10. 10.
    Qian, X.U., Zhou, J., Chen, J.: Dirichlet process and its applications in natural language processing. J. Chin. Inf. Process. 23(5), 25–33 (2009)Google Scholar
  11. 11.
    Xu, G., Wang, H.F.: The development of topic models in natural language processing. Chin. J. Comput. 34(8), 1423–1436 (2011)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Fang, A., Macdonald, C., Ounis, I., Habel, P., Yang, X.: Exploring time-sensitive variational Bayesian inference LDA for social media data. In: Jose, J.M., et al. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 252–265. Springer, Cham (2017). Scholar
  13. 13.
    Aiping, W., Gongying, Z., Fang, L.: Research and application of EM algorithm. Comput. Technol. Dev. 19(9), 108–110 (2009)Google Scholar
  14. 14.
    Heinrich, G.: Parameter estimation for text analysis. Technical report (2008)Google Scholar
  15. 15.
    Yerebakan, H.Z., Dundar, M.: Partially collapsed parallel Gibbs sampler for Dirichlet process mixture models. Pattern Recogn. Lett. 90, 22–27 (2017)CrossRefGoogle Scholar
  16. 16.
    Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  17. 17.
    Goodstein, R.L., Harris, Z.: Mathematical structures of language. Math. Gaz. 54(388), 173 (1970)Google Scholar
  18. 18.
    Bajpai, P., Verma, P.: Improved query translation for English to Hindi cross language information retrieval. Indones. J. Electr. Eng. Inf. 4(2), 134–140 (2016)Google Scholar
  19. 19.
    Liu, J., Cui, R.Y., Zhao, Y.H.: Cross-lingual similar documents retrieval based on co-occurrence projection. In: Proceedings of the 6th International Conference on Computer Science and Network Technology, pp. 11–15. IEEE (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Intelligent Information Processing Lab., Department of Computer Science and TechnologyYanbian UniversityYanjiChina

Personalised recommendations