Skip to main content

Learning Chinese-Japanese Bilingual Word Embedding by Using Common Characters

  • Conference paper
  • First Online:
Knowledge Science, Engineering and Management (KSEM 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9983))

  • 1724 Accesses

Abstract

Bilingual word embedding, which maps word embedding of two languages into one vector space, has been widely applied in the domain of machine translation, word sense disambiguation and so on. However, no model has been universally accepted for learning bilingual word embedding. In this work, we propose a novel model named CJ-BOC to learn Chinese-Japanese word embeddings. Given Chinese and Japanese share a large portion of common characters, we exploit them in our training process. We demonstrated the effectiveness of such exploitation through theoretical and also experimental study. To evaluate the performance of CJ-BOC, we conducted a comprehensive experiment, which reveals its speed advantage, and high quality of acquired word embeddings as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://code.google.com/archive/p/word2vec, accessed date: March 17, 2016.

  2. 2.

    http://www.statmt.org/moses/giza/GIZA++.html, accessed date: June 11, 2016.

  3. 3.

    http://taku910.github.io/mecab, accessed date: May 12, 2016.

  4. 4.

    https://github.com/fxsjy/jieba, accessed date: August 2, 2015.

  5. 5.

    Our source code is available at https://github.com/jileiwang/cjboc.

References

  1. Hinton, G.E.: Learning distributed representations of concepts. In: Proceedings of the Eighth Annual Conference of the Cognitive Science Society

    Google Scholar 

  2. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation (2013). arXiv:1309.4168

  3. Guo, J., Che, W., Wang, H., Liu, T.: Learning sense-specific word embeddings by exploiting bilingual resources. In: Proceedings of COLING, pp. 497–507 (2014)

    Google Scholar 

  4. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)

    MATH  Google Scholar 

  5. Mnih, A., Hinton, G.E.: A scalable hierarchical distributed language model. In: Advances in Neural Information Processing Systems, pp. 1081–1088 (2009)

    Google Scholar 

  6. Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: HLT-NAACL, pp. 746–751 (2013)

    Google Scholar 

  7. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing (2014)

    Google Scholar 

  8. Gouws, S., Bengio, Y., Corrado, G.: BilBOWA: fast bilingual distributed representations without word alignments (2014). arXiv:1410.2455

  9. Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.: Joint learning of character and word embeddings. In: International Conference on Artificial Intelligence. AAAI Press (2015)

    Google Scholar 

  10. Chu, C., Nakazawa, T., Kurohashi, S.: Constructing a Chinese-Japanese parallel corpus from Wikipedia. In: Proceedings of the Ninth Conference on International Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland, May 2014

    Google Scholar 

  11. Chu, C., Nakazawa, T., Kurohashi, S.: Chinese characters mapping table of Japanese, traditional Chinese and simplified Chinese. In: Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC 2012), pp. 2149–2152, Istanbul, Turkey, May 2012

    Google Scholar 

  12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781

  13. Veale, T.: An analogy-oriented type hierarchy for linguistic creativity. Knowl. Based Syst. 19(7), 471–479 (2006)

    Article  Google Scholar 

  14. Veale, T., Li, G.: Analogy as an organizational principle in the construction of large knowledge-bases. In: Prade, H., Richard, G. (eds.) Computational Approaches to Analogical Reasoning: Current Trends. SCI, vol. 548, pp. 83–101. Springer, Heidelberg (2014). doi:10.1007/978-3-642-54516-0_4

    Chapter  Google Scholar 

Download references

Acknowledgement

This research is supported in part by the Major State Basic Research Development Program of China (973 Program, 2012CB315803), the National Natural Science Foundation of China (61371078), and the Research Fund for the Doctoral Program of Higher Education of China (20130002110051).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shu-Tao Xia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Wang, J., Luo, S., Li, Y., Xia, ST. (2016). Learning Chinese-Japanese Bilingual Word Embedding by Using Common Characters. In: Lehner, F., Fteimi, N. (eds) Knowledge Science, Engineering and Management. KSEM 2016. Lecture Notes in Computer Science(), vol 9983. Springer, Cham. https://doi.org/10.1007/978-3-319-47650-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-47650-6_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-47649-0

  • Online ISBN: 978-3-319-47650-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics