Learning Chinese-Japanese Bilingual Word Embedding by Using Common Characters

Wang, Jilei; Luo, Shiying; Li, Yanning; Xia, Shu-Tao

doi:10.1007/978-3-319-47650-6_7

Jilei Wang¹⁵,
Shiying Luo¹⁶,
Yanning Li¹⁷ &
…
Shu-Tao Xia¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9983))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1724 Accesses

Abstract

Bilingual word embedding, which maps word embedding of two languages into one vector space, has been widely applied in the domain of machine translation, word sense disambiguation and so on. However, no model has been universally accepted for learning bilingual word embedding. In this work, we propose a novel model named CJ-BOC to learn Chinese-Japanese word embeddings. Given Chinese and Japanese share a large portion of common characters, we exploit them in our training process. We demonstrated the effectiveness of such exploitation through theoretical and also experimental study. To evaluate the performance of CJ-BOC, we conducted a comprehensive experiment, which reveals its speed advantage, and high quality of acquired word embeddings as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://code.google.com/archive/p/word2vec, accessed date: March 17, 2016.
2.
http://www.statmt.org/moses/giza/GIZA++.html, accessed date: June 11, 2016.
3.
http://taku910.github.io/mecab, accessed date: May 12, 2016.
4.
https://github.com/fxsjy/jieba, accessed date: August 2, 2015.
5.
Our source code is available at https://github.com/jileiwang/cjboc.

References

Hinton, G.E.: Learning distributed representations of concepts. In: Proceedings of the Eighth Annual Conference of the Cognitive Science Society
Google Scholar
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation (2013). arXiv:1309.4168
Guo, J., Che, W., Wang, H., Liu, T.: Learning sense-specific word embeddings by exploiting bilingual resources. In: Proceedings of COLING, pp. 497–507 (2014)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Mnih, A., Hinton, G.E.: A scalable hierarchical distributed language model. In: Advances in Neural Information Processing Systems, pp. 1081–1088 (2009)
Google Scholar
Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: HLT-NAACL, pp. 746–751 (2013)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing (2014)
Google Scholar
Gouws, S., Bengio, Y., Corrado, G.: BilBOWA: fast bilingual distributed representations without word alignments (2014). arXiv:1410.2455
Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.: Joint learning of character and word embeddings. In: International Conference on Artificial Intelligence. AAAI Press (2015)
Google Scholar
Chu, C., Nakazawa, T., Kurohashi, S.: Constructing a Chinese-Japanese parallel corpus from Wikipedia. In: Proceedings of the Ninth Conference on International Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland, May 2014
Google Scholar
Chu, C., Nakazawa, T., Kurohashi, S.: Chinese characters mapping table of Japanese, traditional Chinese and simplified Chinese. In: Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC 2012), pp. 2149–2152, Istanbul, Turkey, May 2012
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781
Veale, T.: An analogy-oriented type hierarchy for linguistic creativity. Knowl. Based Syst. 19(7), 471–479 (2006)
Article Google Scholar
Veale, T., Li, G.: Analogy as an organizational principle in the construction of large knowledge-bases. In: Prade, H., Richard, G. (eds.) Computational Approaches to Analogical Reasoning: Current Trends. SCI, vol. 548, pp. 83–101. Springer, Heidelberg (2014). doi:10.1007/978-3-642-54516-0_4
Chapter Google Scholar

Download references

Acknowledgement

This research is supported in part by the Major State Basic Research Development Program of China (973 Program, 2012CB315803), the National Natural Science Foundation of China (61371078), and the Research Fund for the Doctoral Program of Higher Education of China (20130002110051).

Author information

Authors and Affiliations

Tsinghua University, Beijing, 100084, China
Jilei Wang & Shu-Tao Xia
Northeastern University, Shenyang, 110819, China
Shiying Luo
Renmin University of China, Beijing, 100872, China
Yanning Li

Authors

Jilei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shiying Luo
View author publications
You can also search for this author in PubMed Google Scholar
Yanning Li
View author publications
You can also search for this author in PubMed Google Scholar
Shu-Tao Xia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shu-Tao Xia .

Editor information

Editors and Affiliations

University of Passau, Passau, Germany
Franz Lehner
University of Passau , Passau, Germany
Nora Fteimi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, J., Luo, S., Li, Y., Xia, ST. (2016). Learning Chinese-Japanese Bilingual Word Embedding by Using Common Characters. In: Lehner, F., Fteimi, N. (eds) Knowledge Science, Engineering and Management. KSEM 2016. Lecture Notes in Computer Science(), vol 9983. Springer, Cham. https://doi.org/10.1007/978-3-319-47650-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-47650-6_7
Published: 05 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47649-0
Online ISBN: 978-3-319-47650-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics