Knowledge-Enhanced Bilingual Textual Representations for Cross-Lingual Semantic Textual Similarity

Lu, Hsuehkuan; Cao, Yixin; Lei, Hou; Li, Juanzi

doi:10.1007/978-981-15-0118-0_33

Hsuehkuan Lu¹¹,
Yixin Cao¹²,
Hou Lei¹¹ &
…
Juanzi Li¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1058))

Included in the following conference series:

International Conference of Pioneering Computer Scientists, Engineers and Educators

1481 Accesses

Abstract

Joint learning of words and entities is advantageous to various NLP tasks, while most of the works focus on single language setting. Cross-lingual representations learning receives high attention recently, but is still restricted by the availability of parallel data. In this paper, a method is proposed to jointly embed texts and entities on comparable data. In addition to evaluate on public semantic textual similarity datasets, a task (cross-lingual text extraction) was proposed to assess the similarities between texts and contribute to this dataset. It shows that the proposed method outperforms cross-lingual representations methods using parallel data on cross-lingual tasks, and achieves competitive results on mono-lingual tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The Wikipedia dump was downloaded from the website: https://dumps.wikimedia.org.
2.
https://en.wikipedia.org/wiki/Latin_music.
3.
https://github.com/hsuehkuan-lu/KEBTR.
4.
https://github.com/fxsjy/jieba.

References

Agirre, E., et al.: Semeval-2014 task 10: multilingual semantic textual similarity. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 81–91. Association for Computational Linguistics (2014)
Google Scholar
Agirre, E., et al.: Semeval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 497–511 (2016)
Google Scholar
Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A.: Semeval-2012 task 6: a pilot on semantic textual similarity. In: *SEM 2012: The First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pp. 385–393. Association for Computational Linguistics, Montréal, 7–8 June 2012
Google Scholar
Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G., Dyer, C., Smith, N.A.: Massively multilingual word embeddings. CoRR abs/1602.01925 (2016)
Google Scholar
Artetxe, M., Labaka, G., Agirre, E.: Learning bilingual word embeddings with (almost) no bilingual data. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 451–462 (2017)
Google Scholar
Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 789–798 (2018)
Google Scholar
Aziz, W., Specia, L.: Fully automatic compilation of Portuguese-English and Portuguese-Spanish parallel corpora. In: STIL (2011)
Google Scholar
Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: WWW 2007: Proceedings of the 16th International Conference on World Wide Web, pp. 757–766 (2007)
Google Scholar
Cao, Y., Huang, L., Ji, H., Chen, X., Li, J.: Bridge text and knowledge by learning multi-prototype entity mention embedding. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1623–1633 (2017)
Google Scholar
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: Semeval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 1–14. Association for Computational Linguistics (2017)
Google Scholar
Chandar, A.P.S., et al.: An autoencoder approach to learning bilingual word representations. CoRR abs/1402.1454 (2014)
Google Scholar
Faruqui, M., Dyer, C.: Improving vector space word representations using multilingual correlation. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 462–471. Association for Computational Linguistics (2014)
Google Scholar
Franco-Salvador, M., Rosso, P., Navigli, R.: A knowledge-based representation for cross-language document retrieval and categorization. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, April 2014
Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 6–12 (2007)
Google Scholar
Gouws, S., Bengio, Y., Corrado, G.: BilBOWA: fast bilingual distributed representations without word alignments. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, vol. 37, pp. 748–756, 07–09 July 2015
Google Scholar
He, H., Gimpel, K., Lin, J.: Multi-perspective sentence similarity modeling with convolutional neural networks. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1576–1586. Association for Computational Linguistics (2015)
Google Scholar
He, H., Lin, J.: Pairwise word interaction modeling with deep neural networks for semantic similarity measurement. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 937–948. Association for Computational Linguistics (2016)
Google Scholar
Hermann, K.M., Blunsom, P.: Multilingual models for compositional distributional semantics. In: Proceedings of ACL, June 2014
Google Scholar
Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E., Milios, E.: Information retrieval by semantic similarity. Int. J. Semant. Web Inf. Syst. 2, 55–73 (2006)
Article Google Scholar
Huang, A.: Similarity measures for text document clustering. In: Proceedings of the 6th New Zealand Computer Science Research Student Conference, pp. 49–56 (2008)
Google Scholar
Kenter, T., de Rijke, M.: Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1411–1420 (2015)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)
Google Scholar
Kiros, R., et al.: Skip-thought vectors. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 3294–3302. Curran Associates, Inc. (2015)
Google Scholar
Klementiev, A., Titov, I., Bhattarai, B.: Inducing crosslingual distributed representations of words. In: COLING (2012)
Google Scholar
Lavie, A., Denkowski, M.J.: The meteor metric for automatic evaluation of machine translation. Mach. Transl. 23(2–3), 105–115 (2009)
Article Google Scholar
Luong, M.T., Pham, H., Manning, C.D.: Bilingual word representations with monolingual quality in mind. In: NAACL Workshop on Vector Space Modeling for NLP, Denver, United States (2015)
Google Scholar
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. CoRR (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 3111–3119 (2013)
Google Scholar
Mogadala, A., Rettinger, A.: Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 692–702. Association for Computational Linguistics (2016)
Google Scholar
Mohammad, S.M., Hirst, G.: Distributional measures as proxies for semantic relatedness (2012)
Google Scholar
Mohler, M., Bunescu, R., Mihalcea, R.: Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, HLT 2011, vol. 1, pp. 752–762 (2011)
Google Scholar
Pang, L., Lan, Y., Guo, J., Xu, J., Wan, S., Cheng, X.: Text matching as image recognition. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, USA, 12–17 February 2016, pp. 2793–2799 (2016)
Google Scholar
Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Knowl.-Based Syst. 45, 45–62 (2011)
Article Google Scholar
Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29, 349–380 (2003)
Article Google Scholar
Ruder, S.: A survey of cross-lingual embedding models. CoRR abs/1706.04902 (2017). http://arxiv.org/abs/1706.04902
Schwenk, H., Douze, M.: Learning joint multilingual sentence representations with neural machine translation. In: ACL workshop on Representation Learning for NLP (2017)
Google Scholar
Severyn, A., Moschitti, A.: Learning to rank short text pairs with convolutional deep neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2015, pp. 373–382 (2015)
Google Scholar
Søgaard, A., Agić, Ž., Martínez Alonso, H., Plank, B., Bohnet, B., Johannsen, A.: Inverted indexing for cross-lingual NLP. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1713–1722 (2015)
Google Scholar
Vulić, I., Moens, M.F.: Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 719–725. Association for Computational Linguistics (2015)
Google Scholar
Vulic, I., Moens, M.: Bilingual distributed word representations from document-aligned comparable data. J. Artif. Intell. Res. 55, 953–994 (2016)
Article MathSciNet Google Scholar
Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph and text jointly embedding. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1591–1601 (2014)
Google Scholar
Yamada, I., Shindo, H., Takeda, H., Takefuji, Y.: Joint learning of the embedding of words and entities for named entity disambiguation. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 250–259 (2016)
Google Scholar
Yamada, I., Shindo, H., Takeda, H., Takefuji, Y.: Learning distributed representations of texts and entities from knowledge base. Trans. Assoc. Comput. Linguis. 5, 397–411 (2017)
Article Google Scholar
Yang, L., Ai, Q., Guo, J., Croft, W.B.: aNMM: ranking short answer texts with attention-based neural matching model. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, pp. 287–296 (2016)
Google Scholar
Yin, W., Schütze, H., Xiang, B., Zhou, B.: ABCNN: attention-based convolutional neural network for modeling sentence pairs. Trans. Assoc. Comput. Linguis. 4, 259–272 (2016)
Article Google Scholar

Download references

Acknowledgement

The work is supported by NSFC key project (U1736204, 61533018, 61661146007), Ministry of Education and China Mobile Joint Fund (MCM20170301), and THUNUS NExT Co-Lab.

Author information

Authors and Affiliations

Department of CST, Tsinghua University, Beijing, 100084, China
Hsuehkuan Lu, Hou Lei & Juanzi Li
School of Computing, National University of Singapore, Singapore, Singapore
Yixin Cao

Authors

Hsuehkuan Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yixin Cao
View author publications
You can also search for this author in PubMed Google Scholar
Hou Lei
View author publications
You can also search for this author in PubMed Google Scholar
Juanzi Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hsuehkuan Lu .

Editor information

Editors and Affiliations

Guilin University of Technology, Guilin, China
Xiaohui Cheng
Northeast Forestry University, Harbin, China
Weipeng Jing
Harbin University of Science and Technology, Harbin, China
Xianhua Song
National Academy of Guo Ding Institute of Data Science, Harbin, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, H., Cao, Y., Lei, H., Li, J. (2019). Knowledge-Enhanced Bilingual Textual Representations for Cross-Lingual Semantic Textual Similarity. In: Cheng, X., Jing, W., Song, X., Lu, Z. (eds) Data Science. ICPCSEE 2019. Communications in Computer and Information Science, vol 1058. Springer, Singapore. https://doi.org/10.1007/978-981-15-0118-0_33

Download citation

DOI: https://doi.org/10.1007/978-981-15-0118-0_33
Published: 13 September 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0117-3
Online ISBN: 978-981-15-0118-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics