NLPCC 2016 Shared Task Chinese Words Similarity Measure via Ensemble Learning Based on Multiple Resources

Ma, Shutian; Zhang, Xiaoyong; Zhang, Chengzhi

doi:10.1007/978-3-319-50496-4_79

Shutian Ma¹⁸,
Xiaoyong Zhang¹⁸ &
Chengzhi Zhang^18,19

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10102))

Included in the following conference series:

4704 Accesses

Abstract

Many Chinese words similarity measure algorithms have been introduced since it’s a fundamental issue in various tasks of natural language processing. Previous work focused mainly on using existing semantic knowledge bases or large-scale corpora. However, knowledge base and corpus have limitations for broad coverage and data update. Thus, ensemble learning is then used to improve performance by combing similarities. This paper describes a Chinese word similarity measure using ensemble learning of knowledge and corpus-based algorithms. To be specific, knowledge-based methods are based on TYCCL and Hownet. Two corpus-based methods compute similarities via retrieving on web search engines and deep learning on large-scale corpora (news and microblog). All similarities are combined through support vector regression to get final similarity. Evaluation suggests that TYCCL-based method behaves best according to testing dataset. However, if tuning parameters appropriately, ensemble learning could outperform all the other algorithms. Besides, deep learning on news corpora is better than other corpus-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Available at: https://code.google.com/p/word2vec/.
2.
Available at: http://ictclas.nlpir.org/.
3.
Available at: http://radimrehurek.com/gensim/index.html.
4.
Available at: https://www.csie.ntu.edu.tw/~cjlin/libsvm/.
5.
Available at: http://www.keenage.com/.
6.
Available at: http://pennyliang.com/.
7.
Available at: http://lafnews.com/corpus/.
8.
20 websites are selected based on the URL amount except two video websites.

References

Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E.G., Milios, E.E.: Semantic similarity methods in wordNet and their application to information retrieval on the web. In: Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management, pp. 10–16. ACM (2005)
Google Scholar
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)
Karov, Y., Edelman, S.: Similarity-based word sense disambiguation. Comput. Linguist. 24, 41–59 (1998)
Google Scholar
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, pp. 775–780 (2006)
Google Scholar
Gan, M., Dou, X., Jiang, R.: From ontology to semantic similarity: calculation of ontology-based semantic similarity. Sci. World J. 2013, 1–11 (2013)
Article Google Scholar
Shi, J., Yunfang, W.U., Qiu, L., Xueqiang, L.V.: Chinese lexical semantic similarity computing based on large-scale corpus. J. Chin. Inf. Process. 27, 1–461 (2013)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, pp. 1–12 (2013)
Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, pp. 13–15 (2013)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Li, Y., Bandar, Z.A., McLean, D.: An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. Knowl. Data Eng. 15, 871–882 (2003)
Article Google Scholar
Ittoo, A., Maruster, L.: Ensemble similarity measures for clustering terms. In: 2009 WRI World Congress on Computer Science and Information Engineering, pp. 315–319. IEEE (2009)
Google Scholar
Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: Proceedings of the Seventh International Symposium on String Processing and Information Retrieval, SPIRE 2000, pp. 39–48. IEEE (2000)
Google Scholar
Hall, P.A., Dowling, G.R.: Approximate string matching. ACM Comput. Surv. (CSUR) 12, 381–402 (1980)
Article MathSciNet Google Scholar
Fellbaum, C.: WordNet. Wiley Online Library (1998)
Google Scholar
Vossen, P.: A Multilingual Database with Lexical Semantic Networks. Springer, Dordrecht (1998)
Book MATH Google Scholar
Mei, J.: Tongyici Cilin. Shanghai Cishu Publishing House, Shanghai (1984)
Google Scholar
Dong, Z., Dong, Q.: HowNet and the Computation of Meaning. World Scientific, Singapore (2006)
Book Google Scholar
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint arXiv:cmp-lg/9511007, pp. 1–6 (1995)
Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint arXiv:cmp-lg/9709008, pp. 1–15 (1997)
Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Lang. Cogn. Process. 6, 1–28 (1991)
Article Google Scholar
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013)
Article Google Scholar
Hinton, G.E.: Learning distributed representations of concepts. In: Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Amherst, MA, pp. 1–12 (1986)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Taddy, M.: Document classification by inversion of distributed language representations. arXiv preprint arXiv:1504.07295, pp. 1–6 (2015)
Han, L., Finin, T., McNamee, P., Joshi, A., Yesha, Y.: Improving word similarity by augmenting PMI with estimates of word polysemy. IEEE Trans. Knowl. Data Eng. 25, 1307–1322 (2013)
Article Google Scholar
Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. WWW 7, 757–766 (2007)
Google Scholar
Neshati, M., Hassanabadi, L.S.: Taxonomy construction using compound similarity measure. In: Meersman, R., Tari, Z. (eds.) OTM 2007. LNCS, vol. 4803, pp. 915–932. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76848-7_61
Chapter Google Scholar
Jiu Le, T., Wei, Z.: Words similarity algorithm based on Tongyici Cilin in semantic web adaptive learning system. J. Jilin Univ. 28, 602–608 (2010)
Google Scholar
Liu, Q., Li, S.: Word simialrity computing based on How-net. Int. J. Comput. Linguist. Chin. Lang. Process. 7, 59–76 (2002)
Google Scholar
Xia, T.: Study on Chinese words semantic similarity computation. Comput. Eng. 33, 191–194 (2007)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 1–9 (2013)
Google Scholar
Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14, 199–222 (2004)
Article MathSciNet Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
MATH Google Scholar
Wu, Y., Li, W.: NLPCC-ICCPOL 2016 shared task 3: Chinese word similarity measurement. In: Proceedings of NLPCC 2016 (2016)
Google Scholar
Iman, R.L., Conover, W.-J.: A distribution-free approach to inducing rank correlation among input variables. Commun. Stat.-Simul. Comput. 11, 311–334 (1982)
Article MATH Google Scholar

Download references

Acknowledgments

This work is supported by Major Projects of National Social Science Fund (13&ZD174), National Social Science Fund Project (No. 14BTQ033) and the Graduate Students Education Innovation Project of Jiangsu Province (No. KYLX16_0407).

Author information

Authors and Affiliations

Department of Information Management, Nanjing University of Science and Technology, Nanjing, 210094, China
Shutian Ma, Xiaoyong Zhang & Chengzhi Zhang
Jiangsu Key Laboratory of Data Engineering and Knowledge Service (Nanjing University), Nanjing, 210093, China
Chengzhi Zhang

Authors

Shutian Ma
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chengzhi Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chengzhi Zhang .

Editor information

Editors and Affiliations

Microsoft Research Asia, Beijing, China
Chin-Yew Lin
Brandeis University, Waltham, Massachusetts, USA
Nianwen Xue
Peking University, Beijing, China
Dongyan Zhao
Fudan University, Shanghai, China
Xuanjing Huang
Peking University, Beijing, China
Yansong Feng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, S., Zhang, X., Zhang, C. (2016). NLPCC 2016 Shared Task Chinese Words Similarity Measure via Ensemble Learning Based on Multiple Resources. In: Lin, CY., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds) Natural Language Understanding and Intelligent Applications. ICCPOL NLPCC 2016 2016. Lecture Notes in Computer Science(), vol 10102. Springer, Cham. https://doi.org/10.1007/978-3-319-50496-4_79

Download citation

DOI: https://doi.org/10.1007/978-3-319-50496-4_79
Published: 02 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50495-7
Online ISBN: 978-3-319-50496-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics