Abstract
Many Chinese words similarity measure algorithms have been introduced since it’s a fundamental issue in various tasks of natural language processing. Previous work focused mainly on using existing semantic knowledge bases or large-scale corpora. However, knowledge base and corpus have limitations for broad coverage and data update. Thus, ensemble learning is then used to improve performance by combing similarities. This paper describes a Chinese word similarity measure using ensemble learning of knowledge and corpus-based algorithms. To be specific, knowledge-based methods are based on TYCCL and Hownet. Two corpus-based methods compute similarities via retrieving on web search engines and deep learning on large-scale corpora (news and microblog). All similarities are combined through support vector regression to get final similarity. Evaluation suggests that TYCCL-based method behaves best according to testing dataset. However, if tuning parameters appropriately, ensemble learning could outperform all the other algorithms. Besides, deep learning on news corpora is better than other corpus-based methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Available at: https://code.google.com/p/word2vec/.
- 2.
Available at: http://ictclas.nlpir.org/.
- 3.
Available at: http://radimrehurek.com/gensim/index.html.
- 4.
Available at: https://www.csie.ntu.edu.tw/~cjlin/libsvm/.
- 5.
Available at: http://www.keenage.com/.
- 6.
Available at: http://pennyliang.com/.
- 7.
Available at: http://lafnews.com/corpus/.
- 8.
20 websites are selected based on the URL amount except two video websites.
References
Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E.G., Milios, E.E.: Semantic similarity methods in wordNet and their application to information retrieval on the web. In: Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management, pp. 10–16. ACM (2005)
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)
Karov, Y., Edelman, S.: Similarity-based word sense disambiguation. Comput. Linguist. 24, 41–59 (1998)
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, pp. 775–780 (2006)
Gan, M., Dou, X., Jiang, R.: From ontology to semantic similarity: calculation of ontology-based semantic similarity. Sci. World J. 2013, 1–11 (2013)
Shi, J., Yunfang, W.U., Qiu, L., Xueqiang, L.V.: Chinese lexical semantic similarity computing based on large-scale corpus. J. Chin. Inf. Process. 27, 1–461 (2013)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, pp. 1–12 (2013)
Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, pp. 13–15 (2013)
Manning, C.D., SchĂ¼tze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Li, Y., Bandar, Z.A., McLean, D.: An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. Knowl. Data Eng. 15, 871–882 (2003)
Ittoo, A., Maruster, L.: Ensemble similarity measures for clustering terms. In: 2009 WRI World Congress on Computer Science and Information Engineering, pp. 315–319. IEEE (2009)
Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: Proceedings of the Seventh International Symposium on String Processing and Information Retrieval, SPIRE 2000, pp. 39–48. IEEE (2000)
Hall, P.A., Dowling, G.R.: Approximate string matching. ACM Comput. Surv. (CSUR) 12, 381–402 (1980)
Fellbaum, C.: WordNet. Wiley Online Library (1998)
Vossen, P.: A Multilingual Database with Lexical Semantic Networks. Springer, Dordrecht (1998)
Mei, J.: Tongyici Cilin. Shanghai Cishu Publishing House, Shanghai (1984)
Dong, Z., Dong, Q.: HowNet and the Computation of Meaning. World Scientific, Singapore (2006)
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint arXiv:cmp-lg/9511007, pp. 1–6 (1995)
Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint arXiv:cmp-lg/9709008, pp. 1–15 (1997)
Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Lang. Cogn. Process. 6, 1–28 (1991)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013)
Hinton, G.E.: Learning distributed representations of concepts. In: Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Amherst, MA, pp. 1–12 (1986)
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
Taddy, M.: Document classification by inversion of distributed language representations. arXiv preprint arXiv:1504.07295, pp. 1–6 (2015)
Han, L., Finin, T., McNamee, P., Joshi, A., Yesha, Y.: Improving word similarity by augmenting PMI with estimates of word polysemy. IEEE Trans. Knowl. Data Eng. 25, 1307–1322 (2013)
Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. WWW 7, 757–766 (2007)
Neshati, M., Hassanabadi, L.S.: Taxonomy construction using compound similarity measure. In: Meersman, R., Tari, Z. (eds.) OTM 2007. LNCS, vol. 4803, pp. 915–932. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76848-7_61
Jiu Le, T., Wei, Z.: Words similarity algorithm based on Tongyici Cilin in semantic web adaptive learning system. J. Jilin Univ. 28, 602–608 (2010)
Liu, Q., Li, S.: Word simialrity computing based on How-net. Int. J. Comput. Linguist. Chin. Lang. Process. 7, 59–76 (2002)
Xia, T.: Study on Chinese words semantic similarity computation. Comput. Eng. 33, 191–194 (2007)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 1–9 (2013)
Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14, 199–222 (2004)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Wu, Y., Li, W.: NLPCC-ICCPOL 2016 shared task 3: Chinese word similarity measurement. In: Proceedings of NLPCC 2016 (2016)
Iman, R.L., Conover, W.-J.: A distribution-free approach to inducing rank correlation among input variables. Commun. Stat.-Simul. Comput. 11, 311–334 (1982)
Acknowledgments
This work is supported by Major Projects of National Social Science Fund (13&ZD174), National Social Science Fund Project (No. 14BTQ033) and the Graduate Students Education Innovation Project of Jiangsu Province (No. KYLX16_0407).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Ma, S., Zhang, X., Zhang, C. (2016). NLPCC 2016 Shared Task Chinese Words Similarity Measure via Ensemble Learning Based on Multiple Resources. In: Lin, CY., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds) Natural Language Understanding and Intelligent Applications. ICCPOL NLPCC 2016 2016. Lecture Notes in Computer Science(), vol 10102. Springer, Cham. https://doi.org/10.1007/978-3-319-50496-4_79
Download citation
DOI: https://doi.org/10.1007/978-3-319-50496-4_79
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50495-7
Online ISBN: 978-3-319-50496-4
eBook Packages: Computer ScienceComputer Science (R0)