Skip to main content

NLPCC 2016 Shared Task Chinese Words Similarity Measure via Ensemble Learning Based on Multiple Resources

  • Conference paper
  • First Online:
Natural Language Understanding and Intelligent Applications (ICCPOL 2016, NLPCC 2016)

Abstract

Many Chinese words similarity measure algorithms have been introduced since it’s a fundamental issue in various tasks of natural language processing. Previous work focused mainly on using existing semantic knowledge bases or large-scale corpora. However, knowledge base and corpus have limitations for broad coverage and data update. Thus, ensemble learning is then used to improve performance by combing similarities. This paper describes a Chinese word similarity measure using ensemble learning of knowledge and corpus-based algorithms. To be specific, knowledge-based methods are based on TYCCL and Hownet. Two corpus-based methods compute similarities via retrieving on web search engines and deep learning on large-scale corpora (news and microblog). All similarities are combined through support vector regression to get final similarity. Evaluation suggests that TYCCL-based method behaves best according to testing dataset. However, if tuning parameters appropriately, ensemble learning could outperform all the other algorithms. Besides, deep learning on news corpora is better than other corpus-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Available at: https://code.google.com/p/word2vec/.

  2. 2.

    Available at: http://ictclas.nlpir.org/.

  3. 3.

    Available at: http://radimrehurek.com/gensim/index.html.

  4. 4.

    Available at: https://www.csie.ntu.edu.tw/~cjlin/libsvm/.

  5. 5.

    Available at: http://www.keenage.com/.

  6. 6.

    Available at: http://pennyliang.com/.

  7. 7.

    Available at: http://lafnews.com/corpus/.

  8. 8.

    20 websites are selected based on the URL amount except two video websites.

References

  1. Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E.G., Milios, E.E.: Semantic similarity methods in wordNet and their application to information retrieval on the web. In: Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management, pp. 10–16. ACM (2005)

    Google Scholar 

  2. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)

  3. Karov, Y., Edelman, S.: Similarity-based word sense disambiguation. Comput. Linguist. 24, 41–59 (1998)

    Google Scholar 

  4. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, pp. 775–780 (2006)

    Google Scholar 

  5. Gan, M., Dou, X., Jiang, R.: From ontology to semantic similarity: calculation of ontology-based semantic similarity. Sci. World J. 2013, 1–11 (2013)

    Article  Google Scholar 

  6. Shi, J., Yunfang, W.U., Qiu, L., Xueqiang, L.V.: Chinese lexical semantic similarity computing based on large-scale corpus. J. Chin. Inf. Process. 27, 1–461 (2013)

    Google Scholar 

  7. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, pp. 1–12 (2013)

  8. Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, pp. 13–15 (2013)

    Google Scholar 

  9. Manning, C.D., SchĂ¼tze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  10. Li, Y., Bandar, Z.A., McLean, D.: An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. Knowl. Data Eng. 15, 871–882 (2003)

    Article  Google Scholar 

  11. Ittoo, A., Maruster, L.: Ensemble similarity measures for clustering terms. In: 2009 WRI World Congress on Computer Science and Information Engineering, pp. 315–319. IEEE (2009)

    Google Scholar 

  12. Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: Proceedings of the Seventh International Symposium on String Processing and Information Retrieval, SPIRE 2000, pp. 39–48. IEEE (2000)

    Google Scholar 

  13. Hall, P.A., Dowling, G.R.: Approximate string matching. ACM Comput. Surv. (CSUR) 12, 381–402 (1980)

    Article  MathSciNet  Google Scholar 

  14. Fellbaum, C.: WordNet. Wiley Online Library (1998)

    Google Scholar 

  15. Vossen, P.: A Multilingual Database with Lexical Semantic Networks. Springer, Dordrecht (1998)

    Book  MATH  Google Scholar 

  16. Mei, J.: Tongyici Cilin. Shanghai Cishu Publishing House, Shanghai (1984)

    Google Scholar 

  17. Dong, Z., Dong, Q.: HowNet and the Computation of Meaning. World Scientific, Singapore (2006)

    Book  Google Scholar 

  18. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint arXiv:cmp-lg/9511007, pp. 1–6 (1995)

  19. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint arXiv:cmp-lg/9709008, pp. 1–15 (1997)

  20. Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Lang. Cogn. Process. 6, 1–28 (1991)

    Article  Google Scholar 

  21. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013)

    Article  Google Scholar 

  22. Hinton, G.E.: Learning distributed representations of concepts. In: Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Amherst, MA, pp. 1–12 (1986)

    Google Scholar 

  23. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)

    MATH  Google Scholar 

  24. Taddy, M.: Document classification by inversion of distributed language representations. arXiv preprint arXiv:1504.07295, pp. 1–6 (2015)

  25. Han, L., Finin, T., McNamee, P., Joshi, A., Yesha, Y.: Improving word similarity by augmenting PMI with estimates of word polysemy. IEEE Trans. Knowl. Data Eng. 25, 1307–1322 (2013)

    Article  Google Scholar 

  26. Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. WWW 7, 757–766 (2007)

    Google Scholar 

  27. Neshati, M., Hassanabadi, L.S.: Taxonomy construction using compound similarity measure. In: Meersman, R., Tari, Z. (eds.) OTM 2007. LNCS, vol. 4803, pp. 915–932. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76848-7_61

    Chapter  Google Scholar 

  28. Jiu Le, T., Wei, Z.: Words similarity algorithm based on Tongyici Cilin in semantic web adaptive learning system. J. Jilin Univ. 28, 602–608 (2010)

    Google Scholar 

  29. Liu, Q., Li, S.: Word simialrity computing based on How-net. Int. J. Comput. Linguist. Chin. Lang. Process. 7, 59–76 (2002)

    Google Scholar 

  30. Xia, T.: Study on Chinese words semantic similarity computation. Comput. Eng. 33, 191–194 (2007)

    Google Scholar 

  31. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 1–9 (2013)

    Google Scholar 

  32. Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14, 199–222 (2004)

    Article  MathSciNet  Google Scholar 

  33. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)

    MATH  Google Scholar 

  34. Wu, Y., Li, W.: NLPCC-ICCPOL 2016 shared task 3: Chinese word similarity measurement. In: Proceedings of NLPCC 2016 (2016)

    Google Scholar 

  35. Iman, R.L., Conover, W.-J.: A distribution-free approach to inducing rank correlation among input variables. Commun. Stat.-Simul. Comput. 11, 311–334 (1982)

    Article  MATH  Google Scholar 

Download references

Acknowledgments

This work is supported by Major Projects of National Social Science Fund (13&ZD174), National Social Science Fund Project (No. 14BTQ033) and the Graduate Students Education Innovation Project of Jiangsu Province (No. KYLX16_0407).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chengzhi Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Ma, S., Zhang, X., Zhang, C. (2016). NLPCC 2016 Shared Task Chinese Words Similarity Measure via Ensemble Learning Based on Multiple Resources. In: Lin, CY., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds) Natural Language Understanding and Intelligent Applications. ICCPOL NLPCC 2016 2016. Lecture Notes in Computer Science(), vol 10102. Springer, Cham. https://doi.org/10.1007/978-3-319-50496-4_79

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-50496-4_79

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-50495-7

  • Online ISBN: 978-3-319-50496-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics