Constructing and validating word similarity datasets by integrating methods from psychology, brain science and computational linguistics

  • Yu Wan
  • Yidong Chen
  • Xiaodong Shi
  • Changle Zhou


Human-scored word similarity gold-standard datasets are normally composed of word pairs with corresponding similarity scores. These datasets are popular resources for evaluating word similarity models which are the essential components for many natural language processing tasks. This paper proposes a novel multidisciplinary method for constructing and validating word similarity gold-standard datasets. The proposed method is different from the previous ones in that it introduces methods from three different disciplines, i.e., psychology, brain science and computational linguistics to validate the soundness of the constructed datasets. Specifically, to the best of our knowledge, this is the first time event-related potentials experiments are incorporated to validate the word similarity datasets. Using the proposed method, we finally constructed a Chinese gold-standard word similarity dataset with 260 word pairs and showed its soundness using the interdisciplinary validating methods. It should be noted that, although the paper only focused on constructing Chinese standard dataset, the proposed method is applicable to other languages.


Word similarity Dataset construction and validation Multidisciplinary method Computational linguistics Psychology ERPs 



This work was supported by National Natural Science Foundation of China (No. 61573294), National Social Science Foundation of China (No. 16AZD049) and Fujian Province 2011 Collaborative Innovation Center of TCM Health Management.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Informed consent

Informed consent was obtained from all individual participants included in the study.


  1. Agirre E, Alfonseca E, Hall K, Kravalova J, Pasca M, Soroa A (2009) A study on similarity and relatedness using distributional and wordnet-based approaches. In: Proceedings of human language technologies: the 2009 annual conference of the north American chapter of the ACL, pp 19–27Google Scholar
  2. Bennett M, Duke P, Fuggetta G (2014) Event-related potential n270 delayed and enhanced by the conjunction of relevant and irrelevant perceptual mismatch. Psychophysiology 51(5):456–463CrossRefGoogle Scholar
  3. Burgess C, Lund K (1997) Modelling parsing constraints with high-dimensional context space. Lang Cogn Process 12:177–210CrossRefGoogle Scholar
  4. Chen C, Lee S, Stevenson HW (1995) Response style and cross-cultural comparisons of rating scales among East Asian and North American students. Psychol Sci 6(3):170–175CrossRefGoogle Scholar
  5. Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, pp 160–167Google Scholar
  6. Deacon D, Hewitt S, Yang CM, Nagata M (2000) Event-related potential indices of semantic priming using masked and unmasked words: evidence that the n400 does not reflect a post-lexical process. Cogn Brain Res 9(2):137–146CrossRefGoogle Scholar
  7. Dong Z, Dong Q (2006) HowNet and the computation of meaning, 1st edn. World Scientific, HackensackCrossRefGoogle Scholar
  8. Dong Z, Dong Q, Hao C (2010) Hownet and its computation of meaning. In: Proceedings of the 23rd international conference on computational linguistics, pp 53–56Google Scholar
  9. Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2001) Placing search in context: the concept revisited. In: Proceedings of the 10th international conference on world wide web, pp 406–414Google Scholar
  10. Harris Z (1968) Mathematical structures of language, 1st edn. Wiley, New YorkzbMATHGoogle Scholar
  11. Hauk O, Pulvermüller F (2004) Effects of word length and frequency on the human event-related potential. Clin Neurophysiol 115(5):1090–1103CrossRefGoogle Scholar
  12. Hill F, Reichart R, Korhonen A (2015) Simlex-999: evaluating semantic models with (genuine) similarity estimation. Comput Linguist 41(2):665–695MathSciNetCrossRefGoogle Scholar
  13. Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers, pp 873–882Google Scholar
  14. Jin P, Wu YF (2012) Semeval-2012 task 4: evaluating chinese word similarity. In: Proceedings of the 6th international workshop on semantic evaluation, pp 374–377Google Scholar
  15. Jurgens D, Stevens K (2010) The s-space package: an open source package for word space models. In: Proceedings of the ACL 2010 system demonstrations, pp 30–35Google Scholar
  16. Kiefer M (2002) The n400 is modulated by unconsciously perceived masked words: further evidence for an automatic spreading activation account of n400 priming effects. Cogn Brain Res 13(1):27–39CrossRefGoogle Scholar
  17. Kutas M, Federmeier KD (2011) Thirty years and counting: finding meaning in the n400 component of the event related brain potential (erp). Annu Rev Psychol 62:621CrossRefGoogle Scholar
  18. Kutas M, Hillyard SA et al (1980) Reading senseless sentences: brain potentials reflect semantic incongruity. Science 207(4427):203–205CrossRefGoogle Scholar
  19. Liu Q, Li S (2002) Word similarity computing based on how-net. In: Proceedings of the 3rd Chinese lexical semantics workshop, pp 59–76Google Scholar
  20. Liu Y (2009) A review of Chinese vocabulary statistic studies. Chin Lang Learn 1:62–69Google Scholar
  21. Mao W, Yuping W (2007) Various conflicts from ventral and dorsal streams are sequentially processed in a common system. Exp Brain Res 177:113–121CrossRefGoogle Scholar
  22. Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. In: Proceedings of international conference of learning representationsGoogle Scholar
  23. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119Google Scholar
  24. Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41CrossRefGoogle Scholar
  25. Miller GA, Charles WG (1991) Contextual correlates of semantic similarity. Lang Cogn Process 6(1):1–28MathSciNetCrossRefGoogle Scholar
  26. Moss HE, Ostrin RK (1995) Accessing different types of lexical semantic information: evidence from priming. J Exp Psychol Learn Mem Cogn 21(4):863–883CrossRefGoogle Scholar
  27. Rohde DLT, Gonnerman LM, Plaut DC (2006) An improved model of semantic similarity based on lexical co-occurrence. Commun ACM 8:627–633Google Scholar
  28. Rubenstein H, Goodenough JB (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633CrossRefGoogle Scholar
  29. Turian J, Ratinov LA, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 384–394Google Scholar
  30. Wang X, Jia Y, Zhou B, Ding ZY, Liang Z (2011) Computing semantic relatedness using chinese wikipedia links and taxonomy. J Chin Comput Syst 32(11):2237–2242Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Yu Wan
    • 1
    • 2
  • Yidong Chen
    • 1
    • 2
  • Xiaodong Shi
    • 1
    • 2
  • Changle Zhou
    • 1
    • 2
  1. 1.Department of Cognitive Science, School of Information and EngineeringXiamen UniversityXiamenPeople’s Republic of China
  2. 2.Fujian Key Laboratory of Brain-Inspired Computing Technique and ApplicationsXiamen UniversityXiamenPeople’s Republic of China

Personalised recommendations