Behavior Research Methods

, Volume 50, Issue 2, pp 711–729 | Cite as

Scoring best-worst data in unbalanced many-item designs, with applications to crowdsourcing semantic judgments



Best-worst scaling is a judgment format in which participants are presented with a set of items and have to choose the superior and inferior items in the set. Best-worst scaling generates a large quantity of information per judgment because each judgment allows for inferences about the rank value of all unjudged items. This property of best-worst scaling makes it a promising judgment format for research in psychology and natural language processing concerned with estimating the semantic properties of tens of thousands of words. A variety of different scoring algorithms have been devised in the previous literature on best-worst scaling. However, due to problems of computational efficiency, these scoring algorithms cannot be applied efficiently to cases in which thousands of items need to be scored. New algorithms are presented here for converting responses from best-worst scaling into item scores for thousands of items (many-item scoring problems). These scoring algorithms are validated through simulation and empirical experiments, and considerations related to noise, the underlying distribution of true values, and trial design are identified that can affect the relative quality of the derived item scores. The newly introduced scoring algorithms consistently outperformed scoring algorithms used in the previous literature on scoring many-item best-worst data.


Best-worst scaling Tournament scoring Rank judgment Semantics Human judgment 


Author note

Thank you to Jordan Louviere for helpful discussion on scoring best-worst judgments. Thank you to Marc Brysbaert, Emmanuel Keuleers, Svetlana Kiritchenko, Pawel Mandera, Saif Mohammad, and Chris Westbury for feedback on earlier drafts of the manuscript.


  1. Abercrombie, H. C., Kalin, N. H., Thurow, M. E., Rosenkranz, M. A., & Davidson, R. J. (2003). Cortisol variation in humans affects memory for emotionally laden and neutral information. Behavioral Neuroscience, 117, 505.CrossRefPubMedGoogle Scholar
  2. Baayen, R. H., Milin, P., & Ramscar, M. (2016). Frequency in lexical processing. Aphasiology, 30, 1174–1220.CrossRefGoogle Scholar
  3. Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. A., Kessler, B., Loftis, B., … & Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39, 445–459. doi: 10.3758/BF03193014
  4. Bradley, M. M., & Lang, P. J. (1999). Affective norms for English words (ANEW): Stimuli, instruction manual and affective ratings (Technical Report C-1) (pp. 1–45). Gainesville: University of Florida, Center for Research in Psychophysiology.Google Scholar
  5. Brysbaert, M., Stevens, M., De Deyne, S., Voorspoels, W., & Storms, G. (2014). Norms of age of acquisition and concreteness for 30,000 Dutch words. Acta Psychologica, 150, 80–84.CrossRefPubMedGoogle Scholar
  6. Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904–911. doi: 10.3758/s13428-013-0403-5 CrossRefPubMedGoogle Scholar
  7. Elo, A. E. (1973). The international chess federation rating system. Chess, 38, 293–296. 38(August), 328–330; 39(October), 19–21.Google Scholar
  8. Hamann, S., & Mao, H. (2002). Positive and negative emotional verbal stimuli elicit activity in the left amygdala. NeuroReport, 13, 15–19.CrossRefPubMedGoogle Scholar
  9. Hollis, G., & Westbury, C. (2006). NUANCE: Naturalistic University of Alberta nonlinear correlation explorer. Behavior Research Methods, 38, 8–23. doi: 10.3758/BF03192745 CrossRefPubMedGoogle Scholar
  10. Hollis, G., & Westbury, C. (2016). The principals of meaning: Extracting semantic dimensions from co-occurrence models of semantics. Psychonomic Bulletin & Review, 23, 1744–1756. doi: 10.3758/s13423-016-1053-2 CrossRefGoogle Scholar
  11. Hollis, G., Westbury, C., & Lefsrud, L. (2017). Extrapolating human judgments from skip-gram vector representations of word meaning. Quarterly Journal of Experimental Psychology, 70, 1603–1619. doi: 10.1080/17470218.2016.1195417 CrossRefGoogle Scholar
  12. Hollis, G., Westbury, C. F., & Peterson, J. B. (2006). NUANCE 3.0: Using genetic programming to model variable relationships. Behavior Research Methods, 38, 218–228. doi: 10.3758/BF03192772 CrossRefPubMedGoogle Scholar
  13. Imbir, K. K. (2015). Affective norms for 1,586 Polish words (ANPW): Duality-of-mind approach. Behavior Research Methods, 47, 860–870.CrossRefPubMedGoogle Scholar
  14. Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44, 287–304. doi: 10.3758/s13428-011-0118-4 CrossRefPubMedGoogle Scholar
  15. Keuleers, M., Stevens, M., Mandera, P., & Brysbaert, M. (2015). Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment. Quarterly Journal of Experimental Psychology, 68, 1665–1692.CrossRefGoogle Scholar
  16. Kiritchenko, S., & Mohammad, S. M. (2016a). Sentiment composition of words with opposing polarities. San Diego: Paper presented at the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL).CrossRefGoogle Scholar
  17. Kiritchenko, S., & Mohammad, S. M. (2016b). Capturing reliable fine-grained sentiment associations by crowdsourcing and best-worst scaling. San Diego: Paper presented at the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL).CrossRefGoogle Scholar
  18. Kiritchenko, S., Zhu, X., & Mohammad, S. M. (2014). Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research, 50, 723–762.Google Scholar
  19. Kuperman, V., Estes, Z., Brysbaert, M., & Warriner, A. B. (2014). Emotion and language: Valence and arousal affect word recognition. Journal of Experimental Psychology: General, 143, 1065–1081. doi: 10.1037/a0035669 CrossRefGoogle Scholar
  20. Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44, 978–990. doi: 10.3758/s13428-012-0210-4 CrossRefPubMedGoogle Scholar
  21. Lipovetsky, S., & Conklin, M. (2014). Best-worst scaling in analytical closed-form solution. Journal of Choice Modelling, 10, 60–68.CrossRefGoogle Scholar
  22. Lodge, M., & Taber, C. S. (2005). The automaticity of affect for political leaders, groups, and issues: An experimental test of the hot cognition hypothesis. Political Psychology, 26, 455–482.CrossRefGoogle Scholar
  23. Louviere, J. J., Flynn, T. N., & Marley, A. A. J. (2015). Best-worst scaling: Theory, methods and applications. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  24. Mandera, P., Keuleers, E., & Brysbaert, M. (2015). How useful are corpus-based methods for extrapolating psycholinguistic variables? Quarterly Journal of Experimental Psychology, 68, 1623–1642.CrossRefGoogle Scholar
  25. Marley, A. A. J., & Islam, T. (2012). Conceptual relations between expanded rank data and models of the unexpanded rank data. Journal of Choice Modelling, 5, 38–80.CrossRefGoogle Scholar
  26. Marley, A. A. J., Islam, T., & Hawkins, G. E. (2016). A formal and empirical comparison of two score measures for best-worst scaling. Journal of Choice Modelling. doi: 10.1016/j.jocm.2016.03.002 Google Scholar
  27. Mohammad, S. M., Kiritchenko, S., & Zhu, X. (2013). NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets. arXiv:1308.6242.Google Scholar
  28. Montefinese, M., Ambrosini, E., Fairfield, B., & Mammarella, N. (2014). The adaptation of the Affective Norms for English Words (ANEW) for Italian. Behavior Research Methods, 46, 887–903. doi: 10.3758/s13428-013-0405-3 CrossRefPubMedGoogle Scholar
  29. Ogden, J., & Lo, J. (2012). How meaningful are data from Likert scales? An evaluation of how ratings are made and the role of the response shift in the socially disadvantaged. Journal of Health Psychology, 17, 350–361.CrossRefPubMedGoogle Scholar
  30. Orme, B. (2005). Accuracy of HB estimation in MaxDiff experiments (Sawtooth Software Research Paper Series). Sequim: Sawtooth Software, Inc.. Retrieved from Google Scholar
  31. Osgood, C. E., Suci, G. J., & Tannenbaum, P. H. (1957). The measurement of meaning. Urbana: University of Illinois Press.Google Scholar
  32. Pexman, P. M., Heard, A., Lloyd, E., & Yap, M. J. (2016). The Calgary Semantic Decision Project: Concrete–abstract decision data for 10,000 English words. Behavior Research Methods. doi: 10.3758/s13428-016-0720-6 Google Scholar
  33. Raschka, S. (2015). Python machine learning: Unlock deeper insights into machine learning with this vital guide to cutting-edge predictive analytics. Birmingham: Packt.Google Scholar
  34. Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II: Current research and theory (pp. 64–99). New York: Appleton-Century-Crofts.Google Scholar
  35. Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88, 413. doi: 10.1037/0033-2909.88.2.413 CrossRefGoogle Scholar
  36. Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5, 3–55.CrossRefGoogle Scholar
  37. Stadthagen-Gonzalez, H., Imbault, C., Pérez Sánchez, M. A., & Brysbaert, M. (2017). Norms of valence and arousal for 14,031 Spanish words. Behavior Research Methods, 49, 111–123. doi: 10.3758/s13428-015-0700-2 CrossRefPubMedGoogle Scholar
  38. Warriner, A. B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45, 1191–1207. doi: 10.3758/s13428-012-0314-x CrossRefPubMedGoogle Scholar
  39. Warriner, A. B., Shore, D. I., Schmidt, L. A., Imbault, C. L., & Kuperman, V. (2017). Sliding into happiness: A new tool for measuring affective responses to words. Canadian Journal of Experimental Psychology, 71, 71–88. doi: 10.1037/cep0000112 CrossRefPubMedPubMedCentralGoogle Scholar
  40. Weijters, B., Cabooter, E., & Schillewaert, N. (2010). The effect of rating scale format on response styles: The number of response categories and response category labels. International Journal of Research in Marketing, 27, 236–247.CrossRefGoogle Scholar
  41. Westbury, C. F., & Hollis, G. (2007). Putting Humpty together again: Synthetic approaches to nonlinear variable effects underlying lexical access. In G. Jarema & G. Libben (Eds.), The mental lexicon: Core perspectives (pp. 7–30). Bingley: Emerald.Google Scholar
  42. Westbury, C., Keith, J., Briesemeister, B. B., Hofmann, M. J., & Jacobs, A. M. (2015). Avoid violence, rioting, and outrage; approach celebration, delight, and strength: Using large text corpora to compute valence, arousal, and the basic emotions. Quarterly Journal of Experimental Psychology, 68, 1599–1622.CrossRefGoogle Scholar
  43. Westbury, C. F., Shaoul, C., Hollis, G., Smithson, L., Briesemeister, B. B., Hofmann, M. J., & Jacobs, A. M. (2013). Now you see it, now you don’t: On emotion, context, and the algorithmic prediction of human imageability judgments. Frontiers in Psychology, 4, 991. doi: 10.3389/fpsyg.2013.00991 CrossRefPubMedPubMedCentralGoogle Scholar
  44. Zhu, X., Kiritchenko, S., & Mohammad, S. M. (2014). NRC-Canada-2014: Recent improvements in the sentiment analysis of tweets. In P. Nakov & T. Zesch (Eds.), Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) (pp. 443–447). New York: Association for Computational Linguistics.CrossRefGoogle Scholar

Copyright information

© Psychonomic Society, Inc. 2017

Authors and Affiliations

  1. 1.Department of PsychologyUniversity of AlbertaEdmontonCanada

Personalised recommendations