Abstract
Automatic short answer grading (ASAG) systems are designed to automatically assess short answers in natural language having a length of a few words to a few sentences. Many ASAG techniques have been proposed in the literature. In this paper, we critically analyse the role of evaluation measures used for assessing the quality of ASAG techniques. In real-world settings, multiple factors such as, difficulty level, and diversity of student answers, vary significantly across questions, leading to different ASAG techniques emerging as superior for different evaluation measures. Building upon this observation, we propose to automatic learning of a mapping from questions to ASAG techniques using minimal human (expert/crowd) feedback. We do this by formulating the learning task as a contextual bandits problem and providing a rigorous regret minimization algorithm that handles key practical considerations, such as, noisy experts and similarity between questions. Our approach offers the flexibility to include new ASAG systems on the fly and does not require the human expert to have knowledge of the working details of the system while providing feedback. With extensive simulations on a standard dataset, we demonstrate that our approach yields outcomes that are remarkably consistent with human evaluations.
Similar content being viewed by others
References
Burrows, S., Gurevych, I., Stein, B.: The eras and trends of automatic short answer grading. Int. J. Artif. Intell. Educ. 25(1), 60–117 (2015)
Roy, S., Narahari, Y., Deshmukh, O.D.: A perspective on computer assisted assessment techniques for short free-text answers. In: Computer Assisted Assessment. Research into E-Assessment, pp. 96–109. Springer (2015)
Mohler, M., Mihalcea, R.: Text-to-text semantic similarity for automatic short answer grading. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL). Association for Computational Linguistics (2009)
Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits with side information. In: Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Dec 3–6, 2007, pp. 817–824 (2007)
Stevens, S.S.: On the theory of scales of measurement (1946)
Myroslava, O., et al.: Semeval-2013 task 7: the joint student response analysis and 8th recognizing textual entailment challenge. Technical report (2013)
Leacock, C., Chodorow, M.: C-rater: automated scoring of short-answer questions. Comput. Humanit. 37(4), 389–405 (2003)
Mitchell, T., Russell, T., Broomhead, P., Aldridge, N.: Towards robust computerized marking of free-text responses. In: Proceedings of 6th International Computer Aided Assessment Conference (2002)
Madnani, N., Burstein, J., Sabatini, J., OReilly, T.: Automated scoring of a summary writing task designed to measure reading comprehension. In: Proceedings of the 8th Workshop on Use of NLP for Building Educational Applications (2013)
Hauke, J., Kossowski, T.: Comparison of values of pearson’s and spearman’s correlation coefficients on the same sets of data. Quaest. Geogr. 30(2), 87–93 (2011)
Mukaka, M.M.: A guide to appropriate use of correlation coefficient in medical research. Malawi Med. J. 24(3), 69–71 (2012)
Newson, R.: Parameters behind “nonparametric” statistics: Kendall’s tau, somers’ d and median differences. Stata J. 2(1), 45–64(20) (2002)
Powers, D.M.W.: The problem with kappa. In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL), pp. 345–355. The Association for Computer Linguistics (2012)
Willmott, C.J.: Some comments on the evaluation of model performance. Bull. Am. Meteorol. Soc. 63, 1309–1369 (1982)
Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (mae) over the root mean square error (rmse) in assessing average model performance. Clim. Res. 30(1), 79–82 (2005)
Chai, T., Draxler, R.R.: Root mean square error (rmse) or mean absolute error (mae)?—arguments against avoiding rmse in the literature. Geosci. Model Dev. 7(3), 1247–1250 (2014)
Ferri, C., Hernndez-Orallo, J., Modroiu, R.: An experimental comparison of performance measures for classification. Pattern Recognit. Lett. 30(1), 27–38 (2009)
Freitas, A.A.: Comprehensible classification models: a position paper. ACM SIGKDD Explor. Newslett. 15(1), 1–10 (2014)
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427–437 (2009)
Cardoso, J.S., Sousa, R.G.: Measuring the performance of ordinal classification. Int. J. Pattern Recognit. Artif. Intell. 25(8), 1173–1195 (2011)
Robbins, H.: Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58(5), 527–535 (1952)
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)
Agarwal, A., Hsu, D.J., Kale, S., Langford, J., Li, L., Schapire, R.E.: Taming the monster: a fast and simple algorithm for contextual bandits. In: CoRR, abs/1402.0555 (2014)
Hofmann, K., Whiteson, S., de Rijke, M.: Contextual bandits for information retrieval. In: NIPS 2011 Workshop on Bayesian Optimization, Experimental Design, and Bandits, vol. 12 (2011)
Lan, A.S., Baraniuk, R.G.: A contextual bandits framework for personalized learning action selection. In: Proceedings of the 9th International Conference on Educational Data Mining, pp. 424–429 (2016)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. Technical report, IBM Research Report (2001)
Perez, D., Alfonseca, E., Rodríguez, P.: Application of the bleu method for evaluating free-text answers in an e-learning environment. In: LREC, European Language Resources Association (2004)
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the International Conference on Research in Computational Linguistics (1997)
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998)
Roy, S., Bhatt, H.S., Narahari, Y.: Transfer learning for automatic short answer grading. In: Proceedings of the European Conference on Artificial Intelligence (ECAI) (2016)
Lieven, E.V.M.: Conversations between mothers and young children: individual differences and their possible implication for the study of language learning. na (1978)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Roy, S., Rajkumar, A. & Narahari, Y. Selection of automatic short answer grading techniques using contextual bandits for different evaluation measures. Int J Adv Eng Sci Appl Math 10, 105–113 (2018). https://doi.org/10.1007/s12572-017-0202-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12572-017-0202-9