Progress and Challenges for Automated Scoring and Feedback Systems for Large-Scale Assessments

  • Denise WhitelockEmail author
  • Duygu Bektik
Reference work entry
Part of the Springer International Handbooks of Education book series (SIHE)


Large-scale assessment refers to tests that are administered to large numbers of students and are used at local, state, and national levels to measure the progress of schools with respect to educational standards. In order to have accurate and fair measurements, large-scale assessment systems need to include all available students, which means a high volume of students, with large numbers of exams to be marked. The amount of marking that is required is extensive; thus marking exams at this scale requires a lot of work, which means a high volume of exam scripts need to be marked by tens of thousands of examiners appointed by the exam boards. The need for large-scale assessments and the high cost of manual marking and limited “turn around” time have led to developments, over some years, of automated assessment and marking. This chapter reviews the history and development of automated assessment systems. It includes findings from empirical research as well as highlights the theoretical considerations that emerge from such developments. In addition, the practical aspects of developing such assessments are explored with examples primarily from the UK and USA, including the systems and tools available, the current capabilities of natural language processing (NLP) approaches, and their limitations, ethical concerns, and future potential.


Large-scale assessment Automated assessment Automated analysis of student scripts Ethical issues in automated scoring 


  1. Adesina, A. O. (2016). A semi-automatic computer-aided assessment framework for primary mathematics. A Doctoral Thesis. Submitted in partial fulfilment of the requirements for the award of Doctor of Philosophy of Loughborough University.Google Scholar
  2. Attali, Y. (2013). Validity and reliability of automated essay scoring. In M. D. Shermis & J. C. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 181–199). Oxon: Routledge.Google Scholar
  3. Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. The Journal of Technology, Learning and Assessment, 4(3), 1–31.Google Scholar
  4. Attali, Y., & Powers, D. (2008). A developmental writing scale. ETS Research Report Series, 2008(1), i–59.Google Scholar
  5. Attali, Y., Lewis, W., & Steier, M. (2012). Scoring with the computer: Alternative procedures for improving the reliability of holistic essay scoring. Language Testing, 30, 125–141.CrossRefGoogle Scholar
  6. Bektik, D. (2017). Learning analytics for academic writing through automatic identification of meta-discourse (Doctoral Thesis). The Open University, UK.Google Scholar
  7. Bennett, R. E. (2011). Automated scoring of constructed-response literacy and mathematics items. Retrieved April, 14, 2011.Google Scholar
  8. Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with python -analyzing text with the natural language toolkit (1st ed.). Beijing: O’Reilly.Google Scholar
  9. Bloom, B. S. (1971). Handbook on formative and summative evaluation of student learning. New York: McGraw-Hill Book Company.Google Scholar
  10. Bridgeman, B. (2013). Human ratings and automated essay evaluation. In M. D. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (1st ed., pp. 221–232). Oxon: Routledge.Google Scholar
  11. Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27–40.CrossRefGoogle Scholar
  12. Burstein, J., & Chodorow, M. (2010). Progress and new directions in technology for automated essay evaluation. In R. Kaplan (Ed.), The Oxford handbook of applied linguistics (2nd ed., pp. 487–497). Oxford: Oxford University Press.Google Scholar
  13. Cheville, J. (2004). Automated scoring technologies and the rising influence of error. The English Journal, 93(4), 47–52.CrossRefGoogle Scholar
  14. Crossley, S. A., Salsbury, T., McCarthy, P. M., & McNamara, D. S. (2008). LSA as a measure of second language natural discourse. In V. Sloutsky, B. Love, & K. McRae (Eds.), In Proceedings of the 30th annual conference of the cognitive science society (pp. 1906–1911). Washington, D.C. Cognitive Science Society.Google Scholar
  15. Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18(1), 7–24.CrossRefGoogle Scholar
  16. Duffin, C. (2013). Two-thirds of exams marked online, report finds. (2017, June 1). Retrieved from
  17. Elliot, N., & Williamson, D. M. (2013). Assessing writing special issue: Assessing writing with automated scoring systems. Assessing Writing, 18(1), 1–6.CrossRefGoogle Scholar
  18. Ericsson, P. F., & Haswell, R. H. (2006). Machine scoring of student essays: Truth and consequences. Logan: Utah State University Press.CrossRefGoogle Scholar
  19. Frost, J. (2008). Automated marking of exam papers: using semantic parsing. Oxford University.Google Scholar
  20. Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). The measurement of textual coherence with latent semantic analysis. Discourse processes, 25(2–3), 285–307.CrossRefGoogle Scholar
  21. Foltz, P. W., Laham, D., & Landauer, T. K. (1999). Automated essay scoring: Applications to educational technology. Paper presented at the proceedings of EdMedia, Seattle.Google Scholar
  22. Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. Paper presented at the IJCAI.Google Scholar
  23. Harris, S. (2009). Computers being used to mark exams in trial that could be rolled out to GCSEs and A-Levels. (2017, June 1). Retrieved from
  24. Harrison, A. (2013). Growth of online marking ‘could improve reliability’. (2017, June 1). Retrieved from
  25. Herrington, A., & Moran, C. (2001). What happens when machines read our students’ writing? College English, 63(4), 480–499.CrossRefGoogle Scholar
  26. Herrington, A., & Moran, C. (2012). Writing to a machine is not writing at all. In N. Elliot & L. Perelman (Eds.), Writing assessment in the 21st century: Essays in honor of Edward M. White (pp. 219–232). New York: Hampton Press.Google Scholar
  27. Hyland, K., & Tse, P. (2007). Is there an“ academic vocabulary”? TESOL Quarterly, 235–253.Google Scholar
  28. Irvin, C. (2009). Computers to mark English exams. (2017, June 1). Retrieved from
  29. Ivanič, R. (2004). Discourses of writing and learning to write. Language and Education, 18(3), 220–245.CrossRefGoogle Scholar
  30. Johnson, J., Shum, S.B., Willis, A., Bishop, S., Zamenopoulos, T., Swithenby, S., MacKay, R., Merali, Y., Lorincz, A., Costea, C., & Bourgine, P. (2013). The FuturICT education accelerator. The European Physical Journal Special Topics, 214(1):215–243.Google Scholar
  31. Kellogg, R. T., & Raulerson, B. A. (2007). Improving the writing skills of college students. Psychonomic Bulletin & Review, 14(2), 237–242.CrossRefGoogle Scholar
  32. Kurvinen, E., Linden, R., Lokkila, E., & Laakso, M. J. (2015). Computer-assisted learning: Using automatic assessment and immediate feedback in first grade mathematics. 7th International Conference on Education and New Learning Technologies, Barcelona, 2303–2312.Google Scholar
  33. Landauer, T. K. (2003). Automatic essay assessment. Assessment in education: Principles, policy & practice, 10(3), 295–308. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.Google Scholar
  34. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse processes, 25(2–3), 259–284.CrossRefGoogle Scholar
  35. Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated scoring and annotation of essays with the intelligent essay assessor. Automated essay scoring: A cross-disciplinary perspective, 87–112.Google Scholar
  36. Loper, E., Yi, S.-T., & Palmer, M. (2007). Combining lexical resources: Mapping between propbank and verbnet. Paper presented at the Proceedings of the 7th International Workshop on Computational Linguistics, Tilburg.Google Scholar
  37. Luckin, R. (2017). Towards artificial intelligence-based assessment systems. Nature Human Behaviour, 1, 0028.CrossRefGoogle Scholar
  38. Mason, O., & Grove-Stephensen, I. (2002). Automated free text marking with paperless school.Google Scholar
  39. Mayfield, E., & Rosé, C. P. (2013). LightSIDE: Open source machine learning for text. In M. D. Shermis & J. C. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 124–135). Oxon: Routledge.Google Scholar
  40. McNamara, D. S., & Graesser, A. C. (2012). Coh-mMetrix: An automated tool for theoretical and applied natural language processing. In Applied natural language processing and content analysis: Identification, investigation, and resolution. Hershey: IGI Global.Google Scholar
  41. McNamara, D. S., Crossley, S. A., & McCarthy, P. M. (2009). Linguistic features of writing quality. Written Communication, 27, 57–86.CrossRefGoogle Scholar
  42. McNamara, D. S., Crossley, S. A., & Roscoe, R. (2013). Natural language processing in an intelligent writing strategy tutoring system. Behavior Research Methods, 45(2), 499–515.CrossRefGoogle Scholar
  43. McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press.Google Scholar
  44. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography, 3(4), 235–244.CrossRefGoogle Scholar
  45. Page, E. B. (1966). The imminence of grading essays by computer. The Phi Delta Kappan, 47(5), 238–243.Google Scholar
  46. Page, E. B., & Paulus, D. H. (1968). The analysis of essays by computer. Final report. Retrieved from. Storrs: University of Connecticut.Google Scholar
  47. Page, E. B., & Petersen, N. S. (1995). The computer moves into essay grading: Updating the ancient test. Phi Delta Kappan, 76(7), 561.Google Scholar
  48. PISA. (2013) Draft reading literacy framework. (2017, June 1). Retrieved from
  49. Programme for International Student Assessment. (2017, June 1). Retrieved from
  50. Ras, E., Whitelock, D., & Kalz, M. (2015). The promise and potential of e- assessment for learning. In P. Reimann, S. Bull, M. Kickmeier-Rust, R. Vatrapu, & B. Wasson (Eds.), Measuring and visualizing learning in the information-rich classroom (pp. 21–40). New York: Routledge. ISBN-10: 113802113X.Google Scholar
  51. Rivers, B. A., Whitelock, D., Richardson, J. T., Field, D., & Pulman, S. (2014). Functional, frustrating and full of potential: Learners’ experiences of a prototype for automated essay feedback. In M. Kalz & E. Ras (Eds.), Computer assisted assessment: Research into e-assessment. Communications in computer and information science (439) (pp. 40–52). Cham: Springer.Google Scholar
  52. Shermis, M. D., & Burstein, J. C. (2003). Automated essay scoring: A cross-disciplinary perspective. Mahwah: Lawrence Erlbaum.Google Scholar
  53. Shermis, M. D., & Burstein, J. (2013). Handbook of automated essay evaluation: Current applications and new directions. Oxon: Routledge.Google Scholar
  54. Sinha, R., & Mihalcea, R. (2009). Combining lexical resources for contextual synonym expansion. Paper presented at the Proceedings of the International Conference RANLP.Google Scholar
  55. Sukkarieh, J. Z., Pulman, S. G., & Raikes, N. (2003, October). Auto-marking: Using computational linguistics to score short, free text responses. In the annual conference of the International Association for Educational Assessment (IAEA), Manchester.Google Scholar
  56. Thurlow, M., Quenemoen, R., Thompson, S., & Lehr, C. (2001). Principles and characteristics of inclusive assessment and accountability systems. Synthesis report.Google Scholar
  57. Vossen, P. (1998). EuroWordNet: A multilingual database with lexical semantic networks. Boston: Kluwer Academic.CrossRefGoogle Scholar
  58. Whitelock, D. (2011). Activating assessment for learning: Are we on the way with web 2.0. In M. J. W. Lee & C. McLoughlin (Eds.), Web 2.0-based-e-learning: Applying social informatics for tertiary teaching (Vol. 2, pp. 319–342). Hershey: IGI Global.CrossRefGoogle Scholar
  59. Whitelock, D. (2015). Maximising student success with automatic formative feedback for both teachers and students. In E. Ras & D. Joosten-ten Brinke (Eds.), Computer assisted assessment. Research into e-assessment (pp. 142–148). Switzerland: Springer. ISBN 978-3-319-27703-5.CrossRefGoogle Scholar
  60. Whitelock, D., Field, D., Pulman, S., Richardson, J. T., & Van Labeke, N. (2014). Designing and testing visual representations of draft essays for higher education students. Paper presented at the 2nd International Workshop on Discourse-Centric Learning Analytics, 4th Conference on Learning Analytics and Knowledge, Indianapolis.Google Scholar
  61. Whitelock, D., Twiner, A., Richardson, J. T., Field, D., & Pulman, S. (2015). OpenEssayist: A supply and demand learning analytics tool for drafting academic essays. Paper presented at the Proceedings of the Fifth International Conference on Learning Analytics and Knowledge, Poughkeepsie.Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.The Open UniversityMilton KeynesUK

Section editors and affiliations

  • Mary Webb
    • 1
  • Dirk Ifenthaler
    • 2
  1. 1.King's College LondonLondonUK
  2. 2.University of MannheimMannheimGermany

Personalised recommendations