Evaluating the Detection of Aberrant Responses in Automated Essay Scoring

  • Mo ZhangEmail author
  • Jing Chen
  • Chunyi Ruan
Part of the Springer Proceedings in Mathematics & Statistics book series (PROMS, volume 140)


As automated essay scoring grows in popularity, the measurement issues associated with it take on greater importance. One such issue is the detection of aberrant responses. In this study, we considered aberrant responses as those that were not suitable for machine scoring because the responses have characteristics that the scoring system cannot process. Since no such system can yet understand language in a way that a human rater does, the detection of aberrant responses is important for all automated essay scoring systems. Successful identification of aberrant responses can happen before and after machine scoring is attempted (i.e., pre-screening and post-hoc screening). Such identification is essential if the technology is to be used as the primary scoring method. In this study, we investigated the functioning of a set of pre-screening advisory flags that have been used in different automated essay scoring systems. In addition, we evaluated whether the size of the human–machine discrepancy could be predicted as a precursor to developing a general post-hoc screening method. These analyses were conducted using one scoring system as a case example. Empirical results suggested that some pre-screening advisories were operating more effectively than others were. With respect to post-hoc screening, relatively little scoring difficulty was found overall, thereby reducing the ability to predict human–machine discrepancy for those responses that passed through pre-screening. Limitations of the study and suggestions for future studies are also provided.


Aberrant response Automated essay scoring Scoring difficulty 



We would like to thank Shelby Haberman for his substantial help in providing guidance on the design, analyses, and interpretation. We also thank Beata Beigman Klebanov, Nitin Madnani, Don Powers, Andre Rupp, David Williamson, Randy Bennett, and Andries van der Ark for their review and suggestions on the previous versions of this manuscript.


  1. Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® v. 2. The Journal of Technology, Learning and Assessment, 4(3).Google Scholar
  2. Bejar, I. I., VanWinkle, W. H., Madnani, N., Lewis, W., & Steier, M. (2013). Length of textual response as a construct-irrelevant response strategy: The case of shell language (RR-13-07). Princeton, NJ: Educational Testing Service.Google Scholar
  3. Bennett, R. E., & Zhang, M. (2015). Validity and automated scoring. In F. Drasgow (Ed.), Technology in testing: Improving educational and psychological measurement. NCME applications of educational measurement and assessment series. New York, US: Taylors & Francis.Google Scholar
  4. Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25, 27–40.CrossRefGoogle Scholar
  5. ETS (2014a). Coming together to raise achievement new assessments for the Common Core State Standards. Retrieved from:, on February 25, 2015.
  6. ETS. (2014b). Criterion®. Download from
  7. ETS. (2014c). Understanding your TOEFL iBT® test scores. Download from
  8. Foltz, P. W., Laham, D., & Landauer, T. K. (1999). Automated essay scoring: Applications to educational technology. In B. Collis & R. Oliver (Eds.), Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications (pp. 939–944). Chesapeake, VA: Association for the Advancement of Computing in Education (AACE).Google Scholar
  9. Haberman, S. J., & Sinharay, S. (2010). The application of the cumulative logistic regression model to automated essay scoring. Journal of Educational and Behavioral Statistics, 35, 586–602.CrossRefGoogle Scholar
  10. Higgins, D., Burstein, J., & Attali, Y. (2005). Identifying off-topic student essays without topic-specific training data. Natural Language Engineering, 12, 145–159.CrossRefGoogle Scholar
  11. Higgins, D., & Heilman, M. (2014). Managing what we can measure: Quantifying the susceptibility of automated scoring systems to gaming behavior. Educational Measurement: Issues and Practice, 33(4), 36–46.CrossRefGoogle Scholar
  12. Karabastos, G. (2003). Comparing the aberrant response detection performance of thirty six person-fit statistics. Applied Measurement in Education, 16(4), 277–298.CrossRefGoogle Scholar
  13. Page, E. B. (2003). Project Essay Grade: PEG. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 43–54). Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
  14. Partnership for Assessment of Readiness for College and Careers. (2010). The Partnership for Assessment of Readiness for College and Careers (PARCC) application for the Race to the Top comprehensive assessment systems competition. Retrieved from, on February 25, 2015.
  15. Pearson Education Inc. (2010). Intelligent essay assessor™ (IEA) fact sheet. Retrieved from, on October 19, 2014.
  16. Powers, D., Burstein, J., Chodorow, M., Fowles, M., & Kulich, K. (2001). Stumping e-rater: Challenging the validity of automated essay scoring (RR-01-03). Princeton, NJ: Educational Testing Service.Google Scholar
  17. Reise, S. P., & Due, A. M. (1991). The influence of test characteristics on the detection of aberrant response patterns. Applied Psychological Measurement, 15, 217–226.CrossRefGoogle Scholar
  18. Rupp, A. A. (2013). A systematic review of the methodology for person fit research in item response theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55, 3–38.Google Scholar
  19. SMARTER Balanced Assessment Consortium. (2010). Race to the Top assessment program application for new grants: Comprehensive assessment systems CFDA Number: 84.395B. Retrieved from, on February 25, 2015.
  20. Vantage Learning. (2012). IntelliMetric®. Retrieved from, on October 19, 2014.
  21. Zhang, M. (2013a). Contrasting automated and human scoring of essays (RDC-21). Princeton, NJ: Educational Testing Service.Google Scholar
  22. Zhang, M. (2013b). The impact of sampling approach on population invariance in automated scoring of essays (RR-13-18). Princeton, NJ: Educational Testing Service.Google Scholar
  23. Zhang, M., Breyer, F. J., & Lorenz, F. (2013). Investigating the suitability of implementing the e-rater scoring engine in a large-scale English language testing program (RR-13-36). Princeton, NJ: Educational Testing Service.Google Scholar
  24. Zhang, M., & Deane, P. (in press). Process features in writing: Internal structure and incremental value over product features. Research Report. Princeton, NJ: Educational Testing Service.Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Research and Development DivisionEducational Testing ServicePrincetonUSA

Personalised recommendations