Evaluating the Detection of Aberrant Responses in Automated Essay Scoring
As automated essay scoring grows in popularity, the measurement issues associated with it take on greater importance. One such issue is the detection of aberrant responses. In this study, we considered aberrant responses as those that were not suitable for machine scoring because the responses have characteristics that the scoring system cannot process. Since no such system can yet understand language in a way that a human rater does, the detection of aberrant responses is important for all automated essay scoring systems. Successful identification of aberrant responses can happen before and after machine scoring is attempted (i.e., pre-screening and post-hoc screening). Such identification is essential if the technology is to be used as the primary scoring method. In this study, we investigated the functioning of a set of pre-screening advisory flags that have been used in different automated essay scoring systems. In addition, we evaluated whether the size of the human–machine discrepancy could be predicted as a precursor to developing a general post-hoc screening method. These analyses were conducted using one scoring system as a case example. Empirical results suggested that some pre-screening advisories were operating more effectively than others were. With respect to post-hoc screening, relatively little scoring difficulty was found overall, thereby reducing the ability to predict human–machine discrepancy for those responses that passed through pre-screening. Limitations of the study and suggestions for future studies are also provided.
KeywordsAberrant response Automated essay scoring Scoring difficulty
We would like to thank Shelby Haberman for his substantial help in providing guidance on the design, analyses, and interpretation. We also thank Beata Beigman Klebanov, Nitin Madnani, Don Powers, Andre Rupp, David Williamson, Randy Bennett, and Andries van der Ark for their review and suggestions on the previous versions of this manuscript.
- Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® v. 2. The Journal of Technology, Learning and Assessment, 4(3).Google Scholar
- Bejar, I. I., VanWinkle, W. H., Madnani, N., Lewis, W., & Steier, M. (2013). Length of textual response as a construct-irrelevant response strategy: The case of shell language (RR-13-07). Princeton, NJ: Educational Testing Service.Google Scholar
- Bennett, R. E., & Zhang, M. (2015). Validity and automated scoring. In F. Drasgow (Ed.), Technology in testing: Improving educational and psychological measurement. NCME applications of educational measurement and assessment series. New York, US: Taylors & Francis.Google Scholar
- ETS (2014a). Coming together to raise achievement new assessments for the Common Core State Standards. Retrieved from: http://www.k12center.org/rsc/pdf/coming_together_march_2014_rev_1.pdf, on February 25, 2015.
- ETS. (2014b). Criterion®. Download from http://www.ets.org/criterion
- ETS. (2014c). Understanding your TOEFL iBT® test scores. Download from http://www.ets.org/toefl/ibt/scores/understand
- Foltz, P. W., Laham, D., & Landauer, T. K. (1999). Automated essay scoring: Applications to educational technology. In B. Collis & R. Oliver (Eds.), Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications (pp. 939–944). Chesapeake, VA: Association for the Advancement of Computing in Education (AACE).Google Scholar
- Page, E. B. (2003). Project Essay Grade: PEG. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 43–54). Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
- Partnership for Assessment of Readiness for College and Careers. (2010). The Partnership for Assessment of Readiness for College and Careers (PARCC) application for the Race to the Top comprehensive assessment systems competition. Retrieved from http://www.parcconline.org/sites/parcc/files/PARCC%20Application%20-%20FINAL.pdf, on February 25, 2015.
- Pearson Education Inc. (2010). Intelligent essay assessor™ (IEA) fact sheet. Retrieved from http://kt.pearsonassessments.com/download/IEA-FactSheet-20100401.pdf, on October 19, 2014.
- Powers, D., Burstein, J., Chodorow, M., Fowles, M., & Kulich, K. (2001). Stumping e-rater: Challenging the validity of automated essay scoring (RR-01-03). Princeton, NJ: Educational Testing Service.Google Scholar
- Rupp, A. A. (2013). A systematic review of the methodology for person fit research in item response theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55, 3–38.Google Scholar
- SMARTER Balanced Assessment Consortium. (2010). Race to the Top assessment program application for new grants: Comprehensive assessment systems CFDA Number: 84.395B. Retrieved from http://www.smarterbalanced.org/wordpress/wp-content/uploads/2011/12/Smarter-Balanced-RttT-Application.pdf, on February 25, 2015.
- Vantage Learning. (2012). IntelliMetric®. Retrieved from http://www.vantagelearning.com/products/intellimetric, on October 19, 2014.
- Zhang, M. (2013a). Contrasting automated and human scoring of essays (RDC-21). Princeton, NJ: Educational Testing Service.Google Scholar
- Zhang, M. (2013b). The impact of sampling approach on population invariance in automated scoring of essays (RR-13-18). Princeton, NJ: Educational Testing Service.Google Scholar
- Zhang, M., Breyer, F. J., & Lorenz, F. (2013). Investigating the suitability of implementing the e-rater scoring engine in a large-scale English language testing program (RR-13-36). Princeton, NJ: Educational Testing Service.Google Scholar
- Zhang, M., & Deane, P. (in press). Process features in writing: Internal structure and incremental value over product features. Research Report. Princeton, NJ: Educational Testing Service.Google Scholar