Skip to main content

Scoring Fairness in Large-Scale High-Stakes English Language Testing: An Examination of the National Matriculation English Test

  • Chapter
  • First Online:
English Language Education and Assessment
  • 3157 Accesses

Abstract

Empirical research exploring test fairness in scoring written performance has been mostly conducted in the North American context. There has been little research conducted in Asian countries such as China. Considering the extreme high stakes of large-scale testing in this context, this study examines what and how raters’ scoring decisions were affected by the features of writing intended (or unintended) to be measured in the National Matriculation English Test (NMET) in China. The study further explores whether there was any difference in rating behaviours between novice and experienced NMET raters. The results highlight the extent to which raters attended to the NMET rating scale which led to a deeper understanding of scoring fairness involved in large-scale high-stakes tests within the Chinese context and has implications on scoring fairness in other similar contexts internationally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In this chapter, scoring and rating (singular) are used interchangeably.

  2. 2.

    Cronbach alpha for the 13 Likert-scaled questions was 0.72.

  3. 3.

    Plagiarism was a new writing feature reported by the NMET raters. By “plagiarism”, the raters referred to the fact that certain test-takers either copied sentences from reading passages in the section of Reading Comprehension in the same NMET test paper or wrote down sentences that they recited from NMET writing templates, the content of which may be loosely related to the NMET writing task.

  4. 4.

    Relevance was another new writing feature reported by the NMET raters. By “relevance”, the raters referred to the degree to which the test-takers’ writing matched the NMET writing topic.

  5. 5.

    Experienced NMET raters refer to raters who have had NMET rating experience more than once; novice NMET raters refer to raters who are recruited and trained as NMET rater for the first time.

  6. 6.

    Experienced EFL teacher raters refer to raters whose EFL teaching experience is at least 5 years; novice ESL teacher raters are raters who have taught ESL for less than 5 years.

References

  • Alderson, J. C., & Urquhart, A. H. (1985). The effect of students’ academic discipline on their performance in ESP reading tests. Language Testing, 2, 192–204.

    Article  Google Scholar 

  • American Educational Research Association (AERA), American Psychological Association (APA), & National Council for Measurement in Education (NCME). (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

    Google Scholar 

  • Bai, L. (2011). 我国科举录取名额分配制度的历史与反思——兼谈我国高考录取中的考试剬平与区域剬平 [Reflecting on the history of Chinese imperial examination enrolment quota distribution system: Fairness in test outcomes and regional parity in Gaokao enrolment system]. Educational Innovation, (6), 6–7.

    Google Scholar 

  • Barkaoui, K. (2007). Participants, texts, and processes in ESL/EFL essay tests: A narrative review of the literature. Canadian. Modern Language Review, 64, 99–134. doi:10.3138/cmlr.64.1.099.

    Article  Google Scholar 

  • Barkaoui, K. (2010). Do ESL essay raters’ evaluation criteria change with experience? A mixed-methods, cross-sectional study. TESOL Quarterly, 44, 31–57. doi:10.5054/tq.2010.214047.

    Article  Google Scholar 

  • Barkaoui, K. (2011). Think-aloud protocols in research on essay rating: An empirical study of their veridicality and reactivity. Language Testing, 28, 51–75. doi:10.1177/0265532210376379.

    Article  Google Scholar 

  • Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20, 1–25.

    Article  Google Scholar 

  • Camilli, G. (2006). Test fairness. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 221–256). Westport: Praeger Publishers.

    Google Scholar 

  • Cao, Y., & Zhang, H. (1999). Detection of differential item functioning in a Chinese vocabulary test. Acta Psychologica Sinica, 31, 460–467.

    Google Scholar 

  • Cheng, L. (2008). The key to success: English language testing in China. Language Testing, 25, 15–37. doi:10.1177/0265532207083743.

    Article  Google Scholar 

  • Cheng, L. (2010). The history of examinations: Why, how, what, whom to select? In L. Cheng & A. Curtis (Eds.), English language assessment and the Chinese learner (pp. 13–26). New York: Routledge: Taylor & Francis Group.

    Google Scholar 

  • Clapham, C. (1998). The effect of language proficiency and background knowledge on EAP students’ reading comprehension. In A. J. Kunnan (Ed.), Validation in language assessment (pp. 141–168). Mahwah: Lawrence Erlbaum.

    Google Scholar 

  • Cole, N. S., & Zieky, M. J. (2001). The new faces of fairness. Journal of Educational Measurement, 38, 369–382. doi:10.1111/j.1745-3984.2001.tb01132.x.

    Article  Google Scholar 

  • Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7, 31–51. doi:10.1177/026553229000700104.

    Article  Google Scholar 

  • Cumming, A., Kantor, R., & Powers, D. (2001). Scoring TOEFL essays and TOEFL 2000 prototype writing tasks: An investigation into raters’ decision making and development of a preliminary analytic framework (TOEFL Monograph Series MS-22). Retrieved from Educational Testing Service website: http://www.ets.org/Media/Research/pdf/RM-01-04.pdf

  • Cumming, A., Kantor, R., & Powers, D. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. Modern Language Journal, 86, 67–96.

    Article  Google Scholar 

  • Dai, J., Wei, X., & Liu, F. (2010). 教育考试剬平性的基本理论研究 [Fundamental theoretical research on educational test fairness]. China Higher Education Research, (8), 27–29.

    Google Scholar 

  • Dong, S., & Ma, S. (2011). Fairness analysis on assessment with score report from measurement perspective. Examinations Research, 1, 59–64.

    Google Scholar 

  • Ericsson, K. A., & Simon, H. A. (1993). Protocol analysis: Verbal reports as data. Cambridge, MA: MIT Press.

    Google Scholar 

  • Guo, F. (2009). Fairness of automated essay scoring of GMAT® AWA (GMAC® Research Reports: RR-09-01). Retrieved from the Graduate Management Admission Council® website: http://www.gmac.com/NR/rdonlyres/FACE0811-B6F7-45A9-B57D-ED3703984B9A/0/RR0901_AWAFairness.pdf

  • Guo, G. (2010). 高考剬平性的影响要素分析 [Analysis of influential factors in achieving fairness in Gaokao]. Theory and Practice of Education, 30(17), 15–17.

    Google Scholar 

  • Hale, G. (1988). Student major field and text content: Interactive effects on reading comprehension in the TOEFL. Language Testing, 5, 49–61.

    Article  Google Scholar 

  • Huang, J. (2007). Examining the fairness of rating ESL students’ writing on large-scale assessments. Unpublished doctoral dissertation. Queen’s University, Kingston.

    Google Scholar 

  • Huang, J. (2011). Generalizability theory as evidence of concerns about fairness in large-scale ESL writing assessments. TESOL Journal, 2, 423–443. doi:10.5054/tj.2011.269751.

    Article  Google Scholar 

  • Kunnan, A. J. (2004). Test fairness. In M. Milanovic & C. Weir (Eds.), European language testing in a global context (pp. 27–48). Cambridge: Cambridge University Press.

    Google Scholar 

  • Kunnan, A. J. (2008). Large scale language assessment. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of language and education (Language testing and assessment 2nd ed., Vol. 7, pp. 135–155). New York: Springer.

    Google Scholar 

  • Lee, H. K. (2004). A comparative study of ESL writers’ performance in a paper-based and a computer-delivered writing test. Assessing Writing, 9, 4–26.

    Article  Google Scholar 

  • Li, L. (2007). 教育剬正视野中的高考录取制度改革—兼论考试剬平与区域剬平之争 [Gaokao enrolment system reform in visions of educational justice: On the dispute between fairness in test outcomes and regional parity]. Hubei Social Sciences, (9), 156–158.

    Google Scholar 

  • Lu, Y. (2011). Fairness in writing assessment: A survey of factors that affect rater bias. Foreign Language Testing and Teaching, 2, 30–36.

    Google Scholar 

  • Ma, S. (2009). 建国60年来我国大学入学考试制度的沿革与发展 [History and development of the university entrance examination system in China in 60 years]. Educational Measurement and Evaluation, (10), 49–52.

    Google Scholar 

  • May, L. A. (2007). Interaction in a paired speaking test: The rater's perspective. Unpublished doctoral dissertation. The University of Melbourne, Melbourne.

    Google Scholar 

  • Milanovic, M., Saville, N., & Shen, S. (1996). A study of the decision-making behaviour of composition markers. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment (pp. 92–114). Cambridge, UK: Cambridge University Press.

    Google Scholar 

  • O’Loughlin, K. (2002). The impact of gender in oral proficiency testing. Language Testing, 19, 169–192.

    Article  Google Scholar 

  • Qi, L. (2006). Some reflections on washback. Foreign Languages and Their Teaching, 8, 29–32.

    Google Scholar 

  • Qi, L. (2007). Is testing an efficient agent for pedagogical change? Examining the intended washback of the writing task in a high-stakes English test in China. Assessment in Education: Principles, Policy and Practice, 14(1), 51–74.

    Google Scholar 

  • Stricker, L. J., Rock, D. A., & Lee, Y. W. (2005). Factor structure of the LanguEdgeTM Test across language groups (TOEFL Monograph Series MS-32). Retrieved from Educational Testing Service website: http://www.ets.org/Media/Research/pdf/RR-05-12.pdf

  • Swinton, S. S., & Powers, D. E. (1980). Factor analysis of the TOEFL® test for several language groups (TOEFL Research Report: RR-06). Retrieved from Educational Testing Service website: http://www.ets.org/Media/Research/pdf/RR-80-32.pdf

  • Vaughan, C. (1991). Holistic assessment: What goes on in the rater’s mind? In L. Hamp-Lyons (Ed.), Assessing second language writing in academic contexts (pp. 111–125). Norwood: Ablex.

    Google Scholar 

  • Wang, J. W. (2011). Study on present situation of examination fairness and countermeasure. China Examinations, 5, 53–57.

    Google Scholar 

  • Wolfe, E. W., & Manalo, J. R. (2005). An investigation of the impact of composition medium on the quality of TOEFL writing scores (TOEFL Research Report RR-04-29). Retrieved from https://www.ets.org/Media/Research/pdf/RR-04-29.pdf

  • Yang, Y. (2001). 学业成绩评定的激励作用分析 [Analysis of incentives of attainment assessment]. Journal of Teaching and Management, (4), 15–16.

    Google Scholar 

  • Zeidner, M. (1986). Are English language aptitude tests biased towards culturally different minority groups? Some Israeli findings. Language Testing, 3, 80–95.

    Article  Google Scholar 

  • Zeng, X., & Meng, Q. (1999). 目功能差异及其检测方法 [Differential item functioning and its detection methods]. Journal of Developments in Psychology, 7(2), 41–47.

    Google Scholar 

  • Zhou, H., & Shen, G. (2006). Review and reflection on the history of enrolment by examination in China. Educational Research, 4, 43–48.

    Google Scholar 

  • Zhou, J., Ding, X., Zhang, Q., & Wen, H. (2010). Empirical analysis on the fairness in national uniformed entrance examination in general colleges and universities in Beijing. Educational Research, 10, 46–52.

    Google Scholar 

  • Zou, S. (2011). On enhancing test fairness. Foreign Language Testing and Teaching, 1, 42–50.

    Google Scholar 

Download references

Acknowledgements

The study was supported by a SEED research grant (Liying Cheng: Principal Investigator) from Faculty of Education, Queen’s University, Kingston, Ontario, Canada.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Mei .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media Singapore

About this chapter

Cite this chapter

Mei, Y., Cheng, L. (2014). Scoring Fairness in Large-Scale High-Stakes English Language Testing: An Examination of the National Matriculation English Test. In: Coniam, D. (eds) English Language Education and Assessment. Springer, Singapore. https://doi.org/10.1007/978-981-287-071-1_11

Download citation

Publish with us

Policies and ethics