Scoring Fairness in Large-Scale High-Stakes English Language Testing: An Examination of the National Matriculation English Test

Mei, Yi; Cheng, Liying

doi:10.1007/978-981-287-071-1_11

Yi Mei² &
Liying Cheng²

3157 Accesses

Abstract

Empirical research exploring test fairness in scoring written performance has been mostly conducted in the North American context. There has been little research conducted in Asian countries such as China. Considering the extreme high stakes of large-scale testing in this context, this study examines what and how raters’ scoring decisions were affected by the features of writing intended (or unintended) to be measured in the National Matriculation English Test (NMET) in China. The study further explores whether there was any difference in rating behaviours between novice and experienced NMET raters. The results highlight the extent to which raters attended to the NMET rating scale which led to a deeper understanding of scoring fairness involved in large-scale high-stakes tests within the Chinese context and has implications on scoring fairness in other similar contexts internationally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In this chapter, scoring and rating (singular) are used interchangeably.
2.
Cronbach alpha for the 13 Likert-scaled questions was 0.72.
3.
Plagiarism was a new writing feature reported by the NMET raters. By “plagiarism”, the raters referred to the fact that certain test-takers either copied sentences from reading passages in the section of Reading Comprehension in the same NMET test paper or wrote down sentences that they recited from NMET writing templates, the content of which may be loosely related to the NMET writing task.
4.
Relevance was another new writing feature reported by the NMET raters. By “relevance”, the raters referred to the degree to which the test-takers’ writing matched the NMET writing topic.
5.
Experienced NMET raters refer to raters who have had NMET rating experience more than once; novice NMET raters refer to raters who are recruited and trained as NMET rater for the first time.
6.
Experienced EFL teacher raters refer to raters whose EFL teaching experience is at least 5 years; novice ESL teacher raters are raters who have taught ESL for less than 5 years.

References

Alderson, J. C., & Urquhart, A. H. (1985). The effect of students’ academic discipline on their performance in ESP reading tests. Language Testing, 2, 192–204.
Article Google Scholar
American Educational Research Association (AERA), American Psychological Association (APA), & National Council for Measurement in Education (NCME). (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Google Scholar
Bai, L. (2011). 我国科举录取名额分配制度的历史与反思——兼谈我国高考录取中的考试剬平与区域剬平 [Reflecting on the history of Chinese imperial examination enrolment quota distribution system: Fairness in test outcomes and regional parity in Gaokao enrolment system]. Educational Innovation, (6), 6–7.
Google Scholar
Barkaoui, K. (2007). Participants, texts, and processes in ESL/EFL essay tests: A narrative review of the literature. Canadian. Modern Language Review, 64, 99–134. doi:10.3138/cmlr.64.1.099.
Article Google Scholar
Barkaoui, K. (2010). Do ESL essay raters’ evaluation criteria change with experience? A mixed-methods, cross-sectional study. TESOL Quarterly, 44, 31–57. doi:10.5054/tq.2010.214047.
Article Google Scholar
Barkaoui, K. (2011). Think-aloud protocols in research on essay rating: An empirical study of their veridicality and reactivity. Language Testing, 28, 51–75. doi:10.1177/0265532210376379.
Article Google Scholar
Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing, 20, 1–25.
Article Google Scholar
Camilli, G. (2006). Test fairness. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 221–256). Westport: Praeger Publishers.
Google Scholar
Cao, Y., & Zhang, H. (1999). Detection of differential item functioning in a Chinese vocabulary test. Acta Psychologica Sinica, 31, 460–467.
Google Scholar
Cheng, L. (2008). The key to success: English language testing in China. Language Testing, 25, 15–37. doi:10.1177/0265532207083743.
Article Google Scholar
Cheng, L. (2010). The history of examinations: Why, how, what, whom to select? In L. Cheng & A. Curtis (Eds.), English language assessment and the Chinese learner (pp. 13–26). New York: Routledge: Taylor & Francis Group.
Google Scholar
Clapham, C. (1998). The effect of language proficiency and background knowledge on EAP students’ reading comprehension. In A. J. Kunnan (Ed.), Validation in language assessment (pp. 141–168). Mahwah: Lawrence Erlbaum.
Google Scholar
Cole, N. S., & Zieky, M. J. (2001). The new faces of fairness. Journal of Educational Measurement, 38, 369–382. doi:10.1111/j.1745-3984.2001.tb01132.x.
Article Google Scholar
Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7, 31–51. doi:10.1177/026553229000700104.
Article Google Scholar
Cumming, A., Kantor, R., & Powers, D. (2001). Scoring TOEFL essays and TOEFL 2000 prototype writing tasks: An investigation into raters’ decision making and development of a preliminary analytic framework (TOEFL Monograph Series MS-22). Retrieved from Educational Testing Service website: http://www.ets.org/Media/Research/pdf/RM-01-04.pdf
Cumming, A., Kantor, R., & Powers, D. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. Modern Language Journal, 86, 67–96.
Article Google Scholar
Dai, J., Wei, X., & Liu, F. (2010). 教育考试剬平性的基本理论研究 [Fundamental theoretical research on educational test fairness]. China Higher Education Research, (8), 27–29.
Google Scholar
Dong, S., & Ma, S. (2011). Fairness analysis on assessment with score report from measurement perspective. Examinations Research, 1, 59–64.
Google Scholar
Ericsson, K. A., & Simon, H. A. (1993). Protocol analysis: Verbal reports as data. Cambridge, MA: MIT Press.
Google Scholar
Guo, F. (2009). Fairness of automated essay scoring of GMAT® AWA (GMAC® Research Reports: RR-09-01). Retrieved from the Graduate Management Admission Council® website: http://www.gmac.com/NR/rdonlyres/FACE0811-B6F7-45A9-B57D-ED3703984B9A/0/RR0901_AWAFairness.pdf
Guo, G. (2010). 高考剬平性的影响要素分析 [Analysis of influential factors in achieving fairness in Gaokao]. Theory and Practice of Education, 30(17), 15–17.
Google Scholar
Hale, G. (1988). Student major field and text content: Interactive effects on reading comprehension in the TOEFL. Language Testing, 5, 49–61.
Article Google Scholar
Huang, J. (2007). Examining the fairness of rating ESL students’ writing on large-scale assessments. Unpublished doctoral dissertation. Queen’s University, Kingston.
Google Scholar
Huang, J. (2011). Generalizability theory as evidence of concerns about fairness in large-scale ESL writing assessments. TESOL Journal, 2, 423–443. doi:10.5054/tj.2011.269751.
Article Google Scholar
Kunnan, A. J. (2004). Test fairness. In M. Milanovic & C. Weir (Eds.), European language testing in a global context (pp. 27–48). Cambridge: Cambridge University Press.
Google Scholar
Kunnan, A. J. (2008). Large scale language assessment. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of language and education (Language testing and assessment 2nd ed., Vol. 7, pp. 135–155). New York: Springer.
Google Scholar
Lee, H. K. (2004). A comparative study of ESL writers’ performance in a paper-based and a computer-delivered writing test. Assessing Writing, 9, 4–26.
Article Google Scholar
Li, L. (2007). 教育剬正视野中的高考录取制度改革—兼论考试剬平与区域剬平之争 [Gaokao enrolment system reform in visions of educational justice: On the dispute between fairness in test outcomes and regional parity]. Hubei Social Sciences, (9), 156–158.
Google Scholar
Lu, Y. (2011). Fairness in writing assessment: A survey of factors that affect rater bias. Foreign Language Testing and Teaching, 2, 30–36.
Google Scholar
Ma, S. (2009). 建国60年来我国大学入学考试制度的沿革与发展 [History and development of the university entrance examination system in China in 60 years]. Educational Measurement and Evaluation, (10), 49–52.
Google Scholar
May, L. A. (2007). Interaction in a paired speaking test: The rater's perspective. Unpublished doctoral dissertation. The University of Melbourne, Melbourne.
Google Scholar
Milanovic, M., Saville, N., & Shen, S. (1996). A study of the decision-making behaviour of composition markers. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment (pp. 92–114). Cambridge, UK: Cambridge University Press.
Google Scholar
O’Loughlin, K. (2002). The impact of gender in oral proficiency testing. Language Testing, 19, 169–192.
Article Google Scholar
Qi, L. (2006). Some reflections on washback. Foreign Languages and Their Teaching, 8, 29–32.
Google Scholar
Qi, L. (2007). Is testing an efficient agent for pedagogical change? Examining the intended washback of the writing task in a high-stakes English test in China. Assessment in Education: Principles, Policy and Practice, 14(1), 51–74.
Google Scholar
Stricker, L. J., Rock, D. A., & Lee, Y. W. (2005). Factor structure of the LanguEdgeTM Test across language groups (TOEFL Monograph Series MS-32). Retrieved from Educational Testing Service website: http://www.ets.org/Media/Research/pdf/RR-05-12.pdf
Swinton, S. S., & Powers, D. E. (1980). Factor analysis of the TOEFL® test for several language groups (TOEFL Research Report: RR-06). Retrieved from Educational Testing Service website: http://www.ets.org/Media/Research/pdf/RR-80-32.pdf
Vaughan, C. (1991). Holistic assessment: What goes on in the rater’s mind? In L. Hamp-Lyons (Ed.), Assessing second language writing in academic contexts (pp. 111–125). Norwood: Ablex.
Google Scholar
Wang, J. W. (2011). Study on present situation of examination fairness and countermeasure. China Examinations, 5, 53–57.
Google Scholar
Wolfe, E. W., & Manalo, J. R. (2005). An investigation of the impact of composition medium on the quality of TOEFL writing scores (TOEFL Research Report RR-04-29). Retrieved from https://www.ets.org/Media/Research/pdf/RR-04-29.pdf
Yang, Y. (2001). 学业成绩评定的激励作用分析 [Analysis of incentives of attainment assessment]. Journal of Teaching and Management, (4), 15–16.
Google Scholar
Zeidner, M. (1986). Are English language aptitude tests biased towards culturally different minority groups? Some Israeli findings. Language Testing, 3, 80–95.
Article Google Scholar
Zeng, X., & Meng, Q. (1999). 目功能差异及其检测方法 [Differential item functioning and its detection methods]. Journal of Developments in Psychology, 7(2), 41–47.
Google Scholar
Zhou, H., & Shen, G. (2006). Review and reflection on the history of enrolment by examination in China. Educational Research, 4, 43–48.
Google Scholar
Zhou, J., Ding, X., Zhang, Q., & Wen, H. (2010). Empirical analysis on the fairness in national uniformed entrance examination in general colleges and universities in Beijing. Educational Research, 10, 46–52.
Google Scholar
Zou, S. (2011). On enhancing test fairness. Foreign Language Testing and Teaching, 1, 42–50.
Google Scholar

Download references

Acknowledgements

The study was supported by a SEED research grant (Liying Cheng: Principal Investigator) from Faculty of Education, Queen’s University, Kingston, Ontario, Canada.

Author information

Authors and Affiliations

Faculty of Education, Queen’s University, Kingston, ON, K7M 5R7, Canada
Yi Mei & Liying Cheng

Authors

Yi Mei
View author publications
You can also search for this author in PubMed Google Scholar
Liying Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Mei .

Editor information

Editors and Affiliations

Dept of Curriculum and Instruction, The Hong Kong Institute of Education, Hong Kong, China
David Coniam

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mei, Y., Cheng, L. (2014). Scoring Fairness in Large-Scale High-Stakes English Language Testing: An Examination of the National Matriculation English Test. In: Coniam, D. (eds) English Language Education and Assessment. Springer, Singapore. https://doi.org/10.1007/978-981-287-071-1_11

Download citation

DOI: https://doi.org/10.1007/978-981-287-071-1_11
Published: 02 June 2014
Publisher Name: Springer, Singapore
Print ISBN: 978-981-287-070-4
Online ISBN: 978-981-287-071-1
eBook Packages: Humanities, Social Sciences and LawEducation (R0)

Publish with us

Policies and ethics