Investigating the Use of an Empirically Derived, Binary-Choice and Boundary-Definition (EBB) Scale for the Assessment of English Language Spoken Proficiency

  • Stefan O’GradyEmail author
Part of the Second Language Learning and Teaching book series (SLLT)


The drive towards English medium instruction in modern Turkish higher education has increased the need for valid assessment methodology. However, with ever increasing numbers of test-takers and a limited period of time to carry out language assessment, collecting information about prospective students’ language proficiency can often put a strain on limited resources. The assessment of speaking ability is notoriously difficult and many universities in Turkey do not have the capacity to complete large-scale speaking assessment that is both fair and useful. A common problem with the assessment of spoken ability in this educational context is disagreement between raters with regards to the assessment of spoken proficiency. The current study maps out the procedure for developing an empirically derived rating scale that requires binary choices to distinguish the boundaries between score levels “Empirically derived, Binary-choice, Boundary-definition” (EBB) (Turner & Upshur, 1996) for a test of second language speaking ability. Two groups of trained raters used the EBB scale and an analytical scale to assess speaking samples from 30 English language learners studying in the English preparatory program of a Turkish university. Analysis of the test results was conducted using multifaceted Rasch analysis. Results demonstrated that the use of the EBB scale generated higher levels of inter-rater and intra-rater consistency than the analytical scale. The findings indicate that the use of EBB rating scales may be a viable alternative to traditional rating scales for large-scale testing programs in English medium universities.


Second language speech assessment Rating scales Rasch analysis 


  1. Baddeley, A. (2007). Working memory, thought, and action. Oxford: Oxford University Press.CrossRefGoogle Scholar
  2. Ducasse, A. (2009). Raters as scale makers for an L2 Spanish speaking test: Using paired discourse to develop a rating scale for communicative interaction. In A. Brown & K. Hill (Eds.), Tasks and criteria in performance assessment (pp. 15–39). Frankfurt: Peter Lang.Google Scholar
  3. Field, J. (2011). Cognitive validity. In L. Taylor (Ed.), Studies in language testing 30 examining speaking (pp. 65–112). Cambridge: Cambridge University Press.Google Scholar
  4. Fulcher, G. (2003). Testing second language speaking. London: Pearson.Google Scholar
  5. Fulcher, G., & Davidson, F. (2007). Language testing and assessment. London: Routledge.CrossRefGoogle Scholar
  6. Iwashita, N., McNamara, T., & Elder, C. (2001). Can we predict task difficulty in an oral proficiency test? Exploring the potential of an information processing approach to task design. Language Learning, 21(3), 401–436.CrossRefGoogle Scholar
  7. Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Retrieved from
  8. Linacre, J. (2013). A user’s guide to FACETS: Rasch-Model Computer Programs (Program Manual 3.71.0). Retrieved from
  9. Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  10. McNamara, F. (1996). Measuring second language performance. London: Longman.Google Scholar
  11. Purpura, J. (2016). Second and foreign language assessment. The Modern Language Journal, 100, 190–208.CrossRefGoogle Scholar
  12. Read, J. (2015). Assessing English proficiency for university study. Basingstoke: Palgrave Macmillan.CrossRefGoogle Scholar
  13. Selvi, A. F. (2014). Medium of instruction debate in Turkey: Oscillating between national ideas and bilingual ideals. Current Issues in Language Planning, 15(2), 133–152.CrossRefGoogle Scholar
  14. Skehan, P. (2009). Modelling second language performance: Integrating complexity, accuracy, fluency and lexis. Applied Linguistics, 30(4), 510–532.CrossRefGoogle Scholar
  15. Turner, C., & Upshur, J. (1996). Developing rating scales for the assessment of second language performance. Australian Review of Applied Linguistics, 13, 55–79.Google Scholar
  16. Turner, C., & Upshur, J. (2002). Rating scales derived from student samples: Effects of the scale maker and the student sample on scale content and student scores. TESOL Quarterly, 36(1), 49–70.CrossRefGoogle Scholar
  17. Weir, C. (2005). Language testing and validation. Basingstoke: Palgrave Macmillan.CrossRefGoogle Scholar
  18. Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Retrieved from

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Izmir University of EconomicsIzmirTurkey

Personalised recommendations