Skip to main content

Abstract

The purpose of the chapter is to orient readers to reliability considerations specific to instruments and data coding practices in applied linguistics (AL) research. To that end, the chapter begins with a general discussion of different types of reliability (both internal and external to an instrument itself), including the different indices and models used to estimate reliability and their respective interpretations. Methods for improving the reliability of data coding and instrument scoring practices will then be discussed, followed by a summary of best practices in coder/rater training and norming. Throughout, guidelines for addressing common limitations with respect to reliability analysis and reporting in AL research will be outlined, including suggestions for how to address these issues in operational contexts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 299.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 379.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In this chapter, we will be using the general term instrument to encompass a variety of tools that may be employed in empirical studies, such as tests, surveys, performance assessments, questionnaires, and others. We will also use more specific terms where relevant and when a more precise illustration is helpful or required.

  2. 2.

    It should also be noted that in some areas of AL research (e.g., large-scale language assessment), certain approaches to reliability analysis, such as parallel-forms reliability or test-retest reliability, are becoming more and more obsolete as item- and task-banking and computer-adaptive testing are replacing former assessment delivery methods, such as traditional paper-and-pencil tests and even some first-generation computer-based tests. Most of the data collected from these newer large-scale test-delivery systems have properties that make traditional approaches to reliability estimation inefficient, if not impossible. In most cases, psychometricians and other measurement professionals charged with analyzing the data often use item response theory (IRT)/Rasch in their analysis, as each test-taker may, in theory, receive a unique set of items or tasks on any given occasion. As access to this type of technology is quite rare in most AL research contexts, we have chosen to present approaches to reliability analysis here that are accessible to most, if not all, AL researchers.

  3. 3.

    Most statistics discussed in the chapter are easily obtained using statistical software, such as SPSS (SPSS, Inc.), and, thus, do not require hand calculations. However, for information on formulae related to different types of reliability, or how to calculate reliability statistics by hand, please see Resources for Further Reading at the end of the chapter.

References

  • Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, UK: Oxford University Press.

    Google Scholar 

  • Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge, UK: Cambridge University Press.

    Book  Google Scholar 

  • Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 20(1), 1–34.

    Article  Google Scholar 

  • Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford, UK: Oxford University Press.

    Google Scholar 

  • Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford, UK: Oxford University Press.

    Google Scholar 

  • Bogdan, R. C., & Biklen, S. K. (2003). Qualitative research in education. Boston, MA: Allyn and Bacon.

    Google Scholar 

  • Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer-Verlag.

    Book  Google Scholar 

  • Brown, J. D. (2001). Using surveys in language programs. Cambridge, UK: Cambridge University Press.

    Google Scholar 

  • Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge, UK: Cambridge University Press.

    Book  Google Scholar 

  • Card, N. (2015). Applied meta-analysis for social science research. New York: The Guilford Press.

    Google Scholar 

  • Carr, N. T. (2011). Designing and analyzing language tests. Oxford, UK: Oxford University Press.

    Google Scholar 

  • Cortina, J. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78, 98–104.

    Article  Google Scholar 

  • Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements. New York, NY: Wiley.

    Google Scholar 

  • Derrick, D. (2015). Instrument reporting practices in second language research. TESOL Quarterly, 50(1), 132–153.

    Article  Google Scholar 

  • Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2(3), 175–196.

    Article  Google Scholar 

  • Ellis, R., & Barkhuizen, G. (2005). Analyzing learner language. Oxford, UK: Oxford University Press.

    Google Scholar 

  • Guilford, J. P. (1954). Psychometric methods. Bombay, India: Tata-McGraw Hill.

    Google Scholar 

  • Hadley, G. (2017). Grounded theory in applied linguistics research: A practical guide. London: Routledge.

    Book  Google Scholar 

  • Hamilton, J., Reddel, S., & Spratt, M. (2001). Teachers’ perception of on-line rater training and monitoring. System, 29(4), 505–520.

    Article  Google Scholar 

  • Hogan, T. P., Benjamin, A., & Brezinski, K. L. (2000). Reliability methods: A note on the frequency of use of various types. Educational and Psychological Measurement, 60, 523–561.

    Article  Google Scholar 

  • Huot, B. A. (1993). The influence of holistic scoring procedures on reading and rating student essays. In M. M. Williamson & B. A. Huot (Eds.), Validating holistic scoring for writing assessment (pp. 206–232). Cresskill, NJ: Hampton Press.

    Google Scholar 

  • Kane, M. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (pp. 17–64). Westport, CT: American Council of Education and Praeger Series on Higher Education.

    Google Scholar 

  • Kane, M. (2013). Validating and interpretation and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.

    Article  Google Scholar 

  • Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239–261.

    Article  Google Scholar 

  • Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research findings: What gets reported and recommendations for the field. Language Learning, 65(1), 127–159.

    Article  Google Scholar 

  • Lim, G. (2011). The development and maintenance of rater quality in performance writing assessment: a longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560.

    Article  Google Scholar 

  • Linacre, J., & Wright, B. (1992). A user’s guide to FACETS: Rasch measurement computer program. Chicago, IL: MESA Press.

    Google Scholar 

  • Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, IL: MESA Press.

    Google Scholar 

  • Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Menlo Park, CA: Addison-Wesley Publishing Company.

    Google Scholar 

  • Lunz, M. E., & Stahl, J. A. (1990). Judge consistency and severity across grading periods. Evaluation and the Health Professions, 13(14), 425–444.

    Article  Google Scholar 

  • McNamara, T. (1996). Measuring second language performance. London: Longman.

    Google Scholar 

  • Muchinsky, P. M. (1996). The correction for attenuation. Educational and Psychological Measurement, 56(1), 63–75.

    Article  Google Scholar 

  • O’Sullivan, B., & Rignall, M. (2007). Assessing the value of bias analysis feedback to raters for the IELTS writing module. In L. Taylor & P. Falvey (Eds.), IELTS collected papers. Research in speaking and writing performance (pp. 446–478). Cambridge, UK: Cambridge University Press.

    Google Scholar 

  • Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting practices in quantitative L2 research. Studies in Second Language Acquisition, 35(4), 655–687.

    Article  Google Scholar 

  • Plonsky, L., & Derrick, D. (2016). A meta-analysis of reliability coefficients in second language research. Modern Language Journal, 100(2), 538–553.

    Article  Google Scholar 

  • Plonsky, L., & Oswald, F. L. (2014). How big is ‘big’? Interpreting effect sizes in L2 research. Language Learning, 64, 878–912.

    Article  Google Scholar 

  • Shavelson, R., & Webb, N. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.

    Google Scholar 

  • Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15(1), 72–101.

    Article  Google Scholar 

  • Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197–223.

    Article  Google Scholar 

  • Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287.

    Article  Google Scholar 

  • Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305–335.

    Article  Google Scholar 

  • Yang, Y., & Green, S. B. (2011). Coefficient Alpha: a reliability coefficient for the 21st Century? Journal of Psychoeducational Assessment, 29, 377–392.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kirby C. Grabowski .

Editor information

Editors and Affiliations

Copyright information

© 2018 The Author(s)

About this chapter

Cite this chapter

Grabowski, K.C., Oh, S. (2018). Reliability Analysis of Instruments and Data Coding. In: Phakiti, A., De Costa, P., Plonsky, L., Starfield, S. (eds) The Palgrave Handbook of Applied Linguistics Research Methodology. Palgrave Macmillan, London. https://doi.org/10.1057/978-1-137-59900-1_24

Download citation

  • DOI: https://doi.org/10.1057/978-1-137-59900-1_24

  • Publisher Name: Palgrave Macmillan, London

  • Print ISBN: 978-1-137-59899-8

  • Online ISBN: 978-1-137-59900-1

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics