Skip to main content

Reliability

  • Chapter
  • First Online:
  • 524 Accesses

Abstract

The measurement precision of test scores has within-person and between-persons aspects. The standard error of measurement assesses the precision of the measurement of a single test taker, and reliability the differentiation of test takers of a population. Reliability applies to the observed test score, the difference of two (e.g., pretest and posttest) scores, and item responses model estimates of a latent trait value. Classical Test Theory (CTT) developed two theoretical and one operational definition of reliability . Theoretically, the reliability of the observed test score is the squared product moment correlation (pmc) between observed and true test scores in a population of test takers. Under the assumptions of CTT , it was derived that this squared pmc is equal to the ratio of the between-persons true and observed test score variances in a population of test takers. These two definitions cannot be used to compute the reliability coefficient. Under the assumptions of CTT , the theoretical definitions are equal to the pmc between two parallel tests . This operational definition is used to compute the reliability coefficient. Similar theoretical definitions are given of the difference score and the latent trait estimate. Operational definitions are not needed because parallel tests are not needed to compute the reliability of the difference score and the latent trait estimate. Some counterintuitive properties of reliability are discussed. First, high reliability of the observed test score does not guarantee that the test is unidimensional. Second, less reliability does not imply less estimation precision of test taker parameters (i.e., his (her) true score , true difference score and latent trait value), and population parameters (i.e., test score, difference score, and latent trait estimate means). Third, less reliability does not imply less power of tests of the null hypothesis of equal mean scores of two (e.g., E- and C- ) groups. Fourth, reliability applies to continuous latent variables (latent traits), but has to be adapted to discrete latent variables (latent classes): Accuracy and consistency of test score classification (e.g., masters and nonmasters of a skill) are comparable to the theoretical and operational definitions of reliability , respectively, but consistency cannot be used to assess accuracy. Reliability is a between-persons concept. It is relevant within the context of the measurement of individual differences, but does not apply to other situations.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2002). Functional thought experiments. Synthese, 130, 379–387.

    Article  Google Scholar 

  • Collins, L. M. (1996). Is reliability obsolete? A commentary on “Are simple gain scores obsolete?”. Applied Psychological Measurement, 20, 289–292.

    Article  Google Scholar 

  • Cronbach, L. J., & Furby, L. (1970). How we should measure “change” - or should we? Psychological Bulletin, 74, 68–80.

    Article  Google Scholar 

  • Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37, 827–838.

    Article  Google Scholar 

  • Gulliksen, H. (1950). Theory of mental tests. New York, NY: Wiley.

    Book  Google Scholar 

  • Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 65–110). Westport, CT: Praeger.

    Google Scholar 

  • Jackson, P. H., & Agunwamba, C. C. (1977). Lower bounds for the reliability of the total score on a test composed of nonhomogeneous items: I: Algebraic lower bounds. Psychometrika, 42, 567–578.

    Article  Google Scholar 

  • Jansen, B., & van der Maas, H. L. J. (1997). Statistical test of the rule assessment methodology by latent class analysis. Developmental Review, 17, 321–357.

    Article  Google Scholar 

  • Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

    Google Scholar 

  • Mellenbergh, G. J. (1996). Measurement precision in test score and item response models. Psychological Methods, 1, 293–299.

    Article  Google Scholar 

  • Mellenbergh, G. J. (1999). A note on simple gain score precision. Applied Psychological Measurement, 23, 87–89.

    Article  Google Scholar 

  • Mellenbergh, G. J. (2011). A conceptual introduction to psychometrics: Development, analysis, and application of psychological and educational tests. The Hague, The Netherlands: Eleven International Publishing.

    Google Scholar 

  • Mellenbergh, G. J., & van der Linden, W. J. (1979). The internal and external optimality of decisions based on tests. Applied Psychological Measurement, 3, 257–273.

    Article  Google Scholar 

  • Overall, J. E., & Woodward, J. A. (1975). Unreliability of difference scores: A paradox for measurement of change. Psychological Bulletin, 82, 85–86.

    Article  Google Scholar 

  • Rogosa, D., & Willett, J. B. (1983). Demonstrating the reliability of the difference score. Journal of Educational Measurement, 20, 335–343.

    Article  Google Scholar 

  • Sijtsma, K. (2009). On the use, misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74, 107–120.

    Article  Google Scholar 

  • ten Berge, J. M. F., & Zegers, F. (1978). A series of lower bounds to the reliability of a test. Psychometrika, 43, 575–579.

    Article  Google Scholar 

  • Williams, R. H., & Zimmerman, D. W. (1996). Are simple gain scores obsolete? Applied Psychological Measurement, 20, 59–69.

    Article  Google Scholar 

  • Zimmerman, D. W., Williams, R. H., & Zumbo, D. (1993). Reliability of measurements and power of significance tests based on differences. Applied Psychological Measurement, 17, 1–9.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gideon J. Mellenbergh .

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Mellenbergh, G.J. (2019). Reliability. In: Counteracting Methodological Errors in Behavioral Research. Springer, Cham. https://doi.org/10.1007/978-3-030-12272-0_15

Download citation

Publish with us

Policies and ethics