# Reliability

## Abstract

The measurement precision of test scores has within-person and between-persons aspects. The standard error of measurement assesses the precision of the measurement of a single test taker, and reliability the differentiation of test takers of a population. Reliability applies to the observed test score, the difference of two (e.g., pretest and posttest) scores, and item responses model estimates of a latent trait value. Classical Test Theory (CTT) developed two theoretical and one operational definition of reliability. Theoretically, the reliability of the observed test score is the squared product moment correlation (pmc) between observed and true test scores in a population of test takers. Under the assumptions of CTT, it was derived that this squared pmc is equal to the ratio of the between-persons true and observed test score variances in a population of test takers. These two definitions cannot be used to compute the reliability coefficient. Under the assumptions of CTT, the theoretical definitions are equal to the pmc between two parallel tests. This operational definition is used to compute the reliability coefficient. Similar theoretical definitions are given of the difference score and the latent trait estimate. Operational definitions are not needed because parallel tests are not needed to compute the reliability of the difference score and the latent trait estimate. Some counterintuitive properties of reliability are discussed. First, high reliability of the observed test score does not guarantee that the test is unidimensional. Second, less reliability does not imply less estimation precision of test taker parameters (i.e., his (her) true score, true difference score and latent trait value), and population parameters (i.e., test score, difference score, and latent trait estimate means). Third, less reliability does not imply less power of tests of the null hypothesis of equal mean scores of two (e.g., E- and C-) groups. Fourth, reliability applies to continuous latent variables (latent traits), but has to be adapted to discrete latent variables (latent classes): Accuracy and consistency of test score classification (e.g., masters and nonmasters of a skill) are comparable to the theoretical and operational definitions of reliability, respectively, but consistency cannot be used to assess accuracy. Reliability is a between-persons concept. It is relevant within the context of the measurement of individual differences, but does not apply to other situations.

## Keywords

Classical test theory (CTT) Classification accuracy Classification consistency Difference score Latent classes Latent trait Measurement precision Operational definition of reliability Standard error of measurement Theoretical definitions of reliability## References

- Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2002). Functional thought experiments.
*Synthese,**130,*379–387.CrossRefGoogle Scholar - Collins, L. M. (1996). Is reliability obsolete? A commentary on “Are simple gain scores obsolete?”.
*Applied Psychological Measurement,**20,*289–292.CrossRefGoogle Scholar - Cronbach, L. J., & Furby, L. (1970). How we should measure “change” - or should we?
*Psychological Bulletin,**74,*68–80.CrossRefGoogle Scholar - Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index of test unidimensionality.
*Educational and Psychological Measurement,**37,*827–838.CrossRefGoogle Scholar - Gulliksen, H. (1950).
*Theory of mental tests*. New York, NY: Wiley.CrossRefGoogle Scholar - Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.),
*Educational measurement*(4th ed., pp. 65–110). Westport, CT: Praeger.Google Scholar - Jackson, P. H., & Agunwamba, C. C. (1977). Lower bounds for the reliability of the total score on a test composed of nonhomogeneous items: I: Algebraic lower bounds.
*Psychometrika,**42,*567–578.CrossRefGoogle Scholar - Jansen, B., & van der Maas, H. L. J. (1997). Statistical test of the rule assessment methodology by latent class analysis.
*Developmental Review,**17,*321–357.CrossRefGoogle Scholar - Lord, F. M., & Novick, M. R. (1968).
*Statistical theories of mental test scores*. Reading, MA: Addison-Wesley.Google Scholar - Mellenbergh, G. J. (1996). Measurement precision in test score and item response models.
*Psychological Methods,**1,*293–299.CrossRefGoogle Scholar - Mellenbergh, G. J. (1999). A note on simple gain score precision.
*Applied Psychological Measurement,**23,*87–89.CrossRefGoogle Scholar - Mellenbergh, G. J. (2011).
*A conceptual introduction to psychometrics: Development, analysis, and application of psychological and educational tests.*The Hague, The Netherlands: Eleven International Publishing.Google Scholar - Mellenbergh, G. J., & van der Linden, W. J. (1979). The internal and external optimality of decisions based on tests.
*Applied Psychological Measurement,**3,*257–273.CrossRefGoogle Scholar - Overall, J. E., & Woodward, J. A. (1975). Unreliability of difference scores: A paradox for measurement of change.
*Psychological Bulletin,**82,*85–86.CrossRefGoogle Scholar - Rogosa, D., & Willett, J. B. (1983). Demonstrating the reliability of the difference score.
*Journal of Educational Measurement,**20,*335–343.CrossRefGoogle Scholar - Sijtsma, K. (2009). On the use, misuse, and the very limited usefulness of Cronbach’s alpha.
*Psychometrika,**74,*107–120.CrossRefGoogle Scholar - ten Berge, J. M. F., & Zegers, F. (1978). A series of lower bounds to the reliability of a test.
*Psychometrika,**43,*575–579.CrossRefGoogle Scholar - Williams, R. H., & Zimmerman, D. W. (1996). Are simple gain scores obsolete?
*Applied Psychological Measurement,**20,*59–69.CrossRefGoogle Scholar - Zimmerman, D. W., Williams, R. H., & Zumbo, D. (1993). Reliability of measurements and power of significance tests based on differences.
*Applied Psychological Measurement,**17,*1–9.CrossRefGoogle Scholar