The measurement precision of test scores has within-person and between-persons aspects. The standard error of measurement assesses the precision of the measurement of a single test taker, and reliability the differentiation of test takers of a population. Reliability applies to the observed test score, the difference of two (e.g., pretest and posttest) scores, and item responses model estimates of a latent trait value. Classical Test Theory (CTT) developed two theoretical and one operational definition of reliability. Theoretically, the reliability of the observed test score is the squared product moment correlation (pmc) between observed and true test scores in a population of test takers. Under the assumptions of CTT, it was derived that this squared pmc is equal to the ratio of the between-persons true and observed test score variances in a population of test takers. These two definitions cannot be used to compute the reliability coefficient. Under the assumptions of CTT, the theoretical definitions are equal to the pmc between two parallel tests. This operational definition is used to compute the reliability coefficient. Similar theoretical definitions are given of the difference score and the latent trait estimate. Operational definitions are not needed because parallel tests are not needed to compute the reliability of the difference score and the latent trait estimate. Some counterintuitive properties of reliability are discussed. First, high reliability of the observed test score does not guarantee that the test is unidimensional. Second, less reliability does not imply less estimation precision of test taker parameters (i.e., his (her) true score, true difference score and latent trait value), and population parameters (i.e., test score, difference score, and latent trait estimate means). Third, less reliability does not imply less power of tests of the null hypothesis of equal mean scores of two (e.g., E- and C-) groups. Fourth, reliability applies to continuous latent variables (latent traits), but has to be adapted to discrete latent variables (latent classes): Accuracy and consistency of test score classification (e.g., masters and nonmasters of a skill) are comparable to the theoretical and operational definitions of reliability, respectively, but consistency cannot be used to assess accuracy. Reliability is a between-persons concept. It is relevant within the context of the measurement of individual differences, but does not apply to other situations.
KeywordsClassical test theory (CTT) Classification accuracy Classification consistency Difference score Latent classes Latent trait Measurement precision Operational definition of reliability Standard error of measurement Theoretical definitions of reliability
- Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 65–110). Westport, CT: Praeger.Google Scholar
- Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.Google Scholar
- Mellenbergh, G. J. (2011). A conceptual introduction to psychometrics: Development, analysis, and application of psychological and educational tests. The Hague, The Netherlands: Eleven International Publishing.Google Scholar