Skip to main content

Measurement Fundamentals: Reliability and Validity

  • Chapter
  • First Online:
Evaluation Methods in Biomedical and Health Informatics

Part of the book series: Health Informatics ((HI))

Abstract

This chapter begins an in-depth study of measurement by introducing the “classical theory” of measurement, which serves this book’s level of discussion. The classical theory hinges on the concepts of reliability and validity as the indices of the quality of the measurement process. According, this chapter develops both concepts, and introduces methods for determining the reliability and validity of any given measurement process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 89.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Technically, there is a difference between a scale and an index (Crossman 2019) but for purposes of this discussion the terms can be used interchangeably. Also note that the term scale has two uses in measurement. In addition to the definition given above, scale can also refer to the set of response options from which one chooses when completing a rating form or questionnaire. In popular parlance, one might say “respond on a scale of 1–10” of how satisfied you are with this information resource. Usually, it is possible to infer the sense in which the term “scale” is being used from the context of the statement.

  2. 2.

    The object score can also be computed as the summed score of all observations. Using the mean and summed scores will yield the same reliability results as long as there are no missing observations.

  3. 3.

    Those familiar with the concept of confidence intervals might think of observed score ±1 standard error of measurement as the 68% confidence interval for the true score. The 95% confidence interval would be approximately the observed score plus or minus two standard errors of measurement.

  4. 4.

    The basics of ANOVA are discussed in Chap. 12 of this book.

References

  • Boateng GO, Neilands TB, Frongillo EA, Melgar-Quiñonez HR, Young SL. Best practices for developing and validating scales for health, social, and behavioral research: a primer. Front Public Health. 2018;149:1–18.

    Google Scholar 

  • Briesch AM, Swaminathan H, Welsh M, Chafouleas SM. Generalizability theory: a practical guide to study design, implementation, and interpretation. J Sch Psychol. 2014;52:13–35.

    Article  Google Scholar 

  • Clarke JR, Cebula DP, Webber BL. Artificial intelligence: a computerized decision aid for trauma. J Trauma. 1988;28:1250–4.

    Article  CAS  Google Scholar 

  • Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16:297–334.

    Article  Google Scholar 

  • Crossman A. The differences between indexes and scales: definitions, similarities, and differences. ThoughtCo. 2019. https://www.thoughtco.com/indexes-and-scales-3026544. Accessed 10 Jun 2021.

  • Dykes PC, Hurley A, Cashen M, Bakken S, Duffy ME. Development and psychometric evaluation of the impact of health information technology (I-HIT) scale. J Am Med Inform Assoc. 2007;14:507–14.

    Article  Google Scholar 

  • Kerlinger FN. Foundations of behavioral research. New York: Holt, Rinehart and Winston; 1986.

    Google Scholar 

  • Kimberlin CL, Winterstein AG. Validity and reliability of measurement instruments used in research. Am J Health-Syst Pharmacy. 2008;65:2276–84.

    Article  Google Scholar 

  • Moriarity DP, Alloy LB. Back to basics: the importance of measurement properties in biological psychiatry. Neurosci Biobehav Rev. 2021;123:72–82.

    Article  Google Scholar 

  • Orr K, Howe HS, Omran J, Smith KA, Palmateer TM, Ma AE, et al. Validity of smartphone pedometer applications. BMC Res Notes. 2015;8:1–9.

    Article  Google Scholar 

  • Shavelson RJ, Webb NM, Rowley GL. Generalizability theory. Am Psychol. 1989;44:922–32.

    Article  Google Scholar 

  • Thanasegaran G. Reliability and validity issues in research. Integr Dissemin. 2009;4:35–40.

    Google Scholar 

  • Tractenberg RE. Classical and modern measurement theories, patient reports, and clinical outcomes. Contemp Clin Trials. 2010;31:1–3.

    Article  Google Scholar 

  • VandenBos GR. Validity. APA Dictionary of Psychology. Washington, DC: American Psychological Association; 2007. https://dictionary.apa.org/validity. Accessed 10 Apr 2021.

  • Watson JC. Establishing evidence for internal structure using exploratory factor analysis. Measure Eval Counsel Develop. 2017;50:232–8.

    Article  Google Scholar 

  • Weinstein MC, Fineberg HV, Elstein AS, Frazier HS, Neuhauser D, Neutra RR, et al. Clinical decision analysis. Philadelphia: W.B. Saunders; 1980.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charles P. Friedman .

1 Electronic Supplementary Material

Data 7.1

(XLSX 54 kb)

Data 7.2

(PDF 368 kb)

Answers to Self-Tests

Answers to Self-Tests

Self-Test 7.1

  1. 1.

    (a) Adding a constant has no effect on the standard error of measurement, as it affects neither the standard deviation nor the reliability.

    (b) Multiplication by a constant increases the standard error by that same constant.

  2. 2.

    (a) The scores are 13, 13, 10, 15, 6, 8 for Objects A–F. The standard deviation of the six scores is .86.

    (b) .30.

    (c) The reliability would increase because the scores for Object 1, across observations, become more consistent. The reliability in fact increases to 0.92.

    (d) A decrease in reliability.

Self-Test 7.2

  1. 1.

    (a) 0.95.

    (b) 0.65.

  2. 2.

    (a) In this case, the tasks are the objects and the testers are the observations.

    (b) In a perfectly reliable measurement, all observations have the same value for a given object. So Tester 2 would also give a rating of “4” for Task 1.

    (c) By the Prophecy Formula, the estimated reliability would be .78.

  3. 3.

    (a) The attribute is, for a given method and test sequence, the percentage of carbon atoms within the threshold distance. The observations are the test sequences. The objects are the prediction methods.

    (b) The matrix would have 14 columns corresponding to the test sequences as observations and 123 rows corresponding to prediction methods as objects.

    (c) A very high reliability, on the order of .9 would be sought. The demonstration study seeks to rank order the objects themselves, as opposed to comparing groups of objects. This suggests the use of a large number of test sequences.

Self-Test 7.3

  1. 1.

    It is possible that other types of movements, beside whatever the study team defines as “walking” would count as walking steps. These movements might include climbing stairs, standing up or sitting down, or cycling. Also, computation of distance walked will be influenced by stride length which is in turn related to a person’s height. So the person’s height could be a “polluting” construct.

  2. 2.

    (a) Criterion-related validity. Number of comments generated by the IRB, which would indicate a person’s ability to produce a highly compliant research protocol, could be a considered a criterion for validation in this case. The direction of the relationship between scores on the test and number of comments would be negative: greater knowledge should generate smaller numbers of comments.

    (b) Construct validity. The validation process is based on hypothesized relationships, of varying strength, between the attribute of interest (computer literacy) and four other variables.

    (c) Content validity.

  3. 3.

    (a) Some examples:

    On a scale of 1–5 (5 being highest), rate the effect DIATEL has had on your HbA1c levels.

    Rate from strongly agree to strongly disagree:

    “I am healthier because of my use of DIATEL”

    “I plan to continue to use DIATEL”

    (b) This study can be done by using each item of the questionnaire as an “independent” observation. A representative sample of diabetic patients who have used DIATEL would be recruited. The patients would complete the questionnaire. The data would be displayed in an objects (patients) by observations (items) matrix. From this, the reliability coefficient can be computed.

    (c) Using the Prophecy Formula, the reliability with five items is predicted to be .54.

    (d) Examples of external data include number of DIATEL features patients use (Do patients who use more features give higher ratings on the questionnaire?), whether patients use DIATEL in the future (Do patients’ ratings on the questionnaire predict future use?), whether patients mention DIATEL to their clinicians (Do patients who rate DIATEL higher mention it more often?)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Friedman, C.P., Wyatt, J.C., Ash, J.S. (2022). Measurement Fundamentals: Reliability and Validity. In: Evaluation Methods in Biomedical and Health Informatics. Health Informatics. Springer, Cham. https://doi.org/10.1007/978-3-030-86453-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86453-8_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86452-1

  • Online ISBN: 978-3-030-86453-8

  • eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics