Current Epidemiology Reports

, Volume 6, Issue 1, pp 14–22 | Cite as

Sampling and Sampling Frames in Big Data Epidemiology

  • Stephen J. MooneyEmail author
  • Michael D. Garber
Genetic Epidemiology (C Amos, Section Editor)
Part of the following topical collections:
  1. Topical Collection on Genetic Epidemiology


Purpose of Review

The ‘big data’ revolution affords the opportunity to reuse administrative datasets for public health research. While such datasets offer dramatically increased statistical power compared with conventional primary data collection, typically at much lower cost, their use also raises substantial inferential challenges. In particular, it can be difficult to make population inferences because the sampling frames for many administrative datasets are undefined. We reviewed options for accounting for sampling in big data epidemiology.

Recent Findings

We identified three common strategies for accounting for sampling when the data available were not collected from a deliberately constructed sample: (1) explicitly reconstruct the sampling frame, (2) test the potential impacts of sampling using sensitivity analyses, and (3) limit inference to sample.


Inference from big data can be challenging because the impacts of sampling are unclear. Attention to sampling frames can minimize risks of bias.


Big data Research methods Sampling Sampling frames Secondary data 


Funding Information

This work was supported by a grant from the National Library of Medicine (1K99LM012868) and the National Heart, Lung, And Blood Institute (F31HL143900). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Compliance with Ethical Standards

Conflict of Interest

Stephen J. Mooney reports grants from National Library of Medicine, and the Better Bike Share Coalition during the conduct of the study. Michael D. Garber reports grants from National Heart, Lung, and Blood Institute and from American College of Sports Medicine during the conduct of the study.

Human and Animal Rights and Informed Consent

This article does not contain any studies with human or animal subjects performed by any of the authors.


Papers of particular interest, published recently, have been highlighted as: • Of importance •• Of major importance

  1. 1.
    Brown B, Chui M, Manyika J. Are you ready for the era of ‘big data’. McKinsey Q. 2011;4:24–35.Google Scholar
  2. 2.
    Fallik D. For big data, big questions remain. Health Affairs (Project Hope). 2014;33:1111–4.CrossRefGoogle Scholar
  3. 3.
    Khoury MJ, Ioannidis JP. Big data meets public health. Science. 2014;346:1054–5.CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Mayer-Schönberger V, Cukier K. Big data: a revolution that will transform how we live, work, and think. Boston, MA: Houghton Mifflin Harcourt; 2013.Google Scholar
  5. 5.
    Mooney SJ, Westreich DJ, El-Sayed AM. Epidemiology in the era of big data. Epidemiology (Cambridge, Mass). 2015;26:390.CrossRefGoogle Scholar
  6. 6.
    Davis-Kean PE, Jager J, Maslowsky J. Answering developmental questions using secondary data. Child Dev Perspect. 2015;9:256–61.CrossRefPubMedPubMedCentralGoogle Scholar
  7. 7.
    Keyes K, Galea S. What matters most: quantifying an epidemiology of consequence. Ann Epidemiol. 2015;25:305–11.CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    •• Stuart EA, Ackerman B, Westreich D. Generalizability of Randomized Trial Results to Target Populations: Design and Analysis Possibilities. Res Soc Work Pract. 2018;28:532–7 A clearly written introduction to the problems that arise from assuming trial populations represent a population at large, and some possible solutions. CrossRefPubMedGoogle Scholar
  9. 9.
    Leventhal T, Brooks-Gunn J. Moving to opportunity: an experimental study of neighborhood effects on mental health. Am J Public Health. 2003;93:1576–82.CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Scheaffer RL, Mendenhall W III, Ott RL, Gerow KG. Elementary survey sampling. Boston, MA: Cengage Learning; 2011.Google Scholar
  11. 11.
    Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952;47:663–85.CrossRefGoogle Scholar
  12. 12.
    Rothman KJ, Greenland S, Lash TL, et al. Boston, MA: Little, Brown, and Company; 2008.Google Scholar
  13. 13.
    •• Hargittai E. Is bigger always better? Potential biases of big data derived from social network sites. Ann Am Acad Polit Soc Sci. An excellently clear walk-though of conducting a validation study to test potential impacts of sampling in effluent data. Google Scholar
  14. 14.
    Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M. Improved Horvitz–Thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Stat Biosci. 2009;1:32–49.CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    Deville J-C, Särndal C-E, Sautory O. Generalized raking procedures in survey sampling. J Am Stat Assoc. 1993;88:1013–20.CrossRefGoogle Scholar
  16. 16.
    •• Lesko CR, Buchanan AL, Westreich D, Edwards JK, Hudgens MG, Cole SR. Generalizing Study Results. Epidemiology. 2017;28:553–61 A clear explanation (with a worked example) of generalizability, targeted at an epidemiologist readership. CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Winship C, Radbill L. Sampling weights and regression analysis. Sociol Methods Res. 1994;23:230–57.CrossRefGoogle Scholar
  18. 18.
    Greenland S. For and against methodologies: some perspectives on recent causal and statistical inference debates. Eur J Epidemiol. 2017;32:3–20.CrossRefPubMedGoogle Scholar
  19. 19.
    Stephens-Davidowitz S. The cost of racial animus on a black candidate: evidence using Google search data. J Public Econ. 2014;118:26–40.CrossRefGoogle Scholar
  20. 20.
    Lash TL, Fox MP, Fink AK. Applying quantitative bias analysis to epidemiologic data. New York, NY: Springer Science & Business Media; 2011Google Scholar
  21. 21.
    VanderWeele TJ, Ding P. Sensitivity analysis in observational research: introducing the E-value. Ann Intern Med. 2017;167:268–74.CrossRefPubMedGoogle Scholar
  22. 22.
    Hernán MA. Does water kill? A call for less casual causal inferences. Ann Epidemiol. 2016;26:674–80.CrossRefPubMedPubMedCentralGoogle Scholar
  23. 23.
    • Kaufman JS. There is no virtue in vagueness: comment on: causal identification: a charge of epidemiology in danger of marginalization by Sharon Schwartz, Nicolle M. Gatto, and Ulka B. Campbell. Ann Epidemiol. 2016;26:683–4 A concise commentary (with a hilarious example) laying out the issues in the present controversy over epidemiology's focus. CrossRefPubMedGoogle Scholar
  24. 24.
    Krieger N, Davey SG. The tale wagged by the DAG: broadening the scope of causal inference and explanation for epidemiology. Int J Epidemiol. 2016;45:1787–808.PubMedGoogle Scholar
  25. 25.
    Schwartz S, Gatto NM, Campbell UB. Causal identification: a charge of epidemiology in danger of marginalization. Ann Epidemiol. 2016;26:669–73.CrossRefPubMedGoogle Scholar
  26. 26.
    Vandenbroucke JP, Broadbent A, Pearce N. Causality and causal inference in epidemiology: the need for a pluralistic approach. Int J Epidemiol. 2016;45:1776–86.CrossRefPubMedPubMedCentralGoogle Scholar
  27. 27.
    • Mooney SJ, Pejaver V. Big data in public health: terminology, machine learning, and privacy. Annu Rev Public Health. 2018:95–112 An overview of selected current issues regarding the use of big data for public health purposes. Google Scholar
  28. 28.
    • Duncan DT, Sharifi M, Melly SJ, Marshall R, Sequist TD, Rifas-Shiman SL, et al. Characteristics of walkable built environments and BMI z-scores in children: evidence from a large electronic health record database. Environ Health Perspect. 2014;122:1359 A well-conducted analysis making use of electronic health record data. CrossRefPubMedPubMedCentralGoogle Scholar
  29. 29.
    Hernán MA, McAdams M, McGrath N, Lanoy E, Costagliola D. Observation plans in longitudinal studies with time-varying treatments. Stat Methods Med Res. 2009;18:27–52.CrossRefPubMedGoogle Scholar
  30. 30.
    Mooney SJ. Invited commentary: the tao of clinical cohort analysis—when the transitions that can be spoken of are not the true transitions. Am J Epidemiol. 2017;185:636–8.CrossRefPubMedPubMedCentralGoogle Scholar
  31. 31.
    Harris JK, Mansour R, Choucair B, et al. Health department use of social media to identify foodborne illness - Chicago, Illinois, 2013-2014. MMWR Morb Mortal Wkly Rep. 2014;63(32):681–5 Accessed September 20, 2018.PubMedPubMedCentralGoogle Scholar
  32. 32.
    Harrison C, Jorder M, Stern H, et al. Using online reviews by restaurant patrons to identify unreported cases of foodborne illness - new York City, 2012-2013. MMWR Morb Mortal Wkly Rep. 2014;63(20):441–5 Accessed September 20, 2018.PubMedPubMedCentralGoogle Scholar
  33. 33.
    Oldroyd RA, Morris MA, Birkin M. Identifying methods for monitoring foodborne illness: review of existing public health surveillance techniques. JMIR Public Heal Surveill. 2018;4(2):e57. Scholar
  34. 34.
    Mead PS, Slutsker L, Dietz V, McCaig LF, Bresee JS, Shapiro C, et al. Food-related illness and death in the United States. Emerg Infect Dis. 1999;5(5):607–25. Scholar
  35. 35.
    Henly S, Tuli G, Kluberg SA, Hawkins JB, Nguyen QC, Anema A, et al. Disparities in digital reporting of illness: a demographic and socioeconomic assessment. Prev Med (Baltim). 2017;101:18–22. Scholar
  36. 36.
    Adams NL, Rose TC, Hawker J, Violato M, O’Brien SJ, Barr B, et al. Relationship between socioeconomic status and gastrointestinal infections in developed countries: a systematic review and meta-analysis. PLoS One. 2018;13(1):e0191633. Scholar
  37. 37.
    Jacobs N, Roman N, Pless R. Consistent temporal variations in many outdoor scenes. IEEE. 2007:1–6.Google Scholar
  38. 38.
    • Westreich D, Edwards JK, Lesko CR, Stuart E, Cole SR. Transportability of trial results using inverse odds of sampling weights. Am J Epidemiol. 2017;186:1010–4 A clearly written piece that can assist intuition on how weighting accounts for sampling artifacts. CrossRefPubMedPubMedCentralGoogle Scholar
  39. 39.
    Hipp JA, Adlakha D, Eyler AA, Chang B, Pless R. Emerging technologies: webcams and crowd-sourcing to identify active transportation. Am J Prev Med. 2013;44:96–7.CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of EpidemiologyUniversity of WashingtonSeattleUSA
  2. 2.Harborview Injury Prevention and Research CenterUniversity of WashingtonSeattleUSA
  3. 3.Department of Epidemiology, Rollins School of Public HealthEmory UniversityAtlantaUSA

Personalised recommendations