Sampling and Sampling Frames in Big Data Epidemiology
- 9 Downloads
Purpose of Review
The ‘big data’ revolution affords the opportunity to reuse administrative datasets for public health research. While such datasets offer dramatically increased statistical power compared with conventional primary data collection, typically at much lower cost, their use also raises substantial inferential challenges. In particular, it can be difficult to make population inferences because the sampling frames for many administrative datasets are undefined. We reviewed options for accounting for sampling in big data epidemiology.
We identified three common strategies for accounting for sampling when the data available were not collected from a deliberately constructed sample: (1) explicitly reconstruct the sampling frame, (2) test the potential impacts of sampling using sensitivity analyses, and (3) limit inference to sample.
Inference from big data can be challenging because the impacts of sampling are unclear. Attention to sampling frames can minimize risks of bias.
KeywordsBig data Research methods Sampling Sampling frames Secondary data
This work was supported by a grant from the National Library of Medicine (1K99LM012868) and the National Heart, Lung, And Blood Institute (F31HL143900). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Compliance with Ethical Standards
Conflict of Interest
Stephen J. Mooney reports grants from National Library of Medicine, and the Better Bike Share Coalition during the conduct of the study. Michael D. Garber reports grants from National Heart, Lung, and Blood Institute and from American College of Sports Medicine during the conduct of the study.
Human and Animal Rights and Informed Consent
This article does not contain any studies with human or animal subjects performed by any of the authors.
Papers of particular interest, published recently, have been highlighted as: • Of importance •• Of major importance
- 1.Brown B, Chui M, Manyika J. Are you ready for the era of ‘big data’. McKinsey Q. 2011;4:24–35.Google Scholar
- 4.Mayer-Schönberger V, Cukier K. Big data: a revolution that will transform how we live, work, and think. Boston, MA: Houghton Mifflin Harcourt; 2013.Google Scholar
- 8.•• Stuart EA, Ackerman B, Westreich D. Generalizability of Randomized Trial Results to Target Populations: Design and Analysis Possibilities. Res Soc Work Pract. 2018;28:532–7 A clearly written introduction to the problems that arise from assuming trial populations represent a population at large, and some possible solutions. CrossRefPubMedGoogle Scholar
- 10.Scheaffer RL, Mendenhall W III, Ott RL, Gerow KG. Elementary survey sampling. Boston, MA: Cengage Learning; 2011.Google Scholar
- 12.Rothman KJ, Greenland S, Lash TL, et al. Boston, MA: Little, Brown, and Company; 2008.Google Scholar
- 13.•• Hargittai E. Is bigger always better? Potential biases of big data derived from social network sites. Ann Am Acad Polit Soc Sci. An excellently clear walk-though of conducting a validation study to test potential impacts of sampling in effluent data. Google Scholar
- 20.Lash TL, Fox MP, Fink AK. Applying quantitative bias analysis to epidemiologic data. New York, NY: Springer Science & Business Media; 2011Google Scholar
- 23.• Kaufman JS. There is no virtue in vagueness: comment on: causal identification: a charge of epidemiology in danger of marginalization by Sharon Schwartz, Nicolle M. Gatto, and Ulka B. Campbell. Ann Epidemiol. 2016;26:683–4 A concise commentary (with a hilarious example) laying out the issues in the present controversy over epidemiology's focus. CrossRefPubMedGoogle Scholar
- 27.• Mooney SJ, Pejaver V. Big data in public health: terminology, machine learning, and privacy. Annu Rev Public Health. 2018:95–112 An overview of selected current issues regarding the use of big data for public health purposes. Google Scholar
- 28.• Duncan DT, Sharifi M, Melly SJ, Marshall R, Sequist TD, Rifas-Shiman SL, et al. Characteristics of walkable built environments and BMI z-scores in children: evidence from a large electronic health record database. Environ Health Perspect. 2014;122:1359 A well-conducted analysis making use of electronic health record data. CrossRefPubMedPubMedCentralGoogle Scholar
- 32.Harrison C, Jorder M, Stern H, et al. Using online reviews by restaurant patrons to identify unreported cases of foodborne illness - new York City, 2012-2013. MMWR Morb Mortal Wkly Rep. 2014;63(20):441–5 http://www.ncbi.nlm.nih.gov/pubmed/24848215. Accessed September 20, 2018.PubMedPubMedCentralGoogle Scholar
- 36.Adams NL, Rose TC, Hawker J, Violato M, O’Brien SJ, Barr B, et al. Relationship between socioeconomic status and gastrointestinal infections in developed countries: a systematic review and meta-analysis. PLoS One. 2018;13(1):e0191633. https://doi.org/10.1371/journal.pone.0191633.CrossRefPubMedPubMedCentralGoogle Scholar
- 37.Jacobs N, Roman N, Pless R. Consistent temporal variations in many outdoor scenes. IEEE. 2007:1–6.Google Scholar
- 38.• Westreich D, Edwards JK, Lesko CR, Stuart E, Cole SR. Transportability of trial results using inverse odds of sampling weights. Am J Epidemiol. 2017;186:1010–4 A clearly written piece that can assist intuition on how weighting accounts for sampling artifacts. CrossRefPubMedPubMedCentralGoogle Scholar