Abstract
Purpose of Review
Incomplete data are a common problem in statistical analysis of environmental epidemiological research. However, many researchers still ignore this complication. We evaluate the performance of two commonly used multiple imputation (MI) methods (fully conditional specification and multivariate normal) for handling missing data and compare them to complete case analysis (CCA) method. We further discuss issues that arise when these methods are being used.
Recent Findings
MI is a simulation-based approach to deal with incomplete data. In general, MI will perform better then ad hoc techniques such as CCA. MI is an approach which replaces the missing data with plausible values and allows for additional uncertainty due to the missing information caused by the incomplete data. To illustrate this, we use data of 944 women from the Collaborative Perinatal Project and compare estimates between these methods. The goal is to examine if each of two outcomes, birth-weight and spontaneous abortion, in the data set are associated with mothers’ smoking status during pregnancy adjusting for baseline covariates in the model.
Summary
Results indicate that MI is better suited for handling incomplete data and led to a significant improvement in parameter estimates compared to CCA. The two MI methods produced similar point estimates, but slightly different standard errors.
Similar content being viewed by others
Abbreviations
- ANOVA:
-
analysis of variance
- CCA:
-
complete case analysis
- CPP:
-
Collaborative Perinatal Project
- FCS:
-
fully conditional specification
- ICE:
-
imputation by chained equations
- MAR:
-
missing at random
- MCAR:
-
missing completely at random
- MI:
-
multiple imputation
- MICE:
-
multivariate imputation by chained equations
- MCMC:
-
Markov chain Monte Carlo
- MNAR:
-
missing not at random
- MVN:
-
multivariate normal
- SA:
-
spontaneous abortion
- SD:
-
standard deviation
References
Papers of particular interest, published recently, have been highlighted as: • Of importance •• Of major importance
Little RJA, Rubin DB. Statistical analysis with missing data, vol. 793: Wiley; 2019.
Schafer JL, Graham JW. Missing data: our view of the state of the art. Psychol Methods. 2002;7(2):147–77.
•• Neil J Perkins, Cole SR, Harel O, Tchetgen EJT, Sun BL, Mitchell EM, et al. Principled approaches to missing data in epidemiologic studies. Am J Epidemiol. 2017;187(3):568–75 This study provides a thorough description on different types of missing data in epidemiological studies.
Bartlett JW, Harel O, Carpenter JR. Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression. Am J Epidemiol. 2015;182(8):730–6.
Wood AM, White IR, Thompson SG. Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals. Clin Trials. 2004;1(4):368–76.
Van der Heijden GJMG, Rogier T Donders A, Stijnen T, Moons KGM. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol. 2006;59(10):1102–9.
Klebanoff MA, Cole SR. Use of multiple imputation in the epidemiologic literature. Am J Epidemiol. 2008;168(4):355–7.
Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Bmj. 2009;338:b2393.
Stuart EA, Azur M, Frangakis C, Leaf P. Multiple imputation with large data sets: a case study of the children’s mental health initiative. Am J Epidemiol. 2009;169(9):1133–9.
Harel O, Pellowski J, Kalichman S. Are we missing the importance of missing values in hiv prevention randomized clinical trials? Review and recommendations. AIDS Behav. 2012;16(6):1382–93.
Harel O, Boyko J. Mi??ing data: should we c?re? Am J Public Health. 2013;103(2):200–1.
•• Eekhout I, de Boer MR, Twisk JWR, de Vet HCW, Heymans MW. Brief report: missing data: a systematic review of how they are reported and handled. Epidemiology. 2012.Our study is a companion paper to Perkins at al., which was the basis for this investigation;23:729–32.
Sun BL, Perkins NJ, Cole SR, Harel O, Mitchell EM, Schister- man EF, et al. Inverse-probability-weighted estimation for monotone and nonmonotone missing data. Am J Epidemiol. 2017;187(3):585–91.
Harel O, Mitchell EM, Perkins NJ, Cole SR, Tchetgen Tchetgen EJ, Sun BL, et al. Multiple imputation for incomplete data in epidemiologic studies. Am J Epidemiol. 2017;187(3):576–84.
Van Buuren S. Flexible imputation of missing data: Chapman and Hall/CRC; 2018.
Schafer JL, Olsen MK. Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivar Behav Res. 1998;33(4):545–71.
Johnson DR, Young R. Toward best practices in analyzing datasets with missing data: comparisons and recommendations. J Marriage Fam. 2011;73(5):926–45.
Harel O, Zhou X-H. Multiple imputation: review of theory, implementation and software. Stat Med. 2007;26(16):3057–77.
Seaman SR, White IR, Copas AJ, Li L. Combining multiple imputation and inverse-probability weighting. Biometrics. 2012;68(1):129–37.
Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Stat Methods Med Res. 2013;22(3):278–95.
Rubin DB. Multiple imputation for nonresponse in surveys, vol. 81: John Wiley & Sons; 2004.
Yang CY. Multiple imputation for missing data: concepts and new development (version 9.0), vol. 49. Rockville: SAS Institute Inc; 2010. p. 1–11.
Raghunathan T. Missing data analysis in practice: CRC Press; 2015.
Lee KJ, Carlin JB. Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. Am J Epidemiol. 2010;171(5):624–32.
White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011;30(4):377–99.
JL Schafer. Analysis of incomplete multivariate data. London Google Scholar; Chapman and Hall/CRC; 1997.
• Lee KJ, Carlin JB. Recovery of information from multiple imputation: a simulation study. Emerg Themes Epidemiol. 2012;9(1):3 This study provides a thorough description and difference between the two types of multiple imputation methods employed in our study.
Van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007;16(3):219–42.
Van Buuren S, Brand JPL, Groothuis-Oudshoorn CGM, Rubin DB. Fully conditional specification in multivariate imputation. J Stat Comput Simul. 2006;76(12):1049–64.
Yu L-M, Burton A, Rivero-Arias O. Evaluation of software for multiple imputation of semi-continuous data. Stat Methods Med Res. 2007;16(3):243–58.
Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–92.
He Y. Missing data analysis using multiple imputation: getting to the heart of the matter. Circulation. 2010;3(1):98–105.
Collins LM, Schafer JL, Kam C-M. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods. 2001;6(4):330–51.
Little RJA. A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988;83(404):1198–202.
Whitcomb BW, Schisterman EF, Klebanoff MA, Baumgarten M, Vlasak AR, Luo X, et al. Circulating chemokine levels and miscarriage. Am J Epidemiol. 2007;166(3):323–31.
Slopen N, Loucks EB, Appleton AA, Kawachi I, Kubzansky LD, Non AL, et al. Early origins of inflammation: an examination of prenatal and childhood social adversity in a prospective cohort study. Psychoneuroendocrinology. 2015;51:403–13.
Siddique J, Harel O, Crespi CM. Addressing missing data mechanism uncertainty using multiple-model multiple imputation: application to a longitudinal clinical trial. Ann Appl Stat. 2012;6(4):1814–37.
Sinharay S, Stern HS, Russell D. The use of multiple imputation for the analysis of missing data. Psychol Methods. 2001;6(4):317–29.
Yuan Y, et al. Multiple imputation using sas software. J Stat Softw. 2011;45(6):1–25.
van Buuren S, Groothuis-Oudshoorn K. Mice: multivariate imputation by chained equations in R. J Stat Softw. University of California, Los Angeles; 2010:1–68.
Royston P, White IR, et al. Multiple imputation by chained equations (mice): implementation in stata. J Stat Softw. 2011;45(4):1–20.
Gelman A, Stern HS, Carlin JB, Dunson DB, Vehtari A, Rubin DB. Bayesian data analysis: Chapman and Hall/CRC; 2013.
Kombo AY, Mwambi H, Molenberghs G. Multiple imputation for ordinal longitudinal data with monotone missing data patterns. J Appl Stat. 2017;44(2):270–87.
Enders CK. Multiple imputation as a flexible tool for missing data handling in clinical research. Behav Res Ther. 2017;98:4–18.
Choi K-H, Hoff C, Gregorich SE, Grinstead O, Gomez C, Hussey W. The efficacy of female condom skills training in HIV risk reduction among women: a randomized controlled trial. Am J Public Health. 2008;98(10):1841–8.
Seitzman RL, Mahajan VB, Mangione C, Cauley JA, Ensrud KE, Stone KL, et al. Estrogen receptor alpha and matrix metalloproteinase 2 polymorphisms and age-related maculopathy in older women. Am J Epidemiol. 2008;167(10):1217–25.
Royston P, et al. Multiple imputation of missing values: further update of ice, with an emphasis on categorical variables. Stata J. 2009;9(3):466–77.
Raghunathan TE, Lepkowski JM, Van Hoewyk J, Solenberger P. A multi-variate technique for multiply imputing missing values using a sequence of regression models. Surv Methodol. 2001;27(1):85–96.
Van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med. 1999;18(6):681–94.
Bartlett JW, Seaman SR, White IR, Carpenter JR. Alzheimer’s disease neuroimaging initiative*. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Methods Med Res. 2015;24(4):462–87.
Barnard J, Rubin DB. Miscellanea. Small-sample degrees of freedom with multiple imputation. Biometrika. 1999;86(4):948–55.
Lipsitz S, Parzen M, Zhao LP. A degrees-of-freedom approximation in multiple imputation. J Stat Comput Simul. 2002;72(4):309–18.
Reiter JP. Small-sample degrees of freedom for multi-component significance tests with multiple imputation for missing data. Biometrika. 2007;94(2):502–8.
Wagstaff DA, Harel O, et al. A closer examination of three small-sample approximations to the multiple-imputation degrees of freedom. Stata J. 2011;11(3):403–19.
R Core Team et al. R: a language and environment for statistical computing; Vienna, Austria; 2013.
StataCorp LP, et al. Stata data analysis and statistical software. In: Special Edition Release, vol. 10; 2007. p. 733.
Harel O, Stratton J. Inferences on the outfluence–how do missing values impact your analysis? Commun Stat Theory Methods. 2009;38(16–17):2884–98.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the Topical Collection on Methods in Environmental Epidemiology
Rights and permissions
About this article
Cite this article
Allotey, P.A., Harel, O. Multiple Imputation for Incomplete Data in Environmental Epidemiology Research. Curr Envir Health Rpt 6, 62–71 (2019). https://doi.org/10.1007/s40572-019-00230-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40572-019-00230-y