Skip to main content
Log in

Multiple Imputation for Incomplete Data in Environmental Epidemiology Research

  • Methods in Environmental Epidemiology (AZ Pollack and NJ Perkins, Section Editors)
  • Published:
Current Environmental Health Reports Aims and scope Submit manuscript

Abstract

Purpose of Review

Incomplete data are a common problem in statistical analysis of environmental epidemiological research. However, many researchers still ignore this complication. We evaluate the performance of two commonly used multiple imputation (MI) methods (fully conditional specification and multivariate normal) for handling missing data and compare them to complete case analysis (CCA) method. We further discuss issues that arise when these methods are being used.

Recent Findings

MI is a simulation-based approach to deal with incomplete data. In general, MI will perform better then ad hoc techniques such as CCA. MI is an approach which replaces the missing data with plausible values and allows for additional uncertainty due to the missing information caused by the incomplete data. To illustrate this, we use data of 944 women from the Collaborative Perinatal Project and compare estimates between these methods. The goal is to examine if each of two outcomes, birth-weight and spontaneous abortion, in the data set are associated with mothers’ smoking status during pregnancy adjusting for baseline covariates in the model.

Summary

Results indicate that MI is better suited for handling incomplete data and led to a significant improvement in parameter estimates compared to CCA. The two MI methods produced similar point estimates, but slightly different standard errors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Abbreviations

ANOVA:

analysis of variance

CCA:

complete case analysis

CPP:

Collaborative Perinatal Project

FCS:

fully conditional specification

ICE:

imputation by chained equations

MAR:

missing at random

MCAR:

missing completely at random

MI:

multiple imputation

MICE:

multivariate imputation by chained equations

MCMC:

Markov chain Monte Carlo

MNAR:

missing not at random

MVN:

multivariate normal

SA:

spontaneous abortion

SD:

standard deviation

References

Papers of particular interest, published recently, have been highlighted as: • Of importance •• Of major importance

  1. Little RJA, Rubin DB. Statistical analysis with missing data, vol. 793: Wiley; 2019.

  2. Schafer JL, Graham JW. Missing data: our view of the state of the art. Psychol Methods. 2002;7(2):147–77.

    Article  PubMed  Google Scholar 

  3. •• Neil J Perkins, Cole SR, Harel O, Tchetgen EJT, Sun BL, Mitchell EM, et al. Principled approaches to missing data in epidemiologic studies. Am J Epidemiol. 2017;187(3):568–75 This study provides a thorough description on different types of missing data in epidemiological studies.

    Article  PubMed Central  Google Scholar 

  4. Bartlett JW, Harel O, Carpenter JR. Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression. Am J Epidemiol. 2015;182(8):730–6.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Wood AM, White IR, Thompson SG. Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals. Clin Trials. 2004;1(4):368–76.

    Article  PubMed  Google Scholar 

  6. Van der Heijden GJMG, Rogier T Donders A, Stijnen T, Moons KGM. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol. 2006;59(10):1102–9.

    Article  PubMed  Google Scholar 

  7. Klebanoff MA, Cole SR. Use of multiple imputation in the epidemiologic literature. Am J Epidemiol. 2008;168(4):355–7.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Bmj. 2009;338:b2393.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Stuart EA, Azur M, Frangakis C, Leaf P. Multiple imputation with large data sets: a case study of the children’s mental health initiative. Am J Epidemiol. 2009;169(9):1133–9.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Harel O, Pellowski J, Kalichman S. Are we missing the importance of missing values in hiv prevention randomized clinical trials? Review and recommendations. AIDS Behav. 2012;16(6):1382–93.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Harel O, Boyko J. Mi??ing data: should we c?re? Am J Public Health. 2013;103(2):200–1.

    Article  PubMed  PubMed Central  Google Scholar 

  12. •• Eekhout I, de Boer MR, Twisk JWR, de Vet HCW, Heymans MW. Brief report: missing data: a systematic review of how they are reported and handled. Epidemiology. 2012.Our study is a companion paper to Perkins at al., which was the basis for this investigation;23:729–32.

    Article  PubMed  Google Scholar 

  13. Sun BL, Perkins NJ, Cole SR, Harel O, Mitchell EM, Schister- man EF, et al. Inverse-probability-weighted estimation for monotone and nonmonotone missing data. Am J Epidemiol. 2017;187(3):585–91.

    Article  PubMed Central  Google Scholar 

  14. Harel O, Mitchell EM, Perkins NJ, Cole SR, Tchetgen Tchetgen EJ, Sun BL, et al. Multiple imputation for incomplete data in epidemiologic studies. Am J Epidemiol. 2017;187(3):576–84.

    Article  PubMed Central  Google Scholar 

  15. Van Buuren S. Flexible imputation of missing data: Chapman and Hall/CRC; 2018.

  16. Schafer JL, Olsen MK. Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivar Behav Res. 1998;33(4):545–71.

    Article  CAS  Google Scholar 

  17. Johnson DR, Young R. Toward best practices in analyzing datasets with missing data: comparisons and recommendations. J Marriage Fam. 2011;73(5):926–45.

    Article  Google Scholar 

  18. Harel O, Zhou X-H. Multiple imputation: review of theory, implementation and software. Stat Med. 2007;26(16):3057–77.

    Article  PubMed  Google Scholar 

  19. Seaman SR, White IR, Copas AJ, Li L. Combining multiple imputation and inverse-probability weighting. Biometrics. 2012;68(1):129–37.

    Article  PubMed  PubMed Central  Google Scholar 

  20. Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Stat Methods Med Res. 2013;22(3):278–95.

    Article  PubMed  Google Scholar 

  21. Rubin DB. Multiple imputation for nonresponse in surveys, vol. 81: John Wiley & Sons; 2004.

  22. Yang CY. Multiple imputation for missing data: concepts and new development (version 9.0), vol. 49. Rockville: SAS Institute Inc; 2010. p. 1–11.

    Google Scholar 

  23. Raghunathan T. Missing data analysis in practice: CRC Press; 2015.

  24. Lee KJ, Carlin JB. Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. Am J Epidemiol. 2010;171(5):624–32.

    Article  PubMed  Google Scholar 

  25. White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011;30(4):377–99.

    Article  PubMed  Google Scholar 

  26. JL Schafer. Analysis of incomplete multivariate data. London Google Scholar; Chapman and Hall/CRC; 1997.

  27. • Lee KJ, Carlin JB. Recovery of information from multiple imputation: a simulation study. Emerg Themes Epidemiol. 2012;9(1):3 This study provides a thorough description and difference between the two types of multiple imputation methods employed in our study.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007;16(3):219–42.

    Article  PubMed  Google Scholar 

  29. Van Buuren S, Brand JPL, Groothuis-Oudshoorn CGM, Rubin DB. Fully conditional specification in multivariate imputation. J Stat Comput Simul. 2006;76(12):1049–64.

    Article  Google Scholar 

  30. Yu L-M, Burton A, Rivero-Arias O. Evaluation of software for multiple imputation of semi-continuous data. Stat Methods Med Res. 2007;16(3):243–58.

    Article  PubMed  Google Scholar 

  31. Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–92.

    Article  Google Scholar 

  32. He Y. Missing data analysis using multiple imputation: getting to the heart of the matter. Circulation. 2010;3(1):98–105.

    PubMed  Google Scholar 

  33. Collins LM, Schafer JL, Kam C-M. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods. 2001;6(4):330–51.

    Article  CAS  PubMed  Google Scholar 

  34. Little RJA. A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988;83(404):1198–202.

    Article  Google Scholar 

  35. Whitcomb BW, Schisterman EF, Klebanoff MA, Baumgarten M, Vlasak AR, Luo X, et al. Circulating chemokine levels and miscarriage. Am J Epidemiol. 2007;166(3):323–31.

    Article  PubMed  Google Scholar 

  36. Slopen N, Loucks EB, Appleton AA, Kawachi I, Kubzansky LD, Non AL, et al. Early origins of inflammation: an examination of prenatal and childhood social adversity in a prospective cohort study. Psychoneuroendocrinology. 2015;51:403–13.

    Article  CAS  PubMed  Google Scholar 

  37. Siddique J, Harel O, Crespi CM. Addressing missing data mechanism uncertainty using multiple-model multiple imputation: application to a longitudinal clinical trial. Ann Appl Stat. 2012;6(4):1814–37.

    Article  PubMed  PubMed Central  Google Scholar 

  38. Sinharay S, Stern HS, Russell D. The use of multiple imputation for the analysis of missing data. Psychol Methods. 2001;6(4):317–29.

    Article  CAS  PubMed  Google Scholar 

  39. Yuan Y, et al. Multiple imputation using sas software. J Stat Softw. 2011;45(6):1–25.

    Article  Google Scholar 

  40. van Buuren S, Groothuis-Oudshoorn K. Mice: multivariate imputation by chained equations in R. J Stat Softw. University of California, Los Angeles; 2010:1–68.

  41. Royston P, White IR, et al. Multiple imputation by chained equations (mice): implementation in stata. J Stat Softw. 2011;45(4):1–20.

    Article  Google Scholar 

  42. Gelman A, Stern HS, Carlin JB, Dunson DB, Vehtari A, Rubin DB. Bayesian data analysis: Chapman and Hall/CRC; 2013.

  43. Kombo AY, Mwambi H, Molenberghs G. Multiple imputation for ordinal longitudinal data with monotone missing data patterns. J Appl Stat. 2017;44(2):270–87.

    Article  Google Scholar 

  44. Enders CK. Multiple imputation as a flexible tool for missing data handling in clinical research. Behav Res Ther. 2017;98:4–18.

    Article  PubMed  Google Scholar 

  45. Choi K-H, Hoff C, Gregorich SE, Grinstead O, Gomez C, Hussey W. The efficacy of female condom skills training in HIV risk reduction among women: a randomized controlled trial. Am J Public Health. 2008;98(10):1841–8.

    Article  PubMed  PubMed Central  Google Scholar 

  46. Seitzman RL, Mahajan VB, Mangione C, Cauley JA, Ensrud KE, Stone KL, et al. Estrogen receptor alpha and matrix metalloproteinase 2 polymorphisms and age-related maculopathy in older women. Am J Epidemiol. 2008;167(10):1217–25.

    Article  PubMed  Google Scholar 

  47. Royston P, et al. Multiple imputation of missing values: further update of ice, with an emphasis on categorical variables. Stata J. 2009;9(3):466–77.

    Article  Google Scholar 

  48. Raghunathan TE, Lepkowski JM, Van Hoewyk J, Solenberger P. A multi-variate technique for multiply imputing missing values using a sequence of regression models. Surv Methodol. 2001;27(1):85–96.

    Google Scholar 

  49. Van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med. 1999;18(6):681–94.

    Article  PubMed  Google Scholar 

  50. Bartlett JW, Seaman SR, White IR, Carpenter JR. Alzheimer’s disease neuroimaging initiative*. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Methods Med Res. 2015;24(4):462–87.

    Article  PubMed  PubMed Central  Google Scholar 

  51. Barnard J, Rubin DB. Miscellanea. Small-sample degrees of freedom with multiple imputation. Biometrika. 1999;86(4):948–55.

    Article  Google Scholar 

  52. Lipsitz S, Parzen M, Zhao LP. A degrees-of-freedom approximation in multiple imputation. J Stat Comput Simul. 2002;72(4):309–18.

    Article  Google Scholar 

  53. Reiter JP. Small-sample degrees of freedom for multi-component significance tests with multiple imputation for missing data. Biometrika. 2007;94(2):502–8.

    Article  Google Scholar 

  54. Wagstaff DA, Harel O, et al. A closer examination of three small-sample approximations to the multiple-imputation degrees of freedom. Stata J. 2011;11(3):403–19.

    Article  Google Scholar 

  55. R Core Team et al. R: a language and environment for statistical computing; Vienna, Austria; 2013.

  56. StataCorp LP, et al. Stata data analysis and statistical software. In: Special Edition Release, vol. 10; 2007. p. 733.

    Google Scholar 

  57. Harel O, Stratton J. Inferences on the outfluence–how do missing values impact your analysis? Commun Stat Theory Methods. 2009;38(16–17):2884–98.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ofer Harel.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection on Methods in Environmental Epidemiology

Electronic supplementary material

ESM 1

(BIB 26 kb)

ESM 2

(BST 32 kb)

ESM 3

(BST 29 kb)

ESM 4

(BST 27 kb)

ESM 5

(CLO 3 kb)

ESM 6

(CLS 46 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Allotey, P.A., Harel, O. Multiple Imputation for Incomplete Data in Environmental Epidemiology Research. Curr Envir Health Rpt 6, 62–71 (2019). https://doi.org/10.1007/s40572-019-00230-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40572-019-00230-y

Keywords

Navigation