How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment

Abstract

The study is devoted to a comparison of three approaches to handling missing data of categorical variables: complete case analysis, multiple imputation (based on random forest), and the missing-indicator method. Focusing on OLS regression, we describe how the choice of the approach depends on the missingness mechanism, its proportion, and model specification. The results of a simulated statistical experiment show that each approach may lead to either almost unbiased or dramatically biased estimates. The choice of the appropriate approach should be primarily based on the missingness mechanism: one should choose CCA under MCAR, MI under MAR, and, again, CCA under MNAR. Although MIM produces almost unbiased estimates under MCAR and MNAR as well, it leads to inefficient regression coefficients—ones with too big standard errors and, consequently, incorrect p-values.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Data availability and material

All data files are available upon request.

Code availability

All scripts are available upon request.

Notes

  1. 1.

    The only exceptions are the articles (Choi et al. 2019) and (Donders et al. 2006) where the authors use simulated data but limit their analysis to continuous variables only.

  2. 2.

    The exception is paper (Henry et al. 2013) where the variable of race contains missing values, but in this study the authors use real data and do not control any factors that may affect the results of comparison.

  3. 3.

    All the technical files are available upon request.

  4. 4.

    Random forest-based multiple imputation was carried out with ‘sklearn’ package (specifically its IterativeImputer class) in Python (Pedregosa et al. 2011), which is equivalent to ‘mice’ package in R.

  5. 5.

    θ is the true value of a parameter.

  6. 6.

    AnOVa was carried out with ‘statsmodels’ package (specifically its anova_lm function) in Python (Seabold and Perktold 2010).

  7. 7.

    ChAID was carried out with ‘randan’ package (specifically its CHAIDRegressor class) in Python.

References

  1. Akande, O., Li, F., Reiter, J.: An empirical comparison of multiple imputation methods for categorical data. Am. Stat. 71, 162–170 (2017). https://doi.org/10.1080/00031305.2016.1277158

    Article  Google Scholar 

  2. Allison, P.D.: Multiple imputation for missing data: a cautionary tale. Sociol. Methods Res. 28, 301–309 (2000). https://doi.org/10.1177/0049124100028003003

    Article  Google Scholar 

  3. Allison, P.D.: Imputation of categorical variables with PROC MI. Proc. SAS Users Group Int. Conf. (SUGI) 30, 113–130 (2005)

    Google Scholar 

  4. Bartlett, J.W., Carpenter, J.R., Tilling, K., Vansteelandt, S.: Improving upon the efficiency of complete case analysis when covariates are MNAR. Biostatistics 15, 719–730 (2014). https://doi.org/10.1093/biostatistics/kxu023

    Article  Google Scholar 

  5. Bartlett, J.W., Harel, O., Carpenter, J.R.: Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression. Am. J. Epidemiol. 182, 730–736 (2015). https://doi.org/10.1093/aje/kwv114

    Article  Google Scholar 

  6. Chen, J., Hossler, D.: The effects of financial aid on college success of two-year beginning nontraditional students. Res. High Educ. 58, 40–76 (2017). https://doi.org/10.1007/s11162-016-9416-0

    Article  Google Scholar 

  7. Choi, J., Dekkers, O.M., le Cessie, S.: A comparison of different methods to handle missing data in the context of propensity score analysis. Eur. J. Epidemiol. 34, 23–36 (2019). https://doi.org/10.1007/s10654-018-0447-z

    Article  Google Scholar 

  8. Donders, A.R.T., van der Heijden, G.J.M.G., Stijnen, T., Moons, K.G.M.: Review: a gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59, 1087–1091 (2006). https://doi.org/10.1016/j.jclinepi.2006.01.014

    Article  Google Scholar 

  9. Doove, L.L., Van Buuren, S., Dusseldorp, E.: Recursive partitioning for missing data imputation in the presence of interaction effects. Comput. Stat. Data Anal. 72, 92–104 (2014). https://doi.org/10.1016/j.csda.2013.10.025

    Article  Google Scholar 

  10. Dougherty, C.: Introduction to Econometrics. Oxford University Press, Oxford (2016)

    Google Scholar 

  11. Gentle, J.E. (ed.): Handbook of Computational Statistics: Concepts and Methods. Springer, Berlin (2012)

    Google Scholar 

  12. Gesser-Edelsburg, A., Zemach, M., Lotan, T., Elias, W., Grimberg, E.: Perceptions, intentions and behavioral norms that affect pre-license driving among Arab youth in Israel. Accid. Anal. Prev. 111, 1–11 (2018). https://doi.org/10.1016/j.aap.2017.11.005

    Article  Google Scholar 

  13. Greenacre, M., Pardo, R.: Subset correspondence analysis: visualizing relationships among a selected set of response categories from a questionnaire survey. Sociol. Methods Res. 35, 193–218 (2006). https://doi.org/10.1177/0049124106290316

    Article  Google Scholar 

  14. Groenwold, R.H.H., White, I.R., Donders, A.R.T., Carpenter, J.R., Altman, D.G., Moons, K.G.M.: Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis. Can. Med. Assoc. J. 184, 1265–1269 (2012). https://doi.org/10.1503/cmaj.110977

    Article  Google Scholar 

  15. Henry, A.J., Hevelone, N.D., Lipsitz, S., Nguyen, L.L.: Comparative methods for handling missing data in large databases. J. Vasc. Surg. 58, 1353-1359.e6 (2013). https://doi.org/10.1016/j.jvs.2013.05.008

    Article  Google Scholar 

  16. Hughes, R.A., Heron, J., Sterne, J.A.C., Tilling, K.: Accounting for missing data in statistical analyses: multiple imputation is not always the answer. Int. J. Epidemiol. 48, 1294–1304 (2019). https://doi.org/10.1093/ije/dyz032

    Article  Google Scholar 

  17. Jones, M.P.: Indicator and stratification methods for missing explanatory variables in multiple linear regression. J. Am. Stat. Assoc. 91, 222–230 (1996). https://doi.org/10.1080/01621459.1996.10476680

    Article  Google Scholar 

  18. Knol, M.J., Janssen, K.J.M., Donders, A.R.T., Egberts, A.C.G., Heerdink, E.R., Grobbee, D.E., Moons, K.G.M., Geerlings, M.I.: Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. J. Clin. Epidemiol. 63, 728–736 (2010). https://doi.org/10.1016/j.jclinepi.2009.08.028

    Article  Google Scholar 

  19. Maimon, O., Rokach, L. (eds.): Data Mining and Knowledge Discovery Handbook. Springer, New York (2010)

    Google Scholar 

  20. Morgan, J.N., Sonquist, J.A.: Problems in the analysis of survey data, and a proposal. J. Am. Stat. Assoc. 58, 415–434 (1963). https://doi.org/10.1080/01621459.1963.10500855

    Article  Google Scholar 

  21. Morris, T.P., White, I.R., Crowther, M.J.: Using simulation studies to evaluate statistical methods. Stat. Med. 38, 2074–2102 (2019). https://doi.org/10.1002/sim.8086

    Article  Google Scholar 

  22. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Müller, A., Nothman, J., Louppe, G., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    Google Scholar 

  23. Ratner, B.: Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data. CRC Press, Boca Raton (2017)

    Google Scholar 

  24. Rickles, J., Heppen, J.B., Allensworth, E., Sorensen, N., Walters, K.: Online credit recovery and the path to on-time high school graduation. Educ. Res. 47, 481–491 (2018). https://doi.org/10.3102/0013189X18788054

    Article  Google Scholar 

  25. Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976). https://doi.org/10.1093/biomet/63.3.581

    Article  Google Scholar 

  26. Rubin, D.B. (ed.): Multiple imputation for nonresponse in surveys. Wiley, Hoboken (1987)

    Google Scholar 

  27. Schafer, J.L.: Analysis of Incomplete Multivariate Data. CRC Press, Boca Ratan (1997)

    Google Scholar 

  28. Seabold, S., & Perktold, J. Statsmodels: Econometric and statistical modeling with python. In: Proceedings of the 9th Python in Science Conference (2010)

  29. Shah, A.D., Bartlett, J.W., Carpenter, J., Nicholas, O., Hemingway, H.: Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study. Am. J. Epidemiol. 179, 764–774 (2014). https://doi.org/10.1093/aje/kwt312

    Article  Google Scholar 

  30. Slade, E., Naylor, M.G.: A fair comparison of tree-based and parametric methods in multiple imputation by chained equations. Stat. Med. 39, 1156–1166 (2020). https://doi.org/10.1002/sim.8468

    Article  Google Scholar 

  31. Stavseth, M.R., Clausen, T., Røislien, J.: How handling missing data may impact conclusions: a comparison of six different imputation methods for categorical questionnaire data. SAGE Open Med. 7, 205031211882291 (2019). https://doi.org/10.1177/2050312118822912

    Article  Google Scholar 

  32. Stekhoven, D.J., Buhlmann, P.: MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118 (2012). https://doi.org/10.1093/bioinformatics/btr597

    Article  Google Scholar 

  33. Strebkov, D., Shevchuk, A., Lukina, A., Melianova, E., Tyulyupo, A.: Social factors of contractor selection on freelance online marketplace: study of contests using big data. J. Econ. Sociol. 20, 25–65 (2019)

    Article  Google Scholar 

  34. Sundararajan, A., Sarwat, A.I.: Evaluation of missing data imputationmethods for an enhanced distributed pvgeneration prediction. In: Arai, K., Bhatia, R., and Kapoor, S. (eds.) Proceedings of the Future Technologies Conference (FTC) 2019. pp. 590–609. Springer, Cham (2020)

  35. Tang, F., Ishwaran, H.: Random forest missing data algorithms. Stat. Anal. Data Min.: ASA Data Sci. J. 10, 363–377 (2017). https://doi.org/10.1002/sam.11348

    Article  Google Scholar 

  36. Trevizo, D., Lopez, M.J.: Neighborhood segregation and business outcomes: Mexican immigrant entrepreneurs in Los Angeles county. Sociol. Persp. 59, 668–693 (2016). https://doi.org/10.1177/0731121416629992

    Article  Google Scholar 

  37. Van der Heijden, G.J., Donders, A.R.T., Stijnen, T., Moons, K.G.: Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J. Clin. Epidemiol. 59(10), 1102–1109 (2006)

    Article  Google Scholar 

  38. van Kuijk, S.M., Viechtbauer, W., Peeters, L.L., Smits, L.: Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study. Epidemiol., Biostat. Public Health (2016). https://doi.org/10.2427/11598

    Article  Google Scholar 

  39. Vermunt, J.K., Van Ginkel, J.R., Van Der Ark, L.A., Sijtsma, K.: 9 Multiple imputation of incomplete categorical data using latent class analysis. Sociol. Methodol. 38(1), 369–397 (2008)

    Article  Google Scholar 

  40. Waljee, A.K., Mukherjee, A., Singal, A.G., Zhang, Y., Warren, J., Balis, U., Marrero, J., Zhu, J., Higgins, P.D.: Comparison of imputation methods for missing laboratory data in medicine. BMJ Open. 3, e002847 (2013). https://doi.org/10.1136/bmjopen-2013-002847

    Article  Google Scholar 

  41. Weiss, M.J., Bloom, H.S., Verbitsky-Savitz, N., Gupta, H., Vigil, A.E., Cullinan, D.N.: How much do the effects of education and training programs vary across sites? Evidence from past multisite randomized trials. J. Res. Educ. Eff. 10, 843–876 (2017). https://doi.org/10.1080/19345747.2017.1300719

    Article  Google Scholar 

  42. White, I.R., Thompson, S.G.: Adjusting for partially missing baseline measurements in randomized trials. Stat. Med. 24, 993–1007 (2005). https://doi.org/10.1002/sim.1981

    Article  Google Scholar 

  43. Zhang, P.: Multiple imputation: theory and method. Int. Stat. Rev. 71, 581–592 (2007). https://doi.org/10.1111/j.1751-5823.2003.tb00213.x

    Article  Google Scholar 

  44. Zhelyazkova, N., Ritschard, G.: Parental leave take-up of fathers in Luxembourg. Popul. Res. Policy Rev. 37, 769–793 (2018). https://doi.org/10.1007/s11113-018-9470-8

    Article  Google Scholar 

  45. Zhuchkova, S., Rotmistrov, A.: Handling missing data with CHAID: results of a statistical experiment. Sociology: methodology, methods, mathematical modeling. 46, 85–122 (2018)

Download references

Funding

The publication was prepared within the framework of the Academic Fund Program at the National Research University Higher School of Economics (HSE) in 2020 (Grant No. 20-04-016) and by the Russian Academic Excellence Project "5–100".

Author information

Affiliations

Authors

Corresponding author

Correspondence to Svetlana Zhuchkova.

Ethics declarations

Conflict of interest

We declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhuchkova, S., Rotmistrov, A. How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment. Qual Quant (2021). https://doi.org/10.1007/s11135-021-01114-w

Download citation

Keywords

  • Categorical data
  • Complete case analysis
  • Missing data
  • Missing indicator method
  • Multiple imputation
  • Random forest
  • Regression analysis
  • Statistical experiment