Abstract
The study is devoted to a comparison of three approaches to handling missing data of categorical variables: complete case analysis, multiple imputation (based on random forest), and the missing-indicator method. Focusing on OLS regression, we describe how the choice of the approach depends on the missingness mechanism, its proportion, and model specification. The results of a simulated statistical experiment show that each approach may lead to either almost unbiased or dramatically biased estimates. The choice of the appropriate approach should be primarily based on the missingness mechanism: one should choose CCA under MCAR, MI under MAR, and, again, CCA under MNAR. Although MIM produces almost unbiased estimates under MCAR and MNAR as well, it leads to inefficient regression coefficients—ones with too big standard errors and, consequently, incorrect p-values.
Similar content being viewed by others
Data availability and material
All data files are available upon request.
Code availability
All scripts are available upon request.
Notes
The exception is paper (Henry et al. 2013) where the variable of race contains missing values, but in this study the authors use real data and do not control any factors that may affect the results of comparison.
All the technical files are available upon request.
Random forest-based multiple imputation was carried out with ‘sklearn’ package (specifically its IterativeImputer class) in Python (Pedregosa et al. 2011), which is equivalent to ‘mice’ package in R.
θ is the true value of a parameter.
AnOVa was carried out with ‘statsmodels’ package (specifically its anova_lm function) in Python (Seabold and Perktold 2010).
ChAID was carried out with ‘randan’ package (specifically its CHAIDRegressor class) in Python.
References
Akande, O., Li, F., Reiter, J.: An empirical comparison of multiple imputation methods for categorical data. Am. Stat. 71, 162–170 (2017). https://doi.org/10.1080/00031305.2016.1277158
Allison, P.D.: Multiple imputation for missing data: a cautionary tale. Sociol. Methods Res. 28, 301–309 (2000). https://doi.org/10.1177/0049124100028003003
Allison, P.D.: Imputation of categorical variables with PROC MI. Proc. SAS Users Group Int. Conf. (SUGI) 30, 113–130 (2005)
Bartlett, J.W., Carpenter, J.R., Tilling, K., Vansteelandt, S.: Improving upon the efficiency of complete case analysis when covariates are MNAR. Biostatistics 15, 719–730 (2014). https://doi.org/10.1093/biostatistics/kxu023
Bartlett, J.W., Harel, O., Carpenter, J.R.: Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression. Am. J. Epidemiol. 182, 730–736 (2015). https://doi.org/10.1093/aje/kwv114
Chen, J., Hossler, D.: The effects of financial aid on college success of two-year beginning nontraditional students. Res. High Educ. 58, 40–76 (2017). https://doi.org/10.1007/s11162-016-9416-0
Choi, J., Dekkers, O.M., le Cessie, S.: A comparison of different methods to handle missing data in the context of propensity score analysis. Eur. J. Epidemiol. 34, 23–36 (2019). https://doi.org/10.1007/s10654-018-0447-z
Donders, A.R.T., van der Heijden, G.J.M.G., Stijnen, T., Moons, K.G.M.: Review: a gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59, 1087–1091 (2006). https://doi.org/10.1016/j.jclinepi.2006.01.014
Doove, L.L., Van Buuren, S., Dusseldorp, E.: Recursive partitioning for missing data imputation in the presence of interaction effects. Comput. Stat. Data Anal. 72, 92–104 (2014). https://doi.org/10.1016/j.csda.2013.10.025
Dougherty, C.: Introduction to Econometrics. Oxford University Press, Oxford (2016)
Gentle, J.E. (ed.): Handbook of Computational Statistics: Concepts and Methods. Springer, Berlin (2012)
Gesser-Edelsburg, A., Zemach, M., Lotan, T., Elias, W., Grimberg, E.: Perceptions, intentions and behavioral norms that affect pre-license driving among Arab youth in Israel. Accid. Anal. Prev. 111, 1–11 (2018). https://doi.org/10.1016/j.aap.2017.11.005
Greenacre, M., Pardo, R.: Subset correspondence analysis: visualizing relationships among a selected set of response categories from a questionnaire survey. Sociol. Methods Res. 35, 193–218 (2006). https://doi.org/10.1177/0049124106290316
Groenwold, R.H.H., White, I.R., Donders, A.R.T., Carpenter, J.R., Altman, D.G., Moons, K.G.M.: Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis. Can. Med. Assoc. J. 184, 1265–1269 (2012). https://doi.org/10.1503/cmaj.110977
Henry, A.J., Hevelone, N.D., Lipsitz, S., Nguyen, L.L.: Comparative methods for handling missing data in large databases. J. Vasc. Surg. 58, 1353-1359.e6 (2013). https://doi.org/10.1016/j.jvs.2013.05.008
Hughes, R.A., Heron, J., Sterne, J.A.C., Tilling, K.: Accounting for missing data in statistical analyses: multiple imputation is not always the answer. Int. J. Epidemiol. 48, 1294–1304 (2019). https://doi.org/10.1093/ije/dyz032
Jones, M.P.: Indicator and stratification methods for missing explanatory variables in multiple linear regression. J. Am. Stat. Assoc. 91, 222–230 (1996). https://doi.org/10.1080/01621459.1996.10476680
Knol, M.J., Janssen, K.J.M., Donders, A.R.T., Egberts, A.C.G., Heerdink, E.R., Grobbee, D.E., Moons, K.G.M., Geerlings, M.I.: Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. J. Clin. Epidemiol. 63, 728–736 (2010). https://doi.org/10.1016/j.jclinepi.2009.08.028
Maimon, O., Rokach, L. (eds.): Data Mining and Knowledge Discovery Handbook. Springer, New York (2010)
Morgan, J.N., Sonquist, J.A.: Problems in the analysis of survey data, and a proposal. J. Am. Stat. Assoc. 58, 415–434 (1963). https://doi.org/10.1080/01621459.1963.10500855
Morris, T.P., White, I.R., Crowther, M.J.: Using simulation studies to evaluate statistical methods. Stat. Med. 38, 2074–2102 (2019). https://doi.org/10.1002/sim.8086
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Müller, A., Nothman, J., Louppe, G., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Ratner, B.: Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data. CRC Press, Boca Raton (2017)
Rickles, J., Heppen, J.B., Allensworth, E., Sorensen, N., Walters, K.: Online credit recovery and the path to on-time high school graduation. Educ. Res. 47, 481–491 (2018). https://doi.org/10.3102/0013189X18788054
Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976). https://doi.org/10.1093/biomet/63.3.581
Rubin, D.B. (ed.): Multiple imputation for nonresponse in surveys. Wiley, Hoboken (1987)
Schafer, J.L.: Analysis of Incomplete Multivariate Data. CRC Press, Boca Ratan (1997)
Seabold, S., & Perktold, J. Statsmodels: Econometric and statistical modeling with python. In: Proceedings of the 9th Python in Science Conference (2010)
Shah, A.D., Bartlett, J.W., Carpenter, J., Nicholas, O., Hemingway, H.: Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study. Am. J. Epidemiol. 179, 764–774 (2014). https://doi.org/10.1093/aje/kwt312
Slade, E., Naylor, M.G.: A fair comparison of tree-based and parametric methods in multiple imputation by chained equations. Stat. Med. 39, 1156–1166 (2020). https://doi.org/10.1002/sim.8468
Stavseth, M.R., Clausen, T., Røislien, J.: How handling missing data may impact conclusions: a comparison of six different imputation methods for categorical questionnaire data. SAGE Open Med. 7, 205031211882291 (2019). https://doi.org/10.1177/2050312118822912
Stekhoven, D.J., Buhlmann, P.: MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118 (2012). https://doi.org/10.1093/bioinformatics/btr597
Strebkov, D., Shevchuk, A., Lukina, A., Melianova, E., Tyulyupo, A.: Social factors of contractor selection on freelance online marketplace: study of contests using big data. J. Econ. Sociol. 20, 25–65 (2019)
Sundararajan, A., Sarwat, A.I.: Evaluation of missing data imputationmethods for an enhanced distributed pvgeneration prediction. In: Arai, K., Bhatia, R., and Kapoor, S. (eds.) Proceedings of the Future Technologies Conference (FTC) 2019. pp. 590–609. Springer, Cham (2020)
Tang, F., Ishwaran, H.: Random forest missing data algorithms. Stat. Anal. Data Min.: ASA Data Sci. J. 10, 363–377 (2017). https://doi.org/10.1002/sam.11348
Trevizo, D., Lopez, M.J.: Neighborhood segregation and business outcomes: Mexican immigrant entrepreneurs in Los Angeles county. Sociol. Persp. 59, 668–693 (2016). https://doi.org/10.1177/0731121416629992
Van der Heijden, G.J., Donders, A.R.T., Stijnen, T., Moons, K.G.: Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J. Clin. Epidemiol. 59(10), 1102–1109 (2006)
van Kuijk, S.M., Viechtbauer, W., Peeters, L.L., Smits, L.: Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study. Epidemiol., Biostat. Public Health (2016). https://doi.org/10.2427/11598
Vermunt, J.K., Van Ginkel, J.R., Van Der Ark, L.A., Sijtsma, K.: 9 Multiple imputation of incomplete categorical data using latent class analysis. Sociol. Methodol. 38(1), 369–397 (2008)
Waljee, A.K., Mukherjee, A., Singal, A.G., Zhang, Y., Warren, J., Balis, U., Marrero, J., Zhu, J., Higgins, P.D.: Comparison of imputation methods for missing laboratory data in medicine. BMJ Open. 3, e002847 (2013). https://doi.org/10.1136/bmjopen-2013-002847
Weiss, M.J., Bloom, H.S., Verbitsky-Savitz, N., Gupta, H., Vigil, A.E., Cullinan, D.N.: How much do the effects of education and training programs vary across sites? Evidence from past multisite randomized trials. J. Res. Educ. Eff. 10, 843–876 (2017). https://doi.org/10.1080/19345747.2017.1300719
White, I.R., Thompson, S.G.: Adjusting for partially missing baseline measurements in randomized trials. Stat. Med. 24, 993–1007 (2005). https://doi.org/10.1002/sim.1981
Zhang, P.: Multiple imputation: theory and method. Int. Stat. Rev. 71, 581–592 (2007). https://doi.org/10.1111/j.1751-5823.2003.tb00213.x
Zhelyazkova, N., Ritschard, G.: Parental leave take-up of fathers in Luxembourg. Popul. Res. Policy Rev. 37, 769–793 (2018). https://doi.org/10.1007/s11113-018-9470-8
Zhuchkova, S., Rotmistrov, A.: Handling missing data with CHAID: results of a statistical experiment. Sociology: methodology, methods, mathematical modeling. 46, 85–122 (2018)
Funding
The publication was prepared within the framework of the Academic Fund Program at the National Research University Higher School of Economics (HSE) in 2020 (Grant No. 20-04-016) and by the Russian Academic Excellence Project "5–100".
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We declare no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhuchkova, S., Rotmistrov, A. How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment. Qual Quant 56, 1–22 (2022). https://doi.org/10.1007/s11135-021-01114-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11135-021-01114-w