How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment

Zhuchkova, Svetlana; Rotmistrov, Aleksei

doi:10.1007/s11135-021-01114-w

How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment

Published: 20 February 2021

Volume 56, pages 1–22, (2022)
Cite this article

Quality & Quantity Aims and scope Submit manuscript

1501 Accesses
7 Citations
Explore all metrics

Abstract

The study is devoted to a comparison of three approaches to handling missing data of categorical variables: complete case analysis, multiple imputation (based on random forest), and the missing-indicator method. Focusing on OLS regression, we describe how the choice of the approach depends on the missingness mechanism, its proportion, and model specification. The results of a simulated statistical experiment show that each approach may lead to either almost unbiased or dramatically biased estimates. The choice of the appropriate approach should be primarily based on the missingness mechanism: one should choose CCA under MCAR, MI under MAR, and, again, CCA under MNAR. Although MIM produces almost unbiased estimates under MCAR and MNAR as well, it leads to inefficient regression coefficients—ones with too big standard errors and, consequently, incorrect p-values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sampling Techniques for Quantitative Research

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

Violating the normality assumption may be the lesser of two evils

Article Open access 07 May 2021

Data availability and material

All data files are available upon request.

Code availability

All scripts are available upon request.

Notes

The only exceptions are the articles (Choi et al. 2019) and (Donders et al. 2006) where the authors use simulated data but limit their analysis to continuous variables only.
The exception is paper (Henry et al. 2013) where the variable of race contains missing values, but in this study the authors use real data and do not control any factors that may affect the results of comparison.
All the technical files are available upon request.
Random forest-based multiple imputation was carried out with ‘sklearn’ package (specifically its IterativeImputer class) in Python (Pedregosa et al. 2011), which is equivalent to ‘mice’ package in R.
θ is the true value of a parameter.
AnOVa was carried out with ‘statsmodels’ package (specifically its anova_lm function) in Python (Seabold and Perktold 2010).
ChAID was carried out with ‘randan’ package (specifically its CHAIDRegressor class) in Python.

References

Akande, O., Li, F., Reiter, J.: An empirical comparison of multiple imputation methods for categorical data. Am. Stat. 71, 162–170 (2017). https://doi.org/10.1080/00031305.2016.1277158
Article Google Scholar
Allison, P.D.: Multiple imputation for missing data: a cautionary tale. Sociol. Methods Res. 28, 301–309 (2000). https://doi.org/10.1177/0049124100028003003
Article Google Scholar
Allison, P.D.: Imputation of categorical variables with PROC MI. Proc. SAS Users Group Int. Conf. (SUGI) 30, 113–130 (2005)
Google Scholar
Bartlett, J.W., Carpenter, J.R., Tilling, K., Vansteelandt, S.: Improving upon the efficiency of complete case analysis when covariates are MNAR. Biostatistics 15, 719–730 (2014). https://doi.org/10.1093/biostatistics/kxu023
Article Google Scholar
Bartlett, J.W., Harel, O., Carpenter, J.R.: Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression. Am. J. Epidemiol. 182, 730–736 (2015). https://doi.org/10.1093/aje/kwv114
Article Google Scholar
Chen, J., Hossler, D.: The effects of financial aid on college success of two-year beginning nontraditional students. Res. High Educ. 58, 40–76 (2017). https://doi.org/10.1007/s11162-016-9416-0
Article Google Scholar
Choi, J., Dekkers, O.M., le Cessie, S.: A comparison of different methods to handle missing data in the context of propensity score analysis. Eur. J. Epidemiol. 34, 23–36 (2019). https://doi.org/10.1007/s10654-018-0447-z
Article Google Scholar
Donders, A.R.T., van der Heijden, G.J.M.G., Stijnen, T., Moons, K.G.M.: Review: a gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59, 1087–1091 (2006). https://doi.org/10.1016/j.jclinepi.2006.01.014
Article Google Scholar
Doove, L.L., Van Buuren, S., Dusseldorp, E.: Recursive partitioning for missing data imputation in the presence of interaction effects. Comput. Stat. Data Anal. 72, 92–104 (2014). https://doi.org/10.1016/j.csda.2013.10.025
Article Google Scholar
Dougherty, C.: Introduction to Econometrics. Oxford University Press, Oxford (2016)
Google Scholar
Gentle, J.E. (ed.): Handbook of Computational Statistics: Concepts and Methods. Springer, Berlin (2012)
Google Scholar
Gesser-Edelsburg, A., Zemach, M., Lotan, T., Elias, W., Grimberg, E.: Perceptions, intentions and behavioral norms that affect pre-license driving among Arab youth in Israel. Accid. Anal. Prev. 111, 1–11 (2018). https://doi.org/10.1016/j.aap.2017.11.005
Article Google Scholar
Greenacre, M., Pardo, R.: Subset correspondence analysis: visualizing relationships among a selected set of response categories from a questionnaire survey. Sociol. Methods Res. 35, 193–218 (2006). https://doi.org/10.1177/0049124106290316
Article Google Scholar
Groenwold, R.H.H., White, I.R., Donders, A.R.T., Carpenter, J.R., Altman, D.G., Moons, K.G.M.: Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis. Can. Med. Assoc. J. 184, 1265–1269 (2012). https://doi.org/10.1503/cmaj.110977
Article Google Scholar
Henry, A.J., Hevelone, N.D., Lipsitz, S., Nguyen, L.L.: Comparative methods for handling missing data in large databases. J. Vasc. Surg. 58, 1353-1359.e6 (2013). https://doi.org/10.1016/j.jvs.2013.05.008
Article Google Scholar
Hughes, R.A., Heron, J., Sterne, J.A.C., Tilling, K.: Accounting for missing data in statistical analyses: multiple imputation is not always the answer. Int. J. Epidemiol. 48, 1294–1304 (2019). https://doi.org/10.1093/ije/dyz032
Article Google Scholar
Jones, M.P.: Indicator and stratification methods for missing explanatory variables in multiple linear regression. J. Am. Stat. Assoc. 91, 222–230 (1996). https://doi.org/10.1080/01621459.1996.10476680
Article Google Scholar
Knol, M.J., Janssen, K.J.M., Donders, A.R.T., Egberts, A.C.G., Heerdink, E.R., Grobbee, D.E., Moons, K.G.M., Geerlings, M.I.: Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. J. Clin. Epidemiol. 63, 728–736 (2010). https://doi.org/10.1016/j.jclinepi.2009.08.028
Article Google Scholar
Maimon, O., Rokach, L. (eds.): Data Mining and Knowledge Discovery Handbook. Springer, New York (2010)
Google Scholar
Morgan, J.N., Sonquist, J.A.: Problems in the analysis of survey data, and a proposal. J. Am. Stat. Assoc. 58, 415–434 (1963). https://doi.org/10.1080/01621459.1963.10500855
Article Google Scholar
Morris, T.P., White, I.R., Crowther, M.J.: Using simulation studies to evaluate statistical methods. Stat. Med. 38, 2074–2102 (2019). https://doi.org/10.1002/sim.8086
Article Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Müller, A., Nothman, J., Louppe, G., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Google Scholar
Ratner, B.: Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data. CRC Press, Boca Raton (2017)
Google Scholar
Rickles, J., Heppen, J.B., Allensworth, E., Sorensen, N., Walters, K.: Online credit recovery and the path to on-time high school graduation. Educ. Res. 47, 481–491 (2018). https://doi.org/10.3102/0013189X18788054
Article Google Scholar
Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976). https://doi.org/10.1093/biomet/63.3.581
Article Google Scholar
Rubin, D.B. (ed.): Multiple imputation for nonresponse in surveys. Wiley, Hoboken (1987)
Google Scholar
Schafer, J.L.: Analysis of Incomplete Multivariate Data. CRC Press, Boca Ratan (1997)
Book Google Scholar
Seabold, S., & Perktold, J. Statsmodels: Econometric and statistical modeling with python. In: Proceedings of the 9th Python in Science Conference (2010)
Shah, A.D., Bartlett, J.W., Carpenter, J., Nicholas, O., Hemingway, H.: Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study. Am. J. Epidemiol. 179, 764–774 (2014). https://doi.org/10.1093/aje/kwt312
Article Google Scholar
Slade, E., Naylor, M.G.: A fair comparison of tree-based and parametric methods in multiple imputation by chained equations. Stat. Med. 39, 1156–1166 (2020). https://doi.org/10.1002/sim.8468
Article Google Scholar
Stavseth, M.R., Clausen, T., Røislien, J.: How handling missing data may impact conclusions: a comparison of six different imputation methods for categorical questionnaire data. SAGE Open Med. 7, 205031211882291 (2019). https://doi.org/10.1177/2050312118822912
Article Google Scholar
Stekhoven, D.J., Buhlmann, P.: MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118 (2012). https://doi.org/10.1093/bioinformatics/btr597
Article Google Scholar
Strebkov, D., Shevchuk, A., Lukina, A., Melianova, E., Tyulyupo, A.: Social factors of contractor selection on freelance online marketplace: study of contests using big data. J. Econ. Sociol. 20, 25–65 (2019)
Article Google Scholar
Sundararajan, A., Sarwat, A.I.: Evaluation of missing data imputationmethods for an enhanced distributed pvgeneration prediction. In: Arai, K., Bhatia, R., and Kapoor, S. (eds.) Proceedings of the Future Technologies Conference (FTC) 2019. pp. 590–609. Springer, Cham (2020)
Tang, F., Ishwaran, H.: Random forest missing data algorithms. Stat. Anal. Data Min.: ASA Data Sci. J. 10, 363–377 (2017). https://doi.org/10.1002/sam.11348
Article Google Scholar
Trevizo, D., Lopez, M.J.: Neighborhood segregation and business outcomes: Mexican immigrant entrepreneurs in Los Angeles county. Sociol. Persp. 59, 668–693 (2016). https://doi.org/10.1177/0731121416629992
Article Google Scholar
Van der Heijden, G.J., Donders, A.R.T., Stijnen, T., Moons, K.G.: Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J. Clin. Epidemiol. 59(10), 1102–1109 (2006)
Article Google Scholar
van Kuijk, S.M., Viechtbauer, W., Peeters, L.L., Smits, L.: Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study. Epidemiol., Biostat. Public Health (2016). https://doi.org/10.2427/11598
Article Google Scholar
Vermunt, J.K., Van Ginkel, J.R., Van Der Ark, L.A., Sijtsma, K.: 9 Multiple imputation of incomplete categorical data using latent class analysis. Sociol. Methodol. 38(1), 369–397 (2008)
Article Google Scholar
Waljee, A.K., Mukherjee, A., Singal, A.G., Zhang, Y., Warren, J., Balis, U., Marrero, J., Zhu, J., Higgins, P.D.: Comparison of imputation methods for missing laboratory data in medicine. BMJ Open. 3, e002847 (2013). https://doi.org/10.1136/bmjopen-2013-002847
Article Google Scholar
Weiss, M.J., Bloom, H.S., Verbitsky-Savitz, N., Gupta, H., Vigil, A.E., Cullinan, D.N.: How much do the effects of education and training programs vary across sites? Evidence from past multisite randomized trials. J. Res. Educ. Eff. 10, 843–876 (2017). https://doi.org/10.1080/19345747.2017.1300719
Article Google Scholar
White, I.R., Thompson, S.G.: Adjusting for partially missing baseline measurements in randomized trials. Stat. Med. 24, 993–1007 (2005). https://doi.org/10.1002/sim.1981
Article Google Scholar
Zhang, P.: Multiple imputation: theory and method. Int. Stat. Rev. 71, 581–592 (2007). https://doi.org/10.1111/j.1751-5823.2003.tb00213.x
Article Google Scholar
Zhelyazkova, N., Ritschard, G.: Parental leave take-up of fathers in Luxembourg. Popul. Res. Policy Rev. 37, 769–793 (2018). https://doi.org/10.1007/s11113-018-9470-8
Article Google Scholar
Zhuchkova, S., Rotmistrov, A.: Handling missing data with CHAID: results of a statistical experiment. Sociology: methodology, methods, mathematical modeling. 46, 85–122 (2018)

Download references

Funding

The publication was prepared within the framework of the Academic Fund Program at the National Research University Higher School of Economics (HSE) in 2020 (Grant No. 20-04-016) and by the Russian Academic Excellence Project "5–100".

Author information

Authors and Affiliations

Faculty of Social Science, HSE University, Moscow, Russia
Svetlana Zhuchkova
Faculty of Social Science, HSE University, Moscow, Russia
Aleksei Rotmistrov

Authors

Svetlana Zhuchkova
View author publications
You can also search for this author in PubMed Google Scholar
Aleksei Rotmistrov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Svetlana Zhuchkova.

Ethics declarations

Conflict of interest

We declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhuchkova, S., Rotmistrov, A. How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment. Qual Quant 56, 1–22 (2022). https://doi.org/10.1007/s11135-021-01114-w

Download citation

Accepted: 11 February 2021
Published: 20 February 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11135-021-01114-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment

Abstract

Access this article

Similar content being viewed by others

Sampling Techniques for Quantitative Research

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Violating the normality assumption may be the lesser of two evils

Data availability and material

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment

Abstract

Access this article

Similar content being viewed by others

Sampling Techniques for Quantitative Research

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Violating the normality assumption may be the lesser of two evils

Data availability and material

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation