Variable selection techniques after multiple imputation in high-dimensional data

Abstract

High-dimensional data arise from diverse fields of scientific research. Missing values are often encountered in such data. Variable selection plays a key role in high-dimensional data analysis. Like many other statistical techniques, variable selection requires complete cases without any missing values. A variety of variable selection techniques for complete data is available, but similar techniques for the data with missing values are deficient in the literature. Multiple imputation is a popular approach to handle missing values and to get completed data. If a particular variable selection technique is applied independently on each of the multiply imputed datasets, a different model for each dataset may be the result. It is still unclear in the literature how to implement variable selection techniques on multiply imputed data. In this paper, we propose to use the magnitude of the parameter estimates of each candidate predictor across all the imputed datasets for its selection. A constraint is imposed on the sum of absolute values of these estimates to select or remove the predictor from the model. The proposed method for identifying the informative predictors is compared with other approaches in an extensive simulation study. The performance is compared on the basis of the hit rates (proportion of correctly identified informative predictors) and the false alarm rates (proportion of non-informative predictors dubbed as informative) for different numbers of imputed datasets. The proposed technique is simple and easy to implement, and performs equally well in the high-dimensional case as in the low-dimensional settings. The proposed technique is observed to be a good competitor to the existing approaches in different simulation settings. The performance of different variable selection techniques is also examined for a real dataset with missing values.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  1. Brand JPL (1999) Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. Ph.D. thesis, Erasmus University, Rotterdam

  2. Carpenter JR, Kenward MG, White IR (2007) Sensitivity analysis after multiple imputations under missing at random: a weighting approach. Stat Methods Med Res 16(3):259–275

    MathSciNet  MATH  Google Scholar 

  3. Chen Q, Wang S (2013) Variable selection for multiply-imputed data with application to dioxin exposure study. Stat Med 32:3646–3659

    MathSciNet  Google Scholar 

  4. Clark TG, Altman DG (2003) Developing a prognostic model in the presence of missing data: an ovarian cancer case study. J Clin Epidemiol 56:28–37

    Google Scholar 

  5. Efron B (1994) Missing data, imputation, and the bootstrap. J Am Stat Assoc 89:463–475

    MathSciNet  MATH  Google Scholar 

  6. Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360

    MathSciNet  MATH  Google Scholar 

  7. Fan J, Li R (2006) Statistical challenges with high dimensionality: feature selection in knowledge discovery. In: Proceedings of the international congress of mathematicians, Madrid, Spain, pp 595–622

  8. Gelman A (2004) Parameterization and Bayesian modeling. J Am Stat Assoc 99(466):537–545

    MathSciNet  MATH  Google Scholar 

  9. George E, Foster D (2000) Calibration and empirical bayes variable selection. Biometrika 87:731–747

    MathSciNet  MATH  Google Scholar 

  10. George E, McCulloch R (1993) Variable selection via Gibbs sampling. J Am Stat Assoc 88:881–889

    Google Scholar 

  11. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York

    MATH  Google Scholar 

  12. Heckerman D, Chickering DM, Meek C, Rounthwaite R, Kadie C (2001) Dependency networks for inference, collaborative filtering, and data visualization. J Mach Learn 1:49–75

    MATH  Google Scholar 

  13. Heymans MW, Van Buuren S, Knol DL, Van Mechelen W, de Vet HCW (2007) Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol 7:33–42

    Google Scholar 

  14. Hughes RA, White IR, Seaman SR, Carpenter JR, Tilling K, Sterne JAC (2014) Joint modelling rationale for chained equations. BMC Med Res Methodol 14(28):1–10

    Google Scholar 

  15. Ian RW, Patrick R, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30(4):377–399

    MathSciNet  Google Scholar 

  16. Kennickell AB (1991) Imputation of the 1989 survey of consumer finances: stochastic relaxation and multiple imputation. In: ASA 1991 proceedings of the section on survey research methods, pp 1–10

  17. Kim S, Belin TR, Sugar CA (2016) Multiple imputation with non-additively related variables: joint-modeling and approximations. Stat Methods Med Res. https://doi.org/10.1177/0962280216667763

    Article  Google Scholar 

  18. Kropko J, Goodrich B, Gelman A, Hill J (2014) Multiple imputation for continuous and categorical data: comparing joint multivariate normal and conditional approaches. Polit Anal 22:497–519

    Google Scholar 

  19. Lachenbruch PA (2011) Variable selection when missing values are present: a case study. Stat Methods Med Res 20:429–444

    MathSciNet  MATH  Google Scholar 

  20. Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 28 Jan 2018

  21. Little R, Rubin D (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken

    MATH  Google Scholar 

  22. Long Q, Johnson BA (2015) Variable selection in the presence of missing data: resampling and imputation. Biostatistics 16(3):596–610

    MathSciNet  Google Scholar 

  23. Lv J, Fan Y (2009) A unified approach to model selection and sparse recovery using regularized least squares. Ann Stat 37:3498–3528

    MathSciNet  MATH  Google Scholar 

  24. Matthews BW (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochim Biophys Acta (BBA) Protein Struct 405(2):442–451

    Google Scholar 

  25. Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc B 72(4):417–473

    MathSciNet  MATH  Google Scholar 

  26. Newman DA (2009) Missing data techniques and low response rates: the role of systematic nonresponse parameters, chap 1. In: Lance CE, Vandenberg RJ (eds) Statistical and methodological myths and urban legends. Routledge, Tylor & Francis Group, pp 7–36

    Google Scholar 

  27. Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–687

    MathSciNet  MATH  Google Scholar 

  28. Raghunathan TE, Lepkowski JM, Hoewyk JV, Solenberger P (2001) A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv Methodol 27(1):85–95

    Google Scholar 

  29. Raman S, Fuchs TJ, Wild PJ, Dahl E, Roth V (2009) The Bayesian group-Lasso for analyzing contingency tables. In: Proceedings of the 26th international conference on machine learning, Montreal, Canada, pp 881–888

  30. Rubin DB (1976) Inference and missing data. Biometrika 63:581–592

    MathSciNet  MATH  Google Scholar 

  31. Rubin D (1987) Multiple imputation for nonresponse in surveys. Wiley, New York

    MATH  Google Scholar 

  32. Rubin DB (2003) Nested multiple imputation of Nmes via partially incompatible MCMC. Stat Neerl 57(1):3–18

    MathSciNet  Google Scholar 

  33. Rubin DB, Schenker N (1986) Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J Am Stat Assoc 81:366–374

    MathSciNet  MATH  Google Scholar 

  34. Schafer JL (1997) Analysis of incomplete multivariate data. Chapman & Hall, London

    MATH  Google Scholar 

  35. Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147–177

    Google Scholar 

  36. Shen X, Huang HC (2010) Grouping pursuit through a regularization solution surface. J Am Stat Assoc 105:727–739

    MathSciNet  MATH  Google Scholar 

  37. Shen X, Pan W, Zhu Y (2012) Likelihood-based selection and sharp parameter estimation. J Am Stat Assoc 107(497):223–232

    MathSciNet  MATH  Google Scholar 

  38. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288

    MathSciNet  MATH  Google Scholar 

  39. van Buuren S (2007) Multiple imputation of discrete and continuous data by full conditional specification. Stat Methods Med Res 16(3):219–242

    MathSciNet  MATH  Google Scholar 

  40. van Buuren S, Groothuis-Oudshoorn K (2011) Mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67

    Google Scholar 

  41. van Buuren S, Oudshoorn K (2000) Multiple imputation by chained equations: Mice v1.0 user’s manual. Technical report, TNO prevention and health, Leiden. http://www.stefvanbuuren.nl/publications/mice. Accessed 28 Jan 2018

  42. van Buuren S, Boshuizen HC, Knook DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 18(6):681–694

    Google Scholar 

  43. Waal T, Pannekoek J, Scholtus S (2011) Handbook of statistical data editing and imputation. Wiley, Hoboken

    Google Scholar 

  44. White IR, Royston P, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30:377–399

    MathSciNet  Google Scholar 

  45. Wood AM, White IR, Royston P (2008) How should variable selection be performed with multiply imputed data? Stat Med 27:3227–3246

    MathSciNet  Google Scholar 

  46. Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B 68:49–67

    MathSciNet  MATH  Google Scholar 

  47. Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942

    MathSciNet  MATH  Google Scholar 

  48. Zhao Y, Long Q (2016) Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res 25(5):2021–2035

    MathSciNet  Google Scholar 

  49. Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37:3468–3497

    MathSciNet  MATH  Google Scholar 

  50. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67:301–320

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Faisal Maqbool Zahid.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Tables 2, 3, 4 and 5.

Table 2 Simulation study with normal predictors: values of Mathews correlation coefficient (MCC)
Table 3 Simulation study with binary predictors: values of Mathews correlation coefficient (MCC)
Table 4 Simulation study with normal predictors: average hit rates and false alarm rates (in percentages) of \(S=500\) simulated datasets for different variable selection methods
Table 5 Simulation study with binary predictors: average hit rates and false alarm rates (in percentages) of \(S=500\) simulated datasets for different variable selection methods

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zahid, F.M., Faisal, S. & Heumann, C. Variable selection techniques after multiple imputation in high-dimensional data. Stat Methods Appl 29, 553–580 (2020). https://doi.org/10.1007/s10260-019-00493-7

Download citation

Keywords

  • High-dimensional data
  • Multiple imputation
  • LASSO
  • Rubin’s rules
  • Variable selection