Variable selection techniques after multiple imputation in high-dimensional data

  • Faisal Maqbool ZahidEmail author
  • Shahla Faisal
  • Christian Heumann
Original Paper


High-dimensional data arise from diverse fields of scientific research. Missing values are often encountered in such data. Variable selection plays a key role in high-dimensional data analysis. Like many other statistical techniques, variable selection requires complete cases without any missing values. A variety of variable selection techniques for complete data is available, but similar techniques for the data with missing values are deficient in the literature. Multiple imputation is a popular approach to handle missing values and to get completed data. If a particular variable selection technique is applied independently on each of the multiply imputed datasets, a different model for each dataset may be the result. It is still unclear in the literature how to implement variable selection techniques on multiply imputed data. In this paper, we propose to use the magnitude of the parameter estimates of each candidate predictor across all the imputed datasets for its selection. A constraint is imposed on the sum of absolute values of these estimates to select or remove the predictor from the model. The proposed method for identifying the informative predictors is compared with other approaches in an extensive simulation study. The performance is compared on the basis of the hit rates (proportion of correctly identified informative predictors) and the false alarm rates (proportion of non-informative predictors dubbed as informative) for different numbers of imputed datasets. The proposed technique is simple and easy to implement, and performs equally well in the high-dimensional case as in the low-dimensional settings. The proposed technique is observed to be a good competitor to the existing approaches in different simulation settings. The performance of different variable selection techniques is also examined for a real dataset with missing values.


High-dimensional data Multiple imputation LASSO Rubin’s rules Variable selection 



  1. Brand JPL (1999) Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. Ph.D. thesis, Erasmus University, RotterdamGoogle Scholar
  2. Carpenter JR, Kenward MG, White IR (2007) Sensitivity analysis after multiple imputations under missing at random: a weighting approach. Stat Methods Med Res 16(3):259–275MathSciNetzbMATHCrossRefGoogle Scholar
  3. Chen Q, Wang S (2013) Variable selection for multiply-imputed data with application to dioxin exposure study. Stat Med 32:3646–3659MathSciNetCrossRefGoogle Scholar
  4. Clark TG, Altman DG (2003) Developing a prognostic model in the presence of missing data: an ovarian cancer case study. J Clin Epidemiol 56:28–37CrossRefGoogle Scholar
  5. Efron B (1994) Missing data, imputation, and the bootstrap. J Am Stat Assoc 89:463–475MathSciNetzbMATHCrossRefGoogle Scholar
  6. Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360MathSciNetzbMATHCrossRefGoogle Scholar
  7. Fan J, Li R (2006) Statistical challenges with high dimensionality: feature selection in knowledge discovery. In: Proceedings of the international congress of mathematicians, Madrid, Spain, pp 595–622Google Scholar
  8. Gelman A (2004) Parameterization and Bayesian modeling. J Am Stat Assoc 99(466):537–545MathSciNetzbMATHCrossRefGoogle Scholar
  9. George E, Foster D (2000) Calibration and empirical bayes variable selection. Biometrika 87:731–747MathSciNetzbMATHCrossRefGoogle Scholar
  10. George E, McCulloch R (1993) Variable selection via Gibbs sampling. J Am Stat Assoc 88:881–889CrossRefGoogle Scholar
  11. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New YorkzbMATHCrossRefGoogle Scholar
  12. Heckerman D, Chickering DM, Meek C, Rounthwaite R, Kadie C (2001) Dependency networks for inference, collaborative filtering, and data visualization. J Mach Learn 1:49–75zbMATHGoogle Scholar
  13. Heymans MW, Van Buuren S, Knol DL, Van Mechelen W, de Vet HCW (2007) Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol 7:33–42CrossRefGoogle Scholar
  14. Hughes RA, White IR, Seaman SR, Carpenter JR, Tilling K, Sterne JAC (2014) Joint modelling rationale for chained equations. BMC Med Res Methodol 14(28):1–10Google Scholar
  15. Ian RW, Patrick R, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30(4):377–399MathSciNetCrossRefGoogle Scholar
  16. Kennickell AB (1991) Imputation of the 1989 survey of consumer finances: stochastic relaxation and multiple imputation. In: ASA 1991 proceedings of the section on survey research methods, pp 1–10Google Scholar
  17. Kim S, Belin TR, Sugar CA (2016) Multiple imputation with non-additively related variables: joint-modeling and approximations. Stat Methods Med Res. CrossRefGoogle Scholar
  18. Kropko J, Goodrich B, Gelman A, Hill J (2014) Multiple imputation for continuous and categorical data: comparing joint multivariate normal and conditional approaches. Polit Anal 22:497–519CrossRefGoogle Scholar
  19. Lachenbruch PA (2011) Variable selection when missing values are present: a case study. Stat Methods Med Res 20:429–444MathSciNetzbMATHCrossRefGoogle Scholar
  20. Lichman M (2013) UCI machine learning repository. Accessed 28 Jan 2018
  21. Little R, Rubin D (2002) Statistical analysis with missing data, 2nd edn. Wiley, HobokenzbMATHCrossRefGoogle Scholar
  22. Long Q, Johnson BA (2015) Variable selection in the presence of missing data: resampling and imputation. Biostatistics 16(3):596–610MathSciNetCrossRefGoogle Scholar
  23. Lv J, Fan Y (2009) A unified approach to model selection and sparse recovery using regularized least squares. Ann Stat 37:3498–3528MathSciNetzbMATHCrossRefGoogle Scholar
  24. Matthews BW (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochim Biophys Acta (BBA) Protein Struct 405(2):442–451CrossRefGoogle Scholar
  25. Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc B 72(4):417–473MathSciNetzbMATHCrossRefGoogle Scholar
  26. Newman DA (2009) Missing data techniques and low response rates: the role of systematic nonresponse parameters, chap 1. In: Lance CE, Vandenberg RJ (eds) Statistical and methodological myths and urban legends. Routledge, Tylor & Francis Group, pp 7–36Google Scholar
  27. Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–687MathSciNetzbMATHCrossRefGoogle Scholar
  28. Raghunathan TE, Lepkowski JM, Hoewyk JV, Solenberger P (2001) A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv Methodol 27(1):85–95Google Scholar
  29. Raman S, Fuchs TJ, Wild PJ, Dahl E, Roth V (2009) The Bayesian group-Lasso for analyzing contingency tables. In: Proceedings of the 26th international conference on machine learning, Montreal, Canada, pp 881–888Google Scholar
  30. Rubin DB (1976) Inference and missing data. Biometrika 63:581–592MathSciNetzbMATHCrossRefGoogle Scholar
  31. Rubin D (1987) Multiple imputation for nonresponse in surveys. Wiley, New YorkzbMATHCrossRefGoogle Scholar
  32. Rubin DB (2003) Nested multiple imputation of Nmes via partially incompatible MCMC. Stat Neerl 57(1):3–18MathSciNetCrossRefGoogle Scholar
  33. Rubin DB, Schenker N (1986) Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J Am Stat Assoc 81:366–374MathSciNetzbMATHCrossRefGoogle Scholar
  34. Schafer JL (1997) Analysis of incomplete multivariate data. Chapman & Hall, LondonzbMATHCrossRefGoogle Scholar
  35. Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147–177CrossRefGoogle Scholar
  36. Shen X, Huang HC (2010) Grouping pursuit through a regularization solution surface. J Am Stat Assoc 105:727–739MathSciNetzbMATHCrossRefGoogle Scholar
  37. Shen X, Pan W, Zhu Y (2012) Likelihood-based selection and sharp parameter estimation. J Am Stat Assoc 107(497):223–232MathSciNetzbMATHCrossRefGoogle Scholar
  38. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288MathSciNetzbMATHGoogle Scholar
  39. van Buuren S (2007) Multiple imputation of discrete and continuous data by full conditional specification. Stat Methods Med Res 16(3):219–242MathSciNetzbMATHCrossRefGoogle Scholar
  40. van Buuren S, Groothuis-Oudshoorn K (2011) Mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67CrossRefGoogle Scholar
  41. van Buuren S, Oudshoorn K (2000) Multiple imputation by chained equations: Mice v1.0 user’s manual. Technical report, TNO prevention and health, Leiden. Accessed 28 Jan 2018
  42. van Buuren S, Boshuizen HC, Knook DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 18(6):681–694CrossRefGoogle Scholar
  43. Waal T, Pannekoek J, Scholtus S (2011) Handbook of statistical data editing and imputation. Wiley, HobokenCrossRefGoogle Scholar
  44. White IR, Royston P, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30:377–399MathSciNetCrossRefGoogle Scholar
  45. Wood AM, White IR, Royston P (2008) How should variable selection be performed with multiply imputed data? Stat Med 27:3227–3246MathSciNetCrossRefGoogle Scholar
  46. Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B 68:49–67MathSciNetzbMATHCrossRefGoogle Scholar
  47. Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942MathSciNetzbMATHCrossRefGoogle Scholar
  48. Zhao Y, Long Q (2016) Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res 25(5):2021–2035MathSciNetCrossRefGoogle Scholar
  49. Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37:3468–3497MathSciNetzbMATHCrossRefGoogle Scholar
  50. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67:301–320MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of StatisticsGovernment College University FaisalabadFaisalabadPakistan
  2. 2.Department of StatisticsLudwig-Maximilians-University MunichMunichGermany

Personalised recommendations