Skip to main content

Missing Data Imputation and Its Effect on the Accuracy of Classification

  • Conference paper
  • First Online:
Data Science

Abstract

Multivariate data sets frequently have missing observations scattered throughout the data set. Many machine learning algorithms assume that there is no particular significance in the fact that a particular observation has an attribute value missing. A common approach in coping with these missing values is to replace the missing value using some plausible value, and the resulting completed data set is analysed using standard methods. We evaluate the effect that some commonly used imputation methods have on the accuracy of classifiers in supervised leaning. The effect is assessed in simulations performed on several classical datasets where observations have been made missing at random in different proportions. Our analysis finds that missing data imputation using hot deck, iterative robust model-based imputation (IRMI), factorial analysis for mixed data (FAMD) and Random Forest Imputation (MissForest) perform in a similar manner regardless of the amount of missing data and have the highest mean percentage of observations correctly classified. Other methods investigated did not perform as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Andrews, D.F., Herzberg, A.M.: Data. A Collection of Problems from Many Fields for the Student and Research Worker. Springer, New York (1985)

    Google Scholar 

  2. Andridge, R.R., Little, R.J.A.: A review of hot deck imputation for survey nonresponse. Int. Stat. Rev. 78, 40–64 (2010)

    Article  Google Scholar 

  3. Audigier, V., Husson, F., Josse, J.: A principal components method to impute mixed data. Adv. Data Anal. Classif. 10, 5–26 (2016). doi:10.1007/s11634-014-0195-1

    Article  MathSciNet  Google Scholar 

  4. Byar, D.P., Green, S.B.: The choice of treatment for cancer patients based on covariate information: application to prostate cancer data. Bull. Cancer Paris 67, 477–488 (1980)

    Google Scholar 

  5. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B 39, 1–38 (1977)

    MATH  Google Scholar 

  6. Dempster A.P., Rubin D.B.: Introduction, Incomplete Data in Sample Surveys (Volume 2): Theory and Bibliography. Madow, W.G., Olkin, I., Rubin, D.B. (eds.), pp. 3–10. Academic, New York (1983)

    Google Scholar 

  7. Everitt, B.S., Dunn G.: Applied Multivariate Data Analysis. Edward Arnold, London (2001)

    Book  MATH  Google Scholar 

  8. Huber, P.J.: Robust Statistics. Wiley, New York (1981)

    Book  MATH  Google Scholar 

  9. Josse, J., Husson, F.: Handling missing values in exploratory multivariate data analysis methods. J. de la Soc. Fr. de Stat. 153(2), 1–21 (2012)

    MathSciNet  MATH  Google Scholar 

  10. Little R.J.A., Rubin, D.B.: Statistical Analysis of Missing Data. Wiley New York (1987, 2002)

    Google Scholar 

  11. Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K., Ishii, S.: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16), 2088–2096 (2003). doi:10.1093/bioinformatics/btg287

    Article  Google Scholar 

  12. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)

    Google Scholar 

  13. Quinlan, J.R.: Improved use of continuous attributes in C4.5. J. Artif. Intell. Res. 4, 77–90 (1996)

    Google Scholar 

  14. Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J., Solenberger, P.: A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27(1), 85–9 (2001)

    Google Scholar 

  15. Rousseeuw, P., Van Driessen, K.: Computing LTS regression for large data sets. Data Min. Knowl. Disc. 12, 29–45 (2006). doi:10.1007/s10618-005-0024-4

    Article  MathSciNet  MATH  Google Scholar 

  16. Rubin, D.B.: Inference and missing data. Biometrika 63, 581–593 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  17. Santos, R.: Effects of imputation on regression coefficients. Proc. Sect. Surv. Res. Methods Am. Stat. Assoc. 140–145 (1981)

    Google Scholar 

  18. Stekhoven, D.J.: Using the missForest package. https://stat.ethz.ch/education/semesters/ss2013/ams/.../missForest-1.2.pdf (2012)

  19. Stekhoven, D.J., Bühlmann, P.: MissForest – non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2011)

    Article  Google Scholar 

  20. Templ, M., Kowarik, A., Filzmoser, P.: EM-based stepwise regression imputation using standard and robust methods. Research Report cs-2010-3, Department of Statistics and Probability Theory, Vienna University of Technology (2010)

    Google Scholar 

  21. Templ, M., Alfons, A., Kowarik, A., Prantner, B.: VIM: Visualization and imputation of missing values (2011). http://CRAN.R-project.org/package=VIM. R package version 3.0.0

  22. Templ, M., Kowarik, A., Filzmoser, P.: Iterative stepwise regression imputation using standard and robust methods. Comput. Stat. Data Anal. 55, 2793–2806 (2011)

    Article  MathSciNet  Google Scholar 

  23. Witten, I.H., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)

    Google Scholar 

  24. Yohai, V.J.: High breakdown-point and high efficiency estimates for regression. Ann. Stat. 15, 642–656 (1987)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lynette A. Hunt .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Hunt, L.A. (2017). Missing Data Imputation and Its Effect on the Accuracy of Classification. In: Palumbo, F., Montanari, A., Vichi, M. (eds) Data Science . Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-55723-6_1

Download citation

Publish with us

Policies and ethics