Missing Data Imputation and Its Effect on the Accuracy of Classification

Hunt, Lynette A.

doi:10.1007/978-3-319-55723-6_1

Lynette A. Hunt²¹

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

4121 Accesses
17 Citations

Abstract

Multivariate data sets frequently have missing observations scattered throughout the data set. Many machine learning algorithms assume that there is no particular significance in the fact that a particular observation has an attribute value missing. A common approach in coping with these missing values is to replace the missing value using some plausible value, and the resulting completed data set is analysed using standard methods. We evaluate the effect that some commonly used imputation methods have on the accuracy of classifiers in supervised leaning. The effect is assessed in simulations performed on several classical datasets where observations have been made missing at random in different proportions. Our analysis finds that missing data imputation using hot deck, iterative robust model-based imputation (IRMI), factorial analysis for mixed data (FAMD) and Random Forest Imputation (MissForest) perform in a similar manner regardless of the amount of missing data and have the highest mean percentage of observations correctly classified. Other methods investigated did not perform as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Andrews, D.F., Herzberg, A.M.: Data. A Collection of Problems from Many Fields for the Student and Research Worker. Springer, New York (1985)
Google Scholar
Andridge, R.R., Little, R.J.A.: A review of hot deck imputation for survey nonresponse. Int. Stat. Rev. 78, 40–64 (2010)
Article Google Scholar
Audigier, V., Husson, F., Josse, J.: A principal components method to impute mixed data. Adv. Data Anal. Classif. 10, 5–26 (2016). doi:10.1007/s11634-014-0195-1
Article MathSciNet Google Scholar
Byar, D.P., Green, S.B.: The choice of treatment for cancer patients based on covariate information: application to prostate cancer data. Bull. Cancer Paris 67, 477–488 (1980)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B 39, 1–38 (1977)
MATH Google Scholar
Dempster A.P., Rubin D.B.: Introduction, Incomplete Data in Sample Surveys (Volume 2): Theory and Bibliography. Madow, W.G., Olkin, I., Rubin, D.B. (eds.), pp. 3–10. Academic, New York (1983)
Google Scholar
Everitt, B.S., Dunn G.: Applied Multivariate Data Analysis. Edward Arnold, London (2001)
Book MATH Google Scholar
Huber, P.J.: Robust Statistics. Wiley, New York (1981)
Book MATH Google Scholar
Josse, J., Husson, F.: Handling missing values in exploratory multivariate data analysis methods. J. de la Soc. Fr. de Stat. 153(2), 1–21 (2012)
MathSciNet MATH Google Scholar
Little R.J.A., Rubin, D.B.: Statistical Analysis of Missing Data. Wiley New York (1987, 2002)
Google Scholar
Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K., Ishii, S.: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16), 2088–2096 (2003). doi:10.1093/bioinformatics/btg287
Article Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Google Scholar
Quinlan, J.R.: Improved use of continuous attributes in C4.5. J. Artif. Intell. Res. 4, 77–90 (1996)
Google Scholar
Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J., Solenberger, P.: A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27(1), 85–9 (2001)
Google Scholar
Rousseeuw, P., Van Driessen, K.: Computing LTS regression for large data sets. Data Min. Knowl. Disc. 12, 29–45 (2006). doi:10.1007/s10618-005-0024-4
Article MathSciNet MATH Google Scholar
Rubin, D.B.: Inference and missing data. Biometrika 63, 581–593 (1976)
Article MathSciNet MATH Google Scholar
Santos, R.: Effects of imputation on regression coefficients. Proc. Sect. Surv. Res. Methods Am. Stat. Assoc. 140–145 (1981)
Google Scholar
Stekhoven, D.J.: Using the missForest package. https://stat.ethz.ch/education/semesters/ss2013/ams/.../missForest-1.2.pdf (2012)
Stekhoven, D.J., Bühlmann, P.: MissForest – non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2011)
Article Google Scholar
Templ, M., Kowarik, A., Filzmoser, P.: EM-based stepwise regression imputation using standard and robust methods. Research Report cs-2010-3, Department of Statistics and Probability Theory, Vienna University of Technology (2010)
Google Scholar
Templ, M., Alfons, A., Kowarik, A., Prantner, B.: VIM: Visualization and imputation of missing values (2011). http://CRAN.R-project.org/package=VIM. R package version 3.0.0
Templ, M., Kowarik, A., Filzmoser, P.: Iterative stepwise regression imputation using standard and robust methods. Comput. Stat. Data Anal. 55, 2793–2806 (2011)
Article MathSciNet Google Scholar
Witten, I.H., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)
Google Scholar
Yohai, V.J.: High breakdown-point and high efficiency estimates for regression. Ann. Stat. 15, 642–656 (1987)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

University of Waikato, Hamilton, New Zealand
Lynette A. Hunt

Authors

Lynette A. Hunt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lynette A. Hunt .

Editor information

Editors and Affiliations

Department of Political Sciences, University of Naples Federico II, Napoli, Italy
Francesco Palumbo
Department of Statistical Sciences Paolo Fortunati, Alma Mater Studiorum, University of Bologna, Bologna, Italy
Angela Montanari
Department of Statistical Sciences, Sapienza University of Rome, Rome, Italy
Maurizio Vichi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hunt, L.A. (2017). Missing Data Imputation and Its Effect on the Accuracy of Classification. In: Palumbo, F., Montanari, A., Vichi, M. (eds) Data Science . Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-55723-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-55723-6_1
Published: 05 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55722-9
Online ISBN: 978-3-319-55723-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics