Abstract
Multivariate data sets frequently have missing observations scattered throughout the data set. Many machine learning algorithms assume that there is no particular significance in the fact that a particular observation has an attribute value missing. A common approach in coping with these missing values is to replace the missing value using some plausible value, and the resulting completed data set is analysed using standard methods. We evaluate the effect that some commonly used imputation methods have on the accuracy of classifiers in supervised leaning. The effect is assessed in simulations performed on several classical datasets where observations have been made missing at random in different proportions. Our analysis finds that missing data imputation using hot deck, iterative robust model-based imputation (IRMI), factorial analysis for mixed data (FAMD) and Random Forest Imputation (MissForest) perform in a similar manner regardless of the amount of missing data and have the highest mean percentage of observations correctly classified. Other methods investigated did not perform as well.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Andrews, D.F., Herzberg, A.M.: Data. A Collection of Problems from Many Fields for the Student and Research Worker. Springer, New York (1985)
Andridge, R.R., Little, R.J.A.: A review of hot deck imputation for survey nonresponse. Int. Stat. Rev. 78, 40–64 (2010)
Audigier, V., Husson, F., Josse, J.: A principal components method to impute mixed data. Adv. Data Anal. Classif. 10, 5–26 (2016). doi:10.1007/s11634-014-0195-1
Byar, D.P., Green, S.B.: The choice of treatment for cancer patients based on covariate information: application to prostate cancer data. Bull. Cancer Paris 67, 477–488 (1980)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B 39, 1–38 (1977)
Dempster A.P., Rubin D.B.: Introduction, Incomplete Data in Sample Surveys (Volume 2): Theory and Bibliography. Madow, W.G., Olkin, I., Rubin, D.B. (eds.), pp. 3–10. Academic, New York (1983)
Everitt, B.S., Dunn G.: Applied Multivariate Data Analysis. Edward Arnold, London (2001)
Huber, P.J.: Robust Statistics. Wiley, New York (1981)
Josse, J., Husson, F.: Handling missing values in exploratory multivariate data analysis methods. J. de la Soc. Fr. de Stat. 153(2), 1–21 (2012)
Little R.J.A., Rubin, D.B.: Statistical Analysis of Missing Data. Wiley New York (1987, 2002)
Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K., Ishii, S.: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16), 2088–2096 (2003). doi:10.1093/bioinformatics/btg287
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Quinlan, J.R.: Improved use of continuous attributes in C4.5. J. Artif. Intell. Res. 4, 77–90 (1996)
Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J., Solenberger, P.: A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27(1), 85–9 (2001)
Rousseeuw, P., Van Driessen, K.: Computing LTS regression for large data sets. Data Min. Knowl. Disc. 12, 29–45 (2006). doi:10.1007/s10618-005-0024-4
Rubin, D.B.: Inference and missing data. Biometrika 63, 581–593 (1976)
Santos, R.: Effects of imputation on regression coefficients. Proc. Sect. Surv. Res. Methods Am. Stat. Assoc. 140–145 (1981)
Stekhoven, D.J.: Using the missForest package. https://stat.ethz.ch/education/semesters/ss2013/ams/.../missForest-1.2.pdf (2012)
Stekhoven, D.J., Bühlmann, P.: MissForest – non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2011)
Templ, M., Kowarik, A., Filzmoser, P.: EM-based stepwise regression imputation using standard and robust methods. Research Report cs-2010-3, Department of Statistics and Probability Theory, Vienna University of Technology (2010)
Templ, M., Alfons, A., Kowarik, A., Prantner, B.: VIM: Visualization and imputation of missing values (2011). http://CRAN.R-project.org/package=VIM. R package version 3.0.0
Templ, M., Kowarik, A., Filzmoser, P.: Iterative stepwise regression imputation using standard and robust methods. Comput. Stat. Data Anal. 55, 2793–2806 (2011)
Witten, I.H., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)
Yohai, V.J.: High breakdown-point and high efficiency estimates for regression. Ann. Stat. 15, 642–656 (1987)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Hunt, L.A. (2017). Missing Data Imputation and Its Effect on the Accuracy of Classification. In: Palumbo, F., Montanari, A., Vichi, M. (eds) Data Science . Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-55723-6_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-55723-6_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55722-9
Online ISBN: 978-3-319-55723-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)