Abstract
In many applications of data mining a – sometimes considerable – part of the data values is missing. Despite the frequent occurrence of missing data, most data mining algorithms handle missing data in a rather ad-hoc way, or simply ignore the problem. We investigate simulation-based data augmentation to handle missing data, which is based on filling-in (imputing) one or more plausible values for the missing data. One advantage of this approach is that the imputation phase is separated from the analysis phase, allowing for different data mining algorithms to be applied to the completed data sets. We compare the use of imputation to surrogate splits, such as used in CART, to handle missing data in tree-based mining algorithms. Experiments show that imputation tends to outperform surrogate splits in terms of predictive accuracy of the resulting models. Averaging over M > 1 models resulting from M imputations yields even better results as it profits from variance reduction in much the same way as procedures such as bagging.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Blake, C., Keogh, E., Merz, C.J.: UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences (1999), http://www.ics.uci.edu/_mlearn/MLRepository.html
Breiman, L.: Bagging predictors. Machine Learning 26(2), 123–140 (1996)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.T.: Classification and Regression Trees. Wadsworth, Belmont, California (1984)
Rubin, D.B.: Multiple imputation after 18+ years. Journal of the American Statistical Association 91, 473–489 (1996)
Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall, London (1997)
Schafer, J.L., Olsen, M.K.: Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivariate Behavioral Research 33(4), 545–571 (1998)
Tanner, M.A.: Tools for Statistical Inference, 3rd edn. Springer, Heidelberg (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Feelders, A. (1999). Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation?. In: Żytkow, J.M., Rauch, J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 1999. Lecture Notes in Computer Science(), vol 1704. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-48247-5_38
Download citation
DOI: https://doi.org/10.1007/978-3-540-48247-5_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66490-1
Online ISBN: 978-3-540-48247-5
eBook Packages: Springer Book Archive