Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation?

Feelders, Ad

doi:10.1007/978-3-540-48247-5_38

Ad Feelders⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1704))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2562 Accesses
32 Citations

Abstract

In many applications of data mining a – sometimes considerable – part of the data values is missing. Despite the frequent occurrence of missing data, most data mining algorithms handle missing data in a rather ad-hoc way, or simply ignore the problem. We investigate simulation-based data augmentation to handle missing data, which is based on filling-in (imputing) one or more plausible values for the missing data. One advantage of this approach is that the imputation phase is separated from the analysis phase, allowing for different data mining algorithms to be applied to the completed data sets. We compare the use of imputation to surrogate splits, such as used in CART, to handle missing data in tree-based mining algorithms. Experiments show that imputation tends to outperform surrogate splits in terms of predictive accuracy of the resulting models. Averaging over M > 1 models resulting from M imputations yields even better results as it profits from variance reduction in much the same way as procedures such as bagging.

Download to read the full chapter text

Chapter PDF

How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment

Article 20 February 2021

Feature Based Multivariate Data Imputation

Missing Data Imputation and Its Effect on the Accuracy of Classification

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Blake, C., Keogh, E., Merz, C.J.: UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences (1999), http://www.ics.uci.edu/_mlearn/MLRepository.html
Breiman, L.: Bagging predictors. Machine Learning 26(2), 123–140 (1996)
Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.T.: Classification and Regression Trees. Wadsworth, Belmont, California (1984)
Google Scholar
Rubin, D.B.: Multiple imputation after 18+ years. Journal of the American Statistical Association 91, 473–489 (1996)
Article MATH Google Scholar
Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall, London (1997)
Book MATH Google Scholar
Schafer, J.L., Olsen, M.K.: Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivariate Behavioral Research 33(4), 545–571 (1998)
Article Google Scholar
Tanner, M.A.: Tools for Statistical Inference, 3rd edn. Springer, Heidelberg (1996)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

CentER for Economic Research, Tilburg University, PO Box 90153, 5000 LE, Tilburg, The Netherlands
Ad Feelders

Authors

Ad Feelders
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, UNC Charlotte, Charlotte, N.C. 28223 and Institute of Computer Science, Polish Academy of Sciences,
Jan M. Żytkow
Faculty of Informatics and Statistics, University of Economics, Prague, nám. W. Churchilla 4, 130 67, Prague, Czech Republic
Jan Rauch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feelders, A. (1999). Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation?. In: Żytkow, J.M., Rauch, J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 1999. Lecture Notes in Computer Science(), vol 1704. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-48247-5_38

Download citation

DOI: https://doi.org/10.1007/978-3-540-48247-5_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66490-1
Online ISBN: 978-3-540-48247-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation?

Abstract

Chapter PDF

Similar content being viewed by others

How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment

Feature Based Multivariate Data Imputation

Missing Data Imputation and Its Effect on the Accuracy of Classification

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation?

Abstract

Chapter PDF

Similar content being viewed by others

How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment

Feature Based Multivariate Data Imputation

Missing Data Imputation and Its Effect on the Accuracy of Classification

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation