Influence of Data Distribution in Missing Data Imputation

Santos, Miriam Seoane; Soares, Jastin Pompeu; Henriques Abreu, Pedro; Araújo, Hélder; Santos, João

doi:10.1007/978-3-319-59758-4_33

Miriam Seoane Santos¹⁷,
Jastin Pompeu Soares¹⁷,
Pedro Henriques Abreu¹⁷,
Hélder Araújo¹⁸ &
…
João Santos¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10259))

Included in the following conference series:

Conference on Artificial Intelligence in Medicine in Europe

2665 Accesses
18 Citations

Abstract

Dealing with missing data is a crucial step in the preprocessing stage of most data mining projects. Especially in healthcare contexts, addressing this issue is fundamental, since it may result in keeping or loosing critical patient information that can help physicians in their daily clinical practice. Over the years, many researchers have addressed this problem, basing their approach on the implementation of a set of imputation techniques and evaluating their performance in classification tasks. These classic approaches, however, do not consider some intrinsic data information that could be related to the performance of those algorithms, such as features’ distribution. Establishing a correspondence between data distribution and the most proper imputation method avoids the need of repeatedly testing a large set of methods, since it provides a heuristic on the best choice for each feature in the study. The goal of this work is to understand the relationship between data distribution and the performance of well-known imputation techniques, such as Mean, Decision Trees, k-Nearest Neighbours, Self-Organizing Maps and Support Vector Machines imputation. Several publicly available datasets, all complete, were selected attending to several characteristics such as number of distributions, features and instances. Missing values were artificially generated at different percentages and the imputation methods were evaluated in terms of Predictive and Distributional Accuracy. Our findings show that there is a relationship between features’ distribution and algorithms’ performance, although some factors must be taken into account, such as the number of features per distribution and the missing rate at state.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aisha, N., Adam, M.B., Shohaimi, S.: Effect of missing value methods on Bayesian network classification of hepatitis data. Int. J. Comput. Sci. Telecommun. 4(6), 8–12 (2013)
Google Scholar
Chambers, R.: Evaluation Criteria for Statistical Editing and Imputation. National Statistics Methodological Series No. 28. University of Southampton, Southampton (2001)
Google Scholar
García-Laencina, P.J., Abreu, P.H., Abreu, M.H., Afonso, N.: Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Comput. Biol. Med. 59(2015), 125–133 (2015)
Article Google Scholar
García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)
Article Google Scholar
García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Classifying patterns with missing values using multi-task learning perceptrons. Expert Syst. Appl. 40(4), 1333–1341 (2013)
Article Google Scholar
Jerez, J.M., Molina, I., García-Laencina, P.J., Alba, E., Ribelles, N.: Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif. Intell. Med. 50(2), 105–115 (2010)
Article Google Scholar
Kang, P.: Locally linear reconstruction based missing value imputation for supervised learning. Neurocomputing 118, 65–78 (2013)
Article Google Scholar
Nanni, L., Lumini, A., Brahnam, S.: A classifier ensemble approach for the missing feature problem. Artif. Intell. Med. 55(1), 37–50 (2012)
Article Google Scholar
Rahman, M.M., Davis, D.N.: Fuzzy unordered rules induction algorithm used as missing value imputation methods for K-mean clustering on real cardiovascular data. In: Proceedings of the World Congress on Engineering, vol. 1, pp. 391–395 (2012)
Google Scholar
Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl.-Based Syst. 53, 51–65 (2013)
Article Google Scholar
Van Buuren, S.: Flexible Imputation of Missing Data. CRC Press, Boca Raton (2012)
Book MATH Google Scholar
Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)
MathSciNet MATH Google Scholar

Download references

Acknowledgments

This article is a result of the project NORTE-01-0145-FEDER-000027, supported by Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF).

Author information

Authors and Affiliations

Department of Informatics Engineering, Faculty of Sciences and Technology, CISUC, University of Coimbra, Coimbra, Portugal
Miriam Seoane Santos, Jastin Pompeu Soares & Pedro Henriques Abreu
Department of Electrical and Computer Engineering, Faculty of Sciences and Technology, ISR, University of Coimbra, Coimbra, Portugal
Hélder Araújo
IPO-Porto Research Centre (CI-IPOP), Porto, Portugal
João Santos

Authors

Miriam Seoane Santos
View author publications
You can also search for this author in PubMed Google Scholar
Jastin Pompeu Soares
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Henriques Abreu
View author publications
You can also search for this author in PubMed Google Scholar
Hélder Araújo
View author publications
You can also search for this author in PubMed Google Scholar
João Santos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pedro Henriques Abreu .

Editor information

Editors and Affiliations

Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Annette ten Teije
Medical University of Vienna, Vienna, Austria
Christian Popow
University of Pennsylvania, Philadelphia, Pennsylvania, USA
John H. Holmes
University of Pavia, Pavia, Italy
Lucia Sacchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Santos, M.S., Soares, J.P., Henriques Abreu, P., Araújo, H., Santos, J. (2017). Influence of Data Distribution in Missing Data Imputation. In: ten Teije, A., Popow, C., Holmes, J., Sacchi, L. (eds) Artificial Intelligence in Medicine. AIME 2017. Lecture Notes in Computer Science(), vol 10259. Springer, Cham. https://doi.org/10.1007/978-3-319-59758-4_33

Download citation

DOI: https://doi.org/10.1007/978-3-319-59758-4_33
Published: 30 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59757-7
Online ISBN: 978-3-319-59758-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics