Influence of Data Distribution in Missing Data Imputation

  • Miriam Seoane Santos
  • Jastin Pompeu Soares
  • Pedro Henriques AbreuEmail author
  • Hélder Araújo
  • João Santos
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10259)


Dealing with missing data is a crucial step in the preprocessing stage of most data mining projects. Especially in healthcare contexts, addressing this issue is fundamental, since it may result in keeping or loosing critical patient information that can help physicians in their daily clinical practice. Over the years, many researchers have addressed this problem, basing their approach on the implementation of a set of imputation techniques and evaluating their performance in classification tasks. These classic approaches, however, do not consider some intrinsic data information that could be related to the performance of those algorithms, such as features’ distribution. Establishing a correspondence between data distribution and the most proper imputation method avoids the need of repeatedly testing a large set of methods, since it provides a heuristic on the best choice for each feature in the study. The goal of this work is to understand the relationship between data distribution and the performance of well-known imputation techniques, such as Mean, Decision Trees, k-Nearest Neighbours, Self-Organizing Maps and Support Vector Machines imputation. Several publicly available datasets, all complete, were selected attending to several characteristics such as number of distributions, features and instances. Missing values were artificially generated at different percentages and the imputation methods were evaluated in terms of Predictive and Distributional Accuracy. Our findings show that there is a relationship between features’ distribution and algorithms’ performance, although some factors must be taken into account, such as the number of features per distribution and the missing rate at state.


Missing data Machine learning imputation Data distribution Healthcare contexts 



This article is a result of the project NORTE-01-0145-FEDER-000027, supported by Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF).


  1. 1.
    Aisha, N., Adam, M.B., Shohaimi, S.: Effect of missing value methods on Bayesian network classification of hepatitis data. Int. J. Comput. Sci. Telecommun. 4(6), 8–12 (2013)Google Scholar
  2. 2.
    Chambers, R.: Evaluation Criteria for Statistical Editing and Imputation. National Statistics Methodological Series No. 28. University of Southampton, Southampton (2001)Google Scholar
  3. 3.
    García-Laencina, P.J., Abreu, P.H., Abreu, M.H., Afonso, N.: Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Comput. Biol. Med. 59(2015), 125–133 (2015)CrossRefGoogle Scholar
  4. 4.
    García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)CrossRefGoogle Scholar
  5. 5.
    García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Classifying patterns with missing values using multi-task learning perceptrons. Expert Syst. Appl. 40(4), 1333–1341 (2013)CrossRefGoogle Scholar
  6. 6.
    Jerez, J.M., Molina, I., García-Laencina, P.J., Alba, E., Ribelles, N.: Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif. Intell. Med. 50(2), 105–115 (2010)CrossRefGoogle Scholar
  7. 7.
    Kang, P.: Locally linear reconstruction based missing value imputation for supervised learning. Neurocomputing 118, 65–78 (2013)CrossRefGoogle Scholar
  8. 8.
    Nanni, L., Lumini, A., Brahnam, S.: A classifier ensemble approach for the missing feature problem. Artif. Intell. Med. 55(1), 37–50 (2012)CrossRefGoogle Scholar
  9. 9.
    Rahman, M.M., Davis, D.N.: Fuzzy unordered rules induction algorithm used as missing value imputation methods for K-mean clustering on real cardiovascular data. In: Proceedings of the World Congress on Engineering, vol. 1, pp. 391–395 (2012)Google Scholar
  10. 10.
    Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl.-Based Syst. 53, 51–65 (2013)CrossRefGoogle Scholar
  11. 11.
    Van Buuren, S.: Flexible Imputation of Missing Data. CRC Press, Boca Raton (2012)CrossRefzbMATHGoogle Scholar
  12. 12.
    Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Miriam Seoane Santos
    • 1
  • Jastin Pompeu Soares
    • 1
  • Pedro Henriques Abreu
    • 1
    Email author
  • Hélder Araújo
    • 2
  • João Santos
    • 3
  1. 1.Department of Informatics Engineering, Faculty of Sciences and Technology, CISUCUniversity of CoimbraCoimbraPortugal
  2. 2.Department of Electrical and Computer Engineering, Faculty of Sciences and Technology, ISRUniversity of CoimbraCoimbraPortugal
  3. 3.IPO-Porto Research Centre (CI-IPOP)PortoPortugal

Personalised recommendations