Clustering based imputation algorithm using unsupervised neural network for enhancing the quality of healthcare data


Historical and real-time healthcare data sets are valuable sources of information for predictive data analytics. However, most of the historical healthcare data sets are overloaded with challenges. One of the most frequently faced challenge is the problem of missing values, occurring because of the inaccuracies in data transmission or data entry processes. An appropriate technique for handling missing values is required to generate good quality data sets for achieving better prediction results. Removing the records with missing values, known as marginalization, poses an easy way out to this challenge. But, this will lessen the data volume of the historical data set and disturb the class balance of the data set. An alternative to marginalization is replacing missing values with plausible values, known as imputation. This paper proposes a missing value imputation technique, CLUSTIMP, using an unsupervised neural network Adaptive Resonance Theory 2 (ART2). The efficiency of the proposed imputation method is evaluated on the incomplete Mammographic mass data set and Hepatocellular Carcinoma data set (HCC) from the UCI repository considering Root Mean Squared Error (RMSE) rate and classification accuracy as the evaluation metrics. The proposed CLUSTIMP imputation algorithm outperforms existing state-of-the-art imputation methods by reducing classifiers error rates between 2 and 11%.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4


  1. Almeida RJ, Kaymak U, Sousa JM (2010) A new approach to dealing with missing values in data-driven fuzzy modeling. In: International conference on fuzzy systems, pp. 1–7. IEEE

  2. Armentano R, Bhadoria RS, Chatterjee P, Deka GC (2017) The internet of things: foundation for smart cities, EHealth, and ubiquitous computing. CRC Press, Boca Raton

    Google Scholar 

  3. Arslanturk S, Siadat M-R, Ogunyemi T, Killinger K, Diokno A (2016) Analysis of incomplete and inconsistent clinical survey data. Knowl Inform Syst 46(3):731–750

    Google Scholar 

  4. Beaulieu-Jones BK, Moore JH (2017) Missing data imputation in the electronic health record using deeply learned autoencoders. In: Pacific Symposium on Biocomputing 2017, pp. 207–218. World Scientific

  5. Bhadoria RS, Bajpai D (2019) Stabilizing sensor data collection for control of environment-friendly clean technologies using internet of things. Wirel Personal Commun 108(1):493–510

    Google Scholar 

  6. Carpenter GA, Grossberg S (2017) Adaptive resonance theory. Springer, Berlin

    Google Scholar 

  7. Chan LS, Dunn OJ (1972) The treatment of missing values in discriminant analysisi. the sampling experiment. J Am Stat Assoc 67(338):473–477

    MATH  Google Scholar 

  8. Chen M, Hao Y, Hwang K, Wang L, Wang L (2017) Disease prediction by machine learning over big data from healthcare communities. Ieee Access 5:8869–8879

    Google Scholar 

  9. Davis D, Rahman M (2016) Missing value imputation using stratified supervised learning for cardiovascular data. J. Inf. Data Min 1(2):1–13

    Google Scholar 

  10. Elter M, Schulz-Wendtland R, Wittenberg T (2007) The prediction of breast cancer biopsy outcomes using two cad approaches that both emphasize an intelligible decision process. Med Phys 34(11):4164–4172

    Google Scholar 

  11. Ford BL (1983) An overview of hot-deck procedures. Incomplete Data Sample Surv 2(Part IV):185–207

    Google Scholar 

  12. Haji-Maghsoudi S, Rastegari A, Garrusi B, Baneshi MR (2018) Addressing the problem of missing data in decision tree modeling. J Appl Stat 45(3):547–557

    MathSciNet  Google Scholar 

  13. Imani F, Cheng C, Chen R, Yang H (2019) Nested gaussian process modeling and imputation of high-dimensional incomplete data under uncertainty. IISE Trans Healthc Syst Eng 9(4):315–326

    Google Scholar 

  14. Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intell Med 50(2):105–115

    Google Scholar 

  15. Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmospheric Environ 38(18):2895–2907

    Google Scholar 

  16. Kayal CK, Bagchi S, Dhar D, Maitra T, Chatterjee S (2019) Hepatocellular carcinoma survival prediction using deep neural network. In: Proceedings of international ethical hacking conference 2018, pp. 349–358. Springer

  17. Kurt I, Ture M, Kurum AT (2008) Comparing performances of logistic regression, classification and regression tree, and neural networks for predicting coronary artery disease. Expert Syst Appl 34(1):366–374

    Google Scholar 

  18. LaFreniere D, Zulkernine F, Barber D, Martin K (2016) Using machine learning to predict hypertension from a clinical dataset. In: 2016 IEEE symposium series on computational intelligence (SSCI), pp. 1–7. IEEE

  19. Mazumder RS, Bhadoria RS, Deka GC (eds) (2017) Distributed computing in big data analytics. Concepts, technologies and applications. Springer, Cham

  20. Momeni A, Pincus M, Libien J (2018) Imputation and missing data. In: Introduction to statistical methods in pathology. Springer, Cham, pp 185–200

    Google Scholar 

  21. Nguyen DV, Wang N, Carroll RJ (2004) Evaluation of missing value estimation for microarray data. J Data Sci 2(4):347–370

    Google Scholar 

  22. Penny KI, Chesney T (2006) Imputation methods to deal with missing values when data mining trauma injury data. In: 28th international conference on information technology interfaces, 2006, pp. 213–218. IEEE

  23. Rahman MM (2014) Machine learning based data pre-processing for the purpose of medical data mining and decision support. PhD thesis, University of Hull

  24. Rubin DB (2004) Multiple imputation for nonresponse in surveys, vol 81. Wiley, Hoboken

    Google Scholar 

  25. Santos MS, Abreu PH, García-Laencina PJ, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59

    Google Scholar 

  26. Sen S, Das M, Chatterjee R (2018) Estimation of incomplete data in mixed dataset. In: Progress in intelligent computing techniques: theory, practice, and applications. Springer, Singapore, pp 483–492

    Google Scholar 

  27. Shobha K, Nickolas S (2019) Imputation of multivariate attribute values in big data. In: Smart intelligent computing and applications. Springer, Singapore, pp 53–60

    Google Scholar 

  28. Sokat KY, Dolinskaya IS, Smilowitz K, Bank R (2018) Incomplete information imputation in limited data environments with application to disaster response. Europ J Oper Res 269(2):466–485

    MATH  Google Scholar 

  29. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for dna microarrays. Bioinformatics 17(6):520–525

    Google Scholar 

  30. Turabieh H, Salem AA, Abu-El-Rub N (2018) Dynamic l-rnn recovery of missing data in iomt applications. Future Generation Comput Syst 89:575–583

    Google Scholar 

  31. Tutz G, Ramzan S (2015) Improved methods for the imputation of missing data by nearest neighbor methods. Comput Stat Data Anal 90:84–99

    MathSciNet  MATH  Google Scholar 

  32. Van der Heijden GJ, Donders ART, Stijnen T, Moons KG (2006) Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol 59(10):1102–1109

    Google Scholar 

  33. Verma H, Kumar S (2019) An accurate missing data prediction method using lstm based deep learning for health care. In: Proceedings of the 20th international conference on distributed computing and networking, pp. 371–376. ACM

Download references

Author information



Corresponding author

Correspondence to K. Shobha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shobha, K., Savarimuthu, N. Clustering based imputation algorithm using unsupervised neural network for enhancing the quality of healthcare data. J Ambient Intell Human Comput (2020).

Download citation


  • Missing values
  • Imputation
  • Unsupervised
  • Neural network
  • Classifiers