Journal of Medical Systems

, 43:17 | Cite as

A Systematic Mapping Study of Data Preparation in Heart Disease Knowledge Discovery

  • H. Benhar
  • A. IdriEmail author
  • J. L. Fernández-Alemán
Systems-Level Quality Improvement
Part of the following topical collections:
  1. Health Information Systems & Technologies


The increasing amount of data produced by various biomedical and healthcare systems has led to a need for methodologies related to knowledge data discovery. Data mining (DM) offers a set of powerful techniques that allow the identification and extraction of relevant information from medical datasets, thus enabling doctors and patients to greatly benefit from DM, particularly in the case of diseases with high mortality and morbidity rates, such as heart disease (HD). Nonetheless, the use of raw medical data implies several challenges, such as missing data, noise, redundancy and high dimensionality, which make the extraction of useful and relevant information difficult and challenging. Intensive research has, therefore, recently begun in order to prepare raw healthcare data before knowledge extraction. In any knowledge data discovery (KDD) process, data preparation is the step prior to DM that deals with data imperfectness in order to improve its quality so as to satisfy the requirements and improve the performances of DM techniques. The objective of this paper is to perform a systematic mapping study (SMS) on data preparation for KDD in cardiology so as to provide an overview of the quantity and type of research carried out in this respect. The SMS consisted of a set of 58 selected papers published in the period January 2000 and December 2017. The selected studies were analyzed according to six criteria: year and channel of publication, preparation task, medical task, DM objective, research type and empirical type. The results show that a high amount of data preparation research was carried out in order to improve the performance of DM-based decision support systems in cardiology. Researchers were mainly interested in the data reduction preparation task and particularly in feature selection. Moreover, the majority of the selected studies focused on classification for the diagnosis of HD. Two main research types were identified in the selected studies: solution proposal and evaluation research, and the most frequently used empirical type was that of historical-based evaluation.


Data preparation Heart disease Knowledge discovery Systematic map 



This work was conducted within the research project MPHR-PPR1-2015-2017. The authors would like to thank the Moroccan MESRSFC and CNRST for their support. It is also a part of the GINSENG project (TIN2015-70259-C2-2-R) supported by the Spanish Ministry of Economy and Competitiveness and European FEDER funds.

Compliance with ethical standards

Conflict of interests

All the authors declare that there is no conflict of interest regarding the publication of this paper.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.


  1. 1.
    Ting, S. L., Shum, C. C., Kwok, S. K., Tsang, A. H. C., and Lee, W. B., Data mining in biomedicine: current applications and further directions for research. J. Softw. Eng. Appl. 2:150–159, 2009. CrossRefGoogle Scholar
  2. 2.
    Kurgan, L. A., and Musilek, P., A survey of knowledge discovery and data mining process models. Knowl. Eng. Rev. 21:1, 2006. Scholar
  3. 3.
    Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P., From data mining to knowledge discovery in databases. AI Mag. 17:37, 1996. CrossRefGoogle Scholar
  4. 4.
    Goebel, M., and Gruenwald, L., A survey of data mining and knowledge discovery software tools. ACM SIGKDD Explor. Newsl. 1:20–33, 1999. CrossRefGoogle Scholar
  5. 5.
    Kadi, I., Idri, A., and Fernandez-Aleman, J. L., Systematic mapping study of data mining–based empirical studies in cardiology. Health Informat J. 2017.
  6. 6.
    Han, J., Kamber, M., Jian, P., Data Mining : Concepts and Techniques Third Edition, p 744, 2011. Accessed May 2018
  7. 7.
    Maimon, O., Rokach, L., Data Mining and Knowledge Discovery Handbook (2nd ed.). Springer Publishing Company, Incorporated. 2010
  8. 8.
    Almuhaideb, S., and Menai, M. E. B., Impact of preprocessing on medical data classification. Front. Comput. Sci. 10:1082–1102, 2016. Scholar
  9. 9.
    García, S., Luengo, J., and Herrera, F., Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl.-Based Syst., 2015.
  10. 10.
    Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., and Bing, G., Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 73:220–239, 2017. Scholar
  11. 11.
    He, H., and Garcia, E. A., Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21:1263–1284, 2009. Scholar
  12. 12.
    Jabbar, M. A., Deekshatulu, B. L., and Chandra, P., Heart disease classification using nearest neighbor classifier with feature subset selection. Ann. Comput. Sci. Ser. XI:47–54, 2013 Accessed May, 2018.
  13. 13.
    Mendes, D., Paredes, S., Rocha, T., Carvalho, P., Henriques, J., Cabiddu, R., and Morais, J., Assessment of cardiovascular risk based on a data-driven knowledge discovery approach. Conf Proc IEEE Eng Med Biol Soc. 2015:6800–6803, 2015. Scholar
  14. 14.
    Gaziano, T. A., Reddy, K. S., Paccaud, F., Horton, S., Cardiovascular Disease. Disease Control Priorities in Developing Countries. 2nd edition. Washington (DC): World Bank; Chapter 33 2006.
  15. 15.
    World Health Organization, The world health report 2002 - Reducing Risks, Promoting Healthy Life, 2002, 2002.
  16. 16.
    Kadi, I., Idri, A., and Fernandez-Aleman, J. L., Systematic mapping study of data mining–based empirical studies in cardiology. Health Informatics J.:146045821771763, 2017.
  17. 17.
    Benhar H., Idri A., Fernández-Alemán J.L. (2018) Data preprocessing for decision making in medical informatics: potential and analysis. In: Rocha Á., Adeli H., Reis L., Costanzo S. (eds) Trends and advances in information systems and technologies. WorldCIST'18 2018. Advances in intelligent systems and computing, vol 746. Springer, Cham.Google Scholar
  18. 18.
    Idri, A., Benhar, H., Fernández-Alemán, J. L., and Kadi, I., A systematic map of medical data preprocessing in knowledge discovery. Comput. Methods Prog. Biomed. 162:69–85, 2018. Scholar
  19. 19.
    Yu, S. N., and Chen, Y. H., Noise-tolerant electrocardiogram beat classification based on higher order statistics of subband components. Artif. Intell. Med. 46:165–178, 2009. Scholar
  20. 20.
    Zhang, Y., Kambhampati, C., Davis, D. N., Goode, K., Cleland, J. G. F., A comparative study of missing value imputation with multiclass classification for clinical heart failure data. In: Proc. - 2012 9th Int. Conf. Fuzzy Syst. Knowl. Discov. FSKD 2012, pp. 2840–2844, 2012.
  21. 21.
    Alickovic, E., and Subasi, A., Effect of multiscale PCA De-noising in ECG beat classification for diagnosis of cardiovascular diseases. Circ Syst Signal PR Journal. 34:513–533, 2014.
  22. 22.
    Sáez, J. A., Krawczyk, B., and Woźniak, M., On the influence of class noise in medical data classification: Treatment using noise filtering methods. Appl. Artif. Intell. 30:590–609, 2016. CrossRefGoogle Scholar
  23. 23.
    Ragothaman, B., and Sarojini, B., A multi-objective non-dominated sorted artificial bee colony feature selection algorithm for medical datasets. Indian J. Sci. Technol. 9, 2016.
  24. 24.
    Petersen, K., Feldt, R., Mujtaba, S., Mattsson, M., Systematic mapping studies in software engineering. In: EASE’08 Proc. 12th Int. Conf. Eval. Assess. Softw. Eng., pp. 68–77, 2008.
  25. 25.
    Petersen, K., Vakkalanka, S., Kuzniarz, L., Guidelines for conducting systematic mapping studies in software engineering: An update. In: Inf. Softw. Technol., pp. 1–18, 2015.
  26. 26.
    Esfandiari, N., Babavalian, M. R., Moghadam, A. M. E., and Tabar, V. K., Knowledge discovery in medicine: Current issue and future trend. Expert Syst. Appl. 41:4434–4463, 2014. Scholar
  27. 27.
    Sardi, L., Idri, A., and Fernández-Alemán, J. L., A systematic review of gamification in e-health. J. Biomed. Inform. 71:31–48, 2017. Scholar
  28. 28.
    Idri, A., Hosni, M., and Abran, A., Systematic literature review of ensemble effort estimation. J. Syst. Softw. 118:151–175, 2016. Scholar
  29. 29.
    Idri, A., Amazal, F. A., and Abran, A., Analogy-based software development effort estimation: A systematic mapping and review. Inf. Softw. Technol. 58:206–230, 2015. Scholar
  30. 30.
    Ouhbi, S., Idri, A., Fernández-Alemán, J. L., and Toval, A., Requirements engineering education: A systematic mapping study. Requir. Eng. 20:119–138, 2013. Scholar
  31. 31.
    Wieringa, R., Maiden, N., Mead, N., and Rolland, C., Requirements engineering paper classification and evaluation criteria: A proposal and a discussion. Requir. Eng. 11:102–107, 2006. Scholar
  32. 32.
    Condori-Fernandez, N., Daneva, M., Sikkel, K., Wieringa, R., Dieste, O., Pastor, O., A Systematic mapping study on empirical evaluation of software requirements specifications techniques. In: 2009 3rd Int. Symp. Empir. Softw. Eng. Meas., pp. 502–505, 2009.
  33. 33.
    Niazi, K. A. K., Khan, S. A., Shaukat, A., Akhtar, M., Identifying best feature subset for cardiac arrhythmia classification. In: Sci. Inf. Conf., IEEE, 2015, pp. 494–499, 2015.
  34. 34.
    Yilmaz, N., Inan, O., and Uzer, M. S., A new data preparation method based on clustering algorithms for diagnosis systems of heart and diabetes diseases. J. Med. Syst. 38, 2014.
  35. 35.
    Qin, C.-J., Guan, Q., and Wang, X.-P., Application of ensemble algorithm integrating multiple criteria feature selection in coronary heart disease detection. Biomed Eng (Singapore) 29, 2017.
  36. 36.
    Fatima, M., Basharat, I., Khan, S. A., Anjum, A. R., Biomedical (cardiac) data mining: Extraction of significant patterns for predicting heart condition. In: 2014 IEEE Conf. Comput. Intell. Bioinforma. Comput. Biol. CIBCB 2014, 2014.
  37. 37.
    Poolsawad, N., Moore, L., Kambhampati, C., and Cleland, J. G. F., Issues in the mining of heart failure datasets. Int. J. Autom. Comput. 11:162–179, 2014. Scholar
  38. 38.
    Verma, L., Srivastava, S., and Negi, P. C., An intelligent noninvasive model for coronary artery disease detection. Complex Intell. Syst., 2017.
  39. 39.
    Babaoglu, İ., Findik, O., and Ülker, E., A comparison of feature selection models utilizing binary particle swarm optimization and genetic algorithm in determining coronary artery disease using support vector machine. Expert Syst. Appl. 37:3177–3183, 2010. Scholar
  40. 40.
    Wosiak, A., Zakrzewska, D., Unsupervised feature selection using reversed correlation for improved medical diagnosis. In: Proc. - 2017 IEEE Int. Conf. Innov. Intell. Syst. Appl. INISTA 2017, pp. 18–22, 2017.
  41. 41.
    Son, C.-S., Kim, Y.-N., Kim, H.-S., Park, H.-S., and Kim, M.-S., Decision-making model for early diagnosis of congestive heart failure using rough set and decision tree approaches. J. Biomed. Inform. 45:999–1008, 2012. Scholar
  42. 42.
    Sufi, F., and Khalil, I., Diagnosis of cardiovascular abnormalities from compressed ECG: A data mining-based approach. IEEE Trans. Inf. Technol. Biomed. 15:33–39, 2011. Scholar
  43. 43.
    Anbarasi, M., Anupriya, E., and Iyengar, N. C. S. N., Enhanced prediction of heart disease with feature subset selection using genetic algorithm. Int. J. Eng. Sci. Technol. 2:5370–5376, 2010.Google Scholar
  44. 44.
    Peter, T. J., and Somasundaram, K., Study and development of novel feature selection framework for heart disease prediction. IJSRP 2:1–7, 2012.Google Scholar
  45. 45.
    Konias, S., Chouvarda, I., Vlahavas, I., and Maglaveras, N., A novel approach for incremental uncertainty rule generation from databases with missing values handling: Application to dynamic medical databases. Med. Inform. Internet Med. 30:211–225, 2005. Scholar
  46. 46.
    Exarchos, T. P., Papaloukas, C., Fotiadis, D. I., and Michalis, L. K., An association rule mining-based methodology for automated detection of ischemic ECG beats. IEEE Trans. Biomed. Eng. 53:1531–1540, 2006. Scholar
  47. 47.
    Sasikala, S., Appavu alias Balamurugan, S., and Geetha, S., RF-SEA-based feature selection for data classification in medical domain. ICACNI 243:599–608, 2014. Scholar
  48. 48.
    Rajeswari, K., Vaithiyanathan, V., and Neelakantan, T. R., Feature selection in ischemic heart disease identification using feed forward neural networks. Procedia Eng. 41:1818–1823, 2012. Scholar
  49. 49.
    Pizzi, N. J., Fuzzy quartile encoding as a preprocessing method for biomedical pattern classification. Theor. Comput. Sci. 412:5909–5925, 2011. Scholar
  50. 50.
    Dag, A., Oztekin, A., Yucel, A., Bulur, S., and Megahed, F. M., Predicting heart transplantation outcomes through data analytics. Decis. Support. Syst. 94:42–52, 2017. Scholar
  51. 51.
    Pölsterl, S., Conjeti, S., Navab, N., and Katouzian, A., Survival analysis for high-dimensional, heterogeneous medical data: Exploring feature extraction as an alternative to feature selection. Artif. Intell. Med. 72:1–11, 2016. Scholar
  52. 52.
    Jaganathan, P., and Kuppuchamy, R., A threshold fuzzy entropy based feature selection for medical database classification. Comput. Biol. Med. 43:2222–2229, 2013. Scholar
  53. 53.
    Shao, Y. E., Hou, C. D., and Chiu, C. C., Hybrid intelligent modeling schemes for heart disease classification. Appl. Soft Comput. J. 14 (47–52, 2014. Scholar
  54. 54.
    Jiang, X., Zhang, L., Zhao, Q., Albayrak, S., ECG arrhythmias recognition system based on independent component analysis feature extraction. In: TENCON 2006–2006 IEEE Reg. 10 Conf., IEEE, pp. 1–4, 2006.
  55. 55.
    Zhao, Q., Zhang, L., ECG feature extraction and classification using wavelet transform and support vector machines. In: 2005 Int. Conf. Neural Networks Brain, pp. 1089–1092, 2005.
  56. 56.
    Abraham, R., Simha, J. B., Iyengar, S. S., Medical datamining with a new algorithm for feature selection and naive bayesian classifier. In: 10th Int. Conf. Inf. Technol. (ICIT 2007), IEEE, pp. 44–49, 2007.
  57. 57.
    Abraham, R., Simha, J. B., Iyengar, S. S., A comparative analysis of discretization methods for medical datamining with Naïve Bayesian classifier. In: Proc. - 9th Int. Conf. Inf. Technol. ICIT 2006, pp. 235–236, 2007.
  58. 58.
    Jabbar, M. A., Deekshatulu, B. L., and Chandra, P., Classification of heart disease using artificial neural network and feature subset selection. GJCST 13:5–14, 2013.Google Scholar
  59. 59.
    Song, M. H., Lee, J., Cho, S. P., Lee, K. J., and Yoo, S. K., Support vector machine-based arrhythmia classification using reduced features. Int. J. Control. Autom. Syst. 3:571–579, 2005. CrossRefGoogle Scholar
  60. 60.
    Bhatia, S., Prakash, P., Pillai, G. N., SVM based decision support system for heart disease classification with integer-coded genetic algorithm to select critical features. In: Proc. World Congr. Eng. Comput. Sci., 2008.Google Scholar
  61. 61.
    Millet-Roig, J., Ventura-Galiano, R., Chorro-Gasco, F. J., Cebrian, A., Support vector machine for arrhythmia discrimination with wavelet transform-based feature selection, in: Comput. Cardiol. 2000. vol. 27 (Cat. 00CH37163), IEEE, pp. 407–410, 2000.
  62. 62.
    Lee, I.-N., Liao, S.-C., and Embrechts, M., Data mining techniques applied to medical information. Med. Inform. Internet Med. 25:81–102, 2009. Scholar
  63. 63.
    Llamedo Soria, M., and Martínez, J. P., An ECG classification model based on multilead wavelet transform features. Comput. Cardiol. 34:105–108, 2007. CrossRefGoogle Scholar
  64. 64.
    Hejazi, M., Al-Haddad, S. A. R., Singh, Y. P., Hashim, S. J., and Aziz, A. F. A., Multiclass support vector machines for classification of ECG data with missing values. Appl. Artif. Intell. 29:660–674, 2015.
  65. 65.
    Weston, J., Watkins, C., Support vector machines for multi-class pattern recognition. In ESANN, 1999Google Scholar
  66. 66.
    Zhu, X., Zhang, S., Jin, Z., Zhang, Z., and Xu, Z., Missing value estimation for mixed-attribute data sets. IEEE Trans. Knowl. Data Eng. 23(1):110–121, 2011.CrossRefGoogle Scholar
  67. 67.
    Chen, H.-H., Pai, P.-F., Cho, Y.-Z., Lee, F.-C., and Fu, J.-C., An improved support vector machines model in medical data analysis. Int. J. Math. Model. Numer. Optim. 1:168–184, 2010. Scholar
  68. 68.
    Li, Q., Li, T., Zhu, S., Kambhamettu, C., Improving medical/biological data classification performance by wavelet preprocessing. In: 2002 IEEE Int. Conf. Data Mining, 2002. Proceedings., IEEE Comput. Soc, pp. 657–660, 2002.
  69. 69.
    Kutlu, Y., and Kuntalp, D., A multi-stage automatic arrhythmia recognition and classification system. Comput. Biol. Med. 41:37–45, 2011. Scholar
  70. 70.
    Mitra, M., Samanta, R. K., Cardiac arrhythmia classification using neural networks with selected features. In: Int. Conf. Comput. Intell. Model. Tech. Appl., pp. 76–84, 2013.
  71. 71.
    Melgani, F., and Bazi, Y., Classification of electrocardiogram signals with support vector machines and particle swarm optimization. IEEE Trans. Inf. Technol. Biomed. 12:667–677, 2008. Scholar
  72. 72.
    Anooj, P. K., Clinical decision support system: Risk level prediction of heart disease using weighted fuzzy rules. J. King Saud Univ. - Comput. Inf. Sci. 24:27–40, 2012. Scholar
  73. 73.
    Dobbins, C., Rawassizadeh, R., Clustering of physical activities for quantified self and mhealth applications. In: Proc. - 15th IEEE Int. Conf. Comput. Inf. Technol. CIT 2015, 14th IEEE Int. Conf. Ubiquitous Comput. Commun. IUCC 2015, 13th IEEE Int. Conf. Dependable, Auton. Se, pp. 1423–1428, 2015.
  74. 74.
    Jabbar, M. A., Deekshatulu, B. L., Chandra, P., Computational intelligence technique for early diagnosis of heart disease. In: 2015 IEEE Int. Conf. Eng. Technol, pp. 1–6, 2015.
  75. 75.
    Wang, J.-S., Chiang, W.-C., Hsu, Y.-L., and Yang, Y.-T. C., ECG arrhythmia classification using a probabilistic neural network with a feature reduction method. Neurocomputing 116:38–45, 2013. Scholar
  76. 76.
    Abawajy, J. H., Kelarev, A. V., and Chowdhury, M., Multistage approach for clustering and classification of ECG data. Comput. Methods Prog. Biomed. 112:720–730, 2013. Scholar
  77. 77.
    Asl, B. M., Setarehdan, S. K., and Mohebbi, M., Support vector machine-based arrhythmia classification using reduced features of heart rate variability signal. Artif. Intell. Med. 44:51–64, 2008. Scholar
  78. 78.
    Abdel-Aal, R. E., Improved classification of medical data using abductive network committees trained on different feature subsets. Comput. Methods Prog. Biomed. 80:141–153, 2005. Scholar
  79. 79.
    Polat, K., and Güneş, S., A new feature selection method on classification of medical datasets: Kernel F-score feature selection. Expert Syst. Appl. 36:10367–10373, 2009. CrossRefGoogle Scholar
  80. 80.
    Vivekanandan, T., and Sriman Narayana Iyengar, N. C., Optimal feature selection using a modified differential evolution algorithm and its effectiveness for prediction of heart disease. Comput. Biol. Med. 90:125–136, 2017. Scholar
  81. 81.
    Xu, S., Zhang, Z., Wang, D., Hu, J., Duan, X., Zhu, T., Cardiovascular risk prediction method based on CFS subset evaluation and random forest classification framework. In: 2017 IEEE 2nd Int. Conf. Big Data Anal. (ICBDA), IEEE, pp. 228–232, 2017.
  82. 82.
    Meenachi, L., Raghul, J. J., Raj, C. M., Kathiravan, B., Diagnosis of medical dataset using fuzzy-rough ordered weighted average classification. In: 2017 Int. Conf. Innov. Information, Embed. Commun. Syst., IEEE, pp. 1–5, 2017.
  83. 83.
    Khemphila, A., Boonjing, V., Heart disease classification using neural network and feature selection. In: 2011 21st Int. Conf. Syst. Eng, pp. 406–409, 2011.
  84. 84.
    Mustaqeem, A., Anwar, S. M., Majid, M., Khan, A. R., Wrapper method for feature selection to classify cardiac arrhythmia. In: Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. EMBS, pp. 3656–3659, 2017.
  85. 85.
    Moody, G. B., Mark, R. G., MIT-BIH arrhythmia database, 1997. Accessed June, 2018
  86. 86.
    Blake, C. L., Merz, C. J., UCI Repository of machine learning databases. Univ. Calif, 1998. Accessed June, 2018
  87. 87.
    Davis, J. J., and Clark, A. J., Data preprocessing for anomaly based network intrusion detection: A review. J. Comput. Secur. 30:353–375, 2011. Scholar
  88. 88.
    Huang, J., Li, Y.-F., and Xie, M., An empirical analysis of data preprocessing for machine learning-based software cost estimation. Inf. Softw. Technol. 67:108–127, 2015. Scholar
  89. 89.
    Bowyer, K. W., Mentoring advice on “Conferences versus journals” for CSE Faculty 2012, pp. 1–9, 2012.Google Scholar
  90. 90.
    Idri, A., Abnane, I., and Abran, A., Missing data techniques in analogy-based software development effort estimation. J. Syst. Softw. 117:595–611, 2016. Scholar
  91. 91.
    Quinlan, J. R., Induction of decision trees. Mach. Learn. 1:81–106, 1986. CrossRefGoogle Scholar
  92. 92.
    Liu, H., Hussain, F., Tan, C. L., and Dash, M., Discretization: An enabling technique. Data Min. Knowl. Disc. 6:393–423, 2002. CrossRefGoogle Scholar
  93. 93.
    Visalakshi, N. K., and Thangavel, K., Impact of normalization in distributed K-means clustering. Int. J. Soft Comput. 4:168–172, 2009.Google Scholar
  94. 94.
    Al Shalabi, L., Shaaban, Z., and Kasasbeh, B., Data mining: A preprocessing engine. J. Comput. Sci. 2:735–739, 2006. Scholar
  95. 95.
    Japkowicz, N., and Stephen, S., The class imbalance problem: A systematic study. Intell. Data Anal. 6:429–449, 2002
  96. 96.
    Pincus, T., Yazici, Y., and Bergman, M. J., Patient questionnaires in rheumatoid arthritis: Advantages and limitations as a quantitative, standardized scientific medical history. Rheum. Dis. Clin. N. Am. 35:735–743, 2009. Scholar
  97. 97.
    El Idrissi, T., Idri, A., Bakkoury, Z., Systematic map and review of predictive techniques in diabetes self- management. Int. J. Inf. Manag., In Press, 2018.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Software Project Management Research TeamENSIAS, University Mohammed V of RabatRabatMorocco
  2. 2.Department of Informatics and Systems, Faculty of Computer ScienceUniversity of MurciaMurciaSpain

Personalised recommendations