Patients Stratification in Imbalanced Datasets: A Roadmap

  • Chiheb KarrayEmail author
  • Nebras Gharbi
  • Mohamed Jmaiel
Conference paper
Part of the Advances in Predictive, Preventive and Personalised Medicine book series (APPPM, volume 10)


Learning in an imbalanced context is characterized by high disproportion ratios of data instances number belonging to each class of the dataset. Attributing the correct class for each instance is well studied using supervised learning techniques. However, the examination of effects of the same phenomenon in unsupervised learning environments lags behind. Some of the main issues hindering the application of unsupervised learning techniques (clustering techniques) in an imbalanced data setting are highlighted. It also presents a solution to deal with the showcased issues. This solution evades the noticed drawbacks by employing another set of clustering algorithms while including them in an aggregated learning framework. This set of algorithms would be assessed by measures tailored to the nature of these techniques and to the unique constraints that the imbalanced learning environment imposes. The suggested framework is intended to be applied to the patients stratification problem.


Imbalanced data Clustering Patients stratification Ensemble learning 


  1. 1.
    Aggarwal CC, Reddy CK (2013) Data clustering: algorithms and applications, 1st edn. Chapman & Hall/CRC, Boca RatonGoogle Scholar
  2. 2.
    Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46(1):243–256CrossRefGoogle Scholar
  3. 3.
    Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2):107–145CrossRefGoogle Scholar
  4. 4.
    Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232CrossRefGoogle Scholar
  5. 5.
    Liang J, Bai L, Dang C, Cao F (2012) The K-means-type algorithms versus imbalanced data distributions. IEEE Trans Fuzzy Syst 20(4):728–745CrossRefGoogle Scholar
  6. 6.
    Likas A, Vlassis N, Verbeek JJ (2003) The global k-means clustering algorithm. Pattern Recogn 36(2):451–461CrossRefGoogle Scholar
  7. 7.
    Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: 2010 IEEE International Conference on Data Mining. IEEE, pp 911–916Google Scholar
  8. 8.
    Liu Y, Li Z, Xiong H, Gao X, Wu J, Wu S (2013) Understanding and enhancement of internal clustering validation measures. IEEE Trans Cybern 43(3):982–994CrossRefGoogle Scholar
  9. 9.
    Tang H, Miyamoto S (2013) Sequential extraction of clusters for imbalanced data. In: 2013 IEEE International Conference on Granular Computing (GrC). IEEE, pp 281–285Google Scholar
  10. 10.
    Wang Y, Chen L (2014) Multi-exemplar based clustering for imbalanced data. In: 2014 13th International Conference on Control Automation Robotics & Vision (ICARCV). IEEE, pp 1068–1073Google Scholar
  11. 11.
    Wu J (2012) The uniform effect of K-means clustering. In: Springer theses. Springer, Berlin/Heidelberg/Berlin, pp 17–35Google Scholar
  12. 12.
    Wu PY, Cheng CW, Kaddi CD, Venugopalan J, Hoffman R, Wang MD (2017) Omic and electronic health record big data analytics for precision medicine. IEEE Trans Biomed Eng 64(2):263–273CrossRefGoogle Scholar
  13. 13.
    Yadav P, Steinbach M, Kumar V, Simon G (2017) Mining electronic health records: a survey. arXiv pp 1–70Google Scholar
  14. 14.
    Zhou ZH (2012) Ensemble methods: foundations and algorithms, 1st edn. Chapman & Hall/CRC, Boca RatonGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Chiheb Karray
    • 1
    • 2
    Email author
  • Nebras Gharbi
    • 3
    • 1
  • Mohamed Jmaiel
    • 1
    • 2
  1. 1.Digital Research Centre of SfaxSfaxTunisia
  2. 2.ReDCAD LaboratorySfax UniversitySfaxTunisia
  3. 3.MIRACL LaboratoryUniversity of SfaxSfaxTunisia

Personalised recommendations