Breast Cancer Diagnosis Using Cluster-based Undersampling and Boosted C5.0 Algorithm

Abstract

Learning from imbalanced data set is relatively new challenge for breast cancer diagnosis, where the diseases cases are often quite rare relative to normal population. Although traditional algorithms are all accuracy-oriented which result biased towards the majority class. The combinations of sampling methods with ensemble classifiers have shown certainly good performance. In this paper, a hybrid of cluster-based undersampling and boosted C5.0 is proposed. The proposed classification model consists of two phases: cluster analysis and classification. In cluster analysis, affinity propagation algorithm is used to define the number of clusters, and then the k-means clustering is utilized to select the border and informative samples. In the classification phase, C5.0 algorithm is used in conjunction with boosting technical, owing to leverage the strength of the individual classifiers. The proposed algorithm is assessed by 14 benchmark imbalanced data sets taken from UCI dataset repository. The extensive experimental results on different imbalanced datasets demonstrated that the proposed algorithm can achieve better classification performance in terms of Matthews’ Correlation Coefficient (MCC) as compared to other existing imbalanced dataset classification algorithms.

This is a preview of subscription content, access via your institution.

References

  1. [1]

    R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statistics, 2019,” CA: A Cancer Journal for Clinicians, vol. 69, no. 1, pp. 7–34, 2019.

    Google Scholar 

  2. [2]

    S. G. Rma, R. M. Leite Cicília, M. G. Guerreiro Ana, and N. A. D. Dória, “Fuzzy method for pre-diagnosis of breast cancer from the fine needle aspirate analysis,” Biomedical Engineering Online, vol. 11, no. 1, pp. 83, 2012.

    Article  Google Scholar 

  3. [3]

    F. Paulin and A. Santhakumaran, “Classification of breast cancer by comparing back propagation training algorithms,” Pattern Recognition Letters, vol. 3, no. 1, pp. 327–332, 2011.

    Google Scholar 

  4. [4]

    H. Guo and A. K. Nandi, “Breast cancer diagnosis using genetic programming generated feature,” Pattern Recognition, vol. 39, no. 5, pp. 980–987, 2006.

    Article  Google Scholar 

  5. [5]

    J. B. Li, Y. Peng, and D. Liu, “Quasiconformal kernel common locality discriminant analysis with application to breast cancer diagnosis,” Information Sciences, vol. 223, pp. 256–269, 2013.

    MathSciNet  MATH  Article  Google Scholar 

  6. [6]

    B. Zheng, S. W. Yoon, and S. S. Lam, “Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms,” Expert Systems with Applications, vol. 41, no. 4, pp. 1476–1482, 2014.

    Article  Google Scholar 

  7. [7]

    C. H. Weng, T. C. K. Huang, and R. P. Han, “Disease prediction with different types of neural network classifiers,” Telematics and Informatics, vol. 33, no. 3, pp. 277–292, 2016.

    Article  Google Scholar 

  8. [8]

    M. Nilashi, O. Ibrahim, H. Ahmadi, and L. Shahmoradi, “A knowledge-based system for breast cancer classification using fuzzy logic method,” Telematics and Informatics, vol. 34, no. 4, pp. 133–144, 2017.

    Article  Google Scholar 

  9. [9]

    W. Sun, T. L. Tseng, J. Zhang, and W. Qian, “Enhancing deep convolutional neural network scheme for breast cancer diagnosis with unlabeled data,” Computerized Medical Imaging and Graphics, vol. 57, pp. 4–9, 2017.

    Article  Google Scholar 

  10. [10]

    D. Gu, C. Liang, and H. Zhao, “A case-based reasoning system based on weighted heterogeneous value distance metric for breast cancer diagnosis,” Artificial Intelligence in Medicine, vol. 77, pp. 31–47, 2017.

    Article  Google Scholar 

  11. [11]

    S. Wang, Y. Wang, D. Wang, Y. Yin, Y. Wang, and Y. Jin, “An improved random forest-based rule extraction method for breast cancer diagnosis,” Applied Soft Computing, vol. 86, pp. 105941, 2020.

    Article  Google Scholar 

  12. [12]

    L. Peng, W. Chen, W. Zhou, F. Li, J. Yang, and J. Zhang, “An immune-inspired semi-supervised algorithm for breast cancer diagnosis,” Computer Methods and Programs in Biomedicine, vol. 134, pp. 259–265, 2016.

    Article  Google Scholar 

  13. [13]

    H. Wang, B. Zheng, S. W. Yoon, and H. S. Ko, “A support vector machine-based ensemble algorithm for breast cancer diagnosis,” European Journal of Operational Research, vol. 267, no. 2, pp. 687–699, 2018.

    MathSciNet  MATH  Article  Google Scholar 

  14. [14]

    N. Liu, E. S. Qi, B. Gao, and G. Q. Liu, “A novel intelligent classification model for breast cancer diagnosis,” Information Processing and Management, vol. 56, no. 3, pp. 609–623, 2019.

    Article  Google Scholar 

  15. [15]

    M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463–484, 2011.

    Article  Google Scholar 

  16. [16]

    N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “SMOTEBoost: Improving prediction of the minority class in boosting,” Lecture Notes in Computer Science, vol. 2838, pp. 107–119, 2003.

    Article  Google Scholar 

  17. [17]

    C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, “RUSBoost: A Hybrid approach to alleviating class imbalance,” IEEE Transactions on Systems, Man, and Cybernetics — Part A: Systems and Humans, vol. 40, pp. 185–197, 2010.

    Article  Google Scholar 

  18. [18]

    H. G. Zefrehi and H. Altınçay, “Imbalance learning using heterogeneous ensembles,” Expert Systems with Applications, vol. 142, pp. 113005, 2020.

    Article  Google Scholar 

  19. [19]

    G. Haixiang, L. Yijing, J. Shang, G. Mingyu, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: review of methods and applications,” Expert Systems with Applications, vol. 73, pp. 220–239, 2017.

    Article  Google Scholar 

  20. [20]

    Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, and Y. Zhou, “A novel ensemble method for classifying imbalanced data,” Pattern Recognition, vol. 48, no. 5, pp. 1623–1637, 2015.

    Article  Google Scholar 

  21. [21]

    J. Błaszczyński and J. Stefanowski, “Neighbourhood sampling in bagging for imbalanced data,” Neurocomputing, vol. 150, pp. 529–542, 2015.

    Article  Google Scholar 

  22. [22]

    C. F. Tsai, W. C. Lin, Y. H. Hu, and G. T. Yao, “Undersampling class imbalanced datasets by combining clustering analysis and instance selection,” Information Science, vol. 477, pp. 47–54, 2019.

    Article  Google Scholar 

  23. [23]

    T. Zhu, Y. Lin, Y. Liu, W. Zhang, and J. Zhang, “Minority oversampling for imbalanced ordinal regression,” Knowledge-based Systems, vol. 166, pp. 140–155, 2019.

    Article  Google Scholar 

  24. [24]

    J. Chen, C. Zhang, X. Xue, and C. L. Liu, “Fast instance selection for speeding up support vector machines,” Knowledge-Based Systems, vol. 45, pp. 1–7, 2013.

    Article  Google Scholar 

  25. [25]

    C. Liu, W. Wang, M. Wang, F. Lv, and M. Konan, “An efficient instance selection algorithm to reconstruct training set for support vector machine,” Knowledge-based Systems, vol. 116, pp. 58–73, 2017.

    Article  Google Scholar 

  26. [26]

    S. Idicula-Thomas, A. J. Kulkarni, B. D. Kulkarni, V. K. Jayaraman, and P. V. Balaji, “A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli,” Bioinformatics, vol. 22, no. 3, pp. 278–284, 2005.

    Article  Google Scholar 

  27. [27]

    D. Lavanya, “Ensemble decision tree classifier for breast cancer data,” International Journal of Information Technology Convergence and Services, vol. 2, no. 1, pp. 17, 2012.

    Article  Google Scholar 

  28. [28]

    S. Hou, R. Hou, X. Shi, J. Wang, and C. Yuan, “Research on C5. 0 algorithm improvement and the test in lightning disaster statistics,” International Journal of Control and Automation, vol. 7, no. 1, pp. 181–190, 2014.

    Article  Google Scholar 

  29. [29]

    B. S. Raghuwanshi and S. Shukla, “Class imbalance learning using UnderBagging based kernelized extreme learning machine,” Neurocomputing, vol. 329, pp. 172–187, 2019.

    Article  Google Scholar 

  30. [30]

    S. J. Lee and S. S. Hwang, “Bag of sampled words: A sampling-based strategy for fast and accurate visual place recognition in changing environments,” International Journal of Control, Automation and Systems, vol. 17, no. 10, pp. 2597–2609, 2019.

    Article  Google Scholar 

  31. [31]

    R. Liu, J. Wu, and D. Wang, “Sampled-data fuzzy control of two-wheel inverted pendulums based on passivity theory,” International Journal of Control, Automation and Systems, vol. 16, no. 5, pp. 2538–25486, 2018.

    Article  Google Scholar 

  32. [32]

    D. R. Wilson and T. R. Martinez, “Reduction techniques for instance-based learning algorithms,” Machine Learning, vol. 38, no. 3, pp. 257–286, 2000.

    MATH  Article  Google Scholar 

  33. [33]

    H. Brighton and C. Mellish, “Advances in instance selection for instance-based learning algorithms,” Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 153–172, 2002.

    MathSciNet  MATH  Article  Google Scholar 

  34. [34]

    S. D. Nquyen and K. N. Nqo, “An adaptive input data space parting solution to the synthesis of neuro-fuzzy models,” International Journal of Control, Automation and Systems, vol. 6, no. 6, pp. 928–938, 2008.

    Google Scholar 

  35. [35]

    H. Hwang, “Identification of a Gaussian fuzzy classifier,” International Journal of Control, Automation and Systems, vol. 2, no. 1, pp. 118–124, 2004.

    Google Scholar 

  36. [36]

    A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Computing Surveys (CSUR), vol. 31, no. 3, pp. 264–323, 1999.

    Article  Google Scholar 

  37. [37]

    E. Pashaei, M. Ozen, and N. Aydin, “Improving medical diagnosis reliability using Boosted C5. 0 decision tree empowered by particle swarm optimization,” Proc. of 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 7230–7233, 2015.

  38. [38]

    B. S. Raghuwanshi and S. Shukla, “SMOTE based class-specific extreme learning machine for imbalanced learning,” Knowledge-based Systems, vol. 187, 2020.

  39. [39]

    S. Wang and X. Yao, “Relationships between diversity of classification ensembles and single-class performance measures,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 1, pp. 206–219, 2011.

    Article  Google Scholar 

  40. [40]

    T. Raeder, G. Forman, and N. V. Chawla, “Data mining: Foundations and intelligent paradigms,” pp. 315–331, 2012.

  41. [41]

    E. Pashaei and N. Aydin, “Binary black hole algorithm for feature selection and classification on biological data,” Applied Soft Computing, vol. 56, pp. 94–106, 2017.

    Article  Google Scholar 

  42. [42]

    S. Piri, D. Liu, and T. Liu, “A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets,” Decision Support Systems, vol. 106, pp. 15–29, 2018.

    Article  Google Scholar 

  43. [43]

    G. T. D. David, K. B. Mallick, and F. A. Smith, “Bayesian CART algorithm,” Biometrika, vol. 2, no. 2 pp. 363–377, 1998.

    MathSciNet  MATH  Google Scholar 

  44. [44]

    W. W. Cohen, “Fast effective rule induction,” Proceeding of the Twelfth International Conference on Machine Learning, pp. 115–123, 1995.

  45. [45]

    L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no.2 pp. 123–140, 1996.

    MATH  Google Scholar 

  46. [46]

    T. Sridevi and A. Murugan, “A novel feature selection method for effective breast cancer diagnosis and prognosis,” International Journal of Computer Applications, vol. 88, pp. 28–33, 2014.

    Article  Google Scholar 

  47. [47]

    S. Kotsiantis, “Increasing the accuracy of incremental naive Bayes classifier using instance based learning,” International Journal of Control, Automation and Systems, vol. 11, no.1 pp. 159–166, 2013.

    Article  Google Scholar 

  48. [48]

    E. Theodorsson-Norheim, “Friedman and Quade tests: BASIC computer program to perform nonparametric two-way analysis of variance and multiple comparisons on ranks of several related samples,” Computers in Biology and Medicine, vol. 17, no. 2, pp. 85–99, 1987.

    Article  Google Scholar 

  49. [49]

    J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, no. Jan, pp. 1–30, 2006.

    MathSciNet  MATH  Google Scholar 

  50. [50]

    M. Bach, A. Werner, and M. Palt, “The proposal of undersampling method for learning from imbalanced datasets,” Procedia Computer Science, vol. 9, no. 1, pp. 19518–19518, 2019.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Li Chen.

Additional information

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Recommended by Associate Editor M. Jaead Khan under the direction of Editor Doo Yong Lee. This work is supported by the National Natural Science Foundation of China (61966029, 72061030), Shaanxi Technology Committee industrial Public Relations Projection (2018GY-146), Shaanxi Provincial Department of Education Scientific Research Project (19JK0996), Yulin City Science and Technology Bureau (2019-93-3, 2019-77-3) and Yulin University (14YK37).

Jue Zhang received her M.S. degree in software engineering from Xidian University, Xi’an, China, in 2010. Now, she is pursuing her Ph.D. degree in Northwest Univrsity. Her research interests include machine learning, data mining, and imbalanced data classification.

Li Chen received her Ph.D. degree from Xidian University, Xi’an, China, in 2003. She is a Professor and Doctoral Supervisor of the School of Information Science and Technology, at Northwest University, Xi’an, China. Her research interests include data minging, intelligent information processing and intelligent control of big data. She has authored and co-authored over 100 papers in journals and conference proceedings.

Jian-xue Tian received his M.S. degree in University of Electronic Science and Technology of China, Chengdu, Sichuan Province, China, in 2014. His current research interests include fuzzy system and intelligent control.

Fazeel Abid received his Ph.D. degree in Northwest University of China, Xi’an, in 2020. He was a visiting internship student from Pakistan. His current research interests include sentiment analysis, social network and web mining in the aspect of deep learning.

Wusi Yang received his Ph.D. degree in Northwest University of China, Xi’an, in 2020. He is currently a lecturer in Xianyang Normal University. His current research interests focus on machine learning, swarm intelligence optimization, and multi-objective optimization.

Xiao-fen Tang received her Ph.D. degree in computer science from Northwest University in Xi’an, China. She is currently an associate professor in Ningxia University. Her research interest is mainly in the area of neural network and bioinformatics. She has published several research papers in scholarly journals in the above research areas.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Chen, L., Tian, Jx. et al. Breast Cancer Diagnosis Using Cluster-based Undersampling and Boosted C5.0 Algorithm. Int. J. Control Autom. Syst. (2021). https://doi.org/10.1007/s12555-019-1061-x

Download citation

Keywords

  • Breast cancer diagnosis
  • cluste analysis
  • imbalanced data classification
  • sample selection
  • undersampling