Breast Cancer Diagnosis Using Cluster-based Undersampling and Boosted C5.0 Algorithm

Zhang, Jue; Chen, Li; Tian, Jian-xue; Abid, Fazeel; Yang, Wusi; Tang, Xiao-fen

doi:10.1007/s12555-019-1061-x

Breast Cancer Diagnosis Using Cluster-based Undersampling and Boosted C5.0 Algorithm

Intelligent Control and Applications
Published: 18 February 2021

Volume 19, pages 1998–2008, (2021)
Cite this article

International Journal of Control, Automation and Systems Aims and scope Submit manuscript

Jue Zhang¹,
Li Chen ORCID: orcid.org/0000-0002-9550-9779²,
Jian-xue Tian¹,
Fazeel Abid³,
Wusi Yang⁴ &
…
Xiao-fen Tang⁵

190 Accesses
8 Citations
Explore all metrics

Abstract

Learning from imbalanced data set is relatively new challenge for breast cancer diagnosis, where the diseases cases are often quite rare relative to normal population. Although traditional algorithms are all accuracy-oriented which result biased towards the majority class. The combinations of sampling methods with ensemble classifiers have shown certainly good performance. In this paper, a hybrid of cluster-based undersampling and boosted C5.0 is proposed. The proposed classification model consists of two phases: cluster analysis and classification. In cluster analysis, affinity propagation algorithm is used to define the number of clusters, and then the k-means clustering is utilized to select the border and informative samples. In the classification phase, C5.0 algorithm is used in conjunction with boosting technical, owing to leverage the strength of the individual classifiers. The proposed algorithm is assessed by 14 benchmark imbalanced data sets taken from UCI dataset repository. The extensive experimental results on different imbalanced datasets demonstrated that the proposed algorithm can achieve better classification performance in terms of Matthews’ Correlation Coefficient (MCC) as compared to other existing imbalanced dataset classification algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis of Breast Cancer Detection Using Different Machine Learning Techniques

Evaluating Diagnostic Performance of Machine Learning Algorithms on Breast Cancer

Ensemble-Based Hybrid Approach for Breast Cancer Data

References

R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statistics, 2019,” CA: A Cancer Journal for Clinicians, vol. 69, no. 1, pp. 7–34, 2019.
Google Scholar
S. G. Rma, R. M. Leite Cicília, M. G. Guerreiro Ana, and N. A. D. Dória, “Fuzzy method for pre-diagnosis of breast cancer from the fine needle aspirate analysis,” Biomedical Engineering Online, vol. 11, no. 1, pp. 83, 2012.
Article Google Scholar
F. Paulin and A. Santhakumaran, “Classification of breast cancer by comparing back propagation training algorithms,” Pattern Recognition Letters, vol. 3, no. 1, pp. 327–332, 2011.
Google Scholar
H. Guo and A. K. Nandi, “Breast cancer diagnosis using genetic programming generated feature,” Pattern Recognition, vol. 39, no. 5, pp. 980–987, 2006.
Article Google Scholar
J. B. Li, Y. Peng, and D. Liu, “Quasiconformal kernel common locality discriminant analysis with application to breast cancer diagnosis,” Information Sciences, vol. 223, pp. 256–269, 2013.
Article MathSciNet MATH Google Scholar
B. Zheng, S. W. Yoon, and S. S. Lam, “Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms,” Expert Systems with Applications, vol. 41, no. 4, pp. 1476–1482, 2014.
Article Google Scholar
C. H. Weng, T. C. K. Huang, and R. P. Han, “Disease prediction with different types of neural network classifiers,” Telematics and Informatics, vol. 33, no. 3, pp. 277–292, 2016.
Article Google Scholar
M. Nilashi, O. Ibrahim, H. Ahmadi, and L. Shahmoradi, “A knowledge-based system for breast cancer classification using fuzzy logic method,” Telematics and Informatics, vol. 34, no. 4, pp. 133–144, 2017.
Article Google Scholar
W. Sun, T. L. Tseng, J. Zhang, and W. Qian, “Enhancing deep convolutional neural network scheme for breast cancer diagnosis with unlabeled data,” Computerized Medical Imaging and Graphics, vol. 57, pp. 4–9, 2017.
Article Google Scholar
D. Gu, C. Liang, and H. Zhao, “A case-based reasoning system based on weighted heterogeneous value distance metric for breast cancer diagnosis,” Artificial Intelligence in Medicine, vol. 77, pp. 31–47, 2017.
Article Google Scholar
S. Wang, Y. Wang, D. Wang, Y. Yin, Y. Wang, and Y. Jin, “An improved random forest-based rule extraction method for breast cancer diagnosis,” Applied Soft Computing, vol. 86, pp. 105941, 2020.
Article Google Scholar
L. Peng, W. Chen, W. Zhou, F. Li, J. Yang, and J. Zhang, “An immune-inspired semi-supervised algorithm for breast cancer diagnosis,” Computer Methods and Programs in Biomedicine, vol. 134, pp. 259–265, 2016.
Article Google Scholar
H. Wang, B. Zheng, S. W. Yoon, and H. S. Ko, “A support vector machine-based ensemble algorithm for breast cancer diagnosis,” European Journal of Operational Research, vol. 267, no. 2, pp. 687–699, 2018.
Article MathSciNet MATH Google Scholar
N. Liu, E. S. Qi, B. Gao, and G. Q. Liu, “A novel intelligent classification model for breast cancer diagnosis,” Information Processing and Management, vol. 56, no. 3, pp. 609–623, 2019.
Article Google Scholar
M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463–484, 2011.
Article Google Scholar
N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “SMOTEBoost: Improving prediction of the minority class in boosting,” Lecture Notes in Computer Science, vol. 2838, pp. 107–119, 2003.
Article Google Scholar
C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, “RUSBoost: A Hybrid approach to alleviating class imbalance,” IEEE Transactions on Systems, Man, and Cybernetics — Part A: Systems and Humans, vol. 40, pp. 185–197, 2010.
Article Google Scholar
H. G. Zefrehi and H. Altınçay, “Imbalance learning using heterogeneous ensembles,” Expert Systems with Applications, vol. 142, pp. 113005, 2020.
Article Google Scholar
G. Haixiang, L. Yijing, J. Shang, G. Mingyu, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: review of methods and applications,” Expert Systems with Applications, vol. 73, pp. 220–239, 2017.
Article Google Scholar
Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, and Y. Zhou, “A novel ensemble method for classifying imbalanced data,” Pattern Recognition, vol. 48, no. 5, pp. 1623–1637, 2015.
Article Google Scholar
J. Błaszczyński and J. Stefanowski, “Neighbourhood sampling in bagging for imbalanced data,” Neurocomputing, vol. 150, pp. 529–542, 2015.
Article Google Scholar
C. F. Tsai, W. C. Lin, Y. H. Hu, and G. T. Yao, “Undersampling class imbalanced datasets by combining clustering analysis and instance selection,” Information Science, vol. 477, pp. 47–54, 2019.
Article Google Scholar
T. Zhu, Y. Lin, Y. Liu, W. Zhang, and J. Zhang, “Minority oversampling for imbalanced ordinal regression,” Knowledge-based Systems, vol. 166, pp. 140–155, 2019.
Article Google Scholar
J. Chen, C. Zhang, X. Xue, and C. L. Liu, “Fast instance selection for speeding up support vector machines,” Knowledge-Based Systems, vol. 45, pp. 1–7, 2013.
Article Google Scholar
C. Liu, W. Wang, M. Wang, F. Lv, and M. Konan, “An efficient instance selection algorithm to reconstruct training set for support vector machine,” Knowledge-based Systems, vol. 116, pp. 58–73, 2017.
Article Google Scholar
S. Idicula-Thomas, A. J. Kulkarni, B. D. Kulkarni, V. K. Jayaraman, and P. V. Balaji, “A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli,” Bioinformatics, vol. 22, no. 3, pp. 278–284, 2005.
Article Google Scholar
D. Lavanya, “Ensemble decision tree classifier for breast cancer data,” International Journal of Information Technology Convergence and Services, vol. 2, no. 1, pp. 17, 2012.
Article Google Scholar
S. Hou, R. Hou, X. Shi, J. Wang, and C. Yuan, “Research on C5. 0 algorithm improvement and the test in lightning disaster statistics,” International Journal of Control and Automation, vol. 7, no. 1, pp. 181–190, 2014.
Article Google Scholar
B. S. Raghuwanshi and S. Shukla, “Class imbalance learning using UnderBagging based kernelized extreme learning machine,” Neurocomputing, vol. 329, pp. 172–187, 2019.
Article Google Scholar
S. J. Lee and S. S. Hwang, “Bag of sampled words: A sampling-based strategy for fast and accurate visual place recognition in changing environments,” International Journal of Control, Automation and Systems, vol. 17, no. 10, pp. 2597–2609, 2019.
Article Google Scholar
R. Liu, J. Wu, and D. Wang, “Sampled-data fuzzy control of two-wheel inverted pendulums based on passivity theory,” International Journal of Control, Automation and Systems, vol. 16, no. 5, pp. 2538–25486, 2018.
Article Google Scholar
D. R. Wilson and T. R. Martinez, “Reduction techniques for instance-based learning algorithms,” Machine Learning, vol. 38, no. 3, pp. 257–286, 2000.
Article MATH Google Scholar
H. Brighton and C. Mellish, “Advances in instance selection for instance-based learning algorithms,” Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 153–172, 2002.
Article MathSciNet MATH Google Scholar
S. D. Nquyen and K. N. Nqo, “An adaptive input data space parting solution to the synthesis of neuro-fuzzy models,” International Journal of Control, Automation and Systems, vol. 6, no. 6, pp. 928–938, 2008.
Google Scholar
H. Hwang, “Identification of a Gaussian fuzzy classifier,” International Journal of Control, Automation and Systems, vol. 2, no. 1, pp. 118–124, 2004.
Google Scholar
A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Computing Surveys (CSUR), vol. 31, no. 3, pp. 264–323, 1999.
Article Google Scholar
E. Pashaei, M. Ozen, and N. Aydin, “Improving medical diagnosis reliability using Boosted C5. 0 decision tree empowered by particle swarm optimization,” Proc. of 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 7230–7233, 2015.
B. S. Raghuwanshi and S. Shukla, “SMOTE based class-specific extreme learning machine for imbalanced learning,” Knowledge-based Systems, vol. 187, 2020.
S. Wang and X. Yao, “Relationships between diversity of classification ensembles and single-class performance measures,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 1, pp. 206–219, 2011.
Article Google Scholar
T. Raeder, G. Forman, and N. V. Chawla, “Data mining: Foundations and intelligent paradigms,” pp. 315–331, 2012.
E. Pashaei and N. Aydin, “Binary black hole algorithm for feature selection and classification on biological data,” Applied Soft Computing, vol. 56, pp. 94–106, 2017.
Article Google Scholar
S. Piri, D. Liu, and T. Liu, “A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets,” Decision Support Systems, vol. 106, pp. 15–29, 2018.
Article Google Scholar
G. T. D. David, K. B. Mallick, and F. A. Smith, “Bayesian CART algorithm,” Biometrika, vol. 2, no. 2 pp. 363–377, 1998.
MathSciNet MATH Google Scholar
W. W. Cohen, “Fast effective rule induction,” Proceeding of the Twelfth International Conference on Machine Learning, pp. 115–123, 1995.
L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no.2 pp. 123–140, 1996.
Article MATH Google Scholar
T. Sridevi and A. Murugan, “A novel feature selection method for effective breast cancer diagnosis and prognosis,” International Journal of Computer Applications, vol. 88, pp. 28–33, 2014.
Article Google Scholar
S. Kotsiantis, “Increasing the accuracy of incremental naive Bayes classifier using instance based learning,” International Journal of Control, Automation and Systems, vol. 11, no.1 pp. 159–166, 2013.
Article Google Scholar
E. Theodorsson-Norheim, “Friedman and Quade tests: BASIC computer program to perform nonparametric two-way analysis of variance and multiple comparisons on ranks of several related samples,” Computers in Biology and Medicine, vol. 17, no. 2, pp. 85–99, 1987.
Article Google Scholar
J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, no. Jan, pp. 1–30, 2006.
MathSciNet MATH Google Scholar
M. Bach, A. Werner, and M. Palt, “The proposal of undersampling method for learning from imbalanced datasets,” Procedia Computer Science, vol. 9, no. 1, pp. 19518–19518, 2019.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Engineer, Yulin University, 51 Chongwen Road, Yulin, Shaanxi province, China
Jue Zhang & Jian-xue Tian
School of Information Science and Technology, Northwest University, 1 Xuefu Ave., Guodu Education, Chang’an District, Xi’an, Shaanxi Province, China
Li Chen
School of Business and Economics, Department of Information System, University of Management and Technology, Lahore, Pakistan
Fazeel Abid
School of Computer, Xianyang Normal University, Xi’an, Shaanxi province, China
Wusi Yang
School of Information Engineering, Ningxia University, Yinchuan, China
Xiao-fen Tang

Authors

Jue Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Li Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jian-xue Tian
View author publications
You can also search for this author in PubMed Google Scholar
Fazeel Abid
View author publications
You can also search for this author in PubMed Google Scholar
Wusi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-fen Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Chen.

Additional information

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Recommended by Associate Editor M. Jaead Khan under the direction of Editor Doo Yong Lee.

This work is supported by the National Natural Science Foundation of China (61966029, 72061030), Shaanxi Technology Committee industrial Public Relations Projection (2018GY-146), Shaanxi Provincial Department of Education Scientific Research Project (19JK0996), Yulin City Science and Technology Bureau (2019-93-3, 2019-77-3) and Yulin University (14YK37).

Jue Zhang received her M.S. degree in software engineering from Xidian University, Xi’an, China, in 2010. Now, she is pursuing her Ph.D. degree in Northwest Univrsity. Her research interests include machine learning, data mining, and imbalanced data classification.

Li Chen received her Ph.D. degree from Xidian University, Xi’an, China, in 2003. She is a Professor and Doctoral Supervisor of the School of Information Science and Technology, at Northwest University, Xi’an, China. Her research interests include data minging, intelligent information processing and intelligent control of big data. She has authored and co-authored over 100 papers in journals and conference proceedings.

Jian-xue Tian received his M.S. degree in University of Electronic Science and Technology of China, Chengdu, Sichuan Province, China, in 2014. His current research interests include fuzzy system and intelligent control.

Fazeel Abid received his Ph.D. degree in Northwest University of China, Xi’an, in 2020. He was a visiting internship student from Pakistan. His current research interests include sentiment analysis, social network and web mining in the aspect of deep learning.

Wusi Yang received his Ph.D. degree in Northwest University of China, Xi’an, in 2020. He is currently a lecturer in Xianyang Normal University. His current research interests focus on machine learning, swarm intelligence optimization, and multi-objective optimization.

Xiao-fen Tang received her Ph.D. degree in computer science from Northwest University in Xi’an, China. She is currently an associate professor in Ningxia University. Her research interest is mainly in the area of neural network and bioinformatics. She has published several research papers in scholarly journals in the above research areas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, J., Chen, L., Tian, Jx. et al. Breast Cancer Diagnosis Using Cluster-based Undersampling and Boosted C5.0 Algorithm. Int. J. Control Autom. Syst. 19, 1998–2008 (2021). https://doi.org/10.1007/s12555-019-1061-x

Download citation

Received: 31 December 2019
Revised: 29 June 2020
Accepted: 11 August 2020
Published: 18 February 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s12555-019-1061-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Breast Cancer Diagnosis Using Cluster-based Undersampling and Boosted C5.0 Algorithm

Abstract

Access this article

Similar content being viewed by others

Analysis of Breast Cancer Detection Using Different Machine Learning Techniques

Evaluating Diagnostic Performance of Machine Learning Algorithms on Breast Cancer

Ensemble-Based Hybrid Approach for Breast Cancer Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Breast Cancer Diagnosis Using Cluster-based Undersampling and Boosted C5.0 Algorithm

Abstract

Access this article

Similar content being viewed by others

Analysis of Breast Cancer Detection Using Different Machine Learning Techniques

Evaluating Diagnostic Performance of Machine Learning Algorithms on Breast Cancer

Ensemble-Based Hybrid Approach for Breast Cancer Data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation