Learning from imbalanced data set is relatively new challenge for breast cancer diagnosis, where the diseases cases are often quite rare relative to normal population. Although traditional algorithms are all accuracy-oriented which result biased towards the majority class. The combinations of sampling methods with ensemble classifiers have shown certainly good performance. In this paper, a hybrid of cluster-based undersampling and boosted C5.0 is proposed. The proposed classification model consists of two phases: cluster analysis and classification. In cluster analysis, affinity propagation algorithm is used to define the number of clusters, and then the k-means clustering is utilized to select the border and informative samples. In the classification phase, C5.0 algorithm is used in conjunction with boosting technical, owing to leverage the strength of the individual classifiers. The proposed algorithm is assessed by 14 benchmark imbalanced data sets taken from UCI dataset repository. The extensive experimental results on different imbalanced datasets demonstrated that the proposed algorithm can achieve better classification performance in terms of Matthews’ Correlation Coefficient (MCC) as compared to other existing imbalanced dataset classification algorithms.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statistics, 2019,” CA: A Cancer Journal for Clinicians, vol. 69, no. 1, pp. 7–34, 2019.
S. G. Rma, R. M. Leite Cicília, M. G. Guerreiro Ana, and N. A. D. Dória, “Fuzzy method for pre-diagnosis of breast cancer from the fine needle aspirate analysis,” Biomedical Engineering Online, vol. 11, no. 1, pp. 83, 2012.
F. Paulin and A. Santhakumaran, “Classification of breast cancer by comparing back propagation training algorithms,” Pattern Recognition Letters, vol. 3, no. 1, pp. 327–332, 2011.
H. Guo and A. K. Nandi, “Breast cancer diagnosis using genetic programming generated feature,” Pattern Recognition, vol. 39, no. 5, pp. 980–987, 2006.
J. B. Li, Y. Peng, and D. Liu, “Quasiconformal kernel common locality discriminant analysis with application to breast cancer diagnosis,” Information Sciences, vol. 223, pp. 256–269, 2013.
B. Zheng, S. W. Yoon, and S. S. Lam, “Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms,” Expert Systems with Applications, vol. 41, no. 4, pp. 1476–1482, 2014.
C. H. Weng, T. C. K. Huang, and R. P. Han, “Disease prediction with different types of neural network classifiers,” Telematics and Informatics, vol. 33, no. 3, pp. 277–292, 2016.
M. Nilashi, O. Ibrahim, H. Ahmadi, and L. Shahmoradi, “A knowledge-based system for breast cancer classification using fuzzy logic method,” Telematics and Informatics, vol. 34, no. 4, pp. 133–144, 2017.
W. Sun, T. L. Tseng, J. Zhang, and W. Qian, “Enhancing deep convolutional neural network scheme for breast cancer diagnosis with unlabeled data,” Computerized Medical Imaging and Graphics, vol. 57, pp. 4–9, 2017.
D. Gu, C. Liang, and H. Zhao, “A case-based reasoning system based on weighted heterogeneous value distance metric for breast cancer diagnosis,” Artificial Intelligence in Medicine, vol. 77, pp. 31–47, 2017.
S. Wang, Y. Wang, D. Wang, Y. Yin, Y. Wang, and Y. Jin, “An improved random forest-based rule extraction method for breast cancer diagnosis,” Applied Soft Computing, vol. 86, pp. 105941, 2020.
L. Peng, W. Chen, W. Zhou, F. Li, J. Yang, and J. Zhang, “An immune-inspired semi-supervised algorithm for breast cancer diagnosis,” Computer Methods and Programs in Biomedicine, vol. 134, pp. 259–265, 2016.
H. Wang, B. Zheng, S. W. Yoon, and H. S. Ko, “A support vector machine-based ensemble algorithm for breast cancer diagnosis,” European Journal of Operational Research, vol. 267, no. 2, pp. 687–699, 2018.
N. Liu, E. S. Qi, B. Gao, and G. Q. Liu, “A novel intelligent classification model for breast cancer diagnosis,” Information Processing and Management, vol. 56, no. 3, pp. 609–623, 2019.
M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463–484, 2011.
N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “SMOTEBoost: Improving prediction of the minority class in boosting,” Lecture Notes in Computer Science, vol. 2838, pp. 107–119, 2003.
C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, “RUSBoost: A Hybrid approach to alleviating class imbalance,” IEEE Transactions on Systems, Man, and Cybernetics — Part A: Systems and Humans, vol. 40, pp. 185–197, 2010.
H. G. Zefrehi and H. Altınçay, “Imbalance learning using heterogeneous ensembles,” Expert Systems with Applications, vol. 142, pp. 113005, 2020.
G. Haixiang, L. Yijing, J. Shang, G. Mingyu, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: review of methods and applications,” Expert Systems with Applications, vol. 73, pp. 220–239, 2017.
Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu, and Y. Zhou, “A novel ensemble method for classifying imbalanced data,” Pattern Recognition, vol. 48, no. 5, pp. 1623–1637, 2015.
J. Błaszczyński and J. Stefanowski, “Neighbourhood sampling in bagging for imbalanced data,” Neurocomputing, vol. 150, pp. 529–542, 2015.
C. F. Tsai, W. C. Lin, Y. H. Hu, and G. T. Yao, “Undersampling class imbalanced datasets by combining clustering analysis and instance selection,” Information Science, vol. 477, pp. 47–54, 2019.
T. Zhu, Y. Lin, Y. Liu, W. Zhang, and J. Zhang, “Minority oversampling for imbalanced ordinal regression,” Knowledge-based Systems, vol. 166, pp. 140–155, 2019.
J. Chen, C. Zhang, X. Xue, and C. L. Liu, “Fast instance selection for speeding up support vector machines,” Knowledge-Based Systems, vol. 45, pp. 1–7, 2013.
C. Liu, W. Wang, M. Wang, F. Lv, and M. Konan, “An efficient instance selection algorithm to reconstruct training set for support vector machine,” Knowledge-based Systems, vol. 116, pp. 58–73, 2017.
S. Idicula-Thomas, A. J. Kulkarni, B. D. Kulkarni, V. K. Jayaraman, and P. V. Balaji, “A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli,” Bioinformatics, vol. 22, no. 3, pp. 278–284, 2005.
D. Lavanya, “Ensemble decision tree classifier for breast cancer data,” International Journal of Information Technology Convergence and Services, vol. 2, no. 1, pp. 17, 2012.
S. Hou, R. Hou, X. Shi, J. Wang, and C. Yuan, “Research on C5. 0 algorithm improvement and the test in lightning disaster statistics,” International Journal of Control and Automation, vol. 7, no. 1, pp. 181–190, 2014.
B. S. Raghuwanshi and S. Shukla, “Class imbalance learning using UnderBagging based kernelized extreme learning machine,” Neurocomputing, vol. 329, pp. 172–187, 2019.
S. J. Lee and S. S. Hwang, “Bag of sampled words: A sampling-based strategy for fast and accurate visual place recognition in changing environments,” International Journal of Control, Automation and Systems, vol. 17, no. 10, pp. 2597–2609, 2019.
R. Liu, J. Wu, and D. Wang, “Sampled-data fuzzy control of two-wheel inverted pendulums based on passivity theory,” International Journal of Control, Automation and Systems, vol. 16, no. 5, pp. 2538–25486, 2018.
D. R. Wilson and T. R. Martinez, “Reduction techniques for instance-based learning algorithms,” Machine Learning, vol. 38, no. 3, pp. 257–286, 2000.
H. Brighton and C. Mellish, “Advances in instance selection for instance-based learning algorithms,” Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 153–172, 2002.
S. D. Nquyen and K. N. Nqo, “An adaptive input data space parting solution to the synthesis of neuro-fuzzy models,” International Journal of Control, Automation and Systems, vol. 6, no. 6, pp. 928–938, 2008.
H. Hwang, “Identification of a Gaussian fuzzy classifier,” International Journal of Control, Automation and Systems, vol. 2, no. 1, pp. 118–124, 2004.
A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Computing Surveys (CSUR), vol. 31, no. 3, pp. 264–323, 1999.
E. Pashaei, M. Ozen, and N. Aydin, “Improving medical diagnosis reliability using Boosted C5. 0 decision tree empowered by particle swarm optimization,” Proc. of 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 7230–7233, 2015.
B. S. Raghuwanshi and S. Shukla, “SMOTE based class-specific extreme learning machine for imbalanced learning,” Knowledge-based Systems, vol. 187, 2020.
S. Wang and X. Yao, “Relationships between diversity of classification ensembles and single-class performance measures,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 1, pp. 206–219, 2011.
T. Raeder, G. Forman, and N. V. Chawla, “Data mining: Foundations and intelligent paradigms,” pp. 315–331, 2012.
E. Pashaei and N. Aydin, “Binary black hole algorithm for feature selection and classification on biological data,” Applied Soft Computing, vol. 56, pp. 94–106, 2017.
S. Piri, D. Liu, and T. Liu, “A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets,” Decision Support Systems, vol. 106, pp. 15–29, 2018.
G. T. D. David, K. B. Mallick, and F. A. Smith, “Bayesian CART algorithm,” Biometrika, vol. 2, no. 2 pp. 363–377, 1998.
W. W. Cohen, “Fast effective rule induction,” Proceeding of the Twelfth International Conference on Machine Learning, pp. 115–123, 1995.
L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no.2 pp. 123–140, 1996.
T. Sridevi and A. Murugan, “A novel feature selection method for effective breast cancer diagnosis and prognosis,” International Journal of Computer Applications, vol. 88, pp. 28–33, 2014.
S. Kotsiantis, “Increasing the accuracy of incremental naive Bayes classifier using instance based learning,” International Journal of Control, Automation and Systems, vol. 11, no.1 pp. 159–166, 2013.
E. Theodorsson-Norheim, “Friedman and Quade tests: BASIC computer program to perform nonparametric two-way analysis of variance and multiple comparisons on ranks of several related samples,” Computers in Biology and Medicine, vol. 17, no. 2, pp. 85–99, 1987.
J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, no. Jan, pp. 1–30, 2006.
M. Bach, A. Werner, and M. Palt, “The proposal of undersampling method for learning from imbalanced datasets,” Procedia Computer Science, vol. 9, no. 1, pp. 19518–19518, 2019.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Recommended by Associate Editor M. Jaead Khan under the direction of Editor Doo Yong Lee. This work is supported by the National Natural Science Foundation of China (61966029, 72061030), Shaanxi Technology Committee industrial Public Relations Projection (2018GY-146), Shaanxi Provincial Department of Education Scientific Research Project (19JK0996), Yulin City Science and Technology Bureau (2019-93-3, 2019-77-3) and Yulin University (14YK37).
Jue Zhang received her M.S. degree in software engineering from Xidian University, Xi’an, China, in 2010. Now, she is pursuing her Ph.D. degree in Northwest Univrsity. Her research interests include machine learning, data mining, and imbalanced data classification.
Li Chen received her Ph.D. degree from Xidian University, Xi’an, China, in 2003. She is a Professor and Doctoral Supervisor of the School of Information Science and Technology, at Northwest University, Xi’an, China. Her research interests include data minging, intelligent information processing and intelligent control of big data. She has authored and co-authored over 100 papers in journals and conference proceedings.
Jian-xue Tian received his M.S. degree in University of Electronic Science and Technology of China, Chengdu, Sichuan Province, China, in 2014. His current research interests include fuzzy system and intelligent control.
Fazeel Abid received his Ph.D. degree in Northwest University of China, Xi’an, in 2020. He was a visiting internship student from Pakistan. His current research interests include sentiment analysis, social network and web mining in the aspect of deep learning.
Wusi Yang received his Ph.D. degree in Northwest University of China, Xi’an, in 2020. He is currently a lecturer in Xianyang Normal University. His current research interests focus on machine learning, swarm intelligence optimization, and multi-objective optimization.
Xiao-fen Tang received her Ph.D. degree in computer science from Northwest University in Xi’an, China. She is currently an associate professor in Ningxia University. Her research interest is mainly in the area of neural network and bioinformatics. She has published several research papers in scholarly journals in the above research areas.
About this article
Cite this article
Zhang, J., Chen, L., Tian, Jx. et al. Breast Cancer Diagnosis Using Cluster-based Undersampling and Boosted C5.0 Algorithm. Int. J. Control Autom. Syst. (2021). https://doi.org/10.1007/s12555-019-1061-x
- Breast cancer diagnosis
- cluste analysis
- imbalanced data classification
- sample selection