Abstract
There can be various aspects that can affect the performance of a machine learning classifier, among which the unbalanced dataset is the most prominent. The unbalanced dataset is the one in which there is a disproportion among classes i.e. instances belonging to the one class heavily outnumber instances belonging to all other classes. This problem of the unbalanced dataset is more common in medical data as it is collected from the real world where the number of persons affected by the disease will always be less than the non-affected persons. Due to this disproportion among the classes, classifiers face difficulties in learning concepts related to the class in minority. Most of all data balancing techniques are created keeping general data in mind and are not viable for medical data. In this paper, a method is proposed that helps balance medical data more effectively and at the same time increase performance and decrease the leaning time for the classifier.
K. Kalra and R. Goyal—These authors contributed equally to this work.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Liu, T.: Feature selection based on mutual information for gear faultydiagnosis on the imbalanced dataset. J. Comput. Inf. Syst. 8(18), 7831–7838 (2012)
Mena, L., Gonzalez, J.A.: Symbolic one-class learning from imbalanced datasets: application in medical diagnosis. Int. J. Artif. Intell. Tools 18(02), 273–309 (2009)
Cieslak, D.A., Chawla, N.V., Striegel, A.: Combating imbalance in network intrusion datasets. In: GrC, pp. 732–737, May 2006
Thomas, C.: Improving intrusion detection for imbalanced network traffic. Secur. Commun. Netw. 6(3), 309–324 (2013)
Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newsl. 6(1), 80–89 (2004)
Li, Y., Sun, G., Zhu, Y.: Data imbalance problem in text classification. In: 2010 Third International Symposium on Information Processing (ISIP), pp. 301–305. IEEE, October 2010
Perols, J.: Financial statement fraud detection: an analysis of statistical and machine learning algorithms. Audit. J. Pract. Theory 30(2), 19–50 (2011)
Ghazikhani, A., Monsefi, R., Yazdi, H.S.: Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing 122, 535–544 (2013)
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2012)
Qian, Y., Liang, Y., Li, M., Feng, G., Shi, X.: A resampling ensemble algorithm for classification of imbalance problems. Neurocomputing 143, 57–67 (2014)
Pearson, R., Goney, G., Shwaber, J.: Imbalanced clustering for microarray time-series. In: Proceedings of the ICML, vol. 3 (2003)
Sun, Y., Kamel, M.S., Wang, Y.: Boosting for learning multiple classes with imbalanced class distribution. In: Sixth International Conference on Data Mining, ICDM 2006, pp. 592–602. IEEE, December 2006
Chen, K., Lu, B.L., Kwok, J.T.: Efficient classification of multi-label and imbalanced data using min-max modular classifiers. In: International Joint Conference on Neural Networks, IJCNN 2006, pp. 1770–1775. IEEE, July 2006
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2016)
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 3, 408–421 (1972)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, IJCNN 2008. IEEE World Congress on Computational Intelligence, pp. 1322–1328. IEEE, June 2008
Drummond, C., Holte, R.C.: C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II, vol. 11, pp. 1–8. Citeseer, Washington DC, August 2003
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kalra, K., Goyal, R., Kaur, S., Kumar, P. (2018). PDD Algorithm for Balancing Medical Data. In: Singh, M., Gupta, P., Tyagi, V., Flusser, J., Ören, T. (eds) Advances in Computing and Data Sciences. ICACDS 2018. Communications in Computer and Information Science, vol 905. Springer, Singapore. https://doi.org/10.1007/978-981-13-1810-8_26
Download citation
DOI: https://doi.org/10.1007/978-981-13-1810-8_26
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1809-2
Online ISBN: 978-981-13-1810-8
eBook Packages: Computer ScienceComputer Science (R0)