Abstract
Most of the real-world data sets exhibit a skewed scenario of data distribution in contrast to the well-established data sets. The total number of instances of a particular class extremely surpasses the count of other classes. This uneven dispersal of classes leads to a state of imbalance data sets posing an extreme difficulty for learning procedures. Additionally, due to its intrinsic complex data features, analyzing such imbalanced data sets has setup an avenue for focused researchers. Imbalanced class distribution is effectively handled with over sampling of minority class data which is usually independent of the classifiers. A over sampling technique: Clustering minority samples over sampling technique (CMSOT) is proposed to enhance the classification of imbalanced data sets. The projected technique is implemented on Apache Hadoop under mapreduce environment. The data sets are mainly encompassed from the UCI repository. The effect of True Positive rates justifying the imbalance ratio including the examination of improved classification from the generated pool is studied. The achieved experimental results along with its corresponding statistical analysis of over sampled data sets clearly mark the supremacy of the planned technique to the selected benchmarking techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wei, W., Li, J., Cao, L., Ou, Y., Chen, J.: Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web. 4, 449–475 (2013)
Tomczak, J., ZięBa, M.: Probabilistic combination of classification rules and its application to medical diagnosis. Mach. Learn. 1–3, 105–135 (2015)
Chen, Y.: An empirical study of a hybrid imbalanced-class DT-RST classification procedure to elucidate therapeutic effects in uremia patients. Med. Biol. Eng. Compu. 6, 983–1001 (2016)
Elhag, S., Fernández, A., Bawakid, A., Alshomrani, S., Herrera, F.: On the combination of genetic fuzzy systems and pairwise learning for improving detection rates on intrusion detection systems. Expert Syst. Appl. 1, 193–202 (2015)
López, V., Fernández, A., GarcÃa, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Del RÃo, S., López, V., BenÃtez, J., Herrera, F.: On the use of MapReduce for imbalanced big data using random forest. Inf. Sci. 285, 112–137 (2014)
Jiang, H., Chen, Y., Qiao, Z., Weng, T., Li, K.: Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Comput. 1, 369–383 (2015)
Huang, J., Ling, C.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 3, 299–310 (2005)
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 5, 429–449 (2002)
He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 9, 1263–1284 (2008)
Sun, Y., Wong, A., Kamel, M.: Classification of imbalanced data: A review. Int. J. Pattern Recognit Artif Intell. 04, 687–719 (2009)
Maalouf, M., Trafalis, T.: Robust weighted kernel logistic regression in imbalanced and rare events data. Comput. Stat. Data Anal. 55, 168–183 (2011)
Japkowicz, N., Myers, C., Gluck, M.: A novelty detection approach to classification. InIJCAI 1, 518–523 (1995)
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Han, H., Wang, W., Mao, B.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing, Springer, Berlin, pp. 878–887 (2005)
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Berlin. pp. 475–482 (2009)
He, H., Bai, Y., Garcia, E., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, pp. 322–1328 (2008)
Menardi, G., Torelli, N.: Training and assessing classification rules with imbalanced data. Data Min. Knowl. Disc. 1, 92–122 (2014)
Hu, F., Li, H.: A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE. Math. Problems Eng. (20130
Chawla, N., Lazarevic, A., Hall, L., Bowyer, K.: SMOTEBoost: Improving prediction of the minority class in boosting. In: European Conference on Principles of Data Mining and Knowledge Discovery, Springer, Berlin. pp. 107–119 (2003)
Xiang, H., Yang, Y., Zhao, S.: Local clustering ensemble learning method based on improved AdaBoost for rare class analysis. J. Comput. Inf. Syst. 4, 1783–1790 (2012)
Gong, J., Kim, H.: RHSBoost: Improving classification performance in imbalance data. Comput. Stat. Data Anal. 111, 1–3 (2017)
Barua, S., Islam, M., Yao, X., Murase, K.: MWMOTE—majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2, 405–425 (2012)
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: DBSMOTE: density-based synthetic minority over-sampling technique. Appl. Intell. 3, 664–684 (2012)
UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets.html Accessed 13 Nov 2019
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Patil, S., Sonavane, S. (2021). Investigation of Imbalanced Big Data Set Classification: Clustering Minority Samples Over Sampling Technique. In: Deshpande, P., Abraham, A., Iyer, B., Ma, K. (eds) Next Generation Information Processing System. Advances in Intelligent Systems and Computing, vol 1162 . Springer, Singapore. https://doi.org/10.1007/978-981-15-4851-2_32
Download citation
DOI: https://doi.org/10.1007/978-981-15-4851-2_32
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-4850-5
Online ISBN: 978-981-15-4851-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)