Abstract
Data stream mining is very popular in recent years with advanced electronic devices generating continuous data streams. The performance of standard learning algorithms is been compromised with imbalance nature present in real world data streams. In this paper we propose a novel algorithm dubbed as increment over sampling for data streams (IOSDS) which uses an unique over sampling technique to almost balance the data sets to minimize the effect of imbalance in stream mining process. The experimental analysis is conducted on 15 data chunks of data streams with varied sizes and different imbalance ratios. The results suggests that the proposed IOSDS algorithm improves the knowledge discovery over benchmark algorithms like C4.5 and Hoeffding tree in terms of standard performance measures namely accuracy, AUC, precision, recall and F-measure.
Similar content being viewed by others
References
Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Comput 17:2–3 (255–287)
Angelov PP (2012) Autonomous learning systems: from data streams to knowledge in real-time. Wiley, New York
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604
Bifet A, Holmes G, Pfahringer B, Read J, Kranen P, Kremer H, Jansen T, Seidl T (2011) MOA: a real-time analytics open source framework. In: Joint European conference on machine learning and knowledge discovery in databases, ECML PKDD 2011: machine learning and knowledge discovery in databases, pp 617–620
Brown I, Mues C (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst Appl 39:3446–3453
Cao P, Zhao D, Zaiane O (2011) A PSO-based cost-sensitive neural network for imbalanced data classification, adfa. Springer, Berlin, p 1
Chen Y (2008) Learning classifiers from imbalanced, only positive and unlabeled data sets. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 213–220
Czarnowski I, Jedrzejowicz P (2014) Ensemble classifier for mining data streams. In: 18th international conference on knowledge-based and intelligent information and engineering systems—KES2014. https://doi.org/10.1016/j.procs.2014.08.120
Ditzler G, Polikar R (2013) Incremental learning of concept drift from streaming imbalanced data. In: IEEE transactions on knowledge and data engineering, Digital Object Indentifier. https://doi.org/10.1109/TKDE.2012.136
Doucette J, Heywood MI (2008) GP classification under imbalanced data sets: active sub-sampling and AUC approximation. In: O’Neill M et al (eds) EuroGP 2008, LNCS 4971. Springer, Berlin, pp 266–277
Gama J (2010) Knowledge discovery from data streams. Chapman & Hall/CRC, Boca Raton
Hamilton A, Newman AD (2007) UCI repository of machine learning database (School of Information and Computer Science). University of California, Irvine. http://www.ics.uci.edu/~mlearn/MLRepository.html. Accessed 3 May 2017
Hoens TR, Polikar R, Chawla NV (2012) Learning from streaming data with concept drift and imbalance: an overview. Prog Artif Intell 1:89–101, https://doi.org/10.1007/s13748-011-0008-0
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 97–106
Jankowski D, Jackowski K, Cyganek B (2016) Learning decision trees from data streams with concept drift. In: ICCS 2016. The international conference on computational science, vol 80, pp 1682–1691
Khamassi I, SayedMouchaweh M, Hammami M, Ghédira K (2016) Discussion and review on evolving data streams and concept drift adapting. Evol Syst Springer. https://doi.org/10.1007/s12530-016-9168-2,
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5:221. https://doi.org/10.1007/s13748-016-0094-0
Krempl G, Zliobaite I, Brzezinski D, Hullermeier E, Last M, Lemaire V, Noack T, Shaker A, Sievi S, Spiliopoulou M, Stefanowski J (2014) Open challenges for data stream mining research. SIGKDD Explor 16(1):1–10
Li Q, Mao Y (2014) A review of boosting methods for imbalanced data classification. Pattern Anal Appl 17(4):679–693
López V, Triguero I, Carmona CJ, García S, Herrera F (2014) Addressing imbalanced classification with instance generation techniques: IPADE-ID. Neurocomputing 126:15–28
Lorena AC, Jacintho LFO, Siqueira MF, Giovanni RD, Lohmann LG, de Carvalho ACPLF, Yamamoto M (2011) Comparing machine learning classifiers in potential distribution modelling. Expert Syst Appl 38:5268–5275
Lughofer E, Buchtala O (2013) Reliable all-pairs evolving fuzzy classifiers. IEEE Trans Fuzzy Syst 21(4):625–641
Lughofer E, Weig E, Heid W, Eitzinger C, Radauer T (2015) Integrating new classes on the fly in evolving fuzzy classifier designs and its application in visual inspection. Appl Soft Comput 35:558–582
Lughofer E, Weigl E, Heidl W, Eitzinger C, Radauer T (2016) Recognizing input space and target concept drifts in data streams with scarcely labeled and unlabelle d instances. Inf Sci 355–356:127–151
Menon AK, Narasimhan H, Agarwal S, Chawla S (2013) On the statistical consistency of algorithms for binary classification under class imbalance. In: Appearing in proceedings of the 30th international conference on machine learning, Atlanta, Georgia, USA
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers, Burlington
Sayed-Mouchaweh M, Lughofer E (2012) Learning in non-stationary environments: methods and applications. Springer, New York
Song G, Ye Y (2014) A dynamic ensemble framework for mining textual streams with class imbalance. Hindawi Publ Corp Sci World J. https://doi.org/10.1155/2014/497354. (Article ID 497354)
Thalor MA, Patil S (2016) Incremental learning on non-stationary data stream using ensemble approach. Int J Electr Comput Eng (IJECE) 6(4):1811–1817. https://doi.org/10.11591/ijece.v6i4.10255
Verbiesta N, Ramentol E, Cornelisa C, Herrera F (2014) Preprocessing noisy imbalanced datasets using SMOTE enhanced withfuzzy rough prototype selection. Appl Soft Comput 22:511–517
Wang S, Minku LL, Yao X (2014) A multi-objective ensemble method for online class imbalance learning. In: 2014 international joint conference on neural networks IJCNN July 6–11, Beijing, China
Wang S, Minku LL, Yao X (2015) Resampling-based ensemble methods for online class imbalance learning. In: IEEE transactions on knowledge and data engineering. https://doi.org/10.1109/TKDE.2014.2345380
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
Yang B, Jing L (2014) A novel nonparallel plane proximal SVM for imbalance data classification. J Softw 9(9):2384–2392
Yu S, Tang K, Minku LL, Wang S, Yao X (2016) Online ensemble learning of data streams with gradually evolved classes. In: IEEE transactions on knowledge and data engineering
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Anupama, N., Jena, S. A novel approach using incremental oversampling for data stream mining. Evolving Systems 10, 351–362 (2019). https://doi.org/10.1007/s12530-018-9249-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12530-018-9249-5