Active Learning by Clustering for Drifted Data Stream Classification

  • Jakub Zgraja
  • João Gama
  • Michał  WoźniakEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 967)


Usually, during data stream classifier learning, we assume that labels of all incoming examples are available without any delay and they are used to update employing predictive model. Unfortunately, this assumption about access to all class labels is naive and it requires relatively high budget for labeling. It causes that methods which can train data stream classifiers on the basis of partially labeled data are highly desirable. Among them, active learning [1] seems to be a promising direction, which focuses on selecting only the most valuable learning examples to be labeled and used to produce an accurate predictive model. However, designing such a system we have to ensure that a chosen active learning strategy is able to handle changes in data distribution and quickly adapt to changing data distribution. In this work, we focus on novel active learning strategies that are designed for effective tackling of such changes. We propose a novel active data stream classifier learning method based on query by clustering approach. Experimental evaluation of the proposed methods prove the usefulness of the proposed approach for reducing labeling cost for classifier of drifting data streams.


Active learning Data streams Classification 



This work is supported the statutory funds of the Department of Systems and Computer Networks, Faculty of Electronics, Wrocław University of Science and Technology.


  1. 1.
    Settles, B.: Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 6, no. 1, pp. 1–114 (2012)Google Scholar
  2. 2.
    Gama, J.: Knowledge Discovery from Data Streams, 1st edn. Chapman & Hall/CRC, Boca Raton (2010)CrossRefGoogle Scholar
  3. 3.
    Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 44:1–44:37 (2014).
  4. 4.
    Domingos, P., Hulten, G.: Mining high-speed data streams, pp. 71–80. ACM Press (2000)Google Scholar
  5. 5.
    Tsymbal, A.: The problem of concept drift: definitions and related work. Technical report, Trinity College Dublin (2004)Google Scholar
  6. 6.
    Gama, J., Gaber, M.: Learning from Data Streams: Processing Techniques in Sensor Networks. Springer, Heidelberg (2007). Scholar
  7. 7.
    Zliobaite, I., Pechenizkiy, M., Gama, J.: An overview of concept drift applications. In: Japkowicz, N., Stefanowski, J. (eds.) Big Data Analysis: New Algorithms for a New Society, pp. 91–114. Springer, Cham (2016). Scholar
  8. 8.
    Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: a review. SIGMOD Rec. 34(2), 18–26 (2005).
  9. 9.
    Ienco, D., Bifet, A., Žliobaitė, I., Pfahringer, B.: Clustering based active learning for evolving data streams. In: Fürnkranz, J., Hüllermeier, E., Higuchi, T. (eds.) DS 2013. LNCS (LNAI), vol. 8140, pp. 79–93. Springer, Heidelberg (2013). Scholar
  10. 10.
    de Faria, E.R., de Leon Ferreira Carvalho, A.C.P., Gama, J.: MINAS: multiclass learning algorithm for novelty detection in data streams. Data Min. Knowl. Discov. 30(3), 640–680 (2016). Scholar
  11. 11.
    Kremer, H., et al.: An effective evaluation measure for clustering on evolving data streams. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2011, pp. 868–876. ACM, New York (2011).
  12. 12.
    Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010).
  13. 13.
    Dheeru, D., Taniskidou, E.K.: UCI machine learning repository (2017).
  14. 14.
    Blackard, J.A., Dean, D.J.: Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput. Electron. Agricult. 24, 131–151 (1999)CrossRefGoogle Scholar
  15. 15.
    Harries, M., Wales, N.S.: Splice-2 comparative evaluation: electricity pricing. Technical report (1999)Google Scholar
  16. 16.
    Zhu, X.H.: Stream data mining repository (2010).
  17. 17.
    Zliobaite, I.: How good is the electricity benchmark for evaluating concept drift adaptation. CoRR, abs/1301.3524 (2013).
  18. 18.
    Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, VLDB 2003, vol. 29, pp. 81–92. VLDB Endowment (2003).
  19. 19.
    Kranen, P., Assent, I., Baldauf, C., Seidl, T.: The ClusTree: indexing micro-clusters for anytime stream mining. Knowl. Inf. Syst. 29(2), 249–272 (2011)CrossRefGoogle Scholar
  20. 20.
    Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2007, pp. 133–142. ACM, New York (2007).
  21. 21.
    Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Multiple-Valued Log. Soft Comput. 17(2–3), 255–287 (2011).

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Systems and Computer NetworksWrocław University of Science and TechnologyWrocławPoland
  2. 2.Laboratory of Artificial Intelligence and Decision Support, Faculty of EconomicsUniversity of PortoPortoPortugal

Personalised recommendations