On the Impact of Class Imbalance in GP Streaming Classification with Label Budgets
Streaming data scenarios introduce a set of requirements that do not exist under supervised learning paradigms typically employed for classification. Specific examples include, anytime operation, non-stationary processes, and limited label budgets. From the perspective of class imbalance, this implies that it is not even possible to guarantee that all classes are present in the samples of data used to construct a model. Moreover, when decisions are made regarding what subset of data to sample, no label information is available. Only after sampling is label information provided. This represents a more challenging task than encountered under non-streaming (offline) scenarios because the training partition contains label information. In this work, we investigate the utility of different protocols for sampling from the stream under the above constraints. Adopting a uniform sampling protocol was previously shown to be reasonably effective under both evolutionary and non-evolutionary streaming classifiers. In this work, we introduce a scheme for using the current ‘champion’ classifier to bias the sampling of training instances during the course of the stream. The resulting streaming framework for genetic programming is more effective at sampling minor classes and therefore reacting to changes in the underlying process responsible for generating the data stream.
KeywordsStreaming data classification Non-stationary Class imbalance Benchmarking
This research is supported by the Canadian Safety and Security Program(CSSP) E-Security grant. The CSSP is led by the Defense Research and Development Canada, Centre for Security Science (CSS) on behalf of the Government of Canada and its partners across all levels of government, response and emergency management organizations, nongovernmental agencies, industry and academia.
- 2.Bifet, A., Read, J., Žliobaitė, I., Pfahringer, B., Holmes, G.: Pitfalls in benchmarking data stream classification and how to avoid them. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013, Part I. LNCS, vol. 8188, pp. 465–479. Springer, Heidelberg (2013)CrossRefGoogle Scholar
- 6.Fan, W., Huang, Y., Wang, H., Yu, P.S.: Active mining of data streams. In: SIAM International Conference on Data Mining, pp. 457–461 (2004)Google Scholar
- 10.Lichodzijewski, P., Heywood, M.I.: Managing team-based problem solving with symbiotic bid-based genetic programming. In: ACM Genetic and Evolutionary Computation Conference, pp. 363–370 (2008)Google Scholar
- 11.Lichodzijewski, P., Heywood, M.I.: Symbiosis, complexification and simplicity under GP. In: ACM Genetic and Evolutionary Computation Conference, pp. 853–860 (2010)Google Scholar
- 13.Thomason, R., Soule, T.: Novel ways of improving cooperation and performance in ensemble classifiers. In: ACM Genetic and Evolutionary Computation Conference, pp. 1708–1715 (2007)Google Scholar
- 15.Vahdat, A., Atwater, A., McIntyre, A.R., Heywood, M.I.: On the application of GP to streaming data classification tasks with label budgets. In: ACM GECCO (Companion), pp. 1287–1294 (2014)Google Scholar
- 16.Vahdat, A., Morgan, J., McIntyre, A., Heywood, M., Zincir-Heywood, A.: Evolving GP classifiers for streaming data tasks with concept change and label budgets: a benchmarking study. In: Gandomi, A.H., Alavi, A.H., Ryan, C. (eds.) Handbook of Genetic Programming Applications, pp. 451–480. Springer, Switzerland (2015)CrossRefGoogle Scholar
- 17.Vahdat, A., Morgan, J., McIntyre, A., Heywood, M., Zincir-Heywood, A.: Tapped delay lines for GP streaming data classification with label budgets. In: Machado, P., et al. (eds.) Genetic Programming. LNCS, vol. 9025, pp. 126–138. Springer, Switzerland (2015)Google Scholar
- 19.Wu, S., Banzhaf, W.: Rethinking multilevel selection in genetic programming. In: ACM Genetic and Evolutionary Computation Conference, pp. 1403–1410 (2011)Google Scholar