Keywords

1 Introduction

Currently, the classification of difficult data is a frequently selected topic of research. One of many examples of this type of data is data streams. Such data should be processed for a limited time, having appropriate memory restrictions and performing only one-time use of incoming data. Also, the classifiers are required to be adaptable. A common phenomenon accompanying streams is the concept drift, which causes a change in the incoming data distribution. These changes may occur indefinitely.

Another problem is the imbalance of data, when it is combined with streams, significantly increases the difficulty. Uneven distribution of the number of classes is a fairly common phenomenon occurring in real data sets. This is not a problem when the differences are small, but it becomes serious when the difference between the number of objects from minority and majority classes is significantly huge. One of the known ways to deal with these difficulties is data sampling methods. These methods are designed to reduce the number of objects in the dominant class or to generate artificial objects of the minority class [2].

Designing methods with mechanisms for adapting to this type of data is another approach. One of this kind of approach is Learn++CDS [6] method, which combines the Learn++NSE [7] for nonstationary streams and SMOTE [2] for oversampling data. The next method in this paper is Learn++NIE, which is similar to the previous one, but with little difference. The classification error is introduced and some variation of bagging is used for balancing data. Wang et al. [19] design a method that uses the k-Mean clustering algorithm for undersampling data by prototype generation from centroids. The REA method proposed by Chen and He [4]. It is extension of the SERA [3] and the MuSeRA [5]. This family of methods uses a strategy for estimating similarity between previous samples of minority classes and the current minority data from the chunk.

One of the demanding situations when classifying imbalanced data streams is the temporary disappearance of the minority class or their appearance only in later stages. This type of phenomenon can cause a significant decrease in quality or sometimes prevent the typical classifier from working. The solution that raises this type of problem is the use of one-class classifiers that can make decisions based only on objects from one class only. Krawczyk et al. [11] proposed to the form an ensemble of one-class classifiers. Clustered data within samples from each class is used to train new models and expand ensemble. J. Liu et al. [14] designed a modular committee of single-class classifiers based on data density analysis. This is a similar approach, where clusters are created as part of a single-class data set. Krawczyk and Woźniak [10] presented various metrics enabling the creation of effective one-class classifier committees.

This paper proposes an ensemble method for classifying imbalanced data streams. The purpose of this work is to conduct preliminary experiments and analyze the obtained results, which will confirm whether the designed method can deal with imbalanced data streams competing in tests with the methods of state of the art. The main contributions of this work are as follows:

  • A proposal for an OCEIS method for classifying imbalanced data streams based on one-class SVM classifiers

  • Introduction of an appropriate combination rule allowing full use of the potential of the one-class SVM classifier ensemble

  • Designing the proper learning procedure for the proposed method using division of data into classes and k-mean clustering

  • Experimental evaluation of the proposed OCEIS method using real and synthetic generated imbalanced data streams and a comparison with the state-of-the-art methods

Fig. 1.
figure 1

Decision regions visualisation on the paw dataset from the Keel.es repository [1]

2 Proposed Method

The proposed method One Class support vector machine classifier Ensemble for Imbalanced data Stream (OCEIS) is a combination of different approaches to data classification. The main core of this idea is the use of one-class support vector machines (OCSVM) to classify imbalanced binary problems. This method is the chunk-based data stream method.

In the first step of the Algorithm 1, the chunk of training data is divided into a minority (\(D_{min}\)) and a majority set (\(D_{maj}\)). Then these sets of data are divided into clusters. Krawczyk et al. [11] indicate the importance of this idea. This decomposition of data over the feature space allows achieving less overlap of classifiers decision areas in the ensemble (Fig. 1). The k-means algorithm [15] is used to create clusters. The key aspect is choosing the right number of clusters. Silhouette Value (SV) [18] comes with help, which allows calculating how similar an object is to its own cluster compared to other clusters. Kaufman et al. [9] introduced the Silhouette Coefficient (SC) for the maximum value of the mean SV over the entire dataset.

Minority and majority data is divided into clusters sets (\(Cmin_{t,k}\), \(Cmaj_{t,k}\)) with a different number of centroids from 1 to \(K_{max}\). The number of clusters with the highest value of SC is selected (\(K_{best}\)). This process is performed for minority and majority data. Then the formed clusters are used to fit new models (\(h_{t,i}\), \(h_{t,j}\)) of OCSVM. These models are included in the pool of classifier committees (\(H_{min}\), \(H_{maj}\)). The method is designed by default to operate on data streams. For this reason, a simple forgetting mechanism, also known as incremental learning, was implemented. This allows using models trained only on data with a certain time interval. When the algorithm reaches a set number (S) of chunks (t), in each iteration, the models built on the oldest chunk are removed from the ensemble.

figure a
figure b

A crucial component of any classifier ensemble is the combination rule, which makes decisions based on the predictions of the classifier ensemble. Designing a good decision rule is vital for proper operation and obtaining satisfactory classification quality. First of all, OCEIS uses one-class classifiers and class clustering technique, which changes the way how the ensemble works. Well-known decision making based on majority voting [20] does not allow this kind of committee to make correct decisions. The number of classifiers for individual classes may vary significantly depending on the number of clusters. In this situation, there is a considerable risk that the decision will mainly base on majority classifiers.

OCEIS uses the original combination rule (Algorithm 2) based on distance from the decision boundary of classifiers to predicted samples. In the first step, the distances (\(Dist_{i,m}\), \(Dist_{j,m}\)) are calculated from all objects of the predicted data to the hypersphere of the models forming the minority and the majority committee. The DecisionFunction calculates these values. When the examined object is inside the checked hypersphere, it obtains a positive value, when it is outside, it receives a negative value. Then the highest value (\(D_{maj}\), \(D_{min}\)) is determined from the majority and minority committees for each sample. When the best value (\(D_{maj}\)) for the model from the majority subensemble is greater than the best value (\(D_{min}\)) for the model from the minority subensemble, it means that this object belongs to the majority class. Similarly, when \(D_{min} \) is greater than \(D_{maj}\), the object belongs to a minority class.

3 Experimental Evaluation

The main purpose of this experiment was to check how good the proposed method performed with comparison to the other methods for classifying imbalanced data streams. The following research hypothesis was formulated:

It is possible to design a method with a statistically better or equal classification quality of imbalanced data streams compared to the selected state of the art methods.

3.1 Experiment Setup

All tests were carried out using 24 generated streams and 30 real streams (Table 1). The generated data comes from stream-learn [12] generator. These generated data differ in the level of imbalance: 10%, 20%, 30%. Label noise: 0% or 10% and type of drift: incremental or sudden. All generated data streams have 10 features, two classes and consist of 100,000 objects each. The proposed method has been tested with the selected state of the art methods:

figure c

The SVM implementation from the scikit-learn framework [17] was used as the base classifier in all committees. OCEIS implementation and the experimental environment is available on public github repository.Footnote 1 Four metrics were used to measure the quality: Gmean, precision, recall and specificity. The results obtained in this way were compared using Wilcoxon statistical pair-tests. Each method was compared with OCEIS and these wins, lost and draw are shown in Fig. 2 and Fig. 3.

Table 1. Overview of real datasets used in experimental evaluation (KEEL [1] and PROMISE Software Engineering Repository [13]), IR - Imbalance Ratio

3.2 Results Analysis

The obtained results of the Wilcoxon rank-sum pair statistical tests show that OCEIS can classify with the similar quality compared to the tested methods. For tested synthetic data streams (Fig. 2) there is a certain advantage of the L++CDS method over other methods. In second place can be put L++NIE and OCEIS. For the OUSE and L++NIE methods, there is a noticeable tendency to classify objects of the minority class, which is manifested by the higher results in the Recall (TPR) metric, but this causes a significant drop in Specifity (TNR). The worst in this test was the REA method, which shows a huge beat in the direction of the majority class. The results are more transparent for real data sets (Fig. 3). Despite many ties, the best performing method is OCEIS. The exceptions are Recall for OUSE and Specifity for REA.

Fig. 2.
figure 2

Wilcoxon pair rank sum tests for synthetic data streams. Dashed vertical line is a critical value with a confidence level 0.05 (green - win, yellow - tie, red - lose) (Color figure online)

Fig. 3.
figure 3

Wilcoxon pair rank sum tests for real data streams. Dashed vertical line is a critical value with a confidence level 0.05 (green - win, yellow - tie, red - lose) (Color figure online)

Charts of Gmean score over the data chunks provide some useful information about obtained results. To get a much better readability, the data before plotting was processed using a Gaussian filter. This procedure smoothes the edges of the results, which allows getting much more information from the results. The first observation is that for an incremental drift stream (Fig. 4), OCEIS does not degrade quality over time. The negative effect of the concept drift can be seen on the KMC and REA methods, where the quality deteriorates significantly with the inflow of subsequent data chunks.

Fig. 4.
figure 4

Gmean score over the data chunks for synthetic data with incremental drift

In sudden concept drift (Fig. 5), a certain decrease is noticeable, which is more or less reflected on every tested method. However, L++CDS, L++NIE and OCEIS can quickly rebuild this quality drop. This does not affect the overall quality of the classification significantly. Other methods perform a little bit randomly on sudden drifts. An example of the real-time shuttle-4vsA stream (Fig. 6) shows the clear advantage of the OCEIS method over the other tested methods. A similar observation can be seen in other figures for real streams.

Fig. 5.
figure 5

Gmean score over the data chunks for synthetic data with sudden drift

Fig. 6.
figure 6

Gmean score over the data chunks for real stream shuttle-4-5vsA

When analyzing the results, one should pay attention to the significant divergences in the performance of the proposed method for synthetic and real data streams. A large variety characterized real data streams, while artificial streams were generated using one type of generator (of course, for different settings). However, generated data streams are biased towards one type of data distribution, which probably was easy to analyze by some of the models, while the bias of the rest of them was not consistent with this type of data generator. Therefore, in the future, we are going to carry out the experimental research on the expanded pool of synthetic streams generated by other different generators.

4 Conclusions

We proposed an imbalanced data streams classification algorithm based on the one-class classifier ensemble. Based on the results obtained from reliable experiments, the formulated research hypothesis seems to be confirmed. OCEIS achieves results at a similar level to the compared methods, but it is worth noticing that it performs best on real stream data, which is its important advantage. Another advantage is that there is no tendency towards the excessive classification of objects from one of the classes. This was a problem in experiments carried out for the REA and OUSE methods. Such “stability” contributes significantly to improving the quality of classification and obtaining satisfactory results.

For synthetic data streams, the proposed algorithm is not the worst-performing one. However, one can see some dominance of the methods from the Learn++ family, because the decision made by OCEIS is built based on all classifiers as part of the committee. One possible way to change this would be to break down newly created models by data chunks. This would build subcommittees (the Learn++NIE method works similarly). Then decisions would be made for each subcommittee separately. Expanding this by the weighted voting decision may significantly improve predictive performance. Another modernization of the method that would allow for some improvement would be the introduction of a drift detector. This mechanism would enable the ensemble to clean up after detecting concept drift.

The conducted research indicates the potential hidden in the presented method. It is worth considering extending the research to streams with other types of concept drifts. It is also beneficial to increase the number of real streams to test to get a broader spectrum of knowledge about how this method works on real data. One of the ideas for further research that arose while working on this paper is to test the operation on streams where the imbalance ratio changes over time. A very interesting would be an experiment on imbalanced data streams where the minority class temporarily disappears or appears after some time.