Keywords

1 Introduction

Traditional machine learning algorithms assume that the number of instances belonging to problem classes is relatively similar. However, it is worth noting that in many real problems the size of one class (majority class) may significantly exceed the size of the second one (minority class). This makes the algorithms biased towards the majority class, although the correct recognition of less common class is often more important. This research trend is known as learning from imbalanced data [8] and it is still widely discussed in scientific works.

There are three main approaches to dealing with the imbalanced data classification:

  • Data-level methods focusing on modifying the training set in such a way that it becomes suitable for classic learning algorithms (e.g., oversampling and undersampling).

  • Algorithm-level methods that modify existing classification algorithms to offset their bias towards the majority class.

  • Hybrid methods combining the strengths of the previously mentioned approaches.

Many works on imbalanced data classification employ classifier ensembles [16]. One of the more promising directions is the Dynamic Ensemble Selection (des) [5]. Dynamic selection (ds) methods select a single classifier or an ensemble (from an available classifier pool) to predict the decision for each unknown query. This is based on the assumption that each of the base classifiers is an expert in a different region of the feature space. The classification of each unknown sample by des involves three steps:

  • Definition of the region of competence; that is, how to define the local region surrounding the unknown sample, in which the competence level of the base models is estimated. This local region of competence is found in the dynamic selection dataset (dsel), which is usually part of the training set.

  • Defining the selection criterion later used to assess the competence of the base classifiers in the local region of competence (e.g., accuracy or diversity).

  • Determination of the selection mechanism deciding whether we choose a single classifier or an ensemble.

Previous work related to the imbalanced data classification using classifier ensembles and des involves various approaches. Ksieniewicz in [9] proposed an Undersampled Majority Class Ensemble (umce) employing different combination methods and pruning, based on a k-fold division of the majority class to divide an imbalanced problem into many balanced ones. Chen et al. [4] presented the Dynamic Ensemble Selection Decision-making (desd) algorithm to select the most appropriate classifiers using a weighting mechanism to highlight the base models that are better suited for recognizing the minority class. Zyblewski et al. in [17] proposed the Minority Driven Ensemble (mde) for highly imbalanced data streams classification and Roy et al. in [14] combined preprocessing with dynamic ensemble selection to classify both binary and multiclass stationary imbalanced datasets.

The main contributions of this work are as follows:

  • The proposition of the new dynamic selection methods adapted for the classification of highly imbalanced data.

  • Experimental evaluation of the proposed algorithms based on a high number of diverse benchmark datasets and a detailed comparison with the state-of-art approaches.

2 Dynamic Ensemble Selection Based on Imbalance Ratio and Euclidean Distance

This paper proposes two algorithms for dynamic classifier selection for the imbalanced data classification problem. These are respectively the Dynamic Ensemble Selection using Euclidean distance (dese) and the Dynamic Ensemble Selection using Imbalance Ratio and Euclidean distance (desire).

The generation of the classifier pool is based on the Bagging approach [2], and more specifically on the Stratified Bagging, in which the samples are drawn with replacement from the minority and majority class separately in such a way that each bootstrap maintains the original training set class proportion. This is necessary due to the high imbalance, which in the case of standard bagging can lead to the generation of training sets containing only the majority class.

Both proposed methods are derived in part from algorithms based on local oracles, and more specifically on knora-u [7], which gives base classifiers weights based on the number of correctly classified instances in the local region of competence and then combines them by weighted majority voting. The computational cost in this type of method is mainly related to the size of the classifier pool and the dsel size, as the k-nearest neighbors technique is used to define local competence regions, which can be costly for large datasets. Instead of hard voting, dese and desire are based on the probabilities returned by the base models and they calculate weights for each classifier for both the minority and majority classes separately.

Proposed methods come in two variants: Correct (denoted as C), where weights are modified only in the case of correct classification, and All (denoted as A), where, in addition to correct decisions, weights are also affected by incorrect ones. The exact way of weights calculation is presented in Algorithm 1.

For each instance, the proposed algorithms perform the following steps:

  • In step , the k-nearest neighbors of a given instance are found in dsel, which form the local region of competence lrc.

  • In step , each classifier \(\varPsi _j\) from the pool classifies all samples belonging to lrc.

  • In steps , the classifier weights are modified separately for the minority and majority class, starting from the value of . The All variant uses all four conditions, while the Correct variant is based only on the conditions in lines and . In the case of dese, the modifications are based on the Euclidean distance between the classified sample and its neighbor from the local competence region, and in the case of desire, the Euclidean distance is additionally scaled by a percentage of the minority or majority class in such a way that more emphasis is placed on the minority class.

Finally, the weights obtained from dese or desire are normalized to the [0, 1] range and multiplied by the ensemble support matrix. The combination is carried out according to the maximum rule [6], which chooses the classifier that is most confident of itself. The choice of this combination rule was dictated by a small number of instances in the datasets, which significantly reduces the risk of base classifiers overfitting.

figure i

3 Experimental Evaluation

This section presents the details of the experimental study, the datasets used and the results that the proposed approaches have achieved compared to the state-of-art methods.

3.1 Experimental Set-Up

The main goal of the following experiments was to compare the performance of proposed dynamic selection methods, designed specifically for the task of imbalanced data classification, with the state-of-art ensemble methods paired with preprocessing. The evaluation in each of the experiments is based on metrics commonly used to assess the quality of classification for imbalanced problems. These are F1 score [15], precision and recall [13], G-mean [11] and balanced accuracy score (bac) [3] according to the stream-learn [10] implementation. All experiments have been implanted in Python and can be repeated using the code on GithubFootnote 1.

As the base models three popular classifiers, according to the scikit-learn [12] implementation, were selected, i.e. Gaussian Naive Bayes (gnb), Classification and Regression Trees (cart) and k-Nearest Neighbors classifier (knn). The fixed size of the classifier pool has been determined successively as , , and base models. The evaluation was carried out using times repeated -fold cross-validation. Due to the small number of instances in the datasets, dsel is defined as the entire training set.

The experiments were carried out on datasets from the keel repository [1], which contain binary problems created through various combinations of class merging. All datasets have a high imbalance ratio of at least . Problems characteristics are presented in Table 1.

Table 1. Datasets characteristics.

Subsections 3.2 and 3.3 present the results of experiments comparing the presented methods, dese in experiment and desire in experiment , with state-of-art ensemble algorithms used for the imbalanced data classification.

Both proposed and reference methods occur in versions with preprocessing (in the form of random oversampling) and without it, the use of oversampling is denoted by the letter O found before the acronym of the method. As a reference method, a single classifier, as well as stratified bagging and dynamic selection in the form of the knora-u algorithm were selected.

The radar diagrams show the average global ranks achieved by each of the tested algorithms in terms of each of the evaluation metrics, while the tables show the results of the Wilcoxon signed-rank (\(p=0.05\)) statistical test for a pool size of base classifiers. The numbers under the average rank of each method indicate the algorithms which are statistically significantly worse than the one in question. The complete results for each of the datasets and the full statistical analysis can be found on the GithubFootnote 2.

3.2 Experiment 1 – Euclidean Distance-Based Approach

In Fig. 1 we can see how the average ranks for dese and reference methods changed in terms of different metrics depending on the ensemble size. We can see that the proposed methods (especially odese-c) for base models achieve higher rankings in terms of each metric with an exception of recall. While the single classifier and bagging are preferring recall, odese-c and dese-c prefer precision. As the number of base classifiers increases, bac and G-mean-based rankings deteriorate to knora-u level, while the F1 score remains high due to high precision.

Table 2 presents the results of the statistical analysis, which shows that the odese-c method performs statistically significantly better than all reference methods in terms of each metric except for recall.

When the base classifier is cart, as seen in Fig. 2, for the smallest pool, dese-c (both without and with oversampling) achieves higher ranks than the reference methods in terms of each of the five metrics. Along with the increase in the number of classifiers, we can observe that while oknora-u and osb stand out in terms of precision, odese-c performs better in terms of other metrics, and odese-a, despite the low F1 score and precision, achieves the highest average ranks in terms of bac, G-mean and recall. Table 3 confirms that for the five base classifiers, odese-c is statistically significantly better than all reference methods, while odese-a performs statistically significantly better than odese-c in terms of recall, G-mean and bac.

Table 2. Statistical tests on mean ranks for gnb with pool size = 5.
Fig. 1.
figure 1

Mean ranks for gnb classifier.

Table 3. Statistical tests on mean ranks for cart with pool size = 5.
Fig. 2.
figure 2

Mean ranks for cart classifier.

Table 4. Statistical tests on mean ranks for knn with pool size = 5.
Fig. 3.
figure 3

Mean ranks for knn classifier.

In Fig. 3 and Table 4 we can see that the proposed methods using oversampling do not differ statistically from the reference methods, except for a single classifier, which is characterized by high precision but at the same time achieves the worst mean ranks based on the remaining metrics. Together with the increase in the base classifier number, knora-u and osb achieve higher average ranks than odese-c and odese-a.

3.3 Experiment 2 – Scaled Euclidean Distance-Based Approach

The results below show the average ranks for the proposed desire method, which calculates weights based on Euclidean distances scaled by the percentages of the minority and majority classes in the training set.

In the case of gnb as the base model (Fig. 4), the odesire-c method achieves the best results compared to reference methods in terms of mean ranks based on F1 score, precision, G-mean and bac. When the ensemble size increases, the proposed method is equal to oknora-u in terms of bac and G-mean but retains the advantage in terms of F1 score and precision. Also, the more base classifiers the smaller the differences between desire using preprocessing and the version without it. Table 5 presents the results of the statistical analysis, which shows that odesire-c is statistically better than all reference methods when the number of base classifiers is low.

Figure 5 shows that for a small classifier pool, odesire-c achieves higher ranks than reference methods in terms of each evaluation metric, and as the classifier number increases, it loses significantly in precision compared to osb and oknora-u. odesire-a has a high recall, which unfortunately is reflected by the lowest precision and F1 score. In Table 6 we see that for base classifiers, dsire-c both with and without preprocessing is statistically significantly better than reference methods in terms of all metrics except one, G-mean in the case desire-c and recall for odesire-c.

When the base classifier is knn (Fig. 6), as in the case of dese, odesire-c is not statistically worse than osb and oknora-u (Table 7) and as the number of classifiers in the pool increases, the average global ranks significantly deteriorate compared to reference methods.

Fig. 4.
figure 4

Mean ranks for gnb classifier.

Table 5. Statistical tests on mean ranks for gnb with pool size = 5.
Fig. 5.
figure 5

Mean ranks for cart classifier.

Table 6. Statistical tests on mean ranks for cart with pool size = 5.
Fig. 6.
figure 6

Mean ranks for knn classifier.

Table 7. Statistical tests on mean ranks for knn with pool size = 5.

3.4 Lessons Learned

The presented results confirmed that dynamic selection methods adapted specifically for the imbalanced data classification can achieve statistically better results than state-of-art ensemble methods coupled with preprocessing, especially when the pool of base classifiers is relatively small. This may be due to the fact that bagging has not yet stabilized, while the proposed method chooses the best single classifier. The Correct approach in which the weights of the models were changed only if the instances belonging to the local competence region were correctly classified, proved to be more balanced in terms of all evaluation measures. This may indicate too high weight penalties with incorrect classification in the All approach. When knn is used as the base classifier, with a small pool the proposed methods performed statistically similar to knora-u, and with a larger number of classifiers, achieved statistically inferior rank compared to the reference methods. This may be probably due to the support calculation method in the knn, which is not suitable for the algorithms proposed in this work. For gnb and cart, dese-c and desire-c achieved results which are statistically better than or similar to the reference methods, often without the use of preprocessing, since it has a built-in mechanism to deal with the imbalance.

4 Conclusions

The main purpose of this work was to propose a novel solution based on dynamic classifier selection for imbalanced data classification problem. Two methods were proposed, namely dese and desire, which use the Euclidean distance and imbalance ratio in the training set to select the most appropriate model for the classification of each new sample. Research conducted on benchmark datasets and statistical analysis confirmed the usefulness of proposed methods, especially when there is a need to maintain a relatively low number of classifiers.

Future work may involve the exploration of different approaches to the base classifiers’ weighting, as well as using different combination methods and the use of proposed methods for the imbalanced data stream classification.