1 Introduction

Several classifiers have been proposed for supervised classification, among them, an important family is the contrast pattern-based classifiers. A pattern is an expression defined in a language that describes a set of objects. For example, a pattern that describes a set of plants can be the following: \([Petal\_Width \in [0.60, 1.60]] \wedge [Roots \le 10] \wedge [Stem=\)Thick”]. A pattern that appears significantly more in a class than in the remaining classes is named as contrast pattern. Finally, a classifier which predicts the class of a query object based on a set of contrast patterns is called: contrast pattern-based classifier. It is important to highlight that the pattern-based classifiers, as well as their results, can be understood by the experts in the application domain through the patterns associated to each class. Also, contrast pattern-based classifiers have reported significantly better classification results than other popular classification models, like naive bayes, nearest neighbor, bagging, boosting, and SVM [7, 11].

In some real-world applications, there are problems where the objects are not equally distributed into the classes, like online banking fraud detection, liver and pancreas disorders, forecasting of ozone levels, prediction of protein sequences, and face recognition. In these applications, there exist significantly fewer objects belonging to a class (commonly labeled as minority class) regarding the remaining classes. This problem is known as class imbalance problem [18,19,20].

Some classifiers, which show good classification results in problems with balanced classes do not necessarily achieve good performance in class imbalance problems. The main reason is that they produce a bias toward the majority class (the class with more objects). Accordingly, the accuracy of these classifiers for the minority class could be close to zero [19, 20].

On class imbalance problems, some pattern-based classifiers, like CAEP [8], do not achieve good classification results because of contrast patterns from the minority class are fewer and they have low support regarding those contrast patterns from the majority class. Then, some classification strategies, which are based only on the support of the contrast patterns, tend to be biased toward the majority class [2, 18, 19].

A proposal for supervised classification based on contrast patterns in class imbalance problems is selecting just a subset of good contrast patterns. The idea is to select, for each class, a collection of high-quality patterns. Consequently, at the classification stage, those contrast patterns with low support for the minority class do not become overwhelmed by those contrast patterns with high support for the majority class, which are much more [4, 9, 19, 24, 25].

In the literature there are three main approaches for contrast pattern selection: (i) selecting only the best contrast pattern, (ii) selecting the k best contrast patterns, and (iii) selecting all contrast patterns covering the training dataset [4, 9, 18, 24, 25]. In this paper, we propose a novel contrast pattern selection by class for class imbalance problems; the idea consists in selecting all the contrast patterns for the minority class and only a certain percent of contrast patterns for the majority class. Our proposal allows obtaining better accuracy results, when the selected contrast patterns are used by a contrast pattern-based classifier, than other contrast pattern selection approaches of the state-of-the-art.

The rest of the paper has the following structure. Section 2 contains a brief description of the main contrast pattern selection approaches reported in the state-of-the-art. Section 3 introduces our proposal for selecting contrast patterns in class imbalance problems. Section 4 provides the experimental setup. Section 5 presents the experimental results as well as a discussion of them. Finally, Sect. 6 provides our conclusions and future work.

2 Related Work

In pattern-based classification, an important task is to select a collection of high-quality patterns for obtaining good classification results [18, 27]. Additionally, the fewer patterns, the faster the classification stage and easier to understand the results by experts in the application domain.

Three main approaches have been proposed in the literature for selecting contrast patterns [4, 17, 18, 22, 24, 25]. These approaches use a quality measure for contrast patterns with the aim of creating a ranking of contrast patterns. A quality measure is a function \(q(P, C, {\bar{C}}) \rightarrow R\), which assigns a higher value to a pattern P when it better discriminates the objects in a class C from the objects in the remaining problem classes \({\bar{C}}\) [17, 18]. Usually, the measure for ranking the contrast patterns depends on the contrast pattern-based classifier to be used (e.g., confidence for association rules, \(X^2\) for decision trees, growth rate for emerging patterns). The three main approaches for selecting contrast patterns are:

  • Best contrast pattern (Best CP): Select the best contrast pattern, according to the ranking, covering the query object. This approach is used by several rule-based classifiers, like CBA [16], for classifying query objects. A drawback of this approach, in class imbalance problems, is that commonly the best contrast pattern according to the ranking is from the majority class. Consequently, the accuracy of the classifier for the minority class is bad.

  • Best k contrast patterns (Best k ): Select the best k contrast patterns from the ranking which cover the query object. Usually, this approach is used by rule-based classifiers like CPAR [26] and some emerging pattern selection methods, as the one proposed in [17]. Recently in [18] the authors proposed to use a fixed percent of patterns instead of a fixed number of patterns. A disadvantage of this approach in class imbalance problems is that the patterns for the minority class are too few and they have low support, then selecting a few patterns of the minority class could degrade the accuracy of the classifier for the minority class.

  • Covering the training dataset (Covering): Select the best contrast patterns, according to the ranking, covering all the objects of the training dataset. This approach is used by some rule based-classifier, like ACN [14] and CMAR [15], and some emerging pattern selection methods [11, 17, 18], showing good accuracy results.

The second and third approaches were studied in [18] for selecting contract patterns in class imbalance problems. The authors tested several quality measures for contrast patterns and they concluded that the best quality measure for ranking contrast patterns in class imbalance problems is Jaccard [23]. Based on this conclusion, we will propose a novel contrast pattern selection method, which uses the quality measure Jaccard for ranking the contrast patterns.

3 New Contrast Pattern Selection Method

In this section, we introduce a contrast pattern selection method for class imbalance problems.

Usually, in class imbalance problems, contrast pattern mining algorithms extract several patterns with high support for the majority class and only a few patterns, with low support, for the minority class [2, 18, 19]. This produces that some contrast pattern-based classifiers, like CAEP [8], become biased toward the majority class. Some strategies find a solution by selecting a collection of high-quality patterns, but their main drawback is that some patterns of the minority class, which could help at the classification stage, are discarded [15, 16, 26]. For solving this problem, we propose to select the patterns by class; selecting all the contrast patterns of the minority class and only a few contrast patterns of the majority class. The main idea is not to discard useful patterns of the minority class and avoiding the selection of many patterns of the majority class, which could overwhelm the patterns of the minority class at the classification stage.

Our pattern selection method can be described by the following steps:

  1. 1.

    Select all the contrast patterns of the minority class to avoid reducing the number of patterns of this class.

  2. 2.

    Rank the contrast patterns of the majority class by using a quality measure for contrast patterns.

  3. 3.

    Select the best k contrast patterns of the majority class. The k value is a percent of the total number of patterns, which is provided by the user.

Commonly, the number of patterns extracted from a database depends on different factors such as the nature of training dataset, the contrast pattern mining algorithm, the a-priori global discretization, among others. Hence, instead of selecting a fixed number of contrast patterns, in our contrast pattern selection method, we propose to select just a percent of patterns. The main reason is that the number of patterns to select could be too high, regarding the amount of mined patterns, which means that the selected patterns could be almost all; or too small, which would lead to reduce more than necessary the number of patterns.

Finally, it is important to highlight that in step 2, we suggest to use the quality measure Jaccard [23] because this measure has shown good results for ranking contrast patterns in class imbalance problems [18].

4 Experimental Setup

In order to evaluate the performance of the proposed selection method, we will perform a comparison of our proposal against the three main approaches for selecting contrast patterns reported in the literature. To do this, first, we will extract the patterns by using a contrast pattern miner. After that, we will create a pattern ranking by applying the quality measure Jaccard [23] over the collection of patterns previously extracted. Next, we will select the patterns by using the three main pattern selection approaches shown in Sect. 2 and our proposal. Finally, each subset of patterns will be used to build a contrast pattern-based classifier. By doing this, we can detect which selection method attains better classification results. As the contrast pattern miner and the classifier are the same and the only difference are the contrast patterns selected by means of the selection method, then a good or bad performance in the classification results can be attributed to the selection method employed.

Table 1 shows the 95 databases used in our experiments, which were taken from the KEEL dataset repositoryFootnote 1 [1]. For avoiding problems due to data distribution in class imbalance problems, for each database, we performed a distribution optimally balanced stratified five cross-validation, as suggested in [20].

Table 1. Summary of the imbalanced databases used in our study. Containing the name in the KEEL dataset repository (name), the number of objects (#Objects) and features (#Feat.), and the IR [21].

For assessing the performance of our classification results, we used the AUC measure [13] because it is the most used measure for class imbalance problems [18,19,20]. All our AUC results were averaged over the 5-fold cross validation.

As contrast pattern-based classifier, we selected PBC4cip [19], since, it has reported better AUC results than other state-of-the-art classifiers for class imbalance problems [19].

For mining contrast patterns, we selected the Random Forest miner (RFm) [10] using the Hellinger distance [3] as node splitting measure, as suggested in [19]. The main reason is that RFm has been used jointly with the PBC4cip classifier, obtaining higher accuracies than other state-of-the-art contrast pattern mining algorithms [19].

For selecting contrast patterns, we used the three main approaches reported in the literature (see Sect. 2). For selecting the best k contrast patterns by class, we used the values 10%, 50%, and 80% which have been used in previous studies for class imbalance problems [18]. For selecting contrast patterns using our proposal we used the following k values: 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, and 80%. By using these values, we can investigate if our proposal is able to attain statistically similar AUC results at using fewer patterns than the number of patterns used in [18].

We also used the Shaffer and Finner post-hoc procedures, and the Friedman test to compare all the classification results, as suggested in [5, 6]. Post-hoc results will be shown by using CD (critical distance) diagrams [5]. In a CD diagram, the rightmost classifier is the best classifier. The position of the classifier within the segment represents its rank value, and if two or more classifiers share a thick line it means that they have statistically similar behavior.

5 Experimental Results

This section is devoted to analyzing and discussing about the classification results achieved by the contrast pattern selection methods described in Sect. 2, using all the imbalanced databases shown in Table 1.

In order to simplify the presentation, a supplementary material websiteFootnote 2 has been created for this paper, which contains several tables from experimental results as well as detailed tables from the statistical test results.

Fig. 1.
figure 1

CD diagram with a statistical comparison (using \(\alpha \) = 0.05) of the AUC results of our proposal for selecting contrast patterns, using different k values over all the tested databases.

Figure 1 shows a CD diagram with a statistical comparison of the AUC results obtained by our proposal using different k values and considering all the imbalanced databases shown in Table 1. From this figure, we can conclude that our proposal using k = 25% obtained the best position into the Friedman’s ranking. However, the difference of the AUC results of our proposal using k = 25% against using k as: 10%, 15%, 20%, 30%, 35%, and 40% is not statistically significant. Therefore, we selected k = 10% since it allows selecting the fewest number of patterns.

Figure 2 shows a CD diagram with a statistical comparison of the AUC results obtained by our proposal, using k = 10%, against the other contrast pattern selection methods reviewed in the Sect. 2, as well as by using all the contrast patterns (All CPs). From this figure, we can see that the AUC results of our proposal against those AUC results archived by Best k = 50%, Best k = 80%, and All CPs are not statistically significant. However, our proposal obtains a better position into the Friedman’s ranking than using all the contrast patterns (All CPs). Also, our proposal uses fewer contrast patterns for classification than the other approaches having statistically similar behavior. On the other hand, notice that the contrast pattern selection method Best CP statistically obtained the worst results. This is because, in class imbalance problems, commonly the best contrast pattern according to the ranking comes from the majority class and consequently the accuracy for the minority class is greatly affected.

Fig. 2.
figure 2

CD diagram with a statistical comparison (using \(\alpha \) = 0.05) of the AUC results of our proposal and the other contrast pattern selection methods reported in the literature.

5.1 Regarding Different Class Imbalance Levels

For studying the effect of the class imbalance level on the contrast pattern selection methods previously analyzed, we divided the databases into equal-frequency groups depending on the IR of each one. For doing this, we used the Discretize Footnote 3 method, taken from the WEKA Data Mining Tool [12], to create six equal-frequency groups depending on the IR of the databases. These groups are shown in Table 1 using horizontal thin lines.

Table 2. Results of the best contrast pattern selection for each bin

Table 2 shows the best contrast pattern selection method for each bin. From this table, we can conclude that for the less imbalanced databases (Bin1), Bin4 and for the Bin6, the best contrast pattern selection method is Best k = 80%. For Bin3, the best selection method is our proposed method using k = 10%. Finally, for Bin2 and Bin5 the best contrast pattern selection is Best k = 50%. These results help us to select the best contrast pattern selection method depending on the class imbalance level of the database.

6 Conclusions and Future Work

Selecting a collection of high-quality patterns is an important task for pattern-based classification. The main aim is to achieve good classification results using as few patterns as possible in order to obtain a model easier to understand by experts in the application domain. Following this idea, the main contribution of this paper is a new contrast pattern selection method for contrast pattern-based classification in class imbalance problems. Our proposal selects all the patterns of the minority class and based on a ranking computed through the Jaccard measure, it selects a percent of the best patterns of the majority class.

From our experiments using several imbalanced databases, we can conclude that our proposal performs significantly better, when it uses the 25% of the contrast patterns of the majority class, regarding other tested percents. Also, our proposal using k = 10% outperforms significantly other contrast pattern selection methods reported in the state-of-the-art, like Best CP, Covering, and Best k = 10%. Moreover, our proposal using k = 10% have not statistical differences with other contrast pattern selection methods, like Best k = 50%, Best k = 80%, and All CPs, but these methods need more patterns.

On the other hand, based on our experiments regarding the class imbalance ratio of the databases, we suggest that: if the database has an IR smaller than or equal to 5.3, or its IR ranges in (12.810, 23.730], or its IR ranges in (39.905, 129.440], then Best k = 80% is the best contrast pattern selection method. If the database has an IR in (9.175, 12.810] then our proposal using k = 10% is recommended. And finally, if the database has an IR in (5.300, 9.175] or its IR ranges in (23.730, 39.905] then we suggest using the contrast pattern selection method Best k = 50%.

Finally, as future work, we will explore the use of maximal or closed contrast patterns as an alternative for selecting a reduced subset of contrast patterns for classification in class imbalance problems.