Ensemble with estimation: seeking for optimization in class noisy data
Class noise, as know as the mislabeled data in training set, can lead to poor accuracy in classification no matter what machine learning methods are used. A reasonable estimation of class noise has a significant impact on the performance of learning methods. However, the error in existing estimation is inevitable theoretically and infer the performance of optimal classifier trained on noisy data. Instead of seeking a single optimal classifier on noisy data, in this work, we use a set of weak classifiers, which are caused by negative impacts of noisy data, to learn an ensemble strong classifier which is based on the training error and estimation of class noise. By this strategy, the proposed ensemble with estimation method overcomes the gap between the estimation and true distribution of class noise. Our proposed method does not require any a priori knowledge about class noises. We prove that the optimal ensemble classifier on the noisy distribution can approximate the optimal classifier on the clean distribution when the training set grows. Comparisons with existing algorithms show that our methods outperform stateoftheart approaches on a large number of benchmark datasets in different domains. Both the theoretical analysis and the experimental result reveal that our method can improve the performance, works well on clean data and is robust on the algorithm parameter.
Typical machine learning method uses a classifier learned from a labeled dataset (i.e., the training data) to predict the class labels of new samples (i.e., the testing data). In most of classification applications, labels of the training data are assumed correct. However, realworld datasets often contain noise which may occur either in the features of the data, defined as the attribute noise, or in the labels of the data, defined as the class noise.
Many studies have focused on handling attribute noise since it is quite common in machine learning and data mining tasks. However, researchers, such as [1, 2], have indicated that class noise can be potentially more detrimental than attribute noise. The study on class noise problem has an essential impact on classification performance improvement [1]. We must point out that class noise is unavoidable in many real world applications such as disease prediction in medical applications [3], food labeling for the food industry [4], and manual data labeling in some natural language processing applications [5, 6, 7].
Generally speaking, there are two types of strategies to deal with class noise. The first entails learning with noisy data and the second is based on noise elimination.
Learning with noise assumes that each training sample is assigned to a weight based on an estimated probability of class noise, that is, the class noise rate. The learning algorithm will consider the class noise rate while learning from the original noisy training data [8]. Unfortunately, this method requires a priori knowledge of the class noise rate in the training data.
Noise elimination strategy attempts to detect and remove erroneous data from the training set [9, 10, 11]. This method can reduce the rate of class noise in the training data, yet it often leads to an overall reduction of training samples. Even though the reduced training data may be less noisy, reduced training data may result in reduced performance of a learning algorithm compared to the result from using the original noisy data. Therefore, good noise estimation methods are important to avoid overrating class noises which can lead to large data reduction. One successful approach for class noise elimination is called kernel density estimation including methods such as knearest neighbors (kNN) or Parzen Window to detect and remove noisy data [10, 12, 13]. However, there is a theoretical flaw in these methods. Their work reposes on kernel density estimation, which requires a small radius of neighborhood \(\varepsilon\), to satisfy both the manifold assumption and the centrallimit theorem which requires a large \(\varepsilon\) to estimate the parameter for Guassian distribution.The contradicting assumptions will limit the performance of their methods in applications.
In this paper, we propose a novel method to estimate the noise rate for each training sample. In order to avoid the overrating of class noise rate and the contradict assumptions, we introduce the sum of Rademacher distribution [14] instead of the centrallimit theorem to estimate class noise. We choose the kNN graph for the kernel density estimation because it is more sensitive to class noise.
Based on this noise estimation method, we then modify a given surrogate loss function based on the estimated class noise rate. The modified surrogate loss function is the optimization objective of the classifier. According to the theoretical analysis, this loss function is sensitive to the parameter in the estimation algorithm. So, we propose a sampling based algorithms to obtain a strong classifier through a series of weak classifiers as an ensemble to overcome the sensibility of the parameter,which is adopted to optimize the loss function on the training data. Traditionally, the ensemble method is not suitable to handle class noise because it will also enhance the noise in the learning process [15, 16]. However, in our method, we take into consideration of the noise rate and make the algorithm to adapt the noisy distribution. Both the theoretical analysis and the experimental result reveal that our method can improve the performance, works well on clean data and is robust on the algorithm parameter.

A class noise estimation approach is proposed based and a new weighted loss function is given;

The proposed loss function based on the estimated class noise experimentally demonstrates better class noise estimation performance than the existing popular algorithms;

In comparison with the existing methods of noise estimation, our approach requires no a priori knowledge about the class noise and it makes up for the contradiction of the currently used theory. Performance evaluation also show that we can indeed achieve stateofthe art results.
Identifying class noise is an important issue in machine learning. Previous publications cover both the theory [17] and the application aspects [12, 13] of this topic. Especially, in recent years, with the development of deep learning, how to train a rubost neural network becomes a new hot topic in the designing of learning architecture [18, 18, 19, 20, 21]. In this section, we briefly introduce related works from three perspectives: the source of class noise, the handling of class noise and the application of class noise.
Class noises can exist for different reasons. When used for disease prediction in medical applications [3], training data contains a probability of false positive or negative because data comes from medical experiments. In other words, class noises naturally exist and cannot be avoided. Food labeling for the food industry [4] also faces class noise problem. As shown by [4], beef has a higher price than mutton in some countries in South Africa. Misslabeling is thus an aforethought to achieve more benefit. The manual labeling of data in some natural language processing applications [5, 6, 7] also contains class noise because there is always a possibility of interannotator inconsistency.
Given x, being the set of features of a sample, let \(\widetilde{y}\) be the observed label of x; y the true label of the sample x and p is the probability to flip the true label into a noise label (thus p is the class noise rate, or noise rate for short). For any training algorithm, only x and \(\widetilde{y}\) can be observed. In general, class noises can be simply categorized into three different models based the dependence of noise to y and p [22]. The first model in is called the noise completely at random model [17, 23, 24]. The basic idea is that the class noise rate of a sample is completely random and independent of the labels and the feature set. Thus, an observed label is only determined by the true label and the class noise rate. The second model shown is called the Noise at Random Model [25, 26, 27]. In this model, the class noise rate is dependent on the true label of a sample and is independent of the feature set of the sample. In other words, different labels have different probabilities to flip to the wrong label. The third model shown is the noise not at random model [28, 29]. This model assumes the class noise rate should be affected by both the label and the feature set of the sample. Informally, this model can be described as: a sample will be misslabeled to the most similar category.
The theoretical research on class noise and learning was first proposed by Angluin and Laird in 1988 [17]. In their work, all instances of the labeled data for binary classification have a uniform invert probability \(p \in [0,1/2)\). This is referred to as the random classification noise.
Class noises are typically assumed to be stochastic in algorithms that can handle class noises. The work by Ref. [8] assumed a learned noise rate from a priori knowledge and every sample was given the same probability, a simple assumption that may not be reasonable in all scenarios. The Cut Edge Weight Statistic method [10] also required prior assumed noise rate. This method used the prior probability as a hypothesis to test if the training sample satisfies the null hypothesis. The method also required the neighbors of a training sample to follow the central limit theorem, which is not reasonable because the set of neighbors are too small. Other works simply did not consider noise rate [5, 9, 11, 30, 31].
There are two basic categories of strategies to deal with class noise in training data: either learning with class noise [5, 8] or class noise elimination [9, 10, 11]. The learning with noise strategy approximates a distribution of noiseless training data using the distribution of the original training data with class noise and must have a priori knowledge about the class noise [8]. The problem, however, is that a priori knowledge of the class noise is not typically available, limiting the applicability of this method. Li [24] uses the Kernel Fisher method to estimate the class noise. The estimation is then used to use a robust algorithm that can tolerate noise.
The class noise elimination strategy attempts to detect the samples with high noise probabilities and remove them from the training set. There are different methods to detect class noise and they can be categorized as classificationbased methods and graphbased methods. The classificationbased method was first proposed by Brodley in 1999 [9]. He used kfold crossvalidation to split the training set into k subsets, and used any \(k1\) sets as the training data to classify the remaining data. If the classification result for a sample is different from the original label, that sample is considered class noise and is removed. Zhu et al. proposed a more efficient algorithm suitable for large datasets [30]. Zhu also proposed a costsensitive approach based on kfold crossvalidation [1]. Sluba employed a 10fold crossvalidation to detect class noise [11]. The graphbased method, known as the Cut Edge Weight Statistic method, was proposed by Zighed [10]. The principal idea is based on the manifold assumption and Bernoulli distribution assumption. A similar approach was proposed by Jiang and Zhou, who used a kNN graph to detect class noise without considering noise rate [32]. There are three major problems with the elimination approach. First, some correctly labeled training data can be eliminated because of potentially inaccurate class noise identification. Second, the number of samples in the training data will be reduced, potentially leading to an adverse effect on the learning algorithm performance. Third, the manifold assumption requires small k and the sum of Bernoulli distribution converges to Gaussian distribution if and only if k is larger than 25. The paradox leads to a limited performance gain in the class noise estimation. In eliminationbased methods, reliable noise estimation is crucial and inaccurate estimations can potentially degrade performance compared to no noise elimination.
Some works also use some revision to the AdaBoost learning algorithm to handle noise data [33, 34, 35]. In principle, AdaBoost [36] is not a suitable learning algorithm for noisy data. The reason is that during the learning process AdaBoost will enhance the misclassified training samples in the next iteration. For a noisy sample, which is actually classified correctly, may be regarded as a misclassified sample due to the noisy label to lead to worsened performance [15, 16]. However, some strategies can be used to smooth out weight updating in the learning steps of AdaBoost to avoid overfitting on noisy data [33, 34]. Another method simply allows boosting to missclassify some of the training samples to obtain a more robust boosting [35]. There are also adaptive methods by using a confidence score and removing a sample if its noise estimation confidence is higher than a threshold [37, 38, 39].
Before presenting our proposed algorithm, we need to explain the background and the problem setup first. The fundamental problem is that each sample in the training data has the probability to be a misslabeled sample, which means that the samples have class noise. Hence, we need to estimate the probability for the misslabeling of each training sample, referred to as the class noise rate in the rest of this paper. After that, we can train a classifier on the noisy training data based on the estimated probability.
The second problem is how to measure the learning ability of the classifier with the obtained noise rate? Is it possible that our classifier, trained on noisy distribution, can achieve similar performance to the classifier trained on the clean distribution? If the answer to this question is yes, we can then use the differential between the two classifiers to measure the learning ability of the classifier trained on noisy distribution. If the classifier trained on noisy distribution is the optimal classifier on the clean distribution, this differential should be a small number theoretically.
In this section, we first give a strict formalized definition of the class noise rate, followed by the learning ability measure of the classifiers.
Let D denote the clean distribution without class noise, and \((x_1,y_1),(x_2,y_2),\ldots ,(x_n,y_n)\) denote n training samples from D with true binary label \(y_i\)\((y_i =\pm 1,i=1,2,\ldots ,n)\) When there are class noises, \(\widetilde{D}\) denotes the observed distribution which contains class noise, and the training sample from a noisy distribution \(\widetilde{D}\) are \((x_1,\widetilde{y}_1),(x_2,\widetilde{y}_2),\ldots ,(x_n,\widetilde{y}_n)\), where the label \(\widetilde{y}_i\) may be different from the true label \(y_i\). In this paper, we want to estimate noise rate at the level of the individual samples. Here, we give the definition of class noise rate as follows:
Definition 1
Let \((x_i,\widetilde{y}_i)\in \widetilde{D}\) be a sample from the noisy distribution.The class noise rate is the probability of the observed label different from the true label of \(x_i\), denoted by \(P(\widetilde{y}_i \ne y_i  x_i)\).
According this definition, class noise rate defines on individual data samples. In other words, we allow different samples to have different noise rate.Thus, the first issue is how to estimate the class noise rate for each training sample. The main challenge, however, is that a learner can only observe the noisy data \((x_i,\widetilde{y}_i) \in \widetilde{D}\) and there is no a priori knowledge about the class noise rate and the clean distribution D.
For the time being, let us assume that we have a reasonable estimation method. Then the second issue is how to measure the performance of a classifier. Generally speaking, the objective of a classifier is to minimize a given loss function on a given set of training data. However, in this paper, the observed training data contains class noise. The minimized loss function on noisy training data may not be the minimized loss function on the clean data. So the issue is whether we can use the estimation of the class noise rate for each individual sample to modify the loss function on the noisy distribution so that the modified loss function can also minimize the loss on the clean distribution.
In order to address the problem in a formally, we give some definitions below.
Definition 2
Let \(f:X\rightarrow \mathbb {R}\) be a realvalue decision function, defined as \(f(x)=P(y=1x)1/2\). The risk of f for each sample on the clean distribution is 0–1 loss given by \(R_D(f)=E_{(x,y)\sim D}(1_{sign(f(x)\ne y)})\)
Let l(f(x), y) denote a loss function with a realvalue prediction, for the clean distribution where \(y=\pm 1\) is the true label of x. We can then use the estimation of the class noise rate for each individual sample to modify the loss function on the noisy distribution with an observed label, denoted as \(\widetilde{l}(f(x),\widetilde{y})\). The modified loss function is marked with a hat \(\widetilde{\cdot }\). Because the loss function is defined under the noisy distribution \(\widetilde{D}\). Then, we can define three related risks as follows:
Definition 3.1
The empirical \(\widetilde{l}\)risk on the training data: \(\widehat{R}_{\widetilde{l}}(f) = \frac{1}{n} \sum _{i=1}^{n} \widetilde{l}(f(x_i),\widetilde{y})\).
Definition 3.2
The excepted \(\widetilde{l}\)risk under noisy distribution \(\widetilde{D}\): \(R_{\widetilde{l},\widetilde{D}}(f)=E_{(x,\widetilde{y}) \sim \widetilde{D}} (\widetilde{l}(f(x),\widetilde{y}))\).
Definition 3.2
The excepted lrisk under clean distribution D: \(R_{l,D}(f)=E_{(x,y) \sim D} (l(f(x),y))\).
(Here, the hat \(\widehat{\cdot }\) means that the marked object is a estimated result. The hat \(\widetilde{\cdot }\) means that the noisy label will influence the marked object.)
Here the empirical \(\widetilde{l}\)risk \(\widehat{R}_{\widetilde{l}}(f)\) is is the expected error of the trained classifier on noisy distribution, and the expectation of lrisk \(R_{l,D}(f)\) is the expected error of a classifier training on clean distribution. The learning ability of a training classifier is the difference between the two risks: \(\widehat{R}_{\widetilde{l}}(f)  R_{l,D}(f)\) It indicates the distance between our trained classifier on noisy data and the optimal classifier on the clean distribution. The objective of our algorithm is to make \(\widehat{R}_{\widetilde{l}}(f)  R_{l,D}(f)\) approaching 0 when the size of noisy training set grows.
In this section, we deploy a class noise estimation method proposed in our previous work [40]. Due to the length limitation, we only introduce the basic idea and theorem. The details of mathematical proof can be found in paper [40]. The idea is based on the kernel density estimation method. In this method, the label of an individual sample should be similar to the most similar neighbors even if there is class noise in the data. Therefore, we first present a class noise model based on the kernel density estimation method. We mainly introduce the kNN graph as kernel density estimation method, and we will also introduce the formula based on Parzen Window (known as egraph).Since the kernel density estimation method is sensitive to class noise and it can overrate the class noise. To avoid overrating of noises, we introduce a loose distribution called the Sum of Random Noise in this section. As will be seen in the evaluation, our method is more reasonable theoretically and shows a higher performance in the experiments when compare to the existing estimation methods [10, 12, 13].
Definition 4
Now, we can define the class noise rate for \((x_i,\widetilde{y}_i)\) as the probability of \(S_i\) being opposite from SRN. Because \(I_{ij}=1\) indicates that the sample \(x_i\) has different label to its nearest neighbor and the similarity metric is between 0 to 1, the larger \(S_i\) is, the higher the probability it should be that \(x_i\) has a noisy label. So, we should only consider the upper quantile of SRN here. Now, we can give the class noise rate under Definition 5.
Definition 5
For any individual sample \((x_i,\widetilde{y}_i) \in \widetilde{D}\) and the sum of noisy similarity \(S_i\), the probability of SRN denoted as \(P_{SRN}\), the class noise rate of \((x_i,\widetilde{y}_i)\) is \(1P_{SRN}(s \ge S_i)\).
The formal definition above would be easier to understand using the following explanation. If the principle of entropy maximum reveals a best guessed label, the upper quantile of SRN is the probability of the individual sample having a “worse” label than a guessed one. So Definition 5 can be presented as a descriptive definition as: the class noise rate of a sample is the probability of the observed label being worse than a guessed label.
Definition 5 defines the class noise rate of the sample \((x_i,\widetilde{y}_i)\) as the probability of \(S_i\) not following SRN. We can explain the definition in a different way. If the label of a training samples totally random under PEM, the label of the training sample can be considered as a label from guessing. Then we can get the distribution of \(S_i\) from a “guessing result”. If the individual sample \((x_i,\widetilde{y}_i)\) contains class noise, the corresponding \(S_i\) should be larger than the “guessing result”. Thus, we can say the probability of \(S_i\) does not follow SRN. The formal theorem is given below:
Theorem 1
Lemma 1
Here \(\parallel \cdot \parallel _1\) and \(\parallel \cdot \parallel _2\) are the \(\mathcal {L}_1\) and \(\mathcal {L}_2\) norms; and \(w'_i + w''_i = w_i\). The formula of \(K_{1,2}(w_i,t)\) is well known as the Kmethod of real interpolation for Banach Space [41]. The proof of Lemma 1. was given by [14] in details. According to Lemma 1, we can easily get the Lemma 2 by a suboptimal solution of \(K_{1,2}(w_i,t)\).
Lemma 2
The proof of Lemma 2 is given in our previouse work [40].
According to Definition 4, the estimated class noise rate \(P_c(x_i)\) for \((x_i,\widetilde{y}_i)\in \widetilde{D}\) is the probability of \((x_i,\widetilde{y}_i)\) not following SRN. According to Lemma 2, the probability of \((x_i,\widetilde{y}_i)\) from SRN is less than \(\exp \left( \frac{ \left( \sum _{j=1}^{K} {w_{ij}I_{ij}} \right) ^4 }{2 \left( \parallel w_i \parallel _1 \parallel w_i \parallel _2 \right) ^2} \right)\) when \(t>0\).
In this section, we introduce our learning algorithm based on the estimated class noise rate. We propose to modify a given surrogate loss function based on the estimated class noise rate. The modified surrogate loss function is the optimization objective of the classifier. In this section, two training algorithms are used to optimize the loss function on the training data. One is a perceptron based method, which is based on the learning with noise strategy and aims to reduce the impact of noisy data. The other one is a sampling based Adaboost method, which is based on the class noise elimination strategy and focuses on selecting high quality training data rather than to identify low quality data.
5.1 Class noise estimation based on loss function
Without loss of generality, we do not specify any particular loss function on the observed distribution. Theoretically, any loss function can be used in Formula (6) to modified a surrogate loss function. In Formula (6), the numerator is formed by two parts. The first part is the original loss function with the observed labels weighted by their label correctness probabilities. The second is a penalty for the loss function with an inverted label (i.e., the probability of the observed label is incorrect) weighted by the class noise rate. The denominator is based on the average class noise rate to ensure that the expectation of loss on noisy training data approximates the expectation of loss on the clean data. This is an updated version of the original loss function on noisy data.

Will \(\widehat{R}_{\widetilde{l}}(f)\) converge to \(R_{\widetilde{l},\widetilde{D}}(f)\) under the noisy distribution when n grows? If the answer is “Yes”, it means that we could train a stable classifier on the noisy training data.

Will \(R_{\widetilde{l},\widetilde{D}}(f)\) converge to \(R_{l,D}(f)\) under the clean distribution? If the answer is “Yes”, it means that the trained classifier by noisy training data is also the optimal classifier on the clean distribution.
In other word, if we have a perfectly correct estimation of the class noise rate, the empirical \(\widetilde{l}\)risk on the training data \(\widehat{R}_{\widetilde{l}}(f)\) will converge to the risk \(R_{\widetilde{l},\widetilde{D}}(f)\) of the loss function on the noise data when the size of the training data is sufficiently large.
Because the second item is bound, the risk of our estimation is determined by the first item which is related to the estimation result. This can be interpreted as that the performance of the classifier trained on the noisy data is determined by the estimation result when the size of the training set grows.
5.2 Learning in class noise
According to Sect. 5.1, the basic idea is to use the surrogate loss function defined in Formula (6) to train a classifier on the noisy training data. In a previous work [40], a simple Perceptron based online learning method is used with noisy class data. However, the theoretical analysis has revealed that the result of the surrogate loss function would be affected by the estimated class noise rate. It is shown in Formula (8).
According to theoretical analysis, direct use of a perceptron or any other linear optimizer as the basic classifier (proposed in our previous work [40]) can face one problem: when the estimation is unreliable, the performance of the surrogate loss function may be limited because the first item in \(R_{l,D}(\widehat{f})  R_{l,D}(\widehat{f}_{p})\) may not be sufficiently small. Since the estimation method proposed in Section 4 is based on kNN graph, estimation is related to the parameter k and the similarity measure. The incorrect parameter of the similarity measure may misguide the estimation result, and the error in the estimation will affect the learning algorithm according to our analysis.
Hence we propose an Adaboost based method based on the class noise elimination strategy to whittle the impact of estimation. The reason is that Adaboost makes use of a bag of weak classifiers to achieve a better classification by enhancing the “misclassified” sample in the learning process. However, for a noisy data, “misclassified” by a base classifier may actually indicate a good performance because the observed label is incorrect. These samples should be given a lower weight since they have been trained well. If we give these samples a higher weight as traditional AdaBoost, it may lead to worsened performance because the algorithm enhances the noisy data with noisy label causing overfitting on these samples.
In our method, we solve the problem by adjusting the weight. The basic idea is to use the surrogate loss function given in Formula (6) as the objective function. We then use the AdaBoost.M1 [36] method to achieve optimal result on the noisy training data. In the learning step, we use the surrogate loss function to avoid overfitting on the noisy data without the need to eliminate any training sample. In the optimization, the samples with high class noise probability is given a low weight when “misclassified” by the base classifier so that even if the noisy sample is “misclassified” in the training, it will not obtain a higher weight in the next iteration to enhance learning. By using this strategy, our algorithm can adaptively handle class noise in the training instead of overfitting on noisy data. The pseudo code is shown in Algorithm 1. The weight of each sample is initialized to the probability of the correct label. Then, this weight is used to sample the training data. That is, we use the samples seemed “correct” as the training data to train the base classifier. Based on the error of the whole dataset, we will update the weight to make sure that the sample with high correct label probability but misclassified will have a high weight so that it will be added into training in the next round.
This means that the classifier can be optimized on the noisy distribution and achieves an optimal result on the clean distribution.
Our experiments evaluate the performance of the noise estimation method and show its usefulness in improving the learning performance. The evaluations are based on experiments with varying classnoise rates and trainingset sizes compared to other stateoftheart systems. We use seven public datasets for binary classification with different classnoise rates: (1) the LEU [43] set of cancer data, we reduce the dimensions into 20 by PCA; (2) the Splice for DNA sequence splicejunction classification; (3) the UCI Adult dataset collection containing seven subsets (referred to as UCI.a1a to UCI.a7a in this paper) of independent training and testing data [44]; (4) the DBWorld emails DataSet in English (in short, DB), which consists of 64 emails manually collected from DBWorld mailing list and are classified into two classes: “announces of conferences” and “everything else”; (5) the Farm Ads DataSet in English (in short, FADS), which is collected from text ads found on twelve websites that deal with various farm animal related topics with binary labels based on whether the content owner approves of the ad; (6) the Twitter Dataset for Arabic Sentiment Analysis Dataset (in short, TDA), the class labels are opinion polarity; (7) the Product Reviews from Amazon in three categories, Book, DVD and Music (in short, PRA) with class labels being the opinion polarities. (8) Banknote is the banknote authentication Data Set. Data were extracted from images that were taken from genuine and forged banknotelike specimens. For digitization, an industrial camera usually used for print inspection was used. (9) Haberman contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer [45]. (10) ILPD contains 416 liver patient records and 167 non liver patient records [46]. (11) QSAR contains values for 41 attributes (molecular descriptors) used to classify 1055 chemicals into 2 classes (ready and not ready biodegradable) [47]. (12) SPEC cardiac Single Proton Emission Computed Tomography (SPECT) images. Each patient classified into two categories: normal and abnormal [48].
Overview of datasets
Data  Type  Dimension  Training size (+/−)  Testing size (+/−) 

LEU  Cancer  7129 (20)  38 (0.71/0.29)  34 (0.59/0.41) 
Splice  DNA  60  1000 (0.52/0.48)  2175 (0.52/0.48) 
UCI.a1a  Adult  123  1605 (0.37/0.63)  30956 (0.24/0.76) 
UCI.a2a  Adult  123  2265 (0.25/0.75)  30296 (0.24/0.76) 
UCI.a3a  Adult  123  3185 (0.24/0.76)  29376 (0.24/0.76) 
UCI.a4a  Adult  123  4781 (0.25/0.75)  27780 (0.24/0.76) 
UCI.a5a  Adult  123  6414 (0.24/0.76)  26147 (0.24/0.76) 
UCI.a6a  Adult  123  11220 (0.24/0.76)  21341 (0.24/0.76) 
UCI.a7a  Adult  123  16100 (0.24/0.76)  16461 (0.24/0.76) 
DB  English  4698  64 (0.45/0.55)  – 
FADS  English  54877  4143 (0.51/0.49)  – 
TDA  Arabic  7415  2000 (0.50/0.50)  – 
PRABook  Chinese  74643  4000 (0.50/0.50)  – 
PRADVD  Chinese  74638  4000 (0.50/0.50)  – 
PRAMusic  Chinese  74638  4000 (0.50/0.50)  – 
Banknote  Image  5  1372 (0.56/0.44)  – 
Haberman  Medical  3  306 (0.26/0.74)  – 
ILPD  Medical  10  583 (0.29/0.71)  – 
QSAR  Chemicals  41  1055 (0.34/0.66)  – 
Spect  Medical  22  80 (0.50/0.50)  187 (0.92/0.08) 
To introduce class noise into the training set, we stochastically invert the binary labels of training samples with probability of 10%, 20%, and 30%. For datasets (1)–(3), we train the binary classification algorithm on the inverted noisy data. The algorithm is tuned on a development set before being used on the testing set. We use \(\hbox {SVM}^{\mathrm{light}}\)^{3} as the basic classifier for this experiment. We choose two kind of similarity measures, the cosine similarity for text data, and the Euclidean Distance based similarity for other data. The parameter k in the kNN graph is 5, set experimentally. For datasets (4)–(11), the experiment result is from a fivefolds crossvalidation as they do not have separate testing data.
LiC [40]: a Perceptron based online learning method from our previous work ;
NHERD [49]: a widely used open source robust method;
\({\varvec{\ell }}_{log}\) [8] : a log loss method using the same loss function as LiC;
CEWS [10]: a method using cut edge weight statistics. Furthermore, in order to see the necessity to use an ensemble or not, we also compare our methods with:
the original SVMs;
an ensemble SVMs without class noise handling;
an egraph based method (two samples will be linked if the similarity between them is larger than e, including three different parameters, \(e= 0.2, 0.1\) and 0.05);
Laplace distribution based CEWS;
6.1.1 Comparison with other methods
Table 2 shows the performance of our proposed algorithms compared to the other stateoftheart methods using the 12 datasets. Results show that our proposed algorithms outperform other algorithms in most of the cases. Both the macro average and the weighted micro average over the size of the different class labels clearly show that our algorithm outperform all other methods. Note that \({\varvec{\ell }}_{log}\), CEWS, and Laplace are provided with the classnoise rate. So the comparison to our method is not completely fair. Even with provided classnoise rate to \({\varvec{\ell }}_{log}\), CEWS, and Laplace, they outperform LiC and AdaBC only for the relatively low class noise levels (10% and 20%) in two to three datasets only. This is because these methods require all samples to have the same classnoise rate for the probability weighting.
For the tiny training sets, such as LEU which has only 37 training samples and DB which has only 64 samples for training, the size effect is very prominent. The noisy data elimination based method does not work well in these data because the removal of noisy data also removed useful training data. This is particularly true when the class noise rate increases to 20% and 30%. Obviously, the high percentage of noise has a big effect on the training data. Most methods perform well on 10% classnoise rate but have large degradation in the 20% and the 30% classnoise rates. Different from these methods, LiC performs well and shows a significant advantage in these two sets of data at all three levels of classnoise rate. This is because when the training data is small, the size of the training data is more important than the quality, LiC does not remove any training data. Thus, it can make full use of the data for training. Due to its good class noise estimation, LiC also shows a better performance than \({\varvec{\ell }}_{log}\) even though \({\varvec{\ell }}_{log}\) does not remove any training data either.
When the training set grows, such as Splice which has 1000 samples or TDA which has 2000 samples, the quality of training samples becomes much more important. In these relatively large datasets, AdaBC shows better performance than the other methods. When the training set size becomes larger, the advantages of our methods become even more obviously. For example, the UCI.Adult at most 16,100 samples, or the review text from Amazon with 4000 samples. When the noisy level is at the 30% level, AdaBC can achieve a 5% higher accuracy than other methods which can be shown in the micro average of accuracy with 30% class noise.
We also compare the performance of SVMs and Boosting of SVMs on noisy data with other methods.When class noise rate is low, the class noise handing approach does not show significant performance gain compared to SVMs. But, as class noise increases, our proposed method and other class noise handling approach start to work and have marked improvements over SVMs [22]. have claimed that IMTD is cheap and easy toimplement.However,it is also likely to remove a substantialamount of data. CEWS has similar problem. When the class noise is 30%, which is at a high level, our proposed method is better than other methods.
Note that the egraph based method and the Laplace based CEWS are also compared to our proposed methods. Since an egraph does not need the ranking processing, it is more convenient than a knn graph. However, experimental results shows that the value of e should be different for different tasks. How to choose a reasonable e is an important issue. This is even more serious in our experiments because of the diversity of the datasets. By tuning e for each task, we did not get any better performance compared to LiC nor to AdaBC. This, again, shows that knn graph is a better choice because k is much easier to find. For the Laplace based CEWS, it uses the Laplace distribution and the Bernoulli distribution both are similar to Gaussian distribution. Since CEWS is a discretized result of these distributions, the performance of this method is quite similar to the original CEWS.
Performance of the 10 systems
Data  Noise  AdaBC  NHERD  \({\varvec{\ell }}_{log}^{*}\)  Lic  IMTD  \(\hbox {CEWS}^{*}\)  Egraph (0.2)  Egraph (0.1)  Egraph (0.05)  \(\hbox {Laplace}\) ^{*}  SVMs  Boosting 

LEU  10  90.91  81.62  90.91  87.88  78.79  73.53  58.82  73.53  76.47  73.53  73.53  73.53 
20  87.88  76.62  84.85  90.91  67.65  73.53  58.82  73.53  73.53  67.65  67.65  67.65  
30  76.62  58.68  57.58  78.79  58.82  58.82  58.82  58.82  67.64  58.82  58.82  58.82  
SpliceUCI  10  85.56  72.14  83.99  84.67  83.42  83.15  85.28  85.05  85.37  83.07  83.15  83.15 
20  83.44  66.63  83.02  83.05  82.91  82.39  82.76  82.80  82.94  81.99  82.34  82.34  
30  81.71  61.14  78.31  79.53  80.83  73.54  79.54  79.53  79.54  73.65  75.68  78.34  
UCI.a1a  10  83.68  81.23  83.33  83.40  83.40  82.86  82.93  83.17  83.48  83.01  82.87  83.20 
20  83.49  79.30  81.85  81.40  79.79  82.95  83.34  82.95  83.27  82.99  80.18  80.12  
30  81.46  74.78  77.10  77.30  78.39  81.31  79.11  78.30  78.30  81.46  79.65  78.30  
UCI.a2a  10  84.40  82.25  83.92  84.24  83.79  83.71  84.19  84.10  84.11  84.00  83.85  83.79 
20  83.53  79.34  82.72  83.16  82.32  81.54  81.72  82.63  82.99  80.96  80.86  82.62  
30  82.28  74.13  76.22  80.66  76.58  81.05  76.50  77.19  77.19  81.09  77.79  77.19  
UCI.a3a  10  84.06  82.79  83.93  84.17  83.96  83.07  83.91  83.97  83.83  83.22  83.70  83.60 
20  83.45  81.14  82.23  83.49  78.02  82.83  81.18  82.19  82.17  82.59  78.52  78.39  
30  82.28  77.52  80.23  81.02  79.28  81.35  76.91  78.30  78.30  81.53  77.25  78.31  
UCI.a4a  10  84.19  83.48  84.05  84.28  83.95  83.37  84.33  84.31  84.24  83.02  84.13  83.95 
20  84.17  82.34  82.90  84.10  82.70  83.51  83.11  83.96  84.01  83.33  82.64  82.14  
30  82.62  80.05  81.54  81.81  78.09  82.31  78.66  79.06  78.06  82.12  78.31  78.06  
UCI.a5a  10  84.60  83.57  83.78  84.45  84.03  83.87  84.47  84.23  84.22  83.64  84.40  84.12 
20  84.04  82.88  83.06  84.20  83.55  83.23  83.62  84.05  83.68  83.12  83.07  83.46  
30  82.12  80.42  74.71  79.20  80.30  82.22  79.45  79.52  78.23  82.27  79.49  79.52  
UCI.a6a  10  84.64  83.97  83.20  84.77  83.83  84.44  84.50  84.56  84.61  84.05  84.46  83.75 
20  84.04  82.45  80.37  83.38  78.00  81.48  80.21  78.00  78.65  81.64  83.96  78.00  
30  81.60  80.22  77.83  81.83  78.00  78.00  78.36  78.00  77.99  78.00  77.99  78.00  
UCI.a7a  10  84.96  84.58  82.93  85.65  84.45  83.14  84.66  84.75  84.81  82.99  84.45  84.64 
20  84.53  82.87  80.43  84.27  78.17  81.65  83.90  78.18  78.18  82.01  78.18  78.18  
30  83.26  80.28  78.67  83.49  79.33  80.86  79.33  80.33  79.63  81.13  79.33  77.35  
DB  10  91.66  91.66  87.98  91.66  66.67  73.53  91.66  87.98  73.53  74.01  91.66  91.66 
20  58.33  75.00  84.50  91.66  58.33  67.65  58.33  58.33  58.33  67.65  58.33  58.33  
30  58.33  58.33  80.52  81.82  58.33  58.33  58.33  58.33  58.33  58.33  58.33  58.33  
FADS  10  89.55  86.44  90.91  82.81  75.15  60.74  86.74  85.66  86.97  61.06  89.12  90.04 
20  85.23  81.27  81.82  84.38  73.47  53.78  82.81  81.43  84.36  53.66  83.61  84.99  
30  83.31  78.42  63.64  85.58  71.07  49.58  71.07  49.58  49.58  49.68  79.95  80.82  
TDA  10  83.61  82.92  82.35  84.06  71.88  47.68  81.25  80.97  79.66  48.06  82.56  82.13 
20  79.46  76.81  77.70  79.66  65.28  45.97  76.53  75.45  62.33  45.68  75.41  73.82  
30  79.64  68.70  76.23  77.94  58.92  47.19  71.22  69.91  48.56  47.99  73.22  73.56  
PRA book  10  79.55  75.65  78.02  77.72  63.38  64.36  78.43  78.26  78.00  65.24  76.91  77.36 
20  78.06  74.31  76.29  75.55  61.65  56.97  76.59  77.94  74.51  57.91  77.56  77.25  
30  77.93  69.52  76.97  77.03  59.43  70.04  77.03  76.21  73.22  73.33  74.31  72.51  
PRA DV  10  78.93  79.21  79.55  80.42  69.24  80.20  79.21  79.06  74.69  79.99  78.03  78.09 
20  78.42  74.33  79.30  79.43  73.10  69.86  77.98  77.98  71.29  70.21  74.66  74.94  
30  77.24  70.16  77.55  78.96  62.39  71.98  78.00  76.92  69.55  71.78  74.31  72.76  
PRA music  10  80.65  71.39  79.35  78.83  72.16  52.06  79.65  80.21  74.69  52.06  80.25  79.33 
20  79.36  77.27  76.25  79.84  68.30  52.45  78.99  79.22  75.44  52.45  74.69  74.76  
30  74.88  69.78  72.13  75.23  71.26  70.75  73.21  75.46  71.06  70.75  73.99  72.96  
Banknote  10  98.29  93.25  90.07  89.38  94.54  97.95  97.95  98.29  98.29  97.95  98.29  98.29 
20  98.29  91.44  87.32  91.43  93.86  97.61  97.26  97.61  97.26  97.61  97.61  97.61  
30  97.61  83.21  76.38  80.13  91.47  97.61  92.74  97.61  97.61  97.61  97.61  97.61  
Haberman  10  79.81  70.29  72.46  73.42  72.52  73.91  71.43  71.43  71.43  73.91  71.43  71.43 
20  75.44  69.71  73.91  74.21  70.96  72.46  69.71  69.71  66.53  72.46  71.43  71.43  
30  77.69  66.53  75.36  75.96  70.11  71.43  66.53  66.53  66.53  71.43  71.43  71.43  
ILPD  10  77.26  53.77  64.35  73.26  74.25  74.25  50.98  50.98  50.98  74.69  50.98  72.55 
20  75.25  49.27  74.25  74.25  72.55  70.13  50.33  50.33  50.33  69.96  50.33  72.55  
30  76.24  48.51  62.37  75.24  66.45  69.14  46.21  50.33  50.33  70.04  46.21  72.55  
QSAR  10  74.89  72.67  67.26  67.26  71.36  72.54  71.81  71.81  71.81  72.54  70.40  71.81 
20  71.36  69.35  67.26  67.26  70.91  69.51  66.96  66.96  66.96  69.51  66.96  66.96  
30  66.96  66.96  66.96  66.96  66.96  66.96  66.96  66.96  66.96  66.96  66.96  66.96  
Spect  10  94.86  73.25  82.79  90.32  62.30  70.59  74.33  74.33  74.33  70.36  74.86  74.33 
20  85.93  80.98  83.87  94.08  56.68  77.01  70.05  71.42  72.33  76.44  77.01  70.05  
30  81.49  69.69  71.51  78.49  50.27  59.36  58.82  69.51  66.35  60.21  59.89  58.82  
Macroaverage  10  84.68  79.18  81.27  82.36  77.07  75.55  80.93  80.69  79.42  75.62  80.82  81.96 
20  82.73  76.66  79.95  82.04  74.24  73.31  77.12  76.90  75.55  73.27  76.60  77.15  
30  79.41  71.49  74.96  78.85  71.45  72.26  73.05  73.03  70.80  72.60  73.77  74.91  
Microaverage  10  84.23  82.38  83.40  83.88  82.61  81.61  83.73  83.72  83.57  81.59  83.58  83.56 
20  83.55  80.62  81.69  82.96  79.52  80.24  81.99  81.70  81.48  80.16  80.44  80.11  
30  81.87  77.00  77.45  80.27  77.32  79.46  77.82  77.72  76.87  79.59  78.15  77.86 
We further answer remaining questions: in this section. (1) “Can we use kNN directly?”; (2)“If the SRN distribution seems to be similar to the Gaussian distribution, what is the difference between them?”; (3) “Since the performance of AdaBC looks similar to LiC, is it necessary to use AdaBC?”; and (4)“Can we use the estimation method on the clean data?”.
6.2.1 kNN, SRN and Gaussian
From the experiment above, we can see that the estimation method proposed in this paper is more flexible. Only when most of neighbors have different labels, the individual sample is considered a noisy sample with high probability. If the training data has high probability of noise label, the number of disagreements should be high. In addition, comparing to kNN and Gaussian distribution, SRN gives a more reasonable estimation based on our analysis above. That is why the performance of our method is better than CEWS (Gaussian distribution based method) in the experiment.We do not compare SRN with the original kNN based method because the performance of kNN is similar to CEWS, and CEWS is a more sound method theoretically speaking.
6.2.2 Lic v.s AdaBC on different parameters
In the experiments of Sect. 4. C, LiC achieves better performance in most datasets than AdaBC. Then, why we need the AdaBoost based method, which seems to be more complicated? Figure 1 shows the set of performance evaluations of CUI. Adult using different k of the kNN graph under 10%, 20% and 30% class noise, respectively. Figure 1 shows very clearly that when k increases, the performance of LiC shows a sharp degradation. That is the gap between the noisy training data and clean training data revealed by the theoretical analysis. The performance of LiC is dependent on the estimation result. In fact, it is a wellknown conclusion that the best k in kNN should be no larger than 5 (in most textbook of A.I or machine learning such as [52]), or the precision of this method will be limited and proven in Fig. 1 here again. When k is no larger than 5, LiC achieves the best performance among all k values.
In the Fig. 1d–f the gap between the two methods becomes larger when k increases, it also provides proof of our analysis. AdaBC, seems to be quite robust to k value.This is because even if the estimation is wrong, it still reveals some truth of the clean distribution, and the adaptive sampling method ensemble a series of weak classifiers into a strong classifier.The error boundary in Formula (9) also is affected by the class noise rate. The low class noise rate does less harm to the performance of the base classifier obviously. That is why AdaBC is robust to the k value.
6.2.3 Stability of LiC and AdaBC
Figure 2a–c show the performance of LiC at noise level of 10%, 20%, and 30% on the UCI.Adult data, respectively. Since LiC is a perceptron based method, we also care about the effect of different learning rates. We take three different learning rate of 0.02, 0.01 and 0.005 in this experiment. Different learning rate can achieve similar top performance. It is also obvious that a smaller learning rate picks up performance slower, but it will outperform the higher learning rate after iteration 5. The performance gain with smaller learning rate is much more obvious when the noise level increases. In fact, for the 30% class noise case, there is a 4% gap between different learning rates.
Figure 2d–f show the boxplots of the respective noise levels for AdaBC. Since AdaBC is an AdaBoost based method, the experiment about AdaBC focuses on the mean value and variance of accuracy. In Fig. 2, the top performances of the two methods are similar. But AdaBC gives a more stable performance because the mean value of accuracy is in a similar level when iteration number increases. The variance is also small in the figure. However, the performance of LiC will peak at certain iteration number and then degrade because the accumulated noise will take its tolls on performance.
Now we can answer the question proposed at the beginning of this section. Even though the performances of the two methods are similar, we still have reason to use AdaBC because LiC needs to choose the optimal iteration number and learning rate to achieve top performance. The stability of LiC is also not as good as AdaBC. If the parameter is not suitable for a dataset, the performance degradation is obvious. Comparing to LiC, AdaBC shows a stable and robust result and that is the reason why we propose AdaBC although AdaBC may not be suited for small training datasets.
6.2.4 Estimating the noise on clean data
Performance on clean data
Data  LiC  AdaBC  SVMs  Data  LiC  AdaBC  SVMs 

LEU  57.58  73.53  73.53  FADS  78.42  89.55  90.91 
Splice  66.63  84.97  85.29  TDA  68.70  84.97  85.64 
UCI.a1a  78.45  84.57  84.29  PRABook  72.03  79.98  80.13 
UCI.a2a  77.21  84.15  84.57  PRADVD  69.96  81.25  81.63 
UCI.a3a  77.96  83.96  84.51  PRAMusic  68.71  79.61  78.61 
UCI.a4a  78.33  84.11  84.51  Banknote  83.21  97.95  97.95 
UCI.a5a  78.41  84.27  84.39  Haberman  72.52  80.34  81.43 
UCI.a6a  77.25  83.59  84.71  ILPD  74.25  81.25  82.55 
UCI.a7a  78.00  84.66  84.80  QSAR  67.26  77.35  77.35 
DB  58.33  91.66  91.66  Spect  77.00  92.34  97.64 
Note that LiC performs worse than the original SVMs on the clean dataset. The main reason is that, in the loss function of LiC given by Formula (6), the samples with high class noise rate will have a penalty. When the estimated class noise rate is higher than 50%, the weight of the penalty item will be larger than the original loss function item. It is actually an operation of label inversing on the training sample. Since LiC can introduce class noise into clean data this way, it performs worse than the original SVMs is reasonable. Different from LiC, AdaBC is an ensemble method. The basic classifier is still SVMs. The loss function is to estimate the error rate of the basic classifier and calculate the weight based on this error rate. Usually, the error rate is less than 0.5. It means that each basic classifier will have a positive weight. So, the AdaBC algorithm becomes a bagging method of SVMs. Each basic classifier in the bagging has a weight, but the weight is meaningless since the training data is clean. Since each classifier trains on only part of the training set, the performance is no better than the original SVMs, but it is still comparable.
In conclusion, the estimation of class noise is used to weigh the samples in LiC but to weigh classifiers in AdaBC. In clean data, the incorrect weight on samples is much more harmful than the incorrect weight on classifiers. This is because the former leads to a misslabeled sample, yet the later only introduces an incorrect weight. Fortunately, the weight is still with correct polarity and the training data of classifier is clean. That is the reason why AdaBC also works well on clean data.
Generally speaking, AdaBC achieves the best performance in the evaluation. However, by examining their performance in details in the experiments, we can see that LiC is sensitive to algorithm parameters including the k value in the kNN graph and the learning rate of the perceptron as well as the termination point of the algorithm, it cannot work on the clean data neither. AdaBC is better than LiC in this perspective. AdaBC is robust to all theabove parameters. Another advantage of AdaBC is the performance on clean data. Even the data does not contain noise, the AdaBC still perform well. Therefore, the Adaboost based improvement is necessary.
In this paper, we present a novel class noise estimation method. We apply our estimation result into an Adaboost based algorithm to handle class noise. The algorithm is competitive compared to the stateoftheart techniques and show superior performance on real datasets. We analyze the algorithm performance on different training dataset sizes and classnoise rates. Results confirm to the learning theorem provided in Eq. (8). In future works, we will investigate noise handling in semisupervised tasks such as semisupervised classification, transductive transfer learning, and also look into the domain adaptation problem. We will also consider different noise rate for different classes since label noise rates are often class label dependent in practice.
