Abstract
Evaluation of incremental classification algorithms is a complex task because there are many aspects to evaluate. Besides the aspects such as accuracy and generalization that are usually evaluated in the context of classification, we also need to assess how the algorithm handles two main challenges of the incremental learning: the concept drift and the catastrophic forgetting. However, only catastrophic forgetting is evaluated by the current methodology, where the classifier is evaluated in two scenarios for class addition and expansion. We generalize the methodology by proposing two new scenarios of incremental learning for class inclusion and separation that evaluate the handling of the concept drift. We demonstrate the proposed methodology on the evaluation of three different incremental classifiers, where we show that the proposed methodology provides a more complete and finer evaluation.
Keywords
1 Introduction
Evaluation of incremental learning algorithms is a complex task since there are many possible evaluation scenarios. Each scenario can evaluate multiple aspects of the incremental algorithm such as accuracy convergence, robustness against catastrophic forgetting (CF), or concept drift (CD) handling. Testing multiple aspects at once, however, does not usually help in the identification of the particular issues of the examined learning algorithm. Therefore, we need some basic evaluating scenarios, each addressing a specific aspect of the incremental learning algorithm, to tackle one issue at the time.
An incrementally trained classifier is a classifier that is being trained on consecutive tasks. Each task, the classifier is fed with labeled samples which cannot be stored but must be integrated into the classifier during the training. Such multiple training over the long period has two main challenges that are called concept drift and the catastrophic forgetting [2]. The symptom of catastrophic forgetting is a decrease of classifier performance, where the performance is measured on the previously trained tasks [4]. Such evaluation of the performance decrease is used as a metric for many proposed incremental algorithms [3, 6, 9], where authors introduce incremental class learning and data permutation scenarios [4].
In the incremental class learning scenario, the classifier is learning a different class each task, while in the data permutation scenario, the classifier learns on the same classes but with shuffled feature-vector components. Both scenarios examine how well is the classifier able to aggregate the new data without forgetting the learned class distributions, but it does not examine the algorithm adaptability to the concept drift. The concept drift is a consequence of a non-stationary environment where the class distribution changes in time [10]. The distribution change detection is still an open problem, and its solutions are dependent on the type of the concept drift [1, 13]. A scenario where the concept drift is evident has to be designed to evaluate the concept drift handling on different incremental algorithms. An example of such evident concept drift is when the previously presented sample is presented again but with a different label [7, 13]. The change of label requires the classifier to un-train the old label on the sample, and then train the new label. Such an operation should be tested during the evaluation of an incremental algorithm.
The evaluation methodology for incrementally trained classifiers observed from various papers [3, 6, 7, 12] can be generally divided into three steps:
-
1.
Test the basic properties of the classifier (e.g., accuracy, generalization, how fast it converges) within just one task.
-
2.
Test the behavior of the classifier in the minimal incremental classification problem, where we have two tasks during which we train the classifier on given samples of two classes.
-
3.
Test scalability by adding more classes and by increasing the number of tasks.
The main contribution of this paper relates to the second step for which we introduce basic evaluation scenarios. We show that there are \(2^9\) possible scenarios that can be inferred from the basic presuppositions for the minimal incremental classification problem. In the context of the incremental algorithm evaluation, we filter symmetric and redundant scenarios to get four basic evaluation scenarios. We propose the following basic evaluation scenarios (depicted in Fig. 1): class addition (ADD), expansion (EXP), inclusion (INC), and separation (SEP). Scenarios ADD and EXP correspond to incremental class learning and data permutation [4], respectively, which are used to evaluate how the algorithm handles the catastrophic forgetting. The new scenarios INC and SEP introduce the label change described in various concept drift cases [7, 13]. The benefit of using these basic scenarios is that they are easy to construct with existing datasets (e.g., MNIST [8]) and the classifier can be considered as a black box. Furthermore, by evaluating the classifier with each basic scenario, we can analyze its properties separately. The proposed evaluation is demonstrated on multiple incremental classifiers.
The rest of the paper is organized as follows. The formal definition and inference of scenarios are provided in Sect. 2. The evaluated incremental algorithms are introduced in Sect. 3 and the evaluation results are reported in Sect. 4 with a detailed discussion and interpretation of the evaluation results in Sect. 4.1. The paper is concluded in Sect. 5.
2 Basic Scenarios of Incremental Classification
The incrementally trained classifier is being trained during consecutive tasks \(T_1\), \(T_2\), \(T_3\), ..., where for each consecutive task \(T_i\), the classifier \(F^{T_i}\) is trained on batch \(D^{T_i} = \{(\varvec{x}^j,l^j)\}_{1\le j \le m}\) of m labeled samples. Samples \(\varvec{x} \in X\) are labeled by one of n labels \(l \in L = \{L_1, \dots , L_n\}\). We are interested in the minimal incremental classification problem where we have just two tasks \(T_1\), \(T_2\) and two labels \(L_1\), \(L_2\). Having just two labels, during each task \(T_i\), each sample \(\varvec{x}\in X\) is in one of three states \(S = \{S_1, S_2, S_0\}\); the sample \(\varvec{x}\) is either
-
\(S_1\): presented with the first label,
-
\(S_2\): or the second label,
-
\(S_0\): or not presented.
Having just two tasks, each sample \(\varvec{x}\in X\) has tuple of states \((s, s')\in S^2\), where s and \(s'\) are states of \(\varvec{x}\) during \(T_1\) and \(T_2\), respectively. Let a base set \(C_{s,s'} \subset X\) be a set of samples with the state tuple \((s, s')\). All nine base sets are pairwise disjoint, and their union gives X (see Fig. 2).
Each base set \(C_{s, s'}\) is either empty or non-empty. The base sets \(C_{1,1}\), \(C_{1,2}\), \(C_{2,1}\), and \(C_{2,2}\) are sets that contain samples that are sampled twice (in \(T_1\) and then in \(T_2\)). If samples are taken from a continuous probability distribution, the base sets with samples presented twice are always empty (the probability of sampling the same point twice is zero). However, in the context of the classifier evaluation, we present to a classifier the same sample \(\varvec{x}\) with different labels \(l, l'\) to examine whether the classifier can change its model in such a way, that after the task \(T_2\), the classifier labels \(\varvec{x}\) as \(l'\). Thus, in the evaluation scenarios, the base sets \(C_{1,1}\), \(C_{1,2}\), \(C_{2,1}\), and \(C_{2,2}\) can be non-empty.
Let scenario be an assignment function \(r:S^2\rightarrow \{0,1\}\), where \(r(s,s')=1\) if \(C_{s,s'}\) is non-empty else \(r(s,s')=0\). There are \(2^9\) scenarios, which we prune with the following constraints. First, we consider that labels are symmetric and assume that the base set \(C_{0,0}\) is always non-empty. Second, in the context of the incremental algorithm evaluation, we want to measure how well the classifier can classify samples trained only in \(T_1\) despite changes in \(T_2\). The base set \(C_{1,0}\) (or \(C_{2,0}\)) is a set of samples that are trained only in \(T_1\), and the base sets with samples that change labels in \(T_2\) are \(C_{0,2}, C_{0,1}, C_{2,1}\), and \(C_{1,2}\). The combination of non-empty \(C_{1,0}\) with each of the above mentioned four base sets gives us four basic evaluation scenarios. Additionally, \(C_{0,2}\) is non-empty in all four basic scenarios to ensure that after the task \(T_2\), there are always two classes to classify. The basic evaluation scenarios are listed in Table 1 and illustrated in Fig. 1.
Thus the evaluation of a binary classifier \(F^{T_i}:X \rightarrow \{1, 2\}\) is the examination of its performance for \(T_1\) and \(T_2\) in all basic scenarios, i.e.,
3 Incremental Classifiers
We introduce three incremental classifiers: ENS, ENSGEN, and ENSGENDEL, to present the proposed evaluation using the minimal scenarios listed in Table 1. The ENS is an ensemble of two multilayer perceptrons (MLPs), where each MLP is trained to classify its respective class. Each MLP is trained independently in the ADD scenario, and thus it should be robust to catastrophic forgetting. The ENS is trained each task with Algorithm 1 and the label prediction is made by \(F(\varvec{x})=\arg \max _{l\in L} f_l(\varvec{x})\).
The ENSGEN is an extension of ENS where we generate (replay) samples from autoencoder. The technique where we generate samples to prevent catastrophic forgetting is called the memory replay [11]. Many implementations of the memory replay (i.e., sample generation) use autoencoders, e.g., [14]. For each label \(l\in \{L_1, L_2\}\), we have an autoencoder that is composed of the encoder \(e_l:X \rightarrow Z\) and decoder \(d_l: Z \rightarrow X\), where we call \(Z = [0, 1]^N\) the latent space. The autoencoder \(d_l\circ e_l\) is trained on samples labeled with l (along with the classifier \(f_l\)), and during the next task, we let the autoencoder to generate samples that resemble the samples from the previous task. The sample generation method \(\textsc {generate}(d, f)\) is defined as
where \(\textsc {sample\_uniformly}(Z)\) gets M random samples of the latent space, which are first decoded by decoder d and then filtered by the classifier f. The update method of the ENS can be used also for the ENSGEN for which Algorithm 1 is modified as follows. Line 4 and Line 5 of Algorithm 1 are modified to
The condition in the minimization iterator (Line 7, Algorithm 1) is changed to
where we prioritize training of the autoencoder \(d_l\circ e_l\) over the classifier \(f_l\) for two reasons. First, preliminary experiments showed that the classifier \(f_l\) is easier to train. Second, if the autoencoder is not well trained, then the generate(d, f) method returns a small amount of samples in (4) and (5). The optimization of the autoencoder \(d_l\circ e_l\) is implemented by adding
after Line 9 of Algorithm 1.
On the other hand, the ENSGEN classifier can fail to relabel some of the samples from task \(T_1\) in the INC and SEP scenarios. The samples that need to be relabeled (\(C_{2,1}\) and \(C_{1,2}\)) can be within the cluster of the generated samples (see (4) and (5)). The cost functions of the autoencoder \(e\circ d\) and classifier f are both smooth (differentiable), thus by minimizing the cost function over the set of samples \(A_{l}\) (see Line 8, in Algorithm 1 and (7)), we also minimize the cost in the close neighborhoodFootnote 1 of samples \(A_{l}\). Therefore, if two sets of samples share the close neighborhood but have conflicting minimization objectives (e.g., the two sets have different labels), the minimization process will slow down. We propose an extension of ENSGEN: the ENSGENDEL classifier, that “subtracts” the close neighborhood of the new samples \(A^o_{l'}\) from the generated samples of label l. Hence, it cannot happen the classifier \(f_l\) will be trained to label \(A^o_{l'}\) as l. The modification of ENSGEN is to replace Line 4 and Line 5 of Algorithm 1 to
where \(\varepsilon \) is the minimum distance between the generated and new samples.
4 Results
In this section, we report how the proposed evaluation scenarios (see Sect. 2) can improve the analysis of incremental classifiers that is demonstrated on the classifiers described in Sect. 3. Moreover, to set a baseline, we also train a single MLP classifier SNG with the layer sizes 728-500-500-2 with the softmax layer and cross-entropy loss function. The ENS classifier has \(\theta =0.1\), \(M=1000\), and two binary MLPs, each has the layer sizes 728-500-250-125-1. The ENSGEN and ENSGENDEL classifiers have \(\theta =7\), \(M=10\), \(\varepsilon =0.1\), and two autoencoders, each composed of encoder and decoder with the layer sizes 784-500-200-8 and 8-200-500-784, respectively. A rectifier is used as the activation function for all hidden layers. The output layers of the encoder and MLP in ENS have a sigmoid activation function. All neural networks are trained with Adam [5] with the learning rate set to 0.0001. All the hyperparameters were found empirically.
Different scenarios are created by using the MNIST [8] dataset which has roughly 7000 samples per MNIST class (zero, one, ..., nine), where each MNIST sample is a \(28\times 28\) image of a digit. The dataset is divided into a training and testing set with the ratio 6 to 1. We construct scenarios by assigning some of the MNIST classes to the base sets \(C_{i,j}\). Two assignment configurations are described in Table 2: the 021 assignment which is made from easily distinguishable digits (zeroes, twos, and ones), and the 197 assignment which contains digits that are harder to distinguish (ones, nines, and sevens). The classifiers are trained on scenarios created from the training set, and the evaluation is calculated on the scenarios created from the testing set. The results are shown in Tables 3 and 4.
4.1 Discussion
The overall accuracy of the classifiers can be compared from the results in Table 3, where the regular evaluation on the ADD and EXP scenarios [3, 6, 7, 12] is extended with the proposed INC and SEP scenarios. In the 197 assignment of the INC scenario, we can see that the ENS classifier is unable to relabel some of the previously presented samples (the accuracy in the \(T_1\) column should be at most roughly 0.5, but it is 0.91 in the case of the ENS classifier). Such a low performance at relabeling is most likely caused by the similarity of the digits used in the 197 assignment (ones, nines, and sevens) because in the 021 assignment, the ENS classifier can relabel the previously presented samples (the ENS has 0 accuracy in the \(T_1\) column of the INC scenario). With the SEP scenario, we can distinguish the performance of the ENSGEN and ENSGENDEL classifiers, which have almost identical results in all other scenarios. Thus we gain more information about the evaluated classifiers by evaluation with the proposed scenarios SEP and INC.
The regular evaluation listed in Table 3 is good for a comparison of multiple classifiers. However, for a finer analysis of the classifiers, we propose to evaluate the accuracy on each base set, like it is shown in Table 4, where the column \(C_{1,0}\) shows how well the classifier “remembers” the base set \(C_{1,0}\) after the task \(T_2\). The accuracies in the column \(C_{1,0}\) show that the classifiers ENSGEN and ENSGENDEL remember the previously learned samples almost perfectly. Other interesting columns are \(C_{2,1}\) and \(C_{1,2}\), which show how well the classifier relabel the previously trained samples. In assignment 021 of the SEP scenario, the ENSGEN classifier has been able to relabel only 0.78 of samples, while ENSGENDEL has been able to relabel almost all of them. Such explicit information is lost in the regular overall evaluation (see Table 3) because the regular evaluation is evaluated over multiple base sets.
The results in assignment 197 are worse than results in assignment 021 in most of the cases. From this difference, we can draw a lesson that it is important to try more assignments, as it is pointed out in [12] because each MNIST class (or any other class of different dataset) has different qualities. The quantity is another aspect to consider: in this paper, the base sets are of equal cardinality (roughly). Scenarios with the base sets that have different cardinalities could evaluate the classifier robustness against unbalanced data. Thus, it is good practice to use basic evaluation scenarios with multiple different assignments for a thorough examination of the incremental classifier.
5 Conclusion
In this paper, we propose a generalization of the current methodology for incremental classifier evaluation by proposing four basic evaluation scenarios: class addition, expansion, inclusion, and separation. Three incremental classifiers are presented to demonstrate the methodology within the proposed evaluation scenarios. Each classifier has been evaluated with the proposed methodology, and we assess how well the classifier handles the catastrophic forgetting and the concept drift issues. Moreover, the proposed generalization allows us to design a finer evaluation that can test particular aspects of incremental learning; such are remembering the previously trained samples or selective relabeling of the previously learned samples. Such a detailed methodology for incremental learning evaluation should improve the development of incremental classifiers, and therefore, researchers are encouraged to consider it in their developments.
Notes
- 1.
In a metric space X, a neighborhood of the point \(\varvec{x}\) is defined as a ball of the radius r with \(\varvec{x}\) in the center: \(\mathcal B_d(\varvec{x}, r) = \{\varvec{y}|d(\varvec{x}, \varvec{y})<r; \varvec{y}\in X\}\), where d is a metric function. A close neighborhood is a neighborhood with a very small radius r.
References
Freund, Y., Mansour, Y.: Learning under persistent drift. In: Ben-David, S. (ed.) EuroCOLT 1997. LNCS, vol. 1208, pp. 109–118. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-62685-9_10
Gepperth, A., Hammer, B.: Incremental learning algorithms and applications. In: European Symposium on Artificial Neural Networks (ESANN), pp. 357–368 (2016)
Goodfellow, I.J., Mirza, M., Xiao, D., Courville, A., Bengio, Y.: An empirical investigation of catastrophic forgetting in gradient-based neural networks (2013). arXiv e-prints: arXiv:1312.6211
Kemker, R., Abitino, A., McClure, M., Kanan, C.: Measuring catastrophic forgetting in neural networks. CoRR abs/1708.02072 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)
Kirkpatrick, J., et al.: Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. 114(13), 3521–3526 (2017). https://doi.org/10.1073/pnas.1611835114
Lane, T., Brodley, C.E.: Approaches to online learning and concept drift for user identification in computer security. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, KDD 1998, pp. 259–263. AAAI Press (1998)
LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010). http://yann.lecun.com/exdb/mnist/, cited on 2019-29-01
Lee, S., Kim, J., Ha, J., Zhang, B.: Overcoming catastrophic forgetting by incremental moment matching. CoRR abs/1703.08475 (2017)
Moreno-Torres, J.G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N.V., Herrera, F.: A unifying view on dataset shift in classification. Pattern Recogn. 45(1), 521–530 (2012). https://doi.org/10.1016/j.patcog.2011.06.019
Parisi, G.I., Kemker, R., Part, J.L., Kanan, C., Wermter, S.: Continual lifelong learning with neural networks: a review. Neural Netw. 113, 54–71 (2019). https://doi.org/10.1016/j.neunet.2019.01.012
Pfülb, B., Gepperth, A., Abdullah, S., Kilian, A.: Catastrophic forgetting: still a problem for DNNs. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018, Part I. LNCS, vol. 11139, pp. 487–497. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01418-6_48
Wang, K., Zhou, S., Fu, C.A., Yu, J.X.: Mining changes of classification by correspondence tracing. In: Proceedings of the SIAM International Conference on Data Mining, pp. 95–106. SIAM (2003). https://doi.org/10.1137/1.9781611972733.9
Wu, Y., Chen, Y., Wang, L., Ye, Y., Liu, Z., Guo, Y., Zhang, Z., Fu, Y.: Incremental classifier learning with generative adversarial networks. CoRR abs/1802.00853 (2018)
Acknowledgments
This work was supported by the Czech Science Foundation (GAČR) under research project No. 18-18858S.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Szadkowski, R., Drchal, J., Faigl, J. (2019). Basic Evaluation Scenarios for Incrementally Trained Classifiers. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Deep Learning. ICANN 2019. Lecture Notes in Computer Science(), vol 11728. Springer, Cham. https://doi.org/10.1007/978-3-030-30484-3_41
Download citation
DOI: https://doi.org/10.1007/978-3-030-30484-3_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30483-6
Online ISBN: 978-3-030-30484-3
eBook Packages: Computer ScienceComputer Science (R0)