1 Introduction

With the rapid development in data collection technology, it is of great importance to analyze and extract knowledge from a vast number of data. However, data are commonly in a streaming form and are usually collected from non-stationary environments, and thus they are evolving in nature. In other words, the joint distribution between the input feature and the target label will change, which is also referred to as concept drift in the literature (Gama et al. 2014). If we simply ignore the distribution change when learning from the evolving data stream, the performance will dramatically drop, which is not empirically and theoretically desirable for these tasks. Consequently, the concept drift problem has become one of the most challenging issues for data stream learning and has drawn researchers’ attention to design practically effective and theoretically sound algorithms.

However, data stream with concept drift is essentially almost impossible to learn (predict) if there is no assumption on the distribution change. That is, if the underlying distribution changes arbitrarily or even adversarially, there is no hope to learn a good model to make the prediction. We share the same assumption with most of previous works, that is, there contains some useful knowledge for the future prediction in previous data. No matter sliding window based approaches (Klinkenberg and Joachims 2000; Bifet and Gavaldà 2007; Kuncheva and Zliobaite 2009), forgetting based strategies (Koychev 2000; Klinkenberg 2004; Zhao at al. 2019) or ensemble based methods (Kolter and Maloof 2005, 2007; Sun et al. 2018), they share the same assumption, whereas the difference is how to exploit and utilize the knowledge in previous data.

Another issue is that most previous works on handling concept drift focus on the algorithm design, only a few works consider the theoretical property (Helmbold and Long 1994; Crammer et al. 2010; Mohri and Medina 2012). There are some works proposing algorithms along with theoretical analysis, for instance, (Kolter and Maloof 2005) provides mistake and loss bounds and guarantees that the performance of the proposed approach is relative to the performance of the base learner. (Harel et al. 2014) detects concept drift via resampling and provides bounds on differentiates based on stability analysis. Nevertheless, seldom have clear theoretical guarantees or justifications on why and how to utilize knowledge in previous data to fight with concept drift.

In this paper, we propose a novel and effective approach for handling Concept drift via model reuse, or Condor. It consists of two modules, \({\mathtt {ModelUpdate}}\) module aims at exploiting knowledge from previous models to help build the new model, while \({\mathtt {WeightUpdate}}\) module adaptively assigns weights for previous models according to their performance, where the weights represent the reusability of previous models towards current data. We theoretically justify the advantage of the ModelUpdate module from the aspect of generalization analysis, showing that our approach of model reuse can benefit from a good weighted combination of previous models. Meanwhile, the WeightUpdate module guarantees that the weights will finally concentrate on the better-fit models and thus provides a good weighted combination of previous models as the initialization for ModelUpdate module to train the new model. Therefore, they together make our proposed algorithm successful. Besides, we investigate the overall performance of the whole data stream by the dynamic regret analysis. Experiments on both synthetic and real-world datasets validate the superiority of our approach, and the empirical studies demonstrate the effectiveness of the model reuse mechanism for handling concept drift.

The rest of this paper is organized as follows. Section 2 discusses related work. Section 3 proposes our approach. Sections 4 and 5 present theoretical analysis. Section 6 reports experimental results. Finally, we conclude the paper and discuss future work in Sect. 7.

2 Related work

Concept Drift The phenomenon of concept drift has been well-recognized in recent researches (Gama et al. 2014; Gomes et al. 2017). The capability of handling concept drift is one of fundamental requirements of learnware (Zhou 2016), and also a crucial step towards robust and reliable artificial intelligence (Dietterich 2017, 2019).

Basically, if there is not any structural information about the data stream, we shall not expect to learn from historical data and make any meaningful prediction. Therefore, a common assumption on the concept drift in the stream as aforementioned is that there contains useful knowledge in previous data. Particularly, most previous works assume that the nearby data items include more useful information concerning the current data, and thus researchers propose plenty of approaches based on the sliding window and forgetting mechanisms. Sliding window based approaches maintain the nearest data items and discard old items, with a fixed or adaptive window size (Klinkenberg and Joachims 2000; Kuncheva and Zliobaite 2009). Forgetting based approaches do not explicitly drop old items but downweight previous data items according to their age (Koychev 2000; Klinkenberg 2004).

Another important category falls into the ensemble based approaches, they adaptively add or delete base classifiers and dynamically adjust weights when dealing with the evolving data stream. A series of work borrows the idea from boosting (Schapire 1990) and online boosting (Beygelzimer et al. 2015) by endowing the learning system with the capability to cope with non-stationarity via dynamically adjusting weights of classifiers. Take a few representatives, dynamic weighted majority (DWM) dynamically creates and removes weighted experts in response to distribution changes (Kolter and Maloof 2003, 2007). Additive expert ensemble (AddExp) adaptively adjusts the additive expert pool and provides mistake bounds as theoretical guarantees (Kolter and Maloof 2005). Learning in non-stationary environments (Learn\(^{++}\).NSE) trains one new classifier for each batch of data and then combines these classifiers (Elwell and Polikar 2011). Hybrid Forest (Rad and Haeri 2019) considers to adaptively charge to the suitable classifier by the multiple classifier selection according to the ‘hybrid power’ quantity defined therein. Plenty of approaches are proposed to learn from evolving data stream; we refer the reader to comprehensive surveys (Gama et al. 2014; Gomes et al. 2017). For boosting and ensemble approaches, the reader is recommended books (Schapire and Freund 2012; Zhou 2012).

Our approach is kind of similar to DWM and AddExp on the surface. We all maintain a model pool and adjust weights to penalty models with poor performance. However, we differ from the model update procedure, and they ignore to leverage previous knowledge and reuse models to help build a new model and update the model pool. Besides, our weight update strategies are also different.

Model Reuse Model reuse is an important learning problem, also named as model transfer, hypothesis transfer, or learning from auxiliary classifiers. The basic setting is that one desires to reuse pre-trained models to help further model building, especially when the data are too scarce to train a fair model directly. A series of works lies in the idea of biased regularization, which leverages previous models as the biased regularizer in addition to empirical risk minimization, and achieves a good performance in plenty of scenarios (Duan et al. 2009; Tommasi et al. 2010, 2014). (Reddi et al. 2015) adopts such techniques in the covariate shift to reduce the variance introduced by the standard reweighted empirical risk minimization. There are also some other attempts and applications, for instance, (Segev et al. 2017) develops model reuse by random forests, (Ye et al. 2018) reuses models for transferring the invariant meta feature representation between heterogeneous feature space by semantic mapping and (Wu et al. 2019) proposes a novel model reuse method for multi-party learning by optimizing the global behavior of an ensemble of heterogeneous local models. Besides, (Li et al. 2013) applies model reuse technique to adapt different performance measures. Apart from algorithm design, theoretical foundations are recently established by algorithmic stability (Kuzborskij and Orabona 2013), Rademacher complexity (Kuzborskij and Orabona 2017) and transformation functions (Du et al. 2017).

Our paper proposes to handle concept drift problem via utilizing model reuse learning. The idea of exploiting knowledge by reusing previous model is reminiscent of some past works coping with concept drift by transfer learning, like the temporal inductive transfer (TIX) approach (Forman 2006) and the diversity and transfer-based ensemble learning (DTEL) approach (Sun et al. 2018). Both of them are batch-style approaches, that is, they need to receive a batch of data each time, whereas ours can update either in an incremental style or a batch update mode. TIX concatenates the predictions from previous models into the feature of next data batch as the new data, and a new model is learned from the augmented data batch. DTEL chooses decision tree as the base learner and builds a new tree by directly using the latest data batch along with “fine-tuning” previous models by a direct tree structural adaptation. It maintains a fixed size model pool with the selection criteria based on diversity measurement. They both do not depict the reusability of previous models, which is carried out by WeightUpdate module in our approach. Last but not the least important, our approach is proposed with sound theoretical guarantees, in particular, we carry out a generalization justification on why and how to reuse previous models. Nevertheless, theirs are not theoretically clear in general.

3 Proposed approach

We consider the scenario that data are coming one by one sequentially, and there may emerge concept drift in the data stream. We propose a novel algorithm, Condor, to handle the potential concept drift. The high-level idea is to adapt the non-stationarity by extracting the useful knowledge in previous data. Therefore, we propose to adopt the model reuse technique to handle concept drift. To realize this goal, we design two core modules: \({\mathtt {ModelUpdate}}\) module and \({\mathtt {WeightUpdate}}\) module, both are essential in reusing knowledge in previous data and automatically adapting the non-stationary environments. In the following, we give detailed descriptions of the proposed algorithm.

Condor periodically updates the model when achieving the maximum update period p, which is a data-dependent parameter, reflecting the inherent extent of fluctuation and non-stationarity. Meanwhile, we also adopt a concept drift detector \({\mathfrak {D}}\) in the background in case of abrupt changes. When the maximum update period is achieved, or the abrupt change is detected, instead of resetting the model pool and incrementally training a new model, we aim at reusing knowledge in previous models in a weighted manner to enhance the overall performance. Therefore, there are two issues to address:

  • Suppose each previous model is associated with a proper weight. How to reuse these models weightedly to train the new model?

  • How to obtain a proper weight for each previous model? Note that the weight should represent model’s “reusability” towards current data.

To address the above two issues, our approach consists of the following two important modules.

  1. (1)

    \({\mathtt {ModelUpdate}}\) by model reuse: we reuse previous models to build the new model and update model pool, by making use of the biased regularization technique for multiple model reuse learning.

  2. (2)

    \({\mathtt {WeightUpdate}}\) by expert advice: The weight of each previous model is updated according to its performance on the current data, in an exponential weighted average manner.

We present descriptions of these two modules in next two subsections.

3.1 Model update by model reuse

During the model update stage, we propose to reuse previous models weightedly to adapt the current data epoch. Note that in this subsection, the weight of each previous model is supposed to be given in advance, and the weight update procedure will be specified in the next subsection.

Consider the kth model update as illustrated in Fig. 1, we desire to use previous models in the model pool \(H = \{h_1,\ldots ,h_{k-1}\}\) along with the current data epoch \(S_k\) to train a new model \(h_k\). With a slight abuse of notations, we denote \(S_k = \{(\mathbf{x }_1,y_1),\ldots ,(\mathbf{x }_m,y_m)\}\).

Fig. 1
figure 1

Illustration of main idea: our approach periodically conducts the model update, and adopts the drift detector \({\mathfrak {D}}\) in the background in case of the abrupt change. During the model update, on one hand, we utilize the data items in current epoch \(S_k\); on the other hand, we reuse knowledge from previous models (\(\{h_1,h_2,\ldots ,h_{k-1}\}\)) via model reuse

We first adopt linear classifier as the base model and reusing previous models via biased regularization technique (Schölkopf et al. 2001; Tommasi et al. 2014). We remark that one may use kernel methods, and Nyström method as well as random Fourier features to transform the kernelized problem into a linear one, details can be found in the paper (Yang et al. 2012).

When the base model is linear classifier, the new model will be obtained according to the following regularized empirical risk minimization,

$$\begin{aligned} {\hat{\mathbf{w }}}_k = {{\,\mathrm{arg\,min}\,}}_{\mathbf{w }} \left\{ \frac{1}{m}\sum _{i=1}^m \ell \left( \langle \mathbf{w } ,\mathbf{x }_i\rangle ,y_i\right) + \mu \varOmega (\mathbf{w }, \mathbf{w }_p)\right\} , \end{aligned}$$
(1)

where \(\ell :{\mathbb {R}} \times {\mathcal {Y}} \mapsto {\mathbb {R}}_+\) is the non-negative loss function, \(\mu > 0\) is a positive trade-off regularization coefficient. Besides, we denote by \({\mathcal {H}}\) the hypothesis set, and \(\varOmega : {\mathcal {H}}\times {\mathcal {H}} \mapsto {\mathbb {R}}_+\) is a model reuse regularizer specifying the final model to be built upon the previous models, satisfying \(\varOmega (\mathbf{w }_p,\mathbf{w }_p) = 0\) with a typical choice as \(\Vert \mathbf{w } - \mathbf{w }_p\Vert ^2\). And \(\mathbf{w }_p\) is the linear weighted combination of previous models, namely, \(\mathbf{w }_p = \sum _{j=1}^{k-1} \beta _j {\hat{\mathbf{w }}}_j\), where \(\beta _j\) is the weight associated with previous model \(h_j\), representing the reusability of each model on data in current epoch.

Since data in a relatively stationary epoch are usually scarce in evolving data stream, model reuse mechanism is quite useful as it reduces the sample complexity by reusing previous models as a basis, which will be theoretically investigated by generalization analysis in Theorem 1.

For simplicity, in this paper, we choose the square loss with \(\ell _2\) regularization in practical implementation, essentially, Least Square Support Vector Machine (LS-SVM) (Suykens et al. 2002). Meanwhile, the optimal solution of 1 can be expressed in the form of \({\hat{\mathbf{w }}}_k = \sum _{i=1}^m \alpha _i \mathbf{x }_i\), where the coefficients \({\varvec{\alpha }} = [\alpha _{1},\ldots ,\alpha _{m}]^{\mathrm {T}}\) can be obtained by analyzing its optimality condition. Specifically, it is sufficient to solve the following linear Karush-Kuhn-Tucker (KKT) system (Suykens et al. 2002, Chapter 3, pp. 73),

$$\begin{aligned} \begin{bmatrix} \mathbf{K }+\frac{1}{\mu }\mathbf{I }&\mathbf{1 } \\ \mathbf{1 }&0 \end{bmatrix} \begin{bmatrix} {\varvec{\alpha }} \\ b \end{bmatrix} = \begin{bmatrix} \mathbf{y } - \sum _{j=1}^{k-1} \beta _j {\hat{\mathbf{y }}}_j \\ 0 \end{bmatrix}, \end{aligned}$$
(2)

where \(\mathbf{K }\) is the linear kernel matrix, i.e., \(\mathbf{K }_{ij} = \mathbf{x }_i^T \mathbf{x }_j\). Besides, \(\mathbf{y }\) and \({\hat{\mathbf{y }}}_j\) are vectors containing labels of data stream and predictions of the previous jth model, that is, \(\mathbf{y }=[y_1,\ldots ,y_m]^\mathrm {T}\) and \({\hat{\mathbf{y }}}_j = [\langle {\hat{\mathbf{w }}}_j,\mathbf{x }_1 \rangle , \ldots , \langle {\hat{\mathbf{w }}}_j,\mathbf{x }_m \rangle ]^\mathrm {T}\). The KKT system can be solved by the Gaussian elimination method. For larger datasets, in order to alleviate memory cost and reduce computational complexity, the use of iteration methods are recommended, for instance, the conjugate gradient iterative methods (Suykens et al. 2002, Chapter 3.4.1, pp. 86).

If the concept drift occurs very frequently or data stream accumulates for a long time, the size of the model pool will explode supposing there is no model retirement mechanism. Therefore, we set the maximum of model pool size as K. Apparently, we can keep K of all models with the largest diversity as done in the work of Sun et al. (2018). For simplicity, we only keep the latest K ones in the model pool.

Remark 1

The biased regularization model reuse learning is not limited in the binary classification task, and can be extended to the multi-class scenario. We defer notations and corresponding theoretical analyses in Sect. 5. Meanwhile, our framework can generalize to other base models with suitable model reuse mechanisms. For instance, we can use the decision tree as the base classifier, along with the suitable model reuse strategy. More extensions will be investigated in the future work.

figure a

3.2 Weight update by expert advice

During the weight update stage, we propose to update the weights such that the weight can represent the reusability of the model towards current data scratch. To this end, we update the weight of each model by prediction with expert advice technique (Cesa-Bianchi et al. 1997; Cesa-Bianchi and Lugosi 2006). The intuitive idea is to adaptively adjust the weight distribution of previous models according to their performance in order to reflect their reusability.

Specifically, when the ModelUpdate stage finishes, the weight distribution in the model pool H will reinitialize. We adopt a uniform initialization: \(\beta _{1,k} = 1/|H |\), for \(k=1,\ldots ,|H |\). After the initialization, we update the weight as follows: we first receive the new data item \(\mathbf{x }_t\), and each previous model will then provide its prediction \({\hat{y}}_{t,k}\). The final prediction \({\hat{y}}_t\) is made based on a weighted combination of those predictions (\({\hat{y}}_{t,k}\), for \(k=1,\ldots ,|H |\)). Next, the true label is revealed as \(y_t\), and we will update the weight distribution according to the loss each model suffers, in an exponential weighted manner,

$$\begin{aligned} \beta _{t+1,k} \propto \beta _{t,k} \exp \{-\eta \ell ({\hat{y}}_{t,k},y_t)\}. \end{aligned}$$
(3)

The weight update procedure in Eq. (3) is simple yet efficient. Next, we show the WeightUpdate mechanism returns a good weight distribution, implying our approach can reuse previous models properly. We have the following observation regarding the weight distribution.

Observation 1

(Weight Concentration) During the WeightUpdate procedure in epoch\(S_k\),the weights will concentrate on those previous models who suffer a small cumulative loss on\(S_k\).

Proof

By a simple analysis on the ModelUpdate procedure, we know that the weight associated with the jth previous model is proportional to \(\beta _{1,j}\exp \{-\eta L_{S_k}^{(j)}\}\), where \(L_{S_k}^{(j)} = \sum _{i\in S_k} \ell (h_j(\mathbf{x }_i),y_i)\) and \(j =1,\ldots ,k-1\). Therefore, the model suffers a small cumulative loss will be associated with a large weight. \(\square \)

Notwithstanding its simplicity, this observation plays a vital role in making our approach successful. It guarantees that the algorithm adaptively assigns more weights on better-fit previous models and thus essentially depicts the “reusability” of each model. Therefore, the weight update procedure is particularly useful when there emerge recurring concepts in the data stream. We give empirical evidence as support in Sects. 6.3 and  6.4.

The overall procedure of proposed approach Condor is summarized in Algorithm 1, where line 10-12 mainly conduct the weight update procedure presented in Eq. (3) and the model update procedure shown in line 14 is realized by Eq. (1). From the update procedures, we conclude that the overall space complexity is \(O(d(K+p))\) because Condor needs to store p data items in each epoch and K previous models, where d is the dimensionality of the feature. The overall time complexity is \(O(p^3 + dp^2 + dpK)\), specifically, the \(O(p^3 + dp^2)\) term is mainly devoted in solving the KKT system of Eq. (2), while the O(dpK) term is for the prediction of p instances in each epoch. We remark that the time complexity can be further accelerated by the iterative methods in solving LS-SVM, particularly when the matrix of the KKT system has a small condition number (Shewchuk 1994).

4 Theoretical analysis

In this section, we provide theoretical justifications of the proposed approach Condor. Observation 1 in the last paragraph demonstrates that the WeightUpdate mechanism provides a good weight distribution for the weighted combination of previous models as the initialization. We will further show that the ModelReuse mechanism is capable of taking advantage of this initialization when training the new model. Therefore, they together make our proposed approach successful.

To this end, we present the results in the following two aspects:

  • Local analysis: investigate the performance in each epoch, from both generalization and regret aspects.

  • Global analysis: examine the regret on the whole data stream globally.

To better present the main results, we defer all the proofs in the appendix.

4.1 Local analysis

Local analysis aims to scrutinize the performance of a particular epoch. On the one hand, we are concerned about the generalization ability of the model obtained by the ModelUpdate module. Second, we examine the cumulative regret of predictions.

Let us consider the epoch \(S_k\), the ModelUpdate module reuses previous models \(h_1,\ldots ,h_{k-1}\) to help build the new model \(h_k\), as shown in Fig. 1. To simplify the presentation, we introduce some notations. Suppose the length of data stream S is T, and is partitioned into k epochs, \(S_2,\ldots ,S_k\).Footnote 1 For epoch \(S_k\), we assume the distribution is identical, i.e., \(S_k\) is a sample of \(m_k\) points drawn i.i.d. according to distribution \({\mathcal {D}}_k\), where \(m_k\) denotes its length.

4.1.1 Generalization analysis

First, we conduct generalization analysis on the ModelUpdate module. Define the risk and empirical risk of the hypothesis (model) h on epoch \(S_k\) by

$$\begin{aligned} R(h) = {\mathbb {E}}_{(\mathbf{x },y)\sim {\mathcal {D}}_k} [\ell (h(\mathbf{x }),y)], \quad {\hat{R}}(h) = \frac{1}{m_k}\sum _{i\in S_k} \ell (h(\mathbf{x }_i),y_i). \end{aligned}$$

Here, with a slight abuse of notations, we also adopt \(S_k\) to denote the index included in the epoch, and \({\hat{R}}(h)\) instead of \({\hat{R}}_{S_k}(h)\) for simplicity. The new model \(h_k\) is built and updated on epoch \(S_k\) via the ModelUpdate module, then we have the following generalization error bound.

Theorem 1

Assume that the non-negative loss function \(\ell :{\mathbb {R}}\times {\mathcal {Y}} \mapsto {\mathbb {R}}_+\) is bounded by \(M\ge 0\). Meanwhile, for all \(y\in {\mathcal {Y}}\), \(\ell (\cdot ,y)\) is L-Lipschitz continuous. Also, assume the regularizer \(\varOmega : {\mathcal {H}}\times {\mathcal {H}} \mapsto {\mathbb {R}}_+\) is a non-negative and \(\lambda \)-strongly convex function in its first argument w.r.t. a norm \(||\cdot ||\). Given the source model \(h_p\), which is a linear combination of previous models. Then, for any \(\delta > 0\), with probability at least \(1-\delta \), we have,Footnote 2

$$\begin{aligned} R(h_k) - {\hat{R}}(h_k) \le L\sqrt{\frac{\epsilon _1}{m}} + 3\sqrt{\frac{\epsilon _2\log (1/\delta )}{m}}+ \frac{3M\log (1/\delta )}{4m}, \end{aligned}$$

where \(\epsilon _1 = \frac{2B^2 R_p}{\lambda \mu }\) and \(\epsilon _2 = \frac{M}{4} R_p + 2LM\sqrt{\frac{\epsilon _1}{m}}\). Besides, \(B = \sup _{\mathbf{x }\in {\mathcal {X}}} ||\mathbf{x }||_\star \) and \(R_p = R(h_p) = {\mathbb {E}}_{(\mathbf{x },y)\sim S_k}[\ell (h_p(\mathbf{x }),y)]\), representing the risk of reusing model on the underlying distribution of current data.

To better present the results, we only keep the leading terms w.r.t. m and \(R_p\), and obtain that

$$\begin{aligned} R(h_k) - {\hat{R}}(h_k) = O\left( \frac{{\tilde{\epsilon }}_1}{\sqrt{m}} + \frac{{\tilde{\epsilon }}_2}{m}\right) , \end{aligned}$$
(4)

where \({\tilde{\epsilon }}_1 = \left( \sqrt{\frac{R_p}{\lambda \mu }} +\root 4 \of {\frac{R_p}{\lambda \mu m}}\right) \) and \({\tilde{\epsilon }}_2 = \left( \sqrt{\frac{1}{\lambda }} + \root 4 \of {\frac{1}{\lambda }} \right) \).

Remark 2

Equation (4) shows that the model returned by the ModelUpdate procedure enjoys an \(O(1/\sqrt{m})\) generalization bound under certain conditions. In particular, when the source model (a weighted combination of previous models) is sufficiently good, that is, it has a small risk on the current distribution \({\mathcal {D}}_k\) (i.e., when \(R_p\rightarrow 0\)), we can obtain an O(1 / m) bound, a fast rate convergence guarantee. Such a result is very much desired because the number of data items in each epoch is typically limited. Theorem 1 theoretically justifies the effectiveness of the model reuse mechanism that incorporates the knowledge of previous models to learn the new model on the current data, because such a mechanism can significantly reduce the sample complexity, especially if we can reuse previous models properly. Meanwhile, the WeightUpdate mechanism gives a satisfying weight distribution for reusing previous models, because it guarantees to concentrate more weights on those better-fit models (see Observation 1).

Remark 3

It is noteworthy to mention that the generalization analysis does not require the loss function be necessarily convex. We only assume a bounded and Lipschitz condition (in Theorem 1) for the loss function, along with strongly convexity condition for the regularizer. These two conditions can be easily satisfied by common models. For example, in SVMs we use \(\ell _2\) regularization, which is 2-strongly convex, and the hinge loss \(\ell (z,y)=[1-yz]_+\), which is 1-Lipschitz continuous. We remark that the bounded condition of the loss function can be achieved with bounded model space and training items. This is a general assumption which also appears in the analysis of support vector machine (Mohri et al. 2018, Theorem 5.10), online learning (Cesa-Bianchi and Lugosi 2006, Theorem 2.2) and learning to rank (Mohri et al. 2018, Theorem 10.1 and Corollary 10.2). The limitation of constraining the model space can be possibly relaxed by the technique of localized complexity (Bartlett et al. 2005; Sridharan et al. 2008), which will be investigated in the future work.

Remark 4

The main techniques in the proof are inspired by Kuzborskij and Orabona (2017), but we differ in two aspects. First, their analysis requires a smoothness condition for loss functions, and thus their results are not suitable under our conditions. Second, we extended the analysis of model reuse to multi-class scenarios, and results are presented in Sect. 5 as an independent part for a clear presentation. In addition, the work of Reddi et al. (2015) utilizes biased regularization technique to reduce the variance of the importance sampling estimator in covariate shift scenarios, and they give the theoretical analysis. We remark that their results focus on the deviation of expected risk between the obtained model and the vanilla importance sampling estimator, while they do not give the generalization error of the obtained model, i.e., \(R(h) - {\hat{R}}(h)\). More importantly, their results do not exhibit a phenomenon of fast-rate convergence when the reusable model is good. On the contrary, we present a fast-rate generalization error analysis, theoretically justifying the advantage of the model reuse mechanism.

4.1.2 Regret analysis

In order to proceed to the regret analysis, we need to introduce more notations. Let \(L_T\) be the global cumulative loss on the whole data stream S, namely, \(L_T = \sum _{i=1}^T \ell ({\hat{y}}_i,y_i)\). Meanwhile, on epoch \(S_k\), let \(L_{S_k}\) be the cumulative loss over data in epoch \(S_k\) suffered by our approach, and \(L_{S_k}^{(j)}\) as the local cumulative loss over data in epoch \(S_k\) suffered by the previous model \(h_j\),

$$\begin{aligned} L_{S_k} = \sum _{i\in S_k} \ell ({\hat{y}}_i,y_i), \quad L_{S_k}^{(j)} = \sum _{i\in S_k} \ell (h_j(\mathbf{x }_i),y_i). \end{aligned}$$

We adopt the concept of cumulative regret (or regret) from online learning (Zinkevich 2003; Cesa-Bianchi and Lugosi 2006; Hazan 2016) as the performance measurement, where regret is defined as the difference between the accumulated loss of the predictions and that of a particular expert.

We demonstrate that our approach suffers a small cumulative loss and is able to benefit from recurring concept drift scenarios.

Theorem 2

(Local Regret Cesa-Bianchi and Lugosi (2006)) Assume that the loss function \(\ell :{\mathbb {R}} \times {\mathcal {Y}} \mapsto {\mathbb {R}}_+\) is convex in its first argument and takes values in [0, 1]. When the step size is set as \(\eta = \sqrt{(8\ln (k-1))/m_k}\), then we have

$$\begin{aligned} \mathrm {Regret}_{S_k} = L_{S_k} - \min _{j=1,\ldots ,k-1} L_{S_k}^{(j)} \le \sqrt{(m_k/2) \ln (k-1)}. \end{aligned}$$
(5)

Furthermore, suppose we know that there exists a previous model which matches current data quite well. Then, by setting the step size as \(\eta = \ln (1+\sqrt{2\ln (k-1)/L_{j_k^*}})\), where \(L_{j^*_k} = \min _{j=1,\ldots ,k-1} L_{S_k}^{(j)}\) is the cumulative loss of the best-fit previous model, we have

$$\begin{aligned} \mathrm {Regret}_{S_k} = L_{S_k} - L_{j_k^*} \le \sqrt{2L_{j_k^*} \ln (k-1)} + \ln (k-1). \end{aligned}$$
(6)

Apparently, the quantity \(L_{j_k^*}\) can be only available after all \(m_k\) rounds predictions. However, it can be compensated by “doubling trick” (Cesa-Bianchi et al. 1997), letting \(\eta \) change according to the current best previous model.

From the regret bound in the above statement [Eqs. (5) and (6)], we can see that the order of regret bound can be substantially improved from a typical \(O(\sqrt{m_k})\) to \(O\left( \ln k\right) \), independent from the number of data items in the epoch, providing \(L_{k^*} \ll \sqrt{m_k}\), that is, the cumulative loss of the best-fit previous model is small.

Remark 5

Theorem 2 implies that if the concept of epoch \(S_k\) or a similar concept has emerged previously, our approach enjoys a substantially improved local regret providing a proper step size is chosen (essentially, let the step size be larger). The theory accords to our intuition on why model reuse helps for concept drift data stream. In many situations, although the distribution underlying might change over time, the concepts can be recurring, i.e., disappear and re-appear (Katakis et al. 2010; Gama and Kosina 2014). Thus, the conclusion demonstrates that our approach can benefit from such recurring concepts or similar concepts.

4.2 Global analysis

In this part, we investigate the global behavior of the proposed approach, namely, the performance on the whole data stream.

We remark that in the local regret analysis, Theorem 2 gives the (static) regret analysis, that is, the benchmark in Eqs. (5) and (6) is the cumulative loss of the fixed best-fit previous model. This is reasonable since the underlying distribution in each local epoch is considered identical. However, the static regret is not suitable for the global regret analysis. The rationale behind static regret is that the best fixed decision in hindsight is reasonably good over the data stream, nevertheless, the optimal decision is drifting over time in non-stationary environments. We thus adopt the dynamic regret (Zinkevich 2003; Besbes et al. 2015) denoted by “D-Regret” as the performance measure, a more stringent metric, which compares the predictions to a time-varying comparator sequence.

Theorem 3

(Global Dynamic Regret) Assume that the loss function \(\ell :{\mathbb {R}} \times {\mathcal {Y}} \mapsto {\mathbb {R}}_+\) is convex in its first argument and takes the values in [0, 1]. By setting the epoch size (maximal update period) \(p = \lceil \root 3 \of {\frac{\ln K}{2}}\left( T/2V_T\right) ^{2/3} \rceil \) and the step size in epoch \(S_k\) as \(\eta _k = \sqrt{(8\ln K)/m_k}\),

$$\begin{aligned} \mathrm {D}\text {-}\mathrm {Regret}_T = \sum _{t=1}^{T} \ell ({\hat{y}}_t,y_t) - \sum _{t=1}^{T} \ell (h_t^*(\mathbf{x }_t),y_t) = O\Big (V_T^{1/3} T^{2/3}\Big ), \end{aligned}$$
(7)

where \(h_t^* \in {{\,\mathrm{arg\,min}\,}}_{h\in H} \ell (h(\mathbf{x }_t),y_t)\) is the optimal classifier of this round. Besides, \(V_T\) is the function variation defined by

$$\begin{aligned} V_T = \sum _{t=2}^{T}\sup _{h\in {{H}}} | \ell (h(\mathbf{x }_{t-1}),y_{t-1}) - \ell (h(\mathbf{x }_{t}),y_{t})|. \end{aligned}$$
(8)

Evidently, the function variation measures the non-stationarity of the data stream.

Remark 6

The regret bound in Theorem 3 is different from traditional (static) regret bounds (Hazan 2016). Essentially, it measures the difference between the global cumulative loss with the sum of local cumulative loss suffered by the best possible model. Therefore, the comparator dynamically changes and is time-varying, which depicts the distribution change in the sequence. It is thus named as dynamic regret, more suitable to be the performance measure in non-stationary environments.

Remark 7

The term \(V_T\) involved in the bound is called function variation, characterizing the non-stationarity of the data stream essentially. That is, the more non-stationary the data stream is, the larger the value of \(V_T\) will be. In this sense, the dynamic regret bound in Eq. (7) is adaptive to non-stationarity of the data stream.

Note that the optimal choice of epoch size p depends on the function variation \(V_T\), which is unfortunately unknown ahead of time. There are at least two ways to eliminate such an undesired dependence: either one knows the variation budget \(B_T\) such that \(V_T \le B_T\), then p can be set according to \(B_T\); or one can appeal to doubling trick (Cesa-Bianchi et al. 1997) or grid search (Koolen et al. 2014; Zhang et al. 2018) to replace the unknown quantity by those quantities empirically attainable. We will not go for details as this exceeds the scope of this paper.

5 Multi-class model reuse learning

In this section, we extend the generalization analysis from the binary model reuse learning to the multi-class scenario. All proofs are deferred to “Appendix D”.

We first introduce new notations for a clear presentation, as the notations in multi-class learning scenarios are slightly different from those in the binary case.

Let \({\mathcal {X}}\) denote the input feature space and \({\mathcal {Y}} = \{1,2,\ldots ,c\}\) denote the target label space. Our analysis acts on the last data epoch \(S_k = \{(\mathbf{x }_1,{y}_1),\ldots ,(\mathbf{x }_{m_k},{y}_{m_k})\}\), a sample of \(m_k\) points drawn i.i.d. according to distribution \({\mathcal {D}}_k\), where \(\mathbf{x }_i\in {\mathbb {R}}^d\) and \({y}_i \in {\mathcal {Y}}\) with only a single class from \(\{1,\ldots ,c\}\). Given the multi-class hypothesis set \({\mathcal {H}}\), any hypothesis \(h\in {\mathcal {H}}\) maps from \({\mathcal {X}}\times {\mathcal {Y}} \mapsto {\mathbb {R}}\), and makes the prediction by \(\mathbf{x }\mapsto {{\,\mathrm{arg\,max}\,}}_{{y}\in {\mathcal {Y}}} h(\mathbf{x },{y})\). This naturally rises the definition of margin\(\rho _h(\mathbf{x },{y})\) of the hypothesis h at a labeled instance \((\mathbf{x },{y})\),

$$\begin{aligned} \rho _h(\mathbf{x },{y}) = h(\mathbf{x },{y}) - \max _{y' \ne y} h(\mathbf{x },y'). \end{aligned}$$

Based on the margin and the non-negative loss function \(\ell :{\mathbb {R}} \mapsto {\mathbb {R}}_+\), we can define the risk and empirical risk of a hypothesis h on epoch \(S_k\) as,

$$\begin{aligned} R(h) = {\mathbb {E}}_{(\mathbf{x },y)\sim {\mathcal {D}}_k} \left[ \mathbf{1 }_{\rho _h(\mathbf{x },y)} \le 0\right] , \quad {\hat{R}}_S(h) = \frac{1}{m_k}\sum _{i\in S_k} \ell (\rho _h(\mathbf{x }_i,y_i)). \end{aligned}$$

One should be aware that the definition of the loss function in the multi-class scenario is different from that in the binary classification.

First, we identify the optimization formulation of multi-class biased regularization model reuse learning,

$$\begin{aligned} {\hat{W}} = {{\,\mathrm{arg\,min}\,}}_{W} \left\{ \frac{1}{m} \sum _{i=1}^m \ell \left( \rho _{h_W}(\mathbf{x }_i,y_i)\right) + \mu \varOmega (W, W_p) \right\} , \end{aligned}$$
(9)

where \(\ell :{\mathbb {R}} \times {\mathcal {Y}} \mapsto {\mathbb {R}}_+\) is a non-negative loss function and \(h_W(\mathbf{x }) = W^\mathrm {T}\mathbf{x }\), \(\mu > 0\) is a positive trade-off regularization coefficient. Besides, \(\varOmega : {\mathcal {H}}\times {\mathcal {H}} \mapsto {\mathbb {R}}_+\) is a model reuse regularizer specifying the final model to be built upon the previous models, satisfying \(\varOmega (W_p,W_p) = 0\) with a typical choice as \(\Vert W - W_p\Vert _F^2\). And \(W_p\) is the linear weighted combination of previous models, namely, \(W_p = \sum _{j=1}^{k-1} \beta _j {\hat{W}}_j\), where \(\beta _j\) is the weight associated with previous model \(h_j\), representing the reusability of each model on data in current epoch.

Apart from the non-negative property, we suppose the loss function is regular as defined in the work of Lei et al. (2015).

Definition 1

(Regular Loss, Cf. Definition 2 of Lei et al. 2015) We call a loss function \(\ell :{\mathbb {R}} \mapsto {\mathbb {R}}\) is L-regular if it satisfies the following properties:

  1. (i)

    \(\ell (t)\) bounds the 0-1 loss from above: \(\ell (t) \ge \mathbf{1 }_{t\le 0}\);

  2. (ii)

    \(\ell (t)\) is L-Lipschitz continuous, i.e., \(|\ell (t_1) - \ell (t_2)|\le L |t_1 - t_2|\);

  3. (iii)

    \(\ell (t)\) is decreasing and it has a zero point \(c_{\ell }\), i.e., there exists a \(c_{\ell }\) such that \(\ell (c_{\ell }) = 0\).

Our goal is to provide the generalization analysis, namely, to prove the convergence of risk R(h) to the empirical risk \({\hat{R}}(h)\), and establish the rate. Since \({\mathbb {E}}_{(\mathbf{x },y)\sim {\mathcal {D}}_k}[{\hat{R}}(h)] \ne R(h)\), thus, we cannot directly utilize concentration inequalities to help. To make this feasible, we need to introduce the risk w.r.t. loss function \(\ell \),

$$\begin{aligned} R_{\ell }(h) = {\mathbb {E}}_{(\mathbf{x },y)\sim {\mathcal {D}}_k} \left[ \ell (\rho _h(\mathbf{x },y))\right] . \end{aligned}$$

From property (i) in Definition 1, we know that the risk R(h) is a lower bound of \(R_{\ell }(h)\), that is \(R(h)\le R_{\ell }(h)\). Thus, we only need to establish generalization bound between \(R_{\ell }(h)\) and \({\hat{R}}_S(h)\). Evidently, \({\mathbb {E}}_{(\mathbf{x },y)\sim {\mathcal {D}}_k}[{\hat{R}}_S(h)] = R_{\ell }(h)\), thus we can utilize concentration inequalities again.

In the theoretical analysis, we specify the regularizer as square of Frobenius norm, namely, \(\varOmega (W, W_p) = ||W - W_p||_F^2\), and provide the following generalization error bound.

Theorem 4

Let \({\mathcal {H}} \subseteq {\mathbb {R}}^{{\mathcal {X}}\times {\mathcal {Y}}}\) be a hypothesis set with \({\mathcal {Y}} = \{1,2,\ldots ,c\}\). Assume that the non-negative loss function \(\ell :{\mathbb {R}} \mapsto {\mathbb {R}}_+\) is L-regular and M-bounded. Given the source model \(h_p\), which is a linear combination of previous models. Then, for any \(\delta > 0\), with probability at least \(1-\delta \), the following holds,Footnote 3

$$\begin{aligned} R(h_{{\hat{W}}}) - {\hat{R}}_S(h_{{\hat{W}}}) \le 2L c^2 \sqrt{\frac{\epsilon _1}{ m}} + 3\sqrt{\frac{\epsilon _2\log (1/\delta )}{4m}}+ \frac{3M\log (1/\delta )}{4m}. \end{aligned}$$

where \(\epsilon _1 = \frac{B^2R_p}{2\mu }\) and \(\epsilon _2 = M\left( 8L c^2 \sqrt{\frac{\epsilon _1}{m}}+R_p\right) \). Besides, \(B = \sup _{\mathbf{x }\in {\mathcal {X}}} \Vert \mathbf{x }\Vert _2\).

To better present the results, we only keep the leading term w.r.t. m and \(R_p\), and we have

$$\begin{aligned} R(h_k) - {\hat{R}}_S(h_k) = O\left( \frac{c^2}{\sqrt{m}}\left( \sqrt{\frac{R_p}{\mu }} +\root 4 \of {\frac{R_p}{\mu m}}\right) + \frac{1}{m}\root 4 \of {\frac{1}{\mu }}\right) , \end{aligned}$$
(10)

where \(R_p = R(h_p) = {\mathbb {E}}_{(\mathbf{x },y) \sim S_k}[\ell (\rho _{h_p}(\mathbf{x },y))]\), representing the risk of reusing model on current distribution.

Remark 8

From Theorem 4, we can see that the main result and conclusion in the multi-class case is very similar to that in the binary case. In Eq. (10), we can see that Condor enjoys an \(O(1/\sqrt{m})\) order generalization bound, which is consistent to the common learning guarantees. More importantly, Condor enjoys an O(1 / m) order fast rate generalization guarantees, when \(R_p\mapsto 0\), namely, when the previous model \(h_p\) is highly “reusable” for the current data. This shows the effectiveness of the ModelUpdate module that reuses previous models to help build the new model in multi-class scenarios, since the number of data items in each epoch is usually limited, the model reuses mechanism is able to significantly reduce the sample by utilizing previous models as the basis when constructing the new model.

At last, it is noteworthy to mention that the current generalization error bound admits a quadratic dependence on the number of classes c, one might further sharp this to a radical dependence by applying the vector Rademacher complexity technique (Maurer 2016) along with a more scrutinized analysis.

6 Experiments

In this section, we first present the experimental results on both synthetic and real-world concept drift datasets in Sect. 6.1. Then, we provide the empirical support for the effectiveness of the model reuse mechanism in Sect. 6.2. Next, we justify the efficacy of weight update mechanism by showing the empirical studies of weight concentration phenomenon (Sect. 6.3) and experiments on recurring concept drift datasets (Sect. 6.4). Finally, we conduct the parameter study in Sect. 6.5.

6.1 Results on synthetic and real-world datasets

Contenders We conduct the comparisons with two classes of state-of-the-art concept drift approaches. The first class is the ensemble category, including (a) \({\mathtt {Learn}}^{++}.{\mathtt {NSE}}\) (Elwell and Polikar 2011), (b) \({\mathtt {DWM}}\) (Kolter and Maloof 2003, 2007) and (c) \({\mathtt {AddExp}}\) (Kolter and Maloof 2005), (d) \({\mathtt {Hybrid Forest}}\) (Rad and Haeri 2019). The second class is the transfer category, including (e) \({\mathtt {DTEL}}\) (Sun et al. 2018) and (f) \({\mathtt {TIX}}\) (Forman 2006). Essentially, DTEL and TIX also adopt ensemble idea, we classify them into transfer category just to highlight their model transfer strategies.

Settings In the experiments, we set the maximum update period (epoch size) \(p = 50\) and model pool size \(K=25\).Footnote 4 In case of the abrupt changes, we choose ADWIN algorithm (Bifet and Gavaldà 2007) as the drift detector \({\mathfrak {D}}\) with default parameter setting reported in their paper.

Table 1 Basic statistics of datasets with concept drift
Fig. 2
figure 2

Holdout accuracy comparisons on three synthetic datasets

Synthetic Datasets As it is not realistic to foreknow detailed concept drift information of real-world datasets, like the start, the end of change and so on. We employ six widely used synthetic datasets SEA, CIR, SIN, and STA with their variants into experiments. Besides, another six synthetic datasets are also adopted: 1CDT, 1CHT, UG-2C-2D, UG-2C-3D, UG-2C-5D, and GEARS-2C-2D. Table 1 summarizes their brief statistics. We provide detailed dataset information in “Appendix E.1”.

We plot holdout accuracy comparisons over three synthetic datasets: SEA200A, SEA200G, and SEA500G. The holdout accuracy is calculated over testing data generated according to the identical distribution as training data at each time stamp. Following (Sun et al. 2018), we manually split the time horizon of each dataset into 120 epochs to have a clear presentation, and all the algorithms will perform the model update after each epoch ends. Note that each synthetic dataset has four stages, in other words, the distribution changes for three times. Specifically, the decision boundary changes for every 30 epochs. From Fig. 2, we can see that all the approaches drop when an abrupt concept drift occurs. Nevertheless, Condor is relatively stable and rises rapidly with more new data items coming, and finally achieves the highest accuracy compared with other approaches, which validates its effectiveness.

Real-world Datasets We adopt ten real-world datasets: Usenet-1, Usenet-2, Luxembourg, Spam, Email, Weather, GasSensor, Powersupply, Electricity, and Covertype. The number of data items varies from 1500 to 581,012, and the class number varies from 2 to 6. Detailed descriptions are provided in “Appendix 1”. We conduct all the experiments for five trails and report overall mean and standard deviation of predictive accuracy in Table 2, which is the ratio between the number of iterations with correct predictions and the time horizon T. The measure reflects the average performance of the algorithm over the whole data stream. Apart from all the real-world datasets, synthetic datasets are also included.

Table 2 shows that Condor outperforms other contenders. It achieves the best on 16 over 22 datasets in total and ranks the second on four other datasets. Especially, Condor behaves significantly better than other approaches in all ten real-world datasets. The reason that Condor behaves poor on two synthetic datasets (CIR500G and SIN500G) is that these two datasets are highly nonlinear (generated by a circle and sine function respectively). This problem can be solved by using the non-linear mapping. For example, we adopt the random feature technique to linearize these two datasets, and the results show the accuracy is improved from 68.41 ± 0.87 to 85.14 ± 0.08 for CIR500G, and from 65.68 ± 0.12 to 73.59 ± 0.43 for SIN500G. These show the superiority of our proposed approach.

From the win/tie/loss summarized in the last row of Table 2, we can see that DWM is the most competitive contender. We now discuss its complexity. In each epoch, its space complexity is \(O(d(K+p))\) and the time complexity is \(O(dp^2 + dpK)\). We remark that there is typically a limited number of data in each epoch, so the complexity of DWM is basically comparable with that of Condor.

Table 2 Performance comparisons on synthetic and real-world datasets

6.2 Effectiveness of model reuse mechanism

In the theoretical analysis presented in the last sections, we demonstrate that the WeightUpdate mechanism leads to the phenomenon of weight concentration, highly useful for adaptively reusing the previous model according to their reusability, and thus provides a good combination of previous models. Theorem 1 and Theorem 4 further show that ModelReuse mechanism is capable of utilizing such an initialization, in both binary and multi-class classification. We now validate the effectiveness via empirical studies.

Fig. 3
figure 3

Performance comparisons (in predictive accuracy) of Condor with/without model reuse mechanism

Figure 3 shows the performance comparison between Condor and Condor without model reuse (that is, directly training a new model without reusing previous models). We can observe that the model reuse mechanism does help, especially on these “difficult” datasets where Condor without model reuse only achieves around 60% accuracy, whereas the model reuse mechanism can bring at least ten percents accuracy improvement.

As will be shown later, the default epoch size p will be set as 50. That is, in each epoch, there will be at most 50 data items for training. Therefore, it would be rather undesired if one train a new model with such a limited number of training data directly. Theorem 1 and Theorem 4 reveal that the model reuse mechanism can reduce the sample complexity when provided with a proper initialization. The results accord to the theoretical insight.

6.3 Effectiveness of weight concentration mechanism

We examine the weight concentration phenomenon on a real-world dataset Emailing list, whose detailed information of concept drift is shown in Table 3. We can see that the concept drift happens for every 300 rounds, in a recurring manner.

Let us consider the epoch in 1200–1500 (epoch 5), and focus on the weights of previous models \(h_1,h_2,h_3,h_4\), namely, \(\beta _1,\beta _2,\beta _3,\beta _4\). Clearly, the concept of epoch 5 has emerged previously, i.e., it is the same as the concepts of epoch 1 and epoch 3. Thus, we expect that the weight distribution should concentrate on \(\beta _1\) and \(\beta _4\). As we can see in Table 4, the empirical results show the weights \(\beta _1\) and \(\beta _4\) indeed dominate while \(\beta _2\) and \(\beta _3\) are almost zero. The result validates that the returned weight distribution largely concentrates on those previous models who better fit current data epoch. Additionally, the results, to some extent, justify why our approach succeeds in recurring concept drift scenarios. A more detailed elaboration will be presented in the next paragraph.

Table 3 Emailing list dataset: There are in total three different topics, “+” indicates the users are interested in it, while “−” indicates not interested in it
Table 4 Demonstration of weights concentration phenomenon on Emailing list dataset
Table 5 Performance comparisons on two recurring concept drift datasets: Email list and Spam filtering

6.4 Recurring concept drift

In this paragraph, we conduct the performance comparison on the recurring concept drift scenario, a specific sub-type of concept drift, in which previous concepts may disappear and then re-appear in the future. Therefore, previous models may be beneficial for future learning. Previous studies show that one needs to consider the recurring structure specifically, otherwise the performance will dramatically drop, even for approaches dealing with gradually evolving concept drift.

Datasets We adopt two popular real-world datasets with recurring concept drift, i.e., Email list and Spam filtering datasets (Katakis et al. 2008, 2010; Jaber et al. 2013). Both datasets are extracted from email corpus, and concept is decided by users’ personal interests, which changes in a recurring manner.

Comparisons We include contenders are from the following four categories: (a) Sliding window based approaches, including \({\mathtt {SVM}}\)-\({\mathtt {fix}}\) (batch implementation by SVM) and \({\mathtt {NB}}\)-\({\mathtt {sw}}\) (update only use data in the sliding window based on incremental Naive Bayes). (b) Ensemble based approaches, including \({\mathtt {Learn}}^{{\mathtt {++}}}.{\mathtt {NSE}}\), \({\mathtt {DWM}}\) and \({\mathtt {AddExp}}\). (c) Transfer learning based approaches, including \({\mathtt {TIX}}\) and \({\mathtt {DTEL}}\). (d) Recurring approaches, which are specifically designed for recurring concept drift scenarios, including Conceptual Clustering and Prediction (\({\mathtt {CCP}}\)) approach, an ensemble method that handle recurring concept drift via similarity clustering (Katakis et al. 2010). Also, we compare with Dynamic Adaptation to Concept Change (\({\mathtt {DACC}}\)) and its adaptive variant \({\mathtt {ADACC}}\), detecting recurring concept drift based on a new second-order online learning mechanism (Jaber et al. 2013). Since codes of CCP, DACC and ACACC are not available, we directly use the results reported in their papers, as we use the whole dataset without any random splitting according to their settings.

Results Table 5 reports experimental results. We can see that Condor exhibits an encouraging performance on both datasets over three different performance measures. It performs significantly better than general concept drift approaches and is comparable or even better than those approaches designed explicitly for recurring concept drift scenario.

The effectiveness of Condor in recurring datasets lies in the effect of weight concentration, since our approach guarantees the weight concentrates on the best-fit previous models (See Observation 1).

Fig. 4
figure 4

Parameter study on different real-world datasets

6.5 Parameter study

In this part, we study the parameter influence. There are four parameters: epoch size p, model pool size K, regularization coefficient \(\mu \) and step size \(\eta \). We conduct the experiments for five times and plot the mean and standard variance of predictive accuracy with respect to different parameter values in Fig. 4.

Epoch Size We set the value of epoch size (i.e., maximum update period) p from 25 to 500. Figure 4a shows that the overall performance is relatively stable in terms of different epoch sizes. However, for some highly non-stationary datasets (particularly, email and electricity), the accuracy curve drops significantly with a larger epoch size, because the epoch size is so large that data within are not iid and thus the model trained from scratch is not desired. On the other hand, although a small epoch size leads to a more timely update, the model trained from each epoch will suffer from the insufficient data. Actually, the epoch size is a data-dependent parameter, reflecting the inherent extent of fluctuation. Meanwhile, the setting of epoch size might also subject to requirements from real-world applications. In this paper, we set the default value of epoch size p as 50.

Model Pool Size We set the value of model pool size K from 5 to 50. Figure 4b shows that the predictive accuracy rises as the model pool size increases, and will not benefit from an even larger model pool. Although an even larger K might be benign for performance improvement, the memory cost will also significantly increase with a larger K. We set the default value of model pool size K as 25.

Regularization Coefficient We set the value of regularization coefficient \(\mu \) from \(2\times 10^{-6}\) to \(2\times 10^{6}\). Figure 4c shows that when we set a relatively large \(\mu \) value, all datasets basically achieve the best performance, and are not sensitive to the \(\mu \) value. This agrees with the intuition since \(\mu \) value represents a trade-off between empirical loss and biased regularization, a larger value addresses more importance on reusing models. In other words, when \(\mu \) is large, it tends to exploit more information from previous models to build the new model. Hence, the result implies the effectiveness of model reuse. We set the default value of the regularization coefficient \(\mu \) as 200.

Step Size We set the value of step size \(\eta \) from 0 to 1. From Fig. 4d, we can see that when step size is relatively large, say larger than 0.5, the performance is satisfying and stable. This phenomenon matches the theoretical suggestion value in Theorem 2, calculated by \(\eta _{\texttt {theory}} = \sqrt{8\ln (K)/p} = \sqrt{8\ln (25)/50} \approx 0.718\), where the model size K is 25 and epoch size p is 50 by default. We set the default value of step size \(\eta \) as 0.75.

7 Conclusion

In this paper, a novel and effective approach Condor is proposed for handling concept drift via model reuse, which consists of two key modules, \({\mathtt {ModelUpdate}}\) and \({\mathtt {WeightUpdate}}\). Specifically, ModelUpdate mechanism reuses previous models in a weighted manner to train a new model along with current data. Meanwhile, WeightUpdate mechanism adaptively adjusts weights of previous models according to their performance. By the generalization analysis, we prove that the model reuse mechanism helps if we properly reuse previous models. Through regret analysis, we show that the weights finally concentrate on those better-fit models and thus provides a good weighted combination of previous models as the initialization for ModelUpdate mechanism. Moreover, our approach enjoys an \(O(T^{2/3}V_T^{1/3})\) dynamic regret, where T is the length, and \(V_T\) is the function variation, representing the non-stationarity of the data stream. Empirical results demonstrate the superiority of our approach to other compared methods, on both synthetic and real-world datasets.

In the future, it would be interesting to incorporate more techniques from model reuse learning into handling concept drift problems. Interesting future issue is to incorporate a condor-like approach into the recently proposed abductive learning (Zhou 2019), a new paradigm which leverages both machine learning and logical reasoning, to enable it handle changing concepts and predicates.