Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Learning from data streams has been a topic of active research in recent years [11]. In this branch of machine learning, systems are sought that learn incrementally, and maybe even in real-time, on a continuous and potentially unbounded stream of data, and which is able to properly adapt themselves to changes of environmental conditions or properties of the data generating process. Systems with these properties have already been developed for different machine learning and data mining tasks, such as clustering and classification.

An extension of machine learning methods to the setting of data streams comes with a number of challenges. In particular, the standard batch mode of learning, in which the entire data as a whole is provided as an input to the learning algorithm, is no longer applicable. Correspondingly, the data must be processed in a single pass, which implies an incremental mode of learning and model adaptation.

Domingos and Hulten [9] list a number of properties that an ideal stream mining system should exhibit, and suggest corresponding design decisions: the system uses only a limited amount of memory; the time to process a single record is short and ideally constant; the data is volatile and a single data record accessed only once; the model produced in an incremental way is equivalent to the model that would have been obtained through common batch learning (on all data records so far); the learning algorithm should react to concept change (i.e., any change of the underlying data generating process) in a proper way and maintain a model that always reflects the current concept.

Rule-based learning is a specifically popular approach in the realm of data streams, not only in the machine learning but also in the computational intelligence community, where it has been studied under the notion of “evolving fuzzy systems” [21]. In this paper, we develop a method that combines the strengths of two existing approaches for regression on data streams rooted in different learning paradigms. Our method induces a set of fuzzy rules, which, compared to conventional rules with Boolean antecedents, has the advantage of producing smooth regression functions. To do so, it makes use of an induction technique inspired by AMRules, a very efficient and effective learning algorithm that yields state-of-the-art performance in machine learning.

The rest of the paper is organized as follows. Following a review of related work, we introduce our method in Sect. 3. A comprehensive experimental study, in which the method is compared to several competitors, is presented in Sect. 4, prior to concluding the paper in Sect. 5.

2 Related Work

In the past ten years, learning from data streams has been considered for different learning tasks. Approaches to supervised learning have mostly focused on classification. Here, the Hoeffding tree method [8] has gained a lot of attention, and meanwhile, many modifications and improvements of the original method have been proposed [6]. In addition to the induction of decision trees, the learning of systems of decision rules is supported by several approaches, such as the Adaptive Very Fast Decision Rules (AVFDR) classifier [17]. AVFDR can be seen as an extension of the Very Fast Decision Rules (VFDR) classifier [12] that incrementally induce a compact set of decision rules from a data stream.

Regression on data streams has gained less attention than classification, with a few notable exceptions. AMRules [1] can be seen as an extension of AVFDR to the case of numeric target values. Another approach is based on the induction of model trees [15]. Besides, regression on data streams has been studied quite extensively in the computational intelligence and fuzzy systems community [2, 4, 21]. Specifically relevant for us is FLEXFIS [20], which learns a system of so-called Takagi-Sugeno-Kang (TSK) rules [25]. In the following, we describe these methods in some more detail.

FIMTDD (Fast Incremental Model Trees with Drift Detection) is a tree-based approach for inducing model trees for regression on data streams. Similar to Hoeffding trees, it uses Hoeffding’s inequality [14] for choosing the best splitting attribute. Since FIMTDD tackles regression problems, attributes are evaluated in terms of the reduction of the target attribute’s standard deviation. Each leaf node of the induced tree is associated with a linear function, which is learned (using stochastic gradient descent) in the subspace covering the instances that fall into that leaf node.

AMRules (Adaptive Model Rules) learns rules that are specified by a conjunction of literals on the input attributes in the premise part, and a linear function of the attributes in the consequent. The latter is chosen so as to maximize predictive accuracy in the sense of minimizing the root mean squared error. Adaptive statistical measures are maintained in each rule in order to describe the instance subspace covered by that rule. Each rule is initialized with a single literal and successively expanded with new literals. The best literal to be added, if any, is chosen on the basis of Hoeffding’s bound, in a manner that is similar to the expansion of a Hoeffding tree. In their paper [1], the authors distinguish between decision lists and unordered rule sets and, correspondingly, propose two different update and prediction schemes. The first one sorts the set of rules in the order in which they were learned. Only the first rule covering an example is used for prediction and updated afterward. The second strategy updates all the rules that cover an example, and combines these rules’ predictions by a weighted sum.Footnote 1 The authors also show that the latter strategy outperforms the former one, and hence used it for the rest of their study.

FLEXFIS (Fexible Fuzzy Inference Systems) induces a set of fuzzy rules, making use of fuzzy logic as a generalization of conventional (Boolean) logic [20]. Mores specifically, it uses so-called Takagi-Sugeno-Kang (TSK) rules that are defined by a fuzzy predicate in the premise part and a linear function of the input features in the consequent. As a result, an instance can be covered by a rule to a certain degree, reflecting a degree of relevance of the rule in the corresponding part of the instance space. Correspondingly, the prediction of a TSK system is produced by a weighted average of the outputs of the individual rules. The regions covered by the rules in the input space are defined by means of clustering methods: The instances in the training data are first clustered, and the fuzzy-logical predicates in the rule antecedents are obtained by projecting the clusters to the individual dimensions of this space. For learning on data streams, clustering is done in an incremental way. Moreover, the functions in the rule consequents are adapted using recursive weighted least squares (RWLS) estimation [19].

Our approach essentially seeks to combine the increased expressiveness of fuzzy rules as used by methods such as FLEXFIS, which allows for approximating a regression function is a smoother and much more flexible way, with the efficiency and effectivity of rule induction techniques such as AMRules. In fact, existing methods for learning fuzzy rules, including FLEXFIS and eTS+ [25], are usually slow and computationally inefficient. The complexity is mainly caused by the use of clustering methods, which have the additional disadvantage of producing rules that always contain all input attributes, as well as costly matrix operations (such as inversion) required by RWLS.

3 The TSK-Streams Learning Algorithm

Our method, called TSK-Streams, is an adaptive incremental rule induction algorithm for regression on data streams. The model produced by TSK-Streams is a so-called Takagi-Sugeno-Kang (TSK) fuzzy system [25], a type of rule-based system that is widely used in the fuzzy logic community.

3.1 TSK Fuzzy Systems

A TSK rule \(R_i\) is a fuzzy rule of the following form:

$$\begin{aligned}&\text {IF} \; \quad (x_1 \; \text {IS} \; A_{i,1}) \; \text {AND} \quad \ldots \quad \text {AND} \; (x_d \; \text {IS} \; A_{i,d}) \nonumber \\&\text {THEN} \quad l_i(\varvec{x}) = w_{i,0}+w_{i,1}x_1+w_{i,2}x_2+ \ldots +w_{i,d}x_d , \end{aligned}$$
(1)

where \((x_1, \ldots , x_d)^\top \) is the feature representation of an instance \(\varvec{x} \in \mathbb {R}^d\), and \(A_{i,j}\) defines the jth antecedent of \(R_i\) in terms of a soft constraint. The coefficients \(w_{i,0}, \ldots , w_{i,d} \in \mathbb {R}\) in the consequent part of the rule specify an affine function of the features (input attributes).

Modeling the soft constraint \(A_{i,j}\) in terms of a fuzzy set with membership function \(\mu _{j}^{(i)}: \, \mathbb {R} \longrightarrow [0,1]\), the truth degree of the predicate \((x_j \text {IS} A_{i,j})\) is given by \(\mu _{j}^{(i)}(x_j)\), that is, the degree of membership of \(x_j\) in \(\mu _{j}^{(i)}\). Moreover, modeling the logical conjunction in terms of a triangular norm \(\top \) [16], i.e., an associative, commutative, non-decreasing binary operator \(\top :\, [0,1]^2 \longrightarrow [0,1]\) with neutral element 1 and absorbing element 0, the overall degree to which an instance \(\varvec{x}\) satisfies the premise of the rule \(R_i\) is given by

$$\begin{aligned} \mu _i(\varvec{x}) = \top \left( \mu _{1}^{(i)}(x_1) , \ldots , \mu _{d}^{(i)}(x_d) \right) . \end{aligned}$$
(2)

In the following, we will adopt the simple product norm, i.e., \(\top (u,v)=uv\). Note that \(A_{i,j}\) could be an empty constraint, which is modeled by \(\mu _{j}^{(i)} \equiv 1\); this means that the jth attribute \(x_j\) does effectively not occur as part of the premise of the rule (1).

Now, consider a TSK system consisting of C rules \(RS = \{ R_1, \ldots , R_C \}\). Given an instance \(\varvec{x}\) as an input, each rule \(R_i\) is supposed to “fire” with the (activation) degree (2). Correspondingly, the output produced by the system is defined in terms of a weighted average of the outputs produced by the individual rules:

$$\begin{aligned} \hat{y} = \sum _{i=1}^{C} \varPsi _i(\varvec{x}) \cdot l_i(\varvec{x}) , \end{aligned}$$
(3)

where

$$\begin{aligned} \varPsi _i(\varvec{x}) = \frac{\mu _i(\varvec{x})}{\sum _{j=1}^{C} \mu _j(\varvec{x})} . \end{aligned}$$
(4)

3.2 Online Rule Induction

TSK-Streams learns rules incrementally, starting with a default rule. This rule has an empty premise and covers the entire input space.

For each rule \(R_i\), TSK-Streams continuously checks whether one of its extensions may improve the performance of the current system. Here, expanding a rule \(R_i\) means splitting it into two new rules, which are obtained by adding, respectively, a new predicate \((x_j \text { IS } A_{i,j})\) and \((x_j \text { IS } \lnot A_{i,j})\) as an additional antecedent. Considering the current rule as the default, the former defines a specialization, while the latter can be seen as what remains of this default. \(A_{i,j}\) is modeled in terms of a fuzzy set with membership function \(\mu _{j,l}^{(i)}\), which is chosen from a fuzzy partition \(\{ \mu _{j,1} , \ldots , \mu _{j,k} \}\) of the domain of feature \(x_j\) (cf. Sect. 3.3 below), and its negation \(\lnot A_{i,j}\) is characterized by the membership function \(\bar{\mu }_{j,l}^{(i)} = 1 - \mu _{j,l}^{(i)}\). We denote the corresponding expansions by \(R_i \oplus \mu _{j,l}^{(i)}\) and \(R_i \oplus \bar{\mu }_{j,l}^{(i)}\), respectively.

We distinguish between features \(x_j\) that are included by a positive literal \((x_j \in \mu _{j,l}^{(i)})\) and those included by a negative literal \((x_j \in \bar{\mu }_{j,l}^{(i)})\), collecting the indices of the former in the index set I and those of the latter in \(\bar{I}\). In a single rule, each attribute is only allowed to occur in a single positive literal. Negative literals are allowed to be added as long as the conjunction of the constraints on \(x_j\) does not become too restrictive, thus suggesting a kind of inconsistency (there is not a single value of \(x_j\) satisfying the rule premise to a high degree). Details of the rule expansion procedure in pseudocode are given in Algorithm 1.

figure a

3.3 Online Discretization and Fuzzification

As a basis for rule expansion, TSK-Streams maintains a fuzzy partition for each feature \(x_j\), i.e., a discretization of the domain of \(x_j\) into a finite number of (overlapping) fuzzy sets \(\{ \mu _{j,1}, \ldots , \mu _{j,k} \}\).

The discretization process is based on the Partition Incremental Discretization (PID) proposed by Gama and Pinto [13]. PID is a technique that builds histograms on data streams in an adaptive manner. In a first layer, continuous input values produced by the data stream are grouped into intervals. A second layer then uses the intervals of the first layer to build histograms, using either equal frequency or equal width binning. In this work, we extend the PID approach as follows:

  • Layer 1: This layer discretizes and summarizes the values observed for one input feature into an intial set of intervals.

  • Layer 2: This layer merges or splits intervals of the first layer, with the goal to create intervals of equal frequencies.

  • Layer 3: This layer transforms the second layer’s intervals, which are of the form \(X_{j,l} = [b,c]\), into fuzzy sets \(\mu _{j,l}\). We employ fuzzy sets with a core [bc], in which the degree of membership is 1, and support \([a,d] \supset [b,c]\); outside the support, the membership is 0. The boundary of the fuzzy set \(\mu _{j,l}\) is modeled in terms of a smooth “S-shaped” transition between full and zero membership:

$$\begin{aligned} \mu _{j,l}(x)= \left\{ \begin{array}{cl} 0 &{} \text { if } x< a \\ 2 \left( \frac{x-a}{b-a} \right) ^2 &{} \text { if } a \le x< (a+b)/2 \\ 1 - 2 \left( \frac{x-b}{b-a} \right) ^2 &{} \text { if } (a+b)/2 \le x< b \\ 1 &{} \text { if } b \le x \le c \\ 1 - 2 \left( \frac{c-x}{d-c} \right) ^2 &{} \text { if } c< x \le (c+d)/2 \\ 2 \left( \frac{d-x}{d-c} \right) ^2 &{} \text { if } (c+d)/2 < x \le d \\ 0 &{} \text { if } x > d \end{array} \right. . \end{aligned}$$
(5)

For the fuzzy set \(\mu _{j,l}\) associated with \(X_{j,l}\), we set \(a=b- \alpha \cdot |X_{j,l-1}|\) and \(d=c + \alpha \cdot |X_{j,l+1}|\), where \(|X_{j,l-1}|\) and \(|X_{j,l+1}|\) denote, respectively, the lengths of the left and right neighbor interval of \(X_{j,l}\), and \(\alpha \,\in \, ]0,1[\) is an overlap degree. For the leftmost (rightmost) interval of the partition, we set \(a=b\) (\(d=c\)).

figure b
figure c

3.4 Learning Rule Consequents

FLEXFIS fits linear functions in the consequent parts of the rules via recursive weighted least squares estimation (RWLS) [19]. Since this approach requires multiple matrix inversions, it is computationally expensive. Therefore, inspired by AMRules, we instead apply a gradient method to learn consequents more efficiently.

Upon arrival of a new training instance \((\varvec{x}_t,y_t)\), the squared error of the prediction \(\hat{y}_t\) produced by TSK-Streams can be computed as follows:

$$\begin{aligned} E_t = (y_t - \hat{y}_t)^2 = \left( y_t- \left( \sum _{R_i \in RS} \frac{\varPsi _{i}(\varvec{x}_t)}{\sum _{R_k \in RS} \varPsi _k(\varvec{x}_t) } \sum _{j=0}^{d} \omega _{i,j} x_{t,j} \right) \right) ^2 \end{aligned}$$
(6)

Invoking the principle of stochastic gradient descent, the coefficients \(\omega _{i,j}\) are then shifted into the negative direction of the gradient:

$$\begin{aligned} \varvec{\omega } \leftarrow \varvec{\omega } - \eta \nabla E_t , \end{aligned}$$
(7)

where \(\eta \) is the learning rate. Component-wise, this yields the following update rule:

$$\begin{aligned} \omega _{i,j} \leftarrow \omega _{i,j} - 2\, \eta (y_t - \hat{y}_t) \left( \sum _{R_i \in RS} \frac{\varPsi _i(\varvec{x}_t)}{\sum _{R_k \in RS} \varPsi _k(\varvec{x}_t) } \, x_{t,j} \right) \end{aligned}$$
(8)

The process of updating the rule consequents is summarized in Algorithm 4.

figure d

3.5 Adaptation of the Model Structure

As outlined above, TSK-Streams continuously adapts the rule system through the adaptation of fuzzy sets used as rule antecedents and linear functions in the rule consequents. While these are adaptations of the system’s parameters, the decision to replace a rule by one of its expansions can be seen as a structural change.

Needless to say, structural changes should generally be handled with care, especially when increasing the complexity of the model. Therefore, learning methods typically stick to the current model until being sufficiently convinced of a potential improvement through an expansion; to this end, the estimated difference in performance needs to be significant in a statistical sense.

Similar to Hoeffding trees [8], AMRules [10], and FIMT-DD [15], we apply Hoeffding’s inequality in order to support these decisions. The Hoeffding inequality probabilistically bounds the difference between the expected value E(X) of a random variable X with support \([a,b] \subset \mathbb {R}\) and its empirical mean \(\bar{X}\) on an i.i.d. sample of size n in terms of

$$\begin{aligned} P \Big ( \vert \bar{X} -\mathrm {E} (X) \vert > \epsilon \Big ) \le \exp \left( -{\frac{2n\epsilon ^{2}}{(b-a)^{2}}}\right) . \end{aligned}$$
(9)

More specifically, we decide to split a rule \(R_i\), i.e., to replace the rule with two rules \(R_i \,\oplus \, \mu _{j,l}^{(i)}\) and \(R_i \, \oplus \, \bar{\mu }_{j,l}^{(i)}\), by considering the reduction in the sum of squared errors (SSE). To this end, the SSE of the current system (rule set RS) is compared to the SSE of all alternative systems (\(RS \setminus R_i) \,\cup \,\{ R_i \, \oplus \,\mu _{j,l}^{(i)}, R_i\,\oplus \,\bar{\mu }_{j,l}^{(i)} \}\). Let \(SSE_{best}\) and \(SSE_{2ndbest}\) denote, respectively, the expansion with the lowest and the second lowest error. The best expansion is then adopted whenever

$$\begin{aligned} \frac{SSE_{best}}{SSE_{2ndbest}} < 1 - \epsilon , \end{aligned}$$
(10)

or when \(\epsilon \) becomes smaller than a tie-breaking constant \(\tau \). The constant \(\epsilon \) is derived from (9) by setting the probability to a desired degree of confidence \(1- \delta \), i.e., setting the righ-hand side to \(1-\delta \) and solving for \(\epsilon \); noting that the ratio (10) is bounded in ]0, 1], \(b-a\) is set to 1.Footnote 2 Refer to Algorithm 5 for details.

Instead of looking for a global improvement of the entire system, an alternative is to monitor the performance of individual rules and to base decisions about rule expansion on this performance. In this case, Hoeffding’s bound is applied to the sum of weighted squared errors (SWSE), where the weighted squared error of a rule \(R_i\) on a training example \((\varvec{x}_t, y_t)\) is given by

$$\begin{aligned} WSE_t = \varPsi (\varvec{x}) (y_t - \hat{y}_t)^2 = \left( \frac{\mu (\varvec{x}_t)}{\sum _{R_j \in RS} \mu _j(\varvec{x}_t) } \right) (y_t - \hat{y}_t)^2 . \end{aligned}$$

This error is then compared with the weighted error of the system in which the rule is replaced by extensions \(R_i \oplus \mu _{j,l}^{(i)}\) and \(R_i \oplus \bar{\mu }_{j,l}^{(i)}\). The usefulness of such extensions can be checked using the same kind of hypothesis testing as above.

To avoid an excessive increase in the number of rules, also coming with a danger of overfitting, we propose a penalization mechanism that consists of adding a complexity term C to \(\epsilon \). For the global variant, we set \(C=\frac{1- \log (2)/\log (|RS|)}{\sqrt{d}}\), where RS is the current set of rules and d the number of features. For the local variant, we use \(C=\frac{1- \log (2)/\log (|I \cup \bar{I}e|)}{\sqrt{d}}\) when comparing the extensions of a rule \(R=(I,M,\bar{I},\bar{M},\varvec{\omega })\).

figure e

3.6 Change Detection

To detect a drop in a rule’s performance, possibly caused by a concept drift, we employ the adaptive windowing (ADWIN) [5] drift detection method. The advantage of this technique, compared to the Page-Hinkely test (PH) [22] used by AMRules, is that ADWIN is non-parametric (it makes no assumptions about the observed random variable). Moreover, it has only one parameter \(\delta _{adwin}\), which represents the tolerance towards false alarms. We apply this change detection method locally in each rule on the absolute error committed by a rule on an example, given that the example is covered by this rule.

Upon detecting a drift in the rule \(R_p=(I_p,M_p,\bar{I}_p,\bar{M}_p,\varvec{\omega }_p)\), we find its sibling rule \(R_q=(I_q,M_q,\bar{I}_q,\bar{M}_q,\varvec{\omega }_q)\), from which it differs by only one single literal, i.e., there is a fuzzy set \(\mu _{i,j}\) that satisfies one of the following criteria: \((\mu _{i,j} \in M_p) \wedge (i \in I_p) \wedge (\bar{\mu }_{i,j} \in \bar{M}_q) \wedge (i \in \bar{I}_q)\) or \((\bar{\mu }_{i,j} \in \bar{M}_p) \wedge (i \in \bar{I}_p) \wedge (\mu _{i,j} \in M_q) \wedge (i \in I_q)\). Removing the rule \(R_p\) can simply be achieved by removing it from the rule set and accordingly updating its sibling \(R_q\) by either removing \((i,\mu _{i,j})\) from \((I_q,M_q)\) or removing \((i,\bar{\mu }_{i,j})\) from \((\bar{I}_q,\bar{M}_q)\), depending on which of the previous criteria was satisfied. If the sibling rule \(R_q\) was already extended before detecting the drift, one simply applies the same procedure to its children.

4 Empirical Evaluation

In this section, we conduct experiments in order to study the performance of TSK-Streams in comparison to other algorithms. More precisely, we analyze predictive accuracy and runtime of the algorithms, the size of the models they produce, as well as their ability to recover in the presence of a concept drift.

4.1 Setup

Our proposed fuzzy learner, TSK-Streams, is implemented under the MOAFootnote 3 (Massive Online Analysis) [7] framework, an open source software for mining and analyzing large data sets in a stream-like manner.

In the following evaluations, we compare TSK-Streams with the three methods introduced before: AMRules, FIMTDD, and FLEXFIS. Both AMRules and FIMTDD are implemented in MOA’s distribution, and we use them in their default settings with \(\delta =0.01\) and \(\tau =0.05\) for the Hoeffding bound. Regarding the parametrization of TSK-Streams, we use the same values \(\delta ,\tau \), so as to assure maximal comparability with AMRules and FIMTDD. For the discretization, we use the following parameters: the number of intervals \(k=5\), the overlapping threshold \(\upsilon =0.2\), the exponential weighting factor \(\lambda = 0.999\), and the overlapping degree \(\alpha = 0.15\). FLEXFIS is implemented in Matlab and offers a function for finding optimal parameter values. We used this function to tune all parameters except the so-called “forgetting parameter”, for which we manually found the value 0.999 to perform best.

All experiments are conducted using the test-then-train evaluation procedure; this procedure uses each instance for both training and testing. First, the model is evaluated on the instance, and then a single incremental learning step is carried out.

4.2 Results

In the first part of the evaluation, we preform experiments on standard synthetic and real benchmark data sets collected from the UCI repositoryFootnote 4 [18] and other repositoriesFootnote 5; Table 1 provides a summary of the type, the number of attributes and instances of each data set.Footnote 6 Table 2 shows the average RMSE and the corresponding standard error on ten rounds for each data set. In this table, the winning approach on each data set is highlighted in bold font, and our approach is marked with an asterisk whenever it outperforms the three competitors. As can be seen, our fuzzy rule learner, both in its global and local variant, is superior to the other methods in terms of generalization performance. In a pairwise comparison, the global variant of TSK-Streams outperforms AMRules and FLEXFIS on 11 of the 14 data sets, and performs better than FIMTDD on 13; the local variant outperforms AMRules, FLEXFIS, and FIMTDD on 8, 11, and 13 data sets, respectively. Using a Wilcoxon signed-rank test, the global variant of our method thus outperforms AMRules, FLEXFIS, and FITDD with p-values 0.067, 0.041 and 0.0008, and the local variant outperforms FITDD with p-value 0.0008.

Table 1. Data sets
Table 2. Performance of the algorithms in terms of RMSE.
Table 3. Performance of the algorithms in terms of the runtime and model size.
Fig. 1.
figure 1

Performance curves (RMSE, averaged over ten runs) on the distance to hyperplane data, with a drift from squared (red curve) to the cubed distance (green curve) in the middle of the episode. The recovery curve is plotted in blue. Ideally, this curve quickly reaches the performance level of the second stream (green curve). (Color figure online)

Table 3 shows the performance in terms of runtime and model size. TSK-Streams often remains a bit slower than AMRules and FIMTDD. At the same time, however, it is significantly faster than FLEXFIS, reducing runtime by a factor of around 10. Regarding the model size, we report the number of rules (leaves for FITDD) just to give an indicator of model complexity and without implying specific claims.

In the second part of the evaluation, we study the ability of our approach to recover from a performance drop in the presence of a concept drift. To this end, we make use of so-called recovery analysis as introduced in [24]. Recovery analysis aims at assessing a learner’s ability to maintain its generalization performance in the presence of concept drift; it provides an idea of how quickly a drift is recognized, to what extent it affects the prediction performance, and how quickly the learner manages to adapt its model to the new condition. The main idea of recovery analysis is to employ three streams in parallel, two “pure streams” and one “mixture”, instead of using a single data stream. The mixture stream resembles the first pure stream at the beginning and the second stream at the end, thus it contains a concept drift as a result of modeling the sampling probability as a sigmoidal function. Due to lack of space, we refer the reader to [24] for details of the methodology.

In general, we find that TSK-Streams recovers quite well in comparison to the other methods. As an illustration, we plot the recovery curves (blue lines) for the distance to hyperplane data set in Fig. 1. As can be seen, FLEXFIS exhibits a relatively large drop in performance. FIMTDD does not even manage to recover completely till the end of the stream. Compared to this, TSK-Streams and AMRules recover quite well.

5 Conclusion

In this paper, we proposed TSK-Streams, an evolving fuzzy rule learner for regression that meets the requirements of incremental and adaptive learning on data streams. Our method combines the expressivity and flexibility of TSK fuzzy rules with the efficiency and effectivity of concepts for rule induction as implemented in algorithms such as AMRules.

In an experimental study, we compared TSK-Streams with AMRules, FIMTDD, and FLEXFIS, the state-of-the-art regression algorithms for learning from data streams, on real and synthetic data. The results we obtained show that our learner compares very favorably and achieves superior performance. Moreover, it manages to adapt and recover well after a concept drift.

In future work, we plan to elaborate on extensions and variants of TSK-Rules that may lead to further improvements in performance. These developments will be accompanied by additional experiments and case studies.