Abstract
Forecasting is an important task across several domains. Its generalised interest is related to the uncertainty and complex evolving structure of time series. Forecasting methods are typically designed to cope with temporal dependencies among observations, but it is widely accepted that none is universally applicable. Therefore, a common solution to these tasks is to combine the opinion of a diverse set of forecasts. In this paper we present an approach based on arbitrating, in which several forecasting models are dynamically combined to obtain predictions. Arbitrating is a metalearning approach that combines the output of experts according to predictions of the loss that they will incur. We present an approach for retrieving outofbag predictions that significantly improves its data efficiency. Finally, since diversity is a fundamental component in ensemble methods, we propose a method for explicitly handling the interdependence between experts when aggregating their predictions. Results from extensive empirical experiments provide evidence of the method’s competitiveness relative to state of the art approaches. The proposed method is publicly available in a software package.
Keywords
Dynamic ensembles Metalearning Time series Combining expert advice Forecasting Dependency and diversity1 Introduction
Time series is an important topic in several research communities. The generalised interest in time series arises from the dynamic characteristics of many realworld phenomena. Uncertainty is a major issue in these problems, which complicates the exact understanding of their future behaviour. This is the key motivation for the study of forecasting methods.
Organisations across a wide range of domains rely on forecasting as a decision support tool. For example, financial analysts forecast the behaviour of stock prices for economic profit. Intelligent transportation systems forecast the shortterm traffic flow to enhance the operational efficiency in road networks.
In the last few decades the research community produced a considerable number of contributions on forecasting methods. These have been designed to cope with the time dependency of the data. Time series often comprise nonstationarities and time evolving complex structures, also known as concept drift (Gama et al. 2014), which hamper the forecasting process.
One of the most common approaches to forecasting is the dynamic combination of several experts, i.e., dynamic ensemble methods. Ensemble methods have been shown to provide a superior predictive performance relative to single learning algorithms (Brown et al. 2005). Notwithstanding, selecting the weights of each individual expert in the combination rule is known to be a difficult task.
The state of the art approaches for dynamically combining experts for forecasting are mostly based on estimates of predictive performance. The loss of each expert is tracked over time and used to combine them in an adaptive way. Some of these approaches have interesting theoretical loss upper bounds based on regret minimisation (CesaBianchi and Lugosi 2006).
Metalearning approaches are also commonly used. For example stacking (Wolpert 1992), which directly models interdependencies between experts. This characteristic may be important to take into account the diversity among experts, which is a key component in ensemble learning (Brown et al. 2005).
In this paper we present a metalearning strategy to combine the available forecasting models in an adaptive way. However, contrary to stacking, we separately model the individual expertise of each forecasting model and specialise them across the time series. Consequently, the forecasting models are combined in such a way that they are only selected for predicting examples that they are expected to be good at. Moreover, as opposed to tracking the error on past instances, our combination approach is more proactive as it is based on predictions of future loss of models. This can result in a faster adaptation to changes in the environment.
The motivation for our approach is that different learning models have different areas of expertise across the input space. In time series forecasting there is evidence that forecasting models have a varying relative performance over time (Aiolfi and Timmermann 2006). Moreover, it is also common for the underlying process generating the time series to have recurrent structures due to factors such as seasonality (Gama and Kosina 2014). In this context, we hypothesise that the arbitrage metalearning strategy enables the ensemble to better detect changes in the relative performance of models or changes between different regimes and quickly adapt itself to the environment.
While a given baselearner \(M^i\) is trained to model the future values of the time series, its metalearning associate \(Z^i\) is trained to model the error of \(M^i\). The arbiter \(Z^i\) then can make predictions regarding the error that \(M^i\) will incur when predicting the future values of the time series. The larger the estimates produced by \(Z^i\) (relative to the other models in the ensemble) the lower the weight of \(M^i\) will be in the combination rule.
Diversity among the experts is a fundamental component in building ensemble methods (Brown et al. 2005). We start by addressing this issue implicitly, by using experts with different learning strategies, i.e. heterogeneous ensembles. Our assumption is that the ensemble heterogeneity is useful to cope with the different dynamic regimes of time series. Besides heterogeneity we encourage diversity explicitly during the aggregation of the output of experts. This is achieved by taking into account not only predictions of performance produced by the arbiters, but also the correlation among experts in a recent window of observations.
We validate the proposed method in 62 realworld time series. Empirical experiments suggest that our method is competitive with different adaptive methods for combining experts and other metalearning approaches such as stacking (Wolpert 1992). In the interest of reproducible research, ADE is publicly available as an R software package.^{1} Moreover, all experiments reported in the paper are also reproducible.^{2}

ADE, a method for the arbitrage of forecasting experts;

The introduction of a blocked prequential procedure in the arbitrage approach to obtain outofbag predictions in the training set in order to increase the data used to train the metalearning models;

A sequential reweighting strategy for controlling the redundancy among the output of the experts using their correlation in a recent window of observations;

An extensive empirical study encompassing: statistical comparisons with state of the art approaches; analysis on the different deployment strategies of the proposed method; sensitivity analysis on the main parameters of the proposed method; relative scalability analysis in terms of execution time; and a study on the value of increasing the number of experts in the ensemble.
2 Related work
In this section we review the literature related to our work. First we explain the position of the proposed method in the literature (Sect. 2.1). Then we briefly describe the state of the art methods for dynamically combining expert outputs, both using windowing and metalearning approaches (Sects. 2.2, 2.3). We list their characteristics and limitations as well as highlight our contributions. Particularly in the latter, we overview previous publications that led to this work. Finally, we briefly overview the typical approaches for encouraging diversity in ensemble methods (Sect. 2.4).
2.1 Dynamic combiners
Dynamic ensemble methods for forecasting is a well studied topic in the literature. For example, Clemen (1989) presented an annotated bibliography comprising over 200 approaches.
This work is focused on the application of dynamic combination approaches for numerical and univariate time series forecasting tasks. According to the taxonomy presented by Kuncheva (2004), our approach can be regarded as a dynamic combiner one. This type of strategies builds the experts in advance. The ensemble then adapts to concept drift by dynamically changing the combination rule.
2.2 Windowing strategies for expert combination
Combining different experts is a difficult task, and several methods have been proposed to accomplish this. Particularly in forecasting, the simple average of the available experts (equal weights) has been shown to be a robust combination method (Clemen and Winkler 1986). Its competitive performance relative to approaches using estimated weights is known in the forecasting literature as the “forecasting combination puzzle” (Genre et al. 2013). Nonetheless, more sophisticated approaches have been proposed.
Simple averages are sometimes complemented with model selection before aggregation, also known as trimmed means. For example, Jose and Winkler (2008) propose trimming a percentage of the worst forecasters in past data, and average the output of the remaining experts.
One of the most common and successful approaches to combine predictive models in time dependent data is to weight them according to their performance. Typically the performance is determined on a window of recent data, or by using some other forgetting mechanism that promotes the importance of recency. The idea is that recent observations are more similar to the one we intend to predict, and thus they are considered more relevant. For example, Newbold and Granger (1974) use this approach for combining forecasters models. More recently, van Rijn et al. (2018) proposed a method for data streams classification. As opposed to fusing experts, they select the best recent performing one to classify the next observation.
AEC is a method for adaptively combining forecasters (Sánchez 2008). It uses an exponential reweighting strategy to combine forecasters according to their past performance, including a forgetting factor to give more importance to recent values. Timmermann argues that for the prediction of stock returns models have only shortlived periods of predictability (Timmermann 2008). He proposes an adaptive combination based on the recent \(R^2\) of forecasters. If all models have poor explained variance (low \(R^2\)) in the recent observations then the forecast is set to the mean value of those observations. Otherwise, the experts are combined by averaging their predictions with the arithmetic mean.
In online learning, several strategies have been proposed for aggregating experts advice. These are typically based on regret minimisation, and have interesting theoretical properties. Regret is the average error suffered with respect to the best we could have obtained. In this paper we focus on three of the following approaches: the exponentially weighted average, the polynomially weighted average, and the fixed share aggregation. For a thorough review of these methods we refer to the seminal work by CesaBianchi and Lugosi (2006). Zinkevich (2003) proposed an online convex programming approach based on gradient descent that also guarantees regret bounds.
The outlined models are related to our work in the sense that they employ adaptive heuristics to combine forecasters. However, these heuristics are incremental or sliding summary statistics on relative past performance. Our intuition is that these approaches have a short memory and may fail to capture longrange relationships between changes in the underlying time series and the performance of the experts efficiently. Conversely, we explore differences among experts to specialise them across the data space based on a regression analysis. Moreover, we use a more proactive heuristic that is based on the prediction of relative future performance of individual forecasters.
2.3 Metalearning strategies for expert combination
Metalearning provides a way for modelling the learning process of a learning algorithm (Brazdil et al. 2008). Several methods use this approach to improve the combination or selection of models (Pinto et al. 2016; Rossi et al. 2014; Todorovski and Džeroski 2003; Wolpert 1992).
A popular and successful approach for dynamically combining experts is to apply multiple regression on the output of the experts. For example, Gaillard and Goude (2015) describe a setup in which Ridge regression is used to aggregate experts by minimising the L2regularised leastsquares. The idea behind these approaches is similar to stacking (Wolpert 1992), a widely used approach to combine predictive models.
Our proposal follows a metalearning strategy called arbitrating. This approach was introduced before for dynamic selection of classifiers (Ortega et al. 2001). A prediction is made using a combination of different classifiers that are selected according to their expertise concerning the input data. The expertise of a model is learned using a metalearner, one for each available base classifier, which models the confidence of its base counterpart. At runtime, the classifier with the highest confidence is selected to make a prediction.
The initial indication that arbitration produced interesting results in forecasting was evidenced in a case study regarding solar radiation forecasting (Cerqueira et al. 2017). In that work, the arbitration mechanism was adapted straightforwardly, showing an improvement over stacking (Wolpert 1992).
The proposed dynamic ensemble method ADE was first introduced in a previous work (Cerqueira et al. 2017a). The idea behind arbitration was reworked and applied to time series forecasting problems from several domains. Several of its drawbacks were addressed, such as the inefficient use of the available data, by using outofbag samples from the training set; a more robust combination rule by using a committee of recent well performing models; and the general translation to the time series forecasting tasks, which is fundamentally different than classification tasks. In this paper we extend and improve the approach. The main difference is a diversity inducing procedure during expert aggregation that explicitly models their interdependence. On top of this, we significantly enlarge the experiments used to validate the method. We also provide an indepth analysis of ADE, to provide more insight about its characteristics.
2.3.1 Mixture of experts
The proposed dynamic ensemble is related to mixture of experts (Jacobs et al. 1991) (ME), in the sense that each expert is specialised in a certain region of the input space. The main difference to ME is the way the weights of the experts are computed. ME estimate the weights using a gating function. The gating function is typically a neural network with as many output units as experts and trained using Expectation–Maximisation. Our approach uses a set of arbiters that predict the loss of the experts. ADE also differs in the training procedure of the experts and how diversity is encouraged in the ensemble. ME are typically comprised by neural network experts built incrementally, and the gating function explicitly controls the patterns each neural network learns according to their relative performance. This results in relatively independent experts. Conversely, ADE works as a dynamic combiner approach (Kuncheva 2004). Diversity is introduced implicitly by employing a set of heterogeneous experts, which are trained with the whole set of available observations. During expert aggregation, diversity is also encouraged by considering the redundancy among the output of the experts.
2.4 Diversity creation methods
A wide range of contributions exist for encouraging diversity in ensemble methods. These are typically based on input manipulation [e.g. bagging (Breiman 1996)], output manipulation [e.g. ErrorCorrecting Output Coding (Dietterich and Bakiri 1991)], or manipulation of architectures used to build experts. For a comprehensive read on diversity creation approaches we refer to the survey by Brown et al. (2005).
We propose a method that encourages diversity during the aggregation of experts. This is accomplished by manipulating the experts’ weights according to the redundancy of their output. To the best of our knowledge, there is no closely related approach in the machine learning literature. However, our approach is inspired on the notions of diversity in the context information retrieval. An example is the seminal approach Maximal Marginal Relevance (Carbonell and Goldstein 1998). This method is typically used to rank a list of documents to answer a given query by considering not only the relevance of each document individually, but also their redundancy to documents already ranked.
3 Arbitrated dynamic ensemble
In this section we formalise ADE. We start by describing the predictive task, and then explain the different steps of the methodology.
A time series Y is a temporal sequence of values \(Y = \{y_1, y_2, \dots , y_t\}\), where \(y_i\) is the value of Y at time i. We focus on numeric time series, i.e., \(y_i \in \mathbb {R},\)\(\forall \)\(i \in \{1, \dots , t\}\). We frame the problem of time series forecasting as a regression task. The temporal dependency is modelled by having the previous observations as attributes in the learning of the experts. In order to enhance the representation of the time series, this approach can be extended by using summary statistics on the embedding vectors, or other external domainspecific knowledge.
To be more precise, we use time delay embedding (Takens 1981) to represent Y in an Euclidean space with embedding dimension K. Effectively, we construct a set of observations which are based on the past K lags of the time series. Each observation is composed of a feature vector \(x_i \in \mathbb {X} \subset \mathbb {R}^K\), which denotes the previous K values, and a target vector \(y_i \in \mathbb {Y} \subset \mathbb {R}\), which represents the value we want to predict. The objective is to construct a model \(f : \mathbb {X} \rightarrow \mathbb {Y}\), where f denotes the regression function.

Training of the baselearners: the set of heterogeneous experts that are used to forecast future values of Y;

Training the metalearners: arbiters that model and predict the loss of the experts;

Predicting \(y_{t+1}\): Combining the output of the experts according to the output of the arbiters and the correlation among the output of the experts to forecast the next value of the time series.
3.1 Training the experts
The first step of ADE is to train m individual forecasters. Each \(M^j, \forall j \in \{1, \ldots , m\}\) is built using the available time series Y. The objective is to predict \(y_{t+1}\), the next value of Y. This is accomplished by having experts build the model \(f : \mathbb {X} \rightarrow \mathbb {Y}\).
M is comprised by a set of heterogeneous models, for example decision trees and artificial neural networks. Heterogeneous models have different inductive biases and assumptions regarding the process generating the data. Effectively, we expect models to have different expertise across the time series. Later we will present an approach complementary to ensemble heterogeneity that encourages diversity during the aggregation of the experts (Sect. 3.3.3).
3.2 Training the arbiters
In the metalearning step of ADE the goal is to build models capable of modelling the expertise of each baselearner across the input space.
Our assumption is that not all models will perform equally well at any given prediction point. This idea is in accordance with findings reported in prior work (Aiolfi and Timmermann 2006). Systematic evidence was found that some models have varying relative performance over time and that other models are persistently good (or bad) throughout the time series. Furthermore, in many environments the dynamic concepts have a recurring nature, due to, for example, seasonality. These findings can be regarded as instances of the No Free Lunch theorem presented by Wolpert (2002). This theorem essentially states that no learning algorithm is the most appropriate for all tasks.
In effect, we use metalearning to dynamically weigh baselearners and adapt the combined model to changes in the relative performance of the base models, as well as for the presence of different regimes in the time series.
We perform this regression analysis on a metalevel to understand how the error of a given model relates to the dynamics and the structure of the time series. Effectively, we can capitalise on this knowledge by dynamically combining baselearners according to the expectation of how they will perform.
3.2.1 Blocked prequential for outofbag predictions
Typical metalearning approaches for dynamic model selection or combination, only start the metalearning layer at runtime. This is the case of, for example, the original arbitrating formulation by Ortega et al. (2001) or the work of Gama and Kosina (2014). This is motivated by the need for unbiased samples to build reliable metalearners. However, this means that at the beginning, few observations are available to train the metalearners, which might result in underfitting.
ADE uses the training set to produce outofbag predictions which are then used to compute an unbiased estimate of the loss of each baselearner. By retrieving outofbag samples from the training set we are able to significantly increase the amount of data available to the metalearners. We hypothesise that this strategy improves the overall performance of the ensemble by improving the accuracy of each metalearner.
3.3 Predicting \(y_{t+1}\)
For predicting the next value of the time series, \(y_{t+1}\), ADE combines the output of the experts M according to the output of the arbiters and the recent correlation among the experts.
3.3.1 Committee of models for prediction
In the original arbitrating architecture the expert with the highest confidence (predicted by the arbiters) is selected to make a prediction. Our approach is to combine the output of the experts, as opposed to selecting a single one.
As described earlier, the predictive performance of forecasting models has been reported to vary over a given time series. We address this issue with a committee of models, where we trim recently poor performing models from the combination rule for an upcoming prediction [e.g. trimmed means (Jose and Winkler 2008)].
As we explain in Sect. 2, the state of the art approaches for dynamic combination in time series rely on past performance to quantify the weight of the experts. Specifically, this is typically used for dynamic selection (e.g. Jose and Winkler 2008) or dynamic combination (e.g. Newbold and Granger 1974). Here we use this information for dynamic selection. Formally, we select the \(\varOmega \)% baselearners with lowest mean absolute error in the last \(\lambda \) observations (\({}^{\varOmega }{M}\)), suspending the remaining ones. The predictions of the metalevel models (\({}^{\varOmega }{Z}\)) are used to weigh the selected forecasters.
In summary, if we expect \(M^j\) to make a large error \(e^j\) in a given observation relative to the other experts, we assign it a small weight—or even suspending it—in the final prediction. Conversely, if we expect \(M^j\) to incur a small loss relative to its peers, we increase its weight for the upcoming prediction.
3.3.2 Combining the experts
3.3.3 Sequential reweighting of experts
Most combination approaches, dynamic ones particularly, weigh experts by maximising estimates of predictive performance (c.f. Sect. 2). However, in cases where the experts are highly redundant it is important to model their interdependence.
 a.
“training procedures that result in relatively independent experts”;
 b.
“aggregation methods that explicitly or implicitly model the dependence among the experts”.

the output of the experts \(\hat{y}^{M}_{t+1} = \{\hat{y}^{1}_{t+1}, \dots , \hat{y}^{m}_{t+1}\}\);

and their respective weights predicted by the arbiters and scaled accordingly: \(w^{M}_{t+1} = \{w^{1}_{t+1}, \dots , w^{m}_{t+1}\}: \sum _{i=1}^{m} w^{i}_{t+1} = 1\).
Notwithstanding, time series comprise characteristics that this type of methods need to cope with, e.g. the variance in relative performance that forecasters show over a time series. We formalise our idea for the dynamic combination of forecasting experts in Algorithm 2. We use the correlation among the output of the experts to quantify their redundancy. This correlation is computed in a window of recent observations to cope with eventual nonstationarities of time series.
4 Experiments
In this section we present the experiments carried out to validate ADE. We start by describing the overall setup. We compare the proposed method to state of the art approaches for combining the output of experts. Specifically, we focus on approaches designed to cope with temporal dependencies. Afterwards, we perform sensitivity analyses to enhance our understanding of the components of ADE. To encourage reproducible research, we published the code used to perform these experiments (c.f. footnote 2).
 Q1:

How does the performance of the proposed method compares to the performance of the stateoftheart methods for time series forecasting tasks and state of the art methods for combining forecasting models?
 Q2:

Is it beneficial to use a weighing scheme in our arbitrating strategy instead of selecting the predicted best expert as originally proposed (Ortega et al. 2001)?
 Q3:

Is it beneficial to use outofbag predictions from the training set to increase the data used to train the metalearners?
 Q4:

How does the performance of ADE vary by the introduction of a committee, where poor recent baselearners are discarded from the upcoming prediction, as opposed to weighing all the models?
 Q5:

What is the impact of the sequential reweighting procedure in ADE’s performance?
 Q6:

How does the performance of ADE vary by using different updating strategies for the base and meta models?
 Q7:

How sensitive is ADE to the parameters \(\varOmega \) and \(\lambda \), and to the size of the ensemble in terms of the number of experts?
 Q8:

How does it scale in comparison to other state of the art approaches for combination of forecasters in terms of computational effort?
 Q9:

What is the impact of the sequential reweighting procedure in state of the art approaches for combining experts? Moreover, how does this approach compare with methods that handle correlation in the feature space (e.g. principal components analysis)?
4.1 Experimental setup
Datasets and respective summary
ID  Time series  Data source  Data characteristics  Size  K  I 

1  Rotunda AEP  Porto water consumption from different locations in the city of Porto (Cerqueira et al. 2017a)  Halfhourly values from Nov. 11, 2015 to Jan. 11, 2016  3000  30  0 
2  Preciosa mar  3000  9  1  
3  Amial  3000  11  0  
4  Global horizontal radiation  Solar radiation monitoring (Cerqueira et al. 2017a)  Hourly values from Apr. 25, 2016 to Aug. 25, 2016  3000  23  1 
5  Direct normal radiation  3000  19  1  
6  Diffuse horizontal radiation  3000  18  1  
7  Average wind speed  3000  10  1  
8  Humidity  Bike sharing (Cerqueira et al. 2017a)  Hourly values from Jan. 1, 2011  1338  11  0 
9  Windspeed  Mar. 01, 2011  1338  12  0  
10  Total bike rentals  1338  8  0  
11  AeroStock 1  Stock price values from different aerospace companies (Cerqueira et al. 2017a)  Daily stock prices from January 1988 through October 1991  949  6  1 
12  AeroStock 2  949  13  1  
13  AeroStock 3  949  7  1  
14  AeroStock 4  949  8  1  
15  AeroStock 5  949  6  1  
16  AeroStock 6  949  10  1  
17  AeroStock 7  949  8  1  
18  AeroStock 8  949  8  1  
19  AeroStock 9  949  9  1  
20  AeroStock 10  949  8  1  
21  CO.GT  Air quality indicators in an Italian city (Lichman 2013)  Hourly values from Mar. 10, 2004 to Apr. 04 2005  3000  30  1 
22  PT08.S1.CO  3000  8  1  
23  NMHC.GT  3000  10  1  
24  C6H6.GT  3000  13  0  
25  PT08.S2.NMHC  3000  9  0  
26  NOx.GT  3000  10  1  
27  PT08.S3.NOx  3000  10  1  
28  NO2.GT  3000  30  1  
29  PT08.S4.NO2  3000  8  0  
30  PT08.S5.O3  3000  8  0  
31  Temperature  3000  8  1  
32  RH  3000  23  1  
33  Humidity  3000  10  1  
34  Electricity total load  Hospital energy loads (Cerqueira et al. 2017a)  Hourly values from Jan. 1, 2016 to Mar. 25, 2016  3000  19  0 
35  Equipment load  3000  30  0  
36  Gas energy  3000  10  1  
37  Gas heat energy  3000  13  1  
38  Water heater Energy  3000  30  0  
39  Total demand  Australian electricity (Koprinska et al. 2011)  Halfhourly values from Jan. 1, 1999 to Mar. 1, 1999  2833  6  0 
40  Recommended retail price  2833  19  0  
41  Sea level pressure  Ozone level detection (Lichman 2013)  Daily values from Jan. 2, 1998 to Dec. 31, 2004  2534  9  0 
42  Geopotential height  2534  7  0  
43  K Index  2534  7  0  
44  Flow of Vatnsdalsa river  Data market (Hyndman 2017)  Daily, from Jan. 1, 1972 to Dec. 31, 1974  1095  11  0 
45  Rainfall in Melbourne  Daily, from from 1981 to 1990  3000  29  0  
46  Foreign exchange rates  Daily, from Dec. 31, 1979 to Dec. 31, 1998  3000  6  1  
47  Max. temperatures in Melbourne  Daily, from from 1981 to 1990  3000  7  0  
48  Min. temperatures in Melbourne  Daily, from from 1981 to 1990  3000  6  0  
49  Precipitation in River Hirnant  Halfhourly, from Nov. 1, 1972 to Dec. 31, 1972  2928  6  1  
50  IBM common stock closing prices  Daily, from Jan. 2, 1962 to Dec. 31, 1965  1008  10  1  
51  Internet traffic data I  Hourly, from Jun. 7, 2005 to Jul. 31, 2005  1231  10  0  
52  Internet traffic data II  Hourly, from Nov. 19, 2004 to Jan. 27, 2005  1657  11  1  
53  Internet traffic data III  from Nov. 19, 2004 to Jan. 27, 2005—data collected at 5 min intervals  3000  6  1  
54  Flow of Jokulsa Eystri river  Daily, from Jan. 1, 1972 to Dec. 31, 1974  1096  21  0  
55  Flow of O. Brocket  Daily, from Jan. 1, 1988 to Dec. 31, 1991  1461  6  1  
56  Flow of Saugeen river I  Daily, from Jan. 1, 1915 to Dec. 31, 1979  1400  6  0  
57  Flow of Saugeen river II  Daily, from Jan. 1, 1988 to Dec. 31, 1991  3000  30  0  
58  Flow of Fisher River  Daily, from Jan. 1, 1974 to Dec. 31, 1991  1461  6  0  
59  No. of Births in Quebec  Daily, from Jan. 1, 1977 to Dec. 31, 1990  3000  6  1  
60  Precipitation in O. Brocket  Daily, from Jan. 1, 1988 to Dec. 31, 1991  1461  29  0  
61  Min. temperature  Porto weather (Cerqueira et al. 2017a)  Daily values from Jan. 1, 2010 to Dec. 28, 2013  1456  8  0 
62  Max. temperature  1456  10  0 
To account for trend we applied a KPSS statistical test (Kwiatkowski et al. 1992) to the data. Time series that are not trendstationary according to this test are differenced until the test is passed. This approach is commonly used for trend inclusion in forecasting models, for example ARIMA. Specifically, we follow the procedure adopted by the automatic forecasting model auto.arima from the forecast R package (Hyndman 2014). The number of differences applied to each time series is described in the last column of Table 1.
We estimate the optimal embedding dimension (K) using the method of False Nearest Neighbours (Kennel et al. 1992). This method analyses the behaviour of the nearest neighbours as we increase K. According to Kennel et al. (1992), with a low suboptimal K many of the nearest neighbours will be false. Then, as we increase K and approach an optimal embedding dimension those false neighbours disappear. We set the tolerance false nearest neighbours to 1%. The embedding dimension estimated for each series is shown in Table 1.

Local trend, estimated according to the ratio between the standard deviation of the embedding vector and the standard deviation of the differenced embedding vectors;

Skewness, for measuring the symmetry of the distribution of the embedding vectors;

Mean, as a measure of centrality of the embedding vectors;

Standard deviation, as a dispersion metric;

Serial correlation, estimated using a BoxPierce test statistic;

Longrange dependence, using a Hurst exponent estimation with wavelet transform;

Chaos, using the maximum Lyapunov exponent to measure the level of chaos in the system.
4.1.1 Evaluation procedure
4.2 Ensemble setup and baselines
Summary of the experts
ID  Algorithm  Parameter  Value 

SVR  Support vector regr. (Karatzoglou et al. 2004)  Kernel  {Linear, RBF Polynomial, Laplace} 
Cost  {1}  
\(\epsilon \)  {0.1}  
MARS  Multivar. A. R. splines (Milborrow 2012)  Degree  {1, 3} 
No. terms  {7, 15}  
Forward thresh.  {0.001}  
RF  Random forest (Wright 2015)  No. trees  {100, 250, 500} 
PPR  Proj. pursuit regr. (R Core and Team 2013)  No. terms  {2, 5} 
Method  {Super smoother, spline}  
RBR  Rulebased regr. (Kuhn et al. 2014)  No. iterations  {10, 25, 50, 100} 
GBR  Generalized boosted regr. (Ridgeway 2015)  Depth  {5, 10} 
Distribution  {Gaussian, Laplace}  
No. trees  {500, 1000}  
Learning rate  {0.1}  
MLP  Multilayer perceptron (Venables and Ripley 2002)  Hidden units  {3, 5, 7, 10, 15, 25} 
Decay  {0.01}  
GLM  Generalised linear regr. (Friedman et al. 2010)  Penalty mixing  {0, 0.2, 0.4, 0.6, 0.8, 1} 
GP  Gaussian processes (Karatzoglou et al. 2004)  Kernel  {Linear, RBF, Polynomial, Laplace} 
Tolerance  {0.001, 0.01}  
PCR  Principal comp. regr. (Mevik et al. 2016)  Default  – 
PLS  Partial least regr. (Mevik et al. 2016)  Method  {Kernel, SIMPLS} 
ARIMA  ARIMA (Hyndman 2014)  Auto  – 
ETS  Method  {ETS, TBATS} 
The set M of experts forming the ensemble are summarised in Table 2. Different parameter settings are used for each of the individual learners, adding up to 50 base models. The parameters that are not specified were set with default values or are automatically tuned. This choice of number of experts will be analysed in Sect. 4.4.3.
We use a Random Forest as metalearner. The blocked prequential procedure used to obtain outofbag samples was run with 10 folds (b = 10). The committee for each prediction (Sect. 3.3.1) contains 50% of the forecasters with best performance in the last 50 observations (\(\varOmega \) and \(\lambda \) values are set to 50). We suspend only half the models in the interest of keeping the combined model readily adaptable to changes in the environment. An average performing model may rapidly become important and the combined model should be able to capture these situations. By setting \(\lambda \) to 50 we strive for estimates of recent performance that renders a robust committee. The sensitivity of ADE to different values of \(\varOmega \) and \(\lambda \) is analysed in Sect. 4.4.2. We used Pearson’s method as the correlation function for the sequential reweighting of experts (Sect. 3.3.3).
 Stacking:

An adaptation of stacking (Wolpert 1992) for times series, where a metamodel is learned using the baselevel predictions as attributes. To preserve the temporal order of observations, the outofbag predictions used to train the metalearner (a random forest) are obtained using a blocked prequential procedure (c.f. Sect. 3.2.1). Different strategies for training the metalearner (e.g. holdout) were tested and blocked prequential presented the best results;
 Arbitrating:

An approach following the original arbitrating method presented by Ortega et al. (2001), c.f. Sect. 2.3;
 Simple:

The approach in which the available experts are simply averaged using an arithmetic mean (Timmermann 2006);
 SimpleTrim:

Simple average with model selection: \(\varOmega \)% of the best past performing models are selected and aggregated with a simple average;
 LossTrain:

Weighted static combination of experts, in which the weights are set according to the performance of experts in the training set;
 BestTrain:

An approach that selects the model with best performance in the training data to predict all the test set;
 WindowLoss:

Weighted adaptive combination of experts. The weights are computed according to the performance of the experts in the last \(\lambda \) observations (Newbold and Granger 1974);
 Blast:

Similar to WindowLoss, but selects the best expert in the last \(\lambda \) observations for prediction. van Rijn et al. (2018) showed its competitiveness using streaming data;
 AEC:

The adaptive combination procedure AEC (Sánchez 2008), c.f. Sect. 2.2;
 ERP:

The adaptive combination procedure proposed by Timmermann (2008), c.f. Sect. 2.2;
 EWA:

A forecast combination approach based on an exponentially weighted average—we refer to the seminal work by CesaBianchi and Lugosi for a comprehensive description and theoretical properties (CesaBianchi and Lugosi 2006, Section 2.1);
 FixedShare:

The fixed share approach due to Herbster and Warmuth (1998), which is designed for tracking the best expert across a time series (CesaBianchi and Lugosi 2006, Section 5.2);
 MLpol:

The polinomially weighted average forecast combination (CesaBianchi and Lugosi 2003). See CesaBianchi and Lugosi for a comprehensive description and theoretical properties (CesaBianchi and Lugosi 2006, Section 2.1);
 OGD:

An approach based on online gradient descent that provides theoretical loss bound guarantees (Zinkevich 2003);
 ARIMA:

A stateoftheart method for time series forecasting. We use the implementation provided in the forecast R package (Hyndman 2014), which automatically tunes ARIMA to an optimal parameter setting.
 Naive

Baseline that uses the value of the previous observation (\(y_t\)) for predicting \(y_{t+1}\);
 SeasonalNaive:

Baseline that uses the value of the observation from the previous seasonal period for predicting \(y_{t+1}\) (Hyndman 2014). Particularly, for daily time series we use the value from the previous week, and for hourly time series we use the value from the day before;
 ExpSmoothing:

The exponential smoothing state space model typically used for forecasting (Hyndman 2014).
For the approaches EWA, MLpol, FixedShare, and OGD, we used the software package opera (Gaillard and Goude 2016).
 ADEselectbest:

A variant of ADE in which at each time point the best model is selected to make a prediction. Here best is the one with lowest predicted loss. This is in accordance with the original arbitrating architecture (Ortega et al. 2001);
 ADEallmodels:

A variant of ADE, but without the formation of a committee. In this case, all forecasting models are weighed according to their expertise in the input data;
 ADEnoreweight:

A variant of ADE in which there is no reweight of the experts according to the correlation of their predictions (Sect. 3.3.3);
 ADEv0:

The preliminary version of ADE (Cerqueira et al. 2017a). Besides the reweighting of experts, this approach uses a linear transformation of the output of the arbiters, instead of the softmax function previously proposed (Cerqueira et al. 2017a);
 ADEvanilla:

A baseline variant of ADE with a simpler weighting approach: the error (\(\hat{y}  y\)) predicted by arbiters is simply added to the output of the respective expert. The final prediction is computed according to the average of the shifted output of experts.
4.3 Results
4.3.1 Comparing ADE to the state of the art approaches
ADE presents the best average rank relative to state of the art aggregation methods. This value is considerably better compared to widely used approaches, including Stacking, Simple, or WindowLoss. From the numbers of Table 3, ADE wins in most of the problems against other approaches, most of the times in a considerable way (i.e., with probability above 95%). Among the combination approaches, BestTrain presents one of the lowest average ranks, which suggests that the combination of different experts is worthwhile in terms of predictive performance. The simple average aggregation coupled with model selection leads to an interesting average rank, which is only topped by that of ADE. These results are corroborated by the outcome of the Bayes sign test, which suggests that ADE has an higher probability of winning compared to each other approach.
Figure 6 is useful for visualising the magnitude in the difference in predictive performance, something which average ranks are blind to. The distribution of the percentual difference varies according to the model under comparison. In general, ADE shows a reasonable difference when compared with most of the other approaches.
These results answer the research question Q1 regarding the performance of ADE relative to the state of the art approaches for combining forecasting experts.
Paired comparisons between ADE and the baselines in the 62 time series
Method  ADE loses  ADE draws  ADE wins 

Stacking  12 (3)  16 (2)  34 (13) 
Arbitrating  1 (0)  2 (0)  59 (41) 
Simple  3 (1)  24 (17)  35 (24) 
SimpleTrim  4 (1)  45 (32)  13 (10) 
LossTrain  3 (1)  21 (8)  38 (25) 
WindowLoss  3 (2)  35 (26)  24 (19) 
Blast  1 (0)  2 (0)  59 (42) 
AEC  0 (0)  6 (4)  56 (47) 
ERP  8 (1)  21 (8)  33 (23) 
BestTrain  3 (0)  6 (0)  53 (42) 
EWA  0 (0)  23 (8)  39 (9) 
FixedShare  0 (0)  6 (2)  56 (27) 
MLpol  5 (2)  43 (26)  14 (2) 
OGD  1 (0)  23 (8)  38 (19) 
ARIMA  8 (5)  17 (7)  37 (33) 
Naive  0 (0)  0 (0)  62 (61) 
SeasonalNaive  0 (0)  0 (0)  62 (62) 
ExpSmoothing  6 (5)  10 (4)  46 (45) 
ADEselectbest  1 (1)  11 (2)  50 (24) 
ADEallmodels  3 (1)  34 (22)  25 (16) 
ADEnoreweight  1 (0)  53 (47)  8 (6) 
ADEv0  1 (1)  46 (31)  15 (9) 
ADEvanilla  5 (1)  2 (0)  55 (34) 
4.3.2 Comparing ADE to its variants
ADE shows a consistent advantage over the performance of ADEallmodels (Q4). This suggests that indeed it is worthwhile to prune the ensemble for each prediction (as opposed to combining all the forecasters). ADE’s performance is also considerably better relative to ADEselectbest, which gives evidence for the hypothesis that the combination of experts (as opposed to selection) provides better results (Q3). ADE is also superior to ADEvanilla, which bypasses the weighting scheme, directly adjusting the output of the experts according to the predictions of the arbiters.
ADE shows a consistent improvement over the variant that does not perform a sequential reweighting of the experts according to recent correlation (Sect. 3.3.3) (Q5). The magnitude of the difference in performance is small (Fig. 6), which is corroborated by the high number of draws shown in Table 3. However, it is important to note that the sequential reweighting method does not generally compromise performance (only one loss in 62 problems), and improves it several times. Finally, ADE also shows a systematic improvement over its preliminary version (Cerqueira et al. 2017a). Besides not using the sequential reweighting approach, ADE_v0 aggregates the output of the experts using a softmax function. We tested this approach in the experimental setup of this work and found that it does not improve the results over a linear transformation.
4.4 Further analyses of ADE
Following the comparison of ADE with the state of the art, in this section we provide a more detailed analysis of its workflow. The goal is to enhance our understanding of how the method works. This analysis encompasses: (1) an analysis on the different possible deployment strategies; (2) a sensitivity analysis on the parameters \(\varOmega \) and \(\lambda \); (3) a scalability analysis in terms of relative computational time; (4) a study on the impact of adding additional experts; and (5) additional analysis of the sequential reweighting method. If not stated otherwise the experimental setup is the same as described previously.
4.4.1 Analyzing training strategies
In this section we address the research questions Q4 and Q6. In a dynamic environment it is common to update the model over time, either online or in chunks of observations. Timedependent data is prone to changes in the underlying distribution and continuous training of models ensures that one has an uptodate model. Since ADE settles on two layers of models we analysed different approaches for updating these and study their implications in terms of predictive performance.
 M0_Z0:

both experts (M) and arbiters (Z) are trained in the training set and not updated during test time (ADE as reported in the main experiments);
 M0_Z1:

M is trained only in the training data but Z is retrained every \(\Delta \) observations.
 M1_Z0:

M is retrained every \(\Delta \) observations but Z is trained only in the training data.
 M1_Z1:

Both M and Z are retrained every \(\Delta \) observations, which is particularly interesting if the models in M are typical online methods (e.g. ARIMA);
 ADEruntime:

A variant of ADE in which there is no blocked prequential procedure to obtain outofbag samples to increase the data provided to the metalearners. In this scenario, the arbiters are trained in data obtained only at runtime every \(\Delta \) observations, which is also in accordance with the original arbitrating strategy and other metalearning approaches used in timedependent scenarios (Gama and Kosina 2014). M is fit only in the training data.
The results are presented in Fig. 8, with a barplot representing the average rank and respective standard deviation of each deployment strategy.
ADE (also denoted as M0_Z0 in this particular analysis) shows a better average rank relative to ADEruntime, which suggests that it is better to get outofbag predictions from the available data to improve the fit of the metalearners.
The results also suggest that updating the experts and the arbiters at runtime is better than not updating them. This outcome is expected due to the eventual presence of concept drift (Gama et al. 2014). Particularly, the M1_Z1 approach presents the best average rank. Although the difference in average rank is negligible, the results also suggest that updating the experts and not updating the arbiters (M1_Z0) renders a better average rank than the inverted strategy (M0_Z1).
4.4.2 Sensitivity analysis on \(\varOmega \) and \(\lambda \)
In this and the next subsection we answer the research question Q7 regarding the sensitivity analysis of ADE. Besides the setup of experts and arbiters, ADE has two main parameters: \(\varOmega \), which represents the ratio of experts selected at each time step for forecasting; and \(\lambda \), which denotes the window size used to compute the performance of the experts (for selecting which ones to arbitrate).
The results are shown in Fig. 9. The graphic illustrates two heatmaps. These relate the average rank (left heatmap) and respective standard deviation (right heatmap) of each (\(\varOmega \), \(\lambda \)) combination across the 33 datasets. Higher average rank (i.e. worse performance) and higher rank standard deviation are denoted by darker tiles.
Regarding \(\varOmega \), the best performing values are the ones in the middle of the searched distribution. In principle, this parameter depends to a great extent on the number of experts and their predictive ability. The results also suggest that, unless for extremely low \(\lambda \) values, fixing \(\varOmega \) and varying \(\lambda \) renders a relatively stable average rank.
The heatmap in the right side suggests that the (\(\varOmega \), \(\lambda \)) combinations with lowest rank standard deviation are in the middle of the searched distributions.
In principle and in practice, varying the value of \(\lambda \) follows the stabilityplasticity dilemma (Carpenter et al. 1991): small values of \(\lambda \) (i.e. small window of recent observations) lead to greater reactiveness, but also makes the model susceptible to outliers. Conversely, higher values lead to greater stability, while losing some responsiveness and possibly containing outdated information.
4.4.3 Value of additional experts
In the experiments presented in the previous sections ADE was employed with 50 experts (Table 2). In this section we analyse the sensitivity of ADE to different ensemble compositions. Particularly, we tested ensembles with sizes from 5 to 100 by multiples of 5: \(Q = \{5, 10, 15, \)...\(, 95, 100\}\) , rendering a total of 20 different possible ensemble sizes for analysis.
The result of this analysis is presented in Fig. 10. Generally, including more experts in the ensemble leads to a better performance, and closer to that of the ensemble with 100 models. However, the difference becomes negligible for values above 50. The uncertainty in performance is represented by the vertical bars and is computed according to the standard deviation across the Monte Carlo repetitions. This value also becomes increasingly small as more base models are included.
4.4.4 Scalability analysis
The results are presented in Fig. 11 as boxplots. On all problems, ADE takes more time to run than SimpleTrim. The difference of this method to ADE is mostly driven by the fitting and predictions of the arbiters. As expected, ADE also takes more time than ARIMA. Being a single model (as opposed to an ensemble), ARIMA has considerable less storage requirements when compared to ADE.
In summary, ADE scales worse than both approaches. Although omitted, it also takes more time than the remaining state of the art approaches used earlier (Q8).
4.4.5 Further analyses of the sequential reweighting procedure
In Sect. 3.3.3 we presented an approach for handling the interdependencies among experts during their aggregation. The core arbitrage approach does not explicitly model the interdependencies among experts and this approach was designed to overcome this limitation. Particularly, in the previous section we provided evidence of the benefits of the sequential reweighting approach when applying it to ADE. Particularly, the results suggest that the magnitude of the impact is not substantial. Notwithstanding, applying this method does not generally decrease performance and improves it several times.
Paired comparisons showing the impact of the sequential reweighting approach in state of the art methods
Method  Without reweight wins  Draw  With reweight wins 

WindowLoss  4 (2)  31 (24)  27 (24) 
AEC  32 (22)  25 (18)  5 (1) 
EWA  33 (14)  25 (21)  4 (0) 
FixedShare  41 (24)  20 (18)  1 (0) 
MLpol  27 (9)  27 (20)  8 (2) 
OGD  21 (4)  26 (16)  14 (5) 
The results of the first analysis are reported in Table 4, where each approach in the first column is compared with itself when using the sequential reweighting approach. Similarly to Table 3, this table shows paired comparisons of the respective method with and without the application of the sequential reweighting method. In parenthesis are denoted the results that happen with at least 95% probability according to the Bayesian correlated ttest.
Besides ADE, the results suggest that the approach is also beneficial to WindowLoss. However, when applied to the other tested approaches its impact vanishes and is often decreases the predictive performance.
Figure 12 shows the results of the second analysis. ADE shows the best average rank across the tested approaches. The average ranks suggests that, applying the sequential reweighting procedure improves the predictive performance in the three variants of ADE. Even when accounting for correlation in feature space, the sequential reweighting approach still improves the average rank during expert aggregation.
5 Discussion and future work
5.1 On concept drift
Some of the design decisions behind ADE are based on prior work regarding the variance in relative performance of forecasting models over a time series (Aiolfi and Timmermann 2006) and with potential recurring structures present in the time series. However, there are cases in which time series change into new concepts and both the experts and arbiters may get outdated. Although we do not explicitly cover these scenarios, a possible strategy to address this issue is to track the loss of the ensemble. If its performance decreases beyond some tolerance new baselearners could be introduced (e.g. Gama and Kosina 2014) or existing ones retrained. Since an arbitration approach provides a modular architecture, models can be added (or removed) as needed. Gama et al. (2014) survey several approaches for concept drift adaptation that also could be adopted.
5.2 On the sequential reweighting procedure
In its preliminary version (Cerqueira et al. 2017a) we argued that one of ADE’s limitation was that it did not directly modelled the interdependencies among experts. We address this issue in this work using a sequential reweighting procedure that controls the redundancy among the output of the experts by considering their recent correlation. This approach is independent from ADE. However, its application with ADE is particularly interesting because the reweighting occurs during aggregation and does not withhold ADE’s modularity.
Despite the evidence of its benefits, the sequential reweighting approach has space for improvement. Consider the following (rather extreme) example: one expert producing forecasts with a determined magnitude systematically below the true value, and another expert with similar behaviour but with forecasts above the true value. These two experts are highly correlated but in fact complement themselves greatly. Effectively, using the Pearson’s correlation as a measure of similarity can be a suboptimal solution in this case. Future work includes the exploration of better similarity functions. A possibly interesting line of enquiry is to follow Brown’s work on the study of diversity in classifiers from an information theoretic perspective (Brown 2009). Particularly, instead of measuring the redundancy among experts only according to their outputs, we can also take into consideration the target value, i.e. conditional redundancy.
Finally, the application of the sequential reweighting approach to other dynamic aggregation methods does not render the same positive effects as seen when it is applied to ADE. We plan to study this issue further in future work.
5.3 On scalability
In the previous section we identified the computational effort required by ADE relative to other approaches as its main limitation. In the future we plan to address this issue, by eventually adapting the method to a streaming scenario. One possibility is to use a single arbiter (instead of one arbiter for each expert) designed for multitarget regression, i.e. having a single regression model that forecasts the errors of all base models, though for ensembles with a large number of base models this can be cumbersome given the number of target variables we would have.
5.4 Scope of the experimental setup
From a broad perspective, forecasting can be split into different varieties. In this work we focus on univariate time series, assuming that only the variable of interest is available.
We also center our goal on predicting the next value of the time series, and assume immediate feedback from the environment. However, in many application domains one is often interested in predicting multiple steps in the future. Although we do not evaluate the proposed method in this setting, it can be extended to multistep forecasting using state of the art approaches to this effect. We intend to study the application of ADE in these settings in future work.
Finally, as we describe in Sect. 4.1, we focus on time series with an high sampling frequency, specifically, halfhourly, hourly, and daily data. The main reason for this is because high sampling frequency is typically associated with more data, which is important for fitting the predictive models. Standard forecasting benchmark data are typically more centered around low sampling frequency time series, for example the M competition data (Makridakis et al. 1982).
5.5 Other research lines
We plan to address the previous limitations of ADE by exploring the described potential solutions. Besides these, there are other interesting open research questions. Specifically, we will study ways of quantifying and leveraging the uncertainty of the arbiters regarding the loss that the experts will incur. For example, one could develop an approach in which, when the uncertainty of the output of the arbiters is high, the weights are smoothed. This could be accomplished efficiently using, for example, an infinitesimal jackknife (Wager et al. 2014) (provided random forests are used as arbiters).
We also plan to study the ability of the method, and how it can be adapted, to the timely detection of anomalies, i.e., activity monitoring (Fawcett and Provost 1999). Another interesting analysis could be using ADE in a continual learning setup, where instances for a sequence of tasks are observed over time.
6 Conclusions
In this paper we presented ADE, a dynamic ensemble method. We focused on time series forecasting problems, where the objective is to predict future values of a sequence of observations.
ADE is comprised of a set of forecasting experts pretrained on the available data. A metalearning approach is used to dynamically estimate the weight factors of these experts at runtime. This is accomplished by having a set of arbiters that model the error of each expert and predict how well they will perform in future observations. The resulting weights are used for obtaining the aggregated prediction of the ensemble. This aggregation may temporarily assign zero weight to some experts if their current performance is estimated to be too bad. This suspension decision may be revised in future time steps thus contributing to the robustness of the approach to regime changes.
We argued that this metalearning approach is useful to better capture recurring changes in the environment. Particularly, longrange temporal dependencies (e.g. seasonal factors) that shortmemory windowing approaches may fail to grasp efficiently.
Our proposal also includes a sequential reweighting approach for modelling the interdependencies among experts. Specifically, this approach is designed to control and reduce the redundancy in the output of the experts during their aggregation. Within the proposed arbitrage approach we also include a procedure for retrieving outofbag observations from the training set. These are used to fit the arbiters, significantly improving the data efficiency of the method.
We carried out an extensive empirical study to better characterise the performance of our proposal. This study has provided clear evidence on the competitiveness of our method in terms of predictive performance when compared to the state of the art. We also discussed its limitations and provided guidelines for solving them in future work. The main point for improvement is the scalability of the method. We plan to address this issue and potentially adapt ADE to streaming or incremental scenarios.
In the interest of reproducible science all methods are publicly available as an R software package.
Footnotes
 1.
tsensembler: on CRAN or at https://github.com/vcerqueira/tsensembler.
 2.
Instructions at: https://github.com/vcerqueira/forecasting_experiments.
Notes
Acknowledgements
The authors acknowledge the insightful comments by anonymous reviewers, G. Brown, L. Todorovsky, N. Moniz, and M. Almeida. This work is financed by the ERDFEuropean Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation—COMPETE 2020 Programme within project “POCI010145FEDER006961”, and by National Funds through the FCT Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) as part of Project UID/EEA/50014/2013; Project “NORTE010145FEDER000036” is financed by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF). This work was partly funded by the ECSEL Joint Undertaking, the framework programme for research and innovation horizon 2020 (20142020) under Grant Agreement No. 662189MANTIS20141.
References
 Aiolfi, M., & Timmermann, A. (2006). Persistence in forecasting performance and conditional combination strategies. Journal of Econometrics, 135(1), 31–53.MathSciNetzbMATHGoogle Scholar
 Benavoli, A., Corani, G., Demšar, J., & Zaffalon, M. (2017). Time for a change: A tutorial for comparing multiple classifiers through bayesian analysis. The Journal of Machine Learning Research, 18(1), 2653–2688.MathSciNetzbMATHGoogle Scholar
 Brazdil, P., Carrier, C. G., Soares, C., & Vilalta, R. (2008). Metalearning: Applications to data mining. Berlin: Springer.zbMATHGoogle Scholar
 Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.zbMATHGoogle Scholar
 Brown, G. (2009). An information theoretic perspective on multiple classifier systems. International Workshop on Multiple Classifier Systems (pp. 344–353). Berlin: Springer.Google Scholar
 Brown, G., Wyatt, J., Harris, R., & Yao, X. (2005). Diversity creation methods: A survey and categorisation. Information Fusion, 6(1), 5–20.Google Scholar
 Brown, G., Wyatt, J. L., & Tiňo, P. (2005). Managing diversity in regression ensembles. Journal of Machine Learning Research, 6(Sep), 1621–1650.MathSciNetzbMATHGoogle Scholar
 Carbonell, J., & Goldstein, J. (1998). The use of mmr, diversitybased reranking for reordering documents and producing summaries (pp. 335–336). ACM.Google Scholar
 Carpenter, G. A., Grossberg, S., & Reynolds, J. H. (1991). Artmap: Supervised realtime learning and classification of nonstationary data by a selforganizing neural network. Neural Networks, 4(5), 565–588. https://doi.org/10.1016/08936080(91)90012T.Google Scholar
 Cerqueira, V., Torgo, L., Pinto, F., & Soares, C. (2017). Arbitrated ensemble for time series forecasting. In Joint European conference on machine learning and knowledge discovery in databases (pp. 478–494). Springer.Google Scholar
 Cerqueira, V., Torgo, L., Smailović, J., Mozetič, I. (2017). A comparative study of performance estimation methods for time series forecasting. In proceedings of the 4th international conference on on data science and advanced analytics (pp. 529–538). IEEE. https://doi.org/10.1109/DSAA.2017.7.
 Cerqueira, V., Torgo, L., & Soares, C. (2017). Arbitrated ensemble for solar radiation forecasting. International workconference on artificial neural networks (pp. 720–732). Cham: Springer.Google Scholar
 CesaBianchi, N., & Lugosi, G. (2003). Potentialbased algorithms in online prediction and game theory. Machine Learning, 51(3), 239–261.zbMATHGoogle Scholar
 CesaBianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. New York: Cambridge University Press.zbMATHGoogle Scholar
 Clemen, R. T. (1989). Combining forecasts: A review and annotated bibliography. International Journal of Forecasting, 5(4), 559–583.Google Scholar
 Clemen, R. T., & Winkler, R. L. (1986). Combining economic forecasts. Journal of Business and Economic Statistics, 4(1), 39–46.Google Scholar
 Dawid, A. P. (1984). Present position and potential developments: Some personal views: Statistical theory: The prequential approach. Journal of the Royal Statistical Society. Series A (General), 147(2), 278–292.Google Scholar
 De Livera, A. M., Hyndman, R. J., & Snyder, R. D. (2011). Forecasting time series with complex seasonal patterns using exponential smoothing. Journal of the American Statistical Association, 106(496), 1513–1527.MathSciNetzbMATHGoogle Scholar
 Dietterich, T. G., & Bakiri, G. (1991). Errorcorrecting output codes: A general method for improving multiclass inductive learning programs. In AAAI (pp. 572–577).Google Scholar
 Fawcett, T., & Provost, F. (1999). Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 53–62). ACM.Google Scholar
 Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.Google Scholar
 Gaillard, P., & Goude, Y. (2015). Forecasting electricity consumption by aggregating experts; how to design a good set of experts. In Modeling and stochastic learning for forecasting in high dimensions (pp. 95–115). Springer.Google Scholar
 Gaillard, P., & Goude, Y. (2016) opera: Online prediction by expert aggregation. R package version 1.0. https://CRAN.Rproject.org/package=opera.
 Gama, J., & Kosina, P. (2014). Recurrent concepts in data streams classification. Knowledge and Information Systems, 40(3), 489–507.Google Scholar
 Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46(4), 44.zbMATHGoogle Scholar
 Genre, V., Kenny, G., Meyler, A., & Timmermann, A. (2013). Combining expert forecasts: Can anything beat the simple average? International Journal of Forecasting, 29(1), 108–121.Google Scholar
 Herbster, M., & Warmuth, M. K. (1998). Tracking the best expert. Machine Learning, 32(2), 151–178.zbMATHGoogle Scholar
 Hyndman, R. (2017). Time series data library. http://data.is/TSDLdemo. Accessed 11 December 2017.
 Hyndman, R. J. (2014). With contributions from George Athanasopoulos, Razbash, S., Schmidt, D., Zhou, Z., Khan, Y., Bergmeir, C., Wang, E.: forecast: Forecasting functions for time series and linear models. R package version 5.6.Google Scholar
 Jacobs, R. (1995). Methods for combining experts’ probability assessments. Neural Computation, 7(5), 867–888.Google Scholar
 Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87.Google Scholar
 Jose, V. R. R., & Winkler, R. L. (2008). Simple robust averages of forecasts: Some empirical results. International Journal of Forecasting, 24(1), 163–169.Google Scholar
 Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab—An S4 package for kernel methods in R. Journal of Statistical Software, 11(9), 1–20.Google Scholar
 Kennel, M. B., Brown, R., & Abarbanel, H. D. (1992). Determining embedding dimension for phasespace reconstruction using a geometrical construction. Physical Review A, 45(6), 3403.Google Scholar
 Koprinska, I., Rana, M., & Agelidis, V. G. (2011). Yearly and seasonal models for electricity load forecasting. The 2011 international joint conference on neural networks (IJCNN) (pp. 1474–1481). IEEE.Google Scholar
 Kuhn, M., Weston, S., & Keefer, C. (2014). Code for Cubist by Ross Quinlan, N.C.C.: Cubist: Rule and InstanceBased Regression Modeling. R package version 0.0.18.Google Scholar
 Kuncheva, L. I. (2004). Multiple classifier systems: 5th International workshop, MCS 2004, Cagliari, Italy, June 9–11, 2004. Proceedings, chap. Classifier ensembles for changing environments (pp. 1–15). Berlin: Springer. https://doi.org/10.1007/9783540259664_1.
 Kwiatkowski, D., Phillips, P. C., Schmidt, P., & Shin, Y. (1992). Testing the null hypothesis of stationarity against the alternative of a unit root: How sure are we that economic time series have a unit root? Journal of Econometrics, 54(1–3), 159–178.zbMATHGoogle Scholar
 Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 30 Aug 2017.
 Makridakis, S., Andersen, A., Carbone, R., Fildes, R., Hibon, M., Lewandowski, R., et al. (1982). The accuracy of extrapolation (time series) methods: Results of a forecasting competition. Journal of Forecasting, 1(2), 111–153.Google Scholar
 Mevik, B. H., Wehrens, R., & Liland, K. H. (2016). pls: Partial least squares and principal component regression. R package version 2.60. https://CRAN.Rproject.org/package=pls.
 Milborrow, S. (2012). Earth: Multivariate adaptive regression spline models. Derived from mda:mars by Trevor Hastie and Rob Tibshirani.Google Scholar
 Newbold, P., & Granger, C. W. (1974). Experience with forecasting univariate time series and the combination of forecasts. Journal of the Royal Statistical Society. Series A (General), 137(2), 131–165.Google Scholar
 Ortega, J., Koppel, M., & Argamon, S. (2001). Arbitrating among competing classifiers using learned referees. Knowledge and Information Systems, 3(4), 470–490.zbMATHGoogle Scholar
 Pinto, F., Soares, C., & MendesMoreira, J. (2016). Chade: Metalearning with classifier chains for dynamic combination of classifiers. In Joint european conference on machine learning and knowledge discovery in databases. Springer.Google Scholar
 R Core Team. (2013). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.Google Scholar
 Ridgeway, G. (2015) gbm: Generalized Boosted Regression Models. R package version 2.1.1.Google Scholar
 Rossi, A. L. D., de Leon Ferreira, A. C. P., Soares, C., De Souza, B. F., et al. (2014). Metastream: A metalearning based method for periodic algorithm selection in timechanging data. Neurocomputing, 127, 52–64.Google Scholar
 Sánchez, I. (2008). Adaptive combination of forecasts with application to wind energy. International Journal of Forecasting, 24(4), 679–693.Google Scholar
 Takens, F. (1981). Dynamical Systems and Turbulence, Warwick 1980: Proceedings of a Symposium Held at the University of Warwick 1979/80, chap. Detecting strange attractors in turbulence (pp. 366–381). Berlin: Springer. https://doi.org/10.1007/BFb0091924.
 Timmermann, A. (2006). Forecast combinations. Handbook of Economic Forecasting, 1, 135–196.Google Scholar
 Timmermann, A. (2008). Elusive return predictability. International Journal of Forecasting, 24(1), 1–18.MathSciNetGoogle Scholar
 Todorovski, L., & Džeroski, S. (2003). Combining classifiers with meta decision trees. Machine Learning, 50(3), 223–249.zbMATHGoogle Scholar
 van Rijn, J. N., Holmes, G., Pfahringer, B., & Vanschoren, J. (2018). The online performance estimation framework: Heterogeneous ensemble learning for data streams. Machine Learning, 107(1), 149–176.Google Scholar
 Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). New York: Springer. ISBN 0387954570.Google Scholar
 Wager, S., Hastie, T., & Efron, B. (2014). Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. The Journal of Machine Learning Research, 15(1), 1625–1651.MathSciNetzbMATHGoogle Scholar
 Wang, X., SmithMiles, K., & Hyndman, R. (2009). Rule induction for forecasting method selection: Metalearning the characteristics of univariate time series. Neurocomputing, 72(10), 2581–2594.Google Scholar
 Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5(2), 241–259.Google Scholar
 Wolpert, D. H. (2002). The supervised learning nofreelunch theorems. In R. Roy, M. Köppen, S. Ovaska, T. Furuhashi, & F. Hoffmann (Eds.), Soft computing and industry (pp. 25–42). London: Springer. https://doi.org/10.1007/9781447101239_3.
 Wright, M. N. (2015). Ranger: A fast implementation of random forests. R packageGoogle Scholar
 Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (ICML03) (pp. 928–936).Google Scholar