# Online aggregation of unbounded losses using shifting experts with confidence

**Part of the following topical collections:**

## Abstract

We develop the setting of sequential prediction based on shifting experts and on a “smooth” version of the method of specialized experts. To aggregate expert predictions, we use the AdaHedge algorithm, which is a version of the Hedge algorithm with adaptive learning rate, and extend it by the meta-algorithm Fixed Share. Due to this, we combine the advantages of both algorithms: (1) we use the shifting regret which is a more optimal characteristic of the algorithm; (2) regret bounds are valid in the case of signed unbounded losses of the experts. Also, (3) we incorporate in this scheme a “smooth” version of the method of specialized experts which allows us to make more flexible and accurate predictions. All results are obtained in the adversarial setting—no assumptions are made about the nature of the data source. We present results of numerical experiments for short-term forecasting of electricity consumption based on real data.

## Keywords

On-line learning Prediction with expert advice Unbounded losses Adaptive learning rate Algorithm Hedge Method of mixing past posteriors Shifting experts Specialized experts Confidence level Short-term prediction of electricity consumption## 1 Introduction

We consider sequential prediction in the general framework of decision theoretic online learning or the Hedge setting by Freund and Schapire (1997), which is a variant of prediction with expert advice, see e.g. Littlestone and Warmuth (1994), Freund and Schapire (1997), Vovk (1990, 1998) and Cesa-Bianchi and Lugosi (2006).

*i*weights updating is based on exponential weighting with a constant or variable learning rate \(\eta \):

*i*at step

*t*.

The goal of the algorithm is to design weight updates that guarantee that the loss of the aggregating algorithm is never much larger than the loss of the best expert or the best convex combination of the losses of the experts.

*t*.

Basic notations and definitions

| |

\(\mathbf{l}_t=(l_{1,t},\dots , l_{N,t})\)—loss vector at step | |

\(\mathbf{p}_t=(p_{1,t},\dots ,p_{N,t})\)—vector of confidences at step | |

\({\hat{\mathbf{l}}}_t=(\hat{l}_{1,t},\dots , \hat{l}_{N,t})\)—vector of transformed losses | |

\(l_t^-=\min _{1\le i\le N}l_{i,t}\), \(l_t^+=\max _{1\le i\le N}l_{i,t}\)—min and max loss | |

\(s_t=l_t^+-l_t^-\)—loss range | |

\(\mathbf{q}_t=(q_{1,t},\dots ,q_{N,t})\)—comparison vector at step | |

\(\mathbf{w}^{\mu }_t=(w^\mu _{1,t},\dots ,w^\mu _{N,t})\)—experts weights | |

\(\mathbf{w}_t=(w_{1,t},\dots ,w_{N,t})\)—experts posterior weights at step | |

\(\mathbf{w}^*_t=(w^*_{1,t},\dots ,w^*_{N,t})\)—the learner prediction, where \(w^*_{i,t}=\frac{w_{i,t}p_{i,t}}{\sum _{i=1}^N w_{i,t}p_{i,t}}\) for \(1\le i\le N\). | |

\(h_t=(\mathbf{w}^*_i\cdot l_i)\)—Hedge loss (dot product of two vectors) | |

\(m_t=-\frac{1}{\eta _t}\sum \limits _{i=1}^N w_{i,t}e^{-\eta _t \hat{l}_{i,t}}\)—mixloss | |

\(\delta _t=h_t-m_t\)—mixability gap | |

\(\alpha _t\)—Fixed Share parameter (we put \(\alpha _t=\frac{1}{t}\)) | |

\(L_T^-=\sum \limits _{t=1}^T l_t^-\), \(L_T^+=\sum \limits _{t=1}^T l_t^+\)—cumulative minimal and maximal losses | |

\(S_T=\max _{1\le t\le T} s_t\)—maximum loss range | |

\(H_T=\sum \limits _{t=1}^T h_t\)—algorithm cumulative loss | |

\(M_T=\sum \limits _{t=1}^T m_t\)—cumulative mixloss | |

\(\varDelta _T=\sum _{t=1}^T\delta _t\)—cumulative gap | |

\(R^{(\mathbf{q})}_T=\sum \limits _{t=1}^T\sum \limits _{i=1}^N q_{i,t} p_{i,t} (h_t-l_{i,t})\)— confidence shifting regret | |

\(\eta _t=\frac{\ln ^*N}{\varDelta _{t-1}}\)—variable learning rate, where \(\ln ^*N=\max \{1,\ln N\}\) | |

put \(0/0=0\) |

A more challenging goal is to learn well when the comparator \(\mathbf{q}\) changes over time, i.e. the algorithm competes with the cumulative sum \(\sum \limits _{t=1}^T (\mathbf{q}_t\cdot \mathbf{l}_t)\), where comparison vector \(\mathbf{q}_t\) changes over time. An important special case is when \(\mathbf{q}_t\) are unit vectors, then the sequence of trials is partitioned into segments. In each segment the loss of the algorithm is compared to the loss of a particular expert and this expert changes at the beginning of a new segment. The goal of the aggregation algorithm is to do almost as well as the sum of losses of experts forming the best partition. Algorithms and bounds for shifting comparators were presented by Herbster and Warmuth (1998). This method called Fixed Share was generalized by Bousquet and Warmuth (2002) to the method of Mixing Past Posteriors (MPP) in which arbitrary mixing schemes are considered. In what follows, MPP mixing schemes will be used in our algorithms.

Most papers in the prediction with expert advice setting either consider uniformly bounded losses or assume the existence of a specific loss function (see Vovk 1990; Cesa-Bianchi and Lugosi 2006). But in some practical applications, this assumption is too restrictive. We allow losses at any step to be unbounded and signed. The notion of a specific loss function is not used.

AdaHedge presented by de Rooij et al. (2014) is among a few algorithms that do not have similar restrictions. This algorithm is a version of the classical Hedge algorithm of Freund and Schapire (1997) and is a refinement of the Cesa-Bianchi and Lugosi (2006) algorithm. AdaHedge is completely parameterless and tunes the learning rate \(\eta \) in terms of a direct measure of past performance.

In the case where losses of the experts are uniformly bounded the upper bound (2) takes the form \(O(\sqrt{T\ln N})\).

We emphasize that the versions of Fixed Share and MPP algorithms presented by Herbster and Warmuth (1998) and Bousquet and Warmuth (2002) use a constant learning rate, while the AdaHedge uses adaptive learning rate which is tuned on-line.

The first contribution of this paper is that we present the ConfHedge-1 algorithm which combines advantages of both these algorithms: (1) we use the shifting regret which is a more optimal characteristic of the algorithm; (2) regret bounds are valid in the case of signed unbounded losses of the experts.

The application we will consider below is the sequential short-term (one-hour-ahead) forecasting of electricity consumption will take place in a variant of the basic problem of prediction with expert advice called prediction with specialized (or sleeping) experts. At each round only some of the experts output a prediction while the other ones are inactive. Each expert is expected to provide accurate forecasts mostly in given external conditions, that can be known beforehand. For instance, in the case of the prediction of electricity consumption, experts can be specialized to a season, temperature, to working days or to public holidays, etc.

The method of specialized experts was first proposed by Freund et al. (1997) and further developed by Adamskiy et al. (2012), Chernov and Vovk (2009), Devaine et al. (2013), Kalnishkan et al. (2015). With this approach, at each step *t*, a set of specialized experts \(E_t\subseteq \{1,\dots , N\}\) is given. A specialized expert *i* issues its forecasts not at all steps \(t=1,2,\dots \), but only when \(i\in E_t\). At any step, the aggregating algorithm uses forecasts of only “active (non-sleeping)” experts.

The second contribution of this paper is that we have incorporated into ConfHedge-1 a smooth generalization of the method of specialized experts. At each time moment *t*, we complement the expert *i* forecast by a confidence level which is a real number \(p_{i,t}\in [0,1]\).

The setting of prediction with experts that report their confidences as a number in the interval [0, 1] was first studied by Blum and Mansour (2007) and further developed by Cesa-Bianchi et al. (2007), Gaillard et al. (2011), Gaillard et al. (2014).

In particular, \(p_{i,t}=1\) means that the expert forecast is used in full, whereas in the case of \(p_{i,t}=0\) it is not taken into account at all (the expert sleeps). In cases where \(0<p_{i,t}<1\) the expert’s forecast is partially taken into account. For example, with a gradual drop in temperature a corresponded specialized expert gradually loses its ability for accurate predictions of electricity consumption. The dependence of \(p_{i,t} \) on values of exogenous parameters can be predetermined by a specialist in the domain or can be constructed using regression analysis on historical data.

In Sect. 2, we present the ConfHedge-1 algorithm, which is a loss allocation algorithm adapted for the case, where the losses of the experts can be signed and unbounded. Also, this algorithm takes into account the confidence levels of the experts predictions. In Sect. 3.2, ConfHedge-2 variant of this algorithm is presented for the case when experts make forecasts and calculate their losses using a convex loss function.

In Theorem 1 we present the upper bounds for the shifting regret of these algorithms. The proof of this theorem is given in Sect. A. Some details of the proof from de Rooij et al. (2014) are presented as a supplementary material in Sect. B. All results are obtained in the adversarial setting and no assumptions are made about the nature of data source.

In Sect. 3.3, the techniques of confidence level selection and experts training are presented. We also present the results of numerical experiments of the short-term prediction of electricity consumption with the use of the proposed algorithms.

The approach that sets the confidence levels for expert predictions of electricity consumption is more general than the approach used in the paper Devaine et al. (2013), which uses “sleeping” experts. In our numerical experiments the aggregating algorithm with soft confidence levels outperforms other versions of aggregating algorithms including ones which use sleeping experts.

## 2 Online loss allocation algorithm

In this section we present an algorithm for the optimal online allocation of unbounded signed losses of the experts. In Sect. 3.2, a variant of this algorithm will be presented for the case when experts make forecasts and calculate their losses using a convex loss function.

We assume that at each step *t*, along with the losses \(l_{i,t}\) of experts, theirs confidence levels are given—a vector \(\mathbf{p}_t=(p_{1,t},\dots ,p_{N,t})\), where \(p_{i,t}\in [0,1]\) for \(1\le i\le N\). We assume that \(\Vert \mathbf{p}_t\Vert _1>0\) for all *t*.

*i*prediction. In this case, we define the auxiliary virtual losses of the expert as a random variable

*i*with respect to the probability distribution \(\mathbf{p}_{i,t}=(p_{i,t},1-p_{i,t})\).

At any step *t* we use cumulative weights \(w_{i,t}\) of the experts \(1\le i\le N\) which were computed at the previous step. The algorithm loss is defined as \(h_t=\sum \limits _{i=1}^N w_{i,t}\hat{l}_{i,t}\).

^{1}Also, we compute the mixloss \( m_t=-\frac{1}{\eta _t}\sum \limits _{i=1}^N w_{i,t}e^{-\eta _t\hat{l}_{i,t}} \) and the mixability gap \(\delta _t=h_t-m_t\), which are used in the construction of the algorithm.

By the method MPP of Bousquet and Warmuth (2002), a mixing scheme is defined by a vector \(\beta ^{t+1}=(\beta ^{t+1}_0,\dots , \beta ^{t+1}_t)\), where \(\sum \limits _{s=0}^t\beta ^{t+1}_s=1\) and \(\beta ^{t+1}_s\ge 0\) for \(0\le s\le t\).

In what follows the vector \(\mathbf{w}^\mu _t=(w^\mu _{1,t},\dots ,w^\mu _{N,t})\) presents the normalized experts weights at step *t*. The corresponding posterior probability distribution \(\mathbf{w}_{t+1}=(w_{1,t+1},\dots ,w_{N,t+1})\) for step \(t+1\) is defined as a convex combination \(\mathbf{w}_{t+1}=\sum \limits _{s=0}^t\beta ^{t+1}_s w^\mu _s\) with weights \(\beta ^{t+1}_s\), \(0\le s\le t\), where \(\mathbf{w}^\mu _s=(w^\mu _{1,s},\dots ,w^\mu _{N,s})\).

The vector \(\beta ^{t + 1}\) defines the weights by which the past distributions of experts are mixed. It can be re-set at each step *t*.

| |
---|---|

Put \(w_{i,1}=w^\mu _{i,0}=\frac{1}{N}\) for \(i=1,\dots , N\), \(\varDelta _0=0\), \(\eta _1=\infty \). | |

FOR \(t=1,\dots ,T\) | |

Receive confidence levels \(\mathbf{p}_t=(p_{1,t},\dots ,p_{N,t})\) of the experts \(1\le i\le N\), where \(\Vert \mathbf{p}_t\Vert _1>0\). | |

Predict with the distribution \(\mathbf{w}^*_t=(w^*_{1,t},\dots ,w^*_{N,t})\), where \(w^*_{i,t}=\frac{w_{i,t}p_{i,t}}{\sum _{i=1}^N w_{i,t}p_{i,t}}\) for \(1\le i\le N\). | |

Receive a vector \(\mathbf{l}_t=(l_{1,t},\dots ,l_{N,t})\) containing the losses of the experts. | |

Compute the loss \(h_t=(\mathbf{l}_t\cdot \mathbf{w}^*_t)\) of the algorithm. | |

Update the weights and the learning parameter in three stages: | |

| |

Define \(w^\mu _{i,t}=\frac{w_{i,t}e^{-\eta _t p_{i,t}(l_t^i-h_t)}}{\sum \limits _{s=1}^N w_{s,t}e^{-\eta _t p_{s,t}(l_{s,t}-h_t)}}\) for \(1\le i\le N\). | |

| |

Choose a mixing scheme \(\beta ^{t+1}=(\beta ^{t+1}_0,\dots ,\beta ^{t+1}_t)\) and define future weights of the experts | |

\(w_{i,t+1}=\sum \limits _{s=0}^t\beta ^{t+1}_s w^\mu _{i,s}\) for \(1\le i\le N\). | |

| |

Define mixloss \(m_t=-\frac{1}{\eta _t}\ln \sum _{i=1}^N w_{i,t} e^{-\eta _t(p_{i,t}l_{i,t}+(1-p_{i,t})h_t)}\). Let \(\delta _t=h_t-m_t\) and \(\varDelta _t=\varDelta _{t-1}+\delta _t\). | |

Define the learning rate \(\eta _{t+1}=\ln ^*N/\varDelta _t\) for use at the next step \(t+1\). | |

ENDFOR |

We have \(m_t\le h_t\) by convexity of the exponent, then \(\delta _t\ge 0\) and \(\varDelta _t\le \varDelta _{t+1}\) for all *t*.

We will use the following mixing schemes by Bousquet and Warmuth (2002):

### Example 1

*t*.

### Example 2

*i*and

*t*.

Bousquet and Warmuth (2002) considered the notion of shifting regret with respect to a sequence \(\mathbf{q}_1,\mathbf{q}_2,\dots ,\mathbf{q}_T\) of comparison vectors: \(R_T=H_T-\sum \limits _{t=1}^T (\mathbf{q}_t\cdot l_t)\).^{2}

*t*.

*i*and

*t*then \(R^{(\mathbf{q})}_T=R_T\).

*i*and

*t*. Using the techniques of Sect. A for \(\eta _t\sim \sqrt{\frac{\ln ^*N}{t}}\), we can prove that

*k*is the number of switches of comparison vectors \(\mathbf{q}_t\) on the time interval \(1\le t\le T\).

^{3}

Our goal is to obtain a similar bound in the absence of boundness assumptions for the expert losses. Let the mixing scheme of Example 1 be used.

The following theorem presents the upper bounds for the confidence shifting regret in the case where no assumptions are made about boundness of the losses of the experts.

### Theorem 1

*T*and for any sequence \(\mathbf{q}_1,\dots ,\mathbf{q}_T\) of comparison vectors,

*k*is the number of switches of the comparison vectors on the time interval \(1\le t\le T\).

The bound (7) is an analogue for the shifting experts of the bound from Cesa-Bianchi et al. (2007) and the bound (8) is an analogue of the bound (16) of Theorem 8 from de Rooij et al. (2014). Proof of Theorem 1 is given in Sects. A and B.

A disadvantage of the bounds (8) and (9) below is in the presence of a term that depends quadratically on the number *k* of switches. Whether such a dependence is necessary is an open question. However, this term does not depend on the loss of the algorithm, it has only a slowly growing multiplicative factor \(O(\ln ^2 T)\). Corollary 1 below shows that in some special cases this dependence can be eliminated.

The bound (8) of Theorem 1 can be simplified in the different ways:

### Corollary 1

*T*and for any sequence \(\mathbf{q}_1,\dots ,\mathbf{q}_T\) of comparison vectors,

The bound (10) linearly depends on the number of switches (for the proof see Sect. B). The bound (11) follows from (10). If the losses of the experts are uniformly bounded then the bound (11) is of the same order as the bound (6).

An important special case of Theorem 1 is when the comparison vectors \(\mathbf{q}_t=\mathbf{e}_{i_t}\) are unit vectors and \(p_{i,t}\in \{0,1\}\), i.e., the specialists case is considered for composite experts \(i_1,\dots ,i_T\). Then the confidence shifting regret equals \(R^{(\mathbf{q})}_T=\sum \limits _{t:p_{i_t,t}=1}^T (h_t-l_{i_t,t})\) and the corresponding differences in the right-hand side of inequality (8) are \(L^+_T-L^{(\mathbf{q}-)}_T=\sum \limits _{t:p_{i_t,t}=1}^T (l^+_t-l_{i_t,t})+ \sum \limits _{t:p_{i_t,t}=0}^T s_t\) and \(L^{(\mathbf{q}+)}_T-L^-_T=\sum \limits _{t:p_{i_t,t}=1}^T (l_{i_t,t}-l^-_t) + \sum \limits _{t:p_{i_t,t}=0}^T s_t\).

The bound (10) is important if the algorithm is to be used for a scenario in which we are provided with a sequence of gain vectors \(\mathbf{g}_t\) rather than losses: we can transform these gains into losses using \(\mathbf{l}_t=-\mathbf{g}_t\), and then run the algorithm. Assume that \(p_{i,t}=1\) for all *i* and *t*. The bound then implies that we incur small regret with respect to a composite expert if it has very small cumulative gain relative to the minimum gain (see also de Rooij et al. 2014).

The similar bounds for the mixing scheme of Example 2 also can be obtained, where \(\gamma _{k,T}=(2k+3)\ln T+(k+2)\) (see Sect. A).

## 3 Numerical experiments

Section 3.1 presents the results of applying ConfHedge-1 to synthetic data. In Sect. 3.3 the results of the short-term prediction of electricity consumption are presented. We use in these experiments the ConfHedge-2 algorithm which is a variant the previous algorithm adapted for the case, where experts present the numerical forecasts. The scheme of this algorithm is given in Sect. 3.2.

### 3.1 Unbounded signed losses

*N*(0, 1) additive noise. Confidence levels of all experts are always equal to one. Figure 1a shows mean values of these one-step expert losses. Figure 1b shows cumulative losses of three individual experts and cumulative losses of AdaHedge and ConfHedge-1. These experiments show that ConfHedge-1 is non-inferior to AdaHedge, and, after some time, even outperforms it.

### 3.2 Aggregation of expert forecasts

In this section we suppose that the losses of the experts are computed using a convex in \(\gamma \) loss function \(\lambda (\omega ,\gamma )\), where \(\omega \) is an outcome and \(\gamma \) is a forecast. Outcomes can belong to an arbitrary set, forecasts form a linear space.^{4}

*t*the experts forecasts \(\mathbf{c}_t=(c_{1_t},\dots ,c_{N,t})\) and their confidence levels \(\mathbf{p}_t=(p_{1,t},\dots ,p_{N,t})\) are given. Here \(p_{i,t}\in [0,1]\) for all \(1\le i\le N\). Define the auxiliary virtual experts forecasts

*i*forecast is equal to \(\hat{\mathbf{c}}_{i,t}=E_{\mathbf{p}_{i,t}}[\tilde{c}_{i,t}]=p_{i,t}c_{i,t}+(1-p_{i,t})\gamma _t\).

| |
---|---|

Define \(w_{i,1}=w^\mu _{i,0}=\frac{1}{N}\) for \(i=1,\dots , N\), \(\varDelta _0=0\), \(\eta _1=\infty \). | |

FOR \(t=1,\dots ,T\) | |

Receive the expert forecasts \(\mathbf{c}_t=(c_{1,t},\dots ,c_{N,t})\) and and their confidence levels \(\mathbf{p}_t=(p_{1,t},\dots ,p_{N,t})\). | |

Compute the aggregating algorithm forecast \(\gamma _t=\frac{\sum _{i=1}^N p_{i,t}w_{i,t} c_{i,t}}{\sum _{i=1}^N p_{i,t}w_{i,t}}\). | |

Receive an outcome \(\omega _t\) and compute the experts losses \(\mathbf{l}_t=(l_{1,t},\dots ,l_{N,t})\), where \(l_{i,t}=\lambda (\omega _t,c_{i,t})\), | |

\(1\le i\le N\), and the algorithm loss \(a_t=\lambda (\omega _t,\gamma _t)\). | |

Update experts weights and learning parameter in three stages: | |

| |

Define | |

\(w^\mu _{i,t}=\frac{w_{i,t}e^{-\eta _t p_{i,t}(l_{i,t}-a_t)}}{\sum \limits _{s=1}^N w_{s,t}e^{-\eta _t p_{s,t}(l_{s,t}-a_t)}}\) for \(1\le i\le N\). | |

| |

Choose a mixing scheme \(\beta ^{t+1}=(\beta ^{t+1}_0,\dots ,\beta ^{t+1}_t)\) and define future experts weights | |

\(w_{i,t+1}=\sum \limits _{s=0}^t\beta ^{t+1}_s w^\mu _{i,s}\) for \(1\le i\le N\). | |

| |

Compute the mixloss | |

\(m_t=-\frac{1}{\eta _t}\ln \sum _{i=1}^N w_{i,t} e^{-\eta _t(p_{i,t}l_{i,t})+(1-p_{i,t})a_t)}\). | |

Define \(\delta _t=h_t-m_t\), where \(h_t=\sum _{i=1}^N w_{i,t}(p_{i,t}l_{i,t}+(1-p_{i,t})a_t)\), define also \(\varDelta _t=\varDelta _{t-1}+\delta _t\). | |

After that set \(\eta _{t+1}=\ln ^*N/\varDelta _t\) future value of the learning parameter. | |

ENDFOR |

Let \(A_T=\sum _{t=1}^T a_t\) be the loss of ConfHedge-2. We keep the notation \(H_T=\sum _{t=1}^T h_t\) and \(L^{(\mathbf{q})}_T=\sum _{t=1}^T (\mathbf{q}_t\cdot {\hat{\mathbf{l}}}_t)\). Theorem 1 also holds for these quantities. Hence, using the same notation as in the Sect. 2, we obtain a bound (16) and \(H_T-L^{(\mathbf{q})}_T\le \gamma _{k,T}\varDelta _T\).

Since by convexity of the loss function \(a_t=\lambda (\omega _t,\gamma _t)=\lambda (\omega _t,\sum _{i=1}^N w_{i,t}\hat{c}_{i,t})\le \sum _{i=1}^N w_{i,t}\hat{l}_{i,t}=h_t\) for all *t*, we have \(A_T\le H_T\).

### 3.3 The electrical loads forecasting

The second group of numerical experiments were performed with the contest data of the GefCom2012 competition conducted on the Kaggle platform (Hong et al. 2014). The main objective of this competition was to predict the daily course of hourly electrical loads (demand values for electricity) in 20 regions according to temperature records at 11 meteorological stations. Databases are available at http://www.kaggle.com/datasets. The basic data were provided in the form of the table “temperature-history“ with archive records of temperature monitoring at 11 meteorological stations and the table “load-history“ with hourly electrical load data recorded at 20 power distribution stations of the region for the period from 01.01.2004 to 30.06.2008. The additional calendar information (seasons, days of the week, and working days vs. holidays) could be also used.

Figure 2 shows the averaged curves of the daily electrical loads for each of the four seasons of 2004–2005 in the selected network. We see that the course of the averaged curves clearly depends on the time of day, and also varies from season to season. In addition, the working day and weekend day patterns demonstrate distinct differences in the level of electricity usage. Based on this figure, a simple scheme of forming an ensemble of experts, i.e., specialized algorithms that can only process strictly defined data, was chosen; the scheme includes the following categories: four times of day (night, morning, day, evening); working days and weekend days (two categories); four seasons (winter, spring, summer, fall), all this giving \(4\times 2\times 4=32\) specialized experts (Stepwise Linear Regression). We also use extra four experts, each of which is focused on one of the seasons of the year, and one nonsleeping expert (Random Forest algorithm). Thus, we used a total of 37 experts.

At each moment of time, the confidence function of a given expert is calculated as a product of the confidence functions for each of its specializations. For example, Fig. 3 shows the stages of constructing the confidence function for the expert focused on night forecasting (0–6 a.m.) on the working days of January. Thus synthesized confidence functions are used to form individual training samples for each expert at the stage of training and to aggregate expert forecasts at the stage of testing.

To compare the scheme of “smooth mixing“ with the scheme of “sleeping experts“, the experiments on expert decision aggregation were performed in two stages. First, only the scheme of mixing the sleeping and awake experts was used, i.e., the confidence level took only two values (0 or 1), and then the mixing algorithm from Section 2 of this work was used.

The evolution of differences of cumulative losses \(L^1_T-L^3_T\) and \(L^2_T-L^3_T\), where \(L^1_T\) is the cumulative loss of anytime nonsleeping Random Forest algorithm and \(L^2_T\), \(L^3_T\) are cumulative losses of two schemes of mixing (“sleeping experts‘ and “smooth mixing“), are shown in Fig. 4a.

The mean cumulative losses (Mean Absolute Error – MAE) \(\frac{1}{T}L^1_T\) of Random Forest algorithm and of two schemes of expert mixing: \(\frac{1}{T}L^2_T\) and \(\frac{1}{T}L^3_T\), are shown in Fig. 4b.^{5}

In this experiment, the “smooth mixing“ algorithm outperforms the aggregating algorithm using “sleeping experts“ and both these algorithms outperform the anytime Random Forest forecasting algorithm.

## 4 Conclusion

In this paper we extend the AdaHedge algorithm by de Rooij et al. (2014) for a case of shifting experts and for a smooth version of the method of specialized experts, where at any time moment each expert’s forecast is provided with a confidence level which is a number between 0 and 1.

To aggregate experts predictions, we use methods of shifting experts and the algorithm AdaHedge with an adaptive learning rate. Due to this, we combine the advantages of both algorithms. We use the shifting regret which is a more optimal characteristic of the algorithm, and we do not impose restrictions on the expert losses. Also, we incorporate in this scheme a smooth version of the method of specialized experts by Blum and Mansour (2007), which allows us to make more flexible and accurate predictions.

We obtained the new upper bounds for the regret of our algorithms, which generalize similar upper bounds for the case of specialized experts.

A disadvantage of Theorem 1 and of Corollary 1 is in asymmetry of the bounds (9) and (10) – first of them has a term that depends quadratically on the number *k* of switches. Whether such a dependence is necessary is an open question.

All results are obtained in the adversarial setting, no assumptions are made about the nature of data source.

We present the results of numerical experiments on short-term forecasting of electricity consumption based on a real data. In these experiments, the “smooth mixing“ algorithm outperforms the aggregating algorithm with “sleeping experts“ and both these algorithms outperform the anytime Random Forest forecasting algorithm.

## Footnotes

- 1.
In the simple Hedge we put \(w_{i,t+1}=w^\mu _{i,t}\). Some other mixing schemes will be given below.

- 2.
The notion of regret with respect to a comparison vector was first defined by Kivinen and Warmuth (1999).

- 3.
Does this bound is tight is an open question. Some lower bounds for mixloss (for the logarithmic loss function with the learning rate \(\eta =1\)) were obtained by Adamskiy et al. (2012). They show an information-theoretic lower bound for mixloss that must hold for any algorithm, and which is tight for Fixed Share.

- 4.
In our experiments, the absolute loss function \(\lambda (\omega ,\gamma )=|\omega -\gamma |\) was used, where \(\omega \) and \(\gamma \) are real numbers. In practical applications, we can also use its biased variant \(\lambda (\omega ,\gamma )=\mu _1|\omega -\gamma |_{-}+\mu _2|\omega -\gamma |_{+}\), where \(|r|_{-}=-\min \{0,r\}\) and \(|r|_{+}=\max \{0,r\}\). The positive numbers \(\mu _1\) and \(\mu _2\) provide a balance of losses between the deviations of the forecasts \(\gamma \) and outcomes \(\omega \) in the positive and negative directions.

- 5.
The absolute loss function was used in these experiments.

- 6.
Mixloss is a very useful intermediate concept, cumulative variant of which is less or equal to the cumulative loss of the best expert (up to a small term) and, on the other hand, the cumulative mixloss is close to the cumulative loss of the aggregating algorithm. For the logarithmic loss function, the mixloss coincides with the loss of the Vovk aggregating algorithm (see Adamskiy et al. 2012; Cesa-Bianchi and Lugosi 2006; de Rooij et al. 2014).

## Notes

### Acknowledgements

This paper is an extended version of the conference paper V’yugin (2017). This work was supported by Russian Science Foundation, project 14-50-00150.

## References

- Adamskiy, D., Koolen, W. M., Chernov, A., Vovk, V. (2012). A closer look at adaptive regret. In: N. H. Bshouty, G. Stoltz , N. Vayatis, & T. Zeugmann (Eds.),
*Algorithmic learning theory. ALT 2012. Lecture notes in Computer Science*(Vol. 7568). Berlin, Heidelberg: Springer.Google Scholar - Blum, A., & Mansour, Y. (2007). From external to internal regret.
*Journal of Machine Learning Research*,*8*, 1307–1324.MathSciNetzbMATHGoogle Scholar - Bousquet, O., & Warmuth, M. (2002). Tracking a small set of experts by mixing past posteriors.
*Journal of Machine Learning Research*,*3*, 363–396.MathSciNetzbMATHGoogle Scholar - Chernov, A., & Vovk, V. (2009). Prediction with expert evaluators’ advice. In R. Gavaldà, G. Lugosi, T. Zeugmann, & S. Zilles (Eds.),
*Proceedings of the twentieth international conference on algorithmic learning theory. Lecture notes in computer science*(Vol. 5809, pp. 8–22). Berlin: Springer.Google Scholar - Cesa-Bianchi, N., & Lugosi, G. (2006).
*Prediction, learning, and games*. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar - Cesa-Bianchi, N., Mansour, Y., & Stoltz, G. (2007). Improved second-order bounds for prediction with expert advice.
*Machine Learning*,*66*(2/3), 321–352.CrossRefzbMATHGoogle Scholar - de Rooij, S., van Erven, T., Grunwald, P., & Koolen, W. (2014). Follow the leader. If you can, hedge if you must.
*Journal of Machine Learning Research*,*15*, 1281–1316.MathSciNetzbMATHGoogle Scholar - Devaine, M., Gaillard, P., Goude, Y., & Stoltz, G. (2013). Forecasting electricity consumption by aggregating specialized experts.
*Machine Learning*,*90*(2), 231–260.MathSciNetCrossRefzbMATHGoogle Scholar - Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting.
*Journal of Computer and System Sciences*,*55*, 119–139.MathSciNetCrossRefzbMATHGoogle Scholar - Freund, Y., Schapire, R. E., Singer, Y., & Warmuth, M. K. (1997). Using and combining predictors that specialize. In
*Proc. 29th Annual ACM Symposium on Theory of Computing*. 334–343.Google Scholar - Gaillard, P., Goude, Y., & Stoltz, G. (2011). A further look at the forecasting of the electricity consumption by aggregation of specialized experts. Technical report. pierre.gaillard.me/doc/GaGoSt-report.pdf.Google Scholar
- Gaillard, P., Stoltz, G., & van Erven, T. (2014). A second-order bound with excess losses.
*JMLR: Workshop and Conference Proceedings*,*35*, 176–196.Google Scholar - Herbster, M., & Warmuth, M. (1998). Tracking the best expert.
*Machine Learning*,*32*(2), 151–178.CrossRefzbMATHGoogle Scholar - Hong, T., Pinson, P., & Fan, Shu. (2014). Global energy forecasting competition 2012.
*International Journal of Forecasting*,*V.30*(2), P.357–363.CrossRefGoogle Scholar - Kalnishkan, Y., Adamskiy, D., Chernov, A., & Scarfe, T. (2015). Specialist experts for prediction with side information.
*IEEE international conference on data mining workshop (ICDMW)*. IEEE, 1470–1477.Google Scholar - Kivinen, J., & Warmuth, M.K. (1999). Averaging expert prediction. In P. Fisher & H.U. Simon (Eds.),
*Computational learning theory: 4th european conference (EuroColt ’99)*. 153–167, Springer.Google Scholar - Littlestone, N., & Warmuth, M. (1994). The weighted majority algorithm.
*Information and Computation*,*108*, 212–261.MathSciNetCrossRefzbMATHGoogle Scholar - Vovk, V. (1990). Aggregating strategies. In M. Fulk and J. Case, (Eds.),
*Proceedings of the 3rd annual workshop on computational learning theory*, 371–383, San Mateo, CA, Morgan Kaufmann.Google Scholar - Vovk, V. (1998). A game of prediction with expert advice.
*Journal of Computer and System Sciences*,*56*(2), 153–173.MathSciNetCrossRefzbMATHGoogle Scholar - Vovk, V. (1999). Derandomizing stochastic prediction strategies.
*Machine Learning*,*35*(3), 247–282.CrossRefzbMATHGoogle Scholar - V’yugin, V. (2017). Online aggregation of unbounded signed losses using shifting experts.
*Proceedings of machine learning research*. 60: 1–15. http://proceedings.mlr.press/v60/