Keywords

1 Introduction

Electricity load forecasting is very important for the economic operation and security of a power system. The accuracy of electricity load forecasting directly influences the control and planning of power system operation. It is estimated that a 1% increase of forecasting error would bring in a 10 million pounds increase in operating cost per year (in 1984) for the UK power system [4]. Experts believe that this effect could become even stronger, due to the emergence of highly uncertain energy sources, such as solar and wind energy generation. Depending on the lead time horizon, electricity load forecasting ranges from short-term forecasting (minutes or hours ahead) to long-term forecasting (years ahead) [13]. With increasingly competitive markets and demand response energy management [15], short-term load forecasting is becoming more and more important [25]. In this paper, therefore, we will focus on tackling this problem.

Electricity load forecasting is a very difficult task since the load is influenced by many uncertain factors. Various methods have been proposed for electricity load forecasting including statistical methods, time series analysis, and machine learning algorithms [21]. Some recent work uses multiple kernels to build prediction models for electricity load forecasting. For example, in [1], Gaussian kernels with different parameters are applied to learn peak power consumption. In [8], different types of kernels are used for different features and a multi-task learning algorithm is proposed and applied on low level load consumption data to improve the aggregated load forecasting accuracy. However, all of the existing methods rely on a fixed set of coefficients for the kernels (i.e., simply set to 1), implicitly assuming that all the kernels are equally important for forecasting, which is suboptimal in real world applications.

Multiple kernel learning (MKL) [2], which learns both the kernels and their combination weights for different kernels, could be tailored to this problem. Through MKL, different kernels could have different weights according to their influence on the outputs. However, learning with multiple kernels usually involves a complicated convex optimization problem, which limits their application on large scale problems. Although some progresses have been made in improving the efficiency of the learning algorithms, most of them only focus on classification tasks [23, 26]. On the other hand, electricity load forecasting is a regression problem and the computation time is an important issue.

Another practical issue for load forecasting is the lack of data to build a reliable forecasting model. For example, consider the case of a set of newly built houses (target domain) for which we want to predict the load consumption. We may not have enough data to build a prediction model for these new houses, while we have a large amount of data or knowledge from other houses (source domain). The challenge here is to perform transfer learning [18], which relies on the assumption is that there are some common structures or factors that can be shared across the domains. The objective of transfer learning for load forecasting is to improve the forecasting performance by discovering shared knowledge and leveraging it for electricity load prediction for target buildings.

In this paper, we address both challenges within a novel boosting-based MKL framework. In particular, we first propose the boosting based multiple kernel regression (BMKR) algorithm to improve the computational efficiency of MKL. Furthermore, we extend BMKR to the context of transfer learning, and propose two variants of BMKR: kernel-level boosting based transfer multiple kernel regression (K-BTMKR) and model-level gradient boosting based transfer multiple kernel regression (M-BTMKR). Our contribution, from an algorithmic perspective, is two-fold: We propose a boosting based learning framework (1) to learn regression models with multiple kernels efficiently, and (2) to leverage the MKL models learned from other domains. On the application side, this work introduces the use of transfer learning for the load forecasting problem, which opens up potential future work avenues.

2 Background

2.1 Multiple Kernel Regression

Let \(\mathcal {S} = \{(x_n,y_n),n=1,\dots ,N\} \in \mathbb {R}^d \times \mathbb {R}\) be the data set with N samples, \(\mathcal {K}=\{k_m: \mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R},m=1,\dots ,M\}\) be M kernel functions. The objective of MKL is to learn a prediction model, which is a linear combination of M kernels, by solving the following optimization problem [11]:

$$\begin{aligned} \min _{\eta \in \varDelta } \min _{F\in \mathcal {H}_{K}} \frac{1}{2} ||F||^2_{K} + C \sum _{n=1}^N \ell (F(x_n),y_n), \end{aligned}$$
(1)

where \(\varDelta = \{\eta \in \mathbb {R}_+ | \sum _{m=1}^M \eta _m = 1\}\) is a set of weights, \(\mathcal {H}_K\) is the reproducing kernel Hilbert space (RKHS) induced by the kernel \(K(x,x_n) = \sum _{m=1}^M \eta _m k_m(x,x_n)\) and \(\ell (F(x),y)\) is a loss function. In this paper we use the squared loss \(\ell (F(x),y) = \frac{1}{2} (F(x)-y)^2\) for the regression problem. The solution of Eq. 1 is of the formFootnote 1

$$\begin{aligned} F(x) = \sum _{n=1}^N \alpha _n K(x,x_n), \end{aligned}$$
(2)

where the coefficients \(\{\alpha _n\}\) and \(\{\eta _m\}\) are learned from samples.

Compared with single kernel approaches, MKL algorithms can provide better learning capability and alleviate the burden of designing specific kernels to handle diverse multivariate data.

2.2 Gradient Boosting and \(\epsilon \)-Boosting

Gradient boosting [10, 16] is an ensemble learning framework which combines multiple hypotheses by performing gradient descent in function space. More specifically, the model learned by gradient boosting can be expressed as:

$$\begin{aligned} F(x) = \sum _{t=1}^T \rho ^t f^t(x), \end{aligned}$$
(3)

where T is the number of total boosting iterations, and the t-th base learner \(f^t\) is selected such that the distance between \(f^t\) and the negative gradient of the loss function at \(F=F^{t-1}\) is minimized:

$$\begin{aligned} f^t = \mathop {\arg \min }_f \sum _{n=1}^N \left( f(x_n)-r^t_{n} \right) ^2, \end{aligned}$$
(4)

where \(r_{n}^t = - \left[ \frac{\partial \ell (F(x_n),y_n)}{\partial F}\right] _{F=F^{t-1}}\), and \(\rho ^t\) is the step size which can either be fixed or chosen by line search. Plugging in the squared loss we have \(r_{n}^t = y_n-F^{t-1}(x_n)\). In other words, gradient boosting with squared loss essentially fits the residual at each iteration.

Let \(\mathcal {F}=\{f_1,\dots ,f_J\}\) be a set of candidate functions, where \(J=|\mathcal {F}|\) is the size of the function space, and \(f: \mathbb {R}^d \rightarrow \mathbb {R}^J\), \(f(x) = [f_1(x),\dots ,f_J(x)]^\top \) be the mapping defined by \(\mathcal {F}\). Gradient boosting with squared loss usually proceeds in a greedy way: the step size is simply set \(\rho ^t=1\) for all iterations. On the other hand, if the step size \(\rho ^t\) is set to some small constant \(\epsilon >0\), it can be shown that under the monotonicity condition, this example of gradient boosting algorithm, referred to as \(\epsilon \)-boosting in [20], essentially solves an \(\ell _1\)-regularized learning problem [12]:

$$\begin{aligned} \min _{||\beta ||_1 \le \mu } \sum _{n=1}^N\frac{1}{N} \ell \left( \beta ^\top f(x_n), y_n \right) , \end{aligned}$$
(5)

where \(\beta \in \mathbb {R}^J\) is the coefficient vector, and \(\mu \) is the regularization parameter, such that \(\epsilon T\le \mu \). In other words, \(\epsilon \)-boosting implicitly controls the regularization via the number of iterations T rather than \(\mu \).

2.3 Transfer Learning from Multiple Sources

Let \(\mathcal {S}_T = \{(x_n,y_n),n=1,\dots ,N\}\) be the data set from the target domain, and \(\{\mathcal {S}_1,\dots ,\mathcal {S}_S\}\) be the data sets from S source domains, where \(\mathcal {S}_s=\left\{ (x_n^s,y_n^s), n = 1,\dots ,N_s\right\} \) are the samples of the s-th source. Let \(\{F_1,\dots ,F_S\}\) be the prediction models learned from S source domains. In this work, the s-th model \(F_s\) is trained by some MKL algorithm (e.g., BMKR), and is of the form:

$$\begin{aligned} F_s= \sum _{m=1}^M \eta _m^s h_m^s(x) = \sum _{m=1}^M \eta _m^s \sum _{n=1}^{N_s} \alpha _n^s k_m(x,x_n^s). \end{aligned}$$
(6)

The objective of transfer learning is to build a model F that has a good generalization ability in the target domain using the data set \(\mathcal {S}_T\) (which is typically small) and knowledge learned from sources \(\{\mathcal {S}_1,\dots ,\mathcal {S}_S\}\). In this work, we assume that such knowledge has been embedded into \(\{F_1,\dots ,F_S\}\), and therefore the problem becomes to explore the model structures that can be transferred to the target domain from various source domains. This type of learning approach is also referred to as parameter transfer [18].

3 Methods

3.1 Boosting Based Multiple Kernel Learning Regression

The idea of BMKR is to learn an ensemble model with multiple kernel regressors using the gradient boosting framework. The starting point of our method is similar to multiple kernel boosting (MKBoost) [23], which adapts AdaBoost [9] for multiple kernel classification. We extend this idea to a more general framework of gradient boosting [10, 16], which allows different loss functions for different types of learning problems. In this paper, we focus on the regression problem and use the squared loss.

At the t-th boosting iteration, for each kernel \(k_m, m=1, \dots , M\), we first train a kernel regression model such as support vector regression (SVR) by fitting the current residuals, and obtain a solution of the form:

$$\begin{aligned} f^t_m (x) = \sum _{n=1}^N \alpha _{t,n} k_m(x,x_n). \end{aligned}$$
(7)

Then we choose from M candidates, the regression model with the smallest fitting error

$$\begin{aligned} f^t = \mathop {\arg \min }_{f^t_m, m \in \{1,\dots ,M\}} e^t_m, \end{aligned}$$
(8)

where \(e^t_m = \frac{1}{2}\sum _{n=1}^N \left( f^t_m(x_n)-r_n^t\right) ^2\), and add it to the ensemble F. The final hypothesis of BMKR is expressed as in Eq. 3.

The pseudo-code of BMKR is shown in Algorithm 1. For gradient boosting with squared loss, the step size \(\rho ^t\) is not strictly necessary [3], and we can either simply set it to 1, or a fixed small value \(\epsilon \) as suggested by \(\epsilon \)-boosting. Note that at each boosting iteration, instead of fitting all N samples, we can select only \(N'\) samples for training a SVR model, as suggested in [23], which can substantially reduce the computational complexity of each iteration as \(N' \ll N\).

figure a

3.2 Boosting Based Transfer Regression

As explained in Sect. 1, as we typically have very few data in the target domain, and therefore the model can easily overfit, especially if we train a complicated MKL model, even with the boosting approach. To deal with this issue, we can implicitly regularize the candidate functions at each boosting iteration by constraining the learning process within the function space spanned by the kernel functions trained on the source domains, rather than training the model in the function space spanned by arbitrary kernels. On the other hand, however, the underlying assumption of this approach is that at least one source domain is closely related to the target domain and therefore the kernel functions learned from the source domains can be reused. If this assumption does not hold, negative transfer could hurt the prediction performance. To avoid this situation, we also keep a MKL model which is trained only on the target domain. Consequently, the challenge becomes how to balance the knowledge embedded in the model learned from the source domains and the data fitting in the target domain.

To address this issue in a principled manner, we follow the idea of \(\epsilon \)-boosting [6, 20] and propose the BTMKR algorithm, which is aimed towards transfer learning. There are two levels of transferring the knowledge of models: kernel-level transfer and model-level transfer, denoted by K-BTMKR and M-BTMKR respectively. At each iteration, K-BTMKR selects a single kernel function from \(S\,\times \,M\) candidate kernels, while M-BTMKR selects a multiple kernel model from S domains. Therefore, K-BTMKR has higher “resolution” and more flexibility, at the price of higher risk of overfitting, as the dimension of its search space is M higher than that of M-BTMKR.

Kernel-Level Transfer (K-BTMKR). Let \(\mathcal {H}=\{h_1^1,\dots ,h_M^1,\dots ,h_1^S,\dots ,h_M^S\}\) be the set of MS candidate kernel functions learned from S source domains, and \(\mathcal {F}=\{f_1,\dots ,f_J\}\) be the set of J candidate kernel functions from the target domain. Note that as the kernel functions from the source domains are fixed, the size of \(\mathcal {H}\) is finite, while the size of the function space of the target domain is infinite, since the weights learned by SVR can be arbitrary (i.e., Eq. 7). For simplicity of analysis, we assume J is also finite. Given the mapping \(h: \mathbb {R}^d \rightarrow \mathbb {R}^{MS}, h(x) = [h_1^1(x),\dots ,h_M^S]^\top \) defined by \(\mathcal {H}\) and the mapping f defined by \(\mathcal {F}\), we formulate the transfer learning problem as:

$$\begin{aligned}&\min _{\beta _{\mathcal {S}},\beta _{\mathcal {T}}} \mathcal {L}\left( \beta _{\mathcal {S}}, \beta _{\mathcal {T}} \right) \quad \text {s.t.} \;\ ||\beta _{\mathcal {S}}||_1 + \lambda ||\beta _{\mathcal {T}}||_1 \le \mu , \end{aligned}$$
(9)

where \(\mathcal {L}( \beta _{\mathcal {S}}, \beta _{\mathcal {T}} ) \triangleq \sum _{n=1}^N\ell ( \beta _{\mathcal {S}}^\top h(x_n) + \beta _{\mathcal {T}}^\top f(x_n), y_n )\), \(\beta _{\mathcal {S}} \triangleq [\beta _1^1,\dots ,\beta _M^S]^\top \in \mathbb {R}^{MS}\), \(\beta _{\mathcal {T}} \triangleq [\beta _1,\dots ,\beta _J]^\top \in \mathbb {R}^J\) are the coefficient vectors for the source domains and the target domain respectively, and \(\lambda \) is a parameter that controls how much we penalize \(\beta _{\mathcal {T}}\) against \(\beta _{\mathcal {S}}\). Intuitively, if the data from target domain is limited, we should set \(\lambda \ge 1\) to favor the model learned from the source domains, in order to avoid overfitting.

Following the idea of \(\epsilon \)-boosting [12, 20], Eq. 9 can be solved by slowly increasing the value of \(\mu \) by \(\epsilon \), from 0 to a desired value. More specifically, let \(g(x) = [h(x)^\top , f(x)^\top ]^\top \), and \(\beta = \left[ \varDelta \beta _{\mathcal {S}}^\top , \varDelta \beta _{\mathcal {T}}^\top \right] ^\top \). At the t-th boosting iteration, the coefficient vector \(\beta \) is updated to \(\beta +\varDelta \beta \) by solving the following optimization problem:

$$\begin{aligned}&\min _{\varDelta \beta } \mathcal {L}\left( \beta +\varDelta \beta \right) \quad \text {s.t.} \quad ||\varDelta \beta _{\mathcal {S}}||_1 + \lambda ||\varDelta \beta _{\mathcal {T}}||_1 \le \epsilon \end{aligned}$$
(10)

As \(\epsilon \) is very small, the objective function of Eq. 10 can be expanded by first-order Taylor expansion, which gives

$$\begin{aligned} \mathcal {L}\left( \beta +\varDelta \beta \right) \approx \mathcal {L}\left( \beta \right) + \nabla \mathcal {L}\left( \beta \right) ^\top \varDelta \beta , \end{aligned}$$
(11)

where

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial {\beta _j}} = \sum _{n=1}^N -r_n^t g_j(x_n), \quad \forall j \in \{1,\dots , MS+J\}. \end{aligned}$$
(12)

By changing the coefficients \(\tilde{\beta }_{\mathcal {T}} \leftarrow \lambda \beta _{\mathcal {T}}\), it can be shown that minimizing Eq. 10 can be (approximately) solved by

$$\begin{aligned} \varDelta \beta _j = {\left\{ \begin{array}{ll} \epsilon , \quad &{}\text {if } j = \arg \mathrm{max}_{j} \frac{\sum _{n=1}^N r_n^t g_j(x_n)}{\lambda _j}\\ 0, \quad &{}\text {otherwise} \end{array}\right. }, \end{aligned}$$
(13)

where \(\lambda _j=1, \forall j\in \{1,\dots , MS\}\), and \(\lambda _j=\lambda \), otherwise. In practice, as the size of function space of target domain is infinite, the candidate functions are actually computed by fitting the current residuals, as shown in Algorithm 2.

Model-Level Transfer (M-BTMKR). The derivation of M-BTMKR is similar to that of K-BTMKR, and therefore is omitted here.

3.3 Computational Complexity

The computational complexity of BMKR, as analyzed in [23], is \(\mathcal {O}(TM\xi (N))\), where \(\xi (N)\) is the computational complexity of training a single SVR with N samples. Standard learning approaches formulate SVR as a quadratic programming (QP) problem and therefore \(\xi (N)\) is \(\mathcal {O}(N^3)\). Lower complexity (e.g., about \(\mathcal {O}(N^2)\)) can be achieved by using other solvers (e.g., LIBSVM [5]). More important, BMKR can adopt stochastic learning approach, as suggested in [23], which only selects \(N'\) samples for training a SVR at each boosting iteration. This approach yields a complexity of \(\mathcal {O}(TM (N+\xi (N')))\), which makes the algorithm tractable for large-scale problems by choosing \(N' \ll N\). The computational complexity of the BTMKR algorithms is \(\mathcal {O}(TM(SN+\xi (N)))\). Note that in the context of transfer learning, we use all the samples from the target domain, as the size of data set is usually small.

4 Experiments and Simulation Results

In this section, we evaluate the proposed algorithms on the problem of short-term electricity load forecasting for residential houses. Several factors including day types, weather conditions, and the lagged load consumption itself may affect the load profile of a given house. In this paper, we use three kinds of features for load forecasting: lagged load consumption, i.e., electricity consumed in the last three hours, temperature in the last three hours, and weekday/weekend information.

Fig. 1.
figure 1

Load data for four winter days

Fig. 2.
figure 2

Load data for three houses

4.1 Data Description

The historical temperature data are obtained from [14], and the residential house load consumption data are provided by the US Energy department [17]. The data set includes hourly residential house load consumption data for 24 locations in New York state in 2012. For each location, it provides data for three types of houses, based on the house size: low, base, and high. Figure 1 shows load consumption for a base type house for four consecutive winter days. We can see that the load consumption starts to decrease from 8 am and increases very quickly from 4 pm. Figure 2 shows the load consumption for three high load consumption houses in nearby cities for the same winter day. It can be observed that the load consumption for house 1 is similar to house 2 and both are different from house 3.

4.2 BMKR for Electricity Load Forecasting

To test the performance of BMKR, we use the data of a high energy consumption house in New York City in 2012. We test the performance of BMKR separately for different seasons, and compare it with single kernel SVR and linear regression. We set the number of boosting iterations for the proposed algorithms to 100, the step-size of \(\epsilon \) to 0.05, and the sampling ratio to 0.9. In order to accelerate the learning process, we initialize the model with linear regression. The candidate kernels for BMKR are: Gaussian kernels with 10 different widths (\(2^{-4}, 2^{-3}, ..., 2^{5}\)) and a linear kernel. We repeat the simulation for 10 times, and each time we randomly choose 50% of the data in the season as training data and 50% of the data as testing data.

Table 1 shows the mean and standard deviation (std dev) of the Mean Average Percentage Error (MAPE) measurement for BMKR and the other two baselines. We can see that BMKR achieves the best forecasting performance for all seasons, obtaining 3.3% and 3.8% average MAPE improvements over linear regression and single kernel SVR respectively.

Table 1. MAPE (%) performance (mean ± std dev) for high load consumption houses

4.3 Transfer Regression for Electricity Load Forecasting

We evaluate the proposed transfer regression algorithms: M-BTMKR and K-BTMKR on high load consumption houses. We randomly pick 6 high load consumption houses as target house and use the remaining 18 high consumption houses as source houses. We repeat the simulation 10 times for each house, and each time we randomly choose 36 samples as the training data, and 100 samples as the testing data for the target house. For source houses, we randomly chose 600 data samples as the training data in each simulation. For K-BTMKR and M-BTMKR, \(\lambda \) is chosen by cross validation to balance the model leaned from source house data and the model learned from target house data.

Performance of M-BTMKR and K-BTMKR are compared with linear regression, single kernel SVR and BMKR. The candidate kernels and boosting setting are the same as in Sect. 4.2. For the baselines, the forecasting models are trained only with data from target houses, and the results are shown in Table 2 Footnote 2, from which it can be observed that the proposed transfer algorithms significantly improve the forecasting performance. For each individual location, the best results are achieved by either K-BTMKR or M-BTMKR, and M-BTMKR shows the best performance on average. The forecasting accuracies of M-BTMKR and K-BTMKR are very close to each other and both are much better than the baseline algorithms without transfer. In other words, with the proposed transfer algorithms, the knowledge learned from the source houses is properly transferred to the target house.

Table 2. Transfer learning MAPE (%) performance for high load consumption houses
Table 3. Transfer learning MAPE (%) performance for high load consumption target houses with low load consumption source houses

4.4 Negative Transfer Analysis

Sometimes the consumption pattern for source houses and target houses can be quite different. We would prefer that the transfer algorithms prevent potential negative transfer for such scenarios. Here we present a case study to show the importance of balancing the knowledge learned from source domains and data fitting in the target domain. We use the same high load target houses as described in Sect. 4.3, but for the source houses, we randomly chose eighteen houses from the low type houses. We repeat the simulation for 10 times and the results are shown in Table 3.

figure b

The proposed algorithms are compared with linear regression, single kernel SVR, BMKR, \(\text {M-BTMKR}_{woT}\), and \(\text {K-BTMKR}_{woT}\), where \(\text {M-BTMKR}_{woT}\) and \(\text {K-BTMKR}_{woT}\) denote the BTMKR algorithms that we do not keep a MKL model trained on the target domain when we learn BTMKR models (i.e., we do not train \(f^*\) in Algorithm 2). Simulation results show that, if we do not keep a MKL model trained on the target domain, we would encounter severe negative transfer problem, and the forecasting accuracy would be even much worse than the models learned without transfer. Meanwhile, we can see that the proposed M-BTMKR and K-BTMKR could successfully avoid such negative transfer. In this case, M-BTMKR and K-BTMKR still show better performance than other algorithms, though the forecasting accuracy of K-BTMKR is very close to BMKR. M-BTMKR achieves the best average forecasting performance and provides 14.37% average forecasting accuracy improvements over BMKR. In summary, the BTMKR algorithms can avoid the negative transfer when the data distributions of source domain and target domain are quite different.

5 Related Work

Various techniques have been proposed to efficiently learn MKL models [11], and our BMKR algorithm is originally inspired by [23], which applies the idea of AdaBoost to train a multiple kernel based classifier. BMKR is a more general framework which can adopt different loss functions for different learning tasks. Furthermore, the boosting approach provides a natural approach to solve small sample size problems by leveraging transfer learning techniques. The original work on boosting based transfer learning proposed in [7] introduces a sample-reweighting mechanism based on AdaBoost for classification problem. Later, this approach is generalized to the cases of regression [19], and transferring knowledge from multiple sources [24]. In [6], a gradient boosting based algorithm is proposed for multitask learning, where the assumption is that the model parameters of all the tasks share a common factor. In [22], the transfer boosting and multitask boosting algorithms are generalized to the context of online learning. While both multiple kernel learning and transfer learning have been studied extensively, the effort in simultaneously dealing with these two issues is very limited. Our BTMKR algorithm distinguishes itself from these methods because it deals with these two learning problems in a unified and principled approach. To our best knowledge, this is the first attempt to transfer MKL for regression problem.

6 Conclusion

In this paper, we first propose BMKR, a gradient boosting based multiple kernel learning framework for regression, which is suitable for short-term electricity load forecasting problems. Different from the traditional methods for MKL, the proposed BMKR algorithm learns the combination weights for each kernel using a boosting-style algorithm. Simulation results on residential data show that the short-term electricity load forecasting could be improved with BMKR. We further extend the proposed boosting framework to the context of transfer learning and propose two boosting based transfer multiple kernel regression algorithms: K-BTMKR and M-BTMKR. Empirical results suggest that both algorithms can efficiently transfer the knowledge learned from source houses to the target houses and significantly improve the forecasting performance when the target houses and source houses have similar electricity load consumption pattern. We also investigate the effects of negative transfer and show that the proposed algorithms could prevent potential negative transfer when the source houses are quite different from the target houses.