1 Introduction

Multi-task learning (MTL) aims at improving the generalization performance by learning multiple tasks in parallel and exploiting their intrinsic relationship. While single-task learning (STL) provides a simple solution to learn each task independently, in many cases, data from multiple sources may share similar patterns and can jointly contribute to improving each other’s performance. A typical example can be the feedback from different users in the market that can be closely related and shares similar traits, and an efficient learning algorithm should be able to depict the relatedness of collected data. Representative MTL methods to boost the generalization performance on batch data include: common parameter sharing (Caruana 1997), low-rank subspace projection (Ando and Zhang 2005; Negahban and Wainwright 2008) and trace norm regularization (Pong et al. 2010). Recent years also witness extensive studies (Cavallanti et al. 2010; Dekel et al. 2006; Saha et al. 2011; Li et al. 2014; Zhang et al. 2016) on streaming data, also known as online multi-task learning (OMTL) algorithms, for the merits of low computation cost and strong adaptation to the changing environment.

The big-data era has brought new challenges for MTL since it naturally bears a mission of computing multiple tasks simultaneously, and it’s more desirable to distribute computation that is locally light. In many scenarios, data may be collected from different places and sending all data back to one learning center obviously leads to extra transportation cost, which may also be limited to geometric distance. It’s apparently more favorable to distribute smaller subsets of data to different learning units and then collaborate with direct neighbors to find the optimal solution (Sayed 2014). Along with these advantages of distributed learning algorithms also come the merits as link or node failure insensitivity and user’s privacy protection.

Despite these aforementioned benefits and its strong demand, distributed MTL is still an area that has not been fully exploited. Dinuzzo et al. (2011) first explored distributed MTL by performing information fusion with a server–client strategy, while later research (Wang et al. 2016a, b) proposed distributed MTL algorithms by further restricting task relatedness through shared sparsity. They are all under centralized setting and use the common ‘master and worker’ framework, where the ‘master’ collects real-time information from all tasks to construct a global model and then feeds back to each ‘worker’ to further exploit local content.

MTL under decentralized settings is more challenging. First of all, each node only communicates with its direct neighbors and there is no global model (‘master’) on decentralized distributed networks. Direct information flow is strictly restricted within a limited range, and any attempt to construct global sparsity is impractical. On the other hand, previous research (Sayed 2014) has proved that all nodes will finally converge to the common optimal value after sufficient training, but in contrast to this property is the fact that the optimal solutions for different tasks in MTL may just be similar rather than identical. This discrepancy is known as ‘Pareto optimality’ between each task, which is actually the compromise of reaching a uniform global optimality at the cost of sacrificing each task’s own loss. Figure 1 gives a simple illustration for ‘Pareto optimality’ between two tasks, where the final \(\mathbf {w}_T\) terminates at some point between \(\mathbf {w}_1^*\) and \(\mathbf {w}_2^*\). Apparently, there is no guarantee such a compromise will be better than learning each task individually without communication.

Fig. 1
figure 1

Two tasks with different optimal value \(\mathbf {w}_1^*\) and \(\mathbf {w}_2^*\). The network will finally converge to some point \(\mathbf {w}\) between \(\mathbf {w}_1^*\) and \(\mathbf {w}_2^*\). Obviously, any attempt to decrease one task’s loss lies at the cost of increasing the other’s (e.g. moving \(\mathbf {w}\) down along the red line increases \(J_2(\mathbf {w})\).) (Color figure online)

In order to overcome these challenges, Chen et al. (2014) presented a possible solution by clustering similar tasks into groups and only required tasks within each group to share the same minimizer, while another possible solution (Bertrand and Moonen 2010, 2011) is assuming that the node-specific parameters lie in a common subspace. However, these methods all assume that we have some prior knowledge beforehand, such as how the parameter space is divided or which cluster each task belongs to, and may not hold in real-world scenarios. Besides, these methods strictly rely on the similarity of tasks and are very sensitive to outlier tasks.

Motivated by the merits and challenges of distributed MTL, we present a new learning framework and jump out of the stereotype of limiting each learning node to a ‘worker’ for a specific task. A more general setting is therefore proposed where each learning unit is able to learn all tasks, and knowledge is transferred in two phases: task information is shared within each node in the first phase, and then all nodes collaborate together to achieve a common minimizer. The proposed algorithms successfully avoid suffering the challenge of ‘Pareto optimality’ since the potential optimum for all nodes is identical rather than similar.

Our key contribution lies in two aspects. The first is to propose a simple but novel framework for decentralized distributed OMTL problem, which is not only applicable to our own algorithms but also able to transform many existing distributed STL methods [e.g. distributed dual averaging algorithm (Duchi et al. 2012)] into decentralized distributed MTL algorithms. To the best of our knowledge, this is the first model to enable OMTL on decentralized distributed networks without any prior knowledge. The second contribution is that our algorithm enjoys a lower \(\mathcal {O}(\sqrt{T})\) theoretical regret if given a fixed relationship matrix, while previous studies either only provides empirical results (Saha et al. 2011) or be upper bounded with a larger regret under distributed setting [See \(R_T = \mathcal {O}({\sqrt{T}\log T})\) of theorem 2 in Duchi et al. (2012)].

The remainder of this paper is organized as follows: Sect. 2 introduces our distributed OMTL algorithm. Section 3 establishes our theoretical proof on the regret bound for our proposed algorithm with a fixed relationship matrix, and experiments are conducted in Sect. 4. Section 5 reviews some related work and concludes this paper.

Notations: Lower-case letters (\(\alpha ,\beta ,\cdots \)) are used as scalars, and lower-case bold letters (\(\mathbf {w},\mathbf {v},\cdots \)) are used as vectors. Upper letters (UPQ) denote matrices. For distributed multi-task learning, superscript denotes the task index and subscript denote the node and round index (e.g. \(\mathbf {w}_{i,t}^m\) denotes the weight vector for m-th task on node i for the t-th round). The aggregated weight \(a_{i,j}\) denotes the combination weight from node j to node i.

2 Decentralized distributed online multi-task classification (DOM)

We will present our new learning framework in the section, together with two corresponding algorithms. The novel part is that we separate information flow into two phases: transferring multi-task information within each node first and then collaborate with direct neighbors to achieve a common minimizer on the network.

2.1 Problem formulation

Considering a set of m tasks for classification in parallel and each task receives one instance \(\mathbf {x}_t^i \in \mathbb {R}^d\) on each round, OMTL algorithm is required to make predictions for each task based on its stored learning weight \(W_t\): \( \hat{y}_{t}^i = \text {sign}(\hat{q}_{t}^i) = \text {sign}\left( (\mathbf {w}_{t}^i)^{\top } \mathbf {x}_{t}^i \right) \) for \(i \in [1,\ldots ,m],\) where \(\mathbf {w}^i_t\) refers to the i-th column of \(W_t\). Its predictions are evaluated by a canonical loss function f, and in this paper we adopt the hinge loss \(f = [1- y_t^i (\mathbf {w}_t^i)^{\top }\mathbf {x}_t^i]_{+}\) for simplicity. Upon receiving the penalty, the algorithm updates \(\mathbf {w}_{t+1}\) to minimize its current hinge loss and prepares for predictions on the next round. Our ultimate goal is minimizing the overall learning loss compared with the best \(\mathbf {w}_{*}^i\) for task i in hindsight (Shalev-Shwartz 2012), also known as regret,

$$\begin{aligned} R_T = \sum _{t=0}^{T-1}\sum _{i=1}^m\left[ f(\mathbf {w}_{t}^i) - f(\mathbf {w}_{*}^i) \right] . \end{aligned}$$

Instead of assigning each node to tackle one specific task (Bertrand and Moonen 2010, 2011; Chen et al. 2014) and encountering the compromise of reaching ‘Pareto optimality’, we propose a more general setting where each node is able to deal with all tasks and instances from different tasks can be randomly allocated to any learning node.

Specifically, each node concatenates instances from all m tasks as a compound instance vector

$$\begin{aligned} \mathbf {\phi }^\top _{i,t} = \left[ (\mathbf {x}^1_{i,t})^{\top },(\mathbf {x}^2_{i,t})^{\top },\ldots ,(\mathbf {x}^{m}_{i,t})^{\top }\right] , \forall i \in [1,\ldots ,m] \end{aligned}$$
(1)

and the corresponding weight vector for the i-th node is compounded as

$$\begin{aligned} \mathbf {v}^\top _{i,t} = \left[ (\mathbf {w}^1_{i,t})^\top , (\mathbf {w}^2_{i,t})^\top ,\ldots ,(\mathbf {w}^m_{i,t})^\top \right] , \forall i \in [1,\ldots ,m]. \end{aligned}$$
(2)

To understand why we need to concatenate weights on each node, we provide the following insights. Let’s denote \(\mathbf {w}_{*}^{j}\) as the optimal learning weight for the j-th task and \(\mathbf {v}_{i,*}\) as the optimal learning weight for the i-th node. If following the stereotype of assigning the i-th node to be a ‘worker’ for a specific task (j-th task), we immediately have the conclusion that the optimal weight of this learning node equals to that of its assigned task, namely \(\mathbf {v}_{i,*} = {\mathbf {w}_{*}^{j}}\). Since \(\mathbf {w}_{*}^{j}\) differs from each other, the optimal weights on different nodes will also be disparate and can’t be efficiently learned from the networks. On the other hand, the optimal concatenated value, \( \mathbf {v}^\top _{i,*} = \left[ ({\mathbf {w}_{*}^{1}})^{\top }, ({\mathbf {w}_{*}^{2}})^{\top },\ldots ,({\mathbf {w}_{*}^{m}})^{\top }\right] \), is identical for all nodes under our proposed learning framework, and it also endows each node with the ability to learn all tasks.

2.2 Node update

Although one node can receive multiple instances on each round, we follow the common assumption for OMTL that the algorithm receives m instances and randomly allocates them to m learning nodes.Footnote 1 Suppose the i-th node only receives an instance \(\mathbf {x}_{i,t}^{i_t}\) from the \(i_t\)-th task, and the compound instance vector in Eq. (1) becomes

$$\begin{aligned} \phi _{i,t}^\top = \left[ 0,\ldots ,(\mathbf {x}_{i,t}^{i_t})^\top ,\ldots ,0 \right] . \end{aligned}$$
(3)

The prediction is

$$\begin{aligned} \hat{y}_{i,t} = \text {sign}(\hat{q}_{i,t}) = \text {sign}\left( \mathbf {v}_{i,t}^{\top } \mathbf {\phi }_{i,t} \right) = \text {sign} \left( (\mathbf {w}^{i_t}_{i,t})^{\top } \mathbf {x}^{i_t}_{i,t} \right) . \end{aligned}$$
(4)

Learning on the network is apparently inefficient if we simply update \(i_t\)-th task’s learning weight as

$$\begin{aligned} \mathbf {w}^{i_t}_{i,t+1} = \mathbf {w}^{i_t}_{i,t} - \eta \nabla f(\hat{q}_{i,t}), \end{aligned}$$
(5)

and we need to design a proper scheme to share the instance information for all tasks within each node.

2.2.1 Node updating rule

Following the research in Evgeniou et al. (2005), we define the multi-task kernel feature mapping \(\varPsi \) so that \(\mathbf {x}_{i,t}^{i_t} \in \mathbb {R}^d\) is projected into md-dimension Reproducing Kernel Hilbert Space (RKHS) \(\mathcal {H}\) such that \(\varPsi (\mathbf {x}_{i,t}^{i_t}) = B_{\bigotimes }^{-1} \phi _t \), and then the corresponding inner product is defined as

$$\begin{aligned} \left\langle \varPsi (\mathbf {x}_{i,s}^{i_s}) , \varPsi (\mathbf {x}_{i,t}^{i_t}) \right\rangle _{B_{\bigotimes }} = \left( B_{\bigotimes }^{-1} \phi _s \right) ^\top B_{\bigotimes }\left( B_{\bigotimes }^{-1} \phi _t \right) = \phi ^\top _s B_{\bigotimes }^{-1}\phi _t, \end{aligned}$$

where B refers to a \(m \times m\) symmetric matrix which captures the relationship among tasks, and \(B_{\bigotimes }\) refers to the Kronecker product \(B_{\bigotimes }= B {{\mathrm{\bigotimes }}}I_d\).

The hinge loss function in the multi-task kernel space is then defined as

$$\begin{aligned} f(\mathbf {v}) = \max \left\{ 0,1- y_{i,t}^{i_t} \langle \mathbf {v}, \varPsi (\mathbf {x}_{i,t}^{i_t}) \rangle _{B_{\bigotimes }} \right\} , \end{aligned}$$

and if \(f(\mathbf {v})\) is not zero, the derivative will be

$$\begin{aligned} \nabla f(\mathbf {v}) = - y_{i,t}^{i_t} \varPsi (\mathbf {x}_{i,t}^{i_t}) = -y_{i,t}^{i_t} B_{\bigotimes }^{-1} \phi _t . \end{aligned}$$
(6)

For node i, our goal is selecting the optimal \(\mathbf {v}_i\) to minimize its current loss, or

$$\begin{aligned} \min _{\mathbf {v}_i} f(\mathbf {v}_i) \end{aligned}$$

With Taylor’s expansion in the kernel space, we replace such a minimization problem with

$$\begin{aligned} \min _{\mathbf {v}_i} \ \left\langle \nabla f, \mathbf {v}_i \right\rangle _{B_{\bigotimes }} + \frac{1}{2\eta } \Vert \mathbf {v}_i - \mathbf {v}_{i,t}\Vert ^2_{B_{\bigotimes }} . \end{aligned}$$

Setting the derivative to zero, we have the updating rule as

$$\begin{aligned} \mathbf {v}_i = \mathbf {v}_{i,t} - \eta \nabla f(\mathbf {v}_{i,t}). \end{aligned}$$
(7)

Substituting \(\nabla f(\mathbf {v}_{i,t})\) with Eq. (6), the updating rule for each node as

$$\begin{aligned} \mathbf {v}_i = \mathbf {v}_{i,t} + \eta y_{i,t}^{i_t} B_{\bigotimes }^{-1} \phi _{i,t} . \end{aligned}$$
(8)

Recalling the equation \(B_{\bigotimes }^{-1} = B^{-1} {{\mathrm{\bigotimes }}}I_d\) and the expressions for \(\mathbf {v}_i\) and \(\phi _{i,t}\) in Eqs. (2) and (3), we have the closed-form updating rule for j-th task as

$$\begin{aligned} \mathbf {w}_{i,t+1}^j= \mathbf {w}_{i,t}^j + \eta B_{j,i_t}^{-1} \mathbf {x}_{i,t}^{i_t} y_{i,t}^{i_t}\ \ \ \text {for} \ \ \forall j \in [1,\ldots ,m], \end{aligned}$$
(9)

where \(B_{j,i_t}^{-1}\) denotes the \((j,i_t)\)-th component of matrix \(B^{-1}\). Compared with Eq. (5) where only the \(i_t\)-th task is updated, Eq. (9) updates learning weights for all tasks by transferring information in the multi-task kernel space.

figure a

2.2.2 Selection of relationship matrix B

For the selection of matrix B, we first present an algorithm based on a fixed relationship matrix (Cavallanti et al. 2010),

$$\begin{aligned} B^{-1}_{j,i_t} = \frac{b+m\delta _{j,i_t}}{(b+1)m} \ \ \ \text {for} \ \ \ \forall j,i_t \in [1,\ldots ,m], \end{aligned}$$
(10)

where \(\delta _{j,i_t} = 1 \) if \( j = i_t \) and \(\delta _{j,i_t} = 0\) otherwise. The corresponding updating rule for each task becomes

$$\begin{aligned} \mathbf {w}_{i,t+1}^j= \mathbf {w}_{i,t}^j + \eta \frac{b+m\delta _{j,i_t}}{(b+1)m} y_{i,t}^{i_t} \mathbf {x}_{i,t}^{i_t} \ \ \ \text {for} \ \ \ \forall j \in [1,\ldots ,m]. \end{aligned}$$
(11)

The intuition behind Eq. (11) is that the tasks’ relationship is reflected by parameter b, and the theoretical optimal \(b_*\) will be given in Sect. 3 or simply be tuned on real-world datasets. Specifically, if \(b = 0\), there is no connection behind tasks and only \(i_t\)-th task will be updated, while if \(b = \infty \), all tasks share the same information of instance \((\mathbf {x}_{i,t}^{i_t},y_{i,t}^{i_t})\) with the weight \(\frac{1}{m}\).

Since the relationship between tasks may not be static in the long run, we also propose a dynamic relationship matrix B with a smooth updating rule (Murugesan et al. 2016),Footnote 2

$$\begin{aligned} B^{-1}_{i,t} = \alpha B^{-1}_{0} + (1-\alpha ) P_{i,t}, \end{aligned}$$
(12)

where \(B_0^{-1}\) refers to an initial matrix that can be set to be I or any fixed stochastic matrix and \(P_t\) is the relationship matrix learned from data on each round. Specially, the \((i_t,j)\)-th component of \(P_{i,t}\) will be updated as

$$\begin{aligned} p_{i_t,j} = \frac{\exp (-f(\mathbf {w}_{j,t}^{i_t},\mathbf {x}_{i,t}^{i_t},y_{i,t}^{i_t}))}{\sum _{j^{\prime } = 1}^{m}\exp (-f(\mathbf {w}_{j^{\prime },t}^{i_t},\mathbf {x}_{i,t}^{i_t},y_{i,t}^{i_t}))} \ \ \ \text {for} \ \ \ \forall j,i_t \in [1,\ldots ,m]. \end{aligned}$$
(13)

The intuition behind Eq. (13) is that if the \(\mathbf {w}_{j,t}^{i_t}\) also works well for \((\mathbf {x}_{i,t}^{i_t},y_{i,t}^{i_t})\), it will be assigned with a large transferring weight coefficient from j to \(i_t\), and the normalization ensures \(P_{i,t}\) is still a row stochastic matrix after updating.

figure b

2.3 Nodes communication

After designing the information-sharing scheme within each node, we now turn to the investigation of communication among the nodes. We assume that the network topology is static and information among nodes are exchanged through a fixed combination matrix A. Since the previous research (Tu and Sayed 2012) shows that diffusion learning outperforms incremental learning and consensus learning, we adopt diffusion method with the strategy ‘Combine and Then Adapt’ (CTA) in this paper.

Denote weight vector for the i-th node at the end of round t as \(\mathbf {u}_{i,t}\), then our updating rule is:

$$\begin{aligned} \text {Step I}&: \mathbf {v}_{i,t} = \sum _{j=1}^m a_{i,j} \mathbf {u}_{j,t} , \end{aligned}$$
(14)
$$\begin{aligned} \text {Step II}&: \mathbf {u}_{i,t+1} = \mathbf {v}_{i,t} - \eta \nabla f(\mathbf {v}_{i,t}), \end{aligned}$$
(15)

where \(a_{i,j}\) denotes the (ij)-th component of combination matrix A. Node i first combines learning weights from its direct neighbors and then repeats Eq. (7) to perform an update by learning from the newly-come instance in the multi-task kernel space.

The combination weight matrix A, together with its components \(a_{i,j}\), determine how information flows within the network. In order to guarantee efficient updating, we make a few assumptions below.

Assumption 1

Denote i-th node’s direct neighbors as N(i) and the number of N(i) as N. If \(j \in N(i)\), there exists a scalar \(\epsilon \) such that \(a_{i,j} \ge \epsilon \) holds. If \(j \notin N(i)\), \(a_{i,j} = 0\) holds.

Assumption 2

\(\sum _{j = 1}^{m} a_{i,j} = 1 \) and \(\sum _{i = 1}^{m} a_{i,j} = 1\) holds for any i and j.

Assumption 3

There exists at least one node such that \(i \in N(i)\), and there does not exist any j such that \(N(j) = \{j\}\).

Assumption 1 ensures each node gives sufficient learning weight to any of its direct neighbors and does not combine learning weights from indirect neighbors. Assumption 2 guarantees the network is doubly stochastic, and Assumption 3 confirms that there is no separated node.

Finally, Algorithms 1 and 2 summarize our decentralized distributed online multi-task classification algorithms with a fixed B and a dynamic B separately.

3 Theoretical analysis

We analyze the regret bound for DOM with a fixed relationship matrix in this part.

3.1 Proof setup

We start from defining several auxiliary notations to help us track the information flow within the network.

There is no ‘external’ information flowing into the network in step I since all nodes simply combine their neighbors’ weight. Step II introduces a new instance for each node, together with its gradient information. In order to capture the information, we define

$$\begin{aligned} \mathbf {p}_{i,t+1} = \mathbf {u}_{i,t+1} - \mathbf {v}_{i,t} {\mathop {=}\limits ^{\text {(15)}}} - \eta \nabla f(\mathbf {v}_{i,t}). \end{aligned}$$

Assume that all the instances are within the space \(\varPsi :\{\mathbf {x}\mid \Vert x\Vert \le D\}\), we have the upper bound of \(\mathbf {p}\) as

$$\begin{aligned} \Vert \mathbf {p}_{i,t+1}\Vert _{B_{\bigotimes }} {\mathop {=}\limits ^{\text {(6)}}} \Vert \eta y_{i,t}^{i_t} B_{\bigotimes }^{-1} \phi _{i,t} \Vert _{B_{\bigotimes }} {\mathop {=}\limits ^{\text {(3)(10)}}} \eta \Vert \mathbf {x}_{i,t}^{i_t}\Vert \sqrt{(B^{-1})_{i_t,i_t}}. \end{aligned}$$

Denoting \(L = \max \{ \Vert \mathbf {x}_{i,t}^{i_t}\Vert \sqrt{(B^{-1})_{i_t,i_t}} \} \), information flow is therefore bounded as

$$\begin{aligned} \Vert \mathbf {p}_{i,t+1}\Vert _{B_{\bigotimes }} \le \eta L \end{aligned}$$
(16)

Since any two norms on finite-dimensional vector space are equivalent (Strang et al. 1993) and noticing B is stochastic, we have

$$\begin{aligned} c\Vert \nabla f(\mathbf {v}_{i,t})|_{B_{\bigotimes }} \le \Vert \nabla f(\mathbf {v}_{i,t})\Vert _{B_{\bigotimes }^{-1}} \le C\Vert \nabla f(\mathbf {v}_{i,t})\Vert _{B_{\bigotimes }}, \end{aligned}$$
(17)

for some constant cC.

On the other hand, a matrix chain is defined as \(\Phi (k,s) = A^{k-s} \ (k \ge s \ge 0)\) so the evolution of \(\mathbf {u}_{j,t+1}\) is derived as

$$\begin{aligned} \mathbf {u}_{j,t+1} = \mathbf {p}_{j,t+1} + \sum _{l=1}^t \sum _{i=1}^m[\Phi (t+1,l)]_{j,i} \mathbf {p}_{i,l}. \end{aligned}$$
(18)

We define an auxiliary ‘central’ weight as \(\mathbf {z}_t = \frac{1}{m}\sum _{i=1}^m\mathbf {u}_{i,t}\), and its evolution can be derived as

$$\begin{aligned} \mathbf {z}_{t} = \frac{1}{m}\sum _{i=1}^m\mathbf {u}_{i,0} + \frac{1}{m}\sum _{l=1}^{t}\sum _{i=1}^m\mathbf {p}_{i,l} \end{aligned}$$

Without loss of generality, we assume all the learning weights start from \(\mathbf {0}\), or \(\mathbf {u}_{i,0} = \mathbf {0}\). Then the above equation can be written as

$$\begin{aligned} \mathbf {z}_{t} = \frac{1}{m}\sum _{l=1}^{t}\sum _{i=1}^m\mathbf {p}_{i,l}. \end{aligned}$$
(19)

3.2 Weight difference and regret bound

Our first goal is deriving the difference between these two learning weights, or \(\Vert \mathbf {z}_t - \mathbf {u}_{j,t}\Vert _{B_{\bigotimes }}\).

Lemma 1

Suppose the network is static and strongly connected, and assumptions 1,2,3 hold. We have the following inequality:

$$\begin{aligned} \left| \Phi (k,s)_{i,j} - \frac{1}{m}\right| \le \theta \beta ^{k-s} \end{aligned}$$

for all \(k \ge s \ge 0\), where \(\theta = (1 - \frac{\epsilon }{4m^2})^{-2}\) and \(\beta = (1 - \frac{\epsilon }{4m^2})^{\frac{1}{Q}}\) only depend on the size and the topology of the network.

The above lemma numerically calculates how fast the information flows into separate nodes before the uniform distribution, and enables our further analysis on bounding the weight difference in the following lemma.

Lemma 2

Consider all the instances are within the range \(\varPsi \) and Lemma 1 holds, then we have

$$ \Vert \mathbf {z}_{t+1} - \mathbf {u}_{j,t+1} \Vert _{B_{\bigotimes }} \le \eta L \Delta $$

where \(\Delta = m\theta \frac{\beta }{1-\beta } + 2\).

Equipped with these two lemmas, we can march towards the second goal of minimizing the regret between sequence{\(\mathbf {u}_{i,t}\)} and the potential best classifier \(\mathbf {u}_*\), or

$$\begin{aligned} \min \sum _{t=0}^{T-1}\sum _{i=1}^m\left[ f(\mathbf {u}_{i,t}) - f(\mathbf {u}_*)\right] . \end{aligned}$$
(20)

Given an arbitrary \(\mathbf {u}\), we have

$$\begin{aligned}&\sum _{t=0}^{T-1}\sum _{i=1}^m\left[ f(\mathbf {u}_{i,t}) - f(\mathbf {u}) \right] \nonumber \\&\quad = \sum _{t=0}^{T-1}\sum _{i=1}^m\left[ f(\mathbf {z}_t) - f(\mathbf {u})\right] + \sum _{t=0}^{T-1}\sum _{i=1}^m\left[ f(\mathbf {u}_{i,t}) - f(\mathbf {z}_t) \right] \nonumber \\&\quad \le \sum _{t=0}^{T-1}\sum _{i=1}^m\left[ f(\mathbf {z}_t) - f(\mathbf {u}) \right] + \sum _{t=0}^{T-1}\sum _{i=1}^m\Vert \nabla f(\mathbf {u}_{i,t})\Vert _{B_{\bigotimes }^{-1}} \Vert \mathbf {u}_{i,t} - \mathbf {z}_t\Vert _{B_{\bigotimes }} \nonumber \\&\quad {\mathop {\le }\limits ^{\text {(17)}}}\sum _{t=0}^{T-1}\sum _{i=1}^m\left[ f(\mathbf {z}_t) - f(\mathbf {u}) \right] + \sum _{t=0}^{T-1}\sum _{i=1}^mCL \Vert \mathbf {u}_{i,t} - \mathbf {z}_t\Vert _{B_{\bigotimes }} \end{aligned}$$
(21)

Equation (21) includes two terms, and we derive the upper bound for the first term in the following lemma.

Lemma 3

The difference between \(\mathbf {z}_t\) and any fixed \(\mathbf {u}\) is bounded as

$$\begin{aligned}&\sum _{t=0}^{T-1}\sum _{i=1}^m[f(\mathbf {z}_t) - f(\mathbf {u})] \le \frac{m\Vert \mathbf {u}\Vert ^2_{B_{\bigotimes }}}{2\eta } + CL \sum _{t=0}^{T-1}\sum _{i=1}^m\Vert \mathbf {z}_t - \mathbf {u}_{i,t}\Vert _{B_{\bigotimes }} + \eta mT L^2/2 . \end{aligned}$$

Combining Lemmas 2 and 3 into Eq. (21), we come to the following theorem.

Theorem 4

Let \(\mathbf {u}_*\) denotes the optimal classifier in hindsight, then the regret of the proposed DOM algorithm is upper bounded as

$$\begin{aligned} \sum _{t=0}^{T-1}\sum _{i=1}^mf(\mathbf {u}_{i,t}) - f(\mathbf {u}_*) \le \frac{m\Vert \mathbf {u}_*\Vert ^2_{B_{\bigotimes }}}{2\eta } + \eta mT L^2\left( \frac{1}{2} + 2C\Delta \right) . \end{aligned}$$

Remark: (1) By selecting \(\eta = \frac{\Vert \mathbf {u}_*\Vert _{B_{\bigotimes }}}{ L \sqrt{T (1+ 4 C\Delta )}}\), the optimum regret bound can be written as

$$\begin{aligned} mL \sqrt{(T+4C\Delta ) \Vert \mathbf {u}_*\Vert _{B_{\bigotimes }}^2}, \end{aligned}$$

which is actually a \(\mathcal {O}(\sqrt{T})\) regret.

(2) The regret bound depends on the relationship between the optimum weight \(\mathbf {u}^i_*\) of each task. If B is static, then

$$ \Vert \mathbf {u}_*\Vert _{B_{\bigotimes }}^2 = \sum _{i=1}^{m} \Vert \mathbf {u}_*^i\Vert ^2 + b \sum _{i=1}^{m} \Vert \mathbf {u}^i_* - \bar{\mathbf {u}}\Vert ^2,$$

where \(\bar{\mathbf {u}}= \frac{1}{m} \sum _{i=1}^{m}\mathbf {u}^i_*\) is the average of all potential optimal weight vectors. By the definition of L and \( \Vert \mathbf {u}_*\Vert _{B_{\bigotimes }}^2\) and if \(\Vert \mathbf {u}^i_*\Vert ^2 < \frac{m}{m-1} \Vert \mathbf {u}^i_* - \bar{\mathbf {u}}\Vert ^2\), the optimal \(b_*\) can be derived as

$$\begin{aligned} b_*= \sqrt{(m-1)\frac{\sum _{i=1}^{m}\Vert \mathbf {u}^i_*\Vert ^2 - \sum _{i=1}^{m}\Vert \mathbf {u}^i_*- \bar{\mathbf {u}}\Vert ^2 }{\sum _{i=1}^{m}\Vert \mathbf {u}^i_*- \bar{\mathbf {u}}\Vert ^2 }}, \end{aligned}$$

or \(b = 0\) otherwise.

(3) Decentralized network with good connection property leads to a small \(\Delta \) (since \(\theta \) and \(\beta \) are relatively small), and therefore its learning regret is also lower compared with poorly connected network.

4 Experiment

In this section, we compare our algorithms with several competitors on both synthetic and real-world datasets to explore the influence of multi-task learning and distributed learning. We start by introducing our experiment data and comparison algorithms, followed by discussion on the results.

4.1 Benchmarks

Our work highlights the challenges of distributing OMTL to multiple nodes, and here we classify all competing algorithms into two groups:

  1. 1.

    Centralized Learning

    • PM Original Perceptron algorithm (Rosenblatt 1958) by adding a learning step to adjust its adaptation to data.

    • PA Online Passive-Aggressive algorithm (Crammer et al. 2006).

    • COML Collaborative Online Multi-task Learning (Li et al. 2014) represent centralized OMTL method.

  2. 2.

    Decentralized Distributed Learning

    • NS Previous research (Chen et al. 2014) of assigning each node to tackle one specific task, also known as Node Specific methods.

    • DST Distributed Single-Task method disperses each task to all learning nodes, and there is no information sharing between tasks.

    • DA Dual Averaging algorithm for distributed optimization (Duchi et al. 2012) is originally designed for single-task learning, but it can also be extended to OMTL setting under our presented framework.

    • DOM-I The proposed algorithm with a fixed relationship matrix. Each node can tackle all tasks, and the information is shared for all tasks and nodes.

    • DOM-II Learning with a smooth dynamic relationship matrix.

Three different types of networks, namely full-connected, grid-connected and ring-connected network, are used to examine the effect of network topology. Each learning node equally assigns 1 / N weights to its direct neighbors, where N represents the number of its direct neighbors. The performance of these algorithms is evaluated by their error prediction rates, and each experiment consists of 50 runs, where the instances are shuffled beforehand and randomly assigned to the learning nodes. The learning rate \(\eta \) is tuned within a grid search \(\{10^{-5},\ldots ,10^{+2} \}\).

We first construct a synthetic dataset via random walks in parameter space with Gaussian increments. The initial value is \(\mathbf {w}^0 \in \mathbb {R}^{100}\), and its first 60 components are set to be 1 while the rest are set to be \(-1.5\). We then construct the potential optimal weight vector for 9 similar tasks as \(\mathbf {w}^{i+1}= \mathbf {w}^i + \varepsilon \), where \(\varepsilon \sim N(0,0.1I)\) and \(i = 0,\ldots ,8\). A training sample \(\mathbf {x}= [x_1,\ldots ,x_{100}]\) is generated by randomly selecting \(x_i \in [-3,3]\), and for each task we obtain 2000 samples.

Algorithm efficiency is tested also on two commonly used real-world datasets: MHC-I Footnote 3 (Peters et al. 2006) contains a subset of human MHC-I alleles (A0201, A0202, A0203, A0206, A0301, A3101, A3301, A6801, A6802) and features are extracted with bigram amino acid encoding (Li et al. 2014) to project each protein sequence into 400-dimension feature space. Sentiment Footnote 4 (Blitzer et al. 2007) contains user reviews of 4 types of products (books, DVD, electronics and kitchen) and each review lies in the 230610-dimension feature space based on the corresponding word sequence.

4.2 Performance analysis

Table 1 Error Prediction Rate (%) and Standard Deviations (%) on Real-world Datasets
Fig. 2
figure 2

a Sentiment-Full. b Sentiment-Grid. c Synthetic-Full. d MHC-I Full. Above is the learning process on Sentiment dataset; below is the parameter effect on Synthetic and MHC-I dataset

The synthetic dataset consists of 9 similar tasks, and from Table 1 we can observe that OMTL strategies significantly improve the overall performance, reducing the error rate from \(12\%\) to \(6\%\). Previous studies have validated that OMTL algorithms boost the generalization performance compared with STL, and this property also holds under distributed settings if considering the good effect of the previous method (NS) by assigning one node to tackle one specific task. In the meantime, our new strategy of enabling each node to tackle all tasks successfully transforms STL algorithm (DA) into MTL methods. Among all the distributed methods, our proposed algorithm enjoys a lower error rate and it even outperforms traditional centralized learning on full-connected network, and is also competitive on grid-connected and ring-connected network.

Unlike our synthetic dataset, the relationship between real-world tasks is more complicated, where the optimal weight vector for each task may not be similar and outlier tasks often exist. The performance in Table 1 indicates: (1) Centralized OMTL algorithm (COML) still achieves better performance than learning each task individually. (2)The previous strategy of assigning each node to tackle one specific task (NS) actually increases the error prediction rate, and this can be easily explained by our previous analysis on ‘Pareto optimality’ where the optimal weight vectors may be quite different for tasks. (3) Our strategy of information sharing still works well for both DA and DOM. By comparing with DST, which is actually a naive version of dispersing each task to all learning nodes, we can see the information sharing between tasks has a sound effect on distributed learning. Specifically, DOM algorithms show a favorable performance when compared with typical centralized OMTL algorithm, especially on networks with good connectivity properties. While the theoretical result suggests our algorithm achieves a \(\mathcal {O}(\sqrt{T})\) regret bound while the counterpart in previous research (DA) is \(\mathcal {O}(\sqrt{T}\log T)\), we observe the practical performance of DOM algorithms does outperform DA with lower error rates on all datasets in Table 1 and enjoys a faster learning progress in Fig. 2.

Figure 2 also depicts the parameter effect for DOM with a constant and dynamic relationship matrix on full-connected networks.Footnote 5 For algorithm with a constant B, we have given the theoretical optimal \(b_*\) in Theorem 4, and experiments validate our theoretical result as \(b_{exp}\) is rather close to \(b_*\). We set \(B_{0} = I\) for DOM-II, and it’s equal to learn each task independently if \(\alpha =1\). We can see a combination of an initial fixed relationship matrix and a dynamic term actually performs better than learning each task individually or setting the relationship to be totally dynamic.

Finally, it is important to note that the goal of this work is not to claim a new OMTL algorithm beating the existing centralized algorithms, but mainly focus on making OMTL algorithms feasible and effective under the challenging decentralized distributed settings. Experimental results show that the proposed decentralized algorithm matches the performance of the centralized OMTL learning (COML) on both synthetic and real-world datasets, clearly validating its efficacy.

5 Related work and conclusion

Our paradigm stems from recent advances in distributed algorithms and (online) multi-task learning. Therefore, we will briefly review previous research in these two areas separately.

Since the pioneering research (Tsitsiklis 1984; Bertsekas 1983; Bertsekas and Tsitsiklis 1989) in 1980s, distributed optimization problems have attracted extensive studies in different areas. Based on the adopted strategy, most of previous work can be roughly divided into three groups: incremental, consensus and diffusion learning [Chapter 7 in Sayed (2014)]. Incremental methods (Bertsekas 1997; Blatt et al. 2007; Neto and Pierro 2009) require the network to form a special ring and transmit the updated information through a cyclic trajectory one after the other, while consensus methods (DeGroot 1974; Johansson et al. 2008; Nedic and Ozdaglar 2009) require each node to perform two updating steps: it first aggregates the iterates from its direct neighbors and then updates this aggregate value by the (sub)gradient vector evaluated at its previous weights. Diffusion methods (Chen and Sayed 2012; Sayed 2013) maintain the combination step of consensus learning but evaluate the (sub)gradient information by using the combined learning weights. Recent studies (Tu and Sayed 2012; Sayed 2014) have proved that diffusion strategy allows information to diffuse more thoroughly through the network and therefore outperforms previous two strategies in constant step-size scenarios.

Stemming from the seminal work of Caruana (1997), numerous works have shown that relevant information sharing boosts the generalization performance of MTL, and some representative methods are: common parameter sharing (Caruana 1997), low-rank subspace projection (Ando and Zhang 2005; Negahban and Wainwright 2008) and trace norm regularization (Pong et al. 2010). On the other hand, with the merits of reducing the computation and storage cost, online multi-task learning (OMTL) recently attracts more and more research interest. It also enjoys the benefits of dealing with sequential data and is able to adapt to the changing environment. Studies of OMTL can be traced back to Dekel et al. (2006) where separate tasks are updated through a common loss. Cavallanti et al. (2010) design perceptron-based algorithms, with a fixed matrix to capture the relationship between tasks. Later, Saha et al. (2011) extend their work by trying to learn task relationship from the data itself and allow it to be dynamic. Other OMTL literature can be found in Li et al. (2014) where a collaborative model is proposed and Lugosi et al. (2009) where OMTL with hard constraints is studied.

However, distributing OMTL on decentralized distributed network leads to the aforementioned challenge of ‘Pareto optimality’, which actually reflects the contradiction of different properties of OMTL problems (different minimizers) and distributed networks (same minimizer). In this paper, we jump out of the stereotype of limiting a node as a ‘worker’ for a specific task and propose a novel learning framework so that these nodes are able to all tasks at the same time. The information-sharing scheme in our algorithms is separated into two sessions: multi-task’s information is first shared within each node and then the whole network is pushed towards a common minimizer by communication among different nodes. Such a learning framework not only works for our proposed algorithms, but also be applicable to transfer other STL algorithms (e.g. Dual Averaging) into distributed MTL algorithms. What’s more, we show that the proposed algorithm has \(\mathcal {O}(\sqrt{T})\) regret upper bound if given a fixed relationship matrix, and later validate algorithms’ performance on both synthetic and real-world datasets.