1 Introduction

Multi-task learning (MTL) (Caruana 1997) aims to jointly learn multiple machine learning tasks by exploiting their commonality to boost the generalization performance of each task. Similar to many standard machine learning techniques, in MTL, a single machine is assumed to be able to access all training data over different tasks. However, in practice, especially in the context of smart city, training data for different tasks is owned by different organizations and geo-distributed over different local machines, and centralizing the data may result in expensive cost of data transmission and cause privacy and security issues. Take personalized healthcare as a motivating example. In this context, learning a personalized healthcare prediction model from each user’s personal data including his/her profile and various sensor readings from his/her mobile device is considered as a different task. On one hand, the personal data may be too sparse to learn a precise prediction model for each task, and thus MTL is desired. On the other hand, some of the users may not be willing to share their personal data, which results in a failure of applying standard MTL methods. Thus, a distributed MTL algorithm is more preferred. However, if frequent communication is required for the distributed MTL algorithm to obtain an optimal prediction model for each task, users have to pay for expensive cost on data transmission, which is not practical. Therefore, designing a communication-efficient MTL algorithm in the distributed computing environment is crucial to address the aforementioned problem.

Though a number of distributed machine learning frameworks have been proposed, most of them are focused on single task learning problems (Li et al. 2014; Boyd et al. 2011; Jaggi et al. 2014; Ma et al. 2015). In particular, COCOA+ as a general distributed machine learning framework has been proposed for strongly convex learning problems (Smith et al. 2017b; Ma et al. 2015; Jaggi et al. 2014). To handle non-strongly regularizers (e.g., \(\ell _1\)-norm), Smith et al. (2015, 2017b) extended COCOA+ by directly solving the primal problem instead of its dual problem. However, in their proposed method, data needs to be distributed by features rather than instances. In our problem setting, we suppose the training data for different tasks is originally geo-distributed over different machines. In this case, to use the method proposed in Smith et al. (2015, 2017b), one has to first centralize the data of all the tasks and then re-distribute the data w.r.t. different sets of features, which is impractical.

In this paper, different from previous methods, we focus on the MTL formulation with a \(\ell _{2,1}\)-norm regularization on the weight matrix over all the tasks, and offer a communication-efficient distributed optimization framework to solve it. Specifically, we have two main contributions: (1) We first present an efficient distributed optimization method that enjoys a fast convergence rate for solving the \(\ell _{2,1}\)-norm regularized MTL problem. To achieve this, we carefully design a subproblem for each local worker by incorporating an extrapolation step on the dual variables. We theoretically prove that with the well-designed local subproblem, our proposed method obtains a faster convergence rate than COCOA+ (Ma et al. 2015; Smith et al. 2017b), especially on ill-conditioned problems. Recently, Ma et al. (2017) also attempted to improve the convergence rate of COCOA+. However, our acceleration scheme is different from theirs. Specifically, with a strongly convex regularizer, the acceleration (Ma et al. 2017) can only be done for Lipschitz continuous losses, while our method is able to improve the convergence rate for both smooth and Lipschitz continuous losses. (2) To further reduce the communication cost at each round when handling extremely high-dimensional data, we propose a dynamic feature screening approach to progressively eliminate the features that are associated with zeros values in the optimal solution. Consequently, the communication cost can be substantially reduced as there are only a few features associated with nonzero values in the solution due to the effect of the sparsity regularization. Note that there exist several data or feature screening approaches for single task learning or MTL problems. We believe that this is the first proposed to reduce communication cost in distributed optimization.

Recently, there have been several attempts at developing distributed optimization frameworks for MTL. Baytas et al. (2016) and Xie et al. (2017) proposed asynchronous proximal gradient based algorithms for distributed MTL. Their proposed methods, however, are communication-heavy as gradients need to be frequently communicated among machines. Wang et al. (2016) proposed a Distributed debiased Sparse Multi-task Lasso (DSML) algorithm. In DSML, there is only one round of communication between the local workers and the master. However, it requires the local workers to perform heavy computation (i.e., estimating a \(d \times d\) sparse matrix) to obtain a debiased lasso solution. More importantly, DSML makes a stronger assumption to ensure support recovery. More recently, to provide trade-off between local computation and global communication, COCOA+ has been extended for multi-task relationship learning by Liu et al. (2017). Later, this problem is further studied in Smith et al. (2017a) by considering statistical and systems challenges. Note that our work is different from Liu et al. (2017) and Smith et al. (2017a) in two ways: (1) Our proposed method enjoys a faster convergence rate than that analyzed in Liu et al. (2017) and Smith et al. (2017a) since their rates are the same as COCOA+. (2) We study different MTL models. Specifically, Liu et al. (2017) and Smith et al. (2017a) studied task-relationship based MTL model (Zhang and Yeung 2010) while our problem is feature based MTL. They are different as discussed in Zhang and Yang (2017). Moreover, as our work focuses on feature-based MTL model with sparsity (Obozinski et al. 2010, 2011; Wang and Ye 2015), it enables us to design a tailored feature screening technique to further reduce the communication cost. Unlike our framework, decentralized MTL methods have also been studied in Wang et al. (2018), Bellet et al. (2018), Vanhaesebrouck et al. (2017) and Zhang et al. (2018). However, these approaches may incur heavier communication cost because frequent communications are often required between tasks in MTL.

2 Notation and preliminaries

Throughout this paper, \({\mathbf {w}} \in \mathbb {R}^{dK}\) and \({\mathbf {W}} \in \mathbb {R}^{d \times K}\) denote a vector and a matrix, respectively, and \({\mathscr {G}}\) denotes a set.

  • \([m] \mathop {=}\limits ^{\mathrm{def}}\{i ~|~ 1 \le i \le m, i \in \mathbb {N}\}\), \(\big \{\mathscr {G}_j^{}\big \}_{j=1}^d\): \(\mathscr {G}_j^{} \mathop {=}\limits ^{\mathrm{def}}\big \{(k-1)d + j~|~ k \in [K] \big \}\), \([x]_+ \mathop {=}\limits ^{\mathrm{def}}\max (x, 0)\).

  • \(w_i^{}\) and \(W_{ij}^{}\): the ith and (ij)th entries of \({\mathbf {w}}\) and \({\mathbf {W}}\), respectively.

  • \({\mathbf {W}}_{i\varvec{\cdot }}^{}\): the ith row of \({\mathbf {W}}\), \({\mathbf {w}}_{\mathscr {G}}^{} \mathop {=}\limits ^{\mathrm{def}}\{w_i^{} ~|~ i \in \mathscr {G}\},{\mathbf {W}}_{\mathscr {G}\varvec{\cdot }} \mathop {=}\limits ^{\mathrm{def}}\{{\mathbf {W}}_{i \varvec{\cdot }} ~|~ i \in \mathscr {G}\}\).

  • \({\mathbf {0}}\): a vector or matrix with all its entries equal to 0, \({\mathbf {I}}\): identity matrix.

  • \(\Vert {\mathbf {w}}\Vert \mathop {=}\limits ^{\mathrm{def}}\sqrt{\langle {\mathbf {w}}, {\mathbf {w}}\rangle }\): \(\ell _2\)-norm of \({\mathbf {w}}\), \(\Vert {\mathbf {W}}\Vert _\text {F}^{} \mathop {=}\limits ^{\mathrm{def}}\sqrt{\mathrm{tr}[{\mathbf {W}}^\top {\mathbf {W}}]}\): Frobenius norm of \({\mathbf {W}}\).

  • \(\Vert {\mathbf {w}}\Vert _{2,1}^{} \mathop {=}\limits ^{\mathrm{def}}\sum _{j=1}^d \Vert {\mathbf {w}}_{\mathscr {G}_j^{}}\Vert \) and \(\Vert {\mathbf {W}}\Vert _{2,1}^{} \mathop {=}\limits ^{\mathrm{def}}\sum _{j=1}^d \Vert {\mathbf {W}}_{j\varvec{\cdot }}\Vert \): \(\ell _{2,1}^{}\)-norm of \({\mathbf {w}}\) and \({\mathbf {W}}\), respectively.

Definition 1

A function \(f(\cdot )\) is L-Lipschitz continuous with respect to \(\Vert \cdot \Vert \), if \(\forall {\mathbf {w}}, \widehat{{\mathbf {w}}} \in \mathbb {R}^d\) it holds that \(|f(\widehat{{\mathbf {w}}}) - f({\mathbf {w}})| \le L \Vert \widehat{{\mathbf {w}}} - {\mathbf {w}}\Vert \).

Definition 2

A function \(f(\cdot )\) is L-smooth with respect to \(\Vert \cdot \Vert \), if \(\forall {\mathbf {w}}, \widehat{{\mathbf {w}}} \in \mathbb {R}^d\) it holds that \(f(\widehat{{\mathbf {w}}}) \le f({\mathbf {w}}) + \langle \nabla f({\mathbf {w}}), \widehat{{\mathbf {w}}} - {\mathbf {w}}\rangle + L\Vert \widehat{{\mathbf {w}}} - {\mathbf {w}}\Vert ^2/2\).

Definition 3

A function \(f(\cdot )\) is \(\gamma \)-strongly convex with respect to \(\Vert \cdot \Vert \), if \(\forall {\mathbf {w}}, \widehat{{\mathbf {w}}} \in \mathbb {R}^d\) it holds that \(f(\widehat{{\mathbf {w}}}) \ge f({\mathbf {w}}) + \langle \nabla f({\mathbf {w}}), \widehat{{\mathbf {w}}} - {\mathbf {w}} \rangle + \gamma \left\| \widehat{{\mathbf {w}}} - {\mathbf {w}}\right\| ^2/2\).

Definition 4

For function \(f(\cdot )\), its convex conjugate \(f_{}^*(\cdot )\) is defined as \(f_{}^*(\varvec{\alpha }) \mathop {=}\limits ^{\mathrm{def}}\sup _{{\mathbf {w}}} \big \{ \langle \varvec{\alpha }, {\mathbf {w}}\rangle - f({\mathbf {w}}) \big \}\).

Lemma 1

(Hiriart-Urruty and Lemaréchal 1993) Assume that function f is closed and convex. If f is \((1/\gamma )\)-smooth w.r.t. \(\Vert \cdot \Vert \), then \(f^*\) is \(\gamma \)-strongly convex w.r.t. the dual norm \(\Vert \cdot \Vert _*\).

3 Problem setup

For simplicity, we consider the setting with K tasks distributed over K workers.Footnote 1 For each task k, we have \(n_k^{}\) labeled instances \(\{{\mathbf {x}}_i^k, y_i^k\}_{i=1}^{n_k^{}}\) stored locally on worker k, where \({\mathbf {x}}_i^k \in \mathbb {R}^d\) is the ith input, and \(y_i^k\) is the corresponding output. Our goal is to jointly learn different models in terms of \({\mathbf {w}}^k \in \mathbb {R}^d, k \in [K]\) for each task. For ease of presentation, we define

  • \(n \mathop {=}\limits ^{\mathrm{def}}\sum _{k=1}^K n_k^{}\): the total number of training instances over all the tasks.

  • \({\mathbf {X}}^k \mathop {=}\limits ^{\mathrm{def}}\big [{\mathbf {x}}_1^k, \ldots , {\mathbf {x}}_{n_k^{}}^k\big ] \in \mathbb {R}^{d \times n_k^{}}\) and \({\mathbf {y}}^k \mathop {=}\limits ^{\mathrm{def}}\big [y_1^k, \ldots , y_{n_k^{}}^k\big ]^\top \in \mathbb {R}^{n_k^{}}\): the input and output for task k.

  • \({\mathbf {W}} \mathop {=}\limits ^{\mathrm{def}}[{\mathbf {w}}^1, \ldots , {\mathbf {w}}^K] \in \mathbb {R}^{d \times K}\): the weight matrix over all the tasks.

  • \({\mathbf {A}} \mathop {=}\limits ^{\mathrm{def}}\mathrm{diag}\big ({\mathbf {X}}^1, \ldots , {\mathbf {X}}^K\big ) \in \mathbb {R}^{dK \times n}\), \({\mathbf {w}} \mathop {=}\limits ^{\mathrm{def}}[({\mathbf {w}}^1)^\top , \ldots , ({\mathbf {w}}^K)^\top ]^\top \in \mathbb {R}^{dK}\).

We focus on the following MTL formulation with sparsity regularization (Obozinski et al. 2010, 2011; Lee et al. 2010; Wang and Ye 2015):

$$\begin{aligned} \min _{{\mathbf {W}}} \frac{1}{n} f({\mathbf {w}}) + \lambda \Big (\rho \Vert {\mathbf {W}}\Vert _{2,1} + \frac{1 - \rho }{2} \Vert {\mathbf {W}}\Vert _\text {F}^2\Big ), \end{aligned}$$
(1)

where \(f({\mathbf {w}}) \mathop {=}\limits ^{\mathrm{def}}\sum _{k=1}^K \sum _{i=1}^{n_k^{}} f_{ki}^{}\big (\langle {\mathbf {x}}_i^k, {\mathbf {w}}^k\rangle \big )\), \(f_{ki}^{}\big (\langle {\mathbf {x}}_i^k, {\mathbf {w}}^k\rangle \big )\) is the loss function of the kth task on the ith data point \(({\mathbf {x}}_i^k, y_i^k)\) and \(\rho \in (0, 1)\). The group sparsity regularization \(\Vert {\mathbf {W}}\Vert _{2,1}\) aims to improve the generalization performance for each task by selecting important features, whose effect to the overall objective is controlled by the parameter \(\lambda \). Note that the regularization term \(\Vert {\mathbf {W}}\Vert _\text {F}^2\) is not only to control the complexity of each linear model but also to facilitate distributed optimization.Footnote 2 One can rewrite (1) as the following vectorization form,

$$\begin{aligned} \min _{{\mathbf {w}}} \Big \{P({\mathbf {w}}) \mathop {=}\limits ^{\mathrm{def}}\frac{1}{n} f({\mathbf {w}}) + \lambda g({\mathbf {w}})\Big \}, \end{aligned}$$
(2)

where \(g({\mathbf {w}}) \mathop {=}\limits ^{\mathrm{def}}\rho \sum _{j=1}^d \Vert {\mathbf {w}}_{\mathscr {G}_j^{}}\Vert + (1 - \rho )\Vert {\mathbf {w}}\Vert ^2/2\).

3.1 Dual problem

Compared to the primal problem, it is well-known that there is a dual variable associated with each training instance in its dual problem. This property makes the dual problem more tractable for distributed optimization if training instances are stored on different workers. Let \(\varvec{\alpha }= [\alpha _1^1, \ldots , \alpha _{n_{K}^{}}^K]^\top \in \mathbb {R}^n\). As derived in “Appendix A”, the dual problem of (2) is

$$\begin{aligned} \min _{\varvec{\alpha }} \bigg \{ D(\varvec{\alpha }) \mathop {=}\limits ^{\mathrm{def}}\frac{1}{n} f^*(-\varvec{\alpha }) + \lambda g^*\bigg ( \frac{{\mathbf {A}}\varvec{\alpha }}{\lambda n} \bigg ) \bigg \}, \end{aligned}$$
(3)

where \(f^*(-\varvec{\alpha }) \mathop {=}\limits ^{\mathrm{def}}\sum _{k=1}^K \sum _{i=1}^{n_k^{}} f_{ki}^*(-\alpha _i^k)\), \(f_{ki}^*(\cdot )\) is the conjugate function of \(f_{ki}(\cdot )\) and

$$\begin{aligned} g^*\bigg (\frac{{\mathbf {A}}\varvec{\alpha }}{\lambda n}\bigg ) \mathop {=}\limits ^{\mathrm{def}}\sum _{j=1}^d \Bigg \{ g_j^*\bigg (\frac{{\mathbf {A}}_{\mathscr {G}_{j}^{}{\varvec{\cdot }}}\varvec{\alpha }}{\lambda n} \bigg ) \mathop {=}\limits ^{\mathrm{def}}\frac{\big [\Vert {\mathbf {A}}_{\mathscr {G}_{j}^{}\varvec{\cdot }} \varvec{\alpha }\Vert - \rho \lambda n\big ]_+^2}{2(1 - \rho )\lambda ^2 n^2} \Bigg \}. \end{aligned}$$

Let \({\mathbf {w}}_\star \) and \(\varvec{\alpha }_\star ^{}\) be optimal solutions to (2) and (3), respectively. One can obtain a primal solution \({\mathbf {w}}(\varvec{\alpha })\) from any dual feasible \(\varvec{\alpha }\) via

$$\begin{aligned} {\mathbf {w}}(\varvec{\alpha }) \mathop {=}\limits ^{\mathrm{def}}\nabla g^*\big ({\mathbf {A}}\varvec{\alpha }/(\lambda n) \big ). \end{aligned}$$
(4)

Thus, the duality gap at \(\varvec{\alpha }\) is \(G(\varvec{\alpha }) \mathop {=}\limits ^{\mathrm{def}}P({\mathbf {w}}(\varvec{\alpha })) - (-D(\varvec{\alpha })) = P({\mathbf {w}}(\varvec{\alpha })) + D(\varvec{\alpha })\).

4 Efficient distributed optimization

For ease of presentation, we further introduce some additional notations. Let \(\{\mathscr {P}_k^{}\}_{k=1}^K\) be a partition of [n] such that \(\varvec{\alpha }_{\mathscr {P}_k^{}}^{} \in \mathbb {R}^{n_k^{}}\) are the dual variables associated with the training instances of the kth task. For \(k \in [K]\), \({\mathbf {A}} \in \mathbb {R}^{dK \times n}\) and \({\mathbf {z}} \in \mathbb {R}^n\), we define

  • \(\widehat{{\mathbf {A}}}_{}^{k} \in \mathbb {R}^{dK \times n}\): \(\big (\widehat{{\mathbf {A}}}_{}^{k}\big )_{\varvec{\cdot } i} \mathop {=}\limits ^{\mathrm{def}}{\mathbf {A}}_{\varvec{\cdot } i} \) if \(i \in \mathscr {P}_k^{}\), otherwise \({\mathbf {0}}\).

  • \(\widehat{\varvec{\alpha }}^k\in \mathbb {R}^n\): \(\big (\widehat{\varvec{\alpha }}^k\big )_i \mathop {=}\limits ^{\mathrm{def}}\alpha _i\) if \(i \in \mathscr {P}_k^{}\), otherwise 0, \(\varvec{\alpha }^k \in \mathbb {R}^{n_k^{}}\): \(\varvec{\alpha }^k \mathop {=}\limits ^{\mathrm{def}}\varvec{\alpha }_{\mathscr {P}_k^{}}^{}\), \(f_k^*(-\widehat{\varvec{\alpha }}^k) \mathop {=}\limits ^{\mathrm{def}}\sum _{i \in \mathscr {P}_k^{}} f_{ki}^*(-\alpha _i^k)\).

Recall that we assume \(\{{\mathbf {X}}_{}^{k}, {\mathbf {y}}^{k}\}_{k=1}^K\) to be stored over K local workers. Therefore, it is highly desirable to develop a communication-efficient distributed optimization method to solve (3). Note that one can adopt COCOA+ (Ma et al. 2015; Smith et al. 2017b) to solve the dual problem, which is similar to the idea of adopting COCOA+ for distributed multi-task relationship learning (Liu et al. 2017; Smith et al. 2017a). However, in this way, the convergence rate of such a COCOA+-based approach fails to reach the best one as discussed in Arjevani and Shamir (2015). To address this problem, we present an efficient distributed optimization method to solve (3) with a faster convergence rate compared with the COCOA+-based approach. The high-level idea of the proposed method is summarized in Algorithm 1, and the details are discussed as follows.

figure a

In order to minimize (3) with respect to \(\varvec{\alpha }\) in a distributed environment, one needs to design a subproblem for each worker such that the objective value of (3) decreases when each worker minimizes its local subproblem by only accessing its local data. In (3), the term \(f_{}^*(\cdot )\) is separable for examples on different workers but \(g_{}^*(\cdot )\) is not. Note that \(g_{}^*(\cdot )\) is a smooth function. By Definition 2, it has a quadratic upper bound based on a reference point \({\mathbf {u}}\) that is separable. By making use of this upper bound, one can design a subproblem for each worker such that \(D(\varvec{\alpha })\) decreases if each worker minimizes its local subproblem. Let \(\eta \mathop {=}\limits ^{\mathrm{def}}(1 - \rho )\lambda n^2\). The following subproblem is used for the kth worker at the tth iteration:

$$\begin{aligned} \widehat{\varvec{\alpha }}_t^k \mathop {=}\limits ^{\mathrm{def}}\mathop {{{\,\mathrm{argmin}\,}}}\limits _{\widehat{\varvec{\alpha }}_t^k \in \mathbb {R}^n} L_k\big (\widehat{\varvec{\alpha }}_t^k; \widehat{{\mathbf {u}}}_t^k, {\mathbf {w}}({\mathbf {u}}_t^{})\big ), \end{aligned}$$
(5)

where \({\mathbf {u}}_t^{}\) is a reference point at the tth iteration and

$$\begin{aligned} L_k\big (\widehat{\varvec{\alpha }}_t^k; \widehat{{\mathbf {u}}}_t^k, {\mathbf {w}}({\mathbf {u}}_t^{})\big )&\mathop {=}\limits ^{\mathrm{def}} \frac{1}{n}f_k^*\big (-\widehat{\varvec{\alpha }}_t^k\big ) + \frac{\lambda }{K} g^*\Big (\frac{{\mathbf {A}}{\mathbf {u}}_t^{}}{\lambda n}\Big ) + \frac{1}{n} \Big \langle {\mathbf {w}}({\mathbf {u}}_t^{}), {\mathbf {A}}\big (\widehat{\varvec{\alpha }}_t^k - \widehat{{\mathbf {u}}}_t^k \big ) \Big \rangle \nonumber \\&+ \,\frac{1}{2\eta } \big \Vert {\mathbf {A}}\big (\widehat{\varvec{\alpha }}_t^k - \widehat{{\mathbf {u}}}_t^k\big )\big \Vert ^2. \end{aligned}$$
(6)

It can be proved that \(D(\varvec{\alpha }_t^{}) \le \textstyle {\sum }_{k=1}^K L_k\big (\widehat{\varvec{\alpha }}_t^k; \widehat{{\mathbf {u}}}_t^k, {\mathbf {w}}({\mathbf {u}}_t^{})\big )\) holds for any \({\mathbf {u}}_t^{}\). Therefore, \(D(\varvec{\alpha })\) can be minimized by employing each local worker to solve its own local subproblem 5. With \({\mathbf {w}}({\mathbf {u}}_t^{})\), each subproblem can be minimized by only accessing the corresponding local data \(({\mathbf {X}}^k, {\mathbf {y}}^k)\).

In the literature of distributed optimization, e.g., COCOA+-based approaches (Ma et al. 2015; Smith et al. 2017a, b; Liu et al. 2017), the reference point \({\mathbf {u}}_t^{}\) is set to be the solution of last iteration \(\varvec{\alpha }_{t-1}^{}\). It leads to that the convergence rate of COCOA+-based approachs fails to reach the best one as discussed in Arjevani and Shamir (2015). In contrast, \({\mathbf {u}}_t^{}\) in our proposed method is set as follows,

$$\begin{aligned} {\mathbf {u}}_{t+1}^{} = \varvec{\alpha }_t^{} + \frac{\big (1 - \theta _{t-1}^{}\big )\theta _{t-1}^{}}{\theta _t^{}+\theta _{t-1}^2} \big (\varvec{\alpha }_t^{} - \varvec{\alpha }_{t-1}^{}\big ), \end{aligned}$$

where \(\theta _t^{}\) is the solution of

$$\begin{aligned} \theta _t^2 = \big (1 - \theta _t^{}\big )\theta _{t-1}^2 + \vartheta \eta \theta _t^{}, \end{aligned}$$
(7)

where \(\vartheta \mathop {=}\limits ^{\mathrm{def}}\mu /n\). The definition of \({\mathbf {u}}_{t+1}^{}\) implies

$$\begin{aligned} {\mathbf {A}}{\mathbf {u}}_{t+1}^{} = \sum _{k=1}^K \bigg \{{\mathbf {A}}\widehat{\varvec{\alpha }}_t^k + \frac{\theta _{t-1}^{}\big (1 - \theta _{t-1}^{}\big )}{\theta _t^{}+\theta _{t-1}^2} \Big ({\mathbf {A}}\widehat{\varvec{\alpha }}_t^k - {\mathbf {A}}\widehat{\varvec{\alpha }}_{t-1}^k\Big )\bigg \}. \end{aligned}$$
(8)

Specifically, \({\mathbf {u}}_{t+1}^{}\) is obtained based on an extrapolation from \(\varvec{\alpha }_t^{}\) and \(\varvec{\alpha }_{t-1}^{}\). This is similar to Nesterov’s acceleration technique (Nesterov 2013). As we will see, this technique yields a faster convergence rate compared to COCOA+-based approaches (Ma et al. 2015; Smith et al. 2017a, b; Liu et al. 2017). Recently, Zheng et al. (2017) presented an accelerated distributed alternating dual maximization algorithm for single task learning, where an extrapolation is applied on the primal variable for acceleration. For smooth losses, they only proved the accelerated convergence rate in terms of primal suboptimality while we also prove it for duality gap, resulting in a stronger result.

Remark 1

In each iteration of Algorithm 1, \({\mathbf {w}}({\mathbf {u}}_t^{})\) and \(\{{\mathbf {A}}\widehat{\varvec{\alpha }}_t^k\}_{k=1}^K\) are communicated between master and workers. By the definitions of \({\mathbf {A}}\) and \(\widehat{\varvec{\alpha }}_t^k\), we note that \(\big ({\mathbf {w}}({\mathbf {u}}_t^{})\big )^k \in \mathbb {R}^d\) and \({\mathbf {X}}_{}^k\varvec{\alpha }_t^k \in \mathbb {R}^d\) are actually communicated between master and the kth worker. Therefore, its communication cost for each iteration is the same as COCOA+ in which \(\big ({\mathbf {w}}(\varvec{\alpha }_t^{})\big )^k \in \mathbb {R}^d\) and \({\mathbf {X}}_{}^k\varvec{\alpha }_t^k \in \mathbb {R}^d\) are communicated. Note that \({\mathbf {w}}({\mathbf {u}}_{t+1}^{})\) depends on \({\mathbf {A}}\varvec{\alpha }_t^{}\) but also \({\mathbf {A}}\varvec{\alpha }_{t-1}^{}\), therefore we can keep a copy of \({\mathbf {A}}\varvec{\alpha }_{t-1}^{}\) on the master until iteration t. In this way, no extra communication cost is induced in each iteration by our method for acceleration.

5 Convergence analysis

In this section, we analyze the convergence rate of the proposed method and show that it is faster than COCOA+-based approaches. All the proofs can be found in “Appendix”. In our analysis, we assume that all \(f_{ki}^*, k \in [K], i \in [n_k^{}]\) are \(\mu \)-strongly convex (\(\mu \ge 0\)) with respect to the norm \(\Vert \cdot \Vert \). According to Lemma 1, it is equivalent to assuming that all \(f_{ki}^{}\), for \(k \in [K]\) and \(i \in [n_k^{}]\) are \((1/\mu )\)-smooth with respect to the norm \(\Vert \cdot \Vert \). Since \(\mu \) is allowed to be 0, our analysis also covers the case that all \(f_{ki}^*\), for \(k \in [K]\) and \(i \in [n_k^{}]\) are only generally convex (i.e., \(\mu = 0\)), which implies that all \(f_{ki}\) for \(k \in [K]\) and \(i \in [n_k^{}]\) are Lipschitz continuous instead of smooth. To facilitate analysis, we also assume that \(L_k\big (\widehat{\varvec{\alpha }}_t^k; \widehat{{\mathbf {u}}}_t^k, {\mathbf {w}}({\mathbf {u}}_t^{})\big )\) is exactly solved for any \(k \in [K]\) and \(t \ge 1\).

By defining \(\zeta _t^{}\mathop {=}\limits ^{\mathrm{def}}\theta _t^2/\eta \), (7) becomes

$$\begin{aligned} \zeta _t^{} = \big (1 - \theta _t^{}\big ) \zeta _{t-1}^{} + \vartheta \theta _t^{}. \end{aligned}$$
(9)

For any \(t \ge 1\) and \(k \in [K]\), \(\widehat{{\mathbf {v}}}_t^k\) is defined as

$$\begin{aligned} \widehat{{\mathbf {v}}}_t^k \mathop {=}\limits ^{\mathrm{def}}\widehat{\varvec{\alpha }}_{t-1}^k + \big (\widehat{\varvec{\alpha }}_t^k - \widehat{\varvec{\alpha }}_{t-1}^k\big )\big /\theta _t^{}, ~~ k \in [K]. \end{aligned}$$
(10)

In addition, the suboptimality on dual objective function \(\epsilon _D^t\) is defined as \(\epsilon _D^t \mathop {=}\limits ^{\mathrm{def}}D(\varvec{\alpha }_t^{}) - D(\varvec{\alpha }_\star ^{}), t \ge 0\). By using the above notations, the following lemma shows that there is an upper bound for the suboptimality \(\epsilon _D^t\). As we will see, this is the foundation for analyzing the convergence rate of duality gap.

Lemma 2

Consider applying Algorithm 1 to solve (3), the following inequality holds for any \(t \ge 1\),

$$\begin{aligned} \epsilon _D^t + R^t \le \gamma _t^{} \big (\epsilon _D^0 + R^0 \big ), \end{aligned}$$
(11)

where \(R^t = \frac{\zeta _t^{}}{2} \sum _{k=1}^K \big \Vert {\mathbf {A}}\big (\widehat{\varvec{\alpha }}_\star ^k - \widehat{{\mathbf {v}}}_t^k\big )\big \Vert ^2, \gamma _t^{} = \prod _{i=1}^t \big (1 - \theta _i\big )\) for any \(t \ge 1\) and \(\gamma _0^{} = 1\).

It can be found that the form of \(\gamma _t^{}\) determines the convergence rate of Algorithm 1. Therefore, next, we study the convergence rate by using the upper bound of \(\gamma _t^{}\) under different settings of the loss function.

5.1 Convergence rate for smooth losses

By applying Lemma 2, the following lemma characterizes the effect of iterations of Algorithm 1 when the loss functions \(f_{ki}^{}\)’s are \((1/\mu )\)-smooth for any \(k \in [K]\) and \(i \in [n_k^{}]\).

Lemma 3

Assume the loss functions \(f_{ki}^{}\)’s are \((1/\mu )\)-smooth for any \(k \in [K]\) and \(i \in [n_k^{}]\). If \(\theta _0^{} = \sqrt{\vartheta \eta }\) and \((1-\rho )\lambda \mu n \le 1\), then the following inequality holds for any \(t \ge 1\)

$$\begin{aligned} \epsilon _D^t \le \Big (1 - \sqrt{(1 - \rho )\lambda \mu n} \Big )^t \big (\epsilon _D^0 + R^0\big ). \end{aligned}$$
(12)

Let \(\sigma _\text {max}^{} \mathop {=}\limits ^{\mathrm{def}}\max _{\varvec{\alpha }\ne 0} \Vert {\mathbf {A}}\varvec{\alpha }\Vert _{}^2/\Vert \varvec{\alpha }\Vert _{}^2\). By applying Lemma 3, the next theorem shows the communication complexities for smooth losses in terms of dual objective and duality gap.

Theorem 1

Assume the loss functions \(f_{ki}^{}\)’s are \((1/\mu )\)-smooth for any \(k \in [K]\) and \(i \in [n_k^{}]\). If \(\theta _0^{} = \sqrt{\vartheta \eta }\) and \((1-\rho )\lambda \mu n \le 1\), then after T iterations in Algorithm 1 with

$$\begin{aligned} T \ge \sqrt{\frac{1}{(1 - \rho )\lambda \mu n}} \log \bigg ( \big (1+\sigma _\text {max}^{}\big )\frac{\epsilon _D^0}{\epsilon _D^{}}\bigg ), \end{aligned}$$

\(D\big (\varvec{\alpha }_T^{}\big ) - D\big (\varvec{\alpha }_\star ^{}) \le \epsilon _D^{}\) holds. Furthermore, after T iterations with

$$\begin{aligned} T \ge \sqrt{\frac{1}{(1 - \rho )\lambda \mu n}} \log \bigg ( \big (1 + \sigma _\text {max}^{}\big ) \frac{(1 - \rho )\lambda \mu n + \sigma _\text {max}^{}}{(1 - \rho )\lambda \mu n} \frac{\epsilon _D^0}{\epsilon _G^{}} \bigg ), \end{aligned}$$

it holds that \(P\big ({\mathbf {w}}(\varvec{\alpha }_T^{})) - (-D(\varvec{\alpha }_T^{})) \le \epsilon _G^{}\).

Following Zhang and Xiao (2017), we define the condition number \(\kappa \) as \(\kappa \mathop {=}\limits ^{\mathrm{def}}\max _{k,i} \Vert {\mathbf {x}}_i^k\Vert ^2/(\lambda \mu )\). With the above analysis, the communication complexity of our method is linear with respect to \(\sqrt{\kappa }\), while it is linear with \(\kappa \) for COCOA+-based approaches (Ma et al. 2015; Smith et al. 2017b). The value of \(\kappa \) is typically the order of n as \(\lambda \) is usually set to the order of 1 / n (Bousquet and Elisseeff 2002). Therefore, our method is expected to converge faster than COCOA+-based approaches.

5.2 Convergence rate for Lipschitz continuous losses

Next, we present the convergence rate of the Algorithm 1 when the loss function is just general convex and Lipschitz continuous.

Theorem 2

Assume the loss functions \(f_{ki}^{}\)’s are generally convex and L-Lipschitz continuous for any \(k \in [K]\), \(i \in [n_k^{}]\). If \(\theta _0 = 1\), the following inequality holds for any \(t \ge 1\)

$$\begin{aligned} \epsilon _D^t \le \frac{1}{(t + 2)^2} \left( 4\epsilon _D^0 + \frac{8L^2\sigma _\text {max}^{}}{(1 - \rho )\lambda n^2} \right) . \end{aligned}$$
(13)

After T iterations in Algorithm 1 with

$$\begin{aligned} T \ge \sqrt{\frac{8L^2\sigma _\text {max}^{}}{(1 - \rho )\lambda n^2\epsilon _D^{}} + \frac{4\epsilon _D^0}{\epsilon _D^{}}} - 2, \end{aligned}$$
(14)

it holds that \(D\big (\varvec{\alpha }_T^{}\big ) - D\big (\varvec{\alpha }_\star ^{}) \le \epsilon _D^{}\).

Remark 2

For generally convex loss function, the dual objective obtained by Algorithm 1 decreses in \(O(1/t^2)\) instead of O(1 / t) for COCOA+. Therefore, the complexity for obtaining \(\epsilon _D^{}\)-suboptimal solution is \(\sqrt{1/\epsilon _D^{}}\) that is faster than that of COCOA+ (i.e., \(1/\epsilon _D^{}\)).

6 Further reduce communication cost via dynamic feature screening

In Sect. 4, we present an acceleration method for distributed optimization of (3) that reduces the communication cost in terms of iteration of communications. As discussed in Remark 1, the communication cost of our method in each iteration is linear with the number of features d, that is the same as previous distributed optimization methods for sparsity-regularized problems. It can be expensive for high-dimensional data. To address this issue, we present a method to reduce the communication cost for each iteration by exploiting the sparsity of \({\mathbf {w}}_\star \) (Bonnefoy et al. 2015; Fercoq et al. 2015; Ndiaye et al. 2017). It is well-known that the \(\ell _{2,1}\)-norm regularization is able to produce a row sparse pattern on \({\mathbf {W}}_{\star }^{}\) (Obozinski et al. 2011, 2010; Yuan et al. 2006; Zou and Hastie 2005). In other words, \(({{\mathbf {w}}_\star })_{\mathscr {G}_{j}^{}}\) will be \({\mathbf {0}}\) for most \(\mathscr {G}_{j}^{},j \in [d]\). Thereafter, we refer the jth feature as an inactive feature if \(({\mathbf {w}}_\star )_{\mathscr {G}_{j}^{}} = {\mathbf {0}}\), otherwise an active feature. The key idea of feature screening is to identify inactive features before sending the updated information to workers (Line 4 in Algorithm 1). In this way, the communication cost can be reduced since it is linear with the number active features.

To identify inactive features, we need to exploit the KKT condition of (2)

$$\begin{aligned}&\big (\alpha _\star \big )_i^k \in \partial f_{ki}^{}\big (\big \langle {\mathbf {x}}_i^k, {\mathbf {w}}_\star ^k \big \rangle \big ), \forall k \in [K], i \in [n_k^{}] , \end{aligned}$$
(15)
$$\begin{aligned}&\frac{{\mathbf {A}}_{\mathscr {G}_{j}^{}\varvec{\cdot }} \varvec{\alpha }_\star ^{}}{\lambda n} \in (1 - \rho )\big ({\mathbf {w}}_\star \big )_{\mathscr {G}_{j}^{}} + \rho \partial \big \Vert ({\mathbf {w}}_\star )_{\mathscr {G}_{j}^{}}^{}\big \Vert , \forall j \in [d]. \end{aligned}$$
(16)

By checking the subgradient of \(\Vert \cdot \Vert \), it implies \(({{\mathbf {w}}_\star })_{\mathscr {G}_{j}^{}} = {\mathbf {0}}\) if \(\Vert ({{\mathbf {w}}_\star })_{\mathscr {G}_{j}^{}}\Vert < 1\). Combining this fact with (16), we have

$$\begin{aligned} \big \Vert {\mathbf {A}}_{\mathscr {G}_{j}^{}{\varvec{\cdot }}}^{} \varvec{\alpha }_\star ^{} \big \Vert < \rho \lambda n \; \Rightarrow \; ({{\mathbf {w}}_\star })_{\mathscr {G}_{j}^{}} = {\mathbf {0}}. \end{aligned}$$
(17)
figure b

It can be shown that one can obtain the exact optimum even without considering these inactive features during optimization. Therefore, one can reduce the communication cost by discarding these inactive features, thus less information needs to be communicated. To use (17) to identify inactive features, one needs to have \(\varvec{\alpha }_\star ^{}\) that is unknown before the optimization problem (3) is solved. Next, we show that a feasible set \(\mathscr {F}\) can be constructed for \(\varvec{\alpha }_\star ^{}\) by using the strong convexity of \(D(\varvec{\alpha })\).

Crucial Value\(\lambda _\text {max}\): In view of (17) and (15), there exists a crucial value \(\lambda _\text {max}\) such that \({\mathbf {w}}_\star = {\mathbf {0}}\) for any \(\lambda \ge \lambda _\text {max}\). Let \({\mathbf {r}} = [f_{11}'(0), \ldots , f_{Kn_K^{}}'(0)] \in \mathbb {R}^n\), (15) implies that \(\varvec{\alpha }_\star ^{} = {\mathbf {r}}\) when \({\mathbf {w}}_\star ^{} = {\mathbf {0}}\). By substituting \(\varvec{\alpha }_\star ^{}\) into (17), we obtain \(\lambda _\text {max} = \max _{j \in [d]} \Vert {\mathbf {A}}_{\mathscr {G}_{j}^{}}{\mathbf {r}}\Vert /(\rho n)\). It is trivial to obtain a closed form solution \({\mathbf {w}}_\star = {\mathbf {0}}\) and \(\varvec{\alpha }_\star ^{} = {\mathbf {r}}\) if \(\lambda \ge \lambda _\text {max}\). Therefore, we only focus on the cases when \(\lambda < \lambda _\text {max}\).

Feasible Set of\(\varvec{\alpha }_\star ^{}\): Lemma 1 implies \(D(\varvec{\alpha })\) is strongly convex if \(f_{ki}^{}\)’s are smooth for all k and i. By using this fact, the dual optimal solution \(\varvec{\alpha }_\star ^{}\) can be bounded in terms of \(\varvec{\alpha }\) and its duality gap \(G(\varvec{\alpha })\) as stated in the following lemma.

Lemma 4

Assume the loss functions \(f_{ki}^{}\)’s are \((1/\mu )\)-smooth for any \(k \in [K],i \in [n_k^{}]\). For any dual feasible solution \(\varvec{\alpha }\), it holds that \(\varvec{\alpha }_\star ^{} \in \mathscr {F} \mathop {=}\limits ^{\mathrm{def}}\big \{\varvec{\theta }~|~ \Vert \varvec{\theta }- \varvec{\alpha }\Vert \le \sqrt{2G(\varvec{\alpha })n/\mu } \big \}\).

By using Lemma 4, (17) can be relaxed as

$$\begin{aligned} \max _{\varvec{\theta }\in \mathscr {F}} \Vert {\mathbf {A}}_{\mathscr {G}_j^{}\varvec{\cdot }} \varvec{\theta }\big \Vert < \rho \lambda n \; \Rightarrow \; ({\mathbf {w}}_\star )_{\mathscr {G}_j^{}} = {\mathbf {0}}. \end{aligned}$$
(18)

In other words, we need to solve the following problem

$$\begin{aligned} \max _{\varvec{\theta }} \big \Vert {\mathbf {A}}_{\mathscr {G}_j^{}\varvec{\cdot }} \varvec{\theta }\big \Vert , ~~\mathrm{s.t.}~\Vert \varvec{\theta }- \varvec{\alpha }\Vert \le \sqrt{2G(\varvec{\alpha })n/\mu }. \end{aligned}$$
(19)

Although it is non-convex, the global optimum of (19) can be obtained by using the result in Gay (1981). Let us define \({\mathbf {H}} \in \mathbb {R}^{K \times K}, {\mathbf {g}} \in \mathbb {R}^K, \upsilon _j, \mathscr {I}_j, \bar{\mathscr {I}}_j\) and \(\bar{{\mathbf {s}}} \in \mathbb {R}^K\) as

  • \({\mathbf {H}} \mathop {=}\limits ^{\mathrm{def}}-\mathrm{diag}\big (2\Vert {\mathbf {X}}_{j\varvec{\cdot }}^1\Vert ^2, \ldots , 2\big \Vert {\mathbf {X}}_{j\varvec{\cdot }}^K\big \Vert ^2\big )\), \({\mathbf {g}} \mathop {=}\limits ^{\mathrm{def}}-2{\big [\big \Vert {\mathbf {X}}_{j\varvec{\cdot }}^1\big \Vert \big |\big \langle {\mathbf {X}}_{j\varvec{\cdot }}^1, \varvec{\alpha }^1\big \rangle \big |, \ldots , \big \Vert {\mathbf {X}}_{j\varvec{\cdot }}^K\big \Vert \big |\big \langle {\mathbf {X}}_{j\varvec{\cdot }}^K, \varvec{\alpha }^K\big \rangle \big | \big ]}^\top \).

  • \(\upsilon _j \mathop {=}\limits ^{\mathrm{def}}\max _{k \in [K]} \big \Vert {\mathbf {X}}_{j\varvec{\cdot }}^k\big \Vert ^2\), \(\mathscr {I}_j \mathop {=}\limits ^{\mathrm{def}}\Big \{k ~\big |~ \big \Vert {\mathbf {X}}_{j\varvec{\cdot }}^k\big \Vert ^2 = \upsilon _j, k \in [K] \Big \}\), \(\bar{\mathscr {I}}_j \mathop {=}\limits ^{\mathrm{def}}[K]\setminus \mathscr {I}_j\).

  • \(\bar{s}_k^{} \mathop {=}\limits ^{\mathrm{def}}\frac{\big \Vert {\mathbf {X}}_{j\varvec{\cdot }}^k\big \Vert \big |\big \langle {\mathbf {X}}_{j\varvec{\cdot }}^k, \varvec{\alpha }^k \big \rangle \big |}{\upsilon _j^{} - \big \Vert {\mathbf {X}}_{j\varvec{\cdot }}^k\big \Vert ^2} ~\text {if}~ k \in \bar{\mathscr {I}}_j, ~\text {otherwise}~ \bar{s}_k^{} \mathop {=}\limits ^{\mathrm{def}}0\).

By using the above notations, the solution of (19) is given in the following lemma.

Lemma 5

If \(\upsilon _j^{} = 0\), the maximum value of (19) is 0. Otherwise, the upper bound is

$$\begin{aligned} \sum _{k=1}^K \big \langle {\mathbf {X}}_{j\varvec{\cdot }}^k, \varvec{\alpha }^k\big \rangle ^2 + \frac{nG(\varvec{\alpha })}{\mu } \vartheta _\star - \frac{1}{2}\left\langle {\mathbf {g}}, {\mathbf {s}}_\star \right\rangle , \end{aligned}$$

where \(\vartheta _\star \) and \({\mathbf {s}}_\star \) are defined as follows: (a) \(\vartheta _\star = 2 \upsilon _j\) and \({\mathbf {s}}^\star = \bar{{\mathbf {s}}} + \widehat{{\mathbf {s}}}\) if 1) \(\exists ~\widehat{{\mathbf {s}}} \in \mathbb {R}^K\) with \(\widehat{{\mathbf {s}}}_{\mathscr {I}_j} = {\mathbf {0}}\) and \(\Vert \bar{{\mathbf {s}}} + \widehat{{\mathbf {s}}}\Vert = \sqrt{2G(\varvec{\alpha })n/\mu }\), and 2) \(\left\langle {\mathbf {X}}_{\cdot j}^t, \varvec{\theta }_t^{}\right\rangle = 0, \forall t \in \mathscr {I}_j\). (b) Otherwise, \(\vartheta _\star > 2 \upsilon _j\) is solution of \(\Vert \left( {\mathbf {H}} + \vartheta _\star {\mathbf {I}}\right) ^{-1} {\mathbf {g}}\Vert = \sqrt{2G(\varvec{\alpha })n/\mu }\), and \({\mathbf {s}}_\star = - \left( {\mathbf {H}} + \vartheta _\star {\mathbf {I}}\right) ^{-1}{\mathbf {g}}\).

To perform screening every p iterations, one can simply add the following three lines before line 4 in Algorithm 1.

  • if\(t\%p = 0\) then

  •      Call Algorithm 2

  • end if

Costs of Screening: Note that the screening is performed on the master every p iterations.

  • By carefully examining the detailed screening rule, the master actually only needs \({\mathbf {A}}\varvec{\alpha }_t^{}\) when evaluating screening rule. Even without screening, the \({\mathbf {A}}\varvec{\alpha }_t^{}\) needs to be computed and sent to the master in each iteration as stated in Algorithm 1 and Remark 1. Therefore, the feature screening does not induce extra communication cost.

  • Regarding the computational cost, we note that the screening problem is dependent on the number of active features that is at most d (there are less and less feature due to screening). As shown in Lemma 5, the screening problem for each feature is a one dimension variable optimization problem. It either has a closed form solution (Case 1) or can be efficiently solved by using Newton’s method (Case 2) that usually takes less than 5 iterations to meet the accuracy \(10^{-15}\).

  • More importantly, by screening out inactive features, it can substantially save optimization problem, especially on local computation. Recall that the local SDCA computation complexity is O(Hd) where H is the local SDCA iteration number and its is usually more than \(10^5\). Compared to local SDCA computation cost, the cost of screening is negligible.

We note that Ndiaye et al. (2015) also presented a feature screening method for multi-task learning. However, in their work, all tasks are assumed to share the same training data while our method allows each task to has its own training data. Consequently, the feature screening problem (19) becomes non-convex instead of convex, which is different from and more challenging than that studied in Ndiaye et al. (2015). In addition, Wang and Ye (2015) developed a static screening rule that exploits the solution at another regularization parameter and only performs screening before the optimization procedure. By contrast, our screening rule is a dynamic with a weaker assumption to exploit the latest solution to repeatedly perform screening during optimization. Therefore, our screening is more practical and performs better empirically.

Difference between Our Proposed Method and COCOA+ We denote the proposed method by \(\texttt {DMTL}_{S}\). There are two main differences between \(\hbox {DMTL}_S\) and COCOA+. First, \(\hbox {DMTL}_S\) constructs the subproblem 5 by using the extrapolation of the solutions in last two iterations that is able to achieve accelerated convergence rate. In contrast, COCOA+ only uses the solution of last iteration. Second, \(\hbox {DMTL}_S\) presents a dynamic feature screening method to reduce the communication cost for each iteration by exploiting the sparsity of the model.

7 Experiments

7.1 Experimental setting

In previous sections, we present our method by focusing on distributed MTL. We hereby conduct experiments to show the advantages of the proposed method for MTL. In fact, our approach can also be extended for distributed single task learning (STL) and the details are provided in the “Appendix”.

To demonstrate the advantages of \(\texttt {DMTL}_S\), we compare \(\texttt {DMTL}_S\) with a COCOA+-based approach (Ma et al. 2015; Smith et al. 2017b) and its extension MOCHA (Smith et al. 2017a) to solve the dual problem (3). In our experiments, the squared loss is used for regression, and the smoothed hinge loss (Shalev-Shwartz and Zhang 2013) is used for classification with \(\mu = 0.5\) for all experiments. It is clear to see that \(f_{ki}^{}\) is \((1/\mu )\)-smooth. For ease of comparison, the local subproblem is solved by using SDCA (Shalev-Shwartz and Zhang 2013) for all methods. The number of iterations for SDCA is set to \(H = 10^4\) for all datasets.

We run all experiments on a local server with 64 worker cores. A distributed environment is simulated on the machine by using distributed platform Petuum (Xing et al. 2015),Footnote 3 and workers for each task are assigned to isolated processes that communicate solely through the platform. Regarding the performance, we evaluate the number of communication iterations required by different methods to obtain a solution with prescribed duality gap. Due to the limitation of computational resources, we are not able to perform experiments on a real distributed environment. However, the results (i.e., the number of communication iterations) reported in this paper does not depend on the environment that it runs on. Compared to COCOA+, the additional computation incurred by our method is negligible: the computation complexity of each iteration of COCOA+ is \(O(H \times d)\). The additional computations required by our method for acceleration and feature screening is O(d) and O(d), respectively. This cost is negligible compared to that of SDCA because H is usually around \(10^5\).

We conduct experiments on the following three datasets (Table 1).

Synthetic Data contains \(K = 10\) regression tasks and generated by using \(y_i^k = \langle {\mathbf {x}}_i^k, {\mathbf {w}}^k\rangle + \epsilon \). The number of examples for each task is randomly generated, which ranges from 903 to 1098. \({\mathbf {x}}_i^k \in \mathbb {R}^{50,000}\) is drawn from \(\mathscr {N}({\mathbf {0}}, {\mathbf {I}})\) and \(\epsilon \sim \mathscr {N}({\mathbf {0}}, 0.5 {\mathbf {I}})\). To obtain a \({\mathbf {W}}\) with row sparsity, we randomly select 400 dimensions from [d] and generate them from \(\mathscr {N}({\mathbf {0}}, {\mathbf {I}})\) for all tasks. For each task, extra noise from \(\mathscr {N}({\mathbf {0}}, 0.5 {\mathbf {I}})\) is added to \({\mathbf {W}}\).

Table 1 Statistics of the datasets for MTL

News20 (Lang 1995) is a collection of around 20, 000 documents from 20 different newsgroups. To construct a multi-task learning problem, we create 5 binary classification tasks using data of all the 5 groups from comp as positive examples. For the negative examples, we choose data from misc.forsale, rec.autos, rec.motorcycles, rec.sport.baseball and rec.sport.hockey. The number of training examples for each task ranges from 1163 to 1190, and the number of features is 34, 967.

MDS (Blitzer et al. 2007) includes product reviews on 25 domains in Amazon. We use 22 domains each of which has more than 100 examples for multitask binary sentiment classification. To simulate MTL, we randomly select 1000 examples as training data for the domain with more than 1000 examples. Consequently, the number training examples for each domain ranges from 220 to 1000. The number of features of is 10, 000.

7.2 Results of faster convergence rate

In order to test the convergence rate of \(\texttt {DMTL}_S\), we compare it with the COCOA+-based approach to solving (3) under varying values of \(\lambda \). In view of Sect. 6, we chose \(\lambda = 10^{-2} \lambda _\text {max}\) and \(\lambda = 10^{-3} \lambda _\text {max}\) to solve (3). We set \(\varvec{\alpha }_0^{} = {\mathbf {0}}\) for all methods and \(\rho = 0.9\) for all experiments.

Figure 1 shows the comparison results in terms of the numbers of iterations for communication used by \(\texttt {DMTL}_S\) and COCOA+ to obtain a solution meeting a prescribed duality gap. From the Fig. 1, we can observe that:

  • \(\texttt {DMTL}_S\) is significantly faster than COCOA+ in terms of the number of iterations to meet a prescribed duality gap. Take the synthetic dataset and News20 for example, to obtain a solution at \(\lambda = 10^{-3}\lambda _\text {max}\) with duality gap \(10^{-5}\), \(\texttt {DMTL}_S\) obtains speedups of a factor of 6.64 and 6.94 over COCOA+ on the two datasets, respectively.

  • Generally, the speedup obtained by \(\texttt {DMTL}_S\) is more significant for small values of \(\lambda \). For example, when \(\lambda = 10^{-2} \lambda _\text {max}\), \(\texttt {DMTL}_S\) converges 4.81 and 4.05 times faster than COCOA+ on the synthetic dataset and News20, respectively. In contrast, the speedups is improved up to 7.00 and 5.70 times faster than COCOA+ when \(\lambda = 10^{-3} \lambda _\text {max}\).

  • The improvement is more pronounced when a higher precision is used as the stopping criterion. Take News20 with \(\lambda = 10^{-3} \lambda _\text {max}\) for example, the speedups of \(\texttt {DMTL}_S\) over COCOA+ are 4.00, 4.94, 5.70 and 6.94 when the duality gaps are \(10^{-2}\), \(10^{-3}\), \(10^{-4}\) and \(10^{-5}\), respectively.

7.3 Robust to straggler

In Smith et al. (2017a), MOCHA is proposed to improve COCOA+ to handle systems heterogeneity, e.g., straggler. That means some workers are considerably slower than others and the stragglers fail to return prescribed accurate solution for some iterations. Here, we compare our method with COCOA+ equipped with handling system heterogeneity as presented in Smith et al. (2017a) on News20 and show that our method converges faster even if there exist stragglers. Specifically, we perform experiments under the setting of Smith et al. (2017a) by using different values of H for different workers to simulate the effect of stragglers. The value of H for each iteration is draw from \([0.9 n_\text {min}, n_\text {min}]\) to simulate low variability environment and \([0.5 n_\text {min}, n_\text {min}]\) to simulate high variability environment, where \(n_\text {min} = \min _{k} n_k^{}\).

Fig. 1
figure 1

Duality gap versus communicated iterations on the three datasets for \(\lambda = 10^{-2} \lambda _\text {max}\) and \(\lambda = 10^{-3}\lambda _\text {max}\)

Fig. 2
figure 2

Duality gap versus communicated iterations on News20 with systems heterogeneity for \(\lambda = 10^{-2} \lambda _\text {max}\) and \(\lambda = 10^{-3} \lambda _\text {max}\). Here, COCOA+ denotes its original version equipped with handling system heterogeneity as presented in Smith et al. (2017a)

As shown in Fig. 2, our method is still able to substantially reduce the number of communication for both low and high variability environments. This shows that empirically \(\texttt {DMTL}_S\) is robust to straggler although our analysis assumes that the local subproblem needs to be exactly solved.

7.4 Results of reduced communication cost

To demonstrate the effect of dynamic screening for reducing communication cost, we perform a warm start cross validation experiment on News20 and MDS. Specifically, we solve (3) with 50 various values of \(\lambda \), \(\{\lambda _i\}_{i=1}^{50}\), which are equally distributed on the logarithmic grid from \(0.01 \lambda _\text {max}\) to \(0.3 \lambda _\text {max}\) sequentially (i.e., the solution of \(\lambda _i\) is used as the initialization of \(\lambda _{i-1}\)). To evaluate the total communication cost for the 50 values of \(\lambda \), we calculate the total number of vectors of dimension d used for communication for each worker. We experiment on the following two settings: 1) \(\texttt {DMTL}_S\)without dynamic screening (Without DS), and 2) \(\texttt {DMTL}_S\)with dynamic screening (With DS). Figure 3 presents the total communication cost used by \(\texttt {DMTL}_S\)without and with dynamic screening to solve (3) over \(\{\lambda _i\}_{i=1}^{50}\) on News20 and MDS.

Fig. 3
figure 3

Effect of dynamic screening for reducing communication cost. Total communication cost (normalized by feature dimension d) used by our method without and with dynamic screening for solving (3) over \(\{\lambda _i\}_{i=1}^{50}\) on News20 and MDS

From Fig. 3, we can observe that:

  • The communication cost has been substantially reduced by the proposed dynamic screening because the most inactive features have been progressively identified and discarded during optimization. For example, when the prescribed duality is \(10^{-7}\), the communication cost reduction by the proposed method is \(83.32\%\) and \(67.43\%\) on News20 and MDS, respectively.

  • This advantage of dynamic screening is more significant when a higer precision is used as the stopping criterion. On News20, the speedup increases from 5.99 to 8.63 when the duality gap changes from \(10^{-7}\) to \(10^{-8}\). This is because more inactive features can be screened out when a more accurate solution is obtained.

  • More importantly, the proposed dynamic screening is more pronounced for the problem with higher dimension. Take the duality gap of \(10^{-8}\) for example, the speedups obtained by dynamic screening are 8.63 and 4.14 on News20 and MDS, respectively, where News20 is of much higher dimensionality than MDS.

8 Conclusion

In this paper, we present a new distributed optimization method, \(\texttt {DMTL}_S\), for MTL with matrix sparsity regularization. We provide theoretical convergence analysis for \(\texttt {DMTL}_S\). We also propose a data screening method to further reduce the communication cost. We carefully design and conduct extensive experiments on both synthetic and real-world datasets to verify the faster convergence rate and the reduced communication cost of \(\texttt {DMTL}_S\) in comparison with two state-of-the-art baselines, COCOA+ and MOCHA.