Abstract
Packing and covering linear programs (PC-LP s) form an important class of linear programs (LPs) across computer science, operations research, and optimization. Luby and Nisan (in: STOC, ACM Press, New York, 1993) constructed an iterative algorithm for approximately solving PC-LP s in nearly linear time, where the time complexity scales nearly linearly in N, the number of nonzero entries of the matrix, and polynomially in \(\varepsilon \), the (multiplicative) approximation error. Unfortunately, existing nearly linear-time algorithms (Plotkin et al. in Math Oper Res 20(2):257–301, 1995; Bartal et al., in: Proceedings 38th annual symposium on foundations of computer science, IEEE Computer Society, 1997; Young, in: 42nd annual IEEE symposium on foundations of computer science (FOCS’01), IEEE Computer Society, 2001; Koufogiannakis and Young in Algorithmica 70:494–506, 2013; Young in Nearly linear-time approximation schemes for mixed packing/covering and facility-location linear programs, 2014. arXiv:1407.3015; Allen-Zhu and Orecchia, in: SODA, 2015) for solving PC-LP s require time at least proportional to \(\varepsilon ^{-2}\). In this paper, we break this longstanding barrier by designing a packing solver that runs in time \(\widetilde{O}(N \varepsilon ^{-1})\) and covering LP solver that runs in time \(\widetilde{O}(N \varepsilon ^{-1.5})\). Our packing solver can be extended to run in time \(\widetilde{O}(N \varepsilon ^{-1})\) for a class of well-behaved covering programs. In a follow-up work, Wang et al. (in: ICALP, 2016) showed that all covering LPs can be converted into well-behaved ones by a reduction that blows up the problem size only logarithmically.
Similar content being viewed by others
Notes
Luby and Nisan, who originally studied iterative solvers for this class of problems [24], dubbed them positive LPs. However, the class of LPs with non-negative constraint matrices is slightly larger, including mixed-packing-and-covering LPs. For this reason, we prefer to stick to the PC-LP terminology.
Most width-dependent solvers study the minmax problem \(\min _{\begin{array}{c} x\ge 0 , {\mathbbm {1}}^{T} x = 1 \end{array}} \; \max _{\begin{array}{c} y\ge 0 , {\mathbbm {1}}^{T} y = 1 \end{array}} \; y^T A x ,\) whose optimal value equals \(1/\mathsf {OPT}\). Their approximation guarantees are often written in terms of additive error. We have translated their performances to multiplicative error for a clear comparison.
Some of these solvers still have a \(\textsf {polylog}(\rho )\) dependence. Since each occurrence of \(\log (\rho )\) can be replaced with \(\log (nm)\) after slightly modifying the matrix A, we have done so in Table 1 for a fair comparisons.
This can be verified by observing that our objective \(f_\mu (x)\), to be introduced later, is not globally Lipschitz smooth, so that one cannot apply accelerated gradient descent directly.
Due to space limitation, we quickly sketch why logarithmic word size suffices for our algorithms. On one hand, one can prove in an iteration, if x is calculated with a small additive error \(1/\textsf {poly}(1/\varepsilon ,n,m)\), then the objective f(x) may increase only by \(1/\textsf {poly}(1/\varepsilon ,n,m)\) in that iteration. The proof of this relies on the fact that (1) one can assume without loss of generality all entries of A are no more than \(\textsf {poly}(1/\varepsilon ,n,m)\) and (2) our algorithms ensure \(f(x) < \textsf {poly}(1/\varepsilon , n, m)\) for all iterations with high probability, so even though we are using the exponential functions, f(x) will not change additively by much. On the other hand, one can similarly prove that each \(\nabla _i f(x)\) can be calculated within an additive error \(1/\textsf {poly}(1/\varepsilon ,n,m)\) in each iteration. They together imply that the total error incurred by arithmetic operations can be made negligible.
If \(\min _{i\in [n]}\{\Vert A_{:i}\Vert _{\infty }\} = 0\) then the packing LP is unbounded so we are done. Otherwise, if \(\min _{i\in [n]}\{\Vert A_{:i}\Vert _{\infty }\} = v > 0\) we scale all entries of A by 1 / v, and scale \(\mathsf {OPT}\) by v.
Note that some of the previous results (such as [7, 31]) appear to directly minimize \(\sum _{j=1}^m e^{((Ax)_j - 1)/\mu }\) as opposed to its logarithm g(x). However, their per-iteration objective decrease is multiplicative, meaning it is essentially equivalent to performing a single gradient-descent step on g(x) with additive objective decrease.
If \(\min _{j\in [m]}\{\Vert A_{j :}\Vert _{\infty }\} = 0\) then the covering LP is infeasible so we are done. Otherwise, if \(\min _{j\in [m]}\{\Vert A_{j :}\Vert _{\infty }\} = v > 0\) we scale all entries of A by 1 / v, and scale \(\mathsf {OPT}\) by v.
The constant 9 in this section can be replaced with any other constant greater than 1.
This negative width technique is related to [7, Definition 3.2], where the authors analyze the multiplicative weight update method in a special case when the oracle returns loss values only in \([-\ell , +\rho ]\), for some \(\ell \ll \rho \). This technique is also a sub-case of a more general theory of mirror descent, known as the local-norm convergence, that we have summarized in a separate and later paper [3].
We wish to point out that this proof coincides with a lemma from the accelerated coordinate descent theory of Fercoq and Richtárik [17]. Their paper is about optimizing an objective function that is Lipschitz smooth, and thus irrelevant to our work.
This is because, our parameter choices ensure that \((1+\gamma )\alpha _k n < 1/2\beta \), which further means \(-(1+\gamma )\alpha _k n\xi _{k,i}^{(i)} \le 1/2\). As a result, we must have \(\mathsf {z}_{k,i}^{(i)} \le \mathsf {z}_{k-1,i} \cdot e^{0.5} < 2 \mathsf {z}_{k-1,i}\) (see the explicit definition of the mirror step at Proposition 6.4).
References
Allen-Zhu, Z., Lee, Y.T., Orecchia, L.: Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In: SODA (2016)
Allen-Zhu, Z., Li, Y., Oliveira, R., Wigderson, A: Much faster algorithms for matrix scaling. In: FOCS, 2017. arXiv:1704.02315
Allen-Zhu, Z., Liao, Z., Orecchia, L.: Spectral sparsification and regret minimization beyond multiplicative updates. In: STOC (2015)
Allen-Zhu, Z., Orecchia, L.: Using optimization to break the epsilon barrier: a faster and simpler width-independent algorithm for solving positive linear programs in parallel. In: SODA (2015)
Allen-Zhu, Z., Orecchia, L.: Linear coupling: an ultimate unification of gradient and mirror descent. In: ITCS (2017)
Allen-Zhu, Z., Qu, Z., Richtárik, P., Yuan, Y.: Even faster accelerated coordinate descent using non-uniform sampling. In: ICML (2016)
Arora, S., Hazan, E., Kale, S.: The multiplicative weights update method: a meta-algorithm and applications. Theory Comput. 8, 121–164 (2012)
Awerbuch, B., Khandekar, R.: Stateless distributed gradient descent for positive linear programs. In: STOC (2008)
Awerbuch, B., Khandekar, R., Rao, S.: Distributed algorithms for multicommodity flow problems via approximate steepest descent framework. ACM Trans. Algorithms 9(1), 1–14 (2012)
Bartal, Y., Byers, J.W., Raz, D.: Global optimization using local information with applications to flow control. In: Proceedings 38th Annual Symposium on Foundations of Computer Science, pp. 303–312. IEEE Computer Society (1997)
Bartal, Y., Byers, J.W., Raz, D.: Fast, distributed approximation algorithms for positive linear programming with applications to flow control. SIAM J. Comput. 33(6), 1261–1279 (2004)
Ben-Tal, A., Arkadi, N.: Lectures on modern convex optimization. Soc. Ind. Appl. Math. 315–341 (2013)
Bienstock, D., Iyengar, G.: Faster approximation algorithms for packing and covering problems. Technical report, Columbia University, September 2004. Preliminary version published in STOC ’04
Byers, J., Nasser, G.: Utility-based decision-making in wireless sensor networks. In: 2000 first annual workshop on mobile and ad hoc networking and computing, 2000. MobiHOC, pp. 143–144. IEEE (2000)
Chudak, F.A., Eleutério, V. : Improved approximation schemes for linear programming relaxations of combinatorial optimization problems. In: Proceedings of the 11th International IPCO Conference on Integer Programming and Combinatorial Optimization, pp. 81–96 (2005)
Duan, R., Pettie, S.: Linear-time approximation for maximum weight matching. J. ACM 61(1), 1–23 (2014)
Fercoq, O., Richtárik, P.: Accelerated, parallel and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)
Fleischer, L.K.: Approximating fractional multicommodity flow independent of the number of commodities. SIAM J. Discrete Math. 13(4), 505–520 (2000)
Garg, N., Könemann, J.: Faster and simpler algorithms for multicommodity flow and other fractional packing problems. SIAM J. Comput. 37(2), 630–652 (2007)
Grigoriadis, M.D., Khachiyan, L.G.: Fast approximation schemes for convex programs with many blocks and coupling constraints. SIAM J. Optim. 4(1), 86–107 (1994)
Jain, R., Ji, Z., Upadhyay, S., Watrous, J.: QIP = PSPACE. J. ACM JACM 58(6), 30 (2011)
Klein, P., Young, N.E.: On the number of iterations for Dantzig–Wolfe optimization and packing-covering approximation algorithms. SIAM J. Comput. 44(4), 1154–1172 (2015)
Koufogiannakis, C., Young, N.E.: A nearly linear-time PTAS for explicit fractional packing and covering linear programs. Algorithmica 70, 494–506 (2013). (Previously appeared in FOCS ’07)
Luby, M., Nisan, N.: A parallel approximation algorithm for positive linear programming. In: STOC, pp. 448–457. ACM Press, New York (1993)
Madry, A.: Faster approximation schemes for fractional multicommodity flow problems via dynamic graph algorithms. In: STOC. ACM Press, New York (2010)
Nemirovski, A.: Prox-method with rate of convergence \(O(1/t)\) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)
Nesterov, Y.: Rounding of convex sets and efficient gradient methods for linear programming problems. Optim. Methods Softw. 23(1), 109–128 (2008)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(O(1/k^2)\). Sov. Math. Dokl. 269, 543–547 (1983)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J Optim. 22(2), 341–362 (2012)
Plotkin, S.A., Shmoys, D.B., Tardos, É.: Fast approximation algorithms for fractional packing and covering problems. Math. Oper. Res. 20(2), 257–301 (1995). (conference version published in FOCS 1991)
Trevisan, L.: Parallel approximation algorithms by positive linear programming. Algorithmica 21(1), 72–88 (1998)
Wang, D., Mahoney, M., Mohan, N., Rao, S.: Faster parallel solver for positive linear programs via dynamically-bucketed selective coordinate descent. ArXiv e-prints arXiv:1511.06468 (2015)
Wang, D., Rao, S., Mahoney, M.W.: Unified acceleration method for packing and covering problems via diameter reduction. In: ICALP (2016)
Young, N.E.: Sequential and parallel algorithms for mixed packing and covering. In: 42nd Annual IEEE Symposium on Foundations of Computer Science (FOCS’01), pp. 538–546. IEEE Computer Society (2001)
Young, N.E.: Nearly linear-time approximation schemes for mixed packing/covering and facility-location linear programs. ArXiv e-prints arXiv:1407.3015 (2014)
Zurel, E., Nisan, N.: An efficient approximate allocation algorithm for combinatorial auctions. In: Proceedings of the 3rd ACM Conference on Electronic Commerce, pp. 125–136. ACM (2001)
Author information
Authors and Affiliations
Corresponding author
Additional information
An earlier version of this paper appeared on arXiv http://arxiv.org/abs/1411.1124 in November 2014. A 6-paged abstract of this paper “Nearly-Linear Time Positive LP Solver with Faster Convergence Rate,” excluding Sects. 4, 5 and 6, and excluding all the proofs in Sects. 2 and 3, has been presented at the STOC 2015 conference in Portland, OR.
Appendix
Appendix
1.1 Proof of Lemma 3.6
Lemma 3.6
We have \(\mathsf {x}_k,\mathsf {y}_k,\mathsf {z}_k\in \varDelta _{\mathsf {box}}\) for all \(k=0,1,\dots ,T\).
Proof
This is true at the beginning as \(\mathsf {x}_0 = \mathsf {y}_0 = x^{\mathsf {start}}\in \varDelta _{\mathsf {box}}\) (see Fact 2.8) and \(\mathsf {z}_0 = 0 \in \varDelta _{\mathsf {box}}\). In fact, it suffices for us to show that for every \(k\ge 1\), \(\mathsf {y}_k = \sum _{l=0}^k \gamma _k^l \mathsf {z}_l\) for some scalars \(\gamma _k^l\) satisfying \(\sum _l \gamma _k^l = 1\) and \(\gamma _k^l \ge 0\) for each \(l = 0,\dots ,k\). If this is true, we can prove the lemma by induction: at each iteration \(k \ge 1\),
-
1.
\(\mathsf {x}_k = \tau \mathsf {z}_{k-1} + (1-\tau )\mathsf {y}_{k-1}\) must be in \(\varDelta _{\mathsf {box}}\) because \(\mathsf {y}_{k-1}\) and \(\mathsf {z}_{k-1}\) are and \(\tau \in [0,1]\),
-
2.
\(\mathsf {z}_k\) is in \(\varDelta _{\mathsf {box}}\) by the definition that \(\mathsf {z}_k = {\text {arg min}}_{z\in \varDelta _{\mathsf {box}}}\{\cdots \}\), and
-
3.
\(\mathsf {y}_k\) is also in \(\varDelta _{\mathsf {box}}\) because \(\mathsf {y}_k= \sum _{l=0}^k \gamma _k^l \mathsf {z}_l\) is a convex combination of the \(\mathsf {z}_l\)’s and \(\varDelta _{\mathsf {box}}\) is convex.
For the rest of the proof, we show that \(\mathsf {y}_k = \sum _{l=0}^k \gamma _k^l \mathsf {z}_l\) for every \(k\ge 1\) with coefficientsFootnote 13
This is true at the base case \(k=1\) because \(\mathsf {y}_1 = \mathsf {x}_1 + \frac{1}{n\alpha _1 L}(\mathsf {z}_1 - \mathsf {z}_0) = \frac{1}{n\alpha _1 L} \mathsf {z}_1 + \big (1-\frac{1}{n\alpha _1 L}\big ) \mathsf {z}_0\). For the general \(k\ge 2\), we have
Therefore, we obtain \(\mathsf {y}_k = \sum _{l=0}^k \gamma _k^l \mathsf {z}_l\) as desired.
It is now easy to check that under our definition of \(\alpha _k\) (which satisfies \(\alpha _k\ge \alpha _{k-1}\) and \(\alpha _k\ge \alpha _0 = \frac{1}{nL}\), we must have \(\gamma _k^l \ge 0\) for all k and l. Also,
\(\square \)
1.2 Proof of Proposition 4.5
Proposition 4.5
-
(a)
\(f_\mu (u^*) \le (1+\varepsilon )\mathsf {OPT}\) for \(u^* {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2) x^*\).
-
(b)
\(f_\mu (x) \ge (1-\varepsilon )\mathsf {OPT}\) for every \(x \ge 0\).
-
(c)
For any \(x \ge 0\) satisfying \(f_\mu (x) \le 2\mathsf {OPT}\), we must have \(Ax \ge (1-\varepsilon ){\mathbbm {1}}\).
-
(d)
If \(x\ge 0\) satisfies \(f_\mu (x) \le (1+\delta )\mathsf {OPT}\) for some \(\delta \in [0,1]\), then \(\frac{1}{1-\varepsilon } x\) is a \(\frac{1+\delta }{1-\varepsilon }\)-approximate solution to the covering LP.
Proof
-
(a)
We have \({\mathbbm {1}}^{T} u^* = (1+\varepsilon /2)\mathsf {OPT}\) by the definition of \(\mathsf {OPT}\). Also, from the feasibility constraint \(A x^* \ge {\mathbbm {1}}\) in the covering LP, we have \(A u^* - {\mathbbm {1}}\ge \varepsilon /2 \cdot {\mathbbm {1}}\), and can compute \(f_\mu (u^*)\) as follows:
$$\begin{aligned} f_\mu (u^*) = \mu \sum _j e^{\frac{1}{\mu } (1 - (A u^*)_j)} + {\mathbbm {1}}^{T} u^*&\le \mu \sum _j e^{\frac{-\varepsilon /2}{\mu }} + (1+\varepsilon /2)\mathsf {OPT}\\&\le \frac{\mu m}{(nm)^2} + (1+\varepsilon /2)\mathsf {OPT}\le (1+\varepsilon )\mathsf {OPT}. \end{aligned}$$ -
(b)
Suppose towards contradiction that \(f_\mu (x) < (1-\varepsilon )\mathsf {OPT}\). Since \(f_\mu (x) <\mathsf {OPT}\le m\), we must have that for every \(j\in [m]\), it satisfies that \(e^{\frac{1}{\mu }(1-(Ax)_j)} \le f_\mu (x) / \mu \le m / \mu \). This further implies \((Ax)_j \ge 1-\varepsilon \) by the definition of \(\mu \). In other words, \(Ax \ge (1-\varepsilon ){\mathbbm {1}}\). By the definition of \(\mathsf {OPT}\), we must then have \({\mathbbm {1}}^{T} x \ge (1-\varepsilon )\mathsf {OPT}\), finishing the proof that \(f_\mu (x) \ge {\mathbbm {1}}^{T} x \ge (1-\varepsilon )\mathsf {OPT}\), giving a contradiction.
-
(c)
To show \(Ax \ge (1-\varepsilon ){\mathbbm {1}}\), we can assume that \(v = \max _j (1 - (Ax)_j) > \varepsilon \) because otherwise we are done. Under this definition, we have
$$\begin{aligned} \textstyle f_\mu (x) \ge \mu e^{\frac{v}{\mu }} = \mu \big ((\frac{nm}{\varepsilon })^4\big )^{v/\varepsilon } \ge \frac{\varepsilon }{4\log (nm/\varepsilon )} (\frac{nm}{\varepsilon })^4 \gg 2\mathsf {OPT}, \end{aligned}$$contradicting to our assumption that \(f_\mu (x) \le 2\mathsf {OPT}\). Therefore, we must have \(v \le \varepsilon \), that is, \(Ax \ge (1-\varepsilon ){\mathbbm {1}}\).
-
(d)
For any x satisfying \(f_\mu (x) \le (1+\theta )\mathsf {OPT}\le 2 \mathsf {OPT}\), owing to Proposition 4.5c, we first have that x is approximately feasible, i.e., \(Ax \ge (1-\varepsilon ){\mathbbm {1}}\). Next, because \({\mathbbm {1}}^{T} x \le f_\mu (x) \le (1+\theta )\mathsf {OPT}\), we know that x yields an objective \({\mathbbm {1}}^{T} x \le (1+\theta )\mathsf {OPT}\). Letting \(x' = \frac{1}{1-\varepsilon } x\), we both have that \(x'\) is feasible (i.e., \(Ax' \ge {\mathbbm {1}}\)), and \(x'\) has an objective \({\mathbbm {1}}^{T} x'\) at most \(\frac{1+\delta }{1-\varepsilon } \mathsf {OPT}\).
\(\square \)
1.3 Missing proofs for Sect. 5
In this section we prove Theorem 5.3. Because the proof structure is almost identical to that of Theorem 3.4, we spend most of the discussions only pointing out the difference rather than repeating the proofs. The following three lemmas are completely identical to the ones in the packing LP case, so we restate them below:
Lemma C.1
(cf. Lemma 3.3) Each iteration of \(CovLPSolver^{\mathsf {wb}}\) can be implemented to run in expected O(N / n) time.
Lemma C.2
(cf. Lemma 3.6) We have \(\mathsf {x}_k,\mathsf {y}_k,\mathsf {z}_k\in \varDelta _{\mathsf {box}}\) for all \(k=0,1,\dots ,T\).
Lemma C.3
(cf. Lemma 3.7) For every \(u\in \varDelta _{\mathsf {box}}\), it satisfies \(\big \langle n\alpha _k \xi _{k}^{(i)}, \mathsf {z}_{k-1} - u \big \rangle \le n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 .\)
For the gradient descent guarantee of Sect. 3.3, one can first note that Lemma 2.7 remains true: this can be verified by replacing \(\nabla _i f_\mu (x) + 1\) in its proof with \(1 - \nabla _i f_\mu (x)\). For this reason, Lemma 3.9 (which is built on Lemma 2.7) also remains true. We state it below:
Lemma C.4
(cf. Lemma 3.9) We have \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge \frac{1}{2} \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)}\rangle \ge 0\).
Putting all together Denote by \(\eta _{k}^{(i)} \in \mathbb {R}_{\le 0}^n \) the vector that is only non-zero at coordinate i, and satisfies \(\eta _{k,i}^{(i)} = \nabla _i f_\mu (\mathsf {x}_k) - \xi _{k,i}^{(i)} \in (-\infty , 0]\). In other words, the full gradient
can be (in expectation) decomposed into the a large but non-positive component \(\eta _k^{(i)} \in (-\infty , 0]^n\) and a small component \(\xi _k^{(i)} \in [-1,1]^n\). Similar as Sect. 3.4, for any \(u \in \varDelta _{\mathsf {box}}\), we can use a basic convexity argument and the mirror descent lemma to compute that
Above, \(\textcircled {{\small 1}}\)is because \(\mathsf {x}_{k} = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1}\), which implies that \(\tau (\mathsf {x}_k - \mathsf {z}_{k-1}) = (1-\tau ) (\mathsf {y}_{k-1} - \mathsf {x}_k)\). \(\textcircled {{\small 2}}\)uses convexity and Lemma C.3. We can establish the following lemma to upper bound the boxed term in (C.2). Its proof is in the same spirit to that of Lemma 3.10, and is the only place that we require all vectors to reside in \(\varDelta _{\mathsf {box}}\).
Lemma C.5
(cf. Lemma 3.10) For every \(u \in \varDelta _{\mathsf {box}}\),
Proof of Lemma C.5
Now there are three possibilities:
-
If \(\eta _{k,i}^{(i)}=0\), then we must have \(\xi _{k,i}^{(i)} = \nabla _i f_\mu (\mathsf {x}_k) \in [-1,1]\). Lemma C.4 implies
$$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad = n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \le 2 n^2 \alpha _k^2 L \cdot ( f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) ) \end{aligned}$$ -
If \(\eta _{k,i}^{(i)} < 0\) and \(\mathsf {z}_{k,i}^{(i)} < \frac{10}{\Vert A_{:i}\Vert _{\infty }}\) (thus \(\mathsf {z}_{k}^{(i)}\) is not on the boundary of \(\varDelta _{\mathsf {box}}\)), then we precisely have \(\mathsf {z}_{k,i}^{(i)} = \mathsf {z}_{k-1,i} + \frac{n\alpha _k}{\Vert A_{:i}\Vert _{\infty }}\), and accordingly \(\mathsf {y}_{k,i}^{(i)} = \mathsf {x}_{k,i} + \frac{1}{L \Vert A_{:i}\Vert _{\infty }} > \mathsf {x}_{k,i}\). In this case,
$$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 1}}}{\le }n \alpha _k \cdot \nabla _i f_\mu (\mathsf {x}_k) \cdot \frac{-10}{\Vert A_{:i}\Vert _{\infty }} + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 2}}}{<} n \alpha _k \cdot \nabla _i f_\mu (\mathsf {x}_k) \cdot \frac{-10}{\Vert A_{:i}\Vert _{\infty }} + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 3}}}{=} 10 n \alpha _k L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 4}}}{\le }\big ( 20 n \alpha _k L + 2 n^2 \alpha _k^2 L \big ) \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) . \end{aligned}$$Above, \(\textcircled {{\small 1}}\)follows from the fact that \(\mathsf {z}_{k-1},u \in \varDelta _{\mathsf {box}}\) and therefore \(z_{k-1,i}\ge 0\) and \(u_i \le \frac{10}{\Vert A_{:i}\Vert _{\infty }}\) by the definition of \(\varDelta _{\mathsf {box}}\), and \(u\ge 0\); \(\textcircled {{\small 2}}\)follows from the fact that \(\mathsf {x}_k\) and \(\mathsf {y}_k^{(i)}\) are only different at coordinate i, and \(\xi _{k,i}^{(i)}=-1 > \nabla _i f_\mu (\mathsf {x}_k)\) (since \(\eta _{k,i}^{(i)}<0\)); \(\textcircled {{\small 3}}\)follows from the fact that \(\mathsf {y}_{k}^{(i)} = \mathsf {x}_{k} + \frac{\mathbf {e}_i}{L \Vert A_{:i}\Vert _{\infty }}\); and \(\textcircled {{\small 4}}\)uses Lemma C.4.
-
If \(\eta _{k,i}^{(i)} < 0\) and \(\mathsf {z}_{k,i}^{(i)} = \frac{10}{\Vert A_{:i}\Vert _{\infty }}\), then we have
$$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 1}}}{\le }\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}- \mathsf {z}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 2}}}{\le }\big \langle n \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1} - \mathsf {z}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 3}}}{=} n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 4}}}{\le }4 n^2 \alpha _k^2 L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) . \end{aligned}$$Above, \(\textcircled {{\small 1}}\)is because \(u_i \le \frac{10}{\Vert A_{:i}\Vert _{\infty }} = \mathsf {z}_{k,i}^{(i)}\) and \(\eta _{k,i}^{(i)} < 0\), together with \(\nabla _i f_\mu (\mathsf {x}_k) < \xi _{k,i}^{(i)}\) and \(\mathsf {x}_{k,i} \le \mathsf {y}_{k,i}^{(i)}\); \(\textcircled {{\small 2}}\)uses \(\nabla _i f_\mu (\mathsf {x}_k) = \eta _{k,i}^{(i)} - 1 < \eta _{k,i}^{(i)}\) and \(\mathsf {z}_{k,i}^{(i)} \ge \mathsf {z}_{k-1,i}\); \(\textcircled {{\small 3}}\)is from our choice of \(\mathsf {y}_k\) which satisfies that \(\mathsf {z}_{k-1} -\mathsf {z}_k^{(i)} = n \alpha _k L (\mathsf {x}_k - \mathsf {y}_k^{(i)})\); and \(\textcircled {{\small 4}}\)uses Lemma C.4.
Combining the three cases, and using the fact that \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge 0\), we conclude that
Above, the last inequality uses our choice of \(\alpha _k\), which implies \(n \alpha _k \le n \alpha _T = \frac{1}{\varepsilon L} \le \frac{1}{4}\).
Plugging Lemma C.5 back to (C.2), we have
Above, \(\textcircled {{\small 1}}\)uses Lemma C.5; and \(\textcircled {{\small 2}}\)is because we have chosen \(\tau \) to satisfy \(\frac{1}{\tau } = 21 n L \).
Next, recall that we have picked \(\alpha _{k}\) so that \((21n L - 1) \alpha _k = 21n L \cdot \alpha _{k-1}\) in \(CovLPSolver^{\mathsf {wb}}\). Telescoping (C.3) for \(k=1,\dots ,T\) and choosing \(u^*=(1+\varepsilon /2) x^*\), we have
Here, the second inequality is due to \(f_\mu (\mathsf {y}_0) = f_\mu (x^{\mathsf {start}}) \le 3\mathsf {OPT}\) from Fact 5.2, and the fact that
Finally, using the fact that \(\sum _{k=1}^T \alpha _k = \alpha _T \cdot \sum _{k=0}^{T-1} \big (1 - \frac{1}{21 nL}\big )^k = 21 n \alpha _T L \big (1-(1-\frac{1}{21 nL})^T\big )\), we rearrange and obtain that
We choose \(T = \lceil 21 n L \log (1/\varepsilon ) \rceil \) so that \(\frac{1}{n\alpha _T L} = (1-\frac{1}{21 n L})^T \le \varepsilon \). Combining this with the fact that \(f_\mu (u^*)\le (1+\varepsilon )\mathsf {OPT}\) (see Proposition 4.5a), we obtain
Therefore, we have finished proving Theorem 5.3. \(\square \)
1.4 Missing proofs for Sect. 6
Proposition 6.4
If \(\mathsf {z}_{k-1}\in \varDelta _{\mathsf {simplex}}\) and \(\mathsf {z}_{k-1}>0\), the minimizer \(z = {\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \big \{ V_{\mathsf {z}_{k-1}}(z) + \langle \delta \mathbf {e}_i, z \rangle \big \}\) for any scalar \(\delta \in \mathbb {R}\) and basis vector \(\mathbf {e}_i\) can be computed as follows:
-
1.
\(z \leftarrow \mathsf {z}_{k-1}\).
-
2.
\(z_i \leftarrow z_i \cdot e^{-\delta }\).
-
3.
If \({\mathbbm {1}}^{T} z > 2\mathsf {OPT}'\), \(z \leftarrow \frac{2\mathsf {OPT}'}{{\mathbbm {1}}^{T} z} z\).
-
4.
Return z.
Proof
Let us denote by z the returned value of the described procedure, and \(g(u) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}V_{\mathsf {z}_{k-1}}(u) + \langle \delta \mathbf {e}_i, u \rangle \). Since \(\varDelta _{\mathsf {simplex}}\) is a convex body and \(g(\cdot )\) is convex, to show \(z = {\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \{g(u)\}\), it suffices for us to prove that for every \(u \in \varDelta _{\mathsf {simplex}}\), \(\langle \nabla g(z), u-z \rangle \ge 0\). Since the gradient \(\nabla g(z)\) can be written explicitly, this is equivalent to
If the re-scaling in step 3 is not executed, then we have \(z_\ell = \mathsf {z}_{k-1,\ell }\) for every \(\ell \ne i\), and \(z_i = \mathsf {z}_{k-1,i} \cdot e^{-\delta }\); thus, the left-hand side is zero so the above inequality is true for every \(u\in \varDelta _{\mathsf {simplex}}\).
Otherwise, we have \({\mathbbm {1}}^{T} z = 2\mathsf {OPT}'\) and there exists some constant \(Z>1\) such that, \(z_\ell = \mathsf {z}_{k-1,\ell } / Z\) for every \(\ell \ne i\), and \(z_i = \mathsf {z}_{k-1,i} \cdot e^{-\delta } / Z\). In such a case, the left-hand side equals to
It is clear at this moment that since \(\log Z > 0\) and \({\mathbbm {1}}^{T} u \le 2\mathsf {OPT}' = {\mathbbm {1}}^{T} z\), the above quantity is always non-negative, finishing the proof. \(\square \)
Lemma 6.13
Denoting by \(\gamma {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}2\alpha _T n\), we have
Proof
Define \(w(x){\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\sum _i x_i \log (x_i) - x_i\) and accordingly, \(V_x(y) = w(y) - \langle \nabla w(x), y-x \rangle - w(x) = \sum _i y_i \log \frac{y_i}{x_i} + x_i - y_i\). We first compute using the classical analysis of mirror descent step as follows:
Above, \(\textcircled {{\small 1}}\)is because \(\mathsf {z}^{(i)}_k = {\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}}\big \{V_{\mathsf {z}_{k-1}}(z) + \langle (1+\gamma )\alpha _k n \xi _k^{(i)}, z \rangle \big \}\), which is equivalent to saying
In particular, we have \({\mathbbm {1}}^{T} \frac{u^*}{1+\gamma } = {\mathbbm {1}}^{T} \frac{(1+\varepsilon /2)x^*}{1+\gamma } < 2\mathsf {OPT}\le 2\mathsf {OPT}'\) and therefore \(\frac{u^*}{1+\gamma } \in \varDelta _{\mathsf {simplex}}\). Substituting \(u = \frac{u^*}{1+\gamma } \) into the above inequality we get \(\textcircled {{\small 1}}\).
Next, we upper bound the term in the box:
Above, \(\textcircled {{\small 1}}\)uses the facts (i) \(a \log \frac{a}{b} + b - a \ge 0\) for any \(a,b>0\), (ii) \(\mathsf {z}_{k-1,i}-\mathsf {z}_k^{(i)}\) and \(\xi _{k,i}\) have the same sign, and (iii) \(\xi _{k,i'}^{(i)}=0\) for every \(i' \ne i\); \(\textcircled {{\small 2}}\)uses the inequality that for every \(a,b>0\), we have \( a \log \frac{a}{b} + b - a \ge \frac{(a-b)^2}{2\max \{a,b\}}\). \(\textcircled {{\small 3}}\)uses the fact that \(\mathsf {z}^{(i)}_{k,i} \le 2\mathsf {z}_{k-1,i}\).Footnote 14\(\textcircled {{\small 4}}\)uses Cauchy-Shwarz: \(ab - b^2/4 \le a^2\). \(\textcircled {{\small 5}}\)uses \((1+\gamma )^2 < 2\). \(\textcircled {{\small 6}}\)uses \(|\xi _{k,i}| \le 1\) and \(\gamma = 2\alpha _T n \ge 2\alpha _k n\). \(\textcircled {{\small 7}}\)uses \(\xi _{k,i} \ge -\beta \).
Next, we combine (D.1) and (D.2) to conclude that
Taking expectation on both sides with respect to i, and using the property that \({\mathbbm {1}}^{T} \mathsf {z}_{k-1} \le 3\mathsf {OPT}' \le 6\mathsf {OPT}\), we obtain that
\(\square \)
Lemma 6.14
For every \(i\in [n]\), we have
-
(a)
\(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge 0\), and
-
(b)
\(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge \frac{\mu \beta }{12} \cdot \langle - \widetilde{\eta }_{k}^{(i)}, u^* \rangle .\)
Proof of Lemma 6.14 part (a)
Since if \(i\not \in B_k\) is not a large index we have \(\mathsf {y}_k^{(i)} = \mathsf {x}_k\) and the claim is trivial, we focus on \(i\in B_k\) in the remaining proof. Recall that \(\mathsf {y}_k^{(i)} = \mathsf {x}_k + \delta \mathbf {e}_i\) for some \(\delta >0\) defined in Algorithm 3, so we have
It is clear that \(\langle A_{:i}, p(\mathsf {x}_k + \tau \mathbf {e}_i) \rangle \) decreases as \(\tau \) increases, and therefore it suffices to prove that \(\langle A_{:i}, p(\mathsf {x}_k + \delta \mathbf {e}_i) \rangle \ge 1\).
Suppose that the rows of \(A_{:i}\) are sorted (for the simplicity of notation) by the increasing order of \(A_{j,i}\). Now, by the definition of the algorithm (recall (6.1)), there exists some \(j^* \in [m]\) satisfying that
Next, by our choice of \(\delta \) which satisfies \(\delta = \frac{\mu \beta }{2A_{j^*,i}} \le \frac{\mu \beta }{2A_{j,i}} \) for every \(j \le j^*\), we have for every \(j\le j^*\):
and as a result,
\(\square \)
Proof of Lemma 6.14 part (b)
Owing to part (a), for every coordinate i such that \(\widetilde{\eta }_{k,i}\ge 0\), we automatically have \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge 0\) so the lemma is obvious. Therefore, let us focus only on coordinates i such that \(\widetilde{\eta }_{k,i}<0\); these are necessarily large indices \(i\in B\). Recall from Definition 6.11 that \(\widetilde{\eta }_{k,i} = (1+\beta ) - (\widetilde{A}^T p(\mathsf {x}_k))_i\), so we have
For the simplicity of description, suppose again that each i-th column is sorted in non-decreasing order, that is, \(A_{1,i}\le \cdots \le A_{m,i}\). The definition of \(j^*\) can be simplified as
Let \(j^{\flat } \in [m]\) be the row such that
Note that such a \(j^{\flat }\) must exist because \(\sum _{j=1}^m \widetilde{A}_{j,i} \cdot p_j > 1+\beta \). It is clear that \(j^{\flat } \ge j^*\), owing to the definition that \(\widetilde{A}_{ji} \le A_{ji}\) for all \(i\in [n], j\in [m]\). Defining \(\delta ^{\flat } = \frac{\mu \beta }{2A_{j^{\flat },i}} \le \delta \), the objective decrease is lower bounded as
where the inequality is because \(\delta ^{\flat } \le \delta \) and \(\langle A_{:i}, p(\mathsf {x}_k + \tau \mathbf {e}_i) \rangle \ge 1 \) for all \(\tau \le \delta \) (see the proof of part (a)).
PartI To lower bound I, we use the monotonicity of \(p_j(\cdot )\) and obtain that
However, our choice of \(\delta ^{\flat } = \frac{\mu \beta }{2A_{j^{\flat },i}} \le \frac{\mu \beta }{2A_{j,i}} \) for all \(j\le j^{\flat }\) ensures that
Therefore, we obtain that
where the inequality is because \(\big (\frac{2}{3}-\frac{\beta }{2} \big ) \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k) \ge \frac{4-3\beta }{6} \cdot (1+\beta ) \ge \frac{2}{3}\) whenever \(\beta \le \frac{1}{3}\) (or equivalently, whenever \(\varepsilon \le 1/9\)).
Now, suppose that \(\sum _{j \le j^{\flat }} \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) - (1+\beta ) = b \cdot \widetilde{A}_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k)\) for some \(b \in [0,1]\). Note that we can do so by the very definition of \(j^{\flat }\). Then, we must have
Therefore, we conclude that
Above, the last inequality is because \(u^*_i \cdot \widetilde{A}_{j^{\flat },i} \le \langle \widetilde{A}_{j^{\flat } :}, u^* \rangle \le 2\) by our definition of \(\widetilde{A}\).
Part\(I'\) To lower bound \(I'\), consider every \(j> j^{\flat }\) and the integral
Note that whenever \(\tau \le \frac{\mu \beta }{2A_{j,i}} \le \frac{\mu \beta }{2A_{j^{\flat },i}} = \delta ^{\flat }\), we have that \(p_j(\mathsf {x}_k + \tau \mathbf {e}_i) \ge p_j(\mathsf {x}_k) \cdot e^{-\beta /2} \ge \frac{1}{2} p_j(\mathsf {x}_k)\). Therefore,
This implies a lower bound on \(I'\):
where again in the last inequality we have used \(u^*_i \cdot \widetilde{A}_{j^{\flat },i} \le \langle \widetilde{A}_{j^{\flat } :}, u^* \rangle \le 2\) by our definition of \(\widetilde{A}\).
Together Combining the lower bounds on I and \(I'\), we obtain
\(\square \)
1.5 Proof of Lemma 3.3: Efficient Implementation of \(PacLPSolver\)
In this section, we illustrate how to implement each iteration of \(PacLPSolver\) to run in an expected O(N / n) time. We maintain the following quantities
throughout the algorithm, so as to ensure the following invariants are always satisfied
It is clear that when \(k=0\), letting \(\mathsf {az}_k = A \mathsf {z}_0\), \(\mathsf {y}'_k = \mathsf {y}_0\), \(\mathsf {ay}'_k = A\mathsf {y}_0\), \(B_{k,1}=0\), and \(B_{k,2}=1\), we can ensure that all the invariants are satisfied initially. We denote \(\Vert A_{:i}\Vert _0\) the number of nonzeros elements in vector \(A_{:i}\). In each iteration \(k=1,2,\dots ,T\):
-
The step \(\mathsf {x}_k = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1}\) does not need to be implemented.
-
The value \(\nabla _i f(\mathsf {x}_k)\) requires the knowledge of \(p_j(\mathsf {x}_k) = e^{\frac{1}{\mu } ((A\mathsf {x}_k)_j - 1)}\) for each j such that \(A_{ij}\ne 0\). Accordingly, for each j, we need to know the value
$$\begin{aligned} (A \mathsf {x}_k)_j= & {} \tau (A \mathsf {z}_{k-1})_j + (1-\tau ) (A \mathsf {y}_{k-1})_j \\= & {} \big (\tau + (1-\tau ) B_{k-1,1} \big ) \mathsf {az}_{k-1,j} + (1-\tau ) B_{k-1,2} \mathsf {ay}'_{k-1,j} . \end{aligned}$$This can be computed in O(1) time for each j, and \(O(\Vert A_{:i}\Vert _0)\) time in total.
-
Recall the step \(\mathsf {z}_k \leftarrow {\text {arg min}}_{z \in \varDelta _{\mathsf {box}}} \big \{\frac{1}{2}\Vert z - \mathsf {z}_{k-1}\Vert _A^2 + \langle n \alpha _k \xi _{k}^{(i)}, z \rangle \big \}\) can be written as \(\mathsf {z}_{k} = \mathsf {z}_{k-1} + \delta \mathbf {e}_i\) for some \(\delta \in \mathbb {R}\) that can be computed in O(1) time (see Proposition 3.2). Observe also \(\mathsf {z}_k = \mathsf {z}_{k-1} + \delta \mathbf {e}_i\) yields \(\mathsf {y}_k = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1} + \frac{\delta \mathbf {e}_i}{n \alpha _k L}\) due to Line 6 and Line 10 of Algorithm 1. Therefore, we perform two explicit updates on \(\mathsf {z}_k\) and \(\mathsf {az}_k\) as
$$\begin{aligned}\mathsf {z}_k \leftarrow \mathsf {z}_{k-1} + \delta \mathbf {e}_i , \quad \mathsf {az}_k \leftarrow \mathsf {az}_{k-1} + \delta A_{:i}\end{aligned}$$and two implicit updates on \(\mathsf {y}_k\) as
$$\begin{aligned} \begin{array}{lll} &{}&{}B_{k,1} = \tau + (1-\tau ) B_{k-1,1} , B_{k,2} = (1-\tau ) B_{k-1,2} , \\ &{}&{}\mathsf {y}'_k \leftarrow \mathsf {y}'_{k-1} + \delta \mathbf {e}_i \cdot \left( - \frac{B_{k,1}}{B_{k,2}} + \frac{1}{n\alpha _k L} \frac{1}{B_{k,2}}\right) , \mathsf {ay}'_k \leftarrow \mathsf {ay}'_{k-1} + \delta A_{:i}\cdot \left( - \frac{B_{k,1}}{B_{k,2}} + \frac{1}{n\alpha _k L} \frac{1}{B_{k,2}}\right) \end{array} \end{aligned}$$It is not hard to verify that after these updates, \( A \mathsf {y}_k' = \mathsf {ay}'_k\) and we have
$$\begin{aligned}&B_{k,1} \cdot \mathsf {z}_k + B_{k,2} \cdot \mathsf {y}'_k = B_{k,1} \cdot \big (\mathsf {z}_{k-1} + \delta \mathbf {e}_i\big ) \\&\quad + B_{k,2} \cdot \left( \mathsf {y}'_{k-1} + \delta \mathbf {e}_i \cdot \left( - \frac{B_{k,1}}{B_{k,2}} + \frac{1}{n\alpha _k L} \frac{1}{B_{k,2}}\right) \right) \\&\quad = B_{k,1} \cdot \mathsf {z}_{k-1} + B_{k,2} \cdot \left( \mathsf {y}'_{k-1} + \delta \mathbf {e}_i \cdot \left( \frac{1}{n\alpha _k L} \frac{1}{B_{k,2}}\right) \right) \\&\quad = B_{k,1} \cdot \mathsf {z}_{k-1} + B_{k,2} \cdot \mathsf {y}'_{k-1} + \frac{\delta \mathbf {e}_i}{n\alpha _k L} \\&\quad = \big (\tau + (1-\tau ) B_{k-1,1}\big ) \cdot \mathsf {z}_{k-1} + \big ((1-\tau ) B_{k-1,2}\big ) \cdot \mathsf {y}'_{k-1} + \frac{\delta \mathbf {e}_i}{n\alpha _k L} \\&\quad = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1} + \frac{\delta \mathbf {e}_i}{n\alpha _k L} = \mathsf {y}_k, \end{aligned}$$so the invariant \(\mathsf {y}_k = B_{k,1} \cdot \mathsf {z}_k + B_{k,2} \cdot \mathsf {y}'_k \) also holds. In sum, after performing updates on \(A\mathsf {z}_k\) and \(\mathsf {ay}'_k\) in time \(O(\Vert A_{:i}\Vert _0)\), we can ensure that the invariants in (E.1) and (E.2) are satisfied at iteration k.
In sum, we only need \(O(\Vert A_{:i}\Vert _0)\) time to perform the updates in \(PacLPSolver\) for an iteration k if the coordinate i is selected. Therefore, each iteration of \(PacLPSolver\) can be implemented to run in an expected \(O(\mathbf {E}_i[\Vert A_{:i}\Vert _0]) = O(N/n)\) time.
1.6 Proof of Lemma 6.5: Efficient Implementation of \(CovLPSolver\)
In this section we illustrate how to implement each iteration of \(CovLPSolver\) to run in an expected O(N / n) time. We maintain the following quantities
throughout the algorithm, so as to maintain the following invariants
It is clear that when \(k=0\), letting \(\mathsf {z}_k' = \mathsf {z}_0\), \(\mathsf {sz}_k=1\), \(\mathsf {sumz}_k = {\mathbbm {1}}^{T} \mathsf {z}_0\), \(\mathsf {az}'_k = A\mathsf {z}_0\), \(\mathsf {y}'_k = \mathsf {y}_0\), \(\mathsf {ay}'_k = A\mathsf {y}_0\), \(B_{k,1}=0\), and \(B_{k,2}=1\), we can ensure that all the invariants are satisfied initially.
We denote by \(\Vert A_{:i}\Vert _0\) the number of nonzero elements in vector \(A_{:i}\). In each iteration \(k=1,2,\dots ,T\):
-
The step \(\mathsf {x}_k = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1}\) does not need to be implemented.
-
The value \(p_j(\mathsf {x}_k) = e^{\frac{1}{\mu } (1 - (A\mathsf {x}_k)_j)}\) for each j only requires the knowledge of
$$\begin{aligned} (A \mathsf {x}_k)_j= & {} \tau (A \mathsf {z}_{k-1})_j + (1-\tau ) (A \mathsf {y}_{k-1})_j \\= & {} \big (\tau + (1-\tau ) B_{k-1,1} \big ) \frac{\mathsf {az}'_{k-1,j}}{\mathsf {sz}_{k-1}} + (1-\tau ) B_{k-1,2} \mathsf {ay}'_{k-1,j} . \end{aligned}$$This can be computed in O(1) time.
-
The value \(\nabla _i f(\mathsf {x}_k)\) requires the knowledge of \(p_j(\mathsf {x}_k)\) for each \(j\in [m]\) such that \(A_{ij}\ne 0\). Since we have \(\Vert A_{:i}\Vert _0\) such j’s, we can compute \(\nabla _i f(\mathsf {x}_k)\) in \(O(\Vert A_{:i}\Vert _0)\) time.
-
Letting \(\delta = (1+\gamma ) n \alpha _k \xi _{k,i}^{(i)}\), recall that the mirror step \(\mathsf {z}_k \leftarrow {\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \big \{ V_{\mathsf {z}_{k-1}}(z) + \langle \delta \mathbf {e}_i, z \rangle \big \}\) has a very simple form (see Proposition 6.4): first multiply the i-th coordinate of \(\mathsf {z}_{k-1}\) by \(e^{-\delta }\) and then, if the sum of all coordinates have exceeded \(2\mathsf {OPT}'\), scale everything down so as to sum up to \(2\mathsf {OPT}'\). This can be implemented as follows: setting \(\delta _1 = \mathsf {z}'_{k-1,i} (e^{-\delta }-1)\),
$$\begin{aligned} \begin{array}{l} \mathsf {z}'_{k} \leftarrow \mathsf {z}'_{k-1} + \delta _1 \mathbf {e}_i , \mathsf {az}'_k \leftarrow \mathsf {az}'_{k-1} + \delta _1 A_{:i}, \\ \mathsf {sumz}_{k} \leftarrow \mathsf {sumz}_{k-1} + \delta _1 , \mathsf {sz}_k \leftarrow \mathsf {sz}_k \cdot \max \Big \{1, \frac{\mathsf {sumz}_k}{\mathsf {sz}_{k-1}\cdot 2\mathsf {OPT}'} \Big \} . \end{array} \end{aligned}$$These updates can be implemented to run in \(O(\Vert A_{:i}\Vert _0)\) time, and they together ensure that the invariants in (F.1) are satisfied at iteration k.
-
Recall that the gradient step is of the form \(\mathsf {y}_k \leftarrow \mathsf {x}_k + \delta _2 \cdot \mathbf {e}_i\) for some value \(\delta _2 \ge 0\). This value \(\delta _2\) can be computed in \(O(\Vert A_{:i}\Vert _0)\) time, since each \(p_j(\mathsf {x}_k)\) can be computed in O(1) time, and we can sort the rows of each column of A by preprocessing.
Since \(\mathsf {y}_k = \mathsf {x}_k + \delta _2 \cdot \mathbf {e}_i = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1} + \delta _2 \mathbf {e}_i\), we can implement this update by letting
$$\begin{aligned} \begin{array}{l} B_{k,1} = \frac{\tau }{\mathsf {sz}_{k-1}} + (1-\tau ) B_{k-1,1} , B_{k,2} = (1-\tau ) B_{k-1,2} \\ \mathsf {y}'_k \leftarrow \mathsf {y}'_{k-1} + \mathbf {e}_i \cdot \left( - \frac{B_{k,1} \delta _1}{B_{k,2}} + \frac{\delta _2}{B_{k,2}}\right) , \mathsf {ay}'_k \leftarrow \mathsf {ay}'_{k-1} + A_{:i}\cdot \left( - \frac{B_{k,1} \delta _1}{B_{k,2}} + \frac{\delta _2}{B_{k,2}}\right) \end{array}\end{aligned}$$It is not hard to verify that after these updates, \(\mathsf {ay}'_k = A \mathsf {y}'_k\) and we have
$$\begin{aligned}&B_{k,1} \cdot \mathsf {z}'_k + B_{k,2} \cdot \mathsf {y}'_k = B_{k,1} \cdot \big (\mathsf {z}'_{k-1} + \delta _1 \mathbf {e}_i\big ) \\&\qquad + B_{k,2} \cdot \left( \mathsf {y}'_{k-1} + \mathbf {e}_i \cdot \left( - \frac{B_{k,1}\delta _1}{B_{k,2}} + \frac{\delta _2}{B_{k,2}}\right) \right) \\&\quad = B_{k,1} \cdot \mathsf {z}'_{k-1} + B_{k,2} \cdot \big (\mathsf {y}'_{k-1} + \delta _2 \mathbf {e}_i / B_{k,2} \big ) \\&\quad = B_{k,1} \cdot \mathsf {z}'_{k-1} + B_{k,2} \cdot \mathsf {y}'_{k-1} + \delta _2 \mathbf {e}_i \\&\quad = \big (\frac{\tau }{\mathsf {sz}_{k-1}} + (1-\tau ) B_{k-1,1}\big ) \cdot \mathsf {z}'_{k-1} + \big ((1-\tau ) B_{k-1,2}\big ) \cdot \mathsf {y}'_{k-1} + \delta _2 \mathbf {e}_i \\&\quad = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1} + \delta _2 \mathbf {e}_i = \mathsf {y}_k , \end{aligned}$$so that the invariant \(\mathsf {y}_k = B_{k,1} \cdot \mathsf {z}_k' + B_{k,2} \cdot \mathsf {y}'_k\) is also satisfied. In sum, after running time \(O(\Vert A_{:i}\Vert _0)\), we can ensure that the invariants in (F.2) are satisfied at iteration k.
In sum, we only need \(O(\Vert A_{:i}\Vert _0)\) time to perform the updates in \(CovLPSolver\) for an iteration k if the coordinate i is selected. Therefore, each iteration of \(CovLPSolver\) can be implemented to run in an expected \(O(\mathbf {E}_i[\Vert A_{:i}\Vert _0]) = O(N/n)\) time.
Rights and permissions
About this article
Cite this article
Allen-Zhu, Z., Orecchia, L. Nearly linear-time packing and covering LP solvers. Math. Program. 175, 307–353 (2019). https://doi.org/10.1007/s10107-018-1244-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-018-1244-x