Skip to main content
Log in

Nearly linear-time packing and covering LP solvers

Achieving width-independence and -convergence

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

Packing and covering linear programs (PC-LP s) form an important class of linear programs (LPs) across computer science, operations research, and optimization. Luby and Nisan (in: STOC, ACM Press, New York, 1993) constructed an iterative algorithm for approximately solving PC-LP s in nearly linear time, where the time complexity scales nearly linearly in N, the number of nonzero entries of the matrix, and polynomially in \(\varepsilon \), the (multiplicative) approximation error. Unfortunately, existing nearly linear-time algorithms (Plotkin et al. in Math Oper Res 20(2):257–301, 1995; Bartal et al., in: Proceedings 38th annual symposium on foundations of computer science, IEEE Computer Society, 1997; Young, in: 42nd annual IEEE symposium on foundations of computer science (FOCS’01), IEEE Computer Society, 2001; Koufogiannakis and Young in Algorithmica 70:494–506, 2013; Young in Nearly linear-time approximation schemes for mixed packing/covering and facility-location linear programs, 2014. arXiv:1407.3015; Allen-Zhu and Orecchia, in: SODA, 2015) for solving PC-LP s require time at least proportional to \(\varepsilon ^{-2}\). In this paper, we break this longstanding barrier by designing a packing solver that runs in time \(\widetilde{O}(N \varepsilon ^{-1})\) and covering LP solver that runs in time \(\widetilde{O}(N \varepsilon ^{-1.5})\). Our packing solver can be extended to run in time \(\widetilde{O}(N \varepsilon ^{-1})\) for a class of well-behaved covering programs. In a follow-up work, Wang et al. (in: ICALP, 2016) showed that all covering LPs can be converted into well-behaved ones by a reduction that blows up the problem size only logarithmically.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. Luby and Nisan, who originally studied iterative solvers for this class of problems [24], dubbed them positive LPs. However, the class of LPs with non-negative constraint matrices is slightly larger, including mixed-packing-and-covering LPs. For this reason, we prefer to stick to the PC-LP terminology.

  2. Most width-dependent solvers study the minmax problem \(\min _{\begin{array}{c} x\ge 0 , {\mathbbm {1}}^{T} x = 1 \end{array}} \; \max _{\begin{array}{c} y\ge 0 , {\mathbbm {1}}^{T} y = 1 \end{array}} \; y^T A x ,\) whose optimal value equals \(1/\mathsf {OPT}\). Their approximation guarantees are often written in terms of additive error. We have translated their performances to multiplicative error for a clear comparison.

  3. Some of these solvers still have a \(\textsf {polylog}(\rho )\) dependence. Since each occurrence of \(\log (\rho )\) can be replaced with \(\log (nm)\) after slightly modifying the matrix A, we have done so in Table 1 for a fair comparisons.

  4. This can be verified by observing that our objective \(f_\mu (x)\), to be introduced later, is not globally Lipschitz smooth, so that one cannot apply accelerated gradient descent directly.

  5. Due to space limitation, we quickly sketch why logarithmic word size suffices for our algorithms. On one hand, one can prove in an iteration, if x is calculated with a small additive error \(1/\textsf {poly}(1/\varepsilon ,n,m)\), then the objective f(x) may increase only by \(1/\textsf {poly}(1/\varepsilon ,n,m)\) in that iteration. The proof of this relies on the fact that (1) one can assume without loss of generality all entries of A are no more than \(\textsf {poly}(1/\varepsilon ,n,m)\) and (2) our algorithms ensure \(f(x) < \textsf {poly}(1/\varepsilon , n, m)\) for all iterations with high probability, so even though we are using the exponential functions, f(x) will not change additively by much. On the other hand, one can similarly prove that each \(\nabla _i f(x)\) can be calculated within an additive error \(1/\textsf {poly}(1/\varepsilon ,n,m)\) in each iteration. They together imply that the total error incurred by arithmetic operations can be made negligible.

  6. If \(\min _{i\in [n]}\{\Vert A_{:i}\Vert _{\infty }\} = 0\) then the packing LP is unbounded so we are done. Otherwise, if \(\min _{i\in [n]}\{\Vert A_{:i}\Vert _{\infty }\} = v > 0\) we scale all entries of A by 1 / v, and scale \(\mathsf {OPT}\) by v.

  7. Note that some of the previous results (such as [7, 31]) appear to directly minimize \(\sum _{j=1}^m e^{((Ax)_j - 1)/\mu }\) as opposed to its logarithm g(x). However, their per-iteration objective decrease is multiplicative, meaning it is essentially equivalent to performing a single gradient-descent step on g(x) with additive objective decrease.

  8. The exact same \(f_\mu (x)\) also appeared in our previous work [4], albeit without this smoothing interpretation and without the constraint \( x \in \varDelta _{\mathsf {box}}.\) The techniques in [4] only leads to \(\varepsilon ^{-2}\) convergence (see Table 1).

  9. A similar gradient truncation was developed in our prior work [4], but for a different purpose (to ensure parallelism) and not applied to coordinate gradient. The truncation idea of this paper also inspired later works in matrix scaling [2] and in SDP [1].

  10. If \(\min _{j\in [m]}\{\Vert A_{j :}\Vert _{\infty }\} = 0\) then the covering LP is infeasible so we are done. Otherwise, if \(\min _{j\in [m]}\{\Vert A_{j :}\Vert _{\infty }\} = v > 0\) we scale all entries of A by 1 / v, and scale \(\mathsf {OPT}\) by v.

  11. The constant 9 in this section can be replaced with any other constant greater than 1.

  12. This negative width technique is related to [7, Definition 3.2], where the authors analyze the multiplicative weight update method in a special case when the oracle returns loss values only in \([-\ell , +\rho ]\), for some \(\ell \ll \rho \). This technique is also a sub-case of a more general theory of mirror descent, known as the local-norm convergence, that we have summarized in a separate and later paper [3].

  13. We wish to point out that this proof coincides with a lemma from the accelerated coordinate descent theory of Fercoq and Richtárik [17]. Their paper is about optimizing an objective function that is Lipschitz smooth, and thus irrelevant to our work.

  14. This is because, our parameter choices ensure that \((1+\gamma )\alpha _k n < 1/2\beta \), which further means \(-(1+\gamma )\alpha _k n\xi _{k,i}^{(i)} \le 1/2\). As a result, we must have \(\mathsf {z}_{k,i}^{(i)} \le \mathsf {z}_{k-1,i} \cdot e^{0.5} < 2 \mathsf {z}_{k-1,i}\) (see the explicit definition of the mirror step at Proposition 6.4).

References

  1. Allen-Zhu, Z., Lee, Y.T., Orecchia, L.: Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In: SODA (2016)

  2. Allen-Zhu, Z., Li, Y., Oliveira, R., Wigderson, A: Much faster algorithms for matrix scaling. In: FOCS, 2017. arXiv:1704.02315

  3. Allen-Zhu, Z., Liao, Z., Orecchia, L.: Spectral sparsification and regret minimization beyond multiplicative updates. In: STOC (2015)

  4. Allen-Zhu, Z., Orecchia, L.: Using optimization to break the epsilon barrier: a faster and simpler width-independent algorithm for solving positive linear programs in parallel. In: SODA (2015)

  5. Allen-Zhu, Z., Orecchia, L.: Linear coupling: an ultimate unification of gradient and mirror descent. In: ITCS (2017)

  6. Allen-Zhu, Z., Qu, Z., Richtárik, P., Yuan, Y.: Even faster accelerated coordinate descent using non-uniform sampling. In: ICML (2016)

  7. Arora, S., Hazan, E., Kale, S.: The multiplicative weights update method: a meta-algorithm and applications. Theory Comput. 8, 121–164 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  8. Awerbuch, B., Khandekar, R.: Stateless distributed gradient descent for positive linear programs. In: STOC (2008)

  9. Awerbuch, B., Khandekar, R., Rao, S.: Distributed algorithms for multicommodity flow problems via approximate steepest descent framework. ACM Trans. Algorithms 9(1), 1–14 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  10. Bartal, Y., Byers, J.W., Raz, D.: Global optimization using local information with applications to flow control. In: Proceedings 38th Annual Symposium on Foundations of Computer Science, pp. 303–312. IEEE Computer Society (1997)

  11. Bartal, Y., Byers, J.W., Raz, D.: Fast, distributed approximation algorithms for positive linear programming with applications to flow control. SIAM J. Comput. 33(6), 1261–1279 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  12. Ben-Tal, A., Arkadi, N.: Lectures on modern convex optimization. Soc. Ind. Appl. Math. 315–341 (2013)

  13. Bienstock, D., Iyengar, G.: Faster approximation algorithms for packing and covering problems. Technical report, Columbia University, September 2004. Preliminary version published in STOC ’04

  14. Byers, J., Nasser, G.: Utility-based decision-making in wireless sensor networks. In: 2000 first annual workshop on mobile and ad hoc networking and computing, 2000. MobiHOC, pp. 143–144. IEEE (2000)

  15. Chudak, F.A., Eleutério, V. : Improved approximation schemes for linear programming relaxations of combinatorial optimization problems. In: Proceedings of the 11th International IPCO Conference on Integer Programming and Combinatorial Optimization, pp. 81–96 (2005)

  16. Duan, R., Pettie, S.: Linear-time approximation for maximum weight matching. J. ACM 61(1), 1–23 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  17. Fercoq, O., Richtárik, P.: Accelerated, parallel and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  18. Fleischer, L.K.: Approximating fractional multicommodity flow independent of the number of commodities. SIAM J. Discrete Math. 13(4), 505–520 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  19. Garg, N., Könemann, J.: Faster and simpler algorithms for multicommodity flow and other fractional packing problems. SIAM J. Comput. 37(2), 630–652 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  20. Grigoriadis, M.D., Khachiyan, L.G.: Fast approximation schemes for convex programs with many blocks and coupling constraints. SIAM J. Optim. 4(1), 86–107 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  21. Jain, R., Ji, Z., Upadhyay, S., Watrous, J.: QIP = PSPACE. J. ACM JACM 58(6), 30 (2011)

    MathSciNet  Google Scholar 

  22. Klein, P., Young, N.E.: On the number of iterations for Dantzig–Wolfe optimization and packing-covering approximation algorithms. SIAM J. Comput. 44(4), 1154–1172 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  23. Koufogiannakis, C., Young, N.E.: A nearly linear-time PTAS for explicit fractional packing and covering linear programs. Algorithmica 70, 494–506 (2013). (Previously appeared in FOCS ’07)

    MathSciNet  MATH  Google Scholar 

  24. Luby, M., Nisan, N.: A parallel approximation algorithm for positive linear programming. In: STOC, pp. 448–457. ACM Press, New York (1993)

  25. Madry, A.: Faster approximation schemes for fractional multicommodity flow problems via dynamic graph algorithms. In: STOC. ACM Press, New York (2010)

  26. Nemirovski, A.: Prox-method with rate of convergence \(O(1/t)\) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  27. Nesterov, Y.: Rounding of convex sets and efficient gradient methods for linear programming problems. Optim. Methods Softw. 23(1), 109–128 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  28. Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(O(1/k^2)\). Sov. Math. Dokl. 269, 543–547 (1983)

    Google Scholar 

  29. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  30. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J Optim. 22(2), 341–362 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  31. Plotkin, S.A., Shmoys, D.B., Tardos, É.: Fast approximation algorithms for fractional packing and covering problems. Math. Oper. Res. 20(2), 257–301 (1995). (conference version published in FOCS 1991)

    Article  MathSciNet  MATH  Google Scholar 

  32. Trevisan, L.: Parallel approximation algorithms by positive linear programming. Algorithmica 21(1), 72–88 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  33. Wang, D., Mahoney, M., Mohan, N., Rao, S.: Faster parallel solver for positive linear programs via dynamically-bucketed selective coordinate descent. ArXiv e-prints arXiv:1511.06468 (2015)

  34. Wang, D., Rao, S., Mahoney, M.W.: Unified acceleration method for packing and covering problems via diameter reduction. In: ICALP (2016)

  35. Young, N.E.: Sequential and parallel algorithms for mixed packing and covering. In: 42nd Annual IEEE Symposium on Foundations of Computer Science (FOCS’01), pp. 538–546. IEEE Computer Society (2001)

  36. Young, N.E.: Nearly linear-time approximation schemes for mixed packing/covering and facility-location linear programs. ArXiv e-prints arXiv:1407.3015 (2014)

  37. Zurel, E., Nisan, N.: An efficient approximate allocation algorithm for combinatorial auctions. In: Proceedings of the 3rd ACM Conference on Electronic Commerce, pp. 125–136. ACM (2001)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zeyuan Allen-Zhu.

Additional information

An earlier version of this paper appeared on arXiv http://arxiv.org/abs/1411.1124 in November 2014. A 6-paged abstract of this paper “Nearly-Linear Time Positive LP Solver with Faster Convergence Rate,” excluding Sects. 45 and 6, and excluding all the proofs in Sects. 2 and 3, has been presented at the STOC 2015 conference in Portland, OR.

Appendix

Appendix

1.1 Proof of Lemma 3.6

Lemma 3.6

We have \(\mathsf {x}_k,\mathsf {y}_k,\mathsf {z}_k\in \varDelta _{\mathsf {box}}\) for all \(k=0,1,\dots ,T\).

Proof

This is true at the beginning as \(\mathsf {x}_0 = \mathsf {y}_0 = x^{\mathsf {start}}\in \varDelta _{\mathsf {box}}\) (see Fact 2.8) and \(\mathsf {z}_0 = 0 \in \varDelta _{\mathsf {box}}\). In fact, it suffices for us to show that for every \(k\ge 1\), \(\mathsf {y}_k = \sum _{l=0}^k \gamma _k^l \mathsf {z}_l\) for some scalars \(\gamma _k^l\) satisfying \(\sum _l \gamma _k^l = 1\) and \(\gamma _k^l \ge 0\) for each \(l = 0,\dots ,k\). If this is true, we can prove the lemma by induction: at each iteration \(k \ge 1\),

  1. 1.

    \(\mathsf {x}_k = \tau \mathsf {z}_{k-1} + (1-\tau )\mathsf {y}_{k-1}\) must be in \(\varDelta _{\mathsf {box}}\) because \(\mathsf {y}_{k-1}\) and \(\mathsf {z}_{k-1}\) are and \(\tau \in [0,1]\),

  2. 2.

    \(\mathsf {z}_k\) is in \(\varDelta _{\mathsf {box}}\) by the definition that \(\mathsf {z}_k = {\text {arg min}}_{z\in \varDelta _{\mathsf {box}}}\{\cdots \}\), and

  3. 3.

    \(\mathsf {y}_k\) is also in \(\varDelta _{\mathsf {box}}\) because \(\mathsf {y}_k= \sum _{l=0}^k \gamma _k^l \mathsf {z}_l\) is a convex combination of the \(\mathsf {z}_l\)’s and \(\varDelta _{\mathsf {box}}\) is convex.

For the rest of the proof, we show that \(\mathsf {y}_k = \sum _{l=0}^k \gamma _k^l \mathsf {z}_l\) for every \(k\ge 1\) with coefficientsFootnote 13

$$\begin{aligned} \gamma _{k}^l = \left\{ \begin{array}{ll} (1-\tau )\gamma _{k-1}^l, &{}\quad l = 0,\ldots ,k-2; \\ \big (\frac{1}{n \alpha _{k-1} L} - \frac{1}{n \alpha _k L}\big ) + \tau \big (1 - \frac{1}{n \alpha _{k-1} L}\big ), &{}\quad l=k-1; \\ \frac{1}{n\alpha _k L}, &{}\quad l=k. \end{array} \right. \end{aligned}$$

This is true at the base case \(k=1\) because \(\mathsf {y}_1 = \mathsf {x}_1 + \frac{1}{n\alpha _1 L}(\mathsf {z}_1 - \mathsf {z}_0) = \frac{1}{n\alpha _1 L} \mathsf {z}_1 + \big (1-\frac{1}{n\alpha _1 L}\big ) \mathsf {z}_0\). For the general \(k\ge 2\), we have

$$\begin{aligned} \mathsf {y}_k&= \mathsf {x}_k + \frac{1}{n \alpha _k L}(\mathsf {z}_k - \mathsf {z}_{k-1}) \\&= \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1} + \frac{1}{n \alpha _k L}(\mathsf {z}_k - \mathsf {z}_{k-1}) \\&= \tau \mathsf {z}_{k-1} + (1-\tau ) \left( \sum _{l=0}^{k-2} \gamma _{k-1}^l \mathsf {z}_l + \frac{1}{n \alpha _{k-1} L} \mathsf {z}_{k-1}\right) + \frac{1}{n \alpha _k L}(\mathsf {z}_k - \mathsf {z}_{k-1}) \\&= \left( \sum _{l=0}^{k-2} (1-\tau ) \gamma _{k-1}^l \mathsf {z}_l \right) + \bigg (\left( \frac{1}{n \alpha _{k-1} L} - \frac{1}{n \alpha _k L}\right) + \tau \left( 1 - \frac{1}{n \alpha _{k-1} L}\right) \bigg ) \mathsf {z}_{k-1} \\&\quad + \frac{1}{n\alpha _k L} \mathsf {z}_k . \end{aligned}$$

Therefore, we obtain \(\mathsf {y}_k = \sum _{l=0}^k \gamma _k^l \mathsf {z}_l\) as desired.

It is now easy to check that under our definition of \(\alpha _k\) (which satisfies \(\alpha _k\ge \alpha _{k-1}\) and \(\alpha _k\ge \alpha _0 = \frac{1}{nL}\), we must have \(\gamma _k^l \ge 0\) for all k and l. Also,

$$\begin{aligned} \sum _l \gamma _k^l&= \sum _{l=0}^{k-2} (1-\tau ) \gamma _{k-1}^l + \bigg (\left( \frac{1}{n \alpha _{k-1} L} - \frac{1}{n \alpha _k L}\right) + \tau \left( 1 - \frac{1}{n \alpha _{k-1} L}\right) \bigg ) + \frac{1}{n\alpha _k L} \\&= (1-\tau ) \left( 1-\frac{1}{n\alpha _{k-1}L}\right) + \bigg (\left( \frac{1}{n \alpha _{k-1} L} - \frac{1}{n \alpha _k L}\right) + \tau \left( 1 - \frac{1}{n \alpha _{k-1} L}\right) \bigg ) \\&\quad + \frac{1}{n\alpha _k L} = 1 . \end{aligned}$$

\(\square \)

1.2 Proof of Proposition 4.5

Proposition 4.5

  1. (a)

    \(f_\mu (u^*) \le (1+\varepsilon )\mathsf {OPT}\) for \(u^* {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}(1+\varepsilon /2) x^*\).

  2. (b)

    \(f_\mu (x) \ge (1-\varepsilon )\mathsf {OPT}\) for every \(x \ge 0\).

  3. (c)

    For any \(x \ge 0\) satisfying \(f_\mu (x) \le 2\mathsf {OPT}\), we must have \(Ax \ge (1-\varepsilon ){\mathbbm {1}}\).

  4. (d)

    If \(x\ge 0\) satisfies \(f_\mu (x) \le (1+\delta )\mathsf {OPT}\) for some \(\delta \in [0,1]\), then \(\frac{1}{1-\varepsilon } x\) is a \(\frac{1+\delta }{1-\varepsilon }\)-approximate solution to the covering LP.

Proof

  1. (a)

    We have \({\mathbbm {1}}^{T} u^* = (1+\varepsilon /2)\mathsf {OPT}\) by the definition of \(\mathsf {OPT}\). Also, from the feasibility constraint \(A x^* \ge {\mathbbm {1}}\) in the covering LP, we have \(A u^* - {\mathbbm {1}}\ge \varepsilon /2 \cdot {\mathbbm {1}}\), and can compute \(f_\mu (u^*)\) as follows:

    $$\begin{aligned} f_\mu (u^*) = \mu \sum _j e^{\frac{1}{\mu } (1 - (A u^*)_j)} + {\mathbbm {1}}^{T} u^*&\le \mu \sum _j e^{\frac{-\varepsilon /2}{\mu }} + (1+\varepsilon /2)\mathsf {OPT}\\&\le \frac{\mu m}{(nm)^2} + (1+\varepsilon /2)\mathsf {OPT}\le (1+\varepsilon )\mathsf {OPT}. \end{aligned}$$
  2. (b)

    Suppose towards contradiction that \(f_\mu (x) < (1-\varepsilon )\mathsf {OPT}\). Since \(f_\mu (x) <\mathsf {OPT}\le m\), we must have that for every \(j\in [m]\), it satisfies that \(e^{\frac{1}{\mu }(1-(Ax)_j)} \le f_\mu (x) / \mu \le m / \mu \). This further implies \((Ax)_j \ge 1-\varepsilon \) by the definition of \(\mu \). In other words, \(Ax \ge (1-\varepsilon ){\mathbbm {1}}\). By the definition of \(\mathsf {OPT}\), we must then have \({\mathbbm {1}}^{T} x \ge (1-\varepsilon )\mathsf {OPT}\), finishing the proof that \(f_\mu (x) \ge {\mathbbm {1}}^{T} x \ge (1-\varepsilon )\mathsf {OPT}\), giving a contradiction.

  3. (c)

    To show \(Ax \ge (1-\varepsilon ){\mathbbm {1}}\), we can assume that \(v = \max _j (1 - (Ax)_j) > \varepsilon \) because otherwise we are done. Under this definition, we have

    $$\begin{aligned} \textstyle f_\mu (x) \ge \mu e^{\frac{v}{\mu }} = \mu \big ((\frac{nm}{\varepsilon })^4\big )^{v/\varepsilon } \ge \frac{\varepsilon }{4\log (nm/\varepsilon )} (\frac{nm}{\varepsilon })^4 \gg 2\mathsf {OPT}, \end{aligned}$$

    contradicting to our assumption that \(f_\mu (x) \le 2\mathsf {OPT}\). Therefore, we must have \(v \le \varepsilon \), that is, \(Ax \ge (1-\varepsilon ){\mathbbm {1}}\).

  4. (d)

    For any x satisfying \(f_\mu (x) \le (1+\theta )\mathsf {OPT}\le 2 \mathsf {OPT}\), owing to Proposition 4.5c, we first have that x is approximately feasible, i.e., \(Ax \ge (1-\varepsilon ){\mathbbm {1}}\). Next, because \({\mathbbm {1}}^{T} x \le f_\mu (x) \le (1+\theta )\mathsf {OPT}\), we know that x yields an objective \({\mathbbm {1}}^{T} x \le (1+\theta )\mathsf {OPT}\). Letting \(x' = \frac{1}{1-\varepsilon } x\), we both have that \(x'\) is feasible (i.e., \(Ax' \ge {\mathbbm {1}}\)), and \(x'\) has an objective \({\mathbbm {1}}^{T} x'\) at most \(\frac{1+\delta }{1-\varepsilon } \mathsf {OPT}\).

\(\square \)

1.3 Missing proofs for Sect. 5

In this section we prove Theorem 5.3. Because the proof structure is almost identical to that of Theorem 3.4, we spend most of the discussions only pointing out the difference rather than repeating the proofs. The following three lemmas are completely identical to the ones in the packing LP case, so we restate them below:

Lemma C.1

(cf. Lemma 3.3) Each iteration of \(CovLPSolver^{\mathsf {wb}}\) can be implemented to run in expected O(N / n) time.

Lemma C.2

(cf. Lemma 3.6) We have \(\mathsf {x}_k,\mathsf {y}_k,\mathsf {z}_k\in \varDelta _{\mathsf {box}}\) for all \(k=0,1,\dots ,T\).

Lemma C.3

(cf. Lemma 3.7) For every \(u\in \varDelta _{\mathsf {box}}\), it satisfies \(\big \langle n\alpha _k \xi _{k}^{(i)}, \mathsf {z}_{k-1} - u \big \rangle \le n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 .\)

For the gradient descent guarantee of Sect. 3.3, one can first note that Lemma 2.7 remains true: this can be verified by replacing \(\nabla _i f_\mu (x) + 1\) in its proof with \(1 - \nabla _i f_\mu (x)\). For this reason, Lemma 3.9 (which is built on Lemma 2.7) also remains true. We state it below:

Lemma C.4

(cf. Lemma 3.9) We have \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge \frac{1}{2} \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)}\rangle \ge 0\).

Putting all together Denote by \(\eta _{k}^{(i)} \in \mathbb {R}_{\le 0}^n \) the vector that is only non-zero at coordinate i, and satisfies \(\eta _{k,i}^{(i)} = \nabla _i f_\mu (\mathsf {x}_k) - \xi _{k,i}^{(i)} \in (-\infty , 0]\). In other words, the full gradient

$$\begin{aligned} \nabla f_\mu (\mathsf {x}_k) = \mathbf {E}_i[(0,\dots ,n \nabla _i f_\mu (\mathsf {x}_k),\dots ,0)] = \mathbf {E}_i[ n \eta _{k}^{(i)} + n \xi _k^{(i)} ] \end{aligned}$$

can be (in expectation) decomposed into the a large but non-positive component \(\eta _k^{(i)} \in (-\infty , 0]^n\) and a small component \(\xi _k^{(i)} \in [-1,1]^n\). Similar as Sect. 3.4, for any \(u \in \varDelta _{\mathsf {box}}\), we can use a basic convexity argument and the mirror descent lemma to compute that

$$\begin{aligned}&\alpha _k (f_\mu (\mathsf {x}_k) - f_\mu (u)) \le \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k-u \big \rangle \nonumber \\&\quad = \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {z}_{k-1} \big \rangle + \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1}-u \big \rangle \nonumber \\&\quad = \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {z}_{k-1} \big \rangle + \mathbf {E}_{i} \left[ \big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + \big \langle n \alpha _k\xi _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle \right] \nonumber \\&\quad \overset{\textcircled {{\small 1}}}{=} \frac{(1-\tau ) \alpha _k}{\tau } \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {y}_{k-1} - \mathsf {x}_k \big \rangle \nonumber \\&\qquad + \mathbf {E}_{i} \left[ \big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + \big \langle n \alpha _k\xi _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle \right] \end{aligned}$$
(C.1)
$$\begin{aligned}&\quad \overset{\textcircled {{\small 2}}}{\le }\frac{(1-\tau )\alpha _k}{\tau } (f_\mu (\mathsf {y}_{k-1})-f_\mu (\mathsf {x}_k)) \nonumber \\&\qquad + \mathbf {E}_{i} \Big [\; \boxed {\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle } \nonumber \\&\qquad + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k^{(i)} - u\Vert _A^2 \Big ] \end{aligned}$$
(C.2)

Above, \(\textcircled {{\small 1}}\)is because \(\mathsf {x}_{k} = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1}\), which implies that \(\tau (\mathsf {x}_k - \mathsf {z}_{k-1}) = (1-\tau ) (\mathsf {y}_{k-1} - \mathsf {x}_k)\). \(\textcircled {{\small 2}}\)uses convexity and Lemma C.3. We can establish the following lemma to upper bound the boxed term in (C.2). Its proof is in the same spirit to that of Lemma 3.10, and is the only place that we require all vectors to reside in \(\varDelta _{\mathsf {box}}\).

Lemma C.5

(cf. Lemma 3.10) For every \(u \in \varDelta _{\mathsf {box}}\),

$$\begin{aligned} \big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \le 21n \alpha _k L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})). \end{aligned}$$

Proof of Lemma C.5

Now there are three possibilities:

  • If \(\eta _{k,i}^{(i)}=0\), then we must have \(\xi _{k,i}^{(i)} = \nabla _i f_\mu (\mathsf {x}_k) \in [-1,1]\). Lemma C.4 implies

    $$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad = n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \le 2 n^2 \alpha _k^2 L \cdot ( f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) ) \end{aligned}$$
  • If \(\eta _{k,i}^{(i)} < 0\) and \(\mathsf {z}_{k,i}^{(i)} < \frac{10}{\Vert A_{:i}\Vert _{\infty }}\) (thus \(\mathsf {z}_{k}^{(i)}\) is not on the boundary of \(\varDelta _{\mathsf {box}}\)), then we precisely have \(\mathsf {z}_{k,i}^{(i)} = \mathsf {z}_{k-1,i} + \frac{n\alpha _k}{\Vert A_{:i}\Vert _{\infty }}\), and accordingly \(\mathsf {y}_{k,i}^{(i)} = \mathsf {x}_{k,i} + \frac{1}{L \Vert A_{:i}\Vert _{\infty }} > \mathsf {x}_{k,i}\). In this case,

    $$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 1}}}{\le }n \alpha _k \cdot \nabla _i f_\mu (\mathsf {x}_k) \cdot \frac{-10}{\Vert A_{:i}\Vert _{\infty }} + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 2}}}{<} n \alpha _k \cdot \nabla _i f_\mu (\mathsf {x}_k) \cdot \frac{-10}{\Vert A_{:i}\Vert _{\infty }} + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 3}}}{=} 10 n \alpha _k L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 4}}}{\le }\big ( 20 n \alpha _k L + 2 n^2 \alpha _k^2 L \big ) \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) . \end{aligned}$$

    Above, \(\textcircled {{\small 1}}\)follows from the fact that \(\mathsf {z}_{k-1},u \in \varDelta _{\mathsf {box}}\) and therefore \(z_{k-1,i}\ge 0\) and \(u_i \le \frac{10}{\Vert A_{:i}\Vert _{\infty }}\) by the definition of \(\varDelta _{\mathsf {box}}\), and \(u\ge 0\); \(\textcircled {{\small 2}}\)follows from the fact that \(\mathsf {x}_k\) and \(\mathsf {y}_k^{(i)}\) are only different at coordinate i, and \(\xi _{k,i}^{(i)}=-1 > \nabla _i f_\mu (\mathsf {x}_k)\) (since \(\eta _{k,i}^{(i)}<0\)); \(\textcircled {{\small 3}}\)follows from the fact that \(\mathsf {y}_{k}^{(i)} = \mathsf {x}_{k} + \frac{\mathbf {e}_i}{L \Vert A_{:i}\Vert _{\infty }}\); and \(\textcircled {{\small 4}}\)uses Lemma C.4.

  • If \(\eta _{k,i}^{(i)} < 0\) and \(\mathsf {z}_{k,i}^{(i)} = \frac{10}{\Vert A_{:i}\Vert _{\infty }}\), then we have

    $$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 1}}}{\le }\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}- \mathsf {z}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 2}}}{\le }\big \langle n \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {z}_{k-1} - \mathsf {z}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 3}}}{=} n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k - \mathsf {y}_k^{(i)} \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\quad \overset{\textcircled {{\small 4}}}{\le }4 n^2 \alpha _k^2 L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) . \end{aligned}$$

    Above, \(\textcircled {{\small 1}}\)is because \(u_i \le \frac{10}{\Vert A_{:i}\Vert _{\infty }} = \mathsf {z}_{k,i}^{(i)}\) and \(\eta _{k,i}^{(i)} < 0\), together with \(\nabla _i f_\mu (\mathsf {x}_k) < \xi _{k,i}^{(i)}\) and \(\mathsf {x}_{k,i} \le \mathsf {y}_{k,i}^{(i)}\); \(\textcircled {{\small 2}}\)uses \(\nabla _i f_\mu (\mathsf {x}_k) = \eta _{k,i}^{(i)} - 1 < \eta _{k,i}^{(i)}\) and \(\mathsf {z}_{k,i}^{(i)} \ge \mathsf {z}_{k-1,i}\); \(\textcircled {{\small 3}}\)is from our choice of \(\mathsf {y}_k\) which satisfies that \(\mathsf {z}_{k-1} -\mathsf {z}_k^{(i)} = n \alpha _k L (\mathsf {x}_k - \mathsf {y}_k^{(i)})\); and \(\textcircled {{\small 4}}\)uses Lemma C.4.

Combining the three cases, and using the fact that \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge 0\), we conclude that

$$\begin{aligned}&\big \langle n \alpha _k \eta _{k}^{(i)}, \mathsf {z}_{k-1}-u \big \rangle + n^2 \alpha _k^2 L \cdot \big \langle \xi _{k}^{(i)}, \mathsf {x}_{k} - \mathsf {y}_k^{(i)} \big \rangle \\&\le (20n \alpha _k L + 4n^2 \alpha _k^2 L) \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) \\&\le 21 n \alpha _k L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) . \end{aligned}$$

Above, the last inequality uses our choice of \(\alpha _k\), which implies \(n \alpha _k \le n \alpha _T = \frac{1}{\varepsilon L} \le \frac{1}{4}\).

Plugging Lemma C.5 back to (C.2), we have

$$\begin{aligned}&\alpha _k (f_\mu (\mathsf {x}_k) - f_\mu (u)) \le \big \langle \alpha _k \nabla f_\mu (\mathsf {x}_k), \mathsf {x}_k-u \big \rangle \nonumber \\&\quad \overset{\textcircled {{\small 1}}}{\le }\frac{(1-\tau )\alpha _k}{\tau } (f_\mu (\mathsf {y}_{k-1})-f_\mu (\mathsf {x}_k)) \nonumber \\&\qquad + \mathbf {E}_{i} \Big [ 21 n \alpha _k L \cdot (f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})) + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k - u\Vert _A^2 \Big ] \nonumber \\&\quad \overset{\textcircled {{\small 2}}}{\le }\alpha _k f_\mu (\mathsf {x}_k) + \big (21n\alpha _k L - \alpha _k \big ) f_\mu (\mathsf {y}_{k-1}) \nonumber \\&\qquad + \mathbf {E}_{i} \Big [ -21 n \alpha _k L \cdot f_\mu (\mathsf {y}_k^{(i)}) + \frac{1}{2}\Vert \mathsf {z}_{k-1} - u\Vert _A^2 - \frac{1}{2}\Vert \mathsf {z}_k - u\Vert _A^2 \Big ] . \end{aligned}$$
(C.3)

Above, \(\textcircled {{\small 1}}\)uses Lemma C.5; and \(\textcircled {{\small 2}}\)is because we have chosen \(\tau \) to satisfy \(\frac{1}{\tau } = 21 n L \).

Next, recall that we have picked \(\alpha _{k}\) so that \((21n L - 1) \alpha _k = 21n L \cdot \alpha _{k-1}\) in \(CovLPSolver^{\mathsf {wb}}\). Telescoping (C.3) for \(k=1,\dots ,T\) and choosing \(u^*=(1+\varepsilon /2) x^*\), we have

$$\begin{aligned}&- \sum _{k=1}^T \alpha _k f_\mu (u^*) \le 21 f_\mu (\mathsf {y}_0) - 21 n\alpha _T L\\&\quad \cdot \mathbf {E}[f_\mu (\mathsf {y}_T)] + \Vert \mathsf {z}_0 - u^*\Vert _A^2 \le - 21 n\alpha _T L \cdot \mathbf {E}[f_\mu (\mathsf {y}_T)] + 75\mathsf {OPT}. \end{aligned}$$

Here, the second inequality is due to \(f_\mu (\mathsf {y}_0) = f_\mu (x^{\mathsf {start}}) \le 3\mathsf {OPT}\) from Fact 5.2, and the fact that

$$\begin{aligned}&\Vert \mathsf {z}_0 - u^*\Vert _A^2 = \Vert u^*\Vert _A^2 = \sum _{i=1}^n (u^*_i)^2 \cdot \Vert A_{:i}\Vert _{\infty }\le (1+\varepsilon /2)^2\sum _{i=1}^n (x^*_i)^2 \cdot \Vert A_{:i}\Vert _{\infty }\\&\quad \le 10(1+\varepsilon /2)^2 \sum _{i=1}^n x^*_i < 12\mathsf {OPT}. \end{aligned}$$

Finally, using the fact that \(\sum _{k=1}^T \alpha _k = \alpha _T \cdot \sum _{k=0}^{T-1} \big (1 - \frac{1}{21 nL}\big )^k = 21 n \alpha _T L \big (1-(1-\frac{1}{21 nL})^T\big )\), we rearrange and obtain that

$$\begin{aligned} \mathbf {E}[f_\mu (\mathsf {y}_T)] \le \frac{\sum _k \alpha _k}{21 n\alpha _T L} f_\mu (u^*) + \frac{75}{21 n\alpha _T L} \mathsf {OPT}=&\, \big (1-(1-\frac{1}{21 nL})^T\big ) f_\mu (u^*) \nonumber \\&+ \frac{75}{21 n\alpha _T L} \mathsf {OPT}. \end{aligned}$$

We choose \(T = \lceil 21 n L \log (1/\varepsilon ) \rceil \) so that \(\frac{1}{n\alpha _T L} = (1-\frac{1}{21 n L})^T \le \varepsilon \). Combining this with the fact that \(f_\mu (u^*)\le (1+\varepsilon )\mathsf {OPT}\) (see Proposition 4.5a), we obtain

$$\begin{aligned} \mathbf {E}[f_\mu (\mathsf {y}_T)] \le (1+\varepsilon )\mathsf {OPT}+ 3.6\varepsilon \cdot \mathsf {OPT}< (1+4.6\varepsilon )\mathsf {OPT}. \end{aligned}$$

Therefore, we have finished proving Theorem 5.3. \(\square \)

1.4 Missing proofs for Sect. 6

Proposition 6.4

If \(\mathsf {z}_{k-1}\in \varDelta _{\mathsf {simplex}}\) and \(\mathsf {z}_{k-1}>0\), the minimizer \(z = {\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \big \{ V_{\mathsf {z}_{k-1}}(z) + \langle \delta \mathbf {e}_i, z \rangle \big \}\) for any scalar \(\delta \in \mathbb {R}\) and basis vector \(\mathbf {e}_i\) can be computed as follows:

  1. 1.

    \(z \leftarrow \mathsf {z}_{k-1}\).

  2. 2.

    \(z_i \leftarrow z_i \cdot e^{-\delta }\).

  3. 3.

    If \({\mathbbm {1}}^{T} z > 2\mathsf {OPT}'\), \(z \leftarrow \frac{2\mathsf {OPT}'}{{\mathbbm {1}}^{T} z} z\).

  4. 4.

    Return z.

Proof

Let us denote by z the returned value of the described procedure, and \(g(u) {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}V_{\mathsf {z}_{k-1}}(u) + \langle \delta \mathbf {e}_i, u \rangle \). Since \(\varDelta _{\mathsf {simplex}}\) is a convex body and \(g(\cdot )\) is convex, to show \(z = {\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \{g(u)\}\), it suffices for us to prove that for every \(u \in \varDelta _{\mathsf {simplex}}\), \(\langle \nabla g(z), u-z \rangle \ge 0\). Since the gradient \(\nabla g(z)\) can be written explicitly, this is equivalent to

$$\begin{aligned} \textstyle \delta (u_i - z_i) + \sum _{\ell =1}^n \log \frac{z_{\ell }}{\mathsf {z}_{k-1,\ell }} \cdot (u_\ell - z_\ell ) \ge 0 . \end{aligned}$$

If the re-scaling in step 3 is not executed, then we have \(z_\ell = \mathsf {z}_{k-1,\ell }\) for every \(\ell \ne i\), and \(z_i = \mathsf {z}_{k-1,i} \cdot e^{-\delta }\); thus, the left-hand side is zero so the above inequality is true for every \(u\in \varDelta _{\mathsf {simplex}}\).

Otherwise, we have \({\mathbbm {1}}^{T} z = 2\mathsf {OPT}'\) and there exists some constant \(Z>1\) such that, \(z_\ell = \mathsf {z}_{k-1,\ell } / Z\) for every \(\ell \ne i\), and \(z_i = \mathsf {z}_{k-1,i} \cdot e^{-\delta } / Z\). In such a case, the left-hand side equals to

$$\begin{aligned} \textstyle (u_i - z_i) \cdot (\delta - \delta ) + \sum _{\ell =1}^n -\log Z \cdot (u_\ell - z_\ell ) . \end{aligned}$$

It is clear at this moment that since \(\log Z > 0\) and \({\mathbbm {1}}^{T} u \le 2\mathsf {OPT}' = {\mathbbm {1}}^{T} z\), the above quantity is always non-negative, finishing the proof. \(\square \)

Lemma 6.13

Denoting by \(\gamma {\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}2\alpha _T n\), we have

$$\begin{aligned} \mathbf {E}_i \big [ \alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \big \rangle \big ] \le V_{\mathsf {z}_{k-1}}\big (\frac{u^*}{1+\gamma }\big ) - \mathbf {E}_i \Big [ V_{\mathsf {z}^{(i)}_k}\big (\frac{u^*}{1+\gamma }\big ) \Big ] + 12 \mathsf {OPT}\cdot \gamma \alpha _k \beta . \end{aligned}$$

Proof

Define \(w(x){\mathop {=}\limits ^{\mathrm {\scriptscriptstyle def}}}\sum _i x_i \log (x_i) - x_i\) and accordingly, \(V_x(y) = w(y) - \langle \nabla w(x), y-x \rangle - w(x) = \sum _i y_i \log \frac{y_i}{x_i} + x_i - y_i\). We first compute using the classical analysis of mirror descent step as follows:

$$\begin{aligned}&\gamma \alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} \big \rangle + \alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \big \rangle \nonumber \\&\quad = (1+\gamma )\alpha _k \Big \langle n \xi _k^{(i)}, \mathsf {z}^{(i)}_k - \frac{u^*}{1+\gamma } \Big \rangle + (1+\gamma )\alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}^{(i)}_k \big \rangle \nonumber \\&\quad \overset{\textcircled {{\small 1}}}{\le } \Big \langle \nabla w(\mathsf {z}_{k-1}) - \nabla w(\mathsf {z}^{(i)}_k), \mathsf {z}^{(i)}_k - \frac{u^*}{1+\gamma } \Big \rangle + (1+\gamma )\alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}^{(i)}_k \big \rangle \nonumber \\&\quad = \left( w\big (\frac{u^*}{1+\gamma }\big ) - w(\mathsf {z}_{k-1}) - \Big \langle \nabla w(\mathsf {z}_{k-1}), \frac{u^*}{1+\gamma } - \mathsf {z}_{k-1}\Big \rangle \right) \nonumber \\&\qquad - \left( w\big (\frac{u^*}{1+\gamma }\big ) - w(\mathsf {z}^{(i)}_k) - \Big \langle \nabla w(\mathsf {z}^{(i)}_k), \frac{u^*}{1+\gamma } - \mathsf {z}^{(i)}_k\Big \rangle \right) \nonumber \\&\qquad + \left( w(\mathsf {z}_{k-1}) - w(\mathsf {z}^{(i)}_k) - \big \langle \nabla w(\mathsf {z}_{k-1}), \mathsf {z}_{k-1} - \mathsf {z}^{(i)}_k\big \rangle \right) \nonumber \\&\qquad + (1+\gamma )\alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}^{(i)}_k \big \rangle \nonumber \\&\quad = V_{\mathsf {z}_{k-1}}\big (\frac{u^*}{1+\gamma }\big ) - V_{\mathsf {z}^{(i)}_k}\big (\frac{u^*}{1+\gamma }\big ) + \boxed {(1+\gamma )\alpha _k\big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}^{(i)}_k \big \rangle - V_{\mathsf {z}_{k-1}}(\mathsf {z}^{(i)}_k)} . \end{aligned}$$
(D.1)

Above, \(\textcircled {{\small 1}}\)is because \(\mathsf {z}^{(i)}_k = {\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}}\big \{V_{\mathsf {z}_{k-1}}(z) + \langle (1+\gamma )\alpha _k n \xi _k^{(i)}, z \rangle \big \}\), which is equivalent to saying

$$\begin{aligned}&\forall u \in \varDelta _{\mathsf {simplex}},\quad \langle \nabla V_{\mathsf {z}_{k-1}}(\mathsf {z}^{(i)}_k) + (1+\gamma )\alpha _k n \xi _k^{(i)}, u - \mathsf {z}^{(i)}_k \rangle \ge 0 \\ \Longleftrightarrow \quad&\forall u \in \varDelta _{\mathsf {simplex}},\quad \langle \nabla w(\mathsf {z}^{(i)}_k) - \nabla w(\mathsf {z}_{k-1}) + (1+\gamma )\alpha _k n \xi _k^{(i)}, u - \mathsf {z}^{(i)}_k \rangle \ge 0 . \end{aligned}$$

In particular, we have \({\mathbbm {1}}^{T} \frac{u^*}{1+\gamma } = {\mathbbm {1}}^{T} \frac{(1+\varepsilon /2)x^*}{1+\gamma } < 2\mathsf {OPT}\le 2\mathsf {OPT}'\) and therefore \(\frac{u^*}{1+\gamma } \in \varDelta _{\mathsf {simplex}}\). Substituting \(u = \frac{u^*}{1+\gamma } \) into the above inequality we get \(\textcircled {{\small 1}}\).

Next, we upper bound the term in the box:

$$\begin{aligned}&(1+\gamma )\alpha _k\langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - \mathsf {z}^{(i)}_k \rangle - V_{\mathsf {z}_{k-1}}(\mathsf {z}^{(i)}_k) \nonumber \\&\quad \overset{\textcircled {{\small 1}}}{\le }(1+\gamma )\alpha _k n \xi _{k,i} \cdot (\mathsf {z}_{k-1,i} - \mathsf {z}^{(i)}_{k,i}) - \left( \mathsf {z}^{(i)}_{k,i} \log \frac{\mathsf {z}^{(i)}_{k,i}}{\mathsf {z}_{k-1,i}} + \mathsf {z}_{k-1,i} - \mathsf {z}^{(i)}_{k,i} \right) \nonumber \\&\quad \overset{\textcircled {{\small 2}}}{\le }(1+\gamma )\alpha _k n \xi _{k,i} \cdot (\mathsf {z}_{k-1,i} - \mathsf {z}^{(i)}_{k,i}) - \frac{|\mathsf {z}^{(i)}_{k,i} - \mathsf {z}_{k-1,i}|^2}{2\max \{\mathsf {z}^{(i)}_{k,i}, \mathsf {z}_{k-1,i}\}} \nonumber \\&\quad \overset{\textcircled {{\small 3}}}{\le }(1+\gamma )\alpha _k n \xi _{k,i} \cdot (\mathsf {z}_{k-1,i} - \mathsf {z}^{(i)}_{k,i}) - \frac{|\mathsf {z}^{(i)}_{k,i} - \mathsf {z}_{k-1,i}|^2}{4 \mathsf {z}_{k-1,i}} \nonumber \\&\quad \overset{\textcircled {{\small 4}}}{\le }(1+\gamma )^2\mathsf {z}_{k-1,i} \cdot (\alpha _k n \xi _{k,i})^2 \overset{\textcircled {{\small 5}}}{\le }2\mathsf {z}_{k-1,i} \cdot (\alpha _k n \xi _{k,i})^2 \overset{\textcircled {{\small 6}}}{\le }\mathsf {z}_{k-1,i} \cdot \gamma \alpha _k n |\xi _{k,i}| \nonumber \\&\quad \overset{\textcircled {{\small 7}}}{\le }\mathsf {z}_{k-1,i} \cdot \gamma \alpha _k n \xi _{k,i} + 2 \mathsf {z}_{k-1,i} \cdot \gamma \alpha _k n \beta = \gamma \alpha _k \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} \rangle + 2 \mathsf {z}_{k-1,i} \cdot \gamma \alpha _k n \beta . \end{aligned}$$
(D.2)

Above, \(\textcircled {{\small 1}}\)uses the facts (i) \(a \log \frac{a}{b} + b - a \ge 0\) for any \(a,b>0\), (ii) \(\mathsf {z}_{k-1,i}-\mathsf {z}_k^{(i)}\) and \(\xi _{k,i}\) have the same sign, and (iii) \(\xi _{k,i'}^{(i)}=0\) for every \(i' \ne i\); \(\textcircled {{\small 2}}\)uses the inequality that for every \(a,b>0\), we have \( a \log \frac{a}{b} + b - a \ge \frac{(a-b)^2}{2\max \{a,b\}}\). \(\textcircled {{\small 3}}\)uses the fact that \(\mathsf {z}^{(i)}_{k,i} \le 2\mathsf {z}_{k-1,i}\).Footnote 14\(\textcircled {{\small 4}}\)uses Cauchy-Shwarz: \(ab - b^2/4 \le a^2\). \(\textcircled {{\small 5}}\)uses \((1+\gamma )^2 < 2\). \(\textcircled {{\small 6}}\)uses \(|\xi _{k,i}| \le 1\) and \(\gamma = 2\alpha _T n \ge 2\alpha _k n\). \(\textcircled {{\small 7}}\)uses \(\xi _{k,i} \ge -\beta \).

Next, we combine (D.1) and (D.2) to conclude that

$$\begin{aligned} \alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \big \rangle \le V_{\mathsf {z}_{k-1}}\big (\frac{u^*}{1+\gamma }\big ) - V_{\mathsf {z}^{(i)}_k}\big (\frac{u^*}{1+\gamma }\big ) + 2 \mathsf {z}_{k-1,i} \cdot \gamma \alpha _k n \beta . \end{aligned}$$

Taking expectation on both sides with respect to i, and using the property that \({\mathbbm {1}}^{T} \mathsf {z}_{k-1} \le 3\mathsf {OPT}' \le 6\mathsf {OPT}\), we obtain that

$$\begin{aligned} \mathbf {E}_i \big [ \alpha _k \big \langle n \xi _k^{(i)}, \mathsf {z}_{k-1} - u^* \big \rangle \big ] \le V_{\mathsf {z}_{k-1}}\big (\frac{u^*}{1+\gamma }\big ) - \mathbf {E}_i \Big [ V_{\mathsf {z}^{(i)}_k}\big (\frac{u^*}{1+\gamma }\big ) \Big ] + 12 \mathsf {OPT}\cdot \gamma \alpha _k \beta . \end{aligned}$$

\(\square \)

Lemma 6.14

For every \(i\in [n]\), we have

  1. (a)

    \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge 0\), and

  2. (b)

    \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge \frac{\mu \beta }{12} \cdot \langle - \widetilde{\eta }_{k}^{(i)}, u^* \rangle .\)

Proof of Lemma 6.14 part (a)

Since if \(i\not \in B_k\) is not a large index we have \(\mathsf {y}_k^{(i)} = \mathsf {x}_k\) and the claim is trivial, we focus on \(i\in B_k\) in the remaining proof. Recall that \(\mathsf {y}_k^{(i)} = \mathsf {x}_k + \delta \mathbf {e}_i\) for some \(\delta >0\) defined in Algorithm 3, so we have

$$\begin{aligned} f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) = \int _{\tau = 0}^{\delta } \langle - \nabla f_\mu (\mathsf {x}_k + \tau \mathbf {e}_i), \mathbf {e}_i \rangle d \tau = \int _{\tau = 0}^{\delta } \big ( \langle A_{:i}, p(\mathsf {x}_k + \tau \mathbf {e}_i) \rangle - 1 \big ) d \tau . \end{aligned}$$

It is clear that \(\langle A_{:i}, p(\mathsf {x}_k + \tau \mathbf {e}_i) \rangle \) decreases as \(\tau \) increases, and therefore it suffices to prove that \(\langle A_{:i}, p(\mathsf {x}_k + \delta \mathbf {e}_i) \rangle \ge 1\).

Suppose that the rows of \(A_{:i}\) are sorted (for the simplicity of notation) by the increasing order of \(A_{j,i}\). Now, by the definition of the algorithm (recall (6.1)), there exists some \(j^* \in [m]\) satisfying that

$$\begin{aligned} \sum _{j< j^*} A_{j,i} \cdot p_j(\mathsf {x}_k) < 1+\beta \quad \text {and}\quad \sum _{j \le j^*} A_{j,i} \cdot p_j(\mathsf {x}_k) \ge 1+\beta . \end{aligned}$$

Next, by our choice of \(\delta \) which satisfies \(\delta = \frac{\mu \beta }{2A_{j^*,i}} \le \frac{\mu \beta }{2A_{j,i}} \) for every \(j \le j^*\), we have for every \(j\le j^*\):

$$\begin{aligned} p_j(\mathsf {x}_k + \delta \mathbf {e}_i) = p_j(\mathsf {x}_k) \cdot e^{-\frac{A_{j,i} \delta }{\mu }} \ge p_j(\mathsf {x}_k) \cdot e^{-\beta /2} \ge p_j(\mathsf {x}_k) \cdot (1-\beta /2) , \end{aligned}$$

and as a result,

$$\begin{aligned}&\langle A_{:i}, p(\mathsf {x}_k + \delta \mathbf {e}_i) \ge \sum _{j\le j^*} A_{j,i} \cdot p_j (\mathsf {x}_k + \delta \mathbf {e}_i) \ge (1-\beta /2) \sum _{j\le j^*} A_{j,i} \cdot p_j (\mathsf {x}_k) \\&\quad \ge (1-\beta /2) (1+\beta ) \ge 1 . \end{aligned}$$

\(\square \)

Proof of Lemma 6.14 part (b)

Owing to part (a), for every coordinate i such that \(\widetilde{\eta }_{k,i}\ge 0\), we automatically have \(f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)}) \ge 0\) so the lemma is obvious. Therefore, let us focus only on coordinates i such that \(\widetilde{\eta }_{k,i}<0\); these are necessarily large indices \(i\in B\). Recall from Definition 6.11 that \(\widetilde{\eta }_{k,i} = (1+\beta ) - (\widetilde{A}^T p(\mathsf {x}_k))_i\), so we have

$$\begin{aligned} \textstyle \sum _{j=1}^m \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) - (1+\beta ) > 0 .\end{aligned}$$

For the simplicity of description, suppose again that each i-th column is sorted in non-decreasing order, that is, \(A_{1,i}\le \cdots \le A_{m,i}\). The definition of \(j^*\) can be simplified as

$$\begin{aligned} \textstyle \sum _{j< j^*} A_{j,i} \cdot p_j(\mathsf {x}_k) < 1+\beta \quad \text {and}\quad \sum _{j \le j^*} A_{j,i} \cdot p_j(\mathsf {x}_k) \ge 1+\beta .\end{aligned}$$

Let \(j^{\flat } \in [m]\) be the row such that

$$\begin{aligned} \textstyle \sum _{j< j^{\flat }} \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) < 1+\beta \quad \text {and}\quad \sum _{j \le j^{\flat }} \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) \ge 1+\beta .\end{aligned}$$

Note that such a \(j^{\flat }\) must exist because \(\sum _{j=1}^m \widetilde{A}_{j,i} \cdot p_j > 1+\beta \). It is clear that \(j^{\flat } \ge j^*\), owing to the definition that \(\widetilde{A}_{ji} \le A_{ji}\) for all \(i\in [n], j\in [m]\). Defining \(\delta ^{\flat } = \frac{\mu \beta }{2A_{j^{\flat },i}} \le \delta \), the objective decrease is lower bounded as

$$\begin{aligned} f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})&= \int _{\tau = 0}^{\delta } \langle - \nabla f_\mu (\mathsf {x}_k + \tau \mathbf {e}_i), \mathbf {e}_i \rangle d \tau = \int _{\tau = 0}^{\delta } \big ( \langle A_{:i}, p(\mathsf {x}_k + \tau \mathbf {e}_i) \rangle - 1 \big ) d \tau \\&\ge \int _{\tau = 0}^{\delta ^{\flat }} \big ( \langle A_{:i}, p(\mathsf {x}_k + \tau \mathbf {e}_i) \rangle - 1 \big ) d \tau \\&= \underbrace{\int _{\tau =0}^{\delta ^{\flat }} \left( -1 + \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k + \tau \mathbf {e}_i) \right) d\tau }_{I}\\&\quad + \underbrace{\sum _{j>j^{\flat }} \int _{\tau =0}^{\delta ^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k + \tau \mathbf {e}_i) d\tau }_{I'} \end{aligned}$$

where the inequality is because \(\delta ^{\flat } \le \delta \) and \(\langle A_{:i}, p(\mathsf {x}_k + \tau \mathbf {e}_i) \rangle \ge 1 \) for all \(\tau \le \delta \) (see the proof of part (a)).

PartI To lower bound I, we use the monotonicity of \(p_j(\cdot )\) and obtain that

$$\begin{aligned} I =&\int _{\tau =0}^{\delta ^{\flat }} \left( -1 + \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k + \tau \mathbf {e}_i) \right) d\tau \ge \delta ^{\flat }\\&\cdot \left( -1 + \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k + \delta ^{\flat } \mathbf {e}_i) \right) . \end{aligned}$$

However, our choice of \(\delta ^{\flat } = \frac{\mu \beta }{2A_{j^{\flat },i}} \le \frac{\mu \beta }{2A_{j,i}} \) for all \(j\le j^{\flat }\) ensures that

$$\begin{aligned} \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k + \delta ^{\flat } \mathbf {e}_i)&\ge \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k) \cdot e^{\frac{-A_{j,i} \cdot \delta ^{\flat }}{\mu }} \\&\ge \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k) \cdot (1-\beta /2) . \end{aligned}$$

Therefore, we obtain that

$$\begin{aligned} I \ge \delta ^{\flat } \left( -1 + (1-\beta /2)\sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k)\right) \ge \frac{\delta ^{\flat }}{3} \left( -1 + \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k) \right) , \end{aligned}$$

where the inequality is because \(\big (\frac{2}{3}-\frac{\beta }{2} \big ) \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k) \ge \frac{4-3\beta }{6} \cdot (1+\beta ) \ge \frac{2}{3}\) whenever \(\beta \le \frac{1}{3}\) (or equivalently, whenever \(\varepsilon \le 1/9\)).

Now, suppose that \(\sum _{j \le j^{\flat }} \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) - (1+\beta ) = b \cdot \widetilde{A}_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k)\) for some \(b \in [0,1]\). Note that we can do so by the very definition of \(j^{\flat }\). Then, we must have

$$\begin{aligned} -1 + \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k)&\ge -1 + \sum _{j< j^{\flat }} \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) + A_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k) \\&= -1 + (1+\beta ) - (1-b) \widetilde{A}_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k) + A_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k) \\&\ge \beta + b \cdot A_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k) . \end{aligned}$$

Therefore, we conclude that

$$\begin{aligned} I&\ge \frac{\delta ^{\flat }}{3} \left( -1 + \sum _{j\le j^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k) \right) > \frac{\delta ^{\flat }}{3} \cdot b \cdot A_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k)\\&= \frac{\mu \beta }{6\widetilde{A}_{j^{\flat },i}} \cdot b \cdot \widetilde{A}_{j^{\flat },i} \cdot p_{j^{\flat }}(\mathsf {x}_k) \\&= \frac{\mu \beta }{6\widetilde{A}_{j^{\flat },i}} \cdot \left( - (1+\beta ) + \sum _{j \le j^{\flat }} \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k)\right) \\&\ge \frac{\mu \beta }{12} \cdot u^*_i \cdot \left( - (1+\beta ) + \sum _{j \le j^{\flat }} \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) \right) . \end{aligned}$$

Above, the last inequality is because \(u^*_i \cdot \widetilde{A}_{j^{\flat },i} \le \langle \widetilde{A}_{j^{\flat } :}, u^* \rangle \le 2\) by our definition of \(\widetilde{A}\).

Part\(I'\) To lower bound \(I'\), consider every \(j> j^{\flat }\) and the integral

$$\begin{aligned} \int _{\tau =0}^{\delta ^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k + \tau \mathbf {e}_i) d\tau . \end{aligned}$$

Note that whenever \(\tau \le \frac{\mu \beta }{2A_{j,i}} \le \frac{\mu \beta }{2A_{j^{\flat },i}} = \delta ^{\flat }\), we have that \(p_j(\mathsf {x}_k + \tau \mathbf {e}_i) \ge p_j(\mathsf {x}_k) \cdot e^{-\beta /2} \ge \frac{1}{2} p_j(\mathsf {x}_k)\). Therefore,

$$\begin{aligned} \int _{\tau =0}^{\delta ^{\flat }} A_{j,i} \cdot p_j(\mathsf {x}_k + \tau \mathbf {e}_i) d\tau \ge \int _{\tau =0}^{\frac{\mu \beta }{2A_{j,i}}} A_{j,i} \cdot p_j(\mathsf {x}_k + \tau \mathbf {e}_i) d\tau \ge \frac{\mu \beta }{2A_{j,i}} \cdot A_{j,i} \cdot \frac{1}{2}p_j (\mathsf {x}_k) . \end{aligned}$$

This implies a lower bound on \(I'\):

$$\begin{aligned} I' \ge \sum _{j>j^{\flat }} \frac{\mu \beta }{4A_{j,i}} \cdot A_{j,i} \cdot p_j(\mathsf {x}_k) \ge \frac{\mu \beta }{8} \cdot \sum _{j>j^{\flat }} u^*_i \cdot \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k) ,\end{aligned}$$

where again in the last inequality we have used \(u^*_i \cdot \widetilde{A}_{j^{\flat },i} \le \langle \widetilde{A}_{j^{\flat } :}, u^* \rangle \le 2\) by our definition of \(\widetilde{A}\).

Together Combining the lower bounds on I and \(I'\), we obtain

$$\begin{aligned} f_\mu (\mathsf {x}_k) - f_\mu (\mathsf {y}_k^{(i)})\ge & {} I + I' \ge \frac{\mu \beta }{12} \cdot u^*_i \cdot \left( - (1+\beta ) + \sum _{j=1}^m \widetilde{A}_{j,i} \cdot p_j(\mathsf {x}_k)\right) \\= & {} \frac{\mu \beta }{12} \cdot \langle -\widetilde{\eta }_{k}^{(i)}, u^* \rangle . \end{aligned}$$

\(\square \)

1.5 Proof of Lemma 3.3: Efficient Implementation of \(PacLPSolver\)

In this section, we illustrate how to implement each iteration of \(PacLPSolver\) to run in an expected O(N / n) time. We maintain the following quantities

$$\begin{aligned} \mathsf {z}_k \in \mathbb {R}_{\ge 0}^n, \quad \mathsf {az}_k \in \mathbb {R}_{\ge 0}^m, \quad \mathsf {y}'_k \in \mathbb {R}^n, \quad \mathsf {ay}'_k \in \mathbb {R}^m, \quad B_{k,1},B_{k,2} \in \mathbb {R}_+ \end{aligned}$$

throughout the algorithm, so as to ensure the following invariants are always satisfied

$$\begin{aligned}&A \mathsf {z}_k = \mathsf {az}_k , \end{aligned}$$
(E.1)
$$\begin{aligned}&\mathsf {y}_k = B_{k,1} \cdot \mathsf {z}_k + B_{k,2} \cdot \mathsf {y}'_k , \quad A \mathsf {y}_k' = \mathsf {ay}'_k . \end{aligned}$$
(E.2)

It is clear that when \(k=0\), letting \(\mathsf {az}_k = A \mathsf {z}_0\), \(\mathsf {y}'_k = \mathsf {y}_0\), \(\mathsf {ay}'_k = A\mathsf {y}_0\), \(B_{k,1}=0\), and \(B_{k,2}=1\), we can ensure that all the invariants are satisfied initially. We denote \(\Vert A_{:i}\Vert _0\) the number of nonzeros elements in vector \(A_{:i}\). In each iteration \(k=1,2,\dots ,T\):

  • The step \(\mathsf {x}_k = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1}\) does not need to be implemented.

  • The value \(\nabla _i f(\mathsf {x}_k)\) requires the knowledge of \(p_j(\mathsf {x}_k) = e^{\frac{1}{\mu } ((A\mathsf {x}_k)_j - 1)}\) for each j such that \(A_{ij}\ne 0\). Accordingly, for each j, we need to know the value

    $$\begin{aligned} (A \mathsf {x}_k)_j= & {} \tau (A \mathsf {z}_{k-1})_j + (1-\tau ) (A \mathsf {y}_{k-1})_j \\= & {} \big (\tau + (1-\tau ) B_{k-1,1} \big ) \mathsf {az}_{k-1,j} + (1-\tau ) B_{k-1,2} \mathsf {ay}'_{k-1,j} . \end{aligned}$$

    This can be computed in O(1) time for each j, and \(O(\Vert A_{:i}\Vert _0)\) time in total.

  • Recall the step \(\mathsf {z}_k \leftarrow {\text {arg min}}_{z \in \varDelta _{\mathsf {box}}} \big \{\frac{1}{2}\Vert z - \mathsf {z}_{k-1}\Vert _A^2 + \langle n \alpha _k \xi _{k}^{(i)}, z \rangle \big \}\) can be written as \(\mathsf {z}_{k} = \mathsf {z}_{k-1} + \delta \mathbf {e}_i\) for some \(\delta \in \mathbb {R}\) that can be computed in O(1) time (see Proposition 3.2). Observe also \(\mathsf {z}_k = \mathsf {z}_{k-1} + \delta \mathbf {e}_i\) yields \(\mathsf {y}_k = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1} + \frac{\delta \mathbf {e}_i}{n \alpha _k L}\) due to Line 6 and Line 10 of Algorithm 1. Therefore, we perform two explicit updates on \(\mathsf {z}_k\) and \(\mathsf {az}_k\) as

    $$\begin{aligned}\mathsf {z}_k \leftarrow \mathsf {z}_{k-1} + \delta \mathbf {e}_i , \quad \mathsf {az}_k \leftarrow \mathsf {az}_{k-1} + \delta A_{:i}\end{aligned}$$

    and two implicit updates on \(\mathsf {y}_k\) as

    $$\begin{aligned} \begin{array}{lll} &{}&{}B_{k,1} = \tau + (1-\tau ) B_{k-1,1} , B_{k,2} = (1-\tau ) B_{k-1,2} , \\ &{}&{}\mathsf {y}'_k \leftarrow \mathsf {y}'_{k-1} + \delta \mathbf {e}_i \cdot \left( - \frac{B_{k,1}}{B_{k,2}} + \frac{1}{n\alpha _k L} \frac{1}{B_{k,2}}\right) , \mathsf {ay}'_k \leftarrow \mathsf {ay}'_{k-1} + \delta A_{:i}\cdot \left( - \frac{B_{k,1}}{B_{k,2}} + \frac{1}{n\alpha _k L} \frac{1}{B_{k,2}}\right) \end{array} \end{aligned}$$

    It is not hard to verify that after these updates, \( A \mathsf {y}_k' = \mathsf {ay}'_k\) and we have

    $$\begin{aligned}&B_{k,1} \cdot \mathsf {z}_k + B_{k,2} \cdot \mathsf {y}'_k = B_{k,1} \cdot \big (\mathsf {z}_{k-1} + \delta \mathbf {e}_i\big ) \\&\quad + B_{k,2} \cdot \left( \mathsf {y}'_{k-1} + \delta \mathbf {e}_i \cdot \left( - \frac{B_{k,1}}{B_{k,2}} + \frac{1}{n\alpha _k L} \frac{1}{B_{k,2}}\right) \right) \\&\quad = B_{k,1} \cdot \mathsf {z}_{k-1} + B_{k,2} \cdot \left( \mathsf {y}'_{k-1} + \delta \mathbf {e}_i \cdot \left( \frac{1}{n\alpha _k L} \frac{1}{B_{k,2}}\right) \right) \\&\quad = B_{k,1} \cdot \mathsf {z}_{k-1} + B_{k,2} \cdot \mathsf {y}'_{k-1} + \frac{\delta \mathbf {e}_i}{n\alpha _k L} \\&\quad = \big (\tau + (1-\tau ) B_{k-1,1}\big ) \cdot \mathsf {z}_{k-1} + \big ((1-\tau ) B_{k-1,2}\big ) \cdot \mathsf {y}'_{k-1} + \frac{\delta \mathbf {e}_i}{n\alpha _k L} \\&\quad = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1} + \frac{\delta \mathbf {e}_i}{n\alpha _k L} = \mathsf {y}_k, \end{aligned}$$

    so the invariant \(\mathsf {y}_k = B_{k,1} \cdot \mathsf {z}_k + B_{k,2} \cdot \mathsf {y}'_k \) also holds. In sum, after performing updates on \(A\mathsf {z}_k\) and \(\mathsf {ay}'_k\) in time \(O(\Vert A_{:i}\Vert _0)\), we can ensure that the invariants in (E.1) and (E.2) are satisfied at iteration k.

In sum, we only need \(O(\Vert A_{:i}\Vert _0)\) time to perform the updates in \(PacLPSolver\) for an iteration k if the coordinate i is selected. Therefore, each iteration of \(PacLPSolver\) can be implemented to run in an expected \(O(\mathbf {E}_i[\Vert A_{:i}\Vert _0]) = O(N/n)\) time.

1.6 Proof of Lemma 6.5: Efficient Implementation of \(CovLPSolver\)

In this section we illustrate how to implement each iteration of \(CovLPSolver\) to run in an expected O(N / n) time. We maintain the following quantities

$$\begin{aligned}&\mathsf {z}_k' \in \mathbb {R}_+^n, \quad \mathsf {sz}_k \in \mathbb {R}_+, \quad \mathsf {sumz}_k \in \mathbb {R}_+, \quad \mathsf {az}'_k \in \mathbb {R}_{\ge 0}^m, \quad \mathsf {y}'_k \in \mathbb {R}^n, \\&\mathsf {ay}'_k \in \mathbb {R}^m, \quad B_{k,1},B_{k,2} \in \mathbb {R}_+ \end{aligned}$$

throughout the algorithm, so as to maintain the following invariants

$$\begin{aligned}&\mathsf {z}_k = \mathsf {z}_k' / \mathsf {sz}_k, \quad&\mathsf {sumz}_k = {\mathbbm {1}}^{T} \mathsf {z}'_k, \quad&A \mathsf {z}_k = \mathsf {az}'_k / \mathsf {sz}_k, \end{aligned}$$
(F.1)
$$\begin{aligned}&\mathsf {y}_k = B_{k,1} \cdot \mathsf {z}_k' + B_{k,2} \cdot \mathsf {y}'_k, \quad&A \mathsf {y}_k = \mathsf {ay}'_k . \end{aligned}$$
(F.2)

It is clear that when \(k=0\), letting \(\mathsf {z}_k' = \mathsf {z}_0\), \(\mathsf {sz}_k=1\), \(\mathsf {sumz}_k = {\mathbbm {1}}^{T} \mathsf {z}_0\), \(\mathsf {az}'_k = A\mathsf {z}_0\), \(\mathsf {y}'_k = \mathsf {y}_0\), \(\mathsf {ay}'_k = A\mathsf {y}_0\), \(B_{k,1}=0\), and \(B_{k,2}=1\), we can ensure that all the invariants are satisfied initially.

We denote by \(\Vert A_{:i}\Vert _0\) the number of nonzero elements in vector \(A_{:i}\). In each iteration \(k=1,2,\dots ,T\):

  • The step \(\mathsf {x}_k = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1}\) does not need to be implemented.

  • The value \(p_j(\mathsf {x}_k) = e^{\frac{1}{\mu } (1 - (A\mathsf {x}_k)_j)}\) for each j only requires the knowledge of

    $$\begin{aligned} (A \mathsf {x}_k)_j= & {} \tau (A \mathsf {z}_{k-1})_j + (1-\tau ) (A \mathsf {y}_{k-1})_j \\= & {} \big (\tau + (1-\tau ) B_{k-1,1} \big ) \frac{\mathsf {az}'_{k-1,j}}{\mathsf {sz}_{k-1}} + (1-\tau ) B_{k-1,2} \mathsf {ay}'_{k-1,j} . \end{aligned}$$

    This can be computed in O(1) time.

  • The value \(\nabla _i f(\mathsf {x}_k)\) requires the knowledge of \(p_j(\mathsf {x}_k)\) for each \(j\in [m]\) such that \(A_{ij}\ne 0\). Since we have \(\Vert A_{:i}\Vert _0\) such j’s, we can compute \(\nabla _i f(\mathsf {x}_k)\) in \(O(\Vert A_{:i}\Vert _0)\) time.

  • Letting \(\delta = (1+\gamma ) n \alpha _k \xi _{k,i}^{(i)}\), recall that the mirror step \(\mathsf {z}_k \leftarrow {\text {arg min}}_{z \in \varDelta _{\mathsf {simplex}}} \big \{ V_{\mathsf {z}_{k-1}}(z) + \langle \delta \mathbf {e}_i, z \rangle \big \}\) has a very simple form (see Proposition 6.4): first multiply the i-th coordinate of \(\mathsf {z}_{k-1}\) by \(e^{-\delta }\) and then, if the sum of all coordinates have exceeded \(2\mathsf {OPT}'\), scale everything down so as to sum up to \(2\mathsf {OPT}'\). This can be implemented as follows: setting \(\delta _1 = \mathsf {z}'_{k-1,i} (e^{-\delta }-1)\),

    $$\begin{aligned} \begin{array}{l} \mathsf {z}'_{k} \leftarrow \mathsf {z}'_{k-1} + \delta _1 \mathbf {e}_i , \mathsf {az}'_k \leftarrow \mathsf {az}'_{k-1} + \delta _1 A_{:i}, \\ \mathsf {sumz}_{k} \leftarrow \mathsf {sumz}_{k-1} + \delta _1 , \mathsf {sz}_k \leftarrow \mathsf {sz}_k \cdot \max \Big \{1, \frac{\mathsf {sumz}_k}{\mathsf {sz}_{k-1}\cdot 2\mathsf {OPT}'} \Big \} . \end{array} \end{aligned}$$

    These updates can be implemented to run in \(O(\Vert A_{:i}\Vert _0)\) time, and they together ensure that the invariants in (F.1) are satisfied at iteration k.

  • Recall that the gradient step is of the form \(\mathsf {y}_k \leftarrow \mathsf {x}_k + \delta _2 \cdot \mathbf {e}_i\) for some value \(\delta _2 \ge 0\). This value \(\delta _2\) can be computed in \(O(\Vert A_{:i}\Vert _0)\) time, since each \(p_j(\mathsf {x}_k)\) can be computed in O(1) time, and we can sort the rows of each column of A by preprocessing.

    Since \(\mathsf {y}_k = \mathsf {x}_k + \delta _2 \cdot \mathbf {e}_i = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1} + \delta _2 \mathbf {e}_i\), we can implement this update by letting

    $$\begin{aligned} \begin{array}{l} B_{k,1} = \frac{\tau }{\mathsf {sz}_{k-1}} + (1-\tau ) B_{k-1,1} , B_{k,2} = (1-\tau ) B_{k-1,2} \\ \mathsf {y}'_k \leftarrow \mathsf {y}'_{k-1} + \mathbf {e}_i \cdot \left( - \frac{B_{k,1} \delta _1}{B_{k,2}} + \frac{\delta _2}{B_{k,2}}\right) , \mathsf {ay}'_k \leftarrow \mathsf {ay}'_{k-1} + A_{:i}\cdot \left( - \frac{B_{k,1} \delta _1}{B_{k,2}} + \frac{\delta _2}{B_{k,2}}\right) \end{array}\end{aligned}$$

    It is not hard to verify that after these updates, \(\mathsf {ay}'_k = A \mathsf {y}'_k\) and we have

    $$\begin{aligned}&B_{k,1} \cdot \mathsf {z}'_k + B_{k,2} \cdot \mathsf {y}'_k = B_{k,1} \cdot \big (\mathsf {z}'_{k-1} + \delta _1 \mathbf {e}_i\big ) \\&\qquad + B_{k,2} \cdot \left( \mathsf {y}'_{k-1} + \mathbf {e}_i \cdot \left( - \frac{B_{k,1}\delta _1}{B_{k,2}} + \frac{\delta _2}{B_{k,2}}\right) \right) \\&\quad = B_{k,1} \cdot \mathsf {z}'_{k-1} + B_{k,2} \cdot \big (\mathsf {y}'_{k-1} + \delta _2 \mathbf {e}_i / B_{k,2} \big ) \\&\quad = B_{k,1} \cdot \mathsf {z}'_{k-1} + B_{k,2} \cdot \mathsf {y}'_{k-1} + \delta _2 \mathbf {e}_i \\&\quad = \big (\frac{\tau }{\mathsf {sz}_{k-1}} + (1-\tau ) B_{k-1,1}\big ) \cdot \mathsf {z}'_{k-1} + \big ((1-\tau ) B_{k-1,2}\big ) \cdot \mathsf {y}'_{k-1} + \delta _2 \mathbf {e}_i \\&\quad = \tau \mathsf {z}_{k-1} + (1-\tau ) \mathsf {y}_{k-1} + \delta _2 \mathbf {e}_i = \mathsf {y}_k , \end{aligned}$$

    so that the invariant \(\mathsf {y}_k = B_{k,1} \cdot \mathsf {z}_k' + B_{k,2} \cdot \mathsf {y}'_k\) is also satisfied. In sum, after running time \(O(\Vert A_{:i}\Vert _0)\), we can ensure that the invariants in (F.2) are satisfied at iteration k.

In sum, we only need \(O(\Vert A_{:i}\Vert _0)\) time to perform the updates in \(CovLPSolver\) for an iteration k if the coordinate i is selected. Therefore, each iteration of \(CovLPSolver\) can be implemented to run in an expected \(O(\mathbf {E}_i[\Vert A_{:i}\Vert _0]) = O(N/n)\) time.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Allen-Zhu, Z., Orecchia, L. Nearly linear-time packing and covering LP solvers. Math. Program. 175, 307–353 (2019). https://doi.org/10.1007/s10107-018-1244-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-018-1244-x

Mathematics Subject Classification

Navigation