Abstract
This article reviews the gradient sampling methodology for solving nonsmooth, nonconvex optimization problems. We state an intuitively straightforward gradient sampling algorithm and summarize its convergence properties. Throughout this discussion, we emphasize the simplicity of gradient sampling as an extension of the steepest descent method for minimizing smooth objectives. We provide an overview of various enhancements that have been proposed to improve practical performance, as well as an overview of several extensions that have been proposed in the literature, such as to solve constrained problems. We also clarify certain technical aspects of the analysis of gradient sampling algorithms, most notably related to the assumptions one needs to make about the set of points at which the objective is continuously differentiable. Finally, we discuss possible future research directions.
Dedicated to Krzysztof Kiwiel, in recognition of his fundamental work on algorithms for nonsmooth optimization
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Although this fact has been known for decades, most of the examples that appear in the literature are rather artificial since they were designed with exact line searches in mind. Analyses showing that the steepest descent method with inexact line searches converges to nonstationary points of some simple convex nonsmooth functions have appeared recently in [1, 22].
- 2.
This oversight went unnoticed for 12 years until J. Portegies and T. Mitchell brought it to our attention recently.
- 3.
- 4.
References
Asl, A., Overton, M.L.: Analysis of the gradient method with an Armijo–Wolfe line search on a class of nonsmooth convex functions. Optim. Method Softw. (2017). https://doi.org/10.1080/10556788.2019.1673388
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
Birgin, E., Martinez, J., Raydan, M.: Spectral projected gradient methods: review and perspectives. J. Stat. Softw. 60(3), 1–21 (2014)
Burke, J.V., Lin, Q.: The gradient sampling algorithm for directionally Lipschitzian functions (in preparation)
Burke, J.V., Overton, M.L.: Variational analysis of non-Lipschitz spectral functions. Math. Program. 90(2, Ser. A), 317–351 (2001)
Burke, J.V., Lewis, A.S., Overton, M.L.: Approximating subdifferentials by random sampling of gradients. Math. Oper. Res. 27(3), 567–584 (2002)
Burke, J.V., Lewis, A.S., Overton, M.L.: Two numerical methods for optimizing matrix stability. Linear Algebra Appl. 351/352, 117–145 (2002)
Burke, J.V., Lewis, A.S., Overton, M.L.: A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM J. Optim. 15(3), 751–779 (2005)
Burke, J.V., Henrion, D., Lewis, A.S., Overton, M.L.: HIFOO—a MATLAB package for fixed-order controller design and H ∞ optimization. In: Fifth IFAC Symposium on Robust Control Design, Toulouse (2006)
Clarke, F.H.: Optimization and Nonsmooth Analysis. Wiley, New York (1983). Reprinted by SIAM, Philadelphia, 1990
Crema, A., Loreto, M., Raydan, M.: Spectral projected subgradient with a momentum term for the Lagrangean dual approach. Comput. Oper. Res. 34(10), 3174–3186 (2007)
Curtis, F.E., Overton, M.L.: A sequential quadratic programming algorithm for nonconvex, nonsmooth constrained optimization. SIAM J. Optim. 22(2), 474–500 (2012)
Curtis, F.E., Que, X.: An adaptive gradient sampling algorithm for nonsmooth optimization. Optim. Methods Softw. 28(6), 1302–1324 (2013)
Curtis, F.E., Que, X.: A quasi-Newton algorithm for nonconvex, nonsmooth optimization with global convergence guarantees. Math. Program. Comput. 7(4), 399–428 (2015)
Curtis, F.E., Mitchell, T., Overton, M.L.: A BFGS-SQP method for nonsmooth, nonconvex, constrained optimization and its evaluation using relative minimization profiles. Optim. Methods Softw. 32(1), 148–181 (2017)
Curtis, F.E., Robinson, D.P., Zhou, B.: A self-correcting variable-metric algorithm framework for nonsmooth optimization. IMA J. Numer. Anal. (2019). https://doi.org/10.1093/imanum/drz008; https://academic.oup.com/imajna/advance-article/doi/10.1093/imanum/drz008/5369122?guestAccessKey=a7e5eee5-9ed6-4a95-9f6c-f305237d0849
Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29(1), 207–239 (2019). https://doi.org/10.1137/18M1178244
Davis, D., Drusvyatskiy, D., Kakade, S., Lee, J.D.: Stochastic subgradient method converges on tame functions. Found. Comput. Math. (2019). https://doi.org/10.1007/s10208-018-09409-5
Estrada, A., Mitchell, I.M.: Control synthesis and classification for unicycle dynamics using the gradient and value sampling particle filters. In: Proceedings of the IFAC Conference on Analysis and Design of Hybrid Systems, pp. 108–114 (2018).
Fletcher, R.: Practical Methods of Optimization, 2nd edn. Wiley, New York (1987)
Fletcher, R.: On the Barzilai-Borwein method. In: Qi, L., Teo, K., Yang, X. (eds.) Optimization and Control with Applications, pp. 235–256. Springer, Boston (2005)
Guo, J., Lewis, A.S.: Nonsmooth variants of Powell’s BFGS convergence theorem. SIAM J. Optim. 28(2), 1301–1311 (2018). https://doi.org/10.1137/17M1121883
Hare, W., Nutini, J.: A derivative-free approximate gradient sampling algorithm for finite minimax problems. Comput. Optim. Appl. 56(1), 1–38 (2013). https://doi.org/10.1007/s10589-013-9547-6
Helou, E.S., Santos, S.A., Simões, L.E.A.: On the differentiability check in gradient sampling methods. Optim. Methods Softw. 31(5), 983–1007 (2016)
Helou, E.S., Santos, S.A., Simões, L.E.A.: On the local convergence analysis of the gradient sampling method for finite max-functions. J. Optim. Theory Appl. 175(1), 137–157 (2017)
Hosseini, S., Uschmajew, A.: A Riemannian gradient sampling algorithm for nonsmooth optimization on manifolds. SIAM J. Optim. 27(1), 173–189 (2017). https://doi.org/10.1137/16M1069298
Kiwiel, K.C.: A method for solving certain quadratic programming problems arising in nonsmooth optimization. IMA J. Numer. Anal. 6(2), 137–152 (1986)
Kiwiel, K.C.: Convergence of the gradient sampling algorithm for nonsmooth nonconvex optimization. SIAM J. Optim. 18(2), 379–388 (2007)
Kiwiel, K.C.: A nonderivative version of the gradient sampling algorithm for nonsmooth nonconvex optimization. SIAM J. Optim. 20(4), 1983–1994 (2010). https://doi.org/10.1137/090748408
Larson, J., Menickelly, M., Wild, S.M.: Manifold sampling for ℓ 1 nonconvex optimization. SIAM J. Optim. 26(4), 2540–2563 (2016). https://doi.org/10.1137/15M1042097
Lemaréchal, C., Oustry, F., Sagastizábal, C.: The U-Lagrangian of a convex function. Trans. Am. Math. Soc. 352(2), 711–729 (2000)
Lewis, A.S.: Active sets, nonsmoothness, and sensitivity. SIAM J. Optim. 13(3), 702–725 (2002)
Lewis, A.S., Overton, M.L.: Nonsmooth optimization via quasi-Newton methods. Math. Program. 141(1–2, Ser. A), 135–163 (2013). https://doi.org/10.1007/s10107-012-0514-2
Lin, Q.: Sparsity and nonconvex nonsmooth optimization. Ph.D. thesis, Department of Mathematics, University of Washington (2009)
Loreto, M., Aponte, H., Cores, D., Raydan, M.: Nonsmooth spectral gradient methods for unconstrained optimization. EURO J. Comput. Optim. 5(4), 529–553 (2017)
Mifflin, R., Sagastizábal, C.: A VU-algorithm for convex minimization. Math. Program. 104(2-3), 583–608 (2005)
Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Found. Comput. Math. 17(2), 527–566 (2017). https://doi.org/10.1007/s10208-015-9296-2
Raydan, M.: On the Barzilai and Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13(3), 321–326 (1993)
Raydan, M.: The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 7(1), 26–33 (1997)
Rockafellar, R.T.: Lagrange multipliers and subderivatives of optimal value functions in nonlinear programming. In: Sorensen, D.C., Wets, R.J.B. (eds.) Mathematical Programming Study, Mathematical Programming Studies, Chap. 3, pp. 28–66. North-Holland, Amsterdam (1982). http://www.springerlink.com/index/g03582565267714p.pdf
Rockafellar R.T., Wets, R.J.B.: Variational Analysis. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 317. Springer, Berlin (1998). https://doi.org/10.1007/978-3-642-02431-3
Tang, C.M., Liu, S., Jian, J.B., Li, J.L.: A feasible SQP-GS algorithm for nonconvex, nonsmooth constrained optimization. Numer. Algorithms 65(1), 1–22 (2014). https://doi.org/10.1007/s11075-012-9692-5
Traft, N., Mitchell, I.M.: Improved action and path synthesis using gradient sampling. In: Proceedings of the IEEE Conference on Decision and Control, pp. 6016–6023 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix 1
This appendix is devoted to justifying the requirement that D, the set of points on which the locally Lipschitz function f is continuously differentiable, must be an open full-measure subset of \(\mathbb {R}^n\), instead of the original assumption in [8] that D should be an open and dense set in \(\mathbb {R}^n\).
There are two ways in which the analyses in [8, 28] actually depend on D having full measure:
-
1.
The most obvious is that both papers require that the points sampled in each iteration should lie in D, and a statement is made in both papers that this occurs with probability one, but this is not the case if D is assumed only to be an open dense subset of \(\mathbb {R}^n\). However, as already noted earlier and justified in Appendix 2, this requirement can be relaxed, as in Algorithm GS given in Sect. 6.2, to require only that f be differentiable at the sampled points.
-
2.
The set D must have full measure for Property 6.1, stated below, to hold. The proofs in [8, 28] depend critically on this property, which follows from [6, Eq. (1.2)] (where it was stated without proof). For completeness we give a proof here, followed by an example that demonstrates the necessity of the full measure assumption.
Property 6.1
Assume that D has full measure and let
For all 𝜖 > 0 and all \({\boldsymbol x}\in \mathbb {R}^n\), one has ∂f(x) ⊆ G 𝜖(x), where ∂f is the Clarke subdifferential set presented in Definition 1.8.
Proof of Property 6.1
Let \({\boldsymbol x}\in \mathbb {R}^n\) and v ∈ ∂f(x). We have from [10, Theorem 2.5.1] that Theorem 1.2 can be stated in a more general manner. Indeed, for any set S with zero measure, and considering Ω f to be the set of points at which f fails to be differentiable, the following holds:
In particular, since D has full measure and f is differentiable on D, it follows that
Considering this last relation and Carathéodory’s theorem, it follows that , where, for all , one has \(\boldsymbol {\xi }^i = \lim \limits _j \nabla f({\boldsymbol y}^{j,i})\) for some sequence \(\{{\boldsymbol y}^{j,i}\}_{j\in \mathbb {N}} \subset D\) converging to x. Hence, there must exist a sufficiently large \(j_i \in \mathbb {N}\) such that, for all j ≥ j i, one obtains
Recalling that G 𝜖(x) is the closure of \( \operatorname {\mathrm {conv}}\nabla f\left (\bar {B}({\boldsymbol x}; \epsilon )\cap D\right )\), it follows that ξ i ∈ G 𝜖(x) for all . Moreover, since G 𝜖(x) is convex, we have v ∈ G 𝜖(x). The result follows since \({\boldsymbol x} \in \mathbb {R}^n\) and v ∈ ∂f(x) were arbitrarily chosen. \(\square \)
With the assumption that D has full measure, Property 6.1 holds and hence the proofs of the results in [8, 28] are all valid. In particular, the proof of (ii) in [28, Lemma 3.2], which borrows from [8, Lemma 3.2], depends on Property 6.1. See also [8, the top of p. 762].
The following example shows that Property 6.1 might not hold if D is assumed only to be an open dense set, not necessarily of full measure.
Example 6.2
Let δ ∈ (0, 1) and \(\{q_k\}_{k\in \mathbb {N}}\) be the enumeration of the rational numbers in (0, 1). Define
Clearly, its Lebesgue measure satisfies 0 < λ(D) ≤ δ. Moreover, the set D is an open dense subset of [0, 1]. Now, let \(i_{D}:[0,1]\to \mathbb {R}\) be the indicator function of the set D,
Then, considering the Lebesgue integral, we define the function \(f:[0,1]\to \mathbb {R}\),
Let us prove that f is a Lipschitz continuous function on (0, 1). To see this, note that given any a, b ∈ (0, 1) with b > a, it follows that
which ensures that f is a Lipschitz continuous function on (0, 1). Consequently, the Clarke subdifferential set of f at any point in (0, 1) is well defined. Moreover, we claim that, for all \(k\in \mathbb {N}\), f is continuously differentiable at any point \(q\in \mathcal {Q}_k\) and the following holds
Indeed, given any \(q\in \mathcal {Q}_k\), we have
Since \(\mathcal {Q}_k\) is an open set, we can find \(\overline t>0\) such that \([q,q+t]\subset \mathcal {Q}_k\subset D\), for all \(t\leq \overline t\). Hence, given any \(t\in (0,\overline {t}]\), it follows that
The same reasoning can be used to see that the left derivative of f at q exists and it is equal to i D(q). Consequently, we have f′(q) = i D(q) = 1 for all \(q\in \mathcal {Q}_k\), which yields that f is continuously differentiable on D.
By the Lebesgue differentiation theorem, we know that f′(x) = i D(x) almost everywhere. Since the set [0, 1] ∖ D does not have measure zero, this means that there must exist z ∈ [0, 1] ∖ D such that f′(z) = i D(z) = 0. Defining \(\epsilon := \min \{z,1-z\}/2\), we see, by (6.7), that the set
is a singleton G 𝜖(z) = {1}. However, since f′(z) = 0, it follows that 0 ∈ ∂f(z), which implies ∂f(z)⊄G 𝜖(z).
Note that it is stated on [8, p. 754] and [28, p. 381] that the following holds: for all 0 ≤ 𝜖 1 < 𝜖 2 and all \({\boldsymbol x}\in \mathbb {R}^n\), one has \(\bar \partial _{\epsilon _1} f({\boldsymbol x}) \subseteq G_{\epsilon _2}({\boldsymbol x})\). Property 6.1 is a special case of this statement with 𝜖 1 = 0, and hence this statement too holds only under the full measure assumption.
Finally, it is worth mentioning that in practice, the full measure assumption on D usually holds. In particular, whenever a real-valued function is semi-algebraic (or, more generally, “tame”)—in other words, for all practical purposes virtually always—it is continuously differentiable on an open set of full measure. Hence, the original proofs hold in such contexts.
Appendix 2
In this appendix, we summarize why it is not necessary that the iterates and sampled points of the algorithm lie in the set D in which f is continuously differentiable, and that rather it is sufficient to ensure that f is differentiable at these points, as in Algorithm GS. We do this by outlining how to modify the proofs in [28] to extend to this case.
-
1.
That the gradients at the sampled points {x k, j} exist follows with probability one from Rademacher’s theorem, while the existence of the gradients at the iterates {x k} is ensured by the statement of Algorithm GS. Notice that the proof of part (ii) of [28, Theorem 3.3] still holds in our setting with the statement that the components of the sampled points are “sampled independently and uniformly from \(\bar {B}({\boldsymbol x}^k;\epsilon )\cap D\)” replaced with “sampled independently and uniformly from \(\bar {B}({\boldsymbol x}^k;\epsilon )\)”.
-
2.
One needs to verify that f being differentiable at x k is enough to ensure that the line search procedure presented in (6.3) terminates finitely. This is straightforward. Since ∇f(x k) exists, it follows that the directional derivative along any vector \({\boldsymbol d}\in \mathbb {R}^n\setminus \{0\}\) is given by f′(x k;d) = ∇f(x k)Td. Furthermore, since −∇f(x k)Tg k ≤−∥g k∥2 (see [8, p. 756]), it follows, for any β ∈ (0, 1), that there exists \(\overline t>0\) such that
$$\displaystyle \begin{aligned} f({\boldsymbol x}^k-t{\boldsymbol g}^k) < f({\boldsymbol x}^k) - t\beta\|{\boldsymbol g}^k\|{}^2\ \ \text{for any}\ \ t \in (0,\overline t). \end{aligned}$$This shows that the line search is well defined.
-
3.
The only place where we actually need to modify the proof in [28] concerns item (ii) in Lemma 3.2, where it is stated that \(\nabla f({\boldsymbol x}^k) \in G_\epsilon (\bar {\boldsymbol x})\) (for a particular point \(\bar {\boldsymbol x}\)) because \({\boldsymbol x}^k \in \bar {B}(\bar {\boldsymbol x};\epsilon /3) \cap D\); the latter is not true if x k∉D. However, using Property 6.1, we have
$$\displaystyle \begin{aligned} \nabla f({\boldsymbol x}^k)\in {\partial} f({\boldsymbol x}^k)\subset G_{\epsilon/3}({\boldsymbol x}^k) \subset G_{\epsilon}(\bar {\boldsymbol x}) \text{ when } {\boldsymbol x}^k\in \bar{B}(\bar {\boldsymbol x};\epsilon/3), \end{aligned}$$and therefore, \(\nabla f({\boldsymbol x}^k) \in G_\epsilon (\bar {\boldsymbol x})\) even when x k∉D.
Finally, although it was convenient in Appendix 1 to state Property 1 in terms of D, it actually holds if D is replaced by any full measure set on which f is differentiable. Nonetheless, it is important to note that the proofs of the results in [8, 28] do require that f be continuously differentiable on D. This assumption is used in the proof of (i) in [28, Lemma 3.2].
Acknowledgements
The authors would like to acknowledge the following financial support. J.V. Burke was supported in part by the U.S. National Science Foundation grant DMS-1514559. F.E. Curtis was supported in part by the U.S. Department of Energy grant DE-SC0010615. A.S. Lewis was supported in part by the U.S. National Science Foundation grant DMS-1613996. M.L. Overton was supported in part by the U.S. National Science Foundation grant DMS-1620083. L.E.A. Simões was supported in part by the São Paulo Research Foundation (FAPESP), Brazil, under grants 2016/22989-2 and 2017/07265-0.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Burke, J.V., Curtis, F.E., Lewis, A.S., Overton, M.L., Simões, L.E.A. (2020). Gradient Sampling Methods for Nonsmooth Optimization. In: Bagirov, A., Gaudioso, M., Karmitsa, N., Mäkelä, M., Taheri, S. (eds) Numerical Nonsmooth Optimization. Springer, Cham. https://doi.org/10.1007/978-3-030-34910-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-34910-3_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34909-7
Online ISBN: 978-3-030-34910-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)