Projective Approximation Based Quasi-Newton Methods

Senov, Alexander

doi:10.1007/978-3-319-72926-8_3

Alexander Senov¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10710))

Included in the following conference series:

International Workshop on Machine Learning, Optimization, and Big Data

2949 Accesses
1 Citations

Abstract

We consider a problem of optimizing convex function of vector parameter. Many quasi-Newton optimization methods require to construct and store an approximation of Hessian matrix or its inverse to take function curvature into account, thus imposing high computational and memory requirements. We propose four quasi-Newton methods based on consecutive projective approximation. The idea of these methods is to approximate the product of the function Hessian inverse and function gradient in a low-dimensional space using appropriate projection and then reconstruct it back to original space as a new direction for the next estimate search. By exploiting Hessian rank deficiency in a special way it does not require to store Hessian matrix neither its inverse thus reducing memory requirements. We give a theoretical motivation for the proposed algorithms and prove several properties of corresponding estimates. Finally, we provide a comparison of the proposed methods with several existing ones on modelled data. Despite the fact that the proposed algorithms turned out to be inferior to the limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) one, they have important advantage of being easy to extent and improve. Moreover, two of them do not require the function gradient knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Box, G.E., Draper, N.R., et al.: Empirical Model-Building and Response Surfaces. Wiley, New York (1987)
MATH Google Scholar
Boyd, S.: Global optimization in control system analysis and design. In: Control and Dynamic Systems V53: High Performance Systems Techniques and Applications: Advances in Theory and Applications, vol. 53, p. 1 (2012)
Google Scholar
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Book Google Scholar
Broyden, C.G.: The convergence of a class of double-rank minimization algorithms 1. General considerations. IMA J. Appl. Math. 6(1), 76–90 (1970)
Article Google Scholar
Conn, A.R., Gould, N.I., Toint, P.L.: Convergence of quasi-Newton matrices generated by the symmetric rank one update. Math. Program. 50(1–3), 177–195 (1991)
Article MathSciNet Google Scholar
Davidon, W.C.: Variable metric method for minimization. SIAM J. Optim. 1(1), 1–17 (1991)
Article MathSciNet Google Scholar
Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theor. 52(4), 1289–1306 (2006)
Article MathSciNet Google Scholar
Fletcher, R.: Practical Methods of Optimization, 2nd edn. Wiley, New York (1987)
MATH Google Scholar
Ford, J., Moghrabi, I.: Multi-step quasi-Newton methods for optimization. J. Comput. Appl. Math. 50(1–3), 305–323 (1994)
Article MathSciNet Google Scholar
Forrester, A., Keane, A.: Recent advances in surrogate-based optimization. Prog. Aerosp. Sci. 45(1), 50–79 (2009)
Article Google Scholar
Granichin, O., Volkovich, Z.V., Toledano-Kitai, D.: Randomized Algorithms in Automatic Control and Data Mining, vol. 67. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-642-54786-7
Book Google Scholar
Granichin, O.N.: Stochastic approximation search algorithms with randomization at the input. Autom. Remote Control 76(5), 762–775 (2015)
Article MathSciNet Google Scholar
Hoffmann, W.: Iterative algorithms for Gram-Schmidt orthogonalization. Computing 41(4), 335–348 (1989)
Article MathSciNet Google Scholar
Krause, A.: SFO: a toolbox for submodular function optimization. J. Mach. Learn. Res. 11, 1141–1144 (2010)
MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Programming Volume I: Basic Course. Citeseer (1998)
Google Scholar
Nocedal, J.: Updating quasi-Newton matrices with limited storage. Math. Comput. 35(151), 773–782 (1980)
Article MathSciNet Google Scholar
Polyak, B.T.: Introduction to optimization. Translations series in mathematics and engineering. Optimization Software Inc., Publications Division, New York (1987)
Google Scholar
Senov, A.: Accelerating gradient descent with projective response surface methodology. In: Battiti, R., Kvasov, D.E., Sergeyev, Y.D. (eds.) LION 2017. LNCS, vol. 10556, pp. 376–382. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69404-7_34
Chapter Google Scholar

Download references

Acknowledgments

This work was supported by Russian Science Foundation (project 16-19-00057).

Author information

Authors and Affiliations

Control of Complex Systems Laboratory, Institute of Problems of Mechanical Engineering, V.O., Bolshoj pr., 61, St. Petersburg, 199178, Russia
Alexander Senov

Authors

Alexander Senov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Senov .

Editor information

Editors and Affiliations

University of Catania, Catania, Italy
Giuseppe Nicosia
University of Florida, Gainesville, FL, USA
Panos Pardalos
University of Catania, Catania, Italy
Giovanni Giuffrida
Harvard University, Cambridge, MA, USA
Renato Umeton

A Proofs

Proof

(of Proposition 1)

$$\begin{aligned}&\partial _{\mathbf {x}} \mathop {\arg \!\min }\limits _{ \{\mathbf {x}\in \mathbb {R}^d\,:\,\mathbf {P}\mathbf {x} = \widehat{\mathbf {z}}\} } \sum \limits _{t=1}^{K} \left\| \mathbf {x}_t - \mathbf {x} \right\| _2^2 \\&=\partial _{\mathbf {x}} \sum \limits _{t=1}^{K} \left\| \mathbf {P}^\top \mathbf {P}\mathbf {x}_t + \left( \mathbf {I} - \mathbf {P}^\top \mathbf {P}\right) \mathbf {x}_t - \left( \mathbf {P}^\top \widehat{\mathbf {z}} + \left( \mathbf {I} - \mathbf {P}^\top \mathbf {P}\right) \mathbf {x}\right) \right\| _2^2 \\&=\partial _{\mathbf {x}} \sum \limits _{t=1}^{K} \left\| \mathbf {P}^\top \mathbf {P}\mathbf {x}_t - \mathbf {P}^\top \widehat{\mathbf {z}} \right\| _2^2 + \partial _{\mathbf {x}} \sum \limits _{t=1}^{K} \left\| \left( \mathbf {I} - \mathbf {P}^\top \mathbf {P}\right) \left( \mathbf {x}_t - \mathbf {x} \right) \right\| _2^2 \\&= \sum \limits _{t=1}^{K} 2 (\mathbf {I} - \mathbf {P}^\top \mathbf {P})\left( (\mathbf {I} - \mathbf {P}^\top \mathbf {P}) \mathbf {x}_t - \left( \mathbf {I} - \mathbf {P}^\top \mathbf {P}\right) \mathbf {x} \right) \\&=2 (\mathbf {I} - \mathbf {P}^\top \mathbf {P}) \sum \limits _{t=1}^{K} \mathbf {x}_t - 2 K \left( \mathbf {I} - \mathbf {P}^\top \mathbf {P}\right) \mathbf {x} = 0. \end{aligned}$$

Since $(\mathbf {I}-\mathbf {P}^\top \mathbf {P})$ is not invertible the above equation has infinite number of solutions. Hence, we are free to choose any one of them, e.g. $\mathbf {x}=\frac{1}{K}\sum _1^K \mathbf {x}_t$. $\square $

Proof

(of Proposition 2). From Proposition 1 and the fact that $\mathop {\arg \!\min }\limits _{\mathbf {z}\in \mathbb {R}^q}\, f(\mathbf {P}^\top \mathbf {z} + \mathbf {v}) = -\mathbf {P} \mathbf {H}^{-1} \mathbf {b}$ it follows that $\widehat{\mathbf {x}} = \left( \mathbf {I} - \mathbf {P}^\top \mathbf {P}\right) \overline{\mathbf {x}} - \mathbf {P}^\top \mathbf {P} \mathbf {H}^{-1}\mathbf {b} $. Hence

$$\begin{aligned}&\left\| \arg \!\min \, f - \widehat{\mathbf {x}} \right\| _2^2 = \left\| -\mathbf {H}^{-1}\mathbf {b} - \widehat{\mathbf {x}} \right\| _2^2 \\&\qquad = \left\| - \left( \mathbf {I} - \mathbf {P}^\top \mathbf {P}\right) \overline{\mathbf {x}} + \mathbf {P}^\top \mathbf {P} \mathbf {H}^{-1}\mathbf {b} - \mathbf {H}^{-1} \mathbf {b} \right\| _2^2 \\&\qquad = \left\| (\mathbf {I} - \mathbf {P}^\top \mathbf {P})\left( \mathbf {H}^{-1} \mathbf {b} - \overline{\mathbf {x}} \right) \right\| _2^2. \end{aligned}$$

$\square $

Proof

(of Proposition 3). First,

$$\begin{aligned} \Vert \mathbf {H} - \widehat{\mathbf {H}}\Vert _F^2 = \Vert (\mathbf {I} - \mathbf {P}^\top \mathbf {P})\mathbf {H} (\mathbf {I} - \mathbf {P}^\top \mathbf {P})\Vert _F ^2 + \Vert \mathbf {P}^\top (\mathbf {P} \mathbf {H}\mathbf {P}^\top - \widehat{\mathbf {Q}}) \mathbf {P}\Vert _F^2 \end{aligned}$$

One can note that $\widehat{\mathbf {Q}}_{i,j}$ are normally distributed variables s.t. $\mathrm {E}\left[ \widehat{\mathbf {Q}}\right] = \mathbf {P}\mathbf {H}\mathbf {P}^\top $ (for example, see [1]). Moreover, consider a vectorization of the matrix $\widehat{\mathbf {Q}}$ upper triangle $\widehat{\mathbf {\theta }}$:

$$\begin{aligned} \widehat{\mathbf {\theta }} = \left( \widehat{\mathbf {Q}}_{1,1}, \widehat{\mathbf {Q}}_{1,2}, \ldots , \widehat{\mathbf {Q}}_{q-1,q}, \widehat{\mathbf {Q}}_{q,q}\right) , \end{aligned}$$

— its covariance matrix is equal to $\mathbf {\Sigma }_\theta = \frac{\sigma _\varepsilon }{m}\ddot{\mathbf {Z}}\ddot{\mathbf {Z}}^\top $, where $\ddot{\mathbf {Z}}_{i,\cdot }$ consist of quadratic elements of $\mathbf {z}_i = \mathbf {P}\mathbf {x}_i$:

$$\begin{aligned} \ddot{\mathbf {Z}}_{i,\cdot } = \left( \mathbf {z}_{i}^{(1)} \mathbf {z}_{i}^{(1)}, \mathbf {z}_{i}^{(1)} \mathbf {z}_{i}^{(2)}, \ldots , \mathbf {z}_{i}^{(q - 1)} \mathbf {z}_{i}^{(q)}, \mathbf {z}_{i}^{(q)} \mathbf {z}_{i}^{(q)}\right) ^\top . \end{aligned}$$

Next, denote $\mathbf {\theta }$ as a vectorization of the $\mathbf {P} \mathbf {H}\mathbf {P}^\top $ matrix upper triangle and consider eigendecomposition of $\mathbf {\Sigma }_\theta = \mathbf {U} \mathbf {\Lambda } \mathbf {U}^\top $. Then, vector $\widehat{\mathbf {\beta }} = \mathbf {U}\widehat{\mathbf {\theta }}$ would have gaussian distribution with covariance matrix $\mathbf {\Lambda }$, and

$$\begin{aligned} \Vert \mathbf {P}^\top (\mathbf {P} \mathbf {H}\mathbf {P}^\top - \widehat{\mathbf {Q}}) \mathbf {P}\Vert _F^2 \le \Vert \mathbf {P}^\top \Vert _F^2 \Vert \mathbf {P} \mathbf {H}\mathbf {P}^\top - \widehat{\mathbf {Q}} \Vert _F^2 \Vert \mathbf {P}\Vert _F^2 = q^2 \Vert \mathbf {P} \mathbf {H}\mathbf {P}^\top - \widehat{\mathbf {Q}} \Vert _F^2 . \\ \Vert \mathbf {P} \mathbf {H}\mathbf {P}^\top - \widehat{\mathbf {Q}} \Vert _F^2 = \Vert \widehat{\mathbf {\theta }} - \mathbf {\theta } \Vert _2^2 = \left( \mathbf {U}(\widehat{\mathbf {\theta }} - \mathbf {\theta })\right) ^\top \left( \mathbf {U}(\widehat{\mathbf {\theta }} - \mathbf {\theta })\right) = \sum _{i=1}^{q^2} \xi _2 \sim \sum _{i=1}^{q^2} \lambda _i^2 \chi ^2\ (1). \end{aligned}$$

Thus, $C(\mathbf {X}\mathbf {P}^\top ) = \max \limits _i \lambda _i^2$.

$\square $

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Senov, A. (2018). Projective Approximation Based Quasi-Newton Methods. In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R. (eds) Machine Learning, Optimization, and Big Data. MOD 2017. Lecture Notes in Computer Science(), vol 10710. Springer, Cham. https://doi.org/10.1007/978-3-319-72926-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-72926-8_3
Published: 21 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-72925-1
Online ISBN: 978-3-319-72926-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Projective Approximation Based Quasi-Newton Methods

Abstract

Access this chapter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Proofs

A Proofs

Proof

Proof

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation