Abstract
Over the past few years, there have been increasing interest in understanding the geometry of non-convex programs that naturally arise from machine learning problems. It is particularly interesting to study additional properties of the considered non-convex objective such that popular optimization methods (such as gradient descent) escape saddle points and converge to a local minimum. The strict saddle property (Definition 5.6) is one such property, which was also shown to hold in a broad range of applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, T. Ma, Finding approximate local minima faster than gradient descent, in STOC (2017), pp. 1195–1199. http://arxiv.org/abs/1611.01146
A. Anandkumar, R. Ge, Efficient approaches for escaping higher order saddle points in non-convex optimization, in Conference on Learning Theory (2016), pp. 81–102. arXiv preprint arXiv:1602.05908
A. Arnold, Y. Liu, N. Abe, Temporal causal modeling with graphical granger methods, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2007), pp. 66–75
S. Bubeck, Convex optimization: algorithms and complexity. Found. Trends in Mach. Learn. 8(3–4), 231–357 (2015)
M.T. Bahadori, Y. Liu, On causality inference in time series, in AAAI Fall Symposium: Discovery Informatics (2012)
P.S. Bradley, O.L. Mangasarian, K-plane clustering. J. Global Optim. 16(1), 23–32 (2000)
A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)
Y. Carmon, J.C. Duchi, Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547 (2016)
Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756 (2016)
C.M. Carvalho, M.S. Johannes, H.F. Lopes, N.G. Polson, Particle learning and smoothing. Stat. Sci. 25, 88–106 (2010)
X. Chen, Y. Liu, H. Liu, J.G. Carbonell, Learning spatial-temporal varying graphs with applications to climate data analysis, in AAAI (2010)
F.E. Curtis, D.P. Robinson, M. Samadi, A trust region algorithm with a worst-case iteration complexity of O(𝜖 −3∕2) for nonconvex optimization. Math. Program. 162(1–2), 1–32 (2014)
A. Doucet, S. Godsill, C. Andrieu, On sequential Monte Carlo sampling methods for bayesian filtering. Stat. Comput. 10(3), 197–208 (2000)
M. Eichler, Graphical modelling of multivariate time series with latent variables. Preprint, Universiteit Maastricht (2006)
E. Elhamifar, R. Vidal, Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2765–2781 (2013)
R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points—online stochastic gradient for tensor decomposition, in Proceedings of the 28th Conference on Learning Theory (2015), pp. 797–842
P.E. Gill, W. Murray, Newton-type methods for unconstrained and linearly constrained optimization. Math. Program. 7(1), 311–350 (1974)
C.W.J. Granger, Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37(3), 424–438 (1969)
C.W.J. Granger, Testing for causality: a personal viewpoint. J. Econ. Dyn. Control. 2, 329–352 (1980)
R. Heckel, H. Bölcskei, Robust subspace clustering via thresholding. IEEE Trans. Inf. Theory 61(11), 6320–6342 (2015)
D. Heckerman, A tutorial on learning with bayesian networks. Learning in Graphical Models (Springer, Berlin, 1998), pp. 301–354
M. Hardt, T. Ma, B. Recht, Gradient descent learns linear dynamical systems. arXiv preprint arXiv:1609.05191 (2016)
R. Heckel, M. Tschannen, H. Bölcskei, Dimensionality-reduced subspace clustering. Inf. Inference: A J. IMA 6(3), 246–283 (2017)
C. Jin, R. Ge, P. Netrapalli, S.M. Kakade, M.I. Jordan, How to escape saddle points efficiently, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1724–1732
C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456 (2017)
R. Jansen, H. Yu, D. Greenbaum, Y. Kluger, N.J. Krogan, S. Chung, A. Emili, M. Snyder, J.F. Greenblatt, M. Gerstein, A bayesian networks approach for predicting protein–protein interactions from genomic data. Science 302(5644), 449–453 (2003)
Y. Liu, T. Bahadori, H. Li, Sparse-GEV: sparse latent space model for multivariate extreme value time serie modeling. arXiv preprint arXiv:1206.4685 (2012)
A.C. Lozano, H. Li, A. Niculescu-Mizil, Y. Liu, C. Perlich, J. Hosking, N. Abe, Spatial-temporal causal modeling for climate change attribution, in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (ACM, New York, 2009), pp. 587–596
G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 171–184 (2013)
Y. Liu, A. Niculescu-Mizil, A.C. Lozano, Y. Lu, Learning temporal causal graphs for relational time-series analysis, in Proceedings of the 27th International Conference on Machine Learning (ICML-10) (2010), pp. 687–694
J.D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M.I. Jordan, B. Recht, First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406 (2017)
L. Lessard, B. Recht, A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
J.D. Lee, M. Simchowitz, M.I. Jordan, B. Recht, Gradient descent only converges to minimizers, in Conference on Learning Theory (2016), pp. 1246–1257
M. Liu, T. Yang, On noisy negative curvature descent: competing with gradient descent for faster non-convex optimization. arXiv preprint arXiv:1709.08571 (2017)
Y. Ma, H. Derksen, W. Hong, J. Wright, Segmentation of multivariate mixed data via lossy data coding and compression. IEEE Trans. Pattern Anal. Mach. Intell. 29(9), 1546–1562 (2007)
J.J. Moré, D.C. Sorensen, On the use of directions of negative curvature in a modified newton method. Math. Program. 16(1), 1–20 (1979)
K.P. Murphy, Dynamic bayesian networks: representation, inference and learning, Ph.D. thesis, University of California, Berkeley, 2002
K.P. Murphy, Machine Learning: A Probabilistic Perspective (MIT Press, Cambridge, MA, 2012)
Y. Nesterov, A Method of Solving a Convex Programming Problem with Convergence Rate o (1/k2) Soviet Mathematics Doklady, vol. 27 (1983), pp. 372–376
Y. Nesterov, A. Nemirovsky, A general approach to polynomial-time algorithms design for convex programming, Tech. report, Technical report, Centr. Econ. & Math. Inst., USSR Acad. Sci., Moscow, USSR, 1988
Y. Nesterov, B.T. Polyak, Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
B. O’Donoghue, E. Candès, Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015)
M. O’Neill, S.J. Wright, Behavior of accelerated gradient methods near critical points of nonconvex problems. arXiv preprint arXiv:1706.07993 (2017)
D. Park, C. Caramanis, S. Sanghavi, Greedy subspace clustering, in Advances in Neural Information Processing Systems (2014), pp. 2753–2761
R. Pemantle, Nonconvergence to unstable points in urn models and stochastic approximations. Ann. Probab. 18(2), 698–712 (1990)
B.T. Polyak, Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
I. Panageas, G. Piliouras, Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405 (2016)
D.E. Rumelhart, G.E. Hinton, R.J. Williams et al., Learning representations by back-propagating errors. Cogn. Model. 5(3), 1 (1988)
C.W. Royer, S.J. Wright, Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. arXiv preprint arXiv:1706.03131 (2017)
S.J. Reddi, M. Zaheer, S. Sra, B. Poczos, F. Bach, R. Salakhutdinov, A.J. Smola, A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)
W. Su, S. Boyd, E. Candes, A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights, in Advances in Neural Information Processing Systems (2014), pp. 2510–2518
M. Soltanolkotabi, E.J. Candes, A geometric analysis of subspace clustering with outliers. Ann. Stat. 40(4), 2195–2238 (2012)
M. Soltanolkotabi, E. Elhamifar, E.J. Candes, Robust subspace clustering. Ann. Stat. 42(2), 669–699 (2014)
I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initialization and momentum in deep learning, in International Conference on Machine Learning (2013), pp. 1139–1147
J. Sun, Q. Qu, J. Wright, A geometric analysis of phase retrieval, in 2016 IEEE International Symposium on Information Theory (ISIT) (IEEE, Piscataway, 2016), pp. 2379–2383
P.A. Traganitis, G.B. Giannakis, Sketched subspace clustering. IEEE Trans. Signal Process. 66(7), 1663–1675 (2017)
T. Park, G. Casella, The Bayesian Lasso. J. Am. Stat. Assoc. 103(482), 681–686 (2008)
P. Tseng, Nearest q-flat to m points. J. Optim. Theory Appl. 105(1), 249–252 (2000)
M. Tsakiris, R. Vidal, Algebraic clustering of affine subspaces. IEEE Trans. Pattern Anal. Mach. Intell. 40(2), 482–489 (2017)
M.C. Tsakiris, R. Vidal, Theoretical analysis of sparse subspace clustering with missing entries. arXiv preprint arXiv:1801.00393 (2018)
R. Vidal, Y. Ma, S. Sastry, Generalized principal component analysis (GPCA). IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1945–1959 (2005)
A.C. Wilson, B. Recht, M.I. Jordan, A lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635 (2016)
A. Wibisono, A.C. Wilson, M.I. Jordan, A variational perspective on accelerated methods in optimization. Proc. Nat. Acad. Sci. 113(47), E7351–E7358 (2016)
Y. Wang, Y.-X. Wang, A. Singh, A deterministic analysis of noisy sparse subspace clustering for dimensionality-reduced data, in International Conference on Machine Learning (2015), pp. 1422–1431
Y. Wang, Y.-X. Wang, A. Singh, Differentially private subspace clustering, in Advances in Neural Information Processing Systems (2015), pp. 1000–1008
Y.-X. Wang, H. Xu, Noisy sparse subspace clustering. J. Mach. Learn. Res. 17(12), 1–41 (2016)
J. Yan, M. Pollefeys, A general framework for motion segmentation: independent, articulated, rigid, non-rigid, degenerate and non-degenerate, in European Conference on Computer Vision (Springer, Berlin, 2006), pp. 94–106
C. Yang, D. Robinson, R. Vidal, Sparse subspace clustering with missing entries, in International Conference on Machine Learning (2015), pp. 2463–2472
C. Zou, J. Feng, Granger causality vs. dynamic bayesian network inference: a comparative study. BMC Bioinf. 10(1), 122 (2009)
H. Zou, T. Hastie, Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat Methodol. 67(2), 301–320 (2005)
C. Zeng, Q. Wang, S. Mokhtari, T. Li, Online context-aware recommendation with time varying multi-armed bandit, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2016), pp. 2025–2034
C. Zeng, Q. Wang, W. Wang, T. Li, L. Shwartz, Online inference for time-varying temporal dependency discovery from time series, in 2016 IEEE International Conference on Big Data (Big Data) (IEEE, Piscataway, 2016), pp. 1281–1290
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Shi, B., Iyengar, S.S. (2020). Related Work on Geometry of Non-Convex Programs. In: Mathematical Theories of Machine Learning - Theory and Applications. Springer, Cham. https://doi.org/10.1007/978-3-030-17076-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-17076-9_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17075-2
Online ISBN: 978-3-030-17076-9
eBook Packages: EngineeringEngineering (R0)