Skip to main content

Related Work on Geometry of Non-Convex Programs

  • Chapter
  • First Online:
Mathematical Theories of Machine Learning - Theory and Applications
  • 2593 Accesses

Abstract

Over the past few years, there have been increasing interest in understanding the geometry of non-convex programs that naturally arise from machine learning problems. It is particularly interesting to study additional properties of the considered non-convex objective such that popular optimization methods (such as gradient descent) escape saddle points and converge to a local minimum. The strict saddle property (Definition 5.6) is one such property, which was also shown to hold in a broad range of applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, T. Ma, Finding approximate local minima faster than gradient descent, in STOC (2017), pp. 1195–1199. http://arxiv.org/abs/1611.01146

  2. A. Anandkumar, R. Ge, Efficient approaches for escaping higher order saddle points in non-convex optimization, in Conference on Learning Theory (2016), pp. 81–102. arXiv preprint arXiv:1602.05908

    Google Scholar 

  3. A. Arnold, Y. Liu, N. Abe, Temporal causal modeling with graphical granger methods, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2007), pp. 66–75

    Google Scholar 

  4. S. Bubeck, Convex optimization: algorithms and complexity. Found. Trends in Mach. Learn. 8(3–4), 231–357 (2015)

    Article  MATH  Google Scholar 

  5. M.T. Bahadori, Y. Liu, On causality inference in time series, in AAAI Fall Symposium: Discovery Informatics (2012)

    Google Scholar 

  6. P.S. Bradley, O.L. Mangasarian, K-plane clustering. J. Global Optim. 16(1), 23–32 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  7. A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  8. Y. Carmon, J.C. Duchi, Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547 (2016)

    Google Scholar 

  9. Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756 (2016)

    Google Scholar 

  10. C.M. Carvalho, M.S. Johannes, H.F. Lopes, N.G. Polson, Particle learning and smoothing. Stat. Sci. 25, 88–106 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  11. X. Chen, Y. Liu, H. Liu, J.G. Carbonell, Learning spatial-temporal varying graphs with applications to climate data analysis, in AAAI (2010)

    Google Scholar 

  12. F.E. Curtis, D.P. Robinson, M. Samadi, A trust region algorithm with a worst-case iteration complexity of O(𝜖 −3∕2) for nonconvex optimization. Math. Program. 162(1–2), 1–32 (2014)

    MathSciNet  MATH  Google Scholar 

  13. A. Doucet, S. Godsill, C. Andrieu, On sequential Monte Carlo sampling methods for bayesian filtering. Stat. Comput. 10(3), 197–208 (2000)

    Article  Google Scholar 

  14. M. Eichler, Graphical modelling of multivariate time series with latent variables. Preprint, Universiteit Maastricht (2006)

    MATH  Google Scholar 

  15. E. Elhamifar, R. Vidal, Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2765–2781 (2013)

    Article  Google Scholar 

  16. R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points—online stochastic gradient for tensor decomposition, in Proceedings of the 28th Conference on Learning Theory (2015), pp. 797–842

    Google Scholar 

  17. P.E. Gill, W. Murray, Newton-type methods for unconstrained and linearly constrained optimization. Math. Program. 7(1), 311–350 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  18. C.W.J. Granger, Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37(3), 424–438 (1969)

    Article  MATH  Google Scholar 

  19. C.W.J. Granger, Testing for causality: a personal viewpoint. J. Econ. Dyn. Control. 2, 329–352 (1980)

    Article  MathSciNet  Google Scholar 

  20. R. Heckel, H. Bölcskei, Robust subspace clustering via thresholding. IEEE Trans. Inf. Theory 61(11), 6320–6342 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  21. D. Heckerman, A tutorial on learning with bayesian networks. Learning in Graphical Models (Springer, Berlin, 1998), pp. 301–354

    Chapter  MATH  Google Scholar 

  22. M. Hardt, T. Ma, B. Recht, Gradient descent learns linear dynamical systems. arXiv preprint arXiv:1609.05191 (2016)

    Google Scholar 

  23. R. Heckel, M. Tschannen, H. Bölcskei, Dimensionality-reduced subspace clustering. Inf. Inference: A J. IMA 6(3), 246–283 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  24. C. Jin, R. Ge, P. Netrapalli, S.M. Kakade, M.I. Jordan, How to escape saddle points efficiently, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1724–1732

    Google Scholar 

  25. C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456 (2017)

    Google Scholar 

  26. R. Jansen, H. Yu, D. Greenbaum, Y. Kluger, N.J. Krogan, S. Chung, A. Emili, M. Snyder, J.F. Greenblatt, M. Gerstein, A bayesian networks approach for predicting protein–protein interactions from genomic data. Science 302(5644), 449–453 (2003)

    Article  Google Scholar 

  27. Y. Liu, T. Bahadori, H. Li, Sparse-GEV: sparse latent space model for multivariate extreme value time serie modeling. arXiv preprint arXiv:1206.4685 (2012)

    Google Scholar 

  28. A.C. Lozano, H. Li, A. Niculescu-Mizil, Y. Liu, C. Perlich, J. Hosking, N. Abe, Spatial-temporal causal modeling for climate change attribution, in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (ACM, New York, 2009), pp. 587–596

    Google Scholar 

  29. G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 171–184 (2013)

    Article  Google Scholar 

  30. Y. Liu, A. Niculescu-Mizil, A.C. Lozano, Y. Lu, Learning temporal causal graphs for relational time-series analysis, in Proceedings of the 27th International Conference on Machine Learning (ICML-10) (2010), pp. 687–694

    Google Scholar 

  31. J.D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M.I. Jordan, B. Recht, First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406 (2017)

    Google Scholar 

  32. L. Lessard, B. Recht, A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  33. J.D. Lee, M. Simchowitz, M.I. Jordan, B. Recht, Gradient descent only converges to minimizers, in Conference on Learning Theory (2016), pp. 1246–1257

    Google Scholar 

  34. M. Liu, T. Yang, On noisy negative curvature descent: competing with gradient descent for faster non-convex optimization. arXiv preprint arXiv:1709.08571 (2017)

    Google Scholar 

  35. Y. Ma, H. Derksen, W. Hong, J. Wright, Segmentation of multivariate mixed data via lossy data coding and compression. IEEE Trans. Pattern Anal. Mach. Intell. 29(9), 1546–1562 (2007)

    Article  Google Scholar 

  36. J.J. Moré, D.C. Sorensen, On the use of directions of negative curvature in a modified newton method. Math. Program. 16(1), 1–20 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  37. K.P. Murphy, Dynamic bayesian networks: representation, inference and learning, Ph.D. thesis, University of California, Berkeley, 2002

    Google Scholar 

  38. K.P. Murphy, Machine Learning: A Probabilistic Perspective (MIT Press, Cambridge, MA, 2012)

    MATH  Google Scholar 

  39. Y. Nesterov, A Method of Solving a Convex Programming Problem with Convergence Rate o (1/k2) Soviet Mathematics Doklady, vol. 27 (1983), pp. 372–376

    MATH  Google Scholar 

  40. Y. Nesterov, A. Nemirovsky, A general approach to polynomial-time algorithms design for convex programming, Tech. report, Technical report, Centr. Econ. & Math. Inst., USSR Acad. Sci., Moscow, USSR, 1988

    Google Scholar 

  41. Y. Nesterov, B.T. Polyak, Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  42. B. O’Donoghue, E. Candès, Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  43. M. O’Neill, S.J. Wright, Behavior of accelerated gradient methods near critical points of nonconvex problems. arXiv preprint arXiv:1706.07993 (2017)

    Google Scholar 

  44. D. Park, C. Caramanis, S. Sanghavi, Greedy subspace clustering, in Advances in Neural Information Processing Systems (2014), pp. 2753–2761

    Google Scholar 

  45. R. Pemantle, Nonconvergence to unstable points in urn models and stochastic approximations. Ann. Probab. 18(2), 698–712 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  46. B.T. Polyak, Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)

    Article  Google Scholar 

  47. I. Panageas, G. Piliouras, Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405 (2016)

    Google Scholar 

  48. D.E. Rumelhart, G.E. Hinton, R.J. Williams et al., Learning representations by back-propagating errors. Cogn. Model. 5(3), 1 (1988)

    Google Scholar 

  49. C.W. Royer, S.J. Wright, Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. arXiv preprint arXiv:1706.03131 (2017)

    Google Scholar 

  50. S.J. Reddi, M. Zaheer, S. Sra, B. Poczos, F. Bach, R. Salakhutdinov, A.J. Smola, A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)

    Google Scholar 

  51. W. Su, S. Boyd, E. Candes, A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights, in Advances in Neural Information Processing Systems (2014), pp. 2510–2518

    Google Scholar 

  52. M. Soltanolkotabi, E.J. Candes, A geometric analysis of subspace clustering with outliers. Ann. Stat. 40(4), 2195–2238 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  53. M. Soltanolkotabi, E. Elhamifar, E.J. Candes, Robust subspace clustering. Ann. Stat. 42(2), 669–699 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  54. I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initialization and momentum in deep learning, in International Conference on Machine Learning (2013), pp. 1139–1147

    Google Scholar 

  55. J. Sun, Q. Qu, J. Wright, A geometric analysis of phase retrieval, in 2016 IEEE International Symposium on Information Theory (ISIT) (IEEE, Piscataway, 2016), pp. 2379–2383

    Google Scholar 

  56. P.A. Traganitis, G.B. Giannakis, Sketched subspace clustering. IEEE Trans. Signal Process. 66(7), 1663–1675 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  57. T. Park, G. Casella, The Bayesian Lasso. J. Am. Stat. Assoc. 103(482), 681–686 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  58. P. Tseng, Nearest q-flat to m points. J. Optim. Theory Appl. 105(1), 249–252 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  59. M. Tsakiris, R. Vidal, Algebraic clustering of affine subspaces. IEEE Trans. Pattern Anal. Mach. Intell. 40(2), 482–489 (2017)

    Article  Google Scholar 

  60. M.C. Tsakiris, R. Vidal, Theoretical analysis of sparse subspace clustering with missing entries. arXiv preprint arXiv:1801.00393 (2018)

    Google Scholar 

  61. R. Vidal, Y. Ma, S. Sastry, Generalized principal component analysis (GPCA). IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1945–1959 (2005)

    Article  Google Scholar 

  62. A.C. Wilson, B. Recht, M.I. Jordan, A lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635 (2016)

    Google Scholar 

  63. A. Wibisono, A.C. Wilson, M.I. Jordan, A variational perspective on accelerated methods in optimization. Proc. Nat. Acad. Sci. 113(47), E7351–E7358 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  64. Y. Wang, Y.-X. Wang, A. Singh, A deterministic analysis of noisy sparse subspace clustering for dimensionality-reduced data, in International Conference on Machine Learning (2015), pp. 1422–1431

    Google Scholar 

  65. Y. Wang, Y.-X. Wang, A. Singh, Differentially private subspace clustering, in Advances in Neural Information Processing Systems (2015), pp. 1000–1008

    Google Scholar 

  66. Y.-X. Wang, H. Xu, Noisy sparse subspace clustering. J. Mach. Learn. Res. 17(12), 1–41 (2016)

    MathSciNet  MATH  Google Scholar 

  67. J. Yan, M. Pollefeys, A general framework for motion segmentation: independent, articulated, rigid, non-rigid, degenerate and non-degenerate, in European Conference on Computer Vision (Springer, Berlin, 2006), pp. 94–106

    Google Scholar 

  68. C. Yang, D. Robinson, R. Vidal, Sparse subspace clustering with missing entries, in International Conference on Machine Learning (2015), pp. 2463–2472

    Google Scholar 

  69. C. Zou, J. Feng, Granger causality vs. dynamic bayesian network inference: a comparative study. BMC Bioinf. 10(1), 122 (2009)

    Google Scholar 

  70. H. Zou, T. Hastie, Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat Methodol. 67(2), 301–320 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  71. C. Zeng, Q. Wang, S. Mokhtari, T. Li, Online context-aware recommendation with time varying multi-armed bandit, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2016), pp. 2025–2034

    Google Scholar 

  72. C. Zeng, Q. Wang, W. Wang, T. Li, L. Shwartz, Online inference for time-varying temporal dependency discovery from time series, in 2016 IEEE International Conference on Big Data (Big Data) (IEEE, Piscataway, 2016), pp. 1281–1290

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Shi, B., Iyengar, S.S. (2020). Related Work on Geometry of Non-Convex Programs. In: Mathematical Theories of Machine Learning - Theory and Applications. Springer, Cham. https://doi.org/10.1007/978-3-030-17076-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-17076-9_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-17075-2

  • Online ISBN: 978-3-030-17076-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics