Optimization Formulation

  • Bin Shi
  • S. S. Iyengar


Based on the description on the statistics model in the previous section, we formulate the problems that we need to solve from two angles. One is from the field of optimization, the other is from samples of probability distribution. Practically, from the view of efficient algorithms in computers, the representation of the first one is the expectation–maximization (EM) algorithm . The EM algorithm is used to find (local) maximum likelihood parameters of a statistical model in scenarios wherein the equations cannot be solved directly. These models use latent variables along with unknown parameters and known data observations, i.e., either there is a possibility of finding missing values among the data or the model can be formulated in more simple terms by assuming the existence of unobserved data points. A mixture model can be described in simplistic terms with an assumption that each of the observed data points have a corresponding unobserved data point, or latent variable that specifies the mixture component to which each of the data points belong.


Expectation maximum Markov chain Monte Carlo (MCMC) Linear regression Ridge regression Lasso Elastic-net Accelerated gradient descent Subspace clustering Sequential updating Online algorithms Multivariate time series 


  1. [AAZB+17]
    N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, T. Ma, Finding approximate local minima faster than gradient descent, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (ACM, New York, 2017), pp. 1195–1199zbMATHGoogle Scholar
  2. [B+15]
    S. Bubeck, Convex optimization: algorithms and complexity. Found. Trends in Mach. Learn. 8(3–4), 231–357 (2015)zbMATHGoogle Scholar
  3. [BBW+90]
    F.P. Bretherton, K. Bryan, J.D. Woods et al., Time-dependent greenhouse-gas-induced climate change. Clim. Change IPCC Sci. Assess. 1990, 173–194 (1990)Google Scholar
  4. [BJ03]
    R. Basri, D. Jacobs, Lambertian reflectance and linear subspaces. IEEE Trans. Pattern Anal. Mach. Intell. 25(2), 218–233 (2003)Google Scholar
  5. [BJRL15]
    G.E.P. Box, G.M. Jenkins, G.C. Reinsel, G.M. Ljung, Time Series Analysis: Forecasting and Control (Wiley, London, 2015)zbMATHGoogle Scholar
  6. [BL13]
    M.T. Bahadori, Y. Liu, An examination of practical granger causality inference, in Proceedings of the 2013 SIAM International Conference on data Mining (SIAM, 2013), pp. 467–475Google Scholar
  7. [BLE17]
    S. Bubeck, Y.T. Lee, R. Eldan, Kernel-based methods for bandit convex optimization, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (ACM, New York, 2017), pp. 72–85zbMATHGoogle Scholar
  8. [BNS16]
    S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optimality of local search for low rank matrix recovery, in Advances in Neural Information Processing Systems (2016), pp. 3873–3881Google Scholar
  9. [BV04]
    S. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press, Cambridge, 2004)zbMATHGoogle Scholar
  10. [CD16]
    Y. Carmon, J.C. Duchi, Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547 (2016)Google Scholar
  11. [CDHS16]
    Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756 (2016)Google Scholar
  12. [CJLP10]
    C.M. Carvalho, M.S. Johannes, H.F. Lopes, N.G. Polson, Particle learning and smoothing. Stat. Sci. 25, 88–106 (2010)MathSciNetzbMATHGoogle Scholar
  13. [CJW17]
    Z. Charles, A. Jalali, R. Willett, Sparse subspace clustering with missing and corrupted data. arXiv preprint: arXiv:1707.02461 (2017)Google Scholar
  14. [CK98]
    J. Costeira, T. Kanade, A multibody factorization method for independently moving objects. Int. J. Comput. Vis. 29(3), 159–179 (1998)Google Scholar
  15. [CRS14]
    F.E. Curtis, D.P. Robinson, M. Samadi, A trust region algorithm with a worst-case iteration complexity of O(𝜖 −3∕2) for nonconvex optimization. Math. Program. 162(1–2), 1–32 (2014)MathSciNetzbMATHGoogle Scholar
  16. [EBN12]
    B. Eriksson, L. Balzano, R. Nowak, High rank matrix completion, in Artificial Intelligence and Statistics (2012), pp. 373–381Google Scholar
  17. [EV13]
    E. Elhamifar, R. Vidal, Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2765–2781 (2013)Google Scholar
  18. [FKM05]
    A.D. Flaxman, A.T. Kalai, H.B. McMahan, Online convex optimization in the bandit setting: gradient descent without a gradient, in Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Society for Industrial and Applied Mathematics, Philadelphia, 2005), pp. 385–394zbMATHGoogle Scholar
  19. [GHJY15]
    R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points—online stochastic gradient for tensor decomposition, in Proceedings of the 28th Conference on Learning Theory (2015), pp. 797–842Google Scholar
  20. [GJZ17]
    R. Ge, C. Jin, Y. Zheng, No spurious local minima in nonconvex low rank problems: a unified geometric analysis, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1233–1242Google Scholar
  21. [GLM16]
    R. Ge, J.D. Lee, T. Ma, Matrix completion has no spurious local minimum, in Advances in Neural Information Processing Systems (2016), pp. 2973–2981Google Scholar
  22. [GM74]
    P.E. Gill, W. Murray, Newton-type methods for unconstrained and linearly constrained optimization. Math. Program. 7(1), 311–350 (1974)MathSciNetzbMATHGoogle Scholar
  23. [Ham94]
    J.D. Hamilton, Time Series Analysis, vol. 2 (Princeton University Press, Princeton, 1994)zbMATHGoogle Scholar
  24. [HL14]
    E. Hazan, K. Levy, Bandit convex optimization: towards tight bounds, in Advances in Neural Information Processing Systems (2014), pp. 784–792Google Scholar
  25. [JCSX11]
    A. Jalali, Y. Chen, S. Sanghavi, H. Xu, Clustering Partially Observed Graphs Via Convex Optimization (ICML, 2011)Google Scholar
  26. [JHS+11]
    M. Joshi, E. Hawkins, R. Sutton, J. Lowe, D. Frame, Projections of when temperature change will exceed 2 [deg] c above pre-industrial levels. Nat. Clim. Change 1(8), 407–412 (2011)Google Scholar
  27. [LKJ09]
    Y. Liu, J.R. Kalagnanam, O. Johnsen, Learning dynamic temporal graphs for oil-production equipment monitoring system, in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2009), pp. 1225–1234Google Scholar
  28. [LSJR16]
    J.D. Lee, M. Simchowitz, M.I. Jordan, B. Recht, Gradient descent only converges to minimizers, in Conference on Learning Theory (2016), pp. 1246–1257Google Scholar
  29. [LWL+16]
    X. Li, Z. Wang, J. Lu, R. Arora, J. Haupt, H. Liu, T. Zhao, Symmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296 (2016)Google Scholar
  30. [LY+84]
    D.G. Luenberger, Y. Ye et al., Linear and Nonlinear Programming, vol. 2 (Springer, Berlin, 1984)zbMATHGoogle Scholar
  31. [LY17]
    M. Liu, T. Yang, On noisy negative curvature descent: competing with gradient descent for faster non-convex optimization. arXiv preprint arXiv:1709.08571 (2017)Google Scholar
  32. [LZZ+16]
    T. Li, W. Zhou, C. Zeng, Q. Wang, Q. Zhou, D. Wang, J. Xu, Y. Huang, W. Wang, M. Zhang et al., DI-DAP: an efficient disaster information delivery and analysis platform in disaster management, in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (ACM, New York, 2016), pp. 1593–1602Google Scholar
  33. [MS79]
    J.J. Moré, D.C. Sorensen, On the use of directions of negative curvature in a modified newton method. Math. Program. 16(1), 1–20 (1979)MathSciNetzbMATHGoogle Scholar
  34. [Nes83]
    Y. Nesterov, A Method of Solving a Convex Programming Problem with Convergence Rate o (1/k2) Soviet Mathematics Doklady, vol. 27 (1983), pp. 372–376zbMATHGoogle Scholar
  35. [Nes13]
    Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, vol. 87 (Springer, Berlin, 2013)zbMATHGoogle Scholar
  36. [NN88]
    Y. Nesterov, A. Nemirovsky, A general approach to polynomial-time algorithms design for convex programming, Tech. report, Technical report, Centr. Econ. & Math. Inst., USSR Acad. Sci., Moscow, USSR, 1988Google Scholar
  37. [NP06]
    Y. Nesterov, B.T. Polyak, Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)MathSciNetzbMATHGoogle Scholar
  38. [PKCS17]
    D. Park, A. Kyrillidis, C. Carmanis, S. Sanghavi, Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach, in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (2017), pp. 65–74Google Scholar
  39. [Pol64]
    B.T. Polyak, Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)Google Scholar
  40. [Pol87]
    B.T. Polyak, Introduction to Optimization (Translations series in mathematics and engineering) (Optimization Software, 1987)Google Scholar
  41. [PP16]
    I. Panageas, G. Piliouras, Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405 (2016)Google Scholar
  42. [QX15]
    C. Qu, H. Xu, Subspace clustering with irrelevant features via robust dantzig selector, in Advances in Neural Information Processing Systems (2015), pp. 757–765Google Scholar
  43. [RW17]
    C.W. Royer, S.J. Wright, Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. arXiv preprint arXiv:1706.03131 (2017)Google Scholar
  44. [RZS+17]
    S.J. Reddi, M. Zaheer, S. Sra, B. Poczos, F. Bach, R. Salakhutdinov, A.J. Smola, A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)Google Scholar
  45. [SBC14]
    W. Su, S. Boyd, E. Candes, A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights, in Advances in Neural Information Processing Systems (2014), pp. 2510–2518Google Scholar
  46. [SEC14]
    M. Soltanolkotabi, E. Elhamifar, E.J. Candes, Robust subspace clustering. Ann. Stat. 42(2), 669–699 (2014)MathSciNetzbMATHGoogle Scholar
  47. [Sol14]
    M. Soltanolkotabi, Algorithms and theory for clustering and nonconvex quadratic programming. Ph.D. thesis, Stanford University, 2014Google Scholar
  48. [SQW16]
    J. Sun, Q. Qu, J. Wright, A geometric analysis of phase retrieval, in 2016 IEEE International Symposium on Information Theory (ISIT) (IEEE, Piscataway, 2016), pp. 2379–2383Google Scholar
  49. [SQW17]
    J. Sun, Q. Qu, J. Wright, Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory 63(2), 853–884 (2017)MathSciNetzbMATHGoogle Scholar
  50. [Tib96]
    R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58(1), 267–288 (1996)MathSciNetzbMATHGoogle Scholar
  51. [Vid11]
    R. Vidal, Subspace clustering. IEEE Signal Process. Mag. 28(2), 52–68 (2011)Google Scholar
  52. [WN99]
    S. Wright, J. Nocedal, Numerical Optimization, vol. 35, 7th edn. (Springer, Berlin, 1999), pp. 67–68.Google Scholar
  53. [WRJ16]
    A.C. Wilson, B. Recht, M.I. Jordan, A lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635 (2016)Google Scholar
  54. [WWJ16]
    A. Wibisono, A.C. Wilson, M.I. Jordan, A variational perspective on accelerated methods in optimization. Proc. Nat. Acad. Sci. 113(47), E7351–E7358 (2016)MathSciNetzbMATHGoogle Scholar
  55. [WX16]
    Y.-X. Wang, H. Xu, Noisy sparse subspace clustering. J. Mach. Learn. Res. 17(12), 1–41 (2016)MathSciNetzbMATHGoogle Scholar
  56. [ZFIM12]
    A. Zhang, N. Fawaz, S. Ioannidis, A. Montanari, Guess who rated this movie: identifying users through subspace clustering. arXiv preprint arXiv:1208.1544 (2012)Google Scholar
  57. [ZWML16]
    C. Zeng, Q. Wang, S. Mokhtari, T. Li, Online context-aware recommendation with time varying multi-armed bandit, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2016), pp. 2025–2034Google Scholar
  58. [ZWW+16]
    C. Zeng, Q. Wang, W. Wang, T. Li, L. Shwartz, Online inference for time-varying temporal dependency discovery from time series, in 2016 IEEE International Conference on Big Data (Big Data) (IEEE, Piscataway, 2016), pp. 1281–1290Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Bin Shi
    • 1
  • S. S. Iyengar
    • 2
  1. 1.University of CaliforniaBerkeleyUSA
  2. 2.Florida International UniversityMiamiUSA

Personalised recommendations