Mathematical Programming

, Volume 174, Issue 1–2, pp 293–326 | Cite as

Sub-sampled Newton methods

  • Farbod Roosta-KhorasaniEmail author
  • Michael W. Mahoney
Full Length Paper Series B


For large-scale finite-sum minimization problems, we study non-asymptotic and high-probability global as well as local convergence properties of variants of Newton’s method where the Hessian and/or gradients are randomly sub-sampled. For Hessian sub-sampling, using random matrix concentration inequalities, one can sub-sample in a way that second-order information, i.e., curvature, is suitably preserved. For gradient sub-sampling, approximate matrix multiplication results from randomized numerical linear algebra provide a way to construct the sub-sampled gradient which contains as much of the first-order information as possible. While sample sizes all depend on problem specific constants, e.g., condition number, we demonstrate that local convergence rates are problem-independent.


Newton-type methods Local and global convergence Sub-sampling 

Mathematics Subject Classification

49M15 65K05 90C25 90C06 


  1. 1.
    Agarwal, N., Bullins, B., Hazan, E.: Second order stochastic optimization in linear time. arXiv preprint arXiv:1602.03943 (2016)
  2. 2.
    Berahas, A.S., Bollapragada, R., Nocedal, J.: An investigation of Newton-sketch and subsampled Newton methods. arXiv preprint arXiv:1705.06211 (2017)
  3. 3.
    Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)zbMATHGoogle Scholar
  4. 4.
    Bertsekas, D.P.: Convex Optimization Theory. Athena Scientific, Belmont (2009)zbMATHGoogle Scholar
  5. 5.
    Bollapragada, R., Byrd, R., Nocedal, J.: Exact and inexact subsampled Newton methods for optimization. arXiv preprint arXiv:1609.08502 (2016). (To appear in IMA Journal of Numerical Analysis)
  6. 6.
    Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)CrossRefzbMATHGoogle Scholar
  7. 7.
    Byrd, R.H., Chin, G.M., Neveitt, W., Nocedal, J.: On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim. 21(3), 977–995 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization methods for machine learning. Math. Program. 134(1), 127–155 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Byrd, R.H., Nocedal, J., Oztoprak, F.: An inexact successive quadratic approximation method for convex L-1 regularized optimization. arXiv preprint arXiv:1309.3529 (2013)
  10. 10.
    Cartis, C., Gould, N.I.M., Toint, PhL: On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems. SIAM J. Optim. 20(6), 2833–2852 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Cartis, C., Gould, N.I.M., Toint, Ph.L.: An example of slow convergence for Newton’s method on a function with globally Lipschitz continuous Hessian. Technical report, ERGO 13-008, School of Mathematics, Edinburgh University (2013)Google Scholar
  12. 12.
    Cohen, M.B., Lee, Y.T., Musco, C., Musco, C., Peng, R., Sidford, A.: Uniform sampling for matrix approximation. In: Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pp. 181–190. ACM (2015)Google Scholar
  13. 13.
    Conn, A.R., Gould, N.I.M., Toint, PhL: Trust Region Methods. SIAM, Philadelphia (2000)CrossRefzbMATHGoogle Scholar
  14. 14.
    Dembo, R.S., Eisenstat, S.C., Steihaug, T.: Inexact Newton methods. SIAM J. Numer. Anal. 19(2), 400–408 (1982)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Drineas, P., Kannan, R., Mahoney, M.W.: Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J. Comput. 36(1), 132–157 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Eisen, M., Mokhtari, A., Ribeiro, A.: Large scale empirical risk minimization via truncated adaptive Newton method. In: Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, PMLR,vol. 84, pp. 1447–1455 (2018)Google Scholar
  17. 17.
    Eisenstat, S.C., Walker, H.F.: Choosing the forcing terms in an inexact Newton method. SIAM J. Sci. Comput. 17(1), 16–32 (1996)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Erdogdu, M.A., Montanari, A.: Convergence rates of sub-sampled Newton methods. Adv. Neural Inf. Process. Syst. 28, 3034–3042 (2015)Google Scholar
  19. 19.
    Friedlander, M.P., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 34(3), A1380–A1405 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Griewank, A.: Some Bounds on the Complexity of Gradients, Jacobians, and Hessians. Complexity in Nonlinear Optimization, pp. 128–161. World Scientific Publisher, Singapore (1993)zbMATHGoogle Scholar
  21. 21.
    Gross, D., Nesme, V.: Note on sampling without replacing from a finite collection of matrices. arXiv preprint arXiv:1001.2738 (2010)
  22. 22.
    Haber, E., Chung, M.: Simultaneous source for non-uniform data variance and missing data. arXiv preprint arXiv:1404.5254 (2014)
  23. 23.
    Hestenes, M.R.: Pseudoinversus and conjugate gradients. Commun. ACM 18(1), 40–43 (1975)CrossRefzbMATHGoogle Scholar
  24. 24.
    Hestenes, M.R.: Conjugate Direction Methods in optimization, vol. 12. Springer, Berlin (2012)zbMATHGoogle Scholar
  25. 25.
    Holodnak, J.T., Ipsen, I.C.: Randomized approximation of the Gram matrix: exact computation and probabilistic bounds. SIAM J. Matrix Anal. Appl. 36(1), 110–137 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Lee, J.D., Sun, Y., Saunders, M.A.: Proximal Newton-type methods for minimizing composite functions. SIAM J. Optim. 24(3), 1420–1443 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Liu, X., Hsieh, C.J., Lee, J.D., Sun, Y.: An inexact subsampled proximal Newton-type method for large-scale machine learning. arXiv preprint arXiv:1708.08552 (2017)
  28. 28.
    Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends® Mach. Learn. 3(2), 123–224 (2011)zbMATHGoogle Scholar
  29. 29.
    Martens, J.: Deep learning via Hessian-free optimization. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 735–742 (2010)Google Scholar
  30. 30.
    McCullagh, P., Nelder, J.A.: Generalized Linear Models, vol. 37. CRC Press, Boca Raton (1989)CrossRefzbMATHGoogle Scholar
  31. 31.
    Mutnỳ, M.: Stochastic second-order optimization via von Neumann series. arXiv preprint arXiv:1612.04694 (2016)
  32. 32.
    Nesterov, Y.: Introductory Lectures on Convex Optimization, vol. 87. Springer, Berlin (2004)zbMATHGoogle Scholar
  33. 33.
    Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108(1), 177–205 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Pearlmutter, B.A.: Fast exact multiplication by the Hessian. Neural Comput. 6(1), 147–160 (1994)CrossRefGoogle Scholar
  35. 35.
    Pilanci, M., Wainwright, M.J.: Newton sketch: a linear-time optimization algorithm with linear-quadratic convergence. arXiv preprint arXiv:1505.02250 (2015)
  36. 36.
    Roosta-Khorasani, F., van den Doel, K., Ascher, U.: Stochastic algorithms for inverse problems involving PDEs and many measurements. SIAM J. Sci. Comput. 36(5), S3–S22 (2014). MathSciNetCrossRefzbMATHGoogle Scholar
  37. 37.
    Roosta-Khorasani, F., Székely, G.J., Ascher, U.: Assessing stochastic algorithms for large scale nonlinear least squares problems using extremal probabilities of linear combinations of gamma random variables. SIAM/ASA J. Uncertain. Quantif. 3(1), 61–90 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  38. 38.
    Tropp, J.A.: Improved analysis of the subsampled randomized Hadamard transform. Adv. Adapt. Data Anal. 3(01n02), 115–126 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  39. 39.
    Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12(4), 389–434 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  40. 40.
    Wang, C.C., Huang, C.H., Lin, C.J.: Subsampled Hessian Newton methods for supervised learning. Neural Comput. 27, 1766–1795 (2015)MathSciNetCrossRefGoogle Scholar
  41. 41.
    Xu, P., Roosta-Khorasan, F., Mahoney, M.W.: Second-order optimization for non-convex machine learning: an empirical study. arXiv preprint arXiv:1708.07827 (2017)
  42. 42.
    Xu, P., Yang, J., Roosta-Khorasani, F., Ré, C., Mahoney, M.W.: Sub-sampled Newton methods with non-uniform sampling. In: Advances in Neural Information Processing Systems 29, pp. 3000–3008 (2016)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature and Mathematical Optimization Society 2018

Authors and Affiliations

  1. 1.School of Mathematics and PhysicsUniversity of QueenslandSt LuciaAustralia
  2. 2.ICSI and Department of StatisticsUC BerkeleyBerkeleyUSA

Personalised recommendations