Skip to main content

Optimization for Large-Scale Machine Learning with Distributed Features and Observations

  • Conference paper
  • First Online:
Machine Learning and Data Mining in Pattern Recognition (MLDM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10358))

Abstract

As the size of modern data sets exceeds the disk and memory capacities of a single computer, machine learning practitioners have resorted to parallel and distributed computing. Given that optimization is one of the pillars of machine learning and predictive modeling, distributed optimization methods have recently garnered ample attention in the literature. Although previous research has mostly focused on settings where either the observations, or features of the problem at hand are stored in distributed fashion, the situation where both are partitioned across the nodes of a computer cluster (doubly distributed) has barely been studied. In this work we propose two doubly distributed optimization algorithms. The first one falls under the umbrella of distributed dual coordinate ascent methods, while the second one belongs to the class of stochastic gradient/coordinate descent hybrid methods. We conduct numerical experiments in Spark using real-world and simulated data sets and study the scaling properties of our methods. Our empirical evaluation of the proposed algorithms demonstrates the out-performance of a block distributed ADMM method, which, to the best of our knowledge is the only other existing doubly distributed optimization algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In [17] the size of the partitions was \(3,000\times 5,000\), but due to the BLAS issue mentioned earlier, we resorted to smaller problems to obtain comparable run-times across all methods.

  2. 2.

    http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html.

References

  1. Agarwal, A., Duchi, J.C.: Distributed delayed stochastic optimization. In: Advances in Neural Information Processing Systems, pp. 873–881 (2011)

    Google Scholar 

  2. Babanezhad, R., Ahmed, M.O., Virani, A., Schmidt, M., Konečnỳ, J., Sallinen, S.: Stop wasting my gradients: practical SVRG 2015. arXiv preprint arXiv:1511.01942

  3. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2011)

    MATH  Google Scholar 

  4. Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13(1), 165–202 (2012)

    MathSciNet  MATH  Google Scholar 

  5. Frostig, R., Ge, R., Kakade, S.M., Sidford, A.: Competing with the empirical risk minimizer in a single pass (2014). arXiv preprint arXiv:1412.6606

  6. Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S.S., Sundararajan, S.: A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th International Conference on Machine Learning, pp. 408–415. ACM (2008)

    Google Scholar 

  7. Jaggi, M., Smith, V., Takác, M., Terhorst, J., Krishnan, S., Hofmann, T., Jordan, M.I.: Communication-efficient distributed dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp. 3068–3076 (2014)

    Google Scholar 

  8. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)

    Google Scholar 

  9. Konečnỳ, J., Qu, Z., Richtárik, P.: Semi-stochastic coordinate descent (2014). arXiv preprint arXiv:1412.6293

  10. Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(1), 285–322 (2015)

    MathSciNet  MATH  Google Scholar 

  11. Ma, C., Smith, V., Jaggi, M., Jordan, M.I., Richtárik, P., Takáč, M.: Adding vs. averaging in distributed primal-dual optimization (2015). arXiv preprint arXiv:1502.03508

  12. Mann, G., McDonald, R.T., Mohri, M., Silberman, N., Walker, D.: Efficient large-scale distributed training of conditional maximum entropy models. NIPS 22, 1231–1239 (2009)

    Google Scholar 

  13. McDonald, R., Hall, K., Mann, G.: Distributed training strategies for the structured perceptron. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 456–464 (2010)

    Google Scholar 

  14. Mokhtari, A., Koppel, A., Ribeiro, A.: Doubly random parallel stochastic methods for large scale learning (2016). arXiv preprint arXiv:1603.06782

  15. Mota, J.F., Xavier, J.M., Aguiar, P.M., Püschel, M.: D-ADMM: a communication-efficient distributed algorithm for separable optimization. IEEE Trans. Signal Process. 61(10), 2718–2723 (2013)

    Article  MathSciNet  Google Scholar 

  16. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  17. Parikh, N., Boyd, S.: Block splitting for distributed optimization. Math. Program. Comput. 6(1), 77–102 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  18. Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 693–701 (2011)

    Google Scholar 

  19. Richtárik, P., Takáč, M.: Distributed coordinate descent method for learning with big data (2013). arXiv preprint arXiv:1310.2059

  20. Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  21. Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Program. 1–52 (2015)

    Google Scholar 

  22. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res. 14(1), 567–599 (2013)

    MathSciNet  MATH  Google Scholar 

  23. Takáč, M., Bijral, A., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMS (2013). arXiv preprint arXiv:1303.2314

  24. Wang, H., Banerjee, A.: Randomized block coordinate descent for online and stochastic optimization (2014). arXiv preprint arXiv:1407.0107

  25. Xu, Y., Yin, W.: Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J. Optim. 25(3), 1686–1716 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  26. Yang, T.: Trading computation for communication: distributed stochastic dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp. 629–637 (2013)

    Google Scholar 

  27. Zhang, C., Lee, H., Shin, K.G.: Efficient distributed linear classification algorithms via the alternating direction method of multipliers. In: International Conference on Artificial Intelligence and Statistics, pp. 1398–1406 (2012)

    Google Scholar 

  28. Zhao, T., Yu, M., Wang, Y., Arora, R., Liu, H.: Accelerated mini-batch randomized block coordinate descent method. In: Advances in Neural Information Processing Systems, pp. 3329–3337 (2014)

    Google Scholar 

  29. Zinkevich, M., Weimer, M., Smola, A.J., Li, L.: Parallelized stochastic gradient descent. In: NIPS, vol. 4, p. 4 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexandros Nathan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Nathan, A., Klabjan, D. (2017). Optimization for Large-Scale Machine Learning with Distributed Features and Observations. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2017. Lecture Notes in Computer Science(), vol 10358. Springer, Cham. https://doi.org/10.1007/978-3-319-62416-7_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-62416-7_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-62415-0

  • Online ISBN: 978-3-319-62416-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics