Abstract
As the size of modern data sets exceeds the disk and memory capacities of a single computer, machine learning practitioners have resorted to parallel and distributed computing. Given that optimization is one of the pillars of machine learning and predictive modeling, distributed optimization methods have recently garnered ample attention in the literature. Although previous research has mostly focused on settings where either the observations, or features of the problem at hand are stored in distributed fashion, the situation where both are partitioned across the nodes of a computer cluster (doubly distributed) has barely been studied. In this work we propose two doubly distributed optimization algorithms. The first one falls under the umbrella of distributed dual coordinate ascent methods, while the second one belongs to the class of stochastic gradient/coordinate descent hybrid methods. We conduct numerical experiments in Spark using real-world and simulated data sets and study the scaling properties of our methods. Our empirical evaluation of the proposed algorithms demonstrates the out-performance of a block distributed ADMM method, which, to the best of our knowledge is the only other existing doubly distributed optimization algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In [17] the size of the partitions was \(3,000\times 5,000\), but due to the BLAS issue mentioned earlier, we resorted to smaller problems to obtain comparable run-times across all methods.
- 2.
References
Agarwal, A., Duchi, J.C.: Distributed delayed stochastic optimization. In: Advances in Neural Information Processing Systems, pp. 873–881 (2011)
Babanezhad, R., Ahmed, M.O., Virani, A., Schmidt, M., Konečnỳ, J., Sallinen, S.: Stop wasting my gradients: practical SVRG 2015. arXiv preprint arXiv:1511.01942
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2011)
Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13(1), 165–202 (2012)
Frostig, R., Ge, R., Kakade, S.M., Sidford, A.: Competing with the empirical risk minimizer in a single pass (2014). arXiv preprint arXiv:1412.6606
Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S.S., Sundararajan, S.: A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th International Conference on Machine Learning, pp. 408–415. ACM (2008)
Jaggi, M., Smith, V., Takác, M., Terhorst, J., Krishnan, S., Hofmann, T., Jordan, M.I.: Communication-efficient distributed dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp. 3068–3076 (2014)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Konečnỳ, J., Qu, Z., Richtárik, P.: Semi-stochastic coordinate descent (2014). arXiv preprint arXiv:1412.6293
Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(1), 285–322 (2015)
Ma, C., Smith, V., Jaggi, M., Jordan, M.I., Richtárik, P., Takáč, M.: Adding vs. averaging in distributed primal-dual optimization (2015). arXiv preprint arXiv:1502.03508
Mann, G., McDonald, R.T., Mohri, M., Silberman, N., Walker, D.: Efficient large-scale distributed training of conditional maximum entropy models. NIPS 22, 1231–1239 (2009)
McDonald, R., Hall, K., Mann, G.: Distributed training strategies for the structured perceptron. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 456–464 (2010)
Mokhtari, A., Koppel, A., Ribeiro, A.: Doubly random parallel stochastic methods for large scale learning (2016). arXiv preprint arXiv:1603.06782
Mota, J.F., Xavier, J.M., Aguiar, P.M., Püschel, M.: D-ADMM: a communication-efficient distributed algorithm for separable optimization. IEEE Trans. Signal Process. 61(10), 2718–2723 (2013)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Parikh, N., Boyd, S.: Block splitting for distributed optimization. Math. Program. Comput. 6(1), 77–102 (2014)
Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 693–701 (2011)
Richtárik, P., Takáč, M.: Distributed coordinate descent method for learning with big data (2013). arXiv preprint arXiv:1310.2059
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2014)
Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Program. 1–52 (2015)
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res. 14(1), 567–599 (2013)
Takáč, M., Bijral, A., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMS (2013). arXiv preprint arXiv:1303.2314
Wang, H., Banerjee, A.: Randomized block coordinate descent for online and stochastic optimization (2014). arXiv preprint arXiv:1407.0107
Xu, Y., Yin, W.: Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J. Optim. 25(3), 1686–1716 (2015)
Yang, T.: Trading computation for communication: distributed stochastic dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp. 629–637 (2013)
Zhang, C., Lee, H., Shin, K.G.: Efficient distributed linear classification algorithms via the alternating direction method of multipliers. In: International Conference on Artificial Intelligence and Statistics, pp. 1398–1406 (2012)
Zhao, T., Yu, M., Wang, Y., Arora, R., Liu, H.: Accelerated mini-batch randomized block coordinate descent method. In: Advances in Neural Information Processing Systems, pp. 3329–3337 (2014)
Zinkevich, M., Weimer, M., Smola, A.J., Li, L.: Parallelized stochastic gradient descent. In: NIPS, vol. 4, p. 4 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Nathan, A., Klabjan, D. (2017). Optimization for Large-Scale Machine Learning with Distributed Features and Observations. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2017. Lecture Notes in Computer Science(), vol 10358. Springer, Cham. https://doi.org/10.1007/978-3-319-62416-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-62416-7_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62415-0
Online ISBN: 978-3-319-62416-7
eBook Packages: Computer ScienceComputer Science (R0)