Advertisement

Distributing Frank–Wolfe via map-reduce

  • Armin Moharrer
  • Stratis Ioannidis
Regular Paper
  • 29 Downloads

Abstract

Large-scale optimization problems abound in data mining and machine learning applications, and the computational challenges they pose are often addressed through parallelization. We identify structural properties under which a convex optimization problem can be massively parallelized via map-reduce operations using the Frank–Wolfe (FW) algorithm. The class of problems that can be tackled this way is quite broad and includes experimental design, AdaBoost, and projection to a convex hull. Implementing FW via map-reduce eases parallelization and deployment via commercial distributed computing frameworks. We demonstrate this by implementing FW over Spark, an engine for parallel data processing, and establish that parallelization through map-reduce yields significant performance improvements: We solve problems with 20 million variables using 350 cores in 79 min; the same operation takes 48 h when executed serially.

Keywords

Frank–Wolfe Distributed algorithms Convex optimization and spark 

Notes

Acknowledgements

We kindly thank our reviewers for their very useful comments and suggestions. The work was supported by National Science Foundation (NSF) CAREER grant CCF-1750539.

References

  1. 1.
    Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al (2016) Tensorflow: A system for large-scale machine learning. In: OSDIGoogle Scholar
  2. 2.
    Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable k-means++. In: VLDBGoogle Scholar
  3. 3.
    Beck A, Shtern S (2015) Linearly convergent away-step conditional gradient for non-strongly convex functions. Math Progr 164:1–27MathSciNetCrossRefGoogle Scholar
  4. 4.
    Bellet A, Liang Y, Garakani AB, Balcan M-F, Sha F (2015) A distributed Frank–Wolfe algorithm for communication-efficient sparse learning. In: SDMGoogle Scholar
  5. 5.
    Bertsekas DP (1999) Nonlinear programming. Athena Scientific, BelmontzbMATHGoogle Scholar
  6. 6.
    Bian Y, Mirzasoleiman B, Buhmann JM, Krause A (2017) Guaranteed non-convex optimization: Submodular maximization over continuous domains. In: AISTATSGoogle Scholar
  7. 7.
    Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122CrossRefGoogle Scholar
  8. 8.
    Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, New YorkCrossRefGoogle Scholar
  9. 9.
    Calinescu G, Chekuri C, Pál M, Vondrák J (2011) Maximizing a monotone submodular function subject to a matroid constraint. SIAM J Comput 40(6):1740–1766MathSciNetCrossRefGoogle Scholar
  10. 10.
    Canon M, Cullum C (1968) A tight upper bound on the rate of convergence of Frank–Wolfe algorithm. SIAM J Control 6(4):509–516MathSciNetCrossRefGoogle Scholar
  11. 11.
    Chandrasekaran V, Recht B, Parrilo PA, Willsky AS (2012) The convex geometry of linear inverse problems. Found Comput Math 12(6):805–849MathSciNetCrossRefGoogle Scholar
  12. 12.
    Chen S, Banerjee A (2015) Structured estimation with atomic norms: General bounds and applications. In: NIPSGoogle Scholar
  13. 13.
    Chu C-T, Kim SK, Lin Y-A, Yu Y, Bradski G, Ng AY, Olukotun K (2006) Map-reduce for machine learning on multicore. In: NIPSGoogle Scholar
  14. 14.
    Clarkson K L (2010) Coresets, sparse greedy approximation, and the Frank–Wolfe algorithm. ACM Trans Algorithms 6(4):63:1–63:30MathSciNetCrossRefGoogle Scholar
  15. 15.
    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  16. 16.
    Dudik M, Harchaoui Z, Malick J (2012) Lifted coordinate descent for learning with trace-norm regularization. In: AISTATSGoogle Scholar
  17. 17.
    Dunn JC (1979) Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM J Control Optim 17(2):187–211MathSciNetCrossRefGoogle Scholar
  18. 18.
    Frank M, Wolfe P (1956) An algorithm for quadratic programming. Naval Res Logist Q 3(1–2):95–110MathSciNetCrossRefGoogle Scholar
  19. 19.
    Garber D, Hazan E (2016) A linearly convergent variant of the conditional gradient algorithm under strong convexity, with applications to online and stochastic optimization. SIAM J Optim 26(3):1493–1528MathSciNetCrossRefGoogle Scholar
  20. 20.
    Garber D, Meshi O (2016) Linear-memory and decomposition-invariant linearly convergent conditional gradient algorithm for structured polytopes In: Advances in neural information processing systems, pp 1001–1009Google Scholar
  21. 21.
    Guélat J, Marcotte P (1986) Some comments on Wolfe’s away step. Math Progr 35(1):110–119MathSciNetCrossRefGoogle Scholar
  22. 22.
    Harchaoui Z, Douze M, Paulin M, Dudik M, Malick J (2012) Large-scale image classification with trace-norm regularization. In: CVPRGoogle Scholar
  23. 23.
    Harper F M, Konstan J A (2015) The movielens datasets: history and context. ACM Trans Interact Intell Syst 5(4):19:1–19:19CrossRefGoogle Scholar
  24. 24.
    Hazan E, Kale S (2012) Projection-free online learning. In: ICMLGoogle Scholar
  25. 25.
    Hazan E, Luo H (2016) Variance-reduced and projection-free stochastic optimization. In: ICMLGoogle Scholar
  26. 26.
    Jaggi M (2013) Revisiting Frank–Wolfe: projection-free sparse convex optimization. In: ICMLGoogle Scholar
  27. 27.
    Joulin A, Tang K, Fei-Fei L (2014) Efficient image and video co-localization with Frank–Wolfe algorithm. In: ECVVGoogle Scholar
  28. 28.
    Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. Computer 42(8):30CrossRefGoogle Scholar
  29. 29.
    Kumar R, Moseley B, Vassilvitskii S, Vattani A (2015) Fast greedy algorithms in mapreduce and streaming. ACM Trans Parallel Comput 2(3):14CrossRefGoogle Scholar
  30. 30.
    Lacoste-Julien S, Jaggi M (2015) On the global linear convergence of Frank–Wolfe optimization variants. In: NIPSGoogle Scholar
  31. 31.
    Lacoste-Julien S, Jaggi M, Schmidt M, Pletscher P (2013) Block-coordinate Frank–Wolfe optimization for structural SVMs. In: Proceedings of ICMLGoogle Scholar
  32. 32.
    Lan G, Zhou Y (2016) Conditional gradient sliding for convex optimization. SIAM J Optim 26(2):1379–1409MathSciNetCrossRefGoogle Scholar
  33. 33.
    Leighton F T (2014) Introduction to parallel algorithms and architectures: trees hypercubes. Elsevier, AmsterdamzbMATHGoogle Scholar
  34. 34.
    Li M, Zhou L, Yang Z, Li A, Xia F, Andersen DG, Smola A (2013) Parameter server for distributed machine learning. In: NIPS workshopGoogle Scholar
  35. 35.
    Lichman M (2013) UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CAGoogle Scholar
  36. 36.
    Ng AY (2004) Feature selection, L1 vs. L2 regularization, and rotational invariance. In: ICMLGoogle Scholar
  37. 36.
    Osokin A, Alayrac J-B, Lukasewitz I, Dokania PK, Lacoste-Julien S (2016) Minding the gaps for block Frank–Wolfe optimization of structured SVMs. In: ICMLGoogle Scholar
  38. 37.
    Recht B, Re C, Wright S, Niu F (2011) Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: NIPSGoogle Scholar
  39. 38.
    Reddi SJ, Sra S, Póczós B, Smola A (2016) Stochastic Frank–Wolfe methods for non-convex optimization. In: AllertonGoogle Scholar
  40. 39.
    Shah P, Bhaskar BN, Tang G, Recht B (2012) Linear system identification via atomic norm regularization. In: CDCGoogle Scholar
  41. 40.
    Sherman J, Morrison WJ (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann Math Stat 21(1):124–127MathSciNetCrossRefGoogle Scholar
  42. 41.
    Suri S, Vassilvitskii S (2011) Counting triangles and the curse of the last reducer. In: WWWGoogle Scholar
  43. 42.
    Tewari A, Ravikumar PK, Dhillon IS (2011) Greedy algorithms for structurally constrained high dimensional problems. In: NIPSGoogle Scholar
  44. 43.
    Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 73:267–288MathSciNetzbMATHGoogle Scholar
  45. 44.
    Tran NL, Peel T, Skhiri S (2015) Distributed Frank–Wolfe under pipelined stale synchronous parallelism. In: IEEE international conference on big data (Big Data)Google Scholar
  46. 45.
    Wang Y-X, Sadhanala V, Dai W, Neiswanger W, Sra S, Xing E (2016) Parallel and distributed block-coordinate Frank–Wolfe algorithms. In: Proceedings of ICMLGoogle Scholar
  47. 46.
    White T (2012) Hadoop: the definitive guide. O’Reilly Media, IncGoogle Scholar
  48. 47.
    Wolfe P (1970) Convergence theory in nonlinear programming. In: Abadie J (ed) Integer and nonlinear programming. North-Holland Publishing Company, AmsterdamzbMATHGoogle Scholar
  49. 48.
    Yang H-C, Dasdan A, Hsiao R-L, Parker DS (2007) Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMODGoogle Scholar
  50. 49.
    Yang T (2013) Trading computation for communication: distributed stochastic dual coordinate ascent. In: NIPSGoogle Scholar
  51. 50.
    Ying Y, Li P (2012) Distance metric learning with eigenvalue optimization. J Mach Learn Res 13:1–26MathSciNetzbMATHGoogle Scholar
  52. 51.
    Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: HotCloudGoogle Scholar
  53. 52.
    Zhang L, Wang G, Romero D, Giannakis GB (2017) Randomized block Frank–Wolfe for convergent large-scale learning. IEEE Trans Signal Proc 65(4):6448–6461MathSciNetCrossRefGoogle Scholar
  54. 53.
    Zinkevich M, Weimer M, Li L, Smola AJ (2010) Parallelized stochastic gradient descent. In: NIPSGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  1. 1.Electrical and Computer Engineering DepartmentNortheastern UniversityBostonUSA

Personalised recommendations