GPU_MF_SGD: A Novel GPU-Based Stochastic Gradient Descent Method for Matrix Factorization

  • Mohamed A. NassarEmail author
  • Layla A. A. El-Sayed
  • Yousry Taha
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 887)


Recommender systems are used in most of nowadays applications. Providing real-time suggestions with high accuracy is considered as one of the most crucial challenges that face them. Matrix factorization (MF) is an effective technique for recommender systems as it improves the accuracy. Stochastic Gradient Descent (SGD) for MF is the most popular approach used to speed up MF. SGD is a sequential algorithm, which is not trivial to be parallelized, especially for large-scale problems. Recently, many researches have proposed parallel methods for parallelizing SGD. In this research, we propose GPU_MF_SGD, a novel GPU-based method for large-scale recommender systems. GPU_MF_SGD utilizes Graphics Processing Unit (GPU) resources by ensuring load balancing and linear scalability, and achieving coalesced access of global memory without preprocessing phase. Our method demonstrates 3.1X–5.4X speedup over the most state-of-the-art GPU method, CuMF_SGD.


Collaborative filtering (CF) Matrix factorization (MF) GPU implementation Stochastic Gradient Descent (SGD) 


  1. 1.
    Ricci, F., et al.: Recommender Systems Handbook. Springer, New York (2011)CrossRefGoogle Scholar
  2. 2.
    Ekstrand, M.D., et al.: Collaborative filtering recommender systems. Found. Trends Hum. Comput. Interact. 4(2), 81–173 (2011)CrossRefGoogle Scholar
  3. 3.
    Poriya, A., et al.: Non-personalized recommender systems and user-based collaborative recommender systems. Int. J. Appl. Inf. Syst. 6(9), 22–27 (2014)Google Scholar
  4. 4.
    Aamir, M., Bhusry, M.: Recommendation system: state of the art approach. Int. J. Comput. Appl. 120, 25–32 (2015)Google Scholar
  5. 5.
    Recommender System. Accessed 11 July 2017
  6. 6.
    Jin, J., et al.: GPUSGD: a GPU-accelerated stochastic gradient descent algorithm for matrix factorization. Concurr. Comput. Pract. Exp. 28, 3844–3865 (2016)CrossRefGoogle Scholar
  7. 7.
    Xie, X., et al.: CuMF_SGD: parallelized stochastic gradient descent for matrix factorization on GPUs. In: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. ACM (2017)Google Scholar
  8. 8.
    Li, H., et al.: MSGD: a novel matrix factorization approach for large-scale collaborative filtering recommender systems on GPUs. IEEE Trans. Parallel Distrib. Syst. 29(7), 1530–1544 (2018)CrossRefGoogle Scholar
  9. 9.
    Nassar, M.A., El-Sayed, L.A.A., Taha, Y.: Efficient parallel stochastic gradient descent for matrix factorization using GPU. In: 2016 11th International Conference for Internet Technology and Secured Transactions (ICITST). IEEE (2016)Google Scholar
  10. 10.
    Wen, Z.: Recommendation system based on collaborative filtering. In: CS229 Lecture Notes, Stanford University, December 2008Google Scholar
  11. 11.
    Leskovec, J., et al.: Mining of Massive Datasets, Chap. 9, pp. 307–340. Cambridge University Press, Cambridge (2014)Google Scholar
  12. 12.
    Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)CrossRefGoogle Scholar
  13. 13.
    Kaleem, R., et al.: Stochastic gradient descent on GPUs. In: Proceedings of the 8th Workshop on General Purpose Processing Using GPUs, pp. 81–89 (2015)Google Scholar
  14. 14.
    Konstan, J.A., Riedl, J.: Recommender systems: from algorithms to user experience. User Model. User Adap. Inter. 22(1), 101–123 (2012)CrossRefGoogle Scholar
  15. 15.
    Anastasiu, D.C., et al.: Big Data and Recommender Systems (2016)Google Scholar
  16. 16.
    Melville, P., Sindhwani, V.: Recommender systems. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 829–838. Springer, New York (2011)Google Scholar
  17. 17.
    Kant, V., Bharadwaj, K.K.: Enhancing recommendation quality of content-based filtering through collaborative predictions and fuzzy similarity measures. J. Proc. Eng. 38, 939–944 (2012)CrossRefGoogle Scholar
  18. 18.
    Ma, A., et al.: A FPGA-based accelerator for neighborhood-based collaborative filtering recommendation algorithms. In: Proceedings of IEEE International Conference on Cluster Computing, pp. 494–495, September 2015Google Scholar
  19. 19.
    Anthony, V., Ayala, A., et al.: Speeding up collaborative filtering with parametrized preprocessing. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, August 2015Google Scholar
  20. 20.
    Gates, M., et al.: Accelerating collaborative filtering using concepts from high performance computing. In: IEEE International Conference in Big Data (Big Data) (2015)Google Scholar
  21. 21.
    Wang, Z., et al.: A CUDA-enabled parallel implementation of collaborative filtering. Proc. Comput. Sci. 30, 66–74 (2014)CrossRefGoogle Scholar
  22. 22.
    Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2011)Google Scholar
  23. 23.
    Chin, W.-S., et al.: A fast parallel stochastic gradient method for matrix factorization in shared memory systems. ACM Trans. Intell. Syst. Technol. 6(1), 2 (2015)CrossRefGoogle Scholar
  24. 24.
    Zastrau, D., Edelkamp, S.: Stochastic gradient descent with GPGPU. In: Proceedings of the 35th Annual German Conference on Advances in Artificial Intelligence (KI’12), pp. 193–204 (2012)Google Scholar
  25. 25.
    Shah, A., Majumdar, A.: Accelerating low-rank matrix completion on GPUs. In: Proceedings of International Conference on Advances in Computing, Communications and Informatics, December 2014Google Scholar
  26. 26.
    Kato, K., Hosino, T.: Singular value decomposition for collaborative filtering on a GPU. IOP Conf. Ser. Mater. Sci. Eng. 10(1), 012017 (2010)CrossRefGoogle Scholar
  27. 27.
    Foster, B., et al.: A GPU-based approximate SVD algorithm. In: Proceedings of the 9th International Conference on Parallel Processing and Applied Mathematics, vol. 1, pp. 569–578. Springer, Berlin (2012)CrossRefGoogle Scholar
  28. 28.
    Yu, H.-F., et al.: Parallel matrix factorization for recommender systems. Knowl. Inf. Syst. 41(3), 793–819 (2014)CrossRefGoogle Scholar
  29. 29.
    Yu, H.F., Hsieh, C.J., et al.: Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In: Proceedings of the IEEE 12th International Conference on Data Mining, pp. 765–774 (2012)Google Scholar
  30. 30.
    Yun, H., Yu, H.-F., Hsieh, C.-J., Vishwanathan, S.V.N., Dhillon, I.: NOMAD: non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. Proc. VLDB Endow. 7(11), 975–986 (2014)CrossRefGoogle Scholar
  31. 31.
    Yang, X., et al.: High performance coordinate descent matrix factorization for recommender systems. In: Proceedings of the Computing Frontiers Conference. ACM (2017)Google Scholar
  32. 32.
    Zadeh, R., et al.: Matrix completion via alternating least square (ALS). In: CME 323 Lecture Notes, Stanford University, Spring (2016)Google Scholar
  33. 33.
    Tan, W., Cao, L., Fong, L.: Faster and cheaper: parallelizing large-scale matrix factorization on GPUs. In: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2016 (2016)Google Scholar
  34. 34.
    Aberger, C.R.: Recommender: An Analysis of Collaborative Filtering Techniques (2016)Google Scholar
  35. 35.
    Papamakarios, G.: Comparison of Modern Stochastic Optimization Algorithms (2014)Google Scholar
  36. 36.
    Toulis, P., Airoldi, E., Rennie, J.: Statistical analysis of stochastic gradient methods for generalized linear models. In: International Conference on Machine Learning, pp. 667–675 (2014)Google Scholar
  37. 37.
    Toulis, P., Tran, D., Airoldi, E.: Towards stability and optimality in stochastic gradient descent. In: Artificial Intelligence and Statistics, pp. 1290–1298 (2016)Google Scholar
  38. 38.
    Zhou, Y., Wilkinson, D., et al.: Large-scale parallel collaborative filtering for the Netflix prize. In: Proceedings of International Conference on Algorithmic Aspects in Information and Management (2008)Google Scholar
  39. 39.
    Xie, X., Tan, W., Fong, L.L., Liang, Y.: Cumf_sgd: fast and scalable matrix factorization (2016). arXiv preprint arXiv:1610.05838.
  40. 40.
    Tang, K.: Collaborative filtering with batch stochastic gradient descent, July 2015.
  41. 41.
    Niu, F., et al.: HOGWILD!: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 693–701, June 2011Google Scholar
  42. 42.
    Gemulla, R., et al.: Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 69–77 (2011)Google Scholar
  43. 43.
    Zhang, H., Hsieh, C.-J., Akella, V.: Hogwild++: a new mechanism for decentralized asynchronous stochastic gradient descent. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 629–638. IEEE (2016)Google Scholar
  44. 44.
    Zhang, C., Ré, C.: Dimmwitted: a study of main-memory statistical analytics. Proc. VLDB Endow. 7(12), 1283–1294 (2014)CrossRefGoogle Scholar
  45. 45.
    Udell, M., et al.: Generalized low rank models. Found. Trends Mach. Learn. 9(1), 1–118 (2016)CrossRefGoogle Scholar
  46. 46.
    CUDA C Programming Guide. Accessed 5 Sept 2016
  47. 47.
    Nunna, K.C., et al.: A survey on big data processing infrastructure: evolving role of FPGA. Int. J. Big Data Intell. 2(3), 145–156 (2015) CrossRefGoogle Scholar
  48. 48.
    Nassar, M.A., El-Sayed, L.A.A.: Radix-4 modified interleaved modular multiplier based on sign detection. In: International Conference on Computer Science and Information Technology, pp. 413–423. Springer, Berlin (2012)Google Scholar
  49. 49.
    Nassar, M.A., El-Sayed, L.A.A.: Efficient interleaved modular multiplication based on sign detection. In: IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), November 2015Google Scholar
  50. 50.
    Karydi, E., et al.: Parallel and distributed collaborative filtering: a survey. J. ACM Comput. Surv. 49(2), 37 (2016)Google Scholar
  51. 51.
    Ma, X., Wang, C., Yu, Q., Li, X., Zhou, X.: A FPGA-based accelerator for neighborhood-based collaborative filtering recommendation algorithms. In: 2015 IEEE International Conference on Cluster Computing (CLUSTER), pp. 494–495. IEEE (2015)Google Scholar
  52. 52.
  53. 53.
    Lathia, N.: Evaluating collaborative filtering over time. Ph.D. thesis (2010)Google Scholar
  54. 54.
  55. 55.
  56. 56.
    GPU memory types – performance comparison. Accessed 5 Sept 2015
  57. 57.
    Pankratius, V., et al.: Fundamentals of Multicore Software Development. CRC Press, Boca Raton (2011)zbMATHGoogle Scholar
  58. 58.
    del Mundo, C., Feng, W.: Enabling efficient intra-warp communication for fourier transforms in a many-core architecture. In: Proceedings of the 2013 ACM/IEEE International Conference on Supercomputing (2013)Google Scholar
  59. 59.
    Han, T.D., Abdelrahman, T.S.: Reducing branch divergence in GPU programs. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, p. 3. ACM (2011)Google Scholar
  60. 60.
    Harper, F.M., Konstan, J.A.: The MovieLens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5(4), 19 (2016)Google Scholar
  61. 61.
  62. 62.
    Bennett, J., Lanning, S.: The Netflix prize. In: Proceedings of KDD Cup and Workshop, p. 35 (2007)Google Scholar
  63. 63.
    Dror, G., Koenigstein, N., Koren, Y., Weimer, M.: The Yahoo! music dataset and KDD-Cup’11. In: Proceedings of KDD Cup 2011, pp. 3–18 (2012)Google Scholar
  64. 64.
    Zheng, L.: Performance evaluation of latent factor models for rating prediction. Ph.D. dissertation, University of Victoria (2015)Google Scholar
  65. 65.
    Low, Y., et al.: GraphLab: a new parallel framework for machine learning. In: Proceedings of the Twenty-Sixth Annual Conference on Uncertainty in Artificial Intelligence, UAI-10, pp. 340–349, July 2010Google Scholar
  66. 66.
    Chin, W.-S., et al.: A learning-rate schedule for stochastic gradient methods to matrix factorization. In: PAKDD, pp. 442–455 (2015)Google Scholar
  67. 67. Accessed July 2017
  68. 68. Accessed July 2017
  69. 69.
    Shani, G., Gunawardana, A.: Evaluating recommendation systems. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P. (eds.) Recommender Systems Handbook, pp. 257–297. Springer, Boston (2011)CrossRefGoogle Scholar
  70. 70.
    Ginger, T., Bochkov, Y.: Predicting business ratings on yelp report (2015).
  71. 71.
    Hwu, W.: Efficient host-device data transfer. In: Lecture Notes, University of Illinois at Urbana-Champaign, December 2014Google Scholar
  72. 72.
    Bhatnagar, A.: Accelerating a movie recommender system using VirtualCL on a heterogeneous GPU cluster. Master thesis, July 2015Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Mohamed A. Nassar
    • 1
    Email author
  • Layla A. A. El-Sayed
    • 1
  • Yousry Taha
    • 1
  1. 1.Department of Computer and Systems EngineeringAlexandria UniversityAlexandriaEgypt

Personalised recommendations