Skip to main content

PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms

  • Conference paper
  • First Online:
Book cover Benchmarking, Measuring, and Optimizing (Bench 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12093))

Included in the following conference series:

Abstract

Matrix factorization is a basis for many recommendation systems. Although alternating least squares with weighted-\(\lambda \)-regularization (ALS-WR) is widely used in matrix factorization with collaborative filtering, this approach unfortunately incurs insufficient parallel execution and ineffective memory access. Thus, we propose a solution for accelerating the ALS-WR algorithm by exploiting parallelism, sparsity and locality on x86 platforms. Our PSL can process 20 million ratings and the speedup using multi-threading is up to 14.5\(\times \) on a 20-core machine.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The source code of AIBench is publicly available from http://www.benchcouncil.org/benchhub/AIBench/ (sign up to get access).

References

  1. Advanced vector extensions. https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

  2. Intel guide for developing multithreaded applications. https://software.intel.com/sites/default/files/m/d/4/1/d/8/GDMA_2.pdf

  3. Intel math kernel library. https://software.intel.com/en-us/mkl

  4. Movielens. https://grouplens.org/datasets/movielens/

  5. Openmp. https://en.wikipedia.org/wiki/OpenMP

  6. Ang, A.M.S., Gillis, N.: Accelerating nonnegative matrix factorization algorithms using extrapolation. Neural Comput. 31(2), 417–439 (2019). https://doi.org/10.1162/neco_a_01157

    Article  MathSciNet  Google Scholar 

  7. Balaji, V., Lucia, B.: Combining data duplication and graph reordering to accelerate parallel graph processing. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2019, pp. 133–144. ACM, New York (2019). https://doi.org/10.1145/3307681.3326609

  8. Chen, J., Fang, J., Liu, W., Tang, T., Chen, X., Yang, C.: Efficient and portable als matrix factorization for recommender systems. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 409–418, May 2017. https://doi.org/10.1109/IPDPSW.2017.91

  9. Chen, S., Fang, J., Chen, D., Xu, C., Wang, Z.: Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 649–658, June 2018. https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00116

  10. Elafrou, A., Karakasis, V., Gkountouvas, T., Kourtis, K., Goumas, G., Koziris, N.: Sparsex: a library for high-performance sparse matrix-vector multiplication on multicore platforms. ACM Trans. Math. Softw. 44(3), 26:1–26:32 (2018). https://doi.org/10.1145/3134442

    Article  MathSciNet  MATH  Google Scholar 

  11. Eyerman, S., Heirman, W., Bois, K.D., Fryman, J.B., Hur, I.: Many-core graph workload analysis. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 pp. 22:1–22:11. IEEE Press, Piscataway (2018). https://doi.org/10.1109/SC.2018.00025

  12. Fog, A.: Optimizing software in C++. https://www.agner.org/optimize/optimizing_cpp.pdf

  13. Gao, W., et al.: AIBench: towards scalable and comprehensive datacenter AI benchmarking. In: Zheng, C., Zhan, J. (eds.) Bench 2018. LNCS, vol. 11459, pp. 3–9. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32813-9_1

    Chapter  Google Scholar 

  14. Gao, W., et al.: Aibench: An industry standard internet service ai benchmark suite. arXiv preprint arXiv:1908.08998 (2019)

  15. Gillis, N., Glineur, F.: Accelerated multiplicative updates and hierarchical ALS algorithms for nonnegative matrix factorization. Neural Comput. 24(4), 1085–1105 (2012). https://doi.org/10.1162/NECO_a_00256

    Article  MathSciNet  Google Scholar 

  16. Hao, T., et al.: Edge AIBench: towards comprehensive end-to-end edge computing benchmarking. In: Zheng, C., Zhan, J. (eds.) Bench 2018. LNCS, vol. 11459, pp. 23–30. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32813-9_3

    Chapter  Google Scholar 

  17. Hao, T., Zheng, Z.: The implementation and optimization of matrix decomposition based collaborative filtering task on x86 platform. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds.) Bench 2019. LNCS, vol. 12093, pp. 110–115. Springer, Cham (2019)

    Google Scholar 

  18. Hollowell, C., Caramarcu, C., Strecker-Kellogg, W., Wong, T., Zaytsev, A.: The effect of NUMA tunings on CPU performance. https://indico.cern.ch/event/304944/contributions/1672535/attachments/578723/796898/numa.pdf

  19. Hou, P., Yu, J., Miao, Y., Tai, Y., Wu, Y., Zhao, C.: RVTensor: a light-weight neural network inference framework based on the RISC-V architecture. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds.) Bench 2019. LNCS, vol. 12093, pp. 85–90. Springer, Cham (2019)

    Google Scholar 

  20. Jiang, Z., et al.: HPC AI500: a benchmarksuite for HPC AI systems. In: 2018 Bench Council International Symposium on Benchmarking, Measuring and Optimizing, Bench 2018 (2018)

    Google Scholar 

  21. Li, G., Wang, X., Ma, X., Liu, L., Feng, X.: XDN: towards efficient inference of residual neural networks on Cambricon chips. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds.) Bench 2019. LNCS, vol. 12093, pp. 51–56. Springer, Cham (2019)

    Google Scholar 

  22. Li, S., Hoefler, T., Snir, M.: NUMA-aware shared-memory collective communication for MPI. In: Proceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing, HPDC 2013, pp. 85–96. ACM, New York (2013). https://doi.org/10.1145/2493123.2462903

  23. Luo, C., et al.: AIoT bench: towards comprehensive benchmarking mobile and embedded device intelligence. In: Zheng, C., Zhan, J. (eds.) Bench 2018. LNCS, vol. 11459, pp. 31–35. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32813-9_4

    Chapter  Google Scholar 

  24. Mukkara, A., Beckmann, N., Abeydeera, M., Ma, X., Sanchez, D.: Exploiting locality in graph analytics through hardware-accelerated traversal scheduling. In: Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-51, pp. 1–14. IEEE Press, Piscataway (2018). https://doi.org/10.1109/MICRO.2018.00010

  25. van der Pas, R.: How to befriend NUMA. https://www.openmp.org/wp-content/uploads/SC18-BoothTalks-vanderPas.pdf

  26. Wang, P., Zhang, L., Li, C., Guo, M.: Excavating the potential of GPU for accelerating graph traversal. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 221–230, May 2019. https://doi.org/10.1109/IPDPS.2019.00032

  27. Xiong, X., Wen, X., Huang, C.: Improving RGB-D face recognition via transfer learning from a pretrained 2D network. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds.) Bench 2019. LNCS, vol. 12093, pp. 141–148. Springer, Cham (2019)

    Google Scholar 

  28. Zhou, Y., Wilkinson, D., Schreiber, R., Pan, R.: Large-scale parallel collaborative filtering for the netflix prize. In: Fleischer, R., Xu, J. (eds.) AAIM 2008. LNCS, vol. 5034, pp. 337–348. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68880-8_32

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Weixin Deng , Pengyu Wang , Jing Wang , Chao Li or Minyi Guo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Deng, W., Wang, P., Wang, J., Li, C., Guo, M. (2020). PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds) Benchmarking, Measuring, and Optimizing. Bench 2019. Lecture Notes in Computer Science(), vol 12093. Springer, Cham. https://doi.org/10.1007/978-3-030-49556-5_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-49556-5_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-49555-8

  • Online ISBN: 978-3-030-49556-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics