PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms

Deng, Weixin; Wang, Pengyu; Wang, Jing; Li, Chao; Guo, Minyi

doi:10.1007/978-3-030-49556-5_10

Weixin Deng¹³,
Pengyu Wang¹³,
Jing Wang¹³,
Chao Li¹³ &
…
Minyi Guo¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12093))

Included in the following conference series:

International Symposium on Benchmarking, Measuring and Optimization

1108 Accesses
7 Citations

Abstract

Matrix factorization is a basis for many recommendation systems. Although alternating least squares with weighted-\(\lambda \)-regularization (ALS-WR) is widely used in matrix factorization with collaborative filtering, this approach unfortunately incurs insufficient parallel execution and ineffective memory access. Thus, we propose a solution for accelerating the ALS-WR algorithm by exploiting parallelism, sparsity and locality on x86 platforms. Our PSL can process 20 million ratings and the speedup using multi-threading is up to 14.5\(\times \) on a 20-core machine.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The source code of AIBench is publicly available from http://www.benchcouncil.org/benchhub/AIBench/ (sign up to get access).

References

Advanced vector extensions. https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
Intel guide for developing multithreaded applications. https://software.intel.com/sites/default/files/m/d/4/1/d/8/GDMA_2.pdf
Intel math kernel library. https://software.intel.com/en-us/mkl
Movielens. https://grouplens.org/datasets/movielens/
Openmp. https://en.wikipedia.org/wiki/OpenMP
Ang, A.M.S., Gillis, N.: Accelerating nonnegative matrix factorization algorithms using extrapolation. Neural Comput. 31(2), 417–439 (2019). https://doi.org/10.1162/neco_a_01157
Article MathSciNet Google Scholar
Balaji, V., Lucia, B.: Combining data duplication and graph reordering to accelerate parallel graph processing. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2019, pp. 133–144. ACM, New York (2019). https://doi.org/10.1145/3307681.3326609
Chen, J., Fang, J., Liu, W., Tang, T., Chen, X., Yang, C.: Efficient and portable als matrix factorization for recommender systems. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 409–418, May 2017. https://doi.org/10.1109/IPDPSW.2017.91
Chen, S., Fang, J., Chen, D., Xu, C., Wang, Z.: Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 649–658, June 2018. https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00116
Elafrou, A., Karakasis, V., Gkountouvas, T., Kourtis, K., Goumas, G., Koziris, N.: Sparsex: a library for high-performance sparse matrix-vector multiplication on multicore platforms. ACM Trans. Math. Softw. 44(3), 26:1–26:32 (2018). https://doi.org/10.1145/3134442
Article MathSciNet MATH Google Scholar
Eyerman, S., Heirman, W., Bois, K.D., Fryman, J.B., Hur, I.: Many-core graph workload analysis. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 pp. 22:1–22:11. IEEE Press, Piscataway (2018). https://doi.org/10.1109/SC.2018.00025
Fog, A.: Optimizing software in C++. https://www.agner.org/optimize/optimizing_cpp.pdf
Gao, W., et al.: AIBench: towards scalable and comprehensive datacenter AI benchmarking. In: Zheng, C., Zhan, J. (eds.) Bench 2018. LNCS, vol. 11459, pp. 3–9. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32813-9_1
Chapter Google Scholar
Gao, W., et al.: Aibench: An industry standard internet service ai benchmark suite. arXiv preprint arXiv:1908.08998 (2019)
Gillis, N., Glineur, F.: Accelerated multiplicative updates and hierarchical ALS algorithms for nonnegative matrix factorization. Neural Comput. 24(4), 1085–1105 (2012). https://doi.org/10.1162/NECO_a_00256
Article MathSciNet Google Scholar
Hao, T., et al.: Edge AIBench: towards comprehensive end-to-end edge computing benchmarking. In: Zheng, C., Zhan, J. (eds.) Bench 2018. LNCS, vol. 11459, pp. 23–30. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32813-9_3
Chapter Google Scholar
Hao, T., Zheng, Z.: The implementation and optimization of matrix decomposition based collaborative filtering task on x86 platform. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds.) Bench 2019. LNCS, vol. 12093, pp. 110–115. Springer, Cham (2019)
Google Scholar
Hollowell, C., Caramarcu, C., Strecker-Kellogg, W., Wong, T., Zaytsev, A.: The effect of NUMA tunings on CPU performance. https://indico.cern.ch/event/304944/contributions/1672535/attachments/578723/796898/numa.pdf
Hou, P., Yu, J., Miao, Y., Tai, Y., Wu, Y., Zhao, C.: RVTensor: a light-weight neural network inference framework based on the RISC-V architecture. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds.) Bench 2019. LNCS, vol. 12093, pp. 85–90. Springer, Cham (2019)
Google Scholar
Jiang, Z., et al.: HPC AI500: a benchmarksuite for HPC AI systems. In: 2018 Bench Council International Symposium on Benchmarking, Measuring and Optimizing, Bench 2018 (2018)
Google Scholar
Li, G., Wang, X., Ma, X., Liu, L., Feng, X.: XDN: towards efficient inference of residual neural networks on Cambricon chips. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds.) Bench 2019. LNCS, vol. 12093, pp. 51–56. Springer, Cham (2019)
Google Scholar
Li, S., Hoefler, T., Snir, M.: NUMA-aware shared-memory collective communication for MPI. In: Proceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing, HPDC 2013, pp. 85–96. ACM, New York (2013). https://doi.org/10.1145/2493123.2462903
Luo, C., et al.: AIoT bench: towards comprehensive benchmarking mobile and embedded device intelligence. In: Zheng, C., Zhan, J. (eds.) Bench 2018. LNCS, vol. 11459, pp. 31–35. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32813-9_4
Chapter Google Scholar
Mukkara, A., Beckmann, N., Abeydeera, M., Ma, X., Sanchez, D.: Exploiting locality in graph analytics through hardware-accelerated traversal scheduling. In: Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-51, pp. 1–14. IEEE Press, Piscataway (2018). https://doi.org/10.1109/MICRO.2018.00010
van der Pas, R.: How to befriend NUMA. https://www.openmp.org/wp-content/uploads/SC18-BoothTalks-vanderPas.pdf
Wang, P., Zhang, L., Li, C., Guo, M.: Excavating the potential of GPU for accelerating graph traversal. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 221–230, May 2019. https://doi.org/10.1109/IPDPS.2019.00032
Xiong, X., Wen, X., Huang, C.: Improving RGB-D face recognition via transfer learning from a pretrained 2D network. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds.) Bench 2019. LNCS, vol. 12093, pp. 141–148. Springer, Cham (2019)
Google Scholar
Zhou, Y., Wilkinson, D., Schreiber, R., Pan, R.: Large-scale parallel collaborative filtering for the netflix prize. In: Fleischer, R., Xu, J. (eds.) AAIM 2008. LNCS, vol. 5034, pp. 337–348. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68880-8_32
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Weixin Deng, Pengyu Wang, Jing Wang, Chao Li & Minyi Guo

Authors

Weixin Deng
View author publications
You can also search for this author in PubMed Google Scholar
Pengyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chao Li
View author publications
You can also search for this author in PubMed Google Scholar
Minyi Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Weixin Deng , Pengyu Wang , Jing Wang , Chao Li or Minyi Guo .

Editor information

Editors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Wanling Gao
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Jianfeng Zhan
School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA
Geoffrey Fox
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Xiaoyi Lu
Texas Advanced Computing Center, The University of Texas at Austin, Austin, TX, USA
Dan Stanzione

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deng, W., Wang, P., Wang, J., Li, C., Guo, M. (2020). PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds) Benchmarking, Measuring, and Optimizing. Bench 2019. Lecture Notes in Computer Science(), vol 12093. Springer, Cham. https://doi.org/10.1007/978-3-030-49556-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-49556-5_10
Published: 09 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49555-8
Online ISBN: 978-3-030-49556-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics