Abstract
Matrix factorization is a basis for many recommendation systems. Although alternating least squares with weighted-\(\lambda \)-regularization (ALS-WR) is widely used in matrix factorization with collaborative filtering, this approach unfortunately incurs insufficient parallel execution and ineffective memory access. Thus, we propose a solution for accelerating the ALS-WR algorithm by exploiting parallelism, sparsity and locality on x86 platforms. Our PSL can process 20 million ratings and the speedup using multi-threading is up to 14.5\(\times \) on a 20-core machine.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The source code of AIBench is publicly available from http://www.benchcouncil.org/benchhub/AIBench/ (sign up to get access).
References
Advanced vector extensions. https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
Intel guide for developing multithreaded applications. https://software.intel.com/sites/default/files/m/d/4/1/d/8/GDMA_2.pdf
Intel math kernel library. https://software.intel.com/en-us/mkl
Movielens. https://grouplens.org/datasets/movielens/
Ang, A.M.S., Gillis, N.: Accelerating nonnegative matrix factorization algorithms using extrapolation. Neural Comput. 31(2), 417–439 (2019). https://doi.org/10.1162/neco_a_01157
Balaji, V., Lucia, B.: Combining data duplication and graph reordering to accelerate parallel graph processing. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2019, pp. 133–144. ACM, New York (2019). https://doi.org/10.1145/3307681.3326609
Chen, J., Fang, J., Liu, W., Tang, T., Chen, X., Yang, C.: Efficient and portable als matrix factorization for recommender systems. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 409–418, May 2017. https://doi.org/10.1109/IPDPSW.2017.91
Chen, S., Fang, J., Chen, D., Xu, C., Wang, Z.: Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 649–658, June 2018. https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00116
Elafrou, A., Karakasis, V., Gkountouvas, T., Kourtis, K., Goumas, G., Koziris, N.: Sparsex: a library for high-performance sparse matrix-vector multiplication on multicore platforms. ACM Trans. Math. Softw. 44(3), 26:1–26:32 (2018). https://doi.org/10.1145/3134442
Eyerman, S., Heirman, W., Bois, K.D., Fryman, J.B., Hur, I.: Many-core graph workload analysis. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 pp. 22:1–22:11. IEEE Press, Piscataway (2018). https://doi.org/10.1109/SC.2018.00025
Fog, A.: Optimizing software in C++. https://www.agner.org/optimize/optimizing_cpp.pdf
Gao, W., et al.: AIBench: towards scalable and comprehensive datacenter AI benchmarking. In: Zheng, C., Zhan, J. (eds.) Bench 2018. LNCS, vol. 11459, pp. 3–9. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32813-9_1
Gao, W., et al.: Aibench: An industry standard internet service ai benchmark suite. arXiv preprint arXiv:1908.08998 (2019)
Gillis, N., Glineur, F.: Accelerated multiplicative updates and hierarchical ALS algorithms for nonnegative matrix factorization. Neural Comput. 24(4), 1085–1105 (2012). https://doi.org/10.1162/NECO_a_00256
Hao, T., et al.: Edge AIBench: towards comprehensive end-to-end edge computing benchmarking. In: Zheng, C., Zhan, J. (eds.) Bench 2018. LNCS, vol. 11459, pp. 23–30. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32813-9_3
Hao, T., Zheng, Z.: The implementation and optimization of matrix decomposition based collaborative filtering task on x86 platform. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds.) Bench 2019. LNCS, vol. 12093, pp. 110–115. Springer, Cham (2019)
Hollowell, C., Caramarcu, C., Strecker-Kellogg, W., Wong, T., Zaytsev, A.: The effect of NUMA tunings on CPU performance. https://indico.cern.ch/event/304944/contributions/1672535/attachments/578723/796898/numa.pdf
Hou, P., Yu, J., Miao, Y., Tai, Y., Wu, Y., Zhao, C.: RVTensor: a light-weight neural network inference framework based on the RISC-V architecture. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds.) Bench 2019. LNCS, vol. 12093, pp. 85–90. Springer, Cham (2019)
Jiang, Z., et al.: HPC AI500: a benchmarksuite for HPC AI systems. In: 2018 Bench Council International Symposium on Benchmarking, Measuring and Optimizing, Bench 2018 (2018)
Li, G., Wang, X., Ma, X., Liu, L., Feng, X.: XDN: towards efficient inference of residual neural networks on Cambricon chips. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds.) Bench 2019. LNCS, vol. 12093, pp. 51–56. Springer, Cham (2019)
Li, S., Hoefler, T., Snir, M.: NUMA-aware shared-memory collective communication for MPI. In: Proceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing, HPDC 2013, pp. 85–96. ACM, New York (2013). https://doi.org/10.1145/2493123.2462903
Luo, C., et al.: AIoT bench: towards comprehensive benchmarking mobile and embedded device intelligence. In: Zheng, C., Zhan, J. (eds.) Bench 2018. LNCS, vol. 11459, pp. 31–35. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32813-9_4
Mukkara, A., Beckmann, N., Abeydeera, M., Ma, X., Sanchez, D.: Exploiting locality in graph analytics through hardware-accelerated traversal scheduling. In: Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-51, pp. 1–14. IEEE Press, Piscataway (2018). https://doi.org/10.1109/MICRO.2018.00010
van der Pas, R.: How to befriend NUMA. https://www.openmp.org/wp-content/uploads/SC18-BoothTalks-vanderPas.pdf
Wang, P., Zhang, L., Li, C., Guo, M.: Excavating the potential of GPU for accelerating graph traversal. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 221–230, May 2019. https://doi.org/10.1109/IPDPS.2019.00032
Xiong, X., Wen, X., Huang, C.: Improving RGB-D face recognition via transfer learning from a pretrained 2D network. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds.) Bench 2019. LNCS, vol. 12093, pp. 141–148. Springer, Cham (2019)
Zhou, Y., Wilkinson, D., Schreiber, R., Pan, R.: Large-scale parallel collaborative filtering for the netflix prize. In: Fleischer, R., Xu, J. (eds.) AAIM 2008. LNCS, vol. 5034, pp. 337–348. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68880-8_32
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Deng, W., Wang, P., Wang, J., Li, C., Guo, M. (2020). PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds) Benchmarking, Measuring, and Optimizing. Bench 2019. Lecture Notes in Computer Science(), vol 12093. Springer, Cham. https://doi.org/10.1007/978-3-030-49556-5_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-49556-5_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49555-8
Online ISBN: 978-3-030-49556-5
eBook Packages: Computer ScienceComputer Science (R0)