Vectorizable Design and Implementation of Matrix Multiplication on Vector Processor
Matrix-vector multiplication is one of the core computing of many algorithms calculation in scientific computing, the vectorization algorithm mapping is a difficult problem to vector processors. In this study, based on the background of BP algorithm for deep learning application, on the basis of in-depth analysis of the BP algorithm, according to the characteristics of vector processor architecture, we proposed an efficient vectorization method of matrix-vector multiplication. The L1D configured into SRAM mode, with double buffer “ping-pong” way to smooth data transmission of multistage storage structure, makes the calculation of the kernel and the DMA data moving overlap, let the kernel run at a peak speed, so as to achieve the best calculation efficiency. Through the way of transpose matrix transmission with DMA to avoid the inefficient access to column of matrix and summation reduction of floating-point calculation between the VPEs, Obtain the optimal kernel computing performance. Experimental result on MATRIX2 shows that the single-core performance of presented double precision matrix multiplication achieves 94.45 GFLOPS, and the efficiency of kernel computation achieves 99.39%.
KeywordsMatrix-vector multiplication Vector processor BP algorithm Vectorization
This paper is supported by the National Natural Science Foundation of China (61133007 and 61572025)
- 1.LIU Zhong, TIAN Xi, CHEN Lei. Efficient vectorization method of triangular matrix multiplication supporting in-place calculation [J]. Journal of National University of Defense Technology, 2014(6):7–11.Google Scholar
- 2.LIU Zhong, CHEN Yueyue, CHEN Haiyan. A vectorization of FIR filter supporting any length and data types of coefficients [J]. Acta Electronics Sinica, 2013, 41(2):346–351. (in Chinese).Google Scholar
- 3.J.J. DONGARRA, JEREMY DU CROZ, SVEN HAMMARLING, RICHARD J. HANSON, An Extended Set of FORTRAN Basic Linear Algebra Subprograms [J], ACM Transactions on Mathematical Software, Vol. 14, No. 1, March 1973, Pages 1–17.Google Scholar
- 4.GotoBLASHomepage. [EB/OI]. [2014-04-24]. http://www.tacc.utexas.edu/tacc-projects/gotoblas2.
- 5.Goto K, van de Geijn R A. High-performance implementation of the level-3 BLAS[J]. ACM Transactions on Mathematical Software, 2008, 35(1):1–14.Google Scholar
- 6.ATLASHomepage. [EB/OL]. [2014-04-24]. http://math-atlas.SourceForge.net/.
- 7.Intel MKL Homepage [EB/OL]. [2014-04-24]. http://software.intel.com/en-us/articles/intel-mkl/.
- 8.ZHANG Xianyi, WANG Qian, ZHANG Yunquan. OpenBLAS: a high performance BLAS library on loongson 3A CPU [J]. journal of Software, 2011, 22(zk2):208–216. (in Chinese).Google Scholar
- 9.H. Esmaeilzadeh, P. Saeedi, B.N. Araabi, C. Lucas, and Sied Mehdi Fakhraie. Neural network stream processing core (NnSP) for embedded systems. In Proceedings of the 2006 IEEE International Symposium on Circuits and Systems, 2006.Google Scholar
- 10.V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.Google Scholar
- 11.Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going Deeper with Convolutions. In arXiv:1409.4842, 2014.
- 12.Zhao Z. Study and Application of BP Neural Network in Intrusion Detection[M] Proceedings of the 2012 International Conference on Cybernetics and Informatics. Springer New York, 2014:379–385.Google Scholar
- 13.Y.K. Li, “Analysis and Improvement Application of BP Neural Network,” Anhui University of Science and Technology, 2012.Google Scholar
- 14.Y.M. Li, “The Study of BP Learning Algorithm Improvement and Application in Face Recognition,” Shandong University, 2012.Google Scholar