Efficient Processing of Convolutional Neural Networks on SW26010
Artificial intelligence has developed rapidly in recent years. Deep neural networks are the basis of many artificial intelligence applications. How to accelerate the computational processing of deep neural networks is very important. To explor the potential for accelerating the process deep neural networks on various hardware platforms, we propose a convolutional neural network optimization method based on the Weight-Stationary for SW26010 processor. We re-circulate convolution loops and use hybrid DMA transmission mode to increase memory bandwidth and reduce memory access overhead. On top of those, further optimizations are done based on register communication, asynchronous DMA transfer double buffering, instruction scheduling and other schemes. Finally, we achieve a double-precision convolution performance over 2.4 Tflops, achieving 81% of the processor’s peak performance. In multiple parameters, we achieve a proforamnce acceleration of \(2.4-4.0\times \) speedup compared to the Tesla K80 GPU with cuDNNv7.
KeywordsSW26010 processor Convolutional neural networks Weight-stationary Parallel model Many-core architecture Deep learning
- 1.Chetlur, S., et al.: cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)
- 3.Fang, J., Fu, H., Zhao, W., Chen, B., Zheng, W., Yang, G.: swdnn: a library for accelerating deep learning applications on sunway taihulight. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 615–624. IEEE (2017)Google Scholar
- 4.Li, L., et al.: swcaffe: A parallel framework for accelerating deep learning applications on sunway taihulight. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 413–422. IEEE (2018)Google Scholar
- 5.Jiang, L., et al.: Towards highly efficient dgemm on the emerging sw26010 many-core processor. In: 2017 46th International Conference on Parallel Processing (ICPP), pp. 422–431. IEEE (2017)Google Scholar