Deep Fusion: A Software Scheduling Method for Memory Access Optimization

  • Yimin ZhuangEmail author
  • Shaohui Peng
  • Xiaobing Chen
  • Shengyuan Zhou
  • Tian Zhi
  • Wei Li
  • Shaoli Liu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11783)


Deep neural networks (DNNs) have been considered to be the state-of-the-art artificial intelligence methods in a very broad range of applications. However, DNNs are compute intensive and memory intensive which are difficult to be employed in practical scenarios. Due to their favorable parallel computing ability, a series of DNN accelerators have been proposed. However, the improvement of on-chip computing capacity and the increasing number of parameters in the neural networks make access to memory a bottleneck. In this paper, we analyze the existing DNN algorithms. We observe that the special structure of neural networks makes it have two useful characteristics, which are unilateral directivity and local independence. Based on these characteristics, we propose a general software scheduling method to reduce memory access cost. Based on the experimental results, our method can reduce 32% memory access cost and achieve a speedup of 1.6x in average on our experiment platform and the best result is in ResNet-50, which is up to 56% and 2.62x.


Fusion Reuse On-chip Memory 



This work is partially supported by the National Key Research and Development Program of China (under Grant 2017YFB1003104), the NSF of China (under Grants 61432016, 61532016, 61672491, 61602441, 61602446, 61732002, 61702478, 61732007 and 61732020), Beijing Natural Science Foundation (JQ18013), the 973 Program of China (under Grant 2015CB358800), National Science and Technology Major Project (2018ZX01031102), the Transformation and Transfer of Scientific and Technological Achievements of Chinese Academy of Sciences (KFJ-HGZX-013), Key Research Projects in Frontier Science of Chinese Academy of Sciences (QYZDB-SSW-JSC001) , Strategic Priority Research Program of Chinese Academy of Science (XDB32050200, XDC01020000) and Standardization Research Project of Chinese Academy of Sciences (BZ201800001).


  1. 1.
    Xiong, W., et al.: Achieving human parity in conversational speech recognition. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing, p. 99 (2016)Google Scholar
  2. 2.
    Ren, S., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)CrossRefGoogle Scholar
  3. 3.
    Redmon, J., Farhadi, A.: YOLOv3: An Incremental Improvement (2018)Google Scholar
  4. 4.
    Noh, H., Hong, S., Han, B.: Learning Deconvolution Network for Semantic Segmentation (2015)Google Scholar
  5. 5.
    Han, S., et al.: Learning both Weights and Connections for Efficient Neural Networks (2015)Google Scholar
  6. 6.
    Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. Fiber 56(4), 3–7 (2015)Google Scholar
  7. 7.
    Jacob, B., et al.: Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (2017)Google Scholar
  8. 8.
    Chen, T., et al.: DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM Sigplan Not. 49(4), 269–284 (2014)Google Scholar
  9. 9.
    Chen, Y., et al.: DaDianNao: A Machine-Learning Supercomputer (2014)Google Scholar
  10. 10.
    Han, S., et al.: EIE: efficient inference engine on compressed deep neural network. ACM Sigarch Comput. Archit. News 44(3), 243–254 (2016) CrossRefGoogle Scholar
  11. 11.
    Shen, Y., Ferdman, M., Milder, P.: Escher: a CNN accelerator with flexible buffering to minimize off-chip transfer. In: 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) IEEE Computer Society (2017)Google Scholar
  12. 12.
    Chen, Y.-H., et al.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52(1), 127–138 (2017)CrossRefGoogle Scholar
  13. 13.
    Liu, S., et al.: Cambricon: an instruction set architecture for neural networks. In: ACM/IEEE International Symposium on Computer Architecture (2016)Google Scholar
  14. 14.
    Chen, T., et al.: MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. Statistics (2015)Google Scholar
  15. 15.
    Abadi, M., et al.: TensorFlow: a system for large-scale machine learning (2016)Google Scholar
  16. 16.
    Abadi, M., et al.: TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems (2016)Google Scholar
  17. 17.
    Alwani, M., et al.: Fused-Layer CNN Accelerators. In: IEEE/ACM International Symposium on Microarchitecture (2016)Google Scholar
  18. 18.
    Simonyan, K., Andrew Z.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  19. 19.
    Szegedy, C., et al.: Going Deeper with Convolutions (2014)Google Scholar
  20. 20.
    Xia, X., Cui, X., Bing, N.: Inception-v3 for flower classification. In: International Conference on Image (2017)Google Scholar
  21. 21.
    He, K., et al.: Deep Residual Learning for Image Recognition (2015)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2019

Authors and Affiliations

  • Yimin Zhuang
    • 1
    • 2
    Email author
  • Shaohui Peng
    • 1
    • 2
  • Xiaobing Chen
    • 1
    • 2
  • Shengyuan Zhou
    • 1
    • 2
  • Tian Zhi
    • 1
    • 2
  • Wei Li
    • 1
    • 3
  • Shaoli Liu
    • 1
    • 3
  1. 1.SKL of Computer ArchitectureInstitute of Computing Technology, CASBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina
  3. 3.Cambricon Tech. LtdShanghaiChina

Personalised recommendations