A Heterogeneous and Reconfigurable Embedded Architecture for Energy-Efficient Execution of Convolutional Neural Networks

  • Konstantin LübeckEmail author
  • Oliver Bringmann
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11479)


Machine learning based convolutional neural networks (CNN) are becoming increasingly popular for identification tasks like image classification or speech recognition. However, CNNs have high memory and computational demands which makes it challenging to implement them on cost-efficient and energy-autonomous hardware. To cope with this challenge we present a heterogeneous and reconfigurable embedded architecture implemented on an inexpensive and widely available entry-level system on chip (SoC). Our architecture combines an ARM CPU and a coarse-grained reconfigurable architecture (CGRA) which execute a CNN in parallel to reach a higher energy-efficiency. Our results show up to 130% higher performance and 78% better energy-efficiency compared with an embedded Nvidia GPU.



This work has been partially funded by the Stiftung Industrieforschung through the scholarship for master’s theses and the German Federal Ministry of Education and Research (BMBF) under grant number 16ES0876 (GENIAL!).


  1. 1.
    LeCun, Y., Buttou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998). Scholar
  2. 2.
    Krizhevsky, A., Sutskever, I., Hinton, G. E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS 2012), Lake Tahoe, NV, pp. 1097–1105 (2012)Google Scholar
  3. 3.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
  4. 4.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015). Scholar
  5. 5.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. arXiv:1709.01507 (2017)
  6. 6.
    ImageNet Large Scale Visual Recognition Challenge 2017 Results (ILSVRC2017). Accessed 19 Nov 2018
  7. 7.
    LeCun, Y., Cortes, C.: MNIST handwritten digit database. Accessed 29 Oct 2018
  8. 8.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. arXiv:1408.5093 (2014)
  9. 9.
    Nvidia cuDNN. Accessed 2 Nov 2018
  10. 10.
    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems. Accessed 14 Nov 2018
  11. 11.
    Paszke, A., et al.: Automatic differentiation in PyTorch. In: Proceedings of the NIPS 2017 Workshop Autodiff, Long Beach, CA (2017)Google Scholar
  12. 12.
    Nvidia Titan RTX. Accessed 20 Feb 2019
  13. 13.
    Chen, T., et al.: DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2014), Salt Lake City, UT, pp. 269–284 (2014).
  14. 14.
    Tanomoto, M., Takamaeda-Yamazaki, S., Yao, J., Nakashima, Y.: A CGRA-based approach for accelerating convolutional neural networks. In: Proceedings of the 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC 2015), Turin, pp. 73–80 (2015).
  15. 15.
    Shi, R., et al.: A locality aware convolutional neural networks accelerator. In: Proceedings of the 2015 Euromicro Conference on Digital System Design, Funchal, pp. 591–598 (2015).
  16. 16.
    Fan, X., Li, H., Cao, W., Wang, L.: DT-CGRA: dual-track coarse-grained reconfigurable architecture for stream applications. In: Proceedings of the 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, pp. 1–9 (2016).
  17. 17.
    Jafri, S.M.A.H., Hemani, A., Kolin, P., Abbas, N.: MOCHA: morphable locality and compression aware architecture for convolutional neural networks. In: Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Orlando, FL, pp. 276–286 (2007).
  18. 18.
    Chen, Y.H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52(1), 137–138 (2017). Scholar
  19. 19.
    Zhao, B., Wang, M., Liu, M.: An energy-efficient coarse grained spatial architecture for convolutional neural networks AlexNet. IEICE Electron. Express 14(15), 20170595 (2017).
  20. 20.
    Shin, D., Lee, J., Lee, J., Yoo, H. J.: DNPU: an 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks. In: Proceedings of the in 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 240–241, San Francisco, CA (2017).
  21. 21.
    Du, L., et al.: A reconfigurable streaming deep convolutional neural network accelerator for Internet of Things. IEEE Trans. Circuits Syst. I Regular Papers 65(1), 198–208 (2018). Scholar
  22. 22.
    Chakradhar, S., Sankaradas, M., Jakkula, V., Cadambi, S.: A dynamically configurable coprocessor for convolutional neural networks. In: Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA 2010), Saint-Malo, pp. 247–257 (2010).
  23. 23.
    Zhang, C., Li, P., Sun, G., Xiao, B., Cong, J.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2015), Monterey, CA, pp. 161–170 (2015).
  24. 24.
    Qiu, J., et al.: Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2016), Monterey, CA, pp. 26–35 (2016).
  25. 25.
    Gokhale, V., Zaidy, A., Chang, A.X.M., Culurciello, E.: Snowflake: an efficient hardware accelerator for convolutional neural networks. In: Proceedings of the 2017 IEEE International Symposium on Circuits and Systems (ISCAS), Baltimore, MD, pp. 1–4 (2017).
  26. 26.
    Hartenstein, R.: A decade of reconfigurable computing: a visionary retrospective. In: Proceedings of the Design, Automation and Test in Europe Conference and Exhibition 2001 (DATE 2001), Munich, pp. 642–649 (2001).
  27. 27.
    Xilinx: Zynq UltraScale+ Device Technical Reference Manual, UG1085 v1.7 (2017)Google Scholar
  28. 28.
    Oppold, T., Schweizer, T., Oliveira, J.F., Eisenhardt, S., Kuhn, T., Rosenstiel, W.: CRC - concepts and evaluation of processor-like reconfigurable architectures. Inf. Technol. IT 49(3), 157–164 (2007). Scholar
  29. 29.
    Lübeck, K., Morgenstern, D., Schweizer, T., Peterson D., Rosenstiel W., Bringmann O.: Neues Konzept zur Steigerung der Zuverlässigkeit einer ARM-basierten Prozessorarchitektur unter Verwendung eines CGRAs. In: 19. Workshop Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (MBMV), Freiburg, pp. 46–58 (2016).
  30. 30.
    Hennessy, J.L., Patterson, D.A.: Computer Architecture, 5th edn. Morgan Kaufmann Publisher Inc., San Francisco (2011)zbMATHGoogle Scholar
  31. 31.
    Dagum, L., Menon, R.: OpenMP: an industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 45–55 (1998). Scholar
  32. 32.
    Pico-CNN. Accessed 27 Feb 2019
  33. 33.
    CRC Configurator. Accessed 27 Feb 2019
  34. 34.
    Jia, Y.: Training LeNet on MNIST with Caffe. Accessed 20 Feb 2019
  35. 35.
    System Management Interface Forum, PMBus Power System Management Protocol Specification Part II - Command Language, Revision 1.2 (2010)Google Scholar
  36. 36.
    Nvidia, Whitepaper NVIDIA Tegra K1 A New Era in Mobile Computing, V1.0 (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of TübingenTübingenGermany

Personalised recommendations