Advertisement

A Heterogeneous and Reconfigurable Embedded Architecture for Energy-Efficient Execution of Convolutional Neural Networks

  • Konstantin LübeckEmail author
  • Oliver Bringmann
Conference paper
  • 623 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11479)

Abstract

Machine learning based convolutional neural networks (CNN) are becoming increasingly popular for identification tasks like image classification or speech recognition. However, CNNs have high memory and computational demands which makes it challenging to implement them on cost-efficient and energy-autonomous hardware. To cope with this challenge we present a heterogeneous and reconfigurable embedded architecture implemented on an inexpensive and widely available entry-level system on chip (SoC). Our architecture combines an ARM CPU and a coarse-grained reconfigurable architecture (CGRA) which execute a CNN in parallel to reach a higher energy-efficiency. Our results show up to 130% higher performance and 78% better energy-efficiency compared with an embedded Nvidia GPU.

Notes

Acknowledgments

This work has been partially funded by the Stiftung Industrieforschung through the scholarship for master’s theses and the German Federal Ministry of Education and Research (BMBF) under grant number 16ES0876 (GENIAL!).

References

  1. 1.
    LeCun, Y., Buttou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998).  https://doi.org/10.1109/5.726791CrossRefGoogle Scholar
  2. 2.
    Krizhevsky, A., Sutskever, I., Hinton, G. E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS 2012), Lake Tahoe, NV, pp. 1097–1105 (2012)Google Scholar
  3. 3.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
  4. 4.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015).  https://doi.org/10.1038/nature14539CrossRefGoogle Scholar
  5. 5.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. arXiv:1709.01507 (2017)
  6. 6.
    ImageNet Large Scale Visual Recognition Challenge 2017 Results (ILSVRC2017). http://image-net.org/challenges/LSVRC/2017/results. Accessed 19 Nov 2018
  7. 7.
    LeCun, Y., Cortes, C.: MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist. Accessed 29 Oct 2018
  8. 8.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. arXiv:1408.5093 (2014)
  9. 9.
    Nvidia cuDNN. https://developer.nvidia.com/cudnn. Accessed 2 Nov 2018
  10. 10.
    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems. http://tensorflow.org. Accessed 14 Nov 2018
  11. 11.
    Paszke, A., et al.: Automatic differentiation in PyTorch. In: Proceedings of the NIPS 2017 Workshop Autodiff, Long Beach, CA (2017)Google Scholar
  12. 12.
    Nvidia Titan RTX. https://www.nvidia.com/en-us/titan/titan-rtx. Accessed 20 Feb 2019
  13. 13.
    Chen, T., et al.: DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2014), Salt Lake City, UT, pp. 269–284 (2014).  https://doi.org/10.1145/2541940.2541967
  14. 14.
    Tanomoto, M., Takamaeda-Yamazaki, S., Yao, J., Nakashima, Y.: A CGRA-based approach for accelerating convolutional neural networks. In: Proceedings of the 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC 2015), Turin, pp. 73–80 (2015).  https://doi.org/10.1109/MCSoC.2015.41
  15. 15.
    Shi, R., et al.: A locality aware convolutional neural networks accelerator. In: Proceedings of the 2015 Euromicro Conference on Digital System Design, Funchal, pp. 591–598 (2015).  https://doi.org/10.1109/DSD.2015.70
  16. 16.
    Fan, X., Li, H., Cao, W., Wang, L.: DT-CGRA: dual-track coarse-grained reconfigurable architecture for stream applications. In: Proceedings of the 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, pp. 1–9 (2016).  https://doi.org/10.1109/FPL.2016.7577309
  17. 17.
    Jafri, S.M.A.H., Hemani, A., Kolin, P., Abbas, N.: MOCHA: morphable locality and compression aware architecture for convolutional neural networks. In: Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Orlando, FL, pp. 276–286 (2007).  https://doi.org/10.1109/IPDPS.2017.59
  18. 18.
    Chen, Y.H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52(1), 137–138 (2017).  https://doi.org/10.1109/JSSC.2016.2616357CrossRefGoogle Scholar
  19. 19.
    Zhao, B., Wang, M., Liu, M.: An energy-efficient coarse grained spatial architecture for convolutional neural networks AlexNet. IEICE Electron. Express 14(15), 20170595 (2017).  https://doi.org/10.1587/elex.14.20170595
  20. 20.
    Shin, D., Lee, J., Lee, J., Yoo, H. J.: DNPU: an 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks. In: Proceedings of the in 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 240–241, San Francisco, CA (2017).  https://doi.org/10.1109/ISSCC.2017.7870350
  21. 21.
    Du, L., et al.: A reconfigurable streaming deep convolutional neural network accelerator for Internet of Things. IEEE Trans. Circuits Syst. I Regular Papers 65(1), 198–208 (2018).  https://doi.org/10.1109/TCSI.2017.2735490CrossRefGoogle Scholar
  22. 22.
    Chakradhar, S., Sankaradas, M., Jakkula, V., Cadambi, S.: A dynamically configurable coprocessor for convolutional neural networks. In: Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA 2010), Saint-Malo, pp. 247–257 (2010).  https://doi.org/10.1145/1815961.1815993
  23. 23.
    Zhang, C., Li, P., Sun, G., Xiao, B., Cong, J.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2015), Monterey, CA, pp. 161–170 (2015).  https://doi.org/10.1145/2684746.2689060
  24. 24.
    Qiu, J., et al.: Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2016), Monterey, CA, pp. 26–35 (2016).  https://doi.org/10.1145/2847263.2847265
  25. 25.
    Gokhale, V., Zaidy, A., Chang, A.X.M., Culurciello, E.: Snowflake: an efficient hardware accelerator for convolutional neural networks. In: Proceedings of the 2017 IEEE International Symposium on Circuits and Systems (ISCAS), Baltimore, MD, pp. 1–4 (2017).  https://doi.org/10.1109/ISCAS.2017.8050809
  26. 26.
    Hartenstein, R.: A decade of reconfigurable computing: a visionary retrospective. In: Proceedings of the Design, Automation and Test in Europe Conference and Exhibition 2001 (DATE 2001), Munich, pp. 642–649 (2001).  https://doi.org/10.1109/DATE.2001.915091
  27. 27.
    Xilinx: Zynq UltraScale+ Device Technical Reference Manual, UG1085 v1.7 (2017)Google Scholar
  28. 28.
    Oppold, T., Schweizer, T., Oliveira, J.F., Eisenhardt, S., Kuhn, T., Rosenstiel, W.: CRC - concepts and evaluation of processor-like reconfigurable architectures. Inf. Technol. IT 49(3), 157–164 (2007).  https://doi.org/10.1524/itit.2007.49.3.157CrossRefGoogle Scholar
  29. 29.
    Lübeck, K., Morgenstern, D., Schweizer, T., Peterson D., Rosenstiel W., Bringmann O.: Neues Konzept zur Steigerung der Zuverlässigkeit einer ARM-basierten Prozessorarchitektur unter Verwendung eines CGRAs. In: 19. Workshop Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (MBMV), Freiburg, pp. 46–58 (2016).  https://doi.org/10.6094/UNIFR/10617
  30. 30.
    Hennessy, J.L., Patterson, D.A.: Computer Architecture, 5th edn. Morgan Kaufmann Publisher Inc., San Francisco (2011)zbMATHGoogle Scholar
  31. 31.
    Dagum, L., Menon, R.: OpenMP: an industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 45–55 (1998).  https://doi.org/10.1109/99.660313CrossRefGoogle Scholar
  32. 32.
    Pico-CNN. https://github.com/ekut-es/pico-cnn. Accessed 27 Feb 2019
  33. 33.
    CRC Configurator. https://github.com/ekut-es/crc_configurator. Accessed 27 Feb 2019
  34. 34.
    Jia, Y.: Training LeNet on MNIST with Caffe. http://caffe.berkeleyvision.org/gathered/examples/mnist.html. Accessed 20 Feb 2019
  35. 35.
    System Management Interface Forum, PMBus Power System Management Protocol Specification Part II - Command Language, Revision 1.2 (2010)Google Scholar
  36. 36.
    Nvidia, Whitepaper NVIDIA Tegra K1 A New Era in Mobile Computing, V1.0 (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of TübingenTübingenGermany

Personalised recommendations