Floating Point CGRA based Ultra-Low Power DSP Accelerator


Coarse Grained Reconfigurable Arrays (CGRAs) are emerging as energy efficient accelerators providing a high grade of flexibility in both academia and industry. However, with the recent advancements in algorithms and performance requirements of applications, supporting only integer and logical arithmetic limits the interest of classical/traditional CGRAs. In this paper, we propose a novel CGRA architecture and associated compilation flow supporting both integer and floating-point computations for energy efficient acceleration of DSP applications. Experimental results show that the proposed accelerator achieves a maximum of 4.61× speedup compared to a DSP optimized, ultra low power RISC-V based CPU while executing seizure detection, a representative of wide range of EEG signal processing applications with an area overhead of 1.9×. The proposed CGRA achieves a maximum of 6.5× energy efficiency compared to the single core CPU. While comparing the execution with the multi-core CPU with 8 cores, the proposed CGRA achieves up to 4.4× energy gain.

This is a preview of subscription content, access via your institution.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9


  1. 1.

    In order to maintain a consistency, a single template of PULP-cluster with 8 RI5CY cores has been used to perform all of the experiments in this paper. Pulp-cluster automatically disables other cores not in use.

  2. 2.

    PULP-cluster includes a shared FPU cluster which itself consists of 4 FPUs and PULP-cluster automatically disables the other FPUs not in use.


  1. 1.

    Balasubramanian, M., & Shrivastava, A. (2020). Crimson: compute-intensive loop acceleration by randomized iterative modulo scheduling and optimized mapping on CGRAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 39(11), 3300–3310. https://doi.org/10.1109/TCAD.2020.3022015.

    Article  Google Scholar 

  2. 2.

    Bouwens, F., Berekovic, M., Kanstein, A., & Gaydadjiev, G. (2007). Architectural exploration of the adres coarse-grained reconfigurable array. In Proceedings of the 3rd international conference on reconfigurable computing: architectures, tools and applications, ARC’07. http://dl.acm.org/citation.cfm?id=1764631.1764633 (pp. 1–13). Berlin: Springer.

  3. 3.

    Das, S., Martin, K. J., Rossi, D., Coussy, P., & Benini, L. (2018). An energy-efficient integrated programmable array accelerator and compilation flow for near-sensor ultralow power processing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(6), 1095–1108.

    Article  Google Scholar 

  4. 4.

    Das, S., Peyret, T., Martin, K., Corre, G., Thevenin, M., & Coussy, P. (2016). A scalable design approach to efficiently map applications on cgras. In 2016 IEEE computer society annual symposium on VLSI (ISVLSI) (pp. 655–660), DOI https://doi.org/10.1109/ISVLSI.2016.54, (to appear in print).

  5. 5.

    Das, S., Rossi, D., Martin, K. J. M., Coussy, P., & Benini, L. (2017). A 142mops/mw integrated programmable array accelerator for smart visual processing. In 2017 IEEE International symposium on circuits and systems (ISCAS). IEEE (pp. 1–4).

  6. 6.

    De Sutter, B., Raghavan, P., & Lambrechts, A. (2010). Coarse-grained reconfigurable array architectures. In Bhattacharyya, S.S., Deprettere, E.F., Leupers, R., & Takala, J. (Eds.) Handbook of signal processing systems. Springer US (pp. 449–484).

  7. 7.

    Dinda, P., Bernat, A., & Hetland, C. (2020). Spying on the floating point behavior of existing, unmodified scientific applications. In Proceedings of the 29th international symposium on high-performance parallel and distributed computing, HPDC ’20. Association for Computing Machinery, New York, NY, USA (pp. 5–16), DOI https://doi.org/10.1145/3369583.3392673, (to appear in print).

  8. 8.

    Exynos 5 Octa (5430): Samsung 2014. Retrieved from (2014). https://www.samsung.com/semiconductor/minisite/exynos/products/mobileprocessor/exynos-5-octa-5430/.

  9. 9.

    Gautschi, M., Schiavone, P. D., Traber, A., Loi, I., Pullini, A., Rossi, D., Flamand, E., Gürkaynak, F. K., & Benini, L. (2017). Near-threshold risc-v core with dsp extensions for scalable iot endpoint devices. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 25(10), 2700–2713. https://doi.org/10.1109/TVLSI.2017.2654506.

    Article  Google Scholar 

  10. 10.

    Golub, G. H., & Van der Vorst, H. A. (2001). Eigenvalue computation in the 20th century. In Numerical analysis: historical developments in the 20th century. Elsevier (pp. 209–239).

  11. 11.

    Govindaraju, V., Ho, C. H., Nowatzki, T., Chhugani, J., Satish, N., Sankaralingam, K., & Kim, C. (2012). Dyser: unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro, 32(5), 38–51.

    Article  Google Scholar 

  12. 12.

    IEEE: IEEE standard for floating-point arithmetic. IEEE Std 754-2008 pp. 1–70 (2008).

  13. 13.

    Intel 2016: Retrieved from https://newsroom.intel.com/news-releases/intel-tsinghua-university-and-montage-technology-collaborate-to-bring-indigenous-data-center-solutions-to-china/https://newsroom.intel.com/news-releases/intel-tsinghua-university-and-montage-technology-collaborate-to-bring-indigenous-data-center-solutions-to-china/https://newsroom.intel.com/news-releases/intel-tsinghua-university-and-montage-technology-collaborate-to-bring-indigenous-data-center-solutions-to-china/.

  14. 14.

    Khailany, B., Dally, W. J., Kapasi, U. J., Mattson, P., Namkoong, J., Owens, J. D., Towles, B., Chang, A., & Rixner, S. (2001). Imagine: media processing with streams. IEEE Micro, 21(2), 35–46. https://doi.org/10.1109/40.918001.

    Article  Google Scholar 

  15. 15.

    Kim, S., Park, Y. H., Kim, J., Kim, M., Lee, W., & Lee, S. (2015). Flexible video processing platform for 8k uhd tv. In Hot chips symposium (p. 1).

  16. 16.

    Le Kernec, J., Fioranelli, F., Ding, C., Zhao, H., Sun, L., Hong, H., Lorandel, J., & Romain, O. (2019). Radar signal processing for sensing in assisted living: the challenges associated with real-time implementation of emerging algorithms. IEEE Signal Processing Magazine, 36(4), 29–41.

    Article  Google Scholar 

  17. 17.

    Lee, D., Jo, M., Han, K., & Choi, K. (2009). Flora: coarse-grained reconfigurable architecture with floating-point operation capability. In 2009 International conference on field-programmable technology (pp. 376–379), DOI https://doi.org/10.1109/FPT.2009.5377609, (to appear in print).

  18. 18.

    Lee, M. H., Singh, H., Lu, G., Bagherzadeh, N., Kurdahi, F. J., Eliseu Filho, M., & Alves, V. C. (2000). Design and implementation of the morphosys reconfigurable computing processor. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, 24(2–3), 147–164.

    Article  Google Scholar 

  19. 19.

    Levi, G. (1973). A note on the derivation of maximal common subgraphs of two directed or undirected graphs. Calcolo, 9(4), 341–352.

    MathSciNet  Article  Google Scholar 

  20. 20.

    Liu, D., Yin, S., Luo, G., Shang, J., Liu, L., Wei, S., Feng, Y., & Zhou, S. (2018). Data-flow graph mapping optimization for cgra with deep reinforcement learning. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(12), 2271–2283.

    Article  Google Scholar 

  21. 21.

    Montagna, F., Benatti, S., & Rossi, D. (2017). Flexible, scalable and energy efficient bio-signals processing on the pulp platform: a case study on seizure detection. Journal of Low Power Electronics and Applications, 7(2). https://doi.org/10.3390/jlpea7020016, http://www.mdpi.com/2079-9268/7/2/16.

  22. 22.

    Nicol, C. (2017). A coarse grain reconfigurable array (CGRA) for statically scheduled data flow computing. Wave Computing White Paper. https://wavecomp.ai/wp-content/uploads/2018/12/WP_CGRA.pdf.

  23. 23.

    PACT: Retrieved from http://www.pactxpp.com/.

  24. 24.

    Peyret, T., Corre, G., Thevenin, M., Martin, K., & Coussy, P. (2014). Efficient application mapping on cgras based on backward simultaneous scheduling/binding and dynamic graph transformations. In 2014 IEEE 25th international conference on application-specific systems, architectures and processors (pp. 169–172).

  25. 25.

    Prabhakar, R., Zhang, Y., Koeplinger, D., Feldman, M., Zhao, T., Hadjis, S., Pedram, A., Kozyrakis, C., & Olukotun, K. (2017). Plasticine: a reconfigurable architecture for parallel patterns. In 2017 ACM/IEEE 44th annual international symposium on computer architecture (ISCA). IEEE (pp. 389–402).

  26. 26.

    Prasad, R., Das, S., Martin, K. J. M., Tagliavini, G., Coussy, P., Benini, L., & Rossi, D. (2020). Transpire: an energy-efficient transprecision floating-point programmable architecture. In 2020 Design, automation test in Europe conference exhibition (DATE) (pp. 1067–1072).

  27. 27.

    Pullini, A., Rossi, D., Loi, I., Tagliavini, G., & Benini, L. (2019). Mr.wolf: an energy-precision scalable parallel ultra low power soc for iot edge processing. IEEE Journal of Solid-State Circuits, 54 (7), 1970–1981. https://doi.org/10.1109/JSSC.2019.2912307.

    Article  Google Scholar 

  28. 28.

    PULP Platform: Open hardware, the way it should be! https://pulp-platform.org/.

  29. 29.

    PULP SDK: PULP software development kit and tools. https://pulp-platform.org/docs/hipeac/AndreasKurth_pulp_tools.pdf.

  30. 30.

    Rahimi, A., Loi, I., Kakoee, M. R., & Benini, L. (2011). A fully-synthesizable single-cycle interconnection network for shared-l1 processor clusters. In Design, automation & test in Europe conference & exhibition (DATE), 2011. IEEE (pp. 1– 6).

  31. 31.

    Rossi, D., Conti, F., Marongiu, A., Pullini, A., Loi, I., Gautschi, M., Tagliavini, G., Capotondi, A., Flatresse, P., & Benini, L. (2015). Pulp: a parallel ultra low power platform for next generation iot applications. In 2015 IEEE Hot chips 27 symposium (HCS) (pp. 1–39).

  32. 32.

    Sato, T., Watanabe, H., & Shiba, K. (2005). Implementation of dynamically reconfigurable processor dapdna-2. In 2005 IEEE VLSI-TSA International symposium on VLSI design, automation and test, 2005.(VLSI-TSA-DAT). IEEE (pp. 323– 324).

  33. 33.

    Suzuki, M., Hasegawa, Y., Yamada, Y., Kaneko, N., Deguchi, K., Amano, H., Anjo, K., Motomura, M., Wakabayashi, K., Toi, T., & et al. (2004). Stream applications on the dynamically reconfigurable processor. In Proceedings. 2004 IEEE international conference on field-programmable technology (IEEE cat. no. 04EX921). IEEE (pp. 137–144).

  34. 34.

    Voitsechov, D., & Etsion, Y. (2018). Inter-thread communication in multithreaded, reconfigurable coarse-grain arrays. arXiv:1801.05178.

  35. 35.

    Walker, M. J., & Anderson, J. H. (2019). Generic connectivity-based cgra mapping via integer linear programming. In 2019 IEEE 27th annual international symposium on field-programmable custom computing machines (FCCM). IEEE (pp. 65–73).

  36. 36.

    Wilkinson, J. H., & Reinsch, C. (2012). Handbook for automatic computation: Volume II: linear algebra, vol. 186. Springer Science & Business Media.

  37. 37.

    Yin, S., Liu, D., Sun, L., Liu, L., & Wei, S. (2017). Dfgnet: mapping dataflow graph onto cgra by a deep learning approach. In 2017 IEEE international symposium on circuits and systems (ISCAS) (pp. 1–4), DOI https://doi.org/10.1109/ISCAS.2017.8050274, (to appear in print).

Download references

Author information



Corresponding author

Correspondence to Rohit Prasad.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Prasad, R., Das, S., Martin, K.J.M. et al. Floating Point CGRA based Ultra-Low Power DSP Accelerator. J Sign Process Syst (2021). https://doi.org/10.1007/s11265-020-01630-2

Download citation


  • CGRA architecture
  • Floating point
  • Ultra-low power
  • DSP acceleration