Skip to main content
Log in

Loop Parallelization Techniques for FPGA Accelerator Synthesis

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP). The support for Data-Level Parallelism (DLP), one of the key advantages of Field programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines. In addition to well-known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows processing multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We present concrete implementations of tiling and coarsening for Vivado HLS and Altera OpenCL. Furthermore, we present a comparison of our implementations to the keyword-driven parallelization support provided by the Altera Offline Compiler. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework HIPAcc to generate loop coarsening implementations for Vivado HLS and Altera OpenCL. Moreover, we compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPU), all generated from exactly the same code base.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22

Similar content being viewed by others

Notes

  1. Jetson DC power analysis of the CUDA smoke particle demo, adjusted for fan and system power consumption: http://wccftech.com/nvidia-tegra-k1-performance-power-consumption-revealed-xiaomi-mipad-ship-32bit-64bit-denver-powered-chips/ http://wccftech.com/nvidia-tegra-k1-performance-power-consumption-revealed-xiaomi-mipad-ship-32bit-64bit-denver-powered-chips/ http://wccftech.com/nvidia-tegra-k1-performance-power-consumption-revealed-xiaomi-mipad-ship-32bit-64bit-denver-powered-chips/.

Abbreviations

ALUT:

Adaptive Look-Up Table

AOC:

Altera Offline Compiler

AOCL:

Altera SDK for OpenCL

ASIC:

Application-Specific Integrated Circuit

BRAM:

Block Random Access Memory

CUDA:

Compute Unified Device Architecture

DLP:

Data-Level Parallelism

DPRAM:

Dual-Port-RAM

DSL:

Domain-Specific Language

DSP:

Digital Signal Processor

EDA:

Electronic Design Automation

eGPU:

embedded GPU

F:

Frequency

FF:

Flipflop

FIFO:

First In First Out

FPGA:

Field Programmable Gate Array

GPU:

Graphics Processing Unit

half-ALM:

half-Adaptive Logic Module

HDL:

Hardware Description Language

Hipacc:

Heterogeneous Image Processing Acceleration

HLS:

High-Level Synthesis

IDE:

Integrated Development Environment

II:

Initiation Interval

ILP:

Instruction-Level Parallelism

IO:

Input/Output

LAT:

Latency

LU:

Logic Utilization

LUT:

Look-Up Table

OpenCL:

Open Computing Language

PPnR:

Post Place and Route

RGBA:

Red Green Blue Alpha

RTL:

Register Transfer Level

SIMD:

Single Instruction Multiple Data

SPMD:

Single Program Multiple Data

SU:

Speedup

ThSU:

Theoretical Speedup

TP:

Throughput

References

  1. Aditya, S., & Kathail, V. (2008). Algorithmic synthesis using PICO: An integrated framework for application engine synthesis and verification from high level C algorithms. In P. Coussy & A. Morawiec (Eds.), High-level synthesis: from algorithm to digital circuit (chap. 4, pp. 53–74). Springer. doi:10.1007/978-1-4020-8588-8_4.

  2. Alias, C., Darte, A., & Plesco, A. (2013). Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA, Proceedings of the conference on design, automation and test in europe (DATE) (pp. 575–580).

    Google Scholar 

  3. Amdahl, G. (1967). Validity of the single processor approach to achieving large scale computing capabilities, Proceedings of the spring joint computer conference (AFIPS) (pp. 483–485).

    Google Scholar 

  4. Bailey, D. (2011). Design for embedded image processing on FPGAs. Wiley.

  5. Bondhugula, U., Hartono, A., Ramanujam, J., & Sadayappan, P. (2008). A practical automatic polyhedral parallelizer and locality optimizer (Vol. 43, no. 6, pp. 101–113).

  6. Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Anderson, J., Brown, S., & Czajkowski, T. (2011). LegUp: High-level synthesis for FPGA-based processor/accelerator systems, Proceedings of the international symposium on field programmable gate arrays (FPGA) (pp. 33–36).

  7. Choi, J., Brown, S., & Anderson, J. (2013). From software threads to parallel hardware in high-level synthesis for FPGAs, Proceedings of the international conference on field-programmable technology (FPT) (pp. 270–277).

    Google Scholar 

  8. Feautrier, P., & Lengauer, C. (2011). Polyhedron model. In D. Padua (Ed.), Encyclopedia of parallel computing (pp. 1581–1592). Springer. doi:10.1007/978-0-387-09766-4_502.

  9. George, N., Novo, D., Rompf, T., Odersky, M., & Ienne, P. (2013). Making domain-specific hardware synthesis tools cost-efficient, Proceedings of the international conference on field-programmable technology (FPT) (pp. 120–127).

    Google Scholar 

  10. Hannig, F., Ruckdeschel, H., Dutta, H., & Teich, J. (2008). PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications, Proceedings of the 4th international workshop on applied reconfigurable computing (ARC), Lecture Notes in Computer Science (LNCS) (Vol. 4943, pp. 287–293). 10.1007/978-3-540-78610-8_30: Springer.

  11. Hegarty, J., Brunhaver, J., DeVito, Z., Ragan-Kelley, J., Cohen, N., Bell, S., Vasilyev, A., Horowitz, M., & Hanrahan, P. (2014). Darkroom: Compiling high-level image processing code into hardware pipelines, Proceedings of the 41st international conference on computer graphics and interactive techniques (SIGGRAPH) (pp. 144:1–144:11).

    Google Scholar 

  12. Hwang, D., Cho, S., Kim, Y., & Han, S. (1993). Exploiting spatial and temporal parallelism in the multithreaded node architecture implemented on superscalar RISC processors, Proceedings of the international conference on parallel processing (ICPP) (pp. 51–54).

    Google Scholar 

  13. Lam, M. (1988). Software pipelining: An effective scheduling technique for VLIW machines, Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). doi:10.1145/53990.54022 (pp. 318–328).

    Google Scholar 

  14. Lattuada, M., & Ferrandi, F. (2015). Exploiting outer loops vectorization in high level synthesis, Proceedings of the 28th international conference on architecture of computing systems (ARCS), lecture notes in computer science (LNCS) (Vol. 9017, pp. 31–42) . Springer.

  15. Li, P., Pouchet, L. N., & Cong, J. (2014). Throughput optimization for high-level synthesis using resource constraints. In S. Rajopadhye, & S. Verdoolaege (Eds.), Proceedings of the 4th international workshop on polyhedral compilation techniques. Vienna, Austria.

  16. Membarth, R., Reiche, O., Hannig, F., & Teich, J. (2014). Code Generation for Embedded Heterogeneous Architectures on Android, Proceedings of the conference on design, automation and test in Europe (DATE). doi:10.7873/DATE.2014.099 (pp. 86:1–86:6). Dresden, Germany: IEEE.

    Google Scholar 

  17. Membarth, R., Reiche, O., Hannig, F., Teich, J., Körner, M., & Eckert, W. (2016). HIPAcc: A domain-specific language and compiler for image processing. IEEE Transactions on Parallel and Distributed Systems, 27(1), 210–224. doi:10.1109/TPDS.2015.2394802.

    Article  Google Scholar 

  18. Mentor Graphics (2016). Catapult High-Level Synthesis. https://www.mentor.com/hls-lp/catapult-high-level-synthesis/.

  19. Meredith, M. (2008). High-level SystemC synthesis with Forte’s Cynthesizer. In P. Coussy & A. Morawiec (Eds.), High-level synthesis: from algorithm to digital circuit (chap. 5, pp. 75–97). Springer. doi:10.1007/978-1-4020-8588-8_5.

  20. Owaida, M., Bellas, N., Daloukas, K., & Antonopoulos, C. (2011). Synthesis of platform architectures from openCL programs, Proceedings of the international symposium on field-programmable custom computing machines (FCCM) (pp. 186–193).

    Google Scholar 

  21. Özkan, M., Reiche, O., Hannig, F., & Teich, J. FPGA-based accelerator design from a domain-specific language, Proceedings of the 26th international conference on field-programmable logic and applications (FPL). doi:10.1109/FPL.2016.7577357.

  22. Papakonstantinou, A., Gururaj, K., Stratton, J., Chen, D., Cong, J., & Hwu, W. M. (2009). FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs, Proceedings of the IEEE 7th symposium on application specific processors (SASP). doi:10.1109/SASP.2009.5226333 (pp. 35–42).

    Google Scholar 

  23. Plavec, F., Vranesic, Z., & Brown, S. (2013). Exploiting task- and data-level parallelism in streaming applications implemented in FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 6(4), 16:1–16:37.

    Google Scholar 

  24. Pouchet, L. N., Zhang, P., Sadayappan, P., & Cong, J. (2013). Polyhedral-based data reuse optimization for configurable computing, Proceedings of the ACM/SIGDA international symposium on field programmable gate arrays (pp. 29–38). ACM.

  25. Püschel, M., Franchetti, F., & Voronenko, Y. (2011). Spiral. In D. Padua (Ed.), Encyclopedia of parallel computing (pp. 1920–1933). Springer. doi:10.1007/978-0-387-09766-4.

  26. Ratha, N., & Jain, A. (1999). Computer vision algorithms on reconfigurable logic arrays. IEEE Transactions on Parallel and Distributed Systems (TPDS), 10(1), 29–43.

    Article  Google Scholar 

  27. Reiche, O., Schmid, M., Hannig, F., Membarth, R., & Teich, J. (2014). Code generation from a domain-specific language for C-based HLS of hardware accelerators, Proceedings of the international conference on hardware/software codesign and system synthesis (CODES+ISSS) (pp. 17:1–17:10). 10.1145/2656075.2656081: ACM.

  28. Schmid, M., Reiche, O., Hannig, F., & Teich, J. (2015). Loop coarsening in C-based high-level synthesis, Proceedings of the 26th IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 166–173). IEEE.

  29. Schmidt, M., Reichenbach, M., & Fey, D. (2012). A generic VHDL template for 2D stencil code applications on FPGAs, Proceedings of the 15th IEEE international symposium on object/component/service-oriented real-time distributed computing workshops (ISORCW). doi:10.1109/ISORCW.2012.39 (pp. 180–187).

    Google Scholar 

  30. Singh, D. (2011). Implementing FPGA design with the openCL standard Altera whitepaper.

  31. Tomasi, C., & Manduchi, R. (1998). Bilateral filtering for gray and color images, Proceedings of the 6th international conference on computer vision (ICCV) (pp. 839–846). IEEE.

  32. Trifunovic, K., Nuzman, D., Cohen, A., Zaks, A., & Rosen, I. (2009). Polyhedral-model guided loop-nest auto-vectorization, Proceedings of the 18th international conference on parallel architectures and compilation techniques (PACT) (pp. 327–337). IEEE.

  33. Villarreal, J., Park, A., Najjar, W., & Halstead, R. (2010). Designing modular hardware accelerators in C with ROCCC 2.0, Proceedings of the international symposium on field-programmable custom computing machines (FCCM) (pp. 127–134).

    Google Scholar 

  34. Wakabayashi, K., & Okamoto, T. (2000). C-based SoC design flow and EDA tools: An ASIC and system vendor perspective. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 19(12), 1507–1522.

    Article  Google Scholar 

  35. Wang, C., Yuan, F. L., Yu, T. H., & Markovic, D. (2014). 27.5 a multi-granularity FPGA with hierarchical interconnects for efficient and flexible mobile computing, Proceedings of the IEEE international solid-state circuits conference - digest of technical papers (pp. 460–461).

    Google Scholar 

  36. Wolfe, M. (1989). More iteration space tiling, Proceedings of the 1989 ACM/IEEE conference on supercomputing (pp. 655–664).

    Chapter  Google Scholar 

  37. Xilinx Inc. (2016). Vivado High-Level Synthesis. http://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.

  38. Zhang, Z., Fan, Y., Jiang, W., Han, G., Yang, C., & Cong, J. (2008). AutoPilot: A platform-based ESL synthesis system. In P. Coussy & A. Morawiec (Eds.), High-level synthesis: from algorithm to digital circuit (chap. 6, pp. 99–112). Springer. doi:10.1007/978-1-4020-8588-8_6.

Download references

Acknowledgments

This work is partly supported by the German Research Foundation (DFG), as part of the Research Training Group 1773 “Heterogeneous Image Systems”, and as part of the Transregional Collaborative Research Center “Invasive Computing” (SFB/TR 89). The Tesla K20 used for this research was donated by the Nvidia Corporation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oliver Reiche.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Reiche, O., Özkan, M.A., Hannig, F. et al. Loop Parallelization Techniques for FPGA Accelerator Synthesis. J Sign Process Syst 90, 3–27 (2018). https://doi.org/10.1007/s11265-017-1229-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-017-1229-7

Keywords

Navigation