Advertisement

Journal of Signal Processing Systems

, Volume 90, Issue 1, pp 3–27 | Cite as

Loop Parallelization Techniques for FPGA Accelerator Synthesis

  • Oliver Reiche
  • M. Akif Özkan
  • Frank Hannig
  • Jürgen Teich
  • Moritz Schmid
Article

Abstract

Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP). The support for Data-Level Parallelism (DLP), one of the key advantages of Field programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines. In addition to well-known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows processing multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We present concrete implementations of tiling and coarsening for Vivado HLS and Altera OpenCL. Furthermore, we present a comparison of our implementations to the keyword-driven parallelization support provided by the Altera Offline Compiler. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework HIPAcc to generate loop coarsening implementations for Vivado HLS and Altera OpenCL. Moreover, we compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPU), all generated from exactly the same code base.

Keywords

Altera OpenCL Vivado HLS Vectorization Loop coarsening Loop tiling 

Acronyms

ALUT

Adaptive Look-Up Table

AOC

Altera Offline Compiler

AOCL

Altera SDK for OpenCL

ASIC

Application-Specific Integrated Circuit

BRAM

Block Random Access Memory

CUDA

Compute Unified Device Architecture

DLP

Data-Level Parallelism

DPRAM

Dual-Port-RAM

DSL

Domain-Specific Language

DSP

Digital Signal Processor

EDA

Electronic Design Automation

eGPU

embedded GPU

F

Frequency

FF

Flipflop

FIFO

First In First Out

FPGA

Field Programmable Gate Array

GPU

Graphics Processing Unit

half-ALM

half-Adaptive Logic Module

HDL

Hardware Description Language

Hipacc

Heterogeneous Image Processing Acceleration

HLS

High-Level Synthesis

IDE

Integrated Development Environment

II

Initiation Interval

ILP

Instruction-Level Parallelism

IO

Input/Output

LAT

Latency

LU

Logic Utilization

LUT

Look-Up Table

OpenCL

Open Computing Language

PPnR

Post Place and Route

RGBA

Red Green Blue Alpha

RTL

Register Transfer Level

SIMD

Single Instruction Multiple Data

SPMD

Single Program Multiple Data

SU

Speedup

ThSU

Theoretical Speedup

TP

Throughput

Notes

Acknowledgments

This work is partly supported by the German Research Foundation (DFG), as part of the Research Training Group 1773 “Heterogeneous Image Systems”, and as part of the Transregional Collaborative Research Center “Invasive Computing” (SFB/TR 89). The Tesla K20 used for this research was donated by the Nvidia Corporation.

References

  1. 1.
    Aditya, S., & Kathail, V. (2008). Algorithmic synthesis using PICO: An integrated framework for application engine synthesis and verification from high level C algorithms. In P. Coussy & A. Morawiec (Eds.), High-level synthesis: from algorithm to digital circuit (chap. 4, pp. 53–74). Springer. doi: 10.1007/978-1-4020-8588-8_4.
  2. 2.
    Alias, C., Darte, A., & Plesco, A. (2013). Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA, Proceedings of the conference on design, automation and test in europe (DATE) (pp. 575–580).Google Scholar
  3. 3.
    Amdahl, G. (1967). Validity of the single processor approach to achieving large scale computing capabilities, Proceedings of the spring joint computer conference (AFIPS) (pp. 483–485).Google Scholar
  4. 4.
    Bailey, D. (2011). Design for embedded image processing on FPGAs. Wiley.Google Scholar
  5. 5.
    Bondhugula, U., Hartono, A., Ramanujam, J., & Sadayappan, P. (2008). A practical automatic polyhedral parallelizer and locality optimizer (Vol. 43, no. 6, pp. 101–113).Google Scholar
  6. 6.
    Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Anderson, J., Brown, S., & Czajkowski, T. (2011). LegUp: High-level synthesis for FPGA-based processor/accelerator systems, Proceedings of the international symposium on field programmable gate arrays (FPGA) (pp. 33–36).Google Scholar
  7. 7.
    Choi, J., Brown, S., & Anderson, J. (2013). From software threads to parallel hardware in high-level synthesis for FPGAs, Proceedings of the international conference on field-programmable technology (FPT) (pp. 270–277).Google Scholar
  8. 8.
    Feautrier, P., & Lengauer, C. (2011). Polyhedron model. In D. Padua (Ed.), Encyclopedia of parallel computing (pp. 1581–1592). Springer. doi: 10.1007/978-0-387-09766-4_502.
  9. 9.
    George, N., Novo, D., Rompf, T., Odersky, M., & Ienne, P. (2013). Making domain-specific hardware synthesis tools cost-efficient, Proceedings of the international conference on field-programmable technology (FPT) (pp. 120–127).Google Scholar
  10. 10.
    Hannig, F., Ruckdeschel, H., Dutta, H., & Teich, J. (2008). PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications, Proceedings of the 4th international workshop on applied reconfigurable computing (ARC), Lecture Notes in Computer Science (LNCS) (Vol. 4943, pp. 287–293).  10.1007/978-3-540-78610-8_30: Springer.
  11. 11.
    Hegarty, J., Brunhaver, J., DeVito, Z., Ragan-Kelley, J., Cohen, N., Bell, S., Vasilyev, A., Horowitz, M., & Hanrahan, P. (2014). Darkroom: Compiling high-level image processing code into hardware pipelines, Proceedings of the 41st international conference on computer graphics and interactive techniques (SIGGRAPH) (pp. 144:1–144:11).Google Scholar
  12. 12.
    Hwang, D., Cho, S., Kim, Y., & Han, S. (1993). Exploiting spatial and temporal parallelism in the multithreaded node architecture implemented on superscalar RISC processors, Proceedings of the international conference on parallel processing (ICPP) (pp. 51–54).Google Scholar
  13. 13.
    Lam, M. (1988). Software pipelining: An effective scheduling technique for VLIW machines, Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). doi: 10.1145/53990.54022 (pp. 318–328).Google Scholar
  14. 14.
    Lattuada, M., & Ferrandi, F. (2015). Exploiting outer loops vectorization in high level synthesis, Proceedings of the 28th international conference on architecture of computing systems (ARCS), lecture notes in computer science (LNCS) (Vol. 9017, pp. 31–42) . Springer.Google Scholar
  15. 15.
    Li, P., Pouchet, L. N., & Cong, J. (2014). Throughput optimization for high-level synthesis using resource constraints. In S. Rajopadhye, & S. Verdoolaege (Eds.), Proceedings of the 4th international workshop on polyhedral compilation techniques. Vienna, Austria.Google Scholar
  16. 16.
    Membarth, R., Reiche, O., Hannig, F., & Teich, J. (2014). Code Generation for Embedded Heterogeneous Architectures on Android, Proceedings of the conference on design, automation and test in Europe (DATE). doi: 10.7873/DATE.2014.099 (pp. 86:1–86:6). Dresden, Germany: IEEE.Google Scholar
  17. 17.
    Membarth, R., Reiche, O., Hannig, F., Teich, J., Körner, M., & Eckert, W. (2016). HIPAcc: A domain-specific language and compiler for image processing. IEEE Transactions on Parallel and Distributed Systems, 27(1), 210–224. doi: 10.1109/TPDS.2015.2394802.CrossRefGoogle Scholar
  18. 18.
    Mentor Graphics (2016). Catapult High-Level Synthesis. https://www.mentor.com/hls-lp/catapult-high-level-synthesis/.
  19. 19.
    Meredith, M. (2008). High-level SystemC synthesis with Forte’s Cynthesizer. In P. Coussy & A. Morawiec (Eds.), High-level synthesis: from algorithm to digital circuit (chap. 5, pp. 75–97). Springer. doi: 10.1007/978-1-4020-8588-8_5.
  20. 20.
    Owaida, M., Bellas, N., Daloukas, K., & Antonopoulos, C. (2011). Synthesis of platform architectures from openCL programs, Proceedings of the international symposium on field-programmable custom computing machines (FCCM) (pp. 186–193).Google Scholar
  21. 21.
    Özkan, M., Reiche, O., Hannig, F., & Teich, J. FPGA-based accelerator design from a domain-specific language, Proceedings of the 26th international conference on field-programmable logic and applications (FPL). doi: 10.1109/FPL.2016.7577357.
  22. 22.
    Papakonstantinou, A., Gururaj, K., Stratton, J., Chen, D., Cong, J., & Hwu, W. M. (2009). FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs, Proceedings of the IEEE 7th symposium on application specific processors (SASP). doi: 10.1109/SASP.2009.5226333 (pp. 35–42).Google Scholar
  23. 23.
    Plavec, F., Vranesic, Z., & Brown, S. (2013). Exploiting task- and data-level parallelism in streaming applications implemented in FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 6(4), 16:1–16:37.Google Scholar
  24. 24.
    Pouchet, L. N., Zhang, P., Sadayappan, P., & Cong, J. (2013). Polyhedral-based data reuse optimization for configurable computing, Proceedings of the ACM/SIGDA international symposium on field programmable gate arrays (pp. 29–38). ACM.Google Scholar
  25. 25.
    Püschel, M., Franchetti, F., & Voronenko, Y. (2011). Spiral. In D. Padua (Ed.), Encyclopedia of parallel computing (pp. 1920–1933). Springer. doi: 10.1007/978-0-387-09766-4.
  26. 26.
    Ratha, N., & Jain, A. (1999). Computer vision algorithms on reconfigurable logic arrays. IEEE Transactions on Parallel and Distributed Systems (TPDS), 10(1), 29–43.CrossRefGoogle Scholar
  27. 27.
    Reiche, O., Schmid, M., Hannig, F., Membarth, R., & Teich, J. (2014). Code generation from a domain-specific language for C-based HLS of hardware accelerators, Proceedings of the international conference on hardware/software codesign and system synthesis (CODES+ISSS) (pp. 17:1–17:10).  10.1145/2656075.2656081: ACM.
  28. 28.
    Schmid, M., Reiche, O., Hannig, F., & Teich, J. (2015). Loop coarsening in C-based high-level synthesis, Proceedings of the 26th IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 166–173). IEEE.Google Scholar
  29. 29.
    Schmidt, M., Reichenbach, M., & Fey, D. (2012). A generic VHDL template for 2D stencil code applications on FPGAs, Proceedings of the 15th IEEE international symposium on object/component/service-oriented real-time distributed computing workshops (ISORCW). doi: 10.1109/ISORCW.2012.39 (pp. 180–187).Google Scholar
  30. 30.
    Singh, D. (2011). Implementing FPGA design with the openCL standard Altera whitepaper.Google Scholar
  31. 31.
    Tomasi, C., & Manduchi, R. (1998). Bilateral filtering for gray and color images, Proceedings of the 6th international conference on computer vision (ICCV) (pp. 839–846). IEEE.Google Scholar
  32. 32.
    Trifunovic, K., Nuzman, D., Cohen, A., Zaks, A., & Rosen, I. (2009). Polyhedral-model guided loop-nest auto-vectorization, Proceedings of the 18th international conference on parallel architectures and compilation techniques (PACT) (pp. 327–337). IEEE.Google Scholar
  33. 33.
    Villarreal, J., Park, A., Najjar, W., & Halstead, R. (2010). Designing modular hardware accelerators in C with ROCCC 2.0, Proceedings of the international symposium on field-programmable custom computing machines (FCCM) (pp. 127–134).Google Scholar
  34. 34.
    Wakabayashi, K., & Okamoto, T. (2000). C-based SoC design flow and EDA tools: An ASIC and system vendor perspective. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 19(12), 1507–1522.CrossRefGoogle Scholar
  35. 35.
    Wang, C., Yuan, F. L., Yu, T. H., & Markovic, D. (2014). 27.5 a multi-granularity FPGA with hierarchical interconnects for efficient and flexible mobile computing, Proceedings of the IEEE international solid-state circuits conference - digest of technical papers (pp. 460–461).Google Scholar
  36. 36.
    Wolfe, M. (1989). More iteration space tiling, Proceedings of the 1989 ACM/IEEE conference on supercomputing (pp. 655–664).CrossRefGoogle Scholar
  37. 37.
  38. 38.
    Zhang, Z., Fan, Y., Jiang, W., Han, G., Yang, C., & Cong, J. (2008). AutoPilot: A platform-based ESL synthesis system. In P. Coussy & A. Morawiec (Eds.), High-level synthesis: from algorithm to digital circuit (chap. 6, pp. 99–112). Springer. doi: 10.1007/978-1-4020-8588-8_6.

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Oliver Reiche
    • 1
  • M. Akif Özkan
    • 1
  • Frank Hannig
    • 1
  • Jürgen Teich
    • 1
  • Moritz Schmid
    • 2
  1. 1.Hardware/Software Co-Design, Department of Computer ScienceFriedrich-Alexander University, Erlangen-Nürnberg (FAU)ErlangenGermany
  2. 2.Siemens Healthcare GmbHForchheimGermany

Personalised recommendations