Abstract
Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP). The support for Data-Level Parallelism (DLP), one of the key advantages of Field programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines. In addition to well-known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows processing multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We present concrete implementations of tiling and coarsening for Vivado HLS and Altera OpenCL. Furthermore, we present a comparison of our implementations to the keyword-driven parallelization support provided by the Altera Offline Compiler. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework HIPAcc to generate loop coarsening implementations for Vivado HLS and Altera OpenCL. Moreover, we compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPU), all generated from exactly the same code base.
Similar content being viewed by others
Notes
Jetson DC power analysis of the CUDA smoke particle demo, adjusted for fan and system power consumption: http://wccftech.com/nvidia-tegra-k1-performance-power-consumption-revealed-xiaomi-mipad-ship-32bit-64bit-denver-powered-chips/ http://wccftech.com/nvidia-tegra-k1-performance-power-consumption-revealed-xiaomi-mipad-ship-32bit-64bit-denver-powered-chips/ http://wccftech.com/nvidia-tegra-k1-performance-power-consumption-revealed-xiaomi-mipad-ship-32bit-64bit-denver-powered-chips/.
Abbreviations
- ALUT:
-
Adaptive Look-Up Table
- AOC:
-
Altera Offline Compiler
- AOCL:
-
Altera SDK for OpenCL
- ASIC:
-
Application-Specific Integrated Circuit
- BRAM:
-
Block Random Access Memory
- CUDA:
-
Compute Unified Device Architecture
- DLP:
-
Data-Level Parallelism
- DPRAM:
-
Dual-Port-RAM
- DSL:
-
Domain-Specific Language
- DSP:
-
Digital Signal Processor
- EDA:
-
Electronic Design Automation
- eGPU:
-
embedded GPU
- F:
-
Frequency
- FF:
-
Flipflop
- FIFO:
-
First In First Out
- FPGA:
-
Field Programmable Gate Array
- GPU:
-
Graphics Processing Unit
- half-ALM:
-
half-Adaptive Logic Module
- HDL:
-
Hardware Description Language
- Hipacc:
-
Heterogeneous Image Processing Acceleration
- HLS:
-
High-Level Synthesis
- IDE:
-
Integrated Development Environment
- II:
-
Initiation Interval
- ILP:
-
Instruction-Level Parallelism
- IO:
-
Input/Output
- LAT:
-
Latency
- LU:
-
Logic Utilization
- LUT:
-
Look-Up Table
- OpenCL:
-
Open Computing Language
- PPnR:
-
Post Place and Route
- RGBA:
-
Red Green Blue Alpha
- RTL:
-
Register Transfer Level
- SIMD:
-
Single Instruction Multiple Data
- SPMD:
-
Single Program Multiple Data
- SU:
-
Speedup
- ThSU:
-
Theoretical Speedup
- TP:
-
Throughput
References
Aditya, S., & Kathail, V. (2008). Algorithmic synthesis using PICO: An integrated framework for application engine synthesis and verification from high level C algorithms. In P. Coussy & A. Morawiec (Eds.), High-level synthesis: from algorithm to digital circuit (chap. 4, pp. 53–74). Springer. doi:10.1007/978-1-4020-8588-8_4.
Alias, C., Darte, A., & Plesco, A. (2013). Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA, Proceedings of the conference on design, automation and test in europe (DATE) (pp. 575–580).
Amdahl, G. (1967). Validity of the single processor approach to achieving large scale computing capabilities, Proceedings of the spring joint computer conference (AFIPS) (pp. 483–485).
Bailey, D. (2011). Design for embedded image processing on FPGAs. Wiley.
Bondhugula, U., Hartono, A., Ramanujam, J., & Sadayappan, P. (2008). A practical automatic polyhedral parallelizer and locality optimizer (Vol. 43, no. 6, pp. 101–113).
Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Anderson, J., Brown, S., & Czajkowski, T. (2011). LegUp: High-level synthesis for FPGA-based processor/accelerator systems, Proceedings of the international symposium on field programmable gate arrays (FPGA) (pp. 33–36).
Choi, J., Brown, S., & Anderson, J. (2013). From software threads to parallel hardware in high-level synthesis for FPGAs, Proceedings of the international conference on field-programmable technology (FPT) (pp. 270–277).
Feautrier, P., & Lengauer, C. (2011). Polyhedron model. In D. Padua (Ed.), Encyclopedia of parallel computing (pp. 1581–1592). Springer. doi:10.1007/978-0-387-09766-4_502.
George, N., Novo, D., Rompf, T., Odersky, M., & Ienne, P. (2013). Making domain-specific hardware synthesis tools cost-efficient, Proceedings of the international conference on field-programmable technology (FPT) (pp. 120–127).
Hannig, F., Ruckdeschel, H., Dutta, H., & Teich, J. (2008). PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications, Proceedings of the 4th international workshop on applied reconfigurable computing (ARC), Lecture Notes in Computer Science (LNCS) (Vol. 4943, pp. 287–293). 10.1007/978-3-540-78610-8_30: Springer.
Hegarty, J., Brunhaver, J., DeVito, Z., Ragan-Kelley, J., Cohen, N., Bell, S., Vasilyev, A., Horowitz, M., & Hanrahan, P. (2014). Darkroom: Compiling high-level image processing code into hardware pipelines, Proceedings of the 41st international conference on computer graphics and interactive techniques (SIGGRAPH) (pp. 144:1–144:11).
Hwang, D., Cho, S., Kim, Y., & Han, S. (1993). Exploiting spatial and temporal parallelism in the multithreaded node architecture implemented on superscalar RISC processors, Proceedings of the international conference on parallel processing (ICPP) (pp. 51–54).
Lam, M. (1988). Software pipelining: An effective scheduling technique for VLIW machines, Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). doi:10.1145/53990.54022 (pp. 318–328).
Lattuada, M., & Ferrandi, F. (2015). Exploiting outer loops vectorization in high level synthesis, Proceedings of the 28th international conference on architecture of computing systems (ARCS), lecture notes in computer science (LNCS) (Vol. 9017, pp. 31–42) . Springer.
Li, P., Pouchet, L. N., & Cong, J. (2014). Throughput optimization for high-level synthesis using resource constraints. In S. Rajopadhye, & S. Verdoolaege (Eds.), Proceedings of the 4th international workshop on polyhedral compilation techniques. Vienna, Austria.
Membarth, R., Reiche, O., Hannig, F., & Teich, J. (2014). Code Generation for Embedded Heterogeneous Architectures on Android, Proceedings of the conference on design, automation and test in Europe (DATE). doi:10.7873/DATE.2014.099 (pp. 86:1–86:6). Dresden, Germany: IEEE.
Membarth, R., Reiche, O., Hannig, F., Teich, J., Körner, M., & Eckert, W. (2016). HIPAcc: A domain-specific language and compiler for image processing. IEEE Transactions on Parallel and Distributed Systems, 27(1), 210–224. doi:10.1109/TPDS.2015.2394802.
Mentor Graphics (2016). Catapult High-Level Synthesis. https://www.mentor.com/hls-lp/catapult-high-level-synthesis/.
Meredith, M. (2008). High-level SystemC synthesis with Forte’s Cynthesizer. In P. Coussy & A. Morawiec (Eds.), High-level synthesis: from algorithm to digital circuit (chap. 5, pp. 75–97). Springer. doi:10.1007/978-1-4020-8588-8_5.
Owaida, M., Bellas, N., Daloukas, K., & Antonopoulos, C. (2011). Synthesis of platform architectures from openCL programs, Proceedings of the international symposium on field-programmable custom computing machines (FCCM) (pp. 186–193).
Özkan, M., Reiche, O., Hannig, F., & Teich, J. FPGA-based accelerator design from a domain-specific language, Proceedings of the 26th international conference on field-programmable logic and applications (FPL). doi:10.1109/FPL.2016.7577357.
Papakonstantinou, A., Gururaj, K., Stratton, J., Chen, D., Cong, J., & Hwu, W. M. (2009). FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs, Proceedings of the IEEE 7th symposium on application specific processors (SASP). doi:10.1109/SASP.2009.5226333 (pp. 35–42).
Plavec, F., Vranesic, Z., & Brown, S. (2013). Exploiting task- and data-level parallelism in streaming applications implemented in FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 6(4), 16:1–16:37.
Pouchet, L. N., Zhang, P., Sadayappan, P., & Cong, J. (2013). Polyhedral-based data reuse optimization for configurable computing, Proceedings of the ACM/SIGDA international symposium on field programmable gate arrays (pp. 29–38). ACM.
Püschel, M., Franchetti, F., & Voronenko, Y. (2011). Spiral. In D. Padua (Ed.), Encyclopedia of parallel computing (pp. 1920–1933). Springer. doi:10.1007/978-0-387-09766-4.
Ratha, N., & Jain, A. (1999). Computer vision algorithms on reconfigurable logic arrays. IEEE Transactions on Parallel and Distributed Systems (TPDS), 10(1), 29–43.
Reiche, O., Schmid, M., Hannig, F., Membarth, R., & Teich, J. (2014). Code generation from a domain-specific language for C-based HLS of hardware accelerators, Proceedings of the international conference on hardware/software codesign and system synthesis (CODES+ISSS) (pp. 17:1–17:10). 10.1145/2656075.2656081: ACM.
Schmid, M., Reiche, O., Hannig, F., & Teich, J. (2015). Loop coarsening in C-based high-level synthesis, Proceedings of the 26th IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 166–173). IEEE.
Schmidt, M., Reichenbach, M., & Fey, D. (2012). A generic VHDL template for 2D stencil code applications on FPGAs, Proceedings of the 15th IEEE international symposium on object/component/service-oriented real-time distributed computing workshops (ISORCW). doi:10.1109/ISORCW.2012.39 (pp. 180–187).
Singh, D. (2011). Implementing FPGA design with the openCL standard Altera whitepaper.
Tomasi, C., & Manduchi, R. (1998). Bilateral filtering for gray and color images, Proceedings of the 6th international conference on computer vision (ICCV) (pp. 839–846). IEEE.
Trifunovic, K., Nuzman, D., Cohen, A., Zaks, A., & Rosen, I. (2009). Polyhedral-model guided loop-nest auto-vectorization, Proceedings of the 18th international conference on parallel architectures and compilation techniques (PACT) (pp. 327–337). IEEE.
Villarreal, J., Park, A., Najjar, W., & Halstead, R. (2010). Designing modular hardware accelerators in C with ROCCC 2.0, Proceedings of the international symposium on field-programmable custom computing machines (FCCM) (pp. 127–134).
Wakabayashi, K., & Okamoto, T. (2000). C-based SoC design flow and EDA tools: An ASIC and system vendor perspective. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 19(12), 1507–1522.
Wang, C., Yuan, F. L., Yu, T. H., & Markovic, D. (2014). 27.5 a multi-granularity FPGA with hierarchical interconnects for efficient and flexible mobile computing, Proceedings of the IEEE international solid-state circuits conference - digest of technical papers (pp. 460–461).
Wolfe, M. (1989). More iteration space tiling, Proceedings of the 1989 ACM/IEEE conference on supercomputing (pp. 655–664).
Xilinx Inc. (2016). Vivado High-Level Synthesis. http://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.
Zhang, Z., Fan, Y., Jiang, W., Han, G., Yang, C., & Cong, J. (2008). AutoPilot: A platform-based ESL synthesis system. In P. Coussy & A. Morawiec (Eds.), High-level synthesis: from algorithm to digital circuit (chap. 6, pp. 99–112). Springer. doi:10.1007/978-1-4020-8588-8_6.
Acknowledgments
This work is partly supported by the German Research Foundation (DFG), as part of the Research Training Group 1773 “Heterogeneous Image Systems”, and as part of the Transregional Collaborative Research Center “Invasive Computing” (SFB/TR 89). The Tesla K20 used for this research was donated by the Nvidia Corporation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Reiche, O., Özkan, M.A., Hannig, F. et al. Loop Parallelization Techniques for FPGA Accelerator Synthesis. J Sign Process Syst 90, 3–27 (2018). https://doi.org/10.1007/s11265-017-1229-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-017-1229-7