Source-to-Source Optimization for HLS
This chapter describes the source code optimization techniques and automation tools for FPGA design with high-level synthesis (HLS) design flow. HLS has lifted the design abstraction from RTL to C/C++, but in practice extensive source code rewriting is often required to achieve a good design using HLS—especially when the design space is too large to determine the proper design options in advance. In addition, this code rewriting requires not only the knowledge of hardware microarchitecture design, but also familiarity with the coding style for the high-level synthesis tools. Automatic source-to-source transformation techniques have been applied in software compilation and optimization for a long time. They can also greatly benefit the FPGA accelerator design in a high-level synthesis design flow. In general, source-to-source optimization for FPGA will be much more complex and challenging than that for CPU software because of the much larger design space in microarchitecture choices combined with temporal/spatial resource allocation. The goal of source-to-source transformation is to reduce or eliminate the design abstraction gap between software/algorithm development and existing HLS design flows. This will enable the fully automated FPGA design flows for software developers, which is especially important for deploying FPGAs in data centers, so that many software developers can efficiently use FPGAs with minimal effort for acceleration.
KeywordsInverse Discrete Cosine Transform Loop Transformation Loop Fusion Initiation Interval Design Abstraction
This work is partially supported by the National Science Foundation Small Business Innovation Research (SBIR) Grant No. 1520449 for project entitled “Customized Computing for Big Data Applications”.
- [AK08]S. Aditya, V. Kathail, Algorithmic synthesis using PICO: an integrated framework for application engine synthesis and verification from high level C algorithms, High-Level Synthesis: From Algorithm to Digital Circuit, Springer Netherlands, 2008, Chap. 4, pp. 53–74.Google Scholar
- [Bas04]C. Bastoul. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 7–16. IEEE Computer Society, 2004.Google Scholar
- [CG15]A. Cilardo and L. Gallo. Improving multibank memory access parallelism with lattice-based partitioning. ACM Trans. Archit. Code Optim., 11(4):45:1–45:25, January 2015.Google Scholar
- [CHL+12]J. Cong, M. Huang, B. Liu, P. Zhang, and Y. Zou. Combining module selection and replication for throughput-driven streaming programs. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’12, pages 1018–1023, San Jose, CA, USA, 2012. EDA Consortium.Google Scholar
- [CHZ14]J. Cong, M. Huang, and P. Zhang. Combining computation and communication optimizations in system synthesis for streaming applications. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays, FPGA ’14, pages 213–222, New York, NY, USA, 2014. ACM.Google Scholar
- [CJLZ09]J. Cong, W. Jiang, B. Liu, and Y. Zou. Automatic memory partitioning and scheduling for throughput and power optimization. In Proceedings of the 2009 International Conference on Computer-Aided Design, ICCAD ’09, pages 697–704, New York, NY, USA, 2009. ACM.Google Scholar
- [CJLZ11]J. Cong, W. Jiang, B. Liu, and Y. Zou. Automatic memory partitioning and scheduling for throughput and power optimization. ACM Transactions on Design Automation of Electronic Systems (TODAES), 16(2):15, 2011.Google Scholar
- [CZZ12]J. Cong, P. Zhang, and Y. Zou. Optimizing memory hierarchy allocation with loop transformations for high-level synthesis. In Proceedings of the 49th Annual Design Automation Conference, pages 1233–1238. ACM, 2012.Google Scholar
- [Fea92]P. Feautrier. Some efficient solutions to the affine scheduling problem. part ii. multidimensional time. International journal of parallel programming, 21(6):389–420, 1992.Google Scholar
- [HWBR09]A. Hagiescu, W.-F. Wong, D. F. Bacon, and R. Rabbah. A computing origami: Folding streams in FPGAs. In Design Automation Conference, 2009. DAC’09. 46th ACM/IEEE, pages 282–287. IEEE, 2009.Google Scholar
- [LLV15]LLVM. LLVM - Low Level Virtual Machine, 2015. http://www.llvm.org [Online; accessed 1-April].
- [Ope15a]OpenAcc. OpenACC directives for accelerators, 2015. http://www.openacc-standard.org/ [Online; accessed 4-August].
- [Ope15b]OpenMP. The OpenMP API specification for parallel programming, 2015. http://openmp.org/ [Online; accessed 4-August].
- [Pou10]L.-N. Pouchet. Interative Optimization in the Polyhedral Model. PhD thesis, University of Paris-Sud 11, Orsay, France, January 2010.Google Scholar
- [PSKK15]N. K. Pham, A. K. Singh, A. Kumar, and M. M. A. Khin. Exploiting loop-array dependencies to accelerate the design space exploration with high level synthesis. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, pages 157–162. EDA Consortium, 2015.Google Scholar
- [PZSC13]L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong. Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays, pages 29–38. ACM, 2013.Google Scholar
- [SW12]B. C. Schafer and K. Wakabayashi. Divide and conquer high-level synthesis design space exploration. ACM Trans. Des. Autom. Electron. Syst., 17(3):29:1–29:19, July 2012.Google Scholar
- [WBC14]F. Winterstein, S. Bayliss, and G. A. Constantinides. Separation logic-assisted code transformations for efficient high-level synthesis. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on, pages 1–8. IEEE, 2014.Google Scholar
- [WFY+15]F. Winterstein, K. Fleming, H.-J. Yang, S. Bayliss, and G. Constantinides. Matchup: Memory abstractions for heap manipulating programs. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 136–145. ACM, 2015.Google Scholar
- [WLC14]Y. Wang, P. Li, and J. Cong. Theory and algorithm for generalized memory partitioning in high-level synthesis. In Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays, pages 199–208. ACM, 2014.Google Scholar
- [WLZ+13]Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference, page 12. ACM, 2013.Google Scholar
- [WZCC12]Y. Wang, P. Zhang, X. Cheng, and J. Cong. An integrated and automated memory optimization flow for FPGA behavioral synthesis. In Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific, pages 257–262. IEEE, 2012.Google Scholar
- [YFAE14]H. Yang, K. Fleming, M. Adler, and J. Emer. LEAP shared memories: Automating the construction of FPGA coherent memories. In 2014 Symposium on Field-Programmable Custom Computing Machines, pages 117–124. IEEE, 2014.Google Scholar
- [ZFJ+08b]Z. Zhang, Y. Fan, W. Jiang, G. Han, C. Yang, and J. Cong. AutoPilot: A platform-based ESL synthesis system. In High-Level Synthesis, pages 99–112. Springer, 2008.Google Scholar
- [ZLC+13]W. Zuo, P. Li, D. Chen, L.-N. Pouchet, S. Zhong, and J. Cong. Improving polyhedral code generation for high-level synthesis. In Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, page 15. IEEE Press, 2013.Google Scholar