Source-to-Source Optimization for HLS

  • Jason CongEmail author
  • Muhuan Huang
  • Peichen Pan
  • Yuxin Wang
  • Peng Zhang


This chapter describes the source code optimization techniques and automation tools for FPGA design with high-level synthesis (HLS) design flow. HLS has lifted the design abstraction from RTL to C/C++, but in practice extensive source code rewriting is often required to achieve a good design using HLS—especially when the design space is too large to determine the proper design options in advance. In addition, this code rewriting requires not only the knowledge of hardware microarchitecture design, but also familiarity with the coding style for the high-level synthesis tools. Automatic source-to-source transformation techniques have been applied in software compilation and optimization for a long time. They can also greatly benefit the FPGA accelerator design in a high-level synthesis design flow. In general, source-to-source optimization for FPGA will be much more complex and challenging than that for CPU software because of the much larger design space in microarchitecture choices combined with temporal/spatial resource allocation. The goal of source-to-source transformation is to reduce or eliminate the design abstraction gap between software/algorithm development and existing HLS design flows. This will enable the fully automated FPGA design flows for software developers, which is especially important for deploying FPGAs in data centers, so that many software developers can efficiently use FPGAs with minimal effort for acceleration.


Inverse Discrete Cosine Transform Loop Transformation Loop Fusion Initiation Interval Design Abstraction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work is partially supported by the National Science Foundation Small Business Innovation Research (SBIR) Grant No. 1520449 for project entitled “Customized Computing for Big Data Applications”.


  1. [AK08]
    S. Aditya, V. Kathail, Algorithmic synthesis using PICO: an integrated framework for application engine synthesis and verification from high level C algorithms, High-Level Synthesis: From Algorithm to Digital Circuit, Springer Netherlands, 2008, Chap. 4, pp. 53–74.Google Scholar
  2. [Bas04]
    C. Bastoul. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 7–16. IEEE Computer Society, 2004.Google Scholar
  3. [Bis06]
    C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.zbMATHGoogle Scholar
  4. [CG15]
    A. Cilardo and L. Gallo. Improving multibank memory access parallelism with lattice-based partitioning. ACM Trans. Archit. Code Optim., 11(4):45:1–45:25, January 2015.Google Scholar
  5. [CHL+12]
    J. Cong, M. Huang, B. Liu, P. Zhang, and Y. Zou. Combining module selection and replication for throughput-driven streaming programs. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’12, pages 1018–1023, San Jose, CA, USA, 2012. EDA Consortium.Google Scholar
  6. [CHZ14]
    J. Cong, M. Huang, and P. Zhang. Combining computation and communication optimizations in system synthesis for streaming applications. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays, FPGA ’14, pages 213–222, New York, NY, USA, 2014. ACM.Google Scholar
  7. [CJLZ09]
    J. Cong, W. Jiang, B. Liu, and Y. Zou. Automatic memory partitioning and scheduling for throughput and power optimization. In Proceedings of the 2009 International Conference on Computer-Aided Design, ICCAD ’09, pages 697–704, New York, NY, USA, 2009. ACM.Google Scholar
  8. [CJLZ11]
    J. Cong, W. Jiang, B. Liu, and Y. Zou. Automatic memory partitioning and scheduling for throughput and power optimization. ACM Transactions on Design Automation of Electronic Systems (TODAES), 16(2):15, 2011.Google Scholar
  9. [CLN+11]
    J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-level synthesis for FPGAs: From prototyping to deployment. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 30(4):473–491, 2011.CrossRefGoogle Scholar
  10. [CZZ12]
    J. Cong, P. Zhang, and Y. Zou. Optimizing memory hierarchy allocation with loop transformations for high-level synthesis. In Proceedings of the 49th Annual Design Automation Conference, pages 1233–1238. ACM, 2012.Google Scholar
  11. [Fea92]
    P. Feautrier. Some efficient solutions to the affine scheduling problem. part ii. multidimensional time. International journal of parallel programming, 21(6):389–420, 1992.Google Scholar
  12. [GGDN04b]
    S. Gupta, R. K. Gupta, N. D. Dutt, and A. Nicolau. Coordinated parallelizing compiler optimizations and high-level synthesis. ACM Trans. Des. Autom. Electron. Syst., 9(4):441–470, October 2004.CrossRefGoogle Scholar
  13. [HWBR09]
    A. Hagiescu, W.-F. Wong, D. F. Bacon, and R. Rabbah. A computing origami: Folding streams in FPGAs. In Design Automation Conference, 2009. DAC’09. 46th ACM/IEEE, pages 282–287. IEEE, 2009.Google Scholar
  14. [LLV15]
    LLVM. LLVM - Low Level Virtual Machine, 2015. [Online; accessed 1-April].
  15. [Ope15a]
    OpenAcc. OpenACC directives for accelerators, 2015. [Online; accessed 4-August].
  16. [Ope15b]
    OpenMP. The OpenMP API specification for parallel programming, 2015. [Online; accessed 4-August].
  17. [Pou10]
    L.-N. Pouchet. Interative Optimization in the Polyhedral Model. PhD thesis, University of Paris-Sud 11, Orsay, France, January 2010.Google Scholar
  18. [PSKK15]
    N. K. Pham, A. K. Singh, A. Kumar, and M. M. A. Khin. Exploiting loop-array dependencies to accelerate the design space exploration with high level synthesis. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, pages 157–162. EDA Consortium, 2015.Google Scholar
  19. [PZSC13]
    L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong. Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays, pages 29–38. ACM, 2013.Google Scholar
  20. [SW10]
    B. C. Schafer and K. Wakabayashi. Design space exploration acceleration through operation clustering. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 29(1):153–157, 2010.CrossRefGoogle Scholar
  21. [SW12]
    B. C. Schafer and K. Wakabayashi. Divide and conquer high-level synthesis design space exploration. ACM Trans. Des. Autom. Electron. Syst., 17(3):29:1–29:19, July 2012.Google Scholar
  22. [WBC14]
    F. Winterstein, S. Bayliss, and G. A. Constantinides. Separation logic-assisted code transformations for efficient high-level synthesis. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on, pages 1–8. IEEE, 2014.Google Scholar
  23. [WFY+15]
    F. Winterstein, K. Fleming, H.-J. Yang, S. Bayliss, and G. Constantinides. Matchup: Memory abstractions for heap manipulating programs. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 136–145. ACM, 2015.Google Scholar
  24. [WL91]
    M. E. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maximize parallelism. Parallel and Distributed Systems, IEEE Transactions on, 2(4):452–471, 1991.CrossRefGoogle Scholar
  25. [WLC14]
    Y. Wang, P. Li, and J. Cong. Theory and algorithm for generalized memory partitioning in high-level synthesis. In Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays, pages 199–208. ACM, 2014.Google Scholar
  26. [WLZ+13]
    Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference, page 12. ACM, 2013.Google Scholar
  27. [WZCC12]
    Y. Wang, P. Zhang, X. Cheng, and J. Cong. An integrated and automated memory optimization flow for FPGA behavioral synthesis. In Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific, pages 257–262. IEEE, 2012.Google Scholar
  28. [YFAE14]
    H. Yang, K. Fleming, M. Adler, and J. Emer. LEAP shared memories: Automating the construction of FPGA coherent memories. In 2014 Symposium on Field-Programmable Custom Computing Machines, pages 117–124. IEEE, 2014.Google Scholar
  29. [ZFJ+08b]
    Z. Zhang, Y. Fan, W. Jiang, G. Han, C. Yang, and J. Cong. AutoPilot: A platform-based ESL synthesis system. In High-Level Synthesis, pages 99–112. Springer, 2008.Google Scholar
  30. [ZLC+13]
    W. Zuo, P. Li, D. Chen, L.-N. Pouchet, S. Zhong, and J. Cong. Improving polyhedral code generation for high-level synthesis. In Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, page 15. IEEE Press, 2013.Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Jason Cong
    • 1
    • 2
    Email author
  • Muhuan Huang
    • 1
  • Peichen Pan
    • 1
  • Yuxin Wang
    • 1
  • Peng Zhang
    • 1
  1. 1.Falcon Computing Solutions, Inc.Los AngelesUSA
  2. 2.Computer Science DepartmentUniversity of CaliforniaLos AngelesUSA

Personalised recommendations