Software Compilation Techniques for Heterogeneous Embedded Multi-Core Systems

  • Rainer LeupersEmail author
  • Miguel Angel Aguilar
  • Jeronimo Castrillon
  • Weihua Sheng


The increasing demands of modern embedded systems, such as high-performance and energy-efficiency, have motivated the use of heterogeneous multi-core platforms enabled by Multiprocessor System-on-Chips (MPSoCs). To fully exploit the power of these platforms, new tools are needed to address the increasing software complexity to achieve a high productivity. An MPSoC compiler is a tool-chain to tackle the problems of application modeling, platform description, software parallelization, software distribution and code generation for an efficient usage of the target platform. This chapter discusses various aspects of compilers for heterogeneous embedded multi-core systems, using the well-established single-core C compiler technology as a baseline for comparison. After a brief introduction to the MPSoC compiler technology, the important ingredients of the compilation process are explained in detail. Finally, a number of case studies from academia and industry are presented to illustrate the concepts discussed in this chapter.


  1. 1.
    Eclipse. Visited on Jan. 2010
  2. 2.
    GDB: The GNU Project Debugger. Visited on Jan. 2010
  3. 3.
    OpenMP Application Programming Interface. Version 4.5. Visited on Mar. 2017
  4. 4.
    AbsInt: aiT worst-case execution time analyzers. Visited on Nov. 2009
  5. 5.
    Agbaria, A., Kang, D.I., Singh, K.: LMPI: MPI for heterogeneous embedded distributed systems. In: 12th International Conference on Parallel and Distributed Systems - (ICPADS’06), vol. 1, pp. 8 pp.– (2006)Google Scholar
  6. 6.
    Aguilar, M.A., Aggarwal, A., Shaheen, A., Leupers, R., Ascheid, G., Castrillon, J., Fitzpatrick, L.: Multi-grained Performance Estimation for MPSoC Compilers: Work-in-progress. In: Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion, CASES ’17, pp. 14:1–14:2. ACM, New York, NY, USA (2017)Google Scholar
  7. 7.
    Aguilar, M.A., Eusse, J.F., Ray, P., Leupers, R., Ascheid, G., Sheng, W., Sharma, P.: Towards parallelism extraction for heterogeneous multicore Android devices. International Journal of Parallel Programming pp. 1–33 (2016)Google Scholar
  8. 8.
    Aguilar, M.A., Leupers, R., Ascheid, G., Kavvadias, N.: A toolflow for parallelization of embedded software in multicore DSP platforms. In: Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, SCOPES ’15, pp. 76–79. ACM, New York, NY, USA (2015)Google Scholar
  9. 9.
    Aguilar, M.A., Leupers, R., Ascheid, G., Murillo, L.G.: Automatic parallelization and accelerator offloading for embedded applications on heterogeneous MPSoCs. In: Proceedings of the 53rd Annual Design Automation Conference, DAC ’16, pp. 49:1–49:6. ACM, New York, NY, USA (2016)Google Scholar
  10. 10.
    Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (1986)zbMATHGoogle Scholar
  11. 11.
    Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from Berkeley. Tech. rep., EECS Department, University of California, Berkeley (2006)Google Scholar
  12. 12.
    Bacivarov, I., Haid, W., Huang, K., Thiele, L.: Methods and tools for mapping process networks onto multi-processor systems-on-chip. In: S.S. Bhattacharyya, E.F. Deprettere, R. Leupers, J. Takala (eds.) Handbook of Signal Processing Systems, third edn. Springer (2018)Google Scholar
  13. 13.
    Benini, L., Bertozzi, D., Guerri, A., Milano, M.: Allocation and scheduling for MPSoCs via decomposition and no-good generation. Principles and Practices of Constrained Programming - CP 2005 (DEIS-LIA-05-001), 107–121 (2005)CrossRefGoogle Scholar
  14. 14.
    Bhattacharya, B., Bhattacharyya, S.S.: Parameterized dataflow modeling for DSP systems. IEEE Transactions on Signal Processing 49(10), 2408–2421 (2001)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Carro, L., Rutzig, M.B.: Multi-core systems on chip. In: S.S. Bhattacharyya, E.F. Deprettere, R. Leupers, J. Takala (eds.) Handbook of Signal Processing Systems, second edn. Springer (2013)Google Scholar
  16. 16.
    Castrillon, J., Leupers, R.: Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap. Springer Publishing Company, Incorporated (2013)Google Scholar
  17. 17.
    Castrillon, J., Sheng, W., Jessenberger, R., Thiele, L., Schorr, L., Juurlink, B., Alvarez-Mesa, M., Pohl, A., Reyes, V., Leupers, R.: Multi/many-core programming: Where are we standing? In: 2015 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1708–1717 (2015)Google Scholar
  18. 18.
    Castrillon, J., Sheng, W., Leupers, R.: Trends in embedded software synthesis. In: SAMOS, pp. 347–354 (2011)Google Scholar
  19. 19.
    Ceng, J.: A methodology for efficient multiprocessor system on chip software development. Ph.D. thesis, RWTH Aachen University (2011)Google Scholar
  20. 20.
    Ceng, J., Castrillon, J., Sheng, W., Scharwächter, H., Leupers, R., Ascheid, G., Meyr, H., Isshiki, T., Kunieda, H.: MAPS: an integrated framework for MPSoC application parallelization. In: DAC ’08: Proceedings of the 45th annual conference on Design automation, pp. 754–759. ACM, New York, NY, USA (2008)Google Scholar
  21. 21.
    Cesario, W., Jerraya, A.: Multiprocessor Systems-on-Chips, chap. Chapter 9. Component-Based Design for Multiprocessor Systems-on-Chip, pp. 357–394. Morgan Kaufmann (2005)Google Scholar
  22. 22.
    Cordes, D.A.: Automatic parallelization for embedded multi-core systems using high-level cost models. Ph.D. thesis, TU Dortmund (2013)Google Scholar
  23. 23.
    Diakopoulos, N., Cass, S.: The top programming languages 2016. Visited on Feb. 2017
  24. 24.
    Fisher, J., P., F., Young, C.: Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan-Kaufmann (Elsevier) (2005)Google Scholar
  25. 25.
    Gao, L., Huang, J., Ceng, J., Leupers, R., Ascheid, G., Meyr, H.: TotalProf: a fast and accurate retargetable source code profiler. In: CODES+ISSS ’09: Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis, pp. 305–314. ACM, New York, NY, USA (2009)Google Scholar
  26. 26.
    Geilen, M., Basten, T.: Kahn process networks and a reactive extension. In: S.S. Bhattacharyya, E.F. Deprettere, R. Leupers, J. Takala (eds.) Handbook of Signal Processing Systems, second edn. Springer (2013)Google Scholar
  27. 27.
    Gheorghita, S., T. Basten, H.C.: An overview of application scenario usage in streaming-oriented embedded system design. Visited on Mar. 2017
  28. 28.
    Gupta, R., Micheli, G.D.: Hardware-software co-synthesis for digital systems. In: IEEE Design & Test of Computers, pp. 29–41 (1993)Google Scholar
  29. 29.
    Ha, S., Oh, H.: Decidable signal processing dataflow graphs. In: S.S. Bhattacharyya, E.F. Deprettere, R. Leupers, J. Takala (eds.) Handbook of Signal Processing Systems, third edn. Springer (2018)Google Scholar
  30. 30.
    Hewitt, C., Bishop, P., Greif, I., Smith, B., Matson, T., Steiger, R.: Actor induction and meta-evaluation. In: POPL ’73: Proceedings of the 1st annual ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pp. 153–168. ACM, New York, NY, USA (1973)Google Scholar
  31. 31.
    Hind, M.: Pointer analysis: Haven’t we solved this problem yet? In: PASTE ’01, pp. 54–61. ACM Press (2001)Google Scholar
  32. 32.
    Hu, T.C.: Parallel sequencing and assembly line problems. Oper. Res. 9(6), 841–848 (1961)MathSciNetCrossRefGoogle Scholar
  33. 33.
    Hwang, Y., Abdi, S., Gajski, D.: Cycle-approximate retargetable performance estimation at the transaction level. In: DATE ’08: Proceedings of the conference on Design, automation and test in Europe, pp. 3–8. ACM, New York, NY, USA (2008)Google Scholar
  34. 34.
    Hwu, W.M., Ryoo, S., Ueng, S.Z., Kelm, J.H., Gelado, I., Stone, S.S., Kidd, R.E., Baghsorkhi, S.S., Mahesri, A.A., Tsao, S.C., Navarro, N., Lumetta, S.S., Frank, M.I., Patel, S.J.: Implicitly parallel programming models for thousand-core microprocessors. In: DAC ’07: Proc. of the 44th Design Automation Conference, pp. 754–759. ACM, New York, NY, USA (2007)Google Scholar
  35. 35.
    Johnson, R.C.: Efficient program analysis using dependence flow graphs. Ph.D. thesis, Cornell University (1994)Google Scholar
  36. 36.
    Kahn, G.: The semantics of a simple language for parallel programming. In: J.L. Rosenfeld (ed.) Information Processing ’74: Proceedings of the IFIP Congress, pp. 471–475. North-Holland, New York, NY (1974)Google Scholar
  37. 37.
    Kandemir, M., Dutt, N.: Multiprocessor Systems-on-Chips, chap. Chapter 9. Memory Systems and Compiler Support for MPSoC Architectures, pp. 251–281. Morgan Kaufmann (2005)Google Scholar
  38. 38.
    Karp, R.M., Miller, R.E.: Properties of a model for parallel computations: Determinacy, termination, queuing. SIAM Journal of Applied Math 14(6) (1966)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Karuri, K., Al Faruque, M.A., Kraemer, S., Leupers, R., Ascheid, G., Meyr, H.: Fine-grained application source code profiling for ASIP design. In: DAC ’05: Proceedings of the 42nd annual conference on Design automation, pp. 329–334. ACM, New York, NY, USA (2005)Google Scholar
  40. 40.
    Kennedy, K., Allen, J.R.: Optimizing compilers for modern architectures: A dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2002)Google Scholar
  41. 41.
    Khronos Group: OpenCL embedded boards comparison 2015. Visited on Mar. 2017
  42. 42.
    Kung, H.T.: Why systolic architectures? Computer 15(1), 37–46 (1982)CrossRefGoogle Scholar
  43. 43.
    Kwok, Y.K., Ahmad, I.: Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv. 31(4), 406–471 (1999)CrossRefGoogle Scholar
  44. 44.
    Kwon, S., Kim, Y., Jeun, W.C., Ha, S., Paek, Y.: A retargetable parallel-programming framework for MPSoC. ACM Trans. Des. Autom. Electron. Syst. 13(3), 1–18 (2008)CrossRefGoogle Scholar
  45. 45.
    Lam, M.: Software pipelining: An effective scheduling technique for VLIW machines. SIGPLAN Not. 23(7), 318–328 (1988)CrossRefGoogle Scholar
  46. 46.
    Lee, E., Messerschmitt, D.: Synchronous data flow. Proceedings of the IEEE 75(9), 1235–1245 (1987)CrossRefGoogle Scholar
  47. 47.
    Lee, E.A.: Consistency in dataflow graphs. IEEE Trans. Parallel Distrib. Syst. 2(2), 223–235 (1991)CrossRefGoogle Scholar
  48. 48.
    Lengauer, C.: Loop parallelization in the polytope model. In: Proceedings of the 4th International Conference on Concurrency Theory, CONCUR ’93, pp. 398–416. Springer-Verlag, London, UK, UK (1993)CrossRefGoogle Scholar
  49. 49.
    Leupers, R.: Retargetable Code Generation for Digital Signal Processors. Kluwer Academic Publishers, Norwell, MA, USA (1997)CrossRefGoogle Scholar
  50. 50.
    Leupers, R.: Code selection for media processors with SIMD instructions. In: DATE ’00, pp. 4–8. ACM (2000)Google Scholar
  51. 51.
    Li, L., Huang, B., Dai, J., Harrison, L.: Automatic multithreading and multiprocessing of C programs for IXP. In: PPoPP ’05: Proc. of the 10th ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 132–141. ACM, New York, NY, USA (2005)Google Scholar
  52. 52.
    Ma, Z., Marchal, P., Scarpazza, D.P., Yang, P., Wong, C., Gmez, J.I., Himpe, S., Ykman-Couvreur, C., Catthoor, F.: Systematic Methodology for Real-Time Cost-Effective Mapping of Dynamic Concurrent Task-Based Systems on Heterogenous Platforms. Springer (2007)Google Scholar
  53. 53.
    Martin, G.: ESL requirements for configurable processor-based embedded system design. Visited on Mar. 2017
  54. 54.
    Muchnick, S.S.: Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1997)Google Scholar
  55. 55.
    Multicore Association: MCAPI - Multicore Communications API. Visited on Mar. 2017
  56. 56.
    Multicore Association: Software-hardware interface for multi-many-core (SHIM) specification v1.00. Visited on Mar. 2017
  57. 57.
    National Instruments: LabView. Visited on Mar. 2017
  58. 58.
    Nikolov, H., Thompson, M., Stefanov, T., Pimentel, A., Polstra, S., Bose, R., Zissulescu, C., Deprettere, E.: Daedalus: Toward composable multimedia MP-SoC design. In: DAC ’08: Proceedings of the 45th annual conference on Design automation, pp. 574–579. ACM, New York, NY, USA (2008)Google Scholar
  59. 59.
    Palsberg, J., Naik, M.: Multiprocessor Systems-on-Chips, chap. Chapter 12. ILP-based Resource-aware Compilation, pp. 337–354. Morgan Kaufmann (2005)Google Scholar
  60. 60.
    Paolucci, P.S., Jerraya, A.A., Leupers, R., Thiele, L., Vicini, P.: SHAPES:: a tiled scalable software hardware architecture platform for embedded systems. In: CODES+ISSS ’06: Proceedings of the 4th international conference on Hardware/software codesign and system synthesis, pp. 167–172. ACM, New York, NY, USA (2006)Google Scholar
  61. 61.
    Parks, T.M.: Bounded scheduling of process networks. Ph.D. thesis, Berkeley, CA, USA (1995)Google Scholar
  62. 62.
    Pelcat, M., Desnos, K., Heulot, J., Guy, C., Nezan, J.F., Aridhi, S.: Preesm: A dataflow-based rapid prototyping framework for simplifying multicore dsp programming. In: 2014 6th European Embedded Design in Education and Research Conference (EDERC), pp. 36–40 (2014).
  63. 63.
    Polychronopoulos, C.D.: The hierarchical task graph and its use in auto-scheduling. In: Proceedings of the 5th International Conference on Supercomputing, ICS ’91, pp. 252–263. ACM, New York, NY, USA (1991)Google Scholar
  64. 64.
    Rabenseifner, R., Hager, G., Jost, G.: Hybrid mpi/openmp parallel programming on clusters of multi-core smp nodes. In: 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, pp. 427–436 (2009)Google Scholar
  65. 65.
    Sharma, G., Martin, J.: MATLAB (R): A language for parallel computing. International Journal of Parallel Programming 37(1) (2009)Google Scholar
  66. 66.
    Silexica: SLX Tool Suite. Visited on Mar. 2017
  67. 67.
    Sporer, T., Franck, A., Bacivarov, I., Beckinger, M., Haid, W., Huang, K., Thiele, L., Paolucci, P., Bazzana, P., Vicini, P., Ceng, J., Kraemer, S., Leupers, R.: SHAPES - a scalable parallel HW/SW architecture applied to wave field synthesis. In: Proc. 32nd Intl Audio Engineering Society Conference, pp. 175–187. Audio Engineering Society, Hillerod, Denmark (2007)Google Scholar
  68. 68.
    Sriram, S., Bhattacharyya, S.S.: Embedded Multiprocessors: Scheduling and Synchronization. Marcel Dekker, Inc., New York, NY, USA (2000)Google Scholar
  69. 69.
    Standard for information technology - portable operating system interface (POSIX). Shell and utilities. IEEE Std 1003.1-2004, The Open Group Base Specifications Issue 6, section 2.9: IEEE and The Open GroupGoogle Scholar
  70. 70.
    Stone, J.E., Gohara, D., Shi, G.: OpenCL: A parallel programming standard for heterogeneous computing systems. IEEE Des. Test 12(3), 66–73 (2010)Google Scholar
  71. 71.
    Stotzer, E.: Towards using OpenMP in embedded systems. OpenMPCon: Developers Conference (2015)Google Scholar
  72. 72.
    Synopsys: Virtual Platforms. Visited on Mar. 2017
  73. 73.
    Texas Instruments: Keystone Multicore Devices. Visited on Mar. 2017
  74. 74.
    Texas Instruments: Software development kit for multicore DSP Keystone platform. Visited on Mar. 2017
  75. 75.
    Theelen, B.D., Deprettere, E.F., Bhattacharyya, S.S.: Dynamic dataflow graphs. In: S.S. Bhattacharyya, E.F. Deprettere, R. Leupers, J. Takala (eds.) Handbook of Signal Processing Systems, third edn. Springer (2018)Google Scholar
  76. 76.
    Tournavitis, G., Wang, Z., Franke, B., O’Boyle, M.: Towards a holistic approach to auto-parallelization – integrating profile-driven parallelism detection and machine-learning based mapping. In: PLDI 0-9: Proceedings of the Programming Language Design and Implementation Conference. Dublin, Ireland (2009)Google Scholar
  77. 77.
    Vargas, R., Quinones, E., Marongiu, A.: OpenMP and timing predictability: A possible union? In: Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, DATE ’15, pp. 617–620. EDA Consortium, San Jose, CA, USA (2015)Google Scholar
  78. 78.
    Verdoolaege, S., Nikolov, H., Stefanov, T.: pn: A tool for improved derivation of process networks. EURASIP J. Embedded Syst. 2007(1), 19–19 (2007)CrossRefGoogle Scholar
  79. 79.
    Wilhelm, R., Engblom, J., Ermedahl, A., Holsti, N., Thesing, S., Whalley, D., Bernat, G., Ferdinand, C., Heckmann, R., Mitra, T., Mueller, F., Puaut, I., Puschner, P., Staschulat, J., Stenström, P.: The worst-case execution-time problem - overview of methods and survey of tools. ACM Trans. Embed. Comput. Syst. 7(3), 1–53 (2008)CrossRefGoogle Scholar
  80. 80.
    Working Group ISO/IEC JTC1/SC22/WG14: C99, Programming Language C ISO/IEC 9899:1999Google Scholar
  81. 81.
    Zalfany Urfianto, M., Isshiki, T., Ullah Khan, A., Li, D., Kunieda, H.: Decomposition of task-level concurrency on C programs applied to the design of multiprocessor SoC. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E91-A(7), 1748–1756 (2008)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2019

Authors and Affiliations

  • Rainer Leupers
    • 1
    Email author
  • Miguel Angel Aguilar
    • 1
  • Jeronimo Castrillon
    • 2
  • Weihua Sheng
    • 3
  1. 1.Institute for Communication Technologies and Embedded SystemsRWTH Aachen UniversityAachenGermany
  2. 2.Center for Advancing Electronics DresdenTU DresdenDresdenGermany
  3. 3.Silexica GmbHKölnGermany

Personalised recommendations