Just in Time Load Balancing

  • Rosario Cammarota
  • Alexandru Nicolau
  • Alexander V. Veidenbaum
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7760)


Leveraging Loop Level Parallelism (LLP) is one of the most attractive techniques for improving program performance on emerging multi-cores. Ordinary programs contain a large amount of parallel and DOALL loops, however emerging multi-core designs feature a rapid increase in the number of on-chip cores and the ways such cores share on-chip resources - such as pipeline and memory hierarchy, leads to an increase in the number of possible high-performance configurations. This trend in emerging multi-core design makes attaining peak performance through the exploitation of LLP an increasingly complex problem.

In this paper, we propose a new iteration scheduling technique to speedup the execution of DOALL loops on complex multi-core systems. Our technique targets the execution of DOALL loops with a variable cost per iteration and exhibiting either a predictable or an unpredictable behavior across multiple instances of a DOALL loop. In the former case our technique implements a quick run-time pass - to identify chunks of iterations containing the same amount of work - followed by a static assignment of such chunks to cores. If the static parallel execution is not profitable, our technique can decide to run such a loop either sequentially or in parallel, but using dynamic scheduling and an appropriate selection of the chunk size to optimize performance.

We implemented our technique in GNU GCC/OpenMP and demonstrate promising results on three important linear algebra kernels - matrix multiply, Gauss-Jordan elimination and adjoint convolution - for which near-optimal speedup against existing scheduling techniques is attained. Furthermore, we demonstrate the impact of our approach on the already parallelized program 470.lbm from SPEC CPU2006, implementing the Lattice Boltzman Method. On 470.lbm, our technique attains a speedup up of to 65% on the state-of-the-art 4-cores, 2-way Symmetric Multi-Threading Intel Sandy Bridge architecture.


Iteration Space Dynamic Schedule Schedule Technique Chunk Size Parallel Loop 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Henning, J.L.: Spec cpu2000: Measuring cpu performance in the new millennium. IEEE Computer 33(7), 28–35 (2000)CrossRefGoogle Scholar
  2. 2.
    Henning, J.L.: SPEC CPU2006 benchmark descriptions. SIGARCH Computer Architecture News 34(4), 1–17 (2006)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Lundstrom, S.F., Barnes, G.H.: A controllable MIMD architecture. In: Advanced Computer Architecture, IEEE Computer Society Press, Los Alamitos (1986)Google Scholar
  4. 4.
    Polychronopoulos, C.D., Kuck, D.J.: Guided self-scheduling: A practical scheduling scheme for parallel supercomputers. IEEE Trans. Comput. 36(12), 1425–1439 (1987)CrossRefGoogle Scholar
  5. 5.
    Hummel, S., Schonberg, E., Flynn, L.E.: Factoring: a method for scheduling parallel loops. Commun. ACM 35(8), 90–101 (1992)CrossRefGoogle Scholar
  6. 6.
    Lucco, S.: A dynamic scheduling technique for irregular parallel programs, pp. 200–211 (1992)Google Scholar
  7. 7.
    Tzen, T.H., Ni, L.M.: Trapezoid self-scheduling: A practical scheduling scheme for parallel compilers. IEEE Trans. Parallel Distrib. Syst. 4(1), 87–98 (1993)CrossRefGoogle Scholar
  8. 8.
    Yue, K.K., Lilja, D.J.: Parameter estimation for a generalized parallel loop scheduling algorithm. In: HICSS, p. 187 (1995)Google Scholar
  9. 9.
    Hancock, D.J., Ford, R.W., Freeman, T.L., Bull, J.M.: An investigation of feedback guided dynamic scheduling of nested loops. In: Proceedings of the International Workshop on Parallel Processing (2000)Google Scholar
  10. 10.
    Kejariwal, A., Nicolau, A., Banerjee, U., Veidenbaum, A.V., Polychronopoulos, C.D.: Cache-aware partitioning of multi-dimensional iteration spaces. In: Proceedings of SYSTOR (2009)Google Scholar
  11. 11.
    Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4) (2009)Google Scholar
  12. 12.
  13. 13.
  14. 14.
  15. 15.
    Aslot, V., Domeika, M., Eigenmann, R., Gaertner, G., Jones, W.B., Parady, B.: SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. In: Eigenmann, R., Voss, M.J. (eds.) WOMPAT 2001. LNCS, vol. 2104, pp. 1–10. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  16. 16.
    Zhang, Y., Voss, M.: Runtime empirical selection of loop schedulers on hyperthreaded smps. In: 19th International Parallel and Distributed Processing Symposium (2005)Google Scholar
  17. 17.
    Bull, J.M., O’Neill, D.: A microbenchmark suite for openmp 2.0. SIGARCH Comput. Archit. News 29, 41–48 (2001)CrossRefGoogle Scholar
  18. 18.
    Novillo, D.: Openmp and automatic parallelization in gcc. In: GCC Developers Summit (2006)Google Scholar
  19. 19.
    Mucci, P.J., Browne, S., Deane, C., Ho, G.: Papi: A portable interface to hardware performance counters. In: Proceedings of the Department of Defense HPCMP Users Group Conference, pp. 7–10 (1999)Google Scholar
  20. 20.
    Kernighan, B.W.: The C Programming Language, 2nd edn. Prentice Hall Professional Technical Reference (1988)Google Scholar
  21. 21.
    Pohl, T., Kowarschik, M., Wilke, J., Iglberger, K., Rüde, U.: Optimization and profiling of the cache performance of parallel lattice boltzmann codes. Parallel Processing Letters 13(4) (2003)Google Scholar
  22. 22.
    Flatt, H.P., Kennedy, K.: Performance of parallel processors. Parallel Computing 12(1), 1–20 (1989)MathSciNetzbMATHCrossRefGoogle Scholar
  23. 23.
    Lamport, L.: The Hyperplane Method for an Array Computer. In: Tse-Yun, F. (ed.) Parallel Processing. LNCS, vol. 24, pp. 113–131. Springer, Heidelberg (1975)CrossRefGoogle Scholar
  24. 24.
    Banerjee, U.: Loop transformations for restructuring compilers - the foundations. Kluwer (1993)Google Scholar
  25. 25.
    Kruskal, C.P., Weiss, A.: Allocating independent subtasks on parallel processors. IEEE Trans. Softw. Eng. 11(10) (1985)Google Scholar
  26. 26.
    Aycock, J.: A brief history of just-in-time. ACM Comput. Surv. 35(2), 97–113 (2003)CrossRefGoogle Scholar
  27. 27.
    Rauchwerger, L., Amato, N.M., Padua, D.A.: A scalable method for run-time loop parallelization. International Journal of Parallel Programming 23(6) (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Rosario Cammarota
    • 1
  • Alexandru Nicolau
    • 1
  • Alexander V. Veidenbaum
    • 1
  1. 1.University of CaliforniaIrvineUSA

Personalised recommendations