Scaling FMM with Data-Driven OpenMP Tasks on Multicore Architectures

  • Abdelhalim AmerEmail author
  • Satoshi Matsuoka
  • Miquel Pericàs
  • Naoya Maruyama
  • Kenjiro Taura
  • Rio Yokota
  • Pavan Balaji
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9903)


Poor scalability on parallel architectures can be attributed to several factors, among which idle times, data movement, and runtime overhead are predominant. Conventional parallel loops and nested parallelism have proved successful for regular computational patterns. For more complex and irregular cases, however, these methods often perform poorly because they consider only a subset of these costs. Although data-driven methods are gaining popularity for efficiently utilizing computational cores, their data movement and runtime costs can be prohibitive for highly dynamic and irregular algorithms, such as fast multipole methods (FMMs). Furthermore, loop tiling, a technique that promotes data locality and has been successful for regular parallel methods, has received little attention in the context of dynamic and irregular parallelism.

We present a method to exploit loop tiling in data-driven parallel methods. Here, we specify a methodology to spawn work units characterized by a high data locality potential. Work units operate on tiled computational patterns and serve as building blocks in an OpenMP task-based data-driven execution. In particular, by the adjusting work unit granularity, idle times and runtime overheads are also taken into account. We apply this method to a popular FMM implementation and show that, with careful tuning, the new method outperforms existing parallel-loop and user-level thread-based implementations by up to fourfold on 48 cores.


Idle Time Work Unit Fast Multipole Method Parallel Efficiency Parallel Loop 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This material is based upon work supported by the U.S. Department of Energy, Office of Science, under Contract DE-AC02-06CH11357, and by JST, CREST (Research Areas: Advanced Core Technologies for Big Data Integration; Development of System Software Technologies for post-Peta Scale High Performance Computing).


  1. 1.
    Amer, A.: Parallelism, data movement, and synchronization in threading models on massively parallel systems. Technical report, Tokyo Institute of Technology, Department of Mathematical and Computing Sciences (2015)Google Scholar
  2. 2.
    Amer, A., Maruyama, N., Pericàs, M., Taura, K., Yokota, R., Matsuoka, S.: Fork-join and data-driven execution models on multi-core architectures: case study of the FMM. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2013. LNCS, vol. 7905, pp. 255–266. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  3. 3.
    Chandramowlishwaran, A., Williams, S., Oliker, L., Lashuk, I., Biros, G., Vuduc, R.: Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (2010)Google Scholar
  4. 4.
    Greengard, L.: The Rapid Evaluation of Potential Fields in Particle Systems, vol. 52. MIT Press, Cambridge (1988)zbMATHGoogle Scholar
  5. 5.
    Ltaief, H., Yokota, R.: Data-driven execution of fast multipole methods (2012)Google Scholar
  6. 6.
    Nakashima, J., Taura, K.: Massivethreads: a thread library for high productivity languages. In: Agha, G., Igarashi, A., Kobayashi, N., Masuhara, H., Matsuoka, S., Shibayama, E., Taura, K. (eds.) Concurrent Objects and Beyond. LNCS, vol. 8665, pp. 222–238. Springer, Heidelberg (2014)Google Scholar
  7. 7.
    Olivier, S.L., De Supinski, B.R., Schulz, M., Prins, J.F.: Characterizing and mitigating work time inflation in task parallel programs. In: Proceedings of the 2012 ACM/IEEE Conference on Supercomputing, pp. 1–12. IEEE (2012)Google Scholar
  8. 8.
    Pericas, M., Amer, A., Fukuda, K., Maruyama, N., Yokota, R., Matsuoka, S.: Towards a dataflow FMM using the OmpSs programming model. In: 136th IPSJ Conference on High Performance Computing (2012)Google Scholar
  9. 9.
    Pericàs, M., Amer, A., Taura, K., Matsuoka, S.: Analysis of data reuse in task-parallel runtimes. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2013. LNCS, vol. 8551, pp. 73–87. Springer, Heidelberg (2014)Google Scholar
  10. 10.
    Tasirlar, S., Sarkar, V.: Data-driven tasks and their implementation. In: 2011 International Conference on Parallel Processing (ICPP), pp. 652–661 (2011)Google Scholar
  11. 11.
    Ullman, J.D.: NP-complete scheduling problems. J. Comput. Syst. Sci. 10(3), 384–393 (1975)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Yan, Y., Zhao, J., Guo, Y., Sarkar, V.: Hierarchical place trees: a portable abstraction for task parallelism and data movement. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds.) LCPC 2009. LNCS, vol. 5898, pp. 172–187. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  13. 13.
    Ying, L., Biros, G., Zorin, D., Langston, H.: A new parallel kernel-independent fast multipole method. In: ACM/IEEE Conference on Supercomputing, p. 14 (2003)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Abdelhalim Amer
    • 1
    Email author
  • Satoshi Matsuoka
    • 2
  • Miquel Pericàs
    • 3
  • Naoya Maruyama
    • 4
  • Kenjiro Taura
    • 5
  • Rio Yokota
    • 2
  • Pavan Balaji
    • 1
  1. 1.Argonne National LaboratoryLemontUSA
  2. 2.Tokyo Institute of TechnologyTokyoJapan
  3. 3.Chalmers University of TechnologyGothenburgSweden
  4. 4.RIKEN Advanced Institute of Computational ScienceHyogoJapan
  5. 5.University of TokyoTokyoJapan

Personalised recommendations