Advertisement

TAMM: A New Topology-Aware Mapping Method for Parallel Applications on the Tianhe-2A Supercomputer

  • Xinhai Chen
  • Jie Liu
  • Shengguo Li
  • Peizhen Xie
  • Lihua Chi
  • Qinglin Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11334)

Abstract

With the increasing size of high performance computing systems, the expensive communication overhead between processors has become a key factor leading to the performance bottleneck. However, default process-to-processor mapping strategies do not take into account the topology of the interconnection network, and thus the distance spanned by communication messages may be particularly far. In order to enhance the communication locality, we propose a new topology-aware mapping method called TAMM. By generating an accurate description of the communication pattern and network topology, TAMM employs a two-step optimization strategy to obtain an efficient mapping solution for various parallel applications. This strategy first extracts an appropriate subset of all idle computing resources on the underlying system and then constructs an optimized one-to-one mapping with a refined iterative algorithm. Experimental results demonstrate that TAMM can effectively improve the communication performance on the Tianhe-2A supercomputer.

Keywords

High performance computing systems Topology-aware mapping Communication pattern Network topology 

Notes

Acknowledgment

This research work was supported in part by the National Key Research and Development Program of China (2017YFB0202104), the National Natural Science Foundation of China under Grant No.: 91530324, No.: 91430218, China Postdoctoral Science Foundation (CPSF) Grant No.: 2014M562570, Special Financial Grant from CPSF Grant No.: 2015T81127.

References

  1. 1.
    Bhatele, A., Laxmikant, V.: An evaluative study on the effect of contention on message latencies in large supercomputers. In: 2009 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–8 (2009).  https://doi.org/10.1109/IPDPS.2009.5161094
  2. 2.
    Bhatele, A.: Automating topology aware mapping for supercomputers. Ph.D. thesis, University of Illinois at Urbana-Champaign, Champaign, IL, USA (2010)Google Scholar
  3. 3.
    Brandfass, B., Alrutz, T., Gerhold, T.: Rank reordering for mpi communication optimization. Comput. Fluids 80, 372–380 (2013).  https://doi.org/10.1016/j.compfluid.2012.01.019CrossRefGoogle Scholar
  4. 4.
    Cao, J., Xiao, L., Pang, Z., Wang, K., Xu, J.: The efficient in-band management for interconnect network in Tianhe-2 system. In: 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pp. 18–26 (2016).  https://doi.org/10.1109/PDP.2016.58
  5. 5.
    Chen, H., Chen, W., Huang, J., Robert, B., Kuhn, H.: MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In: Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006, pp. 353–360. ACM (2006).  https://doi.org/10.1145/1183401.1183451
  6. 6.
    Duff, I.S.: European exascale software initiative: numerical libraries, solvers and algorithms. In: Alexander, M., et al. (eds.) Euro-Par 2011. LNCS, vol. 7155, pp. 295–304. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-29737-3_34CrossRefGoogle Scholar
  7. 7.
    Ercal, F., Ramanujam, J., Sadayappan, P.: Task allocation onto a hypercube by recursive mincut bipartitioning. In: Proceedings of the Third Conference on Hypercube Concurrent Computers and Applications: Architecture, Software, Computer Systems, and General Issues, C3P, vol. 1, pp. 210–221. ACM (1988).  https://doi.org/10.1145/62297.62323
  8. 8.
    Fujiwara, T., Malakar, P., Reda, K., Vishwanath, V., Papka, M.E., Ma, K.L.: A visual analytics system for optimizing communications in massively parallel applications. In: IEEE Conference on Visual Analytics Science and Technology (2017)Google Scholar
  9. 9.
    Galvez, J.J., Jain, N., Kale, L.V.: Automatic topology mapping of diverse large-scale parallel applications. In: Proceedings of the International Conference on Supercomputing, ICS 2017, pp. 17:1–17:10. ACM (2017).  https://doi.org/10.1145/3079079.3079104
  10. 10.
    Geist, A., Dosanjh, S.: IESP exascale challenge: co-design of architectures and algorithms. Int. J. High Perform. Comput. Appl. 23(4), 401–402 (2009).  https://doi.org/10.1177/1094342009347766CrossRefGoogle Scholar
  11. 11.
    Georgiou, Y., Jeannot, E., Mercier, G., Villiermet, A.: Topology-aware job mapping. Int. J. High Perform. Comput. Appl. 63 (2017).  https://doi.org/10.1109/SC.2006.63
  12. 12.
    Hendrickson, B., Leland, R.: The Chaco user’s guide: version 2.0. Technical report, Sandia National Laboratory (1994)Google Scholar
  13. 13.
    Hoefler, T., Jeannot, E., Mercier, G.: An overview of topology mapping algorithms and techniques in high-performance computing, Chap. 5, pp. 73–94. Wiley-Blackwell (2014). https://doi.org/10.1002/9781118711897.ch5CrossRefGoogle Scholar
  14. 14.
    Hoefler, T., Snir, M.: Generic topology mapping strategies for large-scale parallel architectures. In: Proceedings of the International Conference on Supercomputing, ICS 2011. pp. 75–84. ACM(2011).  https://doi.org/10.1145/1995896.1995909
  15. 15.
    Jeannot, E., Mercier, G., Tessier, F.: Process placement in multicore clusters:algorithmic issues and practical techniques. IEEE Trans. Parallel Distrib. Syst. 25(4), 993–1002 (2014).  https://doi.org/10.1109/TPDS.2013.104CrossRefGoogle Scholar
  16. 16.
    Jeannot, E., Mercier, G.: Near-optimal placement of MPI processes on hierarchical NUMA architectures. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010. LNCS, vol. 6272, pp. 199–210. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15291-7_20CrossRefGoogle Scholar
  17. 17.
    Karypis, G., Kumar, V.: Metis: a software package for partitioning unstructured graphs. International Cryogenics Monograph, pp. 121–124 (1998)Google Scholar
  18. 18.
    Li, S., Hoefler, T., Snir, M.: NUMA-aware shared-memory collective communication for MPI. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013, pp. 85–96. ACM (2013).  https://doi.org/10.1145/2462902.2462903
  19. 19.
    Liao, X.K., et al.: High performance interconnect network for Tianhe system. J. Comput. Sci. Technol. 30(2), 259–272 (2015).  https://doi.org/10.1007/s11390-015-1520-7CrossRefGoogle Scholar
  20. 20.
    Liao, X., Xiao, L., Yang, C., Lu, Y.: Milkyway-2 supercomputer: system and application. Front. Comput. Sci. 8(3), 345–356 (2014).  https://doi.org/10.1007/s11704-014-3501-3MathSciNetCrossRefGoogle Scholar
  21. 21.
    Mercier, G., Clet-Ortega, J.: Towards an efficient process placement policy for MPI applications in multicore environments. In: Ropo, M., Westerholm, J., Dongarra, J. (eds.) EuroPVM/MPI 2009. LNCS, vol. 5759, pp. 104–115. Springer, Heidelberg (2009).  https://doi.org/10.1007/978-3-642-03770-2_17CrossRefGoogle Scholar
  22. 22.
    Mirsadeghi, S.H., Afsahi, A.: PTRAM: a parallel topology-and routing-aware mapping framework for large-scale HPC systems. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 386–396 (2016).  https://doi.org/10.1109/IPDPSW.2016.146
  23. 23.
    Mirsadeghi, S.H., Afsahi, A.: Topology-aware rank reordering for MPI collectives. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1759–1768 (2016).  https://doi.org/10.1109/IPDPSW.2016.139
  24. 24.
    Pang, Z., et al.: The TH express high performance interconnect networks. Front. Comput. Sci. 8(3), 357–366 (2014).  https://doi.org/10.1007/s11704-014-3500-9MathSciNetCrossRefGoogle Scholar
  25. 25.
    Pellegrini, F., Roman, J.: Scotch: a software package for static mapping by dual recursive bipartitioning of process and architecture graphs. In: Liddell, H., Colbrook, A., Hertzberger, B., Sloot, P. (eds.) HPCN-Europe 1996. LNCS, vol. 1067, pp. 493–498. Springer, Heidelberg (1996).  https://doi.org/10.1007/3-540-61142-8_588CrossRefGoogle Scholar
  26. 26.
    Rodrigues, E.R., Madruga, F.L., Navaux, P.O.A., Panetta, J.: Multi-core aware process mapping and its impact on communication overhead of parallel applications. In: 2009 IEEE Symposium on Computers and Communications, pp. 811–817 (2009).  https://doi.org/10.1109/ISCC.2009.5202271
  27. 27.
    Schreiber, R.S., et al.: The NAS parallel benchmarks. In: 1991 ACM/IEEE Conference on Supercomputing (Supercomputing 1991) (SC), pp. 158–165 (1991).  https://doi.org/10.1145/125826.125925
  28. 28.
    Sreepathi, S., D’Azevedo, E., Philip, B., Worley, P.: Communication characterization and optimization of applications using topology-aware task mapping on large supercomputers. In: Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering, ICPE 2016, pp. 225–236. ACM (2016).  https://doi.org/10.1145/2851553.2851575
  29. 29.
    Subramoni, H., et al.: Design of network topology aware scheduling services for large infiniband clusters. In: 2013 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–8 (2013).  https://doi.org/10.1109/CLUSTER.2013.6702677
  30. 30.
    Sweep3D: The ASCI Sweep3D Benchmark Code (2014). http://www.llnl.gov/asci-benchmarks/scsi/limited/sweep3d/asci_sweep3d.html (2014)
  31. 31.
    Tuncer, O., Leung, V.J., Coskun, A.K.: PaCMap: topology mapping of unstructured communication patterns onto non-contiguous allocations. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS 2015, pp. 37–46. ACM (2015). https://doi.org/10.1145/2751205.2751225
  32. 32.
    Walshaw, C., Cross, M.: Jostle: Parallel multilevel graph-partitioning software - an overview. Mesh Partitioning Techniques and Domain Decomposition Techniques (2007)Google Scholar
  33. 33.
    Wang, T., Qing, P., Wei, D., Qi, F.B.: Optimization of process-to-core mapping based on clustering analysis. Chin. J. Comput. 38, 1044–1055 (2015)MathSciNetGoogle Scholar
  34. 34.
    Wu, J., Xiong, X., Berrocal, E., Wang, J., Lan, Z.: Topology mapping of irregular parallel applications on torus-connected supercomputers. J. Supercomput. 73(4), 1691–1714 (2017).  https://doi.org/10.1007/s11227-016-1876-7CrossRefGoogle Scholar
  35. 35.
    Yu, H., Chung, I.H., Moreira, J.: Topology mapping for blue Gene/L supercomputer. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 2006. ACM (2006).  https://doi.org/10.1145/1188455.1188576
  36. 36.
    Zerr, R.J., Baker, R.S.: SNAP: SN (discrete ordinates) application proxy - proxy description. Technical report, Los Alamos National Laboratory (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Science and Technology on Parallel and Distributed Processing LaboratoryNational University of Defense TechnologyChangshaChina
  2. 2.Institute of Advanced Science and TechnologyHunan Institute of Traffic EngineeringHengyangChina

Personalised recommendations