Advertisement

Application-aware NoC management in GPUs multitasking

  • Zhen Xu
  • Xia ZhaoEmail author
  • Zhiying Wang
  • Canqun Yang
Article

Abstract

Current network-on-chip (NoC) designs in GPUs are agnostic to application requirements, and this leads to wasted performance in GPUs multitasking. We observe that applications can generally be classified as either network-sensitive or network-insensitive. We propose the application-aware NoC (AA-NoC) management to better exploit the application characteristics. AA-NoC consists of the topology-aware streaming multiprocessor (SM) mapping and the adaptive virtual channel (VC) management. The topology-aware SM mapping is implemented in the concurrent thread array scheduler, and the adaptive VC management replies on a light-weight online profiling which only incurs limited hardware overhead. Compared to the traditional application-agnostic NoC, the evaluation results show that AA-NoC improves the STP and ANTT by 19.7% and 20.9%, respectively.

Keywords

GPUs Network-on-chip Multitasking 

Notes

Acknowledgements

Funding was provided by National Natural Science Foundation of China (Grant No. 61572508, 61672526, 61402488), National Key R&D Program of China (Grant No. 2017YFB0202003).

References

  1. 1.
    Nvidia (2009) NVIDIA’s next generation CUDA compute architecture: Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA’s_Fermi-The_First_Complete_GPU_Architecture.pdf. Accessed July 2018
  2. 2.
    Nvidia (2016) NVIDIA GP100 Pascal architecture. White paper. http://www.nvidia.com/object/pascal-architecture-whitepaper.html. Accessed July 2018
  3. 3.
    Sewell K, Dreslinski RG, Manville T, Satpathy S, Pinckney N, Blake G, Cieslak M, Das R, Wenisch TF, Sylvester D, Blaauw D, Mudge T (2012) Swizzle-switch networks for many-core systems. IEEE J Emerg Sel Top Circuits Syst 2:278–294CrossRefGoogle Scholar
  4. 4.
    Bakhoda A, Kim J, Aamodt TM (2010) Throughput-effective on-chip networks for manycore accelerators. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp 421–432Google Scholar
  5. 5.
    Kim H, Kim J, Seo W, Cho Y, Ryu S (2012) Providing cost-effective on-chip network bandwidth in GPGPUs. In: Proceedings of the International Conference on Computer Design (ICCD), pp 407–412Google Scholar
  6. 6.
    Jang H, Kim J, Gratz P, Yum KH, Kim EJ (2015) Bandwidth-efficient on-chip interconnect designs for GPGPUs. In: Proceedings of the Design Automation Conference (DAC), pp 9:1–9:6Google Scholar
  7. 7.
    Zhao X, Ma S, Li C, Eeckhout L, Wang Z (2016) A heterogeneous low-cost and low-latency ring-chain network for GPGPUs. In: Proceedings of the International Conference on Computer Design (ICCD), pp 472–479Google Scholar
  8. 8.
    Adriaens JT, Compton K, Kim NS, Schulte MJ (2012) The case for GPGPU spatial multitasking. In: Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pp 1–12Google Scholar
  9. 9.
    Nvidia (2017) NVIDIA Tesla V100 GPU architecture the world’s most advanced data center GPU. White paper. http://www.nvidia.com/object/volta-architecture-whitepaper.html
  10. 10.
    Jog A, Kayiran O, Kesten T, Pattnaik A, Bolotin E, Chatterjee N, Keckler SW, Kandemir MT, Das CR (2015) Anatomy of GPU memory system for multi-application execution. In: Proceedings of the 2015 International Symposium on Memory Systems, MEMSYSGoogle Scholar
  11. 11.
    Park JJK, Park Y, Mahlke S (2015) Chimera: collaborative preemption for multitasking on a shared GPU. In: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp 593–606Google Scholar
  12. 12.
    Wang B, Yu W, Sun X-H, Wang X (2015) DaCache: memory divergence-aware GPU cache management. In: Proceedings of the International Conference on Supercomputing (ICS), pp 89–98Google Scholar
  13. 13.
    Sethia A, Jamshidi DA, Mahlke S (2015) Mascar: speeding up GPU warps by reducing memory pitstops. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 174–185Google Scholar
  14. 14.
    Abts D, Enright Jerger ND, Kim J, Gibson D, Lipasti MH (2009) Achieving predictable performance through better memory controller placement in many-core CMPs. In: Proceedings of the International Symposium on Computer Architecture, pp 451–461Google Scholar
  15. 15.
    Jerger N E, Krishna T, Peh L (2017) On-chip networks, 2nd edn. Morgan & Claypool Publishers, WillistonGoogle Scholar
  16. 16.
    Tanasic I, Gelado I, Cabezas J, Ramirez A, Navarro N, Valero M (2014) Enabling preemptive multiprogramming on GPUs. In: Proceeding of the International Symposium on Computer Architecture (ISCA), pp 193–204Google Scholar
  17. 17.
    Rezazad M, Sarbazi-azad H (2005) The effect of virtual channel organization on the performance of interconnection networks. In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS)Google Scholar
  18. 18.
    Lee J, Kim H (2012) TAP: a TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 1–12Google Scholar
  19. 19.
    Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to GPU codes. In: Innovative Parallel Computing (InPar), pp 1–10Google Scholar
  20. 20.
    He B, Fang W, Luo Q, Govindaraju NK, Wang T (2008) Mars: a MapReduce framework on graphics processors. In: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pp 260–269Google Scholar
  21. 21.
    Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the International Symposium on Workload Characterization (IISWC), pp 44–54Google Scholar
  22. 22.
    NVIDIA CUDA SDK Code Samples. https://developer.nvidia.com/cuda-downloads
  23. 23.
    Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: Proceeding of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 163–174Google Scholar
  24. 24.
    Stratton JA, Rodrigues C, Sung I-J, Obeid N, Chang L-W, Anssari N, Liu GD, Hwu WMW (2012) Parboil: a revised benchmark suite for scientific and commercial throughput computing. Technical reportGoogle Scholar
  25. 25.
    Wang Z, Yang J, Melhem R, Childers B, Zhang Y, Guo M (2016) Simultaneous multikernel GPU: multi-tasking throughput processors via fine-grained sharing. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 358–369Google Scholar
  26. 26.
    Xu Q, Jeon H, Kim K, Ro WW, Annavaram M (2016) Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp 230–242Google Scholar
  27. 27.
    Zhao X, Wang Z, Eeckhout L (2018) Classification-driven search for effective SM partitioning in GPU multitasking. In: Proceedings of the International Conference on Supercomputing (ICS)Google Scholar
  28. 28.
    Eyerman S, Eeckhout L (2008) System-level performance metrics for multiprogram workloads. IEEE Micro 28(3):42–53CrossRefGoogle Scholar
  29. 29.
    Arabnia HR, Oliver MA (1986) Fast operations on raster images with SIMD machine architectures. Comput Graph Forum 5:179–188CrossRefGoogle Scholar
  30. 30.
    Arabnia HR, Oliver MA (1987) Arbitrary rotation of raster images with SIMD machine architectures. Comput Graph Forum 6:3–11CrossRefGoogle Scholar
  31. 31.
    Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitised images. Comput J 30(5):425–432CrossRefGoogle Scholar
  32. 32.
    Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Comput Graph Forum 8:3–11CrossRefGoogle Scholar
  33. 33.
    Arabnia HR (1990) A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach. J Parallel Distrib Comput 10(2):188–192CrossRefGoogle Scholar
  34. 34.
    Arabnia HR (1996) Distributed stereo-correlation algorithm. Comput Commun 19(8):707–711CrossRefGoogle Scholar
  35. 35.
    Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10:243–269CrossRefGoogle Scholar
  36. 36.
    Arabnia HR, Taha TR (1998) A parallel numerical algorithm on a reconfigurable multi-ring network. Telecommun Syst 10(1–2):185–202CrossRefGoogle Scholar
  37. 37.
    Ziabari AK, Abellán JL, Ma Y, Joshi A, Kaeli D (2015) Asymmetric NoC architectures for GPU systems. In: Proceedings of the International Symposium on Networks-on-Chip (NoCs), pp 25:1–25:8Google Scholar
  38. 38.
    Zhao X, Ma S, Liu Y, Eeckhout L, Wang Z (2016) A low-cost conflict-free NoC for GPGPUs. In: Proceedings of the Design Automation Conference (DAC), pp 34:1–34:6Google Scholar
  39. 39.
    Cheng X, Zhao Y, Zhao H, Xie Y (2018) Packet pump: overcoming network bottleneck in on-chip interconnects for GPGPUs. In: Proceedings of the Design Automation Conference (DAC), pp 84:1–84:6Google Scholar
  40. 40.
    Aguilera P, Morrow K, Kim NS (2014) Fair share: allocation of GPU resources for both performance and fairness. In: The 32nd IEEE International Conference on Computer Design, ICCDGoogle Scholar
  41. 41.
    Wang H, Luo F, Ibrahim M, Kayiran O, Jog A (2018) Efficient and fair multi-programming in GPUs via effective bandwidth management. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 247–258Google Scholar
  42. 42.
    Ausavarungnirun R, Landgraf J, Miller V, Ghose S, Gandhi J, Rossbach CJ, Mutlu O (2017) Mosaic: a GPU memory manager with application-transparent support for multiple page sizes. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp 136–150Google Scholar
  43. 43.
    Dai H, Lin Z, Li C, Zhao C, Wang F, Zheng N, Zhou H (2018) Accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 208–220Google Scholar
  44. 44.
    Liu Y, Yu Z, Eeckhout L, Reddi VJ, Luo Y, Wang X, Wang Z, Xu C (2016) Barrier-aware warp scheduling for throughput processors. In: Proceedings of the International Conference on Supercomputing (ICS), pp 42:1–42:12Google Scholar
  45. 45.
    Jog A, Kayiran O, Mishra AK, andemir MT, Mutlu O, Iyer R, Das CR (2013) Orchestrated scheduling and prefetching for GPGPUs. In: ACM SIGARCH Computer Architecture News, vol 41, pp 332–343. ACMGoogle Scholar
  46. 46.
    Wang B, Zhu Y, Yu W (2016) OAWS: memory occlusion aware warp scheduling. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT), pp 45–55Google Scholar
  47. 47.
    Rogers TG, O’Connor M, Aamodt TM (2012) Cache-conscious wavefront scheduling. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp 72–83Google Scholar
  48. 48.
    Lee S-Y, Arunkumar A, Wu C-J (2015) CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp 515–527Google Scholar
  49. 49.
    Xie X, Liang Y, Wang Y, Sun G, Wang T (2015) Coordinated static and dynamic cache bypassing for GPUs. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 76–88Google Scholar
  50. 50.
    Jia W, Shaw KA, Martonosi M (2014) MRPB: memory request prioritization for massively parallel processors. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 272–283Google Scholar
  51. 51.
    Jeon H, Ravi GS, Kim NS, Annavaram M (2015) GPU register file virtualization. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp 420–432Google Scholar
  52. 52.
    Abdel-Majeed M, Annavaram M (2013) Warped register file: a power efficient register file for GPGPUs. In: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp 412–423Google Scholar
  53. 53.
    Jing N, Shen Y, Lu Y, Ganapathy S, Mao Z, Guo M, Canal R, Liang X (2013) An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp 344–355Google Scholar
  54. 54.
    Yoon M K, Kim K, Lee S, Ro WW, Annavaram M (2016) Virtual thread: maximizing thread-level parallelism beyond GPU scheduling limit. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp 609–621Google Scholar
  55. 55.
    Vijaykumar N, Hsieh K, Pekhimenko G, Khan S, Shrestha A, Ghose S, Jog A, Gibbons PB, Mutlu O (2016) Zorua: a holistic approach to resource virtualization in GPUs. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp 1–14Google Scholar
  56. 56.
    Arunkumar A, Bolotin E, Cho B, Milic U, Ebrahimi E, Villa O, Jaleel A, Wu C-J, Nellans D (2017) MCM-GPU: multi-chip-module GPUs for continued performance scalability. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp 320–332Google Scholar
  57. 57.
    Milic U, Villa O, Bolotin E, Arunkumar A, Ebrahimi E, Jaleel A, Ramirez A, Nellans D (2017) Beyond the socket: NUMA-aware GPUs. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp 123–135Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.National University of Defense TechnologyChangshaChina

Personalised recommendations