Abstract
With the rapid growing complexity of 3D applications, the memory subsystem has become the most bandwidth-exhausting bottleneck in a Graphics Processing Unit (GPU). To produce realistic images, tens to hundreds of thousands of primitives are used. Furthermore, each primitive generates thousands of pixels, and these pixels are computed by shaders with special effects, even to blend multiple texture pixels from external memory to obtain a final color. To hide the long latency texture operations, the shaders are usually highly multithreaded to increase its throughput. However, conventional memory scheduling mechanisms are unaware of the producer-consumer relationship between primitives and pixels. The conventional scheduling mechanisms neither assume that all initiators are independent nor that they use a fixed priority scheme. This paper proposes Demand Look-Ahead (DLA) memory access scheduling based on the statuses of each unit in the GPU, and dynamically generates priority for the memory request scheduler. By considering the producer-consumer relationship, the proposed mechanism reschedules most urgent requests to be serviced first. Experimental results show that the proposed DLA improves 1.47 % and 1.44 % in FPS and IPC, respectively, than First-Ready First-Come-First-Serve (FR-FCFS). By integrating DLA with Bank-level Parallelism Awareness (BPA), DLA-BPA improves FPS and IPC by 7.28 % and 6.55 %, respectively. Furthermore, shader thread performance is improved by 22.06 % and increases the attainable bandwidth by 5.91 % with DLA-BPA.
Similar content being viewed by others
Notes
In the following sections, the term “rendering batch” refers to the batch used in graphics rendering. Otherwise, the batch represents a group of memory requests for memory access scheduling.
References
Ausavarungnirun R, Chang K-W, Subramanian L, Loh GH, Mutlu O (2012) Staged memory scheduling: achieving high performance and scalability in heterogeneous systems. In: Proceedings of the 39th International Symposium on Computer Architecture, pp 416–427
Ebrahimi E, Miftakhutdinov R, Falling C, Lee CJ, Joao JA, Mutlu O, Patt YN (2011) Parallel application memory scheduling. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp 362–373
Hong S, Mckee S, Salinas M, Klenke R, Aylor J, Wulf W (1999) Access order and effective bandwidth for streams on a direct rambus memory. In: Proceeding of High-Performance Computer Architecture, pp 80–89
Hynix (2006) 512M (16Mx32) GDDR3 SDRAM HY5RS123235FP Specification
Jeong MK, Erez M, Sudanthi C, Paver N (2012) A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In: Proceeding of Design Automation Conference, pp 850–855
Joao JA, Suleman AM, Mutlu O, Patt YN (2012) Bottleneck identification and scheduling in multithreaded applications. In: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, pp 223–234
Juffa N, Coon B (2011) Maximized memory throughput using cooperative thread arrays. US Patent 7,925,860 B1 Apr 1998
Kim Y, Han D, Mutlu O, Harcol-Balter M (2010) ATLAS: a scalable and high-performance scheduling algorithm for multiple memory controllers. In: Proceedings of the 16th International Symposium on High-Performance Computer Architecture, pp 1–12
Kim Y, Papamichael M, Mutlu O, Harcol-Balter M (2010) Thread cluster memory scheduling: exploiting differences in memory access behavior. In: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pp 65–76
Kruger F (2008) High bandwidth memory technology: system architecture implications and perspective. In: Hot chips 20
Lee J, Lakshminarayana N, Kim H, Vuduc R (2010) Many-thread aware prefetching mechanisms for GPGPU applications. In: Proceeding of International Symposium on Microarchitecture, pp 213–224
Mantor M (2007) AMD’s Radeon HD 2900 2nd Generation Unified Shader Architecture. In: Hot Chips 19
Mizuyabu C, Chow P, Swan P, Wang C (2003) Method and apparatus for memory access scheduling in a video graphics system. US Patent 6,297,832 B1 May 2003
Moya V, Gonzalez C, Roca J, Fernandez A, Espana R (2006) ATTILA: a cycle-level execution-driven simulator for modern GPU architectures. In: Proceeding of IEEE International Symposium on Performance Analysis of Systems and Software, pp 231–241
Moya V, Gonzalez C, Solis C, Fernandez A, Espana R (2006) Workload characterization of 3D games. In: Proceeding of IEEE International Symposium on Workload Characterization, pp 17–26
Mutlu O, Moscibroda T (2007) Stall-time fair memory access scheduling for chip multiprocessors. In: Proceeding of International Symposium on Microarchitecture, pp 146–160
Mutlu O, Moscibroda T (2008) Parallelism-aware batch scheduling: enhancing both performance and fairness of shared DRAM systems. In: Proceeding of International Symposium on Computer Architecture, pp 63–74
Nesbit KJ, Aggarwal N, Laudon J, Smith JE (2006) Fair queuing memory systems. In: Proceeding of International Symposium on Microarchitecture, pp 208–222
Nickolls J, Dally WJ (2010) The GPU computing era. IEEE Micro 30(2):56–69
Rafique N, Lim W-T, Thottethodi M (2007) Effective management of DRAM bandwidth in multicore processors. In: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pp 245–258
Rixner S, Dally W, Kapsi U, Matton P, Owens J (2000) Memory access scheduling. In: Proceeding of International Symposium on Computer Architecture, pp 128–138
Shao J, Davis B (2007) A burst scheduling access reordering mechanism. In: Proceeding of High-Performance Computer Architecture, pp 285–294.
Therdsteerasukdi K, Byun G, Cong J, Chang M-F, Reinman G (2012) Effective management of DRAM bandwidth in multicore processors utilizing RF-I and intelligent scheduling for better throughput/watt in a mobile GPU memory system. ACM Trans Archit Code Optim 8(4):51–69
Van Hook T, Tang M-K (2001) Memory processing system and method for accessing memory including reordering memory requests to reduce mode switching. US Patent 6,564,304 B1 Oct 2001
Wu C-C, Pean D-L, Chen C (1998) Look-ahead memory consistency model. In: Proceeding of the International Conference on Parallel and Distributed Systems, pp 504–510
Yuan G, Bakhoda A, Aamodt T (2009) Complexity effective memory access scheduling for many-core accelerator architectures. In: Proceeding of International Symposium on Microarchitecture, pp 34–44
Zheng H, Lin J, Zhang Z, Zhu Z (2008) Memory access scheduling schemes for systems with multi-core processors. In: Proceeding of International Conference on Parallel Processing, pp 406–413
Zuravleff W, Robinson T (1997) Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. US Patent 5,630,096 May 1997
Acknowledgments
This work is supported in part by the National Science Council of Republic of China, Taiwan under Grant NSC 101-2221-E-033-049.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hsiao, CC., Lo, MJ. & Chu, SL. Demand look-ahead memory access scheduling for 3D graphics processing units. Multimed Tools Appl 73, 1391–1416 (2014). https://doi.org/10.1007/s11042-013-1639-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-013-1639-x