Multimedia Tools and Applications

, Volume 73, Issue 3, pp 1391–1416 | Cite as

Demand look-ahead memory access scheduling for 3D graphics processing units

  • Chih-Chieh HsiaoEmail author
  • Min-Jen Lo
  • Slo-Li Chu


With the rapid growing complexity of 3D applications, the memory subsystem has become the most bandwidth-exhausting bottleneck in a Graphics Processing Unit (GPU). To produce realistic images, tens to hundreds of thousands of primitives are used. Furthermore, each primitive generates thousands of pixels, and these pixels are computed by shaders with special effects, even to blend multiple texture pixels from external memory to obtain a final color. To hide the long latency texture operations, the shaders are usually highly multithreaded to increase its throughput. However, conventional memory scheduling mechanisms are unaware of the producer-consumer relationship between primitives and pixels. The conventional scheduling mechanisms neither assume that all initiators are independent nor that they use a fixed priority scheme. This paper proposes Demand Look-Ahead (DLA) memory access scheduling based on the statuses of each unit in the GPU, and dynamically generates priority for the memory request scheduler. By considering the producer-consumer relationship, the proposed mechanism reschedules most urgent requests to be serviced first. Experimental results show that the proposed DLA improves 1.47 % and 1.44 % in FPS and IPC, respectively, than First-Ready First-Come-First-Serve (FR-FCFS). By integrating DLA with Bank-level Parallelism Awareness (BPA), DLA-BPA improves FPS and IPC by 7.28 % and 6.55 %, respectively. Furthermore, shader thread performance is improved by 22.06 % and increases the attainable bandwidth by 5.91 % with DLA-BPA.


Demand look-ahead GPU Graphics rendering Memory access scheduling 



This work is supported in part by the National Science Council of Republic of China, Taiwan under Grant NSC 101-2221-E-033-049.


  1. 1.
    Ausavarungnirun R, Chang K-W, Subramanian L, Loh GH, Mutlu O (2012) Staged memory scheduling: achieving high performance and scalability in heterogeneous systems. In: Proceedings of the 39th International Symposium on Computer Architecture, pp 416–427Google Scholar
  2. 2.
    Ebrahimi E, Miftakhutdinov R, Falling C, Lee CJ, Joao JA, Mutlu O, Patt YN (2011) Parallel application memory scheduling. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp 362–373Google Scholar
  3. 3.
    Hong S, Mckee S, Salinas M, Klenke R, Aylor J, Wulf W (1999) Access order and effective bandwidth for streams on a direct rambus memory. In: Proceeding of High-Performance Computer Architecture, pp 80–89Google Scholar
  4. 4.
    Hynix (2006) 512M (16Mx32) GDDR3 SDRAM HY5RS123235FP SpecificationGoogle Scholar
  5. 5.
    Jeong MK, Erez M, Sudanthi C, Paver N (2012) A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In: Proceeding of Design Automation Conference, pp 850–855Google Scholar
  6. 6.
    Joao JA, Suleman AM, Mutlu O, Patt YN (2012) Bottleneck identification and scheduling in multithreaded applications. In: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, pp 223–234Google Scholar
  7. 7.
    Juffa N, Coon B (2011) Maximized memory throughput using cooperative thread arrays. US Patent 7,925,860 B1 Apr 1998Google Scholar
  8. 8.
    Kim Y, Han D, Mutlu O, Harcol-Balter M (2010) ATLAS: a scalable and high-performance scheduling algorithm for multiple memory controllers. In: Proceedings of the 16th International Symposium on High-Performance Computer Architecture, pp 1–12Google Scholar
  9. 9.
    Kim Y, Papamichael M, Mutlu O, Harcol-Balter M (2010) Thread cluster memory scheduling: exploiting differences in memory access behavior. In: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pp 65–76Google Scholar
  10. 10.
    Kruger F (2008) High bandwidth memory technology: system architecture implications and perspective. In: Hot chips 20Google Scholar
  11. 11.
    Lee J, Lakshminarayana N, Kim H, Vuduc R (2010) Many-thread aware prefetching mechanisms for GPGPU applications. In: Proceeding of International Symposium on Microarchitecture, pp 213–224Google Scholar
  12. 12.
    Mantor M (2007) AMD’s Radeon HD 2900 2nd Generation Unified Shader Architecture. In: Hot Chips 19Google Scholar
  13. 13.
    Mizuyabu C, Chow P, Swan P, Wang C (2003) Method and apparatus for memory access scheduling in a video graphics system. US Patent 6,297,832 B1 May 2003Google Scholar
  14. 14.
    Moya V, Gonzalez C, Roca J, Fernandez A, Espana R (2006) ATTILA: a cycle-level execution-driven simulator for modern GPU architectures. In: Proceeding of IEEE International Symposium on Performance Analysis of Systems and Software, pp 231–241Google Scholar
  15. 15.
    Moya V, Gonzalez C, Solis C, Fernandez A, Espana R (2006) Workload characterization of 3D games. In: Proceeding of IEEE International Symposium on Workload Characterization, pp 17–26Google Scholar
  16. 16.
    Mutlu O, Moscibroda T (2007) Stall-time fair memory access scheduling for chip multiprocessors. In: Proceeding of International Symposium on Microarchitecture, pp 146–160Google Scholar
  17. 17.
    Mutlu O, Moscibroda T (2008) Parallelism-aware batch scheduling: enhancing both performance and fairness of shared DRAM systems. In: Proceeding of International Symposium on Computer Architecture, pp 63–74Google Scholar
  18. 18.
    Nesbit KJ, Aggarwal N, Laudon J, Smith JE (2006) Fair queuing memory systems. In: Proceeding of International Symposium on Microarchitecture, pp 208–222Google Scholar
  19. 19.
    Nickolls J, Dally WJ (2010) The GPU computing era. IEEE Micro 30(2):56–69CrossRefGoogle Scholar
  20. 20.
    Rafique N, Lim W-T, Thottethodi M (2007) Effective management of DRAM bandwidth in multicore processors. In: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pp 245–258Google Scholar
  21. 21.
    Rixner S, Dally W, Kapsi U, Matton P, Owens J (2000) Memory access scheduling. In: Proceeding of International Symposium on Computer Architecture, pp 128–138Google Scholar
  22. 22.
    Shao J, Davis B (2007) A burst scheduling access reordering mechanism. In: Proceeding of High-Performance Computer Architecture, pp 285–294.Google Scholar
  23. 23.
    Therdsteerasukdi K, Byun G, Cong J, Chang M-F, Reinman G (2012) Effective management of DRAM bandwidth in multicore processors utilizing RF-I and intelligent scheduling for better throughput/watt in a mobile GPU memory system. ACM Trans Archit Code Optim 8(4):51–69CrossRefGoogle Scholar
  24. 24.
    Van Hook T, Tang M-K (2001) Memory processing system and method for accessing memory including reordering memory requests to reduce mode switching. US Patent 6,564,304 B1 Oct 2001Google Scholar
  25. 25.
    Wu C-C, Pean D-L, Chen C (1998) Look-ahead memory consistency model. In: Proceeding of the International Conference on Parallel and Distributed Systems, pp 504–510Google Scholar
  26. 26.
    Yuan G, Bakhoda A, Aamodt T (2009) Complexity effective memory access scheduling for many-core accelerator architectures. In: Proceeding of International Symposium on Microarchitecture, pp 34–44Google Scholar
  27. 27.
    Zheng H, Lin J, Zhang Z, Zhu Z (2008) Memory access scheduling schemes for systems with multi-core processors. In: Proceeding of International Conference on Parallel Processing, pp 406–413Google Scholar
  28. 28.
    Zuravleff W, Robinson T (1997) Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. US Patent 5,630,096 May 1997Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Department of Information and Computer EngineeringChung Yuan Christian UniversityChung LiTaiwan

Personalised recommendations