Skip to main content

CPU-GPU System Designs for High Performance Cloud Computing

  • Chapter
  • First Online:
Book cover High Performance Cloud Auditing and Applications
  • 1716 Accesses

Abstract

Improvement of parallel computing capability will greatly increase the efficiency of high performance cloud computing. By combining the powerful scalar processing on CPU with the efficient parallel processing on GPU, CPU-GPU systems provide a hybrid computing environment that can be dynamically optimized for cloud computing applications. One of the critical issues in CPU-GPU system designs is the so called memory wall, which denotes the design complexity of memory coherence, bandwidth, capacity, and power budget. The optimization of the memory designs can not only improve the run-time performance but also enhance the reliability of the CPU-GPU system. In this chapter, we will introduce the mainstream and emerging memory hierarchy designs in CPU-GPU systems, discuss the techniques that can optimize the data allocation and migration between CPU and GPU for performance and power efficiency improvement, and present the challenges and opportunities of CPU-GPU systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Branover, A., Foley, D., Steinman, M.: AMD fusion APU: Llano. IEEE Micro 32(2), 28–37 (2012). doi:10.1109/MM.2012.2

    Article  Google Scholar 

  2. Daga, M., Aji, A.M., Feng, W.c.: On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing. In: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing, SAAHPC’11, Knoxville, pp. 141–149 (2011). doi:10.1109/SAAHPC.2011.29

    Google Scholar 

  3. Dally, B.: nvidia.com, PROJECT DENVER: Processor to usher in new era of computing. http://goo.gl/HepP5 (2011)

  4. Desikan, R., Lefurgy, C., Keckler, S., Burger, D.: ibm.com, On-chip MRAM as a high-bandwidth low-latency replacement for DRAM physical memories. http://goo.gl/lyvV2 (2008)

  5. Dong, X., Wu, X., Sun, G., Xie, Y., Li, H., Chen, Y.: Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement. In: Proceedings of the 45th Annual Design Automation Conference, DAC’08, Anaheim, pp. 554–559. ACM, New York (2008). doi:10.1145/1391469.1391610

    Google Scholar 

  6. Ferreira, A.P., Zhou, M., Bock, S., Childers, B., Melhem, R., Mossé, D.: Increasing PCM main memory lifetime. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE’10, Leuven, pp. 914–919. European Design and Automation Association, Leuven (2010)

    Google Scholar 

  7. Foley, D., Bansal, P., Cherepacha, D., Wasmuth, R., Gunasekar, A., Gutta, S., Naini, A.: A low-power integrated x86–64 and graphics processor for mobile computing devices. IEEE J. Solid-State Circuits 47(1), 220–231 (2012)

    Article  Google Scholar 

  8. Gutta, S.R., Foley, D., Naini, A., Wasmuth, R., Cherepacha, D.: A low-power integrated x86-64 and graphics processor for mobile computing devices. In: Proceedings of the 2011 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, pp. 270–272. IEEE (2011). doi:10.1109/ISSCC.2011.5746314

    Google Scholar 

  9. Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach, 4th edn. Morgan Kaufmann, Burlington (2007)

    Google Scholar 

  10. Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU-GPU communication management and optimization. SIGPLAN Notice 47(6), 142–151 (2011). doi:10.1145/ 2345156.1993516

    Article  Google Scholar 

  11. Ji, F., Aji, A.M., Dinan, J., Buntinas, D., Balaji, P., Feng, W.c., Ma, X.: Efficient intranode communication in GPU-accelerated systems. In: Proceedings of the 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IPDPSW’12, Shanghai, pp. 1838–1847 (2012). doi:10.1109/IPDPSW.2012.227

    Google Scholar 

  12. Jiang, L., Du, Y., Zhang, Y., Childers, B.R., 0002, J.Y.: LLS: Cooperative integration of wear-leveling and salvaging for PCM main memory. In: Proceedings of the 41st International Conference on Dependable Systems & Networks (DSN), Hong Kong, pp. 221–232. IEEE (2011). doi:10.1109/DSN.2011.5958221

    Google Scholar 

  13. Kawahara, T., Takemura, R., Miura, K., Hayakawa, J., Ikeda, S., Lee, Y., Sasaki, R., Goto, Y., Ito, K., Meguro, T.: 2Mb SPRAM (SPin-Transfer Torque RAM) with bit-by-bit bi-directional current write and parallelizing-direction current read. IEEE J. Solid-State Circuit 43(1), 109–120 (2008)

    Article  Google Scholar 

  14. Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’09, Raleigh, pp. 101–110. ACM, New York (2009). doi:10.1145/1504176.1504194

    Google Scholar 

  15. Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Comput. Archit. News 38(3), 451–460 (2010). doi:10.1145/1816038.1816021

    Article  Google Scholar 

  16. Meredith, J., Roth, P., Spafford, K., Vetter, J.: Performance implications of nonuniform device topologies in scalable heterogeneous architectures. IEEE Micro 31(5), 66–75 (2011). doi:10.1109/MM.2011.79

    Article  Google Scholar 

  17. mpi-forum.org, MPI: A message-passing interface standard version 2.2. http://goo.gl/SEqm1 (2009)

  18. Nere, A., Lipasti, M.: Cortical architectures on a GPGPU. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU’10, Pittsburgh, pp. 12–18. ACM, New York (2010). doi:10.1145/1735688.1735693

    Google Scholar 

  19. Qureshi, M.K., Franceschini, M., Lastras-Montaño, L.A.: Improving read performance of phase change memories via write cancellation and write pausing. In: Proceedings of the 16th International Symposium on High Performance Computer Architecture (HPCA), Bangalore, pp. 1–11. IEEE Computer Society (2010). doi:10.1109/HPCA.2010. 5416645

    Google Scholar 

  20. Qureshi, M.K., Franceschini, M.M., Lastras-Montaño, L.A., Karidis, J.P.: Morphable memory system: a robust architecture for exploiting multi-level phase change memories. SIGARCH Comput. Archit. News 38(3), 153–162 (2010). doi:10.1145/1816038.1815981

    Article  Google Scholar 

  21. Qureshi, M.K., Karidis, J., Franceschini, M., Srinivasan, V., Lastras, L., Abali, B.: Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, New York, pp. 14–23. ACM, New York (2009). doi:10.1145/1669112. 1669117

    Google Scholar 

  22. Qureshi, M.K., Srinivasan, V., Rivers, J.A.: Scalable high performance main memory system using phase-change memory technology. In: Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA’09, Austin, pp. 24–33. ACM, New York (2009). doi:10.1145/1555754.1555760

    Google Scholar 

  23. Rauchwerger, L., Amato, N., Padua, D.: A scalable method for run-time loop parallelization. Int. J. Parallel Program. 23(6), 537–576 (1995)

    Article  Google Scholar 

  24. Saltz, J., Mirchandaney, R., Crowley, K.: Run-time parallelization and scheduling of loops. IEEE Trans. Comput. 40(5), 603–612 (1991)

    Article  Google Scholar 

  25. Spafford, K.L., Meredith, J.S., Lee, S., Li, D., Roth, P.C., Vetter, J.S.: The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. In: Proceedings of the 9th Conference on Computing Frontiers, CF’12, Caligari, pp. 103–112. ACM, New York (2012). doi:10. 1145/2212908.2212924

    Google Scholar 

  26. Sun, G., Dong, X., Xie, Y., Li, J., Chen, Y.: A novel architecture of the 3D stacked MRAM L2 cache for CMPs. In: IEEE Symposium on High-Performance Computer Architecture (HPCA), Los Alamitos, pp. 239–249. IEEE Computer Society, Los Alamitos (2009). doi:10.1109/ HPCA.2009.4798259

    Google Scholar 

  27. Ueng, S.Z., Lathara, M., Baghsorkhi, S.S., Hwu, W.M.W.: CUDA-Lite: reducing GPU programming complexity. Languages and Compilers for Parallel Computing, pp. 1–15. Springer, Berlin/Heidelberg (2008)

    Google Scholar 

  28. Venkatasubramanian, S., Vuduc, R.W.: Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In: Proceedings of the 23rd International Conference on Supercomputing, ICS’09, Yorktown Heights, pp. 244–255. ACM, New York (2009). doi:10.1145/1542275.1542312

    Google Scholar 

  29. Ware, M., Rajamani, K., Floyd, M., Brock, B., Rubio, J., Rawson, F., Carter, J.: Architecting for power management: The IBM®; POWER7TM approach. In: Proceedings of the 16th IEEE International Symposium on High Performance Computer Architecture, HPCA’10, Bangalore, pp. 1–11 (2010). doi:10.1109/HPCA.2010.5416627

    Google Scholar 

  30. Wolfe, M.: Implementing the PGI accelerator model. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU’10, Pittsburgh, pp. 43–50. ACM, New York (2010). doi:10.1145/1735688.1735697

    Google Scholar 

  31. Xu, W., Sun, H., Wang, X., Chen, Y., Zhang, T.: Design of last-level on-chip cache using spin-torque transfer RAM (STT RAM). IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 19(3), 483–493 (2011). doi:10.1109/TVLSI.2009.2035509

    Google Scholar 

  32. Yan, Y., Grossman, M., Sarkar, V.: JCUDA: A programmer-friendly interface for accelerating Java programs with CUDA. In: Proceedings of the 15th International Euro-Par Conference on Parallel Processing, Euro-Par’09, Delft, pp. 887–899. Springer-Verlag, Berlin/Heidelberg (2009). doi:10.1007/978-3-642-03869-3_82

    Google Scholar 

  33. Yang, Y., Xiang, P., Kong, J., Mantor, M., Zhou, H.: A unified optimizing compiler framework for different GPGPU architectures. ACM Trans. Archit. Code Optim. 9(2), 9:1–9:33 (2012). doi:10.1145/2207222.2207225

    Google Scholar 

  34. Yuffe, M., Knoll, E., Mehalel, M., Shor, J., Kurts, T.: A fully integrated multi-CPU, GPU and memory controller 32nm processor. In: Proceedings of the 2011 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, pp. 264–266 (2011). doi:10.1109/ISSCC.2011.5746311

    Google Scholar 

  35. Zhao, J., Sun, G., Loh, G., Xie, Y.: Energy-efficient GPU design with configurable package graphic memory. In: ISLPED’12, Redondo Beach, pp. 403–408 (2012)

    Google Scholar 

  36. Zhou, P., Zhao, B., Yang, J., Zhang, Y.: Energy reduction for STT-RAM using early write termination. In: Proceedings of the 2009 International Conference on Computer-Aided Design, ICCAD’09, New York, pp. 264–268. ACM, New York (2009). doi:10.1145/1687399. 1687448

    Google Scholar 

  37. Zidan, M.A., Bonny, T., Salama, K.N.: High performance technique for database applications using a hybrid GPU/CPU platform. In: Proceedings of the 21st Edition of the Great Lakes Symposium on VLSI, GLSVLSI’11, Lausanne, pp. 85–90. ACM, New York (2011). doi:10.1145/1973009.1973027

    Google Scholar 

Download references

Acknowledgements

This material is based upon work partially supported by the National Science Foundation (NSF) grant CNS-1116171 and the Air Force Research Laboratory (AFRL) Visiting Faculty Research Program (VFRP) extension grant LRIR 11RI01COR. We are grateful to Prof. Hai (Helen) Li from the University of Pittsburgh Department of Electrical and Computer Engineering for generous help.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yiran Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this chapter

Cite this chapter

Chen, Y., Guo, J., Sun, Z. (2014). CPU-GPU System Designs for High Performance Cloud Computing. In: Han, K., Choi, BY., Song, S. (eds) High Performance Cloud Auditing and Applications. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-3296-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-3296-8_11

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-3295-1

  • Online ISBN: 978-1-4614-3296-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics