Abstract
Improvement of parallel computing capability will greatly increase the efficiency of high performance cloud computing. By combining the powerful scalar processing on CPU with the efficient parallel processing on GPU, CPU-GPU systems provide a hybrid computing environment that can be dynamically optimized for cloud computing applications. One of the critical issues in CPU-GPU system designs is the so called memory wall, which denotes the design complexity of memory coherence, bandwidth, capacity, and power budget. The optimization of the memory designs can not only improve the run-time performance but also enhance the reliability of the CPU-GPU system. In this chapter, we will introduce the mainstream and emerging memory hierarchy designs in CPU-GPU systems, discuss the techniques that can optimize the data allocation and migration between CPU and GPU for performance and power efficiency improvement, and present the challenges and opportunities of CPU-GPU systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Branover, A., Foley, D., Steinman, M.: AMD fusion APU: Llano. IEEE Micro 32(2), 28–37 (2012). doi:10.1109/MM.2012.2
Daga, M., Aji, A.M., Feng, W.c.: On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing. In: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing, SAAHPC’11, Knoxville, pp. 141–149 (2011). doi:10.1109/SAAHPC.2011.29
Dally, B.: nvidia.com, PROJECT DENVER: Processor to usher in new era of computing. http://goo.gl/HepP5 (2011)
Desikan, R., Lefurgy, C., Keckler, S., Burger, D.: ibm.com, On-chip MRAM as a high-bandwidth low-latency replacement for DRAM physical memories. http://goo.gl/lyvV2 (2008)
Dong, X., Wu, X., Sun, G., Xie, Y., Li, H., Chen, Y.: Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement. In: Proceedings of the 45th Annual Design Automation Conference, DAC’08, Anaheim, pp. 554–559. ACM, New York (2008). doi:10.1145/1391469.1391610
Ferreira, A.P., Zhou, M., Bock, S., Childers, B., Melhem, R., Mossé, D.: Increasing PCM main memory lifetime. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE’10, Leuven, pp. 914–919. European Design and Automation Association, Leuven (2010)
Foley, D., Bansal, P., Cherepacha, D., Wasmuth, R., Gunasekar, A., Gutta, S., Naini, A.: A low-power integrated x86–64 and graphics processor for mobile computing devices. IEEE J. Solid-State Circuits 47(1), 220–231 (2012)
Gutta, S.R., Foley, D., Naini, A., Wasmuth, R., Cherepacha, D.: A low-power integrated x86-64 and graphics processor for mobile computing devices. In: Proceedings of the 2011 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, pp. 270–272. IEEE (2011). doi:10.1109/ISSCC.2011.5746314
Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach, 4th edn. Morgan Kaufmann, Burlington (2007)
Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU-GPU communication management and optimization. SIGPLAN Notice 47(6), 142–151 (2011). doi:10.1145/ 2345156.1993516
Ji, F., Aji, A.M., Dinan, J., Buntinas, D., Balaji, P., Feng, W.c., Ma, X.: Efficient intranode communication in GPU-accelerated systems. In: Proceedings of the 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IPDPSW’12, Shanghai, pp. 1838–1847 (2012). doi:10.1109/IPDPSW.2012.227
Jiang, L., Du, Y., Zhang, Y., Childers, B.R., 0002, J.Y.: LLS: Cooperative integration of wear-leveling and salvaging for PCM main memory. In: Proceedings of the 41st International Conference on Dependable Systems & Networks (DSN), Hong Kong, pp. 221–232. IEEE (2011). doi:10.1109/DSN.2011.5958221
Kawahara, T., Takemura, R., Miura, K., Hayakawa, J., Ikeda, S., Lee, Y., Sasaki, R., Goto, Y., Ito, K., Meguro, T.: 2Mb SPRAM (SPin-Transfer Torque RAM) with bit-by-bit bi-directional current write and parallelizing-direction current read. IEEE J. Solid-State Circuit 43(1), 109–120 (2008)
Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’09, Raleigh, pp. 101–110. ACM, New York (2009). doi:10.1145/1504176.1504194
Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Comput. Archit. News 38(3), 451–460 (2010). doi:10.1145/1816038.1816021
Meredith, J., Roth, P., Spafford, K., Vetter, J.: Performance implications of nonuniform device topologies in scalable heterogeneous architectures. IEEE Micro 31(5), 66–75 (2011). doi:10.1109/MM.2011.79
mpi-forum.org, MPI: A message-passing interface standard version 2.2. http://goo.gl/SEqm1 (2009)
Nere, A., Lipasti, M.: Cortical architectures on a GPGPU. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU’10, Pittsburgh, pp. 12–18. ACM, New York (2010). doi:10.1145/1735688.1735693
Qureshi, M.K., Franceschini, M., Lastras-Montaño, L.A.: Improving read performance of phase change memories via write cancellation and write pausing. In: Proceedings of the 16th International Symposium on High Performance Computer Architecture (HPCA), Bangalore, pp. 1–11. IEEE Computer Society (2010). doi:10.1109/HPCA.2010. 5416645
Qureshi, M.K., Franceschini, M.M., Lastras-Montaño, L.A., Karidis, J.P.: Morphable memory system: a robust architecture for exploiting multi-level phase change memories. SIGARCH Comput. Archit. News 38(3), 153–162 (2010). doi:10.1145/1816038.1815981
Qureshi, M.K., Karidis, J., Franceschini, M., Srinivasan, V., Lastras, L., Abali, B.: Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, New York, pp. 14–23. ACM, New York (2009). doi:10.1145/1669112. 1669117
Qureshi, M.K., Srinivasan, V., Rivers, J.A.: Scalable high performance main memory system using phase-change memory technology. In: Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA’09, Austin, pp. 24–33. ACM, New York (2009). doi:10.1145/1555754.1555760
Rauchwerger, L., Amato, N., Padua, D.: A scalable method for run-time loop parallelization. Int. J. Parallel Program. 23(6), 537–576 (1995)
Saltz, J., Mirchandaney, R., Crowley, K.: Run-time parallelization and scheduling of loops. IEEE Trans. Comput. 40(5), 603–612 (1991)
Spafford, K.L., Meredith, J.S., Lee, S., Li, D., Roth, P.C., Vetter, J.S.: The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. In: Proceedings of the 9th Conference on Computing Frontiers, CF’12, Caligari, pp. 103–112. ACM, New York (2012). doi:10. 1145/2212908.2212924
Sun, G., Dong, X., Xie, Y., Li, J., Chen, Y.: A novel architecture of the 3D stacked MRAM L2 cache for CMPs. In: IEEE Symposium on High-Performance Computer Architecture (HPCA), Los Alamitos, pp. 239–249. IEEE Computer Society, Los Alamitos (2009). doi:10.1109/ HPCA.2009.4798259
Ueng, S.Z., Lathara, M., Baghsorkhi, S.S., Hwu, W.M.W.: CUDA-Lite: reducing GPU programming complexity. Languages and Compilers for Parallel Computing, pp. 1–15. Springer, Berlin/Heidelberg (2008)
Venkatasubramanian, S., Vuduc, R.W.: Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In: Proceedings of the 23rd International Conference on Supercomputing, ICS’09, Yorktown Heights, pp. 244–255. ACM, New York (2009). doi:10.1145/1542275.1542312
Ware, M., Rajamani, K., Floyd, M., Brock, B., Rubio, J., Rawson, F., Carter, J.: Architecting for power management: The IBM®; POWER7TM approach. In: Proceedings of the 16th IEEE International Symposium on High Performance Computer Architecture, HPCA’10, Bangalore, pp. 1–11 (2010). doi:10.1109/HPCA.2010.5416627
Wolfe, M.: Implementing the PGI accelerator model. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU’10, Pittsburgh, pp. 43–50. ACM, New York (2010). doi:10.1145/1735688.1735697
Xu, W., Sun, H., Wang, X., Chen, Y., Zhang, T.: Design of last-level on-chip cache using spin-torque transfer RAM (STT RAM). IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 19(3), 483–493 (2011). doi:10.1109/TVLSI.2009.2035509
Yan, Y., Grossman, M., Sarkar, V.: JCUDA: A programmer-friendly interface for accelerating Java programs with CUDA. In: Proceedings of the 15th International Euro-Par Conference on Parallel Processing, Euro-Par’09, Delft, pp. 887–899. Springer-Verlag, Berlin/Heidelberg (2009). doi:10.1007/978-3-642-03869-3_82
Yang, Y., Xiang, P., Kong, J., Mantor, M., Zhou, H.: A unified optimizing compiler framework for different GPGPU architectures. ACM Trans. Archit. Code Optim. 9(2), 9:1–9:33 (2012). doi:10.1145/2207222.2207225
Yuffe, M., Knoll, E., Mehalel, M., Shor, J., Kurts, T.: A fully integrated multi-CPU, GPU and memory controller 32nm processor. In: Proceedings of the 2011 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, pp. 264–266 (2011). doi:10.1109/ISSCC.2011.5746311
Zhao, J., Sun, G., Loh, G., Xie, Y.: Energy-efficient GPU design with configurable package graphic memory. In: ISLPED’12, Redondo Beach, pp. 403–408 (2012)
Zhou, P., Zhao, B., Yang, J., Zhang, Y.: Energy reduction for STT-RAM using early write termination. In: Proceedings of the 2009 International Conference on Computer-Aided Design, ICCAD’09, New York, pp. 264–268. ACM, New York (2009). doi:10.1145/1687399. 1687448
Zidan, M.A., Bonny, T., Salama, K.N.: High performance technique for database applications using a hybrid GPU/CPU platform. In: Proceedings of the 21st Edition of the Great Lakes Symposium on VLSI, GLSVLSI’11, Lausanne, pp. 85–90. ACM, New York (2011). doi:10.1145/1973009.1973027
Acknowledgements
This material is based upon work partially supported by the National Science Foundation (NSF) grant CNS-1116171 and the Air Force Research Laboratory (AFRL) Visiting Faculty Research Program (VFRP) extension grant LRIR 11RI01COR. We are grateful to Prof. Hai (Helen) Li from the University of Pittsburgh Department of Electrical and Computer Engineering for generous help.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media New York
About this chapter
Cite this chapter
Chen, Y., Guo, J., Sun, Z. (2014). CPU-GPU System Designs for High Performance Cloud Computing. In: Han, K., Choi, BY., Song, S. (eds) High Performance Cloud Auditing and Applications. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-3296-8_11
Download citation
DOI: https://doi.org/10.1007/978-1-4614-3296-8_11
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-3295-1
Online ISBN: 978-1-4614-3296-8
eBook Packages: EngineeringEngineering (R0)