On the Acceleration of Graph500: Characterizing PCIe Overheads with Multi-GPUs

Daga, Mayank

doi:10.1007/978-3-319-61982-8_12

Mayank Daga¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10150))

Included in the following conference series:

International Conference on Vector and Parallel Processing

410 Accesses

Abstract

Graphics Processing Units (GPUs) have fundamentally altered the approach to parallel computing despite the substantial PCIe overheads that they manifest. In order to maximize performance-per-dollar, systems are now being deployed with multiple GPUs in the same node. However, multiple GPUs exacerbate the PCIe overheads by inflicting additional data-movement performance penalties when moving non-local data.

In this paper, we first evaluate the PCIe performance loss that occurs due to improper affinity between CPUs and GPUs, using a PCIeBandwidth benchmark specifically developed for systems with multiple GPUs. Our experiments demonstrate that the performance loss can be up to 2.5\(\times \) on a single GPU and up to 4.4\(\times \) when four GPUs are used. We then leverage our learnings from the PCIe studies to optimize and accelerate the Graph500 benchmark on a 4-GPU, multi-socket system. Our optimization techniques include binding the CPU threads to appropriate cores as well as the careful partitioning of data for every GPU. We achieve a speedup of 1.8\(\times \) over a single GPU implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

The Top500 Supercomputer Sites. http://www.top500.org
The Graph500 Benchmark (2012). http://www.graph500.org
Bader, D.A., Meyerhenke, H., Sanders, P., Wagner, D. (eds.): Graph Partitioning and Graph Clustering -10th DIMACS Implementation Challenge Workshop, Georgia Institute of Technology, Atlanta, GA, USA, 13–14 February 2012, Proceedings, Contemporary Mathematics (2013). http://dblp.uni-trier.de/db/conf/dimacs/dimacs2012.html
Beamer, S., Asanović, K., Patterson, D.: Direction-optimizing breadth-first search. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, Los Alamitos, CA, USA, pp. 12:1–12:10 (2012). http://dl.acm.org/citation.cfm?id=2388996.2389013
Checconi, F., Petrini, F.: Traversing trillions of edges in real-time: graph exploration on large-scale parallel machines. In: IEEE 28th International Symposium on Parallel Distributed Processing (IPDPS). IEEE (2014)
Google Scholar
Daga, M., Nutter, M.: Exploiting coarse-grained parallelism in B+ Tree searches on an APU. In: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, pp. 240–247, November 2012
Google Scholar
Daga, M., Nutter, M., Meswani, M.: Efficient breadth-first search on a heterogeneous processor. In: Proceedings of the 2014 IEEE International Conference on Big Data (Big Data), October 2014
Google Scholar
Daga, M., Feng, W., Scogland, T.: Towards accelerating molecular modeling via multiscale approximation on a GPU. In: Proceedings of the 1st IEEE International Conference on Computational Advances in Bio and medical Sciences (2011)
Google Scholar
Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879–899 (2008). http://www.idav.ucdavis.edu/publications/print_pub?pub_id=936
Article Google Scholar
Ueno, K., Suzumura, T.: Highly scalable graph search for the Graph500 benchmark. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2012, New York, NY, USA, pp. 149–160 (2012). http://doi.acm.org/10.1145/2287076.2287104
Yasui, Y., Fujisawa, K., Goto, K.: NUMA-optimized parallel breadth-first search on multicore single-node system. In: BigData Conference, pp. 394–402. IEEE (2013). http://dblp.uni-trier.de/db/conf/bigdataconf/bigdataconf2013.html#YasuiFG13

Download references

Author information

Authors and Affiliations

AMD Research, Advanced Micro Devices, Inc., Sunnyvale, USA
Mayank Daga

Authors

Mayank Daga
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mayank Daga .

Editor information

Editors and Affiliations

University of Porto, Porto, Portugal
Inês Dutra
University of Porto, Porto, Portugal
Rui Camacho
University of Porto, Porto, Portugal
Jorge Barbosa
Lawrence Berkeley National Laboratory, Berkeley, California, USA
Osni Marques

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Daga, M. (2017). On the Acceleration of Graph500: Characterizing PCIe Overheads with Multi-GPUs. In: Dutra, I., Camacho, R., Barbosa, J., Marques, O. (eds) High Performance Computing for Computational Science – VECPAR 2016. VECPAR 2016. Lecture Notes in Computer Science(), vol 10150. Springer, Cham. https://doi.org/10.1007/978-3-319-61982-8_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-61982-8_12
Published: 14 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61981-1
Online ISBN: 978-3-319-61982-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics