Skip to main content

On the Acceleration of Graph500: Characterizing PCIe Overheads with Multi-GPUs

  • Conference paper
  • First Online:
High Performance Computing for Computational Science – VECPAR 2016 (VECPAR 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10150))

Included in the following conference series:

  • 410 Accesses

Abstract

Graphics Processing Units (GPUs) have fundamentally altered the approach to parallel computing despite the substantial PCIe overheads that they manifest. In order to maximize performance-per-dollar, systems are now being deployed with multiple GPUs in the same node. However, multiple GPUs exacerbate the PCIe overheads by inflicting additional data-movement performance penalties when moving non-local data.

In this paper, we first evaluate the PCIe performance loss that occurs due to improper affinity between CPUs and GPUs, using a PCIeBandwidth benchmark specifically developed for systems with multiple GPUs. Our experiments demonstrate that the performance loss can be up to 2.5\(\times \) on a single GPU and up to 4.4\(\times \) when four GPUs are used. We then leverage our learnings from the PCIe studies to optimize and accelerate the Graph500 benchmark on a 4-GPU, multi-socket system. Our optimization techniques include binding the CPU threads to appropriate cores as well as the careful partitioning of data for every GPU. We achieve a speedup of 1.8\(\times \) over a single GPU implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. The Top500 Supercomputer Sites. http://www.top500.org

  2. The Graph500 Benchmark (2012). http://www.graph500.org

  3. Bader, D.A., Meyerhenke, H., Sanders, P., Wagner, D. (eds.): Graph Partitioning and Graph Clustering -10th DIMACS Implementation Challenge Workshop, Georgia Institute of Technology, Atlanta, GA, USA, 13–14 February 2012, Proceedings, Contemporary Mathematics (2013). http://dblp.uni-trier.de/db/conf/dimacs/dimacs2012.html

  4. Beamer, S., Asanović, K., Patterson, D.: Direction-optimizing breadth-first search. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, Los Alamitos, CA, USA, pp. 12:1–12:10 (2012). http://dl.acm.org/citation.cfm?id=2388996.2389013

  5. Checconi, F., Petrini, F.: Traversing trillions of edges in real-time: graph exploration on large-scale parallel machines. In: IEEE 28th International Symposium on Parallel Distributed Processing (IPDPS). IEEE (2014)

    Google Scholar 

  6. Daga, M., Nutter, M.: Exploiting coarse-grained parallelism in B+ Tree searches on an APU. In: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, pp. 240–247, November 2012

    Google Scholar 

  7. Daga, M., Nutter, M., Meswani, M.: Efficient breadth-first search on a heterogeneous processor. In: Proceedings of the 2014 IEEE International Conference on Big Data (Big Data), October 2014

    Google Scholar 

  8. Daga, M., Feng, W., Scogland, T.: Towards accelerating molecular modeling via multiscale approximation on a GPU. In: Proceedings of the 1st IEEE International Conference on Computational Advances in Bio and medical Sciences (2011)

    Google Scholar 

  9. Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879–899 (2008). http://www.idav.ucdavis.edu/publications/print_pub?pub_id=936

    Article  Google Scholar 

  10. Ueno, K., Suzumura, T.: Highly scalable graph search for the Graph500 benchmark. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2012, New York, NY, USA, pp. 149–160 (2012). http://doi.acm.org/10.1145/2287076.2287104

  11. Yasui, Y., Fujisawa, K., Goto, K.: NUMA-optimized parallel breadth-first search on multicore single-node system. In: BigData Conference, pp. 394–402. IEEE (2013). http://dblp.uni-trier.de/db/conf/bigdataconf/bigdataconf2013.html#YasuiFG13

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mayank Daga .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Daga, M. (2017). On the Acceleration of Graph500: Characterizing PCIe Overheads with Multi-GPUs. In: Dutra, I., Camacho, R., Barbosa, J., Marques, O. (eds) High Performance Computing for Computational Science – VECPAR 2016. VECPAR 2016. Lecture Notes in Computer Science(), vol 10150. Springer, Cham. https://doi.org/10.1007/978-3-319-61982-8_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-61982-8_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-61981-1

  • Online ISBN: 978-3-319-61982-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics