Understanding the SIMD Efficiency of Graph Traversal on GPU

Cheng, Yichao; An, Hong; Chen, Zhitao; Li, Feng; Wang, Zhaohui; Jiang, Xia; Peng, Yi

doi:10.1007/978-3-319-11197-1_4

Yichao Cheng²⁴,
Hong An²⁴,
Zhitao Chen²⁴,
Feng Li²⁴,
Zhaohui Wang²⁴,
Xia Jiang²⁴ &
…
Yi Peng²⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8630))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

2614 Accesses
2 Citations

Abstract

Graph is a widely used data structure and graph algorithms, such as breadth-first search (BFS), are regarded as key components in a great number of applications. Recent studies have attempted to accelerate graph algorithms on highly parallel graphics processing unit (GPU). Although many graph algorithms based on large graphs exhibit abundant parallelism, their performance on GPU still faces formidable challenges, one of which is to map the irregular computation onto GPU’s vectorized execution model.

In this paper, we investigate the link between graph topology and performance of BFS on GPU. We introduce a novel model to analyze the components of SIMD underutilization. We show that SIMD lanes are wasted either due to the workload imbalance between tasks, or to the heterogeneity of each task. We also develop corresponding metrics to quantify the SIMD efficiency for BFS on GPU. Finally, we demonstrate the applicability of the metrics by using them to profile the performance for different mapping strategies.

We thank Xiaoqiang Li, Haibo Zhang and Tao Wang for their constructive feedback. This work is supported financially by the National Hi-tech Research and Development Program of China under contract 2012AA010902, the National Basic Research Program of China under contract 2011CB302501.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

10th dimacs implementation challenge, http://www.cc.gatech.edu/dimacs10/index.shtml (accessed: December 15, 2013)
9th dimacs implementation challenge, http://www.dis.uniroma1.it/~challenge9/download.shtml (accessed: December 15, 2013)
Stanford large network dataset collection, http://snap.stanford.edu/data/index.html (accessed: December 15, 2013)
Stanford network analysis platform, https://snap.stanford.edu/snap/index.html (accessed: December 15, 2013)
Agarwal, V., Petrini, F., Pasetto, D., Bader, D.A.: Scalable graph exploration on multicore processors. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE Computer Society (2010)
Google Scholar
Bader, D.A., Madduri, K.: Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-2. In: International Conference on Parallel Processing, ICPP 2006, pp. 523–530. IEEE (2006)
Google Scholar
Beamer, S., Asanovic, K., Patterson, D.: Direction-optimizing breadth-first search. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–10. IEEE (2012)
Google Scholar
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization, IISWC 2009, pp. 44–54. IEEE (2009)
Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C., et al.: Introduction to algorithms, vol. 2. MIT Press, Cambridge (2001)
Google Scholar
Deng, Y., Wang, B.D., Mu, S.: Taming irregular EDA applications on GPUs. In: IEEE/ACM International Conference on Computer-Aided Design-Digest of Technical Papers, ICCAD 2009, pp. 539–546. IEEE (2009)
Google Scholar
Harish, P., Narayanan, P.J.: Accelerating large graph algorithms on the GPU using CUDA. In: Aluru, S., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2007. LNCS, vol. 4873, pp. 197–208. Springer, Heidelberg (2007)
Chapter Google Scholar
Harish, P., Vineet, V., Narayanan, P.J.: Large graph algorithms for massively multithreaded architectures. Centre for Visual Information Technology, I. Institute of Information Technology, Hyderabad, India, Tech. Rep. IIIT/TR/2009/74 (2009)
Google Scholar
Hassaan, M.A., Burtscher, M., Pingali, K.: Ordered vs. unordered: A comparison of parallelism and work-efficiency in irregular algorithms. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, pp. 3–12. ACM (2011)
Google Scholar
Hawick, K.A., Leist, A., Playne, D.P.: Parallel graph component labelling with gpus and cuda. Parallel Computing 36(12), 655–678 (2010)
Article MATH Google Scholar
Hong, S., Kim, S.K., Oguntebi, T., Olukotun, K.: Accelerating CUDA graph algorithms at maximum warp. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, pp. 267–276. ACM (2011)
Google Scholar
Hong, S., Oguntebi, T., Olukotun, K.: Efficient parallel graph exploration on multi-core CPU and GPU. In: 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 78–88. IEEE (2011)
Google Scholar
Katz, G.J., Kider Jr., J.T.: All-pairs shortest-paths for large graphs on the GPU. In: Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, pp. 47–55. Eurographics Association (2008)
Google Scholar
Kulkarni, M., Burtscher, M., Inkulu, R., Pingali, K., Casçaval, C.: How much parallelism is there in irregular applications? ACM Sigplan Notices 44, 3–14 (2009)
Article Google Scholar
Kulkarni, M., Pingali, K., Walter, B., Ramanarayanan, G., Bala, K., Chew, L.P.: Optimistic parallelism requires abstractions. ACM SIGPLAN Notices 42, 211–222 (2007)
Article Google Scholar
Leiserson, C.E., Schardl, T.B.: A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In: Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures, pp. 303–314. ACM (2010)
Google Scholar
Li, D., Becchi, M.: Deploying Graph Algorithms on GPUs: An Adaptive Solution. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 1013–1024 (May 2013), http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6569881
Luo, L., Wong, M., Hwu, W.M.: An effective GPU implementation of breadth-first search. In: Proceedings of the 47th Design Automation Conference, pp. 52–55. ACM (2010)
Google Scholar
Merrill, D., Garland, M., Grimshaw, A.: High performance and scalable gpu graph traversal. Univ. of Virginia, Tech. Rep. UVA CS-2011-05 (2011)
Google Scholar
Merrill, D., Garland, M., Grimshaw, A.: Scalable GPU graph traversal. ACM SIGPLAN Notices 47, 117–128 (2012)
Article Google Scholar
Scarpazza, D.P., Villa, O., Petrini, F.: Efficient breadth-first search on the cell/be processor. IEEE Transactions on Parallel and Distributed Systems 19(10), 1381–1395 (2008)
Article Google Scholar
Watts, D.J., Strogatz, S.H.: Collective dynamics of small-world networks. Nature 393(6684), 440–442 (1998)
Article Google Scholar
Xia, Y., Prasanna, V.K.: Topologically adaptive parallel breadth-first search on multicore processors. In: Proceedings of the 21st IASTED International Conference, vol. 668, p. 91 (2009)
Google Scholar
Yoo, A., Chow, E., Henderson, K., McLendon, W., Hendrickson, B., Catalyurek, U.: A scalable distributed parallel breadth-first search algorithm on BlueGene/L. In: Proceedings of the ACM/IEEE SC 2005 Conference, Supercomputing, p. 25. IEEE (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Yichao Cheng, Hong An, Zhitao Chen, Feng Li, Zhaohui Wang, Xia Jiang & Yi Peng

Authors

Yichao Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Hong An
View author publications
You can also search for this author in PubMed Google Scholar
Zhitao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Feng Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhaohui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xia Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Peng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Illinois Institute of Technology, 60616-3793, Chicago, IL, USA
Xian-he Sun
School of Computer Science and Technology, Dalian Maritime University, 1 Linghai Road, 116026, Dalian, China
Wenyu Qu
University of Ottawa, SEECS, 8, King Edward Ave, K1N 6N5, Ottawa, ON, Canada
Ivan Stojmenovic
Deakin University, 221 Burwood Highway, 3125, Burwood, VIC, Australia
Wanlei Zhou
Dalian Maritime University, NO.1 Linhai Road, 116026, Dailian, China
Zhiyang Li & Tingting Yang &
BeiHang University, XueYuan Road No.37,HaiDian District, Beijing, China
Hua Guo
University of Bradford, BD7 1DP, Bradford, West Yorkshire, United Kingdom
Geyong Min
Computer Network Information Center, Chinese Academy of Sciences, 100190, Beijing, China
Yulei Wu
27 Shanda Nanlu, 250100, Jinan City, Shandong Province, China
Lei Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheng, Y. et al. (2014). Understanding the SIMD Efficiency of Graph Traversal on GPU. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8630. Springer, Cham. https://doi.org/10.1007/978-3-319-11197-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-11197-1_4
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11196-4
Online ISBN: 978-3-319-11197-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics