Abstract
Graph is a widely used data structure and graph algorithms, such as breadth-first search (BFS), are regarded as key components in a great number of applications. Recent studies have attempted to accelerate graph algorithms on highly parallel graphics processing unit (GPU). Although many graph algorithms based on large graphs exhibit abundant parallelism, their performance on GPU still faces formidable challenges, one of which is to map the irregular computation onto GPU’s vectorized execution model.
In this paper, we investigate the link between graph topology and performance of BFS on GPU. We introduce a novel model to analyze the components of SIMD underutilization. We show that SIMD lanes are wasted either due to the workload imbalance between tasks, or to the heterogeneity of each task. We also develop corresponding metrics to quantify the SIMD efficiency for BFS on GPU. Finally, we demonstrate the applicability of the metrics by using them to profile the performance for different mapping strategies.
We thank Xiaoqiang Li, Haibo Zhang and Tao Wang for their constructive feedback. This work is supported financially by the National Hi-tech Research and Development Program of China under contract 2012AA010902, the National Basic Research Program of China under contract 2011CB302501.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
10th dimacs implementation challenge, http://www.cc.gatech.edu/dimacs10/index.shtml (accessed: December 15, 2013)
9th dimacs implementation challenge, http://www.dis.uniroma1.it/~challenge9/download.shtml (accessed: December 15, 2013)
Stanford large network dataset collection, http://snap.stanford.edu/data/index.html (accessed: December 15, 2013)
Stanford network analysis platform, https://snap.stanford.edu/snap/index.html (accessed: December 15, 2013)
Agarwal, V., Petrini, F., Pasetto, D., Bader, D.A.: Scalable graph exploration on multicore processors. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE Computer Society (2010)
Bader, D.A., Madduri, K.: Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-2. In: International Conference on Parallel Processing, ICPP 2006, pp. 523–530. IEEE (2006)
Beamer, S., Asanovic, K., Patterson, D.: Direction-optimizing breadth-first search. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–10. IEEE (2012)
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization, IISWC 2009, pp. 44–54. IEEE (2009)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C., et al.: Introduction to algorithms, vol. 2. MIT Press, Cambridge (2001)
Deng, Y., Wang, B.D., Mu, S.: Taming irregular EDA applications on GPUs. In: IEEE/ACM International Conference on Computer-Aided Design-Digest of Technical Papers, ICCAD 2009, pp. 539–546. IEEE (2009)
Harish, P., Narayanan, P.J.: Accelerating large graph algorithms on the GPU using CUDA. In: Aluru, S., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2007. LNCS, vol. 4873, pp. 197–208. Springer, Heidelberg (2007)
Harish, P., Vineet, V., Narayanan, P.J.: Large graph algorithms for massively multithreaded architectures. Centre for Visual Information Technology, I. Institute of Information Technology, Hyderabad, India, Tech. Rep. IIIT/TR/2009/74 (2009)
Hassaan, M.A., Burtscher, M., Pingali, K.: Ordered vs. unordered: A comparison of parallelism and work-efficiency in irregular algorithms. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, pp. 3–12. ACM (2011)
Hawick, K.A., Leist, A., Playne, D.P.: Parallel graph component labelling with gpus and cuda. Parallel Computing 36(12), 655–678 (2010)
Hong, S., Kim, S.K., Oguntebi, T., Olukotun, K.: Accelerating CUDA graph algorithms at maximum warp. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, pp. 267–276. ACM (2011)
Hong, S., Oguntebi, T., Olukotun, K.: Efficient parallel graph exploration on multi-core CPU and GPU. In: 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 78–88. IEEE (2011)
Katz, G.J., Kider Jr., J.T.: All-pairs shortest-paths for large graphs on the GPU. In: Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, pp. 47–55. Eurographics Association (2008)
Kulkarni, M., Burtscher, M., Inkulu, R., Pingali, K., Casçaval, C.: How much parallelism is there in irregular applications? ACM Sigplan Notices 44, 3–14 (2009)
Kulkarni, M., Pingali, K., Walter, B., Ramanarayanan, G., Bala, K., Chew, L.P.: Optimistic parallelism requires abstractions. ACM SIGPLAN Notices 42, 211–222 (2007)
Leiserson, C.E., Schardl, T.B.: A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In: Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures, pp. 303–314. ACM (2010)
Li, D., Becchi, M.: Deploying Graph Algorithms on GPUs: An Adaptive Solution. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 1013–1024 (May 2013), http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6569881
Luo, L., Wong, M., Hwu, W.M.: An effective GPU implementation of breadth-first search. In: Proceedings of the 47th Design Automation Conference, pp. 52–55. ACM (2010)
Merrill, D., Garland, M., Grimshaw, A.: High performance and scalable gpu graph traversal. Univ. of Virginia, Tech. Rep. UVA CS-2011-05 (2011)
Merrill, D., Garland, M., Grimshaw, A.: Scalable GPU graph traversal. ACM SIGPLAN Notices 47, 117–128 (2012)
Scarpazza, D.P., Villa, O., Petrini, F.: Efficient breadth-first search on the cell/be processor. IEEE Transactions on Parallel and Distributed Systems 19(10), 1381–1395 (2008)
Watts, D.J., Strogatz, S.H.: Collective dynamics of small-world networks. Nature 393(6684), 440–442 (1998)
Xia, Y., Prasanna, V.K.: Topologically adaptive parallel breadth-first search on multicore processors. In: Proceedings of the 21st IASTED International Conference, vol. 668, p. 91 (2009)
Yoo, A., Chow, E., Henderson, K., McLendon, W., Hendrickson, B., Catalyurek, U.: A scalable distributed parallel breadth-first search algorithm on BlueGene/L. In: Proceedings of the ACM/IEEE SC 2005 Conference, Supercomputing, p. 25. IEEE (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Cheng, Y. et al. (2014). Understanding the SIMD Efficiency of Graph Traversal on GPU. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8630. Springer, Cham. https://doi.org/10.1007/978-3-319-11197-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-11197-1_4
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11196-4
Online ISBN: 978-3-319-11197-1
eBook Packages: Computer ScienceComputer Science (R0)