# A Survey on Graph Processing Accelerators: Challenges and Opportunities

- 7 Downloads

## Abstract

Graph is a well known data structure to represent the associated relationships in a variety of applications, e.g., data science and machine learning. Despite a wealth of existing efforts on developing graph processing systems for improving the performance and/or energy efficiency on traditional architectures, dedicated hardware solutions, also referred to as graph processing accelerators, are essential and emerging to provide the benefits significantly beyond what those pure software solutions can offer. In this paper, we conduct a systematical survey regarding the design and implementation of graph processing accelerators. Specifically, we review the relevant techniques in three core components toward a graph processing accelerator: preprocessing, parallel graph computation, and runtime scheduling. We also examine the benchmarks and results in existing studies for evaluating a graph processing accelerator. Interestingly, we find that there is not an absolute winner for all three aspects in graph acceleration due to the diverse characteristics of graph processing and the complexity of hardware configurations. We finally present and discuss several challenges in details, and further explore the opportunities for the future research.

## Keywords

graph processing accelerator domain-specific architecture performance energy efficiency## Preview

Unable to display preview. Download preview PDF.

## Supplementary material

## References

- [1]Malewicz G, Austern M H, Bik A J, Dehnert J C, Horn I, Leiser N, Czajkowski G. Pregel: A system for large-scale graph processing. In
*Proc. ACM SIGMOD Int. Conf. Management of Data*, June 2010, pp.135-146.Google Scholar - [2]Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein J M. Distributed GraphLab: A framework for machine learning and data mining in the cloud.
*Proceedings of the VLDB Endowment*, 2012, 5(8): 716-727.CrossRefGoogle Scholar - [3]Shun J, Blelloch G E. Ligra: A lightweight graph processing framework for shared memory. In
*Proc. the 18th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming*, February 2013, pp.135-146.Google Scholar - [4]Kyrola A, Blelloch G E, Guestrin C. GraphChi: Large-scale graph computation on just a PC. In
*Proc. the 10th USENIX Conf. Operating Systems Design and Implementation*, October 2012, pp.31-46.Google Scholar - [5]Roy A, Mihailovic I, Zwaenepoel W. X-Stream: Edge-centric graph processing using streaming partitions. In
*Proc. the 24th ACM SIGOPS Symp. Operating Systems Principles*, November 2013, pp.472-488.Google Scholar - [6]Zhong J, He B. Medusa: A parallel graph processing system on graphics processors.
*ACM SIGMOD Record*, 2014, 43(2): 35-40.CrossRefGoogle Scholar - [7]Khorasani F, Vora K, Gupta R, Bhuyan L N. CuSha: Vertex-centric graph processing on GPUs. In
*Proc. the 23rd Int. Symp. High-Performance Parallel and Distributed Computing*, June 2014, pp.239-252.Google Scholar - [8]Wang Y, Davidson A, Pan Y, Wu Y, Riffel A, Owens J D. Gunrock: A high-performance graph processing library on the GPU. In
*Proc. the 21st ACM SIGPLAN Symp. Principles and Practice of Parallel Programming*, March 2016, Article No. 11.Google Scholar - [9]Shi X, Luo X, Liang J, Zhao P, Di S, He B, Jin H. Frog: Asynchronous graph processing on GPU with hybrid coloring model.
*IEEE Trans. Knowledge and Data Engineering*, 2018, 30(1): 29-42.CrossRefGoogle Scholar - [10]Fu Z, Personick M, Thompson B. MapGraph: A high level API for fast development of high performance graph analytics on GPUs. In
*Proc. the 2nd International Workshop on Graph Data Management Experiences and Systems*, June 2014, Article No. 2.Google Scholar - [11]Liu H, Huang H H. Enterprise: Breadth-first graph traversal on GPUs. In
*Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis*, November 2015, Article No. 68.Google Scholar - [12]Beamer S, Asanovic K, Patterson D. Locality exists in graph processing: Workload characterization on an ivy bridge server. In
*Proc. IEEE Int. Symp. Workload Characterization*, November 2015, pp.56-65.Google Scholar - [13]Malicevic J, Lepers B, Zwaenepoel W. Everything you always wanted to know about multicore graph processing but were afraid to ask. In
*Proc. the 2017 USENIX Annual Technical Conf.*, July 2017, pp.631-643.Google Scholar - [14]Nai L, Hadidi R, Sim J, Kim H, Kumar P, Kim H. Graph-PIM: Enabling instruction-level PIM offloading in graph computing frameworks. In
*Proc. the 2007 IEEE Int. Symp. High Performance Computer Architecture*, February 2017, pp.457-468.Google Scholar - [15]Yao P, Zheng L, Liao X, Jin H, He B. An efficient graph accelerator with parallel data conflict management. In
*Proc. Int. Conf. Parallel Architectures and Compilation Techniques*, November 2018, Article No. 8.Google Scholar - [16]Ham T J, Wu L, Sundaram N, Satish N, Martonosi M. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In
*Proc. the 49th Annual IEEE/ACM Int. Symp. Microarchitecture*, October 2016, Article No. 56.Google Scholar - [17]Nai L, Xia Y, Tanase I G, Kim H, Lin C Y. GraphBIG: Understanding graph computing in the context of industrial solutions. In
*Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis*, November 2015, Article No. 69.Google Scholar - [18]Satish N, Sundaram N, Patwary M M, Seo J, Park J, Hassaan M A, Sengupta S, Yin Z, Dubey P. Navigating the maze of graph analytics frameworks using massive graph datasets. In
*Proc. ACM SIGMOD Int. Conf. Management of Data*, June 2014, pp.979-990.Google Scholar - [19]Ben-Nun T, Sutton M, Pai S, Pingali K. Groute: An asynchronous multi-GPU programming model for irregular computations. In
*Proc. the 22nd ACM SIGPLAN Symp. Principles and Practice of Parallel Programming*, February 2017, pp.235-248.Google Scholar - [20]Hennessy J, Patterson D. Domain specific architectures. In
*Computer Architecture: A Quantitative Approach*(6*th edition*), Merken S, McFadden N (eds.), Elsevier, 2017, pp.540-606.Google Scholar - [21]Ceze L, Hill M D, Sankaralingam K, Wenisch T F. Democratizing design for future computing platforms. arXiv:1706.08597, 2017. http://arxiv.org/abs/1706.08597, Jun. 2017.
- [22]Lee Y, Waterman A, Cook H
*et al*. An agile approach to building RISC-V microprocessors.*IEEE Micro*, 2016, 36(2): 8-20.CrossRefGoogle Scholar - [23]Caulfield A M, Chung E S, Putnam A
*et al*. A cloud-scale acceleration architecture. In*Proc. the 49th Annual IEEE/ACM Int. Symp. Microarchitecture*, October 2016, Article No. 7.Google Scholar - [24]de Lorimier M, Kapre N, Mehta N
*et al*. GraphStep: A system architecture for sparse-graph algorithms. In*Proc. the 14th Annual IEEE Symp. Field-Programmable Custom Computing Machines*, April 2006, pp.143-151.Google Scholar - [25]Attia O G, Johnson T, Townsend K, Jones P, Zambreno J. CyGraph: A reconfigurable architecture for parallel breadth-first search. In
*Proc. the 2004 Int. Parallel and Distributed Processing Symp. Workshops*, May 2014, pp.228-235.Google Scholar - [26]Dai G, Chi Y, Wang Y, Yang H. FPGP: Graph processing framework on FPGA a case study of breadth-first search. In
*Proc. the 2006 ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays*, February 2016, pp.105-110.Google Scholar - [27]Zhou S, Chelmis C, Prasanna V K. High-throughput and energy-efficient graph processing on FPGA. In
*Proc. the 24th IEEE Annual Int. Symp. Field-Programmable Custom Computing Machines*, May 2016, pp.103-110.Google Scholar - [28]Dai G, Huang T, Chi Y, Xu N, Wang Y, Yang H. Fore-Graph: Exploring large-scale graph processing on multi-FPGA architecture. In
*Proc. the 2017 ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays*, February 2017, pp.217-226.Google Scholar - [29]Ozdal M M, Yesil S, Kim T, Ayupov A, Greth J, Burns S, Özturk Ö. Energy efficient architecture for graph analytics accelerators. In
*Proc. the 43rd ACM/IEEE Annual Int. Symp. Computer Architecture*, June 2016, pp.166-177.Google Scholar - [30]Zhou J, Liu S, Guo Q, Zhou X, Zhi T, Liu D, Wang C, Zhou X, Chen Y, Chen T. TuNao: A high-performance and energy-efficient reconfigurable accelerator for graph processing. In
*Proc. the 17th IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing*, May 2017, pp.731-734.Google Scholar - [31]Ayupov A, Yesil S, Ozdal M M, Kim T, Burns S, Özturk Ö. A template-based design methodology for graph-parallel hardware accelerators.
*IEEE Trans. Computer Aided Design of Integrated Circuits and Systems*, 2018, 37(2): 420-430.CrossRefGoogle Scholar - [32]Ahn J, Hong S, Yoo S, Mutlu O, Choi K. A scalable processing-in-memory accelerator for parallel graph processing. In
*Proc. the 42nd ACM/IEEE Annual Int. Symp. Computer Architecture*, June 2015, pp.105-117.Google Scholar - [33]Pawlowski J T. Hybrid memory cube (HMC). In
*Proc. the 23rd IEEE Hot Chips Symp.*, August 2011, Article No. 15.Google Scholar - [34]Kim J, Kim Y. HBM: Memory solution for bandwidth-hungry processors. In
*Proc. the 26th IEEE Hot Chips Symp.*, August 2014, Article No. 19.Google Scholar - [35]Wong H S, Lee H Y, Yu S, Chen Y S, Wu Y, Chen P S, Lee B, Chen F T, Tsai M J. Metal-oxide RRAM.
*Proceedings of the IEEE*, 2012, 100(6): 1951-1970.CrossRefGoogle Scholar - [36]Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: Bringing order to the web. Technical Report, Stanford InfoLab, 1999. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf, Jan. 2019.
- [37]McCune R R, Weninger T, Madey G. Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph processing.
*ACM Trans. Computing Surveys*, 2015, 48(2): Article No. 25.Google Scholar - [38]Shi X, Zheng Z, Zhou Y, Jin H, He L, Liu B, Hua Q. Graph processing on GPUs: A survey.
*ACM Trans. Computing Surveys*, 2018, 50(6): Article No. 81.Google Scholar - [39]Heidari S, Simmhan Y, Calheiros R N, Buyya R. Scalable graph processing frameworks: A taxonomy and open challenges.
*ACM Trans. Computing Surveys*, 2018, 51(3): Article No. 60.Google Scholar - [40]Gonzalez J E, Low Y, Gu H, Bickson D, Guestrin C. Power-Graph: Distributed graph-parallel computation on natural graphs. In
*Proc. the 10th USENIX Symp. Operating Systems Design and Implementation*, October 2012, pp.17-30.Google Scholar - [41]Avery C. Giraph: Large-scale graph processing infrastructure on Hadoop. In
*Proc. the 2011 Hadoop Summit*, June 2011, pp.5-9.Google Scholar - [42]Gonzalez J E, Xin R S, Dave A, Crankshaw D, Franklin M J, Stoica I. GraphX: Graph processing in a distributed dataflow framework. In
*Proc. the 11th USENIX Symp. Operating Systems Design and Implementation*, October 2014, pp.599-613.Google Scholar - [43]Teixeira C H, Fonseca A J, Serafini M, Siganos G, Zaki M J, Aboulnaga A. Arabesque: A system for distributed graph mining. In
*Proc. the 25th Symp. Operating Systems Principles*, October 2015, pp.425-440.Google Scholar - [44]Chen R, Shi J, Chen Y, Chen H. PowerLyra: Differentiated graph computation and partitioning on skewed graphs. In
*Proc. the 10th European Conf. Computer Systems*, April 2015, Article No. 1.Google Scholar - [45]Zhu X, Chen W, Zheng W, Ma X. Gemini: A computation-centric distributed graph processing system. In
*Proc. the 12th USENIX Symp. Operating Systems Design and Implementation*, November 2016, pp.301-316.Google Scholar - [46]Khayyat Z, Awara K, Alonazi A, Jamjoom H, Williams D, Kalnis P. Mizan: A system for dynamic load balancing in large-scale graph processing. In
*Proc. the 8th ACM European Conf. Computer Systems*, April 2013, pp.169-182.Google Scholar - [47]Randles M, Lamb D, Taleb-Bendiab A. A comparative study into distributed load balancing algorithms for cloud computing. In
*Proc. the 24th IEEE Int. Conf. Advanced Information Networking and Applications Workshops*, April 2010, pp.551-556.Google Scholar - [48]Zhao Y, Yoshigoe K, Xie M, Zhou S, Seker R, Bian J. LightGraph: Lighten communication in distributed graph-parallel processing. In
*Proc. the 2004 IEEE Int. Congress on Big Data*, June 2014, pp.717-724.Google Scholar - [49]Wang P, Zhang K, Chen R, Chen H, Guan H. Replication-based fault-tolerance for large-scale graph processing. In
*Proc. the 44th Annual IEEE/IFIP Int. Conf. Dependable Systems and Networks*, June 2014, pp.562-573.Google Scholar - [50]Nguyen D, Lenharth A, Pingali K. A lightweight infrastructure for graph analytics. In
*Proc. the 24th ACM SIGOPS Symp. Operating Systems Principles*, November 2013, pp.456-471.Google Scholar - [51]Sundaram N, Satish N, Patwary M M, Dulloor S R, Anderson M J, Vadlamudi S G, Das D, Dubey P. GraphMat: High performance graph analytics made productive.
*Proceedings of the VLDB Endowment*, 2015, 8(11): 1214-1225.CrossRefGoogle Scholar - [52]Zhang K, Chen R, Chen H. NUMA-aware graph-structured analytics. In
*Proc. the 20th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming*, February 2015, pp.183-193.Google Scholar - [53]Han W S, Lee S, Park K, Lee J H, Kim M S, Kim J, Yu H. TurboGraph: A fast parallel graph engine handling billion-scale graphs in a single PC. In
*Proc. the 19th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining*, August 2013, pp.77-85.Google Scholar - [54]Yuan P, Zhang W, Xie C, Jin H, Liu L, Lee K. Fast iterative graph computation: A path centric approach. In
*Proc. the 2004 Int. Conf. High Performance Computing, Networking, Storage and Analysis*, November 2014, pp.401-412.Google Scholar - [55]Zhu X, Han W, Chen W. GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In
*Proc. the 2005 USENIX Annual Technical Conf.*, July 2015, pp.375-386.Google Scholar - [56]Chi Y, Dai G, Wang Y, Sun G, Li G, Yang H. NXgraph: An efficient graph processing system on a single machine. In
*Proc. the 32nd IEEE Int. Conf. Data Engineering*, May 2016, pp.409-420.Google Scholar - [57]Maass S, Min C, Kashyap S, Kang W, Kumar M, Kim T. Mosaic: Processing a trillion-edge graph on a single machine. In
*Proc. the 12th ACM European Conf. Computer Systems*, April 2017, pp.527-543.Google Scholar - [58]Seo H, Kim J, Kim M S. GStream: A graph streaming processing method for large-scale graphs on GPUs. In
*Proc. the 20th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming*, February 2015, pp.253-254.Google Scholar - [59]Soman J, Kishore K, Narayanan P J. A fast GPU algorithm for graph connectivity. In
*Proc. the 24th IEEE Int. Symp. Parallel*&*Distributed Processing, Workshops and PhD Forum*, April 2010, Article No. 87.Google Scholar - [60]McLaughlin A, Bader D A. Scalable and high performance betweenness centrality on the GPU. In
*Proc. the 2014 Int. Conf. High Performance Computing, Networking, Storage and Analysis*, November 2014, pp.572-583.Google Scholar - [61]Sariyüce A E, Kaya K, Saule E, Çatalyürek Ü V. Betweenness centrality on GPUs and heterogeneous architectures. In
*Proc. the 6th Workshop on General Purpose Processor Using Graphics Processing Units*, March 2013, pp.76-85.Google Scholar - [62]Davidson A A, Baxter S, Garland M, Owens J D. Work-efficient parallel GPU methods for single-source shortest paths. In
*Proc. the 28th IEEE Int. Parallel and Distributed Processing Symp.*, May 2014, pp.349-359.Google Scholar - [63]Hong S, Chafi H, Sedlar E, Olukotun K. Green-Marl: A DSL for easy and efficient graph analysis. In
*Proc. the 17th Int. Conf. Architectural Support for Programming Languages and Operating Systems*, March 2012, pp.349-362.Google Scholar - [64]Gharaibeh A, Reza T, Santos-Neto E, Costa L B, Sallinen S, Ripeanu M. Efficient large-scale graph processing on hybrid CPU and GPU systems. arXiv:1312.3018, 2013. http://arxiv.org/abs/1312.3018, Dec. 2018.
- [65]Zhang T, Zhang J, Shu W, Wu M Y, Liang X. Efficient graph computation on hybrid CPU and GPU systems.
*The Journal of Supercomputing*, 2015, 71(4): 1563-1586.CrossRefGoogle Scholar - [66]Liu H, Huang H H, Hu Y. iBFS: Concurrent breadth-first search on GPUs. In
*Proc. the 2016 Int. Conf. Management of Data*, June 2016, pp.403-416.Google Scholar - [67]Sengupta D, Song S L, Agarwal K, Schwan K. GraphReduce: Processing large-scale graphs on accelerator-based systems. In
*Proc. the 2015 Int. Conf. High Performance Computing, Networking, Storage and Analysis*, November 2015, Article No. 28.Google Scholar - [68]Kim M S, An K, Park H, Seo H, Kim J. GTS: A fast and scalable graph processing method based on streaming topology to GPUs. In
*Proc. the 2016 Int. Conf. Management of Data*, June 2016, pp.447-461.Google Scholar - [69]Han L, Shen Z, Shao Z, Huang H H, Li T. A novel ReRAM-based processing-in-memory architecture for graph computing. In
*Proc. the 6th IEEE Non-Volatile Memory Systems and Applications Symp.*, August 2017, Article No. 13.Google Scholar - [70]Song L, Zhuo Y, Qian X, Li H, Chen Y. GraphR: Accelerating graph processing using ReRAM. In
*Proc. the 2018 IEEE Int. Symp. High Performance Computer Architecture*, February 2018, pp.531-543.Google Scholar - [71]Zhang J, Khoram S, Li J. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In
*Proc. the 2017 ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays*, February 2017, pp.207-216.Google Scholar - [72]Oguntebi T, Olukotun K. GraphOps: A dataflow library for graph analytics acceleration. In
*Proc. the 2016 ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays*, February 2016, pp.111-117.Google Scholar - [73]Dai G, Huang T, Chi Y, Zhao J, Sun G, Liu Y, Wang Y, Xie Y, Yang H. GraphH: A processing-in-memory architecture for large-scale graph processing.
*IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems.*doi: https://doi.org/10.1109/TCAD.2018.2821565. - [74]Zhang J, Li J. Degree-aware hybrid graph traversal on FPGA-HMC platform. In
*Proc. the 2018 ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays*, February 2018, pp.229-238.Google Scholar - [75]Zhou S, Kannan R, Min Y, Prasanna V K. FASTCF: FPGA-based accelerator for stochastic-gradient-descent-based collaborative filtering. In
*Proc. the 2018 ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays*, February 2018, pp.259-268.Google Scholar - [76]Khoram S, Zhang J, Strange M, Li J. Accelerating graph analytics by co-optimizing storage and access on an FPGAHMC platform. In
*Proc. the 2018 ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays*, February 2018, pp.239-248.Google Scholar - [77]Han L, Shen Z, Liu D, Shao Z, Huang H H, Li T. A novel ReRAM-based processing-in-memory architecture for graph traversal.
*ACM Trans. Storage*, 2018, 14(1): Article No. 9.Google Scholar - [78]Wang Q, Jiang W, Xia Y, Prasanna V. A message-passing multi-softcore architecture on FPGA for breadth-first search. In
*Proc. the 2010 Int. Conf. Field-Programmable Technology*, December 2010, pp.70-77.Google Scholar - [79]Umuroglu Y, Morrison D, Jahre M. Hybrid breadth-first search on a single-chip FPGA-CPU heterogeneous platform. In
*Proc. the 25th Int. Conf. Field Programmable Logic and Applications*, September 2015, Article No. 12.Google Scholar - [80]Zhou S, Prasanna V K. Accelerating graph analytics on CPU-FPGA heterogeneous platform. In
*Proc. the 29th Int. Symp. Computer Architecture and High Performance Computing*, October 2017, pp.137-144.Google Scholar - [81]Zhang M, Zhuo Y, Wang C, Gao M, Wu Y, Chen K, Kozyrakis C, Qian X. GraphP: Reducing communication for PIM-based graph processing with efficient data partition. In
*Proc. the 2018 IEEE Int. Symp. High Performance Computer Architecture*, February 2018, pp.544-557.Google Scholar - [82]Huang T, Dai G, Wang Y, Yang H. HyVE: Hybrid vertexedge memory hierarchy for energy-efficient graph processing. In
*Proc. the 2018 Design, Automation and Test in Europe Conference and Exhibition*, March 2018, pp.973-978.Google Scholar - [83]Ozdal M M, Yesil S, Kim T, Ayupov A, Greth J, Burns S, Ozturk O. Graph analytics accelerators for cognitive systems.
*IEEE Micro*, 2017, 37(1): 42-51.CrossRefGoogle Scholar - [84]Kapre N. Custom FPGA-based soft-processors for sparse graph acceleration. In
*Proc. the 26th IEEE Int. Conf. Application-Specific Systems, Architectures and Processors*, July 2015, pp.9-16.Google Scholar - [85]Betkaoui B, Thomas D B, Luk W, Przulj N. A framework for FPGA acceleration of large graph problems: Graphlet counting case study. In
*Proc. the 2011 Int. Conf. Field-Programmable Technology*, December 2011, Article No. 2.Google Scholar - [86]Betkaoui B, Wang Y, Thomas D B, Luk W. A reconfigurable computing approach for efficient and scalable parallel graph exploration. In
*Proc. the 23rd IEEE Int. Conf. Application-Specific Systems, Architectures and Processors*, July 2012, pp.8-15.Google Scholar - [87]Betkaoui B, Wang Y, Thomas D B, Luk W. Parallel FPGA-based all pairs shortest paths for sparse networks: A human brain connectome case study. In
*Proc. the 22nd Int. Conf. Field Programmable Logic and Applications*, August 2012, pp.99-104.Google Scholar - [88]Nurvitadhi E, Weisz G, Wang Y, Hurkat S, Nguyen M, Hoe J C, Martínez J F, Guestrin C. GraphGen: An FPGA framework for vertex-centric graph computation. In
*Proc. the 22nd IEEE Annual Int. Symp. Field-Programmable Custom Computing Machines*, May 2014, pp.25-28.Google Scholar - [89]Attia O G, Grieve A, Townsend K R, Jones P, Zambreno J. Accelerating all-pairs shortest path using a message-passing reconfigurable architecture. In
*Proc. the 2015 Int. Conf. Reconfigurable Computing and FPGAs*, December 2015, Article No. 5.Google Scholar - [90]Engelhardt N, So H K. GraVF: A vertex-centric distributed graph processing framework on FPGAs. In
*Proc. the 26th Int. Conf. Field Programmable Logic and Applications*, August 2016, Article No. 62.Google Scholar - [91]Jin H, Yao P, Liao X, Zheng L, Li X. Towards dataflow-based graph accelerator. In
*Proc. the 37th IEEE Int. Conf. Distributed Computing Systems*, June 2017, pp.1981-1992.Google Scholar - [92]Zhou S, Chelmis C, Prasanna V K. Accelerating large-scale single-source shortest path on FPGA. In
*Proc. the 2015 Int. Parallel and Distributed Processing Symposium Workshop*, May 2015, pp.129-136.Google Scholar - [93]Zhou S, Chelmis C, Prasanna V K. Optimizing memory performance for FPGA implementation of PageRank. In
*Proc. the 2015 Int. Conf. Reconfigurable Computing and FPGAs*, December 2015, Article No. 53.Google Scholar - [94]Jun S W, Wright A, Zhang S, Xu S, Arvind. GraFBoost: Using accelerated flash storage for external graph analytics. In
*Proc. the 45th ACM/IEEE Int. Symp. Computer Architecture*, June 2018, pp.411-424.Google Scholar - [95]Thomas D, Moorby P. The Verilog
^{®}Hardware Description Language (5th edition). Springer, 2002.Google Scholar - [96]Ashenden P J. The Designer’s Guide to VHDL (3rd edition). Morgan Kaufmann, 2008.Google Scholar
- [97]Lee J, Kim H, Yoo S, Choi K, Hofstee H P, Nam G J, Nutter M R, Jamsek D. ExtraV: Boosting graph processing near storage with a coherent accelerator.
*Proceedings of the VLDB Endowment*, 2017, 10(12): 1706-1717.CrossRefGoogle Scholar - [98]Kim G, Kim J, Ahn J H, Kim J. Memory-centric system interconnect design with hybrid memory cubes. In
*Proc. the 22nd Int. Conf. Parallel Architectures and Compilation Techniques*, September 2013, pp.145-155.Google Scholar - [99]Xu C, Niu D, Muralimanohar N, Balasubramonian R, Zhang T, Yu S, Xie Y. Overcoming the challenges of crossbar resistive memory architectures. In
*Proc. the 21st IEEE Int. Symp. High Performance Computer Architecture*, February 2015, pp.476-488.Google Scholar - [100]Do J, Kee Y S, Patel J M, Park C, Park K, DeWitt D J. Query processing on smart SSDs: Opportunities and challenges. In
*Proc. the 2013 ACM SIGMOD Int. Conf. Management of Data*, June 2013, pp.1221-1230.Google Scholar - [101]Jun S W, Liu M, Lee S, Hicks J, Ankcorn J, King M, Xu S, Arvind. BlueDBM: An appliance for big data analytics. In
*Proc. the 42nd ACM Annual Int. Symp. Computer Architecture*, June 2015, pp.1-13.Google Scholar - [102]Zhang J, Jung M. FlashAbacus: A self-governing flash-based accelerator for low-power systems. In
*Proc. the 13th EuroSys Conf.*, April 2018, Article No. 15.Google Scholar - [103]Ozdal M M. Emerging accelerator platforms for data centers.
*IEEE Design*&*Test*, 2018, 35(1): 47-54.Google Scholar - [104]Weisz G, Melber J, Wang Y, Fleming K, Nurvitadhi E, Hoe J C. A study of pointer-chasing performance on shared-memory processor-FPGA systems. In
*Proc. the 2016 ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays*, February 2016, pp.264-273.Google Scholar - [105]Gu B, Yoon A S, Bae D H, Jo I, Lee J, Yoon J, Kang J U, Kwon M, Yoon C, Cho S, Jeong J, Chang D. Biscuit: A framework for near-data processing of big data workloads. In
*Proc. the 43rd Int. Symp. Computer Architecture*, June 2016, pp.153-165.Google Scholar - [106]Son Y, Choi J, Jeon J, Min C, Kim S, Yeom H Y, Han H. SSD-assisted backup and recovery for database systems. In
*Proc. the 33rd IEEE Int. Conf. Data Engineering*, April 2017, pp.285-296.Google Scholar - [107]Song W S, Gleyzer V, Lomakin A, Kepner J. Novel graph processor architecture, prototype system, and results. In
*Proc. the 2016 IEEE High Performance Extreme Computing Conference*, September 2016, Article No. 59.Google Scholar - [108]Jin H, Yao P, Liao X. Towards dataflow based graph processing.
*Science China Information Sciences*, 2017, 60(12): Article No. 126102.Google Scholar - [109]Windh S, Budhkar P, Najjar W A. CAMs as synchronizing caches for multithreaded irregular applications on FPGAs. In
*Proc. the 2015 ACM/IEEE Int. Conf. Computer-Aided Design*, November 2015, pp.331-336.Google Scholar - [110]Wang L, Yang X, Dai H. Scratchpad memory allocation for arrays in permutation graphs.
*Science China Information Sciences*, 2013, 56(5): 1-13.MathSciNetGoogle Scholar - [111]Gao M, Ayers G, Kozyrakis C. Practical near-data processing for in-memory analytics frameworks. In
*Proc. the 2015 Int. Conf. Parallel Architecture and Compilation*, October 2015, pp.113-124.Google Scholar - [112]Faloutsos M, Faloutsos P, Faloutsos C. On power-law relationships of the Internet topology.
*ACM SIGCOMM Computer Communication Review*, 1999, 29(4): 251-262.CrossRefGoogle Scholar - [113]Xie C, Chen R, Guan H, Zang B, Chen H. SYNC or ASYNC: Time to fuse for distributed graph-parallel computation. In
*Proc. the 20th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming*, February 2015, pp.194-204.Google Scholar - [114]Ozdal M M, Yesil S, Kim T, Ayupov A, Burns S, Ozturk O. Architectural requirements for energy efficient execution of graph analytics applications. In
*Proc. the 2015 IEEE/ACM Int. Conf. Computer-Aided Design*, November 2015, pp.676-681.Google Scholar - [115]Beamer S, Asanović K, Patterson D. Direction-optimizing breadth-first search. In
*Proc. the 2012 Int. Conf. High Performance Computing, Networking, Storage and Analysis*, November 2012, Article No. 12.Google Scholar - [116]Beamer S, Asanović K, Patterson D. The GAP benchmark suite. arXiv:1508.03619, 2015. http://arxiv.org/abs/1508.03619, May 2017.
- [117]Scarpazza D P, Villa O, Petrini F. Efficient breadth-first search on the Cell/B.E. processor.
*IEEE Trans. Parallel and Distributed Systems*, 2008, 19(10): 1381-95.Google Scholar - [118]Milenković T, Lai J, Pržulj N. GraphCrunch: A tool for large network analyses.
*BMC Bioinformatics*, 2008, 9: Article No. 70.Google Scholar - [119]Hong S, Oguntebi T, Olukotun K. Efficient parallel graph exploration on multi-core CPU and GPU. In
*Proc. the 2011 Int. Conf. Parallel Architectures and Compilation Techniques*, October 2011, pp.78-88.Google Scholar - [120]Matsumoto K, Nakasato N, Sedukhin S G. Blocked all-pairs shortest paths algorithm for hybrid CPU-GPU system. In
*Proc. the 13th IEEE Int. Conf. High Performance Computing and Communications*, September 2011, pp.145-152.Google Scholar - [121]Siek J G, Lee L Q, Lumsdaine A. The Boost Graph Library: User Guide and Reference Manual (PAP/CDR edition). Addison-Wesley Professional, 2001.Google Scholar
- [122]Ma X, Zhang D, Chiou D. FPGA-accelerated transactional execution of graph workloads. In
*Proc. the 2017 ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays*, February 2017, pp.227-236.Google Scholar - [123]Zheng D, Mhembere D, Burns R, Vogelstein J, Priebe C E, SzalayA S. FlashGraph: Processing billion-node graphs on an array of commodity SSDs. In
*Proc. the 13th USENIX Conf. File and Storage Technologies*, February 2015, pp.45-58.Google Scholar - [124]Rodeh O. B-trees, shadowing, and clones.
*ACM Transactions on Storage*, 2008, 3(4): Article No. 2.Google Scholar - [125]Sha M, Li Y, He B, Tan K L. Accelerating dynamic graph analytics on GPUs.
*Proceedings of the VLDB Endowment*, 2017, 11(1): 107-120.CrossRefGoogle Scholar - [126]Shi X, Cui B, Shao Y, Tong Y. Tornado: A system for real-time iterative analysis over evolving data. In
*Proc. the 2016 Int. Conf. Management of Data*, June 2016, pp.417-430.Google Scholar - [127]Chen H, Sun Z, Yi F, Su J. BufferBank storage: An economic, scalable and universally usable in-network storage model for streaming data applications.
*Science China Information Sciences*, 2016, 59(1): 1-15.Google Scholar - [128]Zhang M, Wu Y, Chen K, Qian X, Li X, Zheng W. Exploring the hidden dimension in graph processing. In
*Proc. the 12th USENIX Conf. Operating Systems Design and Implementation*, November 2016, pp.285-300.Google Scholar - [129]Battaglia P W, Hamrick J B, Bapst V
*et al*. Relational inductive biases, deep learning, and graph networks. arXiv:1806.01261, 2018. http://arxiv.org/abs/1806.01261, Jun. 2018. - [130]Narayanan A, Chandramohan M, Venkatesan R, Chen L, Liu Y, Jaiswal S. graph2vec: Learning distributed representations of graphs. arXiv:1707.05005, 2017. https://arxiv.org/abs/1707.05005, Jun. 2018.
- [131]Ribeiro L F, Saverese P H, Figueiredo D R. Struc2vec: Learning node representations from structural identity. In
*Proc. the 23rd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining*, August 2017, pp.385-394.Google Scholar - [132]Zheng L, Liao X, Jin H. Efficient and scalable graph parallel processing with symbolic execution.
*ACM Trans. Architecture and Code Optimization*, 2018, 15(1): Article No. 3.Google Scholar - [133]Li Z, Liu L, Deng Y, Yin S, Wang Y, Wei S. Aggressive pipelining of irregular applications on reconfigurable hardware. In
*Proc. the 44th Annual Int. Symp. Computer Architecture*, June 2017, pp.575-586.Google Scholar - [134]Zheng L, Liao X, Jin H, Zhao J, Wang Q. Scalable concurrency debugging with distributed graph processing. In
*Proc. the 2018 Int. Symp. Code Generation and Optimization*, February 2018, pp.188-199.Google Scholar - [135]Jouppi N P, Young C, Patil N
*et al*. In-datacenter performance analysis of a tensor processing unit. In*Proc. the 44th Annual Int. Symp. Computer Architecture*, June 2017, pp.1-12.Google Scholar - [136]Chen T, Du Z, Sun N, Wang J, Wu C, Chen Y, Temam O. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In
*Proc. the 19th Int. Conf. Architectural Support for Programming Languages and Operating Systems*, March 2014, pp.269-284.Google Scholar