Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors
 64 Downloads
Abstract
Heterogeneous processors integrate very distinct compute resources such as CPUs and GPUs into the same chip, thus can exploit the advantages and avoid disadvantages of those compute units. We in this work evaluate and analyze eight sparse matrix and graph kernels on an AMD CPU–GPU heterogeneous processor by using 956 sparse matrices. Five characteristics, i.e., load balancing, indirect addressing, memory reallocation, atomic operations, and dynamic characteristics are our major considerations. The experimental results show that although the CPU and GPU parts access the same DRAM, very different performance behaviors are observed. For example, though the GPU part in general outperforms the CPU part, it cannot achieve the best performance in all cases given by the CPU part. Moreover, the bandwidth utilization of atomic operations on heterogeneous processors can be much higher than a highend discrete GPU.
Keywords
Heterogeneous processor Performance analysis Sparse matrix computation1 Introduction
About a decade ago, graphics processing unit (GPU) has been introduced to high performance computing. Because of its high peak compute performance and bandwidth, a large amount of compute kernels and realworld applications have been accelerated on GPUs (Owens et al. 2008). However, it also has been reported that not all compute patterns are suitable for GPU computing due to their irregularity (Lee et al. 2010) and time consuming memory copy between host memory and GPU memory (Gregg and Hazelwood 2011). As a result, heterogeneous processor, also called accelerated processing unit (APU) or CPU–GPU integrated architecture, has been expected to exploit advantages of both CPUs and GPUs and avoid memory copy between memory areas of different devices. Schulte et al. (2015) and Vijayaraghavan et al. (2017) recently reported that with a good design, heterogeneous processors can be a competitive building block for exascale computing systems.
The effective design of such heterogeneous processors is challenging. For example, because CPU and GPU applications normally have very different memory access patterns, implementing efficient cache coherence between the two parts is an open problem. Several hardware and software supporting techniques have been developed (Agarwal et al. 2016; Dashti and Fedorova 2017; Power et al. 2013. Also, when both parts share the last level cache, data prefetching scheme can be improved through adding new instructions (Yang et al. 2012). In addition, lowpower and performance/watt ratio optimization are crucial design targets as well (Branover et al. 2012; Zhu et al. 2017a).
Despite the difficulties on architecture design, a few usable high performance heterogeneous processors, such as AMD Carrizo (Krishnan et al. 2016), Intel Skylake (Doweck et al. 2017) and NVIDIA Denver (Boggs et al. 2015), have been released in recent years. Such integrated architectures inspired a number of novel techniques for various parallel problems. Daga et al. on AMD heterogeneous processors evaluated several kernels and applications (Daga et al. 2011), and optimized B+ tree search (Daga and Nutter 2012) and breadthfirst search (BFS) (Daga et al. 2014). Zhang et al. (2018) developed faster BFS through traversal order optimization. Puthoor et al. (2016) developed new DAG scheduling methods on heterogeneous processors. Liu and Vinter designed a new heap data structure called adheap (Liu and Vinter 2014) and new sparse matrix–vector multiplication (SpMV) and sparse matrix–matrix multiplication (SpGEMM) algorithms (Liu and Vinter 2015c, 2015) for heterogeneous processors. Said et al. (2017) demonstrated that seismic imaging can be faster and more energy efficient on heterogeneous processors. Zhu et al. (2017a, b) and Zhang et al. (2015, 2017b) studied corun behaviors of various kernels, and Zhang et al. (2017a) developed effective workload partitioning approaches for heterogeneous processors.
However, irregular algorithms in particular sparse matrix and graph computations have not been systematically studied in existing work. Zhang et al. (2017a, b) took sparse matrix–vector multiplication (SpMV) and several graph kernels into consideration in their corun benchmarks and scheduling algorithm design. But only very limited number of sparse matrices anf graphs were used for benchmarking. Also, other important sparse matrix kernels, e.g., SpGEMM (Buluç and Gilbert 2012; Liu and Vinter 2015b; Liu et al. 2018, 2019), sparse matrix transposition (SpTRANS) (Wang et al. 2016) and sparse triangular solve (SpTRSV) (Liu et al. 2017; Wang et al. 2018), have not been well studied on heterogeneous processors.
We in this paper evaluate and analyze performance behaviors of eight representative sparse matrix kernels on the latest AMD APU Ryzen 5 2400G including CPU cores codenamed Zen and GPU cores codenamed Vega. Among the eight kernels, four kernels are from scientific computation, i.e., sparse matrix–vector multiplication (SpMV), sparse matrix–matrix multiplication (SpGEMM), sparse matrix transposition (SpTRANS), and sparse triangular solve (SpTRSV), and the other four kernels from graph computing, i.e., PageRank (PR), graph coloring (GC), connected component (CC), and breadthfirst search (BFS). We use 956 large sparse matrices from the SuiteSparse Matrix Collection (Davis and Hu 2011) as the benchmark suite for obtaining experimental results, which are statistically significant. We then analyze the best performance configurations, in terms of algorithm and compute resource, for matrices of various sparsity structures. We mainly consider five characteristics, load balancing, indirect addressing, memory reallocation, atomic operations, and dynamic characteristics. Moreover, a performance comparison with a highend discrete GPU is also given for better understanding of sparse problems on various architectures. We finally discuss several challenges and opportunities for achieving higher performance for sparse matrix and graphs kernels on heterogeneous processors.
2 Background
2.1 Heterogeneous processors
Compared to homogeneous chip multiprocessors such as CPUs and GPUs, heterogeneous processors are able to combine different types of cores into one chip, thus can deliver improved overall performance and power efficiency (Schulte et al. 2015; Vijayaraghavan et al. 2017), while sufficient heterogeneous parallelism exists. The main characteristics of heterogeneous processors include unified shared memory and fast communication among different types of cores in the same chip. The Cell Broadband Engine can be seen as an early form of heterogeneous processor. Currently, because of mature CPU and GPU architectures, programming environments, and various applications, the CPU–GPU integrated heterogeneous processor with multiple instruction set architectures is the most widely adopted choice.
Currently, although the computation capacity of the coupled CPU–GPU processors is lower than that of the discrete GPUs, we can see that the heterogeneous processor is a potential trend for future processors. Hardware vendors all release their heterogeneous processors, such as AMD Carrizo (Krishnan et al. 2016), Intel Skylake (Doweck et al. 2017) and NVIDIA Denver (Boggs et al. 2015). In addition, future heterogeneous processors can be more powerful; with good design, they even can be applied in exascale computing systems (Schulte et al. 2015; Vijayaraghavan et al. 2017).
2.2 Sparse matrix and graph computations

Sparse matrix–vector multiplication (SpMV) that multiplies a sparse matrix A with a dense vector x and obtains a dense vector y;

Sparse matrix–matrix multiplication (SpGEMM) that multiplies a sparse matrix A with another sparse matrix B and obtains a resulting sparse matrix C;

Sparse matrix transposition (SpTRANS) that transposes a sparse matrix A of rowmajor to \(A^T\) of rowmajor (or both of columnmajor);

Sparse triangular solve (SpTRSV) that computes a dense solution vector x from a system \(Lx=b\), where L is a lower triangular sparse matrix and b is a dense righthand side vector;

PageRank (PR) that ranks Internet web pages in search engines, which works by counting the links between different web pages and weighting each pages;

Graph coloring (GC) that gives colors to vertices where any two adjacent vertices shall have different colors, which is a special case in graph labeling;

Connected component (CC) that marks vertices in different components and calculates the number of connected components;

Breadthfirst search (BFS) that explores a path from a root node to each node in a graph, in a way that during each step, the unvisited neighbors of visited nodes are marked to be visited in the next step.
2.3 Characteristics of parallel sparse matrix and graph kernels
Unlike dense matrix computations, sparse matrix and graph kernels have several unique characteristics (Liu 2015).
The first one is load balancing. Dense matrix operations can be easily executed in parallel through rowwise, columnwise or 2D blockwise decomposition. However, the nonzero entries of a sparse matrix can randomly exist at any locations. Hence, which decomposition method gives the best load balancing depends on sparsity structure, operation pattern, and concrete hardware device.
The second is indirect addressing. Because of the compressed storage fashion, nonzero entries of a sparse matrix have to be accessed by indirect addresses stored in its index array. It is well known that indirect addressing brings more memory transactions and lower cache hit rate, and cannot be optimized at compile time since the addresses are only known at runtime.
The third is memory reallocation. Several sparse kernels, such as addition or multiplication of two sparse matrices, generate sparse output. The number of nonzero entries and their distribution are actually not known in advance. Precomputing an upper bound is one method to deal with the unknown number of nonzero entries of the output. However, this method may waste memory space. Another method is to preallocate a sparse output and reallocate more space if the initial size is small. However, such memory reallocation is expensive on GPUs currently.
The fourth is atomic operations. Some kernels highly depend on atomic operations to collect nonzeros or to synchronize workload of thread blocks. For example, thread blocks can use atomic operations on global variables to communicate and obtain the execution status of other thread blocks for fast synchronization. However, performancewise, atomic operations are inherently sequential, though they can be implemented more efficiently through some architectural designs.
The fifth is dynamic characteristics. Some graph computing kernels have dynamic characteristics, which means that the computation process is divided into several iterations and each iteration only relates to part of a graph, i.e., part of a sparse matrix. Dynamic characteristics relate to both the input and the algorithm. When the workload related to a compute iteration is too low, the GPU fails to utilize all its compute cores.
Summary of the characteristics and sparse kernels
Characteristics  Description  Sparse kernels 

Load balancing  Efficient workload distribution.  SpMV 
Indirect addressing  The data addresses are held in intermediate locations.  SpMV 
Memory reallocation  Allocating memory space during runtime.  SpGEMM 
Atomic operations  Exclusive execution by one thread.  SpTRANS, SpTRSV 
Dynamic characteristics  Dynamically processing different parts during execution.  PR, GC, CC, BFS 
3 Evaluation methodology
3.1 Platform
We in this evaluation use a heterogeneous processor, AMD Ryzen 5 2400G APU, composed of four Zen CPU cores and 11 GPU cores running at 3.6 GHz and 1.25 GHz, respectively. Each CPU core can run two simultaneous threads, and each GPU core has 64 AMD GCN cores. The system memory is 16GB dualchannel DDR42933 of theoretical peak bandwidth 46.9 GB/s. The operating system is 64bit Microsoft Windows 10^{1}. The GPU driver version is 18.5.1. The development environment is AMD APP SDK 3.0 and OpenCL 2.0.
3.2 Sparse kernels
To evaluate load balancing and indirect addressing, we on the CPU part benchmark a classic rowwise CSR SpMV algorithm and a CSR5 SpMV algorithm proposed by Liu and Vinter (2015a) parallelized with OpenMP and vectorized by compiler, and on the GPU part test the other two SpMV kernels, i.e., the CSRadaptive algorithm proposed by Greathouse and Daga (2014) and the CSR5 SpMV algorithm. The CSRadaptive algorithm collects short rows into groups to shrink gaps of row lengths for better load balancing, and the CSR5 SpMV algorithm evenly divides nonzeros into small tiles of the same size for load balancing and uses vectorized segmented sum for utilizing wide SIMD units on GPUs. We can observe the impact of load balancing by comparing these algorithms, because their algorithms use different load balancing strategies. For the impact of indirect addressing, we can analyze the performance difference between the CPU and the GPU, because they have different data access patterns thus resulting in different performance behavior to indirect addressing.
To test memory reallocation, we can analyze an application that involves memory reallocation. We run an SpGEMM algorithm developed by Liu and Vinter (2015b) that calculates the number of floating point operations of each row, groups rows of similar number of operations into the same bin, and use different methods for rows in the same bin. The rows requiring more computations may need to allocate larger space but finally waste the space since the final result can be much shorter. Thus it is better to preallocate a small space and reallocate for larger space when and only when the small space is inadequate. Because GPUs lack the ability to reallocate memory, the program has to allocate a larger space, copy the entries from the current space, and finally release the old space. This method is actually inefficient and wastes memory space. To avoid the slow processing, Liu and Vinter (2015b) exploit the reallocation scheme on the host memory to accelerate the procedure.
To benchmark atomic operations, we use two kernels that involves atomic operations: an atomicbased SpTRANS method described by Wang et al. (2016) and a synchronizationfree SpTRSV algorithm proposed by Liu et al. (2017). The SpTRANS method first uses atomicadd operations to sum the number of nonzeros in each column (assuming both the input and output matrices are in rowmajor) and then scatters nonzeros in rows into columns through an atomicbased counter. The SpTRSV problem is inherently sequential. Its synchronizationfree SpTRSV algorithm uses atomic operations as a communication mechanism between thread groups. When a thread group finishes its work, it will atomically update some variables in global memory, and some other thread groups that are busywaiting will notice the change and start to complete their jobs.
To evaluate dynamic characteristics, we use four graph computing algorithms in our experiment. Different graph applications may exhibit various dynamic characteristics (Wang et al. 2019). PageRank involves all vertices of a graph in computation; graph coloring and connected component have large number of active vertices at first, and then the value decreases; BFS has a low parallelism at first iterations, and then the parallelism increases. These dynamic characteristics make the GPU acceleration challenging.
3.3 Matrices
We use matrices downloaded from the SuiteSparse Matrix Collection (Davis and Hu 2011) (formerly known as the University of Florida Sparse Matrix Collection). There are currently 2757 matrices in the collection. To avoid experimental errors from executing small matrices, we only select relatively large matrices of no less than 100,000 and no more than 200,000,000 nonzero elements. With this condition, 956 matrices are selected and tested to obtain statistically significant experimental results. These matrices are used as the input for the sparse kernels in Sect. 3.
4 Experimental results
4.1 SpMV performance and analysis
Figure 3a plots a performance comparison of two SpMV methods, i.e., CSRomp and CSR5omp, on the CPU side, and Fig. 3b shows a similar performance comparison of two SpMV methods, i.e., CSRadaptiveocl and CSR5ocl, on the GPU side. It can be seen that the methods deliver comparable performance on both parts. But it is also noticeable that when variation of row length is larger than 1 (shown as \(10^0\) in the figures), CSR5omp and CSR5ocl methods outperform CSRomp and CSRadaptiveocl in many cases. However, when the variation is smaller than \(10^0\), CSRomp and CSRadaptiveocl give better performance in some matrices. This means that even on moderate scale parallel devices of four CPU cores or of 11 GPU cores, load balancing is still a problem, and CSR5 as a load balanced method is more competitive than the naïve CSR and CSRadaptive methods. However, it is also worth to note that the two algorithms work better for regular problems since they avoid complex operations designed for load balanced calculation.
As for indirect accessing, it is also interesting to see the performance difference of CSR5omp and CSR5ocl in Fig. 3c. Although the two methods access the same DRAM thus utilize the same bandwidth, CSR5ocl in most cases offers better performance and gives up to 3\(\times\) speedup over CSR5omp. This may be due to that the GPU runs much more simultaneous threads than the CPU, thus can hide latency of randomly accessing vector x. However, it can also be seen that CSR5omp can achieve higher performance in several cases (see the green dots near 80 GB/s). This may be due to cache locality on CPUs is better than that on GPUs (Li et al. 2017a, b), and sparsity structure of the several matrices can exploit caches better.
Finally, we plot the best performance on both sides (CPUbest denotes the best performance of CSRomp and CSR5omp, and GPUbest denotes the best performance of CSRadaptiveocl and CSR5ocl) in Fig. 3d. It can be seen that GPUbest in general outperforms CPUbest, but the latter can achieve higher performance (CPUbest’s near 100 GB/s over GPUbest’s no more than 80 GB/s). We believe that this is also due to the effects of CPU’s cache and GPU’s higher amount of execution threads.
4.2 SpGEMM performance and analysis
It can be seen that in most cases, the two methods deliver comparable performance. This is due to only very few very long rows requiring the progressive allocation, and the overall performance is not affected by those rows. However, there are still many cases that can receive over 2\(\times\) speedups from the reallocation on unified memory. This means that the memory reallocation technique can be very useful for irregular problems such as SpGEMM. Because the integrated GPU could use shared host memory, original algorithms designed for GPUs can be further accelerated on heterogeneous processors.
4.3 SpTRANS and SpTRSV performance and analysis
Figures 5a and 6a demonstrate absolute performance. It can be seen that the Titan X GPU can be nearly up to 20\(\times\) faster than the integrated GPU, but also in many cases delivers only a couple of times speedups. But in Figs. 5b and 6b, it is clear to see that the AMD integrated GPU gives much better bandwidth utilization than the NVIDIA discrete GPU. Overall, when the discrete GPU offers 10\(\times\) more theoretical bandwidth over the integrated GPU, SpTRANS and SpTRSV heavily dependent on atomic operations do not receive benefits of higher bandwidth. Although we lack implementation details of atomic operations on both architectures, it is very interesting to see that the relatively low end integrated GPU accessing host memory can bring such high atomic memory utilization.
4.4 PageRank performance and analysis
From Fig. 7, we can see that PageRank on GPUs is more likely to have high performance when the variation of edges is low; in contrast, PageRank on CPUs does not have obvious performance variation along with the variation of edges. This is due to the architecture difference. GPUs have massive parallel computing cores, and a group of cores are required to execute in an SIMD manner, which means that threads in a coexecution group (“wavefront” in OpenCL terminology) need to execute the same instruction simultaneously. However, when the workload distribution is unbalanced in the coexecution group, the threads that finish workload earlier needs to wait for the others threads to process next workloads, which causes performance degradation. Among graph computing applications, like PageRank, this problem can be more serious. Instead, CPU performance in Fig. 7 keeps in a small fluctuation along with the variation of edges. CPUs do not have such problem due to the outoforder execution model and relatively higher cache capacity. The highest performance the GPUs can reach is 7.5 GTEPS, while the CPUs can only reach 1.2 GTEPS.
Figure 7 shows the performance speedup from GPU to CPU on the integrated architecture. The average performance speedup is about 2.3\(\times\), and the highest speedup is 14.0\(\times\). From Fig. 7, we can see that when the variation of edges is low, the GPU can better release its compute capacity, thus has high speedup to the CPU.
4.5 Graph coloring performance and analysis
Different from PageRank, graph coloring does not involve all vertices during computation. It at first involves all vertices to do coloring; during iterations, the vertices that have been colored successfully do not need to be involved. Hence, the number of active vertices decreases as along with the proceeding of the iterations. Because program parallelism relates to the number of vertices, the GPU performance can be affected.
Figure 8 shows the performance speedup from the GPU to the CPU. Because the CPU part has a relatively steady performance, the performance decreasing trend on the GPU is obvious. The highest performance speedup can be 10.6\(\times\) when the variation of edges is around 0.3; however, when the variation of edges is higher than \(10^1\), the CPU has a better performance than the GPU. Therefore, the variation of edges can be an important indicator for graph coloring to decide whether to run on GPUs or on CPUs for heterogeneous processors.
4.6 Connected component performance and analysis
Connected component calculates the components that each vertex belongs to. In graph theory, a connected component is a subgraph where any two vertices in it can be connected by an path. In this algorithm, we assign each vertex a number; the algorithm consists of several iterations, and during each iteration, each vertex compares its number to that of its neighbors and then updates its number to the smaller value. The algorithm stops when the graph does not change, and at last, the number of unique values represents the number of connected components. The parallelism trend is similar to graph coloring; at first, all vertices need to update and the algorithm has a large parallelism; after several iterations, most components are fixed, and the parallelism can be low.
Figure 9 shows the performance speedup from GPUs to CPUs. We can see a clear trend that the speedup decreases along with the increase of the variation of edges. When the variation of edges is about \(10^{2}\), the speedup is 4.7\(\times\) on average; however, when the variation is around \(10^{2}\), the speedup decreases to 0.05\(\times\).
4.7 BFS performance and analysis
BFS traverses the graph from a root to the other vertices to obtain the shortest path between the root and the other vertices. BFS explores vertices based on their distance to the root. BFS includes several iterations for computation, and during each iteration, it includes the unvisited neighbors to the set of visited vertices. The newly involved vertices at each iteration form a frontier for the next iteration, and only the neighbors to the frontier needs to be explored. The exploration of each vertex in the frontier can be distributed to different threads in parallel, so the frontier size relates to the parallelism. Hence, GPU performance may be affected due to inadequate active vertices. Moreover, compared to other applications, BFS has a low computation density; most operations in BFS relate to memory access.
Figure 10 presents the performance speedup from GPUs to CPUs on heterogeneous processors. It denotes that when both the GPUs and the CPUs share the same memory, the CPUs have better performance than the GPUs do for BFS. In addition, when the input is relatively regular (a low variation of edges), GPUs can still outperform CPUs.
4.8 Comparison between scientific computing kernels and graph computing kernels
From Sects. 4.4 to 4.7, we can see that graph computing kernels are more complex than scientific linear algebra kernels. Linear algebra programs usually launch kernel once, and use the whole sparse matrix for computation; graph computing kernels can be divided into several iterations, and these iterations may have different parallelism. For PageRank, the computation in each iteration involves all vertices, so PageRank has similar performance compared to linear algebra kernels. For graph coloring and connected component, the parallelism has a decreasing process, and each iteration may not involve all vertices for computation; hence, they have better performance on GPUs than on CPUs when the variation is not large, and vice versa. BFS has a low parallelism at first, and then the parallelism increases. Another reason why GPUs have lower performance for BFS than CPUs is that BFS is a memory bound program; most operations in BFS relate to memory access. Because CPUs and GPUs share the same bandwidth, higher architecture parallelism has a negative influence on the bandwidth utilization.
5 Related work
5.1 Performance analysis for coupled heterogeneous processors
Because coupled heterogeneous processors pose nontrivial challenges from both programming and architecture aspects, many researchers focus on performance analysis on the coupled heterogeneous processors to understand their performance behaviors for optimizations. Daga et al. (2011) analyzed the efficiency of the coupled heterogeneous processors, and pointed out that such heterogeneous processor is a step in the right direction for efficient supercomputers. Doerksen et al. (2011) used 0–1 knapsack and Gaussian elimination as two examples to discuss the design and performance on fused architectures. Spafford et al. (2012) studied the tradeoffs of the shared memory hierarchies on coupled heterogeneous processors, and identified a significant requirement for robust runtime systems on such architectures. Lee et al. (2013) provided a comprehensive performance characterization for dataintensive applications, and revealed that the fused architecture is promising for accelerating dataintensive applications. Zakharenko et al. (2013) used FusionSim (Zakharenko 2012) to characterize the performance on fused and discrete architectures. Mekkat et al. (2013) analyzed the management policy for the shared last level cache. Zhang et al. (2015, 2017b) studied the corunning behaviors of different devices for the same application, while Zhu et al. (2014, 2017b) studied corunning performance degradation for different devices for separate applications. Garzón et al. (2017) proposed an approach to optimize the energy efficiency of iterative computation on heterogeneous processors. Zhu et al. (2017a) presented a systematic study on heterogeneous processors with power caps considered. Moreover, lowpower, reliability, and performance/watt ratio optimization are also crucial considerations (Branover et al. 2012; Zhu et al. 2017a; Liu et al. 2015, 2016). Different from these research, our study focuses on sparse matrix and graph kernels. We analyze the load balancing, indirect addressing, memory reallocation, atomic operations, and the difference between those kernels.
5.2 Accelerating irregular applications on heterogeneous processors
Many researchers focus on the optimization for applications on coupled heterogeneous processors. Kaleem et al. (2014) provided an adaptive workload dispatcher for heterogeneous processors to corun CPUs and GPUs together; Pandit and Govindarajan (2014) proposed Fluidic Kernels, which can perform cooperative execution on multiple heterogeneous devices; this work can be applied on heterogeneous processors directly. However, these research does not consider the optimization for workload irregularity. Shen et al. (2013) provided Glinda, which is a framework for accelerating imbalanced applications on heterogeneous platforms. Barik et al. (2014) tried to map irregular C++ applications to the GPU device on heterogeneous processors. Fairness and efficiency are two major concerns for shared system users; Tang et al. (2016) introduced multiresource fairness and efficiency on heterogeneous processors. Zhang et al. (2017a) considered the irregularity inside workload and architecture differences between CPUs and GPUs, and then proposed a method that can distribute the relatively regular part of workload to GPUs while remain irregular part to CPUs on integrated architectures. Daga et al. (2014) proposed a hybrid BFS algorithm that can select appropriate algorithm and devices for iterations on heterogeneous processors. Zhang et al. (2018) further developed a performance model for BFS algorithm on heterogeneous processors.
5.3 Accelerating irregular applications on discrete GPUs
There are many works about accelerating irregular algorithms in sparse matrix and graph computations. For example, Liu and Vinter (2015a) developed CSR5, which is an efficient storage format for SpMV on heterogeneous platforms. Sparse matrix–matrix multiplication (SpGEMM) is another fundamental building block for scientific computation, and Liu and Vinter (2015b) proposed a framework for SpGEMM on GPUs and integrated architectures. Shen et al. (2014, 2016) proposed a method to match imbalanced workloads for GPUs, and performed workload partitioning for accelerating applications.
6 Conclusions
We in this work have conducted a thorough empirical evaluation of four representative sparse matrix kernels, i.e., SpMV, SpTRANS, SpTRSV, and SpGEMM, and four graph computing kernels, i.e., PageRank, Connected Component, Graph Coloring, and BFS, on an AMD APU heterogeneous processors. We benchmarked 956 sparse matrices and obtained experimental results which are statistically significant. Based on the data, we analyzed performance behaviors of the kernels’ load balancing, indirect addressing, memory reallocation, atomic operations, and dynamic characteristics on heterogeneous processors, and identified several interesting insights.
Footnotes
 1.
Since the Linux GPU driver of this integrated GPU is not officially available yet, we have to do all benchmarks on Microsoft Windows.
Notes
Acknowledgements
This work has been partly supported by the National Natural Science Foundation of China (Grant nos. 61732014, 61802412, 61671151), Beijing Natural Science Foundation (no. 4172031), and SenseTime Young Scholars Research Fund.
References
 Agarwal, N., Nellans, D., Ebrahimi, E., Wenisch, T.F., Danskin, J., Keckler, S.W.: Selective GPU caches to eliminate CPU–GPU HW cache coherence. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 494–506 (2016)Google Scholar
 Barik, R., Kaleem, R., Majeti, D., Lewis, B.T., Shpeisman, T., Hu, C., Ni, Y., AdlTabatabai, A.R.: Efficient mapping of irregular C++ applications to integrated GPUs. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, p. 33, ACM (2014)Google Scholar
 Boggs, D., Brown, G., Tuck, N., Venkatraman, K.S.: Denver: Nvidia’s first 64bit ARM processor. IEEE Micro 35(2), 46–55 (2015)CrossRefGoogle Scholar
 Branover, A., Foley, D., Steinman, M.: AMD Fusion APU: Llano. IEEE Micro 32(2), 28–37 (2012)CrossRefGoogle Scholar
 Buluç, A., Gilbert, J.: Parallel sparse matrix–matrix multiplication and indexing: implementation and experiments. SIAM J. Sci. Comput. 34(4), C170–C191 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
 Daga, M., Aji, A.M., Feng, W.C: On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing. In: 2011 Symposium on Application Accelerators in HighPerformance Computing, pp. 141–149 (2011)Google Scholar
 Daga, M., Nutter, M., Meswani, M.: Efficient breadthfirst search on a heterogeneous processor. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 373–382 (2014)Google Scholar
 Daga, M., Nutter, M.: Exploiting coarsegrained parallelism in B+ tree searches on an APU. In: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp. 240–247 (2012)Google Scholar
 Dashti, M., Fedorova, A.: Analyzing Memory Management Methods on Integrated CPU–GPU Systems. In: Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management, ISMM 2017, pp. 59–69 (2017)Google Scholar
 Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011)MathSciNetzbMATHGoogle Scholar
 Doerksen, M., Solomon, S., Thulasiraman, P.: Designing APU oriented scientific computing applications in opencl. In: High Performance Computing and Communications (HPCC), 2011 IEEE 13th International Conference on, pp. 587–592, IEEE (2011)Google Scholar
 Doweck, J., Kao, W., Lu, A.K., Mandelblat, J., Rahatekar, A., Rappoport, L., Rotem, E., Yasin, A., Yoaz, A.: Inside 6thgeneration intel core: new microarchitecture codenamed skylake. IEEE Micro 37(2), 52–62 (2017)CrossRefGoogle Scholar
 Duff, I.S., Heroux, M.A., Pozo, R.: An overview of the sparse basic linear algebra subprograms: the new standard from the BLAS technical forum. ACM Trans. Math. Softw. (TOMS) 28(2), 239–267 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
 Garzón, E.M., Moreno, J., Martínez, J.: An approach to optimise the energy efficiency of iterative computation on integrated GPU–CPU systems. J. Supercomput. 73(1), 114–125 (2017)CrossRefGoogle Scholar
 Greathouse, J.L., Daga, M.: Efficient sparse matrix–vector multiplication on GPUs using the CSR storage format. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 769–780 (2014)Google Scholar
 Gregg, C., Hazelwood, K.: Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In: (IEEE ISPASS) IEEE International Symposium on Performance Analysis of Systems and Software, pp. 134–144 (2011)Google Scholar
 Kaleem, R., Barik, R., Shpeisman, T., Hu, C., Lewis, B.T., Pingali, K.: Adaptive heterogeneous scheduling for integrated GPUs. In: Parallel Architecture and Compilation Techniques (PACT), 2014 23rd International Conference on, pp. 151–162. IEEE (2014)Google Scholar
 Krishnan, G., Bouvier, D., Naffziger, S.: Energyefficient graphics and multimedia in 28nm Carrizo accelerated processing unit. IEEE Micro 36(2), 22–33 (2016)CrossRefGoogle Scholar
 Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, pp. 451–460 (2010)Google Scholar
 Lee, K., Lin, H., Feng, Wc: Performance characterization of dataintensive kernels on AMD fusion architectures. Comput. Sci. Res. Dev. 28(2–3), 175–184 (2013)CrossRefGoogle Scholar
 Li, A., Liu, W., Kristensen, M.R.B., Vinter, B., Wang, H., Hou, K., Marquez, A., Song, S.L.: Exploring and analyzing the real impact of modern onpackage memory on HPC scientific kernels. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, pp. 26:1–26:14 (2017a)Google Scholar
 Li, A., Song, S.L., Liu, W., Liu, X., Kumar, A., Corporaal, H.: Localityaware CTA clustering for modern GPUs. In: Proceedings of the TwentySecond International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’17, pp. 297–311 (2017b)Google Scholar
 Liu, W.: Parallel and scalable sparse basic linear algebra subprograms. Ph.D. thesis, University of Copenhagen (2015)Google Scholar
 Liu, W., Vinter, B.: Adheap: an efficient heap data structure for asymmetric multicore processors. In: Proceedings of Workshop on General Purpose Processing Using GPUs, GPGPU7, pp. 54:54–54:63 (2014)Google Scholar
 Liu, W., Vinter, B.: CSR5: an efficient storage format for crossplatform sparse matrix–vector multiplication. In: Proceedings of the 29th ACM International Conference on Supercomputing, ICS ’15, pp. 339–350 (2015a)Google Scholar
 Liu, W., Vinter, B.: A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors. J. Parallel Distrib. Comput. 85(C), 47–61 (2015b)CrossRefGoogle Scholar
 Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput. 49(C), 179–193 (2015c)MathSciNetCrossRefGoogle Scholar
 Liu, T., Chen, C.C., Kim, W., Milor, L.: Comprehensive reliability and aging analysis on SRAMs within microprocessor systems. Microelectron. Reliab. 55(9), 1290–1296 (2015)CrossRefGoogle Scholar
 Liu, T., Chen, C.C., Wu, J., Milor, L.: SRAM stability analysis for different cache configurations due to bias temperature instability and hot carrier injection. In: Computer Design (ICCD), 2016 IEEE 34th International Conference on, pp. 225–232, IEEE (2016)Google Scholar
 Liu, W., Li, A., Hogg, J.D., Duff, I.S., Vinter, B.: Fast synchronizationfree algorithms for parallel sparse triangular solves with multiple righthand sides. Concurr. Comput. Pract. Exp. 29(21), e4244 (2017)CrossRefGoogle Scholar
 Liu, J., He, X., Liu, W., Tan, G.: Registerbased implementation of the sparse general matrix–matrix multiplication on gpus. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’18, pp. 407–408 (2018)Google Scholar
 Liu, J., He, X., Liu, W., Tan, G.: Registeraware optimizations for parallel sparse matrix–matrix multiplication. Int. J. Parallel Program. (2019). https://doi.org/10.1007/s107660180604 Google Scholar
 Mekkat, V., Holey, A., Yew, P.C., Zhai, A.: Managing shared lastlevel cache in a heterogeneous multicore processor. In: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, pp. 225–234, IEEE Press (2013)Google Scholar
 Merrill, D., Garland, M.: Mergebased parallel sparse matrix–vector multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 58. IEEE Press (2016)Google Scholar
 Nai, L., Xia, Y., Tanase, I.G., Kim, H., Lin, C.Y.: GraphBIG: understanding graph computing in the context of industrial solutions. In: High Performance Computing, Networking, Storage and Analysis, 2015 SCInternational Conference for, pp. 1–12, IEEE (2015)Google Scholar
 Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879–899 (2008)CrossRefGoogle Scholar
 Pandit, P., Govindarajan, R.: Fluidic kernels: Cooperative execution of opencl programs on multiple heterogeneous devices. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, p. 273, ACM (2014)Google Scholar
 Power, J., Basu, A., Gu, J., Puthoor, S., Beckmann, B.M., Hill, M.D., Reinhardt, S.K., Wood, D.A.: Heterogeneous system coherence for integrated CPU–GPU systems. In: 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 457–467 (2013)Google Scholar
 Puthoor, S., Aji, A.M., Che, S., Daga, M., Wu, W., Beckmann, B.M., Rodgers, G.: Implementing directed acyclic graphs with the heterogeneous system architecture. In: Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit, GPGPU ’16, pp. 53–62 (2016)Google Scholar
 Said, I., Fortin, P., Lamotte, J., Calandra, H.: Leveraging the accelerated processing units for seismic imaging: a performance and power efficiency comparison against CPUs and GPUs. Int. J. High Perform. Comput. Appl. 32(6), 819–837 (2017)CrossRefGoogle Scholar
 Schulte, M.J., Ignatowski, M., Loh, G.H., Beckmann, B.M., Brantley, W.C., Gurumurthi, S., Jayasena, N., Paul, I., Reinhardt, S.K., Rodgers, G.: Achieving exascale capabilities through heterogeneous computing. IEEE Micro 35(4), 26–36 (2015)CrossRefGoogle Scholar
 Shen, J., Varbanescu, A.L., Sips, H., Arntzen, M., Simons, D.G.: Glinda: A framework for accelerating imbalanced applications on heterogeneous platforms. In: Proceedings of the ACM International Conference on Computing Frontiers, CF ’13, pp. 14:1–14:10 (2013)Google Scholar
 Shen, J., Varbanescu, A.L., Zou, P., Lu, Y., Sips, H.: Improving performance by matching imbalanced workloads with heterogeneous platforms. In: Proceedings of the 28th ACM International Conference on Supercomputing, ICS ’14, pp. 241–250 (2014)Google Scholar
 Shen, J., Varbanescu, A.L., Lu, Y., Zou, P., Sips, H.: Workload partitioning for accelerating applications on heterogeneous platforms. IEEE Trans. Parallel Distrib. Syst. 27(9), 2766–2780 (2016)CrossRefGoogle Scholar
 Spafford, K.L., Meredith, J.S., Lee, S., Li, D., Roth, P.C., Vetter, J.S.: The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. In: Proceedings of the 9th conference on Computing Frontiers, pp. 103–112, ACM (2012)Google Scholar
 Tang, S., He, B., Zhang, S., Niu, Z.: Elastic multiresource fairness: balancing fairness and efficiency in coupled CPU–GPU architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 75. IEEE Press (2016)Google Scholar
 Vijayaraghavan, T., Eckert, Y., Loh, G.H., Schulte, M.J., Ignatowski, M., Beckmann, B.M., Brantley, W.C., Greathouse, J.L., Huang, W., Karunanithi, A., Kayiran, O., Meswani, M., Paul, I., Poremba, M., Raasch, S., Reinhardt, S.K., Sadowski, G., Sridharan, V.: Design and analysis of an APU for Exascale computing. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 85–96 (2017)Google Scholar
 Wang, H., Liu, W., Hou, K., Feng, W.C.: Parallel transposition of sparse data structures. In: Proceedings of the 2016 International Conference on Supercomputing, ICS ’16, pp. 33:1–33:13 (2016)Google Scholar
 Wang, X., Liu, W., Xue, W., Wu, L.: swsptrsv: A fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’18, pp. 338–353 (2018)Google Scholar
 Wang, H., Geng, L., Lee, R., Hou, K., Zhang, Y., Zhang, X.: Sepgraph: finding shortest execution paths for graph processing under a hybrid framework on GPU. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, PPoPP ’19, pp. 38–52 (2019)Google Scholar
 Yang, Y., Xiang, P., Mantor, M., Zhou, H.: CPUassisted GPGPU on fused CPU–GPU architectures. In: IEEE International Symposium on HighPerformance Comp Architecture, pp. 1–12 (2012)Google Scholar
 Zakharenko, V., Aamodt, T., Moshovos, A.: Characterizing the performance benefits of fused CPU/GPU systems using FusionSim. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 685–688, EDA Consortium (2013)Google Scholar
 Zakharenko, V.: FusionSim: characterizing the performance benefits of fused CPU/GPU systems. Ph.D. thesis (2012)Google Scholar
 Zhang, F., Zhai, J., Chen, W., He, B., Zhang, S.: To corun, or not to corun: a performance study on integrated architectures. In: 2015 IEEE 23rd International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pp. 89–92 (2015)Google Scholar
 Zhang, F., Wu, B., Zhai, J., He, B., Chen, W.: FinePar: Irregularityaware Finegrained Workload Partitioning on Integrated Architectures. In: Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO ’17, pp. 27–38 (2017a)Google Scholar
 Zhang, F., Zhai, J., He, B., Zhang, S., Chen, W.: Understanding corunning behaviors on integrated CPU/GPU architectures. IEEE Trans. Parallel Distrib. Syst. 28(3), 905–918 (2017b)CrossRefGoogle Scholar
 Zhang, F., Lin, H., Zhai, J., Cheng, J., Xiang, D., Li, J., Chai, Y., Du, X.: An Adaptive BreadthFirst Search Algorithm on Integrated Architectures. The Journal of Supercomputing (2018)Google Scholar
 Zhu, Q., Wu, B., Shen, X., Shen, L., Wang, Z.: Understanding corun degradations on integrated heterogeneous processors. In: International Workshop on Languages and Compilers for Parallel Computing, pp. 82–97. Springer (2014)Google Scholar
 Zhu, Q., Wu, B., Shen, X., Shen, L., Wang, Z.: Corun scheduling with power cap on integrated CPU–GPU systems. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 967–977 (2017a)Google Scholar
 Zhu, Q., Wu, B., Shen, X., Shen, K., Shen, L., Wang, Z.: Understanding corun performance on CPU–GPU integrated processors: observations, insights, directions. Front. Comput. Sci. 11(1), 130–146 (2017b)CrossRefGoogle Scholar