The Visual Computer

, Volume 34, Issue 5, pp 633–643 | Cite as

Exploring hidden coherency of Ray-Tracing for heterogeneous systems using online feedback methodology

Original Article


Although naturally adopting an embarrassingly parallel paradigm, Ray-Tracing is also categorized as an irregular program that is troublesome to run on graphics processing units (GPUs). Conventional designs suffer from a performance penalty due to the irregularity of the control flow and memory access caused by incoherent rays. This work aims to explore the hidden coherency of rays by designing a feedback-guided mechanism that serves the following concept: extraction of the hidden regular portions out of the irregular execution flow. The method records the correlation of ray attributes and the traversed path and groups the newly generated rays to reduce potential irregularities for the ongoing execution. This mechanism captures the information from the entire ray space and can extract the hidden coherency from both primary and derived rays. The result leads to performance gains and an increase in resource utilization. The performance becomes 2 to 2.5 times higher than the original GPU and CPU versions.


Ray-Tracing Heterogeneous systems Irregular program HSA Shared virtual memory 

1 Introduction

Ray-Tracing is an rendering algorithm that requires massive computing power. It generates realistic images by emitting and tracing a variety of ray types to implement global illumination. Typically, each generated ray must be examined for intersection with objects in the scene. For this reason, the Ray-Tracing algorithm is often implemented in an embarrassingly parallel fashion that is potentially suitable for processing by a graphics processing unit (GPU). In this paradigm, each work item in a wavefront of a GPU is assigned to trace the path of a single ray, perform intersection tests with the objects and render the outcome by accumulating the radiance [4, 11, 21].

However, the geometric configuration of a scene is constructed by millions of primitives in many cases, which makes it impractical to perform the intersection test sequentially with all primitives (such as triangles). To speed up the testing procedure, tree-based hierarchical accelerators, such as Bounding Volume Hierarchy trees (BVH trees) or KD trees [15], are used to reduce the otherwise O(n) complexity of testing a ray for intersection with all n objects [26]. However, using these accelerators introduces serious control flow divergence and data locality issues that are troublesome for GPU computation, as the computing pattern becomes an irregular algorithm. If we adopt a sequential search in the intersection test, we could obtain a large speedup from a GPU, but the overall runtime would increase.

Transformed into an irregular program execution, Ray-Tracing will inherit the penalty of mapping to heterogeneous systems since its behavior changes unpredictably during execution. In particular, the dynamically varying control flows create thread divergences, which dramatically reduces the level of parallelism and SIMD lane utilization in GPUs. For instance, given two rays with distinct origins and directions, the two work items for processing them will follow greatly diverged execution paths. If the two work items belong to the same wavefront, the divergence in both control and memory access causes a performance penalty on a GPU. Machines that adopt MIMD (multiple instruction, multiple data) parallelism, such as CPUs and the Intel Xeon Phi [16], are better at supporting programs with irregular controls but are less energy efficient if the process becomes regular. In contrast, GPUs can handle regular code effectively due to the nature of their SIMD (single instruction, multiple data) execution pattern. Consequently, mapping irregular code in Ray-Tracing efficiently onto a heterogeneous system remains difficult.

To solve this divergence issue, previous studies have proposed methods for ray grouping. Rays that are successfully grouped are expected to traverse similar nodes in the accelerator. These rays with similar traversal paths are called coherent rays. To explore coherent rays, methods such as the gathering of rays into packets, ray sorting and compression, or combining single and packet ray traversal have been introduced [6, 7, 13]. However, it has also been indicated that it is difficult to identify coherency. The proposed methods are effective only for primary rays. Derived rays, such as diffuse reflective rays and shadow rays, cannot be handled and could eventually break the coherency of rays. In fact, perfect information regarding coherency is only available after the traversal is complete. Thus, the methods for finding potentially coherent rays are typically based on heuristics.

Based on this observation, an idea is raised: Would it be possible to identify the hidden ray coherency from the execution flow, which seems to be incoherent at first glance? To address this question, we conform to the better scaled approach of using BVH trees in intersection tests, and we take a feedback-directed approach to identify coherent rays to avoid branch divergence caused by tree searching. With this feedback mechanism, the algorithm is able to analyze and extract the potential regular portions out of the irregular program execution and gather them to form a better group with less branch divergence for the GPU. For those workloads that cannot be compressed into a coherent group, we adopt a load balancing mechanism that enables the CPU to dynamically take over the duty of processing. The design concept is based on the following facts. We found that rays with similar attributes, such as surrounding origins and orientations, are likely to follow the same path in the traversal procedure. Because tree traversal will be triggered multiple times, if the correlation of ray attributes and the traversed path are recorded, it can be used as a feedback-guided mechanism to group the newly generated rays; for the next time a similar pattern is encountered to reduce potential irregularities for the ongoing execution. The result leads to performance gains and an increase in resource utilization.

In this paper, we aim to uncover the hidden coherent rays by designing a feedback-guided mechanism called Attribute-Based Ray Regrouping, which is implemented in a Ray-Tracing Runtime. The mechanism can extract the coherency from not only the primary rays but also the derived rays since it is designed to capture the information of the entire ray space. The design leverages the features of the Heterogeneous System Architecture (HSA) [29], such as shared virtual memory (SVM) and fast kernel dispatching, to further reduce the systematic overhead, such as memory transferring and kernel dispatching, which makes this approach more applicable. We analyzed the performance and design trade-offs of the Runtime construction.

This paper makes the following contributions:
  1. 1.

    We propose a feedback-guided mechanism that is able to identify the hidden coherency from both primary and derived rays. The mechanism reduces irregularities, resulting in performance enhancement and an increase in hardware utilization.

  2. 2.

    We propose a Tagging method that annotates the workloads based on their characteristics. The workloads are sent to the devices that are better suited for their execution through heterogeneous queues to achieve load balancing on heterogeneous cores.

  3. 3.

    We utilize the features of HSA and analyze the performance advantage of the proposed method under different scenes to provide guidelines for further research in this area.


2 Related work

Many rendering systems based on Ray-Tracing adopt the single ray traversal algorithm. However, to exceed the performance of this algorithm, it is necessary to develop traversal algorithms that emphasize the exploitation of coherency from multiple rays that are following the same traversal path [10, 30, 36]. Packet ray traversal can achieve high speedup while handling a coherent workload, such as primary rays, but will suffer severely from incoherency [7, 23, 37]. Many studies have indicated that this is caused by branch divergence and memory access irregularity and have tried to minimize these issues. For example, Aila et al. [4] gave a comprehensive analysis of the impact of irregular execution and design considerations of efficiently mapping Ray-Tracing on GPU and other machines with wide SIMD. Kao et al. [17] also addressed the issue of branch divergence in Ray-Tracing by developing a Runtime technology.

To reduce branch divergence and improve coherency, several studies have proposed techniques such as ray queuing [3], ray reordering [8, 27], ray sorting [2, 12, 13] and ray stream filtering [14, 28, 32] for intersection testing or shading. In addition, methods for mining effective parallelism and extracting more coherency have been introduced. For instance, Moon et al. [20] proposed a heuristic method to order rays into coherent groups by finding an approximate intersection point for each ray on a simplified mesh of the scene. Tong et al. [31] further extended the concept by using a partial traversal heuristic to help identify coherent rays. However, different from the above methods, our proposed methodology adopts a feedback-guided approach that groups rays by using the exact hit information from the previous iterations rather than a heuristic method. It is able to handle both primary and derived rays effectively since the algorithm captures and tracks the information from the ray space and correlates it with the traversal paths. Moreover, our method can partition the workloads to heterogeneous cores.

Another trend of research emphasizes addressing a specific type of branch divergence called early termination. Methods such as active thread compaction, pipeline-based ray ordering and ray regeneration [18, 22, 34, 35] are proposed to improve SIMD efficiency via better utilization of GPU hardware resources. However, as more rays are compacted together, work items in a wavefront tend to become more divergent due to the lack of ray categorization and grouping processes.

The benefit and performance impact of decomposing a Ray-Tracing megakernel into multiple specialized kernel fragments have also been reported in [19]. However, decomposing a megakernel was discouraged by the overhead results from kernel dispatching, queue management and data maintenance in memory. Our proposed technique overcomes these issues by constructing a Runtime that utilizes the features in HSA to lower the overhead associated with a split kernel implementation.

Many studies have also introduced the mechanism of cooperative execution on heterogeneous systems and have addressed load balancing with different purposes. For instance, Tzeng et al. demonstrated several task management techniques, task donation, and task stealing methodologies for irregular workloads [33]. Pajot et al. [24] presented a hybrid bidirectional path tracing implementation with optimization techniques such as double buffering, batch processing, and asynchronous execution to balance tasks between the CPU and GPU. The OptiX ray tracer also implemented a load balancing mechanism for GPUs by assigning local queues to individual GPUs [25]. With a similar concept, our Runtime system implements the heterogeneous queue mechanism based on the thread pool model. However, the workloads are tagged as either Coherent or Divergent based on their characteristics and will be assigned to the most suitable core for execution. We further improved the load balancing mechanism by dynamically adjusting the tile size and the pull operation.

3 Algorithm design and implementation

We first introduce the high-level concept of the feedback-guided mechanism and then elaborate on the details of the algorithm. We also describe how the Attribute Table is constructed and how to reduce data transfer overhead. Finally, we demonstrate how to achieve load balancing using a heterogeneous queue.

The concept of the proposed mechanism is to serve the following purpose: to combine individually incoherent rays into coherent groups to reduce possible irregularities when processed via a GPU. The irregularities include two aspects: control flow divergence and irregular memory access.
Fig. 1

Structure of the attribute table

3.1 Overview of the feedback-guided mechanism

To identify the coherency, we define an Attribute for each ray. The Attribute of a ray comprises Ray data, i.e., a plenoptic function [1] that represents the origin and direction, and Traversal path information that is bonded to a group identity. In addition, each ray is parameterized by the Primitive it originates from (the camera in the case of primary rays). This information is recorded during execution and stored in a structure called the Attribute Table, as illustrated in Fig. 1, comprising three arrays: Ray, Traversal and Primitive, which store RayInfo, TraversalInfo and PrimitiveID, accordingly. This design aims to capture the correlation of the ray attribute and the traversal path.

The high-level concept of the mechanism is demonstrated in Fig. 2. Whenever new rays are generated from the previous bounce iteration (derived rays) or camera (primary rays), the algorithm queries their attributes from the Attribute Table (Attribute Lookup) before the start of the T&I procedure. If a correlation is found, the path information in the BVH tree will be returned (Path Lookup). After that, the rays with attributes that link to the same path will be grouped to form coherent workloads. The workloads that can be successfully grouped to fulfill a wavefront are tagged as Coherent, whereas the rest are marked as Divergent. All workloads are sent to the heterogeneous queue that dispatches the coherent parts to GPUs and leaves the divergent parts to CPUs.
Fig. 2

Overview of the Ray-Tracing runtime

Figure 3 illustrates the visualized BVH tree with path information of the Room scene. The area with the same color level in red indicates that all the rays that intersect with the surface have traversed through the same number of nodes in the BVH tree that will be explained in the following sections. By applying the proposed mechanism, the next time that similar rays are generated, they will be put into the same group. As a result, the rays with identical color levels will be computed simultaneously in a wavefront.
Fig. 3

Visualized bounding volume hierarchy

3.2 Constructing the attribute table

The structure of the Attribute Table is a triply indexed lookup table linked by pointers placed in the SVM. Accessing the Attribute Table is similar to the page table walk in the computer architecture. To reduce the memory transfer overhead and the number of rays that need to be inspected, we restructure the layout of the table. For instance, the Primitive array is set as the first layer of the table. It is utilized as a filter to select valid rays since a valid ray for the current iteration must be emitted, either by diffuse and specular reflections, refraction, from a unique primitive in the previous iteration. The Traversal array represents the traversal paths as group identities. Each element (TraversalInfo) of the Traversal array stores a hash value. Because the design purpose of TraversalInfo is to minimize branch divergence, to guarantee its uniqueness without inducing extra overhead, it is computed as a simple weighted sum of the following parameters: the number of nodes traversed, the number of leaf nodes inspected and the node identity where the closest intersection point is found. The hash value of TraversalInfo is encoded as a bit stream to reduce the memory usage. Figure 4 illustrates the parameters of the corresponding path information. Ray array, the last layer of the table, stores the ray information (RayInfo) of the paths (Fig. 2).
Fig. 4

Calculation of parameters for attributes

Fig. 5

Control flow of Ray-Tracing

3.3 Predictive regrouping with attribute analysis

The flowchart of the Ray-Tracing algorithm is illustrated in Fig. 5. The phases in Ray-Tracing are composed of three parts: Ray Generation, which generates the primary rays; Ray Intersection, which applies the T&I test; and Rendering, which integrates the radiance and emits the corresponding rays based on the characteristic of the surface. All rays are collected and sent back to the Ray Intersection stage for further computation, as shown in the red rectangle. The attribute analysis introduces two additional stages to the flowchart, Attribute Update and Lookup, highlighted in purple.

When rays traverse through the BVH tree and intersect with primitives, each individual path will be converted to a unique TraversalInfo and the RayInfo will be stored in the corresponding array at the end of the test. Before the next iteration starts, the Runtime queries the Attribute Table to find the closest match of the attribute for each newly generated ray by comparing against a threshold. The threshold is a metric that is dynamically adjusted by the Runtime and indicates the minimum differences of the distance and angle from the ray being observed. Ideally, if the threshold becomes small enough after adjustment, the corresponding records will be considered as valid matches, and their TraversalInfo (Hash value) will be assigned to the new rays as the group identity. The proposed algorithm is called Attribute Lookup, as shown in Algorithm 1 and is effective for both primary and derived rays in the entire ray space. It is computed by the GPU kernel due to its parallel characteristic, which has each ray being handled by a work item. After the Lookup process, the Runtime regroups the workload based on the Group identity to generate coherent permutations. Note that the rays are stored in a doubly indexed array, and only the indexes are swapped during regrouping.

The Lookup process is predictive, and the traversal path may not follow the expected order. During the T&I process, the exact hash value of the traversed path is also calculated according to Fig. 4 and will be compared with the assigned TraversalInfo at the end of the execution. The content of the Attribute Table needs to be updated if a ray does not follow the predicted path. In this case, the Runtime triggers the update procedure. The RayInfo will be added to another slot in the array with a different TraversalInfo that belongs to the new path information. The size of the table grows dynamically: A new entry is created only if there is no entry for that attribute. The process is called Attribute Update, as described in Algorithm 2. Note that the records of the Attribute Table are updated only when a ray has made a successful intersection with a primitive but did not match the predicted path. In other words, there will be only slight overhead involved if the hit rate accuracy is high. For example, after several iterations, the hit rate accuracy can achieve 80% from the experiment. Thus, only a small portion of time is spent on updating the table afterward.
The prediction can fail when it tries to find the closest match of the given attribute, but the closest match may not necessarily be the correct match, except for identical matches. However, the design of Attribute Update enables the proposed mechanism to calibrate itself to achieve higher grouping accuracy. For instance, if the prediction fails on a group of rays, the Attribute Update operation will record these rays to another entry slot of the Traversal array based on their actual traversal path. Thus, the next time similar rays are encountered; the record after calibration will become the closest match and therefore increase the accuracy of prediction. Note that the previous record does not need to be erased since a record is stored only if the exact intersection has occurred before, and thus it will still be a valid reference for another set of rays. Figure 6 gives an example of self-calibration. Assume that, at the first iteration, two rays (blue and red) have intersected with primitives A and B. At the start of the second iteration, the orange ray will be predictively given a group ID equal to the blue ray’s since it is the closest match. However, the prediction is wrong since the orange ray intersects with primitive B. Therefore, the group of the traversal path of the orange ray will be updated to become the same as the red ray’s. Thus, at the third iteration, the green ray will have a correct prediction since it is closer to the orange ray.
Fig. 6

An example of self-calibration with the order the rays are generated

3.4 Workload distribution with heterogeneous queues

GPU devices favor regular algorithms due to their parallel processing model. For instance, Ray Generation is a suitable candidate to be processed by a GPU since the algorithm contains no branches. In contrast, the T&I test may degrade the performance of a GPU due to the irregular property. With the design of the Attribute Table, we are able to assign workloads to different heterogeneous cores based on their characteristics. For the T&I test, each ray will be given a group ID by applying Algorithm 1. After the group assignment, we can forward these groups to individual heterogeneous cores based on their properties. This process is called Tagging. A group with size larger than a wavefront will be tagged as Coherent and will favor GPU execution, whereas groups with few rays will be merged together, and its group will be marked as Divergent and prioritized for the CPU. This minimizes the probability of causing branch divergence in a GPU.

With the groups being tagged, the Runtime system must decide how to distribute them to all cores on the platform to ensure load balancing. To tackle this issue, we propose the mechanism of the heterogeneous queue. The heterogeneous queue is a software-based component that exercises the producer–consumer design pattern. During initialization, one queue is created for each core on the platform. An element stored in the queue is called an entry and wraps the code for execution and the required data by using a task function pointer and a data pointer. A task function pointer points to an instance of a phase in Ray-Tracing, such as Ray Generation, Intersection or Rendering. To achieve heterogeneity, the function is implemented as an OpenCL kernel, which can be dispatched to different devices in a heterogeneous system. The data pointer refers to a partitioned tile of the workload data. The heterogeneous queue adopts the Thread Pool design pattern: Worker threads are created during initialization and wait for incoming entries at the front of each queue. The workload is partitioned into tiles based on the group size before insertion into the queues. Figure 7 illustrates the function of Tagging and the heterogeneous queues.

To achieve load balancing, a pulling mechanism similar to task stealing is introduced. By design, each entry can be pulled from one queue to another. If a device has completed all the originally assigned entries, it simply pulls workloads from other queues to participate in the computation and prevent itself from being idle.
Fig. 7

Structure of heterogeneous queues with tagging

Fig. 8

Test scenes: Bunny (69.4 K), Caustic (56.1 K), Cornell (26 K), Room (331.9 K), Mirror (149.2 K), Sponza (289.9 K), Miguel (7.88 M) and Plant (12.7 M)

4 Experimental results and analysis

The experiment was carried out on an AMD Kaveri A10-7850K [9] with 16GB of memory. The OpenCL Runtime is the AMD APP SDK v3.0. The screen size is fixed to 512x512. The test scenes are expounded in Fig. 8. A BVH tree is used as the accelerator. We use the Radeon-Rays framework to drive multiple backends [5]. For instance, Embree v2.11 is utilized to demonstrate the performance improvement. Embree is a state-of-the-art Ray-Tracing library that is optimized for multicore CPUs with SIMD capability [38].

Since Ray-Tracing is iterative, to better explain the algorithm, we define two terms as iteration counters: Bounce and Sample Count. Bounce is a number that indicates how many times a ray is traced inside a scene. It is visualized as the loop (from Attribute Lookup to New Ray and back to Attribute Lookup) inside the red rectangle in Fig. 5. Based on the properties of surfaces, a complex scene with specular materials may require more Bounces to generate high-quality and accurate images [26]. The number of Bounces is set to six. Sample Count is a number measured per pixel on the screen that indicates how many rays have been emitted. Each ray for an individual pixel must complete the full algorithm path in Fig. 5 to be counted as one sample. A larger Sample Count leads to a better image quality.

The proposed algorithm relies on the caching effect to function since the Attribute Table needs to collect path information during execution. Initially, the Attribute Table is empty. However, we prove that the required information can quickly be fulfilled within several Sample Count iterations. To demonstrate the effect, all scenes were sampled three times before the following experiments were conducted. Theoretically, as the value of Sample Count increases, the hit rate accuracy can become higher since more information will be recorded as reference.

4.1 Branch divergence and the impact of grouping

We illustrate the impact of control flow divergence in a GPU and the effectiveness of grouping by comparing the performance of the T&I test. Figure 9 illustrates the performance enhancement for the GPU measured in throughput (Rays per second) and the GPU utilization of the Room scene. Because the Room scene is an enclosed scene, the number of rays for each bounce iteration will be maintained at roughly the same level since it is not heavily affected by the issue of early termination. In contrast, in open scenes such as Bunny, the number of active rays will decrease very quickly at the end of each Bounce iteration.

The degree of branch divergence is represented by the utilization of the GPU Vector Arithmetic and Logic Unit (VALU). A higher percentage of GPU utilization indicates fewer branch divergence occurrences. The experiment has shown that branch divergence affects the performance significantly. With the help of Attribute Table and Grouping, the percentage of GPU utilization increases from 20.3 to 70.3 on average. The result leads to higher effective throughput measured in sample rays per second for each bounce iteration. The throughput becomes three times higher than the original.
Fig. 9

Performance enhancement measured in throughput and utilization of the Room scene. A higher GPU utilization indicates fewer branch divergence occurrences

4.2 Grouping effectiveness and hit rate accuracy

The grouping mechanism is predictive. One might be curious about the effectiveness of grouping and how accurate the mechanism is. Figures 10 and 11 illustrate the number of grouped rays and the grouping accuracy of the hit rate. The hit rate indicates how many grouped rays are actually following the predictive path. In Fig. 10, because Room is an enclosed scene, the total number of active rays for each Bounce iteration does not change much. Comparatively, in Fig. 11, because Bunny is an open scene, many rays will terminate early and become invalid. Thus, the total number of rays decreases rather quickly. However, no matter what the characteristic of the scene is, the mechanism of the Attribute Table can still group over half of the active rays and achieve very high accuracy. On average, it achieves 81.3% hit accuracy in the Room scene and 86.45% in Bunny. Notice that the first bounce (in case 0 of both scenes) possesses a very high hit rate and a large number of grouped rays since it is from primary rays, which are relatively regular compared to derived rays.

The effectiveness of grouping is independent to viewpoints since it captures and explores the hidden coherency of the ray space. To prove this effectiveness, we conduct an additional measurement that performs a camera flight. Figure 12 depicts the hit rate accuracy on 12 individual viewpoints, along with the accumulated number of grouped rays. The results show that the number of grouped rays and hit accuracy can still be maintained at a high level.
Fig. 10

Grouped rays with hit accuracy of the Room

Fig. 11

Grouped rays with hit accuracy of the Bunny

Fig. 12

Hit accuracy of multiple viewpoints

4.3 Impact of memory transfer overhead

Although the support of SVM removes the need for explicit data copying, the data required for computation still need to be implicitly transferred from the SVM region to the GPU’s internal local storage or registers before execution. The benefits of using the Attribute Table also include the reduction of data transfer and traffic. Figure 13 depicts the amount of data transfer along with the cache hit rate on a GPU. Because successfully grouped rays will follow almost the same path during traversal, only the nodes contained in that path need to be loaded onto the GPU. Thus, the memory access pattern of the work items in a wavefront becomes coherent. Compared to a divergent memory access pattern, this behavior also increases the chance of keeping valid nodes of a BVH tree inside caches, which leads to higher cache hit rates and less data retransmission. According to Fig. 13, the cache hit rate increases from 42 to 80% on average, and the amount of data transfer is reduced by 80%.
Fig. 13

Data transfer and cache hit rate of the room

We also measure the data transfer overhead of accessing the Attribute Table. Figure 14 illustrates the results. With the design of multiple layers, we do not need to load all the arrays in the table during Attribute Lookup. Only the referenced data are required. For example, the first layer of the Attribute Table is the Primitive array. During the Attribute Lookup process, if multiple rays being observed have intersected with the same primitive before, then only one Traversal array will be loaded for those rays. The same principle applies to deeper layers. This phenomenon reduces the overhead of data access since it is common in many scenes to have a large primitive that occupies a space that leads to multiple ray intersections, such as the wall and ground floor in the Room scene. This is particularly the case for primary rays (in case 0) since they are all from the camera. A special slot in the Primitive array is reserved for them. Alternatively, since the table needs to be updated only if the rays are mispredicted but have made an intersection with primitives, the data access overhead for Attribute Update is insignificant if the hit rate accuracy is high. The total size of the Attribute Table is 468 MB when stabilized, but only 56 MB is loaded in this round.

Although all the rays in the workload must be processed by individual work items, the data access overhead caused by Attribute Lookup may not significantly affect the performance since it is executed by a GPU. GPU processors possess high memory-level parallelism (MLP) with a stall-at-dependent instruction policy. This means that, although GPGPUs have an in-order processor, cache misses do not prevent the execution of an instruction from the same work item. The processor could still execute instructions from the same wavefront until instructions that are dependent on the cache misses stall the wavefront. This mechanism can increase the number of in-flight memory requests. Therefore, if the kernel program has high thread-level parallelism, it may not have a significant effect because the latency can be hidden by vertical multithreading.
Fig. 14

Overhead of memory access in term of data size caused by attribute lookup and update of the room scene

Fig. 15

Performance comparison with multiple backends and frameworks

Fig. 16

Time measurement of each phase on Embree, OpenCL 1.2 and the attribute table

4.4 Overall throughput comparison

We compare the performance gain with different configurations as illustrated in Fig. 15. The configurations are as follows. Embree represents the result on the Embree framework powered by a CPU with SIMD extension supported. OpenCL CPU denotes the results on a CPU but with the support of OpenCL 1.2. Notice that Embree is highly optimized for multicore CPUs with advanced SIMD instructions specialized for Intel hardware. However, Kaveri has only four physical cores; therefore, the throughput of using only the CPU will sometimes be lower than the GPU counterparts. OpenCL GPU and HSA represent the throughput on a single GPU device powered by the OpenCL 1.2 framework and HSA, respectively. Attribute Table represents the feedback-guided mechanism that utilizes the HSA framework on a single GPU. Additionally, Attribute Table HQ includes the support of a heterogeneous queue that runs on multiple CPU and GPU devices on Kaveri.

The throughput enhancement of the Attribute Table on a single GPU device is caused by the effective grouping mechanism, as indicated in the previous experiments. The heterogeneous queue boosts the performance results further since the workload can be distributed to all the cores on the platform based on the characteristics of tasks. In complex scenes such as Room and Sponza, the performance becomes 2 to 2.5 times greater than the original results on a single CPU or GPU.

4.5 Execution time measurement of each phase

We measure the execution time of all phases in Ray-Tracing to demonstrate the efficiency of the proposed method. Radeon-Rays is also utilized to attach distinct backends for comparison. The configuration is set as follows: Embree represents the result on the Embree framework, which is processed on a CPU with SIMD extension. OpenCL indicates the performance of OpenCL 1.2 powered by a GPU that exercises explicit data copying. Finally, Attribute Table HQ indicates the feedback-guided methodology with the heterogeneous queue on HSA. The overhead of the Attribute Table Lookup and Update is also measured. Figure 16 illustrates the performance evaluation.

Ray Generation on Embree performs slower than others since it is regular and can be executed by GPUs more effectively. In contrast, T&I may cause branch divergence and slows down the processing speed on GPU devices. Based on the property of the surface, the Rendering phase includes control paths that take longer to execute and could generate extra rays. As a result, the performance of the GPU on OpenCL could be worse than the counterpart of the CPU on Embree. Even worse, the overhead of data copying and marshaling for GPU devices drags the performance down, which makes the overall execution time longer than that of the CPU. The impact of data transfer overhead is nearly equal to the processing time of Rendering in scenes with lots of primitives, such as Room and Sponza. Consequently, the efficiency of the OpenCL implementation cannot compete against the original Embree framework in some circumstances. It takes longer to execute in scenes that contain objects with various surfaces properties (Caustic) or in cases that contain huge tree data with divergent paths (Room, Sponza and Miguel). In summary, for GPU computation, T&I is closely correlated with the size of the BVH tree and the level of branch divergence, whereas the Rendering is affected by the complexity of the surfaces.

Comparatively, the results of the Attribute Table outperform the other cases since it reduces irregularities and can improve the locality of the data objects, shortening the processing times of both the T&I and Rendering phases. Comparing the performance of the OpenCL implementation and the Attribute Table, the improvement is significant in scenes that are relatively complex and contain many primitives such as Room, Sponza and Miguel. It yields 2 to 3 times the performance boost compared to the original version. Additionally, all the execution time results are shorter than those of Embree.

5 Conclusion

In this paper, we have proposed a feedback-guided mechanism called Attribute-Based Ray Regrouping that can effectively reveal the hidden coherency of the rays, leading to performance enhancements on a heterogeneous system. The mechanism identifies regular patterns out of incoherent workloads to form groups that improve the efficiency of GPU computation. Furthermore, the heterogeneous queue and the method of Tagging are proposed to resolve the issue of load balancing. After grouping, the workloads are tagged with labels that describe their characteristics and will be processed by the most suitable heterogeneous core based on the properties of the execution patterns. According to the experiments, the overall performance becomes 2 to 2.5 times better than the original GPU and CPU versions on Kaveri for both enclosed and open scenes. The amount of data transfer is also reduced by 80%.


  1. 1.
    Adelson, E.H., Bergen, J.R.: The plenoptic function and the elements of early vision. Vision and Modeling Group, Media Laboratory, Massachusetts Institute of Technology (1991)Google Scholar
  2. 2.
    Áfra, A.T., Benthin, C., Wald, I., Munkberg, J.: Local shading coherence extraction for SIMD-efficient path tracing on CPUs. In: Proceedings of High Performance Graphics, pp. 119–128. Eurographics Association (2016)Google Scholar
  3. 3.
    Aila, T., Karras, T.: Architecture considerations for tracing incoherent rays. In: Proceedings of the Conference on High Performance Graphics, pp. 113–122. Eurographics Association (2010)Google Scholar
  4. 4.
    Aila, T., Laine, S.: Understanding the efficiency of ray traversal on GPUs. In: Proceedings of the Conference on High Performance Graphics 2009, pp. 145–149. ACM (2009)Google Scholar
  5. 5.
    AMD and GPUOpen: Radeon-rays.
  6. 6.
    Barringer, R., Akenine-Möller, T.: Dynamic ray stream traversal. ACM Trans. Graph. 33(4), 151 (2014)CrossRefGoogle Scholar
  7. 7.
    Benthin, C., Wald, I., Woop, S., Ernst, M., Mark, W.R.: Combining single and packet-ray tracing for arbitrary ray distributions on the intel mic architecture. IEEE Trans. Vis. Comput. Graph. 18(9), 1438–1448 (2012)CrossRefGoogle Scholar
  8. 8.
    Boulos, S., Wald, I., Benthin, C.: Adaptive ray packet reordering. In: IEEE Symposium on Interactive Ray Tracing, 2008. RT 2008, pp. 131–138 (2008)Google Scholar
  9. 9.
    Bouvier, D., Sander, B.: Applying amd’s kaveri apu for heterogeneous computing. In: Hot Chips: A Symposium on High Performance Chips (HC26) (2014)Google Scholar
  10. 10.
    Dammertz, H., Hanika, J., Keller, A.: Shallow bounding volume hierarchies for fast SIMD ray tracing of incoherent rays. In: Computer Graphics Forum, vol. 27, pp. 1225–1233. Wiley Online Library, New York (2008)Google Scholar
  11. 11.
    Davidovič, T., Křivánek, J., Hašan, M., Slusallek, P.: Progressive light transport simulation on the GPU: survey and improvements. ACM Trans. Graph. 33(3), 29 (2014)Google Scholar
  12. 12.
    Eisenacher, C., Nichols, G., Selle, A., Burley, B.: Sorted deferred shading for production path tracing. In: Computer Graphics Forum, vol. 32, pp. 125–132. Wiley Online Library, New York (2013)Google Scholar
  13. 13.
    Garanzha, K., Loop, C.: Fast ray sorting and breadth-first packet traversal for gpu ray tracing. In: Computer Graphics Forum, vol. 29, pp. 289–298. Wiley Online Library, New York (2010)Google Scholar
  14. 14.
    Gribble, C.P., Ramani, K.: Coherent ray tracing via stream filtering. In: IEEE Symposium on Interactive Ray Tracing, 2008. RT 2008, pp. 59–66 (2008)Google Scholar
  15. 15.
    Gunther, J., Popov, S., Seidel, H.P., Slusallek, P.: Realtime ray tracing on GPU with BVH-based packet traversal. In: IEEE Symposium on Interactive Ray Tracing, 2007. RT’07, pp. 113–118 (2007)Google Scholar
  16. 16.
    Jeffers, J., Reinders, J.: Intel Xeon Phi coprocessor high performance programming. 1st edn. Morgan Kaufmann Publishers, San Francisco, CA, USA (2013)Google Scholar
  17. 17.
    Kao, C.C., Hsu, W.C.: Runtime techniques for efficient ray-tracing on heterogeneous systems. In: 2015 IEEE International Conference on Digital Signal Processing (DSP), pp. 100–104 (2015)Google Scholar
  18. 18.
    Kao, C.C., Miao, Y.T., Hsu, W.C.: A pipeline-based runtime technique for improving ray-tracing on HSA-compliant systems. In: 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2016)Google Scholar
  19. 19.
    Laine, S., Karras, T., Aila, T.: Megakernels considered harmful: wavefront path tracing on gpus. In: ACM Proceedings of the 5th High-Performance Graphics Conference, pp. 137–143 (2013)Google Scholar
  20. 20.
    Moon, B., Byun, Y., Kim, T.J., Claudio, P., Kim, H.S., Ban, Y.J., Nam, S.W., Yoon, S.E.: Cache-oblivious ray reordering. ACM Trans. Graph. 29(3), 28 (2010)CrossRefGoogle Scholar
  21. 21.
    Munshi, A., Gaster, B., Mattson, T.G., Ginsburg, D.: OpenCL Programming Guide. Pearson Education, New Jersey (2011)Google Scholar
  22. 22.
    Novák, J., Havran, V., Dachsbacher, C.: Path regeneration for interactive path tracing. In: Proceedings on EUROGRAPHICS Short Papers (2010)Google Scholar
  23. 23.
    Overbeck, R., Ramamoorthi, R., Mark, W.R.: Large ray packets for real-time whitted ray tracing. In: IEEE Symposium on Interactive Ray Tracing, 2008. RT 2008. pp. 41–48. (2008)Google Scholar
  24. 24.
    Pajot, A., Barthe, L., Paulin, M., Poulin, P.: Combinatorial bidirectional path-tracing for efficient hybrid CPU/GPU rendering. In: Computer Graphics Forum, vol. 30, pp. 315–324. Wiley Online Library, Hoboken (2011)Google Scholar
  25. 25.
    Parker, S.G., Bigler, J., Dietrich, A., Friedrich, H., Hoberock, J., Luebke, D., McAllister, D., McGuire, M., Morley, K., Robison, A., Stich, M.: Optix: a general purpose ray tracing engine. ACM Trans. Graph. 29, 66 (2010)CrossRefGoogle Scholar
  26. 26.
    Pharr, M., Humphreys, G.: Physically Based Rendering: From Theory to Implementation. Morgan Kaufmann, Burlington (2004)Google Scholar
  27. 27.
    Pharr, M., Kolb, C., Gershbein, R., Hanrahan, P.: Rendering complex scenes with memory-coherent ray tracing. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, pp. 101–108. ACM Press/Addison-Wesley Publishing Co. (1997)Google Scholar
  28. 28.
    Ramani, K., Gribble, C.P., Davis, A.: Streamray: a stream filtering architecture for coherent ray tracing. ACM Sigplan Not. 44, 325–336 (2009)CrossRefGoogle Scholar
  29. 29.
    Rogers, P., Fellow, A.: Heterogeneous system architecture overview. In: 2013 IEEE Hot Chips 25 Symposium (HCS), pp. 1–41 (2013). doi: 10.1109/HOTCHIPS.2013.7478286
  30. 30.
    Sung, K., Craighead, J., Wang, C., Bakshi, S., Pearce, A., Woo, A.: Design and implementation of the maya renderer. In: IEEE Computer Graphics and Applications, 1998. Pacific Graphics’ 98. Sixth Pacific Conference on, pp. 150–159 (1998)Google Scholar
  31. 31.
    Tong, W., Deng, Y.: Mining effective parallelism from hidden coherence for GPU based path tracing. In: ACM SIGGRAPH Asia 2013 Technical Briefs, p. 31 (2013)Google Scholar
  32. 32.
    Tsakok, J.A.: Faster incoherent rays: multi-BVH ray stream tracing. In: ACM Proceedings of the Conference on High Performance Graphics 2009, pp. 151–158 (2009)Google Scholar
  33. 33.
    Tzeng, S., Patney, A., Owens, J.D.: Task management for irregular-parallel workloads on the GPU. In: Proceedings of the Conference on High Performance Graphics, pp. 29–37. Eurographics Association (2010)Google Scholar
  34. 34.
    Van Antwerpen, D.: Improving SIMD efficiency for parallel Monte Carlo light transport on the GPU. In: ACM Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, pp. 41–50 (2011)Google Scholar
  35. 35.
    Wald, I.: Active thread compaction for GPU path tracing. In: Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, pp. 51–58 (2011)Google Scholar
  36. 36.
    Wald, I., Benthin, C., Boulos, S.: Getting rid of packets-efficient simd single-ray traversal using multi-branching bvhs. In: IEEE Symposium on Interactive Ray Tracing, 2008. RT 2008. pp. 49–57 (2008)Google Scholar
  37. 37.
    Wald, I., Slusallek, P., Benthin, C., Wagner, M.: Interactive rendering with coherent ray tracing. In: Computer Graphics Forum, vol. 20, pp. 153–165. Wiley Online Library, New York (2001)Google Scholar
  38. 38.
    Wald, I., Woop, S., Benthin, C., Johnson, G.S., Ernst, M.: Embree: a kernel framework for efficient cpu ray tracing. ACM Trans. Graph. 33(4), 143 (2014)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  1. 1.National Taiwan UniversityTaipeiTaiwan

Personalised recommendations