Exploring hidden coherency of Ray-Tracing for heterogeneous systems using online feedback methodology
- 107 Downloads
Although naturally adopting an embarrassingly parallel paradigm, Ray-Tracing is also categorized as an irregular program that is troublesome to run on graphics processing units (GPUs). Conventional designs suffer from a performance penalty due to the irregularity of the control flow and memory access caused by incoherent rays. This work aims to explore the hidden coherency of rays by designing a feedback-guided mechanism that serves the following concept: extraction of the hidden regular portions out of the irregular execution flow. The method records the correlation of ray attributes and the traversed path and groups the newly generated rays to reduce potential irregularities for the ongoing execution. This mechanism captures the information from the entire ray space and can extract the hidden coherency from both primary and derived rays. The result leads to performance gains and an increase in resource utilization. The performance becomes 2 to 2.5 times higher than the original GPU and CPU versions.
KeywordsRay-Tracing Heterogeneous systems Irregular program HSA Shared virtual memory
Ray-Tracing is an rendering algorithm that requires massive computing power. It generates realistic images by emitting and tracing a variety of ray types to implement global illumination. Typically, each generated ray must be examined for intersection with objects in the scene. For this reason, the Ray-Tracing algorithm is often implemented in an embarrassingly parallel fashion that is potentially suitable for processing by a graphics processing unit (GPU). In this paradigm, each work item in a wavefront of a GPU is assigned to trace the path of a single ray, perform intersection tests with the objects and render the outcome by accumulating the radiance [4, 11, 21].
However, the geometric configuration of a scene is constructed by millions of primitives in many cases, which makes it impractical to perform the intersection test sequentially with all primitives (such as triangles). To speed up the testing procedure, tree-based hierarchical accelerators, such as Bounding Volume Hierarchy trees (BVH trees) or KD trees , are used to reduce the otherwise O(n) complexity of testing a ray for intersection with all n objects . However, using these accelerators introduces serious control flow divergence and data locality issues that are troublesome for GPU computation, as the computing pattern becomes an irregular algorithm. If we adopt a sequential search in the intersection test, we could obtain a large speedup from a GPU, but the overall runtime would increase.
Transformed into an irregular program execution, Ray-Tracing will inherit the penalty of mapping to heterogeneous systems since its behavior changes unpredictably during execution. In particular, the dynamically varying control flows create thread divergences, which dramatically reduces the level of parallelism and SIMD lane utilization in GPUs. For instance, given two rays with distinct origins and directions, the two work items for processing them will follow greatly diverged execution paths. If the two work items belong to the same wavefront, the divergence in both control and memory access causes a performance penalty on a GPU. Machines that adopt MIMD (multiple instruction, multiple data) parallelism, such as CPUs and the Intel Xeon Phi , are better at supporting programs with irregular controls but are less energy efficient if the process becomes regular. In contrast, GPUs can handle regular code effectively due to the nature of their SIMD (single instruction, multiple data) execution pattern. Consequently, mapping irregular code in Ray-Tracing efficiently onto a heterogeneous system remains difficult.
To solve this divergence issue, previous studies have proposed methods for ray grouping. Rays that are successfully grouped are expected to traverse similar nodes in the accelerator. These rays with similar traversal paths are called coherent rays. To explore coherent rays, methods such as the gathering of rays into packets, ray sorting and compression, or combining single and packet ray traversal have been introduced [6, 7, 13]. However, it has also been indicated that it is difficult to identify coherency. The proposed methods are effective only for primary rays. Derived rays, such as diffuse reflective rays and shadow rays, cannot be handled and could eventually break the coherency of rays. In fact, perfect information regarding coherency is only available after the traversal is complete. Thus, the methods for finding potentially coherent rays are typically based on heuristics.
Based on this observation, an idea is raised: Would it be possible to identify the hidden ray coherency from the execution flow, which seems to be incoherent at first glance? To address this question, we conform to the better scaled approach of using BVH trees in intersection tests, and we take a feedback-directed approach to identify coherent rays to avoid branch divergence caused by tree searching. With this feedback mechanism, the algorithm is able to analyze and extract the potential regular portions out of the irregular program execution and gather them to form a better group with less branch divergence for the GPU. For those workloads that cannot be compressed into a coherent group, we adopt a load balancing mechanism that enables the CPU to dynamically take over the duty of processing. The design concept is based on the following facts. We found that rays with similar attributes, such as surrounding origins and orientations, are likely to follow the same path in the traversal procedure. Because tree traversal will be triggered multiple times, if the correlation of ray attributes and the traversed path are recorded, it can be used as a feedback-guided mechanism to group the newly generated rays; for the next time a similar pattern is encountered to reduce potential irregularities for the ongoing execution. The result leads to performance gains and an increase in resource utilization.
In this paper, we aim to uncover the hidden coherent rays by designing a feedback-guided mechanism called Attribute-Based Ray Regrouping, which is implemented in a Ray-Tracing Runtime. The mechanism can extract the coherency from not only the primary rays but also the derived rays since it is designed to capture the information of the entire ray space. The design leverages the features of the Heterogeneous System Architecture (HSA) , such as shared virtual memory (SVM) and fast kernel dispatching, to further reduce the systematic overhead, such as memory transferring and kernel dispatching, which makes this approach more applicable. We analyzed the performance and design trade-offs of the Runtime construction.
We propose a feedback-guided mechanism that is able to identify the hidden coherency from both primary and derived rays. The mechanism reduces irregularities, resulting in performance enhancement and an increase in hardware utilization.
We propose a Tagging method that annotates the workloads based on their characteristics. The workloads are sent to the devices that are better suited for their execution through heterogeneous queues to achieve load balancing on heterogeneous cores.
We utilize the features of HSA and analyze the performance advantage of the proposed method under different scenes to provide guidelines for further research in this area.
2 Related work
Many rendering systems based on Ray-Tracing adopt the single ray traversal algorithm. However, to exceed the performance of this algorithm, it is necessary to develop traversal algorithms that emphasize the exploitation of coherency from multiple rays that are following the same traversal path [10, 30, 36]. Packet ray traversal can achieve high speedup while handling a coherent workload, such as primary rays, but will suffer severely from incoherency [7, 23, 37]. Many studies have indicated that this is caused by branch divergence and memory access irregularity and have tried to minimize these issues. For example, Aila et al.  gave a comprehensive analysis of the impact of irregular execution and design considerations of efficiently mapping Ray-Tracing on GPU and other machines with wide SIMD. Kao et al.  also addressed the issue of branch divergence in Ray-Tracing by developing a Runtime technology.
To reduce branch divergence and improve coherency, several studies have proposed techniques such as ray queuing , ray reordering [8, 27], ray sorting [2, 12, 13] and ray stream filtering [14, 28, 32] for intersection testing or shading. In addition, methods for mining effective parallelism and extracting more coherency have been introduced. For instance, Moon et al.  proposed a heuristic method to order rays into coherent groups by finding an approximate intersection point for each ray on a simplified mesh of the scene. Tong et al.  further extended the concept by using a partial traversal heuristic to help identify coherent rays. However, different from the above methods, our proposed methodology adopts a feedback-guided approach that groups rays by using the exact hit information from the previous iterations rather than a heuristic method. It is able to handle both primary and derived rays effectively since the algorithm captures and tracks the information from the ray space and correlates it with the traversal paths. Moreover, our method can partition the workloads to heterogeneous cores.
Another trend of research emphasizes addressing a specific type of branch divergence called early termination. Methods such as active thread compaction, pipeline-based ray ordering and ray regeneration [18, 22, 34, 35] are proposed to improve SIMD efficiency via better utilization of GPU hardware resources. However, as more rays are compacted together, work items in a wavefront tend to become more divergent due to the lack of ray categorization and grouping processes.
The benefit and performance impact of decomposing a Ray-Tracing megakernel into multiple specialized kernel fragments have also been reported in . However, decomposing a megakernel was discouraged by the overhead results from kernel dispatching, queue management and data maintenance in memory. Our proposed technique overcomes these issues by constructing a Runtime that utilizes the features in HSA to lower the overhead associated with a split kernel implementation.
Many studies have also introduced the mechanism of cooperative execution on heterogeneous systems and have addressed load balancing with different purposes. For instance, Tzeng et al. demonstrated several task management techniques, task donation, and task stealing methodologies for irregular workloads . Pajot et al.  presented a hybrid bidirectional path tracing implementation with optimization techniques such as double buffering, batch processing, and asynchronous execution to balance tasks between the CPU and GPU. The OptiX ray tracer also implemented a load balancing mechanism for GPUs by assigning local queues to individual GPUs . With a similar concept, our Runtime system implements the heterogeneous queue mechanism based on the thread pool model. However, the workloads are tagged as either Coherent or Divergent based on their characteristics and will be assigned to the most suitable core for execution. We further improved the load balancing mechanism by dynamically adjusting the tile size and the pull operation.
3 Algorithm design and implementation
We first introduce the high-level concept of the feedback-guided mechanism and then elaborate on the details of the algorithm. We also describe how the Attribute Table is constructed and how to reduce data transfer overhead. Finally, we demonstrate how to achieve load balancing using a heterogeneous queue.
3.1 Overview of the feedback-guided mechanism
To identify the coherency, we define an Attribute for each ray. The Attribute of a ray comprises Ray data, i.e., a plenoptic function  that represents the origin and direction, and Traversal path information that is bonded to a group identity. In addition, each ray is parameterized by the Primitive it originates from (the camera in the case of primary rays). This information is recorded during execution and stored in a structure called the Attribute Table, as illustrated in Fig. 1, comprising three arrays: Ray, Traversal and Primitive, which store RayInfo, TraversalInfo and PrimitiveID, accordingly. This design aims to capture the correlation of the ray attribute and the traversal path.
3.2 Constructing the attribute table
3.3 Predictive regrouping with attribute analysis
The flowchart of the Ray-Tracing algorithm is illustrated in Fig. 5. The phases in Ray-Tracing are composed of three parts: Ray Generation, which generates the primary rays; Ray Intersection, which applies the T&I test; and Rendering, which integrates the radiance and emits the corresponding rays based on the characteristic of the surface. All rays are collected and sent back to the Ray Intersection stage for further computation, as shown in the red rectangle. The attribute analysis introduces two additional stages to the flowchart, Attribute Update and Lookup, highlighted in purple.
When rays traverse through the BVH tree and intersect with primitives, each individual path will be converted to a unique TraversalInfo and the RayInfo will be stored in the corresponding array at the end of the test. Before the next iteration starts, the Runtime queries the Attribute Table to find the closest match of the attribute for each newly generated ray by comparing against a threshold. The threshold is a metric that is dynamically adjusted by the Runtime and indicates the minimum differences of the distance and angle from the ray being observed. Ideally, if the threshold becomes small enough after adjustment, the corresponding records will be considered as valid matches, and their TraversalInfo (Hash value) will be assigned to the new rays as the group identity. The proposed algorithm is called Attribute Lookup, as shown in Algorithm 1 and is effective for both primary and derived rays in the entire ray space. It is computed by the GPU kernel due to its parallel characteristic, which has each ray being handled by a work item. After the Lookup process, the Runtime regroups the workload based on the Group identity to generate coherent permutations. Note that the rays are stored in a doubly indexed array, and only the indexes are swapped during regrouping.
3.4 Workload distribution with heterogeneous queues
GPU devices favor regular algorithms due to their parallel processing model. For instance, Ray Generation is a suitable candidate to be processed by a GPU since the algorithm contains no branches. In contrast, the T&I test may degrade the performance of a GPU due to the irregular property. With the design of the Attribute Table, we are able to assign workloads to different heterogeneous cores based on their characteristics. For the T&I test, each ray will be given a group ID by applying Algorithm 1. After the group assignment, we can forward these groups to individual heterogeneous cores based on their properties. This process is called Tagging. A group with size larger than a wavefront will be tagged as Coherent and will favor GPU execution, whereas groups with few rays will be merged together, and its group will be marked as Divergent and prioritized for the CPU. This minimizes the probability of causing branch divergence in a GPU.
With the groups being tagged, the Runtime system must decide how to distribute them to all cores on the platform to ensure load balancing. To tackle this issue, we propose the mechanism of the heterogeneous queue. The heterogeneous queue is a software-based component that exercises the producer–consumer design pattern. During initialization, one queue is created for each core on the platform. An element stored in the queue is called an entry and wraps the code for execution and the required data by using a task function pointer and a data pointer. A task function pointer points to an instance of a phase in Ray-Tracing, such as Ray Generation, Intersection or Rendering. To achieve heterogeneity, the function is implemented as an OpenCL kernel, which can be dispatched to different devices in a heterogeneous system. The data pointer refers to a partitioned tile of the workload data. The heterogeneous queue adopts the Thread Pool design pattern: Worker threads are created during initialization and wait for incoming entries at the front of each queue. The workload is partitioned into tiles based on the group size before insertion into the queues. Figure 7 illustrates the function of Tagging and the heterogeneous queues.
4 Experimental results and analysis
The experiment was carried out on an AMD Kaveri A10-7850K  with 16GB of memory. The OpenCL Runtime is the AMD APP SDK v3.0. The screen size is fixed to 512x512. The test scenes are expounded in Fig. 8. A BVH tree is used as the accelerator. We use the Radeon-Rays framework to drive multiple backends . For instance, Embree v2.11 is utilized to demonstrate the performance improvement. Embree is a state-of-the-art Ray-Tracing library that is optimized for multicore CPUs with SIMD capability .
Since Ray-Tracing is iterative, to better explain the algorithm, we define two terms as iteration counters: Bounce and Sample Count. Bounce is a number that indicates how many times a ray is traced inside a scene. It is visualized as the loop (from Attribute Lookup to New Ray and back to Attribute Lookup) inside the red rectangle in Fig. 5. Based on the properties of surfaces, a complex scene with specular materials may require more Bounces to generate high-quality and accurate images . The number of Bounces is set to six. Sample Count is a number measured per pixel on the screen that indicates how many rays have been emitted. Each ray for an individual pixel must complete the full algorithm path in Fig. 5 to be counted as one sample. A larger Sample Count leads to a better image quality.
The proposed algorithm relies on the caching effect to function since the Attribute Table needs to collect path information during execution. Initially, the Attribute Table is empty. However, we prove that the required information can quickly be fulfilled within several Sample Count iterations. To demonstrate the effect, all scenes were sampled three times before the following experiments were conducted. Theoretically, as the value of Sample Count increases, the hit rate accuracy can become higher since more information will be recorded as reference.
4.1 Branch divergence and the impact of grouping
We illustrate the impact of control flow divergence in a GPU and the effectiveness of grouping by comparing the performance of the T&I test. Figure 9 illustrates the performance enhancement for the GPU measured in throughput (Rays per second) and the GPU utilization of the Room scene. Because the Room scene is an enclosed scene, the number of rays for each bounce iteration will be maintained at roughly the same level since it is not heavily affected by the issue of early termination. In contrast, in open scenes such as Bunny, the number of active rays will decrease very quickly at the end of each Bounce iteration.
4.2 Grouping effectiveness and hit rate accuracy
The grouping mechanism is predictive. One might be curious about the effectiveness of grouping and how accurate the mechanism is. Figures 10 and 11 illustrate the number of grouped rays and the grouping accuracy of the hit rate. The hit rate indicates how many grouped rays are actually following the predictive path. In Fig. 10, because Room is an enclosed scene, the total number of active rays for each Bounce iteration does not change much. Comparatively, in Fig. 11, because Bunny is an open scene, many rays will terminate early and become invalid. Thus, the total number of rays decreases rather quickly. However, no matter what the characteristic of the scene is, the mechanism of the Attribute Table can still group over half of the active rays and achieve very high accuracy. On average, it achieves 81.3% hit accuracy in the Room scene and 86.45% in Bunny. Notice that the first bounce (in case 0 of both scenes) possesses a very high hit rate and a large number of grouped rays since it is from primary rays, which are relatively regular compared to derived rays.
4.3 Impact of memory transfer overhead
We also measure the data transfer overhead of accessing the Attribute Table. Figure 14 illustrates the results. With the design of multiple layers, we do not need to load all the arrays in the table during Attribute Lookup. Only the referenced data are required. For example, the first layer of the Attribute Table is the Primitive array. During the Attribute Lookup process, if multiple rays being observed have intersected with the same primitive before, then only one Traversal array will be loaded for those rays. The same principle applies to deeper layers. This phenomenon reduces the overhead of data access since it is common in many scenes to have a large primitive that occupies a space that leads to multiple ray intersections, such as the wall and ground floor in the Room scene. This is particularly the case for primary rays (in case 0) since they are all from the camera. A special slot in the Primitive array is reserved for them. Alternatively, since the table needs to be updated only if the rays are mispredicted but have made an intersection with primitives, the data access overhead for Attribute Update is insignificant if the hit rate accuracy is high. The total size of the Attribute Table is 468 MB when stabilized, but only 56 MB is loaded in this round.
4.4 Overall throughput comparison
We compare the performance gain with different configurations as illustrated in Fig. 15. The configurations are as follows. Embree represents the result on the Embree framework powered by a CPU with SIMD extension supported. OpenCL CPU denotes the results on a CPU but with the support of OpenCL 1.2. Notice that Embree is highly optimized for multicore CPUs with advanced SIMD instructions specialized for Intel hardware. However, Kaveri has only four physical cores; therefore, the throughput of using only the CPU will sometimes be lower than the GPU counterparts. OpenCL GPU and HSA represent the throughput on a single GPU device powered by the OpenCL 1.2 framework and HSA, respectively. Attribute Table represents the feedback-guided mechanism that utilizes the HSA framework on a single GPU. Additionally, Attribute Table HQ includes the support of a heterogeneous queue that runs on multiple CPU and GPU devices on Kaveri.
The throughput enhancement of the Attribute Table on a single GPU device is caused by the effective grouping mechanism, as indicated in the previous experiments. The heterogeneous queue boosts the performance results further since the workload can be distributed to all the cores on the platform based on the characteristics of tasks. In complex scenes such as Room and Sponza, the performance becomes 2 to 2.5 times greater than the original results on a single CPU or GPU.
4.5 Execution time measurement of each phase
We measure the execution time of all phases in Ray-Tracing to demonstrate the efficiency of the proposed method. Radeon-Rays is also utilized to attach distinct backends for comparison. The configuration is set as follows: Embree represents the result on the Embree framework, which is processed on a CPU with SIMD extension. OpenCL indicates the performance of OpenCL 1.2 powered by a GPU that exercises explicit data copying. Finally, Attribute Table HQ indicates the feedback-guided methodology with the heterogeneous queue on HSA. The overhead of the Attribute Table Lookup and Update is also measured. Figure 16 illustrates the performance evaluation.
Ray Generation on Embree performs slower than others since it is regular and can be executed by GPUs more effectively. In contrast, T&I may cause branch divergence and slows down the processing speed on GPU devices. Based on the property of the surface, the Rendering phase includes control paths that take longer to execute and could generate extra rays. As a result, the performance of the GPU on OpenCL could be worse than the counterpart of the CPU on Embree. Even worse, the overhead of data copying and marshaling for GPU devices drags the performance down, which makes the overall execution time longer than that of the CPU. The impact of data transfer overhead is nearly equal to the processing time of Rendering in scenes with lots of primitives, such as Room and Sponza. Consequently, the efficiency of the OpenCL implementation cannot compete against the original Embree framework in some circumstances. It takes longer to execute in scenes that contain objects with various surfaces properties (Caustic) or in cases that contain huge tree data with divergent paths (Room, Sponza and Miguel). In summary, for GPU computation, T&I is closely correlated with the size of the BVH tree and the level of branch divergence, whereas the Rendering is affected by the complexity of the surfaces.
Comparatively, the results of the Attribute Table outperform the other cases since it reduces irregularities and can improve the locality of the data objects, shortening the processing times of both the T&I and Rendering phases. Comparing the performance of the OpenCL implementation and the Attribute Table, the improvement is significant in scenes that are relatively complex and contain many primitives such as Room, Sponza and Miguel. It yields 2 to 3 times the performance boost compared to the original version. Additionally, all the execution time results are shorter than those of Embree.
In this paper, we have proposed a feedback-guided mechanism called Attribute-Based Ray Regrouping that can effectively reveal the hidden coherency of the rays, leading to performance enhancements on a heterogeneous system. The mechanism identifies regular patterns out of incoherent workloads to form groups that improve the efficiency of GPU computation. Furthermore, the heterogeneous queue and the method of Tagging are proposed to resolve the issue of load balancing. After grouping, the workloads are tagged with labels that describe their characteristics and will be processed by the most suitable heterogeneous core based on the properties of the execution patterns. According to the experiments, the overall performance becomes 2 to 2.5 times better than the original GPU and CPU versions on Kaveri for both enclosed and open scenes. The amount of data transfer is also reduced by 80%.
- 1.Adelson, E.H., Bergen, J.R.: The plenoptic function and the elements of early vision. Vision and Modeling Group, Media Laboratory, Massachusetts Institute of Technology (1991)Google Scholar
- 2.Áfra, A.T., Benthin, C., Wald, I., Munkberg, J.: Local shading coherence extraction for SIMD-efficient path tracing on CPUs. In: Proceedings of High Performance Graphics, pp. 119–128. Eurographics Association (2016)Google Scholar
- 3.Aila, T., Karras, T.: Architecture considerations for tracing incoherent rays. In: Proceedings of the Conference on High Performance Graphics, pp. 113–122. Eurographics Association (2010)Google Scholar
- 4.Aila, T., Laine, S.: Understanding the efficiency of ray traversal on GPUs. In: Proceedings of the Conference on High Performance Graphics 2009, pp. 145–149. ACM (2009)Google Scholar
- 5.AMD and GPUOpen: Radeon-rays. http://gpuopen.com/gaming-product/radeon-rays/
- 8.Boulos, S., Wald, I., Benthin, C.: Adaptive ray packet reordering. In: IEEE Symposium on Interactive Ray Tracing, 2008. RT 2008, pp. 131–138 (2008)Google Scholar
- 9.Bouvier, D., Sander, B.: Applying amd’s kaveri apu for heterogeneous computing. In: Hot Chips: A Symposium on High Performance Chips (HC26) (2014)Google Scholar
- 10.Dammertz, H., Hanika, J., Keller, A.: Shallow bounding volume hierarchies for fast SIMD ray tracing of incoherent rays. In: Computer Graphics Forum, vol. 27, pp. 1225–1233. Wiley Online Library, New York (2008)Google Scholar
- 11.Davidovič, T., Křivánek, J., Hašan, M., Slusallek, P.: Progressive light transport simulation on the GPU: survey and improvements. ACM Trans. Graph. 33(3), 29 (2014)Google Scholar
- 12.Eisenacher, C., Nichols, G., Selle, A., Burley, B.: Sorted deferred shading for production path tracing. In: Computer Graphics Forum, vol. 32, pp. 125–132. Wiley Online Library, New York (2013)Google Scholar
- 13.Garanzha, K., Loop, C.: Fast ray sorting and breadth-first packet traversal for gpu ray tracing. In: Computer Graphics Forum, vol. 29, pp. 289–298. Wiley Online Library, New York (2010)Google Scholar
- 14.Gribble, C.P., Ramani, K.: Coherent ray tracing via stream filtering. In: IEEE Symposium on Interactive Ray Tracing, 2008. RT 2008, pp. 59–66 (2008)Google Scholar
- 15.Gunther, J., Popov, S., Seidel, H.P., Slusallek, P.: Realtime ray tracing on GPU with BVH-based packet traversal. In: IEEE Symposium on Interactive Ray Tracing, 2007. RT’07, pp. 113–118 (2007)Google Scholar
- 16.Jeffers, J., Reinders, J.: Intel Xeon Phi coprocessor high performance programming. 1st edn. Morgan Kaufmann Publishers, San Francisco, CA, USA (2013)Google Scholar
- 17.Kao, C.C., Hsu, W.C.: Runtime techniques for efficient ray-tracing on heterogeneous systems. In: 2015 IEEE International Conference on Digital Signal Processing (DSP), pp. 100–104 (2015)Google Scholar
- 18.Kao, C.C., Miao, Y.T., Hsu, W.C.: A pipeline-based runtime technique for improving ray-tracing on HSA-compliant systems. In: 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2016)Google Scholar
- 19.Laine, S., Karras, T., Aila, T.: Megakernels considered harmful: wavefront path tracing on gpus. In: ACM Proceedings of the 5th High-Performance Graphics Conference, pp. 137–143 (2013)Google Scholar
- 21.Munshi, A., Gaster, B., Mattson, T.G., Ginsburg, D.: OpenCL Programming Guide. Pearson Education, New Jersey (2011)Google Scholar
- 22.Novák, J., Havran, V., Dachsbacher, C.: Path regeneration for interactive path tracing. In: Proceedings on EUROGRAPHICS Short Papers (2010)Google Scholar
- 23.Overbeck, R., Ramamoorthi, R., Mark, W.R.: Large ray packets for real-time whitted ray tracing. In: IEEE Symposium on Interactive Ray Tracing, 2008. RT 2008. pp. 41–48. (2008)Google Scholar
- 24.Pajot, A., Barthe, L., Paulin, M., Poulin, P.: Combinatorial bidirectional path-tracing for efficient hybrid CPU/GPU rendering. In: Computer Graphics Forum, vol. 30, pp. 315–324. Wiley Online Library, Hoboken (2011)Google Scholar
- 26.Pharr, M., Humphreys, G.: Physically Based Rendering: From Theory to Implementation. Morgan Kaufmann, Burlington (2004)Google Scholar
- 27.Pharr, M., Kolb, C., Gershbein, R., Hanrahan, P.: Rendering complex scenes with memory-coherent ray tracing. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, pp. 101–108. ACM Press/Addison-Wesley Publishing Co. (1997)Google Scholar
- 29.Rogers, P., Fellow, A.: Heterogeneous system architecture overview. In: 2013 IEEE Hot Chips 25 Symposium (HCS), pp. 1–41 (2013). doi: 10.1109/HOTCHIPS.2013.7478286
- 30.Sung, K., Craighead, J., Wang, C., Bakshi, S., Pearce, A., Woo, A.: Design and implementation of the maya renderer. In: IEEE Computer Graphics and Applications, 1998. Pacific Graphics’ 98. Sixth Pacific Conference on, pp. 150–159 (1998)Google Scholar
- 31.Tong, W., Deng, Y.: Mining effective parallelism from hidden coherence for GPU based path tracing. In: ACM SIGGRAPH Asia 2013 Technical Briefs, p. 31 (2013)Google Scholar
- 32.Tsakok, J.A.: Faster incoherent rays: multi-BVH ray stream tracing. In: ACM Proceedings of the Conference on High Performance Graphics 2009, pp. 151–158 (2009)Google Scholar
- 33.Tzeng, S., Patney, A., Owens, J.D.: Task management for irregular-parallel workloads on the GPU. In: Proceedings of the Conference on High Performance Graphics, pp. 29–37. Eurographics Association (2010)Google Scholar
- 34.Van Antwerpen, D.: Improving SIMD efficiency for parallel Monte Carlo light transport on the GPU. In: ACM Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, pp. 41–50 (2011)Google Scholar
- 35.Wald, I.: Active thread compaction for GPU path tracing. In: Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, pp. 51–58 (2011)Google Scholar
- 36.Wald, I., Benthin, C., Boulos, S.: Getting rid of packets-efficient simd single-ray traversal using multi-branching bvhs. In: IEEE Symposium on Interactive Ray Tracing, 2008. RT 2008. pp. 49–57 (2008)Google Scholar
- 37.Wald, I., Slusallek, P., Benthin, C., Wagner, M.: Interactive rendering with coherent ray tracing. In: Computer Graphics Forum, vol. 20, pp. 153–165. Wiley Online Library, New York (2001)Google Scholar