When we are at our best as programmers, we recognize patterns in our work and apply techniques that are time-tested to be the best solution. Parallel programming is no different, and it would be a serious mistake not to study the patterns that have proven to be useful in this space. Consider the MapReduce frameworks adopted for Big Data applications; their success stems largely from being based on two simple yet effective parallel patterns—map and reduce.

There are a number of common patterns in parallel programming that crop up time and again, independent of the programming language that we’re using. These patterns are versatile and can be employed at any level of parallelism (e.g., sub-groups, work-groups, full devices) and on any device (e.g., CPUs, GPUs, FPGAs). However, certain properties of the patterns (such as their scalability) may affect their suitability for different devices. In some cases, adapting an application to a new device may simply require choosing appropriate parameters or fine-tuning an implementation of a pattern; in others, we may be able to improve performance by selecting a different pattern entirely.

Developing an understanding of how, when, and where to use these common parallel patterns is a key part of improving our proficiency in SYCL (and parallel programming in general). For those with existing parallel programming experience, seeing how these patterns are expressed in SYCL can be a quick way to spin up and gain familiarity with the capabilities of the language.

This chapter aims to provide answers to the following questions:

  • What are some common patterns that we should understand?

  • How do the patterns relate to the capabilities of different devices?

  • Which patterns are already provided as SYCL functions and libraries?

  • How would the patterns be implemented using direct programming?

Understanding the Patterns

The patterns discussed here are a subset of the parallel patterns described in the book Structured Parallel Programming by McCool et al. We do not cover the patterns related to types of parallelism (e.g., fork-join, branch-and-bound) but focus on some of the algorithmic patterns most useful for writing data-parallel kernels.

We wholeheartedly believe that understanding this subset of parallel patterns is critical to becoming an effective SYCL programmer. The table in Figure 14-1 presents a high-level overview of the different patterns, including their primary use cases, their key attributes, and how their attributes impact their affinity for different hardware devices.

Figure 14-1
A table presents parallel patterns and their affinity for different device types. It consists of 4 columns and 4 rows. The column headers are pattern, useful for, key attributes, and device affinity.

Parallel patterns and their affinity for different device types

Map

The map pattern is the simplest parallel pattern of all and will be immediately familiar to readers with experience in functional programming languages. As shown in Figure 14-2, each input element of a range is independently mapped to an output by applying some function. Many data-parallel operations can be expressed as instances of the map pattern (e.g., vector addition).

Figure 14-2
An illustration presents a map pattern. Each input element is mapped to an output element through an intermediate element.

Map pattern

Since every application of the function is completely independent, expressions of map are often very simple, relying on the compiler and/or runtime to do most of the hard work. We should expect kernels written to the map pattern to be suitable for any device and for the performance of those kernels to scale very well with the amount of available hardware parallelism.

However, we should think carefully before deciding to rewrite entire applications as a series of map kernels! Such a development approach is highly productive and guarantees that an application will be portable to a wide variety of device types but encourages us to ignore optimizations that may significantly improve performance (e.g., improving data reuse, fusing kernels).

Stencil

The stencil pattern is closely related to the map pattern. As shown in Figure 14-3, a function is applied to an input and a set of neighboring inputs described by a stencil to produce a single output. Stencil patterns appear frequently in many domains, including scientific/engineering applications (e.g., finite difference codes) and computer vision/machine learning applications (e.g., image convolutions).

Figure 14-3
The formation of a stencil pattern is depicted. Among other square blocks, 5 blocks in a column and 5 blocks in a row, form a plus pattern. The plus pattern depicts 2 blocks at the top and bottom and 2 square blocks on the left and right that point to a central block.

Stencil pattern

When the stencil pattern is executed out-of-place (i.e., writing the outputs to a separate storage location), the function can be applied to every input independently. Scheduling stencils in the real world is often more complicated than this: computing neighboring outputs requires the same data, and loading that data from memory multiple times will degrade performance; and we may wish to apply the stencil in-place (i.e., overwriting the original input values) in order to decrease an application’s memory footprint.

The suitability of a stencil kernel for different devices is therefore highly dependent on properties of the stencil and the input problem. Generally speaking,

  • Small stencils can benefit from the scratchpad storage of GPUs.

  • Large stencils can benefit from the (comparatively) large caches of CPUs.

  • Small stencils operating on small inputs can achieve significant performance gains via implementation as systolic arrays on FPGAs.

Since stencils are easy to describe but complex to implement efficiently, many stencil applications make use of a domain-specific language (DSL). There are already several embedded DSLs leveraging the template meta-programming capabilities of C++ to generate high-performance stencil kernels at compile time.

Reduction

A reduction is a common parallel pattern which combines partial results using an operator that is typically associative and commutative (e.g., addition). The most ubiquitous examples of reductions are computing a sum (e.g., while computing a dot product) or computing the minimum/maximum value (e.g., using maximum velocity to set time-step size).

Figure 14-4 shows the reduction pattern implemented by way of a tree reduction, which is a popular implementation requiring log2(N) combination operations for a range of N input elements. Although tree reductions are common, other implementations are possible—in general, we should not assume that a reduction combines values in a specific order.

Figure 14-4
The formation of a reduction pattern is depicted. Multiple input elements lead to a single output element through intermediate elements.

Reduction pattern

Kernels are rarely embarrassingly parallel in real life, and even when they are, they are often paired with reductions (as in MapReduce frameworks) to summarize their results. This makes reductions one of the most important parallel patterns to understand and one that we must be able to execute efficiently on any device.

Tuning a reduction for different devices is a delicate balancing act between the time spent computing partial results and the time spent combining them; using too little parallelism increases computation time, whereas using too much parallelism increases combination time.

It may be tempting to improve overall system utilization by using different devices to perform the computation and combination steps, but such tuning efforts must pay careful attention to the cost of moving data between devices. In practice, we find that performing reductions directly on data as it is produced and on the same device is often the best approach. Using multiple devices to improve the performance of reduction patterns therefore relies not on task parallelism but on another level of data parallelism (i.e., each device performs a reduction on part of the input data).

Scan

The scan pattern computes a generalized prefix sum using a binary associative operator, and each element of the output represents a partial result. A scan is said to be inclusive if the partial sum for element i is the sum of all elements in the range [0, i] (i.e., the sum including i). A scan is said to be exclusive if the partial sum for element i is the sum of all elements in the range [0, i) (i.e., the sum excluding i).

At first glance, a scan appears to be an inherently serial operation—the value of each output depends on the value of the previous output! While it is true that scan has less opportunities for parallelism than other patterns (and may therefore be less scalable), Figure 14-5 shows that it is possible to implement a parallel scan using multiple sweeps over the same data.

Figure 14-5
The formation of a scan pattern is depicted. In this, there are multiple sweeps over the same data.

Scan pattern

Because the opportunities for parallelism within a scan operation are limited, the best device on which to execute a scan is highly dependent on problem size: smaller problems are a better fit for a CPU, since only larger problems will contain enough data parallelism to saturate a GPU. Problem size is less of a concern for FPGAs and other spatial architectures since scans naturally lend themselves to pipeline parallelism. As in the case of a reduction, it is usually a good idea to execute the scan operation on the same device that produced the data—considering where and how scan operations fit into an application during optimization will typically produce better results than focusing on optimizing the scan operations in isolation.

Pack and Unpack

The pack and unpack patterns are closely related to scans and are often implemented on top of scan functionality. We cover them separately here because they enable performant implementations of common operations (e.g., appending to a list) that may not have an obvious connection to prefix sums.

Pack

The pack pattern, shown in Figure 14-6, discards elements of an input range based on a Boolean condition, packing the elements that are not discarded into contiguous locations of the output range. This Boolean condition could be a precomputed mask or could be computed online by applying some function to each input element.

Figure 14-6
The formation of a pack pattern is depicted. A function is applied to each input element, which in turn gives output elements.

Pack pattern

Like with scan, there is an inherently serial nature to the pack operation. Given an input element to pack/copy, computing its location in the output range requires information about how many prior elements were also packed/copied into the output. This information is equivalent to an exclusive scan over the Boolean condition driving the pack.

Unpack

As shown in Figure 14-7 (and as its name suggests), the unpack pattern is the opposite of the pack pattern. Contiguous elements of an input range are unpacked into noncontiguous elements of an output range, leaving other elements untouched. The most obvious use case for this pattern is to unpack data that was previously packed, but it can also be used to fill in “gaps” in data resulting from some previous computation.

Figure 14-7
An illustration of an unpack pattern. Output elements have gaps in between, due to some previous computation.

Unpack pattern

Using Built-In Functions and Libraries

Many of these patterns can be expressed directly using built-in functionality of SYCL or vendor-provided libraries written in SYCL. Leveraging these functions and libraries is the best way to balance performance, portability, and productivity in real large-scale software engineering projects.

The SYCL Reduction Library

Rather than require that each of us maintain our own library of portable and highly performant reduction kernels, SYCL provides a convenient abstraction for describing variables with reduction semantics. This abstraction simplifies the expression of reduction kernels and makes the fact that a reduction is being performed explicit, allowing implementations to select between different reduction algorithms for different combinations of device, data type, and reduction operation.

The kernel in Figure 14-8 shows an example of using the reduction library. Note that the kernel body doesn’t contain any reference to reductions—all we must specify is that the kernel contains a reduction which combines instances of the sum variable using the plus functor. This provides enough information for an implementation to automatically generate an optimized reduction sequence.

Figure 14-8
A code of three lines in a reduction library denotes reduction. It computes a sum.

Reduction expressed as a data-parallel kernel using the reduction library

The result of a reduction is not guaranteed to be written back to the original variable until the kernel has completed. Apart from this restriction, accessing the result of a reduction behaves identically to accessing any other variable in SYCL: accessing a reduction result stored in a buffer requires the creation of an appropriate device or host accessor, and accessing a reduction result stored in a USM allocation may require explicit synchronization and/or memory movement.

One important way in which the SYCL reduction library differs from reduction abstractions found in other languages is that it restricts our access to the reduction variable during kernel execution—we cannot inspect the intermediate values of a reduction variable, and we are forbidden from updating the reduction variable using anything other than the specified combination function. These restrictions prevent us from making mistakes that would be hard to debug (e.g., adding to a reduction variable while trying to compute the maximum) and ensure that reductions can be implemented efficiently on a wide variety of different devices.

The reduction Class

The reduction class is the interface we use to describe the reductions present in a kernel. The only way to construct a reduction object is to use one of the functions shown in Figure 14-9. Note that there are three families of reduction function (for buffers, USM pointers and spans), each with two overloads (with and without an identity variable).

Figure 14-9
The reduction function in the reduction class is represented. It contains some templates as the typename Buffer T and typename Binary Operation.

Function prototypes of the reduction function

If a reduction is initialized using a buffer or a USM pointer, the reduction is a scalar reduction, operating on the first object in an array. If a reduction is initialized using a span, the reduction is an array reduction. Each component of an array reduction is independent—we can think of an array reduction operating on an array of size N as equivalent to N scalar reductions with the same data type and operator.

The simplest overloads of the function allow us to specify the reduction variable and the operator used to combine the contributions from each work-item. The second set of overloads allow us to provide an optional identity value associated with the reduction operator—this is an optimization for user-defined reductions, which we will revisit later.

Note that the return type of the reduction function is unspecified, and the reduction class itself is completely implementation-defined. Although this may appear slightly unusual for a C++ class, it permits an implementation to use different classes (or a single class with any number of template arguments) to represent different reduction algorithms. Future versions of SYCL may decide to revisit this design in order to enable us to explicitly request specific reduction algorithms in specific execution contexts (most likely, via the property_list argument).

The reducer Class

An instance of the reducer class encapsulates a reduction variable, exposing a limited interface ensuring that we cannot update the reduction variable in any way that an implementation could consider to be unsafe. A simplified definition of the reducer class is shown in Figure 14-10. Like the reduction class, the precise definition of the reducer class is implementation-defined—a reducer’s type will depend on how the reduction is being performed, and it is important to know this at compile time in order to maximize performance. However, the functions and operators that allow us to update the reduction variable are well defined and are guaranteed to be supported by any SYCL implementation.

Figure 14-10
A simplified definition of the reducer class is represented. It includes the combination of partial result with the reducer's value.

Simplified definition of the reducer class

Specifically, every reducer provides a combine() function which combines the partial result (from a single work-item) with the value of the reduction variable. How this combine function behaves is implementation-defined but is not something that we need to worry about when writing a kernel. A reducer is also required to make other operators available depending on the reduction operator; for example, the += operator is defined for plus reductions. These additional operators are provided only as a programmer convenience and to improve readability; where they are available, these operators have identical behavior to calling combine() directly.

When working with array reductions, the reducer provides an additional subscript operator (i.e., operator[]), allowing access to individual elements of the array. Rather than returning a reference directly to an element of the array, this operator returns another reducer object, which exposes the same combine() function and shorthand operators as the reducers associated with a scalar reduction. Figure 14-11 shows a simple example of a kernel using an array reduction to compute a histogram, where the subscript operator is used to access only the histogram bin that is updated by the work-item.

Figure 14-11
A code of 6 lines. It computes a histogram using a reduction function.

An example kernel using an array reduction to compute a histogram

User-Defined Reductions

Several common reduction algorithms (e.g., a tree reduction) do not see each work-item directly update a single shared variable, but instead accumulate some partial result in a private variable that will be combined at some point in the future. Such private variables introduce a problem: how should the implementation initialize them? Initializing variables to the first contribution from each work-item has potential performance ramifications, since additional logic is required to detect and handle uninitialized variables. Initializing variables to the identity of the reduction operator instead avoids the performance penalty but is only possible when the identity is known.

SYCL implementations can only automatically determine the correct identity value to use when a reduction is operating on simple arithmetic types and the reduction operator is one of several standard function objects (e.g., plus). For user-defined reductions (i.e., those operating on user-defined types and/or using user-defined function objects), we may be able to improve performance by specifying the identity value directly.

Support for user-defined reductions is limited to trivially copyable types and combination functions with no side effects, but this is enough to enable many real-life use cases. For example, the code in Figure 14-12 demonstrates the usage of a user-defined reduction to compute both the minimum element in a vector and its location.

Figure 14-12
A code of several lines demonstrates the usage of a user-defined reduction. It contains the template as the typename T and typename I.

Using a user-defined reduction to find the location of the minimum value

Group Algorithms

Support for parallel patterns in SYCL device code is provided by a separate library of group algorithms. These functions exploit the parallelism of a specific group of work-items (i.e., a work-group or a sub-group) to implement common parallel algorithms at limited scope and can be used as building blocks to construct other more complex algorithms.

The syntax of the group algorithms in SYCL is based on that of the algorithm library in C++, and any restrictions from the C++ algorithms apply. However, there is a critical difference: whereas the STL’s algorithms are called from sequential (host) code and indicate an opportunity for a library to employ parallelism, SYCL’s group algorithms are designed to be called within (device) code that is already executing in parallel. To ensure that this difference cannot be overlooked, the group algorithms have slightly different syntax and semantics to their C++ counterparts.

SYCL distinguishes between two different kinds of parallel algorithm. If an algorithm is performed collaboratively by all work-items in a group but otherwise behaves identically to an algorithm from the STL, the algorithm is named with a “joint” prefix (because the members of the group “join” together to perform the algorithm). Such algorithms read their inputs from memory and write their results to memory and can only operate on data in memory locations visible to all work-items in a given group. If an algorithm instead operates over an implicit range reflecting the group itself, with inputs and outputs stored in work-item private memory, the algorithm name is modified to include the word “group” (because the algorithm is performed directly on data owned to the group).

The code examples in Figure 14-13 demonstrate these two different kinds of algorithm, comparing the behavior of std::reduce to the behaviors of sycl::joint_reduce and sycl::reduce_over_group.

Figure 14-13
A code of several lines demonstrates a comparison. It presents that each work item reduces over a given input range and each work-group reduces over data held in work item.

A comparison of std::reduce, sycl::joint_reduce, and sycl::reduce_over_group

Note that in both cases, the first argument to each group algorithm accepts a group or sub_group object in place of an execution policy, to describe the set of work-items that should be used to perform the algorithm. Since algorithms are performed collaboratively by all the work-items in the specified group, they must also be treated similarly to a group barrier—all work-items in the group must encounter the same algorithm in converged control flow (i.e., all work-items in the group must similarly encounter or not encounter the algorithm call), and the arguments provided by all work-items must be such that all work-items agree on the operation being performed. For example, sycl::joint_reduce requires all arguments to be the same for all work-items, to ensure that all work-items in the group operate on the same data and use the same operator to accumulate results.

The table in Figure 14-14 shows how the parallel algorithms available in the STL relate to the group algorithms, and whether there are any restrictions on the type of group that can be used. Note that in some cases, a group algorithm can only be used with sub-groups; these cases correspond to the “shuffle” operations introduced in earlier chapters.

Figure 14-14
A table presents mapping between C++ algorithms and the S Y C L group. The table has 4 columns and 10 rows. The column headers are C++ algorithm, S Y C L joint algorithm, S Y C L group algorithm, and group types.

Mapping between C++ algorithms and SYCL group algorithms

At the time of writing, the group algorithms are limited to supporting only primitive data types and a set of built-in operators recognized by SYCL (i.e., plus, multiplies, bit_and, bit_or, bit_xor, logical_and, logical_or, minimum, and maximum). This is enough to cover most common use cases, but future versions of SYCL are expected to extend collective support to user-defined types and operators.

Direct Programming

Although we recommend leveraging libraries wherever possible, we can learn a lot by looking at how each pattern could be implemented using “native” SYCL kernels.

The kernels in the remainder of this chapter should not be expected to reach the same level of performance as highly tuned libraries but are useful in developing a greater understanding of the capabilities of SYCL—and may even serve as a starting point for prototyping new library functionality.

Use Vendor-Provided Libraries!

When a vendor provides a library implementation of a function, it is almost always beneficial to use it rather than reimplementing the function as a kernel!

Map

Owing to its simplicity, the map pattern can be implemented directly as a basic parallel kernel. The code shown in Figure 14-15 shows such an implementation, using the map pattern to compute the square root of each input element in a range.

Figure 14-15
A code of three lines. It computes the square root of each input value.

Implementing the map pattern in a data-parallel kernel

Stencil

Implementing a stencil directly as a multidimensional basic data-parallel kernel with multidimensional buffers, as shown in Figure 14-16, is straightforward and easy to understand.

Figure 14-16
A code of several lines. It computes the average of each cell and its immediate neighbors.

Implementing the stencil pattern in a data-parallel kernel

However, this expression of the stencil pattern is very naïve and should not be expected to perform very well. As mentioned earlier in the chapter, it is well known that leveraging locality (via spatial or temporal blocking) is required to avoid repeated reads of the same data from memory. A simple example of spatial blocking, using work-group local memory, is shown in Figure 14-17.

Figure 14-17
A code of several lines. It computes the average of each cell and its immediate neighbors.

Implementing the stencil pattern in an ND-range kernel, using work-group local memory

Selecting the best optimizations for a given stencil requires compile-time introspection of block size, the neighborhood, and the stencil function itself, requiring a much more sophisticated approach than discussed here.

Reduction

It is possible to implement reduction kernels in SYCL by leveraging language features that provide synchronization and communication capabilities between work-items (e.g., atomic operations, work-group and sub-group functions, sub-group “shuffles”). The kernels in Figure 14-18 and Figure 14-19 show two possible reduction implementations: a naïve reduction using a basic parallel_for and an atomic operation for every work-item, and a slightly smarter reduction that exploits locality using an ND-range parallel_for and a work-group reduce function, respectively. We revisit these atomic operations in more detail in Chapter 19.

Figure 14-18
A code of six lines. It implements a naïve reduction and computes the sum.

Implementing a naïve reduction expressed as a data-parallel kernel

Figure 14-19
A code of several lines with an if condition. It implements a naïve reduction and computes the sum.

Implementing a naïve reduction expressed as an ND-range kernel

There are numerous other ways to write reduction kernels, and different devices will likely prefer different implementations, owing to differences in hardware support for atomic operations, work-group local memory size, global memory size, the availability of fast device-wide barriers, or even the availability of dedicated reduction instructions. On some architectures, it may even be faster (or necessary!) to perform a tree reduction using log2(N) separate kernel calls.

We strongly recommend that manual implementations of reductions should only be considered for cases that are not supported by the SYCL reduction library or when fine-tuning a kernel for the capabilities of a specific device—and even then, only after being 100% sure that SYCL’s built-in reductions are underperforming!

Scan

As we saw earlier in this chapter, implementing a parallel scan requires multiple sweeps over the data, with synchronization occurring between each sweep. Since SYCL does not provide a mechanism for synchronizing all work-items in an ND-range, a direct implementation of a device-wide scan must use multiple kernels that communicate partial results through global memory.

The code, shown in Figures 14-20, 14-21, and 14-22, demonstrates an inclusive scan implemented using several kernels. The first kernel distributes the input values across work-groups, computing work-group local scans in work-group local memory (note that we could have used the work-group inclusive_scan function instead). The second kernel computes a local scan using a single work-group, this time over the final value from each block. The third kernel combines these intermediate results to finalize the prefix sum. These three kernels correspond to the three layers of the diagram in Figure 14-5.

Figure 14-20
A code of several lines. It computes local scans over input blocks and performs inclusive scan in local memory, among others.

Phase 1 for implementing a global inclusive scan in an ND-range kernel: computing across each work-group

Figure 14-21
A code of several lines. It computes scan over partial results and performs inclusive scan in local memory, among others.

Phase 2 for implementing a global inclusive scan in an ND-range kernel: scanning across the results of each work-group

Figure 14-22
A code of seven lines with an if condition. It updates local scans using partial results.

Phase 3 (final) for implementing a global inclusive scan in an ND-range kernel

Figure 14-20 and Figure 14-21 are very similar; the only differences are the size of the range and how the input and output values are handled. A real-life implementation of this pattern could use a single function taking different arguments to implement these two phases, and they are only presented as distinct code here for pedagogical reasons.

Pack and Unpack

Pack and unpack are also known as gather and scatter operations. These operations handle differences in how data is arranged in memory and how we wish to present it to the compute resources.

Pack

Since pack depends on an exclusive scan, implementing a pack that applies to all elements of an ND-range must also take place via global memory and over the course of several kernel enqueues. However, there is a common use case for pack that does not require the operation to be applied over all elements of an ND-range—namely, applying a pack only across items in a specific work-group or sub-group.

The snippet in Figure 14-23 shows how to implement a group pack operation on top of an exclusive scan.

Figure 14-23
A code of three lines with an if condition. It includes an exclusive scan.

Implementing a group pack operation on top of an exclusive scan

The code in Figure 14-24 demonstrates how such a pack operation could be used in a kernel to build a list of elements which require some additional postprocessing (in a future kernel). The example shown is based on a real kernel from molecular dynamics simulations: the work-items in the sub-group assigned to particle i cooperate to identify all other particles within a fixed distance of i, and only the particles in this “neighbor list” will be used to calculate the force acting on each particle.

Figure 14-24
A code of several lines. It computes the distance between i and neighbor j and keeps track of how many neighbors have been packed so far, among others.

Using a sub-group pack operation to build a list of elements needing additional postprocessing

Note that the pack pattern never reorders elements—the elements that are packed into the output array appear in the same order as they did in the input. This property of pack is important and enables us to use pack functionality to implement other more abstract parallel algorithms (such as std::copy_if and std::stable_partition). However, there are other parallel algorithms that can be implemented on top of pack functionality where maintaining order is not required (such as std::partition).

Unpack

As with pack, we can implement unpack using scan. Figure 14-25 shows how to implement a sub-group unpack operation on top of an exclusive scan.

Figure 14-25
A code of three lines. It includes an exclusive scan.

Implementing a sub-group unpack operation on top of an exclusive scan

The code in Figure 14-26 demonstrates how such a sub-group unpack operation could be used to improve load balancing in a kernel with divergent control flow (in this case, computing the Mandelbrot set). Each work-item is assigned a separate pixel to compute and iterates until convergence or a maximum number of iterations is reached. An unpack operation is then used to replace completed pixels with new pixels.

Figure 14-26
A code of several lines. It keeps iterating as long as one work-item has work to do and resets the iterator variables for the new i, among others.

Using a sub-group unpack operation to improve load balancing for kernels with divergent control flow

The degree to which an approach like this improves efficiency (and decreases execution time) is highly application- and input-dependent, since checking for completion and executing the unpack operation both introduce some overhead! Successfully using this pattern in realistic applications will therefore require some fine-tuning based on the amount of divergence present and the computation being performed (e.g., introducing a heuristic to execute the unpack operation only if the number of active work-items falls below some threshold).

Summary

This chapter has demonstrated how to implement some of the most common parallel patterns using SYCL features, including built-in functions and libraries.

The SYCL ecosystem is still developing, and we expect to uncover new best practices for these patterns as developers gain more experience with the language and from the development of production-grade applications and libraries.

For More Information

  • Structured Parallel Programming: Patterns for Efficient Computation by Michael McCool, Arch Robison, and James Reinders, © 2012, published by Morgan Kaufmann, ISBN 978-0-124-15993-8.

  • Algorithms library, C++ Reference, https://en.cppreference.com/w/cpp/algorithm.