In this chapter, we will learn about the buffer abstraction. We learned about Unified Shared Memory (USM), the pointer-based strategy for data management, in the previous chapter. USM forces us to think about where memory lives and what should be accessible where. The buffer abstraction is a higher-level model that hides this from the programmer. Buffers simply represent data, and it becomes the job of the runtime to manage how the data is stored and moved in memory.

This chapter presents an alternative approach to managing our data. The choice between buffers and USM often comes down to personal preference and the style of existing code, and applications are free to mix and match the two styles in representation of different data within the application.

USM simply exposes different abstractions for memory. USM has pointers, and buffers are a higher-level abstraction. The abstraction level of buffers allows the data contained within to be used on any device within the application, where the runtime manages whatever is needed to make that data available. The pointer-based model of USM is probably a better fit for applications that use pointer-based data structures such as linked lists, trees, or others. Buffers can also be trickier to retrofit into existing codes that already use pointers. However, buffers are guaranteed to work on every device in the system, while some devices may not support specific (or any) modes of USM. Choices are good, so let’s dive into buffers.

We will look more closely at how buffers are created and used. A discussion of buffers would not be complete without also discussing the accessor. While buffers abstract how we represent and store data in a program, we do not directly access the data using the buffer. Instead, we use accessor objects that inform the runtime how we intend to use the data we are accessing, and accessors are tightly coupled to the powerful data dependence mechanisms within task graphs. After we cover all the things we can do with buffers, we will also explore how to create and use accessors in our programs.

Buffers

A buffer is a high-level abstraction for data. Buffers are not necessarily tied to a single location or virtual memory address. Indeed, the runtime is free to use many different locations in memory (even across different devices) to represent a buffer, but the runtime must be sure to always give us a consistent view of the data. A buffer is accessible on the host and on any device.

Figure 7-1
A buffer syntax reads, template left angular bracket type name T, int dimensions, allocator T allocator right angular bracket class buffer semicolon.

Buffer class definition

The buffer class is a template class with three template arguments, as shown in Figure 7-1. The first template argument is the type of the object that the buffer will contain. This type must be device copyable, which extends the notion of trivially copyable as defined by C++. Types that are trivially copyable are safe to copy byte by byte without using any special copy or move constructors. Device copyable types extend this notion recursively to certain C++ types like std::pair or std::tuple. The next template argument is an integer describing the dimensionality of the buffer. The final template argument is optional, and the default value is usually what is used. This argument specifies a C++-style allocator class that is used to perform any memory allocations on the host that are needed for the buffer. First, we will examine the many ways that buffer objects can be created.

Buffer Creation

In the following figures, we show several ways in which buffer objects can be created. Let’s walk through the example and look at each instance.

Figure 7-2
A program of several ways to create a buffer of 2 by 5 using a default allocator. It executes the range of the buffer, float, and allocator to create 20 floats using the default constructed and passed-in allocator.

Creating buffers, Part 1

The first buffer we create in Figure 7-2, b1, is a two-dimensional buffer of ten integers. We explicitly pass all template arguments, even explicitly passing the default value of buffer_allocator<T> as the allocator type. Since buffer_allocator is also a templated type, we must explicitly specialize it just as we do the buffer by specifying buffer_allocator<int>. However, using modern C++, we can express this much more compactly. Buffer b2 is also a two-dimensional buffer of ten integers using the default allocator. Here we make use of C++17’s class template argument deduction (CTAD) to automatically infer template arguments. CTAD is an all-or-none tool—it must either infer every template argument for a class or infer none of them. In this case, we use the fact that we are initializing b2 with a range that takes two arguments to infer that it is a two-dimensional range. The allocator template argument has a default value, so we do not need to explicitly list it when creating the buffer.

With buffer b3, we create a buffer of 20 floats and use a default-constructed std::allocator to allocate any necessary memory on the host. When using a custom allocator type with a buffer, we often want to pass an actual allocator object to the buffer to use instead of the default-constructed one. Buffer b4 shows how to do this, taking the allocator object after the range in the call to its constructor.

For the first four buffers in our example, we let the buffer allocate any memory it needs and we do not initialize that data with any values at the time of their creation. It is a common pattern to use buffers to effectively wrap existing C++ allocations, which may already have been initialized with data. We can do this by passing a source of initial values to the buffer constructor. Doing so allows us to do several things, which we will see with the next example.

Figure 7-3
A program to create a buffer of 4, 5 doubles and initialize it from a host pointer C array to const double. The auto-shared and make-shared operations are executed.

Creating buffers, Part 2

In Figure 7-3, buffer b5 creates a one-dimensional buffer of four doubles. We pass the host pointer to the C array myDoubles to the buffer constructor in addition to the range that specifies the size of the buffer. Here we can make full use of CTAD to infer all the template arguments of our buffer. The host pointer we pass points to doubles, which gives us the data type of our buffer. The number of dimensions is automatically inferred from the one-dimensional range, which itself is inferred because it is created with only one number. Finally, the default allocator is used, so we do not have to specify that.

Passing a host pointer has a few ramifications of which we should be aware. By passing a pointer to host memory, we are promising the runtime that we will not try to access the host memory during the lifetime of the buffer. This is not (and cannot be) enforced by a SYCL implementation—it is our responsibility to ensure that we do not break this contract. One reason that we should not try to access this memory while the buffer is alive is that the buffer may choose to use different memory on the host to represent the buffer content, often for optimization reasons. If it does so, the values will be copied into this new memory from the host pointer. If subsequent kernels modify the buffer, the original host pointer will not reflect the updated values until certain specified synchronization points. We will talk more about when data gets written back to a host pointer later in this chapter.

Buffer b6 is very similar to buffer b5 with one major difference. This time, we are initializing the buffer with a pointer to const double. This means that we can only read values through the host pointer and not write them. However, the type for our buffer in this example is still double, not const double since the deduction guides do not take const-ness into consideration. This means that the buffer may be written to by a kernel, but we must use a different mechanism to update the host after the buffer has outlived its use (covered later in this chapter).

Buffers can also be initialized using C++ shared pointer objects. This is useful if our application already uses shared pointers, as this method of initialization will properly count the reference and ensure that the memory is not deallocated. Buffer b7 creates a buffer containing a single integer and initializes it using a shared pointer.

Figure 7-4
A program to create a buffer b 8. The functions of the standard vector with different ranges are executed to create 2 by 5 i n t s and 2 non-overlapping buffers.

Creating buffers, Part 3

Containers are commonly used in modern C++ applications, with examples including std::array, std::vector, std::list, or std::map. We can initialize one-dimensional buffers using containers in two different ways. The first way, as shown in Figure 7-4 by buffer b8, uses input iterators. Instead of a host pointer, we pass two iterators to the buffer constructor, one representing the beginning of the data and another representing the end. The size of the buffer is computed as the number of elements returned by incrementing the start iterator until it equals the end iterator. This is useful for any data type that implements the C++ InputIterator interface. If the container object that provides the initial values for a buffer is also contiguous, then we can use an even simpler form to create the buffer. Buffer b9 creates a buffer from a vector simply by passing the vector to the constructor. The size of the buffer is determined by the size of the container being used to initialize it, and the type for the buffer data comes from the type of the container data. Creating buffers using this approach is common and recommended from containers such as std::vector and std::array.

The final example of buffer creation illustrates another feature of the buffer class. It is possible to create a sub-buffer, which is a view of a buffer from another buffer. A sub-buffer requires three things: a reference to a parent buffer, a base index, and the range of the sub-buffer. A sub-buffer cannot be created from a sub-buffer. Multiple sub-buffers can be created from the same buffer, and they are free to overlap. Buffer b10 is created exactly like buffer b2, a two-dimensional buffer of integers with five integers per row. Next, we create two sub-buffers from buffer b10, sub-buffers b11 and b12. Sub-buffer b11 starts at index (0,0) and contains every element in the first row. Similarly, sub-buffer b12 starts at index (1,0) and contains every element in the second row. This yields two disjoint sub-buffers. Since the sub-buffers do not overlap, different kernels could operate on the different sub-buffers concurrently, but we will talk more about scheduling execution graphs and dependences in the next chapter.

Figure 7-5
A program of the buffer properties to create a buffer of 42 i n t s. Three properties are relatively uncommon in most codes. Use host, host pointer, and mutex properties were used.

Buffer properties

Buffer Properties

Buffers can also be created with special properties that alter their behavior. In Figure 7-5, we will walk through an example of the three different optional buffer properties and discuss how they might be used. Note that these properties are relatively uncommon in most codes.

use_host_ptr

The first property that may be optionally specified during buffer creation is use_host_ptr. When present, this property requires the buffer to not allocate any memory on the host, and any allocator passed or specified on buffer construction is effectively ignored. Instead, the buffer must use the memory pointed to by a host pointer that is passed to the constructor. Note that this does not require the device to use the same memory to hold the buffer’s data. A device is free to cache the contents of a buffer in its attached memory. Also note that this property may only be used when a host pointer is passed to the constructor. This option can be useful when the program wants full control over all host memory allocations—for example, it allows programmers to try to minimize the memory footprint of an application.

In our example in Figure 7-5, we create a buffer b as we saw in our previous examples. We next create buffer b1 and initialize it with a pointer to myInts. We also pass the property use_host_ptr, which means that buffer b1 will only use the memory pointed to by myInts and not allocate any additional temporary storage on the host.

use_mutex

The next property, use_mutex, concerns fine-grained sharing of memory between buffers and host code. Buffer b2 is created using this property. The property takes a reference to a mutex object that can later be queried from the buffer as we see in the example. This property also requires a host pointer be passed to the constructor, and it lets the runtime determine when it is safe to access updated values in host code through the provided host pointer. We cannot lock the mutex until the runtime guarantees that the host pointer sees the latest value of the buffer. While this could be combined with the use_host_ptr property, it is not required. use_mutex is a mechanism that allows host code to access data within a buffer while the buffer is still alive and without using the host accessor mechanism (described later). In general, the host accessor mechanism should be preferred unless we have a specific reason to use a mutex, particularly because there are no guarantees on how long it will take before the mutex will be successfully locked and the data ready for use by host code.

context_bound

The final property is shown in the creation of buffer b3 in our example. Here, our buffer of 42 integers is created with the context_bound property. The property takes a reference to a context object. Normally, a buffer is free to be used on any device or context. However, if this property is used, it locks the buffer to the specified context. Attempting to use the buffer on another context will result in a runtime error. This could be useful for debugging programs by identifying cases where a kernel might be submitted to the wrong queue, for instance. In practice, we do not expect to see this property used in many programs, and the ability for buffers to be accessed on any device in any context is one of the most powerful properties of the buffer abstraction (which this property undoes).

What Can We Do with a Buffer?

Many things can be done with buffer objects. We can query characteristics of a buffer, determine if and where any data is written back to host memory after the buffer is destroyed, or reinterpret a buffer as one with different characteristics. One thing that cannot be done, however, is to directly access the data that a buffer represents. Instead, we must create accessor objects to access the data, and we will learn all about this later in the chapter.

Examples of things that can be queried about a buffer include its range, the total number of data elements it represents, and the number of bytes required to store its elements. We can also query which allocator object is being used by the buffer and whether the buffer is a sub-buffer or not.

Updating host memory when a buffer is destroyed is an important aspect to consider when using buffers. Depending on how a buffer is created, host memory may or may not be updated with the results of a computation after buffer destruction. If a buffer is created and initialized from a host pointer to non-const data, that same pointer is updated with the latest data when the buffer is destroyed. However, there is also a way to update host memory regardless of how a buffer was created. The set_final_data method is a template method of buffer that can accept either a raw pointer, a C++ OutputIterator, or a std::weak_ptr. When the buffer is destroyed, data contained by the buffer will be written to the host using the supplied location. Note that if the buffer was created and initialized from a host pointer to non-const data, it’s as if set_final_data was called with that pointer. Technically, a raw pointer is a special case of an OutputIterator. If the parameter passed to set_final_data is a std::weak_ptr, the data is not written to the host if the pointer has expired or has already been deleted. Whether or not writeback occurs can also be controlled by the set_write_back method.

Accessors

Data represented by a buffer cannot be directly accessed through the buffer object. Instead, we must create accessor objects that allow us to safely access a buffer’s data. Accessors inform the runtime where and how we want to access data, allowing the runtime to ensure that the right data is in the right place at the right time. This is a very powerful concept, especially when combined with the task graph that schedules kernels for execution based in part on data dependences.

Accessor objects are instantiated from the templated accessor class. This class has five template parameters. The first parameter is the type of the data being accessed. This should be the same as the type of data being stored by the corresponding buffer. Similarly, the second parameter describes the dimensionality of the data and buffer and defaults to a value of one.

Figure 7-6
A table has 3 rows and 2 columns. The two columns are labeled read mode and description. The row-wise entries are as follows. 1. Read, read-only access. 2. Write, Write-only access preserving previous contents. 3. Read-write, read and write access.

Access modes

The next three template parameters are unique to accessors. The first of these is the access mode. The access mode describes how we intend to use an accessor in a program. The possible modes are listed in Figure 7-6. We will learn how these modes are used to order the execution of kernels and perform data movement in Chapter 8. The access mode parameter does have a default value if none is specified or automatically inferred. If we do not specify otherwise, accessors will default to read_write access mode for non-const data types and read for const data types. These defaults are always correct but providing more accurate information may improve a runtime’s ability to perform optimizations. When starting application development, it is safe and concise to simply not specify an access mode, and we can then refine the access modes based on profiling of performance-critical regions of the application.

Figure 7-7
A table has 2 rows and 2 columns. The two columns are labeled read target and description. The row-wise entries are as follows. 1. Device, Access a buffer via the device's global memory. Host task, Access a buffer from a host task.

Access targets

The next template parameter is the access target. Buffers are an abstraction of data and do not describe where and how data is stored. The access target describes where we are accessing data. The two possible access targets are listed in Figure 7-7.

When using C++ with SYCL, there are only two targets: device and host_task. The default template value is device, and this means that we intend to access a buffer’s data on a device. This is reasonable as accessors are most commonly used in operations on a device such as kernels or data transfers. The other access target is host_task, which is used when a host task needs to access a buffer’s data.

Devices may have different types of memories available. In particular, many devices have some sort of fast local memory that is shared across multiple work-items in a work-group. Prior versions of SYCL had special access targets for local memory, but SYCL 2020 handles it in a different way. We will learn how to use work-group local memory in Chapter 9. Prior versions of SYCL also had a special access target for the host (outside of host tasks, which are new to SYCL 2020). This has been replaced with the new host_accessor class, which provides access to a buffer’s data in host code. However, the access will remain valid for the lifetime of the host_accessor. Given that a buffer is locked to the host while a host_accessor is valid, one should take special care to limit the scope of host_accessor objects.

The final template parameter governs whether an accessor is a placeholder accessor or not. This is not a parameter that a programmer is likely to ever directly set and is usually deduced by which constructor call is used to create the accessor. A placeholder accessor is one that is declared outside of a command group but meant to be used to access data on a device inside a kernel. We will see what differentiates a placeholder accessor from one that is not once we look at examples of accessor creation.

While accessors can be extracted from a buffer object using its get_access method, it’s simpler to directly create (construct) them. This is the style we will use in upcoming examples since it is very simple to understand and is compact.

Accessor Creation

Figure 7-8 shows an example program with everything that we need to get started with accessors. In this example, we have three buffers, A, B, and C. The first parallel task we submit to the queue creates accessors to each buffer and defines a kernel that uses these accessors to initialize the buffers with some values. Each accessor is constructed with a reference to the buffer it will access as well as the handler object defined by the command group we’re submitting to the queue. This effectively binds the accessor to the kernel we’re submitting as part of the command group. Regular accessors are device accessors since they, by default, target global buffers stored in device memory. This is the most common use case.

Figure 7-8
A program to create accessors. We have A, B, and C buffers. The first parallel task is to create accessors to each buffer and defines a kernel with some values. Each accessor is a reference to the buffer it will access as well as the handler object by the command group. This binds the accessor to the kernel to the command group.

Simple accessor creation

The second task we submit also defines three accessors to the buffers. We then use those accessors in the second kernel to add the elements of buffers A and B into buffer C. Since this second task operates on the same data as the first one, the runtime will execute this task after the first one is complete. We will learn about this in detail in the next chapter.

The third task shows how we can use a placeholder accessor. The accessor pC is declared at the beginning of the example in Figure 7-8 after we create our buffers. Note that the constructor is not passed a handler object since we don’t have one to pass. This lets us create a reusable accessor object ahead of time. However, in order to use this accessor inside a kernel, we need to bind it to a command group during submission. We do this using the handler object’s require method. Once we have bound our placeholder accessor to a command group, we can then use it inside a kernel as we would any other accessor.

Finally, we create a host_accessor object in order to read the results of our computations back on the host. Note that this is a different type than we used inside our kernels. Note that the host accessor result in this example also does not take a handler object since we once again do not have one to pass. The special type for host accessors also lets us disambiguate them from placeholders. An important aspect of host accessors is that the constructor only completes when the data is available for use on the host, which means that construction of a host accessor can appear to take a long time. The constructor must wait for any kernels to finish executing that produce the data to be copied as well as for the copy itself to finish. Once the host accessor construction is complete, it is safe to use the data that it accesses directly on the host, and we are guaranteed that the latest version of the data is available to us on the host.

While this example is perfectly correct, we don’t say anything about how we intend to use our accessors when we create them. Instead, we use the default access mode, which is read_write, for the non-const int data in our buffers. This is potentially overconservative and may create unnecessary dependences between operations or superfluous data movement. A runtime may be able to do a better job if it has more information about how we plan to use the accessors we create. However, before we go through an example where we do this, we should first introduce one more tool—the deduction tag.

Deduction tags are a compact way to express the desired combination of access mode and target for an accessor. Deduction tags, when used, are passed as a parameter to an accessor’s constructor. The possible tags are shown in Figure 7-9. When an accessor is constructed with a tag parameter, C++ CTAD can then properly deduce the desired access mode and target, providing an easy way to override the default values for those template parameters. We could also manually specify the desired template parameters, but tags provide a simpler, more compact way to get the same result without spelling out fully templated accessors.

Figure 7-9
A table. Three columns read tag values as, read-only, read-write, write-only, read-only host task, read-write host task, and write-only host task. The access mode column has read, read-write, write, read, read-write, and write. The target column has device, device, device, host-task, host-task, and host-task.

Deduction tags

Let’s take our previous example and rewrite it to add deduction tags. This new and improved example is shown in Figure 7-10.

Figure 7-10
A program to rewrite to add deduction tags. Three sets of tasks to submit handler and h, accessor a, b, and c. The queue has buffer and range N, accessors parallel for n, and the execution of host accessor result and assert result are displayed.

Accessor creation with specified usage

We begin by declaring our buffers as we did in Figure 7-8. We also create our placeholder accessor that we’ll use later. Let’s now look at the first task we submit to the queue. Previously, we created our accessors by passing a reference to a buffer and the handler object for the command group. Now, we add two extra parameters to our constructor calls. The first new parameter is a deduction tag. Since this kernel is writing the initial values for our buffers, we use the write_only deduction tag. This lets the runtime know that this kernel is producing new data and will not read from the buffer.

The second new parameter is an optional accessor property, similar to the optional properties for buffers that we saw earlier in the chapter. The property we pass, no_init, lets the runtime know that the previous contents of the buffer can be discarded. This is useful because it can let the runtime eliminate unnecessary data movement. In this example, since the first task is writing the initial values for our buffers, it’s unnecessary for the runtime to copy the uninitialized host memory to the device before the kernel executes. The no_init property is useful for this example, but it should not be used for read–modify–write cases or kernels where only some values in a buffer may be updated.

The second task we submit to our queue is identical to before, but now we add deduction tags to our accessors. Here, we add the tags read_only to accessors aA and aB to let the runtime know that we will only read the values of buffers A and B through these accessors. The third accessor, aC, gets the read_write deduction tag since we accumulate the sum of the elements of A and B into C. We explicitly use the tag in the example to be consistent, but this is unnecessary since the default access mode is read_write.

The default usage is retained in the third task where we use our placeholder accessor. This remains unchanged from the simplified example we saw in Figure 7-8. Our final accessor, the host accessor result, now receives a deduction tag when we create it. Since we only read the final values on the host, we pass the read_only tag to the constructor. If we rewrote the program in such a way that the host accessor was destroyed, launching another kernel that operated on buffer C would not require it to be written back to the device since the read_only tag lets the runtime know that it will not be modified by the host.

What Can We Do with an Accessor?

Many things can be done with an accessor object. However, the most important thing we can do is spelled out in the accessor’s name—access data. This is usually done through one of the accessor’s [] operators. We use the [] operator in our examples in Figures 7-8 and 7-10. This operator takes either an id object that can properly index multidimensional data or a single size_t. The second case can be used when an accessor has more than one dimension. In that case, it returns an object that is then meant to be indexed again with [] until we arrive at a scalar value, and this would be of the form a[i][j] in a two-dimensional case. Remember that the ordering of accessor dimensions follows the convention of C++ where the rightmost dimension is the unit-stride dimension (iterates “fastest”).

An accessor can also return a pointer to the underlying data. This pointer can be accessed directly following normal C++ rules. Note that there can be additional complexity involved with respect to the address space of this pointer.

Many things can also be queried from an accessor object. Examples include the number of elements accessible through the accessor, the size in bytes of the region of the buffer it covers, or the range of data accessible.

Accessors provide a similar interface to C++ containers and may be used in many situations where containers may be passed. The container interface supported by accessors includes the data method, which is equivalent to get_pointer, and several flavors of forward and backward iterators.

Summary

In this chapter, we have learned about buffers and accessors. Buffers are an abstraction of data that hides the underlying details of memory management from the programmer. They do this in order to provide a simpler, higher-level abstraction. We went through several examples that showed us the different ways to construct buffers as well as the different optional properties that can be specified to alter their behavior. We learned how to initialize a buffer with data from host memory as well as how to write data back to host memory when we are done with a buffer.

Since we cannot access buffers directly, we learned how to access the data in a buffer by using accessor objects. We learned the difference between device accessors and host accessors. We discussed the different access modes and targets and how they inform the runtime how and where an accessor will be used by the program. We showed the simplest way to use accessors using the default access modes and targets, and we learned how to distinguish between a placeholder accessor and one that is not. We then saw how to further optimize the example program by giving the runtime more information about our accessor usage by adding deduction tags to our accessor declarations. Finally, we covered many of the different ways that accessors can be used in a program.

In the next chapter, we will learn in greater detail how the runtime can use the information we give it through accessors to schedule the execution of different kernels. We will also see how this information informs the runtime about when and how the data in buffers needs to be copied between the host and a device. We will learn how we can explicitly control data movement involving buffers—and USM allocations too.