Keywords

1 Introduction

figure a

GPUs are massively parallel devices that promise a great return on investment at a cost: they are notably difficult to program. In GPU programming, hundreds of lightweight threads share portions of arrays in parallel (without locks)ā€”very different from the programming model of multithreaded programs written in C or Java with heavy-weight heterogeneous threads. Data-race freedom (DRF) analysis aims to guarantee that for all possible executions, every array cell being written by one thread cannot be concurrently accessed by another thread.

In the field of static analysis of DRF in GPU programs, there is a tension between efficiency and correctness (no missed data-races and no false alarms) that thus far is unresolved. Bug finding toolsĀ [26, 27, 33] favor correctness over efficiency: they provide correct results at small scales, by simulating the program execution. Such tools are incapable of handling certain parameters symbolically (e.g., array size) and can easily exhaust usersā€™ resources (e.g., loops with long iteration spaces or unknown bounds). Approaches based on Hoare logicĀ [5, 7, 22] can cope with medium-sized programs, do not miss data-races, and do not require array size information; however, they suffer from a high-rate of false alarms and require code annotations written by concurrency experts. Finally, tools that can cope with larger programs and do not require array size information either miss data-racesĀ [24] or overwhelm the user with false alarmsĀ [37].

To appease this tension, we introduce a novel static DRF analysis that can handle larger programs and produce fewer false alarms than related work, without missing data-races. Additionally our analysis does not require code annotations or array size information. Our verification framework hinges on access memory protocols, a new family of behavioral typesĀ [1] that codify the way threads interact through shared memory. Our behavioral types also make evident two aspects of the analysis that can be made separate: concurrency analysis (i.e., could these two expressions run in parallel?) and data-race conflict detection (i.e., do these array indices match?).

Fig. 1.
figure 1

Work-flow of the verification.

Contributions and Synopsis. This paper includes the following contributions.

  1. (1)

    In Sect.Ā 3, we formalize the syntax, semantics, and well-formedness conditions for access memory protocols, which are behavioral types for GPU programs. This behavioral abstraction results in a simpler yet more expressive theory than previous works, e.g., it does not require user-provided loop invariants.

  2. (2)

    In Sect.Ā 4, we show that our DRF analysis of access memory protocols can be soundly and completely reduced to the satisfiability of an SMT formula, see TheoremsĀ 1 andĀ 3. Our theory and results on access memory protocols are fully mechanized in Coq. To the best of our knowledge, this is the first mechanized proof of correctness of a DRF analysis for GPU programs.

  3. (3)

    We show that our DRF analysis of access memory protocols is compositional when protocols satisfy a structural property, see TheoremĀ 2. Additionally, we show how to transform protocols when they do not meet this property.

  4. (4)

    In Sect.Ā 5 we present Faial, which infers access memory protocols from CUDA kernels and implements our theory. Our experiments show that Faial is more precise and scales better than existing tools.

  5. (5)

    In Sect.Ā 6, we present a thorough experimental evaluation of Faial against related workĀ [5, 24, 26, 27], the largest comparative study of GPU verification (5 tools in 260 kernels, 3 tools compared in 487 kernels). Faial verified 218 out of 227 real-world kernels (at least \(1.42{\times }\) more than other tools) and correctly verified more handcrafted tests than other tools (4 out of 5). In a synthetic benchmark suite (250 kernels), Faial is the only tool to exhibit linear growth in 4 out of 5 experiments, while others grow exponentially in allĀ 5 experiments.

Our paper is accompanied by an implementation (Faial), an evaluation framework (inc. datasets), and proof scripts (in Coq) for each theorem. All of these are available in our artifactĀ [9].

2 Overview

This section gives an overview of our approach by examining a data-race we found in published workĀ [17, 34]. We discuss the challenges that such examples pose to the state-of-the-art of DRF analysis. Then we introduce a verification framework based on access memory protocols: behavioral typesĀ [1] that codify the way threads interact via shared memory. FigureĀ 1 gives an overview of the verification pipeline. We start from CUDA kernels, from which we infer access memory protocols. Protocols are then checked for well-formedness and transformed in three steps into formulas that are verified by an SMT solver.

2.1 Challenges of GPU Programming

GPU Programming Model. The key component of GPU programming is the kernel program, or just kernel, that runs according to the Single-Instruction-Multiple-Thread (SIMT) execution model, where multiple threads run a single instruction concurrently. A kernel is parameterized by a special variable that holds a thread identifier, henceforth named . In parallel, each member of a group of threads runs an instantiated copy of the kernel by supplying its identifier as an argument. Threads communicate via shared memory (arrays) and mediate communication via barrier synchronization (an execution point where all threads must wait for each other before advancing further). Writes are only guaranteed to be visible to other threads after a barrier synchronization.

GPU programming platforms usually group threads hierarchically in multiple levels, across which no inter-groups synchronization is possible. In both the literatureĀ [6, 24] and this work, the focus is on intra-group communication.

figure b

Challenges. We motivate the difficulty of analyzing data-races by studying a programming error found in the wild, reported in ListingĀ 2.1 (left). This excerpt comes from a tutorialĀ [34] on optimizing numeric algorithms for GPUs. The code listing transposes a matrix -times with an outer loop indexed by variableĀ  .

Remarkably, the tutorialĀ [34] does not inform the readers that ListingĀ 2.1 contains a subtle data-race: one transpose-operation starts (the writes to in lineĀ 3) without awaiting the termination of the previous transpose-operation (the reads from in lineĀ 6), thus corrupting the data over time and possibly skewing the timing of the optimization to appear faster than it should be. We found a related data-race inĀ [17], which reuses code fromĀ [34].

Our tool, Faial, successfully identifies the program state that triggers the data-race in ListingĀ 2.1: when and . However, state-of-the-art tools struggle to accurately analyze ListingĀ 2.1, as evaluated in Sect.Ā 6 (Claim 1: Test 1). Symbolic execution tools, such asĀ [26, 27], timeout for , and, in general, cannot handle symbolic (unknown) bounds. GPUVerify Ā [6], a tool based on Hoare logic, reports a false alarm: a spurious data-race when and . PUG Ā [24] incorrectly identifies the example as DRF, as its analysis appears to ignore data-races originating from different iterations of a loop.

2.2 Memory Access Protocols by Example

We now investigate the data-race inĀ ListingĀ 2.1 with an access memory protocol. For presentation purposes, we focus our discussion on ListingĀ 2.1 (r.h.s.), that simplifies the l.h.s. whilst retaining the root cause of its data-race, which stems from the interaction between both loops. We discuss how we support multi-dimensional arrays, multi-dimensional thread identifiers, and arbitrary loop strides in Sect.Ā 5. In our Coq formalism the notion of ā€œaccessesā€ (and their dimensions) is a parameter of the theory, thus orthogonal to the theory presented here.

Consider the execution of the end of the first iteration ( ) and the beginning of the second ( ) iteration of the outer-loop. In this case, the execution of the -loop when is not synchronized with the execution of the -loop when as there is no call toĀ __syncthreads() in between.

The access memory protocol in ListingĀ 2.2 captures this partial execution from the viewpoint of array . By design access memory protocols over approximate kernels by abstracting away what data is being written to/read from an array, to focus on where data is being written. The protocol models the two problematic loops of ListingĀ 2.1, i.e., the -loop when and the -loop when . The first loop reads ( ) from the array, while the second writes () to it. Evaluation of a protocol follows the SIMT model: each thread evaluates by instantiating with their unique identifier (hereafter, an integer), e.g., threadĀ 0 yieldsĀ  and threadĀ 1 yieldsĀ .

figure x

Analysis of Unsynchronized Protocols. We say that a protocol is DRF when all concurrent accesses are pair-wise DRF, i.e., when issued by different threads on the same index, then neither access is a write. For instance the respective sets of concurrent accesses of threadsĀ 0 andĀ 1 in ListingĀ 2.2 is given below

figure y

When , threadĀ 0 (l.h.s) accesses and threadĀ 1 (r.h.s) accessesĀ . Thus, there is a data-race on index 1 of the array.

A fundamental challenge of static DRF verification is how to handle loops. Symbolic execution approaches that unroll loops, e.g., [26, 27], cannot handle large nor symbolic iteration spaces. Static approaches that use Hoare logic, e.g., [5, 7, 22], require user-provided loop invariants. Another approach is to reduce loops to verifying the satisfiability of a corresponding universally quantified formula, e.g., [25, 30]. This has the advantage of being fast and not requiring invariants. However, its previous application to GPU programming, i.e., Ā PUG, is unsound due to the interaction between barrier synchronizations and loops, e.g., Ā PUG misses the data-race in ListingĀ 2.1. We give more details in Sect.Ā 6.

Our Approach. A key contribution of our work is to identify conditions that allow a kernel to be reduced to a first-order logic formula, by precisely characterizing the effect of barrier synchronization in loops. To this end, the language of access memory protocols distinguishes syntactically between protocol fragments that synchronize from those that do not. For instance, the protocol in ListingĀ 2.2 is identified as unsynchronized, as it does not include any synchronization.

In Sect.Ā 4, we show that the DRF analysis of unsynchronized protocols can be precisely reduced to a first-order logic formula, where universally quantified formulae represent loops, thus obviating the need to unroll them explicitly. For instance, we reduce the verification of ListingĀ 2.2 to asking whether for all M, \(t_1\), andĀ \(t_2\), where \(t_1 \ne t_2\) are thread identifiers, the following holds:

figure aa

This formula is unprovable since races with when, e.g., \(t_1 = 0\), \(t_2=1\), \(j_1 = 1\), and \(M > 1\). Hence, our technique flags ListingĀ 2.2 as racy.

Analysis of Synchronized Protocols. The protocol in ListingĀ 2.3 (left) models all the interactions over the shared array from ListingĀ 2.1. This protocol consists of one outer loopĀ  that contains two inner loops separated by a barrier synchronization (). The first inner loop writes () to the array, while the second reads ( ) from the array.

figure ae

This protocol illustrates how our language syntactically differentiates between protocols fragments that synchronize from those that do not. Concretely, our language precludes an unsynchronized loop () from calling anywhere in \(u\), and it requires that a synchronized loop () includes at least one occurrence of . The superscript U (resp. S) stands for synchronized (resp. unsynchronized). This distinction can be inferred automatically and yields a compositional analysis, as we explain below.

The behavior of synchronized loops is difficult to analyse because they may contain data-races that span more than one iteration. For instance an instruction of iterationĀ  in ListingĀ 2.3 may race with an instruction of iterationĀ  .

Our Approach. In this work we show that the DRF analysis of synchronized protocols can safely be reduced to a first-order logic formula when such loops are aligned, i.e., when there is at least one synchronization exactly before the loop and one at the end of its body. In Sect.Ā 4.1 we show how to transform an arbitrary access memory protocol into an aligned protocol using a syntax-driven transformation technique called barrier aligning. Intuitively, barrier aligning normalizes loops so that they do not ā€œleakā€ accesses between iterations. The right-hand side of ListingĀ 2.3 shows the result of applying barrier aligning on the protocol from ListingĀ 2.3 (left). Observe that the fragment before the aligned loop (line 1) corresponds to the unsynchronized part of the original loop (before ). The original loop itself is rearranged so that the part succeedingĀ  is moved to the beginning of the aligned loop (lines 3ā€“6). The fragment following the aligned loop (line 7) corresponds to the unsynchronized loop that appears after the in the original protocol.

In Sect.Ā 4.1 we show that aligned protocols enable compositional verification: protocol fragments between two barriers can be analyzed independently. This compositional analysis is possible because (i) there is no causality between instructions, except through and (ii) aligned protocols syntactically delimit the causality induced by . For instance, the aligned protocol in ListingĀ 2.3 can be reduced to analyzing the following three protocol fragments independently:

figure ah

The first two protocols are handled like ListingĀ 2.2 because they are unsynchronized. Representing a synchronized loop as a formula becomes possible when the protocol is aligned: both threads must share the same value for at each iteration. Hence, we reduce the verification to asking whether for all N, M, \(t_1\), andĀ \(t_2\) where \(t_1 \ne t_2\) and the following holds:

figure aj

Our technique identifies ListingĀ 2.3 as racy since this formula is unprovable, i.e., races with when \(r = 1\), \(t_1 = 0\), \(t_2=1\), \(j_1 = 1\), \(N > 1\) and \(M > 1\).

3 Access Memory Protocols

Fig. 2.
figure 2

Syntax, semantics, and properties of access memory protocols.

An access memory protocol describes the interaction between a group of threads and a single shared-memory location. A protocol records where in memory accesses take place, but abstracts away from what data is read from/written to memory. The language of protocols distinguishes between an unsynchronized protocol fragmentĀ \(u\in \mathcal {U}\), that disallows synchronization, from a synchronized fragment \(p\in \mathcal {S}\) that must include a synchronization. The syntax and semantics of access memory protocols is given in FigureĀ 2. Our operational semantics is inspired by the synchronous, delayed semantics (SDV) from Betts et al.Ā [6], where threads execute independently and communicate upon reaching a barrier.

Hereafter, \(i\), \(j\), \(k\) are metavariables over non-negative integers picked from the setĀ \(\mathbb N\). An arithmetic expression \(n\) is either: an integer variableĀ \(x\), an integerĀ \(i\), or a binary operation on integers that yields an integer. A boolean expression \(b\) is either a boolean literal, an arithmetic comparisonĀ \(\diamond \), or a propositional logic connectiveĀ \(\circ \). We write \(n\downarrow i\) when expression \(n\) evaluates to integer \(i\), where evaluation is defined in the natural way. We overload the notation for Boolean expressions, e.g., \(b\downarrow \mathtt {true}\) means that expression \(b\) evaluates to \(\mathtt {true}\).

Unsynchronized Fragment. A protocolĀ \(u\in \mathcal {U}\) either does nothing (), accesses a shared memory locationĀ  (reads from/writes to index \(i\)), performs sequential composition, or loops. FigureĀ 2 gives the semantics of unsynchronized protocols, which is parameterized by a set of thread identifiers \(\mathcal {T}\subseteq \mathbb N\), where \(|\mathcal {T}|\ge 2\).

Evaluation of an unsynchronized protocolĀ \(u\) by a thread identifierĀ \(i\), written \(u\downarrow _iP\), yields a phase, i.e., a setĀ \(P\in \mathcal P\) of access valuesĀ \(\alpha \in \mathbb {A}\). Each access value, or just access, notationĀ , consists of its issuing thread identifierĀ \(i\), an access modeĀ \(o\) (read/write), and an indexĀ \(j\). ProtocolĀ  produces no accesses. A memory accessĀ  evaluates the index and creates a singleton phase. Sequencing and looping are standard. Loop ranges include the lower bound and exclude the upper bound. Similarly to SDV, Rule \(\mathcal {U}\)-par executes a copy of the unsynchronized codeĀ \(u\) for each thread \(i\in \mathcal {T}\) by replacing the special variable by the thread identifier, , which results in the union of the accesses of all threads. To simplify the presentation we omit the unsynchronized conditionals, however they are included in our Coq formalism and are fully supported by Faial, see Sect.Ā 5.

Synchronized Fragment. A protocolĀ \(p\in \mathcal {S}\) may perform barrier synchronizationĀ , run unsynchronized codeĀ \(u\), perform sequential composition, and loop. FigureĀ 2 gives the semantics of a protocol, notation \(p\downarrow H\). Evaluation of a protocolĀ \(p\) yields a history (ranged over by \(H\)), i.e., a list of phases (\(P\)) that records how memory was accessed. We use \(::\) as list constructor and \(\cdot \) for the usual list concatenation operator. Histories are merged using the special \(\odot \)-operator.

A barrier synchronization creates two empty phases, corresponding to phases before and after synchronization. Running an unsynchronized protocol yields a single phase containing all accesses performed by that protocol. Sequencing two synchronized protocolsĀ \(p\) withĀ \(q\) merges the last phase of the former with the first phase of the latter, as these two phases run concurrently. The base case of a synchronized loop produces a singleton history containing the empty phase. Running one iteration of a synchronized loop sequences the history of the first iteration with the rest of the loop, by merging the two histories.

Next, we introduce the notion of well-formed protocols, a restriction of structurally well-formed protocols, see \( swf ({p})\) in FigureĀ 2. We discuss how well-formedness is enforced in Sect.Ā 5. We write \( fv (p)\) (resp. \( fv (n)\)) for the free variables of \(p\) (resp. \(n\)).

Definition 1

(Well-formed protocol, \(p\in \mathcal {W}\)). We say that a protocol is well-formed, notation \(p\in \mathcal {W}\), when \( swf ({p})\), , and every synchronized loop executes at least one iteration.

DRF is formalized at the bottom of FigureĀ 2. Two accesses are in a data-race (or racy) when there exist two different threads that access the same index \(k\), and one of these accesses is a write. Our definition does not distinguish between harmful and benign data races, a data-race in which both threads write the same value. PhaseĀ \(P\) is safe iff each pair of accesses it contains is not racy. HistoryĀ \(P\) is safe when all of its phases are safe. We say that \(p\) is DRF iff \(p\downarrow H\) and \( safe (H)\).

4 DRF-Preserving Transformations of Protocols

This section presents the main steps of the DRF analysis summarized in FigureĀ 1: barrier aligning (Sect.Ā 4.1) and splitting (Sect.Ā 4.2).

This section also includes our key theoretical results. We establish that these steps preserve and reflect data-races (i.e., any and all data-races are found), see TheoremĀ 1 and TheoremĀ 3. We make precise the notion of compositionality that makes our approach scalable in TheoremĀ 2.

4.1 Aligning Protocols

The first transformation step normalizes protocols by aligning synchronized loops, which in turn enables a form of compositional verification. The goal of the transformation is to produce protocols which belong to \(\mathcal {A}\), see top ofĀ FigureĀ 3.

Fig. 3.
figure 3

Aligning protocols.

Barrier aligning (or just aligning) is performed by functionĀ \( align \), given in the bottom half ofĀ FigureĀ 3. The function returns a pair whose first element is an aligned and synchronized protocol, and whose second element is an unsynchronized protocol. Intuitively, the pair represents a sequence which we delimitate syntactically. We note that the output of \( align \), say \((q,\,u)\), can be trivially made into an aligned protocol: . The case for synchronization is simple, \( align \) returns the input protocol as the first component of the pair and as the second component (the input protocol is already fully aligned). The case for sequence consists of the sequential composition of the pair aligned with unsynchronized code using operatorĀ . Sequencing two pairsĀ  amounts to sequencingĀ \(u\) to the outer-most piece of unsynchronized code present inĀ \(q\).

Dealing with synchronized loops is more involved. Given a loop , we produce a protocol consisting of the fragment preceding the loop and the synchronized part of its first iteration (\(q_1\)), an aligned loop starting at \(n{+}1\), and the unsynchronized part of its last iteration (). See ListingĀ 2.3 for an example of protocol aligning. We note that we can always unroll the loop because the analysis only considers non-empty synchronized loops; we discuss how to enforce this assumption in Sect.Ā 5.

We now state two fundamental properties of barrier aligning: preserving and reflecting DRF (TheoremĀ 1), and enabling compositional verification (TheoremĀ 2). TheoremĀ 1 states that verifying DRF of a well-formed protocol \(p\) is equivalent to verifying DRF of its aligned counterpart.

Theorem 1

(Correctness of Align). If \({p}\in \mathcal {W}\) and \( align (p) = (q,\,u)\), then \(p\) is DRF if and only if \(q;u\) is DRF.

To state our compositionality result, we introduce a language of contexts:

figure ak

The base cases correspond to a hole \([\, \_ \,]\) or an unsynchronized protocol (followed by ). The other cases follow the structure of access memory protocols.

Theorem 2

(Compositionality). Let \(\mathcal {C}\) be a context, s.t. is DRF. For all \(p\in \mathcal {A}\), if \(p\) is DRF, , then \(\mathcal {C}[p] \in \mathcal {A}\) and \(\mathcal {C}[p]\) is also DRF.

Compositionality allows Faial to analyze each fragment of an aligned protocol independently, by splitting the given protocol into multiple symbolic traces.

Fig. 4.
figure 4

Syntax and semantics of symbolic traces, and splitting of protocols.

4.2 Splitting Protocols into Symbolic Traces

The second verification step, splitting, consists in transforming an aligned protocol into symbolic traces, i.e., symbolic representations of sets of memory accesses which occur between two synchronizations.

Symbolic Traces. Intuitively, symbolic traces are a thin abstraction over an SMT formula. We describe how to translate a symbolic trace to a formula in Sect.Ā 5.

We give the syntax and semantics of symbolic traces in FigureĀ 4. ExpressionĀ  terminates a trace. ExpressionĀ  states that threadĀ \(n\) accesses indexĀ \(m\) with modeĀ \(o\). ExpressionĀ \(h_1;h_2\) composes two symbolic traces using operator \(\otimes \), also given in FigureĀ 4. ExpressionĀ  binds variableĀ \(x\) in \(h\), where variableĀ \(x\) is an integer in the range induced fromĀ \(n\) andĀ \(m\). The semantics of a symbolic trace yields a history with a phase for each possible variable assignment. ExpressionĀ  yields a single empty phase. ExpressionĀ  evaluates to a singleton set that contains the access value that results from evaluating the thread-identifier expressionĀ \(n\) and the index expressionĀ \(m\). Sequencing historiesĀ \(h_1;h_1\) consists of performing the product of phases (operator \(\otimes \)), which consists of merging every phase ofĀ \(H_1\) with every phase ofĀ \(H_2\). A variable binder behaves like aĀ  when the range of values is empty. Otherwise, we fork two historiesĀ \(H_1\) andĀ \(H_2\). We assign the lower bound of the set in \(H_1\), and we recursively evaluate a variable binder where we increment its lower bound in \(H_2\).

Barrier splitting is the transformation from aligned protocols to symbolic traces, performed via functions \( trace \) and \( split \), defined in FigureĀ 4. FunctionĀ \( trace \) extracts the symbolic trace of an unsynchronized program for a single thread. Memory accesses are tagged with the owner threadĀ , and unsynchronized loops are converted into variable bindings. FunctionĀ \( split \) returns a list of symbolic traces. The case for \(p;q\) is trivial (operator \(\cdot \) stands for list concatenation). The base case of \( split \) is for unsynchronized protocol fragmentĀ \(u\), which produces a list containing a single symbolic trace. It introduces fresh variablesĀ \(t_1\) andĀ \(t_2\) that represent two (distinct) symbolic thread identifiers. The rest of the trace consists of the trace of \(u\) instantiated to the first thread identifierĀ \(t_1\) followed by its instantiation to the second thread identifierĀ \(t_2\). The case for synchronized loops simply reinterprets the loop as a variable binder. Function \( split \) leads to an exponential blow up wrt. nesting of synchronized loops, but this has not posed problems in practice, c.f., Claim 2.

Example 1

Let . We have that \( split ({\hat{p}})\) returns:

figure al

We show that barrier splitting preserves and reflects DRF.

Theorem 3

Let \(p\in \mathcal {A}\), such that \(p\downarrow H_1\), and \(H_2 = [ H\,\mid \,h\in split (p) \wedge h\Downarrow H]\), then \( safe (H_1)\) if and only if \( safe (H_2)\).

Hence we have established that aligning (TheoremĀ 1) and splitting (TheoremĀ 3) preserve and reflect data-races, i.e., any and all data-races are found. Thus, the only source of approximation in our analysis stems from the inference of protocols from CUDA kernels, which we discuss in the next section. TheoremĀ 3 highlights the compositionality of our analysis, as each symbolic trace resulting from functionĀ \( split \) can be analyzed independently.

5 Implementation

In this section we present our tool, Faial, that implements the steps described in FigureĀ 1. Faial takes a CUDA kernel as input and produces results that either identify the kernel as DRF or list specific data-races. In this section, we describe the implementation of the protocol inference, well-formedness checks, and transformation to SMT.

Inference. This step transforms a CUDA kernel into access memory protocols (one for each shared array). It uses libclangĀ [23] to parse the kernel, a standard single static assignment (SSA) transformation to simplify the analysis of indices and arrays, and code slicing to only retain code related to shared array accesses. We note that Faial supports constructs of the CUDA programming model that are not directly modeled by access memory protocols, e.g., unstructured loops, conditionals, function calls, and multi-dimensional arrays. To support multi-dimensional thread identifiers, we extend the language of protocols to support multiple thread identifiers, and adapt functionĀ \( split \) accordingly. The main challenges are related to loops and function calls.

Whenever possible loops are transformed to loops with a stride ofĀ 1 following ideas from loop normalizationĀ [24] and abstractionĀ [30]. For instance, in we change the stride fromĀ  intoĀ 1 by executing the loop bodyĀ  when the loop variableĀ  is divisible by stride, i.e., the loop becomesĀ  . Similarly, a loop ranging over powers ofĀ n, e.g., Ā  , becomesĀ  Ā  , where functionĀ  tests whether is a power of baseĀ  . We approximateĀ  s as a structured loop with an unknown upper bound.

Function calls that manipulate shared memory are uncommon in GPU programming. Additionally auxiliary functions that manipulate shared memory have a compiler annotation to inline their bodies, hence we can inline such calls easily. Faial cannot handle recursive functions, but these rarely occur in practice. Function calls that do not access shared memory are simply discarded.

Well-Formedness. This step ensures that kernels Faial analyzes meet the well-formedness conditions, i.e., \(p\in \mathcal {W}\), including the assumptions that synchronized loops iterate at least once, see DefinitionĀ 1. First, Faial annotates loops with a synchronized/unsynchronized tag according to the presence ofĀ  in the loop body, then adjusts the precedence of sequencing to group all unsynchronized code preceding a or a synchronized loops. Synchronized loops of well-formed protocols cannot manipulate thread-local variables (i.e., ), an assumption shared by the CUDA programming model. Hence, Faial flags such kernels as erroneous. Next, Faial adds assertions before/after synchronized loops to check that the loop range is non-empty, i.e., loops execute at least once. Similarly to loops, conditionals are tagged as synchronized or unsynchronized. Then, Faial inlines synchronized conditionals, i.e., when a synchronized conditional is found, two copies of the input program are created and each copy is prefixed by a global assertion corresponding to the condition. Faial does not support synchronized conditionals that appear within synchronized loops. We have not found real-world kernels that include such a construction.

Quantification. This step transforms each symbolic trace (FigureĀ 4) into an SMT formula, to check for safety, c.f., FigureĀ 2. The presented formalism assumes a decidable fragment. However, as CUDA programs may include multiplication in index expressions, Faial uses an undecidable logic (SMTLibā€™s ). Essentially, the generated formula guarantees that the indices of array accesses are distinct when there is at least one write. We illustrate this straightforward transformation with ExampleĀ 2.

Example 2

The formula generated from the trace in ExampleĀ 1 is given below:

figure ay

where each symbolic access is translated to a conjunction representing its index (\(\mathsf {idx}\)) and access mode (\(\mathsf {m}\)). Observe that the formula enforces that indices \(\mathsf {idx}_1\) and \(\mathsf {idx}_2\) (executed by distinct threads) are different.

For multi-dimensional arrays, we generate one pair of indices per dimension, and check that at least one pair is distinct.

6 Experimental Evaluation

We evaluate Faial over several datasets and show how it fares against existing approaches. We structure this evaluation in three claims.

  • Claim 1: Correctness. We claim that our approach finds more bugs and raises fewer false alarms than existing tools. To evaluate this claim, we compare Faial against four state-of-the-art kernel verification tools over 10 kernels that are known to be tricky to analyze.

  • Claim 2: Scalability. We claim that our approach scales better to larger programs. To evaluate this claim, we compare Faial against other tools over a set of synthetic benchmarks designed to test the limits of each tool, in terms of run time and memory usage.

  • Claim 3: Real-world usability. We claim that our approach is more usable than existing static verification tools on real-world CUDA programs. To evaluate this claim, we use a varied dataset of real-world DRF kernels and measure the false alarm rate, run time, and memory usage of Faial, GPUVerify, and PUG.

Benchmarking Environment. To make our evaluation reproducible, we developed a benchmarking framework to automate our experiments over the different tools and datasets. For Claim 1 and Claim 3, we designed a tool-agnostic file format for kernel functions and associated metadata (e.g., expected result of DRF analysis, grid and block dimensions, and include directives). And for Claim 2, we created a tool that generates kernels according to given templates, e.g., see FigureĀ 7.

We evaluate Faial against the following verification tools: GPUVerify Ā [5] v2018-03-22; PUG Ā [24] v0.2; and, GKLEE Ā [26] and SESA Ā [27] v3.0. Experiments for Claim 1 use an Intel i5-6500 CPU, 7.7Ā GB RAM, and Fedora 33 OS, while Claim 2 and ClaimĀ 3 use an Intel i7-10510U CPU, 16Ā GB RAM, and Pop! OS.

Excluded Tools. We excluded ESBMC-GPU Ā [33] and Simulee Ā [37] from the evaluation because we were unable to get them to run satisfactorily. Both tools have rudimentary support for verifying arbitrary CUDA kernels. ESBMC-GPU did not find a single data-race in our benchmarks, while Simulee produced false alarms for every DRF-kernel given.

Table 1. Results for Claim 1. DRF indicates that a (static analysis) tool reported a test case as DRF. NRR indicates that a (symbolic execution) tool did not report any data-race. Label x/y indicates that the tool reported y data-races, x of which are actual races. Label indicates that the tool did not terminate within 90s. A test passes if the tool returns the expected result and all reported races are valid.

Claim 1: Correctness

We have selected a set of tricky kernels to expose false alarms and missed data-races in Faial, GPUVerify, PUG, GKLEE, and SESA. Our results are reported in TableĀ 1. The dataset consists of 5Ā tests, each consisting of two variations of the same kernel: one racy and one DRF. The racy version of Test 1 (c.f., ListingĀ 2.1) contains an inter-iteration data-races. The DRF version adds aĀ  after the second inner loop. Tests 2 to 4 expose various loop-related data-races. Their protocols are given in FigureĀ 5. In the racy version of Test 2 conflicts with of the first iteration. Similarly, in the racy version of Test 3, of the last iteration races with . In the racy version of Test 4 the last iteration of a nested loop races with the first iteration of the following loop. Test 5 exposes the abstraction gap between kernel and access memory protocols (which abstract away array elements), see FigureĀ 6.

Fig. 5.
figure 5

Protocols forĀ Tests 2 to 4, c.f., Ā Claim 1, where is a free thread-global variable. shaded code only appears in the DRF version of first-iter and last-iter. shaded code only appears in the racy version of last-iter-first-iter (Color figure online).

Fig. 6.
figure 6

Kernels and protocols for Test 5 (read-index), c.f., Ā Claim 1; becomes a free thread-local variable as protocols do not model array elements.

Faial passes more tests than any other tool. Failed Test 5 is caused by access memory protocols abstracting away from what data is being read from/written to arrays, i.e., array elements. In each case, Faial reports one spurious data race (). We report on performance trade-offs wrt. tracking array elements in ClaimĀ 2.

GPUVerify passes Test 5 because it tracks array elements, but fails the remaining 4 tests. Some reported false alarms are ill-formed, e.g., on the racy component of Test 2, the report has disjoint indices.

PUG obtains the worst score amongst static tools. Notably, the tool misses a data-race in Test 1, demonstrating its unsoundness, c.f., Sect.Ā 2.1.

GKLEE and SESA timeout for tests that include loops, as the loop bounds are unknown. Both tools miss the data-race in Test 5. Symbolic tools may be able to report data-races when the bound is known, e.g., timeouts start in Test 1 when the bound is at least 2, in Test 2 when the bound is at least 23,Ā 000.

Claim 2: Scalability

We evaluate the scalability of our approach with a synthetic dataset that aims at demonstrating how different kernel constructs affect run time and memory usage of Faial, GKLEE, GPUVerify, PUG, and SESA. Our dataset is divided into five categories, one per syntactical construct in the language of access memory protocols, as well as conditionals, which are supported by our inference step, c.f., Sect.Ā 5. FigureĀ 7 shows the protocols of the kernel patterns we generate in each category: (i) repeated accesses (read then write), (ii) repeated barrier synchronizations separated by writes, (iii) repeated conditionals, (iv) increasingly nested unsynchronized loops, and (v) increasingly nested synchronized loops. In each category, we vary the problem size by repeating a pattern from 1 toĀ 50 times. Note that all kernels generated this way are DRF.

Fig. 7.
figure 7

Synthetic protocols generated for Claim 2. is a free thread-global variable, and are positive integer literals.

Fig. 8.
figure 8

Results for Claim 2. Run time (left plots) are given on a logarithmic scale, and memory (right plots) are given on a linear scale. Flatter and lower curve is better. Tools annotated with a triangle are excluded due to timeouts or errors.

FigureĀ 8 shows the average run time and memory usage over five runs on logarithmic and linear scales, respectively. For each run, we set a timeout of 90s and we exclude any run that times out or reports a false alarm. Cutoffs in the memory plots are determined by the cutoffs in the run time plots.

Overall Faial is the most scalable tool. In 4 out of 5 categories, Faial has the slowest growth for all experiments, and verifies all tests within 0.46Ā s. In the largest problem sizes, our tool is the fastest in 3 categories (access, conditional, unsynchronized loop), 2nd for barriers, and 3rd for synchronized loops. Overall, the memory usage of Faial is competitive with other tools. Faial is the only tool with a near constant time/memory for upĀ to 50 unsynchronized loops, indicating the scalability of reducing unsynchronized loops to universally quantified formulas. Faial only times out for kernels which consists of >17 nested synchronized loops. However such kernels are uncommon, e.g., the levels of nested synchronized loops in the real-word kernels studied in Claim 3 are at most 3.

GPUVerify remains stable in the barrier and conditional categories but is affected negatively by loops and accesses. Loops are a known bottleneck in GPUVerify Ā [2]. In the access category there is an exponential slowdown due to GPUVerify keeping track of what data is being written to/read from array.

PUG tool remains stable with the number of barrier synchronizations but is affected negatively by the number of conditionals and loops. PUG is the fastest tool with smaller inputs, but it raises false alarms in the access category, hence these measurements are omitted from the corresponding plots.

Fig. 9.
figure 9

Results for Claim 3, on a set of 227 DRF CUDA kernels.

We discuss GKLEE and SESA together since SESA processes GKLEE ā€™s NVCC byte code output by concretizing variables, before passing it to GKLEE itself. There are two main factors that affect negatively these symbolic execution tools: (i) the number of loops, since they unroll each loop; and (ii) the amount of bookkeeping required to keep track of what is read from/written to memory. FigureĀ 8 shows clear exponential curves for the access and barrier synchronization categories. Observe that these tools timeout immediately in the loop categories.

Claim 3: Real-World Usability

We evaluate the usability of our approach by comparing Faial with other static verification tools (GPUVerify and PUG) on real-world kernels wrt. rate of false alarm and run time. We curated a set of CUDA kernels fromĀ [2], which consists of 3 benchmark suites (totaling 227 CUDA kernels): NVIDIA GPU Computing SDK v2.0 (8 CUDA kernels); NVIDIA GPU Computing SDK v5.0 (166 CUDA kernels); Microsoft C++ AMP Sample Projects (20 kernels); gpgpu-sim benchmarks (33 kernels). All kernels are DRF and have been pre-processed by the authors ofĀ [2] to facilitate verification. Each kernel is in a distinct file, all dependencies are available, and kernels are annotated with minimal pre-conditions to allow for automatic analysis (e.g., thread count is given).

As we aim to evaluate fully automatic verification of three tools, we removed code annotations (pre-conditions and loop invariants) specific to GPUVerify. Additionally, we made minor changes to some kernels to meet the limitations of the front-end of Faial and PUG. For instance we converted nested array lookups to use temporary variables and inlined functions calls that operate on arrays in 22 kernels. Another 8 kernels were modified to simplify their control flows. Our curated dataset will be included in our artifact submission.

FiguresĀ 9a,Ā b, andĀ c give the correctness results of Faial, GPUVerify, and PUG, respectively. Correct refers to the true-positive rate, i.e., when the tool correctly identifies the kernel as DRF. False Alarm refers to the false alarm rate, i.e., when the tool incorrectly identifies the kernel as racy. A kernel is Unsupported if it makes the tool crash. A Timeout occurs when the tool exceeds the limit of 60s to verify a kernel. The values shown are an average calculated over five runs. FigureĀ 9d shows the average run time and memory usage of every true-positive report (we omit invalid reports) across the three tools.

Overall Faial has the highest rate of true-positives at \(96\%\). Our tool is second in terms of run time and memory usage, showing a good compromise w.r.t. time and space. Faial verifies most kernels within 1s, and all kernels that need more time are only verified by Faial. GPUVerify shows lower memory usage at the cost of a higher verification run time. PUG verifies the lowest number of kernels (\(34.8\%\)), as most kernels are unsupported (\(62.6\%\)).

7 Related Work

SMT-Based DRF Analyses. Li and Gopalakrishnan propose a direct encoding of DRF analysis of GPU programs in SMT, with PUG Ā [24, 25]. Both PUG and Faial follow a similar approach of barrier splitting: having a symbolic representation of a canonical interleaving, and dividing up the analysis over barrier intervals. The two major distinctions are that (1) PUG misses inter-thread data-races in synchronized loops, e.g., ListingĀ 2.1, and (2) the algorithms of PUG are unspecified and lack soundness proofs. InĀ [24, Sect.Ā 6.3] the authors identify the challenge of detecting inter-thread data-races, but do not elaborate a solution. Ma et al. Ā [30] present a similar technique to detect data-races and deadlocks in OpenMP programs (CPU-based parallelism). However, their work does not guarantee DRF, and they do not formalize their algorithms. InĀ [8], Prasanth et al. propose a polyhedral encoding of DRF for OpenMP programs, which is only applicable to programs with affine array accesses. However the prevalence of linearized array expressions in GPU kernels is known to stump polyhedral analysisĀ [16].

Hoare-Logic-Based DRF Analyses. The main drawback of Hoare-logic based tools is their high rate of false alarms. They also require code annotations from a concurrency expert to handle loops. GPUVerify Ā [2, 3, 5, 6, 12] can verify CUDA and OpenCL kernels using BoogieĀ [4] as a backend. GPUVerify also relies on a two-thread abstraction (pen and paper proof)ā€”in this paper, we present the first machine-checked proof of the two-thread abstraction idea. VeriCUDAĀ [20, 21] focuses on reasoning about the functional correctness of GPU programs using Hoare-logic. In [22] the authors extend VeriCUDA to proving DRF. In a similar vein, VerCorsĀ [7] uses separation logic to prove the functional correctness and DRF of GPU kernels. Both VeriCUDA and VerCors expect a tool-specific language, hence cannot handle real-world kernels directly.

Data-Race Finders. These include dynamic data-race detection, symbolic-execution, and model-checking. Such techniques are better suited for highly detailed analysis in smaller kernels, and typically are unable to prove DRF. Dynamic data-race detection executes a kernel to find data-races on a fixed input, e.g., Ā [14, 18, 19, 28, 32, 38, 39]. This technique only reports real data-races, but suffers from a slowdown of at leastĀ \(10{\times }\) compared to the non-instrumented program, and requires the kernel input data, which might be unavailable or unknown. Symbolic execution and model checking have been extended to detect data-racesĀ [10, 11, 26, 33, 37]. These techniques do without the kernel input data and can detect more data-races than dynamic data-race detection.

Miscellaneous. Ferrel et al. introduce a machine-checked formalism to reason about the semantics of CUDA assemblyĀ [15]. Dabrowski et al. mechanize the DRF-analysis of multithreaded programsĀ [13]. Muller and Hoffmann present a logic to reason about the evaluation cost of CUDA kernelsĀ [31].

Other behavioral types have been used to verify parallel and multithreaded systems that communicate via message-passingĀ [29, 35, 36]. However these do not capture shared memory (only message-passing), thus cannot address data-races.

8 Conclusion

We tackle the problem of statically checking DRF in GPU kernels, with a new family of behavioral types, i.e., access memory protocols. We provide a novel compositional analysis of access memory protocols, along with fully mechanized proofs and an implementation. Our evaluation explores challenging and diverse benchmarks (229 real-world and 258 synthetic kernels) to demonstrate that our approach is more precise (false alarms and missed alarms), scalable (time/memory growth), and usable (real-world kernels correctly verified) than other tools.