International Journal of Parallel Programming

, Volume 44, Issue 2, pp 309–324 | Cite as

Exploiting GPUs with the Super Instruction Architecture

  • Nakul Jindal
  • Victor Lotrich
  • Erik Deumens
  • Beverly A. Sanders


The Super Instruction Architecture (SIA) is a parallel programming environment designed for problems in computational chemistry involving complicated expressions defined in terms of tensors. Tensors are represented by multidimensional arrays which are typically very large. The SIA consists of a domain specific programming language, Super Instruction Assembly Language (SIAL), and its runtime system, Super Instruction Processor. An important feature of SIAL is that algorithms are expressed in terms of blocks (or tiles) of multidimensional arrays rather than individual floating point numbers. In this paper, we describe how the SIA was enhanced to exploit GPUs, obtaining speedups ranging from two to nearly four for computational chemistry calculations, thus saving hours of elapsed time on large-scale computations. The results provide evidence that the “programming-with-blocks” approach embodied in the SIA will remain successful in modern, heterogeneous computing environments.


Parallel programming Tensors GPU Domain specific language 

1 Introduction

A holy grail of parallel computing is to find ways to allow application programmers to express their algorithms at a convenient level of abstraction and obtain good performance when the program is executed. In addition, it is desirable to be able to easily port applications to new architectures as they become available. Of particular current interest are heterogeneous systems with accelerators.

The Super Instruction Architecture (SIA) is a parallel programming environment designed for computations dominated by tensor algebra with very large dense arrays. The work was motivated by the concrete need to implement parallel software for coupled cluster methods for electronic structure prediction, an important class of problems in computational chemistry.

Most problems of interest in this domain will contain arrays that are too large to fit into the memory of a single processor, and sometimes are too large to be completely contained in the collective memory of all the processors, requiring them to be backed on disk. Thus, these arrays will necessarily be decomposed into blocks and distributed throughout the system.

Algorithms in the domain are extremely complex, and in many cases more significant improvements in performance can be obtained by algorithmic improvements than tuning the details of the parallel program. Thus it is important that a programming environment offer a convenient level of abstraction for expressing algorithms that makes it easy for computational chemists to experiment.

The SIA consists of a domain specific programming language, Super Instruction Assembly Language (SIAL, pronounced “sail”) and its runtime system, the Super Instruction Processor (SIP). Computational chemists express their algorithms in SIAL, a simple parallel programming language with a parallel loop construct and intrinsic support for distributed and disk-backed arrays. An important feature of SIAL is that algorithms are expressed in terms of operations called super instructions whose arguments are (usually) blocks (or tiles) of multidimensional arrays rather than individual floating point numbers. SIAL programs are compiled into SIA bytecode which is interpreted by the SIP. The SIP is a parallel virtual machine that manages the complexities of dealing with parallel hardware, including communication and I/O.

The SIA has been implemented and ported to several different architectures and has been successfully used to implement ACES III [1, 9, 10], a software package for computational chemistry which provides parallel implementations of several advanced methods for electronic structure computations using dozens of SIAL programs. ACES III is not a toy application or academic exercise; it is a serious tool used by computational chemists. It pushes boundaries in both high performance computing and computational chemistry and is available for download [1] under the GNU GPL.

As GPUs are becoming increasingly important sources of computation cycles on systems ranging from laptops to leadership class supercomputers, it is vital that computational chemists using ACES III be able to utilize these hardware resources. In this paper, we describe an extension to the SIA to enable SIAL programs to exploit GPUs. The approach is to extend SIAL with a set of directives that delineate regions of the program that should be executed on GPUs and specify data movement. The directives are given in terms of the blocks of multidimensional arrays, the data structures intrinsic to SIAL and supported by the SIP, and are thus simpler than the general purpose directive-based programming models. In contrast to most of the prior work done in implementing electronic structure computations on GPUs, this is not an effort to provide a new GPU implementation of a specific computational chemistry method. Rather it is an extension to the SIA programming environment that enables computational chemists to easily “GPU-enable” new or existing SIAL programs.

As mentioned above, in SIAL algorithms are expressed in terms of operations on blocks (or tiles) of multidimensional arrays, called super instructions, rather than individual floating point numbers. Although blocking arrays is a well-known technique in parallel programming, it is rarely supported at the programming language level. Super instructions simply take blocks as input and generate new blocks as output and do not involve communication. They are implemented in any convenient programming language such as Fortran, C/C++, and now CUDA [13], and thus can take advantage of high quality optimizing compilers. Frequently used super instructions such as tensor contractions are supported by the language syntax; additional super instructions can be implemented by domain programmers. SIAL is used as a high level scripting language to orchestrate the computation.

Expressing algorithms in terms of blocks enhances programmer productivity by eliminating the need for tedious and error-prone index arithmetic. This is is very natural in the domain and has several significant consequences:
  • Data is naturally handled at a granularity that can be efficiently moved between nodes.

  • Computation steps will be time consuming enough for the runtime system to be able to effectively, and automatically, overlap communication and computation.

For the purposes of conveniently exploiting GPUs, programming with blocks provides the following additional benefits:
  • The computation is already partitioned into tasks that map conveniently and efficiently onto CUDA kernels.

  • Most super instructions lend themselves to straightforward data parallel implementations.

2 Overview

In this section, we give a brief overview of the SIA, first describing the programming with blocks concept, then the DSL, SIAL, and its runtime system, SIP. A more complete description can be found in Sanders et al. [15].

2.1 Programming with Blocks

Consider the following term expressing the contraction1 of two four-dimensional tensors (indexed by \(\mu , \nu , \lambda \), and \(\sigma \); and by \(\lambda , \sigma , i\), and \(j\), respectively) and yielding a third four-dimensional tensor (indexed by \(\mu , \nu \) \(i\), and \(j\)).
$$\begin{aligned} R^{\mu \nu }_{ij} = \sum _{\lambda \sigma } V^{\mu \nu }_{\lambda \sigma } T^{\lambda \sigma }_{ij} \end{aligned}$$
This could be expressed as a contraction on blocks as
$$\begin{aligned} R(M, N, I, J)^{\mu \nu }_{ij} = \sum _{LS} \left( \sum _{\lambda \in L} \sum _{\sigma \in S} V(M,N,L,S)^{\mu \nu }_{\lambda \sigma } T(L,S,I,J)^{\lambda \sigma }_{ij} \right) \end{aligned}$$
In this form, the indices \(M,N,L,S,I,J\) refer to segments formed by partitioning the range of each index; the block \(V(M, N, L, S)\) is itself a four dimensional matrix containing \(seg_M*seg_N*seg_L*seg_S\) elements, where \(seg_J\) is the segment size along that rank,2 which typically would be chosen to be between 10 and 50. If we define the operator ‘*’ appropriately as the contraction operator on blocks, then we can rewrite Eq. 2 as
$$\begin{aligned} R(M,N,I,J) = \sum _{LS} V(M,N,L,S)* T(L,S,I,J) \end{aligned}$$
Equation 3 is the way that the SIAL programmer thinks of the problem, which only involves the segment indices. The super instruction implementing ‘*’ internally performs the two inner summations over individual elements shown in Eq. 2.

2.2 SIAL

The most important features of SIAL are intrinsic distributed and disk backed arrays, explicit parallelism with a pardo statement, and support for expressing algorithms in terms of blocks of multi-dimensional arrays.

2.2.1 Arrays and Indices

SIAL exposes the following qualitative differences in the size of arrays: small enough to fit in the memory of a single process, distributed, and disk-backed. This is done by offering several array types: static, local, temp, distributed, and served. Static arrays are small and replicated in all processes. Distributed arrays are partitioned into blocks and distributed. Served arrays, also partitioned into blocks, are stored on disk and cached by IO server processes. Local and temp arrays are local to a worker process and are used for holding intermediate results. In the SIA extension, blocks of local and temp arrays may be allocated in GPU memory.

The shape of an array is defined in its declaration by specifying index variables for each dimension. As a result, the size of an array and the size of its segments are known and fixed during the program execution, but need not be known when the program is written.

One-sided access to blocks of distributed arrays occurs via get and put commands. Analogous commands for served arrays are request and prepare. Both put and prepare have variations that atomically accumulate results into the block on the home node or IO server.

Local and temp arrays are used within a process to hold intermediate results. Local arrays are explicitly allocated and deallocated and are typically fully formed in at least one dimension. Temp arrays are automatically allocated and deallocated from CPU memory by the runtime system. Blocks of local and temp arrays may be allocated and, if necessary, initialized in GPU memory. This will be explained in more detail in the context of the example in Sect. 3.1.

2.2.2 Expressing Coarse-Grained Parallelism

Coarse-grained parallelism, where tasks are mapped onto MPI processes, is explicitly expressed using a pardo command that is given a list of index variables and an optional list of where clauses, each with a boolean expression. The SIP executes iterations, in parallel in MPI processes, over all possible combinations of the values in the range of the given indices that also satisfy the where clauses. The where clause is most frequently used to eliminate redundant computation when arrays are symmetric. Scheduling of pardo iterations and mapping onto processors is done by the SIP.

2.2.3 Other Control Structures

Other control structures include procedure calls, if and if-else commands, and a do loop. The latter is given a single index variable and conducts a sequential iteration over the range of the index variable. Typically, a computation will require looping over blocks of two or four dimensional arrays. The combination of the pardo loop with the sequential do loop provides a convenient and straightforward way for the SIAL programmer to structure computations.

2.2.4 Example

Figure 1 shows a fragment of a SIAL program that computes the term shown in Eq. 2.
Fig. 1

SIAL implementation of Eq. 2

The pardo M, N, I, J statement in line 1 specifies parallel execution of the loop over the segment indices M, N, I and J. Their declarations have been omitted; the ranges of the indices were given when the index variables were defined. For example, we might have a declarations such as aoindex\(=\) 1, norb where norb (number of orbitals) is a symbolic constant and distributed R (M, N, I, J) declares a distributed array R in terms of M and other index variables. Each worker task will perform the iterations for the index values that have been assigned to it. For each iteration, a block of the temp array tempsum will be used to accumulate the results of the summation and is local to the task. All of its elements are initialized to 0.0 in line 2. The statements do L and do S are sequential loops over the complete ranges of L and S. The statement get T(L, S, I, J) obtains the indicated block of the distributed array T. If the block happens to be stored at the local processor, then the statement does nothing, otherwise it initiates an asynchronous communication to request it from whichever processor holds it. In line 6, V represents an array of 2-electron integrals which could be as large as eight TBs. Rather than storing the entire array, each block of V is computed on demand using the super instruction compute_integrals. In lines 7–8, the contraction of a block of T and a block of V is computed and stored in local array tmp, while tempsum accumulates the sum in line 9. The super instruction implementing the contraction operator, ‘*’ in line 8, ensures that the necessary blocks are available and waits for them if necessary. The put statement saves the result to a distributed array, R. Like the get instruction, a put instruction may require communication with another processor if the indicated block is stored elsewhere. The sip_barrier instructions causes each worker process to wait until the execution of all code above the barrier, including handling of any pending or in-transit messages involved in block transmission, have completed at all worker processes. The runtime system, SIP, handles the data allocation, communication, and locking necessary to properly implement the SIAL semantics.

2.3 SIP

SIAL programs are compiled into SIA bytecode, which is interpreted by the SIP. The SIP is a parallel virtual machine written C/C++, Fortran, and MPI that manages the complexities of dealing with parallel hardware, including communication and I/O.

The SIP is organized as a master, a set of workers, and a set of I/O servers, each implemented (in the current release) using a sequential MPI process. When execution of a SIAL program is initiated, the master performs the management functions required to set up the calculation. The focus of the SIP design effort was to produce a well-engineered system that can efficiently execute SIAL programs and be easily ported to and tuned for different systems. A design principle of the SIP is to maximize asynchrony; all message passing is asynchronous and all barriers are explicit.

The SIAL get, put, prepare and request statements may require transferring blocks between nodes, either another worker node for a distributed array or an IO Server for a served array. The SIP first determines whether the indicated block is available at the current node. It may be available because it was assigned to be stored there, or because it is still available in the block cache from a recent use. If not, non-blocking communication is initiated to acquire or send the indicated block using information in the block’s data descriptor. As much as possible, instructions are executed asynchronously: those involving communication are started and then control returns to the SIP task so that more computations or different communications can be performed. When an instruction needing a block executes, it will transparently wait if the communication to acquire the block is still in progress.

2.4 Super Instructions

A SIAL programmer has a rich collection of super instructions at his or her disposal. Super instructions are provided for a variety of operations, including I/O and utility functions. Computational super instructions perform computationally intensive operations on blocks; they simply take blocks as input and generate new blocks as output and do not involve communication. For example, the super instruction implementing the (blocked) contraction shown in Eq. 2, provides an implementation for the ‘*’ operator as \(\sum _{\lambda \in L} \sum _{\sigma \in S} V(M,N,L,S)^{\mu \nu }_{\lambda \sigma } T(L,S,I,J)^{\lambda \sigma }_{ij}\). Contractions are often implemented using a two-dimensional matrix–matrix multiplication (DGEMM) combined with permutations of the input and result arrays that depend on the order of the indices. Super instruction implementations intended for the CPU can use Fortran or another general purpose programming language and thus can take advantage of existing high quality optimizing compilers. CUDA implementations have been provided for the intrinsic computational super instructions, enabling them to be executed on a GPU.

3 Extensions for GPU Utilization

The SIA provides a straightforward approach to enabling applications to utilize GPUs by mapping super instructions to kernels. The super instructions, in most cases, lend themselves to straightforward data parallel implementations. For example, the contraction operator performs permutations on the GPU and calls CUBLAS DGEMM to perform the matrix multiplication.

Directives have been added to SIAL which perform the following functions:
  • Indicate which parts of a SIAL program should be executed on a GPU (if available). These should be the most computationally intense parts of the program.

  • Explicitly manage memory allocation on the GPU and data transfer between the host and device.

The consequence is that data can remain on the GPU exactly as long as it is useful there, and thus can be reused by multiple super instructions. Only results that will be used later in the computation need be transferred to the CPU. These directives are shown in Table 1. Arguments taken by the directives are blocks.
Table 1

Directives for GPU use




Start of a region SIAL code whose super instructions will be implemented on the GPU if one is available


End of the region of SIAL code

gpu_allocate \(\langle \)block\(\rangle \)

Allocate memory to hold local or temp block \(\langle \)block\(\rangle \) on GPU (initialize to 0)

gpu_free \(\langle \)block\(\rangle \)

Free memory associated with \(\langle \)block\(\rangle \) on GPU

gpu_put \(\langle \)block\(\rangle \)

Copy data from local or temp block \(\langle \)block\(\rangle \) from the CPU to the GPU. If necessary, allocate memory on the GPU

gpu_get \(\langle \)block\(\rangle \)

Copy contents of \(\langle \)block\(\rangle \) from GPU to CPU

Since a hardware platform may have fewer GPUs than compute cores, SIAL programs with GPU directives should remain correct when no GPU is available on a particular core and be able to perform the calculation on the CPU. This property is supported by the SIP.

3.1 Example

In this section, we show a fragment of a CCSD calculation that has been annotated for GPU execution.3 Declarations of index variables and arrays are not shown; TA0_ab and T2A0_ab are 4 dimensional served (disk-backed arrays). LTAO_ab, LT2AO_ab1, LT2AO_ab2, are local arrays, and Yab and Y1ab are temp arrays.

The PARDO lambda, sigma statement in line 1 sets up the parallel computation. The index space, formed by the ranges of segment indices lambda and sigma, is partitioned among the worker processes and the instances of the body are performed in parallel. Exactly how this is done is determined by the chosen load balancing mechanism. The next few statements allocate blocks of local array LTAO_ab and fill them with data obtained from served array TAO_ab. The DO command, first seen in line 10, indicates a serial loop over the range of the given index variable. The first GPU directive, gpu_begin appears in line 17. If the node has a GPU, it will be used in the subsequent super instructions. If not, the GPU related directives will have no result, and the entire computation will be performed on the CPU. The command gpu_put allocates memory on the GPU and initializes it with data copied from the indicated block on the host. gpu_allocate allocates a temp block on the GPU. Lines 35–38 perform calculations on the GPU. These intrinsic super instructions such as the contraction in in line 35 are implemented as CUDA kernels. Note that it is not necessarily the case that each block of an array is the same size, so the temp blocks allocated on the GPU are freed and reallocated rather than being reused in the next iteration.

Lines 44–54 copy results stored in blocks of the arrays LT2AO_ab1 and LT2AO_ab2 back to the GPU and free GPU memory. The remainder of the code accumulates the computed results into blocks of the served array T2AO_ab, lines 59–60 and frees CPU memory.

4 Experimental Results

In this section, we provide results from experiments on Blue Waters, the NSF-funded petascale supercomputer at the National Center for Supercomputing Applications (NCSA). Blue Waters contains Cray XE6 and XK7 cabinets with NVIDIA Tesla Kepler GPUs. To eliminate, as much as possible, extraneous effects that might affect the timings, the GPU and CPU computations presented were run immediately after one another on exactly the same set of machines.

Figure 3 shows the total time and idle time (when nodes are waiting for data to arrive from another node) of a relatively small calculation ranging from four to 32 processors, each with a GPU attached. This was a CCSD calculation for \(\text{ Ar }_4\) in cc-pvQZ basis with 236 basis functions and 36 correlated electronic orbitals. The segment sizes were 42 for arrays representing atomic and virtual orbitals and 36 for occupied orbitals, thus a block of an N-dimensional array would contain between \(36^N\) and \(42^N\) elements. For each processor count, the first bar is the total time using GPUs, the second bar is the total time without GPUs. Speedups ranged from 2.0 to 2.2. The third and fourth bars show the total idle time, i.e., the time spent waiting for data to arrive from a different node, in the computation. As would be expected from the structure of the computation,4 these values are nearly the same with and without GPUs.
Fig. 2

SIAL CCSD fragment

Fig. 3

Total and idle time for the \(\text{ Ar }_4\) CCSD calculation

Figure 4 shows the time required for three of the most time-consuming procedures in the CCSD calculation. For each processor count, the first two bars show the time required with and without GPUs respectively for the LADDER procedure which scales as \(N^4O^2\). The third and fourth bars show the time for the WAEBF procedure, which scales as \(V^2O^4\). The fifth and sixth bars show the time for the WMEBJ procedure which scales as \(V^3O^3\) with a relatively large prefactor.
Fig. 4

Time consuming procedures in the \(\text{ Ar }_4\) CCSD calculation. Ladder scales as \(N^4O^2\), WAEBF scales as \(V^2O^4\), and WMEBJ scales as \(V^3O^3\) with a relatively large prefactor

CCSD(T) calculations are more accurate than CCSD, but the accuracy comes at a significant computational price. Figure 5 shows timings results from the (T) contribution for the same molecule, \(\text{ Ar }_4\), and same basis set as in Figs. 3 and 4. However, for this calculation, the segment sizes were reduced in order for the calculation to fit in the available CPU memory.
Fig. 5

Triples contribution of Ar4 in CCSD(T) calculation

The (T) contribution is comprised of two parts, aaa and aab consisting of 9 and 8 permutation steps, respectively. Each permutation step involves an initial permutation of the matrices followed by a contraction, followed by a permutation; the different steps perform different permutations, but are otherwise the same. Figure 6 shows the CPU time per permutation. The varying results for the CPU-only computations reflect the different memory access patterns and interaction with caches on the CPU. When the permutations and contractions are performed on the GPUs, the timing results are much more uniform.
Fig. 6

Time per permutation in (T) contribution

The \(\text{ Ar }_4\) molecule used in the previous results, being relatively small and admitting well-behaved calculations, is useful for studying the performance of GPU-enabled ACES III. However, it is also desirable to show results for larger scale calculations of genuine scientific interest. In Fig. 7, we show timing results for RDX, an organic explosive that requires significantly more computational power than \(\text{ Ar }_4\). These calculations used 534 basis functions (cc-pvTZ basis) with 84 correlated electrons and segments sizes of 21, 34, and 42 for atomic, virtual, and occupied orbitals, respectively. Note that the vertical axis is now hours rather than seconds, and the number of processors ranges from 500 to 1,000. The speedup achieved by exploiting the GPUs ranges from 3.4 for 1,000 processors to 3.7 for 500.
Fig. 7


5 Discussion

The SIA naturally decomposes the programming effort in the application domain into computational kernels written in a general purpose programming language, and SIAL programs. The SIAL programs orchestrate coarse-grained parallelism, manage the flow of data across the system, and specify the sequence of super instructions that will be executed. There is a high degree of reuse of super instructions; typically, many of the super instructions required for a new SIAL program will already exist, either because they are intrinsic in SIAL or because they have already been implemented for another calculation.

The extension of the SIA to exploit GPUs follows the same philosophy. The super instructions are implemented in a general purpose language, CUDA, and the orchestration of the computation is still handled at the SIAL level with the addition of the provided directives. Thus, if the GPU implementations are available, the task of the SIAL programmer above and beyond a CPU-only version is to identify the parts of the SIAL program that would be beneficial to execute on the GPU and add directives to specify memory allocation on the device and the data transfer between host and device. Future work will further develop the SIAL compiler to automate much of this task.

If some part of the code to be executed on the GPU uses a super instruction that does not yet have an implementation for the GPU, this needs be provided. This is the same process that is used for providing programmer implemented super instructions. Because the super instructions simply take blocks as input and produce blocks as output, and perform no communication (on the CPU) or data transfer (on the GPU) and operate on dense arrays, it is fairly straightforward to obtain data parallel implementations. For example, the implementation of the contraction requires (possibly) permuting a multi-dimensional array, making a CUBLAS DGEMM call, and (possibly) permuting the results. The SIAL directives are very simple because the are given in terms of the data types intrinsically supported by SIAL and supported by the SIP.

6 Related Work

Several general purpose directive-based programming models have been proposed for GPUs. A few examples include hiCuda [5], OpenMPC [7], OpenACC [14], and OpenMP for Accelerators [2]. Several of these have been evaluated and compared by Lee and Vetter [8]. In contrast to the SIAL directives, which orchestrate the interaction of the GPU and host in a large-scale parallel computation requiring distributed data structures and inter-node communication, these models are aimed at allowing developers to work at a higher level of abstraction than CUDA, and do not yet target large scale multi-node computations. It might be possible to exploit one of these models implementing super instructions, particularly where a CPU implementation already exists, and then a GPU enabled version could be obtained by adding directives rather than reimplementing in CUDA. However even if we were not trying to leverage the investment already made in developing ACES III using the SIA and were starting from scratch, directive-based approaches would not offer much help for a large-scale parallel program such as ACES III.

Previous implementations of coupled cluster methods in computational chemistry on GPUs have been reported [3, 4, 11]. In these papers, the approach was to take a specific algorithm and implement it on a GPU in a highly optimized form. Because of the rapid development in GPU architectures, some of these optimizations are obsolete, or at least offer significantly less payoff now than when the papers were written.

The goal of Ma et al. [12], was to improve the implementation of contraction operators on a GPU by generating CUDA code to directly implement the contractions, and optimized for the particular order of indices in the contraction. This should achieve better performance for a contraction than using permutations and two-dimensional matrix–matrix multiply as we did. The generated CUDA contraction implementations were embedded into the computational chemistry package NWChem, and significant speedup results shown for a CCSD(T) calculation run on hybrid CPU/GPU systems. Optimized contraction operators could also be incorporated into ACESIII.

7 Conclusion and Future Work

We have described an enhancement to the SIA to allow GPUs to be exploited in heterogeneous systems. The benefits for the end user of ACES III will be substantial; for example, using GPUs can reduce the time required for CCSD(T) calculations on the RDX molecule on 1,000 cores from nearly 10 h to 3.

The changes to the SIA were implementation of a set of super instruction as CUDA kernels and and fairly minor changes to the runtime system. The organization of the SIA allowed the effort to enable GPUs to proceed incrementally as more CUDA implementations were provided. At this point, we have GPU-enabled implementations for all of the intrinsic super instructions and those required by CCSD and CCSD(T) calculations. Changes to SIAL programs involved identifying the computationally intensive parts of the code and inserting directives indicating which super instructions should be executed on the GPU, and when memory for a block should be allocated on the GPU and when the data belonging to a block should be moved between the GPU and host, expressed in terms of the abstractions supported by SIAL, the SIA’s DSL.

The block-oriented approach of the SIA naturally decomposes computations into coarse-grained units that can be efficiently transferred between nodes and computed in parallel with good performance. The provision of a DSL enhances programmer productivity and provides a programming language infrastructure (type checking, AST generation etc.) that has been used to develop SiPMap [6], a tool that can automatically generate performance models from SIAL source code. The success of the enhancements to utilize accelerators provides further evidence supporting the benefits of the ideas underlying the SIA.

Future work will enhance the SIAL compiler to help automate placement of these directives. The first step will automatically determine appropriate memory allocation and data movement directives given programmer-inserted gpu_begin and gpu_end statements. Further efforts will explore using performance models to provide further automation.


  1. 1.

    Tensor contraction operations occur frequently in the domain and are defined as follows: Let \(\alpha , \beta , \gamma \) be mutually disjoint, possibly empty lists of indices of multidimensional arrays representing the tensors. Then the contraction of \(A[\alpha ,\beta ]\) with \(B[\beta ,\gamma ]\) yields \(C[\alpha ,\gamma ] = \sum _{\beta } A[\alpha ,\beta ] * B[\beta ,\gamma ]\). Typically, contractions are implemented by (possibly) permuting one of the arrays and then performing a DGEMM.

  2. 2.

    We refer to “the” segment size for convenience. It is not required that all segments within a rank be the same size. The way an index is segmented is part of its type and is fixed during program initialization. There are several segment index types corresponding to domain specific concepts. For example, aoindex and moindex represent atomic orbital and molecular orbital. This allows the type system to perform useful checks on the consistent use of index variables.

  3. 3.

    The syntax has been slightly simplified.

  4. 4.

    As can be seen from the SIAL code fragment in Fig. 2, data transfers between nodes do not overlap with GPU instructions.



Shawn McDowell provided the CUDA implementation of the contraction operator. This work was supported by the National Science Foundation Grant OCI-0725070 and the Office of Science of the U.S. Department of Energy under grant DE-SC0002565. The development of the SIA and ACES III has been also been supported by the US Department of Defense’s High Performance Computing Modernization Program (HPCMP) under the two programs, Common High Performance Computing Software Initiative (CHSSI), Project CBD-03, and User Productivity Enhancement and Technology Transfer (PET). We also thank the University of Florida High Performance Computing Center for use of its facilities.


  1. 1.
  2. 2.
    Beyer, J.C., Stotzer, E.J., Hart, A., de Supinski, B.R.: OpenMP for accelerators. In: Proceedings of the 7th International Conference on OpenMP in the Petascale Era, IWOMP’11, pp. 108–121. Springe, Berlin, Heidelberg (2011).
  3. 3.
    Bhaskaran-Nair, K., Ma, W., Krishnamoorthy, S., Villa, O., van Dam, H.J.J., Apr, E., Kowalski, K.: Noniterative multireference coupled cluster methods on heterogeneous CPU–GPU systems. J. Chem. Theory Comput. 9(4), 1949–1957 (2013). doi: 10.1021/ct301130u CrossRefGoogle Scholar
  4. 4.
    DePrince, A.E., Hammond, J.R.: Coupled cluster theory on graphics processing units. I. The coupled cluster doubles method. J. Chem. Theory Comput. 7(5), 1287–1295 (2011). doi: 10.1021/ct100584w CrossRefGoogle Scholar
  5. 5.
    Han, T.D., Abdelrahman, T.S.: hiCUDA: High-level GPGPU programming. IEEE Trans. Parallel Distrib. Syst. 22(1), 78–90 (2011). doi: 10.1109/TPDS.2010.62 CrossRefGoogle Scholar
  6. 6.
    Jindal, N., Lotrich, V., Deumens, E., Sanders, B.A.: SIPMaP: A tool for modeling irregular parallel computations in the Super Instruction Architecture. In: 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2013) (2013)Google Scholar
  7. 7.
    Lee, S., Eigenmann, R.: OpenMPC: Extended openmp programming and tuning for GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC’10, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2010). doi: 10.1109/SC.2010.36.
  8. 8.
    Lee, S., Vetter, J.S.: Early evaluation of directive-based GPU programming models for productive exascale computing. In: SC12: ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, IEEE press, Salt Lake City, Utah, USA (2012). doi: 10.1109/SC.2012.51.
  9. 9.
    Lotrich, V.F., Ponton, J.M., Perera, A.S., Deumens, E., Bartlett, R.J., Sanders, B.A.: Super Instruction Architecture for petascale electronic structure software: the story. Mol. Phys. (2010). Special issue: Electrons, Molecules, Solids, and Biosystems: Fifty Years of the Quantum Theory Project. (conditionally accepted)Google Scholar
  10. 10.
    Lotrich, V., Flocke, N., Ponton, M., Yau, A.D., Perera, A., Deumens, E., Bartlett, R.J.: Parallel implementation of electronic structure energy, gradient and Hessian calculations. J. Chem. Phys. 128, 194104 (2008)CrossRefGoogle Scholar
  11. 11.
    Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K.: GPU-based implementations of the noniterative regularized-CCSD(T) corrections: applications to strongly correlated systems. J. Chem. Theory Comput. 7(5), 1316–1327 (2011). doi: 10.1021/ct1007247 CrossRefGoogle Scholar
  12. 12.
    Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K., Agrawal, G.: Optimizing tensor contraction expressions for hybrid CPU–GPU execution. Clust. Comput. 16(1), 131–155 (2013). doi: 10.1007/s10586-011-0179-2 CrossRefGoogle Scholar
  13. 13.
  14. 14.
    OpenACC: Directives for accelerators.
  15. 15.
    Sanders, B.A., Bartlett, R., Deumens, E., Lotrich, V., Ponton, M.: A block-oriented language and runtime system for tensor algebra with very large arrays. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2010). doi: 10.1109/SC.2010.3

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Nakul Jindal
    • 1
  • Victor Lotrich
    • 2
  • Erik Deumens
    • 2
  • Beverly A. Sanders
    • 1
  1. 1.Department of Computer and Information ScienceUniversity of FloridaGainesvilleUSA
  2. 2.Department of ChemistryUniversity of FloridaGainesvilleUSA

Personalised recommendations