International Journal of Parallel Programming

, Volume 44, Issue 2, pp 309–324 | Cite as

Exploiting GPUs with the Super Instruction Architecture

  • Nakul Jindal
  • Victor Lotrich
  • Erik Deumens
  • Beverly A. Sanders


The Super Instruction Architecture (SIA) is a parallel programming environment designed for problems in computational chemistry involving complicated expressions defined in terms of tensors. Tensors are represented by multidimensional arrays which are typically very large. The SIA consists of a domain specific programming language, Super Instruction Assembly Language (SIAL), and its runtime system, Super Instruction Processor. An important feature of SIAL is that algorithms are expressed in terms of blocks (or tiles) of multidimensional arrays rather than individual floating point numbers. In this paper, we describe how the SIA was enhanced to exploit GPUs, obtaining speedups ranging from two to nearly four for computational chemistry calculations, thus saving hours of elapsed time on large-scale computations. The results provide evidence that the “programming-with-blocks” approach embodied in the SIA will remain successful in modern, heterogeneous computing environments.


Parallel programming Tensors GPU Domain specific language 



Shawn McDowell provided the CUDA implementation of the contraction operator. This work was supported by the National Science Foundation Grant OCI-0725070 and the Office of Science of the U.S. Department of Energy under grant DE-SC0002565. The development of the SIA and ACES III has been also been supported by the US Department of Defense’s High Performance Computing Modernization Program (HPCMP) under the two programs, Common High Performance Computing Software Initiative (CHSSI), Project CBD-03, and User Productivity Enhancement and Technology Transfer (PET). We also thank the University of Florida High Performance Computing Center for use of its facilities.


  1. 1.
  2. 2.
    Beyer, J.C., Stotzer, E.J., Hart, A., de Supinski, B.R.: OpenMP for accelerators. In: Proceedings of the 7th International Conference on OpenMP in the Petascale Era, IWOMP’11, pp. 108–121. Springe, Berlin, Heidelberg (2011).
  3. 3.
    Bhaskaran-Nair, K., Ma, W., Krishnamoorthy, S., Villa, O., van Dam, H.J.J., Apr, E., Kowalski, K.: Noniterative multireference coupled cluster methods on heterogeneous CPU–GPU systems. J. Chem. Theory Comput. 9(4), 1949–1957 (2013). doi: 10.1021/ct301130u CrossRefGoogle Scholar
  4. 4.
    DePrince, A.E., Hammond, J.R.: Coupled cluster theory on graphics processing units. I. The coupled cluster doubles method. J. Chem. Theory Comput. 7(5), 1287–1295 (2011). doi: 10.1021/ct100584w CrossRefGoogle Scholar
  5. 5.
    Han, T.D., Abdelrahman, T.S.: hiCUDA: High-level GPGPU programming. IEEE Trans. Parallel Distrib. Syst. 22(1), 78–90 (2011). doi: 10.1109/TPDS.2010.62 CrossRefGoogle Scholar
  6. 6.
    Jindal, N., Lotrich, V., Deumens, E., Sanders, B.A.: SIPMaP: A tool for modeling irregular parallel computations in the Super Instruction Architecture. In: 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2013) (2013)Google Scholar
  7. 7.
    Lee, S., Eigenmann, R.: OpenMPC: Extended openmp programming and tuning for GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC’10, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2010). doi: 10.1109/SC.2010.36.
  8. 8.
    Lee, S., Vetter, J.S.: Early evaluation of directive-based GPU programming models for productive exascale computing. In: SC12: ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, IEEE press, Salt Lake City, Utah, USA (2012). doi: 10.1109/SC.2012.51.
  9. 9.
    Lotrich, V.F., Ponton, J.M., Perera, A.S., Deumens, E., Bartlett, R.J., Sanders, B.A.: Super Instruction Architecture for petascale electronic structure software: the story. Mol. Phys. (2010). Special issue: Electrons, Molecules, Solids, and Biosystems: Fifty Years of the Quantum Theory Project. (conditionally accepted)Google Scholar
  10. 10.
    Lotrich, V., Flocke, N., Ponton, M., Yau, A.D., Perera, A., Deumens, E., Bartlett, R.J.: Parallel implementation of electronic structure energy, gradient and Hessian calculations. J. Chem. Phys. 128, 194104 (2008)CrossRefGoogle Scholar
  11. 11.
    Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K.: GPU-based implementations of the noniterative regularized-CCSD(T) corrections: applications to strongly correlated systems. J. Chem. Theory Comput. 7(5), 1316–1327 (2011). doi: 10.1021/ct1007247 CrossRefGoogle Scholar
  12. 12.
    Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K., Agrawal, G.: Optimizing tensor contraction expressions for hybrid CPU–GPU execution. Clust. Comput. 16(1), 131–155 (2013). doi: 10.1007/s10586-011-0179-2 CrossRefGoogle Scholar
  13. 13.
  14. 14.
    OpenACC: Directives for accelerators.
  15. 15.
    Sanders, B.A., Bartlett, R., Deumens, E., Lotrich, V., Ponton, M.: A block-oriented language and runtime system for tensor algebra with very large arrays. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, pp. 1–11. IEEE Computer Society, Washington, DC, USA (2010). doi: 10.1109/SC.2010.3

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Nakul Jindal
    • 1
  • Victor Lotrich
    • 2
  • Erik Deumens
    • 2
  • Beverly A. Sanders
    • 1
  1. 1.Department of Computer and Information ScienceUniversity of FloridaGainesvilleUSA
  2. 2.Department of ChemistryUniversity of FloridaGainesvilleUSA

Personalised recommendations