Battling Memory Requirements of Array Programming Through Streaming

  • Mads R. B. KristensenEmail author
  • James AveryEmail author
  • Troels Blum
  • Simon Andreas Frimann Lund
  • Brian Vinter
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9945)


A barrier to efficient array programming, for example in Python/NumPy, is that algorithms written as pure array operations completely without loops, while most efficient on small input, can lead to explosions in memory use. The present paper presents a solution to this problem using array streaming, implemented in the automatic parallelization high-performance framework Bohrium. This makes it possible to use array programming in Python/NumPy code directly, even when the apparent memory requirement exceeds the machine capacity, since the automatic streaming eliminates the temporary memory overhead by performing calculations in per-thread registers.

Using Bohrium, we automatically fuse, JIT-compile, and execute NumPy array operations on GPGPUs without modification to the user programs. We present performance evaluations of three benchmarks, all of which show dramatic reductions in memory use from streaming, yielding corresponding improvements in speed and utilization of GPGPU-cores. The streaming-enabled Bohrium effortlessly runs programs on input sizes much beyond sizes that crash on pure NumPy due to exhausting system memory.



James Avery was partially supported by the Danish Council for Independent Research Sapere Aude grant “Complexity through Logic and Algebra” (COLA).


  1. 1.
    Auer, A.A., Baumgartner, G., Bernholdt, D.E., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R., Krishnamoorthy, S., Krishnan, S., et al.: Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Mol. Phy. 104(2), 211–228 (2006)CrossRefGoogle Scholar
  2. 2.
    Ayer, V.M., Miguez, S., Toby, B.H.: Why scientists should learn to program in python. Powder Diffr. 29, S48–S64 (2014)CrossRefGoogle Scholar
  3. 3.
    Behnel, S., Bradshaw, R., Citro, C., Dalcin, L., Seljebotn, D.S., Smith, K.: Cython: the best of both worlds. Comput. Sci. Eng. 13(2), 31–39 (2011)CrossRefGoogle Scholar
  4. 4.
    Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral PresentationGoogle Scholar
  5. 5.
    Blum, T., Kristensen, M.R.B., Vinter, B.: Transparent GPU execution of NumPy applications. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE (2014)Google Scholar
  6. 6.
    Cooke, D., Hochberg, T.: Numexpr. Fast evaluation of array expressions by using a vector-based virtual machineGoogle Scholar
  7. 7.
    Darte, A., Huard, G.: New results on array contraction [memory optimization]. In: Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors, pp. 359–370. IEEE (2002)Google Scholar
  8. 8.
    Enkovaara, J., Romero, N.A., Shende, S., Mortensen, J.J.: Gpaw-massively parallel electronic structure calculations with python-based software. Procedia Comput. Sci. 4, 17–25 (2011)CrossRefGoogle Scholar
  9. 9.
    Foord, M., Muirhead, C.: IronPython in Action. Manning Publications Co., Greenwich (2009)Google Scholar
  10. 10.
    Guelton, S., Brunet, P., Amini, M., Merlini, A., Corbillon, X., Raynaud, A.: Pythran: enabling static optimization of scientific python programs. Comput. Sci. Discov. 8(1), 014001 (2015)CrossRefGoogle Scholar
  11. 11.
    Ihaka, R., Gentleman, R.: R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5(3), 299–314 (1996)Google Scholar
  12. 12.
    Jones, E., Miller, P.J.: Weaveinlining C/C++ in Python. OReilly Open Source Convention (2002)Google Scholar
  13. 13.
    Klckner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P., Fasih, A.: PyCUDA and PyOpenCL: a scripting-based approach to GPU run-time code generation. Parallel Comput. 38(3), 157–174 (2012)CrossRefGoogle Scholar
  14. 14.
    Kristensen, M.R.B., Happe, H., Vinter, B.: GPAW optimized for Blue Gene/P using hybrid programming. In: IEEE International Symposium on Parallel Distributed Processing, IPDPS 2009, pp. 1–6 (2009)Google Scholar
  15. 15.
    Kristensen, M.R.B., Lund, S.A.F., Blum, T., Avery, J.: Fusion of array operations at runtime. In: Proceedings of the 25th International Conference on Parallel Architectures and Compilation Techniques (PACT 2016). ACM (2016)Google Scholar
  16. 16.
    Kristensen, M.R.B., Lund, S.A.F., Blum, T., Skovhede, K.: Separating NumPy API from implementation. In: 5th Workshop on Python for High Performance and Scientific Computing (PyHPC 2014) (2014)Google Scholar
  17. 17.
    Kristensen, M.R.B., Lund, S.A.F., Blum, T., Skovhede, K., Vinter, B.: Bohrium: unmodified NumPy code on CPU, GPU, and cluster. In: 4th Workshop on Python for High Performance and Scientific Computing (PyHPC 2013) (2013)Google Scholar
  18. 18.
    Kristensen, M.R.B., Lund, S.A.F., Blum, T., Skovhede, K., Vinter, B.: Bohrium: a virtual machine approach to portable parallelism. In: 2014 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 312–321. IEEE (2014)Google Scholar
  19. 19.
    Kristensen, M.R.B., Vinter, B.: Numerical python for scalable architectures. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS 2010, pp. 15:1–15:9. ACM, New York (2010)Google Scholar
  20. 20.
    Kristensen, M.R.B., Zheng, Y., Vinter, B.: PGAS for distributed numerical python targeting multi-core clusters. In: International Parallel and Distributed Processing Symposium, pp. 680–690 (2012)Google Scholar
  21. 21.
    Lam, C.-C., Cociorva, D., Baumgartner, G., Sadayappan, P.: Optimization of memory usage requirement for a class of loops implementing multi-dimensional integrals. In: Carter, L., Ferrante, J. (eds.) LCPC 1999. LNCS, vol. 1863, pp. 350–364. Springer, Heidelberg (2000). doi: 10.1007/3-540-44905-1_22 CrossRefGoogle Scholar
  22. 22.
    Madsen, F.M., Clifton-Everest, R., Chakravarty, M.M.T., Keller, G.: Functional array streams. In: Proceedings of the 4th ACM SIGPLAN Workshop on FunctionalHigh-Performance Computing, FHPC 2015, pp. 23–34. ACM, New York (2015)Google Scholar
  23. 23.
    Madsen, F.M., Filinski, A.: Towards a streaming model for nested data parallelism. In: Proceedings of the 2nd ACM SIGPLAN Workshop on Functional High-Performance Computing, pp. 13–24. ACM (2013)Google Scholar
  24. 24.
    MATLAB. version 7.10.0 (R2010a). The MathWorks Inc., Natick, Massachusetts (2010)Google Scholar
  25. 25.
    Mnih, V.: Cudamat: a cuda-based matrix class for python. Department of Computer Science, University of Toronto, Technical report, UTML TR, 4 (2009)Google Scholar
  26. 26.
    Munshi, A., et al.: The OpenCL specification. Khronos OpenCL Working Group 1, 11–15 (2009)Google Scholar
  27. 27.
    NVIDIA Corporation. NVIDIA CUDA Programming Guide 2.0 (2008)Google Scholar
  28. 28.
    Oliphant, T.: Numba python bytecode to llvm translator. In: Proceedings of the Python for Scientific Computing Conference (SciPy) (2012)Google Scholar
  29. 29.
    Pedroni, S., Rappin, N.: Jython Essentials: Rapid Scripting in Java, 1st edn. O’Reilly & Associates Inc., Sebastopol (2002)Google Scholar
  30. 30.
    Rickett, C.D., Choi, S.-E., Rasmussen, C.E., Sottile, M.J.: Rapid prototyping frameworks for developing scientific applications: a case study. J. Supercomput. 36(2), 123–134 (2006)CrossRefGoogle Scholar
  31. 31.
    Tieleman, T.: Gnumpy: an easy way to use gpu boards in python (2010)Google Scholar
  32. 32.
    Van Der Walt, S., Colbert, S., Varoquaux, G.: The numpy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13(2), 22–30 (2011)CrossRefGoogle Scholar
  33. 33.
    van Rossum, G.: Glue it all together with python. In: Workshop on Compositional Software Architectures, Workshop Report, Monterey, California (1998)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Niels Bohr InstituteUniversity of CopenhagenCopenhagenDenmark
  2. 2.Department of Computer ScienceUniversity of CopenhagenCopenhagenDenmark

Personalised recommendations