Dynamic SIMD Vector Lane Scheduling
A classical technique to vectorize code that contains control flow is a control-flow to data-flow conversion. In that approach statements are augmented with masks that denote whether a given vector lane participates in the statement’s execution or idles. If the scheduling of work to vector lanes is performed statically, then some of the vector lanes will run idle in case of control flow divergences or varying work intensities across the loop iterations. With an increasing number of vector lanes, the likelihood of divergences or heavily unbalanced work assignments increases and static scheduling leads to a poor resource utilization. In this paper, we investigate different approaches to dynamic SIMD vector lane scheduling using the Mandelbrot set algorithm as a test case. To overcome the limitations of static scheduling, idle vector lanes are assigned work items dynamically, thereby minimizing per-lane idle cycles. Our evaluation on the Knights Corner and Knights Landing platform shows, that our approaches can lead to considerable performance gains over a static work assignment. By using the AVX-512 vector compress and expand instruction, we are able to further improve the scheduling.
KeywordsSIMD vectorization Dynamic scheduling Intel Xeon Phi
This work has been funded by SAXonPHI – Intel Parallel Computing Center Dresden at the Center for Information Services and High Performance Computing, TU Dresden, by the Research Center for Many-core HPC (IPCC) at Zuse Institute Berlin, and by the Intel Parallel Computing Center at RWTH Aachen University.
- 1.Intel Intrinsics Guide: _mm512_mask_expand_epi32. Website. https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_mask_expand_epi32&techs=AVX_512&expand=2162,2162
- 2.Mandelbrot Algorithms with Dynamic SIMD Vector Lane Scheduling. https://github.com/IXPUG/WG_Vectorization/tree/master/dynamic-simd-scheduling
- 3.Outer Loop Vectorization via Intel Cilk Plus Array Notations. https://software.intel.com/en-us/articles/outer-loop-vectorization-via-intel-cilk-plus-array-notations
- 4.Cheng, Y., An, H., Chen, Z., Li, F., Wang, Z., Jiang, X., Peng, Y.: Understanding the SIMD efficiency of graph traversal on GPU. In: Sun, X., Qu, W., Stojmenovic, I., Zhou, W., Li, Z., Guo, H., Min, G., Yang, T., Wu, Y., Liu, L. (eds.) ICA3PP 2014. LNCS, vol. 8630, pp. 42–56. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-11197-1_4 Google Scholar
- 5.Fog, A.: VCL C++ vector class library. http://www.agner.org/optimize/vectorclass.pdf
- 6.Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pp. 407–420. IEEE Computer Society, Washington, D.C., USA (2007)Google Scholar
- 8.Krzikalla, O., Feldhoff, K., Müller-Pfefferkorn, R., Nagel, W.: Auto-vectorization techniques for modern SIMD architectures. In: 16th International Workshop on Compilers for Parallel Computing (CPC 2012), Padova, Italy, January 2012Google Scholar
- 9.Krzikalla, O., Zitzlsberger, G.: Code vectorization using intel array notation. In: Proceedings of the 3rd Workshop on Programming Models for SIMD/Vector Processing, WPMVP 2016, p. 6 (2011). Observation of strains. Infect Dis Ther. 3(1), 35–43.: 1–6: 8, New York, NY, USA, ACM (2016)Google Scholar
- 10.Satish, N., Kim, C., Chhugani, J., Saito, H., Krishnaiyer, R., Smelyanskiy, M., Girkar, M., Dubey, P.: Can traditional programming bridge the ninja performance gap for parallel computing applications? In: Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA 2012, pp. 440–451. IEEE Computer Society, Washington, D.C., USA (2012)Google Scholar