Abstract
Modern compilers execute sophisticated static analyses to enable optimization across a wide spectrum of code patterns. However, there are many cases where even the most sophisticated static analysis is insufficient or where the computation complexity makes complete static analysis impractical. It is often possible in these cases to discover further opportunities for optimization from dynamic profiling and provide this information to the compiler, either by adding directives or pragmas to the source, or by modifying the source algorithm or implementation. For current and emerging generations of chips, vectorization is one of the most important of these optimizations. This paper defines, implements, and applies a systematic process for combining the information acquired by static analysis by modern compilers with information acquired by a targeted, high-resolution, low-overhead dynamic profiling tool to enable additional and more effective vectorization. Opportunities for more effective vectorization are frequent and the performance gains obtained are substantial: we show a geometric mean across several benchmarks of over 1.5x in speedup on the Intel Xeon Phi coprocessor.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
The Rose compiler framework is not yet available on the Intel Xeon Phi coprocessors hence the code could be instrumented to run only on the Intel Xeon processor and not the Intel Xeon Phi coprocessor.
- 6.
- 7.
- 8.
References
Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V., Weeratunga, S.K.: The NAS parallel benchmarks - summary and preliminary results. In: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing 1991, pp. 158–165. ACM, New York (1991)
Brett, B., Kumar, P., Kim, M., Kim, H.: CHiP: a profiler to measure the effect of cache contention on scalability. In: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops, IPDPSW 2013, pp. 1565–1574. IEEE Computer Society, Washington, DC (2013)
Callahan, D., Dongarra, J., Levine, D.: Vectorizing compilers: a test suite and results. In: Proceedings of the 1988 ACM/IEEE Conference on Supercomputing, Supercomputing 1988, pp. 98–105. IEEE Computer Society Press, Los Alamitos (1988)
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J., Lee, S.H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization, IISWC 2009, pp. 44–54, October 2009
Chung, I.H., Cong, G., Klepacki, D., Sbaraglia, S., Seelam, S., Wen, H.F.: A framework for automated performance bottleneck detection. In: IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pp. 1–7, April 2008
Evans, G.C., Abraham, S., Kuhn, B., Padua, D.A.: Vector seeker: a tool for finding vector potential. In: Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, WPMVP 2014, pp. 41–48. ACM, New York (2014)
Fialho, L., Browne, J.: Framework and modular infrastructure for automation of architectural adaptation and performance optimization for HPC systems. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2014. LNCS, vol. 8488, pp. 261–77. Springer, Heidelberg (2014)
Holewinski, J., Ramamurthi, R., Ravishankar, M., Fauzia, N., Pouchet, L.N., Rountev, A., Sadayappan, P.: Dynamic trace-based analysis of vectorization potential of applications. SIGPLAN Not. 47(6), 371–82 (2012)
Hornung, R., Keasler, J.: A case for improved C++ compiler support to enable performance portability in large physics simulation codes. Technical report, Lawrence Livermore National Laboratory (LLNL), Livermore, CA (2013)
Karlin, I., Bhatele, A., Keasler, J., Chamberlain, B.L., Cohen, J., Devito, Z., Haque, R., Laney, D., Luke, E., Wang, F., Richards, D., Schulz, M., Still, C.H.: Exploring traditional and emerging parallel programming models using a proxy application. In: Parallel and Distributed Processing Symposium, International, pp. 919–932 (2013)
Karlin, I., Keasler, J., Neely, R.: Lulesh 2.0 updates and changes. Technical report LLNL-TR-641973, Lawrence Livermore National Laboratory (2013)
Krishnaiyer, R., Kultursay, E., Chawla, P., Preis, S., Zvezdin, A., Saito, H.: Compiler-based data prefetching and streaming non-temporal store generation for the intel(r) xeon phi(tm) coprocessor. In: 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops Ph.D. Forum (IPDPSW), pp. 1575–1586, May 2013
Kristof, P., Yu, H., Li, Z., Tian, X.: Performance study of simd programming models on intel multicore processors. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops Ph.D. Forum (IPDPSW), pp. 2423–2432, May 2012
Larus, J.: Loop-level parallelism in numeric and symbolic programs. IEEE Trans. Parallel Distrib. Syst. 4(7), 812–26 (1993)
Maleki, S., Gao, Y., Garzaran, M., Wong, T., Padua, D.: An evaluation of vectorizing compilers. In: 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 372–382, October 2011
McCalpin, J.D.: A survey of memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsl. 19–25 (1995)
Oancea, C.E., Rauchwerger, L.: Logical inference techniques for loop parallelization. In: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2012, pp. 509–520. ACM, New York (2012)
Quinlan, D.J.: ROSE: compiler support for object-oriented frameworks. Parallel Process. Lett. 10(2/3), 215–26 (2000)
Rane, A., Browne, J.: Enhancing performance optimization of multicore/multichip nodes with data structure metrics. ACM Trans. Parallel Comput. 1(1), 3:1–3:20 (2014)
Rosales, C., Whyte, D.S.: Dual grid lattice boltzmann method for multiphase flows. Int. J. Numer. Meth. Eng. 84(9), 1068–84 (2010)
Satish, N., Kim, C., Chhugani, J., Saito, H., Krishnaiyer, R., Smelyanskiy, M., Girkar, M., Dubey, P.: Can traditional programming bridge the Ninja performance gap for parallel computing applications? In: Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA 2012, pp. 440–451. IEEE Computer Society, Washington, DC (2012)
Shi, G., Kindratenko, V., Gottlieb, S.: The bottom-up implementation of one MILC lattice QCD application on the cell blade. Int. J. Parallel Program. 37(5), 488–507 (2009)
Zhong, H., Mehrara, M., Lieberman, S., Mahlke, S.: Uncovering hidden loop level parallelism in sequential applications. In: IEEE 14th International Symposium on High Performance Computer Architecture, HPCA 2008, pp. 290–301, February 2008
Zhuang, X., Eichenberger, A., Luo, Y., O’Brien, K., O’Brien, K.: Exploiting parallelism with dependence-aware scheduling. In: 18th International Conference on Parallel Architectures and Compilation Techniques, PACT 2009, pp. 193–202, September 2009
Acknowledgments
This work is funded in part by Intel corporation and by the National Science Foundation under OCI award #0622780.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Rane, A., Krishnaiyer, R., Newburn, C.J., Browne, J., Fialho, L., Matveev, Z. (2015). Unification of Static and Dynamic Analyses to Enable Vectorization. In: Brodman, J., Tu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2014. Lecture Notes in Computer Science(), vol 8967. Springer, Cham. https://doi.org/10.1007/978-3-319-17473-0_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-17473-0_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17472-3
Online ISBN: 978-3-319-17473-0
eBook Packages: Computer ScienceComputer Science (R0)