Using Hardware Counters to Predict Vectorization

  • Neftali WatkinsonEmail author
  • Aniket Shivam
  • Zhi Chen
  • Alexander Veidenbaum
  • Alexandru Nicolau
  • Zhangxiaowen Gong
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11403)


Vectorization is the process of transforming the scalar implementation of an algorithm into vector form. This transformation aims to benefit from parallelism through the generation of microprocessor vector instructions. Using abstract models and source level information, compilers can identify opportunities for auto-vectorization. However, compilers do not always predict the runtime effects accurately or completely fail to identify vectorization opportunities. This ultimately results in no performance improvement.

This paper takes on a new perspective by leveraging the use of runtime hardware counters to predict the potential for loop vectorization. Using supervised machine learning models, we can detect instances where vectorization can be applied (but the compilers fail to) with 80% validation accuracy. We also predict profitability and performance in different architectures.

We evaluate a wide range of hardware counters across different machine learning models. We show that dynamic features, extracted from performance data, implicitly include useful information about the host machine and runtime program behavior.


Machine learning Compilers Auto-vectorization Profitability 



This material is based upon work supported by the National Science Foundation under Award 1533912.


  1. 1.
    Aumage, O., Barthou, D., Haine, C., Meunier, T.: Detecting SIMDization opportunities through static/dynamic dependence analysis. In: an Mey, D., et al. (eds.) Euro-Par 2013. LNCS, vol. 8374, pp. 637–646. Springer, Heidelberg (2014). Scholar
  2. 2.
    Banerjee, U.: An introduction to a formal theory of dependence analysis. J. Supercomput. 2(2), 133–149 (1988)CrossRefGoogle Scholar
  3. 3.
    Cammarota, R., Beni, L.A., Nicolau, A., Veidenbaum, A.V.: Optimizing program performance via similarity, using a feature-agnostic approach. In: Wu, C., Cohen, A. (eds.) APPT 2013. LNCS, vol. 8299, pp. 199–213. Springer, Heidelberg (2013). Scholar
  4. 4.
    Demšar, J., et al.: Orange: data mining toolbox in python. J. Mach. Learn. Res. 14(1), 2349–2353 (2013)zbMATHGoogle Scholar
  5. 5.
    Kennedy, K., Allen, J.R.: Optimizing Compilers for Modern Architectures: A Dependence-Based Approach (2001)Google Scholar
  6. 6.
    Maleki, S., Gao, Y., Garzarán, M.J., Wong, T., Padua, D.A.: An evaluation of vectorizing compilers. In: Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, pp. 372–382 (2011)Google Scholar
  7. 7.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval. In: Americas, vol. 32, pp. 2473–10013. Delhi Cambridge University Press (2008)Google Scholar
  8. 8.
    Reinders, J.: VTuneTM Performance Analyzer Essentials Measurement and Tuning Techniques for Software Developers (First.). Intel Press (2005)Google Scholar
  9. 9.
    Trouvé, A., et al.: Using machine learning in order to improve automatic SIMD instruction generation. Procedia Comput. Sci. 18, 1292–1301 (2013)CrossRefGoogle Scholar
  10. 10.
    Fursin, G., et al.: Milepost GCC: machine learning enabled self-tuning compiler. Int. J. Parallel Prog. 39(3), 296–327 (2011)CrossRefGoogle Scholar
  11. 11.
    Tournavitis, G., Wang, Z., Franke, B., O’Boyle, M.F.M.: Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping. In: ACM SIGPLAN Notices, pp. 177–187 (2009)Google Scholar
  12. 12.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefGoogle Scholar
  13. 13.
    Weaver, V.M.: Linux perf_event features and overhead. In: The 2nd International Workshop on Performance Analysis of Workload Optimized Systems, FastPath, p. 80, April 2013Google Scholar
  14. 14.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI, vol. 14, no. 2, pp. 1137–1145, August 1995Google Scholar
  15. 15.
    Chen, Z., et al.: LORE: a loop repository for the evaluation of compilers. In: 2017 IEEE International Symposium on Workload Characterization (in press)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Neftali Watkinson
    • 1
    Email author
  • Aniket Shivam
    • 1
  • Zhi Chen
    • 1
  • Alexander Veidenbaum
    • 1
  • Alexandru Nicolau
    • 1
  • Zhangxiaowen Gong
    • 2
  1. 1.Department of Computer ScienceUniversity of California, IrvineIrvineUSA
  2. 2.Department of Computer ScienceUniversity of Illinois, Urbana-ChampaignChampaignUSA

Personalised recommendations