Application Suitability Assessment for Many-Core Targets

Newburn, Chris J.; Sukha, Jim; Sharapov, Ilya; Nguyen, Anthony D.; Miao, Chyi-Chang

doi:10.1007/978-3-319-46079-6_23

Application Suitability Assessment for Many-Core Targets

Chris J. Newburn¹⁶,
Jim Sukha¹⁶,
Ilya Sharapov¹⁶,
Anthony D. Nguyen¹⁶ &
…
Chyi-Chang Miao¹⁶

Conference paper
First Online: 06 October 2016

2348 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9945))

Abstract

Many-core hardware platforms offer a tremendous opportunity for scaling up performance, but not all codes that run on these platforms have been modernized sufficiently to fully utilize the hardware. Assessing whether a code will effectively utilize a given platform can be challenging, particularly for new or potential future platforms where native execution on real hardware is not possible. In this case, one typically relies on architecture simulators and other workload characterization tools, which are often not user-friendly for developers who want to do a quick initial assessment of an application’s suitability for a many-core architecture.

To help address this challenge, we present QMSprof, a tool and a set of analyses for an initial assessment of the suitability of a set of applications for a simulated extremely-parallel many-core target. QMSprof automates the process of running a suite of workload binaries through Intel® Software Development Emulator (SDE) and the Sniper multi-core simulator and extracting high-level summary statistics. The tool generates comparative plots summarizing key metrics across the workload suite, including the mix of vector and nonvector instructions, scalability with increasing thread count, memory bandwidth utilization, and statistics on cache misses and working set size. These summary metrics are designed to aid performance tuners in selecting promising codes for a many-core target and in pinpointing opportunities for additional tuning. To illustrate the utility of our tool, we also describe some sample results from characterizing applications on a hypothetical many-core architecture.

This research was, in part, funded by the U.S. Government and DOE. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bentley, B.: Validating the Intel\(\textregistered {}\) Pentium\(\textregistered {}\) 4 microprocessor. In: Proceedings of the 38th Annual Design Automation Conference, DAC 2001, pp. 244–248. ACM, New York (2001). http://doi.acm.org/10.1145/378239.378473
Carlson, T.E., Heirman, W., Eeckhout, L.: Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In: International Conference for High Performance Computing, Networking, Storage and Analysis (2011)
Google Scholar
CORAL Collaboration: Oak Ridge, Argonne, Livermore. Benchmark codes. https://asc.llnl.gov/CORAL-benchmarks/
Himeno, R.: Himeno benchmark (2016). http://accc.riken.jp/en/supercom/himenobmt/
Hong, S.Y., Lim, J.O.J.: The WRF single-moment 6-class microphysics scheme (WSM 2006). J. Korean Meteorol. Soc. 42(2), 129–151 (2006)
Google Scholar
Intel\(\textregistered {}\) Advisor (2016). https://software.intel.com/en-us/intel-advisor-xe
Intel\(\textregistered {}\) Software Development Emulator (2016). https://software.intel.com/en-us/articles/intel-software-development-emulator
Intel\(\textregistered {}\) VTune™ Amplifier (2016). https://software.intel.com/en-us/intel-vtune-amplifier-xe
Intel\(\textregistered {}\) Xeon Phi™ Product Family (2016). http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html
Li, S.: Case study: computing black-scholes with Intel\(\textregistered {}\) advanced vector extensions (2012). https://software.intel.com/en-us/articles/case-study-computing-black-scholes-with-intel-advanced-vector-extensions
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Comput. Soc. Techn. Committee Comput. Archit. (TCCA) Newsl. 19–25 (1995)
Google Scholar
McCalpin, J.D.: STREAM: sustainable memory bandwidth in high performance computers (2016). https://www.cs.virginia.edu/stream/
Shai, O., Shmueli, E., Feitelson, D.G.: Heuristics for resource matching in intel’s compute farm. In: Desai, N., Cirne, W. (eds.) JSSPP 2013. LNCS, vol. 8429, pp. 116–135. Springer, Heidelberg (2014). doi:10.1007/978-3-662-43779-7_7
Google Scholar
The Sniper Multi-Core Simulator (2016). http://snipersim.org
Sugumar, R.A., Abraham, S.G.: Efficient simulation of caches under opt replacement with applications to miss characterization. In: Proceedings of the ACM SIGMETRICS Conference (1993)
Google Scholar
Tramm, J., Gunow, G.: SimpleMOC-kernel, version 2.0 (2015). https://github.com/ANL-CESAR/SimpleMOC-kernel
Valles, A., Zhang, W.: Optimizing for reacting Navier-Stokes equations. In: Reinders, J., Jeffers, J. (eds.) High Performance Parallelism Pearls, pp. 69–85. Morgan Kaufmann, Boston (2015). http://www.sciencedirect.com/science/article/pii/B9780128021187000042
Chapter Google Scholar
Williams, T., Kelley, C.: gnuplot 4.6 (2014). http://www.gnuplot.info/docs_4.6/gnuplot.pdf
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). doi:10.1007/10968987_3
Chapter Google Scholar
Zhang, W.: miniSMC Benchmark (2014). https://github.com/WeiqunZhang/miniSMC
Zhang, Z., Phan, L.T.X., Tan, G., Jain, S., Duong, H., Loo, B.T., Lee, I.: On the feasibility of dynamic rescheduling on the Intel distributed computing platform. In: Proceedings of the 11th International Middleware Conference Industrial Track, Middleware Industrial Track 2010, pp. 4–10. ACM, New York (2010). http://doi.acm.org/10.1145/1891719.1891720

Download references

Author information

Authors and Affiliations

Intel Corporation, Hudson, USA
Chris J. Newburn, Jim Sukha, Ilya Sharapov, Anthony D. Nguyen & Chyi-Chang Miao

Authors

Chris J. Newburn
View author publications
You can also search for this author in PubMed Google Scholar
Jim Sukha
View author publications
You can also search for this author in PubMed Google Scholar
Ilya Sharapov
View author publications
You can also search for this author in PubMed Google Scholar
Anthony D. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Chyi-Chang Miao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jim Sukha .

Editor information

Editors and Affiliations

University of Delaware, Newark, Delaware, USA
Michela Taufer
Forschungszentrum Jülich, Jülich, Germany
Bernd Mohr
DKRZ, Hamburg, Germany
Julian M. Kunkel

A Appendix: Interface for QMSprof

This appendix demonstrates the interface for QMSprof. We first present an example of configuring the collector to run simulations, and then describe how to run the analyzer to extract summary statistics and generate plots.

1.1 A.1 Collector Interface

The collector interface for QMSprof is divided into four major parts, namely configuration for (1) workload binaries and run arguments, (2) Sniper models, (3) environment setup, and (4) experiment script.

For part (1), binaries and run arguments are configured by specifying a Benchmarks dictionary, which maps a key for each benchmark to a per-benchmark dictionary with additional information. When running an experiment, the collector uses information from a per-benchmark dictionary to stage each simulation run, i.e., creating a separate run folder for each simulation run in a staging area, and copying and/or renaming any necessary binaries and input files into that folder. This staging step is needed because workloads are not always built to support concurrent executions from the same run folder.

The per-benchmark dictionary is built to support benchmark variants, i.e., different versions of the same workload, with possibly different binaries or run arguments. This dictionary has several expected fields:

1.
bindir: This string lists the subdirectory of the root benchmark directory containing the files for this benchmark. The root benchmark directory is a global variable specified separately in the top-level of the configuration file.
2.
files: This dictionary maps a variant of the benchmark to a list of files needed to run each variant. Each file is itself a pair, with the first value being the name of the file in the source binary directory, and the second value being the name of the target file in the staging area. This pair allows an input file to be renamed in the staging area before it is run.
3.
runargs: This dictionary maps a variant of a benchmark to the command-line arguments needed to run the variant.
4.
gen_inputs: This dictionary maps a variant to a list of shell commands to execute in bindir to generate any input files that are needed for a run.
5.
requires_MPI: This flag is set if this particular binary requires the use of an MPI library to execute. Our current prototype assumes one MPI rank per program, but in principle this assumption could be relaxed.

For the files, runargs, and gen_inputs dictionaries, if no exact match to a particular variant name is found in the dictionary, QMSprof will map to the key that matches the longest prefix.

As an example, Fig. 7 shows part of a configuration file specifying binaries and arguments for two benchmarks: LULESH and SNAP. In this configuration, running the sim_vec variant of the LULESH, uses the source file lulesh2.0_vec from the subdirectory LULESH/binaries. The staging process will copy and rename (or link) to a file named LULESH_sim_vec in a each simulation run directory. This staging allows QMSprof to use a consistent naming convention for its implementation, without requiring users to duplicate or change input file names in the source directory. To run the sim_vec variant of LULESH, QMSprof will use the sim argument of -s 27 -i 6 -p, since sim is the longest matching prefix of sim_vec.

SNAP has a slightly more complicated description, with a script command list in its gen_inputs parameter. This command list indicates that before staging any files in the files list, the collector should run the script genFile from the SNAP/binaries directory to generate extra input files. The strings is a special pattern in arguments and commands that the collector replaces with the thread count for a particular run. Similarly, is a special pattern that gets represents the run directory used to store and run the binary.

Configuration of parts (2) and (3) are relatively straightforward, as illustrated in Fig. 8. The SniperConfigs dictionary describes the Sniper models that can be used in an experiment. The key 16C_2wide is a descriptive (usually short) string that the user provides for the Sniper configuration file Manycore_16c_2wide.cfg. Similarly, in the EnvFiles dictionary, DefaultOpenMP is a description of the environment file ICCDefaultOpenMP.sh. Each run script created by QMSprof sources a particular environment file before each run, passing in the thread count of the run as its argument. Thus, the user should use the environment file to setup any necessary runtime libraries or tools (e.g., compiler libraries, Sniper and SDE), and any other environment variables such as OMP_NUM_THREADS. Finally, Fig. 8 also specifies the job manager (e.g., Intel NetBatch) to use to run jobs in the desired compute environment.

Finally, for part (4), Fig. 9 shows an example experiment script that specifies the runs in the experiment. Each run (e.g., Run0) is specified as a tuple with 5 entries:

1.
Sniper configuration: The Sniper configuration for the run, as defined by the keys in the input SniperConfigs dictionary. For example, Run0 runs the 16C_2_wide config.
2.
Thread Set: The thread counts to run. For example, Run0 uses the thread counts of 1, 2, 4, 8, 12, and 16.
3.
Experiment File: The environment file for the run, as defined by the keys in the input EnvFiles dictionary. For example, Run0 uses the DefaultOpenMP environment file.
4.
Program list: The workload variants to run. The example in Fig. 9 executes both the sim_vec and sim_novec variants of LULESH and SNAP.
5.
Experiment Knobs: An object that captures all the other configuration knobs for a particular run. This example uses default values for all the knobs, but additional customization is possible.

1.2 A.2 Analyzer Interface

The analyzer for QMSprof is a separate script that takes a single input directory as its argument, scans the input directory for SDE and Sniper simulation output, parses the relevant raw statistics output files, and then generates summary plots. Our prototype for QMSprof has the specific plots demonstrated in Sect. 4 hard-coded as output, but in principle one could implement a more complex interface that would allow for some customization in the generated plots. The analyzer generates Gnuplot scripts and data files as output, which can be manually edited (e.g., to change titles, labels, or legends), and rerun manually to recreate plots.

Our prototype analyzer assumes that simulation output for each simulation run is placed in a separate folder, with the thread count appearing in the folder name. The analyzer uses the names of output folders to group different thread counts for a benchmark together in summary plots, and eliminates the common suffix across all runs to shorten legends in generated plots. These assumptions are designed for processing the output from the QMSprof collector, but one can also use the analyzer to generate summary plots from other simulation runs if the output directories follow a compatible naming convention.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Newburn, C.J., Sukha, J., Sharapov, I., Nguyen, A.D., Miao, CC. (2016). Application Suitability Assessment for Many-Core Targets. In: Taufer, M., Mohr, B., Kunkel, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9945. Springer, Cham. https://doi.org/10.1007/978-3-319-46079-6_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-46079-6_23
Published: 06 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46078-9
Online ISBN: 978-3-319-46079-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix: Interface for QMSprof

A Appendix: Interface for QMSprof

1.1 A.1 Collector Interface

1.2 A.2 Analyzer Interface

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation