Tracking User-Perceived I/O Slowdown via Probing

Kunkel, Julian; Betke, Eugen

doi:10.1007/978-3-030-34356-9_15

Julian Kunkel¹² &
Eugen Betke¹³

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11887))

Included in the following conference series:

International Conference on High Performance Computing

5972 Accesses
4 Citations

Abstract

The perceived I/O performance of a shared file system heavily depends on the usage pattern expressed by all concurrent jobs. From the perspective of a single user or job, the achieved I/O throughput can vary significantly due to activities conducted by other users or system services like RAID rebuilds. As these activities are hidden, users wonder about the cause of observed slowdown and may contact the service desk to report an unusual slow system.

In this paper, we present a methodology to investigate and quantify the user-perceived slowdown which sheds light on the perceivable file system performance. This is achieved by deploying a monitoring system on a client node that constantly probes the performance of various data and metadata operations and then compute a slowdown factor. This information could be acquired and visualized in a timely fashion, informing the users about the expected slowdown.

To evaluate the method, we deploy the monitoring on three data centers and explore the gathered data for up to a period of 60 days. A verification of the method is conducted by investigating the metrics while running the IO-500 benchmark. We conclude that this approach is able to reveal short-term and long-term interference.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
When the specified job walltime limit is hit, jobs are terminated.
2.
The value could be updated periodically in a sliding window to cover the typical operational conditions or it could utilize other statistics than the median.
3.
http://www.archer.ac.uk.
4.
https://github.com/joobog/io-probing.
5.
To minimize this, the precreated file size could have been increased.
6.
https://github.com/hpc/ior.

References

Bent, J., Kunkel, J., Lofstead, J., Markomanolis, G.: IO500 Full Ranked List, Supercomputing 2018 (Corrected), November 2018. https://www.vi4io.org/io500/list/19-01/start
Carns, P.: Darshan. In: High Performance Parallel I/O. Computational Science Series, pp. 309–315. Chapman & Hall/CRC (2015)
Google Scholar
Carns, P., et al.: Understanding and improving computational science storage access through continuous characterization. ACM Trans. Storage (TOS) 7(3), 8 (2011)
Google Scholar
Kunkel, J.M., Markomanolis, G.S.: Understanding metadata latency with MDWorkbench. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds.) ISC High Performance 2018. LNCS, vol. 11203, pp. 75–88. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02465-9_5
Chapter Google Scholar
Lawrence, B., et al.: The JASMIN super-data-cluster. arXiv preprint arXiv:1204.3553 (2012)
Lockwood, G.K., Snyder, S., Wang, T., Byna, S., Carns, P., Wright, N.J.: A year in the life of a parallel file system. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p. 74. IEEE Press (2018)
Google Scholar
Sivalingam, K., Richardson, H., Tate, A., Lafferty, M.: LASSi: metric based I/O analytics for HPC. In: SCS Spring Simulation Multi-Conference (SpringSim 2019), Tucson, AZ, USA (2019)
Google Scholar
Tuncer, O., et al.: Diagnosing performance variations in HPC applications using machine learning. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds.) ISC 2017. LNCS, vol. 10266, pp. 355–373. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58667-0_19
Chapter Google Scholar
Uselton, A., Wright, N.: A file system utilization metric for I/O characterization (2013)
Google Scholar
Voss, J., Garcia, J.A., Cyrus Proctor, W., Todd Evans, R.: Automated system health and performance benchmarking platform: high performance computing test harness with Jenkins. In: Proceedings of the HPC Systems Professionals Workshop, HPCSYSPROS 2017, pp. 1:1–1:8. ACM, New York (2017)
Google Scholar

Download references

Acknowledgements

This work was supported by the UK National Supercomputing Service, ARCHER funded by EPSRC and NERC. We thank the German Climate Computing Center (DKRZ) for providing access to their machines to run the experiments.

Author information

Authors and Affiliations

University of Reading, Reading, UK
Julian Kunkel
DKRZ, Hamburg, Germany
Eugen Betke

Authors

Julian Kunkel
View author publications
You can also search for this author in PubMed Google Scholar
Eugen Betke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julian Kunkel .

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, UK
Michèle Weiland
Helmholtz-Zentrum Dresden-Rossendorf, Dresden, Sachsen, Germany
Guido Juckeland
Swiss National Supercomputing Centre, Lugano, Ticino, Switzerland
Sadaf Alam
University of Tennessee at Knoxville, Knoxville, TN, USA
Heike Jagode

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kunkel, J., Betke, E. (2019). Tracking User-Perceived I/O Slowdown via Probing. In: Weiland, M., Juckeland, G., Alam, S., Jagode, H. (eds) High Performance Computing. ISC High Performance 2019. Lecture Notes in Computer Science(), vol 11887. Springer, Cham. https://doi.org/10.1007/978-3-030-34356-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-34356-9_15
Published: 03 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34355-2
Online ISBN: 978-3-030-34356-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics