Abstract
The perceived I/O performance of a shared file system heavily depends on the usage pattern expressed by all concurrent jobs. From the perspective of a single user or job, the achieved I/O throughput can vary significantly due to activities conducted by other users or system services like RAID rebuilds. As these activities are hidden, users wonder about the cause of observed slowdown and may contact the service desk to report an unusual slow system.
In this paper, we present a methodology to investigate and quantify the user-perceived slowdown which sheds light on the perceivable file system performance. This is achieved by deploying a monitoring system on a client node that constantly probes the performance of various data and metadata operations and then compute a slowdown factor. This information could be acquired and visualized in a timely fashion, informing the users about the expected slowdown.
To evaluate the method, we deploy the monitoring on three data centers and explore the gathered data for up to a period of 60 days. A verification of the method is conducted by investigating the metrics while running the IO-500 benchmark. We conclude that this approach is able to reveal short-term and long-term interference.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
When the specified job walltime limit is hit, jobs are terminated.
- 2.
The value could be updated periodically in a sliding window to cover the typical operational conditions or it could utilize other statistics than the median.
- 3.
- 4.
- 5.
To minimize this, the precreated file size could have been increased.
- 6.
References
Bent, J., Kunkel, J., Lofstead, J., Markomanolis, G.: IO500 Full Ranked List, Supercomputing 2018 (Corrected), November 2018. https://www.vi4io.org/io500/list/19-01/start
Carns, P.: Darshan. In: High Performance Parallel I/O. Computational Science Series, pp. 309–315. Chapman & Hall/CRC (2015)
Carns, P., et al.: Understanding and improving computational science storage access through continuous characterization. ACM Trans. Storage (TOS) 7(3), 8 (2011)
Kunkel, J.M., Markomanolis, G.S.: Understanding metadata latency with MDWorkbench. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds.) ISC High Performance 2018. LNCS, vol. 11203, pp. 75–88. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02465-9_5
Lawrence, B., et al.: The JASMIN super-data-cluster. arXiv preprint arXiv:1204.3553 (2012)
Lockwood, G.K., Snyder, S., Wang, T., Byna, S., Carns, P., Wright, N.J.: A year in the life of a parallel file system. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p. 74. IEEE Press (2018)
Sivalingam, K., Richardson, H., Tate, A., Lafferty, M.: LASSi: metric based I/O analytics for HPC. In: SCS Spring Simulation Multi-Conference (SpringSim 2019), Tucson, AZ, USA (2019)
Tuncer, O., et al.: Diagnosing performance variations in HPC applications using machine learning. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds.) ISC 2017. LNCS, vol. 10266, pp. 355–373. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58667-0_19
Uselton, A., Wright, N.: A file system utilization metric for I/O characterization (2013)
Voss, J., Garcia, J.A., Cyrus Proctor, W., Todd Evans, R.: Automated system health and performance benchmarking platform: high performance computing test harness with Jenkins. In: Proceedings of the HPC Systems Professionals Workshop, HPCSYSPROS 2017, pp. 1:1–1:8. ACM, New York (2017)
Acknowledgements
This work was supported by the UK National Supercomputing Service, ARCHER funded by EPSRC and NERC. We thank the German Climate Computing Center (DKRZ) for providing access to their machines to run the experiments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Kunkel, J., Betke, E. (2019). Tracking User-Perceived I/O Slowdown via Probing. In: Weiland, M., Juckeland, G., Alam, S., Jagode, H. (eds) High Performance Computing. ISC High Performance 2019. Lecture Notes in Computer Science(), vol 11887. Springer, Cham. https://doi.org/10.1007/978-3-030-34356-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-34356-9_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34355-2
Online ISBN: 978-3-030-34356-9
eBook Packages: Computer ScienceComputer Science (R0)