Abstract
Operating system (OS) noise, or jitter, is a key limiter of application scalability in high end computing systems. Several studies have attempted to quantify the sources and effects of system interference, though few of these studies show the influence that architectural and system characteristics have on the impact of noise at scale. In this paper, we examine the impact of three such system properties: platform balance, noisy node distribution, and the choice of collective algorithm. Using a previously-developed noise injection tool, we explore how the impact of noise varies with these platform characteristics. We provide detailed performance results that indicate that a system with relatively less network bandwidth is able to absorb more noise than a system with more network bandwidth. Our results also show that application performance can be significantly degraded by only a subset of noisy nodes. Furthermore, the placement of the noisy nodes is also important, especially for applications that make substantial use of tree-based collective communication operations. Lastly, performance results indicate that non-blocking collective operations have the ability to greatly mitigate the impact of OS interference. When combined, these results show that the impact of OS noise is not solely a property of application communication behavior, but is also influenced by other properties of the system architecture and system software environment.
Similar content being viewed by others
References
Alam, S.R., Vetter, J.S.: An analysis of system balance requirements for scientific applications. In: ICPP ’06: Proceedings of the 2006 International Conference on Parallel Processing, pp. 229–236. IEEE Computer Society, Washington (2006)
Almási, G., Heidelberger, P., Archer, C.J., Martorell, X., Erway, C.C., Moreira, J.E., Steinmacher-Burow, B., Zheng, Y.: Optimization of MPI collective communication on BlueGene/L systems. In: ICS ’05: Proceedings of the 19th annual international conference on Supercomputing, New York, NY, USA, pp. 253–262. ACM Press, New York (2005)
Beckman, P., Iskra, K., Yoshii, K., Coghlan, S.: The influence of operating systems on the performance of collective operations at extreme scale. In: IEEE Conference on Cluster Computing, September (2006)
Brightwell, R., Hudson, T., Pedretti, K.T., Underwood, K.D.: SeaStar Interconnect: balanced bandwidth for scalable performance. IEEE MICRO 26(3), 41–57 (2006)
Durstenfeld, R.: Algorithm 235: random permutation. Commun. ACM 7(7), 420 (1964)
Ferreira, K.B., Brightwell, R., Bridges, P.G.: Characterizing application sensitivity to OS interference using kernel-level noise injection. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (Supercomputing’08) November (2008)
Hertel, J.E.S., Bell, R., Elrick, M., Farnsworth, A., Kerley, G., McGlaun, J., Petney, S., Silling, S., Taylor, P., Yarrington, L.: CTH: a software family for multi-dimensional shock physics analysis. In: Proceedings of the 19th International Symposium on Shock Waves, held at Marseille, France, July, pp. 377–382 (1993)
Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and performance analysis of non-blocking collective operations for MPI. In: Proceedings of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07, Nov. IEEE Computer Society/ACM, New York (2007)
Hoefler, T., Schneider, T., Lumsdaine, A.: Characterizing the influence of system noise on large-scale applications by simulation. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10), Nov. (2010)
Hoefler, T., Schneider, T., Lumsdaine, A.: Loggopsim—simulating large-scale applications in the LogGOPS model, Jun. (2010), Accepted at the ACM Workshop on Large-Scale System and Application Performance (LSAP 2010)
Jones, T., Tuel, W., Brenner, L., Fier, J., Caffrey, P., Dawson, S., Neely, R., Blackmore, R., Maskell, B., Tomlinson, P., Roberts, M.: Improving the scalability of parallel jobs by adding parallel awareness to the operating system. In: Proceedings of SC’03 (2003)
Katramatos, D., Chapin, S.J., Hillman, P., Fisk, L.A., van Dresser, D.: Cross-operating system process migration on a massively parallel processor. Technical Report CS-98-28, University of Virginia (1998)
Kerbyson, D.J., Jones, P.W.: A performance model of the Parallel Ocean Program. Int. J. High Perform. Comput. Appl. 19(3), 261–276 (2005)
Kerbyson, D.J., Alme, H.J., Hoisie, A., Petrini, F., Wasserman, H.J., Gittings, M.: Predictive performance and scalability modeling of a large-scale application. In: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing, Denver, CO, pp. 37–48. ACM Press, New York (2001)
Mann, P.D.V., Mittaly, U.: Handling OS jitter on multicore multithreaded systems. In: IPDPS ’09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–12. IEEE Computer Society, Washington (2009)
Moreira, J., Brutman, M., Castanos, J., Gooding, T., Inglett, T., Lieber, D., McCarthy, P., Mundy, M., Parker, J., Wallenfelt, B., Giampapa, M., Engelsiepen, T., Haskin, R.: Designing a highly-scalable operating system: The Blue Gene/L story. In: Proceedings of the 2006 ACM/IEEE International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC’06), Tampa, Florida, November (2006)
Nataraj, A., Morris, A., Malony, A.D., Sottile, M., Beckman, P.: The ghost in the machine: observing the effects of kernel operation on parallel application performance. In: Proceedings of SC’07 (2007)
Pedretti, K.T., Vaughan, C., Hemmert, K.S., Barrett, B.: Application sensitivity to link and injection bandwidth on a Cray XT4 system. In: Proceedings of the 2008 Cray User Group Annual Technical Conference, May (2008)
Petrini, F., Kerbyson, D., Pakin, S.: The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In: Proceedings of the International Conference on High-Performance Computing and Networking, Phoenix, AZ (2003)
Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.: Performance analysis of MPI collective operations. Clust. Comput. 10(2), 127–143 (2007)
Straalen, B.V., Shalf, J., Ligocki, T., Keen, N., Yan, W.-S.: Scalability challenges for massively parallel AMR applications. In: Proceedings of the International Parallel and Distributed Processing Symposium, May (2009)
Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19, 49–66 (2005)
Zajcew, R., Roy, P., Black, D., Peak, C., Guedes, P., Kemp, B., LoVerso, J., Leibensperger, M., Barnett, M., Rabii, F., Netterwala, D.: An OSF/1 UNIX for Massively Parallel Multicomputers. In: Proceedings of the 1993 Winter USENIX Technical Conference, January, pp. 449–468 (1993)
Zhu, H., Goodell, D., Gropp W.i., Thakur R.: Hierarchical collectives in MPICH2. In: Proceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 325–326. Springer Berlin, Heidelberg (2009)
Author information
Authors and Affiliations
Corresponding author
Additional information
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Rights and permissions
About this article
Cite this article
Ferreira, K.B., Bridges, P.G., Brightwell, R. et al. The impact of system design parameters on application noise sensitivity. Cluster Comput 16, 117–129 (2013). https://doi.org/10.1007/s10586-011-0178-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-011-0178-3