Skip to main content
Log in

The impact of system design parameters on application noise sensitivity

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Operating system (OS) noise, or jitter, is a key limiter of application scalability in high end computing systems. Several studies have attempted to quantify the sources and effects of system interference, though few of these studies show the influence that architectural and system characteristics have on the impact of noise at scale. In this paper, we examine the impact of three such system properties: platform balance, noisy node distribution, and the choice of collective algorithm. Using a previously-developed noise injection tool, we explore how the impact of noise varies with these platform characteristics. We provide detailed performance results that indicate that a system with relatively less network bandwidth is able to absorb more noise than a system with more network bandwidth. Our results also show that application performance can be significantly degraded by only a subset of noisy nodes. Furthermore, the placement of the noisy nodes is also important, especially for applications that make substantial use of tree-based collective communication operations. Lastly, performance results indicate that non-blocking collective operations have the ability to greatly mitigate the impact of OS interference. When combined, these results show that the impact of OS noise is not solely a property of application communication behavior, but is also influenced by other properties of the system architecture and system software environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alam, S.R., Vetter, J.S.: An analysis of system balance requirements for scientific applications. In: ICPP ’06: Proceedings of the 2006 International Conference on Parallel Processing, pp. 229–236. IEEE Computer Society, Washington (2006)

    Chapter  Google Scholar 

  2. Almási, G., Heidelberger, P., Archer, C.J., Martorell, X., Erway, C.C., Moreira, J.E., Steinmacher-Burow, B., Zheng, Y.: Optimization of MPI collective communication on BlueGene/L systems. In: ICS ’05: Proceedings of the 19th annual international conference on Supercomputing, New York, NY, USA, pp. 253–262. ACM Press, New York (2005)

    Chapter  Google Scholar 

  3. Beckman, P., Iskra, K., Yoshii, K., Coghlan, S.: The influence of operating systems on the performance of collective operations at extreme scale. In: IEEE Conference on Cluster Computing, September (2006)

    Google Scholar 

  4. Brightwell, R., Hudson, T., Pedretti, K.T., Underwood, K.D.: SeaStar Interconnect: balanced bandwidth for scalable performance. IEEE MICRO 26(3), 41–57 (2006)

    Article  Google Scholar 

  5. Durstenfeld, R.: Algorithm 235: random permutation. Commun. ACM 7(7), 420 (1964)

    Article  Google Scholar 

  6. Ferreira, K.B., Brightwell, R., Bridges, P.G.: Characterizing application sensitivity to OS interference using kernel-level noise injection. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (Supercomputing’08) November (2008)

    Google Scholar 

  7. Hertel, J.E.S., Bell, R., Elrick, M., Farnsworth, A., Kerley, G., McGlaun, J., Petney, S., Silling, S., Taylor, P., Yarrington, L.: CTH: a software family for multi-dimensional shock physics analysis. In: Proceedings of the 19th International Symposium on Shock Waves, held at Marseille, France, July, pp. 377–382 (1993)

    Google Scholar 

  8. Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and performance analysis of non-blocking collective operations for MPI. In: Proceedings of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07, Nov. IEEE Computer Society/ACM, New York (2007)

    Google Scholar 

  9. Hoefler, T., Schneider, T., Lumsdaine, A.: Characterizing the influence of system noise on large-scale applications by simulation. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10), Nov. (2010)

    Google Scholar 

  10. Hoefler, T., Schneider, T., Lumsdaine, A.: Loggopsim—simulating large-scale applications in the LogGOPS model, Jun. (2010), Accepted at the ACM Workshop on Large-Scale System and Application Performance (LSAP 2010)

    Google Scholar 

  11. Jones, T., Tuel, W., Brenner, L., Fier, J., Caffrey, P., Dawson, S., Neely, R., Blackmore, R., Maskell, B., Tomlinson, P., Roberts, M.: Improving the scalability of parallel jobs by adding parallel awareness to the operating system. In: Proceedings of SC’03 (2003)

    Google Scholar 

  12. Katramatos, D., Chapin, S.J., Hillman, P., Fisk, L.A., van Dresser, D.: Cross-operating system process migration on a massively parallel processor. Technical Report CS-98-28, University of Virginia (1998)

  13. Kerbyson, D.J., Jones, P.W.: A performance model of the Parallel Ocean Program. Int. J. High Perform. Comput. Appl. 19(3), 261–276 (2005)

    Article  Google Scholar 

  14. Kerbyson, D.J., Alme, H.J., Hoisie, A., Petrini, F., Wasserman, H.J., Gittings, M.: Predictive performance and scalability modeling of a large-scale application. In: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing, Denver, CO, pp. 37–48. ACM Press, New York (2001)

    Chapter  Google Scholar 

  15. Mann, P.D.V., Mittaly, U.: Handling OS jitter on multicore multithreaded systems. In: IPDPS ’09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–12. IEEE Computer Society, Washington (2009)

    Chapter  Google Scholar 

  16. Moreira, J., Brutman, M., Castanos, J., Gooding, T., Inglett, T., Lieber, D., McCarthy, P., Mundy, M., Parker, J., Wallenfelt, B., Giampapa, M., Engelsiepen, T., Haskin, R.: Designing a highly-scalable operating system: The Blue Gene/L story. In: Proceedings of the 2006 ACM/IEEE International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC’06), Tampa, Florida, November (2006)

    Google Scholar 

  17. Nataraj, A., Morris, A., Malony, A.D., Sottile, M., Beckman, P.: The ghost in the machine: observing the effects of kernel operation on parallel application performance. In: Proceedings of SC’07 (2007)

    Google Scholar 

  18. Pedretti, K.T., Vaughan, C., Hemmert, K.S., Barrett, B.: Application sensitivity to link and injection bandwidth on a Cray XT4 system. In: Proceedings of the 2008 Cray User Group Annual Technical Conference, May (2008)

    Google Scholar 

  19. Petrini, F., Kerbyson, D., Pakin, S.: The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In: Proceedings of the International Conference on High-Performance Computing and Networking, Phoenix, AZ (2003)

    Google Scholar 

  20. Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.: Performance analysis of MPI collective operations. Clust. Comput. 10(2), 127–143 (2007)

    Article  Google Scholar 

  21. Straalen, B.V., Shalf, J., Ligocki, T., Keen, N., Yan, W.-S.: Scalability challenges for massively parallel AMR applications. In: Proceedings of the International Parallel and Distributed Processing Symposium, May (2009)

    Google Scholar 

  22. Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19, 49–66 (2005)

    Article  Google Scholar 

  23. Zajcew, R., Roy, P., Black, D., Peak, C., Guedes, P., Kemp, B., LoVerso, J., Leibensperger, M., Barnett, M., Rabii, F., Netterwala, D.: An OSF/1 UNIX for Massively Parallel Multicomputers. In: Proceedings of the 1993 Winter USENIX Technical Conference, January, pp. 449–468 (1993)

    Google Scholar 

  24. Zhu, H., Goodell, D., Gropp W.i., Thakur R.: Hierarchical collectives in MPICH2. In: Proceedings of the 16th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 325–326. Springer Berlin, Heidelberg (2009)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kurt B. Ferreira.

Additional information

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ferreira, K.B., Bridges, P.G., Brightwell, R. et al. The impact of system design parameters on application noise sensitivity. Cluster Comput 16, 117–129 (2013). https://doi.org/10.1007/s10586-011-0178-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-011-0178-3

Keywords

Navigation