Abstract
Optimization of access patterns using collective I/O imposes the overhead of exchanging data between processes. In a multi-core-based cluster the costs of inter-node and intra-node data communication are vastly different, and heterogeneity in the efficiency of data exchange poses both a challenge and an opportunity for implementing efficient collective I/O. The opportunity is to effectively exploit fast intra-node communication. We propose to improve communication locality for greater data exchange efficiency. However, such an effort is at odds with improving access locality for I/O efficiency, which can also be critical to collective-I/O performance. To address this issue we propose a framework, Orthrus, that can accommodate multiple collective-I/O implementations, each optimized for some performance aspects, and dynamically select the best performing one accordingly to current workload and system patterns. We have implemented Orthrus in the ROMIO library. Our experimental results with representative MPI-IO benchmarks on both a small dedicated cluster and a large production HPC system show that Orthrus can significantly improve collective I/O performance under various workloads and system scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
The Opportunities and Challenges of Exascale Computing, http://science.energy.gov/media/ascr/ascac/pdf/reports/exascale_subcommittee_report.pdf/
Alam, S., Barrett, R., Kuehn, J., Roth, P., Vetter, J.: Characterization of Scientific Workloads on Systems with Multi-core processors. In: IEEE International Symposium on Workload Characterization (2006)
Benkert, K., Gabriel, E.: Empirical Optimization of Collective Communications with ADCL. In: High Performance Computing on Vector Systems 2010 (2010)
Byna, S., Chen, Y., Sun, X., Thakur, R., Gropp, W.: Parallel I/O Prefetching Using MPI File Caching and I/O Signatures. In: ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (2008)
Carroll, A.: Linux Block I/O Scheduling (2007), http://www.cse.unsw.edu.au/aaronc/iosched/doc/sched.pdf
Parallel Languages/Paradigms: Charm ++ - Parallel Objects, http://charm.cs.uiuc.edu/research/charm/ .
Ching, A., Choudhary, A., Liao, W., Ward, L., Pundit, N.: Evaluating I/O Characteristics and Methods for Storing Sturctured Scientific Data. In: IEEE Internatinal Parallel & Distributed Processing Symposium (1996)
Lu, Y., Chen, Y., Amritkar, P., Thakur, R., Zhuang, Y.: A New Data Sieving Approach for High Performance I/O. In: Park, J.J(J.H.), Leung, V.C.M., Wang, C.-L., Shon, T. (eds.) Future Information Technology, Application, and Service. LNEE, vol. 164, pp. 111–121. Springer, Heidelberg (2012)
Ching, A., Choudhary, A., Liao, W., Ross, R., Gropp, W.: Efficient Structured Data Access in Parallel File Systems. In: IEEE Internatinal Conference on Cluster Computing (2003)
Ching, A., Choudhary, A., Coloma, K., Liao, W.: Noncontiguous I/O Access Through MPI-IO. In: IEEE/ACM International Symposium on Cluster Computing and the Grid (2003)
Coloma, K., Ching, A., Choudhary, A., Liao, W., Ross, R., Thakur, R., Ward, L.: A New Flexible MPI Collective I/O Implementation. In: IEEE International Conference on Cluster Computing (2006)
Cha, K., Maeng, S.: An Efficient I/O Aggregator Assignment Scheme for Collective I/O Considering Processor Affinity. In: 7th International Workshop on Scheduling and Resource Management for Parallel and Distributed Systems (2011)
Chai, L., Gao, Q., Panda, D.: Understanding the Impact of Multi-core Architecture in Cluster Computing. In: The 7th IEEE International Symposium on Cluster Computing and the Grid (2007)
Chaarawi, M., Gabriel, E.: Automatically Selecting the Number of Aggregators for Collective I/O Operations. In: IEEE International Conference on Cluster Computing (2011)
FLASH IO Benchmark Routine-Parallel HDF 5, http://www.ucolick.org/zingale/flash_benchmark_io/
FT: Discrete 3D Fast Fourier Transform, http://www.nas.nasa.gov/publications/npb.html
HDF5 documents, http://www.hdfgroup.org/HDF5/whatishdf5.html
Johnsson, S.L.: CMSSL: a Scalable Scientific Software Library. In: Proceedings of Scalable Parallel Libraries Conference, Mississippi State, MS (1993)
PVFS2, Parallel Virtual File System (Version 2), http://www.pvfs.org/
Riley, K.: Introduction to Flash, http://flash.uchicago.edu/site/flashcode/user_support/tutorial_talks/home.py?submit=May2004.txt
Tuning I/O Performance (2012), http://doc.opensuse.org/products/draft/SLES/SLES-tuning_sd_draft/cha.tuning.io.html
Thakur, R., Gropp, W., Lusk, E.: Data Sieving and Collective I/O in ROMIO. In: Proceedings of the 7th Symposium on the Frontiers of Massively Parallel Computation, Annapolis, MD (1999)
Latham, R., Ross, R.: PVFS, ROMIO, and the noncontig Benchmark (2005)
Liao, W., Choudhary, A.: Dynamically Adapting File Domain Partitioning Methods for Collective I/O Based on Underlying Parallel File System Locking Protocols. In: ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (2008)
Lofstead, J., Klasky, S., Schwan, K., Podhorszki, N., Jin, C.: Flexible io and integration for scientific codes through the adaptable io system (adios). In: Proc. CLADE 2008 (2008)
Lustre File System (2012), https://www.xyratex.com/products/lustre
MPICH2, A High Performance Message Passing Interface, http://www.mcs.anl.gov/research/projects/mpich2/
Madduri, K., Ibrahim, K., Williams, S., Im, E., Ethier, S., Shalf, J., Oliker, L.: Gyrokinetic Toroidal Simulations on Leading Multi- and Manycore HPC Systems. In: ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (2011)
Noncontig I/O Benchmark, http://www-unix.mcs.anl.gov/thakur/pio-benchmarks.html
NAS parallel benchmarks, NASA Ames Research Center (2009), http://www.nas.nasa.gov/Software/NPB
Spider - the Center-Wide Lustre File System, http://www.olcf.ornl.gov/kb_articles/spider-the-center-wide-lustre-file-system/
Zhang, C., Yuan, X., Srinivasan, A.: Processor Affinity and MPI Performance on SMP-CMP Clusters. In: 11th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (2010)
Zheng, G.: Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing. PhD Thesis (2005)
Zou, H., Sun, X., Ma, S., Duan, X.: A Source-Aware Interrupt Scheduling for Modern Parallel I/O Systems. In: 26th IEEE International Parallel & Distributed Processing Symposium (2012)
Zhang, X., Jiang, S., Davis, K.: Making Resonance a Common Case: A High-Performance Implementation of Collective I/O on Parallel File Systems. In: 23th IEEE International Parallel & Distributed Processing Symposium (2009)
Zhang, X., Xu, Y., Jiang, S.: YouChoose: A Performance Interface Enabling Convenient and Efficient QoS Support for Consolidated Storage Systems. In: 27th IEEE Symposium on Massive Storage Systems and Technologies (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhang, X., Ou, J., Davis, K., Jiang, S. (2014). Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds) Supercomputing. ISC 2014. Lecture Notes in Computer Science, vol 8488. Springer, Cham. https://doi.org/10.1007/978-3-319-07518-1_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-07518-1_22
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07517-4
Online ISBN: 978-3-319-07518-1
eBook Packages: Computer ScienceComputer Science (R0)