Abstract
Ever-increasing demand for computing capability is driving the construction of ever-larger computer clusters, soon to be reaching tens of thousands of processors. Many functionalities of system software have failed to scale accordingly – systems are becoming more complex, less reliable, and less efficient. Our premise is that these deficiencies arise from a lack of global control and coordination of the processing nodes. In practice, current parallel machines are loosely-coupled systems that are used for solving inherently tightly-coupled problems. This paper demonstrates that existing and future systems can be made more scalable by using BSP-like parallel programming principles in the design and implementation of the system software, and by taking full advantage of the latest interconnection network hardware. Moreover, we show that this approach can also yield great improvements in efficiency, reliability, and simplicity.
This work is partially supported by the U.S. Department of Energy through Los Alamos National Laboratory contract W-7405-ENG-36 and the Spanish MCYT under grant TIC2003-08154-C06-03.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Brightwell, R., Fisk, L.A.: Scalable Parallel Application Launch on Cplant. In: Proceedings of IEEE/ACM Supercomputing 2001 (SC 2001), Denver, CO (November 2001)
Buntinas, D., Panda, D., Duato, J., Sadayappan, P.: Broadcast/Multicast over Myrinet using NIC-Assisted Multidestination Messages. In: Falsafi, B., Lauria, M. (eds.) CANPC 2000. LNCS, vol. 1797, pp. 115–129. Springer, Heidelberg (2000)
Petrini, F., Feng, W.-c.: Buffered Coscheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems. In: Proceedings of the International Parallel and Distributed Processing Symposium 2000, Cancun, MX, May 2000, vol. 16 (2000)
Fernandez, J., Petrini, F., Frachtenberg, E.: BCS MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers. In: Proceedings of IEEE/ACM Supercomputing 2003 (SC 2003), Phoenix, AZ (November 2003)
Frachtenberg, E., Feitelson, D., Petrini, F., Fernandez, J.: Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources. In: Proceedings of the International Parallel and Distributed Processing Symposium 2003 (IPDPS 2003), Nice, France (April 2003)
Frachtenberg, E., Feitelson, D.G., Fernandez, J., Petrini, F.: Parallel Job Scheduling under Dynamic Workloads. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 208–227. Springer, Heidelberg (2003)
Frachtenberg, E., Petrini, F., Fernandez, J., Pakin, S., Coll, S.: STORM: Lightning-Fast Resource Management. In: Proceedings of IEEE/ACM Supercomputing 2002 (SC 2002), Baltimore, MD (November 2002)
Gupta, M.: Challenges in Developing Scalable Scalable Software for BlueGene/L. In: Scaling to New Heights Workshop, Pittsburgh, PA (May 2002)
Hendriks, E.: BProc: The Beowulf Distributed Process Space. In: Proceedings of the 16th Annual ACM International Conference on Supercomputing, New York, NY (June 2002)
Hori, A., Tezuka, H., Ishikawa, Y.: Overhead Analysis of Preemptive Gang Scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) Job Scheduling Strategies for Parallel Processing, Springer, Heidelberg (1998)
Kamada, T., Matsuoka, S., Yonezawa, A.: Efficient Parallel Global Garbage Collection on Massively Parallel Computers. In: Johnson, G.M. (ed.) Proceedings of IEEE/ACM Supercomputing 1994, SC 1994 (1994)
Lamport, L.: How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Transactions on Computers C-28(9) (September 1979)
Liu, J., Mamidala, A., Panda, D.K.: Fast and Scalable MPI-Level Broadcast using Infiniband’s Hardware Multicast Support. In: Proceedings of the 18th International Parallel & Distributed Processing Symposium, Santa Fe, New Mexico (April 2004)
Petrini, F., Feng, W., Hoisie, A., Coll, S., Frachtenberg, E.: The Quadrics Network: High-Performance Clustering Technology. IEEE Micro 22(1), January/February (2002)
Petrini, F., Kerbyson, D., Pakin, S.: The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q. In: Proceedings of IEEE/ACM Supercomputing 2003 (SC 2003), Phoenix, AZ (November 2003)
Sancho, J.C., Petrini, F., Johnson, G., Fernández, J., Frachtenberg, E.: On the Feasibility of Incremental Checkpointing for Scientific Computing. In: Proceedings of the 18th International Parallel & Distributed Processing Symposium, Santa Fe, New Mexico (April 2004)
Shivam, P., Wyckoff, P., Panda, D.: EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet Message Passing. In: Proceedings of IEEE/ACM Supercomputing 2001 (SC 2001), Denver, CO (November 2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Frachtenberg, E., Davis, K., Petrini, F., Fernandez, J., Sancho, J.C. (2004). Designing Parallel Operating Systems via Parallel Programming. In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds) Euro-Par 2004 Parallel Processing. Euro-Par 2004. Lecture Notes in Computer Science, vol 3149. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27866-5_90
Download citation
DOI: https://doi.org/10.1007/978-3-540-27866-5_90
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22924-7
Online ISBN: 978-3-540-27866-5
eBook Packages: Springer Book Archive