Abstract
This paper describes a system that enables parallel programs written using the BSPlib communications library to migrate processes among a network of workstations. Not only does the system provide fault tolerance of BSPlib jobs, but by utilising a load manager that maintains an approximation of the global load of the system, it is possible to continually schedule the migration of BSP processes onto the least loaded machines in a network. Results are provided for an industrial electromagnetics application that show that we can achieve similar throughput on a publically available collection of workstations as a dedicated NOW.
Chapter PDF
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
David Bailey, Tim Harris, William Saphir, Rob van der Wijngaart, Alex Woo, and Maurice Yarrow. Nas parallel benchmarks 2.0. Technical Report 95-020, NAS Applied Research Branch (RNR), December 1995.
R. Baldoni, J. M. Hélary, A. Mostefaoui, and M. Raynal. A communication-induced checkpointing protocol that ensures the rollback-dependency trackability property. In Proc. of the 27th IEEE Symposium on Fault-Tolerant Computing Systems (FTCS), pages 68–77, Seattle, WA, June 1997. IEEE.
Stephen R. Donaldson, Jonathan M. D. Hill, and David B. Skillicorn. Predictable communication on unpredictable networks: Implementing BSP over TCP/IP. In EuroPar’98, LNCS, Southampton, UK, September 1998. Springer-Verlag.
D. H. J. Epema, M. Livny, R. van Dantzig, X. Evers, and J. Pruyne. A worldwide flock of condors: Load sharing among workstation clusters. Future Generations of Computer Systems, 12, 1996.
Jonathan M. D. Hill, Stephen Jarvis, Constantinos Siniolakis, and Vasil P. Vasilev. Portable and architecture independent parallel performance tuning using a call-graph profiling tool. In 6th EuroMicro Workshop on Parallel and Distributed Processing (PDP’98), pages 286–292. IEEE Computer Society Press, January 1998.
Jonathan M. D. Hill, Bill McColl. Dan C. Stefanescu, Mark W. Goudreau, Kevin Lang, Satish B. Rao, Torsten Suel, Thanasis Tsantilas, and Rob Bisseling. BSPlib: The BSP Programming Library. Parallel Computing, to appear 1998. see www.bsp-worldwide.org for more details.
Jonathan M.D. Hill, Stephen Jarvis, Constantinos Siniolakis, and Vasil P. Vasilev. Analysing an sql application with a bsplib call-graph profiling tool. In EuroPar’98, LNCS, Southampton, UK, September 1998. Springer-Verlag.
M. F. Kaashoek, R. Michiels, H. E. Bal, and A. S Tanenbaum. Transparent fault-tolerance in parallel orca programs. In Proc. Symp. on Experiences with Distributed and Multiprocessor Systems III, pages 297–312, 1992.
Mohan V. Nibhanupudi and Boleslaw K. Szymanski. Adaptive parallelism in the bulk synchronous parallel model. In EuroPar’96, number 1124 in Lecture Notes in Computer Science, pages 311–318, Lyon, France, aug 1996. Springer-Verlag.
David Skillicorn, Jonathan M. D. Hill, and W. F. McColl. Questions and answers about BSP. Scientific Programming, 6(3):249–274, Fall 1997.
Richard M. Stallman. Emacs: The extensible, customizable, self-documenting display editor. AI memo 519A, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (MIT), 1979.
G. Stellner. CoCheck: checkpointing and process migration for MPI. In IEEE, editor, Proceedings of IPPS ’96. The 10th International Parallel Processing Symposium: Honolulu, HI, USA, 15–19 April 1996, pages 526–531, 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA, 1996. IEEE Computer Society Press.
Kasidit Chanchio Xian-He Sun. Efficient process migration for parallel processing on non-dedicated networks of workstations. Technical Report TR-96-74, Institute for Computer Applications in Science and Engineering, December 1996.
Leslie G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, August 1990.
Jingwen Wang, Songnian Zhou, Khalid Ahmed, and Weihong Long. LSBATCH: A distributed load sharing batch system. Technical Report CSRI-286, Computer Systems Research Institute, University of Toronto, April 1993.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hill, J.M.D., Donaldson, S.R., Lanfear, T. (1998). Process migration and fault tolerance of BSPlib programs running on networks of workstations. In: Pritchard, D., Reeve, J. (eds) Euro-Par’98 Parallel Processing. Euro-Par 1998. Lecture Notes in Computer Science, vol 1470. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0057841
Download citation
DOI: https://doi.org/10.1007/BFb0057841
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64952-6
Online ISBN: 978-3-540-49920-6
eBook Packages: Springer Book Archive