Abstract
This paper presents an efficient and fault-tolerant distributed approach to monitoring the status of processors in a network. The Distributed System Monitor (DSMon) is a distributed, decentralized program that gathers processor information, such as CPU load, user information, and network and disk statistics, in parallel at each processor and reliably distributes the information on-line to all fault-free processors. Information is filtered at each processor and distributed at different priorities to conserve communication resources. Fault-tolerance is achieved by applying the results of previous system-level diagnosis research. An on-line distributed system-level diagnosis algorithm that assumes the PMC fault model and a fully connected network is extended to consistently maintain user-defined state information in an unreliable environment.
DSMon has been implemented and currently operates on approximately 200 networked workstations in the Electrical and Computer Engineering Department at Carnegie Mellon University. The key results of this paper include the extension of a distributed system-level diagnosis algorithm for reliable broadcast of current global state, and the specification of the DSMon. A relaxed form of reliable broadcast, called condensed reliable broadcast, is introduced for guaranteeing delivery of the most recently broadcast update, without guaranteeing a complete history of all broadcast updates. The DSMon implementation is described, and its operation in an actual distributed network environment is analyzed. Extensions to this work include other fault and system models and applicability to other distributed applications requiring consistent distributed global state.
This research was supported in part by the National Science Foundation (NSF) under Grant CCR-9257973 and by a NSF Graduate Research Fellowship.
Chapter PDF
Similar content being viewed by others
References
Bagchi, A. (1992) A distributed algorithm for system-level diagnosis in hypercubes. Proceedings of the 1992 IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, July, 1992, Amherst, Massachussetts, 106–113.
Barborak, M., Malek, M., and Dahbura, A. (1991) The consensus problem in fault-tolerant computing. ACM Computing Surveys (USA), 25 (2), 171–220.
Bearden, M. (1993) The Distributed System Monitor: a practical implementation of system-level diagnosis, Research Report No. CMUCSC-93–7, Dept. of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania.
Bianchini, R., Jr., Goodwin, K., and Nydick, D. (1990) Practical application and implementation of distributed system-level diagnosis theory. Proceedings, 20th International Symposium on Fault-Tolerant Computing, IEEE, Boston, June, 332–9.
Bianchini, R. and Buskens, R. (1992) Implementation of on-line distributed system-level diagnosis theory. IEEE Transactions on Computers, 41 (5), 616–26.
Buskens, R. and Bianchini, R., Jr. (1993) Distributed on-line diagnosis in the presence of arbitrary faults. Proc., 23rd Int. Symp. on Fault-Tolerant Computing, IEEE, June, 470–9.
Case, J., Davin, C., Fedor, M., and Schoffstall, M. (1989) Internet network management using the simple network management protocol. Proceedings, 14th IEEE Conference on Local Computer Networks, Mineapolis, Minnesota ( USA ), October, 156–9.
Cheriton, D. and Skeen, D. (1993) Understanding the limitations of causally and totally ordered communication. Proceedings of the 14th Symp. on Operating Systems Principles, ACM, December, in Operating Systems Review, 27 (5), 44–57.
Dahbura, A. (1988) System-level diagnosis: a perspective for the third decade, in Concurrent Computations: Algorithms, Architecture, and Technology, (S. Tewksbury, et al., eds.), Plenum Press.
Friedman, A. and Simoncini, L. (1980) System-level fault diagnosis. IEEE Computer, March 1980, 47–53.
Hadzilacos, V. and Toueg, S. (1993) Fault-tolerant broadcasts and related problems, in Distributed Systems, 2.0 edition, (ed. S. Mullender ), ACM Press, New York.
Hakimi, S. and Amin, A. (1974) Characterization of connection assignment of diagnosable systems. IEEE Transactions on Computers, C-23(1) (Jan.), 86–88.
Hakimi, S. and Nakajima, K. (1984) On adaptive system diagnosis. IEEE Transactions on Computers, C-33(3), 234–40.
Hosseini, S., Kuhl, J., and Reddy, S. (1984) A diagnosis algorithm for distributed computing systems with dynamic failure and repair. IEEE Trans. on Comp., 33 (3), 223–33.
ISO/lED (1990) Information Technology-Open Systems Interconnection-Systems Management Part 5: Event Report Management Function. ISO/IED DIS 10164–5, October 1990.
Lamport, L. (1978) Time, clocks, and the ordering of events in a distributed system. Communications of the ACM (USA), 21 (7), 558–65.
Lehman, R., Carpenter, G., and Hien, N. (1992) Concurrent network management with a distributed management tool. Proceedings of the 6th Systems Administration Conference (USA VI), USENIX Assocation, October, Long Beach, California, 235–44.
Mansouri-Samani, M. and Sloman, M. (1993) Monitoring distributed systems. IEEE Network, November 1993, 20–30.
Obraczka, K., Danzig, P., and Li, S. (1993) Internet resource discovery services. IEEE Computer (USA), 26 (9), 8–22.
Preparata, F., Metze, G., and Chien., R. (1967) On the connection assignment problem of diagnosable systems. IEEE Trans. on Electronic Computers, 16 (6), 848--54.
Ramamritham, K., Stankovic, J., and Zhao, W. (1989) Distributed scheduling of tasks with deadlines and resource requirements. IEEE Trans. on Computers, August 1989, 1110–1123.
Schroeder, M. (1993) A state-of-the-art distributed system: computing with BOB, in Distributed Systems, 2.0 edition, (ed. S. Mullender ), ACM Press, New York.
Stahl, M., Buskens, R., Bianchini, R. (1992) On-line diagnosis in general topology networks. Proceedings of the 1992 IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, July, 1992, Amherst, Massachussetts, 114–121.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1996 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Bearden, M., Bianchini, R. (1996). Efficient and fault-tolerant distributed host monitoring using system-level diagnosis. In: Schill, A., Mittasch, C., Spaniol, O., Popien, C. (eds) Distributed Platforms. IFIP — The International Federation for Information Processing. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-34947-3_13
Download citation
DOI: https://doi.org/10.1007/978-0-387-34947-3_13
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4757-5010-2
Online ISBN: 978-0-387-34947-3
eBook Packages: Springer Book Archive