Efficient and fault-tolerant distributed host monitoring using system-level diagnosis

Bearden, M.; Bianchini, R.

doi:10.1007/978-0-387-34947-3_13

M. Bearden³ &
R. Bianchini Jr.³

Part of the book series: IFIP — The International Federation for Information Processing ((IFIPAICT))

212 Accesses
3 Citations

Abstract

This paper presents an efficient and fault-tolerant distributed approach to monitoring the status of processors in a network. The Distributed System Monitor (DSMon) is a distributed, decentralized program that gathers processor information, such as CPU load, user information, and network and disk statistics, in parallel at each processor and reliably distributes the information on-line to all fault-free processors. Information is filtered at each processor and distributed at different priorities to conserve communication resources. Fault-tolerance is achieved by applying the results of previous system-level diagnosis research. An on-line distributed system-level diagnosis algorithm that assumes the PMC fault model and a fully connected network is extended to consistently maintain user-defined state information in an unreliable environment.

DSMon has been implemented and currently operates on approximately 200 networked workstations in the Electrical and Computer Engineering Department at Carnegie Mellon University. The key results of this paper include the extension of a distributed system-level diagnosis algorithm for reliable broadcast of current global state, and the specification of the DSMon. A relaxed form of reliable broadcast, called condensed reliable broadcast, is introduced for guaranteeing delivery of the most recently broadcast update, without guaranteeing a complete history of all broadcast updates. The DSMon implementation is described, and its operation in an actual distributed network environment is analyzed. Extensions to this work include other fault and system models and applicability to other distributed applications requiring consistent distributed global state.

This research was supported in part by the National Science Foundation (NSF) under Grant CCR-9257973 and by a NSF Graduate Research Fellowship.

Download to read the full chapter text

Chapter PDF

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

Article 18 August 2023

Diagnosis and Automata

Adaptive Fault Diagnosis using Self-Referential Reasoning

Keywords

References

Bagchi, A. (1992) A distributed algorithm for system-level diagnosis in hypercubes. Proceedings of the 1992 IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, July, 1992, Amherst, Massachussetts, 106–113.
Google Scholar
Barborak, M., Malek, M., and Dahbura, A. (1991) The consensus problem in fault-tolerant computing. ACM Computing Surveys (USA), 25 (2), 171–220.
Article Google Scholar
Bearden, M. (1993) The Distributed System Monitor: a practical implementation of system-level diagnosis, Research Report No. CMUCSC-93–7, Dept. of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania.
Google Scholar
Bianchini, R., Jr., Goodwin, K., and Nydick, D. (1990) Practical application and implementation of distributed system-level diagnosis theory. Proceedings, 20th International Symposium on Fault-Tolerant Computing, IEEE, Boston, June, 332–9.
Google Scholar
Bianchini, R. and Buskens, R. (1992) Implementation of on-line distributed system-level diagnosis theory. IEEE Transactions on Computers, 41 (5), 616–26.
Article Google Scholar
Buskens, R. and Bianchini, R., Jr. (1993) Distributed on-line diagnosis in the presence of arbitrary faults. Proc., 23rd Int. Symp. on Fault-Tolerant Computing, IEEE, June, 470–9.
Google Scholar
Case, J., Davin, C., Fedor, M., and Schoffstall, M. (1989) Internet network management using the simple network management protocol. Proceedings, 14th IEEE Conference on Local Computer Networks, Mineapolis, Minnesota ( USA ), October, 156–9.
Google Scholar
Cheriton, D. and Skeen, D. (1993) Understanding the limitations of causally and totally ordered communication. Proceedings of the 14th Symp. on Operating Systems Principles, ACM, December, in Operating Systems Review, 27 (5), 44–57.
Article Google Scholar
Dahbura, A. (1988) System-level diagnosis: a perspective for the third decade, in Concurrent Computations: Algorithms, Architecture, and Technology, (S. Tewksbury, et al., eds.), Plenum Press.
Google Scholar
Friedman, A. and Simoncini, L. (1980) System-level fault diagnosis. IEEE Computer, March 1980, 47–53.
Google Scholar
Hadzilacos, V. and Toueg, S. (1993) Fault-tolerant broadcasts and related problems, in Distributed Systems, 2.0 edition, (ed. S. Mullender ), ACM Press, New York.
Google Scholar
Hakimi, S. and Amin, A. (1974) Characterization of connection assignment of diagnosable systems. IEEE Transactions on Computers, C-23(1) (Jan.), 86–88.
Google Scholar
Hakimi, S. and Nakajima, K. (1984) On adaptive system diagnosis. IEEE Transactions on Computers, C-33(3), 234–40.
Google Scholar
Hosseini, S., Kuhl, J., and Reddy, S. (1984) A diagnosis algorithm for distributed computing systems with dynamic failure and repair. IEEE Trans. on Comp., 33 (3), 223–33.
Article MATH Google Scholar
ISO/lED (1990) Information Technology-Open Systems Interconnection-Systems Management Part 5: Event Report Management Function. ISO/IED DIS 10164–5, October 1990.
Google Scholar
Lamport, L. (1978) Time, clocks, and the ordering of events in a distributed system. Communications of the ACM (USA), 21 (7), 558–65.
MATH Google Scholar
Lehman, R., Carpenter, G., and Hien, N. (1992) Concurrent network management with a distributed management tool. Proceedings of the 6th Systems Administration Conference (USA VI), USENIX Assocation, October, Long Beach, California, 235–44.
Google Scholar
Mansouri-Samani, M. and Sloman, M. (1993) Monitoring distributed systems. IEEE Network, November 1993, 20–30.
Google Scholar
Obraczka, K., Danzig, P., and Li, S. (1993) Internet resource discovery services. IEEE Computer (USA), 26 (9), 8–22.
Google Scholar
Preparata, F., Metze, G., and Chien., R. (1967) On the connection assignment problem of diagnosable systems. IEEE Trans. on Electronic Computers, 16 (6), 848--54.
Article MATH Google Scholar
Ramamritham, K., Stankovic, J., and Zhao, W. (1989) Distributed scheduling of tasks with deadlines and resource requirements. IEEE Trans. on Computers, August 1989, 1110–1123.
Google Scholar
Schroeder, M. (1993) A state-of-the-art distributed system: computing with BOB, in Distributed Systems, 2.0 edition, (ed. S. Mullender ), ACM Press, New York.
Google Scholar
Stahl, M., Buskens, R., Bianchini, R. (1992) On-line diagnosis in general topology networks. Proceedings of the 1992 IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, July, 1992, Amherst, Massachussetts, 114–121.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania, 15213, USA
M. Bearden & R. Bianchini Jr.

Authors

M. Bearden
View author publications
You can also search for this author in PubMed Google Scholar
R. Bianchini Jr.
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dresden University of Technology, Dresden, Germany
Alexander Schill & Christian Mittasch &
Aachen University of Technology, Aachen, Germany
Otto Spaniol & Claudia Popien &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bearden, M., Bianchini, R. (1996). Efficient and fault-tolerant distributed host monitoring using system-level diagnosis. In: Schill, A., Mittasch, C., Spaniol, O., Popien, C. (eds) Distributed Platforms. IFIP — The International Federation for Information Processing. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-34947-3_13

Download citation

DOI: https://doi.org/10.1007/978-0-387-34947-3_13
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4757-5010-2
Online ISBN: 978-0-387-34947-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Efficient and fault-tolerant distributed host monitoring using system-level diagnosis

Abstract

Chapter PDF

Similar content being viewed by others

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

Diagnosis and Automata

Adaptive Fault Diagnosis using Self-Referential Reasoning

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Efficient and fault-tolerant distributed host monitoring using system-level diagnosis

Abstract

Chapter PDF

Similar content being viewed by others

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

Diagnosis and Automata

Adaptive Fault Diagnosis using Self-Referential Reasoning

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation