Diagnosing Distributed Systems with Self-propelled Instrumentation

Mirgorodskiy, Alexander V.; Miller, Barton P.

doi:10.1007/978-3-540-89856-6_5

Alexander V. Mirgorodskiy³ &
Barton P. Miller⁴

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 5346))

Included in the following conference series:

ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing

1148 Accesses
10 Citations

Abstract

We present a three-part approach for diagnosing bugs and performance problems in production distributed environments. First, we introduce a novel execution monitoring technique that dynamically injects a fragment of code, the agent, into an application process on demand. The agent inserts instrumentation ahead of the control flow within the process and propagates into other processes, following communication events, crossing host boundaries, and collecting a distributed function-level trace of the execution. Second, we present an algorithm that separates the trace into user-meaningful activities called flows. This step simplifies manual examination and enables automated analysis of the trace. Finally, we describe our automated root cause analysis technique that compares the flows to help the analyst locate an anomalous flow and identify a function in that flow that is a likely cause of the anomaly. We demonstrate the effectiveness of our techniques by diagnosing two complex problems in the Condor distributed scheduling system.

Download to read the full chapter text

Chapter PDF

Ant: A Debugging Framework for MPI Parallel Programs

GHUMVEE: Efficient, Effective, and Flexible Replication

BugDoc

Article 23 February 2022

Keywords

References

Adams, K., Agesen, O.: A comparison of software and hardware techniques for x86 virtualization. In: 12th International Conference on Architectural Support for Programming Languages, ASPLOS (October 2006)
Google Scholar
Agarwala, S., Schwan, K.: SysProf: Online Distributed Behavior Diagnosis through Fine-grain System Monitoring. In: 26th International Conference on Distributed Computing Systems (ICDCS), Lisboa, Portugal (July 2006)
Google Scholar
Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance Debugging for Distributed Systems of Black Boxes. In: ACM Symposium on Operating Systems Principles, Bolton Landing, New York (October 2003)
Google Scholar
Ayers, A., Schooler, R., Agarwal, A., Metcalf, C., Rhee, J., Witchel, E.: TraceBack: First-Fault Diagnosis by Reconstruction of Distributed Control Flow. In: Conf. on Programming Language Design and Implementation, Chicago, IL (June 2005)
Google Scholar
Barham, P., Donnelly, A., Isaacs, R., Mortier, R.: Using Magpie for Request Extraction and Workload Modelling. In: 6th Symposium on Operating Systems Design and Implementation, San Francisco, CA (December 2004)
Google Scholar
Barham, P., Isaacs, R., Mortier, R., Narayanan, D.: Magpie: real-time modelling and performance-aware systems. In: 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii (May 2003)
Google Scholar
Bruening, D., Duesterwald, E., Amarasinghe, S.: Design and Implementation of a Dynamic Optimization Framework for Windows. In: 4th ACM Workshop on Feedback-Directed and Dynamic Optimization, Austin, TX (December 2001)
Google Scholar
Chanda, A., Elmeleegy, K., Cox, A.L., Zwaenepoel, W.: Causeway: Support For Controlling And Analyzing The Execution Of Web-Accessible Applications. In: 6th International Middleware Conference, Grenoble, France (November 2005)
Google Scholar
Chanda, A., Cox, A.L., Zwaenepoel, W.: Whodunit: Transactional Profiling for Multi-Tier Applications. In: EuroSys, Lisbon, Portugal (March 2007)
Google Scholar
Chen, M., Accardi, A., Kiciman, E., Lloyd, J., Patterson, D., Fox, A., Brewer, E.: Path-based Failure and Evolution Management. In: 1st Symposium on Networked Systems Design and Implementation, San Francisco, CA (March 2004)
Google Scholar
Chen, T.Y., Cheung, Y.Y.: Dynamic Program Dicing. In: International Conference on Software Maintenance, Montreal, Canada (September 1993)
Google Scholar
Chernoff, A., Hookway, R.: DIGITAL FX!32 Running 32-Bit x86 Applications on Alpha NT. In: USENIX Windows NT Workshop, Seattle, WA (August 1997)
Google Scholar
Choi, J.D., Miller, B.P., Netzer, R.H.B.: Techniques for Debugging Parallel Programs with Flowback Analysis. ACM Transactions on Programming Languages and Systems 13(4) (1991)
Google Scholar
Choi, J.D., Zeller, A.: Isolating Failure-Inducing Thread Schedules. In: International Symposium on Software Testing and Analysis, Rome, Italy (July 2002)
Google Scholar
Cohen, I., Zhang, S., Goldszmidt, M., Symons, J., Kelly, T., Fox, A.: Capturing, indexing, clustering, and retrieving system history. In: 20th ACM Symposium on Operating Systems Principles, Brighton, UK (October 2005)
Google Scholar
Dickinson, W., Leon, D., Podgurski, A.: Finding failures by cluster analysis of execution profiles. In: 23rd International Conference on Software Engineering, Toronto, Ontario, Canada (May 2001)
Google Scholar
Engler, D., Chen, D.Y., Hallem, S., Chou, A., Chelf, B.: Bugs as deviant behavior: a general approach to inferring errors in systems code. In: 18th ACM Symposium on Operating Systems Principles (SOSP), Banff, Alberta, Canada (October 2001)
Google Scholar
Gansner, E., North, S.: An open graph visualization system and its applications to software engineering. Software: Practice & Experience 30(11) (September 2000)
Google Scholar
Intel Corp., Intel^® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B: System Programming Guide, Part 2, Order Number: 253669-022US (November 2006)
Google Scholar
Jones, J.A., Harrold, M.J., Stasko, J.: Visualization of test information to assist fault localization. In: Intl. Conf. on Software Engineering, Orlando, FL (May 2002)
Google Scholar
Kiciman, E., Fox, A.: Detecting Application-Level Failures in Component-based Internet Services. In: IEEE Trans. on Neural Networks: Spec. Issue on Adaptive Learning Systems in Communication Networks (September 2005)
Google Scholar
Kiciman, E., Livshits, B.: AjaxScope: A Platform for Remotely Monitoring the Client-Side Behavior of Web 2.0 Applications. In: 21st Symposium on Operating Systems Principles (SOSP), Stevenson, WA (October 2007)
Google Scholar
King, S.T., Chen, P.M.: Backtracking Intrusions. In: 19th ACM Symposium on Operating System Principles, Bolton Landing, NY (October 2003)
Google Scholar
Krempel, S.: Tracing Connections Between MPI Calls and Resulting PVFS2 Disk Operations, Bachelor’s Thesis. Ruprecht-Karls-Universität, Heidelberg (2006)
Google Scholar
Lamport, L.: Time, clocks and the ordering of events in a distributed system. Commun. of the ACM 21(7) (1978)
Google Scholar
Li, J.: Monitoring and Characterization of Component-Based Systems with Global Causality Capture, HP Labs Tech. Report HPL-2003-54 (2003)
Google Scholar
Liblit, B., Naik, M., Zheng, A.X., Aiken, A., Jordan, M.I.: Scalable Statistical Bug Isolation. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, IL (June 2005)
Google Scholar
Litzkow, M., Livny, M., Mutka, M.: Condor–a hunter of idle workstations. In: 8th Intl. Conf. on Distributed Computing Systems, San Jose, CA (June 1988)
Google Scholar
Luk, C.K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, IL (June 2005)
Google Scholar
Lyle, J.R., Weiser, M.: Automatic Program Bug Location by Program Slicing. In: 2nd Intl. Conf. on Computers and Applications, Beijing, China (June 1987)
Google Scholar
Maebe, J., Ronsse, M., De Bosschere, K.: DIOTA: Dynamic Instrumentation, Optimization and Transformation of Applications. In: Workshop on Binary Translation, Charlottesville, VA (September 2002)
Google Scholar
Miller, B.P.: DPM: A Measurement System for Distributed Programs. IEEE Trans. on Computers 37(2) (February 1988)
Google Scholar
Mirgorodskiy, A.V.: Ph.D. Thesis, University of Wisconsin–Madison (2006)
Google Scholar
Mirgorodskiy, A.V., Miller, B.P.: Autonomous Analysis of Interactive Systems with Self-Propelled Instrumentation. In: 12th Multimedia Computing and Networking, San Jose, CA (January 2005)
Google Scholar
Mirgorodskiy, A.V., Maruyama, N., Miller, B.P.: Problem Diagnosis in Large-Scale Computing Environments. In: SC 2006, Tampa, FL (November 2006)
Google Scholar
Nethercote, N., Seward, J.: Valgrind: A program supervision framework. In: 3rd Workshop on Runtime Verification, Boulder, CO (July 2003)
Google Scholar
Reumann, J., Shin, K.G.: Stateful distributed interposition. ACM Transactions on Computer Systems 22(1), 1–48 (2004)
Article Google Scholar
Reynolds, P., Killian, C., Wiener, J.L., Mogul, J.C., Shah, M.A., Vahdat, A.: Pip: Detecting the Unexpected in Distributed Systems. In: 3rd Symposium on Networked Systems Design and Implementation (NSDI), San Jose, CA (May 2006)
Google Scholar
Scott, K., Davidson, J.: Strata: a software dynamic translation infrastructure. In: Workshop on Binary Translation, Barcelona (September 2001)
Google Scholar
Stevens, W.R.: UNIX Network Programming, 2nd edn., vol. 1. Prentice Hall, Englewood Cliffs (1998)
Google Scholar
Thain, D., Tannenbaum, T., Livny, M.: Distributed Computing in Practice: The Condor Experience. Concurrency and Computation: Practice and Experience 17(2–4) (February- March 2005)
Google Scholar
Thereska, E., Salmon, B., Strunk, J., Wachs, M., Abd-El-Malek, M., Lopez, J., Ganger, G.R.: Stardust: Tracking Activity in a Distributed Storage System. In: International Conf. on Measurement and Modeling of Computer Systems, Saint-Malo, France (June 2006)
Google Scholar
Tucek, J., Lu, S., Huang, C., Xanthos, S., Zhou, Y.: Triage: Diagnosing Production Run Failures at the User’s Site. In: 21st Symposium on Operating Systems Principles (SOSP), Stevenson, WA (October 2007)
Google Scholar
Yuan, C., Lao, N., Wen, J.-R., Li, J., Zhang, Z., Wang, Y.-M., Ma, W.-Y.: Automated Known Problem Diagnosis with Event Traces. In: EuroSys, Leuven, Belgium (April 2006)
Google Scholar
Zandy, V.: Force a Process to Load a Library, http://www.cs.wisc.edu/~zandy/p/hijack.c
Zeller, A.: Isolating Cause-Effect Chains from Computer Programs. In: Intl. Symposium on the Foundations of Software Engineering, Charleston, SC (November 2002)
Google Scholar

Download references

Author information

Authors and Affiliations

VMware, Inc., USA
Alexander V. Mirgorodskiy
Computer Sciences Dept, University of Wisconsin, USA
Barton P. Miller

Authors

Alexander V. Mirgorodskiy
View author publications
You can also search for this author in PubMed Google Scholar
Barton P. Miller
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INRIA-Rocquencourt, Domaine de Voluceau, 78153, Le Chesnay, France
Valérie Issarny
BBN Technologies, 10 Moulton Street, MA 02138, Cambridge, USA
Richard Schantz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mirgorodskiy, A.V., Miller, B.P. (2008). Diagnosing Distributed Systems with Self-propelled Instrumentation. In: Issarny, V., Schantz, R. (eds) Middleware 2008. Middleware 2008. Lecture Notes in Computer Science, vol 5346. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89856-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-540-89856-6_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89855-9
Online ISBN: 978-3-540-89856-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Diagnosing Distributed Systems with Self-propelled Instrumentation

Abstract

Chapter PDF

Similar content being viewed by others

Ant: A Debugging Framework for MPI Parallel Programs

GHUMVEE: Efficient, Effective, and Flexible Replication

BugDoc

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Diagnosing Distributed Systems with Self-propelled Instrumentation

Abstract

Chapter PDF

Similar content being viewed by others

Ant: A Debugging Framework for MPI Parallel Programs

GHUMVEE: Efficient, Effective, and Flexible Replication

BugDoc

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation