Abstract
We present a three-part approach for diagnosing bugs and performance problems in production distributed environments. First, we introduce a novel execution monitoring technique that dynamically injects a fragment of code, the agent, into an application process on demand. The agent inserts instrumentation ahead of the control flow within the process and propagates into other processes, following communication events, crossing host boundaries, and collecting a distributed function-level trace of the execution. Second, we present an algorithm that separates the trace into user-meaningful activities called flows. This step simplifies manual examination and enables automated analysis of the trace. Finally, we describe our automated root cause analysis technique that compares the flows to help the analyst locate an anomalous flow and identify a function in that flow that is a likely cause of the anomaly. We demonstrate the effectiveness of our techniques by diagnosing two complex problems in the Condor distributed scheduling system.
Chapter PDF
Similar content being viewed by others
Keywords
References
Adams, K., Agesen, O.: A comparison of software and hardware techniques for x86 virtualization. In: 12th International Conference on Architectural Support for Programming Languages, ASPLOS (October 2006)
Agarwala, S., Schwan, K.: SysProf: Online Distributed Behavior Diagnosis through Fine-grain System Monitoring. In: 26th International Conference on Distributed Computing Systems (ICDCS), Lisboa, Portugal (July 2006)
Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance Debugging for Distributed Systems of Black Boxes. In: ACM Symposium on Operating Systems Principles, Bolton Landing, New York (October 2003)
Ayers, A., Schooler, R., Agarwal, A., Metcalf, C., Rhee, J., Witchel, E.: TraceBack: First-Fault Diagnosis by Reconstruction of Distributed Control Flow. In: Conf. on Programming Language Design and Implementation, Chicago, IL (June 2005)
Barham, P., Donnelly, A., Isaacs, R., Mortier, R.: Using Magpie for Request Extraction and Workload Modelling. In: 6th Symposium on Operating Systems Design and Implementation, San Francisco, CA (December 2004)
Barham, P., Isaacs, R., Mortier, R., Narayanan, D.: Magpie: real-time modelling and performance-aware systems. In: 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii (May 2003)
Bruening, D., Duesterwald, E., Amarasinghe, S.: Design and Implementation of a Dynamic Optimization Framework for Windows. In: 4th ACM Workshop on Feedback-Directed and Dynamic Optimization, Austin, TX (December 2001)
Chanda, A., Elmeleegy, K., Cox, A.L., Zwaenepoel, W.: Causeway: Support For Controlling And Analyzing The Execution Of Web-Accessible Applications. In: 6th International Middleware Conference, Grenoble, France (November 2005)
Chanda, A., Cox, A.L., Zwaenepoel, W.: Whodunit: Transactional Profiling for Multi-Tier Applications. In: EuroSys, Lisbon, Portugal (March 2007)
Chen, M., Accardi, A., Kiciman, E., Lloyd, J., Patterson, D., Fox, A., Brewer, E.: Path-based Failure and Evolution Management. In: 1st Symposium on Networked Systems Design and Implementation, San Francisco, CA (March 2004)
Chen, T.Y., Cheung, Y.Y.: Dynamic Program Dicing. In: International Conference on Software Maintenance, Montreal, Canada (September 1993)
Chernoff, A., Hookway, R.: DIGITAL FX!32 Running 32-Bit x86 Applications on Alpha NT. In: USENIX Windows NT Workshop, Seattle, WA (August 1997)
Choi, J.D., Miller, B.P., Netzer, R.H.B.: Techniques for Debugging Parallel Programs with Flowback Analysis. ACM Transactions on Programming Languages and Systems 13(4) (1991)
Choi, J.D., Zeller, A.: Isolating Failure-Inducing Thread Schedules. In: International Symposium on Software Testing and Analysis, Rome, Italy (July 2002)
Cohen, I., Zhang, S., Goldszmidt, M., Symons, J., Kelly, T., Fox, A.: Capturing, indexing, clustering, and retrieving system history. In: 20th ACM Symposium on Operating Systems Principles, Brighton, UK (October 2005)
Dickinson, W., Leon, D., Podgurski, A.: Finding failures by cluster analysis of execution profiles. In: 23rd International Conference on Software Engineering, Toronto, Ontario, Canada (May 2001)
Engler, D., Chen, D.Y., Hallem, S., Chou, A., Chelf, B.: Bugs as deviant behavior: a general approach to inferring errors in systems code. In: 18th ACM Symposium on Operating Systems Principles (SOSP), Banff, Alberta, Canada (October 2001)
Gansner, E., North, S.: An open graph visualization system and its applications to software engineering. Software: Practice & Experience 30(11) (September 2000)
Intel Corp., Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B: System Programming Guide, Part 2, Order Number: 253669-022US (November 2006)
Jones, J.A., Harrold, M.J., Stasko, J.: Visualization of test information to assist fault localization. In: Intl. Conf. on Software Engineering, Orlando, FL (May 2002)
Kiciman, E., Fox, A.: Detecting Application-Level Failures in Component-based Internet Services. In: IEEE Trans. on Neural Networks: Spec. Issue on Adaptive Learning Systems in Communication Networks (September 2005)
Kiciman, E., Livshits, B.: AjaxScope: A Platform for Remotely Monitoring the Client-Side Behavior of Web 2.0 Applications. In: 21st Symposium on Operating Systems Principles (SOSP), Stevenson, WA (October 2007)
King, S.T., Chen, P.M.: Backtracking Intrusions. In: 19th ACM Symposium on Operating System Principles, Bolton Landing, NY (October 2003)
Krempel, S.: Tracing Connections Between MPI Calls and Resulting PVFS2 Disk Operations, Bachelor’s Thesis. Ruprecht-Karls-Universität, Heidelberg (2006)
Lamport, L.: Time, clocks and the ordering of events in a distributed system. Commun. of the ACMÂ 21(7) (1978)
Li, J.: Monitoring and Characterization of Component-Based Systems with Global Causality Capture, HP Labs Tech. Report HPL-2003-54 (2003)
Liblit, B., Naik, M., Zheng, A.X., Aiken, A., Jordan, M.I.: Scalable Statistical Bug Isolation. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, IL (June 2005)
Litzkow, M., Livny, M., Mutka, M.: Condor–a hunter of idle workstations. In: 8th Intl. Conf. on Distributed Computing Systems, San Jose, CA (June 1988)
Luk, C.K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, IL (June 2005)
Lyle, J.R., Weiser, M.: Automatic Program Bug Location by Program Slicing. In: 2nd Intl. Conf. on Computers and Applications, Beijing, China (June 1987)
Maebe, J., Ronsse, M., De Bosschere, K.: DIOTA: Dynamic Instrumentation, Optimization and Transformation of Applications. In: Workshop on Binary Translation, Charlottesville, VA (September 2002)
Miller, B.P.: DPM: A Measurement System for Distributed Programs. IEEE Trans. on Computers 37(2) (February 1988)
Mirgorodskiy, A.V.: Ph.D. Thesis, University of Wisconsin–Madison (2006)
Mirgorodskiy, A.V., Miller, B.P.: Autonomous Analysis of Interactive Systems with Self-Propelled Instrumentation. In: 12th Multimedia Computing and Networking, San Jose, CA (January 2005)
Mirgorodskiy, A.V., Maruyama, N., Miller, B.P.: Problem Diagnosis in Large-Scale Computing Environments. In: SC 2006, Tampa, FL (November 2006)
Nethercote, N., Seward, J.: Valgrind: A program supervision framework. In: 3rd Workshop on Runtime Verification, Boulder, CO (July 2003)
Reumann, J., Shin, K.G.: Stateful distributed interposition. ACM Transactions on Computer Systems 22(1), 1–48 (2004)
Reynolds, P., Killian, C., Wiener, J.L., Mogul, J.C., Shah, M.A., Vahdat, A.: Pip: Detecting the Unexpected in Distributed Systems. In: 3rd Symposium on Networked Systems Design and Implementation (NSDI), San Jose, CA (May 2006)
Scott, K., Davidson, J.: Strata: a software dynamic translation infrastructure. In: Workshop on Binary Translation, Barcelona (September 2001)
Stevens, W.R.: UNIX Network Programming, 2nd edn., vol. 1. Prentice Hall, Englewood Cliffs (1998)
Thain, D., Tannenbaum, T., Livny, M.: Distributed Computing in Practice: The Condor Experience. Concurrency and Computation: Practice and Experience 17(2–4) (February- March 2005)
Thereska, E., Salmon, B., Strunk, J., Wachs, M., Abd-El-Malek, M., Lopez, J., Ganger, G.R.: Stardust: Tracking Activity in a Distributed Storage System. In: International Conf. on Measurement and Modeling of Computer Systems, Saint-Malo, France (June 2006)
Tucek, J., Lu, S., Huang, C., Xanthos, S., Zhou, Y.: Triage: Diagnosing Production Run Failures at the User’s Site. In: 21st Symposium on Operating Systems Principles (SOSP), Stevenson, WA (October 2007)
Yuan, C., Lao, N., Wen, J.-R., Li, J., Zhang, Z., Wang, Y.-M., Ma, W.-Y.: Automated Known Problem Diagnosis with Event Traces. In: EuroSys, Leuven, Belgium (April 2006)
Zandy, V.: Force a Process to Load a Library, http://www.cs.wisc.edu/~zandy/p/hijack.c
Zeller, A.: Isolating Cause-Effect Chains from Computer Programs. In: Intl. Symposium on the Foundations of Software Engineering, Charleston, SC (November 2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 IFIP International Federation for Information Processing
About this paper
Cite this paper
Mirgorodskiy, A.V., Miller, B.P. (2008). Diagnosing Distributed Systems with Self-propelled Instrumentation. In: Issarny, V., Schantz, R. (eds) Middleware 2008. Middleware 2008. Lecture Notes in Computer Science, vol 5346. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89856-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-540-89856-6_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89855-9
Online ISBN: 978-3-540-89856-6
eBook Packages: Computer ScienceComputer Science (R0)