Skip to main content

Sketching Distributed Data Provenance

  • Chapter

Part of the book series: Studies in Computational Intelligence ((SCI,volume 426))

Abstract

Users can determine the precise origins of their data by collecting detailed provenance records. However, auditing at a finer grain produces large amounts of metadata. To efficiently manage the collected provenance, several provenance management systems, including SPADE, record provenance on the hosts where it is generated. Distributed provenance raises the issue of efficient reconstruction during the query phase. Recursively querying provenance metadata or computing its transitive closure is known to have limited scalability and cannot be used for large provenance graphs. We present matrix filters, which are novel data structures for representing graph information, and demonstrate their utility for improving query efficiency with experiments on provenance metadata gathered while executing distributed workflow applications.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abraham, J., Brazier, P., Chebotko, A., Navarro, J., Piazza, A.: Distributed storage and querying techniques for a semantic web of scientific workflow provenance. In: IEEE International Conference on Services Computing (2010)

    Google Scholar 

  2. Bloom, B.: Space/time tradeoffs in hash coding with allowable errors. Communications of the ACM 13(7) (1970)

    Google Scholar 

  3. Broder, A., Mitzenmacher, M.: Network Applications of Bloom Filters: A Survey (2002)

    Google Scholar 

  4. Callahan, S., Freire, J., Santos, E., Scheidegger, C., Silva, C., Vo, H.: VisTrails: Visualization meets data management. In: ACM SIGMOD International Conference on Management of Data (2006)

    Google Scholar 

  5. Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: BigTable: A distributed storage system for structured data. In: 7th USENIX Symposium on Operating Systems Design and Implementation (2006)

    Google Scholar 

  6. Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., Su, M., Vahi, K., Livny, M.: Pegasus: Mapping scientific workflows onto the Grid. Grid Computing (2004)

    Google Scholar 

  7. Dong, G., Libkin, L., Su, J., Wong, L.: Maintaining transitive closure of graphs in SQL. International Journal of Information Technology 5 (1999)

    Google Scholar 

  8. Foster, I., Vockler, J., Wilde, M., Zhao, Y.: Chimera: A virtual data system for representing, querying, and automating data derivation. In: 14th International Conference on Scientific and Statistical Database Management (2002)

    Google Scholar 

  9. Frew, J., Metzger, D., Slaughter, P.: Automatic capture and reconstruction of computational provenance. Concurrency and Computation 20(5) (2008)

    Google Scholar 

  10. Gadelha Jr., L., Clifford, B., Mattoso, M., Wilde, M., Foster, I.: Provenance management in Swift. Future Generation of Computer Systems 27(6) (2011)

    Google Scholar 

  11. Gehani, A., Lindqvist, U.: Bonsai: Balanced lineage authentication. In: 23rd Annual Computer Security Applications Conference. IEEE Computer Society (2007)

    Google Scholar 

  12. Gehani, A., Kim, M., Zhang, J.: Steps toward managing lineage metadata in Grid clusters. In: 1st USENIX Workshop on the Theory and Practice of Provenance (2009)

    Google Scholar 

  13. Gehani, A., Malik, T.: Efficient Querying of Distributed Provenance Stores. In: 8th Workshop on the Challenges of Large Applications in Distributed Environments (2010)

    Google Scholar 

  14. Groth, P.: Recording Provenance in Service-Oriented Architectures, Report, University of Southampton (2004)

    Google Scholar 

  15. Groth, P., Luck, M., Moreau, L.: A protocol for recording provenance in service-oriented grids. In: International Conference on Principles of Distributed Systems (2004)

    Google Scholar 

  16. Groth, P.: On the Record: Provenance in Large Scale, Open Distributed Systems. Thesis, University of Southampton (2005)

    Google Scholar 

  17. Groth, P.: A Distributed Algorithm for Determining the Provenance of Data, e-Science (2008)

    Google Scholar 

  18. Groth, P., Moreau, L.: Representing distributed systems using the Open Provenance Model. Future Generation Computer Systems 27(6) (2011)

    Google Scholar 

  19. H2, http://www.h2database.com

  20. Heinis, T., Alonso, G.: Efficient lineage tracking for scientific workflows. In: ACM SIGMOD International Conference on Management of Data (2008)

    Google Scholar 

  21. Holland, D., Braun, U., Maclean, D., Muniswamy-Reddy, K., Seltzer, M.: Choosing a data model and query language for provenance. In: 2nd International Provenance and Annotation Workshop (2008)

    Google Scholar 

  22. Karvounarakis, G., Ives, Z., Tannen, V.: Querying data provenance. In: ACM SIGMOD International Conference on Management of Data (2010)

    Google Scholar 

  23. Karypis, G., Aggarwal, R., Kumar, V., Shekhar, S.: Multilevel hypergraph partitioning: Applications in VLSI domain. In: 34th Design and Automation Conference (1997)

    Google Scholar 

  24. Malik, T., Nistor, L., Gehani, A.: Tracking and sketching distributed data provenance. In: 6th IEEE International Conference on e-Science (2010)

    Google Scholar 

  25. Miles, S., Deelman, E., Groth, P., Vahi, K., Mehta, G., Moreau, L.: Connecting scientific data to scientific experiments with provenance. In: 3rd IEEE International Conference on e-Science and Grid Computing (2007)

    Google Scholar 

  26. Moreau, L., Ludaescher, B., Altintas, I., Barga, R., Bowers, S., Callahan, S., Chin Jr., G., Clifford, B., Cohen, S., Cohen-Boulakia, S., Davidson, S., Deelman, E., Digiampietri, L., Foster, I., Freire, J., Frew, J., Futrelle, J., Gibson, T., Gil, Y., Goble, C., Golbeck, J., Groth, P., Holland, D., Jiang, S., Kim, J., Koop, D., Krenek, A., McPhillips, T., Mehta, G., Miles, S., Metzger, D., Munroe, S., Myers, J., Plale, B., Podhorszki, N., Ratnakar, V., Santos, E., Scheidegger, C., Schuchardt, K., Seltzer, M., Simmhan, Y.: The First Provenance Challenge. Concurrency and Computation: Practice and Experience 20(5) (2007)

    Google Scholar 

  27. Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., van den Bussche, J.: The Open Provenance Model core specification (v1.1). Future Generation Computer Systems (2010)

    Google Scholar 

  28. MySQL, http://www.mysql.com

  29. Neo4j, http://neo4j.org

  30. Novel Information Gathering and Harvesting Techniques for Intelligence in Global Autonomous Language Exploitation, http://www.speech.sri.com/projects/GALE/

  31. PlanetLab, http://www.planet-lab.org

  32. Muniswamy-Reddy, K., Holland, D., Braun, U., Seltzer, M.: Provenance-aware storage systems. In: USENIX Annual Technical Conference (2006)

    Google Scholar 

  33. Muniswamy-Reddy, K., Braun, U., Holland, D., Macko, P., Maclean, D., Margo, D., Seltzer, M., Smogor, R.: Layering in provenance systems. In: USENIX Annual Technical Conference (2009)

    Google Scholar 

  34. Muniswamy-Reddy, K., Macko, P., Seltzer, M.: Making a Cloud provenance-aware. In: 1st USENIX Workshop on the Theory and Practice of Provenance (2009)

    Google Scholar 

  35. Muniswamy-Reddy, K., Macko, P., Seltzer, M.: Provenance for the Cloud. In: 8th USENIX Conference on File and Storage Technologies (2010)

    Google Scholar 

  36. Simmhan, Y.L., Plale, B., Gannon, D., Marru, S.: Performance evaluation of the Karma provenance framework for scientific workflows. In: 1st International Provenance and Annotation Workshop (2006)

    Google Scholar 

  37. Support for Provenance Auditing in Distributed Environments, http://spade.csl.sri.com

  38. Speech Technology and Research, SRI International, http://www.speech.sri.com

  39. Thain, D., Tannenbaum, T., Livny, M.: Condor and the Grid, Grid computing: Making the global infrastructure a reality. John Wiley (2003)

    Google Scholar 

  40. Tupelo project, NCSA, http://tupeloproject.ncsa.uiuc.edu/node/2

  41. Zhou, W., Sherr, M., Tao, T., Li, X., Loo, B., Mao, Y.: Efficient querying and maintenance of network provenance at Internet-scale. In: ACM SIGMOD International Conference on Management of Data (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tanu Malik .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Malik, T., Gehani, A., Tariq, D., Zaffar, F. (2013). Sketching Distributed Data Provenance. In: Liu, Q., Bai, Q., Giugni, S., Williamson, D., Taylor, J. (eds) Data Provenance and Data Management in eScience. Studies in Computational Intelligence, vol 426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29931-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-29931-5_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29930-8

  • Online ISBN: 978-3-642-29931-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics