Skip to main content

Towards a Model of Provenance and User Views in Scientific Workflows

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4075))

Abstract

Scientific experiments are becoming increasingly large and complex, with a commensurate increase in the amount and complexity of data generated. Data, both intermediate and final results, is derived by chaining and nesting together multiple database searches and analytical tools. In many cases, the means by which the data are produced is not known, making the data difficult to interpret and the experiment impossible to reproduce. Provenance in scientific workflows is thus of paramount importance.

In this paper, we provide a formal model of provenance for scientific workflows which is general (i.e. can be used with existing workflow systems, such as Kepler, myGrid and Chimera) and sufficiently expressive to answer the provenance queries we encountered in a number of case studies. Interestingly, our model not only takes into account the chained and nested structure of scientific workflows, but allows asks for provenance at different levels of abstraction (user views).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alpdemir, M.N., Mukherjee, A., Paton, N.W., Fernandes, A.A.A., Watson, P., Glover, K., Greenhalgh, C., Oinn, T., Tipney, H.: Contextualised Workflow Execution in MyGrid. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) EGC 2005. LNCS, vol. 3470, pp. 444–453. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  2. Berry, D., Buneman, P., Wilde, M., Ioannidis, Y.: e-Science Workshop on Data Provenance and Annotation. National e-Science Centre, Edinburgh (2003)

    Google Scholar 

  3. Bhagwat, D., Chiticariu, L., Tan, W.C., Vijayvargiya, G.: An Annotation Management System for Relational Databases. In: Proc. Conference on Very Large Data Bases (VLDB), pp. 900–911 (2004)

    Google Scholar 

  4. Bowers, S., McPhillips, T., Ludäscher, B., Cohen, S., Davidson, S.B.: A model for user-oriented data provenance in pipelined scientific workflows. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 133–147. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  5. Bowers, S., Ludäscher, B.: Actor-Oriented Design of Scientific Workflows. In: Delcambre, L.M.L., Kop, C., Mayr, H.C., Mylopoulos, J., Pastor, Ó. (eds.) ER 2005. LNCS, vol. 3716, pp. 369–384. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  6. Buneman, P., Khanna, S., Tan, W.: Why and Where: A Characterization of Data Provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  7. Buneman, P., Chapman, A., Cheney, J.: Provenance Management in Curated Databases. In: Proc. of SIGMOD International Conference on Management of Data (to appear, 2006)

    Google Scholar 

  8. Clark, T., Martin, S., Liefeld, T.: Globally distributed object identification for biological knowledgebases. Briefings in Bioinformatics 5(1), 59–70 (2004)

    Article  Google Scholar 

  9. Cohen-Boulakia, S., Lair, S., Stransky, N., Graziani, S., Radvanyi, F., Barillot, E., Froidevaux, C.: Selecting biomedical data sources according to user preferences. In: Bioinformatics, Proc. ISMB/ECCB 2004, vol. 20, pp. i86–i93 (2004)

    Google Scholar 

  10. Cohen, S., Cohen-Boulakia, S., Davidson, S.: Towards a Model of Provenance in Scientific Workflows, University of Pennsylvania, Internal Report, #MS-CIS-06-03 (2006)

    Google Scholar 

  11. Davidson, S., Crabtree, J., Brunk, B., Schug, J., Tannen, V., Overton, C., Stoeckert, C.: K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. IBM Systems Journal (2001)

    Google Scholar 

  12. Foster, I., Vockler, J., Woilde, M., Zhao, Y.: Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. In: Proc. of the 14th Intl. Conf. on Scientific and Statistical Database Management (SSDBM) (2002)

    Google Scholar 

  13. Foster, I., Voeckler, J., Wilde, M., Zhao, Y.: The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration. In: Proc. of Conference on Innovative Data System Research (CIDR) (2003)

    Google Scholar 

  14. Greiner, U., Müller, R., Rahm, E., Ramsch, J., Heller, B., Löffler, M.: AdaptFlow: Protocol-based Medical Treatment Using Adaptive Workflows. Methods of Information in Medicine 44, 80–88 (2005)

    Google Scholar 

  15. Higgins, D.G., Sharp, P.M.: Clustal: A package for performing multiple sequence alignment on a microcomputer. Gene 73, 237–244 (1998)

    Article  Google Scholar 

  16. Kiepuszewski, B., ter Hofstede, A.H.M., van der Aalst, W.M.P.: Fundamentals of control flow in workflows. Acta Inf. 39(3), 143–209 (2003)

    Article  MATH  Google Scholar 

  17. McPhillips, T., Bowers, S.: An approach for pipelining nested collections in scientific workflows. SIGMOD Record 34(3), 12–17 (2005)

    Article  Google Scholar 

  18. Moss, J.E.B.: Nested Transactions: An Approach to Reliable Distributed Computing, Ph.D. dissertation, Dept. of Electrical Engineering and Computer Science, MIT (April 1981)

    Google Scholar 

  19. Oinn, T.M., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, R.T., Carver, K., Glover, P.M.R., Wipat, A., Li, P.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, Proc. ISMB/ECCB03 20(1), 3045–3054 (2003)

    Article  Google Scholar 

  20. The Pasoa Project Luc Moreau et al., http://www.pasoa.org/

  21. Phylip Programs and Documentation, http://evolution.genetics.washington.edu/phylip/phylip.html.Swofford

  22. Rowe, A., Kalaitzopoulos, D., Osmond, M., Ghanem, M., Guo, Y.: The discovery net system for high throughput bioinformatics. Bioinformatics 19(1), i225–i231 (2004)

    Article  Google Scholar 

  23. Simmhan, Y., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Record 34(3), 31–36 (2005)

    Article  Google Scholar 

  24. Swofford, D.L.: PAUP*: Phylogenetic Analysis Using Parsimony (*and other methods). Sinauer Associates, Sunderland, MA (2000)

    Google Scholar 

  25. Targino, R., Cavalcanti, M.C., Mattoso, M.: An Environment to Define and Execute In-Silico Workflows Using Web Services. In: Ludäscher, B., Raschid, L. (eds.) DILS 2005. LNCS (LNBI), vol. 3615, pp. 288–291. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  26. Ullman, J.D., Widom, J.: A First Course in Database Systems. Prentice-Hall, Englewood Cliffs (1997)

    Google Scholar 

  27. Widom, J.: Trio: A System for Integrated Management of Data, Accuracy, and Lineage. In: CIDR 2005, Conference on Innovative Data Systems Research, pp. 262–276 (2005)

    Google Scholar 

  28. Zhao, J., Wroe, C., Goble, C., Stevens, R., Quan, D., Greenwood, M.: Using Semantic Web Technologies for Representing E-science Provenance. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 92–106. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  29. Zhao, J., Goble, C.A., Stevens, R., Bechhofer, S.: Semantically Linking and Browsing Provenance Logs for E-science. In: Bouzeghoub, M., Goble, C.A., Kashyap, V., Spaccapietra, S. (eds.) ICSNW 2004. LNCS, vol. 3226, pp. 158–176. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  30. http://www.extreme.indiana.edu/swf-survey/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cohen, S., Cohen-Boulakia, S., Davidson, S. (2006). Towards a Model of Provenance and User Views in Scientific Workflows. In: Leser, U., Naumann, F., Eckman, B. (eds) Data Integration in the Life Sciences. DILS 2006. Lecture Notes in Computer Science(), vol 4075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11799511_24

Download citation

  • DOI: https://doi.org/10.1007/11799511_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-36593-8

  • Online ISBN: 978-3-540-36595-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics