Advertisement

A Brief Tour Through Provenance in Scientific Workflows and Databases

  • Bertram LudäscherEmail author
Conference paper
Part of the Springer Proceedings in Business and Economics book series (SPBE)

Abstract

Within computer science, the term provenance has multiple meanings, due to different motivations, perspectives, and assumptions prevalent in the respective communities. This chapter provides a high-level “sightseeing tour” of some of those different notions and uses of provenance in scientific workflows and databases.

Keywords

Lineage Prospective provenance Provenance games Provenance polynomials Retrospective provenance Why-not provenance 

Notes

Acknowledgements

This work was supported in part by NSF grants ACI-1430508, DBI-{1147273, 1356751}, IIS-1118088, and SMA-1439603. With special thanks to Shawn Bowers, Timothy McPhillips, Manish K. Anand, Víctor Cuevas-Vicenttín, Saumen Dey, Lei Dou, Sven Köhler, Sean Riddle, and Daniel Zinn for fruitful years of collaboration on scientific workflows and database provenance. Also special thanks to Boris Glavic for comments on an earlier draft of this paper and for his collaboration on and implementation of games for why-not provenance.

References

  1. 1.
    Wedel, M.J.: A monument of inefficiency: the presumed course of the recurrent laryngeal nerve in sauropod dinosaurs. Acta Palaeontol. Pol. 57 (2), 251–256 (2011)CrossRefGoogle Scholar
  2. 2.
    Dobzhansky, T.: Nothing in biology makes sense except in the light of evolution. Am. Biol. Teach. 35 (3), 125–129 (1973)CrossRefGoogle Scholar
  3. 3.
    Hey, T., Tansley, S., Tolle, K. (eds.): The fourth paradigm: data-intensive scientific discovery. Microsoft Research Redmond, WA (2009)Google Scholar
  4. 4.
    GCIS: Global Change Information System (2015). http://data.globalchange.gov/ Google Scholar
  5. 5.
    Melillo, J.M., Richmond, T.T., Yohe, G.W. (eds.): Climate Change Impacts in the United States: The Third National Climate Assessment. U.S. Global Change Research Program (2014). doi: 10.7930/J0Z31WJ2
  6. 6.
    Tilmes, C., Fox, P., Ma, X.L., McGuinness, D.L., Privette, A.P., Smith, A., Waple, A., Zednik, S., Zheng, J.G.: Provenance representation for the national climate assessment in the global change information system. IEEE Trans. Geosci. Remote Sens. 51 (11), 5160–5168 (2013)CrossRefGoogle Scholar
  7. 7.
    Sadiq, S.: Handbook of Data Quality. Springer, Berlin (2013)CrossRefGoogle Scholar
  8. 8.
    Mann, M.E., Zhang, Z., Hughes, M.K., Bradley, R.S., Miller, S.K., Rutherford, S., Ni, F.: Proxy-based reconstructions of hemispheric and global surface temperature variations over the past two millennia. Proc. Natl. Acad. Sci. 105 (36), 13252–13257 (2008)CrossRefGoogle Scholar
  9. 9.
    Hills, D.J., Downs, R.R., Duerr, R., Goldstein, J.C., Parsons, M.A., Ramapriyan, H.K.: The importance of data set provenance for science. Eos 96 (2015).  10.1029/2015EO040557
  10. 10.
    Eisenman, I., Meier, W.N., Norris, J.R.: A spurious jump in the satellite record: has Antarctic sea ice expansion been overestimated? Cryosphere 8 (4), 1289–1296 (2014)CrossRefGoogle Scholar
  11. 11.
    Stevens, L.: Texas Summer 2011: Record Heat and Drought (2013). GCIS metadata record with provenance. Accessed 12 Dec 2015Google Scholar
  12. 12.
    Ludäscher, B., Bowers, S., McPhillips, T.: Scientific workflows. In: Özsu, T., Liu, L. (eds.) Encyclopedia of Database Systems. Springer, Berlin (2009)Google Scholar
  13. 13.
    Cuevas-Vicenttín, V., Dey, S., Köhler, S., Riddle, S., Ludäscher, B.: Scientific workflows and provenance: introduction and research opportunities. Datenbank-Spektrum 12 (3), 193–203 (2012)CrossRefGoogle Scholar
  14. 14.
    Davidson, S.B., Boulakia, S.C., Eyal, A., Ludäscher, B., McPhillips, T.M., Bowers, S., Anand, M.K., Freire, J.: Provenance in scientific workflow systems. IEEE Data Eng. Bull. 30 (4), 44–50 (2007)Google Scholar
  15. 15.
    Bowers, S.: Scientific workflow, provenance, and data modeling challenges and approaches. J. Data Semant. 1 (1), 19–30 (2012)CrossRefGoogle Scholar
  16. 16.
    Ludäscher, B., Altintas, I., Bowers, S., Cummings, J., Critchlow, T., Deelman, E., Roure, D.D., Freire, J., Goble, C., Jones, M., Klasky, S., McPhillips, T., Podhorszki, N., Silva, C., Taylor, I., Vouk, M.: Scientific process automation and workflow management. In: Shoshani, A., Rotem, D. (eds.) Scientific Data Management. Chapman & Hall/CRC, London/Boca Raton (2009)Google Scholar
  17. 17.
    McPhillips, T., Bowers, S., Zinn, D., Ludäscher, B.: Scientific workflow design for mere mortals. Futur. Gener. Comput. Syst. 25 (5), 541–551 (2009)CrossRefGoogle Scholar
  18. 18.
    Dou, L., Cao, G., Morris, P.J., Morris, R.A., Ludäscher, B., Macklin, J.A., Hanken, J.: Kurator: a kepler package for data curation workflows. Proc. Comput. Sci. 9, 1614–1619 (2012). Demo video at http://youtu.be/DEkPbvLsud0
  19. 19.
    Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurr. Comput. Pract. Experience 18 (10), 1039–1065 (2006)CrossRefGoogle Scholar
  20. 20.
    Bowers, S., McPhillips, T., Riddle, S., Anand, M.K., Ludäscher, B.: Kepler/pPOD: scientific workflow and provenance support for assembling the tree of life. In: Provenance and Annotation of Data and Processes (IPAW), pp. 70–77. Springer, Berlin, Heidelberg (2008)Google Scholar
  21. 21.
    Anand, M.K., Bowers, S., Ludäscher, B.: Provenance browser: displaying and querying scientific workflow provenance graphs. In: IEEE International Conference on Data Engineering (ICDE), pp. 1201–1204 (2010)Google Scholar
  22. 22.
    Zinn, D., Ludäscher, B.: Abstract provenance graphs: anticipating and exploiting schema-level data provenance. In: Provenance and Annotation of Data and Processes, pp. 206–215. Springer, Berlin, Heidelberg (2010)Google Scholar
  23. 23.
    Moreau, L., Ludäscher, B., Altintas, I., Barga, R.S., Bowers, S., Callahan, S., Chin, G., Clifford, B., Cohen, S., Cohen-Boulakia, S., Davidson, S., Deelman, E., Digiampietri, L., Foster, I., Freire, J., Frew, J., Futrelle, J., Gibson, T., Gil, Y., Goble, C., Golbeck, J., Groth, P., Holland, D.A., Jiang, S., Kim, J., Koop, D., Krenek, A., McPhillips, T., Mehta, G., Miles, S., Metzger, D., Munroe, S., Myers, J., Plale, B., Podhorszki, N., Ratnakar, V., Santos, E., Scheidegger, C., Schuchardt, K., Seltzer, M., Simmhan, Y.L., Silva, C., Slaughter, P., Stephan, E., Stevens, R., Turi, D., Vo, H., Wilde, M., Zhao, J., Zhao, Y.: Special issue: the first provenance challenge. Concurr. Comput. Pract. Experience 20 (5), 409–418 (2008)Google Scholar
  24. 24.
    Moreau, L., Freire, J., Futrelle, J., McGrath, R.E., Myers, J., Paulson, P.: The open provenance model: an overview. In: Provenance and Annotation of Data and Processes, pp. 323–326. Springer, Berlin (2008)Google Scholar
  25. 25.
    Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., den Bussche, J.V.: The open provenance model core specification (v1. 1). Futur. Gener. Comput. Syst. 27 (6), 743–756 (2011)Google Scholar
  26. 26.
    Moreau, L., Missier, P., Belhajjame, K., B’Far, R., Cheney, J., Coppens, S., Cresswell, S., Gil, Y., Groth, P., Klyne, G., Lebo, T., McCusker, J., Miles, S., Myers, J., Sahoo, S., Tilmes, C.: The PROV data model. W3C Technical Report (2012). https://www.w3.org/TR/prov-dm/
  27. 27.
    Heinis, T., Alonso, G.: Efficient lineage tracking for scientific workflows. In: SIGMOD, pp. 1007–1018. ACM, New York (2008)Google Scholar
  28. 28.
    Chapman, A.P., Jagadish, H.V., Ramanan, P.: Efficient provenance storage. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 993–1006. ACM, New York (2008)Google Scholar
  29. 29.
    Anand, M.K., Bowers, S., McPhillips, T., Ludäscher, B.: Efficient provenance storage over nested data collections. In: International Conference on Extending Database Technology (EDBT), pp. 958–969. ACM, New York (2009)Google Scholar
  30. 30.
    Anand, M.K., Bowers, S., Ludäscher, B.: A navigation model for exploring scientific workflow provenance graphs. In: 4th Workshop on Workflows in Support of Large-Scale Science (WORKS) (2009)Google Scholar
  31. 31.
    Anand, M.K., Bowers, S., Ludäscher, B.: Techniques for efficiently querying scientific workflow provenance graphs. In: EDBT, vol. 10, pp. 287–298 (2010)Google Scholar
  32. 32.
    Anand, M.K., Bowers, S., Ludäscher, B.: Database support for exploring scientific workflow provenance graphs. In: Scientific and Statistical Database Management, pp. 343–360. Springer, Berlin, Heidelberg (2012)Google Scholar
  33. 33.
    Garijo, D., Gil, Y.: A new approach for publishing workflows: abstractions, standards, and linked data. In: 6th Workshop on Workflows in Support of Large-Scale Science (WORKS) (2011)Google Scholar
  34. 34.
    Missier, P., Dey, S., Belhajjame, K., Cuevas-Vicenttín, V., Ludäscher, B.: D-PROV: extending the prov provenance model with workflow structure. In: 5th USENIX Workshop on the Theory and Practice of Provenance (TaPP) (2013)Google Scholar
  35. 35.
    Dey, S., Köhler, S., Bowers, S., Ludäscher, B.: Datalog as a lingua franca for provenance querying and reasoning. In: Workshop on the Theory and Practice of Provenance (TaPP), Boston, MA (2012)Google Scholar
  36. 36.
    Pham, Q., Malik, T., Glavic, B., Foster, I.: LDV: light-weight database virtualization. In: International Conference on Data Engineering (ICDE), pp. 1179–1190 (2015)Google Scholar
  37. 37.
    Kwasnikowska, N., Moreau, L., Bussche, J.V.D.: A formal account of the open provenance model. ACM Trans. Web (TWEB) 9 (2), 10:1–10:44 (2015)Google Scholar
  38. 38.
    Dey, S., Riddle, S., Ludäscher, B.: Provenance analyzer: exploring provenance semantics with logic rules. In: 5th USENIX Workshop on the Theory and Practice of Provenance (TaPP) (2013)Google Scholar
  39. 39.
    Dijkstra, E.W.: Hamming’s exercise in SASL. EWD-792 (1981)Google Scholar
  40. 40.
    Hemmendinger, D.: The “Hamming problem” in prolog. ACM SIGPLAN Not. 23 (4), 81–86 (1988)CrossRefGoogle Scholar
  41. 41.
    Köhler, S., Ludäscher, B., Smaragdakis, Y.: Declarative datalog debugging for mere mortals. In: Datalog in Academia and Industry, pp. 111–122. Springer, Berlin, Heidelberg (2012)Google Scholar
  42. 42.
    Koop, D., Santos, E., Bauer, B., Troyer, M., Freire, J., Silva, C.T.: Bridging workflow and data provenance using strong links. In: Gertz, M., Ludäscher, B. (eds.) Scientific and statistical database management (SSDBM). Lecture Notes in Computer Science, vol. 6187, Springer, Berlin (2010)Google Scholar
  43. 43.
    Bowers, S., McPhillips, T., Ludäscher, B.: Declarative rules for inferring fine-grained data provenance from scientific workflow execution traces. In: International Provenance and Annotation Workshop (IPAW), pp. 82–96. Springer (2012)Google Scholar
  44. 44.
    Dey, S., Belhajjame, K., Koop, D., Song, T., Missier, P., Ludäscher, B.: UP & DOWN: improving provenance precision by combining workflow-and trace-level information. In: 6th USENIX Workshop on the Theory and Practice of Provenance (TaPP), Cologne (2014)Google Scholar
  45. 45.
    Dey, S., Belhajjame, K., Koop, D., Raul, M., Ludäscher, B.: Linking prospective and retrospective provenance for scripts. In: 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP), Edinburgh (2015)Google Scholar
  46. 46.
    McPhillips, T., Bowers, S., Belhajjame, K., Ludäscher, B.: Retrospective provenance without a runtime provenance recorder. In: 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP), Edinburg (2015)Google Scholar
  47. 47.
    Cheney, J., Chiticariu, L., Tan, W.: Provenance in databases: why, how, and where. Found. Trends Databases 1 (4), 379–474 (2009)CrossRefGoogle Scholar
  48. 48.
    Cohen, S., Cohen-Boulakia, S., Davidson, S.: Towards a model of provenance and user views in scientific workflows. In: Data Integration in the Life Sciences (DILS), pp. 264–279. Springer, BerlinGoogle Scholar
  49. 49.
    Tan, W.C.: Provenance in databases: past, current, and future. IEEE Data Eng. Bull. 30 (4), 3–12 (2007)Google Scholar
  50. 50.
    Bowers, S., Ludäscher, B.: Actor-oriented design of scientific workflows. In: Conceptual Modeling (ER). Lecture Notes in Computer Science, vol. 3716, pp. 369–384. Springer, Berlin (2005)Google Scholar
  51. 51.
    Biton, O., Cohen-Boulakia, S., Davidson, S.B., Hara, C.S.: Querying and managing provenance through user views in scientific workflows. In: International Conference on Data Engineering (ICDE), pp. 1072–1081. IEEE, New York (2008)Google Scholar
  52. 52.
    Murta, L., Braganholo, V., Chirigati, F., Koop, D., Freire, J.: noWorkflow: Capturing and analyzing provenance of scripts. In: Provenance and Annotation of Data and Processes (IPAW), pp. 71–83. Springer, Berlin (2014)Google Scholar
  53. 53.
    Buneman, P., Tan, W.C.: Provenance in databases (Tutorial Outline). In: SIGMOD, pp. 1171–1173. ACM, New York (2007)Google Scholar
  54. 54.
    Amsterdamer, Y., Davidson, S.B., Deutch, D., Milo, T., Stoyanovich, J., Tannen, V.: Putting lipstick on pig: enabling database-style workflow provenance. Proc. VLDB Endow. 5 (4), 346–357 (2011)CrossRefGoogle Scholar
  55. 55.
    Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading, MA (1995)Google Scholar
  56. 56.
    Deutsch, A., Tannen, V.: Reformulation of XML Queries and Constraints. In: International Conference on Database Theory (ICDT), pp. 225–241. Springer, Berlin (2003)Google Scholar
  57. 57.
    Boncz, P., Grust, T., Van Keulen, M., Manegold, S., Rittinger, J., Teubner, J.: MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In: SIGMOD, pp. 479–490. ACM, New York (2006)Google Scholar
  58. 58.
    Wang, Y.R., Madnick, S.E., et al.: A polygen model for heterogeneous database systems: the source tagging perspective. In: VLDB, vol. 90, pp. 519–538 (1990)Google Scholar
  59. 59.
    Woodruff, A., Stonebraker, M.: Supporting fine-grained data lineage in a database visualization environment. In: International Conference on Data Engineering (ICDE), pp. 91–102. IEEE, New York (1997)Google Scholar
  60. 60.
    Cui, Y., Widom, J., Wiener, J.: Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Systems 25 (2), 179–227 (2000)CrossRefGoogle Scholar
  61. 61.
    Chaudhuri, S., Dayal, U.: Data warehousing and OLAP for decision support. ACM Sigmod Rec. 26 (2), 507–508 (1997)CrossRefGoogle Scholar
  62. 62.
    Buneman, P., Khanna, S., Tan, W.C.: Why and where: a characterization of data provenance. In: ICDT, pp. 316–330. Springer, Berlin (2001)Google Scholar
  63. 63.
    Green, T., Karvounarakis, G., Tannen, V.: Provenance semirings. In: PODS, pp. 31–40 (2007)Google Scholar
  64. 64.
    Green, T.J., Karvounarakis, G., Tannen, Z.G.I.V.: Provenance in ORCHESTRA. In: Bulletin of the Technical Committee on Data Engineering, vol. 33(3), pp. 9–16. IEEE Computer Society, New York (2010)Google Scholar
  65. 65.
    Chapman, A., Jagadish, H.: Why not? In: SIGMOD, pp. 523–534. ACM, New York (2009)Google Scholar
  66. 66.
    Herschel, M., Hernández, M.A.: Explaining missing answers to SPJUA queries. Proc. VLDB Endow. 3 (1–2), 185–196 (2010)CrossRefGoogle Scholar
  67. 67.
    Tran, Q.T., Chan, C.Y.: How to ConQueR Why-Not Questions. In: SIGMOD, ACM, New York (2010), pp. 15–26Google Scholar
  68. 68.
    Geerts, F., Poggi, A.: On database query languages for k-relations. J. Appl. Log. 8 (2), 173–185 (2010)CrossRefGoogle Scholar
  69. 69.
    Amsterdamer, Y., Deutch, D., Tannen, V.: On the limitations of provenance for queries with difference. In: TaPP (2011)Google Scholar
  70. 70.
    Köhler, S., Ludäscher, B., Zinn, D.: First-order provenance games. In: In Search of Elegance in the Theory and Practice of Computation. Essays Dedicated to Peter Buneman. Lecture Notes in Computer Science, vol. 8000, pp. 382–399. Springer, Berlin (2013)Google Scholar
  71. 71.
    Bidoit, N., Herschel, M., Tzompanaki, K.: EFQ: why-not answer polynomials in action. Proc. VLDB Endow. 8 (12), 1980–1983 (2015)CrossRefGoogle Scholar
  72. 72.
    ten Cate, B., Civili, C., Sherkhonov, E., Tan, W.C.: High-level why-not explanations using ontologies. In: ACM Symposium on Principles of Database Systems (PODS), pp. 31–43. ACM, New York (2015)Google Scholar
  73. 73.
    Glavic, B., Miller, R.J., Alonso, G.: Using SQL for efficient generation and querying of provenance information. In: In Search of Elegance in the Theory and Practice of Computation. Essays Dedicated to Peter Buneman. Lecture Notes in Computer Science, vol. 8000, pp. 291–320. Springer, Berlin (2013)Google Scholar
  74. 74.
    Arab, B., Gawlick, D., Radhakrishnan, V., Guo, H., Glavic, B.: A Generic Provenance Middleware for Queries, Updates, and Transactions. In: 6th USENIX Workshop on the Theory and Practice of Provenance (TaPP) (2014)Google Scholar
  75. 75.
    Glavic, B., Esmaili, K.S., Fischer, P.M., Tatbul, N.: Efficient stream provenance via operator instrumentation. ACM Trans. Internet Tech. 14 (1), 7 (2014)CrossRefGoogle Scholar
  76. 76.
    Stamatogiannakis, M., Groth, P., Bos, H.: Decoupling provenance capture and analysis from execution. In: 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP) (2015)Google Scholar
  77. 77.
    Arab, B., Gawlick, D., Krishnaswamy, V., Radhakrishnan, V., Glavic, B.: Formal foundations of reenactment and transaction provenance. Technical Report IIT/CS-DB-2016-01. Illinois Institute of Technology (2016)Google Scholar
  78. 78.
    Karvounarakis, G., Green, T.J.: Semiring-annotated data: queries and provenance. ACM SIGMOD Rec. 41 (3), 5–14 (2012)CrossRefGoogle Scholar
  79. 79.
    Benjelloun, O., Sarma, A.D., Halevy, A., Widom, J.: ULDBs: Databases with uncertainty and lineage. In: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB Endowment, pp. 953–964 (2006)Google Scholar
  80. 80.
    Hodges, W.: Logic and Games. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy (2013). http://plato.stanford.edu/entries/logic-games/
  81. 81.
    Hintikka, J.: The Principles of Mathematics Revisited. Cambridge University Press, Cambridge (1996)CrossRefGoogle Scholar
  82. 82.
    Flum, J., Kubierschky, M., Ludäscher, B.: Total and partial well-founded datalog coincide. In: ICDT, pp. 113–124 (1997)Google Scholar
  83. 83.
    Apt, K.R., Doets, K.: A new definition of SLDNF-resolution. J. Logic Program. 18 (2), 177–190 (1994)CrossRefGoogle Scholar
  84. 84.
    Moreau, L.: The foundations for provenance on the web. Found. Trends Web Sci. 2 (2–3), 99–241 (2010)CrossRefGoogle Scholar
  85. 85.
    Missier, P., Paton, N.W., Belhajjame, K.: Fine-grained and efficient lineage querying of collection-based workflow provenance. In: EDBT, pp. 299–310 (2010)Google Scholar
  86. 86.
    Missier, P., Ludäscher, B., Bowers, S., Dey, S., Sarkar, A., Shrestha, B., Altintas, I., Anand, M.K., Goble, C.: Linking multiple workflow provenance traces for interoperable collaborative science. In: 5th Workshop on Workflows in Support of Large-Scale Science (WORKS). IEEE, New York (2010)Google Scholar
  87. 87.
    Köhler, S., Riddle, S., Zinn, D., McPhillips, T., Ludäscher, B.: Improving workflow fault tolerance through provenance-based recovery. In: Scientific and Statistical Database Management, pp. 207–224. Springer, Berlin, Heidelberg (2011)Google Scholar
  88. 88.
    Meliou, A., Gatterbauer, W., Moore, K.F., Suciu, D.: The complexity of causality and responsibility for query answers and non-answers. Proc. VLDB Endow. 4 (1), 34–45 (2010)CrossRefGoogle Scholar
  89. 89.
    Salimi, B., Bertossi, L.: From causes for database queries to repairs and model-based diagnosis and back. In: 18th International Conference on Database Theory (ICDT), vol. 31, pp. 342–362. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Wadern (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.School of Information Sciences and National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-ChampaignChampaignUSA

Personalised recommendations