Skip to main content

Provenance Annotation and Analysis to Support Process Re-computation

  • Conference paper
  • First Online:
  • 772 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11017))

Abstract

Many resource-intensive analytics processes evolve over time following new versions of the reference datasets and software dependencies they use. We focus on scenarios in which any version change has the potential to affect many outcomes, as is the case for instance in high throughput genomics where the same process is used to analyse large cohorts of patient genomes, or cases. As any version change is unlikely to affect the entire population, an efficient strategy for restoring the currency of the outcomes requires first to identify the scope of a change, i.e., the subset of affected data products. In this paper we describe a generic and reusable provenance-based approach to address this scope discovery problem. It applies to a scenario where the process consists of complex hierarchical components, where different input cases are processed using different version configurations of each component, and where separate provenance traces are collected for the executions of each of the components. We show how a new data structure, called a restart tree, is computed and exploited to manage the change scope discovery problem.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://software.broadinstitute.org/gatk/best-practices.

  2. 2.

    https://github.com/ReComp-team/IPAW2018.

References

  1. Alper, P., Belhajjame, K., Curcin, V., Goble, C.: LabelFlow framework for annotating workflow provenance. Informatics 5(1), 11 (2018)

    Article  Google Scholar 

  2. Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the Kepler scientific workflow system. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 118–132. Springer, Heidelberg (2006). https://doi.org/10.1007/11890850_14

    Chapter  Google Scholar 

  3. Angelino, E., Yamins, D., Seltzer, M.: StarFlow: a script-centric data analysis environment. In: McGuinness, D.L., Michaelis, J.R., Moreau, L. (eds.) IPAW 2010. LNCS, vol. 6378, pp. 236–250. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17819-1_27

    Chapter  Google Scholar 

  4. Bavoil, L., et al.: VisTrails: enabling interactive multiple-view visualizations. In: VIS 05. IEEE Visualization, 2005, No. Dx, pp. 135–142. IEEE (2005)

    Google Scholar 

  5. Cała, J., Marei, E., Xu, Y., Takeda, K., Missier, P.: Scalable and efficient whole-exome data processing using workflows on the cloud. Future Gener. Comput. Syst. 65, 153–168 (2016)

    Article  Google Scholar 

  6. Cała, J., Missier, P.: Selective and recurring re-computation of Big Data analytics tasks: insights from a Genomics case study. Big Data Res. (2018). https://doi.org/10.1016/j.bdr.2018.06.001. ISSN 2214-5796

  7. Cuevas-Vicenttín, V., et al.: ProvONE: A PROV Extension Data Model for Scientific Workflow Provenance (2016)

    Google Scholar 

  8. Freire, J., Silva, C.T., Callahan, S.P., Santos, E., Scheidegger, C.E., Vo, H.T.: Managing rapidly-evolving scientific workflows. In: Proceedings of the 2006 International Conference on Provenance and Annotation of Data, pp. 10–18 (2006)

    Google Scholar 

  9. Herschel, M., Diestelkämper, R., Ben Lahmar, H.: A survey on provenance: what for? what form? what from? VLDB J. 26(6), 1–26 (2017)

    Article  Google Scholar 

  10. Ikeda, R., Das Sarma, A., Widom, J.: Logical provenance in data-oriented workflows. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 877–888. IEEE (2013)

    Google Scholar 

  11. Koop, D., Scheidegger, C.E., Freire, J., Silva, C.T.: The provenance of workflow upgrades. In: McGuinness, D.L., Michaelis, J.R., Moreau, L. (eds.) IPAW 2010. LNCS, vol. 6378, pp. 2–16. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17819-1_2

    Chapter  Google Scholar 

  12. Lakhani, H., Tahir, R., Aqil, A., Zaffar, F., Tariq, D., Gehani, A.: Optimized rollback and re-computation. In: 2013 46th Hawaii International Conference on System Sciences, No. I, pp. 4930–4937. IEEE (Jan 2013)

    Google Scholar 

  13. Moreau, L., et al.: PROV-DM: the PROV data model. Technical report, World Wide Web Consortium (2012)

    Google Scholar 

  14. Pimentel, J.F., Murta, L., Braganholo, V., Freire, J.: noWorkflow: a tool for collecting, analyzing, and managing provenance from python scripts. Proc. VLDB Endow. 10(12), 1841–1844 (2017)

    Article  Google Scholar 

  15. Woodman, S., Hiden, H., Watson, P.: Applications of provenance in performance prediction and data storage optimisation. Future Gener. Comput. Syst. 75, 299–309 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jacek Cała .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cała, J., Missier, P. (2018). Provenance Annotation and Analysis to Support Process Re-computation. In: Belhajjame, K., Gehani, A., Alper, P. (eds) Provenance and Annotation of Data and Processes. IPAW 2018. Lecture Notes in Computer Science(), vol 11017. Springer, Cham. https://doi.org/10.1007/978-3-319-98379-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-98379-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-98378-3

  • Online ISBN: 978-3-319-98379-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics