Skip to main content

Preserving the Value of Large Scale Data Analytics over Time Through Selective Re-computation

  • Conference paper
  • First Online:
Data Analytics (BICOD 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10365))

Included in the following conference series:

Abstract

A pervasive problem in Data Science is that the knowledge generated by possibly expensive analytics processes is subject to decay over time as the data and algorithms used to compute it change, and the external knowledge embodied by reference datasets evolves. Deciding when such knowledge outcomes should be refreshed, following a sequence of data change events, requires problem-specific functions to quantify their value and its decay over time, as well as models for estimating the cost of their re-computation. Challenging is the ambition to develop a decision support system for informing re-computation decisions over time that is both generic and customisable. With the help of a case study from genomics, in this paper we offer an initial formalisation of this problem, highlight research challenges, and outline a possible approach based on the analysis of metadata from a history of past computations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.ncbi.nlm.nih.gov/clinvar.

  2. 2.

    https://www.omim.org.

  3. 3.

    Analysis of the specific “backwards” cases will appear in a separate contribution.

References

  1. Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the kepler scientific workflow system. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 118–132. Springer, Heidelberg (2006). doi:10.1007/11890850_14

    Chapter  Google Scholar 

  2. Burgess, L.C., Crotty, D., de Roure, D., Gibbons, J., Goble, C., Missier, P., Mortier, R., Nichols, T.E., O’Beirne, R.: Alan Turing Intitute Symposium on Reproducibility for Data-Intensive Research - Final Report (2016)

    Google Scholar 

  3. Cała, J., Marei, E., Xu, Y., Takeda, K., Missier, P.: Scalable and efficient whole-exome data processing using workflows on the cloud. Future Gener. Comput. Syst. 65(Special Issue: Big Data in the Cloud), 153–168 (2016)

    Google Scholar 

  4. Cooper, G.M., Shendure, J.: Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 12(9), 628–640 (2011)

    Article  Google Scholar 

  5. Freire, J., Fuhr, N., Rauber, A.: Reproducibility of data-oriented experiments in e-science (Dagstuhl Seminar 16041). Dagstuhl Reports 6(1), 108–159 (2016)

    Google Scholar 

  6. Ikeda, R., Das Sarma, A., Widom, J.: Logical provenance in data-oriented workflows? In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 877–888. IEEE, April 2013

    Google Scholar 

  7. Ikeda, R., Salihoglu, S., Widom, J.: Provenance-based refresh in data-oriented workflows. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 1659–1668 (2011)

    Google Scholar 

  8. Ikeda, R., Widom, J.: Panda: a system for provenance and data. In: Proceedings of the 2nd USENIX Workshop on the Theory and Practice of Provenance (TaPP 2010), vol. 33, pp. 1–8 (2010)

    Google Scholar 

  9. Koop, D., Santos, E., Bauer, B., Troyer, M., Freire, J., Silva, C.T.: Bridging workflow and data provenance using strong links. In: Gertz, M., Ludäscher, B. (eds.) SSDBM 2010. LNCS, vol. 6187, pp. 397–415. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13818-8_28

    Chapter  Google Scholar 

  10. Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurrency Comput. Pract. Exp. 18(10), 1039–1065 (2006)

    Article  Google Scholar 

  11. Malik, M.J., Fahringer, T., Prodan, R.: Execution time prediction for grid infrastructures based on runtime provenance data. In: Proceedings of WORKS 2013, pp. 48–57, New York, USA. ACM Press (2013)

    Google Scholar 

  12. Missier, P., Wijaya, E., Kirby, R., Keogh, M.: SVI: a simple single-nucleotide human variant interpretation tool for clinical use. In: Ashish, N., Ambite, J.-L. (eds.) DILS 2015. LNCS, vol. 9162, pp. 180–194. Springer, Cham (2015). doi:10.1007/978-3-319-21843-4_14

    Chapter  Google Scholar 

  13. Miu, T., Missier, P.: Predicting the execution time of workflow activities based on their input features. In: Taylor, I., Montagnat, J., (eds.) Proceedings of WORKS 2012, Salt Lake City, US. ACM (2012)

    Google Scholar 

  14. Moreau, L., Missier, P., Belhajjame, K., B’Far, R., Cheney, J.T.: PROV-DM: the PROV data model. Technical report, World Wide Web Consortium (2012)

    Google Scholar 

  15. Oliveira, W., Missier, P., Ocaña, K., Oliveira, D., Braganholo, V.: Analyzing provenance across heterogeneous provenance graphs. In: Mattoso, M., Glavic, B. (eds.) IPAW 2016. LNCS, vol. 9672, pp. 57–70. Springer, Cham (2016). doi:10.1007/978-3-319-40593-3_5

    Chapter  Google Scholar 

  16. Pietri, I., Juve, G., Deelman, E., Sakellariou, R.: A performance model to estimate execution time of scientific workflows on the cloud. In: Proceedings of WORKS 2014, pp. 11–19. IEEE, November 2014

    Google Scholar 

  17. Stodden, V., Leisch, F., Peng, R.D.: Implementing Reproducible Research. CRC Press, Boca Raton (2014)

    Google Scholar 

  18. Woodman, S., Hiden, H., Watson, P.: Workflow provenance: an analysis of long term storage costs. In: Proceedings of WORKS 2015, pp. 9: 1–9: 9 (2015)

    Google Scholar 

  19. PROV-Overview. An Overview of the PROV Family of Documents, April 2013

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paolo Missier .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Missier, P., Cała, J., Rathi, M. (2017). Preserving the Value of Large Scale Data Analytics over Time Through Selective Re-computation. In: Calì, A., Wood, P., Martin, N., Poulovassilis, A. (eds) Data Analytics. BICOD 2017. Lecture Notes in Computer Science(), vol 10365. Springer, Cham. https://doi.org/10.1007/978-3-319-60795-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-60795-5_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-60794-8

  • Online ISBN: 978-3-319-60795-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics